We are using GemBox to do OCR of PDFs. I need to put the tesseract data files in a different directory. My application is in C# .NET so how can I set what directory the library should look in for the data files.
I can’t find anything in the documentation.
Hi David,
You can set the OcrReadOptions.TesseractDataPath
property, like this:
var readOptions = new OcrReadOptions() { TesseractDataPath = "languagedata" };
readOptions.Languages.Add(OcrLanguages.German);
using (PdfDocument document = OcrReader.Read("GermanDocument.pdf", readOptions))
{
// ...
}
See the second example (“OCR with different languages”) on the following page:
I hope this helps.
Regards,
Mario
Mario, thanks. that solved the near term.
Got a new problem. Deployed on Azure in an appservice. The DLLs for tesseract are coming up missing. But they are in the x64 folder under the current directory.
How do I set that Library path?
Edit #1:
So now I can’t even get it to work locally. I have updated to the latest version I can find: 17.0.1485.
I set the Library Path as follows before creating my OcrReadOptions:
OcrReadOptions.LibraryPath = “C:\Users\p001056H\Documents\binfix\parsonsgptadminconsole\bin\Debug\net8.0\x64”;
readOptions = new OcrReadOptions() { TesseractDataPath = GetEnvironmentVariable(“GEMBOX_TESSERACT_DATA”),
KeepContent=true };
I
The following is the error in the console:
An unhandled exception has occurred while executing the request. System.DllNotFoundException: Failed to find DLLs for tesseract or leptonica. Try setting the GemBox.Pdf.Ocr.OcrReadOptions.LibraryPath static property. ---> System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.DllNotFoundException: Failed to find library "leptonica-1.80.0.dll" for platform x64.
But when I look at the directory:
p001056H@8KVJCY3 MINGW64 ~/Documents/binfix/parsonsgptadminconsole (PGPT-57)
$ ls bin/Debug/net8.0/x64/
leptonica-1.80.0.dll libleptonica-1.80.0.so libtesseract41.so tesseract41.dll
Hi David,
Can you please try again with these NuGet packages:
Install-Package GemBox.Pdf -Version 17.0.1486-hotfix
Install-Package GemBox.Pdf.Ocr -Version 25.0.1486-hotfix
It should now work out of the box, both TesseractDataPath
and LibraryPath
will now additionally be looked for inside the Assembly.GetEntryAssembly().Location
.
Does this solve your issue?
Regards,
Mario