Setting a different directory for the tesseract data

We are using GemBox to do OCR of PDFs. I need to put the tesseract data files in a different directory. My application is in C# .NET so how can I set what directory the library should look in for the data files.
I can’t find anything in the documentation.

Hi David,

You can set the OcrReadOptions.TesseractDataPath property, like this:

var readOptions = new OcrReadOptions() { TesseractDataPath = "languagedata" };
readOptions.Languages.Add(OcrLanguages.German);
using (PdfDocument document = OcrReader.Read("GermanDocument.pdf", readOptions))
{
    // ...
}

See the second example (“OCR with different languages”) on the following page:

I hope this helps.

Regards,
Mario

Mario, thanks. that solved the near term.

Got a new problem. Deployed on Azure in an appservice. The DLLs for tesseract are coming up missing. But they are in the x64 folder under the current directory.

How do I set that Library path?

Edit #1:
So now I can’t even get it to work locally. I have updated to the latest version I can find: 17.0.1485.
I set the Library Path as follows before creating my OcrReadOptions:

OcrReadOptions.LibraryPath = “C:\Users\p001056H\Documents\binfix\parsonsgptadminconsole\bin\Debug\net8.0\x64”;
readOptions = new OcrReadOptions() { TesseractDataPath = GetEnvironmentVariable(“GEMBOX_TESSERACT_DATA”),
KeepContent=true };
I

The following is the error in the console:

  An unhandled exception has occurred while executing the request.
  System.DllNotFoundException: Failed to find DLLs for tesseract or leptonica. Try setting the GemBox.Pdf.Ocr.OcrReadOptions.LibraryPath static property.
   ---> System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation.
   ---> System.DllNotFoundException: Failed to find library "leptonica-1.80.0.dll" for platform x64.

But when I look at the directory:
p001056H@8KVJCY3 MINGW64 ~/Documents/binfix/parsonsgptadminconsole (PGPT-57)
$ ls bin/Debug/net8.0/x64/
leptonica-1.80.0.dll libleptonica-1.80.0.so libtesseract41.so tesseract41.dll

Hi David,

Can you please try again with these NuGet packages:

Install-Package GemBox.Pdf -Version 17.0.1486-hotfix
Install-Package GemBox.Pdf.Ocr -Version 25.0.1486-hotfix

It should now work out of the box, both TesseractDataPath and LibraryPath will now additionally be looked for inside the Assembly.GetEntryAssembly().Location.

Does this solve your issue?

Regards,
Mario