No definition for Content?

Theory · August 9, 2023, 12:23pm

In your examples you guys use “page.content.tostring” which throws me an error on that theres no definition for that. Sadly i have the error Messages only in German.

“PdfPage” enthält keine Definition für “Content”, und es konnte keine zugängliche Content-Erweiterungsmethode gefunden werden,
die ein erstes Argument vom Typ “PdfPage” akzeptiert (möglicherweise fehlt eine using-Direktive oder ein Assemblyverweis).

This is error gets thrown from your basic example: Read text from PDF files with C# / VB.NET applications

Also i need to make sure to get all the text on every layer and i need to perform an additional OCR scan to make sure the pdf has text.

Then i need to split the document on Keywords which as long as i can’t warrant that all the text on each layer is accessibel. Maybe i could convert it from Pdf to pdf/a if that has any impact on the layering.

Maybe i made a simple mistake but this is where i’m at.

Using Visual Studio Code

mario.gembox · August 9, 2023, 12:45pm

Hi Kevin,

Unfortunately, without having some sort of a sample project from you I cannot say for sure what problem you have.

Nevertheless, I believe the issue you’re experiencing is because you’re using some older version of GemBox.Pdf. So, please try again using the current latest version:

Also note that besides PdfPage.Content.ToString(), with the latest version, you can access PdfPage.Content.Text which is of PdfText type. This type includes some useful members (like for finding or deleting some text).

Regarding the OCR, check out the following example:

I hope this helps.

Regards,
Mario

mario.gembox · August 10, 2023, 6:34am

Hi Kevin,

Please send us your input PDF file (the one that’s targeted with pdfFilePaths[0]) so that we can reproduce the issue and investigate it.

Regards,
Mario

Theory · August 10, 2023, 8:26am

Am i correct that there is no pagenumber on pdfPage? do i have to count the pages myself?

Also i can just have multiple statements of readOptions.Languages.Add(OcrLanguages.German); to support multiple languages at once right?

mario.gembox · August 10, 2023, 8:39am

You retrieve the PdfPage object from the PdfDocument.Pages collection. The page number is the same as the index of that object in its parent collection.

Nevertheless, here is how you can get the page number:

PdfDocument document;
PdfPage page;
// ...
int index = document.Pages.IndexOf(page);
int pageNumber = index + 1;

Also yes, you can add multiple OcrLanguages.

Theory · August 10, 2023, 8:44am

Alrigth, thank you very much.

Theory · August 10, 2023, 3:02pm

I’m a bit confused about OCR i tried to implement it with your examples but i get errors even though my editor is in admin mode

Unhandled exception. System.UnauthorizedAccessException: Access to the path ‘C:\Lokale Dateien\aXc\Desktop\OCR’ is denied.
at Microsoft.Win32.SafeHandles.SafeFileHandle.CreateFile(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options)
at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, Nullable1 unixCreateMode) at System.IO.Strategies.OSFileStreamStrategy..ctor(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, Nullable1 unixCreateMode)
at System.IO.Strategies.FileStreamHelpers.ChooseStrategyCore(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, Nullable`1 unixCreateMode)
at System.IO.File.Open(String path, FileMode mode, FileAccess access, FileShare share)
at tring , Boolean , Boolean , Uri&)
at tring , Boolean , Boolean )
at GemBox.Pdf.PdfDocument.(String )
at GemBox.Pdf.PdfSaveOptions.nj4tq8q2uybq2vcvtvpjsxcrymkqkg58(PdfDocument , Stream , String )
at GemBox.Pdf.PdfDocument.Save(String path, SaveOptions options)
at GemBox.Pdf.PdfDocument.Save(String path)
at Program.Main() in C:\Lokale Dateien\aXc\Projects\PDF_Splitter\csharp\PDF_Splitter\Program.cs:line 24

Would be lovely if you had an idea

mario.gembox · August 11, 2023, 2:31am

Please check this SO question:

I hope this will help you resolve that issue.

Theory · August 11, 2023, 6:33am

well that was a stupid mistake, i forgot to use the contextual variable thank you :*

Theory · August 11, 2023, 6:46am

Another question i have is if i use OCR on Pdfs which have Text on them (lets say a pdf from a word document) OCR produces white pages, it works with pictures though. Can’t i process them again? do i have to check if OCR has already been performed?

mario.gembox · August 11, 2023, 6:52am

You need to set the OcrReadOptions.KeepContent property to true.

Theory · August 11, 2023, 7:20am

Does this also Keep Pictures and other things because so far it recognizes text from a image but if it doesn’t it doesn’t keep the other part of it?

And lastly please say yes regarding pdf to pdf/a beeing now present in Gembox.pdf. Because i couldn’t find it in the Documentation. But it shows on the product page if i remember correct or was that only on Gembox.Document? Since 3 years ago in a post it got mentioned “not beeing present now but in the future”.

mario.gembox · August 11, 2023, 7:55am

The OcrReadOptions.KeepContent will keep all other elements that are not processed by the OCR engine. The OcrReadOptions.KeepImage ( ~~OcrReadOptions.KeepPictures~~) will keep those image elements that are processed by the OCR engine.

Also regarding the PDF/A, I’m afraid that the situation is still the same.
GemBox.Document can create a PDF/A file, but GemBox.Pdf is unable to create a new PDF/A nor convert PDF to PDF/A. However, if you have an input PDF/A file you could process it with GemBox.Pdf and keep it as PDF/A.

Theory · August 11, 2023, 8:10am

So what can i do we got to get an additional license for Gembox.Document and then i can “using gembox.document” and simply convert the memory stream?

Also KeepPictures is nonexisting.

mario.gembox · August 11, 2023, 9:15am

What do you mean by “convert the memory stream”, what is the content of that stream?
In other words, what is your input file, is it PDF?
Can you send us that file so that we can investigate if GemBox.Document will be a good fit for you?
Nevertheless, yes in that case you would need the license for both GemBox.Pdf and GemBox.Document.

Last, apologize the right name is OcrReadOptions.KeepImage. You mentioned “Keep Pictures” so I thought you have seen this property and written it’s name correctly…

Theory · August 11, 2023, 9:29am

No worries and we already have a license i just thought we only have pdf

And yes the document in the memory stream is a Pdf File. I had it all in one class but i’ll move it out to its own class, make another class to convert it andjust call those from main. It would be nice if i had a license for Document that i could give the argument with the save options (for the PdfDocument like it beeing the same class but (pdfsave thingy) dependent on licence if i can use it or not).

No sadly i haven’t i kinda am not really getting warm with your documentations.