In your examples you guys use “page.content.tostring” which throws me an error on that theres no definition for that. Sadly i have the error Messages only in German.
“PdfPage” enthält keine Definition für “Content”, und es konnte keine zugängliche Content-Erweiterungsmethode gefunden werden,
die ein erstes Argument vom Typ “PdfPage” akzeptiert (möglicherweise fehlt eine using-Direktive oder ein Assemblyverweis).
Also i need to make sure to get all the text on every layer and i need to perform an additional OCR scan to make sure the pdf has text.
Then i need to split the document on Keywords which as long as i can’t warrant that all the text on each layer is accessibel. Maybe i could convert it from Pdf to pdf/a if that has any impact on the layering.
Maybe i made a simple mistake but this is where i’m at.
Unfortunately, without having some sort of a sample project from you I cannot say for sure what problem you have.
Nevertheless, I believe the issue you’re experiencing is because you’re using some older version of GemBox.Pdf. So, please try again using the current latest version:
Also note that besides PdfPage.Content.ToString(), with the latest version, you can access PdfPage.Content.Text which is of PdfText type. This type includes some useful members (like for finding or deleting some text).
Regarding the OCR, check out the following example:
Another question i have is if i use OCR on Pdfs which have Text on them (lets say a pdf from a word document) OCR produces white pages, it works with pictures though. Can’t i process them again? do i have to check if OCR has already been performed?
Does this also Keep Pictures and other things because so far it recognizes text from a image but if it doesn’t it doesn’t keep the other part of it?
And lastly please say yes regarding pdf to pdf/a beeing now present in Gembox.pdf. Because i couldn’t find it in the Documentation. But it shows on the product page if i remember correct or was that only on Gembox.Document? Since 3 years ago in a post it got mentioned “not beeing present now but in the future”.
The OcrReadOptions.KeepContent will keep all other elements that are not processed by the OCR engine. The OcrReadOptions.KeepImage ( OcrReadOptions.KeepPictures) will keep those image elements that are processed by the OCR engine.
Also regarding the PDF/A, I’m afraid that the situation is still the same.
GemBox.Document can create a PDF/A file, but GemBox.Pdf is unable to create a new PDF/A nor convert PDF to PDF/A. However, if you have an input PDF/A file you could process it with GemBox.Pdf and keep it as PDF/A.
What do you mean by “convert the memory stream”, what is the content of that stream?
In other words, what is your input file, is it PDF?
Can you send us that file so that we can investigate if GemBox.Document will be a good fit for you?
Nevertheless, yes in that case you would need the license for both GemBox.Pdf and GemBox.Document.
Last, apologize the right name is OcrReadOptions.KeepImage. You mentioned “Keep Pictures” so I thought you have seen this property and written it’s name correctly…
No worries and we already have a license i just thought we only have pdf
And yes the document in the memory stream is a Pdf File. I had it all in one class but i’ll move it out to its own class, make another class to convert it andjust call those from main. It would be nice if i had a license for Document that i could give the argument with the save options (for the PdfDocument like it beeing the same class but (pdfsave thingy) dependent on licence if i can use it or not).
No sadly i haven’t i kinda am not really getting warm with your documentations.