Reconstruct text lines based on PdfPoint values

entge001 · September 13, 2020, 6:28pm

Hi, how to get to the PdfPoint X and Y values?

When analyzing the sample code to read the text from a pdf (VB.Net) I find out that the lines are mixed up in the page.Content.ToString.
I saw that PdfPoint(x,y) stores the information of the text object. X is the location on the line and Y is the location of the line.

Based on the X and Y values I want to reconstruct the lines as they are visible in the PDF.
Any tips/thoughts?

Thanks in advance,
Gerard

mario.gembox · September 14, 2020, 9:39am

Hi Gerard,

The problem is that GemBox.Pdf doesn’t provide width and height information of PdfTextContent elements which is why this is currently not possible.

Typically, you can reconstruct the content of each line by inspecting the Y values (when the value is changed then the text elements are from a different line). But without knowing the width, you cannot be sure if and how many spaces are between text elements which have the same Y value, but a different X value.

Nevertheless, note that we do intend to add support for this in the future.
But unfortunately, at this moment, I cannot say exactly when it will be available.
Please note that we priorities feature request implementations by a number of users requesting them and currently we’re working on some other features that have greater priority.

Regards,
Mario

entge001 · September 14, 2020, 1:23pm

Mario,

thanks for your quick and detailed response!! I am impressed.
I found partly a solution trough the example how to use:

For Each textElement In document.Pages...
    Dim location = textElement.Location

Next is to sort the textElements Y and X location. This should more or less the original layout give.

Thanks again and “keep up the good work”!

Regards,
Gerard

mario.gembox · September 14, 2020, 3:15pm

Hi Gerard,

Just as an FYI, on the second Reading example you can see exactly how to obtain X and Y values.

Regards,
Mario

entge001 · September 22, 2020, 3:37pm

Hi Mario,

making progress with my project and find slowly my way in the PDF world.
I just could not figure out how this works:

For Each textElement In page _
    .Content.Elements.All() _
    .Where(Function(element) element.ElementType = PdfContentElementType.Text) _
    .Cast(Of PdfTextContent)()

Beginning with .Content.Elements.All() it seems a kind query statement.
Can you point me to some documentation on how to use it?
I can fiddle around with it a bit but I want to know what/ why I am doing

Thanks in advance

Regards,
Gerard

mario.gembox · September 23, 2020, 6:18am

Hi Gerard,

Please check the Content Streams and Resources help page, it contains information about PDF content elements.

The page.Content is a root element, it’s of a PdfContentGroup type and can have child elements in the PdfContentElementCollection (the Content.Elements part). The Elements.All() method will return all content elements underneath the collection, so you get all elements including all children of all group elements.

This is a good starting point for adding any additional iteration requirement that you need by using the LINQ extension methods. In this case, the .Where is filtering out only the text elements and .Cast is casting them to PdfTextContent type.
These two (.Where plus .Cast) could be replaced with just .OfType(Of PdfTextContent) and you will end up iterating through the same text elements.

I hope this helps.

Regards,
Mario