facebooktwittermenuarrow-up

GemBox Support Forum

How to extract tabular data from PDF?

Is there a straight-forward way to extract tables from PDF? Couldn’t find if such a feature is supported.

From Read text from PDF files with C# / VB.NET applications I can see how to extract text and some extra info like Bounds and Format, and maybe one can deduce from such information whether there is a table in the PDF and what its structure is :thinking:

Hi Ivan,

Unfortunately no, there is currently no easy or straightforward way to do this with GemBox.Pdf.
You would need to read both text elements and path elements (lines) to achieve the desired extraction.

Note, we do have a feature request for this and please feel free to vote for it to increase its priority:
https://support.gemboxsoftware.com/community/view/export-text-in-a-structural-manner

But I’m afraid that I cannot say when it will be available, this is not in our current roadmap so it won’t be done this year and for later I cannot say at this moment.

Last, we have this feature in our other component called GemBox.Document, see Read and extract PDF text in C# and VB.NET.
However, I must point out that GemBox.Document’s PDF reader never left the BETA stage, for more information read the Support level for reading PDF format (beta).

Regards,
Mario

Thanks for the clarification.

We may give GemBox.Document a try to decide if it meets our needs. Btw, do you have somewhere (e.g. in the Github repo) more examples of PDF documents from which tables are successfully extracted (besides https://www.gemboxsoftware.com/document/examples/305/resources/CustomInvoice.pdf from the example)

Best Regards

Hi Ivan,

No, that is the only sample file for that example on GitHub as well:

Regards,
Mario