How to extract tabular data from PDF?

immitev · July 22, 2021, 12:42pm

Is there a straight-forward way to extract tables from PDF? Couldn’t find if such a feature is supported.

From Read text from PDF files with C# / VB.NET applications I can see how to extract text and some extra info like Bounds and Format, and maybe one can deduce from such information whether there is a table in the PDF and what its structure is

mario.gembox · July 23, 2021, 8:29am

Hi Ivan,

Unfortunately no, there is currently no easy or straightforward way to do this with GemBox.Pdf.
You would need to read both text elements and path elements (lines) to achieve the desired extraction.

Note, we do have a feature request for this and please feel free to vote for it to increase its priority:
https://support.gemboxsoftware.com/community/view/export-text-in-a-structural-manner

But I’m afraid that I cannot say when it will be available, this is not in our current roadmap so it won’t be done this year and for later I cannot say at this moment.

Last, we have this feature in our other component called GemBox.Document, see Read and extract PDF text in C# and VB.NET.
However, I must point out that GemBox.Document’s PDF reader never left the BETA stage, for more information read the Support level for reading PDF format (beta).

Regards,
Mario

immitev · July 27, 2021, 11:07am

Thanks for the clarification.

We may give GemBox.Document a try to decide if it meets our needs. Btw, do you have somewhere (e.g. in the Github repo) more examples of PDF documents from which tables are successfully extracted (besides https://www.gemboxsoftware.com/document/examples/305/resources/CustomInvoice.pdf from the example)

Best Regards

mario.gembox · July 27, 2021, 1:21pm

Hi Ivan,

No, that is the only sample file for that example on GitHub as well:

Regards,
Mario