Reading PDF document

I am loading a document with many pages. Each page looks something like an invoice. When I load the entire page text like shown in the “reading text” example, it’s not great as there is no order to the text. It has items from all over the page strung together with spaces between words.

I look closer using the other example “reading additional information about text”, however, when I look at the elements this way, again, no particular order and even words are split. For instance “Account Number: 30012345” on the document, might have “Acc” as one text element, then the next one is “ount:”, however, then there are multiple lines with other data and then somewhere down the list there is the “30012345”.

What I need is a way to read the page text from top to bottom, left to right, so I can coordinate values. Can this be done?

Hi James,

Notice that in the Reading additional information about a text example you get the position of each text element. If you order the elements based on their position you will get the expected text.

We do plan to provide a straightforward API for that, but for now, you can use this ToVisuallyOrderedString extension method:

using System;
using System.Collections.Generic;
using System.Text;
using GemBox.Pdf.Content;

namespace GemBox.Pdf
{
    internal static class PdfContentExtensions
    {
        public static string ToVisuallyOrderedString(this PdfContentGroup group, PdfMatrix transform = default, double maxLocationYDelta = 1, double maxNonSpaceHorizontalDistance = 1)
        {
            var list = new List<PdfTextContentInfo>();

            // Gather all PdfTextContent elements, their bounds and locations.
            var contentEnumerator = group.Elements.All(transform).GetEnumerator();
            while (contentEnumerator.MoveNext())
                if (contentEnumerator.Current.ElementType == PdfContentElementType.Text)
                {
                    var element = (PdfTextContent)contentEnumerator.Current;

                    transform = contentEnumerator.Transform;

                    var bounds = element.Bounds;
                    transform.Transform(ref bounds);

                    var location = transform.Transform(element.Location);

                    list.Add(new PdfTextContentInfo(element, bounds, location));
                }

            // Sort them vertically.
            list.Sort((a, b) => Math.Sign(b.Location.Y - a.Location.Y));

            // Group them into lines.
            var lines = new List<List<PdfTextContentInfo>>();
            List<PdfTextContentInfo> currentLine = null;
            double initialLocationY = 0;
            foreach (var item in list)
            {
                var locationY = item.Location.Y;
                if (currentLine == null)
                {
                    currentLine = new List<PdfTextContentInfo>() { item };
                    initialLocationY = locationY;
                }
                else
                {
                    var locationYDelta = initialLocationY - locationY;
                    if (locationYDelta < maxLocationYDelta)
                        currentLine.Add(item);
                    else
                    {
                        lines.Add(currentLine);

                        currentLine = new List<PdfTextContentInfo>() { item };
                        initialLocationY = locationY;
                    }
                }
            }
            if (currentLine != null)
                lines.Add(currentLine);

            var sb = new StringBuilder();

            foreach (var line in lines)
            {
                // Sort horizontally.
                line.Sort((a, b) => Math.Sign(a.Location.X - b.Location.X));

                sb.Append(line[0].Element.ToString());
                var previousTextRightEdge = line[0].Bounds.Right;

                for (int i = 1; i < line.Count; ++i)
                {
                    var item = line[i];

                    // Add ' ' between, if needed.
                    var horizontalDistance = item.Location.X - previousTextRightEdge;
                    if (horizontalDistance > maxNonSpaceHorizontalDistance)
                        sb.Append(' ');

                    sb.Append(item.Element.ToString());
                    previousTextRightEdge = item.Bounds.Right;
                }

                sb.AppendLine();
            }

            return sb.ToString();
        }

        private readonly struct PdfTextContentInfo
        {
            public readonly PdfTextContent Element;

            public readonly PdfQuad Bounds;

            public readonly PdfPoint Location;

            public PdfTextContentInfo(PdfTextContent element, PdfQuad bounds, PdfPoint location)
            {
                this.Element = element;
                this.Bounds = bounds;
                this.Location = location;
            }

            public override string ToString() => this.Element.ToString() + ' ' + this.Location.ToString();
        }
    }
}

And here is how you can use it:

using (var document = PdfDocument.Load("input.pdf"))
{
    foreach (var page in document.Pages)
    {
        string text = page.Content.ToVisuallyOrderedString();
        // ...
    }
}

Regards,
Mario

Fantastic. Works like a charm. Thanks. Glad I didn’t have to figure this out myself, it would have taken me forever. This is exactly what I needed. Thanks.