Trouble loading PDF document

Arild · March 17, 2021, 1:35pm

Hi

I’m trying to load a PDF document using GemBox.Pdf and the Load method like this:

var pdf = PdfDocument.Load(@"C:\Temp\test.pdf");

The problem is that I get an exception saying "Invalid character ‘R’ was read at index 0 in keyword “obj”. I can open the PDF document without any errors using my browser, Adobe Acrobat or FoxIt Reader.

Is it a bug somewhere or is GemBox.Pdf very strict compared to regular PDF readers?

Thank you

Arild

mario.gembox · March 18, 2021, 2:39am

Hi Arild,

Yes, GemBox.Pdf is very strict.
However, we do remove some restrictions if we notice that other PDF applications ignore some irregularity.

Can you send us your “test.pdf” file so that we can investigate this?

Regards,
Mario

Arild · March 18, 2021, 8:45am

Hi Mario

Thank you for your quick response

I would love to send you the PDF file, but since it contains sensitive information I can’t do it without first removing a lot of text. I have tried to manually remove all the streams from within the PDF using Notepad++, but I have not had any success doing that while keeping the same error message. If you have any suggestion to how I can remove the text, please let me know.

Regards
Arild

mario.gembox · March 18, 2021, 9:30am

Hi Arild,

I’m afraid I don’t think that’s something that can be simply done from any text editor.

Perhaps you could consider sending us your document privately by submitting a support ticket.

Regards,
Mario

stipo.gembox · March 19, 2021, 12:03pm

Hi Arild,

Please try to apply the following code on your file:

var desktop = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);

var fileName = "your-file-name.pdf";

var str = new string(Array.ConvertAll(File.ReadAllBytes(Path.Combine(desktop, fileName)), b => (char)b));

if (str.Contains("/XRef") || str.Contains("/ObjStm"))
{
    Console.WriteLine("PDF file probably contains cross-reference stream or object stream and cannot be processed.");
    return;
}

// Strip all streams.
str = Regex.Replace(str, @"(?<start>\bstream(\r\n|\n))(?<data>(.|\n)*?)(?<end>(\r\n|\r|\n)endstream\b)", m => m.Groups["start"].Value + new string(' ', m.Groups["data"].Length) + m.Groups["end"].Value);

// Strip all literal strings.
bool hasBalancedParentheses;
do
{
    hasBalancedParentheses = false;
    str = Regex.Replace(str, @"\((.|\n)*?\)", m =>
    {
        var lastIndex = m.Value.LastIndexOf('(');
        if (lastIndex > 0 && str[lastIndex - 1] != '\\')
        {
            hasBalancedParentheses = true;
            return '(' + new string(' ', m.Length - 1);
        }

        return '(' + new string(' ', m.Length - 2) + ')';
    });
}
while (hasBalancedParentheses);

// Strip all hexadecimal strings.
str = Regex.Replace(str, @"((?<!<)<(?!<))[^>]*?>", m => '<' + new string('0', m.Length - 2) + '>');

File.WriteAllBytes(Path.Combine(desktop, Path.GetFileNameWithoutExtension(fileName) + "-stripped" + Path.GetExtension(fileName)), Array.ConvertAll(str.ToCharArray(), c => (byte)c));

If your file doesn’t contain cross-reference streams (/XRef) nor object streams( /ObjStm), this code should be able to replace the content of all streams and literal strings with spaces and content of all hexadecimal strings with zeros without making the cross-reference table invalid.

Changing other objects is not required because PDF encryption encrypts only strings and streams.

Regards,
Stipo

Arild · March 19, 2021, 12:58pm

Thank you, Stipo! I was now able to anonymize the PDF document while maintain the ‘R’ was read at index 0 in keyword “obj” exception. I will upload the PDF as a support ticket