Image to PDF Page

Hey there its me again, i have the case where i have images (single pages) when i do the example on the site its only scaled on a little factor of the page. also if i continue to work with those PDFs in the end they have extra content. From a images with totl of like 4MB it creates a Docment with 15MB i wonder how that should be solved.

My code for loading a image as page:

public void MergeFilesIntoPdf(GroupedFile groupedFile)
{
    string outputFolderPath = "path";
    using (var document = new PdfDocument())
    {
        var firstImagePath = groupedFile.GroupedFiles.FirstOrDefault();

        if (firstImagePath == null)
        {
            return; // No files to merge.
        }

        var firstImage = PdfImage.Load(firstImagePath);
        double pageWidth = firstImage.Width;
        double pageHeight = firstImage.Height;

        foreach (var imagePath in groupedFile.GroupedFiles)
        {
            var page = document.Pages.Add();
            var image = PdfImage.Load(imagePath);
            pageWidth = image.Width;
            pageHeight = image.Height;
            // Draw the image at the top-left corner of each page.
            page.Content.DrawImage(image, new PdfPoint(0, 0));

            // Set the page size for each page.
            page.SetMediaBox(0, 0, pageWidth, pageHeight);
        }

        string pdfFileName = $"{groupedFile.SimilarPart}.pdf";
        string pdfFilePath = Path.Combine(outputFolderPath, pdfFileName);
        document.Save(pdfFilePath);
    }
}

weirdly I had a version where the page had the correct size but then the first one didn’t and other similar things which happened. Would you mind telling me:

  • how to properly load the image as a “page” so that its right from a filesize, page/image correct scaling and if correct scaling not an empty first page,
  • what properties can i give there to reduce filesize an after doing OCR etc. its not trippling in filesize (pdfs don’t, its only the image-pages who do that).
  • Also your example on Images to PDF is only for one Page but my TIFF file has all pages in one file, since i haven’t tested it yet, can i just load it like with jpg or can i directly convert it to a pdf?

i thought pdf documents might get initialized with a empty page but it didn’t. Then again sometimes i get a empty pdf for the first page, this is rather confusing.

Hi,

Apologize, but I’m not 100% sure that I understood what you’re asking here.

Are you saying that previously the PdfImage.Load resulted in a different image size when loading the same “imagePath” file?

Your code looks fine, that is how you would load an into a PDF page.

Unfortunately, GemBox.Pdf currently doesn’t have an API for compressing PDF (reducing PDF size). We do plan to provide this in the future, but at the moment I cannot say exactly when that will be.

You’ll need to load each TIFF frame separately. So, you could convert each TIFF frame to PNG using something like this:
https://stackoverflow.com/questions/3566650/convert-multipage-tiff-to-png-net
Then use the PdfImage.Load method to import those resulting PNG images.

Yes, the PdfDocument is initialized with an empty PdfDocument.Pages collection, it has no pages.
Regarding the empty first page, I was unable to reproduce this issue.
I would need a small Visual Studio project from you that reproduces this issue so that I can investigate it.

Regards,
Mario

Regarding the image size it seems setting the media Box to the picture size isn’t working as intended if i delete it, it works. Aside from that i also was unable to reproduce the issue with the empty page ( with the same code - don’t ask me why it happned). Also i have a file which always gives me an OCR error (outside of boxrange or smth) i can share the file with you, i tried convertig it to diffrent formats after splitting the TIFF file but it seems the empty page isn’t smth OCR likes so much. It would be nice if you could test with that pdf file. Just let me know how you want me to share that file.

Also i really wish i could reduce size with something, esp. if we have “ugly” scans where the image has much unnessecary information (regarding OCR) i mean i could always read out the text, rewrite it on the page, get rid of the image, reduce the image file and put it back as background or smth.

Edit: The Error for OCR reads as follow
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box

This issue leads to my next question, can i set these arguments?

Kindly
K

I was unable to reproduce this, please send us a small Visual Studio project that reproduces your issue so that we can investigate it.

Yes, please do.

Yes, you could try something like that.

As mentioned in my previous message, note that in the future this will be easier to accomplish (when we introduce a compression feature in GemBox.Pdf).

Currently, there is no public API for this.
Nevertheless, we’ll investigate your file and see what we can do about this.

How do you want me to share the file :slight_smile: ?

Whatever is easier for you. You can send us a download link for it, send it via email, or send it by submitting a support ticket (see Contact page).

Hi,

Please try again with this latest bugfix version:

Install-Package GemBox.Pdf -Version 17.0.1418-hotfix
Install-Package GemBox.Pdf.Ocr -Version 17.0.1418-hotfix

I hope this helps.

Regards,
Mario

Nice it worked, if i’m allowed to ask; what did you do :D? Not just suppress the error you wouldn’t do that would you ;D (please don’t take this as an insult it should be a joke).

Thank you very much <3

Yes, we just suppressed the error because the resulting output looks ok.

Also, we were unable to expose that thresholding_method at the moment because the interop that we’re currently using doesn’t have direct support for it so it would require quite some time to implement its support. Nevertheless, it is doable and if we get more requests for this we will work on it.

But as mentioned above, in your case, the result is fine so you wouldn’t see a difference even with the thresholding_method usage.

Last, just as an FYI, you could still preprocess the image yourself and do the binarization (that is what the option controls) and then pass the image to OcrReader.

Regards,
Mario

Alright, ty very much for the update. I will see how it works, if we run into an error i will do so. Thanks and a beatiful day.

K

Just to make sure you have read the links trough which i’ve added? bc if the error quote is 30% imo its smth you guys should fix ngl. No offense but that would kinda be not tolareatable. Since one guy on git made a huge test with like 50k scanned sites :stuck_out_tongue:

Sorry, what exactly are you referring to?

The thing that we read in that GitHub issue was the following:

6587 scans of 371629 finished with “Empty page” when Tesseract used the default binarization.

Which is ~2%.
Also, he continues:

105 scans of those 6587 still were “Empty page” when Tesseract was used with -c thresholding_method=2" …

That doesn’t mean that if he used thredsholding_method=2 on all 300,000 documents that he would get only 105 empty pages. He would probably get a lot of empty pages where the other binarization worked.

Anyway, if/when we introduce the support for this option then users will be able to play around with this value and see what fits them better. But note that there is no one value that’ll fit everyone.

Then my memory tricked me, thanks for clarifying and sorry.