Image to PDF Page

Theory · September 18, 2023, 11:08am

Hey there its me again, i have the case where i have images (single pages) when i do the example on the site its only scaled on a little factor of the page. also if i continue to work with those PDFs in the end they have extra content. From a images with totl of like 4MB it creates a Docment with 15MB i wonder how that should be solved.

My code for loading a image as page:

public void MergeFilesIntoPdf(GroupedFile groupedFile)
{
    string outputFolderPath = "path";
    using (var document = new PdfDocument())
    {
        var firstImagePath = groupedFile.GroupedFiles.FirstOrDefault();

        if (firstImagePath == null)
        {
            return; // No files to merge.
        }

        var firstImage = PdfImage.Load(firstImagePath);
        double pageWidth = firstImage.Width;
        double pageHeight = firstImage.Height;

        foreach (var imagePath in groupedFile.GroupedFiles)
        {
            var page = document.Pages.Add();
            var image = PdfImage.Load(imagePath);
            pageWidth = image.Width;
            pageHeight = image.Height;
            // Draw the image at the top-left corner of each page.
            page.Content.DrawImage(image, new PdfPoint(0, 0));

            // Set the page size for each page.
            page.SetMediaBox(0, 0, pageWidth, pageHeight);
        }

        string pdfFileName = $"{groupedFile.SimilarPart}.pdf";
        string pdfFilePath = Path.Combine(outputFolderPath, pdfFileName);
        document.Save(pdfFilePath);
    }
}

weirdly I had a version where the page had the correct size but then the first one didn’t and other similar things which happened. Would you mind telling me:

how to properly load the image as a “page” so that its right from a filesize, page/image correct scaling and if correct scaling not an empty first page,
what properties can i give there to reduce filesize an after doing OCR etc. its not trippling in filesize (pdfs don’t, its only the image-pages who do that).
Also your example on Images to PDF is only for one Page but my TIFF file has all pages in one file, since i haven’t tested it yet, can i just load it like with jpg or can i directly convert it to a pdf?

i thought pdf documents might get initialized with a empty page but it didn’t. Then again sometimes i get a empty pdf for the first page, this is rather confusing.

mario.gembox · September 19, 2023, 7:20am

Hi,

Apologize, but I’m not 100% sure that I understood what you’re asking here.

Are you saying that previously the PdfImage.Load resulted in a different image size when loading the same “imagePath” file?

Your code looks fine, that is how you would load an into a PDF page.

Unfortunately, GemBox.Pdf currently doesn’t have an API for compressing PDF (reducing PDF size). We do plan to provide this in the future, but at the moment I cannot say exactly when that will be.

You’ll need to load each TIFF frame separately. So, you could convert each TIFF frame to PNG using something like this:
https://stackoverflow.com/questions/3566650/convert-multipage-tiff-to-png-net
Then use the PdfImage.Load method to import those resulting PNG images.

Yes, the PdfDocument is initialized with an empty PdfDocument.Pages collection, it has no pages.
Regarding the empty first page, I was unable to reproduce this issue.
I would need a small Visual Studio project from you that reproduces this issue so that I can investigate it.

Regards,
Mario

Theory · September 20, 2023, 6:53am

Regarding the image size it seems setting the media Box to the picture size isn’t working as intended if i delete it, it works. Aside from that i also was unable to reproduce the issue with the empty page ( with the same code - don’t ask me why it happned). Also i have a file which always gives me an OCR error (outside of boxrange or smth) i can share the file with you, i tried convertig it to diffrent formats after splitting the TIFF file but it seems the empty page isn’t smth OCR likes so much. It would be nice if you could test with that pdf file. Just let me know how you want me to share that file.

Also i really wish i could reduce size with something, esp. if we have “ugly” scans where the image has much unnessecary information (regarding OCR) i mean i could always read out the text, rewrite it on the page, get rid of the image, reduce the image file and put it back as background or smth.

Edit: The Error for OCR reads as follow
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box

github.com/tesseract-ocr/tesseract

Tesseract Empty Page

opened 08:35AM - 16 Jun 20 UTC

M3ssman

bug bounding box binarization

### Environment * **Tesseract Version**: tesseract 4.1.1-rc2-21-gf4ef lepto…nica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1 * **Platform**: Ubuntu 18.04 LTS * Model Configs tested: `frk`, `Fraktur` (from `tessdata_best`), `gt4hist_5000k` (gt4hist-Model with 5000k Iterations) ### Current Behavior: When using rather large uncompressed TIF-Files (ca. 80 MB) from Project "Digitalisierung historischer deutscher Zeitschriften" for about 5 Pages (or even less) of 1000 Images we get ALTO-Files missing valid OCR-Date. When run with `tesseract 0046.tif 0046 -l frk alto` it only alerts `Empy Page!!` and exits in < 20 seconds. [0046-alto.zip](https://github.com/tesseract-ocr/tesseract/files/4785292/0046-alto.zip) [0046-tif.zip](https://send.firefox.com/download/fef4efcbc2b40db3/#nkNQ5MBdts8lro_HgtS5TQ) Generated ALTO-File and TIF-Image included. ### Expected Behavior: Produce ALTO-XML with contents. ### Suggested Fix: No idea.

This issue leads to my next question, can i set these arguments?

github.com/tesseract-ocr/tesseract

Tesseract Empty Page

opened 08:35AM - 16 Jun 20 UTC

M3ssman

bug bounding box binarization

### Environment * **Tesseract Version**: tesseract 4.1.1-rc2-21-gf4ef lepto…nica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1 * **Platform**: Ubuntu 18.04 LTS * Model Configs tested: `frk`, `Fraktur` (from `tessdata_best`), `gt4hist_5000k` (gt4hist-Model with 5000k Iterations) ### Current Behavior: When using rather large uncompressed TIF-Files (ca. 80 MB) from Project "Digitalisierung historischer deutscher Zeitschriften" for about 5 Pages (or even less) of 1000 Images we get ALTO-Files missing valid OCR-Date. When run with `tesseract 0046.tif 0046 -l frk alto` it only alerts `Empy Page!!` and exits in < 20 seconds. [0046-alto.zip](https://github.com/tesseract-ocr/tesseract/files/4785292/0046-alto.zip) [0046-tif.zip](https://send.firefox.com/download/fef4efcbc2b40db3/#nkNQ5MBdts8lro_HgtS5TQ) Generated ALTO-File and TIF-Image included. ### Expected Behavior: Produce ALTO-XML with contents. ### Suggested Fix: No idea.

Kindly
K

mario.gembox · September 21, 2023, 6:31am

I was unable to reproduce this, please send us a small Visual Studio project that reproduces your issue so that we can investigate it.

Yes, please do.

Yes, you could try something like that.

As mentioned in my previous message, note that in the future this will be easier to accomplish (when we introduce a compression feature in GemBox.Pdf).

Currently, there is no public API for this.
Nevertheless, we’ll investigate your file and see what we can do about this.

Theory · September 21, 2023, 7:46am

How do you want me to share the file ?

mario.gembox · September 21, 2023, 8:02am

Whatever is easier for you. You can send us a download link for it, send it via email, or send it by submitting a support ticket (see Contact page).

mario.gembox · September 25, 2023, 6:30pm

Hi,

Please try again with this latest bugfix version:

Install-Package GemBox.Pdf -Version 17.0.1418-hotfix
Install-Package GemBox.Pdf.Ocr -Version 17.0.1418-hotfix

I hope this helps.

Regards,
Mario

Theory · September 26, 2023, 11:54am

Nice it worked, if i’m allowed to ask; what did you do :D? Not just suppress the error you wouldn’t do that would you ;D (please don’t take this as an insult it should be a joke).

Thank you very much <3

mario.gembox · September 26, 2023, 12:16pm

Yes, we just suppressed the error because the resulting output looks ok.

Also, we were unable to expose that thresholding_method at the moment because the interop that we’re currently using doesn’t have direct support for it so it would require quite some time to implement its support. Nevertheless, it is doable and if we get more requests for this we will work on it.

But as mentioned above, in your case, the result is fine so you wouldn’t see a difference even with the thresholding_method usage.

Last, just as an FYI, you could still preprocess the image yourself and do the binarization (that is what the option controls) and then pass the image to OcrReader.

Regards,
Mario

Theory · September 26, 2023, 2:08pm

Alright, ty very much for the update. I will see how it works, if we run into an error i will do so. Thanks and a beatiful day.

K

Theory · September 26, 2023, 3:31pm

Just to make sure you have read the links trough which i’ve added? bc if the error quote is 30% imo its smth you guys should fix ngl. No offense but that would kinda be not tolareatable. Since one guy on git made a huge test with like 50k scanned sites

mario.gembox · September 27, 2023, 9:04am

Sorry, what exactly are you referring to?

The thing that we read in that GitHub issue was the following:

6587 scans of 371629 finished with “Empty page” when Tesseract used the default binarization.

Which is ~2%.
Also, he continues:

105 scans of those 6587 still were “Empty page” when Tesseract was used with -c thresholding_method=2" …

That doesn’t mean that if he used thredsholding_method=2 on all 300,000 documents that he would get only 105 empty pages. He would probably get a lot of empty pages where the other binarization worked.

Anyway, if/when we introduce the support for this option then users will be able to play around with this value and see what fits them better. But note that there is no one value that’ll fit everyone.

Theory · September 27, 2023, 9:41am

Then my memory tricked me, thanks for clarifying and sorry.