What could be a reason why the same pdf that is saved as a png results in different png’s depending on the machine?
Visually the PNG files are the same, but they are different in size. Exactly the same code is used on both machines.
What could be a reason why the same pdf that is saved as a png results in different png’s depending on the machine?
Visually the PNG files are the same, but they are different in size. Exactly the same code is used on both machines.
Hi Jochen,
Are you using the same version of GemBox.Pdf on both machines?
Can you check the following:
string version = GemBox.Pdf.ComponentInfo.Version;
Regards,
Mario
Hi Mario,
The version seems to be the same and to be sure I cleared my NuGet Cache, this forced a re-fetch from the nuget servers. But it didn’t help.
I also discovered that probably the GemBox.Pdf is not causing the issue. I thought the PDF’s were equal but actually they aren’t.
The flow I’m using is Word doc => Pdf => Png. The last one is used as a means to compare the PDF in a unit test.
From Word to Pdf is done via GemBox.Document. The unittest then uses GemBox.Pdf to convert it to a Png.
Anyway, I discovered that the PDF’s generated by GemBox.Document are not equal. Visually they are, but there is a size difference:
Regards,
Jochen
Hi Jochen,
Can you send us your PDF and PNG files so we can investigate them?
Also, can you send us your Word file so I can try to reproduce your outputs?
Regards,
Mario
I’ve created a small sample app, you can download at Download Gembox.zip | LimeWire
internal static class Program
{
static Program()
{
ComponentInfo.SetLicense("FREE-LIMITED-KEY");
}
private static void Main(string[] args)
{
// Load PDF
var pdf = PdfDocument.Load(File.OpenRead("doc.pdf"), PdfLoadOptions.Default);
// Save as PNG
var imageOptions = new ImageSaveOptions(ImageSaveFormat.Png);
pdf.Save(File.OpenWrite("doc.current.png"), imageOptions);
Console.WriteLine($"Current file size: {new FileInfo("doc.current.png").Length}");
Console.WriteLine($"Original file size: {new FileInfo("doc.other.png").Length}");
}
}
The result on one computer is the following:
Current file size: 41924
Original file size: 35540
But on other machines the sizes are similar. The image itself always looks the same, but it’s annoying that piece of code is used in unit tests to verify if documents remain the same after .NET or GemBox or other updates.
Is the “doc.other.png” generated from the same input “doc.pdf” or a different PDF file?
If it’s from a different PDF file, can you send that as well?
Also, can I presume the same DOCX file is used to generate those PDF files? Can you send that as well?
Hi Mario,
I’ve tried this code on different machines and it seems that it always fails on a Windows 11 24H2. But is does work on all other Windows versions, for example Windows 11 23H2.
Everything is generated with the same pdf. There is a docx also, but it’s harder to share this one.
But I picked a random pdf from the web and the result is similar.
Current file size: 983040
Original file size: 1033391
The new code can be found at Gembox.zip
Hi Jochen,
I apologize for the late response. It took us some time to analyze your files, locate the differences, and determine what is causing them.
We analyzed the differences between files magic.windows.11.23H2.png and magic.windows.11.24H2.png from your GemBox.zip solution.
The file magic.windows.11.24H2.png is 983,040 bytes, while the file magic.windows.11.23H2.png is 1,033,391 bytes, which makes the magic.windows.11.24H2.png 50,351 bytes smaller.
The following screenshot shows the differences between PNG datastream chunks:
Further analysis of the uncompressed IDAT chunks shows that the 24H2 file is missing the last 123 pixel rows. Analysis of the unfiltered pixel data PngFilterType-23H2.txt and PngFilterType-24H2.txt shows that row 2839 (zero-indexed) uses filter type Sub in the 23H2 and Up in the 24H2. The last 123 rows from the 23H2 (that are missing in the 24H2) use the None filter.
Analysis of the Bgra32 pixel data Row-B-G-R-diff.txt and Row-B-G-R-A-diff.txt shows that the majority of the differences are due to differences in the alpha channel. Differences in the last 123 rows (from row 3910 to row 4031) are hard-coded to 127-127-127-127 as these rows are missing in the 24H2 file.
The conclusion is that these files are almost identical except small differences in the transparency of pixels and the file 24H2 is missing the last 123 rows of pixels and the IEND chunk.
We tried converting the magic.pdf from your GemBox.zip solution on our Windows 11 23H2 and 24H2 machines. The output file GemBox.magic.windows.11.23H2.png is 1,033,391 bytes, while the GemBox.magic.windows.11.24H2.png is 984,468 bytes which makes the 24H2 48,923 bytes smaller. But here the 24H2 is a valid PNG file. It is not missing the required IEND chunk at its end. Only its last IDAT chunk is 48,923 bytes smaller. The only difference of the unfiltered pixel data is that row 2839 (zero-indexed) again uses filter type Sub in the 23H2 and Up in the 24H2. But the number of rows is equal in both files, 4032. Analysis of the Bgra32 pixel data shows zero differences between these two files.
The final conclusion is that the image file created on 24H2 is pixel-wise identical to the one created on 23H2. The size of 24H2 is about 5% smaller, probably because it used the Up filter type instead of the Sub for row 2839, allowing better compression.
We are unsure why PNG encoding differs on 23H2 and 24H2. GemBox.Pdf uses WPF for encoding to PNG, and the WPF uses Windows Imaging Components (WIC) underneath. It is possible that there has been some changes in the WIC libraries between Windows 11 23H2 and 24H2.
Your file created on 24H2 is an invalid PNG file. It seems that the (several thousand) bytes (totaling to 123 pixel rows) at the end of the file were simply deleted. We are unsure what caused this deletion.
If you are able to figure out why this deletion of the 24H2 file happened, the solution to the issue is not to compare the sizes of the files, but use some PNG decoder like WPF or GDI+ to compare pixel width, pixel height and actual B-G-R-A pixel values, since the encoded (compressed) data might change based on the actual implementation.
Regards,
Stipo
Hi Stipo,
Thx for the thorough investigation and explanation.
I’ve changed my code to compare the images based on a hash and some threshold. FYI, the library I used for this is ImageHash.
I’ll keep using the file created on Windows 11 23H2 as the baseline, according to your investigation the 24H2 file is missing the IEND and actually not valid.
Kind regards,
Jochen