-
-
Notifications
You must be signed in to change notification settings - Fork 579
Description
In some cases, PDFs may contain image resources which are not referenced on pages anymore.
Example files:
pdf-optimization-original.pdf
pdf-optimization-page-removed.pdf
Here a second page containing an otter image was there, but has been removed by a third-party PDF editing tool. However, the image resource is still in the PDF.
If you diff the two PDFs, the size is almost the same, and the original image of page two is still taking up space in the file.
C:\Software\pdfcpu_0.6.0_Windows_x86_64>pdfcpu.exe image list pdf-optimization-original.pdf
pages: all
pdf-optimization-original.pdf:
2 images available(241 KB)
Page Obj# │ Id │ Type SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │ Size │ Filters
━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━
1 14 │ Image14 │ image │ 1384 │ 865 │ DeviceRGB 3 8 * │ 133 KB │ DCTDecode
━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━
2 17 │ Image17 │ image │ 1384 │ 865 │ DeviceRGB 3 8 * │ 108 KB │ DCTDecode
C:\Software\pdfcpu_0.6.0_Windows_x86_64>pdfcpu.exe image list pdf-optimization-page-removed.pdf
pages: all
pdf-optimization-page-removed.pdf:
1 images available(133 KB)
Page Obj# │ Id │ Type SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │ Size │ Filters
━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━
1 14 │ Image14 │ image │ 1384 │ 865 │ DeviceRGB 3 8 * │ 133 KB │ DCTDecode
Would it be possible to add an optimization option to remove such "orphan" image resources?
The ocrmypdf tool does something similar during its PDF optimization see optimize.py. The possibility to perform this optimization also via pdfcpu would allow us to simplify our toolchain and reduce the number of required dependencies.
EDIT: optimization output via ocrmypdf:
$ ls -hal pdf-optimization-page-removed*.pdf
-rw-r--r-- 1 root root 279.0K Feb 21 14:20 pdf-optimization-page-removed.pdf
$ ocrmypdf --skip-text pdf-optimization-page-removed.pdf pdf-optimization-page-removed-optimized.pdf
Scan: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 30.07page/s]
INFO - Using Tesseract OpenMP thread limit 3
INFO - 1: skipping all processing on this page
OCR: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00, 179.26page/s]
WARNING - Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
INFO - Optimize ratio: 1.00 savings: 0.0%
WARNING - Output file is okay but is not PDF/A (seems to be No PDF/A metadata in XMP)
$ ls -hal pdf-optimization-page-removed*.pdf
-rw-r--r-- 1 root root 146.8K Feb 21 14:26 pdf-optimization-page-removed-optimized.pdf
-rw-r--r-- 1 root root 279.0K Feb 21 14:20 pdf-optimization-page-removed.pdfPDF optimized with ocrmypdf:
pdf-optimization-page-removed-optimized.pdf
Thank you very much, best regards from Tyrol
Andreas