Skip to content

Improve optimization - add option to remove unreferenced images #807

@xelan

Description

@xelan

In some cases, PDFs may contain image resources which are not referenced on pages anymore.

Example files:
pdf-optimization-original.pdf
pdf-optimization-page-removed.pdf

Here a second page containing an otter image was there, but has been removed by a third-party PDF editing tool. However, the image resource is still in the PDF.

If you diff the two PDFs, the size is almost the same, and the original image of page two is still taking up space in the file.

C:\Software\pdfcpu_0.6.0_Windows_x86_64>pdfcpu.exe image list pdf-optimization-original.pdf
pages: all

pdf-optimization-original.pdf:
2 images available(241 KB)
Page Obj# │ Id      │ Type  SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │   Size │ Filters
━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━
   1   14 │ Image14 │ image                  │  1384 │    865 │  DeviceRGB    3   8    *   │ 133 KB │ DCTDecode
━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━
   2   17 │ Image17 │ image                  │  1384 │    865 │  DeviceRGB    3   8    *   │ 108 KB │ DCTDecode

C:\Software\pdfcpu_0.6.0_Windows_x86_64>pdfcpu.exe image list pdf-optimization-page-removed.pdf
pages: all

pdf-optimization-page-removed.pdf:
1 images available(133 KB)
Page Obj# │ Id      │ Type  SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │   Size │ Filters
━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━
   1   14 │ Image14 │ image                  │  1384 │    865 │  DeviceRGB    3   8    *   │ 133 KB │ DCTDecode

Would it be possible to add an optimization option to remove such "orphan" image resources?

The ocrmypdf tool does something similar during its PDF optimization see optimize.py. The possibility to perform this optimization also via pdfcpu would allow us to simplify our toolchain and reduce the number of required dependencies.

EDIT: optimization output via ocrmypdf:

$ ls -hal pdf-optimization-page-removed*.pdf
-rw-r--r--    1 root     root      279.0K Feb 21 14:20 pdf-optimization-page-removed.pdf

$ ocrmypdf --skip-text  pdf-optimization-page-removed.pdf pdf-optimization-page-removed-optimized.pdf
Scan: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 30.07page/s]
   INFO - Using Tesseract OpenMP thread limit 3
   INFO -    1: skipping all processing on this page                                                                                                                                  
OCR: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00, 179.26page/s]
WARNING - Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
   INFO - Optimize ratio: 1.00 savings: 0.0%
WARNING - Output file is okay but is not PDF/A (seems to be No PDF/A metadata in XMP)

$ ls -hal pdf-optimization-page-removed*.pdf
-rw-r--r--    1 root     root      146.8K Feb 21 14:26 pdf-optimization-page-removed-optimized.pdf
-rw-r--r--    1 root     root      279.0K Feb 21 14:20 pdf-optimization-page-removed.pdf

PDF optimized with ocrmypdf:
pdf-optimization-page-removed-optimized.pdf

Thank you very much, best regards from Tyrol
Andreas

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions