Improve optimization - add option to remove unreferenced images

In some cases, PDFs may contain image resources which are not referenced on pages anymore.

Example files: 
[pdf-optimization-original.pdf](https://github.com/pdfcpu/pdfcpu/files/14359922/pdf-optimization-original.pdf)
[pdf-optimization-page-removed.pdf](https://github.com/pdfcpu/pdfcpu/files/14359923/pdf-optimization-page-removed.pdf)

Here a second page containing an otter image was there, but has been removed by a third-party PDF editing tool. However, the image resource is still in the PDF.

If you diff the two PDFs, the size is almost the same, and the original image of page two is still taking up space in the file.

```
C:\Software\pdfcpu_0.6.0_Windows_x86_64>pdfcpu.exe image list pdf-optimization-original.pdf
pages: all

pdf-optimization-original.pdf:
2 images available(241 KB)
Page Obj# │ Id      │ Type  SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │   Size │ Filters
━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━
   1   14 │ Image14 │ image                  │  1384 │    865 │  DeviceRGB    3   8    *   │ 133 KB │ DCTDecode
━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━
   2   17 │ Image17 │ image                  │  1384 │    865 │  DeviceRGB    3   8    *   │ 108 KB │ DCTDecode

C:\Software\pdfcpu_0.6.0_Windows_x86_64>pdfcpu.exe image list pdf-optimization-page-removed.pdf
pages: all

pdf-optimization-page-removed.pdf:
1 images available(133 KB)
Page Obj# │ Id      │ Type  SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │   Size │ Filters
━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━
   1   14 │ Image14 │ image                  │  1384 │    865 │  DeviceRGB    3   8    *   │ 133 KB │ DCTDecode
```

Would it be possible to add an optimization option to remove such "orphan" image resources?

The ocrmypdf tool does something similar during its PDF optimization [see optimize.py](https://github.com/ocrmypdf/OCRmyPDF/blob/main/src/ocrmypdf/optimize.py). The possibility to perform this optimization also via pdfcpu would allow us to simplify our toolchain and reduce the number of required dependencies.

EDIT: optimization output via `ocrmypdf`:
```bash
$ ls -hal pdf-optimization-page-removed*.pdf
-rw-r--r--    1 root     root      279.0K Feb 21 14:20 pdf-optimization-page-removed.pdf

$ ocrmypdf --skip-text  pdf-optimization-page-removed.pdf pdf-optimization-page-removed-optimized.pdf
Scan: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 30.07page/s]
   INFO - Using Tesseract OpenMP thread limit 3
   INFO -    1: skipping all processing on this page                                                                                                                                  
OCR: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00, 179.26page/s]
WARNING - Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
   INFO - Optimize ratio: 1.00 savings: 0.0%
WARNING - Output file is okay but is not PDF/A (seems to be No PDF/A metadata in XMP)

$ ls -hal pdf-optimization-page-removed*.pdf
-rw-r--r--    1 root     root      146.8K Feb 21 14:26 pdf-optimization-page-removed-optimized.pdf
-rw-r--r--    1 root     root      279.0K Feb 21 14:20 pdf-optimization-page-removed.pdf
```
PDF optimized with `ocrmypdf`:
[pdf-optimization-page-removed-optimized.pdf](https://github.com/pdfcpu/pdfcpu/files/14360077/pdf-optimization-page-removed-optimized.pdf)

Thank you very much, best regards from Tyrol
Andreas

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve optimization - add option to remove unreferenced images #807

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Improve optimization - add option to remove unreferenced images #807

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions