Skip to content

WriteImageToDisk does not use unique filenames for extracted images which can lead to images being overwritten #935

@dzflack

Description

@dzflack

[x] Your issue is based on the latest commit - Using latest release v0.8.0
[x] State your OS and OS version - macOS 14.6.1

Description

The WriteImageToDisk function is responsible for writing extracted images files to disk.

The generated filename does not use entirely unique values, as such, different image files can be created with the same filename. This can result in extracted images being overwritten.

Example

Listing images in sample PDF (which unfortunately I am unable to share) via pdfcpu image list 8020932-report.pdf:

Page Obj# │ Id   │ Type  SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │   Size │ Filters
━━━━━━━━━━┿━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━
   3   23 │ img1 │ image    *             │  1000 │    367 │  DeviceRGB    3   8        │   1 KB │ FlateDecode
       24 │ img0 │ image                  │  1000 │    367 │ DeviceGray    1   8        │  10 KB │ FlateDecode
       70 │ img1 │ image    *             │   834 │     84 │  DeviceRGB    3   8        │   2 KB │ FlateDecode

This shows two images on the same page which have the same id, namely img1.

When extracting this, it leads to one of the files being overwritten:

pdfcpu extract -m=i -p 3 8020932-report.pdf .
extracting images from 8020932-report.pdf into ./ ...
optimizing...
pages: 3
writing 8020932-report_3_img0.png
writing 8020932-report_3_img1.png
writing 8020932-report_3_img1.png

# Three images extracted, only two files exist:
❯ ls -laht *.png
-rw-r--r--  1 user  staff   1.6K 25 Aug 14:27 8020932-report_3_img1.png
-rw-r--r--  1 user  staff    14K 25 Aug 14:27 8020932-report_3_img0.png

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions