Skip to content

Introduce a way to radically reduce the output file size (sacrificing image quality) #541

@heinrich-ulbricht

Description

@heinrich-ulbricht

Is your feature request related to a problem? Please describe.
My use case is "scanning" documents with a smartphone camera, then archiving those "scans" as low-quality monochrome images. But OCR should be done beforehand on the high-quality images.

I describes this in more detail here: #443 (comment)

Furthermore I see a discussion covering a similar topic here: #293

Describe the solution you'd like
I want greater control of image quality for the images embedded into the PDF (after doing OCR). I can imagine those possible solutions (each point is a complete solution):

  • add a parameter that forces all images to be converted to 1 bpp images (low effort)
  • add a parameter allowing arbitrary shell commands to be passed that will be executed by OCRmyPDF on the images in the temporary folder, before OCRmyPDF handles them further (high effort, security implications?)
  • introduce multiple parameters that allow for more control of the things that go on in the optimization step (probably here) (medium effort?)

Additional context
I'm currently evaluating how to achieve my goal with the least effort. I see two approaches:

  1. let OCRmyPDF do it's thing on high quality images/PDFs; post-process manually using pikepdf using a Python script that replaces the high quality images with low quality ones in the PDF (I have a working PoC, but it's not pretty)
  2. modify OCRmyPDF

I'm not sure about the second approach - where would be a good point to start? One approach could be:

  1. using PNG images in the input PDF file, then
  2. forcing pngquant to convert them to 1 bpp (here?)
  3. this could trigger PNG rewriting as G4 (here)

@jbarlow83 Does this sound right?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions