Skip to content

Implement optional downsampling as part of preprocessing #443

@jbarlow83

Description

@jbarlow83

(As requested by @tcurdt)

Some users want optional downsampling of images, but we don't want to do this all the time in case the file has high resolution scans (e.g. film).

Like Acrobat, should downsample images above some threshold.

TBD:

  • What to do if we don't have any reason to flatten the PDF to images (no deskew, clean) - we should probably downsample on an image by image basis, rather than page by page, in this case, which suggests putting downsampling in the optimizer
  • Generally it makes sense to downsample any image that would be too big Tesseract, because at least then we can do something with it
  • Test image before OCR and downsample images that are too big and then scale them up
  • To a degree it is better to aggressively lower JPEG quality than downsample, because JPEG encoders will save the bit budget for high detail areas. This starts to break down for very large images because they take up a lot of memory when decompressed and make the file unwieldy. So we may want to satisfy a request to downsample with some combination of downsampling and JPEG quality lowering.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions