Skip to content

bug: Indexed colour (palette-based) PDF images returned with format: "raw" and colourspace: null making them undecoded and unusable #561

@MacJedi42

Description

@MacJedi42

4.5.4 fixes the panic and the majority of images in my PDFs that I have re-tested with, now most images extract successfully.

However, there's a remaining issue with indexed colour (palette-based) images. These are returned with format: "raw" and colourspace: null, which makes them impossible to decode downstream.

Example from my logs (consistent across 4 PDFs):

Cannot convert image to PNG, skipping
  imageIndex: 0
  originalFormat: "raw"
  dataSize: 3773
  width: 2460
  height: 1014
  colorspace: null

pdfimages -list confirms these are indexed color images:

page  num  type   width height color  comp bpc  enc
  1    0   image  2460  1014   index  1    8    image

The image data is 3,773 bytes for a 2460×1014 image. This is neither RGB (7,483,320 bytes), grayscale (2,494,440 bytes), nor a 1:1 pixel mapping. These are compressed palette-indexed data. Without the palette or a colourspace identifier, we have no way to decode this to a usable format.

Would it be possible to pre-decode indexed images to RGB before returning? Kreuzberg should already have access to the PDF's colour palette during extraction. Applying a palette lookup and returning decoded RGB data (with colourspace: "DeviceRGB") would make these images immediately usable, consistent with how JPEG/PNG images are already returned as decoded data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions