bug: Indexed colour (palette-based) PDF images returned with format: "raw" and colourspace: null making them undecoded and unusable

4.5.4 fixes the panic and the majority of images in my PDFs that I have re-tested with, now most images extract successfully.

However, there's a remaining issue with indexed colour (palette-based) images. These are returned with format: "raw" and colourspace: null, which makes them impossible to decode downstream.

Example from my logs (consistent across 4 PDFs):

```
Cannot convert image to PNG, skipping
  imageIndex: 0
  originalFormat: "raw"
  dataSize: 3773
  width: 2460
  height: 1014
  colorspace: null
```

`pdfimages -list` confirms these are indexed color images:
```
page  num  type   width height color  comp bpc  enc
  1    0   image  2460  1014   index  1    8    image
```

The image data is 3,773 bytes for a 2460×1014 image. This is neither RGB (7,483,320 bytes), grayscale (2,494,440 bytes), nor a 1:1 pixel mapping. These are compressed palette-indexed data. Without the palette or a colourspace identifier, we have no way to decode this to a usable format.

Would it be possible to pre-decode indexed images to RGB before returning? Kreuzberg should already have access to the PDF's colour palette during extraction. Applying a palette lookup and returning decoded RGB data (with colourspace: "DeviceRGB") would make these images immediately usable, consistent with how JPEG/PNG images are already returned as decoded data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Indexed colour (palette-based) PDF images returned with format: "raw" and colourspace: null making them undecoded and unusable #561

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: Indexed colour (palette-based) PDF images returned with format: "raw" and colourspace: null making them undecoded and unusable #561

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions