4.5.4 fixes the panic and the majority of images in my PDFs that I have re-tested with, now most images extract successfully.
However, there's a remaining issue with indexed colour (palette-based) images. These are returned with format: "raw" and colourspace: null, which makes them impossible to decode downstream.
Example from my logs (consistent across 4 PDFs):
Cannot convert image to PNG, skipping
imageIndex: 0
originalFormat: "raw"
dataSize: 3773
width: 2460
height: 1014
colorspace: null
pdfimages -list confirms these are indexed color images:
page num type width height color comp bpc enc
1 0 image 2460 1014 index 1 8 image
The image data is 3,773 bytes for a 2460×1014 image. This is neither RGB (7,483,320 bytes), grayscale (2,494,440 bytes), nor a 1:1 pixel mapping. These are compressed palette-indexed data. Without the palette or a colourspace identifier, we have no way to decode this to a usable format.
Would it be possible to pre-decode indexed images to RGB before returning? Kreuzberg should already have access to the PDF's colour palette during extraction. Applying a palette lookup and returning decoded RGB data (with colourspace: "DeviceRGB") would make these images immediately usable, consistent with how JPEG/PNG images are already returned as decoded data.
4.5.4 fixes the panic and the majority of images in my PDFs that I have re-tested with, now most images extract successfully.
However, there's a remaining issue with indexed colour (palette-based) images. These are returned with format: "raw" and colourspace: null, which makes them impossible to decode downstream.
Example from my logs (consistent across 4 PDFs):
pdfimages -listconfirms these are indexed color images:The image data is 3,773 bytes for a 2460×1014 image. This is neither RGB (7,483,320 bytes), grayscale (2,494,440 bytes), nor a 1:1 pixel mapping. These are compressed palette-indexed data. Without the palette or a colourspace identifier, we have no way to decode this to a usable format.
Would it be possible to pre-decode indexed images to RGB before returning? Kreuzberg should already have access to the PDF's colour palette during extraction. Applying a palette lookup and returning decoded RGB data (with colourspace: "DeviceRGB") would make these images immediately usable, consistent with how JPEG/PNG images are already returned as decoded data.