Skip to content

fix(docx): extract image ref from <a:blip> when it contains child elements#591

Merged
Goldziher merged 1 commit intokreuzberg-dev:mainfrom
gvaxx:fix/docx-blip-with-extlst-children
Mar 27, 2026
Merged

fix(docx): extract image ref from <a:blip> when it contains child elements#591
Goldziher merged 1 commit intokreuzberg-dev:mainfrom
gvaxx:fix/docx-blip-with-extlst-children

Conversation

@gvaxx
Copy link
Copy Markdown
Contributor

@gvaxx gvaxx commented Mar 26, 2026

Problem

Fixes #590

When Word generates a DOCX with certain image settings (e.g. high-quality print output), it adds an <a:extLst> child inside <a:blip>:

<a:blip r:embed="rId4" cstate="print">
  <a:extLst>
    <a:ext uri="{28A0092B-C50C-407E-A947-70E740481C1C}">
      <a14:useLocalDpi val="0"/>
    </a:ext>
  </a:extLst>
</a:blip>

Because the element now has children, quick-xml emits Event::Start instead of Event::Empty. The image reference r:embed is an attribute on the opening tag — but the code only handled blip in the Event::Empty arm, so image_ref was never populated, images were silently skipped, and result.images returned None.

Fix

Added b"blip" to the Event::Start match arm in parse_drawing() so the r:embed / r:link attribute is read regardless of whether the element is self-closing or has children.

Testing

  • Added regression test test_parse_blip_with_extlst_children reproducing the exact XML structure from a real-world DOCX that originally triggered the bug
  • All existing extraction::docx::drawing tests pass: cargo test -p kreuzberg --lib --features office -- extraction::docx::drawing
  • Manually verified: before the fix result.images = None; after — a 399 KB JPEG (680×951 px) is correctly returned in result.images

When Word saves a document with high-quality image settings, it adds an
<a:extLst> child element inside <a:blip>, making the XML parser emit
Event::Start instead of Event::Empty for that tag. The image reference
(r:embed) is an attribute on the opening tag, so it must be read in the
Start arm too — not only in the Empty arm.

Fixes kreuzberg-dev#590

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@Goldziher Goldziher merged commit ed8a6c4 into kreuzberg-dev:main Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: result.images is always None for DOCX files in Python binding (v4.x)

2 participants