Skip to content

Merge changes from upstream#35

Merged
KRRT7 merged 5 commits intocodeflash/optimize-CustomPDFPageInterpreter._patch_current_chars_with_render_mode-mm3h21a8from
optimize-CustomPDFPageInterpreter-pr-branch
Feb 27, 2026
Merged

Merge changes from upstream#35
KRRT7 merged 5 commits intocodeflash/optimize-CustomPDFPageInterpreter._patch_current_chars_with_render_mode-mm3h21a8from
optimize-CustomPDFPageInterpreter-pr-branch

Conversation

@qued
Copy link
Copy Markdown
Collaborator

@qued qued commented Feb 27, 2026

Does this all look right?

KRRT7 and others added 5 commits February 26, 2026 15:38
The _last_patched_idx approach overwrites previously patched chars
when cur_item reverts after a figure with text ops. Instead, each
do_TJ/do_Tj snapshots len(objs) before super() and only patches
from that index.
pdfminer's base do_Tj delegates to self.do_TJ([s]), which already
dispatches to the overridden do_TJ. The do_Tj override was patching
the same char range a second time.

Repro (add print traces to do_TJ/do_Tj, run against any PDF):

    from unstructured.partition.pdf_image.pdfminer_utils import open_pdfminer_pages_generator

    with open("example-docs/pdf/reliance.pdf", "rb") as f:
        for page, layout in open_pdfminer_pages_generator(f):
            break

Before this fix, every Tj op produces two patch calls with the same
start index:

    [TRACE] do_TJ patching from 9
    [TRACE] do_Tj patching from 9   <- redundant
@KRRT7 KRRT7 merged commit 6fa1716 into codeflash/optimize-CustomPDFPageInterpreter._patch_current_chars_with_render_mode-mm3h21a8 Feb 27, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants