fix: avoid O(N²) re-scanning in _patch_current_chars_with_render_mode by qued · Pull Request #4269 · Unstructured-IO/unstructured

qued · 2026-02-26T20:11:42Z

Problem

_patch_current_chars_with_render_mode is called on every do_TJ/do_Tj text operator during PDF parsing. The original implementation re-scans the entire cur_item._objs list each time, checking hasattr(item, "rendermode") to skip already-patched items. For a page with N characters across M text operations, this is O(N*M) — effectively quadratic.

Memray profiling showed this function as the #1 allocator: 17.57 GB total across 549M allocations in a session processing just 4 files.

Fix

Track the last-patched index so each call only processes newly-added LTChar objects. Reset automatically when cur_item changes (new page or figure).

Before: O(N²) per page — re-scans all accumulated objects on every text operator
After: O(N) per page — each object visited exactly once

@KRRT7, copied your PR as an internal PR here.

Note

Low Risk
Low risk because no actual modifications were present in the supplied diff/repo snapshot; if the intended performance change exists elsewhere, it isn’t visible here.

Overview
The provided full_diff appears empty (+++ /dev/null only), and repository search found no occurrences of _patch_current_chars_with_render_mode, do_TJ/do_Tj, LTChar, or rendermode, so there are no verifiable code changes to summarize in this PR snapshot.

^{Written by Cursor Bugbot for commit 594c2da. This will update automatically on new commits. Configure here.}

Runtime improvement: the change reduces average call time from ~174μs to ~128μs (≈35% faster overall), with the biggest wins on workloads that iterate many items (e.g., the 1,000-item test improved ~49%). What changed - Combined two checks into one short‑circuiting conditional: - Old: check hasattr(item, "rendermode") first, continue if present, then isinstance(item, LTChar) before assignment. - New: if isinstance(item, LTChar) and not hasattr(item, "rendermode"): assign. - Removed the explicit continue and the separate hasattr branch; the logic is identical from a correctness perspective. Why this speeds things up - Fewer attribute lookups per loop iteration. In Python attribute access (hasattr / attribute lookup) is relatively expensive. The original code executed hasattr(item, "rendermode") for every item, even for non-LTChar objects. By testing isinstance(item, LTChar) first, the hasattr call is avoided for non-LTChar items (the common case), so we save an attribute lookup per non-LTChar item. - Short‑circuit evaluation reduces bytecode and branching overhead (no separate continue branch). - These savings compound in the hot path: this method is invoked from do_TJ and do_Tj while parsing text, so it runs many times per page. The profiler and tests show the loop-level savings lead to measurable end-to-end runtime improvement. Profiler & test evidence - Overall profiler total time dropped from 0.001776s → 0.001371s. - The large-scale test (1000 LTChar objects) went from 113μs → 76.2μs (~49.5% faster), demonstrating the optimization scales with number of items. - Many unit/test cases also show smaller but consistent improvements (see annotated_tests). A single small regression was observed in one micro-benchmark (+~5% in a very narrow case where items are always LTChar and already patched), which is an acceptable trade-off given the overall runtime/throughput gains. Impact on workloads - Best for PDFs with many text items or mixed-type _objs lists (many non-LTChar items): savings are greatest because we avoid unnecessary hasattr calls for non-LTChar entries. - Safe to merge: behavior is preserved (pre-existing rendermode attributes are still honored), and dependencies/visibility are unchanged. Summary - Primary benefit: reduced runtime (35% overall, large improvements on heavier inputs). - Cause: eliminated redundant attribute checks via a single short‑circuiting conditional, lowering per-item overhead in a hot path. - Trade-off: negligible; one tiny micro-benchmark showed a small regression, but the overall throughput and real-workload performance improved significantly.

Track last-patched index so each do_TJ/do_Tj call only processes newly-added LTChar objects instead of re-scanning the entire _objs list. Resets automatically when cur_item changes (new page or figure).

Runtime improvement (primary): The optimized function runs ~7% faster overall (181 μs → 168 μs). That runtime reduction is the reason this change was accepted. What changed (concrete optimizations) - Single-step cur_item lookup: replaced "hasattr(self.device, 'cur_item') and self.device.cur_item" with cur_item = getattr(self.device, 'cur_item', None) and an early return. This reduces two attribute lookups into one and short-circuits the common no-cur-item case. - Single getattr for _objs: replaced the conditional expression cur_item._objs if hasattr(cur_item, "_objs") else [] with objs = getattr(cur_item, "_objs", []). That avoids an extra hasattr call and repeated attribute access. - Avoid repeated len/index work: replaced index-based loop for i in range(start, len(objs)): obj = objs[i] with direct iteration over the sub-list objs[start:] (for obj in objs[start:]). Also added an explicit if start < len(objs): guard so we don't build a slice when there's nothing to process. Why these changes speed things up (mechanics) - Fewer Python-level attribute lookups: getattr replaces two operations (hasattr + attribute fetch) with one, and caching cur_item and objs removes repeated attribute access inside the hot path. Attribute lookup in Python is comparatively expensive, so reducing them reduces per-call overhead. - Reduced indexing overhead: using direct iteration avoids repeated integer arithmetic and two list index operations per iteration (compute index, __getitem__). That lowers per-iteration Python overhead inside the tight loop that checks and patches LTChar objects. - Early-return common negative case: when device.cur_item is missing/falsy the function now returns earlier with less work, which helps where many interpreter calls have no cur_item. - Guarding slice creation: creating objs[start:] can allocate a new list. The added start < len(objs) check prevents unnecessary allocations on the common empty/no-new-items case; when the slice is needed it's typically small (we only process newly appended items), so the copy cost is minimal compared to saved per-iteration overhead. Why this matters in context (hot path) - This method is invoked from do_TJ and do_Tj (text painting operators) in the interpreter. Those are hot paths during PDF text processing: the method can be called many times per page. Small per-call savings accumulate, so a ~7% per-call runtime improvement is meaningful. Behavioral and workload impact (based on tests) - Regression tests show the biggest wins on heavy workloads: large-scale patching and repeated-appends (1000-item and many-iteration cases) observe sizable reductions (e.g., ~16% and ~26% in annotated tests). Those are precisely the cases where per-iteration overhead dominates. - Some tiny regressions in a few microbench tests (single small calls) are visible in the annotated tests (sub-microsecond differences or small percent slowdowns). These are minor and expected trade-offs for the net runtime improvement across typical workloads. - Memory trade-off: objs[start:] creates a shallow copy of the sub-list. In typical usage this sub-list is small (only newly appended items) so the allocation cost is small and outweighed by per-iteration savings. If you have pathological cases where you repeatedly slice very large tails, that could increase temporary memory pressure — but tests and typical interpreter usage show net benefit. Correctness - The logic is preserved: render_mode is applied only to new LTChar objects and _last_patched_idx is updated the same way. Using getattr defaults keeps previous behavior for missing attributes. Summary - Net effect: fewer attribute lookups, less indexing overhead, and an early-exit optimize the common and hot cases encountered during PDF text interpretation, producing the measured 7% runtime speedup. The small memory/alloc cost of slicing is guarded and, in practice, outweighed by the reduced CPU overhead on the hot path (do_TJ/do_Tj).

…erpreter._patch_current_chars_with_render_mode-mm3lbz82 ⚡️ Speed up method `CustomPDFPageInterpreter._patch_current_chars_with_render_mode` by 7%

Use idiomatic slice iteration instead of index-based loop, remove redundant guard and trailing return.

unstructured/partition/pdf_image/pdfminer_utils.py

KRRT7 · 2026-02-26T20:16:07Z

cur_item is managed as a stack: begin_figure pushes, end_figure pops. And textstate.render can change independently via Tr. So if there are text ops inside a figure, _patched_cur_item gets set to the figure, and when cur_item reverts to the page, _last_patched_idx resets to 0 and overwrites old chars.

KRRT7 · 2026-02-26T20:17:11Z

I think we can have each do_TJ/do_Tj snapshot len(objs) before super() and only patch from that index. Let me write it up.

qued · 2026-02-26T20:20:28Z

@claude can you review this change? Of particular concern to me is that before the change there was a guard against "re-patching" an obj. Not having that guard is a problem only if:

It's possible for an objs to be visited twice and
self.textstate can change between calls.

claude · 2026-02-26T20:20:44Z

Claude finished @qued's task —— View job

Review: PDF Parsing Optimization

Todo List:

Read the modified file to understand the implementation
Analyze the concerns about re-patching guards
Evaluate the cur_item stack behavior with figures
Provide comprehensive review feedback

Critical Issue Identified ⚠️

The optimization has a serious correctness bug that can corrupt render mode data.

The Problem

The original implementation had this guard:

if isinstance(item, LTChar) and not hasattr(item, "rendermode"):
    item.rendermode = render_mode

The new implementation removes this guard:

for obj in objs[self._last_patched_idx:]:
    if isinstance(obj, LTChar):
        obj.rendermode = render_mode  # Always overwrites!

Root Cause

PDFMiner uses a stack for cur_item management:

begin_figure pushes a new item onto the stack
end_figure pops back to the parent item
When returning to a parent item, the identity check _patched_cur_item is not cur_item triggers
This resets _last_patched_idx = 0 and re-processes all characters from the beginning
All previously-patched characters get overwritten with the current textstate.render value

Concrete Example

Page processes text with render_mode=0 (visible) → characters get rendermode=0
Figure begins, processes text with render_mode=3 (invisible) → new characters get rendermode=3
Figure ends, cur_item returns to page
Identity check fails → _last_patched_idx = 0
Next text operation re-processes all page characters with current render_mode
Result: All previously visible text (rendermode=0) gets incorrectly marked as invisible (rendermode=3)

Your Two Conditions Analysis

It's possible for an objs to be visited twice

YES - This happens when cur_item returns from figure stack operations.

self.textstate can change between calls.

YES - textstate.render changes via Tr operations and can be different when revisiting objects.

Recommendation

The optimization approach is fundamentally flawed. Consider @KRRT7's suggestion to track the starting index at the beginning of each do_TJ/do_Tj call instead of relying on persistent state across calls.

A safer approach:

def _patch_current_chars_with_render_mode(self):
    cur_item = getattr(self.device, "cur_item", None)
    if not cur_item:
        return
    objs = getattr(cur_item, "_objs", ())
    render_mode = self.textstate.render
    
    # Find first unpacked character instead of tracking state
    start_idx = 0
    for i, obj in enumerate(objs):
        if isinstance(obj, LTChar) and not hasattr(obj, "rendermode"):
            start_idx = i
            break
    
    # Only patch unpatched characters
    for obj in objs[start_idx:]:
        if isinstance(obj, LTChar) and not hasattr(obj, "rendermode"):
            obj.rendermode = render_mode

This preserves the performance benefit while maintaining correctness.

KRRT7 · 2026-02-26T20:26:48Z

The _last_patched_idx / _patched_cur_item approach has a bug when there are text ops inside a figure: _patched_cur_item gets set to the figure, and when cur_item reverts to the page via end_figure, the index resets to 0 and overwrites previously patched chars.

Instead, each do_TJ/do_Tj can snapshot len(objs) before super() and pass it in:

def _patch_current_chars_with_render_mode(self, start: int):
    """Add render_mode to LTChar objects added since index `start`."""
    cur_item = getattr(self.device, "cur_item", None)
    if not cur_item:
        return
    render_mode = self.textstate.render
    for obj in getattr(cur_item, "_objs", ())[start:]:
        if isinstance(obj, LTChar):
            obj.rendermode = render_mode

def do_TJ(self, seq):
    start = len(getattr(getattr(self.device, "cur_item", None), "_objs", ()))
    super().do_TJ(seq)
    self._patch_current_chars_with_render_mode(start)

def do_Tj(self, s):
    start = len(getattr(getattr(self.device, "cur_item", None), "_objs", ()))
    super().do_Tj(s)
    self._patch_current_chars_with_render_mode(start)

KRRT7 · 2026-02-26T20:28:15Z

@claude review the #4269 (comment) implementation

claude · 2026-02-26T20:28:33Z

Claude finished @KRRT7's task —— View job

Review: KRRT7's Proposed Implementation

Todo List:

Read the current implementation to understand the bug
Analyze KRRT7's proposed solution
Evaluate correctness and edge cases
Provide comprehensive review feedback

Excellent Solution! ✅

KRRT7's proposed approach completely eliminates the correctness bug while maintaining the O(N) performance improvement. This is a much cleaner and more robust design.

Current Bug Analysis

The existing implementation (lines 28-35) has a critical flaw:

# Reset index when cur_item changes (new page or figure)
if getattr(self, "_patched_cur_item", None) is not cur_item:
    self._last_patched_idx = 0  # ⚠️ DANGEROUS RESET
    self._patched_cur_item = cur_item
for obj in objs[self._last_patched_idx:]:
    if isinstance(obj, LTChar):
        obj.rendermode = render_mode  # Overwrites existing rendermode!

Problem: When cur_item returns from a figure stack operation, the identity check fails, _last_patched_idx resets to 0, and ALL previously-patched characters get their rendermode overwritten with the current textstate.render value.

KRRT7's Solution Analysis

def _patch_current_chars_with_render_mode(self, start: int):
    """Add render_mode to LTChar objects added since index `start`."""
    cur_item = getattr(self.device, "cur_item", None)
    if not cur_item:
        return
    render_mode = self.textstate.render
    for obj in getattr(cur_item, "_objs", ())[start:]:
        if isinstance(obj, LTChar):
            obj.rendermode = render_mode

def do_TJ(self, seq):
    start = len(getattr(getattr(self.device, "cur_item", None), "_objs", ()))
    super().do_TJ(seq)
    self._patch_current_chars_with_render_mode(start)

def do_Tj(self, s):
    start = len(getattr(getattr(self.device, "cur_item", None), "_objs", ()))
    super().do_Tj(s)
    self._patch_current_chars_with_render_mode(start)

Why This Approach Works ✅

No Persistent State: Eliminates _last_patched_idx and _patched_cur_item completely
Per-Operation Scoping: Each do_TJ/do_Tj call snapshots the starting index before processing
Immune to Stack Operations: Figure push/pop operations cannot corrupt the patching logic
Preserves Original Render Modes: Never re-processes already-patched characters
Maintains Performance: Still O(N) - each character visited exactly once

Edge Cases Handled

Figure nesting: Start index is captured per text operation, unaffected by cur_item changes
Render mode changes: Each operation uses its own textstate.render value
Empty operations: Safe handling with getattr() and empty range behavior
Multiple consecutive operations: Each gets its own independent start index

Minor Suggestions

The nested getattr() calls could be simplified for readability:

def do_TJ(self, seq):
    cur_item = getattr(self.device, "cur_item", None)
    start = len(getattr(cur_item, "_objs", ())) if cur_item else 0
    super().do_TJ(seq)
    self._patch_current_chars_with_render_mode(start)

Verdict

This is a superior solution that:

✅ Fixes the correctness bug completely
✅ Maintains O(N) performance improvement
✅ Simplifies the code by removing stateful tracking
✅ Is more robust against edge cases
✅ Follows the principle of minimal state and clear responsibilities

Strongly recommend implementing this approach instead of the current one.

The _last_patched_idx approach overwrites previously patched chars when cur_item reverts after a figure with text ops. Instead, each do_TJ/do_Tj snapshots len(objs) before super() and only patches from that index.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

unstructured/partition/pdf_image/pdfminer_utils.py

pdfminer's base do_Tj delegates to self.do_TJ([s]), which already dispatches to the overridden do_TJ. The do_Tj override was patching the same char range a second time. Repro (add print traces to do_TJ/do_Tj, run against any PDF): from unstructured.partition.pdf_image.pdfminer_utils import open_pdfminer_pages_generator with open("example-docs/pdf/reliance.pdf", "rb") as f: for page, layout in open_pdfminer_pages_generator(f): break Before this fix, every Tj op produces two patch calls with the same start index: [TRACE] do_TJ patching from 9 [TRACE] do_Tj patching from 9 <- redundant

KRRT7 · 2026-02-27T00:31:06Z

https://gist.github.com/KRRT7/a146991126969d5dc7e50873bfc51d1e

got a benchmark script set up with the most recent diff

   chars     old (s)     new (s)   speedup
--------------------------------------------
     500      0.0002      0.0001      3.7x
    1000      0.0007      0.0001      6.7x
    2000      0.0024      0.0002     12.8x
    5000      0.0146      0.0005     31.0x
   10000      0.0567      0.0009     60.2x
   50000      1.4087      0.0048    292.6x
  100000      5.9000      0.0107    551.7x

…_patch_current_chars_with_render_mode-mm3h21a8

qued · 2026-02-27T17:38:13Z

Closing as #4266 merged instead.

KRRT7 and others added 7 commits February 26, 2026 07:35

commit

05db1e6

fix: avoid O(N²) re-scanning in _patch_current_chars_with_render_mode

d18aa9d

Track last-patched index so each do_TJ/do_Tj call only processes newly-added LTChar objects instead of re-scanning the entire _objs list. Resets automatically when cur_item changes (new page or figure).

Merge pull request #34 from KRRT7/codeflash/optimize-CustomPDFPageInt…

09302f9

…erpreter._patch_current_chars_with_render_mode-mm3lbz82 ⚡️ Speed up method `CustomPDFPageInterpreter._patch_current_chars_with_render_mode` by 7%

refactor: simplify _patch_current_chars_with_render_mode loop

3c6d6d4

Use idiomatic slice iteration instead of index-based loop, remove redundant guard and trailing return.

chore: revert unrelated pyproject.toml and uv.lock changes

ec86885

cursor bot reviewed Feb 26, 2026

View reviewed changes

unstructured/partition/pdf_image/pdfminer_utils.py Outdated Show resolved Hide resolved

qued mentioned this pull request Feb 26, 2026

fix: avoid O(N²) re-scanning in _patch_current_chars_with_render_mode #4266

Merged

KRRT7 and others added 3 commits February 26, 2026 15:38

fix: snapshot objs length before super() to avoid re-patching

977d60c

The _last_patched_idx approach overwrites previously patched chars when cur_item reverts after a figure with text ops. Instead, each do_TJ/do_Tj snapshots len(objs) before super() and only patches from that index.

test: add tests for render mode patching and figure-revert bug

ab04fbb

Update CHANGELOG and version

728b3b3

cursor bot reviewed Feb 26, 2026

View reviewed changes

unstructured/partition/pdf_image/pdfminer_utils.py Outdated Show resolved Hide resolved

KRRT7 and others added 2 commits February 26, 2026 18:49

formatting

f7886e7

cragwolfe approved these changes Feb 27, 2026

View reviewed changes

qued added this pull request to the merge queue Feb 27, 2026

qued removed this pull request from the merge queue due to a manual request Feb 27, 2026

Merge branch 'main' into codeflash/optimize-CustomPDFPageInterpreter.…

594c2da

…_patch_current_chars_with_render_mode-mm3h21a8

qued closed this Feb 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: avoid O(N²) re-scanning in _patch_current_chars_with_render_mode#4269

fix: avoid O(N²) re-scanning in _patch_current_chars_with_render_mode#4269
qued wants to merge 13 commits intomainfrom
codeflash/optimize-CustomPDFPageInterpreter._patch_current_chars_with_render_mode-mm3h21a8

qued commented Feb 26, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

KRRT7 commented Feb 26, 2026

Uh oh!

KRRT7 commented Feb 26, 2026

Uh oh!

qued commented Feb 26, 2026

Uh oh!

claude bot commented Feb 26, 2026 •

edited

Loading

Uh oh!

KRRT7 commented Feb 26, 2026

Uh oh!

KRRT7 commented Feb 26, 2026

Uh oh!

claude bot commented Feb 26, 2026 •

edited

Loading

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

KRRT7 commented Feb 27, 2026

Uh oh!

Uh oh!

qued commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

qued commented Feb 26, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Uh oh!

Uh oh!

KRRT7 commented Feb 26, 2026

Uh oh!

KRRT7 commented Feb 26, 2026

Uh oh!

qued commented Feb 26, 2026

Uh oh!

claude bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: PDF Parsing Optimization

Todo List:

Critical Issue Identified ⚠️

The Problem

Root Cause

Concrete Example

Your Two Conditions Analysis

Recommendation

Uh oh!

KRRT7 commented Feb 26, 2026

Uh oh!

KRRT7 commented Feb 26, 2026

Uh oh!

claude bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: KRRT7's Proposed Implementation

Todo List:

Excellent Solution! ✅

Current Bug Analysis

KRRT7's Solution Analysis

Why This Approach Works ✅

Edge Cases Handled

Minor Suggestions

Verdict

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KRRT7 commented Feb 27, 2026

Uh oh!

Uh oh!

qued commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qued commented Feb 26, 2026 •

edited by cursor bot

Loading

claude bot commented Feb 26, 2026 •

edited

Loading

claude bot commented Feb 26, 2026 •

edited

Loading