fix: avoid O(N²) re-scanning in _patch_current_chars_with_render_mode by codeflash-ai[bot] · Pull Request #33 · KRRT7/unstructured

codeflash-ai · 2026-02-26T13:01:14Z

Problem

_patch_current_chars_with_render_mode is called on every do_TJ/do_Tj text operator during PDF parsing. The original implementation re-scanned the entire cur_item._objs list each time, checking hasattr(item, "rendermode") to skip already-patched items. For a page with N characters across M text operations, this is O(N*M) — effectively quadratic.

Memray profiling showed this function as the # 1 allocator**: 17.57 GB total across 549M allocations in a session processing just 4 files.

Fix

Track the last-patched index so each call only processes newly-added LTChar objects. Reset automatically when cur_item changes (new page or figure).

Before: O(N²) per page — re-scans all accumulated objects on every text operator
After: O(N) per page — each object visited exactly once

Runtime improvement: the change reduces average call time from ~174μs to ~128μs (≈35% faster overall), with the biggest wins on workloads that iterate many items (e.g., the 1,000-item test improved ~49%). What changed - Combined two checks into one short‑circuiting conditional: - Old: check hasattr(item, "rendermode") first, continue if present, then isinstance(item, LTChar) before assignment. - New: if isinstance(item, LTChar) and not hasattr(item, "rendermode"): assign. - Removed the explicit continue and the separate hasattr branch; the logic is identical from a correctness perspective. Why this speeds things up - Fewer attribute lookups per loop iteration. In Python attribute access (hasattr / attribute lookup) is relatively expensive. The original code executed hasattr(item, "rendermode") for every item, even for non-LTChar objects. By testing isinstance(item, LTChar) first, the hasattr call is avoided for non-LTChar items (the common case), so we save an attribute lookup per non-LTChar item. - Short‑circuit evaluation reduces bytecode and branching overhead (no separate continue branch). - These savings compound in the hot path: this method is invoked from do_TJ and do_Tj while parsing text, so it runs many times per page. The profiler and tests show the loop-level savings lead to measurable end-to-end runtime improvement. Profiler & test evidence - Overall profiler total time dropped from 0.001776s → 0.001371s. - The large-scale test (1000 LTChar objects) went from 113μs → 76.2μs (~49.5% faster), demonstrating the optimization scales with number of items. - Many unit/test cases also show smaller but consistent improvements (see annotated_tests). A single small regression was observed in one micro-benchmark (+~5% in a very narrow case where items are always LTChar and already patched), which is an acceptable trade-off given the overall runtime/throughput gains. Impact on workloads - Best for PDFs with many text items or mixed-type _objs lists (many non-LTChar items): savings are greatest because we avoid unnecessary hasattr calls for non-LTChar entries. - Safe to merge: behavior is preserved (pre-existing rendermode attributes are still honored), and dependencies/visibility are unchanged. Summary - Primary benefit: reduced runtime (35% overall, large improvements on heavier inputs). - Cause: eliminated redundant attribute checks via a single short‑circuiting conditional, lowering per-item overhead in a hot path. - Trade-off: negligible; one tiny micro-benchmark showed a small regression, but the overall throughput and real-workload performance improved significantly.

Track last-patched index so each do_TJ/do_Tj call only processes newly-added LTChar objects instead of re-scanning the entire _objs list. Resets automatically when cur_item changes (new page or figure).

Runtime improvement (primary): The optimized function runs ~7% faster overall (181 μs → 168 μs). That runtime reduction is the reason this change was accepted. What changed (concrete optimizations) - Single-step cur_item lookup: replaced "hasattr(self.device, 'cur_item') and self.device.cur_item" with cur_item = getattr(self.device, 'cur_item', None) and an early return. This reduces two attribute lookups into one and short-circuits the common no-cur-item case. - Single getattr for _objs: replaced the conditional expression cur_item._objs if hasattr(cur_item, "_objs") else [] with objs = getattr(cur_item, "_objs", []). That avoids an extra hasattr call and repeated attribute access. - Avoid repeated len/index work: replaced index-based loop for i in range(start, len(objs)): obj = objs[i] with direct iteration over the sub-list objs[start:] (for obj in objs[start:]). Also added an explicit if start < len(objs): guard so we don't build a slice when there's nothing to process. Why these changes speed things up (mechanics) - Fewer Python-level attribute lookups: getattr replaces two operations (hasattr + attribute fetch) with one, and caching cur_item and objs removes repeated attribute access inside the hot path. Attribute lookup in Python is comparatively expensive, so reducing them reduces per-call overhead. - Reduced indexing overhead: using direct iteration avoids repeated integer arithmetic and two list index operations per iteration (compute index, __getitem__). That lowers per-iteration Python overhead inside the tight loop that checks and patches LTChar objects. - Early-return common negative case: when device.cur_item is missing/falsy the function now returns earlier with less work, which helps where many interpreter calls have no cur_item. - Guarding slice creation: creating objs[start:] can allocate a new list. The added start < len(objs) check prevents unnecessary allocations on the common empty/no-new-items case; when the slice is needed it's typically small (we only process newly appended items), so the copy cost is minimal compared to saved per-iteration overhead. Why this matters in context (hot path) - This method is invoked from do_TJ and do_Tj (text painting operators) in the interpreter. Those are hot paths during PDF text processing: the method can be called many times per page. Small per-call savings accumulate, so a ~7% per-call runtime improvement is meaningful. Behavioral and workload impact (based on tests) - Regression tests show the biggest wins on heavy workloads: large-scale patching and repeated-appends (1000-item and many-iteration cases) observe sizable reductions (e.g., ~16% and ~26% in annotated tests). Those are precisely the cases where per-iteration overhead dominates. - Some tiny regressions in a few microbench tests (single small calls) are visible in the annotated tests (sub-microsecond differences or small percent slowdowns). These are minor and expected trade-offs for the net runtime improvement across typical workloads. - Memory trade-off: objs[start:] creates a shallow copy of the sub-list. In typical usage this sub-list is small (only newly appended items) so the allocation cost is small and outweighed by per-iteration savings. If you have pathological cases where you repeatedly slice very large tails, that could increase temporary memory pressure — but tests and typical interpreter usage show net benefit. Correctness - The logic is preserved: render_mode is applied only to new LTChar objects and _last_patched_idx is updated the same way. Using getattr defaults keeps previous behavior for missing attributes. Summary - Net effect: fewer attribute lookups, less indexing overhead, and an early-exit optimize the common and hot cases encountered during PDF text interpretation, producing the measured 7% runtime speedup. The small memory/alloc cost of slicing is guarded and, in practice, outweighed by the reduced CPU overhead on the hot path (do_TJ/do_Tj).

…erpreter._patch_current_chars_with_render_mode-mm3lbz82 ⚡️ Speed up method `CustomPDFPageInterpreter._patch_current_chars_with_render_mode` by 7%

Use idiomatic slice iteration instead of index-based loop, remove redundant guard and trailing return.

codeflash-ai bot requested a review from KRRT7 February 26, 2026 13:01

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Feb 26, 2026

KRRT7 and others added 2 commits February 26, 2026 08:32

fix: avoid O(N²) re-scanning in _patch_current_chars_with_render_mode

d18aa9d

Track last-patched index so each do_TJ/do_Tj call only processes newly-added LTChar objects instead of re-scanning the entire _objs list. Resets automatically when cur_item changes (new page or figure).

KRRT7 closed this Feb 26, 2026

codeflash-ai bot mentioned this pull request Feb 26, 2026

⚡️ Speed up method CustomPDFPageInterpreter._patch_current_chars_with_render_mode by 7% #34

Merged

codeflash-ai bot deleted the codeflash/optimize-CustomPDFPageInterpreter._patch_current_chars_with_render_mode-mm3h21a8 branch February 26, 2026 15:05

KRRT7 restored the codeflash/optimize-CustomPDFPageInterpreter._patch_current_chars_with_render_mode-mm3h21a8 branch February 26, 2026 15:05

KRRT7 reopened this Feb 26, 2026

KRRT7 and others added 2 commits February 26, 2026 15:12

Merge pull request #34 from KRRT7/codeflash/optimize-CustomPDFPageInt…

09302f9

…erpreter._patch_current_chars_with_render_mode-mm3lbz82 ⚡️ Speed up method `CustomPDFPageInterpreter._patch_current_chars_with_render_mode` by 7%

refactor: simplify _patch_current_chars_with_render_mode loop

3c6d6d4

Use idiomatic slice iteration instead of index-based loop, remove redundant guard and trailing return.

KRRT7 changed the title ~~⚡️ Speed up method CustomPDFPageInterpreter._patch_current_chars_with_render_mode by 36%~~ fix: avoid O(N²) re-scanning in _patch_current_chars_with_render_mode Feb 26, 2026

KRRT7 merged commit abef17a into leak-control Feb 26, 2026
1 check failed

codeflash-ai bot deleted the codeflash/optimize-CustomPDFPageInterpreter._patch_current_chars_with_render_mode-mm3h21a8 branch February 26, 2026 15:46

KRRT7 restored the codeflash/optimize-CustomPDFPageInterpreter._patch_current_chars_with_render_mode-mm3h21a8 branch February 26, 2026 17:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: avoid O(N²) re-scanning in _patch_current_chars_with_render_mode#33

fix: avoid O(N²) re-scanning in _patch_current_chars_with_render_mode#33
KRRT7 merged 5 commits intoleak-controlfrom
codeflash/optimize-CustomPDFPageInterpreter._patch_current_chars_with_render_mode-mm3h21a8

codeflash-ai bot commented Feb 26, 2026 •

edited by KRRT7

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

codeflash-ai bot commented Feb 26, 2026 • edited by KRRT7 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codeflash-ai bot commented Feb 26, 2026 •

edited by KRRT7

Loading