Conversation
Runtime improvement: the change reduces average call time from ~174μs to ~128μs (≈35% faster overall), with the biggest wins on workloads that iterate many items (e.g., the 1,000-item test improved ~49%). What changed - Combined two checks into one short‑circuiting conditional: - Old: check hasattr(item, "rendermode") first, continue if present, then isinstance(item, LTChar) before assignment. - New: if isinstance(item, LTChar) and not hasattr(item, "rendermode"): assign. - Removed the explicit continue and the separate hasattr branch; the logic is identical from a correctness perspective. Why this speeds things up - Fewer attribute lookups per loop iteration. In Python attribute access (hasattr / attribute lookup) is relatively expensive. The original code executed hasattr(item, "rendermode") for every item, even for non-LTChar objects. By testing isinstance(item, LTChar) first, the hasattr call is avoided for non-LTChar items (the common case), so we save an attribute lookup per non-LTChar item. - Short‑circuit evaluation reduces bytecode and branching overhead (no separate continue branch). - These savings compound in the hot path: this method is invoked from do_TJ and do_Tj while parsing text, so it runs many times per page. The profiler and tests show the loop-level savings lead to measurable end-to-end runtime improvement. Profiler & test evidence - Overall profiler total time dropped from 0.001776s → 0.001371s. - The large-scale test (1000 LTChar objects) went from 113μs → 76.2μs (~49.5% faster), demonstrating the optimization scales with number of items. - Many unit/test cases also show smaller but consistent improvements (see annotated_tests). A single small regression was observed in one micro-benchmark (+~5% in a very narrow case where items are always LTChar and already patched), which is an acceptable trade-off given the overall runtime/throughput gains. Impact on workloads - Best for PDFs with many text items or mixed-type _objs lists (many non-LTChar items): savings are greatest because we avoid unnecessary hasattr calls for non-LTChar entries. - Safe to merge: behavior is preserved (pre-existing rendermode attributes are still honored), and dependencies/visibility are unchanged. Summary - Primary benefit: reduced runtime (35% overall, large improvements on heavier inputs). - Cause: eliminated redundant attribute checks via a single short‑circuiting conditional, lowering per-item overhead in a hot path. - Trade-off: negligible; one tiny micro-benchmark showed a small regression, but the overall throughput and real-workload performance improved significantly.
Track last-patched index so each do_TJ/do_Tj call only processes newly-added LTChar objects instead of re-scanning the entire _objs list. Resets automatically when cur_item changes (new page or figure).
Runtime improvement (primary): The optimized function runs ~7% faster overall (181 μs → 168 μs). That runtime reduction is the reason this change was accepted. What changed (concrete optimizations) - Single-step cur_item lookup: replaced "hasattr(self.device, 'cur_item') and self.device.cur_item" with cur_item = getattr(self.device, 'cur_item', None) and an early return. This reduces two attribute lookups into one and short-circuits the common no-cur-item case. - Single getattr for _objs: replaced the conditional expression cur_item._objs if hasattr(cur_item, "_objs") else [] with objs = getattr(cur_item, "_objs", []). That avoids an extra hasattr call and repeated attribute access. - Avoid repeated len/index work: replaced index-based loop for i in range(start, len(objs)): obj = objs[i] with direct iteration over the sub-list objs[start:] (for obj in objs[start:]). Also added an explicit if start < len(objs): guard so we don't build a slice when there's nothing to process. Why these changes speed things up (mechanics) - Fewer Python-level attribute lookups: getattr replaces two operations (hasattr + attribute fetch) with one, and caching cur_item and objs removes repeated attribute access inside the hot path. Attribute lookup in Python is comparatively expensive, so reducing them reduces per-call overhead. - Reduced indexing overhead: using direct iteration avoids repeated integer arithmetic and two list index operations per iteration (compute index, __getitem__). That lowers per-iteration Python overhead inside the tight loop that checks and patches LTChar objects. - Early-return common negative case: when device.cur_item is missing/falsy the function now returns earlier with less work, which helps where many interpreter calls have no cur_item. - Guarding slice creation: creating objs[start:] can allocate a new list. The added start < len(objs) check prevents unnecessary allocations on the common empty/no-new-items case; when the slice is needed it's typically small (we only process newly appended items), so the copy cost is minimal compared to saved per-iteration overhead. Why this matters in context (hot path) - This method is invoked from do_TJ and do_Tj (text painting operators) in the interpreter. Those are hot paths during PDF text processing: the method can be called many times per page. Small per-call savings accumulate, so a ~7% per-call runtime improvement is meaningful. Behavioral and workload impact (based on tests) - Regression tests show the biggest wins on heavy workloads: large-scale patching and repeated-appends (1000-item and many-iteration cases) observe sizable reductions (e.g., ~16% and ~26% in annotated tests). Those are precisely the cases where per-iteration overhead dominates. - Some tiny regressions in a few microbench tests (single small calls) are visible in the annotated tests (sub-microsecond differences or small percent slowdowns). These are minor and expected trade-offs for the net runtime improvement across typical workloads. - Memory trade-off: objs[start:] creates a shallow copy of the sub-list. In typical usage this sub-list is small (only newly appended items) so the allocation cost is small and outweighed by per-iteration savings. If you have pathological cases where you repeatedly slice very large tails, that could increase temporary memory pressure — but tests and typical interpreter usage show net benefit. Correctness - The logic is preserved: render_mode is applied only to new LTChar objects and _last_patched_idx is updated the same way. Using getattr defaults keeps previous behavior for missing attributes. Summary - Net effect: fewer attribute lookups, less indexing overhead, and an early-exit optimize the common and hot cases encountered during PDF text interpretation, producing the measured 7% runtime speedup. The small memory/alloc cost of slicing is guarded and, in practice, outweighed by the reduced CPU overhead on the hot path (do_TJ/do_Tj).
…erpreter._patch_current_chars_with_render_mode-mm3lbz82 ⚡️ Speed up method `CustomPDFPageInterpreter._patch_current_chars_with_render_mode` by 7%
Use idiomatic slice iteration instead of index-based loop, remove redundant guard and trailing return.
|
|
|
I think we can have each |
|
@claude can you review this change? Of particular concern to me is that before the change there was a guard against "re-patching" an
|
|
Claude finished @qued's task —— View job Review: PDF Parsing OptimizationTodo List:
Critical Issue Identified
|
|
The Instead, each def _patch_current_chars_with_render_mode(self, start: int):
"""Add render_mode to LTChar objects added since index `start`."""
cur_item = getattr(self.device, "cur_item", None)
if not cur_item:
return
render_mode = self.textstate.render
for obj in getattr(cur_item, "_objs", ())[start:]:
if isinstance(obj, LTChar):
obj.rendermode = render_mode
def do_TJ(self, seq):
start = len(getattr(getattr(self.device, "cur_item", None), "_objs", ()))
super().do_TJ(seq)
self._patch_current_chars_with_render_mode(start)
def do_Tj(self, s):
start = len(getattr(getattr(self.device, "cur_item", None), "_objs", ()))
super().do_Tj(s)
self._patch_current_chars_with_render_mode(start) |
|
@claude review the #4269 (comment) implementation |
|
Claude finished @KRRT7's task —— View job Review: KRRT7's Proposed ImplementationTodo List:
Excellent Solution! ✅KRRT7's proposed approach completely eliminates the correctness bug while maintaining the O(N) performance improvement. This is a much cleaner and more robust design. Current Bug AnalysisThe existing implementation (lines 28-35) has a critical flaw: # Reset index when cur_item changes (new page or figure)
if getattr(self, "_patched_cur_item", None) is not cur_item:
self._last_patched_idx = 0 # ⚠️ DANGEROUS RESET
self._patched_cur_item = cur_item
for obj in objs[self._last_patched_idx:]:
if isinstance(obj, LTChar):
obj.rendermode = render_mode # Overwrites existing rendermode!Problem: When KRRT7's Solution Analysisdef _patch_current_chars_with_render_mode(self, start: int):
"""Add render_mode to LTChar objects added since index `start`."""
cur_item = getattr(self.device, "cur_item", None)
if not cur_item:
return
render_mode = self.textstate.render
for obj in getattr(cur_item, "_objs", ())[start:]:
if isinstance(obj, LTChar):
obj.rendermode = render_mode
def do_TJ(self, seq):
start = len(getattr(getattr(self.device, "cur_item", None), "_objs", ()))
super().do_TJ(seq)
self._patch_current_chars_with_render_mode(start)
def do_Tj(self, s):
start = len(getattr(getattr(self.device, "cur_item", None), "_objs", ()))
super().do_Tj(s)
self._patch_current_chars_with_render_mode(start)Why This Approach Works ✅
Edge Cases Handled
Minor SuggestionsThe nested def do_TJ(self, seq):
cur_item = getattr(self.device, "cur_item", None)
start = len(getattr(cur_item, "_objs", ())) if cur_item else 0
super().do_TJ(seq)
self._patch_current_chars_with_render_mode(start)VerdictThis is a superior solution that:
Strongly recommend implementing this approach instead of the current one. |
The _last_patched_idx approach overwrites previously patched chars when cur_item reverts after a figure with text ops. Instead, each do_TJ/do_Tj snapshots len(objs) before super() and only patches from that index.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
pdfminer's base do_Tj delegates to self.do_TJ([s]), which already
dispatches to the overridden do_TJ. The do_Tj override was patching
the same char range a second time.
Repro (add print traces to do_TJ/do_Tj, run against any PDF):
from unstructured.partition.pdf_image.pdfminer_utils import open_pdfminer_pages_generator
with open("example-docs/pdf/reliance.pdf", "rb") as f:
for page, layout in open_pdfminer_pages_generator(f):
break
Before this fix, every Tj op produces two patch calls with the same
start index:
[TRACE] do_TJ patching from 9
[TRACE] do_Tj patching from 9 <- redundant
|
got a benchmark script set up with the most recent diff |
…_patch_current_chars_with_render_mode-mm3h21a8
|
Closing as #4266 merged instead. |
Problem
_patch_current_chars_with_render_modeis called on everydo_TJ/do_Tjtext operator during PDF parsing. The original implementation re-scans the entirecur_item._objslist each time, checkinghasattr(item, "rendermode")to skip already-patched items. For a page with N characters across M text operations, this is O(N*M) — effectively quadratic.Memray profiling showed this function as the #1 allocator: 17.57 GB total across 549M allocations in a session processing just 4 files.
Fix
Track the last-patched index so each call only processes newly-added
LTCharobjects. Reset automatically whencur_itemchanges (new page or figure).Before: O(N²) per page — re-scans all accumulated objects on every text operator
After: O(N) per page — each object visited exactly once
@KRRT7, copied your PR as an internal PR here.
Note
Low Risk
Low risk because no actual modifications were present in the supplied diff/repo snapshot; if the intended performance change exists elsewhere, it isn’t visible here.
Overview
The provided
full_diffappears empty (+++ /dev/nullonly), and repository search found no occurrences of_patch_current_chars_with_render_mode,do_TJ/do_Tj,LTChar, orrendermode, so there are no verifiable code changes to summarize in this PR snapshot.Written by Cursor Bugbot for commit 594c2da. This will update automatically on new commits. Configure here.