fix: avoid O(N²) re-scanning in _patch_current_chars_with_render_mode#4266
Conversation
Runtime improvement: the change reduces average call time from ~174μs to ~128μs (≈35% faster overall), with the biggest wins on workloads that iterate many items (e.g., the 1,000-item test improved ~49%). What changed - Combined two checks into one short‑circuiting conditional: - Old: check hasattr(item, "rendermode") first, continue if present, then isinstance(item, LTChar) before assignment. - New: if isinstance(item, LTChar) and not hasattr(item, "rendermode"): assign. - Removed the explicit continue and the separate hasattr branch; the logic is identical from a correctness perspective. Why this speeds things up - Fewer attribute lookups per loop iteration. In Python attribute access (hasattr / attribute lookup) is relatively expensive. The original code executed hasattr(item, "rendermode") for every item, even for non-LTChar objects. By testing isinstance(item, LTChar) first, the hasattr call is avoided for non-LTChar items (the common case), so we save an attribute lookup per non-LTChar item. - Short‑circuit evaluation reduces bytecode and branching overhead (no separate continue branch). - These savings compound in the hot path: this method is invoked from do_TJ and do_Tj while parsing text, so it runs many times per page. The profiler and tests show the loop-level savings lead to measurable end-to-end runtime improvement. Profiler & test evidence - Overall profiler total time dropped from 0.001776s → 0.001371s. - The large-scale test (1000 LTChar objects) went from 113μs → 76.2μs (~49.5% faster), demonstrating the optimization scales with number of items. - Many unit/test cases also show smaller but consistent improvements (see annotated_tests). A single small regression was observed in one micro-benchmark (+~5% in a very narrow case where items are always LTChar and already patched), which is an acceptable trade-off given the overall runtime/throughput gains. Impact on workloads - Best for PDFs with many text items or mixed-type _objs lists (many non-LTChar items): savings are greatest because we avoid unnecessary hasattr calls for non-LTChar entries. - Safe to merge: behavior is preserved (pre-existing rendermode attributes are still honored), and dependencies/visibility are unchanged. Summary - Primary benefit: reduced runtime (35% overall, large improvements on heavier inputs). - Cause: eliminated redundant attribute checks via a single short‑circuiting conditional, lowering per-item overhead in a hot path. - Trade-off: negligible; one tiny micro-benchmark showed a small regression, but the overall throughput and real-workload performance improved significantly.
Track last-patched index so each do_TJ/do_Tj call only processes newly-added LTChar objects instead of re-scanning the entire _objs list. Resets automatically when cur_item changes (new page or figure).
Runtime improvement (primary): The optimized function runs ~7% faster overall (181 μs → 168 μs). That runtime reduction is the reason this change was accepted. What changed (concrete optimizations) - Single-step cur_item lookup: replaced "hasattr(self.device, 'cur_item') and self.device.cur_item" with cur_item = getattr(self.device, 'cur_item', None) and an early return. This reduces two attribute lookups into one and short-circuits the common no-cur-item case. - Single getattr for _objs: replaced the conditional expression cur_item._objs if hasattr(cur_item, "_objs") else [] with objs = getattr(cur_item, "_objs", []). That avoids an extra hasattr call and repeated attribute access. - Avoid repeated len/index work: replaced index-based loop for i in range(start, len(objs)): obj = objs[i] with direct iteration over the sub-list objs[start:] (for obj in objs[start:]). Also added an explicit if start < len(objs): guard so we don't build a slice when there's nothing to process. Why these changes speed things up (mechanics) - Fewer Python-level attribute lookups: getattr replaces two operations (hasattr + attribute fetch) with one, and caching cur_item and objs removes repeated attribute access inside the hot path. Attribute lookup in Python is comparatively expensive, so reducing them reduces per-call overhead. - Reduced indexing overhead: using direct iteration avoids repeated integer arithmetic and two list index operations per iteration (compute index, __getitem__). That lowers per-iteration Python overhead inside the tight loop that checks and patches LTChar objects. - Early-return common negative case: when device.cur_item is missing/falsy the function now returns earlier with less work, which helps where many interpreter calls have no cur_item. - Guarding slice creation: creating objs[start:] can allocate a new list. The added start < len(objs) check prevents unnecessary allocations on the common empty/no-new-items case; when the slice is needed it's typically small (we only process newly appended items), so the copy cost is minimal compared to saved per-iteration overhead. Why this matters in context (hot path) - This method is invoked from do_TJ and do_Tj (text painting operators) in the interpreter. Those are hot paths during PDF text processing: the method can be called many times per page. Small per-call savings accumulate, so a ~7% per-call runtime improvement is meaningful. Behavioral and workload impact (based on tests) - Regression tests show the biggest wins on heavy workloads: large-scale patching and repeated-appends (1000-item and many-iteration cases) observe sizable reductions (e.g., ~16% and ~26% in annotated tests). Those are precisely the cases where per-iteration overhead dominates. - Some tiny regressions in a few microbench tests (single small calls) are visible in the annotated tests (sub-microsecond differences or small percent slowdowns). These are minor and expected trade-offs for the net runtime improvement across typical workloads. - Memory trade-off: objs[start:] creates a shallow copy of the sub-list. In typical usage this sub-list is small (only newly appended items) so the allocation cost is small and outweighed by per-iteration savings. If you have pathological cases where you repeatedly slice very large tails, that could increase temporary memory pressure — but tests and typical interpreter usage show net benefit. Correctness - The logic is preserved: render_mode is applied only to new LTChar objects and _last_patched_idx is updated the same way. Using getattr defaults keeps previous behavior for missing attributes. Summary - Net effect: fewer attribute lookups, less indexing overhead, and an early-exit optimize the common and hot cases encountered during PDF text interpretation, producing the measured 7% runtime speedup. The small memory/alloc cost of slicing is guarded and, in practice, outweighed by the reduced CPU overhead on the hot path (do_TJ/do_Tj).
…erpreter._patch_current_chars_with_render_mode-mm3lbz82 ⚡️ Speed up method `CustomPDFPageInterpreter._patch_current_chars_with_render_mode` by 7%
Use idiomatic slice iteration instead of index-based loop, remove redundant guard and trailing return.
| if getattr(self, "_patched_cur_item", None) is not cur_item: | ||
| self._last_patched_idx = 0 | ||
| self._patched_cur_item = cur_item | ||
| for obj in objs[self._last_patched_idx:]: | ||
| if isinstance(obj, LTChar): | ||
| obj.rendermode = render_mode | ||
| self._last_patched_idx = len(objs) |
There was a problem hiding this comment.
The way I understand this logic is, that _last_patched_idx can either be 0 or len(objs), so the net effect will be to either iterate through all of objs, or to skip all of them. If so, why not just gate the whole process on whether cur_item has changed? Something like:
| if getattr(self, "_patched_cur_item", None) is not cur_item: | |
| self._last_patched_idx = 0 | |
| self._patched_cur_item = cur_item | |
| for obj in objs[self._last_patched_idx:]: | |
| if isinstance(obj, LTChar): | |
| obj.rendermode = render_mode | |
| self._last_patched_idx = len(objs) | |
| if getattr(self, "_patched_cur_item", None) is not cur_item: | |
| self._patched_cur_item = cur_item | |
| for obj in objs: | |
| if isinstance(obj, LTChar): | |
| obj.rendermode = render_mode |
I think it's easier to understand what's going on this way.
There was a problem hiding this comment.
This also seems different than the previous code in that this code will "re-patch" objects that have already been patched, while the old code has a guard against that. This would only be relevant if self.textstate.render can change between calls of this method while simultaneously cur_item changes but later changes back, and I'm not familiar enough with this flow to know whether that's a thing that can happen.
@claude can you investigate this?
There was a problem hiding this comment.
Thanks for the review! The key detail is that _patch_current_chars_with_render_mode is called after every do_TJ/do_Tj, and each super() call appends new LTChar objects to cur_item._objs before we get here. So for a page with multiple text operations, _objs grows between calls while cur_item stays the same.
With the suggested approach, only the first text operation per cur_item would patch its characters — all subsequent ones on the same page would be skipped since cur_item hasn't changed.
The _last_patched_idx tracking is what gives us O(N) amortized — each call slices from where the last one left off, so every LTChar is visited exactly once.
There was a problem hiding this comment.
actually I think you're onto something, going to debug and come up with a fix.
There was a problem hiding this comment.
So for a page with multiple text operations,
_objsgrows between calls whilecur_itemstays the same.
Gotcha, that's a piece I was missing.
|
Opened duplicate #4269 as an internal PR for full access to CI tools. |
The _last_patched_idx approach overwrites previously patched chars when cur_item reverts after a figure with text ops. Instead, each do_TJ/do_Tj snapshots len(objs) before super() and only patches from that index.
pdfminer's base do_Tj delegates to self.do_TJ([s]), which already
dispatches to the overridden do_TJ. The do_Tj override was patching
the same char range a second time.
Repro (add print traces to do_TJ/do_Tj, run against any PDF):
from unstructured.partition.pdf_image.pdfminer_utils import open_pdfminer_pages_generator
with open("example-docs/pdf/reliance.pdf", "rb") as f:
for page, layout in open_pdfminer_pages_generator(f):
break
Before this fix, every Tj op produces two patch calls with the same
start index:
[TRACE] do_TJ patching from 9
[TRACE] do_Tj patching from 9 <- redundant
…r-branch Merge changes from upstream
aef3bc4
Problem
_patch_current_chars_with_render_modeis called on everydo_TJ/do_Tjtext operator during PDF parsing. The original implementation re-scans the entirecur_item._objslist each time, checkinghasattr(item, "rendermode")to skip already-patched items. For a page with N characters across M text operations, this is O(N*M) — effectively quadratic.Memray profiling showed this function as the #1 allocator: 17.57 GB total across 549M allocations in a session processing just 4 files.
Fix
Track the last-patched index so each call only processes newly-added
LTCharobjects. Reset automatically whencur_itemchanges (new page or figure).Before: O(N²) per page — re-scans all accumulated objects on every text operator
After: O(N) per page — each object visited exactly once