Comparing changes

## Summary - Fix `_merge_table_chunks()` to merge only top-level rows from each chunk HTML table. - Prevent nested table rows from being hoisted into the reconstructed root table. - Add regression coverage to verify nested table structure is preserved. ## Finding Reference - #4291 (comment) ## Validation - `unset VIRTUAL_ENV && CI=false uv run --no-sync pytest -q test_unstructured/chunking/test_base.py -k "reconstruct_tables_from_a_mixed_element_list or preserves_nested_table_structure" --maxfail=1` - `unset VIRTUAL_ENV && CI=false uv run --no-sync pytest -q test_unstructured/chunking/test_base.py test_unstructured/chunking/test_dispatch.py --maxfail=1` - `unset VIRTUAL_ENV && uv run --no-sync python - <<'PY' from unstructured.partition.text import partition_text elements = partition_text(text="Codex initializer smoke test") assert elements, "partition_text returned no elements" print(f"partition_text smoke check passed ({len(elements)} elements)") PY` - `unset VIRTUAL_ENV && CI=false uv run --no-sync pytest -q test_unstructured/partition/test_text.py --maxfail=1` authored by codex

## Summary - Replace custom `lazyproperty` descriptor with stdlib `functools.cached_property` - Fix bug where 26 properties returning `None` were re-evaluated on every access instead of caching — `lazyproperty.__get__` uses `if value is None` to detect a cache miss, so any property that legitimately returns `None` re-runs on every access - Slight performance improvement on cached reads — `cached_property` is a non-data descriptor, so after first access the `__dict__` entry shadows the descriptor directly (plain dict lookup vs `__get__` call) - Nothing in the codebase ever assigns to a lazyproperty-decorated attribute, so dropping the write-protection from the data descriptor has no behavioral impact

Reduce PaddleOCR `rec_batch_num` from 6 (default) to 1. Paddle's native inference engine allocates 500 MiB memory arena chunks proportional to recognition batch size. With batch_num=6, four chunks are allocated during text recognition. Setting it to 1 reduces this to one chunk. ![benchmark](https://raw.githubusercontent.com/codeflash-ai/codeflash/pr-assets/images/paddle-rec-batch-num-bench.png) | Setting | Peak memory | |---------|------------| | `rec_batch_num=6` | 7,184 MiB | | `rec_batch_num=1` | 2,684 MiB | | **Delta** | **-4,500 MiB (-62.6%)** | Measured with `memray run` on `layout-parser-paper-with-table.pdf` through `partition()` with hi_res + PaddleOCR table OCR. On CPU, batch processing doesn't parallelize — it's sequential within `predictor.run()`. Smaller batches just allocate less workspace memory. ## Reproduce Requires `unstructured[pdf]`, `paddlepaddle`, `unstructured-paddleocr`, and `memray`. ```bash cat > /tmp/bench_paddle.py << 'SCRIPT' from unstructured.partition.auto import partition elements = partition( filename="example-docs/layout-parser-paper.pdf", strategy="hi_res", pdf_infer_table_structure=True, ocr_agent="unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle", ) print(f"Partitioned: {len(elements)} elements") SCRIPT # Baseline (main branch, rec_batch_num=6): git checkout main memray run --native --trace-python-allocators -o /tmp/paddle_baseline.bin /tmp/bench_paddle.py memray stats /tmp/paddle_baseline.bin | grep "Peak memory" # With this change (rec_batch_num=1): git checkout mem/paddle-rec-batch-num memray run --native --trace-python-allocators -o /tmp/paddle_opt.bin /tmp/bench_paddle.py memray stats /tmp/paddle_opt.bin | grep "Peak memory" ```

## Summary This change enforces the documented table-isolation guarantees in chunking: - Table and TableChunk are always staged in their own pre-chunk and never combined with adjacent non-table elements into a CompositeElement. - PreChunkCombiner will not merge pre-chunks when either side contains a table-family element, preventing “table gets wrapped/merged” behavior when combine_text_under_n_chars is enabled. - Shared helper functions centralize the table-isolation checks in unstructured.chunking.base. Also includes: - Updated/adjusted chunking tests to reflect the new behavior. - Added a dedicated test_table_isolation.py regression suite. - Version bump + CHANGELOG.md entry to document the fix. Closes #3921

## Behavior summary ### Before - Oversized table chunks only preserved headers in the first chunk; continuation chunks could lose column context. - Table header semantics (`<thead>` / `<th>`) were not retained as explicit row-level metadata after compactification. ### After - Added `repeat_table_headers` (default `True`) to chunking APIs and strategy plumbing: - `chunk_elements(..., repeat_table_headers=...)` - `chunk_by_title(..., repeat_table_headers=...)` - `add_chunking_strategy(...)` forwarded args/docs - `_TableChunker` now detects contiguous leading header rows and repeats them on non-initial continuation chunks. - Repeated header rows are prepended to both continuation chunk text and `text_as_html`. - First chunk behavior remains unchanged relative to legacy output. - Added a guardrail: if a repeated header row would consume more than half the chunk window, splitter falls back to legacy non-repeating behavior. ## Invariants - No body-row drop, duplication, or reordering across emitted continuation chunks. - Opt-out behavior (`repeat_table_headers=False`) matches legacy table splitting behavior. - Chunk windows still respect max-size constraints, including near-boundary continuation windows. - Only contiguous leading header rows are repeated; later non-leading header-like rows are not promoted. ## Edge cases covered - No headers, single leading header row, multiple leading header rows. - Header detection from both `<thead>` and `<th>` rows. - Exact-fit and near-boundary continuation sizing. - Cascading repetition across 3+ continuation chunks. - Pathologically large header rows trigger safe fallback to non-repeating behavior. - Strategy-path forwarding validated through `partition_html(..., chunking_strategy="by_title")`. ## Test evidence - `uv run --no-sync pytest -q test_unstructured/chunking/test_dispatch.py` (6 passed) - `uv run --no-sync pytest -q test_unstructured/chunking/test_base.py -k "Describe_TableChunker"` (26 passed) - `uv run --no-sync pytest -q test_unstructured/chunking/test_title.py::test_add_chunking_strategy_forwards_repeat_table_headers` (1 passed) - `uv run --no-sync pytest -q test_unstructured/chunking/test_title.py -k "repeat_table_headers"` (5 passed) - `uv run --with python-docx pytest -q test_unstructured/chunking/test_basic.py -k "repeat_table_headers"` (4 passed) - `uv run --no-sync pytest -q test_unstructured/common/test_html_table.py` (26 passed) authored by codex

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing changes

Open a pull request

Commits on Mar 26, 2026

Commits on Mar 27, 2026

Commits on Mar 31, 2026

This comparison is taking too long to generate.

Uh oh!