-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Comparing changes
Open a pull request
base repository: Unstructured-IO/unstructured
base: 0.22.6
head repository: Unstructured-IO/unstructured
compare: 0.22.10
- 5 commits
- 25 files changed
- 3 contributors
Commits on Mar 26, 2026
-
fix(chunking): preserve nested table structure in reconstruction (#4301)
## Summary - Fix `_merge_table_chunks()` to merge only top-level rows from each chunk HTML table. - Prevent nested table rows from being hoisted into the reconstructed root table. - Add regression coverage to verify nested table structure is preserved. ## Finding Reference - #4291 (comment) ## Validation - `unset VIRTUAL_ENV && CI=false uv run --no-sync pytest -q test_unstructured/chunking/test_base.py -k "reconstruct_tables_from_a_mixed_element_list or preserves_nested_table_structure" --maxfail=1` - `unset VIRTUAL_ENV && CI=false uv run --no-sync pytest -q test_unstructured/chunking/test_base.py test_unstructured/chunking/test_dispatch.py --maxfail=1` - `unset VIRTUAL_ENV && uv run --no-sync python - <<'PY' from unstructured.partition.text import partition_text elements = partition_text(text="Codex initializer smoke test") assert elements, "partition_text returned no elements" print(f"partition_text smoke check passed ({len(elements)} elements)") PY` - `unset VIRTUAL_ENV && CI=false uv run --no-sync pytest -q test_unstructured/partition/test_text.py --maxfail=1` authored by codex
Configuration menu - View commit details
-
Copy full SHA for 94b3ffd - Browse repository at this point
Copy the full SHA 94b3ffdView commit details
Commits on Mar 27, 2026
-
Replace lazyproperty with functools.cached_property (#4282)
## Summary - Replace custom `lazyproperty` descriptor with stdlib `functools.cached_property` - Fix bug where 26 properties returning `None` were re-evaluated on every access instead of caching — `lazyproperty.__get__` uses `if value is None` to detect a cache miss, so any property that legitimately returns `None` re-runs on every access - Slight performance improvement on cached reads — `cached_property` is a non-data descriptor, so after first access the `__dict__` entry shadows the descriptor directly (plain dict lookup vs `__get__` call) - Nothing in the codebase ever assigns to a lazyproperty-decorated attribute, so dropping the write-protection from the data descriptor has no behavioral impact
Configuration menu - View commit details
-
Copy full SHA for 7c5855b - Browse repository at this point
Copy the full SHA 7c5855bView commit details -
mem: reduce PaddleOCR rec_batch_num from 6 to 1 (#4295)
Reduce PaddleOCR `rec_batch_num` from 6 (default) to 1. Paddle's native inference engine allocates 500 MiB memory arena chunks proportional to recognition batch size. With batch_num=6, four chunks are allocated during text recognition. Setting it to 1 reduces this to one chunk.  | Setting | Peak memory | |---------|------------| | `rec_batch_num=6` | 7,184 MiB | | `rec_batch_num=1` | 2,684 MiB | | **Delta** | **-4,500 MiB (-62.6%)** | Measured with `memray run` on `layout-parser-paper-with-table.pdf` through `partition()` with hi_res + PaddleOCR table OCR. On CPU, batch processing doesn't parallelize — it's sequential within `predictor.run()`. Smaller batches just allocate less workspace memory. ## Reproduce Requires `unstructured[pdf]`, `paddlepaddle`, `unstructured-paddleocr`, and `memray`. ```bash cat > /tmp/bench_paddle.py << 'SCRIPT' from unstructured.partition.auto import partition elements = partition( filename="example-docs/layout-parser-paper.pdf", strategy="hi_res", pdf_infer_table_structure=True, ocr_agent="unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle", ) print(f"Partitioned: {len(elements)} elements") SCRIPT # Baseline (main branch, rec_batch_num=6): git checkout main memray run --native --trace-python-allocators -o /tmp/paddle_baseline.bin /tmp/bench_paddle.py memray stats /tmp/paddle_baseline.bin | grep "Peak memory" # With this change (rec_batch_num=1): git checkout mem/paddle-rec-batch-num memray run --native --trace-python-allocators -o /tmp/paddle_opt.bin /tmp/bench_paddle.py memray stats /tmp/paddle_opt.bin | grep "Peak memory" ```
Configuration menu - View commit details
-
Copy full SHA for 47f4728 - Browse repository at this point
Copy the full SHA 47f4728View commit details
Commits on Mar 31, 2026
-
fix: isolate Table elements in pre-chunks (#4307)
## Summary This change enforces the documented table-isolation guarantees in chunking: - Table and TableChunk are always staged in their own pre-chunk and never combined with adjacent non-table elements into a CompositeElement. - PreChunkCombiner will not merge pre-chunks when either side contains a table-family element, preventing “table gets wrapped/merged” behavior when combine_text_under_n_chars is enabled. - Shared helper functions centralize the table-isolation checks in unstructured.chunking.base. Also includes: - Updated/adjusted chunking tests to reflect the new behavior. - Added a dedicated test_table_isolation.py regression suite. - Version bump + CHANGELOG.md entry to document the fix. Closes #3921
Configuration menu - View commit details
-
Copy full SHA for 6360ef7 - Browse repository at this point
Copy the full SHA 6360ef7View commit details -
feat(chunking): repeat table headers on continuation chunks (#4298)
## Behavior summary ### Before - Oversized table chunks only preserved headers in the first chunk; continuation chunks could lose column context. - Table header semantics (`<thead>` / `<th>`) were not retained as explicit row-level metadata after compactification. ### After - Added `repeat_table_headers` (default `True`) to chunking APIs and strategy plumbing: - `chunk_elements(..., repeat_table_headers=...)` - `chunk_by_title(..., repeat_table_headers=...)` - `add_chunking_strategy(...)` forwarded args/docs - `_TableChunker` now detects contiguous leading header rows and repeats them on non-initial continuation chunks. - Repeated header rows are prepended to both continuation chunk text and `text_as_html`. - First chunk behavior remains unchanged relative to legacy output. - Added a guardrail: if a repeated header row would consume more than half the chunk window, splitter falls back to legacy non-repeating behavior. ## Invariants - No body-row drop, duplication, or reordering across emitted continuation chunks. - Opt-out behavior (`repeat_table_headers=False`) matches legacy table splitting behavior. - Chunk windows still respect max-size constraints, including near-boundary continuation windows. - Only contiguous leading header rows are repeated; later non-leading header-like rows are not promoted. ## Edge cases covered - No headers, single leading header row, multiple leading header rows. - Header detection from both `<thead>` and `<th>` rows. - Exact-fit and near-boundary continuation sizing. - Cascading repetition across 3+ continuation chunks. - Pathologically large header rows trigger safe fallback to non-repeating behavior. - Strategy-path forwarding validated through `partition_html(..., chunking_strategy="by_title")`. ## Test evidence - `uv run --no-sync pytest -q test_unstructured/chunking/test_dispatch.py` (6 passed) - `uv run --no-sync pytest -q test_unstructured/chunking/test_base.py -k "Describe_TableChunker"` (26 passed) - `uv run --no-sync pytest -q test_unstructured/chunking/test_title.py::test_add_chunking_strategy_forwards_repeat_table_headers` (1 passed) - `uv run --no-sync pytest -q test_unstructured/chunking/test_title.py -k "repeat_table_headers"` (5 passed) - `uv run --with python-docx pytest -q test_unstructured/chunking/test_basic.py -k "repeat_table_headers"` (4 passed) - `uv run --no-sync pytest -q test_unstructured/common/test_html_table.py` (26 passed) authored by codex
Configuration menu - View commit details
-
Copy full SHA for b6cf510 - Browse repository at this point
Copy the full SHA b6cf510View commit details
This comparison is taking too long to generate.
Unfortunately it looks like we can’t render this comparison for you right now. It might be too big, or there might be something weird with your repository.
You can try running this command locally to see the comparison on your machine:
git diff 0.22.6...0.22.10