Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: Unstructured-IO/unstructured
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: 0.22.6
Choose a base ref
...
head repository: Unstructured-IO/unstructured
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: 0.22.10
Choose a head ref
  • 5 commits
  • 25 files changed
  • 3 contributors

Commits on Mar 26, 2026

  1. fix(chunking): preserve nested table structure in reconstruction (#4301)

    ## Summary
    - Fix `_merge_table_chunks()` to merge only top-level rows from each
    chunk HTML table.
    - Prevent nested table rows from being hoisted into the reconstructed
    root table.
    - Add regression coverage to verify nested table structure is preserved.
    
    ## Finding Reference
    -
    #4291 (comment)
    
    ## Validation
    - `unset VIRTUAL_ENV && CI=false uv run --no-sync pytest -q
    test_unstructured/chunking/test_base.py -k
    "reconstruct_tables_from_a_mixed_element_list or
    preserves_nested_table_structure" --maxfail=1`
    - `unset VIRTUAL_ENV && CI=false uv run --no-sync pytest -q
    test_unstructured/chunking/test_base.py
    test_unstructured/chunking/test_dispatch.py --maxfail=1`
    - `unset VIRTUAL_ENV && uv run --no-sync python - <<'PY'
    from unstructured.partition.text import partition_text
    
    elements = partition_text(text="Codex initializer smoke test")
    assert elements, "partition_text returned no elements"
    print(f"partition_text smoke check passed ({len(elements)} elements)")
    PY`
    - `unset VIRTUAL_ENV && CI=false uv run --no-sync pytest -q
    test_unstructured/partition/test_text.py --maxfail=1`
    
    authored by codex
    cragwolfe authored Mar 26, 2026
    Configuration menu
    Copy the full SHA
    94b3ffd View commit details
    Browse the repository at this point in the history

Commits on Mar 27, 2026

  1. Replace lazyproperty with functools.cached_property (#4282)

    ## Summary
    - Replace custom `lazyproperty` descriptor with stdlib
    `functools.cached_property`
    - Fix bug where 26 properties returning `None` were re-evaluated on
    every access instead of caching — `lazyproperty.__get__` uses `if value
    is None` to detect a cache miss, so any property that legitimately
    returns `None` re-runs on every access
    - Slight performance improvement on cached reads — `cached_property` is
    a non-data descriptor, so after first access the `__dict__` entry
    shadows the descriptor directly (plain dict lookup vs `__get__` call)
    - Nothing in the codebase ever assigns to a lazyproperty-decorated
    attribute, so dropping the write-protection from the data descriptor has
    no behavioral impact
    KRRT7 authored Mar 27, 2026
    Configuration menu
    Copy the full SHA
    7c5855b View commit details
    Browse the repository at this point in the history
  2. mem: reduce PaddleOCR rec_batch_num from 6 to 1 (#4295)

    Reduce PaddleOCR `rec_batch_num` from 6 (default) to 1. Paddle's native
    inference engine allocates 500 MiB memory arena chunks proportional to
    recognition batch size. With batch_num=6, four chunks are allocated
    during text recognition. Setting it to 1 reduces this to one chunk.
    
    
    ![benchmark](https://raw.githubusercontent.com/codeflash-ai/codeflash/pr-assets/images/paddle-rec-batch-num-bench.png)
    
    | Setting | Peak memory |
    |---------|------------|
    | `rec_batch_num=6` | 7,184 MiB |
    | `rec_batch_num=1` | 2,684 MiB |
    | **Delta** | **-4,500 MiB (-62.6%)** |
    
    Measured with `memray run` on `layout-parser-paper-with-table.pdf`
    through `partition()` with hi_res + PaddleOCR table OCR. On CPU, batch
    processing doesn't parallelize — it's sequential within
    `predictor.run()`. Smaller batches just allocate less workspace memory.
    
    ## Reproduce
    
    Requires `unstructured[pdf]`, `paddlepaddle`, `unstructured-paddleocr`,
    and `memray`.
    
    ```bash
    cat > /tmp/bench_paddle.py << 'SCRIPT'
    from unstructured.partition.auto import partition
    elements = partition(
        filename="example-docs/layout-parser-paper.pdf",
        strategy="hi_res",
        pdf_infer_table_structure=True,
        ocr_agent="unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle",
    )
    print(f"Partitioned: {len(elements)} elements")
    SCRIPT
    
    # Baseline (main branch, rec_batch_num=6):
    git checkout main
    memray run --native --trace-python-allocators -o /tmp/paddle_baseline.bin /tmp/bench_paddle.py
    memray stats /tmp/paddle_baseline.bin | grep "Peak memory"
    
    # With this change (rec_batch_num=1):
    git checkout mem/paddle-rec-batch-num
    memray run --native --trace-python-allocators -o /tmp/paddle_opt.bin /tmp/bench_paddle.py
    memray stats /tmp/paddle_opt.bin | grep "Peak memory"
    ```
    KRRT7 authored Mar 27, 2026
    Configuration menu
    Copy the full SHA
    47f4728 View commit details
    Browse the repository at this point in the history

Commits on Mar 31, 2026

  1. fix: isolate Table elements in pre-chunks (#4307)

    ## Summary 
    This change enforces the documented table-isolation guarantees in
    chunking:
    
    - Table and TableChunk are always staged in their own pre-chunk and
    never combined with adjacent non-table elements into a CompositeElement.
    - PreChunkCombiner will not merge pre-chunks when either side contains a
    table-family element, preventing “table gets wrapped/merged” behavior
    when combine_text_under_n_chars is enabled.
    - Shared helper functions centralize the table-isolation checks in
    unstructured.chunking.base.
    
    Also includes:
    
    - Updated/adjusted chunking tests to reflect the new behavior.
    - Added a dedicated test_table_isolation.py regression suite.
    - Version bump + CHANGELOG.md entry to document the fix.
    
    Closes #3921
    claytonlin1110 authored Mar 31, 2026
    Configuration menu
    Copy the full SHA
    6360ef7 View commit details
    Browse the repository at this point in the history
  2. feat(chunking): repeat table headers on continuation chunks (#4298)

    ## Behavior summary
    
    ### Before
    - Oversized table chunks only preserved headers in the first chunk;
    continuation chunks could lose column context.
    - Table header semantics (`<thead>` / `<th>`) were not retained as
    explicit row-level metadata after compactification.
    
    ### After
    - Added `repeat_table_headers` (default `True`) to chunking APIs and
    strategy plumbing:
      - `chunk_elements(..., repeat_table_headers=...)`
      - `chunk_by_title(..., repeat_table_headers=...)`
      - `add_chunking_strategy(...)` forwarded args/docs
    - `_TableChunker` now detects contiguous leading header rows and repeats
    them on non-initial continuation chunks.
    - Repeated header rows are prepended to both continuation chunk text and
    `text_as_html`.
    - First chunk behavior remains unchanged relative to legacy output.
    - Added a guardrail: if a repeated header row would consume more than
    half the chunk window, splitter falls back to legacy non-repeating
    behavior.
    
    ## Invariants
    - No body-row drop, duplication, or reordering across emitted
    continuation chunks.
    - Opt-out behavior (`repeat_table_headers=False`) matches legacy table
    splitting behavior.
    - Chunk windows still respect max-size constraints, including
    near-boundary continuation windows.
    - Only contiguous leading header rows are repeated; later non-leading
    header-like rows are not promoted.
    
    ## Edge cases covered
    - No headers, single leading header row, multiple leading header rows.
    - Header detection from both `<thead>` and `<th>` rows.
    - Exact-fit and near-boundary continuation sizing.
    - Cascading repetition across 3+ continuation chunks.
    - Pathologically large header rows trigger safe fallback to
    non-repeating behavior.
    - Strategy-path forwarding validated through `partition_html(...,
    chunking_strategy="by_title")`.
    
    ## Test evidence
    - `uv run --no-sync pytest -q
    test_unstructured/chunking/test_dispatch.py` (6 passed)
    - `uv run --no-sync pytest -q test_unstructured/chunking/test_base.py -k
    "Describe_TableChunker"` (26 passed)
    - `uv run --no-sync pytest -q
    test_unstructured/chunking/test_title.py::test_add_chunking_strategy_forwards_repeat_table_headers`
    (1 passed)
    - `uv run --no-sync pytest -q test_unstructured/chunking/test_title.py
    -k "repeat_table_headers"` (5 passed)
    - `uv run --with python-docx pytest -q
    test_unstructured/chunking/test_basic.py -k "repeat_table_headers"` (4
    passed)
    - `uv run --no-sync pytest -q
    test_unstructured/common/test_html_table.py` (26 passed)
    
    authored by codex
    cragwolfe authored Mar 31, 2026
    Configuration menu
    Copy the full SHA
    b6cf510 View commit details
    Browse the repository at this point in the history
Loading