Skip to content

feat: tablechunks can reconstruct table#4291

Merged
qued merged 15 commits intomainfrom
ml-1016/tablechunks-can-reconstruct-table
Mar 26, 2026
Merged

feat: tablechunks can reconstruct table#4291
qued merged 15 commits intomainfrom
ml-1016/tablechunks-can-reconstruct-table

Conversation

@qued
Copy link
Copy Markdown
Contributor

@qued qued commented Mar 23, 2026

Note

Medium Risk
Changes core table-chunking behavior by adding new metadata fields and reconstruction logic; risk is mainly around backward compatibility and correct ordering/HTML merging of split tables.

Overview
Adds end-to-end support for reassembling split tables after chunking. TableChunk now receives stable sequencing metadata (table_id, chunk_index) when a Table is split, and a new reconstruct_table_from_chunks() helper in unstructured.chunking.dispatch groups and merges TableChunks back into full Table elements (including merged text_as_html when available).

Updates ElementMetadata to carry the new fields (dropped during consolidation), bumps version to 0.22.4, and adds unit tests covering reconstruction across mixed element streams and edge cases like missing chunk_index.

Written by Cursor Bugbot for commit 1e732a3. This will update automatically on new commits. Configure here.

lxml.append() moves elements, disrupting the lazy iterator. Using
list() materializes all rows before the loop so none are skipped.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
qued and others added 3 commits March 23, 2026 19:15
When a table has HTML but takes the text-only chunking path (due to
small hard_max), the deep-copied metadata retained the full original
text_as_html. This would cause reconstruct_table_from_chunks to
duplicate rows. Now explicitly set text_as_html=None for text-only
chunks.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…hunks

Replace the parent_id linked list approach for table reconstruction with
explicit chunk sequencing metadata per ML-1020:
- table_id: shared UUID for all chunks from the same table
- chunk_index: 0-based position in the chunk sequence
- total_chunks: total number of chunks for the table

Update reconstruct_table_from_chunks to group by table_id and order by
chunk_index instead of walking parent_id chains.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
cragwolfe added a commit that referenced this pull request Mar 25, 2026
Fixes high-severity table reconstruction corruption by merging only top-level rows from each chunk's table HTML, preventing nested table rows from being hoisted.

Adds regression coverage for nested-table HTML reconstruction.

Finding reference: #4291 (comment)
qued and others added 5 commits March 25, 2026 17:42
Remove total_chunks per business guidance. Keep table_id and
chunk_index for table reconstruction. Add tests verifying chunk
sequencing metadata is set correctly on split tables and absent
on unsplit tables.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

@qued qued added this pull request to the merge queue Mar 26, 2026
Merged via the queue into main with commit 78dfb30 Mar 26, 2026
54 checks passed
@qued qued deleted the ml-1016/tablechunks-can-reconstruct-table branch March 26, 2026 17:28
github-merge-queue bot pushed a commit that referenced this pull request Mar 26, 2026
## Summary
- Fix `_merge_table_chunks()` to merge only top-level rows from each
chunk HTML table.
- Prevent nested table rows from being hoisted into the reconstructed
root table.
- Add regression coverage to verify nested table structure is preserved.

## Finding Reference
-
#4291 (comment)

## Validation
- `unset VIRTUAL_ENV && CI=false uv run --no-sync pytest -q
test_unstructured/chunking/test_base.py -k
"reconstruct_tables_from_a_mixed_element_list or
preserves_nested_table_structure" --maxfail=1`
- `unset VIRTUAL_ENV && CI=false uv run --no-sync pytest -q
test_unstructured/chunking/test_base.py
test_unstructured/chunking/test_dispatch.py --maxfail=1`
- `unset VIRTUAL_ENV && uv run --no-sync python - <<'PY'
from unstructured.partition.text import partition_text

elements = partition_text(text="Codex initializer smoke test")
assert elements, "partition_text returned no elements"
print(f"partition_text smoke check passed ({len(elements)} elements)")
PY`
- `unset VIRTUAL_ENV && CI=false uv run --no-sync pytest -q
test_unstructured/partition/test_text.py --maxfail=1`

authored by codex
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants