Skip to content

fix: isolate Table elements in pre-chunks#4307

Merged
PastelStorm merged 3 commits intoUnstructured-IO:mainfrom
claytonlin1110:fix/chunk-table-isolation
Mar 31, 2026
Merged

fix: isolate Table elements in pre-chunks#4307
PastelStorm merged 3 commits intoUnstructured-IO:mainfrom
claytonlin1110:fix/chunk-table-isolation

Conversation

@claytonlin1110
Copy link
Copy Markdown
Contributor

Summary

This change enforces the documented table-isolation guarantees in chunking:

  • Table and TableChunk are always staged in their own pre-chunk and never combined with adjacent non-table elements into a CompositeElement.
  • PreChunkCombiner will not merge pre-chunks when either side contains a table-family element, preventing “table gets wrapped/merged” behavior when combine_text_under_n_chars is enabled.
  • Shared helper functions centralize the table-isolation checks in unstructured.chunking.base.

Also includes:

  • Updated/adjusted chunking tests to reflect the new behavior.
  • Added a dedicated test_table_isolation.py regression suite.
  • Version bump + CHANGELOG.md entry to document the fix.

Closes #3921

Copy link
Copy Markdown
Contributor

@PastelStorm PastelStorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

High: Table isolation is still violated when overlap_all=True, so adjacent text/table chunks continue to leak content across the boundary.

Description: The PR blocks table/text from sharing the same _elements list and blocks PreChunkCombiner merges, but it does not isolate the separate overlap state. PreChunkBuilder.flush() carries overlap_tail forward into _overlap_prefix, and tables still consume that prefix when the next pre-chunk is emitted. That means a "standalone" Table can still contain prior narrative text, and a text chunk after a table can still start with table content.

Why it is risky: The isolation invariant in the PR description is stronger than "not in the same CompositeElement". Here, hidden state crosses the boundary through _overlap_prefix, so callers still get mixed-content chunks under a supported option. That is silent data corruption, not just a presentation issue: embeddings, indexing, table reconstruction, and downstream classifiers will operate on polluted table/text payloads even though the element types look correct.

Concrete failure scenario: In production, a retrieval pipeline enables overlap_all=True for recall and chunks documents containing text/table boundaries. On this branch, chunk_elements([Text("Alpha beta gamma delta."), Table("H\nC")], overlap=5, overlap_all=True, new_after_n_chars=0) yields a table chunk with text "elta.\nH\nC". The inverse boundary leaks too: chunk_elements([Table("H\nC"), Text("Omega sigma tau upsilon.")], ...) yields the next CompositeElement as "H C\n\nOmega sigma tau upsilon.". So the PR still breaks its advertised isolation guarantee under a documented option.

The mechanism is visible here:

        # -- copy element list, don't use original or it may change contents as builder proceeds --
        pre_chunk = PreChunk(elements, self._overlap_prefix, self._opts)
        # -- clear builder before yield so we're not sensitive to the timing of how/when this
        # -- iterator is exhausted and can add elements for the next pre-chunk immediately.
        self._reset_state(pre_chunk.overlap_tail)
        yield pre_chunk
...
    def _reset_state(self, overlap_prefix: str) -> None:
        """Set working-state values back to "empty", ready to accumulate next pre-chunk."""
        self._overlap_prefix = overlap_prefix
        self._elements.clear()
        self._text_segments = [overlap_prefix] if overlap_prefix else []
        self._text_len = len(overlap_prefix)
        if len(self._elements) == 1 and isinstance(self._elements[0], Table):
            yield from _TableChunker.iter_chunks(
                self._elements[0], self._overlap_prefix, self._opts
            )
        else:
            yield from _Chunker.iter_chunks(self._elements, self._text, self._opts)
...
        if self._overlap_prefix:
            yield self._overlap_prefix
        for e in self._elements:
            if e.text and len(e.text):
                # -- preserve all whitespace for code snippets to maintain formatting --
                if isinstance(e, CodeSnippet):
                    yield e.text
    @cached_property
    def _text_with_overlap(self) -> str:
        """The text for this chunk, including the overlap-prefix when present."""
        overlap_prefix = self._overlap_prefix
        table_text = "" if not self._table.text else self._table.text.strip()
        # -- use row-separator between overlap and table-text --
        return overlap_prefix + "\n" + table_text if overlap_prefix else table_text

Missing High-Value Tests

  • Add an end-to-end chunk_elements() test for text -> table with overlap_all=True asserting the Table text contains only table content.
  • Add the symmetric table -> text test with overlap_all=True asserting the following CompositeElement does not start with table text.
  • Add the same coverage for chunk_by_title(), since it shares the same PreChunkBuilder/PreChunk overlap path.
  • The new test_table_isolation.py suite passes, but it never exercises overlap or overlap_all, so it would not catch this regression.

Verdict

Not Merge. I did not find other Critical/High issues with high confidence, but this one is a real correctness gap in the PR’s stated isolation guarantee.

@PastelStorm PastelStorm added this pull request to the merge queue Mar 31, 2026
Merged via the queue into Unstructured-IO:main with commit 6360ef7 Mar 31, 2026
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat/chunk_elements

2 participants