fix: isolate Table elements in pre-chunks#4307
fix: isolate Table elements in pre-chunks#4307PastelStorm merged 3 commits intoUnstructured-IO:mainfrom
Conversation
PastelStorm
left a comment
There was a problem hiding this comment.
Findings
High: Table isolation is still violated when overlap_all=True, so adjacent text/table chunks continue to leak content across the boundary.
Description: The PR blocks table/text from sharing the same _elements list and blocks PreChunkCombiner merges, but it does not isolate the separate overlap state. PreChunkBuilder.flush() carries overlap_tail forward into _overlap_prefix, and tables still consume that prefix when the next pre-chunk is emitted. That means a "standalone" Table can still contain prior narrative text, and a text chunk after a table can still start with table content.
Why it is risky: The isolation invariant in the PR description is stronger than "not in the same CompositeElement". Here, hidden state crosses the boundary through _overlap_prefix, so callers still get mixed-content chunks under a supported option. That is silent data corruption, not just a presentation issue: embeddings, indexing, table reconstruction, and downstream classifiers will operate on polluted table/text payloads even though the element types look correct.
Concrete failure scenario: In production, a retrieval pipeline enables overlap_all=True for recall and chunks documents containing text/table boundaries. On this branch, chunk_elements([Text("Alpha beta gamma delta."), Table("H\nC")], overlap=5, overlap_all=True, new_after_n_chars=0) yields a table chunk with text "elta.\nH\nC". The inverse boundary leaks too: chunk_elements([Table("H\nC"), Text("Omega sigma tau upsilon.")], ...) yields the next CompositeElement as "H C\n\nOmega sigma tau upsilon.". So the PR still breaks its advertised isolation guarantee under a documented option.
The mechanism is visible here:
# -- copy element list, don't use original or it may change contents as builder proceeds --
pre_chunk = PreChunk(elements, self._overlap_prefix, self._opts)
# -- clear builder before yield so we're not sensitive to the timing of how/when this
# -- iterator is exhausted and can add elements for the next pre-chunk immediately.
self._reset_state(pre_chunk.overlap_tail)
yield pre_chunk
...
def _reset_state(self, overlap_prefix: str) -> None:
"""Set working-state values back to "empty", ready to accumulate next pre-chunk."""
self._overlap_prefix = overlap_prefix
self._elements.clear()
self._text_segments = [overlap_prefix] if overlap_prefix else []
self._text_len = len(overlap_prefix) if len(self._elements) == 1 and isinstance(self._elements[0], Table):
yield from _TableChunker.iter_chunks(
self._elements[0], self._overlap_prefix, self._opts
)
else:
yield from _Chunker.iter_chunks(self._elements, self._text, self._opts)
...
if self._overlap_prefix:
yield self._overlap_prefix
for e in self._elements:
if e.text and len(e.text):
# -- preserve all whitespace for code snippets to maintain formatting --
if isinstance(e, CodeSnippet):
yield e.text @cached_property
def _text_with_overlap(self) -> str:
"""The text for this chunk, including the overlap-prefix when present."""
overlap_prefix = self._overlap_prefix
table_text = "" if not self._table.text else self._table.text.strip()
# -- use row-separator between overlap and table-text --
return overlap_prefix + "\n" + table_text if overlap_prefix else table_textMissing High-Value Tests
- Add an end-to-end
chunk_elements()test fortext -> tablewithoverlap_all=Trueasserting theTabletext contains only table content. - Add the symmetric
table -> texttest withoverlap_all=Trueasserting the followingCompositeElementdoes not start with table text. - Add the same coverage for
chunk_by_title(), since it shares the samePreChunkBuilder/PreChunkoverlap path. - The new
test_table_isolation.pysuite passes, but it never exercisesoverlaporoverlap_all, so it would not catch this regression.
Verdict
Not Merge. I did not find other Critical/High issues with high confidence, but this one is a real correctness gap in the PR’s stated isolation guarantee.
Summary
This change enforces the documented table-isolation guarantees in chunking:
Also includes:
Closes #3921