Skip to content

fix: gracefully handle invalide html string during chunking#4243

Merged
badGarnet merged 7 commits intomainfrom
fix/gracefully-handle-invalid-html-string
Feb 18, 2026
Merged

fix: gracefully handle invalide html string during chunking#4243
badGarnet merged 7 commits intomainfrom
fix/gracefully-handle-invalid-html-string

Conversation

@badGarnet
Copy link
Copy Markdown
Collaborator

This PR fixes an issue where an invalid text_as_html input into html based table chunking logic can lead to chunking failing. Like the following stack trace shows:

    |   File "/app/unstructured/unstructured/chunking/base.py", line 594, in iter_chunks
    |     yield from _TableChunker.iter_chunks(
    |   File "/app/unstructured/unstructured/chunking/base.py", line 837, in _iter_chunks
    |     html_size = measure(self._html) if self._html else 0
    |                                        ^^^^^^^^^^
    |   File "/app/unstructured/unstructured/utils.py", line 154, in __get__
    |     value = self._fget(obj)
    |             ^^^^^^^^^^^^^^^
    |   File "/app/unstructured/unstructured/chunking/base.py", line 866, in _html
    |     if not (html_table := self._html_table):
    |                           ^^^^^^^^^^^^^^^^
    |   File "/app/unstructured/unstructured/utils.py", line 154, in __get__
    |     value = self._fget(obj)
    |             ^^^^^^^^^^^^^^^
    |   File "/app/unstructured/unstructured/chunking/base.py", line 884, in _html_table
    |     return HtmlTable.from_html_text(text_as_html)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/app/unstructured/unstructured/common/html_table.py", line 61, in from_html_text
    |     root = fragment_fromstring(html_text)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/app/.venv/lib/python3.12/site-packages/lxml/html/__init__.py", line 810, in fragment_fromstring
    |     elements = fragments_fromstring(
    |                ^^^^^^^^^^^^^^^^^^^^^
    |   File "/app/.venv/lib/python3.12/site-packages/lxml/html/__init__.py", line 780, in fragments_fromstring
    |     raise etree.ParserError(
    | lxml.etree.ParserError: There is leading text: '```html\n'

The solution is to catch the parser error and return a None instead in unstructured/chunking/base.py in _html_table. This way we fallback to text based chunking for this element with a warning log.

@badGarnet badGarnet marked this pull request as ready for review February 18, 2026 04:00
return HtmlTable.from_html_text(text_as_html)
try:
return HtmlTable.from_html_text(text_as_html)
except ParserError:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also catch ValueError here and a test to cover this new path.

@socket-security
Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedgithub/​docker/​setup-compose-action@​364cc21a5de5b1ee4a7f5f9d3fa374ce0ccde74699100100100100

View full report

@badGarnet badGarnet added this pull request to the merge queue Feb 18, 2026
Merged via the queue into main with commit c1f819c Feb 18, 2026
50 checks passed
@badGarnet badGarnet deleted the fix/gracefully-handle-invalid-html-string branch February 18, 2026 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants