Comparing changes

> [!NOTE] > **Medium Risk** > Introduces a new automated publishing workflow and modifies dependency-install semantics in CI/Docker, which could cause release or build failures if credentials, tags, or lockfile expectations are misconfigured. > > **Overview** > Adds an automated release pipeline: a new `release.yml` workflow triggers on published GitHub releases, validates the tag matches `unstructured.__version__`, builds via `uv build`, publishes to PyPI using trusted publishing, and *best-effort* uploads the same artifacts to Azure Artifacts via `twine`. > > Across CI, Docker, and Make targets, replaces `uv sync --frozen` with `uv sync --locked` and adds `uv run --no-sync` where `uv sync` already ran to avoid implicit re-syncing; introduces a new `release` dependency group (adds `twine`), bumps version to `0.20.2`, and updates `uv.lock` accordingly. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit c9555c9. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>

#4215) Closes #3864 ## Summary - Fixes issue where bold text in PDFs is extracted with duplicate characters (e.g., "BOLD" → "BBOOLLDD") - Some PDF generators simulate bold by rendering each character twice at slightly offset positions - Added character-level deduplication based on position proximity to detect and remove these duplicates ## Problem When extracting text from certain PDFs, bold text appears duplicated: ```python # Before fix elements = partition_pdf("document.pdf", strategy="fast") print(elements[0].text) # Output: ">60>60" instead of ">60" ``` ## Solution Added character-level deduplication that: - Compares consecutive characters' text content and position - Removes duplicates where same character appears within 3 pixels (configurable) - Preserves spaces and other non-character elements (LTAnno objects) ```python # After fix elements = partition_pdf("document.pdf", strategy="fast") print(elements[0].text) # Output: ">60" ✓ ``` ## Configuration ```bash # Default: 3.0 pixels (enabled) export PDF_CHAR_DUPLICATE_THRESHOLD=3.0 # Disable deduplication export PDF_CHAR_DUPLICATE_THRESHOLD=0 # More aggressive deduplication export PDF_CHAR_DUPLICATE_THRESHOLD=5.0 ```

Improve PDF fast strategy cold-start latency by lazy-loading hi-res-only imports in [pdf.py](https://github.com/Unstructured-IO/unstructured/blob/1c3d5e6ef7b6123a2d8739bf9a8c3afecc3dd127/unstructured/partition/pdf.py). This reduces first-call startup overhead without changing partition behavior. Local benchmarks show a significant fast strategy cold-start speedup of ~35% from 2.75s -> 1.78s. They also show a small hi_res slowdown (~2-4%), which is acceptable given the fast improvements. Benchmark was run on 6 pdfs https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/DA-1p.pdf https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/chevron-page.pdf https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/embedded-images-tables.pdf https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/fake-memo-with-duplicate-page.pdf https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/interface-config-guide-p93.pdf https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/layout-parser-paper-fast.pdf  --- > [!NOTE] > **Medium Risk** > Touches core PDF partitioning by changing import timing and locations; behavior should be unchanged but there is some risk of missed/conditional imports causing runtime errors in less-tested hi_res/OCR/analysis paths. > > **Overview** > Improves PDF `fast` strategy cold-start performance by **lazy-loading hi-res-only dependencies** in `unstructured/partition/pdf.py` (moving several `pdf_image`/`unstructured_inference`-related imports into `_partition_pdf_or_image_local` and other hi-res/OCR-only code paths), while keeping the `fast` path lighter. > > Adds `scripts/performance/quick_partition_bench.py` for quick local cold vs warm partition timing across one or more PDFs, updates the table metrics helper to import `convert_pdf_to_images` from `pdf_image_utils`, and bumps the library version to `0.20.4` with corresponding changelog entry. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit b66ae0e. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>

This PR fixes an issue where an invalid `text_as_html` input into html based table chunking logic can lead to chunking failing. Like the following stack trace shows: ``` | File "/app/unstructured/unstructured/chunking/base.py", line 594, in iter_chunks | yield from _TableChunker.iter_chunks( | File "/app/unstructured/unstructured/chunking/base.py", line 837, in _iter_chunks | html_size = measure(self._html) if self._html else 0 | ^^^^^^^^^^ | File "/app/unstructured/unstructured/utils.py", line 154, in __get__ | value = self._fget(obj) | ^^^^^^^^^^^^^^^ | File "/app/unstructured/unstructured/chunking/base.py", line 866, in _html | if not (html_table := self._html_table): | ^^^^^^^^^^^^^^^^ | File "/app/unstructured/unstructured/utils.py", line 154, in __get__ | value = self._fget(obj) | ^^^^^^^^^^^^^^^ | File "/app/unstructured/unstructured/chunking/base.py", line 884, in _html_table | return HtmlTable.from_html_text(text_as_html) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "/app/unstructured/unstructured/common/html_table.py", line 61, in from_html_text | root = fragment_fromstring(html_text) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "/app/.venv/lib/python3.12/site-packages/lxml/html/__init__.py", line 810, in fragment_fromstring | elements = fragments_fromstring( | ^^^^^^^^^^^^^^^^^^^^^ | File "/app/.venv/lib/python3.12/site-packages/lxml/html/__init__.py", line 780, in fragments_fromstring | raise etree.ParserError( | lxml.etree.ParserError: There is leading text: '```html\n' ``` The solution is to catch the parser error and return a `None` instead in `unstructured/chunking/base.py` in `_html_table`. This way we fallback to text based chunking for this element with a warning log.

This PR addresses an issue where hashing element id loses the reference for parent id. This happens when calling `partition_html` where the partition process already assigned parent ids for elements based on html structure before `apply_metadata` is called, i.e., before element id hashing happens. This fix ensures that the parent references stay unchanged after hashing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing changes

Open a pull request

Commits on Feb 13, 2026

Commits on Feb 17, 2026

Commits on Feb 18, 2026

Commits on Feb 19, 2026

This comparison is taking too long to generate.

Uh oh!