-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Comparing changes
Open a pull request
base repository: Unstructured-IO/unstructured
base: 0.20.1
head repository: Unstructured-IO/unstructured
compare: 0.20.6
- 5 commits
- 31 files changed
- 4 contributors
Commits on Feb 13, 2026
-
Automate pypi publishing (#4239)
<!-- CURSOR_SUMMARY --> > [!NOTE] > **Medium Risk** > Introduces a new automated publishing workflow and modifies dependency-install semantics in CI/Docker, which could cause release or build failures if credentials, tags, or lockfile expectations are misconfigured. > > **Overview** > Adds an automated release pipeline: a new `release.yml` workflow triggers on published GitHub releases, validates the tag matches `unstructured.__version__`, builds via `uv build`, publishes to PyPI using trusted publishing, and *best-effort* uploads the same artifacts to Azure Artifacts via `twine`. > > Across CI, Docker, and Make targets, replaces `uv sync --frozen` with `uv sync --locked` and adds `uv run --no-sync` where `uv sync` already ran to avoid implicit re-syncing; introduces a new `release` dependency group (adds `twine`), bumps version to `0.20.2`, and updates `uv.lock` accordingly. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit c9555c9. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
Emily Voss authoredFeb 13, 2026 Configuration menu - View commit details
-
Copy full SHA for 104d29f - Browse repository at this point
Copy the full SHA 104d29fView commit details -
fix: remove duplicate characters caused by fake bold rendering in PDFs (
#4215) Closes #3864 ## Summary - Fixes issue where bold text in PDFs is extracted with duplicate characters (e.g., "BOLD" → "BBOOLLDD") - Some PDF generators simulate bold by rendering each character twice at slightly offset positions - Added character-level deduplication based on position proximity to detect and remove these duplicates ## Problem When extracting text from certain PDFs, bold text appears duplicated: ```python # Before fix elements = partition_pdf("document.pdf", strategy="fast") print(elements[0].text) # Output: ">60>60" instead of ">60" ``` ## Solution Added character-level deduplication that: - Compares consecutive characters' text content and position - Removes duplicates where same character appears within 3 pixels (configurable) - Preserves spaces and other non-character elements (LTAnno objects) ```python # After fix elements = partition_pdf("document.pdf", strategy="fast") print(elements[0].text) # Output: ">60" ✓ ``` ## Configuration ```bash # Default: 3.0 pixels (enabled) export PDF_CHAR_DUPLICATE_THRESHOLD=3.0 # Disable deduplication export PDF_CHAR_DUPLICATE_THRESHOLD=0 # More aggressive deduplication export PDF_CHAR_DUPLICATE_THRESHOLD=5.0 ```
Configuration menu - View commit details
-
Copy full SHA for 8096b5a - Browse repository at this point
Copy the full SHA 8096b5aView commit details
Commits on Feb 17, 2026
-
Improve fast partition cold start (#4242)
Improve PDF fast strategy cold-start latency by lazy-loading hi-res-only imports in [pdf.py](https://github.com/Unstructured-IO/unstructured/blob/1c3d5e6ef7b6123a2d8739bf9a8c3afecc3dd127/unstructured/partition/pdf.py). This reduces first-call startup overhead without changing partition behavior. Local benchmarks show a significant fast strategy cold-start speedup of ~35% from 2.75s -> 1.78s. They also show a small hi_res slowdown (~2-4%), which is acceptable given the fast improvements. Benchmark was run on 6 pdfs https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/DA-1p.pdf https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/chevron-page.pdf https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/embedded-images-tables.pdf https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/fake-memo-with-duplicate-page.pdf https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/interface-config-guide-p93.pdf https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/layout-parser-paper-fast.pdf <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Touches core PDF partitioning by changing import timing and locations; behavior should be unchanged but there is some risk of missed/conditional imports causing runtime errors in less-tested hi_res/OCR/analysis paths. > > **Overview** > Improves PDF `fast` strategy cold-start performance by **lazy-loading hi-res-only dependencies** in `unstructured/partition/pdf.py` (moving several `pdf_image`/`unstructured_inference`-related imports into `_partition_pdf_or_image_local` and other hi-res/OCR-only code paths), while keeping the `fast` path lighter. > > Adds `scripts/performance/quick_partition_bench.py` for quick local cold vs warm partition timing across one or more PDFs, updates the table metrics helper to import `convert_pdf_to_images` from `pdf_image_utils`, and bumps the library version to `0.20.4` with corresponding changelog entry. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit b66ae0e. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
Configuration menu - View commit details
-
Copy full SHA for e1f75a3 - Browse repository at this point
Copy the full SHA e1f75a3View commit details
Commits on Feb 18, 2026
-
fix: gracefully handle invalide html string during chunking (#4243)
This PR fixes an issue where an invalid `text_as_html` input into html based table chunking logic can lead to chunking failing. Like the following stack trace shows: ``` | File "/app/unstructured/unstructured/chunking/base.py", line 594, in iter_chunks | yield from _TableChunker.iter_chunks( | File "/app/unstructured/unstructured/chunking/base.py", line 837, in _iter_chunks | html_size = measure(self._html) if self._html else 0 | ^^^^^^^^^^ | File "/app/unstructured/unstructured/utils.py", line 154, in __get__ | value = self._fget(obj) | ^^^^^^^^^^^^^^^ | File "/app/unstructured/unstructured/chunking/base.py", line 866, in _html | if not (html_table := self._html_table): | ^^^^^^^^^^^^^^^^ | File "/app/unstructured/unstructured/utils.py", line 154, in __get__ | value = self._fget(obj) | ^^^^^^^^^^^^^^^ | File "/app/unstructured/unstructured/chunking/base.py", line 884, in _html_table | return HtmlTable.from_html_text(text_as_html) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "/app/unstructured/unstructured/common/html_table.py", line 61, in from_html_text | root = fragment_fromstring(html_text) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "/app/.venv/lib/python3.12/site-packages/lxml/html/__init__.py", line 810, in fragment_fromstring | elements = fragments_fromstring( | ^^^^^^^^^^^^^^^^^^^^^ | File "/app/.venv/lib/python3.12/site-packages/lxml/html/__init__.py", line 780, in fragments_fromstring | raise etree.ParserError( | lxml.etree.ParserError: There is leading text: '```html\n' ``` The solution is to catch the parser error and return a `None` instead in `unstructured/chunking/base.py` in `_html_table`. This way we fallback to text based chunking for this element with a warning log.Configuration menu - View commit details
-
Copy full SHA for c1f819c - Browse repository at this point
Copy the full SHA c1f819cView commit details
Commits on Feb 19, 2026
-
fix: remap parent id after hashing (#4245)
This PR addresses an issue where hashing element id loses the reference for parent id. This happens when calling `partition_html` where the partition process already assigned parent ids for elements based on html structure before `apply_metadata` is called, i.e., before element id hashing happens. This fix ensures that the parent references stay unchanged after hashing.
Configuration menu - View commit details
-
Copy full SHA for e2d8b7a - Browse repository at this point
Copy the full SHA e2d8b7aView commit details
This comparison is taking too long to generate.
Unfortunately it looks like we can’t render this comparison for you right now. It might be too big, or there might be something weird with your repository.
You can try running this command locally to see the comparison on your machine:
git diff 0.20.1...0.20.6