Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: Unstructured-IO/unstructured
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: 0.20.1
Choose a base ref
...
head repository: Unstructured-IO/unstructured
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: 0.20.6
Choose a head ref
  • 5 commits
  • 31 files changed
  • 4 contributors

Commits on Feb 13, 2026

  1. Automate pypi publishing (#4239)

    <!-- CURSOR_SUMMARY -->
    > [!NOTE]
    > **Medium Risk**
    > Introduces a new automated publishing workflow and modifies
    dependency-install semantics in CI/Docker, which could cause release or
    build failures if credentials, tags, or lockfile expectations are
    misconfigured.
    > 
    > **Overview**
    > Adds an automated release pipeline: a new `release.yml` workflow
    triggers on published GitHub releases, validates the tag matches
    `unstructured.__version__`, builds via `uv build`, publishes to PyPI
    using trusted publishing, and *best-effort* uploads the same artifacts
    to Azure Artifacts via `twine`.
    > 
    > Across CI, Docker, and Make targets, replaces `uv sync --frozen` with
    `uv sync --locked` and adds `uv run --no-sync` where `uv sync` already
    ran to avoid implicit re-syncing; introduces a new `release` dependency
    group (adds `twine`), bumps version to `0.20.2`, and updates `uv.lock`
    accordingly.
    > 
    > <sup>Written by [Cursor
    Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
    c9555c9. This will update automatically
    on new commits. Configure
    [here](https://cursor.com/dashboard?tab=bugbot).</sup>
    <!-- /CURSOR_SUMMARY -->
    Emily Voss authored Feb 13, 2026
    Configuration menu
    Copy the full SHA
    104d29f View commit details
    Browse the repository at this point in the history
  2. fix: remove duplicate characters caused by fake bold rendering in PDFs (

    #4215)
    
    Closes #3864
    
    ## Summary
    - Fixes issue where bold text in PDFs is extracted with duplicate
    characters (e.g., "BOLD" → "BBOOLLDD")
    - Some PDF generators simulate bold by rendering each character twice at
    slightly offset positions
    - Added character-level deduplication based on position proximity to
    detect and remove these duplicates
    
    ## Problem
    When extracting text from certain PDFs, bold text appears duplicated:
    ```python
    # Before fix
    elements = partition_pdf("document.pdf", strategy="fast")
    print(elements[0].text)  # Output: ">60>60" instead of ">60"
    ```
    
    ## Solution
    Added character-level deduplication that:
    - Compares consecutive characters' text content and position
    - Removes duplicates where same character appears within 3 pixels
    (configurable)
    - Preserves spaces and other non-character elements (LTAnno objects)
    
    ```python
    # After fix
    elements = partition_pdf("document.pdf", strategy="fast")
    print(elements[0].text)  # Output: ">60" ✓
    ```
    
    ## Configuration
    ```bash
    # Default: 3.0 pixels (enabled)
    export PDF_CHAR_DUPLICATE_THRESHOLD=3.0
    
    # Disable deduplication
    export PDF_CHAR_DUPLICATE_THRESHOLD=0
    
    # More aggressive deduplication
    export PDF_CHAR_DUPLICATE_THRESHOLD=5.0
    ```
    bittoby authored Feb 13, 2026
    Configuration menu
    Copy the full SHA
    8096b5a View commit details
    Browse the repository at this point in the history

Commits on Feb 17, 2026

  1. Improve fast partition cold start (#4242)

    Improve PDF fast strategy cold-start latency by lazy-loading hi-res-only
    imports in
    [pdf.py](https://github.com/Unstructured-IO/unstructured/blob/1c3d5e6ef7b6123a2d8739bf9a8c3afecc3dd127/unstructured/partition/pdf.py).
    
    This reduces first-call startup overhead without changing partition
    behavior.
    
    Local benchmarks show a significant fast strategy cold-start speedup of
    ~35% from 2.75s -> 1.78s.
    They also show a small hi_res slowdown (~2-4%), which is acceptable
    given the fast improvements.
    
    Benchmark was run on 6 pdfs
    
    https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/DA-1p.pdf
    
    https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/chevron-page.pdf
    
    https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/embedded-images-tables.pdf
    
    https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/fake-memo-with-duplicate-page.pdf
    
    https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/interface-config-guide-p93.pdf
    
    https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/layout-parser-paper-fast.pdf
    
    <!-- CURSOR_SUMMARY -->
    ---
    
    > [!NOTE]
    > **Medium Risk**
    > Touches core PDF partitioning by changing import timing and locations;
    behavior should be unchanged but there is some risk of
    missed/conditional imports causing runtime errors in less-tested
    hi_res/OCR/analysis paths.
    > 
    > **Overview**
    > Improves PDF `fast` strategy cold-start performance by **lazy-loading
    hi-res-only dependencies** in `unstructured/partition/pdf.py` (moving
    several `pdf_image`/`unstructured_inference`-related imports into
    `_partition_pdf_or_image_local` and other hi-res/OCR-only code paths),
    while keeping the `fast` path lighter.
    > 
    > Adds `scripts/performance/quick_partition_bench.py` for quick local
    cold vs warm partition timing across one or more PDFs, updates the table
    metrics helper to import `convert_pdf_to_images` from `pdf_image_utils`,
    and bumps the library version to `0.20.4` with corresponding changelog
    entry.
    > 
    > <sup>Written by [Cursor
    Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
    b66ae0e. This will update automatically
    on new commits. Configure
    [here](https://cursor.com/dashboard?tab=bugbot).</sup>
    <!-- /CURSOR_SUMMARY -->
    CyMule authored Feb 17, 2026
    Configuration menu
    Copy the full SHA
    e1f75a3 View commit details
    Browse the repository at this point in the history

Commits on Feb 18, 2026

  1. fix: gracefully handle invalide html string during chunking (#4243)

    This PR fixes an issue where an invalid `text_as_html` input into html
    based table chunking logic can lead to chunking failing. Like the
    following stack trace shows:
    
    ```
        |   File "/app/unstructured/unstructured/chunking/base.py", line 594, in iter_chunks
        |     yield from _TableChunker.iter_chunks(
        |   File "/app/unstructured/unstructured/chunking/base.py", line 837, in _iter_chunks
        |     html_size = measure(self._html) if self._html else 0
        |                                        ^^^^^^^^^^
        |   File "/app/unstructured/unstructured/utils.py", line 154, in __get__
        |     value = self._fget(obj)
        |             ^^^^^^^^^^^^^^^
        |   File "/app/unstructured/unstructured/chunking/base.py", line 866, in _html
        |     if not (html_table := self._html_table):
        |                           ^^^^^^^^^^^^^^^^
        |   File "/app/unstructured/unstructured/utils.py", line 154, in __get__
        |     value = self._fget(obj)
        |             ^^^^^^^^^^^^^^^
        |   File "/app/unstructured/unstructured/chunking/base.py", line 884, in _html_table
        |     return HtmlTable.from_html_text(text_as_html)
        |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        |   File "/app/unstructured/unstructured/common/html_table.py", line 61, in from_html_text
        |     root = fragment_fromstring(html_text)
        |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        |   File "/app/.venv/lib/python3.12/site-packages/lxml/html/__init__.py", line 810, in fragment_fromstring
        |     elements = fragments_fromstring(
        |                ^^^^^^^^^^^^^^^^^^^^^
        |   File "/app/.venv/lib/python3.12/site-packages/lxml/html/__init__.py", line 780, in fragments_fromstring
        |     raise etree.ParserError(
        | lxml.etree.ParserError: There is leading text: '```html\n'
    ```
    
    The solution is to catch the parser error and return a `None` instead in
    `unstructured/chunking/base.py` in `_html_table`. This way we fallback
    to text based chunking for this element with a warning log.
    badGarnet authored Feb 18, 2026
    Configuration menu
    Copy the full SHA
    c1f819c View commit details
    Browse the repository at this point in the history

Commits on Feb 19, 2026

  1. fix: remap parent id after hashing (#4245)

    This PR addresses an issue where hashing element id loses the reference
    for parent id.
    This happens when calling `partition_html` where the partition process
    already assigned parent ids for elements based on html structure before
    `apply_metadata` is called, i.e., before element id hashing happens.
    This fix ensures that the parent references stay unchanged after
    hashing.
    badGarnet authored Feb 19, 2026
    Configuration menu
    Copy the full SHA
    e2d8b7a View commit details
    Browse the repository at this point in the history
Loading