Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: Unstructured-IO/unstructured
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: 0.21.5
Choose a base ref
...
head repository: Unstructured-IO/unstructured
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: 0.22.4
Choose a head ref
  • 13 commits
  • 43 files changed
  • 12 contributors

Commits on Feb 24, 2026

  1. feat: add create_file_from_elements() to re-create document files fro…

    …m elements (#4259)
    
    ## Summary
    Adds `create_file_from_elements()` in `unstructured.staging.base` so
    users can re-build a document file from a list of elements (reverse of
    partition). Supports the workflow: partition -> modify elements (e.g.
    replace Image with NarrativeText using alt text) -> write back to file.
    
    Closes #3994.
    
    ## Changes
    - **`unstructured/staging/base.py`**: New
    `create_file_from_elements(elements, format="markdown"|"html"|"text",
    filename=None, ...)` that delegates to `elements_to_md`,
    `elements_to_html`, or `elements_to_text` and optionally writes to a
    file.
    - **`test_unstructured/staging/test_base.py`**: Tests for markdown,
    text, and HTML output and for unsupported format raising `ValueError`.
    claytonlin1110 authored Feb 24, 2026
    Configuration menu
    Copy the full SHA
    d0f8620 View commit details
    Browse the repository at this point in the history

Commits on Feb 25, 2026

  1. Bump dependencies (#4265)

    <!-- CURSOR_SUMMARY -->
    > [!NOTE]
    > **Medium Risk**
    > Dependency upgrades (especially `transformers` major version and
    `weaviate-client` API shift) can introduce runtime or test regressions;
    CI runner change may also surface environment-specific failures.
    > 
    > **Overview**
    > Bumps the release to `0.21.7` and updates dependency pins in
    `pyproject.toml`, notably moving to `wrapt` 2.x+, `transformers` 5.x,
    and `weaviate-client` 4.x (including constraint updates).
    > 
    > Updates the Weaviate staging integration test to use the
    `weaviate.connect_to_embedded()` / `collections.*` API instead of the
    legacy `Client`/schema API. CI unit-test jobs are moved from
    `ubuntu-latest` to the `opensource-linux-8core` runner, and `.gitignore`
    now ignores `.venv*` directories.
    > 
    > <sup>Written by [Cursor
    Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
    8e26501. This will update automatically
    on new commits. Configure
    [here](https://cursor.com/dashboard?tab=bugbot).</sup>
    <!-- /CURSOR_SUMMARY -->
    PastelStorm authored Feb 25, 2026
    Configuration menu
    Copy the full SHA
    031b0cf View commit details
    Browse the repository at this point in the history

Commits on Feb 27, 2026

  1. fix: avoid O(N²) re-scanning in _patch_current_chars_with_render_mode (

    …#4266)
    
    ## Problem
    
    `_patch_current_chars_with_render_mode` is called on every
    `do_TJ`/`do_Tj` text operator during PDF parsing. The original
    implementation re-scans the entire `cur_item._objs` list each time,
    checking `hasattr(item, "rendermode")` to skip already-patched items.
    For a page with N characters across M text operations, this is O(N*M) —
    effectively quadratic.
    
    Memray profiling showed this function as the #1 allocator: 17.57 GB
    total across 549M allocations in a session processing just 4 files.
    
    ## Fix
    
    Track the last-patched index so each call only processes newly-added
    `LTChar` objects. Reset automatically when `cur_item` changes (new page
    or figure).
    
    **Before:** O(N²) per page — re-scans all accumulated objects on every
    text operator
    **After:** O(N) per page — each object visited exactly once
    
    ---------
    
    Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
    Co-authored-by: Alan Bertl <[email protected]>
    3 people authored Feb 27, 2026
    Configuration menu
    Copy the full SHA
    aef3bc4 View commit details
    Browse the repository at this point in the history

Commits on Mar 3, 2026

  1. Configuration menu
    Copy the full SHA
    ac14f57 View commit details
    Browse the repository at this point in the history
  2. Adds Form Element (#4272)

    The form class currently maps to NarrativeText. This updates it to Form
    and adds a new class for form.
    aadland6 authored Mar 3, 2026
    Configuration menu
    Copy the full SHA
    6aeb74f View commit details
    Browse the repository at this point in the history

Commits on Mar 4, 2026

  1. feat: audio speech to text partition (#4264)

    ## Summary
    
    Enables partitioning of WAV audio files into document elements by
    transcribing with an optional speech-to-text (STT) agent, defaulting to
    Whisper.
    Closes #4029
    
    ## Changes:
    - New partition_audio() and routing for FileType.WAV so partition()
    supports audio.
    - Pluggable STT layer: SpeechToTextAgent interface and
    SpeechToTextAgentWhisper implementation.
    - Optional extra audio in pyproject.toml (openai-whisper); all-docs
    includes audio.
    - Config: STT_AGENT (and STT_AGENT_MODULES_WHITELIST) for choosing the
    STT implementation.
    
    ## Usage
    pip install "unstructured[audio]" then partition("file.wav") or
    partition_audio("file.wav", language="en").
    claytonlin1110 authored Mar 4, 2026
    Configuration menu
    Copy the full SHA
    4da154b View commit details
    Browse the repository at this point in the history
  2. Add a check for complex pdfs (#4268)

    This checks if a pdf file is likely a complex document like
    mini-holistic-3-v1-Eng_Civil-Structural-Drawing_p001.pdf that is mostly
    vector graphics by comparing the ratio of vector images to text
    elements.
    
    This limits the overhead to every file by setting a minimum file size
    before running the check.
    aadland6 authored Mar 4, 2026
    Configuration menu
    Copy the full SHA
    f6fcba4 View commit details
    Browse the repository at this point in the history

Commits on Mar 16, 2026

  1. chore: disable fail-build on Anchore container scan (#4285)

    ## Summary
    - Sets `fail-build: false` on the Anchore `scan-action@v3` step in the
    CI workflow
    - Critical vulnerability findings will still be reported in the scan
    output, but will no longer block the pipeline
    
    ## Test plan
    - [ ] Verify CI pipeline runs and the Anchore scan step completes
    without failing the build
    - [ ] Confirm scan results are still visible in the workflow logs
    
    🤖 Generated with [Claude Code](https://claude.com/claude-code)
    
    <!-- CURSOR_SUMMARY -->
    ---
    
    > [!NOTE]
    > **Low Risk**
    > Low risk workflow-only change; CI will no longer block merges on
    critical vulnerability findings, which reduces enforcement rather than
    altering runtime behavior.
    > 
    > **Overview**
    > Updates the CI `test_dockerfile` job to set `fail-build: false` for
    the `anchore/scan-action@v3` container scan.
    > 
    > Critical (fixed) vulnerabilities will still be reported in the scan
    output, but they will no longer fail the pipeline.
    > 
    > <sup>Written by [Cursor
    Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
    b01f263. This will update automatically
    on new commits. Configure
    [here](https://cursor.com/dashboard?tab=bugbot).</sup>
    <!-- /CURSOR_SUMMARY -->
    
    Co-authored-by: Claude Opus 4.6 <[email protected]>
    lawrence-u10d and claude authored Mar 16, 2026
    Configuration menu
    Copy the full SHA
    5585e98 View commit details
    Browse the repository at this point in the history
  2. feat: make telemetry off by default (#4281)

    ## Summary
    Closes #3940
    
    ## Changes
    - **Behavior:** `scarf_analytics()` sends the ping only when
    `UNSTRUCTURED_TELEMETRY_ENABLED=true` (or `1`). Opt-out env vars
    `DO_NOT_TRACK` and `SCARF_NO_ANALYTICS` are still respected and take
    precedence.
    - **Docs:** README Analytics section and logger comment updated to
    describe the new default and opt-in/opt-out.
    - **Tests:** New `DescribeScarfAnalytics` tests for default off, opt-in
    (`true`/`1`), and opt-out overriding opt-in.
    - **Changelog:** Entry under 0.21.13.
    
    ---------
    
    Co-authored-by: Lawrence Elitzer (LoLo) <[email protected]>
    claytonlin1110 and lawrence-u10d authored Mar 16, 2026
    Configuration menu
    Copy the full SHA
    cb16853 View commit details
    Browse the repository at this point in the history

Commits on Mar 20, 2026

  1. fix(deps): Update security vulnerability in pypdf to v6.9.1 [SECURITY] (

    #4248)
    
    This PR contains the following updates:
    
    | Package | Change |
    [Age](https://docs.renovatebot.com/merge-confidence/) |
    [Confidence](https://docs.renovatebot.com/merge-confidence/) |
    |---|---|---|---|
    | [pypdf](https://redirect.github.com/py-pdf/pypdf)
    ([changelog](https://pypdf.readthedocs.io/en/latest/meta/CHANGELOG.html))
    | `6.7.3` → `6.9.1` |
    ![age](https://developer.mend.io/api/mc/badges/age/pypi/pypdf/6.9.1?slim=true)
    |
    ![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/pypdf/6.7.3/6.9.1?slim=true)
    |
    
    ### GitHub Vulnerability Alerts
    
    ####
    [CVE-2026-28351](https://redirect.github.com/py-pdf/pypdf/security/advisories/GHSA-f2v5-7jq9-h8cg)
    
    ### Impact
    
    An attacker who uses this vulnerability can craft a PDF which leads to
    large memory usage. This requires parsing the content stream using the
    RunLengthDecode filter.
    
    ### Patches
    This has been fixed in
    [pypdf==6.7.4](https://redirect.github.com/py-pdf/pypdf/releases/tag/6.7.4).
    
    ### Workarounds
    If you cannot upgrade yet, consider applying the changes from PR
    [#&#8203;3664](https://redirect.github.com/py-pdf/pypdf/pull/3664).
    
    ####
    [CVE-2026-28804](https://redirect.github.com/py-pdf/pypdf/security/advisories/GHSA-9m86-7pmv-2852)
    
    ### Impact
    An attacker who uses this vulnerability can craft a PDF which leads to
    long runtimes. This requires accessing a stream which uses the
    `/ASCIIHexDecode` filter.
    
    ### Patches
    This has been fixed in
    [pypdf==6.7.5](https://redirect.github.com/py-pdf/pypdf/releases/tag/6.7.5).
    
    ### Workarounds
    If you cannot upgrade yet, consider applying the changes from PR
    [#&#8203;3666](https://redirect.github.com/py-pdf/pypdf/pull/3666).
    
    ####
    [CVE-2026-31826](https://redirect.github.com/py-pdf/pypdf/security/advisories/GHSA-hqmh-ppp3-xvm7)
    
    ### Impact
    
    An attacker who uses this vulnerability can craft a PDF which leads to
    large memory usage. This requires parsing a content stream with a rather
    large `/Length` value, regardless of the actual data length inside the
    stream.
    
    ### Patches
    This has been fixed in
    [pypdf==6.8.0](https://redirect.github.com/py-pdf/pypdf/releases/tag/6.8.0).
    
    ### Workarounds
    If you cannot upgrade yet, consider applying the changes from PR
    [#&#8203;3675](https://redirect.github.com/py-pdf/pypdf/pull/3675).
    
    As far as we are aware, this mostly affects reading from buffers of
    unknown size, as returned by `open("file.pdf", mode="rb")` for example.
    Passing a file path or a `BytesIO` buffer to *pypdf* instead does not
    seem to trigger the vulnerability.
    
    ####
    [CVE-2026-33123](https://redirect.github.com/py-pdf/pypdf/security/advisories/GHSA-qpxp-75px-xjcp)
    
    ### Impact
    An attacker who uses this vulnerability can craft a PDF which leads to
    long runtimes and/or large memory usage. This requires accessing an
    array-based stream with lots of entries.
    
    ### Patches
    This has been fixed in
    [pypdf==6.9.1](https://redirect.github.com/py-pdf/pypdf/releases/tag/6.9.1).
    
    ### Workarounds
    If you cannot upgrade yet, consider applying the changes from PR
    [#&#8203;3686](https://redirect.github.com/py-pdf/pypdf/pull/3686).
    
    ---
    
    ### Release Notes
    
    <details>
    <summary>py-pdf/pypdf (pypdf)</summary>
    
    ###
    [`v6.9.1`](https://redirect.github.com/py-pdf/pypdf/blob/HEAD/CHANGELOG.md#Version-691-2026-03-17)
    
    [Compare
    Source](https://redirect.github.com/py-pdf/pypdf/compare/6.9.0...6.9.1)
    
    ##### Security (SEC)
    
    - Improve performance and limit length of array-based content streams
    ([#&#8203;3686](https://redirect.github.com/py-pdf/pypdf/issues/3686))
    
    [Full
    Changelog](https://redirect.github.com/py-pdf/pypdf/compare/6.9.0...6.9.1)
    
    ###
    [`v6.9.0`](https://redirect.github.com/py-pdf/pypdf/blob/HEAD/CHANGELOG.md#Version-691-2026-03-17)
    
    [Compare
    Source](https://redirect.github.com/py-pdf/pypdf/compare/6.8.0...6.9.0)
    
    ##### Security (SEC)
    
    - Improve performance and limit length of array-based content streams
    ([#&#8203;3686](https://redirect.github.com/py-pdf/pypdf/issues/3686))
    
    [Full
    Changelog](https://redirect.github.com/py-pdf/pypdf/compare/6.9.0...6.9.1)
    
    ###
    [`v6.8.0`](https://redirect.github.com/py-pdf/pypdf/blob/HEAD/CHANGELOG.md#Version-690-2026-03-15)
    
    [Compare
    Source](https://redirect.github.com/py-pdf/pypdf/compare/6.7.5...6.8.0)
    
    ##### New Features (ENH)
    
    - Expose /Perms verification result on Encryption object
    ([#&#8203;3672](https://redirect.github.com/py-pdf/pypdf/issues/3672))
    
    ##### Performance Improvements (PI)
    
    - Fix O(n²) performance in NameObject read/write
    ([#&#8203;3679](https://redirect.github.com/py-pdf/pypdf/issues/3679))
    - Batch-parse all objects in ObjStm on first access
    ([#&#8203;3677](https://redirect.github.com/py-pdf/pypdf/issues/3677))
    
    ##### Bug Fixes (BUG)
    
    - Avoid sharing array-based content streams between pages
    ([#&#8203;3681](https://redirect.github.com/py-pdf/pypdf/issues/3681))
    - Avoid accessing invalid page when inserting blank page under some
    conditions
    ([#&#8203;3529](https://redirect.github.com/py-pdf/pypdf/issues/3529))
    
    [Full
    Changelog](https://redirect.github.com/py-pdf/pypdf/compare/6.8.0...6.9.0)
    
    ###
    [`v6.7.5`](https://redirect.github.com/py-pdf/pypdf/blob/HEAD/CHANGELOG.md#Version-680-2026-03-09)
    
    [Compare
    Source](https://redirect.github.com/py-pdf/pypdf/compare/6.7.4...6.7.5)
    
    ##### Security (SEC)
    
    - Limit allowed `/Length` value of stream
    ([#&#8203;3675](https://redirect.github.com/py-pdf/pypdf/issues/3675))
    
    ##### New Features (ENH)
    
    - Add /IRT (in-reply-to) support for markup annotations
    ([#&#8203;3631](https://redirect.github.com/py-pdf/pypdf/issues/3631))
    
    ##### Documentation (DOC)
    
    - Avoid using `PageObject.replace_contents` on PdfReader
    ([#&#8203;3669](https://redirect.github.com/py-pdf/pypdf/issues/3669))
    - Document how to disable jbig2dec calls
    
    [Full
    Changelog](https://redirect.github.com/py-pdf/pypdf/compare/6.7.5...6.8.0)
    
    ###
    [`v6.7.4`](https://redirect.github.com/py-pdf/pypdf/blob/HEAD/CHANGELOG.md#Version-675-2026-03-02)
    
    [Compare
    Source](https://redirect.github.com/py-pdf/pypdf/compare/6.7.3...6.7.4)
    
    ##### Security (SEC)
    
    - Improve the performance of the ASCIIHexDecode filter
    ([#&#8203;3666](https://redirect.github.com/py-pdf/pypdf/issues/3666))
    
    [Full
    Changelog](https://redirect.github.com/py-pdf/pypdf/compare/6.7.4...6.7.5)
    
    </details>
    
    ---
    
    ### Configuration
    
    📅 **Schedule**: Branch creation - At any time (no schedule defined),
    Automerge - At any time (no schedule defined).
    
    🚦 **Automerge**: Disabled by config. Please merge this manually once you
    are satisfied.
    
    ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
    rebase/retry checkbox.
    
    🔕 **Ignore**: Close this PR and you won't be reminded about this update
    again.
    
    ---
    
    - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
    this box
    
    ---
    
    This PR has been generated by [Renovate
    Bot](https://redirect.github.com/renovatebot/renovate).
    
    <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0Mi45Mi4xMCIsInVwZGF0ZWRJblZlciI6IjQyLjkyLjEwIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJzZWN1cml0eSJdfQ==-->
    
    Co-authored-by: utic-renovate[bot] <235200891+utic-renovate[bot]@users.noreply.github.com>
    utic-renovate[bot] authored Mar 20, 2026
    Configuration menu
    Copy the full SHA
    cc89c8c View commit details
    Browse the repository at this point in the history

Commits on Mar 25, 2026

  1. Configuration menu
    Copy the full SHA
    1d66b0c View commit details
    Browse the repository at this point in the history
  2. feat: custom Markdown extensions for partition_md (#4292)

    ## Summary
    Closes #4006 
    - Adds support for custom Markdown `extensions` when calling
    `partition_md`, defaulting to `["tables"]` for backward compatibility.
    - Invalid `extensions` values log a warning and fall back to
    `["tables"]`.
    
    ## Motivation
    Fixes incorrect parsing when fenced code blocks contain `#` lines
    (treated as headings without `fenced_code`).
    
    ## How to use
    ```python
    from unstructured.partition.md import partition_md
    
    elements = partition_md(text=md, extensions=["fenced_code"])
    claytonlin1110 authored Mar 25, 2026
    Configuration menu
    Copy the full SHA
    47f42b1 View commit details
    Browse the repository at this point in the history

Commits on Mar 26, 2026

  1. feat: tablechunks can reconstruct table (#4291)

    <!-- CURSOR_SUMMARY -->
    > [!NOTE]
    > **Medium Risk**
    > Changes core table-chunking behavior by adding new metadata fields and
    reconstruction logic; risk is mainly around backward compatibility and
    correct ordering/HTML merging of split tables.
    > 
    > **Overview**
    > Adds end-to-end support for reassembling split tables after chunking.
    `TableChunk` now receives stable sequencing metadata (`table_id`,
    `chunk_index`) when a `Table` is split, and a new
    `reconstruct_table_from_chunks()` helper in
    `unstructured.chunking.dispatch` groups and merges `TableChunk`s back
    into full `Table` elements (including merged `text_as_html` when
    available).
    > 
    > Updates `ElementMetadata` to carry the new fields (dropped during
    consolidation), bumps version to `0.22.4`, and adds unit tests covering
    reconstruction across mixed element streams and edge cases like missing
    `chunk_index`.
    > 
    > <sup>Written by [Cursor
    Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
    1e732a3. This will update automatically
    on new commits. Configure
    [here](https://cursor.com/dashboard?tab=bugbot).</sup>
    <!-- /CURSOR_SUMMARY -->
    
    ---------
    
    Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
    Co-authored-by: cragwolfe <[email protected]>
    3 people authored Mar 26, 2026
    Configuration menu
    Copy the full SHA
    78dfb30 View commit details
    Browse the repository at this point in the history
Loading