-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Comparing changes
Open a pull request
base repository: Unstructured-IO/unstructured
base: 0.21.5
head repository: Unstructured-IO/unstructured
compare: 0.22.4
- 13 commits
- 43 files changed
- 12 contributors
Commits on Feb 24, 2026
-
feat: add create_file_from_elements() to re-create document files fro…
…m elements (#4259) ## Summary Adds `create_file_from_elements()` in `unstructured.staging.base` so users can re-build a document file from a list of elements (reverse of partition). Supports the workflow: partition -> modify elements (e.g. replace Image with NarrativeText using alt text) -> write back to file. Closes #3994. ## Changes - **`unstructured/staging/base.py`**: New `create_file_from_elements(elements, format="markdown"|"html"|"text", filename=None, ...)` that delegates to `elements_to_md`, `elements_to_html`, or `elements_to_text` and optionally writes to a file. - **`test_unstructured/staging/test_base.py`**: Tests for markdown, text, and HTML output and for unsupported format raising `ValueError`.
Configuration menu - View commit details
-
Copy full SHA for d0f8620 - Browse repository at this point
Copy the full SHA d0f8620View commit details
Commits on Feb 25, 2026
-
<!-- CURSOR_SUMMARY --> > [!NOTE] > **Medium Risk** > Dependency upgrades (especially `transformers` major version and `weaviate-client` API shift) can introduce runtime or test regressions; CI runner change may also surface environment-specific failures. > > **Overview** > Bumps the release to `0.21.7` and updates dependency pins in `pyproject.toml`, notably moving to `wrapt` 2.x+, `transformers` 5.x, and `weaviate-client` 4.x (including constraint updates). > > Updates the Weaviate staging integration test to use the `weaviate.connect_to_embedded()` / `collections.*` API instead of the legacy `Client`/schema API. CI unit-test jobs are moved from `ubuntu-latest` to the `opensource-linux-8core` runner, and `.gitignore` now ignores `.venv*` directories. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 8e26501. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
Configuration menu - View commit details
-
Copy full SHA for 031b0cf - Browse repository at this point
Copy the full SHA 031b0cfView commit details
Commits on Feb 27, 2026
-
fix: avoid O(N²) re-scanning in _patch_current_chars_with_render_mode (…
…#4266) ## Problem `_patch_current_chars_with_render_mode` is called on every `do_TJ`/`do_Tj` text operator during PDF parsing. The original implementation re-scans the entire `cur_item._objs` list each time, checking `hasattr(item, "rendermode")` to skip already-patched items. For a page with N characters across M text operations, this is O(N*M) — effectively quadratic. Memray profiling showed this function as the #1 allocator: 17.57 GB total across 549M allocations in a session processing just 4 files. ## Fix Track the last-patched index so each call only processes newly-added `LTChar` objects. Reset automatically when `cur_item` changes (new page or figure). **Before:** O(N²) per page — re-scans all accumulated objects on every text operator **After:** O(N) per page — each object visited exactly once --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Alan Bertl <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for aef3bc4 - Browse repository at this point
Copy the full SHA aef3bc4View commit details
Commits on Mar 3, 2026
-
Configuration menu - View commit details
-
Copy full SHA for ac14f57 - Browse repository at this point
Copy the full SHA ac14f57View commit details -
The form class currently maps to NarrativeText. This updates it to Form and adds a new class for form.
Configuration menu - View commit details
-
Copy full SHA for 6aeb74f - Browse repository at this point
Copy the full SHA 6aeb74fView commit details
Commits on Mar 4, 2026
-
feat: audio speech to text partition (#4264)
## Summary Enables partitioning of WAV audio files into document elements by transcribing with an optional speech-to-text (STT) agent, defaulting to Whisper. Closes #4029 ## Changes: - New partition_audio() and routing for FileType.WAV so partition() supports audio. - Pluggable STT layer: SpeechToTextAgent interface and SpeechToTextAgentWhisper implementation. - Optional extra audio in pyproject.toml (openai-whisper); all-docs includes audio. - Config: STT_AGENT (and STT_AGENT_MODULES_WHITELIST) for choosing the STT implementation. ## Usage pip install "unstructured[audio]" then partition("file.wav") or partition_audio("file.wav", language="en").
Configuration menu - View commit details
-
Copy full SHA for 4da154b - Browse repository at this point
Copy the full SHA 4da154bView commit details -
Add a check for complex pdfs (#4268)
This checks if a pdf file is likely a complex document like mini-holistic-3-v1-Eng_Civil-Structural-Drawing_p001.pdf that is mostly vector graphics by comparing the ratio of vector images to text elements. This limits the overhead to every file by setting a minimum file size before running the check.
Configuration menu - View commit details
-
Copy full SHA for f6fcba4 - Browse repository at this point
Copy the full SHA f6fcba4View commit details
Commits on Mar 16, 2026
-
chore: disable fail-build on Anchore container scan (#4285)
## Summary - Sets `fail-build: false` on the Anchore `scan-action@v3` step in the CI workflow - Critical vulnerability findings will still be reported in the scan output, but will no longer block the pipeline ## Test plan - [ ] Verify CI pipeline runs and the Anchore scan step completes without failing the build - [ ] Confirm scan results are still visible in the workflow logs 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Low Risk** > Low risk workflow-only change; CI will no longer block merges on critical vulnerability findings, which reduces enforcement rather than altering runtime behavior. > > **Overview** > Updates the CI `test_dockerfile` job to set `fail-build: false` for the `anchore/scan-action@v3` container scan. > > Critical (fixed) vulnerabilities will still be reported in the scan output, but they will no longer fail the pipeline. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit b01f263. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> Co-authored-by: Claude Opus 4.6 <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5585e98 - Browse repository at this point
Copy the full SHA 5585e98View commit details -
feat: make telemetry off by default (#4281)
## Summary Closes #3940 ## Changes - **Behavior:** `scarf_analytics()` sends the ping only when `UNSTRUCTURED_TELEMETRY_ENABLED=true` (or `1`). Opt-out env vars `DO_NOT_TRACK` and `SCARF_NO_ANALYTICS` are still respected and take precedence. - **Docs:** README Analytics section and logger comment updated to describe the new default and opt-in/opt-out. - **Tests:** New `DescribeScarfAnalytics` tests for default off, opt-in (`true`/`1`), and opt-out overriding opt-in. - **Changelog:** Entry under 0.21.13. --------- Co-authored-by: Lawrence Elitzer (LoLo) <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for cb16853 - Browse repository at this point
Copy the full SHA cb16853View commit details
Commits on Mar 20, 2026
-
fix(deps): Update security vulnerability in pypdf to v6.9.1 [SECURITY] (
#4248) This PR contains the following updates: | Package | Change | [Age](https://docs.renovatebot.com/merge-confidence/) | [Confidence](https://docs.renovatebot.com/merge-confidence/) | |---|---|---|---| | [pypdf](https://redirect.github.com/py-pdf/pypdf) ([changelog](https://pypdf.readthedocs.io/en/latest/meta/CHANGELOG.html)) | `6.7.3` → `6.9.1` |  |  | ### GitHub Vulnerability Alerts #### [CVE-2026-28351](https://redirect.github.com/py-pdf/pypdf/security/advisories/GHSA-f2v5-7jq9-h8cg) ### Impact An attacker who uses this vulnerability can craft a PDF which leads to large memory usage. This requires parsing the content stream using the RunLengthDecode filter. ### Patches This has been fixed in [pypdf==6.7.4](https://redirect.github.com/py-pdf/pypdf/releases/tag/6.7.4). ### Workarounds If you cannot upgrade yet, consider applying the changes from PR [#​3664](https://redirect.github.com/py-pdf/pypdf/pull/3664). #### [CVE-2026-28804](https://redirect.github.com/py-pdf/pypdf/security/advisories/GHSA-9m86-7pmv-2852) ### Impact An attacker who uses this vulnerability can craft a PDF which leads to long runtimes. This requires accessing a stream which uses the `/ASCIIHexDecode` filter. ### Patches This has been fixed in [pypdf==6.7.5](https://redirect.github.com/py-pdf/pypdf/releases/tag/6.7.5). ### Workarounds If you cannot upgrade yet, consider applying the changes from PR [#​3666](https://redirect.github.com/py-pdf/pypdf/pull/3666). #### [CVE-2026-31826](https://redirect.github.com/py-pdf/pypdf/security/advisories/GHSA-hqmh-ppp3-xvm7) ### Impact An attacker who uses this vulnerability can craft a PDF which leads to large memory usage. This requires parsing a content stream with a rather large `/Length` value, regardless of the actual data length inside the stream. ### Patches This has been fixed in [pypdf==6.8.0](https://redirect.github.com/py-pdf/pypdf/releases/tag/6.8.0). ### Workarounds If you cannot upgrade yet, consider applying the changes from PR [#​3675](https://redirect.github.com/py-pdf/pypdf/pull/3675). As far as we are aware, this mostly affects reading from buffers of unknown size, as returned by `open("file.pdf", mode="rb")` for example. Passing a file path or a `BytesIO` buffer to *pypdf* instead does not seem to trigger the vulnerability. #### [CVE-2026-33123](https://redirect.github.com/py-pdf/pypdf/security/advisories/GHSA-qpxp-75px-xjcp) ### Impact An attacker who uses this vulnerability can craft a PDF which leads to long runtimes and/or large memory usage. This requires accessing an array-based stream with lots of entries. ### Patches This has been fixed in [pypdf==6.9.1](https://redirect.github.com/py-pdf/pypdf/releases/tag/6.9.1). ### Workarounds If you cannot upgrade yet, consider applying the changes from PR [#​3686](https://redirect.github.com/py-pdf/pypdf/pull/3686). --- ### Release Notes <details> <summary>py-pdf/pypdf (pypdf)</summary> ### [`v6.9.1`](https://redirect.github.com/py-pdf/pypdf/blob/HEAD/CHANGELOG.md#Version-691-2026-03-17) [Compare Source](https://redirect.github.com/py-pdf/pypdf/compare/6.9.0...6.9.1) ##### Security (SEC) - Improve performance and limit length of array-based content streams ([#​3686](https://redirect.github.com/py-pdf/pypdf/issues/3686)) [Full Changelog](https://redirect.github.com/py-pdf/pypdf/compare/6.9.0...6.9.1) ### [`v6.9.0`](https://redirect.github.com/py-pdf/pypdf/blob/HEAD/CHANGELOG.md#Version-691-2026-03-17) [Compare Source](https://redirect.github.com/py-pdf/pypdf/compare/6.8.0...6.9.0) ##### Security (SEC) - Improve performance and limit length of array-based content streams ([#​3686](https://redirect.github.com/py-pdf/pypdf/issues/3686)) [Full Changelog](https://redirect.github.com/py-pdf/pypdf/compare/6.9.0...6.9.1) ### [`v6.8.0`](https://redirect.github.com/py-pdf/pypdf/blob/HEAD/CHANGELOG.md#Version-690-2026-03-15) [Compare Source](https://redirect.github.com/py-pdf/pypdf/compare/6.7.5...6.8.0) ##### New Features (ENH) - Expose /Perms verification result on Encryption object ([#​3672](https://redirect.github.com/py-pdf/pypdf/issues/3672)) ##### Performance Improvements (PI) - Fix O(n²) performance in NameObject read/write ([#​3679](https://redirect.github.com/py-pdf/pypdf/issues/3679)) - Batch-parse all objects in ObjStm on first access ([#​3677](https://redirect.github.com/py-pdf/pypdf/issues/3677)) ##### Bug Fixes (BUG) - Avoid sharing array-based content streams between pages ([#​3681](https://redirect.github.com/py-pdf/pypdf/issues/3681)) - Avoid accessing invalid page when inserting blank page under some conditions ([#​3529](https://redirect.github.com/py-pdf/pypdf/issues/3529)) [Full Changelog](https://redirect.github.com/py-pdf/pypdf/compare/6.8.0...6.9.0) ### [`v6.7.5`](https://redirect.github.com/py-pdf/pypdf/blob/HEAD/CHANGELOG.md#Version-680-2026-03-09) [Compare Source](https://redirect.github.com/py-pdf/pypdf/compare/6.7.4...6.7.5) ##### Security (SEC) - Limit allowed `/Length` value of stream ([#​3675](https://redirect.github.com/py-pdf/pypdf/issues/3675)) ##### New Features (ENH) - Add /IRT (in-reply-to) support for markup annotations ([#​3631](https://redirect.github.com/py-pdf/pypdf/issues/3631)) ##### Documentation (DOC) - Avoid using `PageObject.replace_contents` on PdfReader ([#​3669](https://redirect.github.com/py-pdf/pypdf/issues/3669)) - Document how to disable jbig2dec calls [Full Changelog](https://redirect.github.com/py-pdf/pypdf/compare/6.7.5...6.8.0) ### [`v6.7.4`](https://redirect.github.com/py-pdf/pypdf/blob/HEAD/CHANGELOG.md#Version-675-2026-03-02) [Compare Source](https://redirect.github.com/py-pdf/pypdf/compare/6.7.3...6.7.4) ##### Security (SEC) - Improve the performance of the ASCIIHexDecode filter ([#​3666](https://redirect.github.com/py-pdf/pypdf/issues/3666)) [Full Changelog](https://redirect.github.com/py-pdf/pypdf/compare/6.7.4...6.7.5) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://redirect.github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0Mi45Mi4xMCIsInVwZGF0ZWRJblZlciI6IjQyLjkyLjEwIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJzZWN1cml0eSJdfQ==--> Co-authored-by: utic-renovate[bot] <235200891+utic-renovate[bot]@users.noreply.github.com>
Configuration menu - View commit details
-
Copy full SHA for cc89c8c - Browse repository at this point
Copy the full SHA cc89c8cView commit details
Commits on Mar 25, 2026
-
Configuration menu - View commit details
-
Copy full SHA for 1d66b0c - Browse repository at this point
Copy the full SHA 1d66b0cView commit details -
feat: custom Markdown extensions for partition_md (#4292)
## Summary Closes #4006 - Adds support for custom Markdown `extensions` when calling `partition_md`, defaulting to `["tables"]` for backward compatibility. - Invalid `extensions` values log a warning and fall back to `["tables"]`. ## Motivation Fixes incorrect parsing when fenced code blocks contain `#` lines (treated as headings without `fenced_code`). ## How to use ```python from unstructured.partition.md import partition_md elements = partition_md(text=md, extensions=["fenced_code"])
Configuration menu - View commit details
-
Copy full SHA for 47f42b1 - Browse repository at this point
Copy the full SHA 47f42b1View commit details
Commits on Mar 26, 2026
-
feat: tablechunks can reconstruct table (#4291)
<!-- CURSOR_SUMMARY --> > [!NOTE] > **Medium Risk** > Changes core table-chunking behavior by adding new metadata fields and reconstruction logic; risk is mainly around backward compatibility and correct ordering/HTML merging of split tables. > > **Overview** > Adds end-to-end support for reassembling split tables after chunking. `TableChunk` now receives stable sequencing metadata (`table_id`, `chunk_index`) when a `Table` is split, and a new `reconstruct_table_from_chunks()` helper in `unstructured.chunking.dispatch` groups and merges `TableChunk`s back into full `Table` elements (including merged `text_as_html` when available). > > Updates `ElementMetadata` to carry the new fields (dropped during consolidation), bumps version to `0.22.4`, and adds unit tests covering reconstruction across mixed element streams and edge cases like missing `chunk_index`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 1e732a3. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> Co-authored-by: cragwolfe <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 78dfb30 - Browse repository at this point
Copy the full SHA 78dfb30View commit details
This comparison is taking too long to generate.
Unfortunately it looks like we can’t render this comparison for you right now. It might be too big, or there might be something weird with your repository.
You can try running this command locally to see the comparison on your machine:
git diff 0.21.5...0.22.4