feat: handling larger more complex PDF docs (and fix)#1663
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR improves PDF processing by integrating the extractous library to better handle text extraction (including from URLs) and large PDF files, and it also refines the CI pipeline with more aggressive disk cleanup steps.
- Replace low-level PDF text extraction with extractous for both file and URL-based PDFs
- Add tests for URL text extraction and image extraction error handling, and support for large PDF files
- Enhance CI workflows with additional cleanup steps and dependency updates
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| crates/goose-mcp/src/computercontroller/pdf_tool.rs | Introduces extractous library for enhanced PDF text extraction and improved handling of large texts |
| .github/workflows/ci.yml | Adds aggressive disk cleanup and artifact removal steps to the CI process |
| crates/goose-mcp/Cargo.toml | Adds extractous dependency for improved PDF processing |
| crates/goose-mcp/src/computercontroller/mod.rs | Updates PDF tool description to clarify URL support |
Comments suppressed due to low confidence (1)
.github/workflows/ci.yml:83
- The removal command for '/opt/hostedtoolcache' is duplicated (also appearing on line 85). Consider removing the duplicate to simplify the CI script.
sudo rm -rf /opt/hostedtoolcache
baxen
approved these changes
Mar 13, 2025
Collaborator
baxen
left a comment
There was a problem hiding this comment.
Looks good! But i'd like to test this out, @wendytang is on it
wendytang
reviewed
Mar 13, 2025
| rg 'search term' {}\n\n\ | ||
| Or view portions of it:\n\ | ||
| head -n 50 {}\n\ | ||
| tail -n 50 {}\n\ |
There was a problem hiding this comment.
This isn't outputted to user correct? Mainly an fyi for goose iiuc
Collaborator
Author
There was a problem hiding this comment.
yes, this is instructions for goose to use in case of large files
wendytang
approved these changes
Mar 13, 2025
wendytang
added a commit
that referenced
this pull request
Mar 13, 2025
This reverts commit 4c31832.
wendytang
added a commit
that referenced
this pull request
Mar 13, 2025
wendytang
added a commit
that referenced
this pull request
Mar 13, 2025
michaelneale
added a commit
that referenced
this pull request
Mar 14, 2025
* main: (32 commits) ui: load builtins (#1679) chore(release): release version 1.0.14 (#1676) Revert "feat: handling larger more complex PDF docs (and fix) (#1663)" (#1675) fix: uvshim default to existing uv configuration (#1670) fix: handle interruptions during tool responses (#1651) feat: Copy error message button in toast (#1658) feat: handling larger more complex PDF docs (and fix) (#1663) Add Filesystem Tutorial (#1666) docs: figma blog post (#1647) docs: updating goose modes doc (#1665) docs: Add running tasks guide (#1626) docs: Add experimental features (#1644) feat(cli): add better error message, support stdin via -i - or just no args (#1660) feat: extensions read config (#1637) fix: trigger words for memory (#1654) fix: cleanup keyboard shortcut indication (#1642) Extensions load in background and show pending state (#1657) Extension error toast stays until dismissed, and error message cleanup (#1653) fix: remove other category in settings (#1641) fix: restore image outputs from tool calls (#1640) ...
kalvinnchau
added a commit
that referenced
this pull request
Mar 14, 2025
* origin/main: (29 commits) ui: reorganize extensions settings (#1702) feat: google_drive write tools and read comment tool (#1650) fix: developer builtin name (#1699) chore: update extensions section to work with new endpoints (#1696) chore: move things around (#1662) ui: extensions state updates (#1674) docs: goose ollama blog, updated (#1691) ui: load builtins (#1679) chore(release): release version 1.0.14 (#1676) Revert "feat: handling larger more complex PDF docs (and fix) (#1663)" (#1675) fix: uvshim default to existing uv configuration (#1670) fix: handle interruptions during tool responses (#1651) feat: Copy error message button in toast (#1658) feat: handling larger more complex PDF docs (and fix) (#1663) Add Filesystem Tutorial (#1666) docs: figma blog post (#1647) docs: updating goose modes doc (#1665) docs: Add running tasks guide (#1626) docs: Add experimental features (#1644) feat(cli): add better error message, support stdin via -i - or just no args (#1660) ...
laanak08
added a commit
that referenced
this pull request
Mar 16, 2025
* main: (31 commits) feat: add default metrics for core evals (#1602) feat(google_drive): use oauth2 crate for PKCE support, make token storage generic over Serializable (#1645) ui: reorganize extensions settings (#1702) feat: google_drive write tools and read comment tool (#1650) fix: developer builtin name (#1699) chore: update extensions section to work with new endpoints (#1696) chore: move things around (#1662) ui: extensions state updates (#1674) docs: goose ollama blog, updated (#1691) ui: load builtins (#1679) chore(release): release version 1.0.14 (#1676) Revert "feat: handling larger more complex PDF docs (and fix) (#1663)" (#1675) fix: uvshim default to existing uv configuration (#1670) fix: handle interruptions during tool responses (#1651) feat: Copy error message button in toast (#1658) feat: handling larger more complex PDF docs (and fix) (#1663) Add Filesystem Tutorial (#1666) docs: figma blog post (#1647) docs: updating goose modes doc (#1665) docs: Add running tasks guide (#1626) ...
ahau-square
pushed a commit
that referenced
this pull request
May 2, 2025
ahau-square
pushed a commit
that referenced
this pull request
May 2, 2025
cbruyndoncx
pushed a commit
to cbruyndoncx/goose
that referenced
this pull request
Jul 20, 2025
cbruyndoncx
pushed a commit
to cbruyndoncx/goose
that referenced
this pull request
Jul 20, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Using a new library https://crates.io/crates/extractous which seems to perform vastly better (still using low level obejcts to extract images when needed).
Enhancements:
Fixes: #1664