fix(aiCore): improve PDF processing robustness for aggregator providers#13641
fix(aiCore): improve PDF processing robustness for aggregator providers#13641
Conversation
Remove provider whitelist guard from PDF file processing so that all PDFs produce a rich FilePart with pre-extracted text fallback. Add a new pdfCompatibilityPlugin middleware that converts PDF FileParts to TextParts for providers that don't support native PDF input. This fixes aggregator providers (cherryin, new-api type) always falling back to text extraction even when the underlying model supports native PDF. Also add user-facing notifications at 5 silent failure paths where PDF processing previously failed without informing the user. Fixes #13603, fixes #13638 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: icarus <[email protected]>
…ider ID getAiSdkProviderId is unreliable for aggregator providers (e.g. cherryin resolves to 'cherryin' instead of a standard AI SDK ID). Switch to checking provider.type (ProviderType) which correctly reflects the API protocol and ensures aggregator providers with type 'openai' pass through native PDF support. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: icarus <[email protected]>
EurFelux
left a comment
There was a problem hiding this comment.
Self-Review: PR #13641
Overall
Clean design — moving PDF compatibility from upstream whitelist gating to a downstream middleware plugin is the right approach for aggregator providers. Tests are thorough and cover all the important cases.
Issues Found
1. Dead code: supportsPdfInput and PDF_NATIVE_PROVIDER_IDS (Low)
fileProcessor.ts no longer imports supportsPdfInput, and grep confirms no other file uses it. PDF_NATIVE_PROVIDER_IDS is also only referenced inside supportsPdfInput. Both exports are now dead code.
The PR description says "it remains available for other potential consumers", but per project conventions (avoid backwards-compatibility hacks), consider removing them or at least marking as @deprecated.
2. Two PDF provider lists may drift (Medium)
Two separate "providers that support PDF" lists exist:
pdfCompatibilityPlugin.ts→PDF_NATIVE_PROVIDER_TYPES(byprovider.type, API protocol level)modelCapabilities.ts→PDF_NATIVE_PROVIDER_IDS(by AI SDK provider ID)
Since supportsPdfInput is dead code, the second list has no consumers. Removing it would leave a single source of truth in the plugin.
No Issues (Verified)
- Pre-extraction overhead acknowledged in PR description — local file reads are negligible cost
- Plugin execution order (
enforce: 'pre', before Anthropic cache) is correct - Toast notifications on all 5 failure paths — good UX improvement
- Test coverage: supported/unsupported providers, aggregators, mixed content, edge cases
…xtraction Upgrade pdf-parse to 2.x which supports browser environments. Add extractPdfText() utility in shared that extracts text directly from FilePart base64 data. This eliminates providerOptions workaround and IPC calls — the plugin now extracts text purely in the renderer process. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: icarus <[email protected]>
Self-Review: PR #13641 (Updated — includes commit
|
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: icarus <[email protected]>
…ATIVE_PROVIDER_IDS These are no longer consumed after fileProcessor stopped using the provider whitelist guard. PDF compatibility is now handled by the pdfCompatibilityPlugin using provider.type instead. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: icarus <[email protected]>
There was a problem hiding this comment.
I re-reviewed the updated version of this PR.
The revised approach is cleaner than the previous iteration: it removes the providerOptions-based pre-extraction workaround and instead introduces a shared extractPdfText() utility plus on-demand extraction inside pdfCompatibilityPlugin only for providers that do not support native PDF input. That keeps the native-PDF path untouched while still fixing aggregator-provider compatibility.
I also checked the test changes again. In addition to the plugin tests, the new shared utility tests now cover base64 / Uint8Array / ArrayBuffer inputs for PDF text extraction, which matches the actual risk introduced by the refactor.
Overall, the responsibilities are better separated now (fileProcessor produces FileParts / fallback, plugin handles compatibility), and I did not find a new correctness or compatibility blocker in the updated version. Approving.
| * @param data - PDF content as Uint8Array, ArrayBuffer, base64-encoded string, or URL | ||
| * @returns Extracted text content | ||
| */ | ||
| export async function extractPdfText(data: Uint8Array | ArrayBuffer | string | URL): Promise<string> { |
There was a problem hiding this comment.
Note
This comment was translated by Claude.
Wouldn't it be better to put this in the main process? Running it with Node.js might provide better performance. If the browser needs it, just call directly through IPC.
Original Content
这个放到main process 是不是更好,nodejs来运行,性能可能会更好,如果browser需要的话,直接通过ipc调用。
There was a problem hiding this comment.
Note
This comment was translated by Claude.
Both are compatible, but going through IPC requires a bit more modification.
Original Content
都兼容,走 ipc 要多改一点
There was a problem hiding this comment.
Note
This comment was translated by Claude.
URL type cannot go through IPC, so it's handled in the renderer, other types go through IPC
Original Content
URL 类型走不了 ipc,放 renderer 处理了,其他类型走 ipc 了
| @@ -0,0 +1,39 @@ | |||
| import { PDFParse } from 'pdf-parse' | |||
There was a problem hiding this comment.
Note
This comment was translated by Claude.
Can this library only extract text from normal PDFs? If the PDF is not normal, such as a scanned document, it probably cannot extract any text.
Original Content
这个库是不是只能提取正常的pdf的,如果是不太正常的,比如扫描的,应该是提取不了任何文字出来
There was a problem hiding this comment.
Note
This issue/comment/review was translated by Claude.
This requires OCR
Original Content
这种要走 ocr
…ocess via IPC Run pdf-parse in the main (Node.js) process for better performance. URL inputs are still handled directly in the renderer since they cannot be serialized over IPC. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: icarus <[email protected]>
| * @param data - PDF content as Uint8Array, ArrayBuffer, base64-encoded string, or URL | ||
| * @returns Extracted text content | ||
| */ | ||
| export async function extractPdfText(data: Uint8Array | ArrayBuffer | string | URL): Promise<string> { |
There was a problem hiding this comment.
Note
This comment was translated by Claude.
Consider adding a TODO.
Original Content
可以加个TODOThere was a problem hiding this comment.
Note
This comment was translated by Claude.
Just change it to OCR later.
Original Content
就是后面换成OCRThere was a problem hiding this comment.
Note
This comment was translated by Claude.
OCR should go through a service, not in this util. Let's keep this function pure, and also determining what scenarios need OCR is a problem in itself.
Original Content
OCR需要走service,不能放到这个util里。让这个函数保持pure吧,而且判断什么场景下需要ocr也是个问题
|
Note This comment was translated by Claude. This only affects Original Content这个只影响 |
|
(index):10 [pdfCompatibilityPlugin] Failed to extract text from PDF somefile.pdf: Error: Error invoking remote method 'pdf:extractText': Error: Setting up fake worker failed: "Cannot find module '/Users/user/Code/cherry-studio/out/main/pdf.worker.mjs' imported from /Users/user/Code/cherry-studio/out/main/index.js". |
…rt list (#13809) ### What this PR does Before this PR: - `PDF_NATIVE_PROVIDER_TYPES` included `'openai'`, `'new-api'`, and `'gateway'` types, assuming all OpenAI-compatible providers support native PDF file input via the `file` part type. - Sending a PDF to providers like Moonshot/Kimi (which have `type: 'openai'`) resulted in a 400 error: `"invalid part type: file"`. - A special-case `isCherryAI` check was needed because CherryAI also has `type: 'openai'` but doesn't support native PDF. After this PR: - Only first-party provider protocols (`openai-response`, `anthropic`, `gemini`, `azure-openai`, `vertexai`, `aws-bedrock`, `vertex-anthropic`) are in `PDF_NATIVE_PROVIDER_TYPES`. - All `type: 'openai'` providers (Moonshot, DeepSeek, Groq, CherryAI, cherryin, etc.) correctly have PDFs converted to text before sending. - The CherryAI special-case check is removed as it's no longer needed. ### Why we need it and why it was done in this way The following tradeoffs were made: - Removed `'openai'` entirely from the native set rather than adding per-provider ID exceptions, because the actual OpenAI provider uses `type: 'openai-response'` (not `'openai'`), and the vast majority of `type: 'openai'` providers are third-party APIs that don't support `file` parts. - Also removed `'new-api'` and `'gateway'` aggregator types, since these route to various backends — it's safer to convert PDFs to text and let specific backends handle text rather than risk `file` part errors. The following alternatives were considered: - Adding individual provider IDs (like `moonshot`) to a blocklist — rejected as it's whack-a-mole; new OpenAI-compatible providers would keep hitting the same bug. - Keeping `'openai'` in the set and adding more ID-based exceptions — rejected for the same reason. Links to places where the discussion took place: - PR #13641 introduced the `pdfCompatibilityPlugin` with the overly broad provider type set - PR #13777 added the CherryAI special case as a point fix ### Breaking changes None. Providers that previously had PDFs silently fail with 400 errors will now correctly receive extracted text content instead. ### Special notes for your reviewer - The actual OpenAI provider uses `type: 'openai-response'`, which remains in the native set — real OpenAI API users are unaffected. - All existing tests updated to match new behavior. Test suite passes fully (3811 tests). - The `isCherryAI` special case from PR #13777 is removed since CherryAI (`type: 'openai'`) is now naturally handled by the conversion path. ### Checklist - [x] PR: The PR description is expressive enough and will help future contributors - [x] Code: [Write code that humans can understand](https://en.wikiquote.org/wiki/Martin_Fowler#code-for-humans) and [Keep it simple](https://en.wikipedia.org/wiki/KISS_principle) - [x] Refactor: You have [left the code cleaner than you found it (Boy Scout Rule)](https://learning.oreilly.com/library/view/97-things-every/9780596809515/ch08.html) - [x] Upgrade: Impact of this change on upgrade flows was considered and addressed if required - [ ] Documentation: A [user-guide update](https://docs.cherry-ai.com) was considered and is present (link) or not required. Check this only when the PR introduces or changes a user-facing feature or behavior. - [x] Self-review: I have reviewed my own code (e.g., via [`/gh-pr-review`](/.claude/skills/gh-pr-review/SKILL.md), `gh pr diff`, or GitHub UI) before requesting review from others ### Release note ```release-note Fixed PDF file upload failing with "invalid part type: file" error for OpenAI-compatible providers (Moonshot, DeepSeek, Groq, etc.). PDFs are now correctly converted to text for these providers. ``` --------- Signed-off-by: suyao <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: Phantom <[email protected]>

What this PR does
Before this PR:
supportsPdfInput()hardcodes a provider ID whitelist, causing aggregator providers (cherryin, new-api type) to always fall back to text extraction even when the underlying model supports native PDF.After this PR:
pdfCompatibilityPluginmiddleware handles compatibility by converting PDF FileParts to TextParts for providers that don't support native PDF input, usingpdf-parse2.x to extract text directly from FilePart data in the renderer process.pdf-parseupgraded to 2.x (browser-compatible, TypeScript, extracts text from buffer/base64 directly).extractPdfText()for PDF text extraction from any data format.Fixes #13603, fixes #13638
Why we need it and why it was done in this way
The following tradeoffs were made:
provider.type(API protocol) instead ofgetAiSdkProviderId()to determine PDF support, because aggregator providers (cherryin, new-api, gateway) resolve to non-standard AI SDK IDs but speak standard protocols that support PDF.pdf-parse2.x is used directly in the renderer process to extract text from base64 PDF data, eliminating the need for IPC calls or providerOptions workarounds.The following alternatives were considered:
getAiSdkProviderId()— rejected because it returns unreliable IDs for aggregator providers.Links to places where the discussion took place:
Breaking changes
None. This is a transparent improvement — providers that previously supported native PDF still get FileParts, and providers that didn't now get TextParts instead of silently dropped files.
Special notes for your reviewer
pdf-parseupgraded from 1.1.1 to 2.4.5. The existing patch (patches/pdf-parse-npm-1.1.1-04a6109b2a.patch) is for the indirect dependency (viaoffice-text-extractor) and remains untouched.@langchain/communityhas a peer dep on[email protected]but onlyVoyageEmbeddingsis imported from it —PDFLoaderis not used, so the peer warning is harmless.pdf-parsepeer dependency warning for@langchain/communityis safe only whenPDFLoaderis not being used. IfPDFLoaderis ever imported in the future, this incompatibility must be addressed.qwen-long/qwen-docmodels usehandleLargeFileUpload→fileid://reference →SystemModelMessage. They never reach the plugin as FilePart, so they are unaffected.nullfromconvertFileBlockToFilePartand use the text fallback path unchanged.isPdfFilePartis a proper TypeScript type guard. Noastype assertions in the plugin code.Checklist
/gh-pr-review,gh pr diff, or GitHub UI) before requesting review from othersRelease note