fix(aiCore): improve PDF processing robustness for aggregator providers by EurFelux · Pull Request #13641 · CherryHQ/cherry-studio

EurFelux · 2026-03-19T11:32:26Z

What this PR does

Before this PR:

supportsPdfInput() hardcodes a provider ID whitelist, causing aggregator providers (cherryin, new-api type) to always fall back to text extraction even when the underlying model supports native PDF.
5 places where PDF processing fails silently — files get dropped from messages with no user notification.

After this PR:

Message preparation always produces the richest representation (FilePart) for all PDFs regardless of provider.
A new pdfCompatibilityPlugin middleware handles compatibility by converting PDF FileParts to TextParts for providers that don't support native PDF input, using pdf-parse 2.x to extract text directly from FilePart data in the renderer process.
All 5 failure paths now show user-facing toast notifications (warnings/errors).
pdf-parse upgraded to 2.x (browser-compatible, TypeScript, extracts text from buffer/base64 directly).
New shared utility extractPdfText() for PDF text extraction from any data format.

Fixes #13603, fixes #13638

Why we need it and why it was done in this way

The following tradeoffs were made:

The plugin uses provider.type (API protocol) instead of getAiSdkProviderId() to determine PDF support, because aggregator providers (cherryin, new-api, gateway) resolve to non-standard AI SDK IDs but speak standard protocols that support PDF.
pdf-parse 2.x is used directly in the renderer process to extract text from base64 PDF data, eliminating the need for IPC calls or providerOptions workarounds.
The plugin runs for every request but short-circuits immediately for supported providers (Set lookup).

The following alternatives were considered:

Extending the whitelist to cover more providers — rejected because it's a whack-a-mole approach.
Using getAiSdkProviderId() — rejected because it returns unreliable IDs for aggregator providers.
Pre-extracting text in fileProcessor and passing via providerOptions — rejected as it misuses providerOptions and adds unnecessary IO for native providers.
Adding IPC methods for buffer-based text extraction — rejected as pdf-parse 2.x runs directly in renderer.

Links to places where the discussion took place:

[Bug]: native PDF file uploading is unavailable for some providers #13603 (provider whitelist issue)
[Bug]: PDF files silently dropped from messages — 5 failure paths with no user notification #13638 (silent failure paths)

Breaking changes

None. This is a transparent improvement — providers that previously supported native PDF still get FileParts, and providers that didn't now get TextParts instead of silently dropped files.

Special notes for your reviewer

pdf-parse upgraded from 1.1.1 to 2.4.5. The existing patch (patches/pdf-parse-npm-1.1.1-04a6109b2a.patch) is for the indirect dependency (via office-text-extractor) and remains untouched.
@langchain/community has a peer dep on [email protected] but only VoyageEmbeddings is imported from it — PDFLoader is not used, so the peer warning is harmless.
NOTE: The pdf-parse peer dependency warning for @langchain/community is safe only when PDFLoader is not being used. If PDFLoader is ever imported in the future, this incompatibility must be addressed.
qwen-long/qwen-doc models use handleLargeFileUpload → fileid:// reference → SystemModelMessage. They never reach the plugin as FilePart, so they are unaffected.
Non-PDF documents (Word, Excel) still return null from convertFileBlockToFilePart and use the text fallback path unchanged.
isPdfFilePart is a proper TypeScript type guard. No as type assertions in the plugin code.

Checklist

PR: The PR description is expressive enough and will help future contributors
Code: Write code that humans can understand and Keep it simple
Refactor: You have left the code cleaner than you found it (Boy Scout Rule)
Upgrade: Impact of this change on upgrade flows was considered and addressed if required
Documentation: A user-guide update was considered and is present (link) or not required. Check this only when the PR introduces or changes a user-facing feature or behavior.
Self-review: I have reviewed my own code (e.g., via /gh-pr-review, gh pr diff, or GitHub UI) before requesting review from others

Release note

Improved PDF file processing: aggregator providers (cherryin, new-api, etc.) now correctly handle native PDF input instead of always falling back to text extraction. Added user-facing notifications when PDF processing fails instead of silently dropping files.

Remove provider whitelist guard from PDF file processing so that all PDFs produce a rich FilePart with pre-extracted text fallback. Add a new pdfCompatibilityPlugin middleware that converts PDF FileParts to TextParts for providers that don't support native PDF input. This fixes aggregator providers (cherryin, new-api type) always falling back to text extraction even when the underlying model supports native PDF. Also add user-facing notifications at 5 silent failure paths where PDF processing previously failed without informing the user. Fixes #13603, fixes #13638 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: icarus <[email protected]>

…ider ID getAiSdkProviderId is unreliable for aggregator providers (e.g. cherryin resolves to 'cherryin' instead of a standard AI SDK ID). Switch to checking provider.type (ProviderType) which correctly reflects the API protocol and ensures aggregator providers with type 'openai' pass through native PDF support. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: icarus <[email protected]>

EurFelux

Self-Review: PR #13641

Overall

Clean design — moving PDF compatibility from upstream whitelist gating to a downstream middleware plugin is the right approach for aggregator providers. Tests are thorough and cover all the important cases.

Issues Found

1. Dead code: `supportsPdfInput` and `PDF_NATIVE_PROVIDER_IDS` (Low)

fileProcessor.ts no longer imports supportsPdfInput, and grep confirms no other file uses it. PDF_NATIVE_PROVIDER_IDS is also only referenced inside supportsPdfInput. Both exports are now dead code.

The PR description says "it remains available for other potential consumers", but per project conventions (avoid backwards-compatibility hacks), consider removing them or at least marking as @deprecated.

2. Two PDF provider lists may drift (Medium)

Two separate "providers that support PDF" lists exist:

pdfCompatibilityPlugin.ts → PDF_NATIVE_PROVIDER_TYPES (by provider.type, API protocol level)
modelCapabilities.ts → PDF_NATIVE_PROVIDER_IDS (by AI SDK provider ID)

Since supportsPdfInput is dead code, the second list has no consumers. Removing it would leave a single source of truth in the plugin.

No Issues (Verified)

Pre-extraction overhead acknowledged in PR description — local file reads are negligible cost
Plugin execution order (enforce: 'pre', before Anthropic cache) is correct
Toast notifications on all 5 failure paths — good UX improvement
Test coverage: supported/unsupported providers, aggregators, mixed content, edge cases

…xtraction Upgrade pdf-parse to 2.x which supports browser environments. Add extractPdfText() utility in shared that extracts text directly from FilePart base64 data. This eliminates providerOptions workaround and IPC calls — the plugin now extracts text purely in the renderer process. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: icarus <[email protected]>

EurFelux · 2026-03-19T12:38:25Z

Self-Review: PR #13641 (Updated — includes commit `280096c`)

Changes Since Last Review

Major refactoring in commit 280096c:

pdf-parse upgraded from 1.x to 2.x — new API using PDFParse class
New shared utility packages/shared/utils/pdf.ts — extractPdfText() works in both Node.js and browser
Plugin now extracts text directly from base64 data via pdf-parse, instead of relying on pre-extracted text in providerOptions
fileProcessor.ts simplified — no more pre-extraction or providerOptions.cherryStudio.pdfTextContent hack
Tests updated accordingly

This is a cleaner design — text extraction now happens at the right layer (the plugin that needs it), not upstream.

Issues

1. Bundle size impact of pdfjs-dist in renderer (Medium)

pdf-parse 2.x depends on [email protected], which is a ~2MB+ library. Since the plugin runs in the renderer process and imports @shared/utils/pdf, pdfjs-dist will be bundled into the renderer bundle.

Previously, PDF text extraction happened via IPC (window.api.file.read()) in the main process, keeping pdfjs-dist out of the renderer. Consider:

Whether the bundle size increase is acceptable
Whether the extraction could still be done via IPC for non-native providers to keep the renderer bundle lean
Running pnpm analyze:renderer to measure actual impact

2. Dead code: `supportsPdfInput` and `PDF_NATIVE_PROVIDER_IDS` (Low — unchanged from previous review)

Still no consumers for these exports after fileProcessor.ts removed its import. Consider removing or marking as @deprecated.

Verified — No Issues

extractPdfText utility: Well-designed with proper destroy() cleanup in finally blocks, handles multiple input types (base64, Uint8Array, ArrayBuffer, URL)
Type safety improved: isPdfFilePart() type guard is cleaner than the previous any casts
hasPdf early-exit: Smart optimization to avoid creating new objects for messages without PDFs
Tests: Both the shared utility tests (with real minimal PDF) and plugin tests (with mocks) are solid
Toast notifications: All failure paths covered
Plugin execution order: enforce: 'pre' remains correct

Resolution:

Issue 1 (bundle size): Accepted — pdf-parse 2.x in renderer is acceptable for now.
Issue 2 (dead code): Fixed in commit d7ab2de — removed supportsPdfInput and PDF_NATIVE_PROVIDER_IDS.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: icarus <[email protected]>

…ATIVE_PROVIDER_IDS These are no longer consumed after fileProcessor stopped using the provider whitelist guard. PDF compatibility is now handled by the pdfCompatibilityPlugin using provider.type instead. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: icarus <[email protected]>

cherry-ai-bot

I re-reviewed the updated version of this PR.

The revised approach is cleaner than the previous iteration: it removes the providerOptions-based pre-extraction workaround and instead introduces a shared extractPdfText() utility plus on-demand extraction inside pdfCompatibilityPlugin only for providers that do not support native PDF input. That keeps the native-PDF path untouched while still fixing aggregator-provider compatibility.

I also checked the test changes again. In addition to the plugin tests, the new shared utility tests now cover base64 / Uint8Array / ArrayBuffer inputs for PDF text extraction, which matches the actual risk introduced by the refactor.

Overall, the responsibilities are better separated now (fileProcessor produces FileParts / fallback, plugin handles compatibility), and I did not find a new correctness or compatibility blocker in the updated version. Approving.

beyondkmp · 2026-03-19T13:46:50Z

packages/shared/utils/pdf.ts

+ * @param data - PDF content as Uint8Array, ArrayBuffer, base64-encoded string, or URL
+ * @returns Extracted text content
+ */
+export async function extractPdfText(data: Uint8Array | ArrayBuffer | string | URL): Promise<string> {


Note
This comment was translated by Claude.

Wouldn't it be better to put this in the main process? Running it with Node.js might provide better performance. If the browser needs it, just call directly through IPC.

Original Content

这个放到main process 是不是更好，nodejs来运行，性能可能会更好，如果browser需要的话，直接通过ipc调用。

Note
This comment was translated by Claude.

Both are compatible, but going through IPC requires a bit more modification.

Original Content

都兼容，走 ipc 要多改一点

Note
This comment was translated by Claude.

URL type cannot go through IPC, so it's handled in the renderer, other types go through IPC

Original Content

URL 类型走不了 ipc，放 renderer 处理了，其他类型走 ipc 了

beyondkmp · 2026-03-19T13:49:24Z

packages/shared/utils/pdf.ts

@@ -0,0 +1,39 @@
+import { PDFParse } from 'pdf-parse'


Note
This comment was translated by Claude.

Can this library only extract text from normal PDFs? If the PDF is not normal, such as a scanned document, it probably cannot extract any text.

Original Content

这个库是不是只能提取正常的pdf的，如果是不太正常的，比如扫描的，应该是提取不了任何文字出来

Note
This issue/comment/review was translated by Claude.

This requires OCR

Original Content

这种要走 ocr

…ocess via IPC Run pdf-parse in the main (Node.js) process for better performance. URL inputs are still handled directly in the renderer since they cannot be serialized over IPC. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: icarus <[email protected]>

DeJeune · 2026-03-19T15:38:22Z

packages/shared/utils/pdf.ts

+ * @param data - PDF content as Uint8Array, ArrayBuffer, base64-encoded string, or URL
+ * @returns Extracted text content
+ */
+export async function extractPdfText(data: Uint8Array | ArrayBuffer | string | URL): Promise<string> {


Note
This comment was translated by Claude.

Consider adding a TODO.

Original Content
可以加个TODO

Note
This comment was translated by Claude.

Just change it to OCR later.

Original Content
就是后面换成OCR

Note
This comment was translated by Claude.

OCR should go through a service, not in this util. Let's keep this function pure, and also determining what scenarios need OCR is a problem in itself.

Original Content

OCR需要走service，不能放到这个util里。让这个函数保持pure吧，而且判断什么场景下需要ocr也是个问题

src/renderer/src/aiCore/plugins/pdfCompatibilityPlugin.ts

DeJeune

LGTM

kangfenmao · 2026-03-20T03:37:48Z

We need to see if this needs to be modified.

EurFelux · 2026-03-20T06:04:28Z

Note

This comment was translated by Claude.

This only affects pdf-parse@v1; the directory structure in v2 is completely different. I'm also not entirely sure why this needs to be excluded, but it does not affect the operation of pdf-parse@v2.

Original Content

这个只影响pdf-parse@v1, v2的目录结构完全不同。我也不是很清楚为什么这里要排除掉，但不影响pdf-parse@v2的运行

kangfenmao · 2026-03-25T03:24:43Z

(index):10 [pdfCompatibilityPlugin] Failed to extract text from PDF somefile.pdf: Error: Error invoking remote method 'pdf:extractText': Error: Setting up fake worker failed: "Cannot find module '/Users/user/Code/cherry-studio/out/main/pdf.worker.mjs' imported from /Users/user/Code/cherry-studio/out/main/index.js".

…rt list (#13809) ### What this PR does Before this PR: - `PDF_NATIVE_PROVIDER_TYPES` included `'openai'`, `'new-api'`, and `'gateway'` types, assuming all OpenAI-compatible providers support native PDF file input via the `file` part type. - Sending a PDF to providers like Moonshot/Kimi (which have `type: 'openai'`) resulted in a 400 error: `"invalid part type: file"`. - A special-case `isCherryAI` check was needed because CherryAI also has `type: 'openai'` but doesn't support native PDF. After this PR: - Only first-party provider protocols (`openai-response`, `anthropic`, `gemini`, `azure-openai`, `vertexai`, `aws-bedrock`, `vertex-anthropic`) are in `PDF_NATIVE_PROVIDER_TYPES`. - All `type: 'openai'` providers (Moonshot, DeepSeek, Groq, CherryAI, cherryin, etc.) correctly have PDFs converted to text before sending. - The CherryAI special-case check is removed as it's no longer needed. ### Why we need it and why it was done in this way The following tradeoffs were made: - Removed `'openai'` entirely from the native set rather than adding per-provider ID exceptions, because the actual OpenAI provider uses `type: 'openai-response'` (not `'openai'`), and the vast majority of `type: 'openai'` providers are third-party APIs that don't support `file` parts. - Also removed `'new-api'` and `'gateway'` aggregator types, since these route to various backends — it's safer to convert PDFs to text and let specific backends handle text rather than risk `file` part errors. The following alternatives were considered: - Adding individual provider IDs (like `moonshot`) to a blocklist — rejected as it's whack-a-mole; new OpenAI-compatible providers would keep hitting the same bug. - Keeping `'openai'` in the set and adding more ID-based exceptions — rejected for the same reason. Links to places where the discussion took place: - PR #13641 introduced the `pdfCompatibilityPlugin` with the overly broad provider type set - PR #13777 added the CherryAI special case as a point fix ### Breaking changes None. Providers that previously had PDFs silently fail with 400 errors will now correctly receive extracted text content instead. ### Special notes for your reviewer - The actual OpenAI provider uses `type: 'openai-response'`, which remains in the native set — real OpenAI API users are unaffected. - All existing tests updated to match new behavior. Test suite passes fully (3811 tests). - The `isCherryAI` special case from PR #13777 is removed since CherryAI (`type: 'openai'`) is now naturally handled by the conversion path. ### Checklist - [x] PR: The PR description is expressive enough and will help future contributors - [x] Code: [Write code that humans can understand](https://en.wikiquote.org/wiki/Martin_Fowler#code-for-humans) and [Keep it simple](https://en.wikipedia.org/wiki/KISS_principle) - [x] Refactor: You have [left the code cleaner than you found it (Boy Scout Rule)](https://learning.oreilly.com/library/view/97-things-every/9780596809515/ch08.html) - [x] Upgrade: Impact of this change on upgrade flows was considered and addressed if required - [ ] Documentation: A [user-guide update](https://docs.cherry-ai.com) was considered and is present (link) or not required. Check this only when the PR introduces or changes a user-facing feature or behavior. - [x] Self-review: I have reviewed my own code (e.g., via [`/gh-pr-review`](/.claude/skills/gh-pr-review/SKILL.md), `gh pr diff`, or GitHub UI) before requesting review from others ### Release note ```release-note Fixed PDF file upload failing with "invalid part type: file" error for OpenAI-compatible providers (Moonshot, DeepSeek, Groq, etc.). PDFs are now correctly converted to text for these providers. ``` --------- Signed-off-by: suyao <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: Phantom <[email protected]>

EurFelux and others added 2 commits March 19, 2026 19:23

EurFelux marked this pull request as ready for review March 19, 2026 11:46

EurFelux commented Mar 19, 2026

View reviewed changes

EurFelux and others added 2 commits March 19, 2026 20:39

chore(i18n): sync and sort i18n keys for PDF error notifications

aaef703

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: icarus <[email protected]>

EurFelux requested review from DeJeune, MyPrototypeWhat and beyondkmp March 19, 2026 12:42

cherry-ai-bot bot approved these changes Mar 19, 2026

View reviewed changes

EurFelux requested a review from kangfenmao March 19, 2026 13:12

beyondkmp reviewed Mar 19, 2026

View reviewed changes

EurFelux requested a review from 0xfullex as a code owner March 19, 2026 14:21

DeJeune reviewed Mar 19, 2026

View reviewed changes

src/renderer/src/aiCore/plugins/pdfCompatibilityPlugin.ts Show resolved Hide resolved

DeJeune approved these changes Mar 20, 2026

View reviewed changes

MyPrototypeWhat approved these changes Mar 20, 2026

View reviewed changes

DeJeune merged commit fb42d4d into main Mar 20, 2026
44 checks passed

DeJeune deleted the fix/pdf-processing-robustness branch March 20, 2026 06:29

DeJeune mentioned this pull request Mar 26, 2026

fix(aiCore): remove openai-compatible providers from PDF native support list #13809

Merged

6 tasks

ninime09 mentioned this pull request Mar 30, 2026

[Bug]: PDF and other files cannot be recognized #13866

Open

10 tasks

Conversation

EurFelux commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Why we need it and why it was done in this way

Breaking changes

Special notes for your reviewer

Checklist

Release note

Uh oh!

EurFelux left a comment

Choose a reason for hiding this comment

Self-Review: PR #13641

Overall

Issues Found

1. Dead code: supportsPdfInput and PDF_NATIVE_PROVIDER_IDS (Low)

2. Two PDF provider lists may drift (Medium)

No Issues (Verified)

Uh oh!

EurFelux commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Self-Review: PR #13641 (Updated — includes commit 280096c)

Changes Since Last Review

Issues

1. Bundle size impact of pdfjs-dist in renderer (Medium)

2. Dead code: supportsPdfInput and PDF_NATIVE_PROVIDER_IDS (Low — unchanged from previous review)

Verified — No Issues

Uh oh!

cherry-ai-bot bot left a comment

Choose a reason for hiding this comment

Uh oh!

beyondkmp Mar 19, 2026 • edited by kangfenmao Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EurFelux Mar 19, 2026 • edited by kangfenmao Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EurFelux Mar 19, 2026 • edited by kangfenmao Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beyondkmp Mar 19, 2026 • edited by kangfenmao Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EurFelux Mar 19, 2026 • edited by kangfenmao Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DeJeune Mar 19, 2026 • edited by kangfenmao Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EurFelux Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

DeJeune Mar 20, 2026 • edited by kangfenmao Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EurFelux Mar 20, 2026 • edited by kangfenmao Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DeJeune left a comment

Choose a reason for hiding this comment

Uh oh!

kangfenmao commented Mar 20, 2026

Uh oh!

EurFelux commented Mar 20, 2026 • edited by kangfenmao Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kangfenmao commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

EurFelux commented Mar 19, 2026 •

edited

Loading

1. Dead code: `supportsPdfInput` and `PDF_NATIVE_PROVIDER_IDS` (Low)

EurFelux commented Mar 19, 2026 •

edited

Loading

Self-Review: PR #13641 (Updated — includes commit `280096c`)

2. Dead code: `supportsPdfInput` and `PDF_NATIVE_PROVIDER_IDS` (Low — unchanged from previous review)

beyondkmp Mar 19, 2026 •

edited by kangfenmao

Loading

EurFelux Mar 19, 2026 •

edited by kangfenmao

Loading

EurFelux Mar 19, 2026 •

edited by kangfenmao

Loading

beyondkmp Mar 19, 2026 •

edited by kangfenmao

Loading

EurFelux Mar 19, 2026 •

edited by kangfenmao

Loading

DeJeune Mar 19, 2026 •

edited by kangfenmao

Loading

DeJeune Mar 20, 2026 •

edited by kangfenmao

Loading

EurFelux Mar 20, 2026 •

edited by kangfenmao

Loading

EurFelux commented Mar 20, 2026 •

edited by kangfenmao

Loading