Skip to content

fix(aiCore): improve PDF processing robustness for aggregator providers#13641

Merged
DeJeune merged 6 commits intomainfrom
fix/pdf-processing-robustness
Mar 20, 2026
Merged

fix(aiCore): improve PDF processing robustness for aggregator providers#13641
DeJeune merged 6 commits intomainfrom
fix/pdf-processing-robustness

Conversation

@EurFelux
Copy link
Copy Markdown
Collaborator

@EurFelux EurFelux commented Mar 19, 2026

What this PR does

Before this PR:

  • supportsPdfInput() hardcodes a provider ID whitelist, causing aggregator providers (cherryin, new-api type) to always fall back to text extraction even when the underlying model supports native PDF.
  • 5 places where PDF processing fails silently — files get dropped from messages with no user notification.

After this PR:

  • Message preparation always produces the richest representation (FilePart) for all PDFs regardless of provider.
  • A new pdfCompatibilityPlugin middleware handles compatibility by converting PDF FileParts to TextParts for providers that don't support native PDF input, using pdf-parse 2.x to extract text directly from FilePart data in the renderer process.
  • All 5 failure paths now show user-facing toast notifications (warnings/errors).
  • pdf-parse upgraded to 2.x (browser-compatible, TypeScript, extracts text from buffer/base64 directly).
  • New shared utility extractPdfText() for PDF text extraction from any data format.

Fixes #13603, fixes #13638

Why we need it and why it was done in this way

The following tradeoffs were made:

  • The plugin uses provider.type (API protocol) instead of getAiSdkProviderId() to determine PDF support, because aggregator providers (cherryin, new-api, gateway) resolve to non-standard AI SDK IDs but speak standard protocols that support PDF.
  • pdf-parse 2.x is used directly in the renderer process to extract text from base64 PDF data, eliminating the need for IPC calls or providerOptions workarounds.
  • The plugin runs for every request but short-circuits immediately for supported providers (Set lookup).

The following alternatives were considered:

  • Extending the whitelist to cover more providers — rejected because it's a whack-a-mole approach.
  • Using getAiSdkProviderId() — rejected because it returns unreliable IDs for aggregator providers.
  • Pre-extracting text in fileProcessor and passing via providerOptions — rejected as it misuses providerOptions and adds unnecessary IO for native providers.
  • Adding IPC methods for buffer-based text extraction — rejected as pdf-parse 2.x runs directly in renderer.

Links to places where the discussion took place:

Breaking changes

None. This is a transparent improvement — providers that previously supported native PDF still get FileParts, and providers that didn't now get TextParts instead of silently dropped files.

Special notes for your reviewer

  • pdf-parse upgraded from 1.1.1 to 2.4.5. The existing patch (patches/pdf-parse-npm-1.1.1-04a6109b2a.patch) is for the indirect dependency (via office-text-extractor) and remains untouched.
  • @langchain/community has a peer dep on [email protected] but only VoyageEmbeddings is imported from it — PDFLoader is not used, so the peer warning is harmless.
  • NOTE: The pdf-parse peer dependency warning for @langchain/community is safe only when PDFLoader is not being used. If PDFLoader is ever imported in the future, this incompatibility must be addressed.
  • qwen-long/qwen-doc models use handleLargeFileUploadfileid:// reference → SystemModelMessage. They never reach the plugin as FilePart, so they are unaffected.
  • Non-PDF documents (Word, Excel) still return null from convertFileBlockToFilePart and use the text fallback path unchanged.
  • isPdfFilePart is a proper TypeScript type guard. No as type assertions in the plugin code.

Checklist

Release note

Improved PDF file processing: aggregator providers (cherryin, new-api, etc.) now correctly handle native PDF input instead of always falling back to text extraction. Added user-facing notifications when PDF processing fails instead of silently dropping files.

EurFelux and others added 2 commits March 19, 2026 19:23
Remove provider whitelist guard from PDF file processing so that all PDFs
produce a rich FilePart with pre-extracted text fallback. Add a new
pdfCompatibilityPlugin middleware that converts PDF FileParts to TextParts
for providers that don't support native PDF input. This fixes aggregator
providers (cherryin, new-api type) always falling back to text extraction
even when the underlying model supports native PDF.

Also add user-facing notifications at 5 silent failure paths where PDF
processing previously failed without informing the user.

Fixes #13603, fixes #13638

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: icarus <[email protected]>
…ider ID

getAiSdkProviderId is unreliable for aggregator providers (e.g. cherryin
resolves to 'cherryin' instead of a standard AI SDK ID). Switch to
checking provider.type (ProviderType) which correctly reflects the API
protocol and ensures aggregator providers with type 'openai' pass through
native PDF support.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: icarus <[email protected]>
@EurFelux EurFelux marked this pull request as ready for review March 19, 2026 11:46
Copy link
Copy Markdown
Collaborator Author

@EurFelux EurFelux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-Review: PR #13641

Overall

Clean design — moving PDF compatibility from upstream whitelist gating to a downstream middleware plugin is the right approach for aggregator providers. Tests are thorough and cover all the important cases.

Issues Found

1. Dead code: supportsPdfInput and PDF_NATIVE_PROVIDER_IDS (Low)

fileProcessor.ts no longer imports supportsPdfInput, and grep confirms no other file uses it. PDF_NATIVE_PROVIDER_IDS is also only referenced inside supportsPdfInput. Both exports are now dead code.

The PR description says "it remains available for other potential consumers", but per project conventions (avoid backwards-compatibility hacks), consider removing them or at least marking as @deprecated.

2. Two PDF provider lists may drift (Medium)

Two separate "providers that support PDF" lists exist:

  • pdfCompatibilityPlugin.tsPDF_NATIVE_PROVIDER_TYPES (by provider.type, API protocol level)
  • modelCapabilities.tsPDF_NATIVE_PROVIDER_IDS (by AI SDK provider ID)

Since supportsPdfInput is dead code, the second list has no consumers. Removing it would leave a single source of truth in the plugin.

No Issues (Verified)

  • Pre-extraction overhead acknowledged in PR description — local file reads are negligible cost
  • Plugin execution order (enforce: 'pre', before Anthropic cache) is correct
  • Toast notifications on all 5 failure paths — good UX improvement
  • Test coverage: supported/unsupported providers, aggregators, mixed content, edge cases

…xtraction

Upgrade pdf-parse to 2.x which supports browser environments. Add
extractPdfText() utility in shared that extracts text directly from
FilePart base64 data. This eliminates providerOptions workaround and
IPC calls — the plugin now extracts text purely in the renderer process.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: icarus <[email protected]>
@EurFelux
Copy link
Copy Markdown
Collaborator Author

EurFelux commented Mar 19, 2026

Self-Review: PR #13641 (Updated — includes commit 280096c)

Changes Since Last Review

Major refactoring in commit 280096c:

  • pdf-parse upgraded from 1.x to 2.x — new API using PDFParse class
  • New shared utility packages/shared/utils/pdf.tsextractPdfText() works in both Node.js and browser
  • Plugin now extracts text directly from base64 data via pdf-parse, instead of relying on pre-extracted text in providerOptions
  • fileProcessor.ts simplified — no more pre-extraction or providerOptions.cherryStudio.pdfTextContent hack
  • Tests updated accordingly

This is a cleaner design — text extraction now happens at the right layer (the plugin that needs it), not upstream.

Issues

1. Bundle size impact of pdfjs-dist in renderer (Medium)

pdf-parse 2.x depends on [email protected], which is a ~2MB+ library. Since the plugin runs in the renderer process and imports @shared/utils/pdf, pdfjs-dist will be bundled into the renderer bundle.

Previously, PDF text extraction happened via IPC (window.api.file.read()) in the main process, keeping pdfjs-dist out of the renderer. Consider:

  • Whether the bundle size increase is acceptable
  • Whether the extraction could still be done via IPC for non-native providers to keep the renderer bundle lean
  • Running pnpm analyze:renderer to measure actual impact

2. Dead code: supportsPdfInput and PDF_NATIVE_PROVIDER_IDS (Low — unchanged from previous review)

Still no consumers for these exports after fileProcessor.ts removed its import. Consider removing or marking as @deprecated.

Verified — No Issues

  • extractPdfText utility: Well-designed with proper destroy() cleanup in finally blocks, handles multiple input types (base64, Uint8Array, ArrayBuffer, URL)
  • Type safety improved: isPdfFilePart() type guard is cleaner than the previous any casts
  • hasPdf early-exit: Smart optimization to avoid creating new objects for messages without PDFs
  • Tests: Both the shared utility tests (with real minimal PDF) and plugin tests (with mocks) are solid
  • Toast notifications: All failure paths covered
  • Plugin execution order: enforce: 'pre' remains correct

Resolution:

  • Issue 1 (bundle size): Accepted — pdf-parse 2.x in renderer is acceptable for now.
  • Issue 2 (dead code): Fixed in commit d7ab2de — removed supportsPdfInput and PDF_NATIVE_PROVIDER_IDS.

EurFelux and others added 2 commits March 19, 2026 20:39
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: icarus <[email protected]>
…ATIVE_PROVIDER_IDS

These are no longer consumed after fileProcessor stopped using the
provider whitelist guard. PDF compatibility is now handled by the
pdfCompatibilityPlugin using provider.type instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: icarus <[email protected]>
Copy link
Copy Markdown
Contributor

@cherry-ai-bot cherry-ai-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-reviewed the updated version of this PR.

The revised approach is cleaner than the previous iteration: it removes the providerOptions-based pre-extraction workaround and instead introduces a shared extractPdfText() utility plus on-demand extraction inside pdfCompatibilityPlugin only for providers that do not support native PDF input. That keeps the native-PDF path untouched while still fixing aggregator-provider compatibility.

I also checked the test changes again. In addition to the plugin tests, the new shared utility tests now cover base64 / Uint8Array / ArrayBuffer inputs for PDF text extraction, which matches the actual risk introduced by the refactor.

Overall, the responsibilities are better separated now (fileProcessor produces FileParts / fallback, plugin handles compatibility), and I did not find a new correctness or compatibility blocker in the updated version. Approving.

@EurFelux EurFelux requested a review from kangfenmao March 19, 2026 13:12
* @param data - PDF content as Uint8Array, ArrayBuffer, base64-encoded string, or URL
* @returns Extracted text content
*/
export async function extractPdfText(data: Uint8Array | ArrayBuffer | string | URL): Promise<string> {
Copy link
Copy Markdown
Collaborator

@beyondkmp beyondkmp Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

This comment was translated by Claude.

Wouldn't it be better to put this in the main process? Running it with Node.js might provide better performance. If the browser needs it, just call directly through IPC.


Original Content

这个放到main process 是不是更好,nodejs来运行,性能可能会更好,如果browser需要的话,直接通过ipc调用。

Copy link
Copy Markdown
Collaborator Author

@EurFelux EurFelux Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

This comment was translated by Claude.

Both are compatible, but going through IPC requires a bit more modification.


Original Content

都兼容,走 ipc 要多改一点

Copy link
Copy Markdown
Collaborator Author

@EurFelux EurFelux Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

This comment was translated by Claude.

URL type cannot go through IPC, so it's handled in the renderer, other types go through IPC


Original Content

URL 类型走不了 ipc,放 renderer 处理了,其他类型走 ipc 了

@@ -0,0 +1,39 @@
import { PDFParse } from 'pdf-parse'
Copy link
Copy Markdown
Collaborator

@beyondkmp beyondkmp Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

This comment was translated by Claude.

Can this library only extract text from normal PDFs? If the PDF is not normal, such as a scanned document, it probably cannot extract any text.


Original Content

这个库是不是只能提取正常的pdf的,如果是不太正常的,比如扫描的,应该是提取不了任何文字出来

Copy link
Copy Markdown
Collaborator Author

@EurFelux EurFelux Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

This issue/comment/review was translated by Claude.

This requires OCR


Original Content

这种要走 ocr

…ocess via IPC

Run pdf-parse in the main (Node.js) process for better performance.
URL inputs are still handled directly in the renderer since they
cannot be serialized over IPC.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: icarus <[email protected]>
@EurFelux EurFelux requested a review from 0xfullex as a code owner March 19, 2026 14:21
* @param data - PDF content as Uint8Array, ArrayBuffer, base64-encoded string, or URL
* @returns Extracted text content
*/
export async function extractPdfText(data: Uint8Array | ArrayBuffer | string | URL): Promise<string> {
Copy link
Copy Markdown
Collaborator

@DeJeune DeJeune Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

This comment was translated by Claude.

Consider adding a TODO.


Original Content 可以加个TODO

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what todo

Copy link
Copy Markdown
Collaborator

@DeJeune DeJeune Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

This comment was translated by Claude.

Just change it to OCR later.


Original Content 就是后面换成OCR

Copy link
Copy Markdown
Collaborator Author

@EurFelux EurFelux Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

This comment was translated by Claude.

OCR should go through a service, not in this util. Let's keep this function pure, and also determining what scenarios need OCR is a problem in itself.


Original Content

OCR需要走service,不能放到这个util里。让这个函数保持pure吧,而且判断什么场景下需要ocr也是个问题

Copy link
Copy Markdown
Collaborator

@DeJeune DeJeune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kangfenmao
Copy link
Copy Markdown
Collaborator

image

We need to see if this needs to be modified.

@EurFelux
Copy link
Copy Markdown
Collaborator Author

EurFelux commented Mar 20, 2026

Note

This comment was translated by Claude.

This only affects pdf-parse@v1; the directory structure in v2 is completely different. I'm also not entirely sure why this needs to be excluded, but it does not affect the operation of pdf-parse@v2.


Original Content

这个只影响pdf-parse@v1, v2的目录结构完全不同。我也不是很清楚为什么这里要排除掉,但不影响pdf-parse@v2的运行

@DeJeune DeJeune merged commit fb42d4d into main Mar 20, 2026
44 checks passed
@DeJeune DeJeune deleted the fix/pdf-processing-robustness branch March 20, 2026 06:29
@kangfenmao
Copy link
Copy Markdown
Collaborator

(index):10 [pdfCompatibilityPlugin] Failed to extract text from PDF somefile.pdf: Error: Error invoking remote method 'pdf:extractText': Error: Setting up fake worker failed: "Cannot find module '/Users/user/Code/cherry-studio/out/main/pdf.worker.mjs' imported from /Users/user/Code/cherry-studio/out/main/index.js".

kangfenmao pushed a commit that referenced this pull request Mar 26, 2026
…rt list (#13809)

### What this PR does

Before this PR:
- `PDF_NATIVE_PROVIDER_TYPES` included `'openai'`, `'new-api'`, and
`'gateway'` types, assuming all OpenAI-compatible providers support
native PDF file input via the `file` part type.
- Sending a PDF to providers like Moonshot/Kimi (which have `type:
'openai'`) resulted in a 400 error: `"invalid part type: file"`.
- A special-case `isCherryAI` check was needed because CherryAI also has
`type: 'openai'` but doesn't support native PDF.

After this PR:
- Only first-party provider protocols (`openai-response`, `anthropic`,
`gemini`, `azure-openai`, `vertexai`, `aws-bedrock`, `vertex-anthropic`)
are in `PDF_NATIVE_PROVIDER_TYPES`.
- All `type: 'openai'` providers (Moonshot, DeepSeek, Groq, CherryAI,
cherryin, etc.) correctly have PDFs converted to text before sending.
- The CherryAI special-case check is removed as it's no longer needed.

### Why we need it and why it was done in this way

The following tradeoffs were made:
- Removed `'openai'` entirely from the native set rather than adding
per-provider ID exceptions, because the actual OpenAI provider uses
`type: 'openai-response'` (not `'openai'`), and the vast majority of
`type: 'openai'` providers are third-party APIs that don't support
`file` parts.
- Also removed `'new-api'` and `'gateway'` aggregator types, since these
route to various backends — it's safer to convert PDFs to text and let
specific backends handle text rather than risk `file` part errors.

The following alternatives were considered:
- Adding individual provider IDs (like `moonshot`) to a blocklist —
rejected as it's whack-a-mole; new OpenAI-compatible providers would
keep hitting the same bug.
- Keeping `'openai'` in the set and adding more ID-based exceptions —
rejected for the same reason.

Links to places where the discussion took place:
- PR #13641 introduced the `pdfCompatibilityPlugin` with the overly
broad provider type set
- PR #13777 added the CherryAI special case as a point fix

### Breaking changes

None. Providers that previously had PDFs silently fail with 400 errors
will now correctly receive extracted text content instead.

### Special notes for your reviewer

- The actual OpenAI provider uses `type: 'openai-response'`, which
remains in the native set — real OpenAI API users are unaffected.
- All existing tests updated to match new behavior. Test suite passes
fully (3811 tests).
- The `isCherryAI` special case from PR #13777 is removed since CherryAI
(`type: 'openai'`) is now naturally handled by the conversion path.

### Checklist

- [x] PR: The PR description is expressive enough and will help future
contributors
- [x] Code: [Write code that humans can
understand](https://en.wikiquote.org/wiki/Martin_Fowler#code-for-humans)
and [Keep it simple](https://en.wikipedia.org/wiki/KISS_principle)
- [x] Refactor: You have [left the code cleaner than you found it (Boy
Scout
Rule)](https://learning.oreilly.com/library/view/97-things-every/9780596809515/ch08.html)
- [x] Upgrade: Impact of this change on upgrade flows was considered and
addressed if required
- [ ] Documentation: A [user-guide update](https://docs.cherry-ai.com)
was considered and is present (link) or not required. Check this only
when the PR introduces or changes a user-facing feature or behavior.
- [x] Self-review: I have reviewed my own code (e.g., via
[`/gh-pr-review`](/.claude/skills/gh-pr-review/SKILL.md), `gh pr diff`,
or GitHub UI) before requesting review from others

### Release note

```release-note
Fixed PDF file upload failing with "invalid part type: file" error for OpenAI-compatible providers (Moonshot, DeepSeek, Groq, etc.). PDFs are now correctly converted to text for these providers.
```

---------

Signed-off-by: suyao <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>
Co-authored-by: Phantom <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

5 participants