Skip to content

gemini prompt caching#3181

Merged
pashpashpash merged 14 commits intomainfrom
pashpashpash/gemini-prompt-caching
May 2, 2025
Merged

gemini prompt caching#3181
pashpashpash merged 14 commits intomainfrom
pashpashpash/gemini-prompt-caching

Conversation

@pashpashpash
Copy link
Copy Markdown
Contributor

@pashpashpash pashpashpash commented Apr 29, 2025

Description

improved pricing calculation for gemini and vertex + more robust caching & cache tracking for gemini & vertex

  • Cache Creation/Update: The logic correctly identifies when to create a new cache (first message for a task) or update it (subsequent messages with new content). It uses the updateCacheContent function, which creates a new cache with the full content (old + new) due to Gemini API limitations.
  • Cache Cleanup: The updateCacheContent function now attempts to delete the old cache in the background using setTimeout after successfully creating the new one. This deletion is non-blocking; if it fails (e.g., due to permissions causing a 403 error), the error is logged, but the main operation continues. The old cache will eventually expire based on its TTL, providing a fallback cleanup mechanism. This addresses the previous 403 error issue.
  • Robustness: The use of the isCacheBusy flag helps prevent race conditions during cache updates. The non-blocking deletion makes the process more resilient to potential API errors during cleanup.
  • Cost Accounting: The costs are correctly split into immediate (input, output, cache read) and ongoing (cache storage). The calculateCost method returns only immediate costs for per-message display, while getTaskOngoingCosts calculates the storage cost based on the total tokens tracked in taskCacheTokens.
Screenshot 2025-04-30 at 7 10 26 PM

Test Procedure

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • ♻️ Refactor Changes
  • 💅 Cosmetic Changes
  • 📚 Documentation update
  • 🏃 Workflow Changes

Pre-flight Checklist

  • Changes are limited to a single feature, bugfix or chore (split larger changes into separate PRs)
  • Tests are passing (npm test) and code is formatted and linted (npm run format && npm run lint)
  • I have created a changeset using npm run changeset (required for user-facing changes)
  • I have reviewed contributor guidelines

Screenshots

Additional Notes


Important

Enhance Gemini and Vertex API integration with improved caching strategy, accurate cost calculation, and updated UI components.

  • Caching and Cost Calculation:
    • Implement task-based caching strategy in GeminiHandler in gemini.ts, creating a single cache per task and reusing it for subsequent turns.
    • Introduce updateCacheContent() to handle cache updates and deletions, ensuring non-blocking operations.
    • Calculate immediate and ongoing costs separately, using calculateCost() and getTaskOngoingCosts().
  • API Integration:
    • Add GeminiHandler to VertexHandler in vertex.ts for handling Gemini models.
    • Update ApiHandlerOptions in api.ts to include new configuration options for Gemini and Vertex.
  • UI Enhancements:
    • Update TaskHeader.tsx to display cache and cost information.
    • Modify ApiOptions.tsx to include new settings for Gemini and Vertex APIs.
  • Miscellaneous:
    • Add node-cache dependency in package.json for caching support.
    • Update model pricing and tier information in api.ts to reflect new cost structures.

This description was created by Ellipsis for f3c07f3. You can customize this summary. It will automatically update as commits are pushed.

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Apr 29, 2025

🦋 Changeset detected

Latest commit: 8de8bf1

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
claude-dev Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 29, 2025

Coverage Report

Extension Coverage

Base branch: 45%

PR branch: 46%

✅ Coverage increased or remained the same

Webview Coverage

Base branch: 11%

PR branch: 11%

✅ Coverage increased or remained the same

Overall Assessment

Test coverage has been maintained or improved

Last updated: 2025-05-02T04:42:45.827765

@pashpashpash pashpashpash marked this pull request as ready for review May 1, 2025 02:12
Copy link
Copy Markdown
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Changes requested ❌

Reviewed everything up to e0db81c in 2 minutes and 46 seconds. Click for details.
  • Reviewed 1190 lines of code in 7 files
  • Skipped 1 files when reviewing.
  • Skipped posting 8 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. src/api/providers/gemini.ts:96
  • Draft comment:
    Consider using async/await for the cache creation flow instead of the promise chain for improved readability.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% This is a code quality suggestion that could improve readability. However, this is background processing code that intentionally doesn't await completion. The promise chain actually makes the asynchronous, non-blocking nature of this code more explicit. Using async/await might make it less clear that this is intentionally not awaited. The comment assumes async/await is always more readable, but in this case the promise chain better expresses the intent of background processing. The current implementation clearly shows the flow of background operations. While async/await can improve readability in many cases, here the promise chain better communicates the non-blocking nature of the cache creation process. The comment should be deleted as the current promise chain implementation is actually more appropriate for this background processing code.
2. src/utils/cost.ts:14
  • Draft comment:
    Ensure the default prices and tiered adjustments are well documented; returning undefined when inputPrice or outputPrice is missing may lead to silent failures.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 50% <= threshold 50% The comment is suggesting that returning undefined could lead to silent failures, which is a valid concern. However, it also asks to ensure documentation, which is not allowed. I need to determine if the comment is more about the potential issue or the documentation.
3. webview-ui/src/components/settings/ApiOptions.tsx:230
  • Draft comment:
    The createDropdown function is large and tightly coupled; consider extracting it into a separate component to improve readability and maintainability.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
4. webview-ui/src/components/settings/ApiOptions.tsx:1815
  • Draft comment:
    Ensure that callback URLs in getOpenRouterAuthUrl are URL-encoded to avoid issues with special characters in the uriScheme.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
5. src/api/providers/gemini.ts:67
  • Draft comment:
    The caching eligibility check uses a multiplier of 4 on CONTEXT_CACHE_TOKEN_MINIMUM. It isn’t immediately clear why 4 is chosen. Consider adding a brief comment explaining the rationale behind this multiplier.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% While the comment raises a valid point about documentation, this seems like a minor implementation detail. The code already has good comments explaining the high-level purpose of the cache eligibility check. The '4' multiplier appears to be a tuning parameter based on the character-per-token estimation used elsewhere in the code. Adding another comment here would add noise without significant value. I might be undervaluing the importance of documenting magic numbers. Future maintainers might waste time trying to understand why 4 was chosen. The relationship to the 4 characters-per-token estimation is clear from the codebase context, and the exact multiplier value isn't critical - it's just a rough heuristic for when caching might be worthwhile. The comment should be deleted. The multiplier is a tuning parameter whose exact value isn't critical, and its relationship to token estimation is clear from context.
6. src/api/providers/vertex.ts:28
  • Draft comment:
    For model IDs that include 'claude', the implementation is left as a comment. Consider adding a TODO or placeholder to explicitly handle or warn about this incomplete branch.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50% The comment appears to be incorrect in its premise. The code shows clear signs of intentional refactoring where Gemini handling is extracted while Claude handling remains. The "..." comment is likely a placeholder in the diff to avoid showing unchanged code. This is not a case of missing implementation, but rather unchanged implementation. I could be wrong about the "..." - maybe it really does mean the implementation is missing. The comment could be pointing out a real issue. The surrounding context strongly suggests this is just a diff simplification - the comments on lines 29-30 explicitly state "Your existing Claude implementation remains unchanged" and "This preserves your current Claude-specific functionality". The comment should be deleted as it incorrectly assumes missing implementation where the code is actually just preserving existing functionality.
7. webview-ui/src/components/settings/ApiOptions.tsx:1944
  • Draft comment:
    The 'Supports browser use' item uses the same 'supportsImages' property as 'Supports Images'. If these represent distinct features, consider introducing a separate flag to avoid confusion.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
8. src/utils/cost.ts:26
  • Draft comment:
    The tiered pricing lookup in calculateApiCostInternal relies on tiers sorted by contextWindow. While the fallback uses the last tier, consider adding a brief comment to clarify that the highest tier should use Infinity as the upper limit for robust matching.
  • Reason this comment was not posted:
    Comment looked like it was already resolved.

Workflow ID: wflow_vhZhppHNlCHqwb5F

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

@pashpashpash pashpashpash changed the title gemini prompt caching (wip) gemini prompt caching May 1, 2025
Copy link
Copy Markdown
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Changes requested ❌

Reviewed 6f1a41f in 2 minutes and 11 seconds. Click for details.
  • Reviewed 235 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 4 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. src/api/providers/vertex.ts:45
  • Draft comment:
    Consider adding an explicit type annotation for the 'stream' variable to improve clarity and type safety.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% Adding an explicit 'any' type would actually reduce type safety by opting out of TypeScript's type inference. The stream variable's type should be correctly inferred from the AnthropicVertex SDK's return type. Adding 'any' would hide potential type errors and make the code less safe. Perhaps there's a specific reason why type inference isn't working well here that I'm missing? Maybe the SDK types are incomplete? Even if there are SDK type issues, explicitly typing as 'any' would be worse than letting TypeScript infer what it can. If there are type issues, they should be fixed in the SDK or by using a more specific type. The comment should be deleted as it suggests adding an 'any' type which would reduce type safety rather than improve it.
2. src/api/providers/vertex.ts:74
  • Draft comment:
    The message mapping logic for ephemeral cache control (lines 74–112) is very similar to the one in the default case (lines 133–144). Consider refactoring this into a helper function to reduce duplication.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50% None
3. src/api/providers/vertex.ts:43
  • Draft comment:
    Variable 'budget_tokens' is written in snake_case while most other variables use camelCase. Confirm if this naming is intentional (e.g. matching API expectations) or consider standardizing for consistency.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50% None
4. src/api/providers/vertex.ts:170
  • Draft comment:
    In the 'message_stop' case, only a break from the switch is used. If 'message_stop' signifies the end of the stream, consider exiting the loop to prevent processing any further chunks.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50% None

Workflow ID: wflow_lDunSZLwIJ6IXIsq

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

@pashpashpash pashpashpash requested a review from saoudrizwan May 1, 2025 03:16
@nischalsrinivas
Copy link
Copy Markdown

This is great improvement, I like how cache information is shown, it will be tremendously useful to see how cache is growing overtime.

Enhanced caching system for GeminiHandler in gemini.ts using node-cache with a default TTL of 15 minutes

But curious if the default TTL can be configured?

Copy link
Copy Markdown
Contributor

@arafatkatze arafatkatze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

This algorithm is not the best way to go about caching.

@joaomj
Copy link
Copy Markdown

joaomj commented May 1, 2025

Cache information will be shown when using what providers? OpenRouter, Gemini?

@arafatkatze
Copy link
Copy Markdown
Contributor

arafatkatze commented May 1, 2025

@pashpashpash Here's the problem and the fix.

At the moment every user turn triggers two separate network calls:

  • models.generateContentStream – cheap, because you only send the delta.
  • caches.create – expensive, because you re-upload the entire context.

Because caches.create is executed on every turn, the full prompt is written to the cache over and over again. With the default TTL of 15 minutes this costs roughly $4.50 per million tokens per hour, i.e. $1.125 for every million tokens kept for 15 minutes—a charge that the user never sees in the normal token accounting.

Recommended changes

  • One cache per task-id
    • When the task starts, call caches.create once if the cache doesn't exist
    • On subsequent turns, if the cache does exist update that same cache instead of creating a new one when the cache exists
image - On subsquent turns if the cache doesn't exist because of TTL limits then just create a new ones.

Stable cache key

  • Make sure cacheKey is a stable conversation / task identifier (e.g. taskId). If you fall back to Date.now() the key changes every call and the cache is never reused.

Cost visibility

We may never get a perfect breakdown from the API, but we can:
• Count cacheWriteTokens and cacheReadTokens separately.
• Apply the published $ / 1M prices for cache writes & reads.
That at least prevents us from grossly misrepresenting the costs.

The net effect: we create the cache once, keep extending it, avoid dozens of unused caches, and eliminate the hidden “hidden double spend” on every turn.

@arafatkatze arafatkatze force-pushed the pashpashpash/gemini-prompt-caching branch from d871446 to 2024be7 Compare May 2, 2025 03:44
@arafatkatze arafatkatze force-pushed the pashpashpash/gemini-prompt-caching branch from 2024be7 to f3c07f3 Compare May 2, 2025 03:44
arafatkatze
arafatkatze previously approved these changes May 2, 2025
Copy link
Copy Markdown
Contributor

@arafatkatze arafatkatze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR works but @pashpashpash and I would still like to be vigilant.

WE have tried best to show the ongoing pricing of the cache usage and made the best use of the available cache to save on costs given how much we hit the same parts of cache over and over again this PR makes sense.

This PR has been tested on vertex and gemini for caching.

Copy link
Copy Markdown
Contributor

@dcbartlett dcbartlett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to review this more when I have access to my computer. The changes commented on should happen for now, might have more later.

Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
@pashpashpash pashpashpash merged commit 61d2f42 into main May 2, 2025
7 checks passed
@github-actions github-actions bot mentioned this pull request Dec 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants