Skip to content

feat(moonshot): explicit context caching for moonshot-v1 via /v1/caching API#25104

Closed
Elarwei001 wants to merge 21 commits intoopenclaw:mainfrom
Elarwei001:feat/moonshot-context-cache
Closed

feat(moonshot): explicit context caching for moonshot-v1 via /v1/caching API#25104
Elarwei001 wants to merge 21 commits intoopenclaw:mainfrom
Elarwei001:feat/moonshot-context-cache

Conversation

@Elarwei001
Copy link
Copy Markdown
Contributor

@Elarwei001 Elarwei001 commented Feb 24, 2026

Summary

Adds explicit context caching support for moonshot-v1-* models via Moonshot's /v1/caching API.

Note: This PR depends on #25436 (K2 cache stats fix) for the usage.ts changes.

Background

Moonshot has two caching mechanisms:

Model Family Caching Mechanism This PR
moonshot-v1-* Explicit /v1/caching API ✅ Adds wrapper
kimi-k2.* Automatic prefix caching N/A (handled by #25436)

Changes

New Files

  • src/agents/moonshot-cache.ts (~280 lines): Cache wrapper with:

    • Session-based cache storage with content hash invalidation
    • Async IIFE pattern to resolve cache before streaming
    • FIFO eviction (max 1000 entries) to prevent memory leak
    • Inflight promise coalescing to avoid duplicate cache creation
  • src/agents/moonshot-cache.test.ts: 17 unit tests

  • test/moonshot-cache.e2e.test.ts: E2E tests (requires API key)

Modified Files

  • src/agents/pi-embedded-runner/extra-params.ts: Integration point
  • src/agents/pi-embedded-runner/run/attempt.ts: Pass sessionKey

Configuration

agents:
  defaults:
    models:
      moonshot/moonshot-v1-32k:
        params:
          contextCache:
            enabled: true
            ttl: 3600

Token Savings

Component Without Cache With Cache
System prompt ~2000 tokens Cached
Tool definitions ~3000 tokens Cached
Per-request savings ~80%

Design Doc

https://github.com/Elarwei001/research_openclaw/blob/main/proposals/kimi-context-cache.md

Implements lazy context caching for Moonshot/Kimi models using the /v1/caching API.

Features:
- Cache system prompts and tool definitions to reduce token usage
- Automatic cache invalidation when content hash changes
- Inflight promise coalescing to prevent duplicate cache creation
- TTL auto-renewal via reset_ttl on each request

Configuration:
```yaml
agents:
  defaults:
    models:
      moonshot/kimi-k2-turbo:
        params:
          contextCache:
            enabled: true
            ttl: 3600
            resetTtl: 3600
```

Closes openclaw#7073

Co-authored-by: Elar Wei <[email protected]>
@openclaw-barnacle openclaw-barnacle bot added agents Agent runtime and tooling size: L labels Feb 24, 2026
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +589 to +622
onPayload: async (payload) => {
if (payload && typeof payload === "object") {
const payloadObj = payload as Record<string, unknown>;
const payloadMessages = payloadObj.messages;

if (Array.isArray(payloadMessages)) {
try {
const cacheId = await getOrCreateCache({
sessionKey,
apiKey,
baseUrl,
model: modelId,
system: systemContent,
tools,
ttl,
});

// Replace messages with cache-injected version
payloadObj.messages = injectCacheRole(
payloadMessages as Array<{ role: string; content: unknown }>,
cacheId,
resetTtl,
);

log.debug(`[moonshot-cache] Injected cache ${cacheId} for session ${sessionKey}`);
} catch (err) {
// On cache error, fall back to normal request
log.warn(`[moonshot-cache] Cache error, falling back: ${String(err)}`);
}
}
}
originalOnPayload?.(payload);
},
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Async onPayload will not work — cache injection is a no-op

onPayload is called synchronously by pi-ai and every existing caller in this codebase (options?.onPayload?.(payload) — no await). Making onPayload async means the await getOrCreateCache(...) runs in a detached microtask. By the time the cache ID is retrieved and payloadObj.messages is mutated, the request body has already been serialized and sent to the API.

This effectively makes the entire caching feature a no-op in production — the message array will never contain the cache role when the HTTP request is dispatched.

Every other onPayload wrapper in this file (createOpenAIResponsesStoreWrapper, createOpenRouterWrapper, createZaiToolStreamWrapper, createOpenRouterSystemCacheWrapper) is synchronous for exactly this reason.

To fix this, the cache should be resolved before calling the underlying streamFn, and the cache role should be injected into context.messages synchronously — not inside onPayload. The wrapper's outer function should be async, resolve the cache up front, mutate the context messages, and then call the underlying stream function with the modified context. This would require verifying that pi-ai supports an async StreamFn return type.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-runner/extra-params.ts
Line: 589-622

Comment:
**Async `onPayload` will not work — cache injection is a no-op**

`onPayload` is called synchronously by pi-ai and every existing caller in this codebase (`options?.onPayload?.(payload)` — no `await`). Making `onPayload` `async` means the `await getOrCreateCache(...)` runs in a detached microtask. By the time the cache ID is retrieved and `payloadObj.messages` is mutated, the request body has already been serialized and sent to the API.

This effectively makes the entire caching feature a no-op in production — the message array will never contain the cache role when the HTTP request is dispatched.

Every other `onPayload` wrapper in this file (`createOpenAIResponsesStoreWrapper`, `createOpenRouterWrapper`, `createZaiToolStreamWrapper`, `createOpenRouterSystemCacheWrapper`) is synchronous for exactly this reason.

To fix this, the cache should be resolved *before* calling the underlying `streamFn`, and the cache role should be injected into `context.messages` synchronously — not inside `onPayload`. The wrapper's outer function should be `async`, resolve the cache up front, mutate the context messages, and then call the underlying stream function with the modified context. This would require verifying that pi-ai supports an async `StreamFn` return type.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@SamuelHinestrosa SamuelHinestrosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisión automática: Cambios revisados. Por favor asegurar que los tests pasan.

- Fix async onPayload race condition by resolving cache BEFORE streaming
- Add MAX_CACHE_SIZE (1000) with FIFO eviction to prevent memory leaks
- Clear local cache entry immediately after remote deletion to prevent stale entries
@openclaw-barnacle openclaw-barnacle bot added the scripts Repository scripts label Feb 24, 2026
- Use async IIFE instead of async generator (StreamFn supports Promise return)
- Cast modified messages to context.messages type to avoid pi-ai Message mismatch
The caching API requires base model names (e.g., 'moonshot-v1') without
context-length suffixes. Add toCacheModelName() to handle the mapping.
Moonshot/Kimi returns 'cached_tokens' instead of 'cache_read_input_tokens'.
This ensures OpenClaw's cacheRead field is properly populated when using
Moonshot context caching.
… invasion

Reduces extra-params.ts changes from ~135 lines to ~20 lines.
All Moonshot-specific logic is now isolated in moonshot-cache.ts.
- Remove unused clearSessionCache function
- Simplify comments and remove section dividers
- Inline simple functions
- Reduce verbosity while maintaining functionality
@Elarwei001
Copy link
Copy Markdown
Contributor Author

@greptile-apps please re-review - code has been refactored to address previous feedback:

  • Fixed async onPayload race condition (now resolves cache before streaming)
  • Added MAX_CACHE_SIZE with FIFO eviction
  • Clear local entry before remote deletion to prevent stale entries
  • Moved wrapper to moonshot-cache.ts to reduce extra-params.ts invasion (135 → 20 lines)
  • Simplified code (434 → 325 lines)

@Elarwei001
Copy link
Copy Markdown
Contributor Author

Elarwei001 commented Feb 24, 2026

AI-Assisted Testing Results

Tested with a real Moonshot API account to verify the caching mechanism works correctly.

Test 1: Baseline (without cache)

POST /v1/chat/completions
{
  "model": "moonshot-v1-32k",
  "messages": [
    {"role": "system", "content": "<system prompt>"},
    {"role": "user", "content": "What is 2+2?"}
  ]
}

Response usage:

{
  "prompt_tokens": 48,
  "completion_tokens": 8,
  "total_tokens": 56
}

Test 2: With cache

# Step 1: Create cache
POST /v1/caching
{
  "model": "moonshot-v1",
  "messages": [{"role": "system", "content": "<system prompt>"}],
  "ttl": 300
}
# Returns: {"id": "cache-xxx", "tokens": 19, "status": "pending"}

# Step 2: Query with cache
POST /v1/chat/completions
{
  "model": "moonshot-v1-32k",
  "messages": [
    {"role": "cache", "content": "cache_id=cache-xxx;reset_ttl=300"},
    {"role": "user", "content": "What is 2+2?"}
  ]
}

Response usage:

{
  "prompt_tokens": 30,
  "completion_tokens": 9,
  "total_tokens": 39,
  "cached_tokens": 19,
  "prompt_tokens_details": {
    "cached_tokens": 19
  }
}

Key Findings

Metric Without Cache With Cache
prompt_tokens 48 30
cached_tokens - 19
  1. Cache creation works: API returns cache ID and token count
  2. Cache injection works: role: "cache" is accepted by the API
  3. Usage reflects caching: cached_tokens field shows cache hit count
  4. Token savings confirmed: prompt_tokens reduced when using cache

Model Name Mapping

Discovered that the caching API requires base model names:

  • moonshot-v1-32kmoonshot-v1 (for cache creation)
  • Query can still use moonshot-v1-32k

This is handled by toCacheModelName() in the implementation.

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +296 to +331
return (async () => {
let modifiedContext = context;

try {
const cacheId = await getOrCreateCache({
sessionKey,
apiKey,
baseUrl,
model: modelId,
system: systemContent,
tools,
ttl,
});

// Inject cache role into messages, replacing system message.
// Cast to context.messages type to avoid pi-ai's stricter Message type.
const modifiedMessages = injectCacheRole(
messages,
cacheId,
resetTtl,
) as unknown as typeof context.messages;

modifiedContext = {
...context,
messages: modifiedMessages,
};

log.debug(`[moonshot-cache] Injected cache ${cacheId} for session ${sessionKey}`);
} catch (err) {
// On cache error, fall back to normal request with original context
log.warn(`[moonshot-cache] Cache error, falling back: ${String(err)}`);
}

// Delegate to underlying stream with (possibly modified) context
return baseStreamFn(model, modifiedContext, options);
})();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Async IIFE returns Promise — incompatible with downstream wrappers

This returns Promise<EventStream> instead of EventStream. Every other StreamFn wrapper in the codebase (createOpenRouterWrapper, createBedrockNoCacheWrapper, createOpenAIResponsesStoreWrapper, etc.) returns the stream synchronously.

After createMoonshotCacheWrapper, createOpenAIResponsesStoreWrapper is applied at extra-params.ts:598. That wrapper calls underlying(model, context, options) and may attempt to use the return value as an EventStream rather than awaiting a Promise. Additionally, the wrappers installed in attempt.ts (lines 760, 786) pass through inner(model, context, options) without await, propagating the Promise up to pi-ai's agent loop.

Whether this actually causes a runtime failure depends on how @mariozechner/pi-ai consumes the streamFn return value internally. If it awaits or handles thenables, this works. If it expects a synchronous EventStream (like createOllamaStreamFn returns via createAssistantMessageEventStream()), this will fail at runtime.

The safer pattern used by createOllamaStreamFn is to create the event stream synchronously, queue the async work via queueMicrotask, and return the stream immediately.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/moonshot-cache.ts
Line: 296-331

Comment:
**Async IIFE returns `Promise` — incompatible with downstream wrappers**

This returns `Promise<EventStream>` instead of `EventStream`. Every other `StreamFn` wrapper in the codebase (`createOpenRouterWrapper`, `createBedrockNoCacheWrapper`, `createOpenAIResponsesStoreWrapper`, etc.) returns the stream synchronously.

After `createMoonshotCacheWrapper`, `createOpenAIResponsesStoreWrapper` is applied at `extra-params.ts:598`. That wrapper calls `underlying(model, context, options)` and may attempt to use the return value as an `EventStream` rather than awaiting a `Promise`. Additionally, the wrappers installed in `attempt.ts` (lines 760, 786) pass through `inner(model, context, options)` without `await`, propagating the `Promise` up to pi-ai's agent loop.

Whether this actually causes a runtime failure depends on how `@mariozechner/pi-ai` consumes the `streamFn` return value internally. If it awaits or handles thenables, this works. If it expects a synchronous `EventStream` (like `createOllamaStreamFn` returns via `createAssistantMessageEventStream()`), this will fail at runtime.

The safer pattern used by `createOllamaStreamFn` is to create the event stream synchronously, queue the async work via `queueMicrotask`, and return the stream immediately.

How can I resolve this? If you propose a fix, please make it concise.

sessionKey is not part of SimpleStreamOptions from pi-ai, so it was always
undefined when read from options. Now sessionKey is passed as a parameter
to createMoonshotCacheWrapper and captured via closure.

Fixes Greptile review comment about caching always being skipped.
Tests verify:
- Cache role is injected when sessionKey is provided via closure
- Caching is skipped when apiKey is missing
- Graceful fallback on cache creation error

This ensures the OpenClaw integration path works correctly,
not just direct API calls.
Tests require MOONSHOT_API_KEY env var and make real API calls:
- creates cache and returns cache_id
- returns same cache_id for same content (cache hit)
- creates new cache when content changes (cache invalidation)
- chat completion with cache shows reduced prompt_tokens
- wrapper integration modifies context correctly

Tests are skipped in CI when API key is not available.

Run locally with:
  MOONSHOT_API_KEY=sk-xxx pnpm vitest run --config vitest.e2e.config.ts test/agents/moonshot-cache.e2e.test.ts
@Elarwei001
Copy link
Copy Markdown
Contributor Author

@greptile-apps please re-review - all previous issues have been addressed:

  1. Async onPayload race condition - Fixed by using async IIFE that resolves cache before streaming
  2. Unbounded maps memory leak - Added MAX_CACHE_SIZE (1000) with FIFO eviction
  3. Stale cacheStore entry - Clear local entry before remote deletion
  4. sessionKey not in options - Now passed via closure at wrapper creation time
  5. Async IIFE Promise return - StreamFn type allows Promise<stream>, verified working

Additional improvements:

  • Added real e2e tests (run with MOONSHOT_API_KEY=xxx)
  • Reduced code invasion: extra-params.ts only +20 lines
  • All CI checks passing

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

// Delete old cache if exists and clear local entry immediately
// This prevents stale entries if createCache fails after deletion
if (existing) {
cacheStore.delete(params.sessionKey);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed stale cache issue by deleting local entry immediately after invalidation detection, before attempting remote deletion and recreation

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/moonshot-cache.ts
Line: 161

Comment:
Fixed stale cache issue by deleting local entry immediately after invalidation detection, before attempting remote deletion and recreation

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +294 to +329
return (async () => {
let modifiedContext = context;

try {
const cacheId = await getOrCreateCache({
sessionKey,
apiKey,
baseUrl,
model: modelId,
system: systemContent,
tools,
ttl,
});

// Inject cache role into messages, replacing system message.
// Cast to context.messages type to avoid pi-ai's stricter Message type.
const modifiedMessages = injectCacheRole(
messages,
cacheId,
resetTtl,
) as unknown as typeof context.messages;

modifiedContext = {
...context,
messages: modifiedMessages,
};

log.debug(`[moonshot-cache] Injected cache ${cacheId} for session ${sessionKey}`);
} catch (err) {
// On cache error, fall back to normal request with original context
log.warn(`[moonshot-cache] Cache error, falling back: ${String(err)}`);
}

// Delegate to underlying stream with (possibly modified) context
return baseStreamFn(model, modifiedContext, options);
})();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved cache asynchronously before calling baseStreamFn to avoid the async onPayload race condition from the previous implementation. The async IIFE returns Promise<EventStream>, which differs from other synchronous wrappers in extra-params.ts (createOpenRouterWrapper, createBedrockNoCacheWrapper). Verify this pattern is supported by @mariozechner/pi-ai in production - the e2e tests pass but this should be validated in real gateway scenarios.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/moonshot-cache.ts
Line: 294-329

Comment:
Resolved cache asynchronously before calling `baseStreamFn` to avoid the async `onPayload` race condition from the previous implementation. The async IIFE returns `Promise<EventStream>`, which differs from other synchronous wrappers in `extra-params.ts` (`createOpenRouterWrapper`, `createBedrockNoCacheWrapper`). Verify this pattern is supported by `@mariozechner/pi-ai` in production - the e2e tests pass but this should be validated in real gateway scenarios.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The async IIFE returning Promise is supported by the StreamFn type definition in @mariozechner/pi-agent-core:

export type StreamFn = (...) => ReturnType<typeof streamSimple> | Promise<ReturnType<typeof streamSimple>>;

This pattern is explicitly designed for async config lookup scenarios like our cache resolution, so it should be fine.

Comment on lines +588 to +593
agent.streamFn = createMoonshotCacheWrapper(
agent.streamFn ?? streamSimple,
moonshotCacheConfig!,
modelId,
sessionKey,
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed previous issue where sessionKey was unavailable - now passed via closure to createMoonshotCacheWrapper from attempt.ts:744 instead of being extracted from options at call time

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-runner/extra-params.ts
Line: 588-593

Comment:
Fixed previous issue where `sessionKey` was unavailable - now passed via closure to `createMoonshotCacheWrapper` from `attempt.ts:744` instead of being extracted from `options` at call time

How can I resolve this? If you propose a fix, please make it concise.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Feb 24, 2026

Additional Comments (1)

src/agents/usage.ts
Added Moonshot/Kimi's cached_tokens field to normalize cache read counts alongside the standard OpenAI cache_read_input_tokens

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/usage.ts
Line: 969-971

Comment:
Added Moonshot/Kimi's `cached_tokens` field to normalize cache read counts alongside the standard OpenAI `cache_read_input_tokens`

How can I resolve this? If you propose a fix, please make it concise.

@openclaw-barnacle openclaw-barnacle bot removed scripts Repository scripts size: XL labels Feb 24, 2026
- Handle moonshot-v1-* → moonshot-v1
- Handle kimi-k2-* → kimi-k2
- Pass through kimi-k2.5, kimi-latest as-is
- Add toCacheModelName tests
Tested with real API - kimi-k2.5, kimi-k2, kimi-latest all return
400 'model family is invalid'. Only moonshot-v1 supports caching.

- toCacheModelName returns undefined for unsupported models
- Wrapper skips caching entirely for unsupported models
- Updated tests to reflect actual API behavior
@Elarwei001 Elarwei001 changed the title feat(moonshot): add context caching support WIP: feat(moonshot): context caching via /v1/caching API Feb 24, 2026
Kimi K2 models use automatic prefix caching (like Anthropic) and return
cached_tokens in prompt_tokens_details field. This differs from moonshot-v1
which requires explicit /v1/caching API.

- Add prompt_tokens_details.cached_tokens to UsageLike type
- Update normalizeUsage to extract nested cached_tokens
- Add test for K2 usage format
Better reflects the function's purpose: checking if a model requires
explicit /v1/caching API calls vs automatic prefix caching (K2).
- Hide model name implementation detail (MOONSHOT_V1_CACHE_MODEL is internal)
- Cleaner API: returns true/false instead of string/undefined
@Elarwei001 Elarwei001 changed the title WIP: feat(moonshot): context caching via /v1/caching API feat(moonshot): explicit context caching for moonshot-v1 via /v1/caching API Feb 24, 2026
@Elarwei001
Copy link
Copy Markdown
Contributor Author

Closing this PR for now since moonshot-v1 is a legacy model (released Oct 2023) and most users should prefer the newer K2 series which has automatic prefix caching built-in.

If anyone has a specific need for explicit caching on moonshot-v1, feel free to cherry-pick from this branch or request to reopen this PR.

See the analysis in issue #7073 for details on the two different caching mechanisms.

@Elarwei001 Elarwei001 closed this Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants