Skip to content

fix: return ContextLengthExceeded when prompt exceeds effective KV cache size#7815

Merged
DOsinga merged 1 commit intomainfrom
fix/local-inference-kv-cache-context-overflow
Mar 11, 2026
Merged

fix: return ContextLengthExceeded when prompt exceeds effective KV cache size#7815
DOsinga merged 1 commit intomainfrom
fix/local-inference-kv-cache-context-overflow

Conversation

@DOsinga
Copy link
Copy Markdown
Collaborator

@DOsinga DOsinga commented Mar 11, 2026

Problem

When using local inference with llama.cpp, if the prompt token count exceeds the effective context size (which can be capped by context_limit, n_ctx_train, or available memory), the KV cache is allocated with fewer slots than the prompt requires. Attempting to prefill more tokens than the cache can hold causes llama.cpp to return an opaque error:

Execution error: Prefill decode failed: Decode Error 1: NoKvCacheSlot

This happens because validate_and_compute_context only checked the prompt against the memory-based limit, but not against the final effective_ctx value. For example, with prompt_token_count = 5000 and effective_ctx = 4096, a 4096-token KV cache is created and then 5000 tokens are fed into it — failing around token 4096.

Fix

Add a guard in validate_and_compute_context that checks prompt_token_count >= effective_ctx and returns a clear ContextLengthExceeded error with an actionable message, instead of letting the decode fail with an opaque KV cache error.

Testing

  • All existing inference_engine unit tests pass
  • cargo clippy --all-targets -- -D warnings clean
  • cargo fmt clean

…che size

When the prompt token count exceeds the effective context size (capped by
context_limit, n_ctx_train, or available memory), the KV cache is allocated
with fewer slots than needed. Attempting to prefill more tokens than the
cache can hold causes llama.cpp to return a NoKvCacheSlot error.

Add a guard in validate_and_compute_context that checks prompt_token_count
against effective_ctx and returns a clear ContextLengthExceeded error
instead of letting the decode fail with an opaque KV cache error.
@DOsinga DOsinga enabled auto-merge March 11, 2026 19:10
@DOsinga DOsinga added this pull request to the merge queue Mar 11, 2026
Merged via the queue into main with commit f462d73 Mar 11, 2026
20 checks passed
@DOsinga DOsinga deleted the fix/local-inference-kv-cache-context-overflow branch March 11, 2026 19:21
lifeizhou-ap added a commit that referenced this pull request Mar 12, 2026
* main: (270 commits)
  test(acp): align provider and server test parity (#7822)
  fix(acp): register MCP extensions when resuming a session (#7806)
  fix(goose): load .gitignore in prompt_manager for hint file filtering (#7795)
  fix: remap max_completion_tokens to max_tokens for OpenAI-compatible providers (#7765)
  fix(openai): preserve Responses API tool call/output linkage (#7759)
  chore(deps): bump @hono/node-server from 1.19.9 to 1.19.11 in /evals/open-model-gym/mcp-harness (#7687)
  fix: return ContextLengthExceeded when prompt exceeds effective KV cache size (#7815)
  feat: MCP Roots support (#7790)
  fix(google): use `includeThoughts/part.thought` for thinking handling (#7593)
  refactor: simplify tokenizer initialization — remove unnecessary Result wrapper (#7744)
  Fix model selector showing wrong model in tabs (#7784)
  Stop collecting goosed stderr after startup (#7814)
  fix: avoid word splitting by space for windows shell commands (#7781) (#7810)
  Simplify and make it not break on linux (#7813)
  Add preferred microphone selection  (#7805)
  Remove dependency on posthog-rs (#7811)
  feat: load hints in nested subdirs (#7772)
  feat(acp): add read tool and delegate filesystem I/O to ACP clients (#7668)
  Support secret interpolation in streamable HTTP extension URLs (#7782)
  More logging for command injection classifier model training (#7779)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants