Conversation
…che size When the prompt token count exceeds the effective context size (capped by context_limit, n_ctx_train, or available memory), the KV cache is allocated with fewer slots than needed. Attempting to prefill more tokens than the cache can hold causes llama.cpp to return a NoKvCacheSlot error. Add a guard in validate_and_compute_context that checks prompt_token_count against effective_ctx and returns a clear ContextLengthExceeded error instead of letting the decode fail with an opaque KV cache error.
jh-block
approved these changes
Mar 11, 2026
lifeizhou-ap
added a commit
that referenced
this pull request
Mar 12, 2026
* main: (270 commits) test(acp): align provider and server test parity (#7822) fix(acp): register MCP extensions when resuming a session (#7806) fix(goose): load .gitignore in prompt_manager for hint file filtering (#7795) fix: remap max_completion_tokens to max_tokens for OpenAI-compatible providers (#7765) fix(openai): preserve Responses API tool call/output linkage (#7759) chore(deps): bump @hono/node-server from 1.19.9 to 1.19.11 in /evals/open-model-gym/mcp-harness (#7687) fix: return ContextLengthExceeded when prompt exceeds effective KV cache size (#7815) feat: MCP Roots support (#7790) fix(google): use `includeThoughts/part.thought` for thinking handling (#7593) refactor: simplify tokenizer initialization — remove unnecessary Result wrapper (#7744) Fix model selector showing wrong model in tabs (#7784) Stop collecting goosed stderr after startup (#7814) fix: avoid word splitting by space for windows shell commands (#7781) (#7810) Simplify and make it not break on linux (#7813) Add preferred microphone selection (#7805) Remove dependency on posthog-rs (#7811) feat: load hints in nested subdirs (#7772) feat(acp): add read tool and delegate filesystem I/O to ACP clients (#7668) Support secret interpolation in streamable HTTP extension URLs (#7782) More logging for command injection classifier model training (#7779) ...
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When using local inference with llama.cpp, if the prompt token count exceeds the effective context size (which can be capped by
context_limit,n_ctx_train, or available memory), the KV cache is allocated with fewer slots than the prompt requires. Attempting to prefill more tokens than the cache can hold causes llama.cpp to return an opaque error:This happens because
validate_and_compute_contextonly checked the prompt against the memory-based limit, but not against the finaleffective_ctxvalue. For example, withprompt_token_count = 5000andeffective_ctx = 4096, a 4096-token KV cache is created and then 5000 tokens are fed into it — failing around token 4096.Fix
Add a guard in
validate_and_compute_contextthat checksprompt_token_count >= effective_ctxand returns a clearContextLengthExceedederror with an actionable message, instead of letting the decode fail with an opaque KV cache error.Testing
inference_engineunit tests passcargo clippy --all-targets -- -D warningscleancargo fmtclean