Port general engine improvements for TinyLlama#513
Conversation
…block
that delegates to it, supporting multiple blocks and unclosed tags
- Add enable_thinking_if_supported param to parse_options_json and
thinking field to construct_response_json
- Refactor KV cache to store head_dim per-layer via LayerCache.head_dim,
get_layer_head_dim(), and vector-based init() signature
- Remove dim parameter from update_from_graph; derive per-layer inside loop
- Update set_window_size to iterate per-layer, skipping dim=0 layers
- Add KVCache::remove_token_range for surgical token removal from cache
- Add KVCache::compact_to_windows for per-layer compaction to target windows
- Add Model::remove_thinking_tokens (back-to-front multi-range removal)
- Add Model::compact_kv_cache virtual and get_kv_layer_dims virtual
- Add GemmaModel3n::get_kv_layer_dims override returning 0 for shared layers
- Add tool call tag auto-detection in gemma::parse_function_calls
supporting both <start_function_call> and <|tool_call> formats
- Add enable_thinking_if_supported param to format_chat_prompt and
format_qwen_style with thinking injection for Qwen models
- Add test_qwen_thinking.cpp with 6 tests for thinking option parsing,
tag stripping, response JSON, and prompt injection
- Update test_kv_cache.cpp for new init signature and fix pre-existing
bug where graph.input() returning node ID 0 caused layer skipping
Signed-off-by: Noah Cylich <[email protected]>
There was a problem hiding this comment.
Pull request overview
This PR ports a set of engine improvements from the tinyllama branch. The main areas of change are: adding per-layer head_dim support to the KV cache (previously a single value for all layers), introducing a thinking feature for Qwen models (enabling/disabling <think> token injection and extraction), and adding tool call auto-detection for Gemma models.
Changes:
- KV Cache refactoring:
head_dimis now stored per-layer inLayerCache.head_dim,init()takes a vector of dims instead of a scalar,update_from_graphandset_window_sizederive dim per-layer. New methodsremove_token_rangeandcompact_to_windowsare added for surgical cache manipulation. Tests updated accordingly. - Thinking feature: Adds
enable_thinking_if_supportedparameter toparse_options_json,format_chat_prompt, andformat_qwen_style; addsthinkingfield toconstruct_response_json; addsstrip_thinking_blockandstrip_tag_blockshelpers. - Other: Gemma tool call auto-detection (
<|tool_call>format),GemmaModel3n::get_kv_layer_dimsoverride for shared layers, and a new test filetest_qwen_thinking.cpp.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
cactus/engine/engine.h |
Adds head_dim to LayerCache, replaces scalar head_dim with vector in KVCache, adds get_layer_head_dim, remove_token_range, compact_to_windows, remove_thinking_tokens, compact_kv_cache, get_kv_layer_dims virtuals, and enable_thinking_if_supported to format_chat_prompt |
cactus/engine/engine_cache.cpp |
Implements per-layer head_dim in init/set_window/update; adds remove_token_range and compact_to_windows |
cactus/engine/engine_model.cpp |
Wires get_kv_layer_dims() into kv_cache_.init; removes head_dim param from update_from_graph; adds remove_thinking_tokens |
cactus/engine/engine_tokenizer.cpp |
Adds enable_thinking_if_supported param; refactors thinking injection from QWEN3P5-only to both QWEN types |
cactus/ffi/cactus_utils.h |
Adds enable_thinking_if_supported to parse_options_json; adds strip_tag_blocks, strip_thinking_block; adds thinking field to construct_response_json |
cactus/ffi/gemma_tools.h |
Adds <|tool_call> format auto-detection in parse_function_calls |
cactus/models/model.h |
Adds get_kv_layer_dims override declaration for GemmaModel3n |
cactus/models/model_gemma3n.cpp |
Implements get_kv_layer_dims returning 0 for shared KV layers |
tests/test_kv_cache.cpp |
Updates tests for new init signature; adds dummy graph input to fix node ID=0 bug |
tests/test_qwen_thinking.cpp |
New test file with 6 tests for thinking option parsing, tag stripping, response JSON, and prompt injection |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| size_t scaler_per_tok = num_kv_heads * dim / KV_QUANT_GROUP_SIZE; | ||
| erase_floats(layer.key_scales, start * scaler_per_tok, count * scaler_per_tok, tail_tokens * scaler_per_tok); | ||
| erase_floats(layer.value_scales, start * scaler_per_tok, count * scaler_per_tok, tail_tokens * scaler_per_tok); |
There was a problem hiding this comment.
The scaler_per_tok variable on line 501 uses integer truncation division (num_kv_heads * dim / KV_QUANT_GROUP_SIZE) rather than the ceiling division used everywhere else (e.g., compact_to_windows at line 532: (dim + KV_QUANT_GROUP_SIZE - 1) / KV_QUANT_GROUP_SIZE). If dim is not a multiple of KV_QUANT_GROUP_SIZE (32), the scale buffer offsets will be computed incorrectly, causing data corruption when erasing scale entries. Use the same ceiling formula as compact_to_windows and the rest of the codebase.
| erase_bytes(layer.values, start * bytes_per_tok, count * bytes_per_tok, tail_tokens * bytes_per_tok); | ||
|
|
||
| if (precision == Precision::INT8 && !layer.key_scales.empty()) { | ||
| size_t scaler_per_tok = num_kv_heads * dim / KV_QUANT_GROUP_SIZE; |
There was a problem hiding this comment.
The variable name scaler_per_tok is a misspelling; it should be scales_per_tok (consistent with scales_per_token used elsewhere in the same file, e.g., line 141 and line 533).
tests/test_qwen_thinking.cpp
Outdated
| if (!model_path) { std::cout << " [SKIP] CACTUS_TEST_MODEL not set\n"; return true; } | ||
|
|
||
| cactus_model_t model = cactus_init(model_path, nullptr, false); | ||
| if (!model) return false; |
There was a problem hiding this comment.
In test_complete_thinking_toggle, when cactus_init successfully loads the model but then fails, it returns false without printing a skip message (line 105: if (!model) return false;). However, in the analogous test test_prompt_thinking_injection at line 76, it returns true with a skip message. This inconsistency means a failure to load the model in test_complete_thinking_toggle will cause the test suite to report a failure rather than a skip, which is unexpected since the test is conditional on having a CACTUS_TEST_MODEL set.
| const std::string CALL_START = (response.find("<|tool_call>") != std::string::npos) | ||
| ? "<|tool_call>" : "<start_function_call>"; | ||
| const std::string CALL_END = (CALL_START == "<|tool_call>") | ||
| ? "<tool_call|>" : "<end_function_call>"; |
There was a problem hiding this comment.
The new <|tool_call> format is detected but the actual parsing logic at line 516 only handles content starting with "call:". If <|tool_call>...<tool_call|> content doesn't use the same call:function_name{...} format as <start_function_call>, then tool calls using this format will be silently swallowed — the tags are erased from the response (line 548) but no function call is extracted. At minimum this should be documented; if the format is different, the parsing branch needs to be added.
| for (size_t i = 0; i < num_layers; i++) { | ||
| layer_caches[i].head_dim = layer_dims[i]; | ||
| } |
There was a problem hiding this comment.
KVCache::init accesses layer_dims[i] for i in [0, num_layers) without verifying that layer_dims.size() >= num_layers. If a caller passes a vector shorter than num_layers, this results in undefined behavior. A defensive assert or bounds check should be added, e.g., assert(layer_dims.size() >= num_layers).
tests/test_qwen_thinking.cpp
Outdated
| int r1 = cactus_complete(model, msgs, buf, sizeof(buf), | ||
| R"({"max_tokens":128,"enable_thinking_if_supported":true,"telemetry_enabled":false})", | ||
| nullptr, nullptr, nullptr); | ||
| std::string resp1(buf); | ||
| bool ok1 = r1 > 0 && resp1.find("\"success\":true") != std::string::npos; | ||
|
|
||
| handle->model->reset_cache(); | ||
| handle->processed_tokens.clear(); | ||
|
|
||
| int r2 = cactus_complete(model, msgs, buf, sizeof(buf), | ||
| R"({"max_tokens":128,"enable_thinking_if_supported":false,"telemetry_enabled":false})", | ||
| nullptr, nullptr, nullptr); | ||
| std::string resp2(buf); | ||
| bool ok2 = r2 > 0 && resp2.find("\"thinking\"") == std::string::npos; | ||
|
|
||
| cactus_destroy(model); | ||
| return ok1 && ok2; |
There was a problem hiding this comment.
The test_complete_thinking_toggle test does not effectively verify that thinking is enabled or disabled. ok1 only checks that the completion succeeded ("success":true), not that a "thinking" field is present when thinking is enabled. ok2 checks that "thinking" is absent when disabled — but since cactus_complete.cpp never calls strip_thinking_block or passes thinking to construct_response_json, this check will always pass regardless of whether the feature works. The test needs to also verify resp1.find("\"thinking\"") != std::string::npos once the wiring is added.
| bool& telemetry_enabled, | ||
| bool* auto_handoff = nullptr, | ||
| size_t* cloud_timeout_ms = nullptr, | ||
| bool* handoff_with_images = nullptr) { | ||
| bool* handoff_with_images = nullptr, | ||
| bool* enable_thinking_if_supported = nullptr) { |
There was a problem hiding this comment.
The enable_thinking_if_supported option is parsed in parse_options_json and the prompt formatting functions accept it, but the main completion path (cactus_complete.cpp) never wires these together. Since cactus_complete.cpp is not modified in this PR, the new feature is silently broken:
parse_options_jsonis called without the&enable_thinking_if_supportedpointer argument, so the parsed value is discarded.format_chat_promptis called without passingenable_thinking_if_supported, so it uses the default value oftrueregardless of what the caller requested.strip_thinking_blockis never called after generation, so thethinkingfield never appears in the response JSON.
The test_complete_thinking_toggle test will pass for the wrong reason: resp2.find("\"thinking\"") returns npos because thinking is never extracted, not because thinking was disabled. The wiring changes to cactus_complete.cpp are missing from this PR.
…and added cli updates
that delegates to it, supporting multiple blocks and unclosed tags
thinking field to construct_response_json
get_layer_head_dim(), and vector-based init() signature
supporting both <start_function_call> and <|tool_call> formats
format_qwen_style with thinking injection for Qwen models
tag stripping, response JSON, and prompt injection
bug where graph.input() returning node ID 0 caused layer skipping