Port general engine improvements for TinyLlama by ncylich · Pull Request #513 · cactus-compute/cactus

ncylich · 2026-03-10T03:17:13Z

Add strip_tag_blocks generic helper and refactored strip_thinking_block
that delegates to it, supporting multiple blocks and unclosed tags
Add enable_thinking_if_supported param to parse_options_json and
thinking field to construct_response_json
Refactor KV cache to store head_dim per-layer via LayerCache.head_dim,
get_layer_head_dim(), and vector-based init() signature
Remove dim parameter from update_from_graph; derive per-layer inside loop
Update set_window_size to iterate per-layer, skipping dim=0 layers
Add KVCache::remove_token_range for surgical token removal from cache
Add KVCache::compact_to_windows for per-layer compaction to target windows
Add Model::remove_thinking_tokens (back-to-front multi-range removal)
Add Model::compact_kv_cache virtual and get_kv_layer_dims virtual
Add GemmaModel3n::get_kv_layer_dims override returning 0 for shared layers
Add tool call tag auto-detection in gemma::parse_function_calls
supporting both <start_function_call> and <|tool_call> formats
Add enable_thinking_if_supported param to format_chat_prompt and
format_qwen_style with thinking injection for Qwen models
Add test_qwen_thinking.cpp with 6 tests for thinking option parsing,
tag stripping, response JSON, and prompt injection
Update test_kv_cache.cpp for new init signature and fix pre-existing
bug where graph.input() returning node ID 0 caused layer skipping

…block that delegates to it, supporting multiple blocks and unclosed tags - Add enable_thinking_if_supported param to parse_options_json and thinking field to construct_response_json - Refactor KV cache to store head_dim per-layer via LayerCache.head_dim, get_layer_head_dim(), and vector-based init() signature - Remove dim parameter from update_from_graph; derive per-layer inside loop - Update set_window_size to iterate per-layer, skipping dim=0 layers - Add KVCache::remove_token_range for surgical token removal from cache - Add KVCache::compact_to_windows for per-layer compaction to target windows - Add Model::remove_thinking_tokens (back-to-front multi-range removal) - Add Model::compact_kv_cache virtual and get_kv_layer_dims virtual - Add GemmaModel3n::get_kv_layer_dims override returning 0 for shared layers - Add tool call tag auto-detection in gemma::parse_function_calls supporting both <start_function_call> and <|tool_call> formats - Add enable_thinking_if_supported param to format_chat_prompt and format_qwen_style with thinking injection for Qwen models - Add test_qwen_thinking.cpp with 6 tests for thinking option parsing, tag stripping, response JSON, and prompt injection - Update test_kv_cache.cpp for new init signature and fix pre-existing bug where graph.input() returning node ID 0 caused layer skipping Signed-off-by: Noah Cylich <[email protected]>

Copilot

Pull request overview

This PR ports a set of engine improvements from the tinyllama branch. The main areas of change are: adding per-layer head_dim support to the KV cache (previously a single value for all layers), introducing a thinking feature for Qwen models (enabling/disabling <think> token injection and extraction), and adding tool call auto-detection for Gemma models.

Changes:

KV Cache refactoring: head_dim is now stored per-layer in LayerCache.head_dim, init() takes a vector of dims instead of a scalar, update_from_graph and set_window_size derive dim per-layer. New methods remove_token_range and compact_to_windows are added for surgical cache manipulation. Tests updated accordingly.
Thinking feature: Adds enable_thinking_if_supported parameter to parse_options_json, format_chat_prompt, and format_qwen_style; adds thinking field to construct_response_json; adds strip_thinking_block and strip_tag_blocks helpers.
Other: Gemma tool call auto-detection (<|tool_call> format), GemmaModel3n::get_kv_layer_dims override for shared layers, and a new test file test_qwen_thinking.cpp.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`cactus/engine/engine.h`	Adds `head_dim` to `LayerCache`, replaces scalar `head_dim` with vector in `KVCache`, adds `get_layer_head_dim`, `remove_token_range`, `compact_to_windows`, `remove_thinking_tokens`, `compact_kv_cache`, `get_kv_layer_dims` virtuals, and `enable_thinking_if_supported` to `format_chat_prompt`
`cactus/engine/engine_cache.cpp`	Implements per-layer head_dim in init/set_window/update; adds `remove_token_range` and `compact_to_windows`
`cactus/engine/engine_model.cpp`	Wires `get_kv_layer_dims()` into `kv_cache_.init`; removes `head_dim` param from `update_from_graph`; adds `remove_thinking_tokens`
`cactus/engine/engine_tokenizer.cpp`	Adds `enable_thinking_if_supported` param; refactors thinking injection from QWEN3P5-only to both QWEN types
`cactus/ffi/cactus_utils.h`	Adds `enable_thinking_if_supported` to `parse_options_json`; adds `strip_tag_blocks`, `strip_thinking_block`; adds `thinking` field to `construct_response_json`
`cactus/ffi/gemma_tools.h`	Adds `<\|tool_call>` format auto-detection in `parse_function_calls`
`cactus/models/model.h`	Adds `get_kv_layer_dims` override declaration for `GemmaModel3n`
`cactus/models/model_gemma3n.cpp`	Implements `get_kv_layer_dims` returning 0 for shared KV layers
`tests/test_kv_cache.cpp`	Updates tests for new `init` signature; adds dummy graph input to fix node ID=0 bug
`tests/test_qwen_thinking.cpp`	New test file with 6 tests for thinking option parsing, tag stripping, response JSON, and prompt injection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cactus/engine/engine_tokenizer.cpp

Copilot · 2026-03-10T03:24:08Z

cactus/engine/engine_cache.cpp

+            size_t scaler_per_tok = num_kv_heads * dim / KV_QUANT_GROUP_SIZE;
+            erase_floats(layer.key_scales, start * scaler_per_tok, count * scaler_per_tok, tail_tokens * scaler_per_tok);
+            erase_floats(layer.value_scales, start * scaler_per_tok, count * scaler_per_tok, tail_tokens * scaler_per_tok);


The scaler_per_tok variable on line 501 uses integer truncation division (num_kv_heads * dim / KV_QUANT_GROUP_SIZE) rather than the ceiling division used everywhere else (e.g., compact_to_windows at line 532: (dim + KV_QUANT_GROUP_SIZE - 1) / KV_QUANT_GROUP_SIZE). If dim is not a multiple of KV_QUANT_GROUP_SIZE (32), the scale buffer offsets will be computed incorrectly, causing data corruption when erasing scale entries. Use the same ceiling formula as compact_to_windows and the rest of the codebase.

Copilot · 2026-03-10T03:24:09Z

cactus/engine/engine_cache.cpp

+        erase_bytes(layer.values, start * bytes_per_tok, count * bytes_per_tok, tail_tokens * bytes_per_tok);
+
+        if (precision == Precision::INT8 && !layer.key_scales.empty()) {
+            size_t scaler_per_tok = num_kv_heads * dim / KV_QUANT_GROUP_SIZE;


The variable name scaler_per_tok is a misspelling; it should be scales_per_tok (consistent with scales_per_token used elsewhere in the same file, e.g., line 141 and line 533).

Copilot · 2026-03-10T03:24:09Z

tests/test_qwen_thinking.cpp

+    if (!model_path) { std::cout << "  [SKIP] CACTUS_TEST_MODEL not set\n"; return true; }
+
+    cactus_model_t model = cactus_init(model_path, nullptr, false);
+    if (!model) return false;


In test_complete_thinking_toggle, when cactus_init successfully loads the model but then fails, it returns false without printing a skip message (line 105: if (!model) return false;). However, in the analogous test test_prompt_thinking_injection at line 76, it returns true with a skip message. This inconsistency means a failure to load the model in test_complete_thinking_toggle will cause the test suite to report a failure rather than a skip, which is unexpected since the test is conditional on having a CACTUS_TEST_MODEL set.

Copilot · 2026-03-10T03:24:09Z

cactus/ffi/gemma_tools.h

+    const std::string CALL_START = (response.find("<|tool_call>") != std::string::npos)
+        ? "<|tool_call>" : "<start_function_call>";
+    const std::string CALL_END = (CALL_START == "<|tool_call>")
+        ? "<tool_call|>" : "<end_function_call>";


The new <|tool_call> format is detected but the actual parsing logic at line 516 only handles content starting with "call:". If <|tool_call>...<tool_call|> content doesn't use the same call:function_name{...} format as <start_function_call>, then tool calls using this format will be silently swallowed — the tags are erased from the response (line 548) but no function call is extracted. At minimum this should be documented; if the format is different, the parsing branch needs to be added.

Copilot · 2026-03-10T03:24:09Z

cactus/engine/engine_cache.cpp

+    for (size_t i = 0; i < num_layers; i++) {
+        layer_caches[i].head_dim = layer_dims[i];
+    }


KVCache::init accesses layer_dims[i] for i in [0, num_layers) without verifying that layer_dims.size() >= num_layers. If a caller passes a vector shorter than num_layers, this results in undefined behavior. A defensive assert or bounds check should be added, e.g., assert(layer_dims.size() >= num_layers).

Copilot · 2026-03-10T03:24:09Z

tests/test_qwen_thinking.cpp

+    int r1 = cactus_complete(model, msgs, buf, sizeof(buf),
+        R"({"max_tokens":128,"enable_thinking_if_supported":true,"telemetry_enabled":false})",
+        nullptr, nullptr, nullptr);
+    std::string resp1(buf);
+    bool ok1 = r1 > 0 && resp1.find("\"success\":true") != std::string::npos;
+
+    handle->model->reset_cache();
+    handle->processed_tokens.clear();
+
+    int r2 = cactus_complete(model, msgs, buf, sizeof(buf),
+        R"({"max_tokens":128,"enable_thinking_if_supported":false,"telemetry_enabled":false})",
+        nullptr, nullptr, nullptr);
+    std::string resp2(buf);
+    bool ok2 = r2 > 0 && resp2.find("\"thinking\"") == std::string::npos;
+
+    cactus_destroy(model);
+    return ok1 && ok2;


The test_complete_thinking_toggle test does not effectively verify that thinking is enabled or disabled. ok1 only checks that the completion succeeded ("success":true), not that a "thinking" field is present when thinking is enabled. ok2 checks that "thinking" is absent when disabled — but since cactus_complete.cpp never calls strip_thinking_block or passes thinking to construct_response_json, this check will always pass regardless of whether the feature works. The test needs to also verify resp1.find("\"thinking\"") != std::string::npos once the wiring is added.

Copilot · 2026-03-10T03:24:09Z

cactus/ffi/cactus_utils.h

                               bool& telemetry_enabled,
                               bool* auto_handoff = nullptr,
                               size_t* cloud_timeout_ms = nullptr,
-                               bool* handoff_with_images = nullptr) {
+                               bool* handoff_with_images = nullptr,
+                               bool* enable_thinking_if_supported = nullptr) {


The enable_thinking_if_supported option is parsed in parse_options_json and the prompt formatting functions accept it, but the main completion path (cactus_complete.cpp) never wires these together. Since cactus_complete.cpp is not modified in this PR, the new feature is silently broken:

parse_options_json is called without the &enable_thinking_if_supported pointer argument, so the parsed value is discarded.

format_chat_prompt is called without passing enable_thinking_if_supported, so it uses the default value of true regardless of what the caller requested.

strip_thinking_block is never called after generation, so the thinking field never appears in the response JSON.

The test_complete_thinking_toggle test will pass for the wrong reason: resp2.find("\"thinking\"") returns npos because thinking is never extracted, not because thinking was disabled. The wiring changes to cactus_complete.cpp are missing from this PR.

…and added cli updates

Copilot AI review requested due to automatic review settings March 10, 2026 03:17

Copilot started reviewing on behalf of ncylich March 10, 2026 03:17 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

removed qwen thinking test and updated existing qwen exhaustive test …

0ff3d53

…and added cli updates

ncylich changed the title ~~Port general engine improvements from tinyllama branch~~ Port general engine improvements for TinyLlama Mar 10, 2026

HenryNdubuaku merged commit f69b3c6 into main Mar 10, 2026
4 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port general engine improvements for TinyLlama#513

Port general engine improvements for TinyLlama#513
HenryNdubuaku merged 2 commits intomainfrom
engine-improvements

ncylich commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ncylich commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants