Skip to content

Port general engine improvements for TinyLlama#513

Merged
HenryNdubuaku merged 2 commits intomainfrom
engine-improvements
Mar 10, 2026
Merged

Port general engine improvements for TinyLlama#513
HenryNdubuaku merged 2 commits intomainfrom
engine-improvements

Conversation

@ncylich
Copy link
Copy Markdown
Collaborator

@ncylich ncylich commented Mar 10, 2026

  • Add strip_tag_blocks generic helper and refactored strip_thinking_block
    that delegates to it, supporting multiple blocks and unclosed tags
  • Add enable_thinking_if_supported param to parse_options_json and
    thinking field to construct_response_json
  • Refactor KV cache to store head_dim per-layer via LayerCache.head_dim,
    get_layer_head_dim(), and vector-based init() signature
  • Remove dim parameter from update_from_graph; derive per-layer inside loop
  • Update set_window_size to iterate per-layer, skipping dim=0 layers
  • Add KVCache::remove_token_range for surgical token removal from cache
  • Add KVCache::compact_to_windows for per-layer compaction to target windows
  • Add Model::remove_thinking_tokens (back-to-front multi-range removal)
  • Add Model::compact_kv_cache virtual and get_kv_layer_dims virtual
  • Add GemmaModel3n::get_kv_layer_dims override returning 0 for shared layers
  • Add tool call tag auto-detection in gemma::parse_function_calls
    supporting both <start_function_call> and <|tool_call> formats
  • Add enable_thinking_if_supported param to format_chat_prompt and
    format_qwen_style with thinking injection for Qwen models
  • Add test_qwen_thinking.cpp with 6 tests for thinking option parsing,
    tag stripping, response JSON, and prompt injection
  • Update test_kv_cache.cpp for new init signature and fix pre-existing
    bug where graph.input() returning node ID 0 caused layer skipping

…block

    that delegates to it, supporting multiple blocks and unclosed tags
  - Add enable_thinking_if_supported param to parse_options_json and
    thinking field to construct_response_json
  - Refactor KV cache to store head_dim per-layer via LayerCache.head_dim,
    get_layer_head_dim(), and vector-based init() signature
  - Remove dim parameter from update_from_graph; derive per-layer inside loop
  - Update set_window_size to iterate per-layer, skipping dim=0 layers
  - Add KVCache::remove_token_range for surgical token removal from cache
  - Add KVCache::compact_to_windows for per-layer compaction to target windows
  - Add Model::remove_thinking_tokens (back-to-front multi-range removal)
  - Add Model::compact_kv_cache virtual and get_kv_layer_dims virtual
  - Add GemmaModel3n::get_kv_layer_dims override returning 0 for shared layers
  - Add tool call tag auto-detection in gemma::parse_function_calls
    supporting both <start_function_call> and <|tool_call> formats
  - Add enable_thinking_if_supported param to format_chat_prompt and
    format_qwen_style with thinking injection for Qwen models
  - Add test_qwen_thinking.cpp with 6 tests for thinking option parsing,
    tag stripping, response JSON, and prompt injection
  - Update test_kv_cache.cpp for new init signature and fix pre-existing
    bug where graph.input() returning node ID 0 caused layer skipping

Signed-off-by: Noah Cylich <[email protected]>
Copilot AI review requested due to automatic review settings March 10, 2026 03:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ports a set of engine improvements from the tinyllama branch. The main areas of change are: adding per-layer head_dim support to the KV cache (previously a single value for all layers), introducing a thinking feature for Qwen models (enabling/disabling <think> token injection and extraction), and adding tool call auto-detection for Gemma models.

Changes:

  • KV Cache refactoring: head_dim is now stored per-layer in LayerCache.head_dim, init() takes a vector of dims instead of a scalar, update_from_graph and set_window_size derive dim per-layer. New methods remove_token_range and compact_to_windows are added for surgical cache manipulation. Tests updated accordingly.
  • Thinking feature: Adds enable_thinking_if_supported parameter to parse_options_json, format_chat_prompt, and format_qwen_style; adds thinking field to construct_response_json; adds strip_thinking_block and strip_tag_blocks helpers.
  • Other: Gemma tool call auto-detection (<|tool_call> format), GemmaModel3n::get_kv_layer_dims override for shared layers, and a new test file test_qwen_thinking.cpp.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
cactus/engine/engine.h Adds head_dim to LayerCache, replaces scalar head_dim with vector in KVCache, adds get_layer_head_dim, remove_token_range, compact_to_windows, remove_thinking_tokens, compact_kv_cache, get_kv_layer_dims virtuals, and enable_thinking_if_supported to format_chat_prompt
cactus/engine/engine_cache.cpp Implements per-layer head_dim in init/set_window/update; adds remove_token_range and compact_to_windows
cactus/engine/engine_model.cpp Wires get_kv_layer_dims() into kv_cache_.init; removes head_dim param from update_from_graph; adds remove_thinking_tokens
cactus/engine/engine_tokenizer.cpp Adds enable_thinking_if_supported param; refactors thinking injection from QWEN3P5-only to both QWEN types
cactus/ffi/cactus_utils.h Adds enable_thinking_if_supported to parse_options_json; adds strip_tag_blocks, strip_thinking_block; adds thinking field to construct_response_json
cactus/ffi/gemma_tools.h Adds <|tool_call> format auto-detection in parse_function_calls
cactus/models/model.h Adds get_kv_layer_dims override declaration for GemmaModel3n
cactus/models/model_gemma3n.cpp Implements get_kv_layer_dims returning 0 for shared KV layers
tests/test_kv_cache.cpp Updates tests for new init signature; adds dummy graph input to fix node ID=0 bug
tests/test_qwen_thinking.cpp New test file with 6 tests for thinking option parsing, tag stripping, response JSON, and prompt injection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +501 to +503
size_t scaler_per_tok = num_kv_heads * dim / KV_QUANT_GROUP_SIZE;
erase_floats(layer.key_scales, start * scaler_per_tok, count * scaler_per_tok, tail_tokens * scaler_per_tok);
erase_floats(layer.value_scales, start * scaler_per_tok, count * scaler_per_tok, tail_tokens * scaler_per_tok);
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scaler_per_tok variable on line 501 uses integer truncation division (num_kv_heads * dim / KV_QUANT_GROUP_SIZE) rather than the ceiling division used everywhere else (e.g., compact_to_windows at line 532: (dim + KV_QUANT_GROUP_SIZE - 1) / KV_QUANT_GROUP_SIZE). If dim is not a multiple of KV_QUANT_GROUP_SIZE (32), the scale buffer offsets will be computed incorrectly, causing data corruption when erasing scale entries. Use the same ceiling formula as compact_to_windows and the rest of the codebase.

Copilot uses AI. Check for mistakes.
erase_bytes(layer.values, start * bytes_per_tok, count * bytes_per_tok, tail_tokens * bytes_per_tok);

if (precision == Precision::INT8 && !layer.key_scales.empty()) {
size_t scaler_per_tok = num_kv_heads * dim / KV_QUANT_GROUP_SIZE;
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name scaler_per_tok is a misspelling; it should be scales_per_tok (consistent with scales_per_token used elsewhere in the same file, e.g., line 141 and line 533).

Copilot uses AI. Check for mistakes.
if (!model_path) { std::cout << " [SKIP] CACTUS_TEST_MODEL not set\n"; return true; }

cactus_model_t model = cactus_init(model_path, nullptr, false);
if (!model) return false;
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In test_complete_thinking_toggle, when cactus_init successfully loads the model but then fails, it returns false without printing a skip message (line 105: if (!model) return false;). However, in the analogous test test_prompt_thinking_injection at line 76, it returns true with a skip message. This inconsistency means a failure to load the model in test_complete_thinking_toggle will cause the test suite to report a failure rather than a skip, which is unexpected since the test is conditional on having a CACTUS_TEST_MODEL set.

Copilot uses AI. Check for mistakes.
Comment on lines +503 to +506
const std::string CALL_START = (response.find("<|tool_call>") != std::string::npos)
? "<|tool_call>" : "<start_function_call>";
const std::string CALL_END = (CALL_START == "<|tool_call>")
? "<tool_call|>" : "<end_function_call>";
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new <|tool_call> format is detected but the actual parsing logic at line 516 only handles content starting with "call:". If <|tool_call>...<tool_call|> content doesn't use the same call:function_name{...} format as <start_function_call>, then tool calls using this format will be silently swallowed — the tags are erased from the response (line 548) but no function call is extracted. At minimum this should be documented; if the format is different, the parsing branch needs to be added.

Copilot uses AI. Check for mistakes.
Comment on lines +19 to +21
for (size_t i = 0; i < num_layers; i++) {
layer_caches[i].head_dim = layer_dims[i];
}
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KVCache::init accesses layer_dims[i] for i in [0, num_layers) without verifying that layer_dims.size() >= num_layers. If a caller passes a vector shorter than num_layers, this results in undefined behavior. A defensive assert or bounds check should be added, e.g., assert(layer_dims.size() >= num_layers).

Copilot uses AI. Check for mistakes.
Comment on lines +118 to +134
int r1 = cactus_complete(model, msgs, buf, sizeof(buf),
R"({"max_tokens":128,"enable_thinking_if_supported":true,"telemetry_enabled":false})",
nullptr, nullptr, nullptr);
std::string resp1(buf);
bool ok1 = r1 > 0 && resp1.find("\"success\":true") != std::string::npos;

handle->model->reset_cache();
handle->processed_tokens.clear();

int r2 = cactus_complete(model, msgs, buf, sizeof(buf),
R"({"max_tokens":128,"enable_thinking_if_supported":false,"telemetry_enabled":false})",
nullptr, nullptr, nullptr);
std::string resp2(buf);
bool ok2 = r2 > 0 && resp2.find("\"thinking\"") == std::string::npos;

cactus_destroy(model);
return ok1 && ok2;
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test_complete_thinking_toggle test does not effectively verify that thinking is enabled or disabled. ok1 only checks that the completion succeeded ("success":true), not that a "thinking" field is present when thinking is enabled. ok2 checks that "thinking" is absent when disabled — but since cactus_complete.cpp never calls strip_thinking_block or passes thinking to construct_response_json, this check will always pass regardless of whether the feature works. The test needs to also verify resp1.find("\"thinking\"") != std::string::npos once the wiring is added.

Copilot uses AI. Check for mistakes.
Comment on lines 707 to +711
bool& telemetry_enabled,
bool* auto_handoff = nullptr,
size_t* cloud_timeout_ms = nullptr,
bool* handoff_with_images = nullptr) {
bool* handoff_with_images = nullptr,
bool* enable_thinking_if_supported = nullptr) {
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enable_thinking_if_supported option is parsed in parse_options_json and the prompt formatting functions accept it, but the main completion path (cactus_complete.cpp) never wires these together. Since cactus_complete.cpp is not modified in this PR, the new feature is silently broken:

  1. parse_options_json is called without the &enable_thinking_if_supported pointer argument, so the parsed value is discarded.
  2. format_chat_prompt is called without passing enable_thinking_if_supported, so it uses the default value of true regardless of what the caller requested.
  3. strip_thinking_block is never called after generation, so the thinking field never appears in the response JSON.

The test_complete_thinking_toggle test will pass for the wrong reason: resp2.find("\"thinking\"") returns npos because thinking is never extracted, not because thinking was disabled. The wiring changes to cactus_complete.cpp are missing from this PR.

Copilot uses AI. Check for mistakes.
@ncylich ncylich changed the title Port general engine improvements from tinyllama branch Port general engine improvements for TinyLlama Mar 10, 2026
@HenryNdubuaku HenryNdubuaku merged commit f69b3c6 into main Mar 10, 2026
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants