Speech-to-Text Timestamps by jakmro · Pull Request #515 · cactus-compute/cactus

jakmro · 2026-03-11T03:29:24Z

No description provided.

Signed-off-by: jakmro <[email protected]>

Copilot

Pull request overview

This PR adds per-segment timestamp output to the speech-to-text (STT) pipeline (Whisper timestamps and Parakeet-TDT token timings) and plumbs those segments through the FFI JSON responses and test harness.

Changes:

Emit "segments" (start/end/text) in cactus_transcribe JSON responses and surface them in streaming transcribe responses.
Extend the audio decoding API (decode_with_audio) to optionally return per-token timing (implemented for Parakeet-TDT).
Update STT prompts/tests to enable Whisper timestamps (remove <|notimestamps|>) and record segments in test metrics.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/test_utils.h	Adds `segments` field to test metrics.
tests/test_utils.cpp	Parses/prints `"segments"` from response JSON.
tests/test_stt.cpp	Updates Whisper prompt to allow timestamps.
tests/asr.cpp	Updates Whisper prompt to allow timestamps.
rust/cactus-sys/tests/stt.rs	Updates Whisper prompt to allow timestamps.
cactus/models/model_whisper.cpp	Updates `decode_with_audio` signature (timing args ignored).
cactus/models/model_parakeet_tdt.cpp	Returns token timing via new `decode_with_audio` out-params; adds timed token struct.
cactus/models/model_parakeet.cpp	Updates `decode_with_audio` signature (timing args ignored).
cactus/models/model_moonshine.cpp	Updates `decode_with_audio` signature (timing args ignored).
cactus/models/model.h	Extends model interface and adds Parakeet-TDT timed token types/storage.
cactus/ffi/cactus_utils.h	Adds `TranscriptSegment` and serializes `"segments"` into response JSON.
cactus/ffi/cactus_transcribe.cpp	Builds timestamped segments for Whisper and Parakeet-TDT; includes segments in JSON response.
cactus/ffi/cactus_stream.cpp	Adds segment parsing/serialization for streaming responses and uses segments for confirmation logic.
cactus/engine/engine_model.cpp	Updates base `Model::decode_with_audio` signature to include timing out-params.
cactus/engine/engine.h	Extends virtual `decode_with_audio` API to include timing out-params.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-11T03:34:22Z

cactus/ffi/cactus_transcribe.cpp

                if (is_terminal_transcription_piece(piece)) {
                    break;
                }
+                if (std::find(tokens_to_remove.begin(), tokens_to_remove.end(), piece) != tokens_to_remove.end()) {


tokens_to_remove filtering uses continue before appending next_token to tokens / incrementing completion_tokens. If the model ever emits a filtered token (e.g. <|startoftranscript|>), the decoder context never advances and the loop can repeatedly sample the same token until max_chunk_tokens is exhausted, skewing token counts/throughput and potentially producing empty output. Consider always pushing the token into tokens (and counting it) but skipping only the user-visible output (final_text/segments/callback).

Suggested change

if (std::find(tokens_to_remove.begin(), tokens_to_remove.end(), piece) != tokens_to_remove.end()) {

if (std::find(tokens_to_remove.begin(), tokens_to_remove.end(), piece) != tokens_to_remove.end()) {

// Advance decoding context and accounting even for filtered tokens,

// but skip adding to user-visible text or invoking the callback.

tokens.emplace_back(next_token);

completion_tokens++;

Copilot · 2026-03-11T03:34:22Z

cactus/ffi/cactus_transcribe.cpp

+                    const bool new_word = !piece.empty() && piece[0] == ' ';
+                    if (new_word && !segment_text.empty()) {
+                        final_text += segment_text;
+                        segments.emplace_back(
+                            segment_start,
+                            segment_end,
+                            segment_text[0] == ' '
+                            ? (segment_text.erase(0, 1), std::move(segment_text))
+                            : std::move(segment_text)
+                        );
+                    }
+                    if (segment_text.empty()) segment_start = tok_time_start + orig_offset;
+                    segment_text += piece;
+                    segment_end = tok_time_end + orig_offset;


In the Parakeet-TDT branch, a segment is flushed via segments.emplace_back(..., std::move(segment_text)) but segment_text is not cleared afterward. Relying on a moved-from std::string to be empty is not guaranteed; subsequent segment_text.empty() / += uses may behave inconsistently. Clear/reset segment_text after moving (or avoid moving until the end) to keep segment boundaries/timestamps deterministic.

Copilot · 2026-03-11T03:34:23Z

cactus/ffi/cactus_stream.cpp

        const auto model_type = handle->model_handle->model->get_config().model_type;
        bool is_moonshine = model_type == cactus::engine::Config::ModelType::MOONSHINE;
        bool is_parakeet =
            model_type == cactus::engine::Config::ModelType::PARAKEET ||
            model_type == cactus::engine::Config::ModelType::PARAKEET_TDT;

-        std::string prompt = is_moonshine ? "" :
-            "<|startoftranscript|><|" + handle->options.language + "|><|transcribe|><|notimestamps|>";
-
        cactus::telemetry::setStreamMode(true);
        const int result = cactus_transcribe(
            handle->model_handle,
            nullptr,
-            (is_moonshine || is_parakeet) ? "" : "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>",
+            (is_moonshine || is_parakeet) ? "" : "<|startoftranscript|><|en|><|transcribe|>",
            handle->transcribe_response_buffer,


cactus_stream_transcribe_process parses and stores options.language, but the prompt passed to cactus_transcribe is hard-coded to <|en|> (and ignores handle->options.language). This makes the stream API’s language option ineffective/regressed. Use the configured language when building the Whisper prompt (or remove the option if it’s intentionally unsupported).

Copilot · 2026-03-11T03:34:23Z

cactus/ffi/cactus_stream.cpp

+static std::vector<TranscriptSegment> parse_segments(const std::string& transcribe_json) {
+    std::vector<TranscriptSegment> segs;
+    std::string arr = json_array(transcribe_json, "segments");
+    size_t i = 1;
+    while (i < arr.size() - 1) {
+        while (i < arr.size() && (arr[i] == ',' || std::isspace((unsigned char)arr[i]))) ++i;
+        if (i >= arr.size() - 1 || arr[i] != '{') break;
+        int depth = 1;
+        size_t obj_start = i++;
+        while (i < arr.size() && depth > 0) {
+            if (arr[i] == '{') ++depth;
+            else if (arr[i] == '}') --depth;
+            ++i;
+        }
+        std::string obj = arr.substr(obj_start, i - obj_start);
+        segs.emplace_back(
+            static_cast<float>(json_number(obj, "start")),
+            static_cast<float>(json_number(obj, "end")),
+            json_string(obj, "text")
+        );
+    }
+    return segs;
+}


parse_segments() extracts each segment object by counting {/} and uses json_string(obj, "text"), which does not handle escaped quotes/backslashes. Transcripts can contain ", causing json_string to terminate early and segment parsing/confirmation to break; brace-counting can also be confused by braces inside strings. Consider either (a) parsing segments directly from the existing in-memory TranscriptSegment structures instead of round-tripping through JSON, or (b) implementing minimal JSON string escape handling when scanning/extracting segment objects/fields.

Signed-off-by: jakmro <[email protected]>

jakmro added 4 commits March 10, 2026 22:28

timestamps

4015ecc

Signed-off-by: jakmro <[email protected]>

timestamps 2

52c036b

Signed-off-by: jakmro <[email protected]>

timestamps 3

3b0f325

Signed-off-by: jakmro <[email protected]>

sdk

43a542b

Signed-off-by: jakmro <[email protected]>

Copilot AI review requested due to automatic review settings March 11, 2026 03:29

Copilot started reviewing on behalf of jakmro March 11, 2026 03:29 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

jakmro added 2 commits March 11, 2026 10:43

Fix language option handling in transcribe process

0c75532

Signed-off-by: jakmro <[email protected]>

Remove unused token handling in transcription process

4c04aa0

Signed-off-by: jakmro <[email protected]>

HenryNdubuaku merged commit 849ce3f into main Mar 11, 2026
5 of 6 checks passed

devin-ai-integration bot mentioned this pull request Mar 12, 2026

chore: update cactus to v1.11 with timestamp support fastrepl/char#4537

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speech-to-Text Timestamps#515

Speech-to-Text Timestamps#515
HenryNdubuaku merged 6 commits intomainfrom
timestamps

jakmro commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-                if (std::find(tokens_to_remove.begin(), tokens_to_remove.end(), piece) != tokens_to_remove.end()) {
+                if (std::find(tokens_to_remove.begin(), tokens_to_remove.end(), piece) != tokens_to_remove.end()) {
+                    // Advance decoding context and accounting even for filtered tokens,
+                    // but skip adding to user-visible text or invoking the callback.
+                    tokens.emplace_back(next_token);
+                    completion_tokens++;

Conversation

jakmro commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants