Skip to content

Speech-to-Text Timestamps#515

Merged
HenryNdubuaku merged 6 commits intomainfrom
timestamps
Mar 11, 2026
Merged

Speech-to-Text Timestamps#515
HenryNdubuaku merged 6 commits intomainfrom
timestamps

Conversation

@jakmro
Copy link
Copy Markdown
Collaborator

@jakmro jakmro commented Mar 11, 2026

No description provided.

jakmro added 4 commits March 10, 2026 22:28
Signed-off-by: jakmro <[email protected]>
Signed-off-by: jakmro <[email protected]>
Signed-off-by: jakmro <[email protected]>
Signed-off-by: jakmro <[email protected]>
Copilot AI review requested due to automatic review settings March 11, 2026 03:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds per-segment timestamp output to the speech-to-text (STT) pipeline (Whisper timestamps and Parakeet-TDT token timings) and plumbs those segments through the FFI JSON responses and test harness.

Changes:

  • Emit "segments" (start/end/text) in cactus_transcribe JSON responses and surface them in streaming transcribe responses.
  • Extend the audio decoding API (decode_with_audio) to optionally return per-token timing (implemented for Parakeet-TDT).
  • Update STT prompts/tests to enable Whisper timestamps (remove <|notimestamps|>) and record segments in test metrics.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_utils.h Adds segments field to test metrics.
tests/test_utils.cpp Parses/prints "segments" from response JSON.
tests/test_stt.cpp Updates Whisper prompt to allow timestamps.
tests/asr.cpp Updates Whisper prompt to allow timestamps.
rust/cactus-sys/tests/stt.rs Updates Whisper prompt to allow timestamps.
cactus/models/model_whisper.cpp Updates decode_with_audio signature (timing args ignored).
cactus/models/model_parakeet_tdt.cpp Returns token timing via new decode_with_audio out-params; adds timed token struct.
cactus/models/model_parakeet.cpp Updates decode_with_audio signature (timing args ignored).
cactus/models/model_moonshine.cpp Updates decode_with_audio signature (timing args ignored).
cactus/models/model.h Extends model interface and adds Parakeet-TDT timed token types/storage.
cactus/ffi/cactus_utils.h Adds TranscriptSegment and serializes "segments" into response JSON.
cactus/ffi/cactus_transcribe.cpp Builds timestamped segments for Whisper and Parakeet-TDT; includes segments in JSON response.
cactus/ffi/cactus_stream.cpp Adds segment parsing/serialization for streaming responses and uses segments for confirmation logic.
cactus/engine/engine_model.cpp Updates base Model::decode_with_audio signature to include timing out-params.
cactus/engine/engine.h Extends virtual decode_with_audio API to include timing out-params.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if (is_terminal_transcription_piece(piece)) {
break;
}
if (std::find(tokens_to_remove.begin(), tokens_to_remove.end(), piece) != tokens_to_remove.end()) {
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokens_to_remove filtering uses continue before appending next_token to tokens / incrementing completion_tokens. If the model ever emits a filtered token (e.g. <|startoftranscript|>), the decoder context never advances and the loop can repeatedly sample the same token until max_chunk_tokens is exhausted, skewing token counts/throughput and potentially producing empty output. Consider always pushing the token into tokens (and counting it) but skipping only the user-visible output (final_text/segments/callback).

Suggested change
if (std::find(tokens_to_remove.begin(), tokens_to_remove.end(), piece) != tokens_to_remove.end()) {
if (std::find(tokens_to_remove.begin(), tokens_to_remove.end(), piece) != tokens_to_remove.end()) {
// Advance decoding context and accounting even for filtered tokens,
// but skip adding to user-visible text or invoking the callback.
tokens.emplace_back(next_token);
completion_tokens++;

Copilot uses AI. Check for mistakes.
Comment on lines +472 to +485
const bool new_word = !piece.empty() && piece[0] == ' ';
if (new_word && !segment_text.empty()) {
final_text += segment_text;
segments.emplace_back(
segment_start,
segment_end,
segment_text[0] == ' '
? (segment_text.erase(0, 1), std::move(segment_text))
: std::move(segment_text)
);
}
if (segment_text.empty()) segment_start = tok_time_start + orig_offset;
segment_text += piece;
segment_end = tok_time_end + orig_offset;
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Parakeet-TDT branch, a segment is flushed via segments.emplace_back(..., std::move(segment_text)) but segment_text is not cleared afterward. Relying on a moved-from std::string to be empty is not guaranteed; subsequent segment_text.empty() / += uses may behave inconsistently. Clear/reset segment_text after moving (or avoid moving until the end) to keep segment boundaries/timestamps deterministic.

Copilot uses AI. Check for mistakes.
Comment on lines 372 to 383
const auto model_type = handle->model_handle->model->get_config().model_type;
bool is_moonshine = model_type == cactus::engine::Config::ModelType::MOONSHINE;
bool is_parakeet =
model_type == cactus::engine::Config::ModelType::PARAKEET ||
model_type == cactus::engine::Config::ModelType::PARAKEET_TDT;

std::string prompt = is_moonshine ? "" :
"<|startoftranscript|><|" + handle->options.language + "|><|transcribe|><|notimestamps|>";

cactus::telemetry::setStreamMode(true);
const int result = cactus_transcribe(
handle->model_handle,
nullptr,
(is_moonshine || is_parakeet) ? "" : "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>",
(is_moonshine || is_parakeet) ? "" : "<|startoftranscript|><|en|><|transcribe|>",
handle->transcribe_response_buffer,
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cactus_stream_transcribe_process parses and stores options.language, but the prompt passed to cactus_transcribe is hard-coded to <|en|> (and ignores handle->options.language). This makes the stream API’s language option ineffective/regressed. Use the configured language when building the Whisper prompt (or remove the option if it’s intentionally unsupported).

Copilot uses AI. Check for mistakes.
Comment on lines +143 to +165
static std::vector<TranscriptSegment> parse_segments(const std::string& transcribe_json) {
std::vector<TranscriptSegment> segs;
std::string arr = json_array(transcribe_json, "segments");
size_t i = 1;
while (i < arr.size() - 1) {
while (i < arr.size() && (arr[i] == ',' || std::isspace((unsigned char)arr[i]))) ++i;
if (i >= arr.size() - 1 || arr[i] != '{') break;
int depth = 1;
size_t obj_start = i++;
while (i < arr.size() && depth > 0) {
if (arr[i] == '{') ++depth;
else if (arr[i] == '}') --depth;
++i;
}
std::string obj = arr.substr(obj_start, i - obj_start);
segs.emplace_back(
static_cast<float>(json_number(obj, "start")),
static_cast<float>(json_number(obj, "end")),
json_string(obj, "text")
);
}
return segs;
}
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_segments() extracts each segment object by counting {/} and uses json_string(obj, "text"), which does not handle escaped quotes/backslashes. Transcripts can contain ", causing json_string to terminate early and segment parsing/confirmation to break; brace-counting can also be confused by braces inside strings. Consider either (a) parsing segments directly from the existing in-memory TranscriptSegment structures instead of round-tripping through JSON, or (b) implementing minimal JSON string escape handling when scanning/extracting segment objects/fields.

Copilot uses AI. Check for mistakes.
@HenryNdubuaku HenryNdubuaku merged commit 849ce3f into main Mar 11, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants