Conversation
Signed-off-by: jakmro <[email protected]>
Signed-off-by: jakmro <[email protected]>
Signed-off-by: jakmro <[email protected]>
Signed-off-by: jakmro <[email protected]>
There was a problem hiding this comment.
Pull request overview
This PR adds per-segment timestamp output to the speech-to-text (STT) pipeline (Whisper timestamps and Parakeet-TDT token timings) and plumbs those segments through the FFI JSON responses and test harness.
Changes:
- Emit
"segments"(start/end/text) incactus_transcribeJSON responses and surface them in streaming transcribe responses. - Extend the audio decoding API (
decode_with_audio) to optionally return per-token timing (implemented for Parakeet-TDT). - Update STT prompts/tests to enable Whisper timestamps (remove
<|notimestamps|>) and record segments in test metrics.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_utils.h | Adds segments field to test metrics. |
| tests/test_utils.cpp | Parses/prints "segments" from response JSON. |
| tests/test_stt.cpp | Updates Whisper prompt to allow timestamps. |
| tests/asr.cpp | Updates Whisper prompt to allow timestamps. |
| rust/cactus-sys/tests/stt.rs | Updates Whisper prompt to allow timestamps. |
| cactus/models/model_whisper.cpp | Updates decode_with_audio signature (timing args ignored). |
| cactus/models/model_parakeet_tdt.cpp | Returns token timing via new decode_with_audio out-params; adds timed token struct. |
| cactus/models/model_parakeet.cpp | Updates decode_with_audio signature (timing args ignored). |
| cactus/models/model_moonshine.cpp | Updates decode_with_audio signature (timing args ignored). |
| cactus/models/model.h | Extends model interface and adds Parakeet-TDT timed token types/storage. |
| cactus/ffi/cactus_utils.h | Adds TranscriptSegment and serializes "segments" into response JSON. |
| cactus/ffi/cactus_transcribe.cpp | Builds timestamped segments for Whisper and Parakeet-TDT; includes segments in JSON response. |
| cactus/ffi/cactus_stream.cpp | Adds segment parsing/serialization for streaming responses and uses segments for confirmation logic. |
| cactus/engine/engine_model.cpp | Updates base Model::decode_with_audio signature to include timing out-params. |
| cactus/engine/engine.h | Extends virtual decode_with_audio API to include timing out-params. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cactus/ffi/cactus_transcribe.cpp
Outdated
| if (is_terminal_transcription_piece(piece)) { | ||
| break; | ||
| } | ||
| if (std::find(tokens_to_remove.begin(), tokens_to_remove.end(), piece) != tokens_to_remove.end()) { |
There was a problem hiding this comment.
tokens_to_remove filtering uses continue before appending next_token to tokens / incrementing completion_tokens. If the model ever emits a filtered token (e.g. <|startoftranscript|>), the decoder context never advances and the loop can repeatedly sample the same token until max_chunk_tokens is exhausted, skewing token counts/throughput and potentially producing empty output. Consider always pushing the token into tokens (and counting it) but skipping only the user-visible output (final_text/segments/callback).
| if (std::find(tokens_to_remove.begin(), tokens_to_remove.end(), piece) != tokens_to_remove.end()) { | |
| if (std::find(tokens_to_remove.begin(), tokens_to_remove.end(), piece) != tokens_to_remove.end()) { | |
| // Advance decoding context and accounting even for filtered tokens, | |
| // but skip adding to user-visible text or invoking the callback. | |
| tokens.emplace_back(next_token); | |
| completion_tokens++; |
| const bool new_word = !piece.empty() && piece[0] == ' '; | ||
| if (new_word && !segment_text.empty()) { | ||
| final_text += segment_text; | ||
| segments.emplace_back( | ||
| segment_start, | ||
| segment_end, | ||
| segment_text[0] == ' ' | ||
| ? (segment_text.erase(0, 1), std::move(segment_text)) | ||
| : std::move(segment_text) | ||
| ); | ||
| } | ||
| if (segment_text.empty()) segment_start = tok_time_start + orig_offset; | ||
| segment_text += piece; | ||
| segment_end = tok_time_end + orig_offset; |
There was a problem hiding this comment.
In the Parakeet-TDT branch, a segment is flushed via segments.emplace_back(..., std::move(segment_text)) but segment_text is not cleared afterward. Relying on a moved-from std::string to be empty is not guaranteed; subsequent segment_text.empty() / += uses may behave inconsistently. Clear/reset segment_text after moving (or avoid moving until the end) to keep segment boundaries/timestamps deterministic.
| const auto model_type = handle->model_handle->model->get_config().model_type; | ||
| bool is_moonshine = model_type == cactus::engine::Config::ModelType::MOONSHINE; | ||
| bool is_parakeet = | ||
| model_type == cactus::engine::Config::ModelType::PARAKEET || | ||
| model_type == cactus::engine::Config::ModelType::PARAKEET_TDT; | ||
|
|
||
| std::string prompt = is_moonshine ? "" : | ||
| "<|startoftranscript|><|" + handle->options.language + "|><|transcribe|><|notimestamps|>"; | ||
|
|
||
| cactus::telemetry::setStreamMode(true); | ||
| const int result = cactus_transcribe( | ||
| handle->model_handle, | ||
| nullptr, | ||
| (is_moonshine || is_parakeet) ? "" : "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>", | ||
| (is_moonshine || is_parakeet) ? "" : "<|startoftranscript|><|en|><|transcribe|>", | ||
| handle->transcribe_response_buffer, |
There was a problem hiding this comment.
cactus_stream_transcribe_process parses and stores options.language, but the prompt passed to cactus_transcribe is hard-coded to <|en|> (and ignores handle->options.language). This makes the stream API’s language option ineffective/regressed. Use the configured language when building the Whisper prompt (or remove the option if it’s intentionally unsupported).
| static std::vector<TranscriptSegment> parse_segments(const std::string& transcribe_json) { | ||
| std::vector<TranscriptSegment> segs; | ||
| std::string arr = json_array(transcribe_json, "segments"); | ||
| size_t i = 1; | ||
| while (i < arr.size() - 1) { | ||
| while (i < arr.size() && (arr[i] == ',' || std::isspace((unsigned char)arr[i]))) ++i; | ||
| if (i >= arr.size() - 1 || arr[i] != '{') break; | ||
| int depth = 1; | ||
| size_t obj_start = i++; | ||
| while (i < arr.size() && depth > 0) { | ||
| if (arr[i] == '{') ++depth; | ||
| else if (arr[i] == '}') --depth; | ||
| ++i; | ||
| } | ||
| std::string obj = arr.substr(obj_start, i - obj_start); | ||
| segs.emplace_back( | ||
| static_cast<float>(json_number(obj, "start")), | ||
| static_cast<float>(json_number(obj, "end")), | ||
| json_string(obj, "text") | ||
| ); | ||
| } | ||
| return segs; | ||
| } |
There was a problem hiding this comment.
parse_segments() extracts each segment object by counting {/} and uses json_string(obj, "text"), which does not handle escaped quotes/backslashes. Transcripts can contain ", causing json_string to terminate early and segment parsing/confirmation to break; brace-counting can also be confused by braces inside strings. Consider either (a) parsing segments directly from the existing in-memory TranscriptSegment structures instead of round-tripping through JSON, or (b) implementing minimal JSON string escape handling when scanning/extracting segment objects/fields.
Signed-off-by: jakmro <[email protected]>
Signed-off-by: jakmro <[email protected]>
No description provided.