Conversation
Signed-off-by: jakmro <[email protected]>
There was a problem hiding this comment.
Pull request overview
Adds support for “long transcription” by chunking audio inputs in the cactus_transcribe FFI path and expanding the STT test suite to cover a longer WAV fixture.
Changes:
- Updated STT tests to support selecting an audio fixture and added a new long-transcription test case.
- Implemented audio chunking (VAD-aware or fixed 30s chunks) in
cactus_transcribewith limited text context carryover between chunks. - Minor refactors in transcription preprocessing (e.g., mel normalization and option handling).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
tests/test_stt.cpp |
Renames the transcription test helper to accept an audio filename and adds transcription_long using test_long.wav. |
cactus/ffi/cactus_transcribe.cpp |
Adds chunk-based transcription flow for long audio inputs and related token/context handling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| while (pos < opts.size() && std::isspace(static_cast<unsigned char>(opts[pos]))) ++pos; | ||
| try { | ||
| cloud_handoff_threshold = std::stof(opts.substr(pos)); | ||
| cloud_handoff_threshold = std::stof(opts.c_str() + pos); |
There was a problem hiding this comment.
std::stof does not have an overload that accepts a const char*. std::stof(opts.c_str() + pos) will fail to compile; use std::stof(opts.substr(pos)) (as before) or parse with std::strtof/std::from_chars while tracking the end position.
| cloud_handoff_threshold = std::stof(opts.c_str() + pos); | |
| cloud_handoff_threshold = std::stof(opts.substr(pos)); |
| std::vector<std::vector<float>> chunks; | ||
|
|
||
| if (use_vad) { | ||
| auto* vad = static_cast<SileroVADModel*>(handle->vad_model.get()); | ||
| auto segments = vad->get_speech_timestamps(audio_buffer, {}); | ||
|
|
||
| std::vector<float> speech_audio; | ||
| for (const auto& segment : segments) { | ||
| speech_audio.insert( | ||
| speech_audio.end(), | ||
| audio_buffer.begin() + segment.start, | ||
| audio_buffer.begin() + std::min(segment.end, audio_buffer.size()) | ||
| auto vad_segments = vad->get_speech_timestamps(audio_samples, {}); | ||
| chunks.reserve(vad_segments.size()); | ||
|
|
||
| std::vector<float> current; | ||
| for (const auto& seg : vad_segments) { | ||
| size_t end = std::min(seg.end, audio_samples.size()); | ||
| if (current.size() + (end - seg.start) > MAX_CHUNK_SAMPLES) { | ||
| chunks.emplace_back(std::move(current)); | ||
| current = {}; | ||
| } | ||
| current.insert( | ||
| current.end(), | ||
| audio_samples.begin() + seg.start, | ||
| audio_samples.begin() + end | ||
| ); | ||
| } | ||
| audio_buffer = std::move(speech_audio); | ||
|
|
||
| if (audio_buffer.empty()) { | ||
| CACTUS_LOG_DEBUG("transcribe", "VAD detected only silence, returning empty transcription"); | ||
| if (!current.empty()) { | ||
| chunks.emplace_back(std::move(current)); | ||
| } | ||
|
|
||
| if (chunks.empty()) { | ||
| CACTUS_LOG_DEBUG("transcribe", "VAD detected only silence, returning empty transcription"); | ||
| auto vad_end_time = std::chrono::high_resolution_clock::now(); | ||
| double vad_total_time_ms = std::chrono::duration_cast<std::chrono::microseconds>(vad_end_time - start_time).count() / 1000.0; | ||
|
|
||
| std::string json = construct_response_json("", {}, 0.0, vad_total_time_ms, 0.0, 0.0, 0, 0, 1.0f); | ||
|
|
||
| if (json.size() >= buffer_size) { | ||
| handle_error_response("Response buffer too small", response_buffer, buffer_size); | ||
| cactus::telemetry::recordTranscription(handle->model_name.c_str(), false, 0.0, 0.0, 0.0, 0, 0.0, "Response buffer too small"); | ||
| return -1; | ||
| } | ||
|
|
||
| cactus::telemetry::recordTranscription(handle->model_name.c_str(), true, 0.0, 0.0, vad_total_time_ms, 0, get_ram_usage_mb(), ""); | ||
| std::strcpy(response_buffer, json.c_str()); | ||
| return static_cast<int>(json.size()); | ||
| } | ||
| } else { | ||
| chunks.reserve((audio_samples.size() + MAX_CHUNK_SAMPLES - 1) / MAX_CHUNK_SAMPLES); | ||
| for (size_t start = 0; start < audio_samples.size(); start += MAX_CHUNK_SAMPLES) { | ||
| size_t end = std::min(start + MAX_CHUNK_SAMPLES, audio_samples.size()); | ||
| chunks.emplace_back(audio_samples.begin() + start, audio_samples.begin() + end); | ||
| } |
There was a problem hiding this comment.
The current chunking strategy materializes std::vector<std::vector<float>> chunks and copies audio samples into each chunk (emplace_back with iterator ranges / current.insert). For long inputs this duplicates audio in memory and can significantly increase peak RAM usage. Consider iterating over the audio in a streaming fashion (store start/end indices or use spans/views) and build/process one chunk at a time instead of keeping all chunks resident.
|
|
||
| std::string audio_path = std::string(g_assets_path) + "/test.wav"; | ||
| std::string audio_path = std::string(g_assets_path) + "/" + audio_file; | ||
| std::cout << "Transcript: "; |
There was a problem hiding this comment.
g_assets_path is used to build audio_path without checking whether CACTUS_TEST_ASSETS is set. If CACTUS_TEST_TRANSCRIBE_MODEL is set but CACTUS_TEST_ASSETS is not, std::string(g_assets_path) will dereference a null pointer and crash the test binary. Consider adding a skip (similar to other tests) when !g_assets_path.
No description provided.