Skip to content

long transcription#482

Merged
HenryNdubuaku merged 1 commit intomainfrom
infinite_transcription
Mar 3, 2026
Merged

long transcription#482
HenryNdubuaku merged 1 commit intomainfrom
infinite_transcription

Conversation

@jakmro
Copy link
Copy Markdown
Collaborator

@jakmro jakmro commented Mar 3, 2026

No description provided.

Signed-off-by: jakmro <[email protected]>
Copilot AI review requested due to automatic review settings March 3, 2026 03:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for “long transcription” by chunking audio inputs in the cactus_transcribe FFI path and expanding the STT test suite to cover a longer WAV fixture.

Changes:

  • Updated STT tests to support selecting an audio fixture and added a new long-transcription test case.
  • Implemented audio chunking (VAD-aware or fixed 30s chunks) in cactus_transcribe with limited text context carryover between chunks.
  • Minor refactors in transcription preprocessing (e.g., mel normalization and option handling).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
tests/test_stt.cpp Renames the transcription test helper to accept an audio filename and adds transcription_long using test_long.wav.
cactus/ffi/cactus_transcribe.cpp Adds chunk-based transcription flow for long audio inputs and related token/context handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

while (pos < opts.size() && std::isspace(static_cast<unsigned char>(opts[pos]))) ++pos;
try {
cloud_handoff_threshold = std::stof(opts.substr(pos));
cloud_handoff_threshold = std::stof(opts.c_str() + pos);
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::stof does not have an overload that accepts a const char*. std::stof(opts.c_str() + pos) will fail to compile; use std::stof(opts.substr(pos)) (as before) or parse with std::strtof/std::from_chars while tracking the end position.

Suggested change
cloud_handoff_threshold = std::stof(opts.c_str() + pos);
cloud_handoff_threshold = std::stof(opts.substr(pos));

Copilot uses AI. Check for mistakes.
Comment on lines +158 to +202
std::vector<std::vector<float>> chunks;

if (use_vad) {
auto* vad = static_cast<SileroVADModel*>(handle->vad_model.get());
auto segments = vad->get_speech_timestamps(audio_buffer, {});

std::vector<float> speech_audio;
for (const auto& segment : segments) {
speech_audio.insert(
speech_audio.end(),
audio_buffer.begin() + segment.start,
audio_buffer.begin() + std::min(segment.end, audio_buffer.size())
auto vad_segments = vad->get_speech_timestamps(audio_samples, {});
chunks.reserve(vad_segments.size());

std::vector<float> current;
for (const auto& seg : vad_segments) {
size_t end = std::min(seg.end, audio_samples.size());
if (current.size() + (end - seg.start) > MAX_CHUNK_SAMPLES) {
chunks.emplace_back(std::move(current));
current = {};
}
current.insert(
current.end(),
audio_samples.begin() + seg.start,
audio_samples.begin() + end
);
}
audio_buffer = std::move(speech_audio);

if (audio_buffer.empty()) {
CACTUS_LOG_DEBUG("transcribe", "VAD detected only silence, returning empty transcription");
if (!current.empty()) {
chunks.emplace_back(std::move(current));
}

if (chunks.empty()) {
CACTUS_LOG_DEBUG("transcribe", "VAD detected only silence, returning empty transcription");
auto vad_end_time = std::chrono::high_resolution_clock::now();
double vad_total_time_ms = std::chrono::duration_cast<std::chrono::microseconds>(vad_end_time - start_time).count() / 1000.0;

std::string json = construct_response_json("", {}, 0.0, vad_total_time_ms, 0.0, 0.0, 0, 0, 1.0f);

if (json.size() >= buffer_size) {
handle_error_response("Response buffer too small", response_buffer, buffer_size);
cactus::telemetry::recordTranscription(handle->model_name.c_str(), false, 0.0, 0.0, 0.0, 0, 0.0, "Response buffer too small");
return -1;
}

cactus::telemetry::recordTranscription(handle->model_name.c_str(), true, 0.0, 0.0, vad_total_time_ms, 0, get_ram_usage_mb(), "");
std::strcpy(response_buffer, json.c_str());
return static_cast<int>(json.size());
}
} else {
chunks.reserve((audio_samples.size() + MAX_CHUNK_SAMPLES - 1) / MAX_CHUNK_SAMPLES);
for (size_t start = 0; start < audio_samples.size(); start += MAX_CHUNK_SAMPLES) {
size_t end = std::min(start + MAX_CHUNK_SAMPLES, audio_samples.size());
chunks.emplace_back(audio_samples.begin() + start, audio_samples.begin() + end);
}
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current chunking strategy materializes std::vector<std::vector<float>> chunks and copies audio samples into each chunk (emplace_back with iterator ranges / current.insert). For long inputs this duplicates audio in memory and can significantly increase peak RAM usage. Consider iterating over the audio in a streaming fashion (store start/end indices or use spans/views) and build/process one chunk at a time instead of keeping all chunks resident.

Copilot uses AI. Check for mistakes.
Comment on lines 531 to 533

std::string audio_path = std::string(g_assets_path) + "/test.wav";
std::string audio_path = std::string(g_assets_path) + "/" + audio_file;
std::cout << "Transcript: ";
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g_assets_path is used to build audio_path without checking whether CACTUS_TEST_ASSETS is set. If CACTUS_TEST_TRANSCRIBE_MODEL is set but CACTUS_TEST_ASSETS is not, std::string(g_assets_path) will dereference a null pointer and crash the test binary. Consider adding a skip (similar to other tests) when !g_assets_path.

Copilot uses AI. Check for mistakes.
@HenryNdubuaku HenryNdubuaku merged commit c4e54fb into main Mar 3, 2026
8 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants