long transcription by jakmro · Pull Request #482 · cactus-compute/cactus

jakmro · 2026-03-03T03:34:09Z

No description provided.

Signed-off-by: jakmro <[email protected]>

Copilot

Pull request overview

Adds support for “long transcription” by chunking audio inputs in the cactus_transcribe FFI path and expanding the STT test suite to cover a longer WAV fixture.

Changes:

Updated STT tests to support selecting an audio fixture and added a new long-transcription test case.
Implemented audio chunking (VAD-aware or fixed 30s chunks) in cactus_transcribe with limited text context carryover between chunks.
Minor refactors in transcription preprocessing (e.g., mel normalization and option handling).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`tests/test_stt.cpp`	Renames the transcription test helper to accept an audio filename and adds `transcription_long` using `test_long.wav`.
`cactus/ffi/cactus_transcribe.cpp`	Adds chunk-based transcription flow for long audio inputs and related token/context handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-03T03:39:00Z

cactus/ffi/cactus_transcribe.cpp

                    while (pos < opts.size() && std::isspace(static_cast<unsigned char>(opts[pos]))) ++pos;
                    try {
-                        cloud_handoff_threshold = std::stof(opts.substr(pos));
+                        cloud_handoff_threshold = std::stof(opts.c_str() + pos);


std::stof does not have an overload that accepts a const char*. std::stof(opts.c_str() + pos) will fail to compile; use std::stof(opts.substr(pos)) (as before) or parse with std::strtof/std::from_chars while tracking the end position.

Suggested change

cloud_handoff_threshold = std::stof(opts.c_str() + pos);

cloud_handoff_threshold = std::stof(opts.substr(pos));

Copilot · 2026-03-03T03:39:00Z

cactus/ffi/cactus_transcribe.cpp

+        std::vector<std::vector<float>> chunks;
+
        if (use_vad) {
            auto* vad = static_cast<SileroVADModel*>(handle->vad_model.get());
-            auto segments = vad->get_speech_timestamps(audio_buffer, {});
-
-            std::vector<float> speech_audio;
-            for (const auto& segment : segments) {
-                speech_audio.insert(
-                    speech_audio.end(),
-                    audio_buffer.begin() + segment.start,
-                    audio_buffer.begin() + std::min(segment.end, audio_buffer.size())
+            auto vad_segments = vad->get_speech_timestamps(audio_samples, {});
+            chunks.reserve(vad_segments.size());
+
+            std::vector<float> current;
+            for (const auto& seg : vad_segments) {
+                size_t end = std::min(seg.end, audio_samples.size());
+                if (current.size() + (end - seg.start) > MAX_CHUNK_SAMPLES) {
+                    chunks.emplace_back(std::move(current));
+                    current = {};
+                }
+                current.insert(
+                    current.end(),
+                    audio_samples.begin() + seg.start,
+                    audio_samples.begin() + end
                );
            }
-            audio_buffer = std::move(speech_audio);

-            if (audio_buffer.empty()) {
-                CACTUS_LOG_DEBUG("transcribe", "VAD detected only silence, returning empty transcription");
+            if (!current.empty()) {
+                chunks.emplace_back(std::move(current));
+            }

+            if (chunks.empty()) {
+                CACTUS_LOG_DEBUG("transcribe", "VAD detected only silence, returning empty transcription");
                auto vad_end_time = std::chrono::high_resolution_clock::now();
                double vad_total_time_ms = std::chrono::duration_cast<std::chrono::microseconds>(vad_end_time - start_time).count() / 1000.0;
-
                std::string json = construct_response_json("", {}, 0.0, vad_total_time_ms, 0.0, 0.0, 0, 0, 1.0f);
-
                if (json.size() >= buffer_size) {
                    handle_error_response("Response buffer too small", response_buffer, buffer_size);
                    cactus::telemetry::recordTranscription(handle->model_name.c_str(), false, 0.0, 0.0, 0.0, 0, 0.0, "Response buffer too small");
                    return -1;
                }
-
                cactus::telemetry::recordTranscription(handle->model_name.c_str(), true, 0.0, 0.0, vad_total_time_ms, 0, get_ram_usage_mb(), "");
                std::strcpy(response_buffer, json.c_str());
                return static_cast<int>(json.size());
            }
+        } else {
+            chunks.reserve((audio_samples.size() + MAX_CHUNK_SAMPLES - 1) / MAX_CHUNK_SAMPLES);
+            for (size_t start = 0; start < audio_samples.size(); start += MAX_CHUNK_SAMPLES) {
+                size_t end = std::min(start + MAX_CHUNK_SAMPLES, audio_samples.size());
+                chunks.emplace_back(audio_samples.begin() + start, audio_samples.begin() + end);
+            }


The current chunking strategy materializes std::vector<std::vector<float>> chunks and copies audio samples into each chunk (emplace_back with iterator ranges / current.insert). For long inputs this duplicates audio in memory and can significantly increase peak RAM usage. Consider iterating over the audio in a streaming fashion (store start/end indices or use spans/views) and build/process one chunk at a time instead of keeping all chunks resident.

Copilot · 2026-03-03T03:39:00Z

tests/test_stt.cpp


-    std::string audio_path = std::string(g_assets_path) + "/test.wav";
+    std::string audio_path = std::string(g_assets_path) + "/" + audio_file;
    std::cout << "Transcript: ";


g_assets_path is used to build audio_path without checking whether CACTUS_TEST_ASSETS is set. If CACTUS_TEST_TRANSCRIBE_MODEL is set but CACTUS_TEST_ASSETS is not, std::string(g_assets_path) will dereference a null pointer and crash the test binary. Consider adding a skip (similar to other tests) when !g_assets_path.

long transcription

78002c7

Signed-off-by: jakmro <[email protected]>

Copilot AI review requested due to automatic review settings March 3, 2026 03:34

Copilot started reviewing on behalf of jakmro March 3, 2026 03:34 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

HenryNdubuaku merged commit c4e54fb into main Mar 3, 2026
8 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

long transcription#482

long transcription#482
HenryNdubuaku merged 1 commit intomainfrom
infinite_transcription

jakmro commented Mar 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 3, 2026

Uh oh!

Copilot AI Mar 3, 2026

Uh oh!

Copilot AI Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	cloud_handoff_threshold = std::stof(opts.c_str() + pos);
	cloud_handoff_threshold = std::stof(opts.substr(pos));

Conversation

jakmro commented Mar 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants