Add dynamic max_tokens estimation based on audio length in cactus_tra…#499
Conversation
…nscribe Signed-off-by: jakmro <[email protected]>
There was a problem hiding this comment.
Pull request overview
Adds an automatic max_tokens default for cactus_transcribe requests by estimating token budget from the input audio duration, so longer audio doesn’t prematurely stop at the prior fixed default.
Changes:
- Detect absence of
"max_tokens"inoptions_jsonand compute an estimated token budget fromaudio_samplesduration. - Apply different tokens/sec heuristics depending on whether the model is Parakeet vs non-Parakeet.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (opts.find("\"max_tokens\"") == std::string::npos) { | ||
| const float audio_length_sec = static_cast<float>(audio_samples.size()) / static_cast<float>(WHISPER_SAMPLE_RATE); | ||
| const float tps = is_parakeet ? 30.0f : 20.0f; | ||
| const size_t estimated = static_cast<size_t>(audio_length_sec * tps); | ||
| max_tokens = std::max<size_t>(estimated, 100); | ||
| } |
There was a problem hiding this comment.
The dynamic default reintroduces hard-coded constants (100, and the chosen tokens/sec values) instead of building on the existing defaults coming from parse_options_json. Since max_tokens is already initialized to its default before this block, consider using the current max_tokens value as the floor (and possibly moving the tps values to config/constants) so the default behavior stays consistent if parse_options_json defaults change later.
| if (opts.find("\"max_tokens\"") == std::string::npos) { | ||
| const float audio_length_sec = static_cast<float>(audio_samples.size()) / static_cast<float>(WHISPER_SAMPLE_RATE); | ||
| const float tps = is_parakeet ? 30.0f : 20.0f; | ||
| const size_t estimated = static_cast<size_t>(audio_length_sec * tps); | ||
| max_tokens = std::max<size_t>(estimated, 100); | ||
| } |
There was a problem hiding this comment.
max_tokens now scales with total audio length with no explicit upper bound when the caller doesn’t provide max_tokens. For very long audio inputs this can substantially increase decode work across many chunks (previously the default 100 would stop early), which is a potential resource/abuse risk for any service exposing this FFI. Consider clamping the estimated value to a reasonable configured maximum (or a maximum derived from chunk count/max_tps) to keep worst-case runtime bounded by policy.
…nscribe