feat(stt): custom vocabulary biasing for all speech models by vyomshah05 · Pull Request #451 · cactus-compute/cactus

vyomshah05 · 2026-02-25T22:27:25Z

Part of #396.

Adds custom vocabulary biasing for all speech models (Whisper, Moonshine).

This PR supersedes #436, which added the bias infrastructure to WhisperModel only. Based on the review, the implementation has been moved to the base Model class, so it generalises to all speech models.

Changes

cactus/engine/engine.h: added vocab_bias_ field and set_vocab_bias()
to base Model class so all speech models inherit it automatically
cactus/engine/engine_model.cpp: merge vocab_bias_ into tool_constrainer
bias map inside Model::decode() before passing to gb->sample()
cactus/ffi/cactus_transcribe.cpp: parse custom_vocabulary and
vocabulary_boost from options_json, tokenize each word, call set_vocab_bias()
cactus/ffi/cactus_stream.cpp: same parsing for the streaming path
tests/test_stt.cpp: added test_vocab_bias_base_class which verifies
the full chain — JSON parsing → tokenization → bias map → decode

How it works

When custom_vocabulary is passed in options_json, the FFI layer tokenizes
each word and builds a token_id → boost map. Inside Model::decode() this
map is merged with the existing tool constrainer bias and passed to gb->sample().
Boost values are clamped to [0, 20] to prevent degenerate outputs.

Testing

All stt tests pass. Debug output from cactus_transcribe.cpp confirmed the full chain works:

[vocab_bias] parsed 3 words, boost=15
  word='Omeprazole' token_ids=46 76 595 424 4765 306
  word='HIPAA' token_ids=39 9139 5265
  word='Cactus' token_ids=34 34775

Signed-off-by: ammesatyajit <[email protected]>

feat(stt): add custom vocabulary hotword biasing for transcription

4a7993b

Signed-off-by: ammesatyajit <[email protected]>

ammesatyajit force-pushed the feat/custom-vocabulary-hotword-biasing branch from 8c3c543 to 4a7993b Compare March 7, 2026 12:21

HenryNdubuaku merged commit d5cd5c0 into cactus-compute:main Mar 7, 2026
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(stt): custom vocabulary biasing for all speech models#451

feat(stt): custom vocabulary biasing for all speech models#451
HenryNdubuaku merged 1 commit intocactus-compute:mainfrom
vyomshah05:feat/custom-vocabulary-hotword-biasing

vyomshah05 commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vyomshah05 commented Feb 25, 2026

Changes

How it works

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants