model: Add Qwen3-ASR batch transcription engine.#957
model: Add Qwen3-ASR batch transcription engine.#957andrewleech wants to merge 2 commits intocjpais:mainfrom
Conversation
|
We probably should do a tar if we are bringing this in rather than a new methodology. We can do verification but it should be a separate PR |
Do you mean tar the individual model files into a single download? Would you want to keep it as a tar on disk and unpack on the fly during load, or just unpack at the end of download? |
|
Look at how the other ones are done |
859d1d9 to
5ce1bd7
Compare
VirenMohindra
left a comment
There was a problem hiding this comment.
nice work on the engine integration — follows the existing patterns cleanly. left a few inline comments on things to address.
src-tauri/Cargo.toml
Outdated
| tar = "0.4.44" | ||
| flate2 = "1.0" | ||
| transcribe-rs = { version = "0.2.8", features = ["whisper", "parakeet", "moonshine", "sense_voice", "gigaam"] } | ||
| transcribe-rs = { git = "https://github.com/andrewleech/transcribe-rs", branch = "feat/qwen3-batch", features = ["whisper", "parakeet", "moonshine", "sense_voice", "gigaam", "qwen3"] } |
There was a problem hiding this comment.
this pins the entire project to a personal fork branch. if andrewleech/transcribe-rs feat/qwen3-batch goes away, handy can't build. this should wait for upstream transcribe-rs PR to merge and ship on crates.io before merging.
There was a problem hiding this comment.
Yes this is all still WIP (should have made this a draft pr sorry - If/when my library PR there is finished & merged I'll fix this.
There was a problem hiding this comment.
Agreed — this is in a separate DROP: commit at the tip specifically so it can be dropped cleanly when transcribe-rs PR #48 merges. The commit message makes this explicit.
| partial_size: 0, | ||
| is_directory: true, | ||
| engine_type: EngineType::Qwen3, | ||
| accuracy_score: 0.75, |
There was a problem hiding this comment.
these huggingface URLs point to personal repos (andrewleech/qwen3-asr-*-onnx). are the model files uploaded yet? the PR description says "not yet runtime-tested (model files not yet uploaded to HuggingFace)" — nobody can verify this works until they are. the 1.7B model is 7.3 GB, worth runtime testing before merge.
There was a problem hiding this comment.
I've been fixing / testing / improving the models in various formats over the last few days, the current uploads are hopefully pretty well release ready - but I'm just reviewing my transcribe-rs changes more thoroughly which use these before I push that up again, then I'll update this PR to match
There was a problem hiding this comment.
Models are now uploaded and download URLs are live — both FP32 and Int4 variants for 0.6B and 1.7B. Runtime tested on Windows with the 0.6B Int4 model.
src-tauri/src/managers/model.rs
Outdated
| supported_languages: vec![], | ||
| is_custom: true, | ||
| }, | ||
| }, |
There was a problem hiding this comment.
unrelated whitespace change — should be dropped from this PR.
There was a problem hiding this comment.
Resolved in subsequent rebases — the current diff has no stray whitespace changes.
| } | ||
| LoadedEngine::GigaAM(gigaam_engine) => gigaam_engine | ||
| .transcribe_samples(audio, None) | ||
| .map_err(|e| anyhow::anyhow!("GigaAM transcription failed: {}", e)), |
There was a problem hiding this comment.
transcribe_samples(audio, None) — no inference params despite qwen3 supporting 12 languages. fine for an initial PR but worth a follow-up to wire up language selection like whisper/sensevoice do. users selecting e.g. zh-Hans in settings won't get that passed through to the engine.
There was a problem hiding this comment.
Qwen3-ASR auto-detects language from the audio — there's no language token in the prompt template (unlike Whisper). The transcribe-rs Qwen3 engine only exposes max_tokens, with no language field. supports_language_selection: false is set correctly to prevent the UI from showing a language selector that would have no effect. If the engine adds language hints in the future, this can be wired up then.
There was a problem hiding this comment.
Good call — this is feasible as a follow-up. I investigated the Qwen3-ASR prompt structure and there is a mechanism for language hinting.
How the model handles language:
The model outputs language {Name}<asr_text>{transcription} — it auto-detects the language, emits a language tag, then the <asr_text> special token (ID 151704), then the transcription. The strip_language_prefix function in transcribe-rs removes this prefix before returning the result.
How language forcing would work:
Instead of letting the model generate the language tag, you prefill the assistant turn with the language prefix tokens. The prompt currently ends with:
<|im_start|>assistant\n
To force e.g. Chinese, you'd extend it to:
<|im_start|>assistant\n language Chinese<asr_text>
The token IDs are: [11528, <lang_name_token>, 151704] where 11528 = "language", the language name is a single BPE token (e.g. 6364=English, 8453=Chinese, 10769=Japanese), and 151704 = <asr_text>.
The decoder then starts generating from after the prefilled tokens, skipping auto-detection entirely and producing transcription in the specified language.
What would need to change (transcribe-rs):
- Add a language-name-to-token-ID mapping (~57 entries matching
KNOWN_LANGUAGES) - Add
language: Option<String>toQwen3Params - Extend
build_prompt_idsto append[11528, lang_token, 151704]after the assistant prefix when language is set - Adjust
strip_language_prefixto not strip when language was forced (it's already been consumed by the prefill)
What would need to change (Handy):
- Pass
options.languagethrough to the Qwen3 engine (currently usesTranscribeOptions::default()sincesupports_language_selection: false) - Set
supports_language_selection: trueon the Qwen3 model entries - The frontend language selector would then appear for Qwen3 models, and the selected language would flow through
TranscribeOptions.language→Qwen3Params.language→ prefilled prompt tokens
I'll track this as a follow-up to the transcribe-rs Qwen3 engine.
5ce1bd7 to
0774612
Compare
|
Updated to address review feedback and rebase onto current main. Changes since the original PR:
The DROP commit ( |
c80f923 to
ca4e555
Compare
|
great job. I'll take a look at this holistically and test e2e once cjpais/transcribe-rs#48 lands. feel free to ping me directly! |
f85dc4b to
0f12525
Compare
- Local OpenAI-compatible API server on /v1/audio/transcriptions (cjpais#509) - GNOME system shortcuts via gsettings for Wayland (cjpais#572) - Wake word detection infrastructure with settings (cjpais#618) - Live transcription mode settings (overlay/clipboard) (cjpais#832) - Qwen3-ASR engine type placeholder for future transcribe-rs (cjpais#957) - Prioritized microphone device list with fallback (cjpais#1070) - Flatpak detection helper (cjpais#548) - Storybook dev script (cjpais#784)
e431445 to
606d124
Compare
Add Qwen3-ASR 0.6B and 1.7B models (FP32 and Int4 variants) using tar.gz archive downloads. Uses the new transcribe-rs SpeechModel API with quantization derived from model ID suffix.
606d124 to
e124e52
Compare
Summary
Qwen3-ASR is a multilingual speech recognition model (12 languages) with two size variants (0.6B, 1.7B). I'm adding it as a new batch transcription engine backed by transcribe-rs PR #48.
Qwen3 models are distributed as individual ONNX files (encoder, decoder, embeddings, tokenizer, config) rather than a single tar.gz archive. The existing download infrastructure only handled single-file tar.gz downloads, so the model manager now supports multi-file downloads with per-file SHA-256 verification, resume, and cancellation. The shared download logic was extracted into a
download_single_filehelper used by both the single-file and multi-file paths.flowchart LR A[download_model] --> B{files empty?} B -->|yes| C[download_single_file<br/>tar.gz path] B -->|no| D[download_model_files] D --> E[verify existing files<br/>SHA-256 + size] E --> F[download_single_file<br/>per missing file] F --> G[remove .incomplete marker]The second commit (
chore: temporarily pin transcribe-rs to git branch) should be dropped once the upstream transcribe-rs PR merges and a crates.io release includes qwen3 support.Testing
cargo checkclean, 35 unit tests passTrade-offs and Alternatives
The model registry entries are verbose (~160 lines each for the 9-file manifests with SHA-256 hashes). An alternative would be loading manifests from external JSON, but that adds a fetch dependency to model discovery. Keeping them inline is consistent with existing model entries and avoids the extra complexity.