model: Add Qwen3-ASR batch transcription engine. by andrewleech · Pull Request #957 · cjpais/Handy

andrewleech · 2026-03-04T10:42:58Z

Summary

Qwen3-ASR is a multilingual speech recognition model (12 languages) with two size variants (0.6B, 1.7B). I'm adding it as a new batch transcription engine backed by transcribe-rs PR #48.

Qwen3 models are distributed as individual ONNX files (encoder, decoder, embeddings, tokenizer, config) rather than a single tar.gz archive. The existing download infrastructure only handled single-file tar.gz downloads, so the model manager now supports multi-file downloads with per-file SHA-256 verification, resume, and cancellation. The shared download logic was extracted into a download_single_file helper used by both the single-file and multi-file paths.

flowchart LR
    A[download_model] --> B{files empty?}
    B -->|yes| C[download_single_file<br/>tar.gz path]
    B -->|no| D[download_model_files]
    D --> E[verify existing files<br/>SHA-256 + size]
    E --> F[download_single_file<br/>per missing file]
    F --> G[remove .incomplete marker]

The second commit (chore: temporarily pin transcribe-rs to git branch) should be dropped once the upstream transcribe-rs PR merges and a crates.io release includes qwen3 support.

Testing

cargo check clean, 35 unit tests pass
Frontend lint clean
Not yet runtime-tested (model files not yet uploaded to HuggingFace)

Trade-offs and Alternatives

The model registry entries are verbose (~160 lines each for the 9-file manifests with SHA-256 hashes). An alternative would be loading manifests from external JSON, but that adds a fetch dependency to model discovery. Keeping them inline is consistent with existing model entries and avoids the extra complexity.

cjpais · 2026-03-04T12:02:57Z

We probably should do a tar if we are bringing this in rather than a new methodology.

We can do verification but it should be a separate PR

andrewleech · 2026-03-04T20:02:54Z

We probably should do a tar if we are bringing this in rather than a new methodology.

Do you mean tar the individual model files into a single download?

Would you want to keep it as a tar on disk and unpack on the fly during load, or just unpack at the end of download?

cjpais · 2026-03-05T01:48:36Z

Look at how the other ones are done

VirenMohindra

nice work on the engine integration — follows the existing patterns cleanly. left a few inline comments on things to address.

VirenMohindra · 2026-03-16T01:59:32Z

src-tauri/Cargo.toml

 tar = "0.4.44"
 flate2 = "1.0"
-transcribe-rs = { version = "0.2.8", features = ["whisper", "parakeet", "moonshine", "sense_voice", "gigaam"] }
+transcribe-rs = { git = "https://github.com/andrewleech/transcribe-rs", branch = "feat/qwen3-batch", features = ["whisper", "parakeet", "moonshine", "sense_voice", "gigaam", "qwen3"] }


this pins the entire project to a personal fork branch. if andrewleech/transcribe-rs feat/qwen3-batch goes away, handy can't build. this should wait for upstream transcribe-rs PR to merge and ship on crates.io before merging.

Yes this is all still WIP (should have made this a draft pr sorry - If/when my library PR there is finished & merged I'll fix this.

Agreed — this is in a separate DROP: commit at the tip specifically so it can be dropped cleanly when transcribe-rs PR #48 merges. The commit message makes this explicit.

VirenMohindra · 2026-03-16T01:59:32Z

src-tauri/src/managers/model.rs

+                partial_size: 0,
+                is_directory: true,
+                engine_type: EngineType::Qwen3,
+                accuracy_score: 0.75,


these huggingface URLs point to personal repos (andrewleech/qwen3-asr-*-onnx). are the model files uploaded yet? the PR description says "not yet runtime-tested (model files not yet uploaded to HuggingFace)" — nobody can verify this works until they are. the 1.7B model is 7.3 GB, worth runtime testing before merge.

I've been fixing / testing / improving the models in various formats over the last few days, the current uploads are hopefully pretty well release ready - but I'm just reviewing my transcribe-rs changes more thoroughly which use these before I push that up again, then I'll update this PR to match

Models are now uploaded and download URLs are live — both FP32 and Int4 variants for 0.6B and 1.7B. Runtime tested on Windows with the 0.6B Int4 model.

VirenMohindra · 2026-03-16T01:59:32Z

src-tauri/src/managers/model.rs

                    supported_languages: vec![],
                    is_custom: true,
-                },
+                    },


unrelated whitespace change — should be dropped from this PR.

Resolved in subsequent rebases — the current diff has no stray whitespace changes.

VirenMohindra · 2026-03-16T01:59:32Z

src-tauri/src/managers/transcription.rs

                        }
                        LoadedEngine::GigaAM(gigaam_engine) => gigaam_engine
                            .transcribe_samples(audio, None)
                            .map_err(|e| anyhow::anyhow!("GigaAM transcription failed: {}", e)),


transcribe_samples(audio, None) — no inference params despite qwen3 supporting 12 languages. fine for an initial PR but worth a follow-up to wire up language selection like whisper/sensevoice do. users selecting e.g. zh-Hans in settings won't get that passed through to the engine.

Qwen3-ASR auto-detects language from the audio — there's no language token in the prompt template (unlike Whisper). The transcribe-rs Qwen3 engine only exposes max_tokens, with no language field. supports_language_selection: false is set correctly to prevent the UI from showing a language selector that would have no effect. If the engine adds language hints in the future, this can be wired up then.

Good call — this is feasible as a follow-up. I investigated the Qwen3-ASR prompt structure and there is a mechanism for language hinting.

How the model handles language:

The model outputs language {Name}<asr_text>{transcription} — it auto-detects the language, emits a language tag, then the <asr_text> special token (ID 151704), then the transcription. The strip_language_prefix function in transcribe-rs removes this prefix before returning the result.

How language forcing would work:

Instead of letting the model generate the language tag, you prefill the assistant turn with the language prefix tokens. The prompt currently ends with:

<|im_start|>assistant\n

To force e.g. Chinese, you'd extend it to:

<|im_start|>assistant\n language Chinese<asr_text>

The token IDs are: [11528, <lang_name_token>, 151704] where 11528 = "language", the language name is a single BPE token (e.g. 6364=English, 8453=Chinese, 10769=Japanese), and 151704 = <asr_text>.

The decoder then starts generating from after the prefilled tokens, skipping auto-detection entirely and producing transcription in the specified language.

What would need to change (transcribe-rs):

Add a language-name-to-token-ID mapping (~57 entries matching KNOWN_LANGUAGES)

Add language: Option<String> to Qwen3Params

Extend build_prompt_ids to append [11528, lang_token, 151704] after the assistant prefix when language is set

Adjust strip_language_prefix to not strip when language was forced (it's already been consumed by the prefill)

What would need to change (Handy):

Pass options.language through to the Qwen3 engine (currently uses TranscribeOptions::default() since supports_language_selection: false)

Set supports_language_selection: true on the Qwen3 model entries

The frontend language selector would then appear for Qwen3 models, and the selected language would flow through TranscribeOptions.language → Qwen3Params.language → prefilled prompt tokens

I'll track this as a follow-up to the transcribe-rs Qwen3 engine.

andrewleech · 2026-03-18T13:14:19Z

Updated to address review feedback and rebase onto current main.

Changes since the original PR:

Switched from multi-file downloads to tar.gz archives as requested — Qwen3 models now download as a single .tar.gz and extract to a directory, matching the existing Parakeet/Moonshine/SenseVoice pattern. The ModelFile struct, per-file SHA-256 verification, .incomplete markers, and download_model_files() method are all removed. The sha2 crate dependency is also dropped.
Rebased onto current main including the transcribe-rs 0.3.1 migration (SpeechModel trait, onnx::* module paths), Canary engine, DirectML acceleration, and other upstream changes. The Qwen3 integration now uses the new API (Qwen3Model::load(&path, &Quantization) + model.transcribe(&audio, &options)).
Expanded language support from 12 to 55 languages, matching the engine's actual KNOWN_LANGUAGES list cross-referenced against the frontend language constants.
Translated model descriptions in all 17 locale files (previously English-only placeholders in non-English locales).
Added HTTP read_timeout(30s) to the download client and documented lock ordering in finish_downloading.
Int4 quantized variants are supported in the transcription code path (quantization derived from model ID suffix) but the int4 model entries are not yet registered — they'll be added once the int4 tar.gz archives finish uploading to HuggingFace.
FP32 and Int4 model archives:
- https://huggingface.co/andrewleech/qwen3-asr-0.6b-onnx (uploaded)
- https://huggingface.co/andrewleech/qwen3-asr-1.7b-onnx (upload in progress)

The DROP commit (pin transcribe-rs to git branch) should be dropped when cjpais/transcribe-rs#48 is merged.

VirenMohindra · 2026-03-19T05:50:52Z

great job. I'll take a look at this holistically and test e2e once cjpais/transcribe-rs#48 lands. feel free to ping me directly!

- Local OpenAI-compatible API server on /v1/audio/transcriptions (cjpais#509) - GNOME system shortcuts via gsettings for Wayland (cjpais#572) - Wake word detection infrastructure with settings (cjpais#618) - Live transcription mode settings (overlay/clipboard) (cjpais#832) - Qwen3-ASR engine type placeholder for future transcribe-rs (cjpais#957) - Prioritized microphone device list with fallback (cjpais#1070) - Flatpak detection helper (cjpais#548) - Storybook dev script (cjpais#784)

Add Qwen3-ASR 0.6B and 1.7B models (FP32 and Int4 variants) using tar.gz archive downloads. Uses the new transcribe-rs SpeechModel API with quantization derived from model ID suffix.

…merged)

andrewleech force-pushed the feat/qwen3-batch-standalone branch 2 times, most recently from 859d1d9 to 5ce1bd7 Compare March 6, 2026 03:23

VirenMohindra reviewed Mar 16, 2026

View reviewed changes

andrewleech force-pushed the feat/qwen3-batch-standalone branch from 5ce1bd7 to 0774612 Compare March 18, 2026 13:14

andrewleech mentioned this pull request Mar 18, 2026

feat: add Qwen3-ASR batch transcription engine cjpais/transcribe-rs#48

Open

andrewleech force-pushed the feat/qwen3-batch-standalone branch 2 times, most recently from c80f923 to ca4e555 Compare March 19, 2026 05:20

andrewleech force-pushed the feat/qwen3-batch-standalone branch 2 times, most recently from f85dc4b to 0f12525 Compare March 20, 2026 00:39

andrewleech force-pushed the feat/qwen3-batch-standalone branch 2 times, most recently from e431445 to 606d124 Compare March 23, 2026 05:17

pi-anl added 2 commits March 25, 2026 13:06

feat: add Qwen3-ASR batch transcription engine

31f28ac

Add Qwen3-ASR 0.6B and 1.7B models (FP32 and Int4 variants) using tar.gz archive downloads. Uses the new transcribe-rs SpeechModel API with quantization derived from model ID suffix.

DROP: pin transcribe-rs to git branch (drop when feat/qwen3-batch is …

e124e52

…merged)

andrewleech force-pushed the feat/qwen3-batch-standalone branch from 606d124 to e124e52 Compare March 25, 2026 02:06

Uh oh!

Conversation

andrewleech commented Mar 4, 2026

Summary

Testing

Trade-offs and Alternatives

Uh oh!

cjpais commented Mar 4, 2026

Uh oh!

andrewleech commented Mar 4, 2026

Uh oh!

cjpais commented Mar 5, 2026

Uh oh!

VirenMohindra left a comment

Choose a reason for hiding this comment

Uh oh!

VirenMohindra Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewleech commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VirenMohindra commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

VirenMohindra Mar 16, 2026 •

edited

Loading

andrewleech commented Mar 18, 2026 •

edited

Loading

VirenMohindra commented Mar 19, 2026 •

edited

Loading