Skip to content

model: Add Qwen3-ASR batch transcription engine.#957

Open
andrewleech wants to merge 2 commits intocjpais:mainfrom
andrewleech:feat/qwen3-batch-standalone
Open

model: Add Qwen3-ASR batch transcription engine.#957
andrewleech wants to merge 2 commits intocjpais:mainfrom
andrewleech:feat/qwen3-batch-standalone

Conversation

@andrewleech
Copy link
Copy Markdown

Summary

Qwen3-ASR is a multilingual speech recognition model (12 languages) with two size variants (0.6B, 1.7B). I'm adding it as a new batch transcription engine backed by transcribe-rs PR #48.

Qwen3 models are distributed as individual ONNX files (encoder, decoder, embeddings, tokenizer, config) rather than a single tar.gz archive. The existing download infrastructure only handled single-file tar.gz downloads, so the model manager now supports multi-file downloads with per-file SHA-256 verification, resume, and cancellation. The shared download logic was extracted into a download_single_file helper used by both the single-file and multi-file paths.

flowchart LR
    A[download_model] --> B{files empty?}
    B -->|yes| C[download_single_file<br/>tar.gz path]
    B -->|no| D[download_model_files]
    D --> E[verify existing files<br/>SHA-256 + size]
    E --> F[download_single_file<br/>per missing file]
    F --> G[remove .incomplete marker]
Loading

The second commit (chore: temporarily pin transcribe-rs to git branch) should be dropped once the upstream transcribe-rs PR merges and a crates.io release includes qwen3 support.

Testing

  • cargo check clean, 35 unit tests pass
  • Frontend lint clean
  • Not yet runtime-tested (model files not yet uploaded to HuggingFace)

Trade-offs and Alternatives

The model registry entries are verbose (~160 lines each for the 9-file manifests with SHA-256 hashes). An alternative would be loading manifests from external JSON, but that adds a fetch dependency to model discovery. Keeping them inline is consistent with existing model entries and avoids the extra complexity.

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Mar 4, 2026

We probably should do a tar if we are bringing this in rather than a new methodology.

We can do verification but it should be a separate PR

@andrewleech
Copy link
Copy Markdown
Author

We probably should do a tar if we are bringing this in rather than a new methodology.

Do you mean tar the individual model files into a single download?

Would you want to keep it as a tar on disk and unpack on the fly during load, or just unpack at the end of download?

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Mar 5, 2026

Look at how the other ones are done

@andrewleech andrewleech force-pushed the feat/qwen3-batch-standalone branch 2 times, most recently from 859d1d9 to 5ce1bd7 Compare March 6, 2026 03:23
Copy link
Copy Markdown
Contributor

@VirenMohindra VirenMohindra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work on the engine integration — follows the existing patterns cleanly. left a few inline comments on things to address.

tar = "0.4.44"
flate2 = "1.0"
transcribe-rs = { version = "0.2.8", features = ["whisper", "parakeet", "moonshine", "sense_voice", "gigaam"] }
transcribe-rs = { git = "https://github.com/andrewleech/transcribe-rs", branch = "feat/qwen3-batch", features = ["whisper", "parakeet", "moonshine", "sense_voice", "gigaam", "qwen3"] }
Copy link
Copy Markdown
Contributor

@VirenMohindra VirenMohindra Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this pins the entire project to a personal fork branch. if andrewleech/transcribe-rs feat/qwen3-batch goes away, handy can't build. this should wait for upstream transcribe-rs PR to merge and ship on crates.io before merging.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is all still WIP (should have made this a draft pr sorry - If/when my library PR there is finished & merged I'll fix this.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — this is in a separate DROP: commit at the tip specifically so it can be dropped cleanly when transcribe-rs PR #48 merges. The commit message makes this explicit.

partial_size: 0,
is_directory: true,
engine_type: EngineType::Qwen3,
accuracy_score: 0.75,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these huggingface URLs point to personal repos (andrewleech/qwen3-asr-*-onnx). are the model files uploaded yet? the PR description says "not yet runtime-tested (model files not yet uploaded to HuggingFace)" — nobody can verify this works until they are. the 1.7B model is 7.3 GB, worth runtime testing before merge.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been fixing / testing / improving the models in various formats over the last few days, the current uploads are hopefully pretty well release ready - but I'm just reviewing my transcribe-rs changes more thoroughly which use these before I push that up again, then I'll update this PR to match

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Models are now uploaded and download URLs are live — both FP32 and Int4 variants for 0.6B and 1.7B. Runtime tested on Windows with the 0.6B Int4 model.

supported_languages: vec![],
is_custom: true,
},
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated whitespace change — should be dropped from this PR.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in subsequent rebases — the current diff has no stray whitespace changes.

}
LoadedEngine::GigaAM(gigaam_engine) => gigaam_engine
.transcribe_samples(audio, None)
.map_err(|e| anyhow::anyhow!("GigaAM transcription failed: {}", e)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transcribe_samples(audio, None) — no inference params despite qwen3 supporting 12 languages. fine for an initial PR but worth a follow-up to wire up language selection like whisper/sensevoice do. users selecting e.g. zh-Hans in settings won't get that passed through to the engine.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Qwen3-ASR auto-detects language from the audio — there's no language token in the prompt template (unlike Whisper). The transcribe-rs Qwen3 engine only exposes max_tokens, with no language field. supports_language_selection: false is set correctly to prevent the UI from showing a language selector that would have no effect. If the engine adds language hints in the future, this can be wired up then.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — this is feasible as a follow-up. I investigated the Qwen3-ASR prompt structure and there is a mechanism for language hinting.

How the model handles language:

The model outputs language {Name}<asr_text>{transcription} — it auto-detects the language, emits a language tag, then the <asr_text> special token (ID 151704), then the transcription. The strip_language_prefix function in transcribe-rs removes this prefix before returning the result.

How language forcing would work:

Instead of letting the model generate the language tag, you prefill the assistant turn with the language prefix tokens. The prompt currently ends with:

<|im_start|>assistant\n

To force e.g. Chinese, you'd extend it to:

<|im_start|>assistant\n language Chinese<asr_text>

The token IDs are: [11528, <lang_name_token>, 151704] where 11528 = "language", the language name is a single BPE token (e.g. 6364=English, 8453=Chinese, 10769=Japanese), and 151704 = <asr_text>.

The decoder then starts generating from after the prefilled tokens, skipping auto-detection entirely and producing transcription in the specified language.

What would need to change (transcribe-rs):

  • Add a language-name-to-token-ID mapping (~57 entries matching KNOWN_LANGUAGES)
  • Add language: Option<String> to Qwen3Params
  • Extend build_prompt_ids to append [11528, lang_token, 151704] after the assistant prefix when language is set
  • Adjust strip_language_prefix to not strip when language was forced (it's already been consumed by the prefill)

What would need to change (Handy):

  • Pass options.language through to the Qwen3 engine (currently uses TranscribeOptions::default() since supports_language_selection: false)
  • Set supports_language_selection: true on the Qwen3 model entries
  • The frontend language selector would then appear for Qwen3 models, and the selected language would flow through TranscribeOptions.languageQwen3Params.language → prefilled prompt tokens

I'll track this as a follow-up to the transcribe-rs Qwen3 engine.

@andrewleech andrewleech force-pushed the feat/qwen3-batch-standalone branch from 5ce1bd7 to 0774612 Compare March 18, 2026 13:14
@andrewleech
Copy link
Copy Markdown
Author

andrewleech commented Mar 18, 2026

Updated to address review feedback and rebase onto current main.

Changes since the original PR:

  • Switched from multi-file downloads to tar.gz archives as requested — Qwen3 models now download as a single .tar.gz and extract to a directory, matching the existing Parakeet/Moonshine/SenseVoice pattern. The ModelFile struct, per-file SHA-256 verification, .incomplete markers, and download_model_files() method are all removed. The sha2 crate dependency is also dropped.

  • Rebased onto current main including the transcribe-rs 0.3.1 migration (SpeechModel trait, onnx::* module paths), Canary engine, DirectML acceleration, and other upstream changes. The Qwen3 integration now uses the new API (Qwen3Model::load(&path, &Quantization) + model.transcribe(&audio, &options)).

  • Expanded language support from 12 to 55 languages, matching the engine's actual KNOWN_LANGUAGES list cross-referenced against the frontend language constants.

  • Translated model descriptions in all 17 locale files (previously English-only placeholders in non-English locales).

  • Added HTTP read_timeout(30s) to the download client and documented lock ordering in finish_downloading.

  • Int4 quantized variants are supported in the transcription code path (quantization derived from model ID suffix) but the int4 model entries are not yet registered — they'll be added once the int4 tar.gz archives finish uploading to HuggingFace.

  • FP32 and Int4 model archives:

The DROP commit (pin transcribe-rs to git branch) should be dropped when cjpais/transcribe-rs#48 is merged.

@VirenMohindra
Copy link
Copy Markdown
Contributor

VirenMohindra commented Mar 19, 2026

great job. I'll take a look at this holistically and test e2e once cjpais/transcribe-rs#48 lands. feel free to ping me directly!

@andrewleech andrewleech force-pushed the feat/qwen3-batch-standalone branch 2 times, most recently from f85dc4b to 0f12525 Compare March 20, 2026 00:39
DylanBricar added a commit to DylanBricar/Phonara that referenced this pull request Mar 20, 2026
- Local OpenAI-compatible API server on /v1/audio/transcriptions (cjpais#509)
- GNOME system shortcuts via gsettings for Wayland (cjpais#572)
- Wake word detection infrastructure with settings (cjpais#618)
- Live transcription mode settings (overlay/clipboard) (cjpais#832)
- Qwen3-ASR engine type placeholder for future transcribe-rs (cjpais#957)
- Prioritized microphone device list with fallback (cjpais#1070)
- Flatpak detection helper (cjpais#548)
- Storybook dev script (cjpais#784)
@andrewleech andrewleech force-pushed the feat/qwen3-batch-standalone branch 2 times, most recently from e431445 to 606d124 Compare March 23, 2026 05:17
pi-anl added 2 commits March 25, 2026 13:06
Add Qwen3-ASR 0.6B and 1.7B models (FP32 and Int4 variants) using
tar.gz archive downloads. Uses the new transcribe-rs SpeechModel API
with quantization derived from model ID suffix.
@andrewleech andrewleech force-pushed the feat/qwen3-batch-standalone branch from 606d124 to e124e52 Compare March 25, 2026 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants