Skip to content

gpu: Add runtime GPU execution provider selection.#958

Closed
andrewleech wants to merge 2 commits intocjpais:mainfrom
andrewleech:feat/gpu-providers-standalone
Closed

gpu: Add runtime GPU execution provider selection.#958
andrewleech wants to merge 2 commits intocjpais:mainfrom
andrewleech:feat/gpu-providers-standalone

Conversation

@andrewleech
Copy link
Copy Markdown

Summary

ORT-based transcription engines (Parakeet, Moonshine, SenseVoice) can use GPU acceleration via ONNX Runtime execution providers, but until now the build had no way to select one. This adds compile-time Cargo feature flags (gpu-directml, gpu-cuda, gpu-coreml, webgpu) that gate the available providers, plus a runtime settings dropdown that lets the user switch between them without restarting.

Backed by transcribe-rs PR #49 which adds GpuProvider enum, set_gpu_provider(), available_providers().

flowchart LR
    A[Startup] --> B{Persisted provider<br/>available in build?}
    B -->|yes| C[set_gpu_provider]
    B -->|no| D[Reset to auto<br/>+ warn]
    D --> C
    E[User changes setting] --> F{Model loaded?}
    F -->|ORT engine| G[Unload + reload<br/>with new EP]
    F -->|Whisper| H[Skip reload<br/>whisper.cpp ignores EP]
    F -->|Transcription in flight| I[Reject + revert]
Loading

The dropdown only appears when a GPU EP beyond auto+cpu is compiled in, so default builds see no UI change.

The second commit (chore: temporarily pin transcribe-rs to git branch) should be dropped once the upstream transcribe-rs PR merges and a crates.io release includes GPU provider support.

Benchmarks

Tested on AMD Ryzen AI 9 365 (Zen 5) with integrated AMD Radeon 860M iGPU. The autoregressive decoding loop in current ORT-based engines doesn't benefit from GPU parallelism — iGPU memory bandwidth is shared with CPU and per-token overhead dominates.

DirectML (Qwen3-ASR decoder only — encoder hits a DML Conv2d NaN bug and must stay on CPU):

Metric Result
1.7B decoder ~7% speedup
0.6B decoder ~9% slowdown (GPU dispatch overhead > benefit)

WebGPU (Qwen3-ASR encoder + decoder, native Windows, D3D12 backend):

Model Audio CPU RTF CPU time WebGPU RTF WebGPU time Delta
0.6B 11s 2.54x 4.33s 2.42x 4.55s -5%
0.6B 30s 1.45x 20.37s 1.38x 21.45s -5%
1.7B 11s 1.31x 8.38s 1.16x 9.47s -12%
1.7B 30s 0.98x 30.18s 0.66x 45.16s -50%

On this hardware, GPU acceleration doesn't help. The infrastructure is in place for users with discrete GPUs where the cost/benefit may be different, and the setting defaults to auto so there's no regression for everyone else.

Testing

  • cargo check clean, 33 unit tests pass
  • Frontend lint clean
  • Runtime-tested on Windows with --features gpu-directml (setting persists, model reloads, stale provider resets correctly on next launch)

Trade-offs and Alternatives

Whisper models (whisper.cpp backend) don't use ORT, so GPU provider changes are intentionally skipped for them — the reload would be a no-op. An alternative would be hiding the setting entirely when only Whisper models are available, but the current approach is simpler and the dropdown already hides itself when no GPU EP is compiled in.

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Mar 4, 2026

Okay, I'm open to this, but I really need you to do some work on the CI/CD too. I get it works for you at build time but this cannot be pulled in if it cannot be distributed across platforms properly.

@andrewleech
Copy link
Copy Markdown
Author

andrewleech commented Mar 4, 2026

To be fair it doesn't work for me, but that's an AMD laptop issue (even though they brand it as an AI laptop) ;-)

Good point regardless, I'm sorry for throwing this up without tests, my bad. All the testing scripts were too local looking to include in the commit, I overlooked that was all I had.

I'll look into what can be covered in a GitHub runner, if nothing else the program flow/logic.

pi-anl added 2 commits March 5, 2026 15:31
Add compile-time feature flags (gpu-directml, gpu-cuda, gpu-coreml, webgpu)
that enable GPU acceleration for ORT-based engines (Parakeet, Moonshine,
SenseVoice). A runtime GPU provider setting in Advanced Settings lets users
switch between available providers; changing the provider reloads any loaded
ORT model. Stale provider values from a previous build are reset to "auto"
at startup.
Points at andrewleech/transcribe-rs feat/gpu-providers (PR cjpais/transcribe-rs#49).
Drop this commit once GPU provider support is published to crates.io.
@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Mar 16, 2026

im going to close this, but we will add the option, I just want to start it from scratch

@cjpais cjpais closed this Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants