Skip to content

Prefer DirectML for Windows ONNX transcription models#985

Closed
ferologics wants to merge 2 commits intocjpais:mainfrom
ferologics:feat/windows-onnx-directml
Closed

Prefer DirectML for Windows ONNX transcription models#985
ferologics wants to merge 2 commits intocjpais:mainfrom
ferologics:feat/windows-onnx-directml

Conversation

@ferologics
Copy link
Copy Markdown
Contributor

@ferologics ferologics commented Mar 9, 2026

Summary

  • patch Handy's transcribe-rs dependency to a forked git revision with Windows DirectML support for ONNX models
  • prefer DirectMLExecutionProvider on Windows, with explicit CPU fallback if provider registration fails
  • log whether DirectML registration succeeded or fell back to CPU
  • clean a few existing Rust warnings touched during validation

Validation

  • cargo check
  • cargo check --release
  • launched the built Handy app locally and confirmed handy.log shows successful DirectML registration for the Parakeet ONNX sessions
  • ran a local dev-build transcription successfully after the change
  • benchmarked the patched Parakeet path on a real Handy recording (244.38s audio): 35.368s before vs 6.99s after (~5.1x faster, ~35x realtime)

Dependency patch

@ferologics ferologics force-pushed the feat/windows-onnx-directml branch 2 times, most recently from 3b27bd4 to 32b0150 Compare March 9, 2026 23:40
@github-actions
Copy link
Copy Markdown

🧪 Test Build Ready

Build artifacts for PR #985 are available for testing.

Download artifacts from workflow run

Artifacts expire after 30 days.

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Mar 10, 2026

@ferologics can you see if this helps the inference speed for you still, would be quite curious if it just goes back to CPU or works out the box. I kind of think since it's direct ML it might just work on Win 11, would be curious about Win10 too

@ferologics
Copy link
Copy Markdown
Contributor Author

Tested the CI-built Windows artifact locally and it looks good.

What I checked:

  • downloaded handy-pr-985-x86_64-pc-windows-msvc from run 22882036165
  • extracted the MSI payload and confirmed the packaged app includes DirectML.dll
  • launched the packaged handy.exe
  • triggered a real start/stop transcription cycle against my normal Handy setup

Result:

  • handy.log shows successful DirectML registration in the packaged build:
    • ONNX Runtime session registered DirectMLExecutionProvider on Windows (device 0) with CPU fallback enabled
  • I saw that log for the Parakeet encoder / decoder / nemo sessions
  • I did not see the CPU fallback warning

So on my Windows 11 machine the CI-built artifact is still taking the intended GPU path, not silently dropping back to CPU.

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Mar 10, 2026

Solid. This is amazing news. I will test on my windows machine when I can and see how it goes as well. I'm curious how this will play with integrated GPUs

I am slightly wondering if we will need to provide an option to disable this. Just in case CPU is faster for someone. I know another PR in transcribe rs had something like this. Might be worth considering

@ferologics ferologics marked this pull request as ready for review March 12, 2026 15:08
@ferologics
Copy link
Copy Markdown
Contributor Author

Good callout. I agree an opt-out could be useful just in case CPU ends up better for some setups.

I’m not sure it needs to be tackled in this PR unless you think it’s important for landing it — happy to add it if you feel it’s essential, otherwise we can keep this one focused and follow up separately.

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Mar 13, 2026

Let me think about it, I want to give this a test myself on my machine and go from there. I will be able to test probably tomorrow

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Mar 13, 2026

Okay I gave this a quick run. We definitely need a toggle before shipping this. The default should be off. It probably should be in experimental settings. Possibly, it should be a dropdown since we may add CUDA, etc to it in the future. Not sure exactly how we are going to handle this generically but we will cross that bridge when we get there.

The reason for this is DirectML is 4x slower than CPU on my test machine with an integrated GPU (testing with parakeet v3). I suspect a lot of users have integrated GPU's and we cannot impact their performance.

Maybe #958 is relevant here and worth combining efforts. Pinging @andrewleech for thoughts and opinions.

I know there is also #1023, which we need to do.. Also cc: @intech. I am not ready to move to 0.3.0 of transcribe-rs quite yet. Mostly because we need to have a solid design for supporting acceleration in the app. Basically all these PR's are interrelated, so would love any help thinking about this.

Opinions and thoughts welcome. I will likely be making some changes to transcribe-rs soon enough which might change some of the API footprint as it relates to acceleration.

@andrewleech
Copy link
Copy Markdown

andrewleech commented Mar 13, 2026

In #958 I was testing Direct ML as well as WebGPU and generally found on my iGPU they gave worse performance on most models depending on model architecture and format/quantizing.

I didn't keep DirectML by default because it's in maintenance / effectively deprecated, as far as I could tell WebGPU is the most supported framework that supports both Nvidia and AMD.

However WebGPU was slightly slower for me than DirectML on my test iirc, though there are some settings in WebGPU needed to ensure it's not using the "default browser settings" restrictions.

Cuda would likely give better performance for compatible hardware but I think ORT then bundles the ~100M binaries so you probably don't want it included / enabled by default.

My PR adds -- compile flags to select which GPU frameworks too include, along with drop-down setting to choose what's enabled.

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Mar 16, 2026

Closing because I will be submitting a PR for this and pulling it in

@cjpais cjpais closed this Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants