A Single Sample ASR STT Evaluation (On Local Hardware)

I was tired of guessing which Whisper model size to use for speech-to-text, so I ran a quick evaluation on my own setup to figure it out.

Limitations

This was a "back of the envelope" style experiment and not intended as a definitive evaluation of local ASR inference.
My hardware (AMD, ROCM) isn't ideal for STT
STT accuracy and ASR
There are many variables when it comes to ASR accuracy ranging from microphone to background noise conditions through to how you speak. I've seen measureable differences in results based upon the mic I use (etc)
Inference: CPU (!)

Test Sample

All tests were run on audio/001.wav - you can listen to the actual sample used in this evaluation.

Source of truth (reference text): text/recorded/01_conversational_casual.txt

For Next Time!

I've attempted something like this (informally!) on Android with Futo but can never remember the results ...

A Thought

This seemed like a big project but I thought it was worth it. If you're spending hours per day doing transcription (I am) then this is probably very worth your time. You can almost certainly set up a more robust evaluation than I can.

My Question Set

Q1: How much does model size actually matter for accuracy?

And more specifically: on my hardware where does diminishing returns set in?

Diminishing returns in the case of STT means something like (to me): not worth STT becoming annoyingly laggy for minor gains in accuracy / decreases in WER. Colloquially, we'lll call this the "sweet spot."

Answer: on my HW, it's at Whisper Medium (approximately and accpetable latency / the "sweet spot" depends heavily upon what STT workload you're doing; for live transcription even small lags are obviously significantly more problematic).

On my test sample:

tiny: 15.05% WER - Fast but rough
base: 9.95% WER - Significant improvement
small: 11.17% WER - Slight regression (interesting!)
medium: 6.07% WER - Best accuracy in my test
large-v3-turbo: 7.04% WER - Good balance

My takeaway: The biggest accuracy jump was from tiny → base. After that, diminishing returns for the speed cost.

Q2: Is faster-whisper really as good as OpenAI Whisper?'

This question can be extended to the other Whisper variants. These are wonderful, but I've always been curious about what accuracy was like.

What I found: On my test, yes - same accuracy, slightly faster.

Testing the same base model on my hardware:

faster-whisper: 9.95% WER in 5.01s
openai-whisper: 9.95% WER in 6.17s

My takeaway: Identical accuracy on this sample, faster-whisper was ~1.2x faster. Good enough reason for me to use it.

Q3: What's the speed vs. accuracy tradeoff?

What I found: For my use case, base or small seems like the sweet spot.

On my hardware:

tiny: Super fast (2.73s) but 15% WER is rough for my needs
base: Good balance - 10% WER in 5s
small: Similar to base, slightly slower
medium: Best accuracy (6% WER) but 7x slower than tiny
large-v3-turbo: 33s for 7% WER - more than I need

My takeaway: For daily transcription of my own voice, base or small hits the sweet spot for me.

Q4: Which model should I use for my daily STT workflow?

There are many ASR models I would love to try out but on AMD it's just too much work to attempt to resolve dependencies. But if you're on NVIDIA and have CUDA ... there are many more questions worth exploring. On my hardware, and for round two, I would love to look at how much ASR outperforms "legacy" STT.

My personal answer: base model with faster-whisper.

Why it works for me:

~10% WER is acceptable for my daily use (I dictate a lot)
5 seconds per clip is fast enough
140MB model size is manageable
Good balance for my workflow

When I'd use something else:

tiny: Quick tests or long recordings where speed matters more
medium/large: Publishing/professional work where I need better accuracy

All Variants Tested

I also threw distil-whisper into the mix to see if it lived up to the speed claims.

On my test:

faster-whisper: 9.95% WER, 4.87s ✓
openai-whisper: 9.95% WER, 6.51s
distil-whisper: 21.6% WER, 38.49s ✗

My takeaway: distil-whisper was both slower AND less accurate on my sample. Unexpected, but that's what I got.

Summary of My Results

Best accuracy (on this sample): medium (6.07% WER)
Fastest: tiny (2.73s)
My choice for daily use: base (9.95% WER, 5s)
Recommended engine: faster-whisper

Running Your Own Tests

Want to benchmark on your own voice and hardware? Here's how:

Set up the conda environment (see setup.md)
Record your own audio samples and create reference transcriptions
Put audio in audio/, reference text in text/
Run the scripts:

# Test all model sizes
python scripts/test_all_sizes.py --audio audio/your_test.wav --reference text/your_test.txt

# Compare engines
python scripts/compare_engines.py --audio audio/your_test.wav --reference text/your_test.txt

# Generate visualizations from your results
python scripts/visualize_results.py

Hardware Context

My test environment:

GPU: AMD Radeon RX 7700 XT (but using CPU inference)
CPU: Intel Core i7-12700F
RAM: 64 GB
OS: Ubuntu 25.04

Your performance will differ based on your setup.

Models Storage

Models get downloaded to ~/models/stt/ with subdirectories for different engines (openai-whisper, faster-whisper, etc).

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
audio		audio
results		results
scripts		scripts
text		text
.gitignore		.gitignore
models.md		models.md
readme.md		readme.md
requirements.txt		requirements.txt
setup.md		setup.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Single Sample ASR STT Evaluation (On Local Hardware)

Limitations

Test Sample

For Next Time!

A Thought

My Question Set

Q1: How much does model size actually matter for accuracy?

Q2: Is faster-whisper really as good as OpenAI Whisper?'

Q3: What's the speed vs. accuracy tradeoff?

Q4: Which model should I use for my daily STT workflow?

All Variants Tested

Summary of My Results

Running Your Own Tests

Hardware Context

Models Storage

License

About

Uh oh!

Languages

danielrosehill/Local-ASR-STT-Benchmark

Folders and files

Latest commit

History

Repository files navigation

A Single Sample ASR STT Evaluation (On Local Hardware)

Limitations

Test Sample

For Next Time!

A Thought

My Question Set

Q1: How much does model size actually matter for accuracy?

Q2: Is faster-whisper really as good as OpenAI Whisper?'

Q3: What's the speed vs. accuracy tradeoff?

Q4: Which model should I use for my daily STT workflow?

All Variants Tested

Summary of My Results

Running Your Own Tests

Hardware Context

Models Storage

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages