Skip to content

gabrimatic/local-whisper

Local Whisper

License: MIT Platform: macOS Apple Silicon Python 3.11+

Local Whisper is a local-first voice app for macOS, iOS, and Android.

On macOS, it gives you fast push-to-talk dictation, grammar cleanup, text replacements, selected-text shortcuts, and offline text-to-speech from the menu bar.

On mobile, it brings private recording, local history, modes, model management, and keyboard setup to iOS and Android.

Double-tap to record, tap to stop, and the text lands on your clipboard. Parakeet-TDT v3 is the default speech-to-text engine.

Local Whisper macOS settings with Parakeet-TDT v3 active

Quick Start (macOS)

Apple Silicon required. Microphone and Accessibility permissions are needed.

git clone https://github.com/gabrimatic/local-whisper.git
cd local-whisper
./setup.sh

One command. The setup script installs dependencies, downloads core local models, builds the Swift UI, configures auto-start, and creates the wh alias.

Action Key
Start recording Double-tap Right Option
Hold to record Hold Right Option past double-tap threshold
Stop and transcribe Tap Right Option or Space
Cancel Esc
Read selected text aloud ⌥T
Stop speech ⌥T again or Esc

What It Does

  • On-device transcription via MLX. Parakeet-TDT v3 is the default; Qwen3-ASR and WhisperKit are alternatives.
  • Grammar correction with pluggable backends: Apple Intelligence, Ollama, LM Studio. Or disabled.
  • Text-to-speech reads any selected text aloud. Works in any app, multiple voices, streaming playback, offline via Kokoro MLX.
  • Text replacements for spoken-to-corrected mappings.
  • Audio processing: VAD, silence trimming, noise reduction, normalization.
  • Keyboard shortcuts for proofreading, rewriting, and prompt engineering on selected text.
  • CLI: wh whisper, wh listen, wh transcribe for scripting and automation.
  • Native macOS UI: menu bar, floating overlay, and settings window.
  • Mobile apps: Flutter iOS and Android surfaces for private recording, history, modes, model management, and keyboard setup.
  • Auto-backup of every recording and transcription.

Keyboard Shortcuts

Shortcut Action
⌥T Read selected text aloud (again or Esc to stop)
Ctrl+Shift+G Proofread selected text
Ctrl+Shift+R Rewrite selected text
Ctrl+Shift+P Optimize selected text as an LLM prompt

Results go to clipboard. TTS plays through speakers.

Feedback

  • Sounds: Pop on start, Glass on success, Basso on failure
  • Menu bar: animated waveform (recording), speaker icon (speech)
  • Overlay: 0.0 recording · ··· processing · Copied done · Failed error · Speaking...

Local Whisper recording overlay while dictating into notes


Transcription Engines

Switch via Settings, wh engine <name>, or config.

Parakeet-TDT v3 (default)

In-process via parakeet-mlx. No server. Multilingual: English plus 24 European languages. The default checkpoint is NVIDIA Parakeet-TDT 0.6B v3, served through the mlx-community/parakeet-tdt-0.6b-v3 MLX snapshot. Long audio is handled via overlapping chunks.

Setting Default Notes
model mlx-community/parakeet-tdt-0.6b-v3 Downloaded by setup.sh
timeout 0 No limit
chunk_duration 120.0 Seconds per chunk for long audio. 0 disables chunking (pair with local attention).
overlap_duration 15.0 Overlap between consecutive chunks.
decoding "greedy" Or "beam" for a small quality bump at higher cost.
beam_size / length_penalty / patience / duration_reward 5 / 0.013 / 3.5 / 0.67 Beam-only tuning knobs.
local_attention false Reduces peak RAM for unchunked long audio.

Qwen3-ASR (English only)

In-process via qwen3-asr-mlx. No local server required. Native long-audio support, up to 20 minutes in a single pass. English only. Switch with wh engine qwen3_asr.

Setting Default Notes
model mlx-community/Qwen3-ASR-1.7B-bf16 Downloaded by setup.sh
timeout 0 No limit
repetition_penalty 1.2 Higher suppresses repetition loops
repetition_context_size 100 Tokens considered for repetition penalty
chunk_duration 1200 Seconds per internal chunk for very long audio
max_tokens 0 0 = auto-scale from duration. Cap for faster short-clip decode.

WhisperKit (alternative)

Whisper on Apple Neural Engine via Argmax. Install with brew install whisperkit-cli, switch with wh engine whisperkit.

Model Notes
tiny / tiny.en Fastest, lowest accuracy
base / base.en
small / small.en
whisper-large-v3-v20240930 Best accuracy (default)

Text-to-Speech

Kokoro-82M via kokoro-mlx. Runs in-process after install. No local server required. Streaming playback starts before full synthesis completes.

Toggle from the menu bar or Settings -> Voice. Activating the feature downloads the Kokoro voice model (~170 MB) and uses the spaCy en_core_web_sm dictionary plus system espeak-ng. Running ./setup.sh while the toggle is on pre-fetches everything so the first speak has no wait.

Usage:

  • ⌥T on selected text in any app. Press ⌥T again, Esc, or start a recording to stop.
  • CLI: wh whisper "text", wh whisper --voice af_bella "text", or pipe stdin with echo "hello" | wh whisper.

The overlay shows "Generating speech..." during synthesis, then "Speaking..." during playback. The shortcut is configurable via tts.speak_shortcut in config.

Voices

Multiple presets available. Default is Sky (af_sky).

Voice ID Type
Heart af_heart American female
Bella af_bella American female
Nova af_nova American female
Sky af_sky (default) American female
Sarah af_sarah American female
Nicole af_nicole American female
Alice bf_alice British female
Emma bf_emma British female
Adam am_adam American male
Echo am_echo American male
Eric am_eric American male
Liam am_liam American male
Daniel bm_daniel British male
George bm_george British male

Grammar Backends

Optional. Pick a grammar backend or disable it:

Backend Requirements Notes
Apple Intelligence macOS 26+, Apple Silicon, Apple Intelligence enabled On-device Foundation Models
Ollama Ollama installed and running Works on any Mac
LM Studio LM Studio with a model loaded and the local server started Works on any Mac
Disabled None Transcription only

Switch from menu bar (instant), wh backend <name> (restarts), or Settings.

Ollama setup
  1. Download from ollama.com
  2. Pull a model and start the server:
ollama pull gemma3:4b-it-qat
ollama serve
LM Studio setup
  1. Download from lmstudio.ai
  2. Download and load a model (e.g., google/gemma-3-4b)
  3. Start the local server: Developer tab > Start Server

Loading a model does not start the server. Start it from Developer tab.


Usage

CLI

wh controls everything:

wh                  # Status and help
wh status           # Service status, PID, grammar backend
wh start            # Launch the service
wh stop             # Stop the service
wh restart          # Restart (rebuilds Swift UI if sources changed)
wh build            # Rebuild Swift UI app

wh engine           # Show current engine and list available
wh engine whisperkit  # Switch transcription engine
wh backend          # Show current grammar backend and list available
wh backend ollama   # Switch grammar backend

wh replace          # Show text replacement rules
wh replace add "gonna" "going to"
wh replace remove "gonna"
wh replace on|off   # Enable or disable replacements
wh replace import rules.csv   # Bulk-import rules (CSV, TSV, "a"="b", or a -> b)

wh whisper "text"   # Speak text aloud via Kokoro TTS
wh whisper --voice af_bella "text"
echo "hello" | wh whisper

wh listen           # Record until silence, output transcription
wh listen 30        # Record up to 30 seconds
wh listen --raw     # Raw transcription, no grammar

wh transcribe recording.wav
wh transcribe --raw audio.wav

wh stats            # Show usage statistics computed from history
wh export           # Export history to ~/Desktop/local-whisper-history.md
wh export --format json --out ~/Downloads/history.json
wh export --format txt --limit 50

wh config           # Interactive config editor (static summary when piped)
wh config edit      # Open config.toml in $EDITOR
wh config path      # Print config file path
wh doctor           # Check system health
wh doctor --fix     # Auto-repair issues
wh doctor --report  # Write a shareable diagnostic report
wh log              # Tail service log
wh update           # Pull, upgrade deps, warm up models, rebuild, restart
wh version          # Show version
wh uninstall        # Remove Local Whisper

Voice Dictation Commands

Speak these phrases anywhere in a dictation and Local Whisper replaces them with the literal punctuation or whitespace:

Spoken Inserted
"new line" newline
"new paragraph" blank line
"period" .
"comma" ,
"question mark" ?
"exclamation mark" !
"colon" :
"semicolon" ;
"dash" - (space, hyphen, space)
"open paren" / "close paren" ( / )
"scratch that" deletes the current sentence fragment

Custom commands go under [dictation.commands] in ~/.whisper/config.toml. The pass runs before grammar correction, so grammar sees well-punctuated sentences.

Menu Bar

Item What it does
Status Current state with active engine and backend subtitle
Engine Switch transcription engine in-place
Grammar Switch grammar backend in-place
Replacements Toggle, shows rule count
Retry Last / Copy Last Re-transcribe or re-copy
Transcriptions Recent entries, click to copy
Recordings Audio files, click to reveal in Finder
Settings… Full sidebar settings window
Service Restart, Check for Updates, Open Service Log
Quit Local Whisper Exit

Settings

Sidebar layout with focused panels:

Local Whisper macOS settings

Panel Covers
Recording Trigger key, double-tap window, audio cleanup, duration limits
Transcription Engine picker plus per-engine sampling and decoding parameters
Grammar Master toggle, backend picker, per-backend connection and limits
Voice Text-to-speech voice and shortcut, dictation command help
Vocabulary Searchable replacement editor with import / export
Output Overlay, sounds, notifications, paste-at-cursor, history limit
Shortcuts Proofread / rewrite / prompt-engineer keybindings, full cheatsheet
Activity Sessions, words, 30-day chart, top words, top replacement triggers
Advanced Storage paths, service log, doctor, restart, update
About Version, credits, replay tutorial

Saves settings to ~/.whisper/config.toml. Restart-required fields warn and offer immediate restart.


Configuration

Settings are stored in ~/.whisper/config.toml. Edit them from the app, with wh config, or directly.

Common fields:

Section What to change
[hotkey] Trigger key and double-tap timing
[transcription] Active engine: parakeet_v3, qwen3_asr, or whisperkit
[grammar] Grammar backend and enable/disable state
[audio] VAD, noise reduction, normalization, pre-buffer, and duration limits
[ui] Overlay, sounds, notifications, and auto-paste
[tts] Text-to-speech toggle and shortcut

See docs/configuration.md for the full TOML reference.


Privacy

Audio recording, transcription, grammar correction, replacements, and text-to-speech run on-device or against localhost services. Setup, model downloads, wh update, and wh doctor --fix can use the network to install packages, fetch models, or update the checkout.

Component Runs at
Parakeet-TDT v3 In-process MLX
Qwen3-ASR In-process MLX
Kokoro TTS In-process MLX
WhisperKit localhost:50060
Apple Intelligence On-device
Ollama localhost:11434
LM Studio localhost:1234

Models cached at ~/.whisper/models/. Config and backups at ~/.whisper/.

After the required models and optional local services are installed, dictation and cleanup do not send audio or transcript text to cloud APIs.


Architecture

Python headless service (LaunchAgent). Swift owns all UI.

Python (LaunchAgent, headless)
  ├── Recording, transcription, grammar, replacements, clipboard, hotkeys
  ├── Text-to-Speech (Kokoro-82M, in-process)
  ├── IPC server at ~/.whisper/ipc.sock (Swift UI communication)
  ├── Command server at ~/.whisper/cmd.sock (CLI commands)
  ├── Pipeline watchdog (per-stage timeouts, skip-don't-freeze)
  └── Crash recovery (~/.whisper/processing.marker + current_session.jsonl)

Swift (subprocess, all UI)
  ├── Menu bar with grammar submenus and transcription history
  ├── Floating overlay pill (recording, processing, speaking states)
  ├── NSWorkspace sleep/wake observer → resync_audio IPC
  └── Settings window (auto-fetches Ollama / LM Studio models on open)

The LaunchAgent uses KeepAlive={SuccessfulExit=false} with ThrottleInterval=10 so real crashes auto-restart while clean stops (wh stop, wh restart, user-error exits) don't relaunch. Recordings longer than five minutes take the chunked pipeline: each VAD segment is transcribed, grammar-corrected, and persisted to ~/.whisper/current_session.jsonl before the next chunk runs, so a crash mid-lecture recovers the completed chunks on next boot instead of losing everything.

Data flow
┌───────────────────────────────────────────────────────────┐
│  Microphone → pre-buffer (ring) + live capture            │
└──────────────────────────┬────────────────────────────────┘
                           ▼
┌───────────────────────────────────────────────────────────┐
│  Audio Processing                                         │
│  VAD → silence trim → noise reduction → normalize         │
└──────────────────────────┬────────────────────────────────┘
                           ▼
┌───────────────────────────────────────────────────────────┐
│  Transcription Engine                                     │
│                                                           │
│  Parakeet-TDT v3 (default) │ Qwen3-ASR  │ WhisperKit     │
│  Multilingual, MLX         │ English, MLX│ localhost:50060│
│  120s chunked              │ Long audio  │ Split at 28s   │
└──────────────────────────┬────────────────────────────────┘
                           ▼
┌───────────────────────────────────────────────────────────┐
│  Grammar Correction                                       │
│                                                           │
│  Apple Intelligence  │  Ollama        │  LM Studio        │
│  On-device           │  localhost LLM │  OpenAI-compatible │
└──────────────────────────┬────────────────────────────────┘
                           ▼
┌───────────────────────────────────────────────────────────┐
│  Text Replacements                                        │
│  Case-insensitive, word-boundary-aware regex              │
└──────────────────────────┬────────────────────────────────┘
                           ▼
┌───────────────────────────────────────────────────────────┐
│  Clipboard · Saved to ~/.whisper/                         │
│  (auto_paste: pasted at cursor, clipboard preserved)      │
└───────────────────────────────────────────────────────────┘

Troubleshooting

"This process is not trusted"

Grant Accessibility to the wh process, not your terminal app. System Settings opens automatically on first run.

If it didn't:

open x-apple.systempreferences:com.apple.preference.security?Privacy_Accessibility

Enable wh, then wh restart.

Double-tap not working

Tap twice within 0.4s (default). Adjust double_tap_threshold in config.

Apple Intelligence not working

Verify:

  1. macOS 26 or later
  2. Apple Silicon (M1/M2/M3/M4)
  3. Apple Intelligence enabled in System Settings > Apple Intelligence & Siri
Ollama not working

Verify:

  1. Ollama installed: ollama.com
  2. Model pulled: ollama pull gemma3:4b-it-qat
  3. Server running: ollama serve
LM Studio not working

Verify:

  1. LM Studio installed: lmstudio.ai
  2. A model is downloaded and loaded
  3. Local server is running (most common issue): Developer tab > Start Server
  4. Confirm with: curl http://localhost:1234/v1/models

Loading a model does not start the server.

Slow first transcription

setup.sh pre-downloads and warms the built-in local models used by transcription and TTS. It does not pull Ollama or LM Studio models for you. Skip setup and the first transcription loads them on demand. After that, loaded from disk.

Empty transcription
  • Speak clearly, close to the microphone
  • Check microphone permissions in System Settings
  • Confirm the correct input device is selected
Overlay not showing

Check show_overlay = true in ~/.whisper/config.toml.


Development

python3 -m venv .venv && source .venv/bin/activate
pip install -e .

wh build              # Build Swift UI (one-time)
wh                    # Run the service
pytest tests/              # Run the full test suite

Mobile Apps

The Flutter app lives in src/flutter/local_whisper.

Surface Status Notes
Flutter iOS app Native transcription wired Uses AVAudioEngine plus WhisperKit/Core ML through the native Swift bridge.
Flutter Android app Native shell ready Recording bridge, input method, setup flow, model management, history, modes, and QA seeding are in place. Production ASR adapter is still pending.

See docs/mobile.md for setup flow, keyboard behavior, model packs, Android notes, and mobile checks.

Adding an Engine or Grammar Backend

Engines: implement TranscriptionEngine in engines/, register in ENGINE_REGISTRY. Grammar backends: implement GrammarBackend in backends/, register in BACKEND_REGISTRY.

Menu, CLI, and Settings auto-generate from the registries.

Project structure
local-whisper/
├── pyproject.toml
├── setup.sh
├── tests/
│   ├── test_flow.py
│   └── fixtures/
├── LocalWhisperUI/                  # Swift UI app
│   ├── Package.swift
│   └── Sources/LocalWhisperUI/
│       ├── AppMain.swift            # @main entry point
│       ├── AppState.swift           # Observable state, IPC handler, ConnectionState
│       ├── IPCClient.swift          # Unix socket client + connection-state publishing
│       ├── IPCMessages.swift        # Codable message types
│       ├── Theme.swift              # Typography, spacing, radii, tones, accents
│       ├── MenuBarView.swift        # Menu bar dropdown (status + connection state)
│       ├── OverlayWindowController.swift
│       ├── OverlayView.swift        # Floating pill: waveform + state
│       ├── OnboardingView.swift     # First-launch + replay tutorial
│       ├── SettingsView.swift       # Sidebar root + section enum
│       ├── RecordingPanel.swift     # Trigger key + audio cleanup
│       ├── TranscriptionPanel.swift # Engine picker + per-engine params
│       ├── GrammarPanel.swift       # Backend picker + Ollama / LM Studio / AI
│       ├── VoicePanel.swift         # TTS + dictation commands
│       ├── VocabularyPanel.swift    # Replacements editor (search, import / export)
│       ├── OutputPanel.swift        # Overlay, sounds, notifications, paste, history
│       ├── ShortcutsPanel.swift     # Text-transform keybindings + cheatsheet
│       ├── ActivityPanel.swift      # Usage stats with 30-day chart
│       ├── AdvancedPanel.swift      # Live status, storage, diagnostics, service control
│       ├── AboutView.swift          # Hero, credits, links, replay tutorial
│       ├── SharedViews.swift        # DeferredText fields, StatusPill, InlineNotice, headers
│       └── Constants.swift
├── src/flutter/local_whisper/        # Flutter iOS and Android app
│   ├── lib/
│   │   ├── main.dart
│   │   └── src/
│   │       ├── app.dart              # Tabs, record/history/modes/models/settings UI
│   │       ├── native_speech_service.dart
│   │       ├── model_store.dart      # Local model catalog/install state
│   │       ├── text_polisher.dart    # Offline cleanup and modes
│   │       ├── history_store.dart    # Local persistence
│   │       └── models.dart
│   ├── ios/Runner/
│   │   ├── LocalSpeechBridge.swift   # AVAudioEngine + WhisperKit bridge
│   │   ├── AppDelegate.swift
│   │   └── SceneDelegate.swift
│   ├── ios/LocalWhisperKeyboard/     # Native keyboard extension
│   ├── android/app/src/main/
│   │   ├── AndroidManifest.xml       # App identity, microphone, haptics, input method
│   │   ├── kotlin/info/gabrimatic/localwhisper/
│   │   │   ├── MainActivity.kt       # Method/event channels for recording and setup
│   │   │   └── LocalWhisperInputMethodService.kt
│   │   └── res/                      # Launch theme, keyboard XML, launcher icons
│   └── test/
└── src/whisper_voice/
    ├── app.py              # App class + service_main (imports mixins)
    ├── app_ipc.py          # IPCMixin: IPC send/receive
    ├── app_recording.py    # RecordingMixin: keyboard + recording lifecycle
    ├── app_pipeline.py     # PipelineMixin: transcription pipeline
    ├── app_commands.py     # CommandsMixin: CLI command handlers
    ├── app_switching.py    # SwitchingMixin: engine/backend switching
    ├── cli/                # CLI package (wh)
    │   ├── constants.py    # Colors, path constants
    │   ├── lifecycle.py    # start/stop/status
    │   ├── build.py        # Swift UI build, restart
    │   ├── settings.py     # engine/backend/replace commands
    │   ├── editor.py       # Interactive config TUI
    │   ├── client.py       # whisper/listen/transcribe socket client
    │   ├── doctor.py       # wh doctor + wh update
    │   └── main.py         # help, version, cli_main dispatcher
    ├── config/             # Config package
    │   ├── schema.py       # Dataclasses + DEFAULT_CONFIG
    │   ├── loader.py       # load_config, get_config, singleton
    │   ├── toml_helpers.py # _find/_replace_in_section, _serialize_toml_value
    │   └── mutations.py    # add/remove_replacement, update_config_field
    ├── ipc_server.py       # IPC server (Swift UI)
    ├── cmd_server.py       # Command server (CLI)
    ├── audio.py            # Recording and pre-buffer
    ├── audio_processor.py  # VAD, noise reduction, normalization
    ├── backup.py           # History persistence
    ├── grammar.py          # Grammar backend factory
    ├── transcriber.py      # Engine routing
    ├── utils.py            # Helpers
    ├── shortcuts.py        # Text transformation shortcuts
    ├── key_interceptor.py  # CGEvent tap
    ├── tts_processor.py    # TTS shortcut handler
    ├── tts/
    │   ├── base.py         # TTSProvider base
    │   └── kokoro_tts.py   # Kokoro provider (MLX)
    ├── engines/
    │   ├── base.py             # TranscriptionEngine base
    │   ├── parakeet.py         # Parakeet-TDT v3 (MLX, default)
    │   ├── qwen3_asr.py        # Qwen3-ASR (MLX)
    │   ├── whisperkit.py       # WhisperKit (localhost)
    │   ├── status.py           # Cache status + on-disk size reporting
    │   └── download_progress.py # HF preflight + inline progress bar IPC
    └── backends/
        ├── base.py         # Backend base
        ├── modes.py        # Transformation modes
        ├── ollama/
        ├── lm_studio/
        └── apple_intelligence/

Data stored in ~/.whisper/:

~/.whisper/
├── config.toml             # Settings
├── ipc.sock                # Python/Swift IPC
├── cmd.sock                # CLI commands
├── LocalWhisperUI.app      # Swift UI (built by setup.sh)
├── last_recording.wav
├── last_raw.txt            # Before grammar
├── last_transcription.txt  # Final text
├── audio_history/
├── history/                # Last 100 transcriptions
└── models/                 # Parakeet-TDT, Qwen3-ASR, Kokoro TTS

Credits

parakeet-mlx (MLX port of NVIDIA Parakeet) · Parakeet-TDT by NVIDIA NeMo · qwen3-asr-mlx (MLX port of Qwen3-ASR) · kokoro-mlx (MLX port of Kokoro-82M) · Qwen3-ASR by Qwen Team · Kokoro-82M · WhisperKit by Argmax · Apple Intelligence · Apple FM SDK · Ollama · LM Studio · SwiftUI

Legal notices

Trademarks

"Whisper" is a trademark of OpenAI. "Apple Intelligence" is a trademark of Apple Inc. "WhisperKit" is a trademark of Argmax, Inc. "Qwen" is a trademark of Alibaba Cloud. "Parakeet" and "NeMo" are trademarks of NVIDIA Corporation. "Ollama" and "LM Studio" are trademarks of their respective owners.

This project is not affiliated with, endorsed by, or sponsored by OpenAI, Apple, Argmax, Alibaba Cloud, NVIDIA, or any other trademark holder. All trademark names are used solely to describe compatibility with their respective technologies.

Third-Party Licenses

This project depends on pynput, licensed under LGPL-3.0. When installed via pip (the default), pynput is dynamically linked and fully compatible with this project's MIT license.

All other dependencies use MIT, BSD, or Apache 2.0 licenses. See each package for details.

License

MIT License. See LICENSE for details.


Created by Soroush Yousefpour

"Buy Me A Coffee"