Agent-first Rust ASR orchestration stack with adaptive backend routing, real-time NDJSON streaming, and SQLite-backed persistence.
curl -fsSL "https://raw.githubusercontent.com/Dicklesworthstone/franken_whisper/main/install.sh?$(date +%s)" | bashOr build from source:
git clone https://github.com/Dicklesworthstone/franken_whisper.git
cd franken_whisper && cargo build --releaseSpeech-to-text pipelines are fragmented. You need whisper.cpp for speed, insanely-fast-whisper for GPU batching, and whisper-diarization for speaker identification. Each has its own CLI, output format, error handling, and deployment story. Orchestrating them from scripts means parsing inconsistent stdout, handling timeouts manually, and losing run history.
Agent workflows need structured, streaming, machine-readable output, not human-oriented terminal decorations that break when piped.
franken_whisper is a single Rust binary that wraps all three backends behind a unified interface with:
- Adaptive backend routing: Bayesian decision contract selects the best engine per-request with explicit loss matrix, posterior calibration, and deterministic fallback
- Real-time NDJSON streaming: every pipeline stage emits sequenced, timestamped events with stable schema (v1.0.0) for agent consumption
- Durable run history: every transcription is persisted to SQLite with full event logs, replay envelopes, and JSONL export/import
- Graceful cancellation: Ctrl+C propagates through the entire pipeline via cancellation tokens with proper resource cleanup
- TTY audio transport: low-bandwidth audio relay over PTY links using mulaw+zlib+base64 NDJSON with handshake, integrity checks, and deterministic retransmission
- Zero-dependency audio decode: MP3, AAC, FLAC, WAV, OGG decoded natively via symphonia with no ffmpeg needed for common formats
| Feature | whisper.cpp | insanely-fast-whisper | whisper-diarization | franken_whisper |
|---|---|---|---|---|
| Streaming output | partial | no | no | NDJSON stage events |
| Machine-readable errors | no | no | no | 12 structured error codes |
| Adaptive backend selection | -- | -- | -- | Bayesian routing |
| Run persistence | no | no | no | SQLite + JSONL |
| Diarization | no | yes (HF token) | yes | yes (any backend) |
| GPU acceleration | CUDA/Metal | CUDA/MPS | CUDA | frankentorch/frankenjax |
| Cancellation support | SIGKILL | SIGKILL | SIGKILL | graceful token-based |
| TTY audio relay | no | no | no | mulaw+zlib+b64 NDJSON |
| Native audio decode | WAV only | needs ffmpeg | needs ffmpeg | MP3/AAC/FLAC/WAV/OGG/ALAC |
| Memory safety | C++ | Python | Python | #![forbid(unsafe_code)] |
# Transcribe any audio file -- MP3/FLAC/OGG/AAC decoded natively, no ffmpeg needed
cargo run -- transcribe --input meeting.mp3 --json
# Transcribe a video file -- audio extracted automatically via ffmpeg fallback
cargo run -- transcribe --input presentation.mp4 --json
# Stream real-time pipeline events (agent mode)
cargo run -- robot run --input meeting.mp3 --backend auto
# Speculative streaming: fast partial results with quality corrections
cargo run -- robot run --input meeting.mp3 --speculative \
--fast-model tiny.en --quality-model large-v3
# Transcribe with speaker diarization
cargo run -- transcribe --input meeting.mp3 --diarize --hf-token "$HF_TOKEN" --json
# TinyDiarize: whisper.cpp's built-in speaker-turn detection (no HF token needed)
cargo run -- transcribe --input meeting.mp3 --tiny-diarize --json
# Discover available backends and their capabilities
cargo run -- robot backends
# System health check (backends, ffmpeg, database, resources)
cargo run -- robot health
# Query run history
cargo run -- runs --limit 10 --format json
# Export runs to portable JSONL snapshot (full or incremental)
cargo run -- sync export-jsonl --output ./snapshot
# TTY audio: encode, transmit over lossy link, decode
cargo run -- tty-audio encode --input audio.wav > frames.ndjson
cat frames.ndjson | cargo run -- tty-audio decode --output restored.wavEvery command produces structured NDJSON on stdout. Human-friendly output is the exception (plain transcribe mode), not the rule. The robot subcommand is the primary interface. It emits sequenced stage events with stable schema versioning so upstream agents can parse output without fragile regex.
Given identical inputs and parameters, franken_whisper produces identical outputs. The retransmit loop, replay envelopes, and conformance harness all enforce determinism. Random elements (UUIDs, timestamps) are isolated to metadata fields, never to computational outputs.
Every error has a structured code (FW-IO, FW-CMD-TIMEOUT, FW-BACKEND-UNAVAILABLE, etc.) and propagates through the NDJSON event stream. Cancellation tokens allow in-flight work to checkpoint and clean up rather than being killed mid-write.
The 10-stage pipeline (Ingest, Normalize, VAD, Separate, Backend, Accelerate, Align, Punctuate, Diarize, Persist) is composed dynamically per-request. Stages are skipped when unnecessary, budgeted independently, and profiled automatically.
The entire codebase uses #![forbid(unsafe_code)]. Memory safety is enforced at the compiler level, not by convention.
franken_whisper can transcribe MP3, AAC, FLAC, WAV, OGG, and other common audio files without ffmpeg, Python, or any other runtime dependency beyond the backend engine itself. The built-in Rust audio decoder (symphonia) handles format detection, codec decoding, sample rate conversion, and channel mixing entirely in-process. ffmpeg is only invoked as a fallback for video files and exotic codecs, and even then it is auto-provisioned if missing.
The whisper ecosystem has dozens of tools. This diagram shows where franken_whisper fits:
+--------------------------------------------------------------+
| INFERENCE ENGINES (run models) |
| |
| whisper.cpp (C++, CPU/Metal/CUDA, ~47k stars) |
| faster-whisper (Python/CTranslate2, ~14k stars) |
| OpenAI Whisper (Python/PyTorch, ~95k stars) |
+--------------------------------------------------------------+
|
+------------------------------v-------------------------------+
| ENHANCED PIPELINES (add features on top) |
| |
| WhisperX (faster-whisper + wav2vec2 + pyannote) |
| whisper-diarization (Whisper + Demucs + TitaNet) |
| insanely-fast-whisper (HF Transformers, max GPU) |
| whisper-timestamped (DTW word timestamps) |
+--------------------------------------------------------------+
|
+------------------------------v-------------------------------+
| ORCHESTRATION (manage engines/pipelines) |
| |
| > franken_whisper < (Rust, Bayesian routing, |
| 10-stage pipeline, speculative streaming, |
| conformance validation, evidence-based decisions) |
+--------------------------------------------------------------+
Most tools in the ecosystem occupy one level. franken_whisper occupies the orchestration level: it wraps inference engines and enhanced pipelines behind a unified interface, then adds capabilities that none of them provide individually.
| Capability | whisper.cpp | faster-whisper | WhisperX | WhisperLive | WhisperS2T | franken_whisper |
|---|---|---|---|---|---|---|
| Language | C++ | Python | Python | Python | Python | Rust |
| Multi-backend | -- | -- | -- | 3 backends | 4 backends | 3 backends + 3 native pilots |
| Backend selection | -- | -- | -- | manual | manual | Bayesian adaptive routing |
| Pipeline stages | monolithic | monolithic | 3-stage | monolithic | monolithic | 10 composable stages |
| Per-stage budgets | -- | -- | -- | -- | -- | independent timeouts |
| Speculative streaming | -- | -- | -- | single-model | -- | dual-model fast+quality |
| Conformance validation | -- | -- | -- | -- | -- | cross-engine 50ms tolerance |
| Native rollout governance | -- | -- | -- | -- | -- | 5-stage shadow->sole |
| Memory safety | C++ | Python GC | Python GC | Python GC | Python GC | #![forbid(unsafe_code)] |
| Capability | whisper.cpp | faster-whisper | WhisperX | WhisperLive | franken_whisper |
|---|---|---|---|---|---|
| Run history | none | none | none | none | SQLite + JSONL export |
| Decision audit trail | -- | -- | -- | -- | 200-entry evidence ledger |
| Replay envelopes | -- | -- | -- | -- | SHA-256 content hashing |
| Replay packs | -- | -- | -- | -- | 4-artifact reproducibility bundle |
| Structured errors | exit code | exceptions | exceptions | -- | 12 FW-* error codes |
| NDJSON streaming | partial | -- | -- | WebSocket | sequenced stage events |
| Cancellation | SIGKILL | KeyboardInterrupt | -- | -- | cooperative CancellationToken |
| Resource cleanup | none guaranteed | GC | GC | GC | RAII + bounded finalizers |
| Latency profiling | -- | -- | -- | -- | per-stage with tuning recs |
| Capability | whisper.cpp | faster-whisper | WhisperX | franken_whisper |
|---|---|---|---|---|
| Native audio decode | WAV only | -- (needs ffmpeg) | -- (needs ffmpeg) | MP3/AAC/FLAC/WAV/OGG/ALAC (symphonia) |
| ffmpeg required? | for non-WAV | yes | yes | no (fallback only) |
| Video audio extraction | -- | -- | -- | automatic (-vn flag) |
| TTY audio transport | -- | -- | -- | mulaw+zlib+b64 NDJSON |
| Microphone capture | -- | -- | -- | platform-specific ffmpeg |
| Auto-provision ffmpeg | -- | -- | -- | downloads static binary if missing |
| Capability | Groq Whisper API | Deepgram Nova-3 | AssemblyAI | franken_whisper |
|---|---|---|---|---|
| Runs locally | no | no | no | yes |
| Open source | no | no | no | yes (MIT) |
| Data leaves machine | yes | yes | yes | never |
| Cost per hour of audio | ~$0.04 | ~$0.75 | ~$0.65 | $0 (your hardware) |
| Inference speed | very fast | fast | moderate | depends on backend |
| Multi-model routing | -- | -- | -- | Bayesian adaptive |
| Diarization | limited | yes | yes | yes (any backend) |
| Custom pipeline stages | -- | -- | -- | 10 composable stages |
curl -fsSL "https://raw.githubusercontent.com/Dicklesworthstone/franken_whisper/main/install.sh?$(date +%s)" | bashOptions: --system (install to /usr/local/bin), --easy-mode (auto-update PATH), --verify (self-test), --version vX.Y.Z, --uninstall.
git clone https://github.com/Dicklesworthstone/franken_whisper.git
cd franken_whisper
# minimal build
cargo build --release
# with TUI support
cargo build --release --features tui
# with GPU acceleration
cargo build --release --features gpu-frankentorch
cargo build --release --features gpu-frankenjaxThe release profile is optimized for size (opt-level = "z", LTO, single codegen unit, stripped symbols).
- Rust nightly (2024 edition)
- ffmpeg (optional): only needed for video files, exotic audio codecs symphonia cannot decode, and live microphone capture; the built-in Rust audio decoder handles MP3, AAC, FLAC, WAV, OGG, and other common formats natively with zero external dependencies
- Backend binaries (at least one):
whisper-cli(from whisper.cpp); override:FRANKEN_WHISPER_WHISPER_CPP_BINinsanely-fast-whisper(Python); override:FRANKEN_WHISPER_INSANELY_FAST_BINpython3withpyannote.audio(for diarization backend); override:FRANKEN_WHISPER_PYTHON_BIN
- HuggingFace token (for diarization):
--hf-tokenorFRANKEN_WHISPER_HF_TOKEN/HF_TOKEN
franken_whisper depends on sibling projects via Cargo path dependencies:
../frankensqlite # SQLite persistence (fsqlite crate)
../frankentui # TUI (optional, feature: tui)
../frankentorch # GPU acceleration (optional, feature: gpu-frankentorch)
../frankenjax # GPU acceleration (optional, feature: gpu-frankenjax)
# plain text output
cargo run -- transcribe --input audio.mp3
# full JSON report (includes segments, timing, backend info)
cargo run -- transcribe --input audio.mp3 --json
# specific backend
cargo run -- transcribe --input audio.mp3 --backend whisper_cpp --json
# with language hint
cargo run -- transcribe --input audio.mp3 --language ja --json# real-time NDJSON event stream
cargo run -- robot run --input audio.mp3 --backend autoOutput (one JSON object per line):
{"event":"run_start","schema_version":"1.0.0","request":{"input":"audio.mp3","backend":"auto"}}
{"event":"stage","schema_version":"1.0.0","run_id":"...","seq":1,"stage":"ingest","code":"ingest.start","message":"materializing input"}
{"event":"stage","schema_version":"1.0.0","run_id":"...","seq":2,"stage":"normalize","code":"normalize.ok","message":"audio normalized"}
{"event":"run_complete","schema_version":"1.0.0","run_id":"...","backend":"whisper_cpp","transcript":"Hello world..."}cargo run -- transcribe \
--input meeting.mp3 \
--diarize \
--hf-token "$HF_TOKEN" \
--min-speakers 2 \
--max-speakers 5 \
--json# record 30 seconds from default mic
cargo run -- transcribe --mic --mic-seconds 30 --json
# specific device
cargo run -- transcribe --mic --mic-device "hw:0" --json# pipe audio bytes
cat audio.mp3 | cargo run -- transcribe --stdin --jsonCore transcription command. Runs the full pipeline: ingest, normalize, backend execution, optional acceleration, and persistence.
cargo run -- transcribe [OPTIONS]Input (mutually exclusive):
| Flag | Description |
|---|---|
--input <PATH> |
Audio/video file path |
--stdin |
Read audio bytes from stdin |
--mic |
Capture from microphone via ffmpeg |
Backend & Model:
| Flag | Default | Description |
|---|---|---|
--backend <KIND> |
auto |
auto, whisper_cpp, insanely_fast, whisper_diarization |
--model <MODEL> |
backend-specific | Model name/path forwarded to backend |
--language <LANG> |
auto-detect | Language hint (ISO 639-1) |
--translate |
false | Translate to English |
--diarize |
false | Enable speaker diarization |
Output:
| Flag | Description |
|---|---|
--json |
Full JSON run report |
--output-txt |
Plain text (whisper.cpp) |
--output-vtt |
WebVTT subtitles |
--output-srt |
SRT subtitles |
--output-csv |
CSV |
--output-json-full |
Extended JSON with metadata |
--output-lrc |
LRC karaoke format |
Storage:
| Flag | Default | Description |
|---|---|---|
--db <PATH> |
.franken_whisper/storage.sqlite3 |
SQLite database path |
--no-persist |
false | Skip persistence |
Inference Tuning (whisper.cpp):
| Flag | Default | Description |
|---|---|---|
--threads <N> |
4 | Computation threads |
--processors <N> |
1 | Parallel processors |
--no-gpu |
false | Force CPU-only |
--beam-size <N> |
5 | Beam search width |
--best-of <N> |
5 | Sampling candidates |
--temperature <F> |
0.0 | Sampling temperature |
--temperature-increment <F> |
-- | Temperature fallback increment |
--entropy-threshold <F> |
-- | Entropy threshold for fallback |
--logprob-threshold <F> |
-- | Log probability threshold |
--no-speech-threshold <F> |
-- | No-speech probability threshold |
--max-context <N> |
-- | Maximum context tokens from prior segment |
--max-segment-length <N> |
-- | Maximum segment length in characters |
--no-timestamps |
false | Suppress timestamps |
--detect-language-only |
false | Detect language and exit (no transcription) |
--split-on-word |
false | Split segments on word boundaries |
--no-fallback |
false | Disable temperature fallback |
--suppress-nst |
false | Suppress non-speech tokens |
--tiny-diarize |
false | Enable TinyDiarize (speaker-turn token injection) |
--prompt <TEXT> |
-- | Initial prompt to guide transcription style |
--carry-initial-prompt |
false | Carry prompt across segments |
Audio Windowing (whisper.cpp):
| Flag | Default | Description |
|---|---|---|
--offset-ms <N> |
0 | Start transcription at offset (ms) |
--duration-ms <N> |
-- | Transcribe only this duration (ms) |
--audio-ctx <N> |
-- | Audio context size (tokens) |
--word-threshold <F> |
-- | Word-level timestamp confidence threshold |
--suppress-regex <REGEX> |
-- | Suppress tokens matching regex |
VAD (Voice Activity Detection):
| Flag | Default | Description |
|---|---|---|
--vad |
false | Enable Voice Activity Detection |
--vad-model <PATH> |
-- | Custom VAD model path |
--vad-threshold <F> |
-- | Speech detection threshold |
--vad-min-speech-ms <N> |
-- | Minimum speech duration (ms) |
--vad-min-silence-ms <N> |
-- | Minimum silence duration (ms) |
--vad-max-speech-s <F> |
-- | Maximum speech duration (seconds) |
--vad-speech-pad-ms <N> |
-- | Speech padding (ms) |
--vad-samples-overlap <N> |
-- | Sample overlap between windows |
Batching (insanely-fast-whisper):
| Flag | Default | Description |
|---|---|---|
--batch-size <N> |
24 | Parallel inference batch size |
--gpu-device <DEV> |
auto | GPU device (0, cuda:0, mps) |
--flash-attention |
false | Enable Flash Attention 2 |
--hf-token <TOKEN> |
env | HuggingFace token for diarization |
--timestamp-level |
chunk |
chunk or word granularity |
--transcript-path <PATH> |
-- | Override transcript output path |
Diarization:
| Flag | Description |
|---|---|
--num-speakers <N> |
Exact speaker count |
--min-speakers <N> |
Minimum speakers |
--max-speakers <N> |
Maximum speakers |
--no-stem |
Disable vocal isolation (Demucs source separation) |
--suppress-numerals |
Spell out numbers for alignment stability |
--diarization-model <MODEL> |
Override whisper model for diarization stage |
Speculative Streaming:
| Flag | Default | Description |
|---|---|---|
--speculative |
false | Enable dual-model speculative cancel-correct mode |
--fast-model <MODEL> |
-- | Fast model for low-latency partial transcripts |
--quality-model <MODEL> |
-- | Quality model for correction/verification |
--speculative-window-ms <N> |
3000 | Sliding window size (ms) |
--speculative-overlap-ms <N> |
500 | Window overlap (ms) |
--correction-tolerance-wer <F> |
-- | WER tolerance for confirmation vs. retraction |
--no-adaptive |
false | Disable adaptive window sizing |
--always-correct |
false | Force quality model on every window (evaluation mode) |
Agent-first interface with structured NDJSON output.
# streaming transcription with stage events
cargo run -- robot run [TRANSCRIBE_OPTIONS]
# emit JSON schema for all event types
cargo run -- robot schema
# discover backends and capabilities
cargo run -- robot backends
# system health diagnostics (backends, ffmpeg, database, resources)
cargo run -- robot health
# query routing decision history
cargo run -- robot routing-history [--run-id <ID>] [--limit 20]Robot Event Types (12 total):
| Event | Description |
|---|---|
run_start |
Request accepted, pipeline starting |
stage |
Pipeline stage progress (sequenced, timestamped) |
run_complete |
Transcription finished with full result |
run_error |
Pipeline failed with structured error code |
backends.discovery |
Backend discovery response with per-backend capabilities |
routing_decision |
Backend routing decision with posterior snapshot and evidence |
health.report |
System health diagnostics (backend/ffmpeg/DB/resource status) |
transcript.partial |
Speculative fast-model partial transcript (immediate) |
transcript.confirm |
Quality model confirms partial (drift within tolerance) |
transcript.retract |
Quality model retracts partial (drift exceeds tolerance) |
transcript.correct |
Quality model correction with corrected segments |
transcript.speculation_stats |
Aggregate speculation pipeline statistics |
Stage Codes:
Stages emit paired *.start / *.ok codes (or *.error on failure, *.skip when not needed):
ingest.start, ingest.ok, normalize.start, normalize.ok, vad.start, vad.ok, separate.start, separate.ok, backend.start, backend.ok, backend.routing.decision_contract, accelerate.start, accelerate.ok, align.start, align.ok, punctuate.start, punctuate.ok, diarize.start, diarize.ok, persist.start, persist.ok, orchestration.latency_profile
Health Report: The robot health command probes all subsystems and returns a structured diagnostic:
{
"event": "health.report",
"schema_version": "1.0.0",
"ts": "2026-02-22T00:00:00Z",
"backends": [{"name": "whisper.cpp", "available": true, "path": null, "version": "1.7.2", "issues": []}],
"ffmpeg": {"name": "ffmpeg", "available": true, "path": "/usr/bin/ffmpeg", "version": null, "issues": []},
"database": {"name": "database", "available": true, "path": ".franken_whisper/storage.sqlite3", "version": null, "issues": []},
"resources": {"disk_free_bytes": 12345, "disk_total_bytes": 67890, "memory_available_bytes": 11111, "memory_total_bytes": 22222},
"overall_status": "ok"
}Query persisted run history.
cargo run -- runs [--limit 20] [--format plain|json|ndjson] [--id <RUN_ID>]| Flag | Default | Description |
|---|---|---|
--limit <N> |
20 | Max recent runs |
--format |
plain |
plain (table), json (pretty), ndjson (streaming) |
--id <UUID> |
-- | Fetch specific run details |
One-way JSONL snapshot export/import.
# export
cargo run -- sync export-jsonl --output ./snapshot [--db <PATH>]
# import
cargo run -- sync import-jsonl --input ./snapshot --conflict-policy reject|skip|overwrite|overwrite-strictExport produces: runs.jsonl, segments.jsonl, events.jsonl, manifest.json (with SHA-256 checksums).
Low-bandwidth audio transport over TTY/PTY links using the mulaw+zlib+b64 NDJSON protocol.
# encode audio to NDJSON frames
cargo run -- tty-audio encode --input audio.wav [--chunk-ms 200]
# decode NDJSON frames to WAV
cat frames.ndjson | cargo run -- tty-audio decode --output restored.wav [--recovery fail_closed|skip_missing]
# generate retransmit plan from lossy stream
cat frames.ndjson | cargo run -- tty-audio retransmit-plan
# emit individual control frames
cargo run -- tty-audio control handshake
cargo run -- tty-audio control ack --up-to-seq 42
cargo run -- tty-audio control backpressure --remaining-capacity 64
cargo run -- tty-audio control retransmit-request --sequences 1,2,4
cargo run -- tty-audio control retransmit-response --sequences 1,2,4
# automated retransmit loop with strategy escalation
cat frames.ndjson | cargo run -- tty-audio control retransmit-loop --rounds 3
# convenience shorthands
cargo run -- tty-audio send-control handshake|eof|reset
cat frames.ndjson | cargo run -- tty-audio retransmit --rounds 3Recovery Strategies:
The retransmit loop escalates recovery effort across rounds:
Simple (1 frame/round) -> Redundant (2 frames/round) -> Escalate (4 frames/round)
Integrity Checks:
Each frame carries optional CRC32 and SHA-256 hashes of raw (pre-compression) audio bytes. Mismatches cause frame drops (skip_missing) or stream failure (fail_closed).
See docs/tty-audio-protocol.md for the full protocol specification.
Interactive TUI for human operators (feature-gated, requires --features tui).
cargo run --features tui -- tuiFeatures:
- Live transcription view: Real-time segment display with auto-scroll behavior
- Speaker labels and timestamps: Each segment displays start/end times, speaker identification, and confidence scores
- Runs list: Browse persisted run history with timing and backend info
- Timeline view: Visual timeline of pipeline stages with duration bars
- Event detail panes: Inspect individual NDJSON events with full payload
- Segment retention: Caps display at 10,000 segments with oldest-first drain
- Keyboard navigation: Focus cycling between panes, vim-style keybindings
Built on the FrankenTUI framework.
| Variable | Default | Description |
|---|---|---|
FRANKEN_WHISPER_WHISPER_CPP_BIN |
whisper-cli |
whisper.cpp binary name/path |
FRANKEN_WHISPER_INSANELY_FAST_BIN |
insanely-fast-whisper |
insanely-fast-whisper binary |
FRANKEN_WHISPER_PYTHON_BIN |
python3 |
Python interpreter for diarization |
FRANKEN_WHISPER_HF_TOKEN |
-- | HuggingFace token (preferred over HF_TOKEN) |
HF_TOKEN |
-- | HuggingFace token (fallback) |
FRANKEN_WHISPER_DIARIZATION_DEVICE |
-- | GPU device for diarization backend |
FRANKEN_WHISPER_STATE_DIR |
.franken_whisper |
State directory root |
FRANKEN_WHISPER_DB |
.franken_whisper/storage.sqlite3 |
SQLite database path |
FRANKEN_WHISPER_FFMPEG_BIN |
auto | Explicit ffmpeg binary path override |
FRANKEN_WHISPER_FFPROBE_BIN |
auto | Explicit ffprobe binary path override |
FRANKEN_WHISPER_AUTO_PROVISION_FFMPEG |
1 |
Auto-provision local ffmpeg/ffprobe bundle when system binaries are missing (0/false disables) |
FRANKEN_WHISPER_FORCE_FFMPEG_NORMALIZE |
0 |
Force file normalization through ffmpeg even when the built-in Rust decoder can handle the format (1/true enables) |
FRANKEN_WHISPER_NATIVE_EXECUTION |
0 |
Enable native in-process engine dispatch (1/true) |
FRANKEN_WHISPER_BRIDGE_NATIVE_RECOVERY |
1 |
In bridge-only mode, allow recoverable bridge failures to fall back to native engines (0/false disables) |
FRANKEN_WHISPER_NATIVE_ROLLOUT_STAGE |
primary |
Native engine rollout stage |
RUST_LOG |
-- | tracing filter (e.g. franken_whisper=debug) |
| Feature | Description |
|---|---|
tui |
Enable interactive TUI via frankentui |
gpu-frankentorch |
Enable frankentorch GPU acceleration |
gpu-frankenjax |
Enable frankenjax GPU acceleration |
No features are enabled by default.
The auto backend uses adaptive Bayesian routing:
Non-diarization priority: whisper_cpp > insanely_fast > whisper_diarization
Diarization priority: insanely_fast > whisper_diarization > whisper_cpp
Each auto run emits a backend.routing.decision_contract stage event with explicit state/action/loss/posterior/calibration terms. The router falls back to deterministic static priority when calibration score drops below 0.3 or Brier score exceeds 0.35.
Native Rust engine replacements follow a staged rollout:
| Stage | Behavior |
|---|---|
shadow |
Deterministic bridge execution only; native conformance validated out-of-band |
validated |
Deterministic bridge execution only with stricter conformance gating |
fallback |
Deterministic bridge execution only; fallback policy and evidence paths hardened |
primary |
Native preferred with deterministic bridge fallback (requires FRANKEN_WHISPER_NATIVE_EXECUTION=1) |
sole |
Native only (requires FRANKEN_WHISPER_NATIVE_EXECUTION=1) |
+------------------------------------+
| CLI / Robot |
| (clap + NDJSON emit) |
+------------------------------------+
|
+-----------------v------------------+
| FrankenWhisperEngine |
| (orchestrator.rs) |
| |
| 10-Stage Pipeline: |
| 1. Ingest |
| 2. Normalize |
| 3. VAD |
| 4. Source Separate |
| 5. Backend Execution |
| 6. Accelerate (GPU) |
| 7. Alignment |
| 8. Punctuation |
| 9. Diarization |
| 10. Persist |
+------------------------------------+
| | |
+------------------+ +----------+ +------------------+
| Backends | | Accel | | Storage |
| | | | | |
| whisper.cpp | | frank | | fsqlite |
| insanely-fast | | torch | | (SQLite WAL) |
| whisper-diar | | frank | | |
| native pilots | | jax | | JSONL export |
+------------------+ +----------+ +------------------+
+------------------+ +------------------+ +------------------+
| TTY Audio | | Conformance | | Replay |
| | | | | |
| mulaw+zlib+b64 | | 50ms tolerance | | SHA-256 content |
| NDJSON transport | | cross-engine | | hash envelopes |
| handshake/retry | | comparator | | drift detection |
+------------------+ +------------------+ +------------------+
- Ingest: Materialize input from file, stdin, or microphone capture
- Normalize: Convert to 16kHz mono WAV via built-in Rust decoder (ffmpeg fallback for video/exotic formats)
- VAD: (Optional) Voice Activity Detection to skip silence
- Source Separate: (Optional) Vocal isolation for cleaner transcription
- Backend: Dispatch to selected engine (adaptive routing or explicit)
- Accelerate: (Optional) GPU confidence normalization via frankentorch/frankenjax
- Alignment: (Optional) Forced alignment for word-level timestamps
- Punctuation: (Optional) Punctuation restoration
- Diarization: (Optional) Speaker identification and labeling
- Persist: Write run report, segments, and events to SQLite
Each stage emits *.start and *.ok events to the NDJSON stream with timing, sequence numbers, and structured payloads.
When --backend auto is selected, franken_whisper uses a formal Bayesian decision contract to choose the best engine for each request rather than trying backends in a fixed order.
State Space (3 states):
all_available: all three backends found on PATH and responsivepartial_available: 1-2 backends operationalnone_available: nothing usable
Action Space (4 actions):
try_whisper_cpp,try_insanely_fast,try_diarization(reordered per-request based on--diarize)fallback_error: return structured error when nothing is available
Loss Matrix:
The router maintains a 3x4 loss matrix (states x actions). Each cell contains an expected cost computed from three weighted factors:
cost = (0.45 x latency_cost) + (0.35 x quality_cost) + (0.20 x failure_cost)
- Latency cost scales with audio duration (short/medium/long buckets) and backend latency proxy
- Quality cost depends on backend capability relative to the request (diarization support, GPU availability)
- Failure cost is
(1.0 - p_success) x 100, wherep_successcomes from the Bayesian posterior
Bayesian Posterior:
Each backend starts with a Beta distribution prior reflecting expected reliability:
| Backend | Prior | Interpretation |
|---|---|---|
| whisper_cpp | Beta(7, 3) | Strong expectation of success |
| insanely_fast | Beta(6, 4) | Moderate expectation |
| whisper_diarization | Beta(5, 5) | Weakest prior (most uncertain) |
After each run, the posterior is updated with the observed outcome.
Calibration & Fallback:
The router tracks a sliding window of 50 prediction-outcome pairs and computes a Brier score. The adaptive router falls back to deterministic static priority when any of these hold:
- Fewer than 5 observations (insufficient data)
- Calibration score < 0.3 (posterior margin too narrow)
- Brier score > 0.35 (predictions don't match reality)
Latency Proxy Model:
Backend latency is estimated as a function of audio duration with per-backend parameters:
latency_cost = base + (sqrt(audio_duration_seconds) * multiplier)
| Backend | Base Cost | Multiplier (normal) | Multiplier (diarize) |
|---|---|---|---|
| whisper_cpp | 18.0 | 1.0 | 1.25 |
| insanely_fast | 8.0 | 1.0 | 1.25 |
| whisper_diarization | 18.0 | 1.0 | 1.25 |
When empirical latency data is available (>= 5 observations), the estimate blends prior and empirical: (0.6 * prior_latency) + (0.4 * empirical_latency).
Quality Proxy Model:
Each backend has a quality score that varies based on whether diarization is requested:
| Backend | Quality (normal) | Quality (diarize) |
|---|---|---|
| whisper_cpp | 0.84 | 0.55 |
| insanely_fast | 0.80 | 0.65 |
| whisper_diarization | 0.60 | 0.60 |
The quality score feeds into the posterior success probability: p_success = (alpha + quality_score * 2.0 + diarize_boost) / (alpha + beta + quality_terms + penalty_terms).
Availability Penalties:
The loss matrix applies sharp penalties when backends are unavailable:
| State | Penalty |
|---|---|
| Available | +0 |
| Partially available | +333 |
| Unavailable | +1,000 |
These penalties dominate the loss calculation, ensuring the router never selects an unavailable backend even if its quality/latency profile is otherwise attractive.
Policy Versioning:
The routing policy is versioned (backend-selection-v1.0). The loss_matrix_hash field in evidence entries enables detecting when the policy weights changed between runs, supporting reproducibility audits.
Evidence Ledger:
Every routing decision is recorded in a circular buffer (capacity: 200 entries) containing the decision ID, trace ID, observed state, chosen action, posterior snapshot, calibration metrics, and whether fallback was triggered.
Each pipeline stage runs under an independent millisecond budget. Default budgets:
| Stage | Budget | Rationale |
|---|---|---|
| Ingest | 15s | File I/O or mic capture |
| Normalize | 180s | Audio decode + resample |
| VAD | 10s | Lightweight energy detection |
| Source Separate | 30s | Demucs-style vocal isolation |
| Backend | 900s (15 min) | Full inference (long audio on CPU) |
| Accelerate | 20s | GPU confidence normalization |
| Align | 30s | CTC forced alignment |
| Punctuate | 10s | Punctuation model inference |
| Diarize | 30s | Speaker clustering |
| Persist | 20s | SQLite transaction |
| Cleanup | 5s | Finalizer timeout |
Every budget is overridable via FRANKEN_WHISPER_STAGE_BUDGET_<STAGE>_MS environment variables.
Automatic Latency Profiling:
After each run, the orchestrator emits an orchestration.latency_profile stage event with per-stage timing decomposition. The profiler computes a utilization ratio (service_ms / budget_ms) and emits tuning recommendations:
| Utilization | Recommendation |
|---|---|
| <= 30% | decrease_budget_candidate |
| 30-90% | keep_budget |
| >= 90% | increase_budget (suggest 1.25x current) |
Every completed run produces a ReplayEnvelope containing SHA-256 hashes:
+-------------------------------------------------+
| ReplayEnvelope |
+-------------------------------------------------+
| input_content_hash: SHA-256(normalized WAV) |
| backend_identity: "whisper-cli-v1.7.2" |
| backend_version: "1.7.2" |
| output_payload_hash: SHA-256(raw backend JSON) |
+-------------------------------------------------+
Given identical input audio and the same backend version, the output hash should be identical. If it changes between runs, something drifted.
Self-contained replay packs capture everything needed to reproduce and analyze a run:
replay_pack/
env.json # EnvSnapshot: OS, arch, backend identity/version, fw version
manifest.json # PackManifest: trace_id, run_id, timestamps, content hashes
repro.lock # ReproLock: routing evidence, replay envelope, request params
tolerance_manifest.json # ToleranceManifest: schema version, timestamp tolerance
Replay Pack Artifact Details:
| File | Struct | Contents |
|---|---|---|
env.json |
EnvSnapshot |
OS, architecture, backend identity/version, franken_whisper version (compile-time CARGO_PKG_VERSION) |
manifest.json |
PackManifest |
trace_id, run_id, start/finish timestamps, input/output SHA-256 hashes, segment/event/evidence counts |
repro.lock |
ReproLock |
Routing evidence chain, frozen replay envelope, original backend request, diarize flag |
tolerance_manifest.json |
ToleranceManifest |
Schema version (tolerance-manifest-v1), timestamp tolerance in seconds, text/speaker exactness flags, native rollout stage, segment/event counts |
All four files are deterministic: the same input RunReport produces byte-identical output across runs and machines. This property is critical for regression detection: if the same audio produces different replay packs on different runs, something in the pipeline changed.
The conformance module enforces cross-engine compatibility using a 50ms canonical timestamp tolerance. Segment comparison counts violations:
| Violation Type | Condition |
|---|---|
| Text mismatch | Segment text differs at same index |
| Speaker mismatch | Speaker label differs (optional check) |
| Timestamp violation | start/end differs by > 50ms |
| Length mismatch | Different segment counts |
Includes overlap detection, WER approximation (Levenshtein-based), and segment invariant validation (finite timestamps, non-negative values, confidence in [0.0, 1.0], non-empty text).
Dual-model streaming pattern for real-time transcription with quality corrections:
Audio Stream
|
+---> WindowManager (sliding windows with overlap)
| |
| +---> Fast Model ---> PartialTranscript (status: Pending)
| | |
| | v emit "transcript.partial" event
| |
| +---> Quality Model ---> CorrectionDrift analysis
| |
| +- drift below tolerance ---> "transcript.confirm"
| +- drift above tolerance ---> "transcript.retract" + corrected text
|
+---> CorrectionTracker (adaptive thresholds)
The CorrectionTracker maintains running drift statistics and adaptively adjusts confirmation thresholds.
Input audio is normalized to 16 kHz, mono, 16-bit PCM WAV:
Input file (any format)
|
+-> Built-in Rust decoder (PRIMARY)
| symphonia: MP3, AAC, FLAC, WAV, OGG, Vorbis, ALAC, PCM variants
| Resampler: linear interpolation to 16 kHz
| Channel mixer: stereo/surround -> mono via sample averaging
| Output: normalized_16k_mono.wav (PCM S16LE)
|
+-> ffmpeg subprocess (FALLBACK -- only if built-in decoder fails)
Triggered for: video files, exotic codecs (AC3, DTS, Opus-in-MKV, etc.)
Args: -hide_banner -loglevel error -y -i <input> -vn -ar 16000 -ac 1 -c:a pcm_s16le <output>
ffmpeg fallback chain:
- Explicit binary path (
FRANKEN_WHISPER_FFMPEG_BIN) - System-installed
ffmpegon PATH - Auto-provisioned local binary (linux/x86_64)
- If all fail:
FW-CMD-MISSINGerror with actionable message
Set FRANKEN_WHISPER_FORCE_FFMPEG_NORMALIZE=1 to bypass the built-in decoder and always use ffmpeg.
The storage layer uses fsqlite (from the frankensqlite project) with three tables:
runs (run_id PK, started_at, finished_at, backend, input_path,
request_json, result_json, transcript, replay_json, ...)
segments (run_id FK, idx, start_sec, end_sec, speaker, text, confidence)
events (run_id FK, seq, ts_rfc3339, stage, code, message, payload_json)Atomic Persistence with Retry: All inserts are wrapped in a single transaction with 8 retry attempts and exponential backoff (5ms base). Cancellation token is checked before each COMMIT.
Cancellation-Safe Writes:
The token checkpoint pattern ensures no partial data reaches the database:
SAVEPOINT sp_persist_N
INSERT INTO runs ...
INSERT INTO segments ... (N rows)
INSERT INTO events ... (M rows)
token.checkpoint()? <-- rolls back if cancelled
RELEASE SAVEPOINT sp_persist_N
If the token fires between inserts, the savepoint rolls back cleanly. If the process is killed during RELEASE, SQLite's journal recovery handles it on next open. The storage layer uses savepoints (not top-level transactions) so that concurrent sessions can nest persist calls without deadlocking.
Schema Migrations:
When opening older databases missing expected columns (e.g., runs.replay_json, runs.acceleration_json), the storage layer performs a safe migration:
- Switch journal mode from WAL to DELETE (more reliable for DDL)
- Execute
ALTER TABLE ... ADD COLUMN - Restore WAL mode
- If migration fails: log the error, leave the database untouched
For severely corrupted databases, the recovery path is JSONL-based: export from a known-good source, create a fresh database, import via sync import-jsonl.
Each backend has a bridge adapter that spawns an external process and parses its output. The adapters normalize diverse output formats into a uniform TranscriptionResult.
whisper.cpp Bridge (whisper_cpp.rs):
Spawns whisper-cli (or FRANKEN_WHISPER_WHISPER_CPP_BIN) with the audio file and requested parameters. Parses the JSON output file looking for:
{
"text": "full transcript...",
"language": "en",
"segments": [
{
"start": 0.0,
"end": 2.5,
"text": "Hello world",
"confidence": 0.95
}
]
}The parser handles multiple JSON layouts: "transcription", "segments", or "chunks" arrays. For word-level timestamps, it extracts from nested "words" arrays within each segment.
insanely-fast-whisper Bridge (insanely_fast.rs):
Spawns insanely-fast-whisper (or FRANKEN_WHISPER_INSANELY_FAST_BIN). Shares the same JSON segment extraction logic as whisper.cpp since both produce compatible output. Falls back to joining segment texts if the root "text" key is missing.
whisper-diarization Bridge (whisper_diarization.rs):
Spawns a Python script via python3 (or FRANKEN_WHISPER_PYTHON_BIN). Parses two output files:
.txtfile: Full transcript text.srtfile: SRT subtitle format with speaker labels
The SRT parser handles timing in both comma (00:01:23,456) and dot (00:01:23.456) separator formats. Speaker labels are extracted from patterns like [SPEAKER_00], SPEAKER_00: text, spk0: text, or s0: text.
Every transcription produces a RunReport, the complete record of what happened:
RunReport
run_id: "fw-run-abc123"
trace_id: "1710000000000-random64"
started_at_rfc3339: "2026-03-17T06:00:00Z"
finished_at_rfc3339: "2026-03-17T06:00:05Z"
input_path: "/path/to/audio.mp3"
normalized_wav_path: "/tmp/normalized_16k_mono.wav"
request: TranscribeRequest { ... } -- full input parameters
result: TranscriptionResult { ... } -- backend output
transcript: "Hello world..."
segments: [TranscriptionSegment { ... }] -- timed chunks
language: Some("en")
acceleration: AccelerationReport { ... } -- confidence normalization metadata
events: [RunEvent { ... }] -- pipeline stage events (sequenced)
warnings: ["..."] -- non-fatal issues
evidence: [Value { ... }] -- routing decision evidence
replay: ReplayEnvelope { ... } -- SHA-256 hashes for deterministic replay
The report is both persisted to SQLite (split across runs, segments, and events tables) and optionally emitted as JSON via --json or as streaming NDJSON events in robot mode.
In robot mode (robot run), the pipeline emits events in real time via an mpsc channel:
+-------------------+
| CLI (main.rs) |
| |
| event_rx poll |<--+
| (every 40ms) | |
+-------------------+ |
| | mpsc channel
v |
+-------------------+ |
| stdout (NDJSON) | | StreamedRunEvent { run_id, event }
| one line per | |
| event | |
+-------------------+ |
|
+-------------------+ |
| Pipeline Worker |---+
| (background |
| thread) |
+-------------------+
The CLI thread polls the receive end of the channel every 40ms, formatting each event as a single NDJSON line on stdout. The pipeline worker thread runs transcribe_with_stream() which emits StreamedRunEvent wrappers containing (run_id, RunEvent) pairs. When the worker completes, the CLI emits a final run_complete or run_error event.
Schema Contract Guarantees:
| Guarantee | Enforcement |
|---|---|
event and schema_version present on every event |
Hardcoded in all emit_* functions |
seq strictly increasing per run |
Auto-incremented from events.len() |
ts non-decreasing per run |
Generated from Utc::now().to_rfc3339() |
run_complete is always the final event |
Emitted only after pipeline returns |
| Stage events follow pipeline order | Orchestrator executes stages sequentially |
The TTY audio protocol begins with a version and codec negotiation before any audio frames flow:
Encoder Decoder
| |
|-- Handshake { |
| min_version: 1, |
| max_version: 2, |
| supported_codecs: ["mulaw+zlib+b64"] |
| } --------------------------------------> |
| |
| <--------------------------------------- |
| HandshakeAck { |
| negotiated_version: 1, |
| negotiated_codec: "mulaw+zlib+b64" |
| } |
| |
|-- AudioFrame { seq: 0, ... } ------------> |
|-- AudioFrame { seq: 1, ... } ------------> |
| ... |
|-- SessionClose { last_data_seq: N } -----> |
| |
| <--- Ack { up_to_seq: N } |
Version Negotiation: The encoder advertises its supported version range. The decoder picks the highest version both support. If ranges don't overlap, the handshake fails.
Codec Negotiation: Currently only "mulaw+zlib+b64" is defined. The protocol is extensible; future codecs (e.g., opus+b64) can be added by extending the supported_codecs array.
Session Close: The encoder sends SessionClose { reason, last_data_seq } to signal end of stream. The decoder verifies it has received all frames up to last_data_seq. Missing frames trigger the retransmit protocol.
The retransmit system is designed to be fully deterministic for testing and debugging:
- Given the same frame buffer and the same loss pattern, the output and report are byte-identical across runs
- There are no timing dependencies;
timeout_msis advisory (used for reporting) with no actual sleeps or waits - Frame recovery proceeds in sequence-number order (not arrival order)
- Strategy escalation follows a fixed chain:
Simple -> Redundant -> Escalate - The
inject_loss()method resets all prior recovery state, ensuring clean separation between test scenarios
This determinism enables comprehensive fuzz testing of the retransmit protocol without flaky timing-dependent test failures.
When ffmpeg is needed but not installed, franken_whisper can automatically download a static binary (Linux x86_64 only):
Source: https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz
Flow:
- Check
FRANKEN_WHISPER_AUTO_PROVISION_FFMPEG(default:1/ enabled) - Check if provisioned binary already exists at
{state_root}/tools/ffmpeg/bin/ffmpeg - If missing: download bundle via
curl -fsSLorwget --quiet(whichever is available) - Extract from
.tar.xzarchive viatar -xf - Copy
ffmpegandffprobeto{state_root}/tools/ffmpeg/bin/ - Set executable permissions (
chmod 755) - Verify the extracted binaries are executable
Safeguards:
- 180-second download timeout prevents hanging on slow connections
- Download is atomic: temp directory used during extraction, then moved into place
- Failure is non-fatal: logs a warning and continues (the built-in Rust decoder handles most audio formats anyway)
- Can be disabled entirely with
FRANKEN_WHISPER_AUTO_PROVISION_FFMPEG=0 - Non-Linux/non-x86_64 platforms get an actionable error message explaining how to install ffmpeg manually
Ctrl+C
|
v
ctrlc handler
| sets AtomicBool (SeqCst)
v
CancellationToken.checkpoint()
| returns Err(Cancelled) at next checkpoint
v
Pipeline stage catches Cancelled
| rolls back any in-progress transaction
| cleans up temp files via finalizers
v
CLI exits with code 130 (128 + SIGINT)
| Code | Meaning |
|---|---|
FW-IO |
I/O error (file not found, permission denied) |
FW-JSON |
JSON serialization/deserialization failure |
FW-CMD-MISSING |
Required external binary not found on PATH |
FW-CMD-FAILED |
Backend subprocess exited with non-zero status |
FW-CMD-TIMEOUT |
Backend subprocess exceeded timeout |
FW-BACKEND-UNAVAILABLE |
No suitable backend found for request |
FW-INVALID-REQUEST |
Malformed or contradictory request parameters |
FW-STORAGE |
SQLite persistence error |
FW-UNSUPPORTED |
Requested feature not available |
FW-MISSING-ARTIFACT |
Expected output file not produced by backend |
FW-CANCELLED |
Operation cancelled via token or Ctrl+C |
FW-STAGE-TIMEOUT |
Pipeline stage exceeded its budget |
Robot Error Code Mapping:
In robot mode, the 12 internal error variants are grouped into 6 robot-specific codes for agent consumption:
| Robot Code | Internal Variants | When |
|---|---|---|
FW-ROBOT-TIMEOUT |
CommandTimedOut, StageTimeout |
Any timeout during pipeline execution |
FW-ROBOT-BACKEND |
BackendUnavailable |
No suitable backend found |
FW-ROBOT-REQUEST |
InvalidRequest |
Malformed CLI arguments |
FW-ROBOT-STORAGE |
Storage |
SQLite persistence failure |
FW-ROBOT-CANCELLED |
Cancelled |
Ctrl+C or deadline cancellation |
FW-ROBOT-EXEC |
All others (Io, Json, CommandMissing, CommandFailed, Unsupported, MissingArtifact) |
General execution failure |
This simplification lets agents handle errors with a small match table rather than parsing 12 variants.
Every backend (bridge or native) implements the Engine trait:
pub trait Engine: Send + Sync {
fn name(&self) -> &'static str; // "whisper.cpp", "insanely-fast-whisper", etc.
fn kind(&self) -> BackendKind; // WhisperCpp, InsanelyFast, WhisperDiarization
fn capabilities(&self) -> EngineCapabilities;
fn is_available(&self) -> bool; // PATH probe via `which` crate
fn run(
&self,
request: &TranscribeRequest,
normalized_wav: &Path,
work_dir: &Path,
timeout: Duration,
) -> FwResult<TranscriptionResult>;
}EngineCapabilities describe what each backend supports:
| Capability | whisper.cpp | insanely-fast | whisper-diarization |
|---|---|---|---|
supports_diarization |
false | true (HF token) | true |
supports_translation |
true | true | false |
supports_word_timestamps |
true | true (word level) | false |
supports_gpu |
true (CUDA/Metal) | true (CUDA/MPS) | true (CUDA) |
supports_streaming |
false | false | false |
These capabilities feed into the Bayesian router's quality proxy: a backend that doesn't support a requested feature gets a lower quality score for that request.
Backend Availability Probing:
Availability is checked via the which crate (equivalent to running which whisper-cli on the command line):
pub fn command_exists(program: &str) -> bool {
which::which(program).is_ok()
}Each backend can be overridden with an environment variable (FRANKEN_WHISPER_WHISPER_CPP_BIN, etc.), in which case the override path is checked directly for existence.
The process module provides three execution modes with increasing safety guarantees:
run_command -- fire and forget with captured output:
Spawn child -> wait -> return (stdout, stderr, exit_status)
run_command_with_timeout -- bounded execution:
Spawn child -> poll exit every 50ms -> if timeout: kill + return TimeoutError
run_command_cancellable -- full cooperative cancellation:
Spawn child
loop:
poll child.try_wait()
if exited: return output
token.checkpoint()? <-- if cancelled: kill child, return Err(Cancelled)
sleep 50ms
hard_timeout safety net: kill child regardless
The 50ms poll interval means cancellation response time is bounded to ~50ms. The child process receives SIGKILL (not SIGTERM), ensuring immediate termination of backend subprocesses that may be doing heavy GPU inference.
The TTY audio codec uses mu-law compression, a standard telephony algorithm that compresses 16-bit PCM to 8-bit with logarithmic companding:
Encoding (linear PCM -> mu-law):
1. Input: 16-bit signed integer sample
2. Clamp to [-32635, 32635] (mu-law representable range)
3. Add bias: sample = |sample| + 132
4. Find segment: position of highest set bit (determines compression curve)
5. Extract mantissa: 4 bits from the segment position
6. Combine: segment (3 bits) + mantissa (4 bits) + sign (1 bit) = 8 bits
7. Invert all bits (wire format convention)
Decoding (mu-law -> linear PCM):
1. Invert all bits
2. Extract sign, segment, mantissa
3. Reconstruct: ((mantissa << 3) + bias) << (segment + 1) - bias
4. Apply sign
This compression achieves ~2:1 ratio (16-bit -> 8-bit) while preserving speech intelligibility. Combined with zlib compression and base64 encoding, the full pipeline is:
Raw PCM (16-bit) -> mu-law (8-bit) -> zlib compress -> base64 encode -> NDJSON line
The inverse pipeline runs on decode. CRC32 and SHA-256 integrity hashes are computed on the raw (pre-compression) audio bytes, so corruption at any stage of the pipeline is detected.
Each audio frame is a single NDJSON line with this structure:
{
"protocol_version": 1,
"seq": 42,
"codec": "mulaw+zlib+b64",
"sample_rate_hz": 16000,
"channels": 1,
"payload_b64": "eJztwTEBAAAAwqD1T20ND...",
"crc32": 3141592653,
"payload_sha256": "a1b2c3d4e5f6..."
}| Field | Type | Required | Description |
|---|---|---|---|
protocol_version |
u32 | yes | Protocol version (1 = audio, 2 = audio + transcript) |
seq |
u64 | yes | Strictly increasing sequence number |
codec |
string | yes | Compression codec identifier |
sample_rate_hz |
u32 | yes | Audio sample rate (always 16000 for whisper) |
channels |
u8 | yes | Channel count (always 1 for mono) |
payload_b64 |
string | yes | Base64-encoded compressed audio data |
crc32 |
u32 | optional | CRC32 of raw (pre-compression) audio bytes |
payload_sha256 |
string | optional | SHA-256 hex digest of raw audio bytes |
Control frames use the same NDJSON line format but with a "type" field instead of "seq":
{"type": "handshake", "min_version": 1, "max_version": 2, "supported_codecs": ["mulaw+zlib+b64"]}
{"type": "ack", "up_to_seq": 42}
{"type": "backpressure", "remaining_capacity": 64}
{"type": "session_close", "reason": "complete", "last_data_seq": 100}The conformance comparator aligns expected vs. observed segment lists index-by-index:
Input: expected[0..N], observed[0..M], tolerance
1. If N != M: set length_mismatch = true
2. For i in 0..min(N, M):
a. Compare text:
if tolerance.require_text_exact && expected[i].text != observed[i].text:
text_mismatches += 1
b. Compare speaker:
if tolerance.require_speaker_exact && expected[i].speaker != observed[i].speaker:
speaker_mismatches += 1
c. Compare timestamps:
if |expected[i].start_sec - observed[i].start_sec| > tolerance.timestamp_tolerance_sec:
timestamp_violations += 1
if |expected[i].end_sec - observed[i].end_sec| > tolerance.timestamp_tolerance_sec:
timestamp_violations += 1
3. Return SegmentComparisonReport {
length_mismatch,
text_mismatches,
speaker_mismatches,
timestamp_violations,
segments_compared: min(N, M),
}
Default Tolerance Values:
| Parameter | Default | Meaning |
|---|---|---|
timestamp_tolerance_sec |
0.05 (50ms) | Maximum acceptable timestamp drift |
require_text_exact |
true | Text must match exactly |
require_speaker_exact |
false | Speaker labels not required to match |
The 50ms timestamp tolerance (CANONICAL_TIMESTAMP_TOLERANCE_SEC) is the single source of truth across the entire codebase. Conformance tests, native engine rollout gates, and replay comparison all reference this constant.
The pipeline is composed using a builder pattern rather than hardcoded stage lists:
// Default 10-stage pipeline
let config = PipelineBuilder::default_stages().build()?;
// Custom pipeline (skip stages you don't need)
let config = PipelineBuilder::new()
.stage(PipelineStage::Ingest)
.stage(PipelineStage::Normalize)
.stage(PipelineStage::Backend)
.stage(PipelineStage::Persist)
.build()?;
// Remove a stage from defaults
let config = PipelineBuilder::default_stages()
.without(PipelineStage::Vad)
.without(PipelineStage::Diarize)
.build()?;The build() method validates the pipeline: it ensures Ingest comes before Normalize, Normalize comes before Backend, and Persist (if present) is last. build_unchecked() skips validation for testing.
The FinalizerRegistry ensures resources are cleaned up even on cancellation or panic:
enum Finalizer {
TempDir(PathBuf), // Remove temporary directory
Custom(Box<dyn Fn()>), // User-provided cleanup function
Process(u32), // Kill subprocess by PID
}Execution semantics:
- Finalizers run in LIFO order (last registered, first cleaned up)
run_all_bounded(budget_ms)enforces a per-finalizer timeout, so a hung cleanup cannot block shutdown indefinitely- The default cleanup budget is 5 seconds (from the pipeline's
Cleanupstage budget) - Process finalizers send
SIGKILL(immediate termination, no graceful shutdown for subprocesses) - Temp directory finalizers use
std::fs::remove_dir_all - If a finalizer panics, the remaining finalizers still run (catch_unwind)
franken_whisper integrates several sibling crates from the FrankenSuite ecosystem:
franken_whisper
|
+-- fsqlite (frankensqlite) Pure-Rust SQLite implementation
| +-- fsqlite-types Core SQLite value types
|
+-- franken-kernel (asupersync) Budget, TraceId, time utilities
+-- franken-evidence (asupersync) Evidence ledger primitives
+-- franken-decision (asupersync) Decision contract framework
|
+-- [optional] ftui (frankentui) Terminal UI framework
+-- [optional] ft-api (frankentorch) GPU tensor operations
+-- [optional] ft-core (frankentorch)
+-- [optional] fj-api (frankenjax) JAX-based GPU compute
+-- [optional] fj-core (frankenjax)
Third-party dependencies (non-optional):
| Crate | Version | Purpose |
|---|---|---|
clap |
4.5 | CLI argument parsing with derive macros |
serde + serde_json |
1.0 | JSON serialization/deserialization |
chrono |
0.4 | Timestamp handling (RFC-3339) |
uuid |
1.15 | Run ID generation (v4 random) |
sha2 |
0.10 | SHA-256 content hashing |
crc32fast |
1.4 | CRC32 integrity checksums |
base64 |
0.22 | Base64 encoding for TTY wire format |
flate2 |
1.1 | Zlib compression (TTY audio, JSONL sync) |
symphonia |
0.5 | Native audio decoding (MP3, AAC, FLAC, OGG, WAV) |
hound |
3.5 | WAV file writing |
which |
7.0 | Backend binary PATH discovery |
ctrlc |
3.4 | Ctrl+C signal handling |
tracing |
0.1 | Structured logging and diagnostics |
thiserror |
2.0 | Error type derive macros |
tempfile |
3.17 | Temporary file/directory management |
The codebase enforces strict linting beyond #![forbid(unsafe_code)]:
[lints.clippy]
enum_glob_use = "warn" # No wildcard enum imports
explicit_into_iter_loop = "warn" # Use .iter() not .into_iter() on references
explicit_iter_loop = "warn" # Prefer for x in &collection
flat_map_option = "warn" # Use .flatten() instead of .flat_map(|x| x)
implicit_clone = "warn" # Prefer .clone() over implicit copies
semicolon_if_nothing_returned = "warn" # Consistent semicolons on unit functions
unused_self = "warn" # Flag methods that don't use selfAll CI gates run cargo clippy --all-targets -- -D warnings, which promotes these warnings to hard errors. This prevents common Rust anti-patterns from accumulating in the codebase.
Why Bayesian routing over multi-armed bandits?
Multi-armed bandits (UCB, Thompson sampling) optimize for a single reward signal. Backend selection involves multiple conflicting objectives (latency, quality, failure risk) that vary per-request (diarization changes the optimal backend). The Bayesian decision contract with an explicit loss matrix handles this naturally: each (state, action) pair has a multi-factor cost, and the posterior captures per-backend reliability independent of the cost model. Bandits would need to collapse the multi-factor cost into a single scalar reward, losing the ability to reason about tradeoffs.
Why savepoints instead of top-level transactions?
Top-level BEGIN/COMMIT transactions don't nest in SQLite. If a caller is already inside a transaction (e.g., a concurrent session), a nested BEGIN either fails or starts an implicit savepoint depending on the SQLite driver. Explicit SAVEPOINT/RELEASE always nest correctly and make the isolation boundaries visible in the code. The naming convention (sp_persist_N, fw_session_name) provides debuggability when inspecting WAL state.
Why mu-law over Opus for TTY audio?
Opus is a superior audio codec, but it requires a native C library (libopus) which conflicts with #![forbid(unsafe_code)]. Mu-law is trivially implementable in safe Rust (bit manipulation only), universally understood by telephony systems, and sufficient for speech at 16 kHz. Combined with zlib compression, the bandwidth overhead vs. Opus is modest (~30% more) while maintaining the zero-unsafe-code guarantee. A future opus+b64 codec can be added via the protocol's codec negotiation without breaking existing deployments.
Why not whisper-rs (Rust FFI bindings)?
whisper-rs provides Rust bindings to the whisper.cpp C++ library via FFI. This is necessarily unsafe because the entire inference engine runs through a foreign function interface. franken_whisper takes a different approach: it orchestrates whisper.cpp as an external subprocess, preserving memory safety at the cost of subprocess overhead (~50ms per invocation). The native engine pilots (in-process Rust) are being developed as pure-Rust reimplementations that don't need FFI, with the 5-stage rollout governance ensuring quality parity before replacing the bridge adapters.
Why a 10-stage pipeline instead of a monolithic transcribe function?
Stage isolation provides three benefits. First, independent budgets: a slow normalize stage cannot eat into the backend's time budget. Second, observable progress: agents see exactly which stage is running via NDJSON events. Third, composability: the PipelineBuilder can skip stages that are not needed, avoiding unnecessary work. The overhead of stage management is negligible (~1ms per stage transition) compared to actual inference time (seconds to minutes).
Why NDJSON over WebSocket or gRPC?
NDJSON (newline-delimited JSON) has three advantages for agent consumption. First, zero dependencies: any language can parse it with a JSON library and readline(). Second, pipe-friendly: works with jq, grep, head, tail, and standard Unix tools. Third, TTY-safe: can flow over SSH, serial links, and PTY connections where binary protocols cannot. The tradeoff is higher bandwidth than binary protocols, but for a speech-to-text pipeline where the bottleneck is inference (not I/O), the difference is irrelevant.
Every adaptive controller in franken_whisper follows a formal "alien-artifact engineering contract," a design discipline that prevents adaptive systems from making unbounded bad decisions.
The problem it solves: Adaptive algorithms (Bayesian routers, auto-tuners, ML-based controllers) can behave unpredictably when their models are wrong. A Bayesian router with a bad prior will confidently make terrible decisions. An auto-tuner with noisy data will oscillate. The standard response is "just add more data" or "tune the hyperparameters," but for a CLI tool that runs on user machines, there's no ops team watching dashboards.
The contract requires every adaptive controller to declare:
| Component | Purpose | Example (Backend Router) |
|---|---|---|
| State space | What does the controller observe? | 3 availability states (all/partial/none) |
| Action space | What can it decide? | 4 actions (try each backend + error) |
| Loss matrix | What's the cost of each state x action? | 3x4 matrix: latency(45%) + quality(35%) + failure(20%) |
| Posterior terms | How confident is the model? | Beta distribution per backend |
| Calibration metric | How accurate are predictions? | Brier score on 50-observation sliding window |
| Deterministic fallback | What happens when the model is wrong? | Static priority list |
| Fallback trigger | When does fallback activate? | Brier > 0.35 or calibration < 0.3 or < 5 observations |
| Evidence ledger | Audit trail of all decisions | Circular buffer of 200 RoutingEvidenceLedgerEntry records |
Why this matters: The contract guarantees bounded worst-case behavior. Even if the Bayesian model is perfectly miscalibrated, the system falls back to a simple priority list that always works. The evidence ledger makes every decision inspectable after the fact. The loss matrix makes the tradeoffs explicit rather than buried in code.
Controllers using this contract:
- Backend router (Bayesian backend selection)
- Adaptive bitrate controller (TTY audio link quality)
- Budget tuner (pipeline stage timeout recommendations)
- Correction tracker (speculation confirmation thresholds)
- Speculative window controller (adaptive window sizing)
The 10-stage pipeline is not a hardcoded sequence. It is composed dynamically per-request based on the input source, backend capabilities, and user flags.
PipelineCx (Pipeline Context):
Every pipeline run creates a PipelineCx that carries shared state through all stages:
| Field | Type | Purpose |
|---|---|---|
trace_id |
TraceId |
Unique identifier from (timestamp_ms, random_u64) |
deadline |
Option<DateTime<Utc>> |
Absolute wall-clock deadline for the entire pipeline |
budget |
Budget |
Remaining time budget (decremented by stage service times) |
evidence |
Vec<Value> |
JSON evidence accumulator for post-hoc analysis |
finalizers |
FinalizerRegistry |
Cleanup handlers run on pipeline exit (bounded to 5s) |
CancellationToken (Copy + Send + Sync):
A lightweight handle extracted from PipelineCx for passing into background threads and subprocess monitors:
struct CancellationToken {
deadline: Option<DateTime<Utc>>,
}The token's checkpoint() method checks two conditions: (1) has Ctrl+C been pressed (global AtomicBool), and (2) has the deadline passed. If either is true, it returns Err(Cancelled). This is polled cooperatively: stages call checkpoint() at safe points (between loop iterations, before COMMIT, after subprocess completion).
Stage Budget Isolation:
Each stage has an independent timeout budget. A slow normalization stage cannot eat into the backend's time budget. Budgets are configured via environment variables (FRANKEN_WHISPER_STAGE_BUDGET_<STAGE>_MS) and profiled automatically. After each run, the orchestrator computes utilization ratios and emits tuning recommendations: decrease_budget_candidate (<=30% utilized), keep_budget (30-90%), or increase_budget (>=90%, suggests 1.25x current).
Dynamic Stage Composition:
Not every run executes all 10 stages. The pipeline skips stages that aren't needed:
| Condition | Skipped Stages |
|---|---|
| Input is already 16kHz mono WAV | Normalize (passthrough) |
No --diarize flag |
Diarize |
No --vad flag |
VAD |
| No GPU features compiled | Accelerate (CPU fallback used inline) |
--no-persist flag |
Persist |
| Backend doesn't support alignment | Align |
--no-stem flag set |
Source Separate |
| VAD detects only silence | All post-Backend stages |
Skipped stages still emit *.skip events to the NDJSON stream so agents can distinguish "not needed" from "failed."
The acceleration stage normalizes per-segment confidence scores into a proper probability distribution. Raw backend confidences are often uncalibrated; whisper.cpp and insanely-fast-whisper use different scoring scales, so normalization is necessary for meaningful cross-backend comparison.
Algorithm:
- Extract confidence values from all segments
- Replace missing/invalid values (NaN, infinity, zero, negative) with a text-length-based baseline:
ln(1 + char_count) + 1.0 - Compute pre-mass:
sum(confidences)before normalization - Apply softmax normalization (GPU path via frankentorch/frankenjax, or CPU fallback)
- Compute post-mass:
sum(normalized)(should equal 1.0) - Record both masses in the
AccelerationReportfor validation
Numerically Stable Softmax (CPU path):
max_val = max(finite values) -- prevent overflow
exps[i] = exp(value[i] - max_val) -- shift by max
output[i] = exps[i] / sum(exps) -- normalize to sum=1.0
Non-finite values (NaN, infinity) map to 0.0 in the output. If the sum is near-zero (all values are degenerate), the result falls back to a uniform distribution 1/N.
Acceleration Paths:
| Path | Trigger | Method |
|---|---|---|
| frankentorch | --features gpu-frankentorch |
Tensor softmax via FrankenTorchSession |
| frankenjax | --features gpu-frankenjax |
JAX-based normalization via fj_api |
| CPU fallback | no GPU features | Numerically stable softmax with NaN/inf guards |
The transition from external bridge adapters (spawning whisper-cli, python3) to in-process native Rust engines follows a 5-stage rollout with conformance gating at each stage. This prevents a buggy native engine from silently degrading transcription quality.
Rollout Stages:
Shadow --> Validated --> Fallback --> Primary --> Sole
| | | | |
| | | | +- Native only, bridge removed
| | | +- Native preferred, bridge fallback on error
| | +- Bridge preferred, native fallback hardened
| +- Bridge only, stricter conformance gating
+- Bridge only, native conformance validated out-of-band
Conformance Gate: At each stage transition, the conformance harness compares native vs. bridge output on a test corpus. The 50ms canonical timestamp tolerance is the single source of truth. A native engine that produces timestamps >50ms different from the bridge adapter for the same audio is blocked from promotion.
Segment Validation Rules:
- Timestamps must be finite (no NaN, no infinity)
- Start and end times must be non-negative
- Start must be <= end
- No overlapping segments (configurable epsilon: 1 microsecond default)
- Confidence scores must be in [0.0, 1.0]
- Text must be non-empty
Runtime Control:
Two environment variables jointly control native engine behavior:
FRANKEN_WHISPER_NATIVE_ROLLOUT_STAGE: which stage the deployment is atFRANKEN_WHISPER_NATIVE_EXECUTION: whether native dispatch is enabled at runtime (0/1)
Both must agree for native engines to actually execute. Setting NATIVE_EXECUTION=1 with stage shadow has no effect; the stage gate prevents native execution regardless of the runtime flag.
Execution Path Metadata:
Every backend.ok and replay.envelope stage event includes explicit execution-path metadata: implementation (bridge or native), execution_mode, native_rollout_stage, and native_fallback_error (populated when native fails and bridge recovers).
The speculative streaming system combines dual-model execution with Bayesian window sizing, drift quantification, and deterministic fallback.
WindowManager:
Divides the audio stream into overlapping windows. Each window gets a unique window_id, an SHA-256 hash of its audio content, and slots for both the fast and quality model results. Window sizes range from 1,000ms to 30,000ms, with the default starting at the configured --speculative-window-ms (default: 3,000ms).
CorrectionDrift Metrics:
When the quality model disagrees with the fast model, the system quantifies the disagreement using four metrics:
| Metric | Meaning | Typical Range |
|---|---|---|
wer_approx |
Approximate Word Error Rate (Levenshtein on word sequences) | 0.0 (identical) to 1.0 (completely different) |
confidence_delta |
Absolute difference in mean segment confidence | 0.0 to 1.0 |
segment_count_delta |
quality_count - fast_count |
-N to +N |
text_edit_distance |
Levenshtein distance on concatenated transcript text | 0 to unbounded |
CorrectionTolerance (When to confirm vs. retract):
A partial transcript is confirmed when all drift metrics fall within tolerance, and retracted (with correction) when any metric exceeds its threshold:
| Threshold | Default Value | Meaning |
|---|---|---|
max_wer |
0.1 (10%) | Maximum word error rate before retraction |
max_confidence_delta |
0.15 | Maximum confidence difference |
max_edit_distance |
50 characters | Maximum text edit distance |
SpeculationWindowController (Adaptive Sizing):
The window controller uses the same alien-artifact engineering contract as the backend router:
- State space: Observed correction rate (fraction of windows needing correction)
- Posterior:
Beta(alpha, beta)distribution over expected correction rate - Calibration: Sliding window of 20 prediction-outcome pairs with Brier score tracking
- Fallback trigger: Brier score > 0.25 with >= 10 observations
The controller adjusts window size based on correction patterns:
| Pattern | Action | Rationale |
|---|---|---|
| High correction rate (> 25%) | Shrink window by step_ms |
Smaller windows reduce correction latency |
| Low correction rate (< 6.25%) | Grow window by step_ms |
Larger windows reduce overhead |
| Runaway corrections (> 75%) | Force minimum window size | System is clearly struggling |
| 20 consecutive zero corrections | Shrink (counterintuitive) | May be over-tolerant, tighten to validate |
| High WER (> 12.5%) | Shrink window | Fast model consistently wrong at this scale |
ConcurrentTwoLaneExecutor:
Runs both models in parallel lanes with independent timeout budgets. Results are collected asynchronously, and the faster result (always the fast model by design) is emitted immediately while the quality result triggers correction logic when it arrives.
The built-in normalizer (normalize_to_wav_with_builtin_decoder) is a pure-Rust audio pipeline that produces whisper-compatible WAV without spawning any subprocess:
Format Detection: Symphonia's get_probe().format() uses file extension hints and magic-byte probing to identify the container format. Supported containers include MP3 (MPEG Layer III), MP4/M4A (AAC), FLAC, WAV/RIFF, OGG (Vorbis), and WavPack.
Decoding Loop:
for each packet in format_reader:
decoded = codec_decoder.decode(packet)
convert decoded samples to f32
if multi-channel: average all channels -> mono
append to sample buffer
Sample conversion handles i16, i32, f32, and f64 source formats. Multi-channel audio is mixed to mono by averaging corresponding samples across channels.
Resampling: A linear interpolation resampler converts from the source sample rate (commonly 44.1 kHz or 48 kHz) to whisper's required 16 kHz:
ratio = src_rate / dst_rate
for each output sample i:
position = i * ratio
left = input[floor(position)]
right = input[ceil(position)]
output[i] = left + frac(position) * (right - left)
This is computationally lightweight (no FFT, no filter bank) while being sufficient for speech. Whisper models tolerate minor resampling artifacts well.
WAV Output: The final mono f32 buffer is quantized to 16-bit signed PCM (i16) via clamp-and-round, then written as a standard RIFF WAV header + raw PCM data. The output is always normalized_16k_mono.wav in the work directory.
The sync module provides one-way JSONL snapshot export/import with distributed lock safety.
Lock Protocol:
Before any export or import, a JSON lock file is created at {state_root}/locks/sync.lock:
{"pid": 12345, "created_at_rfc3339": "2026-02-22T12:00:00Z", "operation": "export"}Stale lock detection checks two conditions:
- Is the PID still alive? (reads
/proc/{pid}on Linux) - Is the lock older than 5 minutes?
If either check fails, the lock is archived with a reason suffix and a new lock is acquired.
Export Format:
An export produces four files:
snapshot/
runs.jsonl # one JSON object per run
segments.jsonl # one JSON object per segment
events.jsonl # one JSON object per event
manifest.json # metadata + SHA-256 checksums
The manifest contains row counts and SHA-256 checksums of each JSONL file, enabling integrity verification on import.
Incremental Export:
Full exports re-dump the entire database. For large databases, incremental export is more efficient:
cargo run -- sync export-jsonl --output ./snapshot --incrementalIncremental mode uses a cursor file (sync_cursor.json) tracking the last export timestamp and run ID. Only runs created after the cursor are exported. The cursor uses (finished_at, run_id) tuple ordering for deterministic deduplication, ensuring resume-safety across interrupted exports.
JSONL Compression:
Sync supports optional gzip compression for JSONL files, reducing snapshot size for archival or transfer:
snapshot/
runs.jsonl.gz # gzip-compressed (flate2, default compression)
segments.jsonl.gz
events.jsonl.gz
manifest.json # always uncompressed (small)
The import path transparently detects and decompresses .gz variants.
Sync Validation:
After import, validate_sync() compares the database state against the imported JSONL files, checking for row count mismatches and checksum mismatches. This provides end-to-end integrity verification.
Conflict Policies:
| Policy | Behavior on duplicate run_id |
|---|---|
reject |
Fail the entire import |
skip |
Silently skip existing runs |
overwrite |
Replace conflicting runs rows, but fail closed if child-row mutation is needed |
overwrite-strict |
Verified strict replacement including child-row updates (delete+insert) and stale child-row pruning |
The TTY audio module goes beyond simple encode/decode. The AdaptiveBitrateController monitors link quality in real time and adjusts compression dynamically:
| Frame Loss Rate | Link Quality | Compression | Critical Frame FEC |
|---|---|---|---|
| < 1% | High | zlib level 1 (fast) | 1x (no duplication) |
| 1% - 10% | Moderate | zlib level 6 (default) | 2x |
| > 10% | Poor | zlib level 9 (best) | 3x |
Critical Frame FEC (Forward Error Correction):
Control frames essential for protocol correctness (handshake, session_close, ack) are emitted multiple times based on current link quality. Under 10% loss, every handshake frame is transmitted 3 times to ensure at least one copy arrives. This is a probabilistic reliability guarantee: with independent frame loss at rate p, the probability all k copies are lost is p^k.
Link Quality Assessment:
The controller maintains running frames_sent and frames_lost counters:
frame_loss_rate = frames_lost / frames_sent
link_quality = 1.0 - frame_loss_rate
Quality transitions trigger compression level changes on subsequent frames, providing automatic adaptation without manual tuning.
Transcript Streaming over TTY (Protocol v2):
Beyond raw audio transport, the TTY protocol supports real-time transcript streaming via three control frame types:
| Frame Type | Direction | Purpose |
|---|---|---|
TranscriptPartial |
sender -> receiver | Speculative partial transcript from fast model |
TranscriptRetract |
sender -> receiver | Retract a previous partial (quality model disagrees) |
TranscriptCorrect |
sender -> receiver | Send corrected transcript from quality model |
These frames carry TranscriptSegmentCompact payloads, a wire-efficient representation using single-letter field names (s/e/t/sp/c for start/end/text/speaker/confidence) to minimize bandwidth. The speculative streaming pipeline can therefore operate over TTY links where only text-based NDJSON can flow.
Telemetry Counters:
The decode path tracks comprehensive telemetry:
frames_decoded: count of successfully decoded audio framesgaps: sequence number discontinuities (with expected/actual pairs)duplicates: repeated sequence numbers (second copy discarded)integrity_failures: CRC32/SHA-256 mismatches (frame dropped)dropped_frames: total frames discarded due to policy (integrity + duplicates)
The storage layer supports concurrent persistence sessions using SQLite savepoints for nested transaction isolation:
// Start a named session (creates a SAVEPOINT)
let session = store.begin_concurrent_session("agent_alpha")?;
// Persist reports within the session
session.persist_report(&report)?;
// Commit the session (RELEASE SAVEPOINT)
session.commit()?;
// Or roll back on error (ROLLBACK TO SAVEPOINT)Session names are validated to be alphanumeric + underscore only (no SQL injection via session names). Each session maps to a SQLite savepoint named fw_session_{name}, providing ACID isolation without blocking other readers.
The StorageDiagnostics struct provides runtime introspection of database health:
| Field | Type | Description |
|---|---|---|
page_count |
i64 | Total database pages |
page_size |
i64 | Bytes per page (typically 4096) |
journal_mode |
String | Current mode (wal, delete) |
wal_checkpoint |
WalCheckpointInfo | WAL status: busy flag, log frames, checkpointed frames |
freelist_count |
i64 | Unused pages available for reuse |
integrity_check |
String | "ok" when database passes PRAGMA integrity_check |
Accessible via robot health which includes database diagnostics in the health report.
Every routing decision records a RoutingEvidenceLedgerEntry in a 200-entry circular buffer. Each entry contains:
| Field | Type | Purpose |
|---|---|---|
decision_id |
String | Unique decision identifier |
trace_id |
String | Links to pipeline trace |
timestamp_rfc3339 |
String | When the decision was made |
observed_state |
String | Availability state at decision time |
chosen_action |
String | Which backend was selected |
policy_id |
String | Which routing policy was active |
loss_matrix_hash |
String | Provenance tracking for the loss matrix |
availability |
Vec<(String, bool)> | Per-backend availability snapshot |
duration_bucket |
String | Audio duration category (short/medium/long) |
diarize |
bool | Whether diarization was requested |
actual_outcome |
Option | Observed success/failure (filled post-run) |
This ledger is queryable via robot routing-history and included in stage event payloads for post-hoc analysis. The loss_matrix_hash field enables detecting when the routing policy itself changed between runs.
Every pipeline run receives two identifiers:
Trace ID, a deterministic composite of wall-clock time and randomness:
trace_id = hex(timestamp_ms) + "-" + hex(uuid_v4_lower_80_bits)
Example: "18e4a0b1c00-a1b2c3d4e5f6"
The timestamp prefix enables time-range queries without parsing. The random suffix prevents collisions when multiple runs start in the same millisecond.
Run ID, a standard UUID v4:
run_id = uuid::Uuid::new_v4().to_string()
Example: "550e8400-e29b-41d4-a716-446655440000"
The trace_id links all events across the pipeline (including routing evidence), while the run_id is the persistence key in SQLite.
The router maintains a CalibrationState with a sliding window of prediction-outcome pairs:
struct CalibrationState {
observations: VecDeque<CalibrationObservation>, // bounded to 50 entries
window_size: usize, // ROUTER_HISTORY_WINDOW = 50
}
struct CalibrationObservation {
predicted_probability: f64, // router's confidence that the backend would succeed
actual_outcome: f64, // 1.0 if it did succeed, 0.0 if it failed
observed_at_rfc3339: String, // when the observation was recorded
}Update cycle:
- Before each run, the router predicts
p_successfor the chosen backend - After the run completes, the actual outcome (success/failure) is recorded
- If the window exceeds 50 entries, the oldest observation is evicted
- The Brier score is recomputed from the current window
Brier Score Formula:
Brier = (1/N) * sum_i((predicted_i - actual_i)^2)
Brier = 0.0 means perfect calibration (every prediction matched reality). Brier = 0.25 is the score of a coin flip. Brier > 0.35 triggers fallback to static priority routing.
The calibration score tracks a simpler metric: correct_predictions / total_predictions, where a prediction is "correct" if the predicted probability matched the outcome direction (predicted > 0.5 and succeeded, or predicted < 0.5 and failed). This gives a quick sanity check independent of the Brier score.
Each backend's reliability is modeled as a Beta distribution Beta(alpha, beta):
- Mean =
alpha / (alpha + beta)(estimated success probability) - Variance =
alpha * beta / ((alpha + beta)^2 * (alpha + beta + 1))(uncertainty)
The update rule blends the prior with empirical data:
if sample_count >= 5:
empirical_weight = min(sample_count, 20)
alpha += success_rate * empirical_weight
beta += (1 - success_rate) * empirical_weight
The weight cap at 20 prevents a long history from making the posterior too rigid. A backend that succeeded 19 out of 20 recent runs gets alpha += 0.95 * 20 = 19 and beta += 0.05 * 20 = 1, strongly increasing its selection probability. A backend that failed 10 out of 20 gets alpha += 0.5 * 20 = 10 and beta += 0.5 * 20 = 10, pulling toward neutral.
The posterior success probability then factors in request-specific adjustments:
p_success = (alpha + quality_score * 2.0 + diarize_boost) /
(alpha + beta + quality_terms + translate_penalty)
This means a backend with a strong empirical track record can still be penalized for a specific request if it lacks a needed capability (e.g., whisper.cpp getting a diarization request).
The SQLite connection is configured for concurrent read/write:
| PRAGMA | Value | Purpose |
|---|---|---|
journal_mode |
WAL |
Write-Ahead Logging for concurrent readers |
busy_timeout |
5000 (5 seconds) |
Wait for locks before returning SQLITE_BUSY |
WAL mode allows multiple readers and a single writer to operate simultaneously. The 5-second busy timeout means a write that encounters a lock will wait up to 5 seconds before failing, which accommodates brief contention from concurrent agent processes.
Journal Mode Switching for DDL:
SQLite's ALTER TABLE ADD COLUMN is more reliable in DELETE journal mode than WAL mode (an observed quirk of fsqlite's pure-Rust implementation). When adding a column, the storage layer:
- Queries current journal mode (
PRAGMA journal_mode;) - If WAL, switches to DELETE (
PRAGMA journal_mode='delete';) - Executes
ALTER TABLE ... ADD COLUMN - Restores WAL mode (
PRAGMA journal_mode='wal';) - If restoration fails, logs an error but preserves the column addition
This round-trip ensures schema migrations succeed while maintaining WAL mode for normal operation.
Before the pipeline starts, the request is validated:
Mutually Exclusive Input Modes:
The CLI enforces that exactly one of --input, --stdin, or --mic is specified. Zero inputs or multiple inputs produce an immediate error before pipeline construction.
Pipeline Configuration Validation:
PipelineConfig::validate() enforces ordering constraints:
Normalizemust come afterIngestBackendmust come afterNormalize- No duplicate stages in the pipeline
- All stage dependencies are satisfied in execution order
These checks run at pipeline build time (not at runtime), so invalid configurations fail fast.
Timeout Conversion:
The --timeout flag (in seconds) converts to an absolute deadline:
timeout_ms = timeout_seconds * 1000 (with saturating multiplication)
deadline = now + chrono::Duration::milliseconds(clamped_to_i64_max)
The saturating_mul prevents overflow; the clamp to i64::MAX prevents chrono panics on unreasonably large timeouts.
When a pipeline stage fails, the behavior depends on the error type:
| Error Type | Behavior | Event Emitted |
|---|---|---|
Cancelled (Ctrl+C or deadline) |
Pipeline stops immediately | {stage}.cancelled |
StageTimeout (budget exceeded) |
Pipeline stops, timeout reported | {stage}.timeout |
| Other errors (I/O, backend, etc.) | Pipeline stops, error propagated | {stage}.error |
All stage failures produce a corresponding error event in the NDJSON stream before the pipeline terminates. In-progress SQLite transactions roll back via the savepoint mechanism. Registered finalizers (temp directory cleanup, subprocess kills) run within the 5-second cleanup budget.
The run_error event at the end of the stream contains the structured error code and message, allowing agents to programmatically determine what failed and why.
The PipelineCx carries a Vec<serde_json::Value> evidence accumulator that grows throughout the pipeline:
- Routing decision: the backend router pushes its decision evidence (posterior snapshot, loss matrix, chosen action)
- Stage observations: individual stages can record evidence about unusual conditions (e.g., normalization fallback to ffmpeg, high latency)
- Conformance results: when native engines run in shadow/validated mode, comparison results are recorded as evidence
All accumulated evidence is included in the final RunReport.evidence field and persisted alongside the run in SQLite. This enables post-hoc debugging without needing to reproduce the exact conditions.
The interactive TUI (enabled with --features tui) provides a three-pane interface:
+-------------------+-------------------------------------+
| | |
| Runs List | Timeline / Transcript |
| (left pane) | (main pane) |
| | |
| - run-abc | [0.0s - 2.5s] Hello world |
| - run-def | [2.5s - 5.1s] How are you |
| > run-ghi | [5.1s - 7.3s] [SPK_01] Fine |
| | |
+-------------------+-------------------------------------+
| Event Details (bottom pane) |
| stage: backend | code: backend.ok | 4.2s |
+---------------------------------------------------------+
Keyboard Bindings:
| Key | Action |
|---|---|
Tab / Shift+Tab |
Cycle focus between panes |
Up / Down |
Move selection within focused pane |
PageUp / PageDown |
Jump by page |
r |
Reload data from SQLite |
h or ? |
Toggle help overlay |
q or Ctrl+C |
Quit |
Speaker Color Assignment:
Speakers are assigned distinct colors via an FNV-1a-style hash of the speaker label, mapped to an 8-color palette. This ensures the same speaker always gets the same color within a session, making multi-speaker conversations visually parseable.
Segment Retention:
To prevent unbounded memory growth during long sessions, the TUI caps displayed segments at 10,000 (DEFAULT_MAX_SEGMENTS). When the cap is exceeded, oldest segments are drained first, keeping the most recent transcription visible.
Fastest possible transcription (accuracy tradeoff):
cargo run -- transcribe --input audio.mp3 \
--backend whisper_cpp \
--model tiny.en \
--no-persist \
--no-timestamps \
--beam-size 1 \
--best-of 1 \
--jsonHighest accuracy with diarization:
cargo run -- transcribe --input meeting.mp3 \
--backend whisper_cpp \
--model large-v3 \
--diarize \
--hf-token "$HF_TOKEN" \
--min-speakers 2 \
--max-speakers 8 \
--vad \
--jsonAgent integration with health monitoring:
# Pre-flight check
cargo run -- robot health 2>/dev/null | jq -e '.overall_status == "ok"' > /dev/null
# Transcribe with full event stream
cargo run -- robot run \
--input audio.mp3 \
--backend auto \
--json 2>/dev/null | while IFS= read -r line; do
event=$(echo "$line" | jq -r '.event')
case "$event" in
stage) echo "[STAGE] $(echo "$line" | jq -r '.code')" ;;
run_complete) echo "[DONE] $(echo "$line" | jq -r '.transcript' | head -c 100)" ;;
run_error) echo "[FAIL] $(echo "$line" | jq -r '.code'): $(echo "$line" | jq -r '.message')" ;;
esac
doneOffline archival workflow:
# Transcribe everything, persist to custom DB
for f in archive/*.mp3; do
cargo run -- transcribe --input "$f" --db archive.sqlite3 --json > /dev/null
done
# Export to portable JSONL
cargo run -- sync export-jsonl --output ./archive_snapshot --db archive.sqlite3
# Validate the export
cargo run -- sync import-jsonl --input ./archive_snapshot --conflict-policy skipLow-bandwidth remote transcription via TTY:
# On remote (has audio, no GPU):
cargo run -- tty-audio encode --input recording.wav --chunk-ms 100 > /tmp/frames.ndjson
# Transfer (works over any text channel):
scp /tmp/frames.ndjson gpu-server:/tmp/
# On GPU server (has whisper, fast inference):
cat /tmp/frames.ndjson | cargo run -- tty-audio decode --output /tmp/audio.wav
cargo run -- transcribe --input /tmp/audio.wav --backend whisper_cpp --model large-v3 --json| Term | Definition |
|---|---|
| Backend | An external ASR engine (whisper.cpp, insanely-fast-whisper, whisper-diarization) or its native Rust replacement |
| Bridge adapter | Code that spawns an external backend process and parses its output into a TranscriptionResult |
| Brier score | Mean squared error between predicted probabilities and actual outcomes; measures calibration quality (0.0 = perfect, 0.25 = random) |
| Conformance | Cross-engine output comparison using the 50ms timestamp tolerance and optional text/speaker matching |
| Decision contract | Formal specification of an adaptive controller's state space, action space, loss matrix, posterior, calibration, fallback, and evidence |
| Evidence ledger | Circular buffer recording every routing decision with full posterior snapshots for audit |
| Finalizer | A cleanup handler (temp dir removal, subprocess kill) registered during pipeline execution and run on exit within a bounded timeout |
| NDJSON | Newline-Delimited JSON; one JSON object per line, compatible with jq and standard Unix text tools |
| Pipeline stage | One of 10 composable processing steps (Ingest, Normalize, VAD, Separate, Backend, Accelerate, Align, Punctuate, Diarize, Persist) |
| Posterior | Beta distribution Beta(alpha, beta) modeling estimated success probability for a backend |
| Replay envelope | SHA-256 hash summary (input, backend identity, output) for detecting drift between runs |
| Replay pack | Four-artifact directory (env, manifest, repro.lock, tolerance_manifest) capturing everything needed to reproduce a run |
| Robot mode | The robot subcommand; emits structured NDJSON events for machine consumption rather than human-readable text |
| Savepoint | SQLite's nested transaction mechanism; used for concurrent session isolation and cancellation-safe writes |
| Speculative streaming | Dual-model pattern where a fast model emits partial transcripts immediately and a quality model confirms or corrects them |
| TTY audio | Protocol for transporting compressed audio over text-only channels (PTY, SSH, serial) using mu-law + zlib + base64 NDJSON frames |
| WAL mode | SQLite's Write-Ahead Logging; allows concurrent reads during writes |
The release profile aggressively optimizes for deployment:
[profile.release]
opt-level = "z" # Optimize for binary size (smaller than "s")
lto = true # Full link-time optimization across all crates
codegen-units = 1 # Single codegen unit for maximum optimization opportunity
panic = "abort" # Abort on panic (no unwinding overhead, smaller binary)
strip = true # Strip debug symbols from final binaryThis produces the smallest possible binary while retaining full optimization. The tradeoff is slower compilation (codegen-units = 1 + LTO) and no panic unwinding (acceptable for a CLI tool where panics are fatal regardless). On a typical Linux x86_64 build, the stripped release binary is significantly smaller than a default release build.
Live microphone capture requires ffmpeg (the only path that does; file transcription uses the built-in decoder). The capture path adapts to the host OS:
| OS | ffmpeg Format | Default Device | Notes |
|---|---|---|---|
| Linux | alsa |
default |
Uses ALSA subsystem |
| macOS | avfoundation |
:0 |
First audio input device |
| Windows | dshow |
audio=default |
DirectShow capture |
The microphone flow:
- Spawn ffmpeg with
-f <format> -i <device> -t <seconds> -ar 16000 -ac 1 -c:a pcm_s16le <output> - Wait for capture to complete (bounded by
--mic-seconds) - Output is already 16kHz mono WAV, so the normalization stage becomes a passthrough
- Proceed to backend execution
Custom devices, formats, and sources can be overridden via --mic-device, --mic-ffmpeg-format, and --mic-ffmpeg-source flags.
The conformance module includes a Levenshtein-based Word Error Rate calculator used in both conformance testing and speculative streaming correction:
1. Tokenize both transcripts by whitespace -> word sequences
2. Compute Levenshtein edit distance between word sequences
(insertions, deletions, substitutions)
3. WER = edit_distance / max(reference_length, 1)
4. Clamp to [0.0, 1.0]
This is an approximation. True WER requires a reference transcript and uses the reference length as the denominator. The conformance module normalizes by the reference (expected) length, while the speculation module normalizes by max(fast_length, quality_length) since neither model is the "reference."
The SegmentConformancePolicy can optionally reject overlapping segments, where one segment's end_sec exceeds the next segment's start_sec beyond a configurable epsilon (default: 1 microsecond). This catches backends that produce garbled timeline output.
for each pair (segment[i], segment[i+1]):
if segment[i].end_sec > segment[i+1].start_sec + epsilon:
report overlap violation at index i
Overlap detection runs before cross-engine comparison, so a backend that produces self-overlapping output is flagged before being compared against a reference.
franken_whisper is designed with privacy as a hard constraint:
+----------------------------------------------------------------+
| YOUR MACHINE |
| |
| +-----------+ +-------------+ +-----------+ |
| | Input |--->| Pipeline |--->| Output | |
| +-----------+ +-------------+ +-----------+ |
| |
| No network calls (inference is local) |
| No telemetry or analytics |
| No cloud sync |
| No API keys required (except HuggingFace for diarization) |
+----------------------------------------------------------------+
All processing happens on your hardware using local backend binaries. The only external network access is:
- ffmpeg auto-provisioning (one-time download, can be disabled with
FRANKEN_WHISPER_AUTO_PROVISION_FFMPEG=0) - HuggingFace model downloads (only when using
--diarizewith pyannote models)
| Location | Contents | Sensitive? |
|---|---|---|
.franken_whisper/storage.sqlite3 |
Run history, transcripts, segments | Yes (contains transcription text) |
.franken_whisper/locks/ |
Sync lock files (PID, timestamp only) | No |
<work_dir>/normalized_16k_mono.wav |
Temporary normalized audio | Yes (audio content, cleaned up by finalizers) |
| JSONL snapshots | Exported run history | Yes (contains transcription text) |
# Remove all franken_whisper state
rm -rf .franken_whisper/
# Or just the database (preserves settings)
rm .franken_whisper/storage.sqlite3franken_whisper is both a CLI binary and a Rust library. The public API exposes all modules for embedding ASR pipelines in other applications:
use franken_whisper::backend::{BackendRouter, Engine};
use franken_whisper::orchestrator::{PipelineConfig, PipelineBuilder, FrankenWhisperEngine};
use franken_whisper::model::{TranscribeRequest, BackendKind, TranscriptionResult};
use franken_whisper::storage::RunStore;
use franken_whisper::robot::robot_schema_value;
use franken_whisper::tty_audio::{encode_wav_to_frames, decode_frames_to_raw};
use franken_whisper::conformance::compare_segments_with_tolerance;
use franken_whisper::error::{FwError, FwResult};Key types:
| Type | Module | Purpose |
|---|---|---|
TranscribeRequest |
model |
Fully-specified transcription request with all parameters |
TranscriptionResult |
model |
Backend output: transcript, segments, language, acceleration metadata |
TranscriptionSegment |
model |
Individual segment: start/end times, text, speaker, confidence |
RunReport |
model |
Complete run envelope: request + result + events + evidence + replay |
BackendKind |
model |
Enum: Auto, WhisperCpp, InsanelyFast, WhisperDiarization |
FrankenWhisperEngine |
orchestrator |
Main pipeline orchestrator |
PipelineConfig |
orchestrator |
Ordered list of stages to execute |
PipelineBuilder |
orchestrator |
Fluent constructor for pipeline configs |
CancellationToken |
orchestrator |
Cooperative cancellation handle |
RunStore |
storage |
SQLite persistence interface (open, persist, query) |
TtyAudioFrame |
tty_audio |
Protocol frame with seq, codec, payload, integrity hashes |
TtyControlFrame |
tty_audio |
Control messages (handshake, ack, retransmit, backpressure) |
DecodeReport |
tty_audio |
Decode telemetry: frames decoded, gaps, duplicates, failures |
ReplayEnvelope |
replay_pack |
SHA-256 hash summary for deterministic replay |
FwError |
error |
Error enum with 12 variants, each mapping to a stable FW-* code |
SegmentCompatibilityTolerance |
conformance |
Drift thresholds for cross-engine comparison |
-- Core run record (one row per transcription)
CREATE TABLE runs (
id TEXT PRIMARY KEY, -- UUID run identifier
started_at TEXT NOT NULL, -- RFC-3339 timestamp
finished_at TEXT, -- RFC-3339 timestamp (NULL if crashed)
backend TEXT NOT NULL, -- "whisper_cpp", "insanely_fast", etc.
input_path TEXT, -- Original input file path
normalized_wav_path TEXT, -- Path to 16kHz mono WAV
request_json TEXT, -- Full TranscribeRequest as JSON
result_json TEXT, -- Full TranscriptionResult as JSON
transcript TEXT, -- Plain text transcript
replay_json TEXT, -- ReplayEnvelope as JSON
acceleration_json TEXT, -- AccelerationReport as JSON
warnings_json TEXT -- Non-fatal warnings as JSON array
);
-- Timed transcript segments (N rows per run)
CREATE TABLE segments (
run_id TEXT NOT NULL REFERENCES runs(id),
idx INTEGER NOT NULL, -- Segment index within run
start_sec REAL, -- Start time in seconds
end_sec REAL, -- End time in seconds
speaker TEXT, -- Speaker label (if diarized)
text TEXT NOT NULL, -- Segment text
confidence REAL -- Confidence score [0.0, 1.0]
);
-- Pipeline stage events (M rows per run)
CREATE TABLE events (
run_id TEXT NOT NULL REFERENCES runs(id),
seq INTEGER NOT NULL, -- Strictly increasing per run
ts_rfc3339 TEXT NOT NULL, -- Non-decreasing timestamp
stage TEXT NOT NULL, -- Pipeline stage name
code TEXT NOT NULL, -- Event code (e.g., "backend.ok")
message TEXT NOT NULL, -- Human-readable description
payload_json TEXT -- Event-specific JSON payload
);JSONL snapshots mirror the database schema:
runs.jsonl (one JSON object per line):
{"id":"fw-run-abc","started_at":"2026-03-17T06:00:00Z","finished_at":"2026-03-17T06:00:05Z","backend":"whisper_cpp","transcript":"Hello world...","replay_json":"{...}"}segments.jsonl:
{"run_id":"fw-run-abc","idx":0,"start_sec":0.0,"end_sec":2.5,"text":"Hello world","confidence":0.95}events.jsonl:
{"run_id":"fw-run-abc","seq":0,"ts_rfc3339":"2026-03-17T06:00:00Z","stage":"ingest","code":"ingest.start","message":"materializing input","payload_json":"{}"}manifest.json (integrity metadata):
{
"exported_at": "2026-03-17T06:30:00Z",
"run_count": 42,
"segment_count": 1847,
"event_count": 336,
"runs_sha256": "a1b2c3...",
"segments_sha256": "d4e5f6...",
"events_sha256": "g7h8i9..."
}TranscribeRequest (the full input specification):
| Field | Type | Description |
|---|---|---|
input_path |
Option<PathBuf> |
Audio/video file path |
stdin_input |
bool |
Read from stdin |
mic_capture |
bool |
Capture from microphone |
backend |
BackendKind |
Which engine to use |
model |
Option<String> |
Model name/path |
language |
Option<String> |
Language hint (ISO 639-1) |
translate |
bool |
Translate to English |
diarize |
bool |
Enable speaker diarization |
decoding_params |
DecodingParams |
Beam size, temperature, thresholds |
vad_params |
Option<VadParams> |
Voice activity detection settings |
diarization_config |
DiarizationConfig |
Speaker count, stemming, model override |
speculative_config |
Option<SpeculativeConfig> |
Dual-model streaming settings |
timeout_seconds |
Option<u64> |
Overall pipeline timeout |
db_path |
Option<PathBuf> |
SQLite database path |
no_persist |
bool |
Skip persistence |
json_output |
bool |
Output full JSON report |
output_formats |
Vec<OutputFormat> |
Additional output formats (VTT, SRT, etc.) |
TranscriptionResult (what the backend produces):
| Field | Type | Description |
|---|---|---|
transcript |
String |
Full transcript text |
segments |
Vec<TranscriptionSegment> |
Timed segments with text, speaker, confidence |
language |
Option<String> |
Detected language |
acceleration |
Option<AccelerationReport> |
Confidence normalization metadata |
raw_backend_json |
Option<String> |
Preserved raw backend output for replay |
RunEvent (a single pipeline event):
| Field | Type | Description |
|---|---|---|
seq |
u64 |
Strictly increasing per run |
ts_rfc3339 |
String |
Non-decreasing RFC-3339 timestamp |
stage |
String |
Pipeline stage (e.g., "ingest", "backend", "speculation") |
code |
String |
Event code (e.g., "backend.routing.decision_contract") |
message |
String |
Human-readable description |
payload |
Value |
Event-specific JSON payload |
| Input Format | Duration | Normalization Time | Method |
|---|---|---|---|
| MP3 (128kbps, stereo) | 2 min | ~260ms | Built-in (symphonia) |
| FLAC (16-bit, 44.1kHz) | 2 min | ~180ms | Built-in (symphonia) |
| WAV (16kHz, mono) | 2 min | ~5ms | Passthrough (already normalized) |
| MP4 (video, AAC audio) | 2 min | ~500ms | ffmpeg fallback |
The built-in path is fast because it runs entirely in-process with no subprocess spawning, no temporary file juggling, and no PATH dependency.
Typical overhead beyond the backend inference time:
| Component | Time | Notes |
|---|---|---|
| CLI parse | <1ms | Clap argument parsing |
| Database open | ~5ms | SQLite connection + schema check |
| Ingest | ~1ms | File existence check, size read |
| Normalize (MP3) | ~260ms | Built-in Rust decoder |
| Persistence | ~10ms | SQLite transaction (8 retry budget) |
| Latency profiling | ~1ms | Compute utilization ratios |
| Report assembly | ~2ms | JSON serialization |
| Total overhead | ~280ms | Everything except actual inference |
The backend inference stage dominates total runtime (typically 3-60 seconds depending on audio length, model size, and hardware).
Five criterion benchmark suites measure performance of critical paths:
| Benchmark | What it measures |
|---|---|
storage_bench |
SQLite persist/query throughput, concurrent access |
normalize_bench |
Audio normalization latency by format and duration |
pipeline_bench |
End-to-end pipeline overhead (mocked backend) |
tty_bench |
TTY encode/decode throughput, retransmit loop latency |
sync_bench |
JSONL export/import throughput, compression ratios |
Run with: cargo bench --bench <name>
With the aggressive release profile (opt-level = "z", LTO, stripped):
| Build | Approximate Size |
|---|---|
| Debug | ~150 MB |
| Release (default) | ~20 MB |
| Release (optimized profile) | ~12 MB |
Release + --features tui |
~15 MB |
| Metric | Value |
|---|---|
| Total source lines (src/) | ~90,000 |
| Total test lines (tests/) | ~17,000 |
Library tests (cargo test --lib) |
2,973 |
| Integration + doc tests | 560+ |
Integration test files (tests/*.rs) |
23 |
| Benchmark suites | 5 (criterion) |
| Public modules | 18 |
| Error variants | 12 (each with structured code) |
| Backend engines | 6 (3 bridge + 3 native pilot) |
| Pipeline stages | 10 (composable, independently budgeted) |
| CLI subcommands | 6 (transcribe, robot, runs, sync, tty-audio, tui) |
| CLI flags (transcribe) | 70+ (inference, VAD, diarization, speculative, audio windowing) |
| Robot event types | 12 (run lifecycle, stage, speculation, health, routing) |
| TTY control frame types | 10 (handshake, ack, retransmit, backpressure, transcript streaming, session close) |
| TTY protocol versions | 2 (v1 audio, v2 transcript streaming) |
| Replay pack artifacts | 4 (env, manifest, repro.lock, tolerance_manifest) |
| Sync conflict policies | 4 (reject, skip, overwrite, overwrite-strict) |
| Native rollout stages | 5 (shadow, validated, fallback, primary, sole) |
| Conformance tolerance | 50ms canonical timestamp tolerance |
| Evidence ledger capacity | 200 entries (circular buffer) |
| Router history window | 50 outcome records per backend |
| Clippy enforcement | #![forbid(unsafe_code)] + -D warnings on all targets |
| Cargo features | 3 (tui, gpu-frankentorch, gpu-frankenjax) |
| Release optimizations | opt-level z, LTO, single codegen unit, panic=abort, stripped |
~107,000 lines of Rust with 3,500+ tests across unit, integration, conformance, and doc-test suites.
# run all library tests
cargo test --lib
# run specific test module
cargo test --lib -- backend::tests
cargo test --lib -- robot::tests
cargo test --lib -- tty_audio::tests
# run integration tests
cargo test --test tty_telemetry_tests
cargo test --test conformance_comparator_tests
cargo test --test gpu_cancellation_tests
cargo test --test robot_contract_tests
cargo test --test e2e_pipeline_tests
# run benchmarks
cargo bench --bench storage_bench
cargo bench --bench normalize_bench
cargo bench --bench pipeline_bench
cargo bench --bench tty_bench
cargo bench --bench sync_bench
# lint
cargo clippy --all-targets -- -D warnings| Category | Count | Description |
|---|---|---|
| Backend engine tests | 260+ | Engine trait compliance, native pilot validation |
| Robot contract tests | 150+ | NDJSON schema validation, field presence |
| TTY audio tests | 200+ | Handshake, integrity, retransmit, telemetry |
| Conformance tests | 80+ | Cross-engine tolerance, replay envelopes |
| Storage tests | 100+ | SQLite roundtrip, concurrent writes, recovery |
| Sync tests | 300+ | JSONL export/import, conflict resolution, validation |
| GPU cancellation tests | 42 | Stream ownership, fence payloads, fallback |
| Speculation tests | 200+ | Windowing, adaptive thresholds, correction drift |
| CLI integration tests | 79 | End-to-end command execution with stub backends |
No backend binary is on your PATH. Install at least one:
# whisper.cpp
brew install whisper-cpp # macOS
# or build from source: https://github.com/ggerganov/whisper.cpp
# or override the binary name
export FRANKEN_WHISPER_WHISPER_CPP_BIN=/path/to/whisper-cliDiarization needs a HuggingFace API token for pyannote models:
export FRANKEN_WHISPER_HF_TOKEN="hf_your_token_here"
# or pass directly
cargo run -- transcribe --input audio.mp3 --diarize --hf-token "hf_..."The backend took longer than the allowed duration:
# increase timeout (seconds)
cargo run -- transcribe --input long_audio.mp3 --timeout 600 --jsonEnsure you're using the robot run subcommand, not just robot:
cargo run -- robot run --input audio.mp3 --backend autoAnother franken_whisper process is writing. The storage layer retries with exponential backoff (5-40ms), but simultaneous heavy writes may conflict. Use --no-persist to skip persistence, or use separate --db paths.
Some formats or containers are outside symphonia's coverage. Force the ffmpeg path:
export FRANKEN_WHISPER_FORCE_FFMPEG_NORMALIZE=1
cargo run -- transcribe --input exotic_file.opus --json- Backend binaries required. franken_whisper orchestrates external ASR engines; it does not include inference runtimes. You need whisper.cpp, insanely-fast-whisper, or whisper-diarization installed.
- ffmpeg only needed for video/exotic formats. The built-in Rust decoder handles common audio formats natively. ffmpeg is used as an automatic fallback for video files and exotic codecs. Microphone capture still depends on ffmpeg.
- Path dependencies. The project depends on sibling Cargo workspace members (
frankensqlite, etc.) via relative paths. It is not published to crates.io as a standalone crate. - Native engines are pilots. Native Rust engine implementations are conformance pilots. They can execute in-process when
FRANKEN_WHISPER_NATIVE_EXECUTION=1and rollout stage isprimary|sole; otherwise bridge adapters remain active. - No bidirectional sync. JSONL export/import is one-way. There is no merge or conflict resolution beyond the explicit
--conflict-policyflag. - Single-machine. Designed for single-machine use with local SQLite. No distributed or multi-node support.
- frankensqlite MVCC limitation. Under extreme concurrent multi-connection WAL writes, frankensqlite may silently lose committed data. Production usage should serialize writes through a single connection.
Q: Do I need all three backends installed?
No. franken_whisper works with any single backend. The auto router will use whatever is available. You can also force a specific backend with --backend whisper_cpp.
Q: What audio formats are supported?
Common audio formats (MP3, AAC, FLAC, WAV, OGG, Vorbis, ALAC) are decoded natively by the built-in Rust decoder with zero external dependencies. Video files and exotic codecs (AC3, DTS, Opus-in-MKV) fall back to ffmpeg automatically.
Q: Can I use this as a library?
Yes. franken_whisper is both a library crate and a binary. The public API exposes all modules: backend, orchestrator, robot, storage, tty_audio, conformance, etc.
Q: What's the "replay envelope"?
Each run produces a ReplayEnvelope containing SHA-256 hashes of the input content, backend identity, and output payload. This allows detecting drift when re-running the same input.
Q: How does cancellation work?
Ctrl+C sets a global shutdown flag. The CancellationToken propagates through every pipeline stage. Each stage calls token.checkpoint() at safe points, which returns Err(Cancelled) if shutdown was requested. No partial writes to SQLite, no orphaned subprocesses.
Q: What's the TTY audio module for?
It enables audio transport over constrained TTY/PTY links where binary data can't flow directly. Audio is compressed (mu-law + zlib), base64-encoded, and transmitted as NDJSON lines with sequence numbers, CRC32, and SHA-256 integrity.
Q: How does the Bayesian router differ from a simple priority list?
A priority list always tries backends in the same order. The Bayesian router learns from outcomes: if a backend starts failing, its posterior degrades and traffic shifts to alternatives. When the model is poorly calibrated (Brier > 0.35), it falls back to static priority automatically.
Q: What happens if I Ctrl+C during a long transcription?
The shutdown controller propagates cancellation through the pipeline. The active stage finishes its current checkpoint, rolls back uncommitted transactions, kills running subprocesses, runs finalizers within 5s, and exits with code 130. No data corruption, no orphaned processes.
Q: What's speculative streaming?
Two models run simultaneously: a fast model produces low-latency partial transcripts, while a quality model runs in parallel. When the quality model finishes each window, it either confirms or corrects the fast model's output. Use --speculative when you need both low latency and high accuracy.
Q: What's TinyDiarize?
whisper.cpp's built-in speaker-turn detection via --tiny-diarize. It injects speaker-turn tokens during inference without requiring a separate diarization pipeline or HuggingFace token. Less accurate than full diarization but zero additional dependencies.
Q: Why SQLite instead of Postgres/Redis/files?
SQLite fits a single-machine CLI tool: zero configuration, no daemon, ACID transactions, concurrent reads via WAL mode. The fsqlite crate provides a Rust-native interface without depending on system libsqlite3. JSONL export/import covers portability.
Q: Can franken_whisper transcribe video files?
Yes. Any video file that ffmpeg can decode (MP4, MKV, AVI, MOV, WebM, etc.) is handled automatically. The ffmpeg fallback extracts the audio track using the -vn flag.
Q: What's the "alien-artifact engineering contract"?
A design discipline for adaptive controllers. Every adaptive system (the router, the bitrate controller, the budget tuner) must declare an explicit state space, action space, loss matrix, calibration metric, deterministic fallback trigger, and evidence ledger. This prevents adaptive systems from making unbounded bad decisions when their models are wrong.
This is what happens, step by step, when you run cargo run -- transcribe --input meeting.mp3 --json --backend auto:
1. CLI PARSE
Clap parses args -> TranscribeRequest { input: "meeting.mp3", backend: Auto, json: true, ... }
2. ENGINE CONSTRUCTION
FrankenWhisperEngine::new() opens SQLite database, initializes tracing
3. PIPELINE COMPOSITION
PipelineBuilder evaluates request flags:
- No --vad flag -> skip VAD stage
- No --diarize flag -> skip Diarize stage
- No GPU features -> skip Accelerate stage (CPU fallback inline)
- json output requested -> include Persist stage
Pipeline: [Ingest, Normalize, Backend, Persist]
4. TRACE ID GENERATION
TraceId::from_parts(1710000000000, random_u64) -> "1710000000000-a1b2c3d4e5f6"
5. INGEST STAGE (budget: 15s)
emit: stage { code: "ingest.start" }
Verify meeting.mp3 exists, get file size
emit: stage { code: "ingest.ok", payload: { size_bytes: 1234567 } }
6. NORMALIZE STAGE (budget: 180s)
emit: stage { code: "normalize.start" }
Try built-in Rust decoder (symphonia):
- Detect format: MP3
- Decode packets -> f32 samples
- Mix stereo -> mono (average channels)
- Resample 44.1kHz -> 16kHz (linear interpolation)
- Quantize f32 -> i16 PCM
- Write normalized_16k_mono.wav
emit: stage { code: "normalize.ok", payload: { method: "builtin", duration_ms: 260 } }
7. BACKEND STAGE (budget: 900s)
emit: stage { code: "backend.routing.decision_contract" }
Bayesian router evaluates:
- Probe availability: whisper_cpp=true, insanely_fast=false, diarization=false
- State: partial_available
- Compute loss matrix (latency*0.45 + quality*0.35 + failure*0.20)
- Best action: try_whisper_cpp (lowest expected loss)
- Calibration check: Brier=0.12, score=0.8 -> adaptive mode (no fallback)
emit: stage { code: "backend.start", payload: { backend: "whisper_cpp" } }
Spawn: whisper-cli -m large-v3 -f normalized_16k_mono.wav --output-json
Wait for subprocess (check cancellation token periodically)
Parse JSON output -> TranscriptionResult { transcript, segments, language }
emit: stage { code: "backend.ok", payload: { segments: 42, language: "en" } }
8. CONFIDENCE NORMALIZATION (inline, no separate stage)
Replace missing confidences with ln(1 + char_count) + 1.0
Apply numerically stable softmax
Record pre_mass=34.2, post_mass=1.0 in AccelerationReport
9. PERSIST STAGE (budget: 20s)
emit: stage { code: "persist.start" }
SAVEPOINT sp_persist_1
INSERT INTO runs (run_id, started_at, ...)
INSERT INTO segments (42 rows)
INSERT INTO events (8 rows)
token.checkpoint() -> Ok (not cancelled)
RELEASE SAVEPOINT sp_persist_1
emit: stage { code: "persist.ok" }
10. LATENCY PROFILING
emit: stage { code: "orchestration.latency_profile" }
Per-stage utilization: normalize=0.14% (decrease_budget_candidate),
backend=2.3% (decrease_budget_candidate),
persist=0.5% (decrease_budget_candidate)
11. REPLAY ENVELOPE
Compute SHA-256(normalized_16k_mono.wav) -> input_content_hash
Record backend_identity: "whisper-cli", backend_version: "1.7.2"
Compute SHA-256(raw_backend_json) -> output_payload_hash
12. REPORT ASSEMBLY
RunReport { run_id, trace_id, request, result, events, evidence, replay, warnings }
13. OUTPUT
Serialize RunReport as JSON -> stdout
Exit code 0
Total wall time for a 2-minute MP3: typically 5-15 seconds depending on backend and hardware.
# Extract just the transcript from a robot run
cargo run -- robot run --input audio.mp3 --backend auto 2>/dev/null \
| jq -r 'select(.event == "run_complete") | .transcript'
# Monitor pipeline progress in real time
cargo run -- robot run --input audio.mp3 --backend auto 2>/dev/null \
| jq -r 'select(.event == "stage") | "\(.code): \(.message)"'
# Extract all segments with timestamps
cargo run -- robot run --input audio.mp3 --backend auto 2>/dev/null \
| jq -r 'select(.event == "run_complete") | .segments[] | "[\(.start_sec)s - \(.end_sec)s] \(.text)"'#!/bin/bash
# Transcribe all audio files in a directory
for file in recordings/*.mp3; do
echo "Transcribing: $file"
cargo run -- transcribe --input "$file" --json --no-persist \
| jq -r '.result.transcript' > "${file%.mp3}.txt"
done# Verify all backends are available before running tests
status=$(cargo run -- robot health 2>/dev/null | jq -r '.overall_status')
if [ "$status" != "ok" ]; then
echo "Backend health check failed"
cargo run -- robot health 2>/dev/null | jq '.backends[] | select(.available == false)'
exit 1
fi# Full export with compression
cargo run -- sync export-jsonl --output ./backup
gzip ./backup/*.jsonl
# Incremental daily backup
cargo run -- sync export-jsonl --output ./daily --incremental
# Validate a snapshot matches the database
cargo run -- sync import-jsonl --input ./backup --conflict-policy skip --dry-run# On the remote machine (audio source):
cargo run -- tty-audio encode --input recording.wav \
| ssh user@local-machine 'cargo run -- tty-audio decode --output received.wav'
# With retransmit recovery for lossy links:
cargo run -- tty-audio encode --input recording.wav > frames.ndjson
cat frames.ndjson | ssh user@remote 'cat > /tmp/frames.ndjson'
# On remote, check for gaps:
ssh user@remote 'cat /tmp/frames.ndjson | cargo run -- tty-audio retransmit-plan'use franken_whisper::model::{TranscribeRequest, BackendKind};
use franken_whisper::orchestrator::FrankenWhisperEngine;
use franken_whisper::storage::RunStore;
use std::path::PathBuf;
fn transcribe_file(path: &str) -> Result<String, Box<dyn std::error::Error>> {
let request = TranscribeRequest {
input_path: Some(PathBuf::from(path)),
backend: BackendKind::Auto,
..Default::default()
};
let engine = FrankenWhisperEngine::new()?;
let report = engine.transcribe(request)?;
Ok(report.result.transcript)
}
fn query_history(db_path: &str, limit: usize) -> Result<(), Box<dyn std::error::Error>> {
let store = RunStore::open(std::path::Path::new(db_path))?;
let runs = store.list_recent_runs(limit)?;
for run in &runs {
println!("{}: {} ({})", run.run_id, run.transcript_preview, run.backend);
}
Ok(())
}# See how the Bayesian router is performing
cargo run -- robot routing-history --limit 20 2>/dev/null \
| jq '.[] | {decision_id, chosen_action, calibration_score, brier_score, fallback_active}'
# Track correction rates in speculative mode
cargo run -- robot run --input audio.mp3 --speculative \
--fast-model tiny.en --quality-model large-v3 2>/dev/null \
| jq 'select(.event == "transcript.speculation_stats")'WhisperS2T, transcribe-anything, and WhisperLive let you pick a backend. franken_whisper learns which backend to use based on observed outcomes. The Bayesian router maintains Beta-distribution posteriors per backend, tracks calibration via Brier scoring, and falls back to deterministic priority when uncertain.
franken_whisper's conformance harness compares segment output across engines using a 50ms canonical timestamp tolerance, text matching, speaker label matching, and WER approximation. The 5-stage native rollout governance prevents buggy engines from silently degrading quality.
franken_whisper runs a fast model and a quality model in parallel on overlapping windows, emits partial transcripts immediately, and issues corrections when the quality model disagrees. The CorrectionTracker adaptively adjusts confirmation thresholds.
Every run is persisted to SQLite with the complete request, result, segments, pipeline events, evidence, and replay envelope. Full and incremental JSONL export with SHA-256 checksums.
The built-in Rust decoder handles MP3, AAC, FLAC, WAV, OGG, Vorbis, ALAC natively with no subprocess, no external binary, and no PATH dependency. ffmpeg is only the fallback.
The robot subcommand is the primary interface: sequenced NDJSON events with stable schema versioning (v1.0.0), 12 structured error codes, health diagnostics, routing history, and speculation events.
franken_whisper enforces #![forbid(unsafe_code)]. Note the distinction: deny can be overridden per-item, but forbid cannot. Combined with cooperative cancellation, atomic transactions, bounded finalizers, and RAII cleanup, this gives strong safety guarantees.
| Document | Description |
|---|---|
docs/tty-audio-protocol.md |
Complete TTY audio protocol specification |
docs/tty-replay-guarantees.md |
Deterministic replay/framing guarantees |
docs/native_engine_contract.md |
Native engine replacement interface contract |
docs/engine_compatibility_spec.md |
50ms timestamp tolerance specification |
docs/conformance-contract.md |
Cross-engine conformance test contract |
docs/operational-playbook.md |
Deployment and monitoring guide |
docs/benchmark_regression_policy.md |
Performance regression thresholds |
RECOVERY_RUNBOOK.md |
Disaster recovery procedures |
SYNC_STRATEGY.md |
One-way sync semantics |
PROPOSED_ARCHITECTURE.md |
System architecture design document |
FEATURE_PARITY.md |
Legacy feature parity matrix |
Please don't take this the wrong way, but I do not accept outside contributions for any of my projects. I simply don't have the mental bandwidth to review anything, and it's my name on the thing, so I'm responsible for any problems it causes; thus, the risk-reward is highly asymmetric from my perspective. I'd also have to worry about other "stakeholders," which seems unwise for tools I mostly make for myself for free. Feel free to submit issues, and even PRs if you want to illustrate a proposed fix, but know I won't merge them directly. Instead, I'll have Claude or Codex review submissions via gh and independently decide whether and how to address them. Bug reports in particular are welcome. Sorry if this offends, but I want to avoid wasted time and hurt feelings. I understand this isn't in sync with the prevailing open-source ethos that seeks community contributions, but it's the only way I can move at this velocity and keep my sanity.
MIT License with OpenAI/Anthropic Rider. See LICENSE for the full text.
In short: standard MIT terms apply, with an additional restriction that no rights are granted to OpenAI, Anthropic, or their affiliates without express prior written permission from the author. This rider must be preserved in all copies and derivative works.
