Ultra-fast, on-device voice assistant for Linux
Xenith is a fully local voice assistant that runs entirely on your hardware with no cloud dependencies. Optimized for Intel Core Ultra processors, it achieves ~1.5-2.5 second response times from wake word to audio output.
- π€ Wake Word Detection - Always listening with ultra-low power (~2-3W)
- π§ Local LLM - Qwen2.5-1.5B runs entirely on-device
- π Natural TTS - High-quality Piper neural voices
- β‘ Streaming Response - Audio starts playing as LLM generates
- π 100% Private - No data leaves your device
- π¨ Beautiful UI - Animated plasma widget shows voice state
# Install dependencies
make install
# Run Xenith
make runSay "Hi" to activate, then speak your command!
| Stage | Time |
|---|---|
| Wake word β Detection | ~500ms |
| STT Processing | ~300ms |
| LLM First Token | ~200ms |
| TTS β Audio | ~100ms |
| Total to First Audio | ~1.5s |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β XENITH VOICE PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β π€ Microphone β
β β (continuous audio stream) β
β βΌ β
β ββββββββββββββββββββ β
β β Wake Word β "Hi" detection β
β β Detection β β’ Whisper on NPU (~2-3W) β
β β β β’ 0.5s check interval β
β ββββββββββ¬ββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββ β
β β Speech-to-Text β Whisper STT β
β β (STT) β β’ OpenVINO on NPU β
β β β β’ 0.3s silence threshold β
β ββββββββββ¬ββββββββββ β
β β text β
β βΌ β
β ββββββββββββββββββββ β
β β LLM Brain β Qwen2.5-1.5B β
β β β β’ OpenVINO on CPU (fast, ~200ms warmup) β
β β β β’ Token streaming enabled β
β ββββββββββ¬ββββββββββ β
β β streaming tokens β
β βΌ β
β ββββββββββββββββββββ β
β β Sentence β Buffers tokens until sentence complete β
β β Buffer β β’ Min 3 chars, ends on .!?;,: β
β ββββββββββ¬ββββββββββ β
β β sentences β
β βΌ β
β ββββββββββββββββββββ β
β β TTS (Piper) β Neural text-to-speech β
β β β β’ ~100ms per sentence β
β β β β’ In-memory audio (no file I/O) β
β ββββββββββ¬ββββββββββ β
β β numpy audio β
β βΌ β
β ββββββββββββββββββββ β
β β Audio Player β Real-time playback β
β β β β’ Direct sounddevice output β
β β β β’ 10ms queue polling β
β ββββββββββ¬ββββββββββ β
β β β
β βΌ β
β π Speakers β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
All audio data flows in-memory with zero file I/O in the critical path:
Mic β numpy β STT(NPU) β text β LLM(CPU) β tokens β TTS(CPU) β numpy β speakers
β β
βββ in-memory ββββββββββββββββββββββββββββββββββββββββββββββββββ
- Intel Core Ultra (Meteor Lake) or newer
- 8GB RAM
- 5GB disk space
- Intel Core Ultra 7/9
- 16GB RAM
- Intel Arc GPU (optional, for larger models)
Edit config/config.yaml:
llm:
# CPU recommended for fast response (~200ms warmup)
# NPU is low power but slow (~2.3s warmup per query)
device: "CPU"
model: "qwen2.5-1.5b"
audio:
stt:
device: "auto" # NPU β Intel GPU β CPU
model: "base"
tts:
voice: "EN-Default" # Ryan male voice (high quality)| Device | LLM Warmup | Power | Best For |
|---|---|---|---|
| CPU | ~200ms | ~15-30W | Fast response (recommended) |
| NPU | ~2,300ms | ~3-5W | Battery life |
| GPU | ~300ms | ~30-50W | Larger models |
src/
βββ app.py # Main GTK application
βββ main.py # Entry point
βββ widgets/
β βββ plasma_widget.py # Animated voice indicator
βββ audio/
βββ voice_input.py # Wake word & STT handling
βββ streaming_pipeline.py # LLM + TTS streaming
βββ pipeline_metrics.py # Performance tracking
βββ stt_backends/ # Speech-to-Text
β βββ openvino_backend.py
β βββ whisper_backend.py
βββ llm_backends/ # Language Models
β βββ openvino_backend.py
βββ tts_backends/ # Text-to-Speech
βββ piper_backend.py
βββ melotts_backend.py
# Test STT backends
python test_stt_backends.py
# Test TTS backends
python test_tts_backends.py
# Test LLM backends
python test_llm_backends.py
# Test full pipeline
python test_streaming_pipeline.py- Voice Pipeline Architecture
- Intel NPU Setup Guide
- STT Backends Reference
- TTS Backends Reference
- LLM Backends Reference
- Scripts Reference
llm:
device: "CPU" # 12x faster than NPU
audio:
tts:
voice: "EN-Fast" # Medium quality, faster synthesisllm:
device: "NPU" # Slower but efficient
audio:
stt:
device: "NPU"MIT License - See LICENSE file for details.