Skip to content

dmzoneill/Xenith

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Xenith

Ultra-fast, on-device voice assistant for Linux

Xenith is a fully local voice assistant that runs entirely on your hardware with no cloud dependencies. Optimized for Intel Core Ultra processors, it achieves ~1.5-2.5 second response times from wake word to audio output.

Response Time Power Privacy

Features

  • 🎀 Wake Word Detection - Always listening with ultra-low power (~2-3W)
  • 🧠 Local LLM - Qwen2.5-1.5B runs entirely on-device
  • πŸ”Š Natural TTS - High-quality Piper neural voices
  • ⚑ Streaming Response - Audio starts playing as LLM generates
  • πŸ”’ 100% Private - No data leaves your device
  • 🎨 Beautiful UI - Animated plasma widget shows voice state

Quick Start

# Install dependencies
make install

# Run Xenith
make run

Say "Hi" to activate, then speak your command!

Performance

Stage Time
Wake word β†’ Detection ~500ms
STT Processing ~300ms
LLM First Token ~200ms
TTS β†’ Audio ~100ms
Total to First Audio ~1.5s

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     XENITH VOICE PIPELINE                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                     β”‚
β”‚   🎀 Microphone                                                     β”‚
β”‚        β”‚ (continuous audio stream)                                  β”‚
β”‚        β–Ό                                                            β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                              β”‚
β”‚   β”‚   Wake Word      β”‚  "Hi" detection                              β”‚
β”‚   β”‚   Detection      β”‚  β€’ Whisper on NPU (~2-3W)                    β”‚
β”‚   β”‚                  β”‚  β€’ 0.5s check interval                       β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                              β”‚
β”‚            β”‚                                                        β”‚
β”‚            β–Ό                                                        β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                              β”‚
β”‚   β”‚   Speech-to-Text β”‚  Whisper STT                                 β”‚
β”‚   β”‚   (STT)          β”‚  β€’ OpenVINO on NPU                           β”‚
β”‚   β”‚                  β”‚  β€’ 0.3s silence threshold                    β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                              β”‚
β”‚            β”‚ text                                                   β”‚
β”‚            β–Ό                                                        β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                              β”‚
β”‚   β”‚   LLM Brain      β”‚  Qwen2.5-1.5B                                β”‚
β”‚   β”‚                  β”‚  β€’ OpenVINO on CPU (fast, ~200ms warmup)     β”‚
β”‚   β”‚                  β”‚  β€’ Token streaming enabled                   β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                              β”‚
β”‚            β”‚ streaming tokens                                       β”‚
β”‚            β–Ό                                                        β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                              β”‚
β”‚   β”‚   Sentence       β”‚  Buffers tokens until sentence complete      β”‚
β”‚   β”‚   Buffer         β”‚  β€’ Min 3 chars, ends on .!?;,:               β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                              β”‚
β”‚            β”‚ sentences                                              β”‚
β”‚            β–Ό                                                        β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                              β”‚
β”‚   β”‚   TTS (Piper)    β”‚  Neural text-to-speech                       β”‚
β”‚   β”‚                  β”‚  β€’ ~100ms per sentence                       β”‚
β”‚   β”‚                  β”‚  β€’ In-memory audio (no file I/O)             β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                              β”‚
β”‚            β”‚ numpy audio                                            β”‚
β”‚            β–Ό                                                        β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                              β”‚
β”‚   β”‚   Audio Player   β”‚  Real-time playback                          β”‚
β”‚   β”‚                  β”‚  β€’ Direct sounddevice output                 β”‚
β”‚   β”‚                  β”‚  β€’ 10ms queue polling                        β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                              β”‚
β”‚            β”‚                                                        β”‚
β”‚            β–Ό                                                        β”‚
β”‚   πŸ”Š Speakers                                                       β”‚
β”‚                                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Flow (Optimized)

All audio data flows in-memory with zero file I/O in the critical path:

Mic β†’ numpy β†’ STT(NPU) β†’ text β†’ LLM(CPU) β†’ tokens β†’ TTS(CPU) β†’ numpy β†’ speakers
      ↑                                                              ↑
      └── in-memory β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Hardware Requirements

Minimum

  • Intel Core Ultra (Meteor Lake) or newer
  • 8GB RAM
  • 5GB disk space

Recommended

  • Intel Core Ultra 7/9
  • 16GB RAM
  • Intel Arc GPU (optional, for larger models)

Configuration

Edit config/config.yaml:

llm:
  # CPU recommended for fast response (~200ms warmup)
  # NPU is low power but slow (~2.3s warmup per query)
  device: "CPU"
  model: "qwen2.5-1.5b"

audio:
  stt:
    device: "auto"  # NPU β†’ Intel GPU β†’ CPU
    model: "base"
  tts:
    voice: "EN-Default"  # Ryan male voice (high quality)

Device Trade-offs

Device LLM Warmup Power Best For
CPU ~200ms ~15-30W Fast response (recommended)
NPU ~2,300ms ~3-5W Battery life
GPU ~300ms ~30-50W Larger models

Project Structure

src/
β”œβ”€β”€ app.py                  # Main GTK application
β”œβ”€β”€ main.py                 # Entry point
β”œβ”€β”€ widgets/
β”‚   └── plasma_widget.py    # Animated voice indicator
└── audio/
    β”œβ”€β”€ voice_input.py      # Wake word & STT handling
    β”œβ”€β”€ streaming_pipeline.py  # LLM + TTS streaming
    β”œβ”€β”€ pipeline_metrics.py    # Performance tracking
    β”œβ”€β”€ stt_backends/       # Speech-to-Text
    β”‚   β”œβ”€β”€ openvino_backend.py
    β”‚   └── whisper_backend.py
    β”œβ”€β”€ llm_backends/       # Language Models
    β”‚   └── openvino_backend.py
    └── tts_backends/       # Text-to-Speech
        β”œβ”€β”€ piper_backend.py
        └── melotts_backend.py

Testing

# Test STT backends
python test_stt_backends.py

# Test TTS backends
python test_tts_backends.py

# Test LLM backends
python test_llm_backends.py

# Test full pipeline
python test_streaming_pipeline.py

Documentation

Performance Tuning

For Fastest Response (~1.5s)

llm:
  device: "CPU"  # 12x faster than NPU
audio:
  tts:
    voice: "EN-Fast"  # Medium quality, faster synthesis

For Lowest Power (~3-5W active)

llm:
  device: "NPU"  # Slower but efficient
audio:
  stt:
    device: "NPU"

License

MIT License - See LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors