Portable multi-GPU text-to-speech server for Windows — 10 models, one-click install, zero dependencies.
Send text in, get broadcast-ready audio out. Voice cloning, dialogue, emotion control, multilingual synthesis — across any combination of NVIDIA GPUs. No system Python, no Git, no FFmpeg, no Docker, no admin rights. Everything downloads automatically into one portable folder.
- 10 TTS models — from 82M-parameter Kokoro to 7B-parameter Qwen, covering voice cloning, dialogue, emotion control, and multilingual synthesis
- Multi-GPU inference — pin workers to any detected GPU, run the same model on multiple GPUs simultaneously
- Zero-install portable app — embedded Python 3.10, portable Git, bundled FFmpeg; copy the folder to any Windows machine
- Production audio pipeline — hierarchical text chunking, 7-stage post-processing (de-reverb, highpass, de-ess, tempo, trim, LUFS normalization, peak limiting), multi-format export
- Whisper verification — optional transcription check scores synthesized audio against the original text
- Auto-scaling workers — workers spawn on first request, fail over to siblings, and get health-checked every 10 seconds
- Full REST API — 40+ endpoints with interactive Swagger docs at
/docs - GUI environment manager — install venvs, download models, and launch the server from a Tkinter interface
Port 8100
+---------------+
Client --------->| GATEWAY |
| (FastAPI) |
| |
| - Pipeline | Text chunking, audio post-processing,
| - Job mgmt | whisper verification, assembly,
| - Load balance| format conversion, job tracking
| - Worker mgmt|
+-------+-------+
|
+---------------+---------------+------------------+
| | | |
Port 8101 Port 8102 Port 8103 Port 8104
+-----------+ +-----------+ +-----------+ +-----------+
| WORKER | | WORKER | | WORKER | | WORKER |
| xtts #1 | | kokoro #1 | | kokoro #2 | | higgs #1 |
| cuda:0 | | cuda:0 | | cuda:1 | | cpu |
+-----------+ +-----------+ +-----------+ +-----------+
coqui_env unified_env unified_env higgs_env
- Gateway (port 8100) — orchestrates the full pipeline, delegates inference to workers via HTTP
- Workers (ports 8101-8200) — each runs one model on one device as an isolated subprocess
- Each worker injects only its own venv's
site-packages— no cross-environment dependency conflicts - Same model can run multiple instances across GPUs for concurrent inference
- Workers auto-spawn on first request if none exist for a model
Download Latest Release (v3.0.2) — Extract anywhere and run install.bat.
install.batDownloads embedded Python 3.10, portable Git, configures pip, installs gateway dependencies, and launches the GUI. From there, install model environments and download weights.
launcher.bat apiGateway starts on http://127.0.0.1:8100. Workers auto-spawn when you send your first request. Interactive API docs at http://127.0.0.1:8100/docs.
curl -X POST http://127.0.0.1:8100/api/tts/kokoro \
-H "Content-Type: application/json" \
-d '{"text": "Hello from Kokoro!", "voice": "af_heart"}'Models like XTTS, Chatterbox, Fish, and F5-TTS can clone a voice from a reference audio file. Encode it as base64:
# XTTS voice cloning
REF=$(base64 -w 0 my_voice.wav)
curl -X POST http://127.0.0.1:8100/api/tts/xtts \
-H "Content-Type: application/json" \
-d "{\"text\": \"Hello in my voice!\", \"reference_audio\": \"$REF\"}"
# F5-TTS requires both reference audio AND its transcript
curl -X POST http://127.0.0.1:8100/api/tts/f5 \
-H "Content-Type: application/json" \
-d "{\"text\": \"New text to say.\", \"reference_audio\": \"$REF\", \"reference_text\": \"The words spoken in the reference audio.\"}"
# XTTS with a built-in voice (no reference audio needed)
curl -X POST http://127.0.0.1:8100/api/tts/xtts \
-H "Content-Type: application/json" \
-d '{"text": "Using a built-in voice.", "voice": "Ana Florence", "mode": "builtin"}'
# Dia multi-speaker dialogue
curl -X POST http://127.0.0.1:8100/api/tts/dia \
-H "Content-Type: application/json" \
-d '{"text": "[S1] How are you? [S2] Great, thanks!"}'launcher.bat gui # Launch environment manager GUI
launcher.bat api --port 8200 # Custom port
launcher.bat download --all # Download all model weights
launcher.bat download --list # List available models
launcher.bat setup # Re-run full setupTen models spanning voice cloning, dialogue, emotion control, and multilingual synthesis:
| Model | Params | Max Chars | Sample Rate | Key Capability |
|---|---|---|---|---|
| XTTS v2 | ~500M | 250 | 22050 Hz | Multilingual voice cloning from audio samples |
| Bark | ~1B | 200 | 24000 Hz | Expressive audio with laughter, music, nonverbals |
| Fish Speech 1.5 | ~500M | 250 | 24000 Hz | Fast neural TTS with V1.5 FireflyGAN codec |
| Kokoro | 82M | 500 | 24000 Hz | Lightweight and fast, 54 built-in voices |
| Dia 1.6B-0626 | 1.6B | 400 | 44100 Hz | Multi-speaker dialogue with [S1]/[S2] tags |
| Chatterbox | ~500M | 250 | 24000 Hz | Emotion and exaggeration control |
| F5-TTS | ~300M | 135 | 24000 Hz | Diffusion-based, reference audio + transcript |
| Qwen2.5-Omni | 7B | 300 | 24000 Hz | Multimodal LLM, speakers: Chelsie, Ethan |
| VibeVoice | 1.5B | 300 | 24000 Hz | 90-minute generation, multi-speaker via Speaker N: |
| Higgs Audio | 3B | 300 | 24000 Hz | Automatic prosody, ChatML format, runs on CPU |
| Endpoint | Voices |
|---|---|
/api/tts/xtts/voices |
58 built-in speakers |
/api/tts/kokoro/voices |
54 voices (dynamic .pt scan) |
/api/tts/bark/voices |
260+ speaker presets (.npy embeddings) |
/api/tts/qwen/voices |
Chelsie, Ethan |
| Other models | Reference audio instructions returned |
Each model group runs in its own virtual environment to avoid dependency conflicts:
| Environment | Models | PyTorch | CUDA | transformers |
|---|---|---|---|---|
coqui_env |
XTTS, Bark | 2.6.0+cu124 | 12.4 | <5.0 |
unified_env |
Kokoro, Fish, Dia | 2.6.0+cu124 | 12.4 | flexible |
chatterbox_env |
Chatterbox | 2.6.0+cu124 | 12.4 | 4.46.3 |
f5tts_env |
F5-TTS | 2.8.0+cu126 | 12.6 | 5.1.0 |
qwen3_env |
Qwen2.5-Omni | 2.6.0+cu124 | 12.4 | 4.57.3 |
vibevoice_env |
VibeVoice | 2.5.1+cu121 | 12.1 | 4.51.3 |
higgs_env |
Higgs Audio | CPU-only | — | 4.47.0 |
Full interactive documentation is available at http://localhost:8100/docs when the server is running.
All TTS endpoints accept POST with JSON body:
POST /api/tts/xtts POST /api/tts/bark POST /api/tts/fish
POST /api/tts/kokoro POST /api/tts/dia POST /api/tts/chatterbox
POST /api/tts/f5 POST /api/tts/qwen POST /api/tts/vibevoice
POST /api/tts/higgs
{
"status": "completed",
"job_id": "abc123-...",
"filename": "kokoro_20260207_143000_final.wav",
"saved_to": "output/jobs/abc123/kokoro_..._final.wav",
"sample_rate": 24000,
"duration_sec": 3.45,
"format": "wav",
"audio_base64": "<base64-encoded audio>", // Omitted when save_path is set
"chunks": 1,
"whisper_result": null // Populated when verify_whisper=true
}When save_path is provided, audio is saved to projects_output/{save_path}/ and audio_base64 is omitted from the response.
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/workers/spawn |
Spawn worker: {"model":"kokoro","device":"cuda:0"} |
DELETE |
/api/workers/{worker_id} |
Kill a specific worker |
GET |
/api/workers |
List all workers with status, device, VRAM |
POST |
/api/models/{model}/scale |
Scale to N instances: {"count":2,"device":"cuda:0"} |
POST |
/api/models/{model}/load |
Load model (spawn a worker) |
POST |
/api/models/{model}/unload |
Unload model (kill all its workers) |
GET |
/api/devices |
GPU discovery with VRAM info |
# Spawn a Kokoro worker on GPU 0
curl -X POST http://127.0.0.1:8100/api/workers/spawn \
-H "Content-Type: application/json" \
-d '{"model": "kokoro", "device": "cuda:0"}'
# Run Kokoro on GPU 0 and XTTS on GPU 1 simultaneously
curl -X POST http://127.0.0.1:8100/api/workers/spawn -d '{"model":"xtts","device":"cuda:1"}'
# Scale Kokoro to 2 instances for parallel inference
curl -X POST http://127.0.0.1:8100/api/models/kokoro/scale \
-H "Content-Type: application/json" \
-d '{"count": 2, "device": "cuda:0"}'
# List all workers
curl http://127.0.0.1:8100/api/workers
# Kill a specific worker
curl -X DELETE http://127.0.0.1:8100/api/workers/kokoro-1| Method | Endpoint | Description |
|---|---|---|
GET |
/api/jobs |
List all jobs (running, completed, failed) |
GET |
/api/jobs/{job_id} |
Job status, chunks, progress metadata |
POST |
/api/jobs/{job_id}/recover |
Resume interrupted job from last completed chunk |
POST |
/api/tts/{model}/cancel |
Cancel running job(s) for a model |
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/whisper |
Status, available sizes, loaded instances |
POST |
/api/whisper/{size}/load |
Load model: tiny, base, small, medium, large |
POST |
/api/whisper/{size}/unload |
Unload model and free VRAM |
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Health check with loaded models and worker count |
GET |
/api/models |
List all available models |
GET |
/api/models/status |
Detailed status per model (workers, loaded) |
GET |
/api/tts/{model}/voices |
Available voices for a model |
GET |
/docs |
Interactive Swagger documentation |
- Hierarchical chunking — splits on sentence boundaries first, then clauses, then words
- Model-aware limits — each model has a tuned character limit (135–500 chars)
- Unicode normalization — smart quotes, whitespace collapse, NFKC
Applied per chunk, fully configurable per request:
| Stage | Library | Description |
|---|---|---|
| 1. De-reverb | noisereduce | Spectral gating using first 0.2s as noise profile |
| 2. Highpass | scipy | 4th-order Butterworth at 80 Hz |
| 3. De-ess | scipy | Multiband Hilbert envelope, 3 kHz+ crossover, 4:1 compression |
| 4. Tempo | pyrubberband | Time-stretch without pitch shift |
| 5. Trim | pydub | Silence detection with front/end protection zones |
| 6. LUFS | pyloudnorm | Loudness normalization to -23 LUFS |
| 7. Peak limit | numpy | Hard clamp to 0.95 (prevents clipping) |
Each stage degrades gracefully — if a library is unavailable, that stage is skipped with a warning.
- Configurable silence padding between chunks
- Front/end padding for natural spacing
- Format conversion via FFmpeg: WAV, MP3, OGG, FLAC, M4A
- Powered by OpenAI Whisper running as a dedicated worker
- Transcribes generated audio and scores it against the original text
- Configurable similarity threshold (0–100%)
- Five model sizes from tiny (39M) to large (1.5B)
1. Client POSTs text to /api/tts/{model}
2. Gateway normalizes and chunks text per model limits
3. Creates a tracked job in output/jobs/
4. For each chunk:
a. Round-robin picks a ready worker (or auto-spawns one)
b. POSTs to worker's /infer endpoint via HTTP
c. Worker returns base64-encoded numpy audio
d. Gateway applies 7-stage post-processing
e. Optional Whisper verification
f. Updates job metadata
5. Assembles chunks with silence padding
6. Converts to requested format via FFmpeg
7. Returns audio (base64 JSON or file path)
- Workers are Python subprocesses, each running its own FastAPI server
- Each injects only its model's venv
site-packagesintosys.path - Health checks run every 10 seconds; 3 consecutive failures trigger cleanup
- Dead workers are terminated, their ports released, and they're unregistered
- Failed inference automatically retries on a different worker (up to 3 attempts)
Jobs track per-chunk progress in output/jobs/{job_id}/job.json. If a long synthesis is interrupted, send text: "##recover##" with the save_path to resume from the last completed chunk.
Key settings in config.py:
DEFAULT_API_PORT = 8100 # Gateway port
WORKER_PORT_MIN = 8101 # Worker port range
WORKER_PORT_MAX = 8200 # Up to 100 concurrent workers
WORKER_HEALTH_INTERVAL = 10 # Health check every 10 seconds
WORKER_STARTUP_TIMEOUT = 120 # Max wait for model loading (seconds)
WORKER_MAX_HEALTH_FAILURES = 3 # Failures before worker cleanup
WORKER_AUTO_SPAWN = True # Spawn workers on first request
WORKER_DEFAULT_DEVICE = "cuda:0"- Python:
python_embedded/python.exe> system Python - Git:
git_portable/cmd/git.exe> system Git - FFmpeg:
ffmpeg/bin/ffmpeg.exe>ffmpeg/*/bin/ffmpeg.exe> system FFmpeg
tts_module/
├── install.bat One-click setup (downloads Python, Git, FFmpeg)
├── launcher.bat Command router (gui, api, download, setup)
├── requirements.txt Gateway dependencies
├── config.py Central configuration
│
├── tts_api_server.py Gateway: pipeline, jobs, worker management
├── tts_worker.py Worker: single-model FastAPI inference server
├── worker_registry.py Worker tracking: port pool, round-robin balancing
├── worker_manager.py Worker lifecycle: spawn, kill, scale, health
├── job_manager.py Job tracking: create, update, cancel, recover
│
├── text_utils.py Hierarchical text chunking + normalization
├── audio_profiles.py Per-model audio processing presets
├── audio_processing.py 7-stage post-processing pipeline
├── audio_assembler.py Chunk assembly + FFmpeg format conversion
│
├── tts_manager.py Tkinter GUI (environments, models, server)
├── download_all_models.py HuggingFace model downloader
│
├── install_configs/ 8 virtual environment configurations
│ ├── base.py Abstract base (pip install, git clone, verify)
│ ├── coqui_tts.py XTTS, Bark
│ ├── unified_tts.py Kokoro, Fish Speech, Dia
│ ├── chatterbox.py Chatterbox
│ ├── f5_tts.py F5-TTS
│ ├── qwen3_tts.py Qwen2.5-Omni
│ ├── vibevoice.py VibeVoice
│ ├── higgs_audio.py Higgs Audio
│ └── orpheus.py Orpheus (reserved)
│
├── python_embedded/ [Auto] Embedded Python 3.10.11
├── git_portable/ [Auto] Portable Git 2.47.1
├── ffmpeg/ Portable FFmpeg
├── tts_models/ [Auto] HuggingFace model cache
├── venvs/ [Auto] 8 isolated virtual environments
├── voices/ [Auto] Voice reference audio files
└── output/ [Auto] Generated audio + logs
├── jobs/{job_id}/ Per-job chunks, metadata, final audio
└── logs/ Worker log files
| Requirement | Minimum | Recommended |
|---|---|---|
| OS | Windows 10 64-bit | Windows 11 |
| GPU | NVIDIA with CUDA (any) | 12 GB+ VRAM |
| RAM | 16 GB | 32 GB |
| Disk | ~2 GB (base install) | ~50 GB (all envs + models) |
| Internet | Required for setup | Required for setup |
CUDA versions used: 12.1, 12.4, 12.6 (depending on model environment). Higgs Audio runs on CPU.
| Model | VRAM | Notes |
|---|---|---|
| Kokoro | ~1 GB | Lightweight, great for low-VRAM GPUs |
| F5-TTS | ~2 GB | |
| Chatterbox | ~3 GB | |
| XTTS v2 | ~3 GB | |
| Fish Speech | ~3 GB | |
| Bark | ~4 GB | |
| Dia | ~6 GB | |
| VibeVoice | ~6 GB | |
| Higgs Audio | CPU-only | Uses ~6 GB system RAM |
| Qwen2.5-Omni | ~14 GB | Requires RTX 3090 or better |
espeak-ng is required for Kokoro (phoneme synthesis). Automatically downloaded and configured by install.bat.
The Tkinter GUI (tts_manager.py) provides three tabs:
Environments — Install, remove, and manage the 8 virtual environments. Status indicators show installed (green), installing (yellow), not installed (gray), or error (red). Each environment has Install, Terminal (opens activated cmd), Remove, and Info buttons.
Models — Download model weights from HuggingFace Hub with progress tracking. Supports selective or bulk download.
API Server — Start/stop the gateway server, configure the port, and view endpoint documentation.
History prompt chaining is managed by the gateway (not the worker). History resets every 5 chunks to prevent voice drift. 260+ speaker presets via .npy embeddings.
Uses the updated Dia-1.6B-0626 model which generates proper EOS tokens. The first 2–3 inferences after loading may produce near-silent audio (warmup). The gateway's RMS check rejects degenerate audio (<-40 dB) and retries automatically. Recommended params: temperature=1.8, top_p=0.90, cfg_scale=3.0.
Uses the V1.5 codec (FireflyArchitecture + FiniteScalarQuantize), not the S1-mini DAC codec. The LLAMA model generates 8-codebook tokens. Gateway default repetition_penalty (2.0) is auto-clamped to 1.1 for Fish.
Very large model (~14 GB). First load takes ~12 minutes; subsequent loads use cached weights.
Runs on CPU only. Inference is slow (~5 minutes per generation) but produces good quality with automatic prosody.
- Create a venv config in
install_configs/new_model.py - Register it in
install_configs/__init__.py - Add an inference function
_infer_newmodel()intts_worker.py - Add model entry to
load_model()intts_worker.py - Add audio profile in
audio_profiles.py - Add text limit in
audio_profiles.py - Add endpoint in
tts_api_server.py
| Model | Repository | Paper / Card |
|---|---|---|
| XTTS v2 | coqui-ai/TTS | HF: coqui/XTTS-v2 |
| Bark | suno-ai/bark | HF: suno/bark |
| Fish Speech | fishaudio/fish-speech | HF: fishaudio/fish-speech-1.5 |
| Kokoro | hexgrad/kokoro | HF: hexgrad/Kokoro-82M |
| Dia | nari-labs/dia | HF: nari-labs/Dia-1.6B |
| Chatterbox | resemble-ai/chatterbox | HF: ResembleAI/chatterbox |
| F5-TTS | SWivid/F5-TTS | HF: SWivid/F5-TTS |
| Qwen2.5-Omni | QwenLM/Qwen2.5-Omni | HF: Qwen/Qwen2.5-Omni-7B |
| VibeVoice | microsoft/VibeVoice | HF: microsoft/VibeVoice-1.5B |
| Higgs Audio | boson-ai/higgs-audio | HF: bosonai/higgs-audio-v2-generation-3B-base |
| Library | Purpose |
|---|---|
| FastAPI | Gateway and worker web framework |
| OpenAI Whisper | Transcription verification |
| HuggingFace Hub | Model downloads |
| FFmpeg | Audio format conversion |
| espeak-ng | Phoneme synthesis (Kokoro) |
| Library | Stage |
|---|---|
| noisereduce | De-reverb (spectral gating) |
| pyloudnorm | LUFS loudness normalization |
| pyrubberband | Tempo adjustment |
| pydub | Silence detection and trimming |
| scipy | Highpass filter, de-essing |
| soundfile | WAV I/O |
| numpy | Audio array processing |
"Worker failed to start" / model won't load
- Ensure the model's virtual environment is installed (GUI > Environments tab)
- Check the worker log in
output/logs/for the specific error - Verify you have enough VRAM (see VRAM table above)
Kokoro fails with phonemizer/espeak error
- espeak-ng should be auto-installed by
install.bat. If missing, re-runinstall.bator manually download from espeak-ng releases
CUDA out of memory
- Kill workers for other models first:
curl -X DELETE http://127.0.0.1:8100/api/workers/{id} - Use a smaller model (Kokoro ~1 GB vs Qwen ~14 GB)
- Or pin models to different GPUs:
{"device": "cuda:1"}
Dia produces silence on first requests
- Normal warmup behavior. The gateway auto-retries up to 3 times. Audio quality improves after 2-3 inferences.
Qwen takes forever to load
- First load is ~12 minutes (downloads/caches 14 GB of weights). Subsequent loads use cached weights and are faster.
Tempo/speed adjustment doesn't work
- Rubberband must be installed. Re-run
install.bator check thatrubberband/directory exists withrubberband.exe.
Port already in use
- Change the gateway port:
launcher.bat api --port 8200 - Or kill whatever is using port 8100:
netstat -ano | findstr :8100
Audio sounds robotic / low quality
- Try increasing
temperature(more expressive) or adjusting model-specific params - Disable post-processing to test raw output:
"skip_post_process": true - Ensure the model environment has CUDA PyTorch (not CPU-only), except Higgs which is CPU-only
- Fix venv creation failing with embedded Python (
python -m venv→python -m virtualenv) - Fix operator precedence bug in
run_git_clone()editable install check - Add
virtualenvto requirements.txt for bootstrap
- Fix
install.batPowerShell multi-line commands failing on fresh installs
- Gateway + Worker architecture with subprocess isolation
- 10 TTS models integrated and verified
- Multi-GPU support with round-robin load balancing
- Dynamic worker management (spawn, kill, scale, failover)
- Auto-spawn workers on first request
- Health checks with dead worker cleanup
- Worker management API endpoints
- Full TTS pipeline: chunking, post-processing, assembly
- 7-stage audio processing pipeline
- Whisper transcription verification
- Job management with cancellation and recovery
- Multi-chunk support with per-model text limits
- Initial portable release
- Embedded Python 3.10, portable Git, FFmpeg
- 8 virtual environment configurations
- Tkinter GUI with environment, model, and server tabs
- HuggingFace model downloads
This project is a gateway/orchestrator. Each integrated TTS model has its own license:
| Model | License |
|---|---|
| XTTS v2, Bark | MPL 2.0 (Coqui) |
| Chatterbox | MIT |
| F5-TTS | MIT |
| Fish Speech | Apache 2.0 |
| Kokoro | Apache 2.0 |
| Dia | Apache 2.0 |
| Qwen2.5-Omni | Apache 2.0 |
| VibeVoice | MIT |
| Higgs Audio | Apache 2.0 |
The gateway/orchestrator code is licensed under the MIT License.
Built with Claude Opus 4.6 as a pair-programming partner.
{ "text": "Hello, world!", // Required "voice": "af_heart", // Model-specific voice ID "reference_audio": "<base64 wav>", // For voice cloning models "reference_text": "transcript here", // Required for F5-TTS and Higgs "mode": "cloned", // "cloned" (default) or "builtin" for XTTS "language": "en", "speed": 1.0, "temperature": 0.65, "repetition_penalty": 2.0, "output_format": "wav", // wav, mp3, ogg, flac, m4a "de_reverb": 0.7, // 0.0-1.0 "de_ess": 0.0, // 0.0-1.0 "verify_whisper": false, // Transcription verification "whisper_model": "base", // tiny/base/small/medium/large "tolerance": 80, // Whisper similarity threshold (0-100) "save_path": "my_project/scene1", // Optional output directory "skip_post_process": false, "auto_retry": 3, "device": "cuda:0" // Target GPU }