flowchart TD
A["User Speech"] --> B["ESP32 Microphone - INMP441"]
B --> |"16 kHz PCM Audio"| C["WiFi WebSocket Stream"]
C --> D["Python AI Server"]
D --> E1["STT - Whisper Model (base)"]
E1 --> E2["LLM - Gemini 2.5 Flash"]
E2 --> E3["TTS - Piper Neural Voice"]
E3 --> |"8-bit PCM Audio Stream"| F["ESP32 DAC Output"]
F --> G["Speaker / LM386 Amplifier"]
subgraph ESP32_Device
B
C
F
G
end
subgraph AI_Backend_Python
D
E1
E2
E3
end
| Component | Function | Connection |
|---|---|---|
| ESP32-WROOM-32 38Pin Development Board | Main microcontroller handling Wi-Fi, I2S mic input and DAC output | |
| INMP441 MEMS High Precision Microphone | Captures real-time voice input | SCK=14, WS=15, SD=32 |
| LM386 Audio Amplifier Module | Drives small speaker output | Input from DAC1 (GPIO25) |
| 2inch 8Ohm 12W Midrange Speaker | Plays synthesized voice from AI | Connected to LM386 |
| 12x12x12mm Tactile Push Button | Triggers recording | GPIO 26 |
| TP4056 Adjustable 1A Li-ion lithium Battery Charging Module | Charges battery | VIN-(OUT+) / GND-(OUT-) |
| 1000mAh Rechargeable 3.7v Lithium Polymer Battery | Battery | +(B+) / -(B-) |
| LED (Built In) | Status indication | GPIO 2 (BUILT IN) |
Refer to this sheet for a detailed BOM: ESP32 Voice Assistant: BOM
The hardware supports half-duplex streaming, recording speech, sending it to the backend, waiting for a response and then playing it back via DAC.
-
Button Press, ESP32 starts recording via the INMP441 microphone.
-
I2S Audio Capture, 16 kHz samples are streamed in real-time over WebSocket.
-
AI Processing (Server)
- Whisper converts audio → text
- Gemini 2.5 Flash generates a contextual reply
- Piper converts text → natural speech
-
Response Playback, The server streams 8-bit PCM chunks back to ESP32 for DAC output.
-
User Hears AI Voice, LM386 amplifier drives the speaker.
The AI backend acts as the brain of the device, hosted locally or on a VPS. It performs all heavy lifting, speech recognition, understanding and voice generation, so the ESP32 stays efficient.
| Layer | Description | Library |
|---|---|---|
| STT (Speech-to-Text) | Converts voice input to text | faster-whisper |
| LLM (Language Model) | Generates response based on transcribed query | google-genai |
| TTS (Text-to-Speech) | Synthesizes natural human-like voice | piper-tts |
| Transport Layer | Binary audio streaming and event control | websockets, asyncio |
- Audio Reception The ESP32 streams 16 kHz PCM data → server buffers in memory.
- Transcription
Whisper transcribes it into text using a lightweight
tinymodel for fast CPU inference. - Response Generation Gemini Flash LLM interprets intent and composes a short, empathetic response.
- Speech Synthesis Piper TTS generates smooth, low-latency speech from text.
- Audio Encoding Output audio is converted to 8-bit unsigned PCM and streamed back to the ESP32.
cd esp-server
pip install uv
uv sync
uv run main.pyor via Docker:
docker build -t esp32-ws-server .
docker run -p 7860:7860 esp32-ws-serverThe server runs a WebSocket on
ws://0.0.0.0:7860to communicate with ESP32.
- Python ≥ 3.12
- Google Gemini API key (
GEMINI_API_KEY) - Pre-downloaded Piper voice model (
en_US-libritts_r-medium.onnx)
Model can be download using the following command:
python -m piper.download_voices en_US-libritts_r-medium --data-dir tts_models
- 🎧 End-to-end Voice Interaction, Speak → Understand → Reply → Speak
- ⚡ Real-Time Processing, WebSocket-based binary streaming under 100 ms latency
- 🔊 High-Quality TTS, Piper’s neural synthesis produces clean, expressive voice output
- 🧩 Lightweight AI Core, Whisper Tiny (CPU) enables fast offline operation
- 🧠 LLM Personality Control, Adjustable system prompt to alter the AI’s character
- 🔄 Auto-Restart Watchdog,
watcher.pyensures uptime even after failure
![]() |
![]() |
![]() |
|---|
This project is released under the MIT License. See LICENSE for details.
- Weslei Prudencio for Project Inspiration
- Faster Whisper by Systran
- Piper TTS by Rhasspy
- Gemini API by Google
- ESP-IDF & Arduino Core
Built with ❤️ by Arpit Sengar




