ESP32 based Voice Assistant

ESP32 based Voice Assistant

ESP32 based Voice Assistant

Overview	YouTube Video
This project combines embedded systems and AI inference to create an end-to-end conversational assistant. The ESP32 handles real-time audio recording and playback, while a Python backend performs: Speech-to-Text (STT) via Faster-Whisper Language Understanding via Google Gemini Text-to-Speech (TTS) using Piper TTS

System Architecture

flowchart TD
    A["User Speech"] --> B["ESP32 Microphone - INMP441"]
    B --> |"16 kHz PCM Audio"| C["WiFi WebSocket Stream"]
    C --> D["Python AI Server"]

    D --> E1["STT - Whisper Model (base)"]
    E1 --> E2["LLM - Gemini 2.5 Flash"]
    E2 --> E3["TTS - Piper Neural Voice"]

    E3 --> |"8-bit PCM Audio Stream"| F["ESP32 DAC Output"]
    F --> G["Speaker / LM386 Amplifier"]

    subgraph ESP32_Device
        B
        C
        F
        G
    end

    subgraph AI_Backend_Python
        D
        E1
        E2
        E3
    end

Hardware Architecture

Component	Function	Connection
ESP32-WROOM-32 38Pin Development Board	Main microcontroller handling Wi-Fi, I2S mic input and DAC output
INMP441 MEMS High Precision Microphone	Captures real-time voice input	`SCK=14`, `WS=15`, `SD=32`
LM386 Audio Amplifier Module	Drives small speaker output	Input from `DAC1 (GPIO25)`
2inch 8Ohm 12W Midrange Speaker	Plays synthesized voice from AI	Connected to LM386
12x12x12mm Tactile Push Button	Triggers recording	GPIO 26
TP4056 Adjustable 1A Li-ion lithium Battery Charging Module	Charges battery	VIN-(OUT+) / GND-(OUT-)
1000mAh Rechargeable 3.7v Lithium Polymer Battery	Battery	+(B+) / -(B-)
LED (Built In)	Status indication	GPIO 2 (BUILT IN)

Refer to this sheet for a detailed BOM: ESP32 Voice Assistant: BOM

The hardware supports half-duplex streaming, recording speech, sending it to the backend, waiting for a response and then playing it back via DAC.

Schematic

Hardware Workflow

Button Press, ESP32 starts recording via the INMP441 microphone.
I2S Audio Capture, 16 kHz samples are streamed in real-time over WebSocket.
AI Processing (Server)
- Whisper converts audio → text
- Gemini 2.5 Flash generates a contextual reply
- Piper converts text → natural speech
Response Playback, The server streams 8-bit PCM chunks back to ESP32 for DAC output.
User Hears AI Voice, LM386 amplifier drives the speaker.

AI Backend (Python Server)

The AI backend acts as the brain of the device, hosted locally or on a VPS. It performs all heavy lifting, speech recognition, understanding and voice generation, so the ESP32 stays efficient.

Backend Modules

Layer	Description	Library
STT (Speech-to-Text)	Converts voice input to text	`faster-whisper`
LLM (Language Model)	Generates response based on transcribed query	`google-genai`
TTS (Text-to-Speech)	Synthesizes natural human-like voice	`piper-tts`
Transport Layer	Binary audio streaming and event control	`websockets`, `asyncio`

Processing Pipeline

Audio Reception The ESP32 streams 16 kHz PCM data → server buffers in memory.
Transcription Whisper transcribes it into text using a lightweight tiny model for fast CPU inference.
Response Generation Gemini Flash LLM interprets intent and composes a short, empathetic response.
Speech Synthesis Piper TTS generates smooth, low-latency speech from text.
Audio Encoding Output audio is converted to 8-bit unsigned PCM and streamed back to the ESP32.

Server Setup

cd esp-server
pip install uv
uv sync
uv run main.py

or via Docker:

docker build -t esp32-ws-server .
docker run -p 7860:7860 esp32-ws-server

The server runs a WebSocket on ws://0.0.0.0:7860 to communicate with ESP32.

Environment Requirements

Python ≥ 3.12
Google Gemini API key (GEMINI_API_KEY)
Pre-downloaded Piper voice model (en_US-libritts_r-medium.onnx)

Model can be download using the following command: python -m piper.download_voices en_US-libritts_r-medium --data-dir tts_models

System Highlights

🎧 End-to-end Voice Interaction, Speak → Understand → Reply → Speak
⚡ Real-Time Processing, WebSocket-based binary streaming under 100 ms latency
🔊 High-Quality TTS, Piper’s neural synthesis produces clean, expressive voice output
🧩 Lightweight AI Core, Whisper Tiny (CPU) enables fast offline operation
🧠 LLM Personality Control, Adjustable system prompt to alter the AI’s character
🔄 Auto-Restart Watchdog, watcher.py ensures uptime even after failure

Final Outcome

License

This project is released under the MIT License. See LICENSE for details.

Acknowledgements

Built with ❤️ by Arpit Sengar

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
esp-code		esp-code
schematic		schematic
server		server
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ESP32 based Voice Assistant

System Architecture

Hardware Architecture

Schematic

Hardware Workflow

AI Backend (Python Server)

Backend Modules

Processing Pipeline

Server Setup

Environment Requirements

System Highlights

Final Outcome

License

Acknowledgements

About

Uh oh!

Languages

License

arpy8/ESP32_Voice_Assistant

Folders and files

Latest commit

History

Repository files navigation

ESP32 based Voice Assistant

System Architecture

Hardware Architecture

Schematic

Hardware Workflow

AI Backend (Python Server)

Backend Modules

Processing Pipeline

Server Setup

Environment Requirements

System Highlights

Final Outcome

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages