Skip to content

This project combines embedded system and AI inference to create an end-to-end conversational experience.

License

Notifications You must be signed in to change notification settings

arpy8/ESP32_Voice_Assistant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Logo

ESP32 based Voice Assistant


ESP32 based Voice Assistant

Overview YouTube Video
This project combines embedded systems and AI inference to create an end-to-end conversational assistant. The ESP32 handles real-time audio recording and playback, while a Python backend performs:
  • Speech-to-Text (STT) via Faster-Whisper
  • Language Understanding via Google Gemini
  • Text-to-Speech (TTS) using Piper TTS
YouTube demo thumbnail

System Architecture

flowchart TD
    A["User Speech"] --> B["ESP32 Microphone - INMP441"]
    B --> |"16 kHz PCM Audio"| C["WiFi WebSocket Stream"]
    C --> D["Python AI Server"]

    D --> E1["STT - Whisper Model (base)"]
    E1 --> E2["LLM - Gemini 2.5 Flash"]
    E2 --> E3["TTS - Piper Neural Voice"]

    E3 --> |"8-bit PCM Audio Stream"| F["ESP32 DAC Output"]
    F --> G["Speaker / LM386 Amplifier"]

    subgraph ESP32_Device
        B
        C
        F
        G
    end

    subgraph AI_Backend_Python
        D
        E1
        E2
        E3
    end
Loading

Hardware Architecture

Component Function Connection
ESP32-WROOM-32 38Pin Development Board Main microcontroller handling Wi-Fi, I2S mic input and DAC output
INMP441 MEMS High Precision Microphone Captures real-time voice input SCK=14, WS=15, SD=32
LM386 Audio Amplifier Module Drives small speaker output Input from DAC1 (GPIO25)
2inch 8Ohm 12W Midrange Speaker Plays synthesized voice from AI Connected to LM386
12x12x12mm Tactile Push Button Triggers recording GPIO 26
TP4056 Adjustable 1A Li-ion lithium Battery Charging Module Charges battery VIN-(OUT+) / GND-(OUT-)
1000mAh Rechargeable 3.7v Lithium Polymer Battery Battery +(B+) / -(B-)
LED (Built In) Status indication GPIO 2 (BUILT IN)

Refer to this sheet for a detailed BOM: ESP32 Voice Assistant: BOM

The hardware supports half-duplex streaming, recording speech, sending it to the backend, waiting for a response and then playing it back via DAC.

Schematic

Logo

Hardware Workflow

  1. Button Press, ESP32 starts recording via the INMP441 microphone.

  2. I2S Audio Capture, 16 kHz samples are streamed in real-time over WebSocket.

  3. AI Processing (Server)

    • Whisper converts audio → text
    • Gemini 2.5 Flash generates a contextual reply
    • Piper converts text → natural speech
  4. Response Playback, The server streams 8-bit PCM chunks back to ESP32 for DAC output.

  5. User Hears AI Voice, LM386 amplifier drives the speaker.

AI Backend (Python Server)

The AI backend acts as the brain of the device, hosted locally or on a VPS. It performs all heavy lifting, speech recognition, understanding and voice generation, so the ESP32 stays efficient.

Backend Modules

Layer Description Library
STT (Speech-to-Text) Converts voice input to text faster-whisper
LLM (Language Model) Generates response based on transcribed query google-genai
TTS (Text-to-Speech) Synthesizes natural human-like voice piper-tts
Transport Layer Binary audio streaming and event control websockets, asyncio

Processing Pipeline

  1. Audio Reception The ESP32 streams 16 kHz PCM data → server buffers in memory.
  2. Transcription Whisper transcribes it into text using a lightweight tiny model for fast CPU inference.
  3. Response Generation Gemini Flash LLM interprets intent and composes a short, empathetic response.
  4. Speech Synthesis Piper TTS generates smooth, low-latency speech from text.
  5. Audio Encoding Output audio is converted to 8-bit unsigned PCM and streamed back to the ESP32.

Server Setup

cd esp-server
pip install uv
uv sync
uv run main.py

or via Docker:

docker build -t esp32-ws-server .
docker run -p 7860:7860 esp32-ws-server

The server runs a WebSocket on ws://0.0.0.0:7860 to communicate with ESP32.

Environment Requirements

  • Python ≥ 3.12
  • Google Gemini API key (GEMINI_API_KEY)
  • Pre-downloaded Piper voice model (en_US-libritts_r-medium.onnx)

Model can be download using the following command: python -m piper.download_voices en_US-libritts_r-medium --data-dir tts_models

System Highlights

  • 🎧 End-to-end Voice Interaction, Speak → Understand → Reply → Speak
  • Real-Time Processing, WebSocket-based binary streaming under 100 ms latency
  • 🔊 High-Quality TTS, Piper’s neural synthesis produces clean, expressive voice output
  • 🧩 Lightweight AI Core, Whisper Tiny (CPU) enables fast offline operation
  • 🧠 LLM Personality Control, Adjustable system prompt to alter the AI’s character
  • 🔄 Auto-Restart Watchdog, watcher.py ensures uptime even after failure

Final Outcome

image image image

License

This project is released under the MIT License. See LICENSE for details.

Acknowledgements


Built with ❤️ by Arpit Sengar

About

This project combines embedded system and AI inference to create an end-to-end conversational experience.

Topics

Resources

License

Stars

Watchers

Forks