“Empathy through embedded intelligence.” AashaAI is a hardware-integrated AI companion designed to enable real-time, voice-based interaction through a custom ESP32 device. It captures your speech, sends it to an intelligent backend that listens, understands and responds naturally.
AashaAI combines embedded systems and AI inference to create an end-to-end conversational experience. The ESP32 handles real-time audio capture and playback, while the AI backend (Python server) performs:
- Speech-to-Text (STT) using Whisper
- Language Understanding via Google Gemini
- Text-to-Speech (TTS) using Piper
A simple web interface exists only for text-based interaction, allowing users to chat online with the same AI model, but the hardware device remains the core focus.
flowchart TD
A[User Speech] --> B[ESP32 Microphone - INMP441]
B --> |16 kHz PCM Audio| C[WiFi WebSocket Stream]
C --> D[Python AI Server]
D --> E1[STT: Whisper Model]
E1 --> E2[LLM: Gemini 2.5 Flash]
E2 --> E3[TTS: Piper Neural Voice]
E3 --> |8-bit PCM Audio Stream| F[ESP32 DAC Output]
F --> G[Speaker / LM386 Amplifier]
subgraph ESP32_Device
B
C
F
G
end
subgraph AI_Backend_Python
D
E1
E2
E3
end
subgraph Web_Interface
H[Chat UI - Next.js]
end
H -.-> |HTTP / REST| D
| Component | Function | Connection |
|---|---|---|
| ESP32-WROOM | Main microcontroller handling Wi-Fi, I2S mic input and DAC output | , |
| INMP441 Microphone | Captures real-time voice input | SCK=14, WS=15, SD=32 |
| LM386 Amplifier | Drives small speaker output | Input from DAC1 (GPIO25) |
| Speaker | Plays synthesized voice from AI | Connected to LM386 |
| Push Button | Triggers recording | GPIO 26 |
| LED | Status indication | GPIO 2 (BUILT IN) |
The hardware supports half-duplex streaming, recording speech, sending it to the backend, waiting for a response and then playing it back via DAC.
-
Button Press, ESP32 starts recording via the INMP441 microphone.
-
I2S Audio Capture, 16 kHz samples are streamed in real-time over WebSocket.
-
AI Processing (Server)
- Whisper converts audio → text
- Gemini 2.5 Flash generates a contextual reply
- Piper converts text → natural speech
-
Response Playback, The server streams 8-bit PCM chunks back to ESP32 for DAC output.
-
User Hears AI Voice, LM386 amplifier drives the speaker.
The AI backend acts as the brain of the device, hosted locally or on a VPS. It performs all heavy lifting, speech recognition, understanding and voice generation, so the ESP32 stays efficient.
| Layer | Description | Library |
|---|---|---|
| STT (Speech-to-Text) | Converts voice input to text | faster-whisper |
| LLM (Language Model) | Generates response based on transcribed query | google-genai |
| TTS (Text-to-Speech) | Synthesizes natural human-like voice | piper-tts |
| Transport Layer | Binary audio streaming and event control | websockets, asyncio |
- Audio Reception The ESP32 streams 16 kHz PCM data → server buffers in memory.
- Transcription
Whisper transcribes it into text using a lightweight
tinymodel for fast CPU inference. - Response Generation Gemini Flash LLM interprets intent and composes a short, empathetic response.
- Speech Synthesis Piper TTS generates smooth, low-latency speech from text.
- Audio Encoding Output audio is converted to 8-bit unsigned PCM and streamed back to the ESP32.
cd esp-server
pip install uv
uv sync
uv run main.pyor via Docker:
docker build -t aashaai-server .
docker run -p 7860:7860 aashaai-serverThe server runs a WebSocket on
ws://0.0.0.0:7860to communicate with ESP32.
- Python ≥ 3.12
- Google Gemini API key (
GEMINI_API_KEY) - Pre-downloaded Piper voice model (
en_US-libritts_r-medium.onnx)
Model can be download using the following command:
RUN python -m piper.download_voices en_US-libritts_r-medium --data-dir tts_models
The website is just an interface built with Next.js 14, Tailwind CSS and Gemini 2.5 Flash. It allows users without hardware to chat with the same AI model online. It does not handle any audio: all real-time speech functionality belongs to the hardware pipeline.
- 🎧 End-to-end Voice Interaction, Speak → Understand → Reply → Speak
- ⚡ Real-Time Processing, WebSocket-based binary streaming under 100 ms latency
- 🔊 High-Quality TTS, Piper’s neural synthesis produces clean, expressive voice output
- 🧩 Lightweight AI Core, Whisper Tiny (CPU) enables fast offline operation
- 🧠 LLM Personality Control, Adjustable system prompt to alter the AI’s character
- 🔄 Auto-Restart Watchdog,
watcher.pyensures uptime even after failure
This project is released under the MIT License. See LICENSE for details.
- Weslei Prudencio for Project Inspiration
- Whisper by OpenAI
- Piper TTS by Rhasspy
- Gemini API by Google
- ESP-IDF & Arduino Core
Built with ❤️ by Arpit Sengar
