Merged
Conversation
Owner
codingjoe
commented
Mar 16, 2026
- Make agent more responsive
- Improve voice response
There was a problem hiding this comment.
Pull request overview
This PR aims to make the AI “agent response” loop feel more responsive by changing how RTP audio is buffered/flushed for transcription and how TTS audio is generated/sent back to the caller.
Changes:
- Made RTP payload decoding synchronous (no executor) and adjusted VAD buffering/flush timing (shorter
silence_gap, new timer handle, new flush filtering). - Simplified Whisper transcription triggering (removed some buffering/min-duration logic and docstrings).
- Changed Agent reply assembly and switched TTS from streamed chunk delivery to single-shot audio generation.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 11 comments.
| File | Description |
|---|---|
voip/audio.py |
Alters decode execution model and VAD buffering/flush behavior (timers, thresholds, buffering strategy). |
voip/ai.py |
Changes transcription triggering/docs and modifies agent response + TTS generation approach. |
You can also share your feedback on Copilot code review. Take the survey.
voip/audio.py
Outdated
Comment on lines
195
to
199
| audio = self.decode_payload(packet.payload) | ||
| if audio.size > 0: | ||
| self.audio_received( | ||
| audio=audio, rms=float(np.sqrt(np.mean(np.square(audio)))) | ||
| ) |
voip/ai.py
Outdated
Comment on lines
+165
to
+176
| self.messages.append({"role": "user", "content": self.pending_text.getvalue()}) | ||
| self.pending_text.seek(0) | ||
| self.pending_text.truncate(0) | ||
| response = await ollama.AsyncClient().chat( | ||
| model=self.ollama_model, | ||
| messages=self.messages, | ||
| ) | ||
| reply = (response.message.content or "").encode("ascii", "ignore").decode() | ||
| self.messages.append({"role": "assistant", "content": reply}) | ||
|
|
||
| logger.debug("Agent reply: %r", reply) | ||
| await self.send_speech(reply) |
Comment on lines
178
to
+196
| @@ -228,23 +185,12 @@ async def send_speech(self, text: str) -> None: | |||
| Args: | |||
| text: Text to synthesise and transmit. | |||
| """ | |||
| loop = asyncio.get_running_loop() | |||
| queue: asyncio.Queue[np.ndarray | None] = asyncio.Queue() | |||
|
|
|||
| def generate() -> None: | |||
| for chunk in self.tts_instance.generate_audio_stream( | |||
| self.voice_state, | |||
| text, # type: ignore[too-many-positional-arguments] | |||
| ): | |||
| asyncio.run_coroutine_threadsafe( | |||
| queue.put(chunk.numpy()), loop | |||
| ).result() | |||
| asyncio.run_coroutine_threadsafe(queue.put(None), loop).result() | |||
|
|
|||
| future = loop.run_in_executor(None, generate) | |||
| while (tts_chunk := await queue.get()) is not None: | |||
| resampled = self.resample( | |||
| tts_chunk, self.tts_instance.sample_rate, self.codec.sample_rate_hz | |||
| audio = self.tts_instance.generate_audio( | |||
| self.voice_state, | |||
| text, # type: ignore[too-many-positional-arguments] | |||
| ) | |||
| await self.send_rtp_audio( | |||
| self.resample( | |||
| audio.numpy(), self.tts_instance.sample_rate, self.codec.sample_rate_hz | |||
| ) | |||
| await self.send_rtp_audio(resampled) | |||
| await future | |||
| ) | |||
voip/audio.py
Outdated
| self.silence_handle = None | ||
| if not self.speech_buffer: | ||
| return | ||
| self.transcription_handle = None |
| audio=audio, rms=float(np.sqrt(np.mean(np.square(audio)))) | ||
| ) | ||
|
|
||
| def decode_payload(self, payload: bytes) -> np.ndarray: |
Comment on lines
321
to
327
| def on_audio_silence(self) -> None: | ||
| """Arm the silence debounce timer when speech is buffered.""" | ||
| if self.silence_handle is None and self.speech_buffer: | ||
| loop = asyncio.get_running_loop() | ||
| self.silence_handle = loop.call_later( | ||
| if self.transcription_handle is None: | ||
| loop = asyncio.get_event_loop() | ||
| self.transcription_handle = loop.call_later( | ||
| self.silence_gap, | ||
| self.flush_speech_buffer, | ||
| ) |
Comment on lines
309
to
315
| def audio_received(self, *, audio: np.ndarray, rms: float) -> None: | ||
| if self.collect_audio(audio, rms): | ||
| self.speech_buffer.append(audio) | ||
| self.speech_buffer.append(audio) | ||
| if rms > self.speech_threshold: | ||
| self.on_audio_speech() | ||
| else: | ||
| self.on_audio_silence() | ||
|
|
voip/audio.py
Outdated
Comment on lines
+330
to
+339
| self.transcription_handle = None | ||
| # Ensure at least one second of audio to avoid cutting words in half. | ||
| audio = np.concatenate(self.speech_buffer) | ||
| if ( | ||
| sum(len(c) for c in self.speech_buffer) | ||
| < self.RESAMPLING_RATE_HZ * self.silence_gap | ||
| or float(np.sqrt(np.mean(np.square(audio)))) < 0.01 | ||
| ): | ||
| self.speech_buffer.clear() | ||
| return |
voip/ai.py
Outdated
Comment on lines
75
to
77
| async def speech_buffer_ready(self, audio: np.ndarray) -> None: | ||
| """Transcribe the buffered utterance when it meets the minimum length. | ||
|
|
||
| Skips utterances shorter than one second to avoid passing fragments | ||
| to Whisper that would produce low-quality transcriptions. | ||
|
|
||
| Args: | ||
| audio: Float32 mono PCM array at `RESAMPLING_RATE_HZ` Hz. | ||
| """ | ||
| if len(audio) < self.RESAMPLING_RATE_HZ: | ||
| return | ||
| await self.transcribe(audio) | ||
|
|
Comment on lines
309
to
315
| def audio_received(self, *, audio: np.ndarray, rms: float) -> None: | ||
| if self.collect_audio(audio, rms): | ||
| self.speech_buffer.append(audio) | ||
| self.speech_buffer.append(audio) | ||
| if rms > self.speech_threshold: | ||
| self.on_audio_speech() | ||
| else: | ||
| self.on_audio_silence() | ||
|
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #49 +/- ##
==========================================
+ Coverage 94.25% 94.58% +0.32%
==========================================
Files 24 24
Lines 1759 1716 -43
==========================================
- Hits 1658 1623 -35
+ Misses 101 93 -8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.