feat(ios): background voice mode with improved TTS handling#17319
feat(ios): background voice mode with improved TTS handling#17319zeulewan wants to merge 5 commits intoopenclaw:mainfrom
Conversation
…app is backgrounded - Add 'talk.background.enabled' preference toggle in Settings > Features - Modify TalkModeManager.suspendForBackground() to accept keepActive parameter that preserves the audio session and recognition when enabled - Update NodeAppModel.setScenePhase() to check the preference and skip suspension when Background Voice is on - Update resumeAfterBackground() to skip restart when talk was kept active - Leverages existing UIBackgroundModes=audio entitlement to keep process alive When enabled, the audio session, speech recognition, and gateway chat subscription remain active in the background, allowing continuous voice conversation while browsing other apps.
…ive + add thinking chime Background voice was failing after the first response because stopRecognition() kills the audio engine, causing iOS to suspend the app before it can restart. Fix: When backgroundKeepAlive is enabled, play a silent audio loop (generated in-memory WAV) that keeps the UIBackgroundModes=audio session alive between speech recognition and TTS cycles. Also adds a subtle thinking chime (880Hz A5 note, 150ms fade-out) that plays when the assistant starts processing a transcript, giving audible feedback during the silence between speaking and the response. Changes: - Add backgroundKeepAlive flag and silent AVAudioPlayer loop - Start keepalive on background entry, stop on foreground return or Talk stop - Add playThinkingSound() called at start of processTranscript() - Generate audio programmatically (no asset files needed)
…alk Mode prompt Adds a 'Voice Directive Hint' toggle in Settings > Features that controls whether the ElevenLabs voice switching instruction is included in the Talk Mode prompt. Disabling it saves tokens when voice switching is not needed. Changes: - TalkPromptBuilder.build() now accepts includeVoiceDirectiveHint parameter - Add talk.voiceDirectiveHint.enabled AppStorage preference (default: true) - Add toggle + description in SettingsTab - TalkModeManager.buildPrompt() reads the preference - Add unit tests for the new parameter
PROBLEM ANALYSIS: - Talk Mode died when app was backgrounded despite UIBackgroundModes=audio - Root cause: processTranscript() called stopRecognition() which killed AVAudioEngine - iOS suspended the app when engine stopped/restarted between speech cycles - Silent AVAudioPlayer keepalive wasn't sufficient to maintain background execution RESEARCH FINDINGS: - iOS has strict background execution limits (Apple DTS Quinn Eskimo post) - No general-purpose continuous background execution allowed - BUT audio background mode IS designed for continuous audio apps like VoIP - Key insight: Must keep audio engine running continuously, never stop it SOLUTION IMPLEMENTED: 1. **Added VoIP background mode** to project.yml (audio + voip) 2. **Keep AVAudioEngine running continuously** when backgroundKeepAlive=true 3. **New pauseRecognitionOnly()/resumeRecognitionOnly() methods**: - pauseRecognitionOnly(): Stops speech recognition but keeps engine running - resumeRecognitionOnly(): Restarts recognition on already-running engine 4. **Updated processTranscript() lifecycle**: - In background mode: Use pauseRecognitionOnly() instead of stopRecognition() - After TTS: Use resumeRecognitionOnly() instead of full restart 5. **Improved background keepalive logging** for debugging 6. **Enhanced resumeAfterBackground()** to handle engine state properly TECHNICAL APPROACH: - processTranscript() → pauseRecognitionOnly() when backgroundKeepAlive=true - After speaking → resumeRecognitionOnly() when backgroundKeepAlive=true - playAssistant() → resumeRecognitionOnly() for interrupt handling - Never stop AVAudioEngine while in background voice mode - VoIP + audio background modes for maximum background execution time This follows iOS best practices for VoIP apps like Discord/WhatsApp that need continuous background audio processing. The engine stays alive, preventing iOS from suspending the app between speech recognition cycles. Tested approach based on Apple's background execution guidelines and real-world patterns from successful VoIP applications.
Add VoIP background mode, speaker bleed detection with grace period, push message speech queue, multi-turn follow-up polling, audio route change handling, startup chime, and config reload throttling. Rename settings labels for clarity.
| let tapDiagnostics = AudioTapDiagnostics(label: "talk") { [weak self] level in | ||
| guard let self else { return } | ||
| Task { @MainActor in | ||
| let raw = max(0, min(Double(level) * 10.0, 1.0)) | ||
| let next = (self.micLevel * 0.80) + (raw * 0.20) | ||
| self.micLevel = next | ||
|
|
||
| if self.isListening, !self.isSpeaking, !self.noiseFloorReady { | ||
| self.noiseFloorSamples.append(raw) | ||
| if self.noiseFloorSamples.count >= 22 { | ||
| let sorted = self.noiseFloorSamples.sorted() | ||
| let take = max(6, sorted.count / 2) | ||
| let slice = sorted.prefix(take) | ||
| let avg = slice.reduce(0.0, +) / Double(slice.count) | ||
| self.noiseFloor = avg | ||
| self.noiseFloorReady = true | ||
| self.noiseFloorSamples.removeAll(keepingCapacity: true) | ||
| let threshold = min(0.35, max(0.12, avg + 0.10)) | ||
| GatewayDiagnostics.log( | ||
| "talk audio: noiseFloor=\(String(format: "%.3f", avg)) threshold=\(String(format: "%.3f", threshold))") | ||
| } | ||
| } | ||
|
|
||
| // Track speaker bleed baseline during TTS for interrupt gating. | ||
| if self.isSpeechOutputActive { | ||
| self.ttsAudioBaseline = (self.ttsAudioBaseline * 0.92) + (raw * 0.08) | ||
| } | ||
|
|
||
| let threshold: Double = if let floor = self.noiseFloor, self.noiseFloorReady { | ||
| min(0.35, max(0.12, floor + 0.10)) | ||
| } else { | ||
| 0.18 | ||
| } | ||
| if raw >= threshold { | ||
| self.lastAudioActivity = Date() | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Code duplication: entire audio tap setup logic duplicated in resumeRecognitionOnly
This 40-line block (lines 765-802) is nearly identical to the tap setup in startRecognition (lines 520-560). Consider extracting to a shared method like setupAudioTap(request:) to reduce maintenance burden and prevent drift.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/ios/Sources/Voice/TalkModeManager.swift
Line: 765:802
Comment:
Code duplication: entire audio tap setup logic duplicated in `resumeRecognitionOnly`
This 40-line block (lines 765-802) is nearly identical to the tap setup in `startRecognition` (lines 520-560). Consider extracting to a shared method like `setupAudioTap(request:)` to reduce maintenance burden and prevent drift.
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| // Remove the old tap (kept alive during pause) before installing the new one | ||
| if self.inputTapInstalled { | ||
| input.removeTap(onBus: 0) | ||
| self.inputTapInstalled = false | ||
| } |
There was a problem hiding this comment.
Verify tap state sync across engine restart paths
In pauseRecognitionOnly (line 703-719), the tap is kept installed. Here in resumeRecognitionOnly, you remove it before re-installing. If the audio engine stops/restarts outside this flow, the tap state could become inconsistent with the inputTapInstalled flag.
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/ios/Sources/Voice/TalkModeManager.swift
Line: 758:762
Comment:
Verify tap state sync across engine restart paths
In `pauseRecognitionOnly` (line 703-719), the tap is kept installed. Here in `resumeRecognitionOnly`, you remove it before re-installing. If the audio engine stops/restarts outside this flow, the tap state could become inconsistent with the `inputTapInstalled` flag.
How can I resolve this? If you propose a fix, please make it concise.| private func pauseRecognitionOnly() { | ||
| self.recognitionTask?.cancel() | ||
| self.recognitionTask = nil | ||
| self.recognitionRequest?.endAudio() | ||
| self.recognitionRequest = nil | ||
| self.micLevel = 0 | ||
| self.lastAudioActivity = nil | ||
| self.noiseFloorSamples.removeAll(keepingCapacity: true) | ||
| self.noiseFloor = nil | ||
| self.noiseFloorReady = false | ||
| // CRITICAL: Do NOT remove the input tap or stop the audioEngine in background mode. | ||
| // Keeping the tap installed means audio continues flowing through the engine, | ||
| // which iOS recognizes as active audio work and won't suspend us. | ||
| // The tap's captured request reference has had endAudio() called, so append() is a no-op. | ||
| self.speechRecognizer = nil | ||
| self.logger.info("paused recognition only, keeping engine + tap running for background mode (tapInstalled=\(self.inputTapInstalled))") |
There was a problem hiding this comment.
Tap continues processing with orphaned request after endAudio()
After calling recognitionRequest?.endAudio() on line 706, the tap's captured request reference is stale. The comment (line 716) says "append() is a no-op" but this isn't guaranteed - appending to an ended request could cause crashes or undefined behavior. The tap callback in makeAudioTapAppendCallback (line 862) calls request.append() unconditionally, which may not be safe after endAudio().
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/ios/Sources/Voice/TalkModeManager.swift
Line: 703:718
Comment:
Tap continues processing with orphaned request after `endAudio()`
After calling `recognitionRequest?.endAudio()` on line 706, the tap's captured `request` reference is stale. The comment (line 716) says "append() is a no-op" but this isn't guaranteed - appending to an ended request could cause crashes or undefined behavior. The tap callback in `makeAudioTapAppendCallback` (line 862) calls `request.append()` unconditionally, which may not be safe after `endAudio()`.
How can I resolve this? If you propose a fix, please make it concise.| self.recognitionTask = recognizer.recognitionTask(with: request) { [weak self] result, error in | ||
| guard let self else { return } | ||
| if let error { | ||
| let msg = error.localizedDescription | ||
| GatewayDiagnostics.log("talk speech: error=\(msg)") | ||
| if !self.isSpeaking { | ||
| if msg.localizedCaseInsensitiveContains("no speech detected") { | ||
| self.statusText = self.isEnabled ? "Listening" : "Speech error: \(msg)" | ||
| } else { | ||
| self.statusText = "Speech error: \(msg)" | ||
| } | ||
| } | ||
| self.logger.debug("speech recognition error: \(msg, privacy: .public)") | ||
| if self.captureMode == .continuous, self.isEnabled, !self.isSpeaking { | ||
| Task { @MainActor [weak self] in | ||
| await self?.restartRecognitionAfterError() | ||
| } | ||
| } | ||
| } | ||
| guard let result else { return } | ||
| let transcript = result.bestTranscription.formattedString | ||
| if !result.isFinal, !self.loggedPartialThisCycle { | ||
| let trimmed = transcript.trimmingCharacters(in: .whitespacesAndNewlines) | ||
| if !trimmed.isEmpty { | ||
| self.loggedPartialThisCycle = true | ||
| GatewayDiagnostics.log("talk speech: partial chars=\(trimmed.count)") |
There was a problem hiding this comment.
Error handling inconsistency: captureMode != .idle check missing in resumeRecognitionOnly
In the main startRecognition error handler (lines 600-627), there's a check for captureMode != .idle before treating errors as recoverable. The resumeRecognitionOnly version (lines 818-843) lacks this check on line 823, meaning it may treat intentional stops differently.
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/ios/Sources/Voice/TalkModeManager.swift
Line: 818:843
Comment:
Error handling inconsistency: `captureMode != .idle` check missing in `resumeRecognitionOnly`
In the main `startRecognition` error handler (lines 600-627), there's a check for `captureMode != .idle` before treating errors as recoverable. The `resumeRecognitionOnly` version (lines 818-843) lacks this check on line 823, meaning it may treat intentional stops differently.
How can I resolve this? If you propose a fix, please make it concise.| } | ||
|
|
||
| if params.speak ?? true { | ||
| if params.speak == true { |
There was a problem hiding this comment.
Semantic change: speak ?? true → speak == true
Before: params.speak ?? true (spoke by default when nil)
After: params.speak == true (only speaks when explicitly true)
This changes behavior when speak is nil - previously would speak, now won't. If the API contract expects nil to mean "speak by default", this breaks backward compatibility.
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/ios/Sources/Model/NodeAppModel.swift
Line: 1074:1074
Comment:
Semantic change: `speak ?? true` → `speak == true`
Before: `params.speak ?? true` (spoke by default when nil)
After: `params.speak == true` (only speaks when explicitly true)
This changes behavior when `speak` is `nil` - previously would speak, now won't. If the API contract expects `nil` to mean "speak by default", this breaks backward compatibility.
How can I resolve this? If you propose a fix, please make it concise.
Additional Comments (2)
Setting Prompt To Fix With AIThis is a comment left during a code review.
Path: apps/ios/Sources/Voice/TalkModeManager.swift
Line: 1195:1226
Comment:
Silent audio player loop with volume 0.0 may not satisfy background audio requirement
Setting `player.volume = 0.0` (line 1218) creates a completely silent loop. iOS may detect this as fake background activity and suspend the app anyway. Apple's background audio guidelines expect audible content. The comment acknowledges the AVAudioEngine should be primary keepalive, but if the engine stops, this silent loop may not be sufficient.
How can I resolve this? If you propose a fix, please make it concise.
The PR adds Prompt To Fix With AIThis is a comment left during a code review.
Path: apps/ios/Sources/Voice/TalkModeManager.swift
Line: 2092:2098
Comment:
Missing `.allowBluetooth` before this PR
The PR adds `.allowBluetooth` and `.allowBluetoothA2DP` (lines 2095-2096) but only `.allowBluetoothHFP` existed before. This means Bluetooth audio didn't work properly before this change. This is correct, but worth highlighting as a behavior change that may affect users.
How can I resolve this? If you propose a fix, please make it concise. |
bfc1ccb to
f92900f
Compare
Summary
557ece6Add Background Voice toggle and basic background audio sessiona9c7a8bKeep app alive in background with silent audio keepalive, add thinking chimec4be86eAdd toggle to disable ElevenLabs voice directive hint2a440faKeep AVAudioEngine running continuously, pause/resume recognition instead of full teardown2e9ca02Add VoIP background mode, speaker bleed detection, push speech queue, audio route change handling, startup chime, settings label cleanupChange Type (select all)
Scope (select all touched areas)
Linked Issue/PR
None
User-visible / Behavior Changes
Current limitations
Security Impact (required)
Yes- addedvoipUIBackgroundModeNoNoNoNoRepro + Verification
Environment
Steps
Expected
Actual
Evidence
Human Verification (required)
Compatibility / Migration
YesNoNoFailure Recovery (if this breaks)
Risks and Mitigations
Greptile Summary
Implements iOS background voice mode by keeping AVAudioEngine running continuously when backgrounded. The PR adds background audio session keepalive, speaker bleed detection, push speech queue handling, audio route change observers, and UI toggles.
Key changes:
voipUIBackgroundMode enables background executionpauseRecognitionOnly()/resumeRecognitionOnly()keep engine alive while pausing/resuming speech recognitionIssues found:
NodeAppModel.swift:1074: changingspeak ?? truetospeak == truebreaks backward compatibility whenspeakisnil(previously spoke by default, now won't speak)TalkModeManager.swift:703-718: keeping audio tap installed after callingrecognitionRequest?.endAudio()may cause unsafeappend()calls on a stale requeststartRecognitionandresumeRecognitionOnly(40-line audio tap setup block)The background keepalive approach is sound (continuous AVAudioEngine + VoIP mode), but the two issues above need attention before merge.
Confidence Score: 3/5
NodeAppModel.swift:1074wherespeak ?? true→speak == truechanges nil handling, and (2) potential crash inTalkModeManager.swift:703-718where audio tap continues callingappend()on an ended recognition request. The core background keepalive design is solid, but these issues must be resolved. Additional style improvements (code deduplication, error handling consistency) are recommended but non-blocking.apps/ios/Sources/Model/NodeAppModel.swift(backward compatibility break) andapps/ios/Sources/Voice/TalkModeManager.swift(tap lifecycle safety)Last reviewed commit: 2e9ca02
(2/5) Greptile learns from your feedback when you react with thumbs up/down!