fix(talk): prevent double TTS playback when system voice times out#53511
fix(talk): prevent double TTS playback when system voice times out#53511grp06 merged 5 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR fixes a double-TTS-playback bug in Talk Mode that affected non-ElevenLabs providers. When Two complementary improvements ship alongside the bug fix:
Key observations:
Confidence Score: 5/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: apps/macos/Sources/OpenClaw/TalkModeRuntime.swift
Line: 454-457
Comment:
**Non-idiomatic force-unwraps inside `useElevenLabs` guard**
`input.apiKey!` on the same line as the `!= nil` check, and both `input.apiKey!` / `input.voiceId!` on line 457, are safe because `useElevenLabs` can only be `true` when both are non-nil — but the pattern is non-idiomatic Swift and slightly obscures the intent.
A `guard let` at the call site would be more conventional and removes the forced unwraps entirely:
```suggestion
let useElevenLabs = !(input.apiKey?.isEmpty ?? true) && input.voiceId != nil
```
And in the `if useElevenLabs` branch on line 456–457, you could either shadow-bind there or just rely on the already-safe `!` given the guard. Either way, a small refactor would make the intent clearer for future readers.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(talk): prevent double TTS playback w..." | Re-trigger Greptile |
| let useElevenLabs = input.apiKey != nil && !input.apiKey!.isEmpty && input.voiceId != nil | ||
| do { | ||
| if let apiKey = input.apiKey, !apiKey.isEmpty, let voiceId = input.voiceId { | ||
| try await self.playElevenLabs(input: input, apiKey: apiKey, voiceId: voiceId) | ||
| if useElevenLabs { | ||
| try await self.playElevenLabs(input: input, apiKey: input.apiKey!, voiceId: input.voiceId!) |
There was a problem hiding this comment.
Non-idiomatic force-unwraps inside
useElevenLabs guard
input.apiKey! on the same line as the != nil check, and both input.apiKey! / input.voiceId! on line 457, are safe because useElevenLabs can only be true when both are non-nil — but the pattern is non-idiomatic Swift and slightly obscures the intent.
A guard let at the call site would be more conventional and removes the forced unwraps entirely:
| let useElevenLabs = input.apiKey != nil && !input.apiKey!.isEmpty && input.voiceId != nil | |
| do { | |
| if let apiKey = input.apiKey, !apiKey.isEmpty, let voiceId = input.voiceId { | |
| try await self.playElevenLabs(input: input, apiKey: apiKey, voiceId: voiceId) | |
| if useElevenLabs { | |
| try await self.playElevenLabs(input: input, apiKey: input.apiKey!, voiceId: input.voiceId!) | |
| let useElevenLabs = !(input.apiKey?.isEmpty ?? true) && input.voiceId != nil |
And in the if useElevenLabs branch on line 456–457, you could either shadow-bind there or just rely on the already-safe ! given the guard. Either way, a small refactor would make the intent clearer for future readers.
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/macos/Sources/OpenClaw/TalkModeRuntime.swift
Line: 454-457
Comment:
**Non-idiomatic force-unwraps inside `useElevenLabs` guard**
`input.apiKey!` on the same line as the `!= nil` check, and both `input.apiKey!` / `input.voiceId!` on line 457, are safe because `useElevenLabs` can only be `true` when both are non-nil — but the pattern is non-idiomatic Swift and slightly obscures the intent.
A `guard let` at the call site would be more conventional and removes the forced unwraps entirely:
```suggestion
let useElevenLabs = !(input.apiKey?.isEmpty ?? true) && input.voiceId != nil
```
And in the `if useElevenLabs` branch on line 456–457, you could either shadow-bind there or just rely on the already-safe `!` given the guard. Either way, a small refactor would make the intent clearer for future readers.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Addressed — the current revision uses if let apiKey = input.apiKey, !apiKey.isEmpty, let voiceId = input.voiceId to bind safely without force-unwraps. No useElevenLabs boolean or ! operators remain.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f91bbef82e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| try await TalkSystemSpeechSynthesizer.shared.speak( | ||
| text: input.cleanedText, | ||
| language: input.language) | ||
| language: input.language, | ||
| timeout: input.synthTimeoutSeconds) |
There was a problem hiding this comment.
Remove unsupported timeout arg from system TTS call
TalkSystemSpeechSynthesizer.shared.speak is called with timeout: here, but the OpenClawKit API only defines speak(text:language:onStart:) (see apps/shared/OpenClawKit/Sources/OpenClawKit/TalkSystemSpeechSynthesizer.swift). In macOS builds this produces an extra argument 'timeout' compile error, so the talk-mode fix cannot ship until this call matches the current method signature (or the API is updated in the same commit).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed — the timeout: parameter has been removed from the playSystemVoice call. The watchdog timeout is now handled entirely inside TalkSystemSpeechSynthesizer using language-aware per-character estimates. No compile error.
1f41d36 to
21ee726
Compare
21ee726 to
64768f0
Compare
|
Good catch — yes, we'll land the bug fixes here separately first, then rebase #53553 on top so the feature commits (system sounds, right Shift interrupt, CJK silence timeout) stack cleanly. |
22a364b to
436812d
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 436812dcde
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let ttsLanguage = input.language ?? appLocale | ||
| try await TalkSystemSpeechSynthesizer.shared.speak( | ||
| text: input.cleanedText, | ||
| language: input.language) | ||
| language: ttsLanguage) |
There was a problem hiding this comment.
Preserve default system voice when no directive language is set
Passing voiceWakeLocaleID as language here changes system-TTS behavior, not just watchdog timing: replies without a lang directive no longer use the macOS default voice and instead force the recognition locale. In setups where wake-word recognition locale differs from preferred speech voice, this makes Talk mode speak in the wrong language/voice for every system-provider reply. This regression comes from using locale fallback in the synthesis language: argument rather than only for timeout estimation.
Useful? React with 👍 / 👎.
436812d to
4312643
Compare
4312643 to
2fda4b7
Compare
When using a non-ElevenLabs TTS provider, playAssistant() would unconditionally fall back to playSystemVoice() on any error — even when system voice itself threw the error. This caused the response to play twice: once during the initial attempt and again in the catch block. Split the ElevenLabs and system-voice paths so that system voice failures are logged without retrying. ElevenLabs failures still fall back to system voice as before. Closes openclaw#53510 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
CJK languages have different speech rates per character. Using a flat 0.08s/char estimate caused premature watchdog timeout for Korean and Chinese text. Based on Pellegrino et al. (2011) syllable-per-second research (https://www.science.org/doi/10.1126/sciadv.aaw2594), adjusted for TTS synthesis speed: - Korean: 0.25s/char (1 syllable block per char, 5.96 SPS) - Chinese: 0.28s/char (1 syllable per hanzi, tonal, 5.18 SPS) - Japanese: 0.20s/char (kana/kanji blended average, 7.84 SPS) - English: 0.08s/char (avg ~5 chars/syllable, 6.19 SPS) Apply 3x safety margin on top, so the watchdog only fires on genuine hangs — not during normal speech. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Update Pellegrino et al. reference from 2011 to 2019 (matching the DOI), and align per-char timing values in comments to match actual code values (Japanese 0.20, Korean 0.25, Chinese 0.28). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2fda4b70ab
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| final class TalkSystemSpeechSynthesizerTests: XCTestCase { | ||
| func testWatchdogTimeoutDefaultsToLatinProfile() { | ||
| let timeout = TalkSystemSpeechSynthesizer.watchdogTimeoutSeconds( | ||
| text: String(repeating: "a", count: 100), |
There was a problem hiding this comment.
Make watchdog timeout tests actor-safe
TalkSystemSpeechSynthesizer is @MainActor, so watchdogTimeoutSeconds is also main-actor isolated, but this synchronous XCTest method calls it from a nonisolated context. With Swift 6 strict concurrency enabled for OpenClawKitTests, this produces an actor-isolation compile error (call to main actor-isolated static method ... in a synchronous nonisolated context), so the new test target can fail to build until the call site is @MainActor (or the helper is marked nonisolated).
Useful? React with 👍 / 👎.
2fda4b7 to
864d556
Compare
|
Merged via squash.
Thanks @hongsw! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 864d556fa6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let appLocale = await MainActor.run { AppStateStore.shared.voiceWakeLocaleID } | ||
| let ttsLanguage = input.language ?? appLocale |
There was a problem hiding this comment.
Keep system voice fallback from forcing a locale voice
playSystemVoice now always supplies input.language ?? AppStateStore.shared.voiceWakeLocaleID to TalkSystemSpeechSynthesizer.speak, and speak sets utterance.voice whenever language is non-nil. In the common case where no lang directive is present, this changes behavior from “use macOS default voice” to “force wake-locale voice”, so users with different recognition and preferred TTS locales will hear the wrong voice/language on every system-voice reply; the locale fallback should be used for timeout estimation only, not passed as synthesis language.
Useful? React with 👍 / 👎.
| let timeout = TalkSystemSpeechSynthesizer.watchdogTimeoutSeconds( | ||
| text: String(repeating: "a", count: 100), | ||
| language: nil) |
There was a problem hiding this comment.
Mark watchdog timeout tests as main-actor isolated
TalkSystemSpeechSynthesizer is @MainActor, so watchdogTimeoutSeconds is main-actor isolated; these synchronous XCTest methods call it directly from a nonisolated context. With StrictConcurrency enabled in apps/shared/OpenClawKit/Package.swift, this becomes a Swift 6 actor-isolation compile error and can fail the OpenClawKitTests target build until the tests (or helper API) are explicitly main-actor safe.
Useful? React with 👍 / 👎.
…penclaw#53511) Merged via squash. Prepared head SHA: 864d556 Co-authored-by: hongsw <[email protected]> Co-authored-by: grp06 <[email protected]> Reviewed-by: @grp06
…penclaw#53511) Merged via squash. Prepared head SHA: 864d556 Co-authored-by: hongsw <[email protected]> Co-authored-by: grp06 <[email protected]> Reviewed-by: @grp06
…penclaw#53511) Merged via squash. Prepared head SHA: 864d556 Co-authored-by: hongsw <[email protected]> Co-authored-by: grp06 <[email protected]> Reviewed-by: @grp06
…penclaw#53511) Merged via squash. Prepared head SHA: 864d556 Co-authored-by: hongsw <[email protected]> Co-authored-by: grp06 <[email protected]> Reviewed-by: @grp06
…penclaw#53511) Merged via squash. Prepared head SHA: 864d556 Co-authored-by: hongsw <[email protected]> Co-authored-by: grp06 <[email protected]> Reviewed-by: @grp06
Summary
Changes
1. Prevent double TTS playback (
TalkModeRuntime.swift)Split the ElevenLabs and system-voice error paths. Previously
playAssistant()unconditionally retriedplaySystemVoice()in its catch block — even when system voice itself threw the error. Now system voice failures are logged without retrying; only ElevenLabs failures fall back to system voice.2. Language-aware watchdog timeout (
TalkSystemSpeechSynthesizer.swift+TalkModeRuntime.swift)Two fixes:
a) Per-language timing estimates: The watchdog timer now uses per-language estimates based on Pellegrino et al. (2019) syllable-per-second research, adjusted for TTS synthesis speed:
A 3x safety margin is applied so the watchdog only fires on genuine hangs where
didFinishnever arrives — not during normal speech.b) App locale fallback:
playSystemVoicewas passingnilas language (ElevenLabs directive field) which causedTalkSystemSpeechSynthesizerto default to English timing (0.08s/char). Now falls back to the app's voice wake locale (e.g.ko-KR) so the correct language-specific timing is used.Closes #53510
Test plan
talk.provider = system— response plays once (not twice)pnpm check)🤖 Generated with Claude Code