Skip to content

fix(talk): prevent double TTS playback when system voice times out#53511

Merged
grp06 merged 5 commits intoopenclaw:mainfrom
hongsw:fix/talk-double-playback
Mar 26, 2026
Merged

fix(talk): prevent double TTS playback when system voice times out#53511
grp06 merged 5 commits intoopenclaw:mainfrom
hongsw:fix/talk-double-playback

Conversation

@hongsw
Copy link
Copy Markdown
Contributor

@hongsw hongsw commented Mar 24, 2026

Summary

  • Fix Talk Mode playing every assistant reply twice when using a non-ElevenLabs TTS provider
  • Fix CJK (Korean, Japanese, Chinese) system voice watchdog cutting off speech mid-sentence

Changes

1. Prevent double TTS playback (TalkModeRuntime.swift)

Split the ElevenLabs and system-voice error paths. Previously playAssistant() unconditionally retried playSystemVoice() in its catch block — even when system voice itself threw the error. Now system voice failures are logged without retrying; only ElevenLabs failures fall back to system voice.

2. Language-aware watchdog timeout (TalkSystemSpeechSynthesizer.swift + TalkModeRuntime.swift)

Two fixes:

a) Per-language timing estimates: The watchdog timer now uses per-language estimates based on Pellegrino et al. (2019) syllable-per-second research, adjusted for TTS synthesis speed:

Language Per-char estimate Min timeout 50 chars × 3x → watchdog
Korean 0.25s 10s 37.5s
Chinese 0.28s 10s 42.0s
Japanese 0.20s 10s 30.0s
English 0.08s 3s 12.0s

A 3x safety margin is applied so the watchdog only fires on genuine hangs where didFinish never arrives — not during normal speech.

b) App locale fallback: playSystemVoice was passing nil as language (ElevenLabs directive field) which caused TalkSystemSpeechSynthesizer to default to English timing (0.08s/char). Now falls back to the app's voice wake locale (e.g. ko-KR) so the correct language-specific timing is used.

Closes #53510

Test plan

  • Talk Mode with talk.provider = system — response plays once (not twice)
  • Korean long response (50+ chars) — plays to completion without timeout
  • English short response — still completes quickly
  • Verified watchdog uses correct language (not defaulting to English)
  • Pre-commit hooks pass (pnpm check)

🤖 Generated with Claude Code

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 24, 2026

Greptile Summary

This PR fixes a double-TTS-playback bug in Talk Mode that affected non-ElevenLabs providers. When TalkSystemSpeechSynthesizer timed out mid-speech, the catch block in playAssistant unconditionally retried playSystemVoice, playing the response a second time. The fix introduces a useElevenLabs flag so the fallback path is only taken when ElevenLabs was the active provider.

Two complementary improvements ship alongside the bug fix:

  • synthTimeoutSeconds (already computed from text length) is now forwarded to TalkSystemSpeechSynthesizer.speak, so the dynamically-scaled timeout actually reaches the synthesiser and prevents spurious watchdog fires on long responses.
  • reloadConfig enforces a 2 000 ms minimum silence window for CJK locales (ko, ja, zh), which avoids premature silence detection in those languages.

Key observations:

  • The useElevenLabs flag correctly mirrors the original if let apiKey / let voiceId guard — behavior is equivalent for all existing provider configurations.
  • Force-unwrapping input.apiKey! and input.voiceId! on line 457 is safe because useElevenLabs is only true when both are non-nil, but the pattern is non-idiomatic Swift; a guard let or optional-chaining form would be clearer.
  • The CJK minimum silence override (max(configuredSilenceMs, 2000)) silently clamps an explicit user setting; this is a UX tradeoff, not a bug.

Confidence Score: 5/5

  • Safe to merge — targeted fix for a well-understood regression with no behavioral changes to unaffected code paths.
  • The primary bug fix is straightforward and directly addresses the root cause described in the PR. The useElevenLabs flag faithfully replicates the original conditional guard, the synthTimeoutSeconds forwarding removes the underlying trigger for spurious catch-block retries, and the CJK silence floor is an isolated additive change. No existing code paths are altered for ElevenLabs users. The only nits are non-idiomatic force-unwraps that are statically safe.
  • No files require special attention.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: apps/macos/Sources/OpenClaw/TalkModeRuntime.swift
Line: 454-457

Comment:
**Non-idiomatic force-unwraps inside `useElevenLabs` guard**

`input.apiKey!` on the same line as the `!= nil` check, and both `input.apiKey!` / `input.voiceId!` on line 457, are safe because `useElevenLabs` can only be `true` when both are non-nil — but the pattern is non-idiomatic Swift and slightly obscures the intent.

A `guard let` at the call site would be more conventional and removes the forced unwraps entirely:

```suggestion
        let useElevenLabs = !(input.apiKey?.isEmpty ?? true) && input.voiceId != nil
```

And in the `if useElevenLabs` branch on line 456–457, you could either shadow-bind there or just rely on the already-safe `!` given the guard. Either way, a small refactor would make the intent clearer for future readers.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(talk): prevent double TTS playback w..." | Re-trigger Greptile

Comment on lines +454 to +457
let useElevenLabs = input.apiKey != nil && !input.apiKey!.isEmpty && input.voiceId != nil
do {
if let apiKey = input.apiKey, !apiKey.isEmpty, let voiceId = input.voiceId {
try await self.playElevenLabs(input: input, apiKey: apiKey, voiceId: voiceId)
if useElevenLabs {
try await self.playElevenLabs(input: input, apiKey: input.apiKey!, voiceId: input.voiceId!)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Non-idiomatic force-unwraps inside useElevenLabs guard

input.apiKey! on the same line as the != nil check, and both input.apiKey! / input.voiceId! on line 457, are safe because useElevenLabs can only be true when both are non-nil — but the pattern is non-idiomatic Swift and slightly obscures the intent.

A guard let at the call site would be more conventional and removes the forced unwraps entirely:

Suggested change
let useElevenLabs = input.apiKey != nil && !input.apiKey!.isEmpty && input.voiceId != nil
do {
if let apiKey = input.apiKey, !apiKey.isEmpty, let voiceId = input.voiceId {
try await self.playElevenLabs(input: input, apiKey: apiKey, voiceId: voiceId)
if useElevenLabs {
try await self.playElevenLabs(input: input, apiKey: input.apiKey!, voiceId: input.voiceId!)
let useElevenLabs = !(input.apiKey?.isEmpty ?? true) && input.voiceId != nil

And in the if useElevenLabs branch on line 456–457, you could either shadow-bind there or just rely on the already-safe ! given the guard. Either way, a small refactor would make the intent clearer for future readers.

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/macos/Sources/OpenClaw/TalkModeRuntime.swift
Line: 454-457

Comment:
**Non-idiomatic force-unwraps inside `useElevenLabs` guard**

`input.apiKey!` on the same line as the `!= nil` check, and both `input.apiKey!` / `input.voiceId!` on line 457, are safe because `useElevenLabs` can only be `true` when both are non-nil — but the pattern is non-idiomatic Swift and slightly obscures the intent.

A `guard let` at the call site would be more conventional and removes the forced unwraps entirely:

```suggestion
        let useElevenLabs = !(input.apiKey?.isEmpty ?? true) && input.voiceId != nil
```

And in the `if useElevenLabs` branch on line 456–457, you could either shadow-bind there or just rely on the already-safe `!` given the guard. Either way, a small refactor would make the intent clearer for future readers.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — the current revision uses if let apiKey = input.apiKey, !apiKey.isEmpty, let voiceId = input.voiceId to bind safely without force-unwraps. No useElevenLabs boolean or ! operators remain.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f91bbef82e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +672 to +675
try await TalkSystemSpeechSynthesizer.shared.speak(
text: input.cleanedText,
language: input.language)
language: input.language,
timeout: input.synthTimeoutSeconds)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove unsupported timeout arg from system TTS call

TalkSystemSpeechSynthesizer.shared.speak is called with timeout: here, but the OpenClawKit API only defines speak(text:language:onStart:) (see apps/shared/OpenClawKit/Sources/OpenClawKit/TalkSystemSpeechSynthesizer.swift). In macOS builds this produces an extra argument 'timeout' compile error, so the talk-mode fix cannot ship until this call matches the current method signature (or the API is updated in the same commit).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — the timeout: parameter has been removed from the playSystemVoice call. The watchdog timeout is now handled entirely inside TalkSystemSpeechSynthesizer using language-aware per-character estimates. No compile error.

@hongsw hongsw force-pushed the fix/talk-double-playback branch 2 times, most recently from 1f41d36 to 21ee726 Compare March 24, 2026 08:02
@hongsw hongsw force-pushed the fix/talk-double-playback branch from 21ee726 to 64768f0 Compare March 24, 2026 08:14
Copy link
Copy Markdown
Contributor

@fabianwilliams fabianwilliams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be a subset of #53553 — the two bug fix commits here (031fea0, 64768f0) are identical in both PRs. Are you planning to close this one in favor of #53553, or would you prefer to land the bug fixes separately here and rebase #53553 on top?

baryonai

This comment was marked as duplicate.

@hongsw
Copy link
Copy Markdown
Contributor Author

hongsw commented Mar 24, 2026

Good catch — yes, we'll land the bug fixes here separately first, then rebase #53553 on top so the feature commits (system sounds, right Shift interrupt, CJK silence timeout) stack cleanly.

@grp06 grp06 self-assigned this Mar 26, 2026
@grp06 grp06 force-pushed the fix/talk-double-playback branch from 22a364b to 436812d Compare March 26, 2026 22:11
grp06 added a commit to hongsw/openclaw that referenced this pull request Mar 26, 2026
grp06 added a commit to hongsw/openclaw that referenced this pull request Mar 26, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 436812dcde

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +686 to +689
let ttsLanguage = input.language ?? appLocale
try await TalkSystemSpeechSynthesizer.shared.speak(
text: input.cleanedText,
language: input.language)
language: ttsLanguage)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve default system voice when no directive language is set

Passing voiceWakeLocaleID as language here changes system-TTS behavior, not just watchdog timing: replies without a lang directive no longer use the macOS default voice and instead force the recognition locale. In setups where wake-word recognition locale differs from preferred speech voice, this makes Talk mode speak in the wrong language/voice for every system-provider reply. This regression comes from using locale fallback in the synthesis language: argument rather than only for timeout estimation.

Useful? React with 👍 / 👎.

grp06 added a commit to hongsw/openclaw that referenced this pull request Mar 26, 2026
grp06 added a commit to hongsw/openclaw that referenced this pull request Mar 26, 2026
@grp06 grp06 force-pushed the fix/talk-double-playback branch from 436812d to 4312643 Compare March 26, 2026 22:18
grp06 added a commit to hongsw/openclaw that referenced this pull request Mar 26, 2026
grp06 added a commit to hongsw/openclaw that referenced this pull request Mar 26, 2026
@grp06 grp06 force-pushed the fix/talk-double-playback branch from 4312643 to 2fda4b7 Compare March 26, 2026 22:19
hongsw and others added 5 commits March 26, 2026 15:22
When using a non-ElevenLabs TTS provider, playAssistant() would
unconditionally fall back to playSystemVoice() on any error — even
when system voice itself threw the error. This caused the response
to play twice: once during the initial attempt and again in the
catch block.

Split the ElevenLabs and system-voice paths so that system voice
failures are logged without retrying. ElevenLabs failures still
fall back to system voice as before.

Closes openclaw#53510

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
CJK languages have different speech rates per character. Using a
flat 0.08s/char estimate caused premature watchdog timeout for
Korean and Chinese text.

Based on Pellegrino et al. (2011) syllable-per-second research
(https://www.science.org/doi/10.1126/sciadv.aaw2594), adjusted
for TTS synthesis speed:
  - Korean:   0.25s/char (1 syllable block per char, 5.96 SPS)
  - Chinese:  0.28s/char (1 syllable per hanzi, tonal, 5.18 SPS)
  - Japanese: 0.20s/char (kana/kanji blended average, 7.84 SPS)
  - English:  0.08s/char (avg ~5 chars/syllable, 6.19 SPS)

Apply 3x safety margin on top, so the watchdog only fires on
genuine hangs — not during normal speech.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Update Pellegrino et al. reference from 2011 to 2019 (matching the DOI),
and align per-char timing values in comments to match actual code values
(Japanese 0.20, Korean 0.25, Chinese 0.28).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2fda4b70ab

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

final class TalkSystemSpeechSynthesizerTests: XCTestCase {
func testWatchdogTimeoutDefaultsToLatinProfile() {
let timeout = TalkSystemSpeechSynthesizer.watchdogTimeoutSeconds(
text: String(repeating: "a", count: 100),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Make watchdog timeout tests actor-safe

TalkSystemSpeechSynthesizer is @MainActor, so watchdogTimeoutSeconds is also main-actor isolated, but this synchronous XCTest method calls it from a nonisolated context. With Swift 6 strict concurrency enabled for OpenClawKitTests, this produces an actor-isolation compile error (call to main actor-isolated static method ... in a synchronous nonisolated context), so the new test target can fail to build until the call site is @MainActor (or the helper is marked nonisolated).

Useful? React with 👍 / 👎.

@grp06 grp06 force-pushed the fix/talk-double-playback branch from 2fda4b7 to 864d556 Compare March 26, 2026 22:36
@grp06 grp06 merged commit 138a923 into openclaw:main Mar 26, 2026
24 of 27 checks passed
@grp06
Copy link
Copy Markdown
Member

grp06 commented Mar 26, 2026

Merged via squash.

Thanks @hongsw!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 864d556fa6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +685 to +686
let appLocale = await MainActor.run { AppStateStore.shared.voiceWakeLocaleID }
let ttsLanguage = input.language ?? appLocale
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep system voice fallback from forcing a locale voice

playSystemVoice now always supplies input.language ?? AppStateStore.shared.voiceWakeLocaleID to TalkSystemSpeechSynthesizer.speak, and speak sets utterance.voice whenever language is non-nil. In the common case where no lang directive is present, this changes behavior from “use macOS default voice” to “force wake-locale voice”, so users with different recognition and preferred TTS locales will hear the wrong voice/language on every system-voice reply; the locale fallback should be used for timeout estimation only, not passed as synthesis language.

Useful? React with 👍 / 👎.

Comment on lines +6 to +8
let timeout = TalkSystemSpeechSynthesizer.watchdogTimeoutSeconds(
text: String(repeating: "a", count: 100),
language: nil)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Mark watchdog timeout tests as main-actor isolated

TalkSystemSpeechSynthesizer is @MainActor, so watchdogTimeoutSeconds is main-actor isolated; these synchronous XCTest methods call it directly from a nonisolated context. With StrictConcurrency enabled in apps/shared/OpenClawKit/Package.swift, this becomes a Swift 6 actor-isolation compile error and can fail the OpenClawKitTests target build until the tests (or helper API) are explicitly main-actor safe.

Useful? React with 👍 / 👎.

steipete pushed a commit to bugkill3r/openclaw that referenced this pull request Mar 27, 2026
…penclaw#53511)

Merged via squash.

Prepared head SHA: 864d556
Co-authored-by: hongsw <[email protected]>
Co-authored-by: grp06 <[email protected]>
Reviewed-by: @grp06
pxnt pushed a commit to pxnt/openclaw that referenced this pull request Mar 27, 2026
…penclaw#53511)

Merged via squash.

Prepared head SHA: 864d556
Co-authored-by: hongsw <[email protected]>
Co-authored-by: grp06 <[email protected]>
Reviewed-by: @grp06
godlin-gh pushed a commit to YouMindInc/openclaw that referenced this pull request Mar 27, 2026
…penclaw#53511)

Merged via squash.

Prepared head SHA: 864d556
Co-authored-by: hongsw <[email protected]>
Co-authored-by: grp06 <[email protected]>
Reviewed-by: @grp06
w-sss pushed a commit to w-sss/openclaw that referenced this pull request Mar 28, 2026
…penclaw#53511)

Merged via squash.

Prepared head SHA: 864d556
Co-authored-by: hongsw <[email protected]>
Co-authored-by: grp06 <[email protected]>
Reviewed-by: @grp06
livingghost pushed a commit to livingghost/openclaw that referenced this pull request Mar 31, 2026
…penclaw#53511)

Merged via squash.

Prepared head SHA: 864d556
Co-authored-by: hongsw <[email protected]>
Co-authored-by: grp06 <[email protected]>
Reviewed-by: @grp06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Mac app Talk Mode plays every reply twice (duplicate TTS audio)

4 participants