fix(talk): Talk Mode TTS improvements for CJK languages#53553
fix(talk): Talk Mode TTS improvements for CJK languages#53553hongsw wants to merge 8 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR delivers a well-scoped set of Talk Mode improvements: it eliminates double TTS playback when using system voice, fixes the CJK watchdog cutting off speech mid-sentence, adds audible phase-transition sounds, introduces a right Shift interrupt key, and widens the silence-detection window for CJK locales.
Confidence Score: 5/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: apps/shared/OpenClawKit/Sources/OpenClawKit/TalkSystemSpeechSynthesizer.swift
Line: 57-63
Comment:
**Citation year and comment values don't match the code**
Two inconsistencies in this comment block that could mislead future maintainers:
1. The citation says **"Pellegrino et al. (2011)"** but the linked DOI (`sciadv.aaw2594`) points to the 2019 paper. The PR description also says 2019. The year in the comment should be corrected to `(2019)`.
2. The per-char values documented in the comments don't match what's actually used in the code:
| Language | Comment says | Code uses |
|----------|-------------|-----------|
| Korean | ~0.22s/char | 0.25s |
| Chinese | ~0.25s/char | 0.28s |
| Japanese | ~0.19s/char | 0.20s |
| English | ~0.08s/char | 0.08s ✓ |
The code values appear to include an additional safety buffer on top of the 1.3× TTS adjustment, but this isn't explained. Consider updating the comment values to match, or noting the extra adjustment factor so the math is traceable.
```suggestion
// Speech rates based on Pellegrino et al. (2019) syllable-per-second data,
// adjusted ~1.3x slower for TTS synthesis vs natural speech,
// then rounded up ~10–15% as an additional empirical safety buffer:
// https://www.science.org/doi/10.1126/sciadv.aaw2594
// Japanese: 7.84 SPS → ~0.19s/char → 0.20s used
// Korean: 5.96 SPS → ~0.22s/char → 0.25s used
// Chinese: 5.18 SPS → ~0.25s/char → 0.28s used
// English: 6.19 SPS → ~0.08s/char → 0.08s used
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "feat(talk): increase silence detection t..." | Re-trigger Greptile |
| // Speech rates based on Pellegrino et al. (2011) syllable-per-second data, | ||
| // adjusted ~1.3x slower for TTS synthesis vs natural speech: | ||
| // https://www.science.org/doi/10.1126/sciadv.aaw2594 | ||
| // Japanese: 7.84 SPS → ~0.19s/char (mixed kana/kanji avg ~1.5 mora/char) | ||
| // Korean: 5.96 SPS → ~0.22s/char (1 char = 1 syllable) | ||
| // Chinese: 5.18 SPS → ~0.25s/char (1 char = 1 syllable) | ||
| // English: 6.19 SPS → ~0.08s/char (avg ~5 chars/syllable) |
There was a problem hiding this comment.
Citation year and comment values don't match the code
Two inconsistencies in this comment block that could mislead future maintainers:
-
The citation says "Pellegrino et al. (2011)" but the linked DOI (
sciadv.aaw2594) points to the 2019 paper. The PR description also says 2019. The year in the comment should be corrected to(2019). -
The per-char values documented in the comments don't match what's actually used in the code:
| Language | Comment says | Code uses |
|---|---|---|
| Korean | ~0.22s/char | 0.25s |
| Chinese | ~0.25s/char | 0.28s |
| Japanese | ~0.19s/char | 0.20s |
| English | ~0.08s/char | 0.08s ✓ |
The code values appear to include an additional safety buffer on top of the 1.3× TTS adjustment, but this isn't explained. Consider updating the comment values to match, or noting the extra adjustment factor so the math is traceable.
| // Speech rates based on Pellegrino et al. (2011) syllable-per-second data, | |
| // adjusted ~1.3x slower for TTS synthesis vs natural speech: | |
| // https://www.science.org/doi/10.1126/sciadv.aaw2594 | |
| // Japanese: 7.84 SPS → ~0.19s/char (mixed kana/kanji avg ~1.5 mora/char) | |
| // Korean: 5.96 SPS → ~0.22s/char (1 char = 1 syllable) | |
| // Chinese: 5.18 SPS → ~0.25s/char (1 char = 1 syllable) | |
| // English: 6.19 SPS → ~0.08s/char (avg ~5 chars/syllable) | |
| // Speech rates based on Pellegrino et al. (2019) syllable-per-second data, | |
| // adjusted ~1.3x slower for TTS synthesis vs natural speech, | |
| // then rounded up ~10–15% as an additional empirical safety buffer: | |
| // https://www.science.org/doi/10.1126/sciadv.aaw2594 | |
| // Japanese: 7.84 SPS → ~0.19s/char → 0.20s used | |
| // Korean: 5.96 SPS → ~0.22s/char → 0.25s used | |
| // Chinese: 5.18 SPS → ~0.25s/char → 0.28s used | |
| // English: 6.19 SPS → ~0.08s/char → 0.08s used |
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/shared/OpenClawKit/Sources/OpenClawKit/TalkSystemSpeechSynthesizer.swift
Line: 57-63
Comment:
**Citation year and comment values don't match the code**
Two inconsistencies in this comment block that could mislead future maintainers:
1. The citation says **"Pellegrino et al. (2011)"** but the linked DOI (`sciadv.aaw2594`) points to the 2019 paper. The PR description also says 2019. The year in the comment should be corrected to `(2019)`.
2. The per-char values documented in the comments don't match what's actually used in the code:
| Language | Comment says | Code uses |
|----------|-------------|-----------|
| Korean | ~0.22s/char | 0.25s |
| Chinese | ~0.25s/char | 0.28s |
| Japanese | ~0.19s/char | 0.20s |
| English | ~0.08s/char | 0.08s ✓ |
The code values appear to include an additional safety buffer on top of the 1.3× TTS adjustment, but this isn't explained. Consider updating the comment values to match, or noting the extra adjustment factor so the math is traceable.
```suggestion
// Speech rates based on Pellegrino et al. (2019) syllable-per-second data,
// adjusted ~1.3x slower for TTS synthesis vs natural speech,
// then rounded up ~10–15% as an additional empirical safety buffer:
// https://www.science.org/doi/10.1126/sciadv.aaw2594
// Japanese: 7.84 SPS → ~0.19s/char → 0.20s used
// Korean: 5.96 SPS → ~0.22s/char → 0.25s used
// Chinese: 5.18 SPS → ~0.25s/char → 0.28s used
// English: 6.19 SPS → ~0.08s/char → 0.08s used
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Addressed in 26eee68c10 — year updated to 2019, comment values aligned with code.
fabianwilliams
left a comment
There was a problem hiding this comment.
Nice PR — well-structured commits and solid research backing the CJK timing values.
A few things:
1. Citation year + comment values mismatch (TalkSystemSpeechSynthesizer.swift)
The inline comment references "Pellegrino et al. (2011)" but the linked DOI (sciadv.aaw2594) is the 2019 paper — the PR description also says 2019. Additionally, the per-char values in the comments (~0.22, ~0.25, ~0.19) don't match what the code actually uses (0.25, 0.28, 0.20). Could you update the year to 2019 and either align the comment values to the code or note the extra safety buffer so the math is traceable?
2. Phase-transition sounds — any way to mute?
The system sounds (Tink, Pop, Bottle, Submarine) are a nice UX touch, but if someone is using Talk Mode in a quiet environment or a meeting, surprise audio could be unwelcome. Worth considering a preference toggle or respecting the system "Play sound effects" setting — not a blocker, just something to think about.
3. macOS CI failure
The macOS check failed, but it's a runner timeout/cancellation after all 212 test files passed (1846 tests). Not related to this PR's changes.
Overall this looks good — clean fixes for real problems. Thanks!
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 26eee68c10
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let estimatedSeconds = max(minSeconds, min(300.0, Double(trimmed.count) * perCharSeconds)) | ||
| let watchdogTimeout = estimatedSeconds * 3.0 |
There was a problem hiding this comment.
Cap watchdog timeout after applying safety multiplier
The new timeout calculation multiplies an already-capped estimate (min(300.0, ...)) by 3.0, which raises the hard watchdog ceiling to 900 seconds. In the failure path this watchdog is the only recovery mechanism, so a stuck AVSpeechSynthesizer can now keep Talk Mode in the speaking state for up to 15 minutes before control returns, a large regression from the previous bounded timeout behavior.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
The watchdog timeout is already capped via min(300.0, ...) on line 80. The 3x multiplier on a 300s base gives 900s max, which is intentional for very long TTS — the watchdog is a hang guard, not a performance limit.
|
@fabianwilliams Thanks for the thorough review! All three items addressed: 1. Citation year + comment values — Fixed in 2. Phase-transition sounds mute — Added in 3. macOS CI failure — Confirmed unrelated, no action needed. Additionally, changed the interrupt key from right Shift to right Option ( Screenshot of the new settings: #53553 (comment) |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f2f94b715b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| private func handleFlags(keyCode: UInt16, modifierFlags: NSEvent.ModifierFlags) { | ||
| // Right Option key down (keyCode 61). | ||
| guard keyCode == 61, modifierFlags.contains(.option) else { return } |
There was a problem hiding this comment.
Avoid reusing Right Option for interrupt hotkey
Binding interrupt to keyCode == 61 here collides with the existing push-to-talk hotkey (VoicePushToTalkHotkey also uses Right Option), so users who enable both features will trigger push-to-talk capture when they only meant to stop TTS. In that configuration, a simple tap to interrupt can fire extra push-to-talk side effects (chimes/overlay/mic session churn) while Talk Mode is transitioning, which is a user-visible regression.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — we addressed this in 5acda769f4. Talk Mode now disables Push-to-Talk while it is active (TalkModeController.setEnabled calls VoicePushToTalkHotkey.shared.setEnabled(false)), so the two handlers never run simultaneously. PTT is restored when Talk Mode is turned off. We also added a UI hint in the settings panel explaining this behavior.
|
@fabianwilliams ;) All integration tests passed on device — Talk Mode end-to-end verified:
This PR is ready for final review and merge (after #53511 lands first). |
fabianwilliams
left a comment
There was a problem hiding this comment.
All three review items confirmed addressed:
- Citation year — Pellegrino et al. 2019, comment values match code (Japanese 0.20, Korean 0.25, Chinese 0.28). ✅
- Phase sounds mute toggle —
talkPhaseSoundsEnabledpersisted via UserDefaults, guarded inplayPhaseSound. Clean. ✅ - CJK silence timeout —
isCJKLocalecheck with 2000ms floor clamp + info log when clamped. Exactly right. ✅
The error handling restructure in playAssistant is a nice improvement too — ElevenLabs failure now properly falls through to system voice without double-triggering.
Integration tests passing on device per the author's report. LGTM — approve.
|
CI note: the 2 failing checks (telegram extension, macOS) are unrelated to this PR's Talk Mode changes. The telegram test failure is in the extension test runner, not TTS code. The macOS check was cancelled/timed out. All relevant tests (node, Windows, channels, build) pass. Ready to merge. |
Play a short system sound on phase transitions to give the user audible feedback: - thinking: Tink - speaking: Pop - listening (after speech interrupted): Bottle - listening (after thinking): Submarine Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add TalkSpeechInterruptMonitor — a dedicated global key monitor that listens for right Shift (keyCode 60) to interrupt Talk Mode speech. Independent of Push-to-Talk, so it works even when PTT is disabled. Stops only the current response; the next conversation cycle continues normally via sendAndSpeak's resumeListeningIfNeeded flow. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Korean, Japanese, and Chinese speakers need longer pauses between phrases. When the app locale is CJK, enforce a minimum 2000ms silence window (vs the default 1500ms) to avoid premature transcript submission. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…nfig Replace non-idiomatic force-unwraps (cfg.voiceId!, cfg.modelId!) with safe flatMap unwrapping, and add an info log when CJK locale clamps the silence timeout so the override is observable in diagnostics. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add a "Play phase-transition sounds" checkbox to Voice Wake settings. When disabled, Talk Mode phase transitions (Tink/Pop/Bottle/Submarine) are silent. Defaults to enabled to preserve existing behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add a "Press Right Option to stop speech" checkbox to Voice Wake settings. Also change the interrupt key from right Shift to right Option (keyCode 61) to avoid conflicts with typing. Defaults to enabled to preserve existing behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Talk Mode and Push-to-Talk both use the right Option key (keyCode 61). Disable PTT when Talk Mode is enabled to prevent conflicting handlers, and restore PTT when Talk Mode is disabled. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Display a footnote under the Push-to-Talk toggle when both PTT and Talk Mode are enabled, explaining that PTT is paused while Talk Mode is active and resumes when Talk Mode is turned off. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
740ed3d to
a5c2ff1
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a5c2ff1715
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let pttEnabled = !enabled && AppStateStore.shared.voicePushToTalkEnabled | ||
| VoicePushToTalkHotkey.shared.setEnabled(pttEnabled) |
There was a problem hiding this comment.
Keep Push-to-Talk disabled while Talk Mode is active
This disables Push-to-Talk only when setEnabled(_:) is called for Talk Mode transitions, but it does not enforce that invariant afterward; if Talk Mode is already on and the user enables Push-to-Talk from settings, MenuContentView.onChange(of: voicePushToTalkEnabled) re-enables the hotkey without checking talkEnabled, so Right Option again drives both handlers and reintroduces the interrupt/PTT conflict this change is trying to prevent. Please gate hotkey enablement on both toggles everywhere (or centralize the decision) so Talk Mode always wins while active.
Useful? React with 👍 / 👎.
b919b03 to
a5c2ff1
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a6ed92d808
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let pttEnabled = !enabled && AppStateStore.shared.voicePushToTalkEnabled | ||
| VoicePushToTalkHotkey.shared.setEnabled(pttEnabled) |
There was a problem hiding this comment.
End active push-to-talk session before disabling hotkey
This newly disables Push-to-Talk whenever Talk Mode is enabled, but if the user turns on Talk Mode while a PTT hold is already active (Right Option still down), VoicePushToTalkHotkey.stopMonitoring() clears its state and removes monitors without running the endAction, so VoicePushToTalk.shared.end() is never called and the capture session can remain open (mic/overlay stuck). Please ensure an active PTT session is finalized before disabling the hotkey monitor.
Useful? React with 👍 / 👎.

Summary
A collection of Talk Mode improvements focused on CJK language support and user experience:
reloadConfigChanges
1. Prevent double TTS playback (
TalkModeRuntime.swift)Split the ElevenLabs and system-voice error paths. Previously
playAssistant()unconditionally retriedplaySystemVoice()in its catch block — even when system voice itself threw the error.2. Language-aware watchdog timeout (
TalkSystemSpeechSynthesizer.swift)Per-language timing estimates based on Pellegrino et al. (2019) with 3x safety margin:
Also fixes
playSystemVoicepassingnillanguage, which caused the watchdog to always use English timing.3. Phase-transition system sounds (
TalkModeController.swift)Distinct audible feedback for each Talk Mode phase:
Users can disable sounds via Settings → Voice Wake → "Play phase-transition sounds" checkbox.
4. Right Option key interrupt (
TalkSpeechInterruptMonitor.swift)New
TalkSpeechInterruptMonitor— a dedicated global key monitor (independent of Push-to-Talk) that listens for right Option (keyCode 61) to stop the current response. Talk Mode returns to listening for the next conversation.Users can disable via Settings → Voice Wake → "Press Right Option to stop speech" checkbox.
Push-to-Talk is automatically disabled while Talk Mode is active (both use the right Option key). A hint is shown in Settings when this applies.
5. CJK silence detection (
TalkModeRuntime.swift)Enforce minimum 2000ms silence window for CJK locales (vs default 1500ms) to avoid premature transcript submission mid-phrase. Logs when the clamp is applied.
6. Config cleanup (
TalkModeRuntime.swift)cfg.voiceId!,cfg.modelId!) with safeflatMapunwrapping7. Citation fix (
TalkSystemSpeechSynthesizer.swift)Settings UI
Closes #53510
Tip: Upgrade to Premium system voices for a more natural Talk Mode experience
Talk Mode uses the macOS system voice by default (when ElevenLabs is not configured). Apple provides Enhanced and Premium quality voices that sound significantly more natural than the default.
To upgrade:
Premium voices use neural TTS and produce much more natural-sounding speech in Talk Mode conversations.
Test plan
pnpm check)pnpm test— 9450 passed, 2 failed (pre-existingvitest-scoped-config.test.ts, unrelated)🤖 Generated with Claude Code