Skip to content

Comments

Improve SpeechSession lifecycle and observability#19

Merged
rudrankriyam merged 10 commits intomainfrom
feature/todo-audio-session
Oct 25, 2025
Merged

Improve SpeechSession lifecycle and observability#19
rudrankriyam merged 10 commits intomainfrom
feature/todo-audio-session

Conversation

@rudrankriyam
Copy link
Member

Summary

  • Ensure AVAudioSession is activated/deactivated explicitly so other apps regain audio focus as soon as transcription ends.
  • Add AVAudioSession interruption observers that pause, optionally resume, or tear down the pipeline to avoid stuck sessions during calls or Siri.
  • Convert status and audio-input streams into broadcast-style observables (plus smaller extensions) so multiple UI and logic consumers receive every update consistently.
  • Restart the demo app's VAD indicator task whenever voice activation settings change to keep its speech-detection badge accurate.
  • Document and structural tweaks (Agents guide, SpeechSession file splits, TODO tracker) that keep the codebase maintainable while capturing outstanding work.

@gemini-code-assist
Copy link

Summary of Changes

Hello @rudrankriyam, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and observability of the SpeechSession lifecycle. It introduces explicit control over AVAudioSession activation and deactivation, along with comprehensive handling for audio session interruptions, preventing stuck sessions. Furthermore, key status and audio input streams have been refactored into broadcast observables, allowing for more flexible and consistent consumption of updates across the application. These changes, coupled with structural improvements and updated documentation, contribute to a more stable and maintainable speech transcription framework.

Highlights

  • Audio Session Management: Explicitly activate and deactivate AVAudioSession to improve audio focus handling for other applications, ensuring resources are released promptly.
  • Interruption Handling: Implemented observers for AVAudioSession interruptions, allowing the speech pipeline to pause, optionally resume, or tear down gracefully during system events like phone calls or Siri interactions, preventing stuck sessions.
  • Broadcast Observables: Converted SpeechSession's status and audio-input streams into broadcast-style AsyncStreams, enabling multiple UI and logic consumers to receive consistent updates without needing to manage individual continuations.
  • Demo App VAD Indicator Reliability: Ensured the demo application's Voice Activity Detection (VAD) indicator task restarts when VAD settings change, maintaining the accuracy of its speech-detection badge.
  • Structural Refinements: SpeechSession was refactored from an actor to a @MainActor class, and its related types (Status, StreamingMode, VoiceActivationConfiguration) and observable properties (statusStream, audioInputConfigurationStream, modelDownloadProgress, isVoiceActivationEnabled) were moved to dedicated extensions for better organization and clarity.
  • Documentation & TODOs: Updated Agents.md and README.md for clarity and added a TODO.md to track completed work items, improving codebase maintainability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an excellent pull request that significantly improves the lifecycle management and observability of SpeechSession. The explicit activation/deactivation of AVAudioSession and the addition of robust interruption handling are crucial for a well-behaved audio application. The transition to broadcast-style async streams for status and audio input updates is a great architectural improvement, allowing multiple consumers to observe the session's state concurrently. The associated refactoring, including splitting the SpeechSession implementation into logical extensions and files, enhances code organization and maintainability. The inclusion of new tests for the broadcast stream behavior and updates to documentation and the demo app make this a comprehensive and high-quality contribution.

Comment on lines 95 to 107
guard enableVAD else {
await MainActor.run {
isSpeechDetected = true
}
return
}

guard let stream = session.speechDetectorResultsStream else {
await MainActor.run {
isSpeechDetected = true
}
return
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The two guard statements here have identical else blocks. They can be combined into a single guard statement to reduce code duplication and improve conciseness.

Suggested change
guard enableVAD else {
await MainActor.run {
isSpeechDetected = true
}
return
}
guard let stream = session.speechDetectorResultsStream else {
await MainActor.run {
isSpeechDetected = true
}
return
}
guard enableVAD, let stream = session.speechDetectorResultsStream else {
await MainActor.run {
isSpeechDetected = true
}
return
}

Comment on lines 88 to 91
#expect(firstA != nil)
#expect(firstB != nil)
#expect(firstA! == nil)
#expect(firstB! == nil)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These assertions are functionally correct, but they can be expressed more idiomatically and readably. The type of firstA is AudioInputInfo?? (or Optional<Optional<AudioInputInfo>>), which can be confusing. Using #expect(value == .some(nil)) makes the intent clearer by explicitly checking for a non-nil outer optional containing a nil value.

Suggested change
#expect(firstA != nil)
#expect(firstB != nil)
#expect(firstA! == nil)
#expect(firstB! == nil)
#expect(firstA == .some(nil))
#expect(firstB == .some(nil))

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +164 to +183
private func handleInterruptionEnded(options: AVAudioSession.InterruptionOptions) async {
guard shouldResumeAfterInterruption else { return }
shouldResumeAfterInterruption = false

guard options.contains(.shouldResume) else {
if Self.shouldLog(.notice) {
Self.logger.notice("Audio session interruption ended without resume option; cleaning up session")
}
prepareForStop()
await cleanup(cancelRecognizer: true)
await finishStream(error: nil)
return
}

do {
try await setupAudioSession()
try startAudioStreaming()
setStatus(.transcribing)
if Self.shouldLog(.notice) {
Self.logger.notice("Audio session resumed after interruption")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Resuming after phone interruptions ignores manual stop

The new interruption handler resumes audio streaming whenever shouldResumeAfterInterruption is still true, without checking whether the session was explicitly stopped while the interruption was in progress. shouldResumeAfterInterruption is only reset inside handleInterruptionEnded, so if the user taps “Stop” during a call interruption the flag stays true. When the interruption ends, the code at handleInterruptionEnded reactivates the AVAudioSession, restarts the audio engine, and sets the status back to .transcribing even though the recognizer and stream were already torn down by stopTranscribing(). This leaves the session in an inconsistent state and restarts microphone capture unexpectedly. Consider clearing shouldResumeAfterInterruption during stopTranscribing()/cleanup or guarding the resume path with the current status/task state.

Useful? React with 👍 / 👎.

@rudrankriyam
Copy link
Member Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly improves the SpeechSession lifecycle management and observability. The introduction of explicit AVAudioSession activation/deactivation, handling of audio interruptions, and conversion of streams to broadcast-style observables are excellent enhancements. The code is well-structured, with good separation of concerns into different file extensions, and the inclusion of new tests for the broadcast streams is appreciated. My main feedback concerns a small improvement to the new audioInputConfigurationStream to ensure it provides an initial value to new subscribers, making its behavior more predictable and consistent with the statusStream.

Comment on lines 58 to 60
#if os(iOS) || os(macOS)
var audioInputConfigurationContinuation: AsyncStream<AudioInputInfo?>.Continuation?
var audioInputContinuations: [UUID: AsyncStream<AudioInputInfo?>.Continuation] = [:]
#endif

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To ensure new subscribers to audioInputConfigurationStream receive the current state, it's helpful to cache the last known AudioInputInfo. Please add a property to store this. This is part of a set of changes to provide an initial value to stream consumers.

Suggested change
#if os(iOS) || os(macOS)
var audioInputConfigurationContinuation: AsyncStream<AudioInputInfo?>.Continuation?
var audioInputContinuations: [UUID: AsyncStream<AudioInputInfo?>.Continuation] = [:]
#endif
var audioInputContinuations: [UUID: AsyncStream<AudioInputInfo?>.Continuation] = [:]
var lastAudioInputInfo: AudioInputInfo?

Comment on lines +24 to +30
public var audioInputConfigurationStream: AsyncStream<AudioInputInfo?> {
AsyncStream { [weak self] continuation in
let id = UUID()
Task { @MainActor [weak self] in
guard let self else { return }
self.audioInputContinuations[id] = continuation
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The audioInputConfigurationStream currently doesn't provide an initial value to new subscribers, which is inconsistent with statusStream and can lead to surprising behavior for consumers. By yielding the newly cached lastAudioInputInfo, you can ensure the stream behaves more like a CurrentValueSubject from Combine, providing the most recent value upon subscription.

    public var audioInputConfigurationStream: AsyncStream<AudioInputInfo?> {
        AsyncStream { [weak self] continuation in
            let id = UUID()
            Task { @MainActor [weak self] in
                guard let self else { return }
                continuation.yield(self.lastAudioInputInfo)
                self.audioInputContinuations[id] = continuation
            }

@rudrankriyam rudrankriyam merged commit c3dad58 into main Oct 25, 2025
1 check passed
@rudrankriyam rudrankriyam deleted the feature/todo-audio-session branch October 25, 2025 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant