Skip to content

Audio Support for Scope Rework#534

Merged
leszko merged 45 commits intomainfrom
marco/feat/audio-sync-2
Mar 19, 2026
Merged

Audio Support for Scope Rework#534
leszko merged 45 commits intomainfrom
marco/feat/audio-sync-2

Conversation

@BuffMcBigHuge
Copy link
Copy Markdown
Collaborator

Audio Support for Scope (Reworked)

Overall, this approach is simplified and seems to solve audio quality issues. The complex gating mechanics and resampling have been removed, leveraging WebRTC aiortc-style timing.

Summary

Adds end-to-end audio support to Scope's WebRTC streaming pipeline. Pipelines can return audio alongside video in their output dict; the server streams audio over WebRTC. This is a simplified rewrite of the audio path that fixes clipping and audio quality issues reported in PR #480.

What's New

Backend

  • Pipeline interface: Pipelines may return {"video": ..., "audio": ..., "audio_sample_rate": ...}. Audio keys are optional; pipelines that don't produce audio are unchanged.
  • PipelineProcessor: Uses an audio_callback instead of a queue. Only the last processor in a chain receives the callback.
  • FrameProcessor: Simple audio_queue for raw (audio_tensor, sample_rate) tuples. No background drain thread, no video-gated release, no resampling.
  • AudioProcessingTrack: Receives raw audio from FrameProcessor, resamples to 48 kHz when needed (per-channel, preserves stereo), buffers samples, and delivers 20ms stereo frames for WebRTC/Opus. Uses aiortc-style monotonic pacing.

Frontend

  • VideoOutput: Mute/unmute toggle (speaker icon). Starts muted to satisfy browser autoplay policy; user can unmute once the stream is playing.
  • useUnifiedWebRTC: Merges video and audio tracks into a single MediaStream. Adds a recvonly audio transceiver so the SDP offer includes an audio m-line for the backend to attach its track.

WebRTC Handshake

The browser adds addTransceiver("audio", { direction: "recvonly" }) so the offer includes an audio m-line. After setRemoteDescription, the backend finds the audio transceiver, attaches its AudioProcessingTrack, and sets direction to sendonly. The answer then indicates that the server will send audio.

Why This Version (audio-sync-2)

PR #480 received feedback about:

  • Audio quality / clipping – video and audio not synchronized, poor experience
  • Code complexity – video-gated release, drain thread, MediaClock coupling

This branch is a simplified rewrite that:

  1. Keeps stereo – No forced mono mixdown. Pipelines (e.g. LTX-2) output stereo; we pass it through. Forced mono can cause phase cancellation and artifacts.
  2. Passthrough when possible – If the pipeline outputs 48 kHz (WebRTC standard), resampling is skipped entirely.
  3. Per-channel resampling – When resampling is needed, each channel is resampled separately, preserving stereo.
  4. Simpler architecture – Callback → queue → track. No drain thread, no video-gated slicing, no MediaClock for audio.

Architecture

Pipeline.__call__() → {"video": tensor, "audio": tensor, "audio_sample_rate": int}
    │
    ▼
PipelineProcessor.audio_callback(audio_tensor, sample_rate)
    │
    ▼
FrameProcessor.audio_queue  (raw tuples)
    │
    ▼
AudioProcessingTrack.recv()  (resample if needed, buffer, 20ms frames)
    │
    ▼
WebRTC

Trade-offs

Item Status
NDI audio Not included in this PR. NDI output sinks do not receive audio. Can be re-added in a follow-up.
A/V sync Previous video-gated approach removed. Uses independent aiortc-style pacing. Please test sync behavior.

Files Changed

  • src/scope/server/frame_processor.py – Simplified audio path (~185 net lines removed)
  • src/scope/server/pipeline_processor.py – Callback-based audio delivery
  • src/scope/server/tracks.py – Stereo AudioProcessingTrack with per-channel resampling
  • src/scope/server/webrtc.py – Audio track wiring (no MediaClock)

Related

Signed-off-by: BuffMcBigHuge <[email protected]>
Signed-off-by: BuffMcBigHuge <[email protected]>
Signed-off-by: BuffMcBigHuge <[email protected]>
Signed-off-by: BuffMcBigHuge <[email protected]>
Copy link
Copy Markdown
Contributor

@j0sh j0sh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this does seem somewhat simpler than the last iteration.

I don't want to block this for the sake of shipping something if it seems to be working alright for now, but there are a couple things I don't quite understand.

  • Video and audio are being paced differently. Video effectively uses wall-clock while audio is using sample counts. Is there a reason for this?
  • Using wall-clock makes things susceptible to pipeline jitter.
  • If audio output is a little delayed, it doesn't get a chance for another 20ms. This accumulates and will lead to desync. Conversely, audio that might be a little bursty can be unnecessarily delayed.
  • Is there a reason the audio queue is non-blocking? It seems preferable to block (up to a reasonable duration, then silence can be inserted) instead of sleeping. But maybe I'm missing something about why media pulls are intended to be non-blocking.
  • In general, I'd consider using input timestamps or a "reference clock" to timestamp and pace the output, rather depending on wall-clock after the pipeline. Pipelines may produce output at different rates and this architecture generally doesn't account for that. More on this in Discord.

@BuffMcBigHuge
Copy link
Copy Markdown
Collaborator Author

Thank you @j0sh, let's continue in Discord.

Comment on lines +313 to +321
# Interleave into buffer: [L0, R0, L1, R1, ...]
for i in range(audio_np.shape[1]):
for ch in range(self.channels):
self._audio_buffer.append(audio_np[ch, i])

# Serve a 20ms frame from the buffer
samples_needed = self._samples_per_frame * self.channels
if len(self._audio_buffer) >= samples_needed:
samples = [self._audio_buffer.popleft() for _ in range(samples_needed)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these loop over individual samples? That is probably pretty expensive; best to work on complete 20ms frames if possible. There's probably some numpy wizardry for fast interleaving and effective frame chunking.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i attempted to address this here #633

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 2, 2026

🚀 fal.ai Preview Deployment

App ID daydream/scope-pr-534--preview
WebSocket wss://fal.run/daydream/scope-pr-534--preview/ws
Commit 7f5e460

Testing

Connect to this preview deployment by running this on your branch:

uv run build && SCOPE_CLOUD_APP_ID="daydream/scope-pr-534--preview/ws" uv run daydream-scope

🧪 E2E tests will run automatically against this deployment.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 2, 2026

✅ E2E Tests passed

Status passed
fal App daydream/scope-pr-534--preview
Run View logs

Test Artifacts

Check the workflow run for screenshots.

…ioProcessingTrack

Addresses review feedback on #534. The audio buffer interleaving and
frame extraction used O(n) Python loops over individual samples, which
is expensive for real-time audio. Now uses np.ravel(order="F") for
interleaving and numpy slicing for frame extraction.

Also adds 42 tests covering interleaving, buffering, resampling,
channel conversion, frame construction, and adversarial inputs.

Signed-off-by: RyanOnTheInside <[email protected]>
…perf

Fix AudioProcessingTrack per-sample loop performance
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 9, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ccfac82d-94d7-41ea-849d-7a68cd50815d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch marco/feat/audio-sync-2
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

BuffMcBigHuge and others added 25 commits March 17, 2026 16:16
…oProcessingTrack

Move FrameProcessor ownership from VideoProcessingTrack to Session so it
can be shared cleanly between video and audio tracks. When a pipeline
declares produces_video=False, skip video track creation entirely and
deliver audio through the existing AudioProcessingTrack path.

Key changes:
- Session owns and manages FrameProcessor lifecycle
- FrameProcessor injected into VideoProcessingTrack via constructor (DI)
- Audio callback fires before video early-return in PipelineProcessor
- Add produces_video / requires_audio_input ClassVars to BasePipelineConfig
- Frontend shows audio-only indicator when no video track is present

Signed-off-by: RyanOnTheInside <[email protected]>
Clean up unused requires_audio_input ClassVar from BasePipelineConfig
and remove test_tone pipeline from the registry.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
Add explanatory comments and docstring documenting why Fortran-order
ravel produces the correct packed interleaved format for PyAV s16
layout. Add round-trip test verifying samples survive interleave,
int16 conversion, AudioFrame, and planes[0] readback.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
Remove 15 tests that tested numpy operations (ravel, slicing, vstack,
mean) rather than actual class methods. The remaining 27 tests all call
real methods (recv, stop, _create_audio_frame, _resample_audio).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
- Initialize FrameProcessor.paused to False to prevent AttributeError
  when AudioProcessingTrack.recv() is called before any pause/resume
- Stop audio_track in Session.close() to ensure buffer cleanup on teardown
- Consolidate AUDIO_CLOCK_RATE in media_clock.py, remove duplicate from
  tracks.py and unused VIDEO_CLOCK_RATE from media_clock.py

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
MediaClock was only used by VideoProcessingTrack while
AudioProcessingTrack used its own independent clock, so it added
complexity without delivering actual synchronization. Video track
now uses a simple monotonic PTS counter matching the audio pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
…n-audio pipelines

- recv() now drains all queued audio chunks per call instead of one,
  reducing latency for bursty/small-chunk pipelines
- Cap audio buffer at 1 second to prevent unbounded memory growth
- Add produces_audio ClassVar to BasePipelineConfig; only create
  AudioProcessingTrack when a pipeline declares audio support
- Update tests for drain-loop compatibility and add coverage for
  queue drain and buffer cap behaviors

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
…, use deque buffer

- FFT resampling caused spectral leakage artifacts at chunk boundaries;
  linear interpolation (np.interp) is chunk-boundary safe
- Restore None guard on output_dict before .get("audio") to prevent
  AttributeError when pipeline returns None
- Replace np.concatenate/slice buffer with collections.deque to avoid
  O(buffer_size) copies on every chunk append

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
…ame creation

Move chain_produces_video/chain_produces_audio from webrtc.py into
PipelineRegistry as class methods for better encapsulation. Guard
_create_audio_frame against NaN/inf with nan_to_num. Consolidate and
clean up audio processing track tests using parametrize.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
The log message assumed a non-video pipeline must be audio-only, but
the audio check happens separately afterwards.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
Consistent with cloud_track.py pattern — each track type gets its own file.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
Fix audio-only pipeline bugs where pause, parameter broadcasts, and
get_frame_processor() silently skipped sessions without a video track.
All three now fall back to session.frame_processor when video_track is
None.

Add audio recording support: RecordingManager accepts optional video
and audio tracks, recording setup moved outside the produces_video
block so it works for video-only, audio-only, or combined sessions.
AudioTimestampNormalizingTrack normalizes PTS for recording restarts.

Fix test import to use audio_track module after earlier refactor.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
Audio output from pipelines now flows through a queue on
PipelineProcessor (audio_output_queue) instead of a callback,
making it symmetric with how video frames are delivered.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
…xtraction

Move audio pacing responsibility to individual pipelines instead of
the processor loop. Remove redundant second output_dict.get("video") call.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
Port the audio-beep plugin into the core pipelines directory so it is
available without installing a separate plugin.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
Cloud pipelines that produce audio now deliver it to the browser.
The audio flows through: cloud WebRTC track → CloudWebRTCClient →
CloudConnectionManager → FrameProcessor cloud audio queue →
AudioProcessingTrack → browser.

Key changes:
- CloudWebRTCClient handles audio tracks and receives audio frames
- CloudConnectionManager forwards audio via callbacks
- FrameProcessor queues cloud audio for AudioProcessingTrack
- CloudTrack accepts an injected FrameProcessor (shared with audio)
- handle_offer_with_relay creates AudioProcessingTrack and wires
  the browser audio transceiver

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
Without this, the cloud handle_offer has no audio m-line to attach
an AudioProcessingTrack to, so pipeline audio is produced but never
sent back -- filling and overflowing the audio output queue.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
Cloud sends packed s16 stereo where to_ndarray() returns (1, 1920)
-- interleaved channel pairs in a single plane. Without de-interleaving,
AudioProcessingTrack misidentifies this as mono and duplicates it,
producing 2x the expected data rate (constant buffer overflow) and
garbled output.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
In relay mode we cannot know the cloud pipeline capabilities, so
we always create both CloudTrack and AudioProcessingTrack. For
audio-only pipelines the CloudTrack emits small black frames at 1fps
to keep the browser MediaStream active so audio playback is not
blocked. For video-only pipelines the AudioProcessingTrack sends
silence harmlessly.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
…ities

Expose produces_video and produces_audio flags from the pipeline registry
through the status endpoint and into WebRTC negotiation. This avoids
creating unnecessary tracks (no video track for audio-only pipelines, no
audio transceiver for video-only pipelines) and removes the black-frame
fallback hack from CloudTrack. Also increases the audio buffer to 3s to
handle bursty pipelines like LTX2, and improves buffer trimming to keep
newest samples instead of dropping entire chunks.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
These test pipelines are available as separate plugins and don't need to
be bundled in core.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Rafał Leszko <[email protected]>
@leszko leszko marked this pull request as ready for review March 19, 2026 15:45
@leszko leszko merged commit 53a49cf into main Mar 19, 2026
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants