Skip to content

Fix WebRTC VP8 decode errors and PLI keyframe requests#614

Merged
emranemran merged 1 commit intomainfrom
emran/webrtc-retry-fix-vp8
Mar 9, 2026
Merged

Fix WebRTC VP8 decode errors and PLI keyframe requests#614
emranemran merged 1 commit intomainfrom
emran/webrtc-retry-fix-vp8

Conversation

@emranemran
Copy link
Copy Markdown
Contributor

@emranemran emranemran commented Mar 6, 2026

Problem

Cloud relay WebRTC connections would permanently die from transient VP8 decode errors. When a decoder receives P-frames before a keyframe (common at connection start), the exception killed the frame receive loop with no recovery.

Additionally, aiortc 1.14.0 changed _send_rtcp_pli() to require a media_ssrc argument, causing PLI keyframe requests to fail with warnings on every cloud connection.

Two fixes

1. VP8 decode error retry (tracks.py, cloud_track.py, cloud_webrtc_client.py)

All three frame input loops now tolerate transient errors instead of dying on the first one:

  • Track 10 consecutive errors with a 10ms sleep between each
  • Any successful frame resets the counter to 0
  • Only bail after 10 consecutive failures (~400ms at 25fps)
  • Covers the window between connection start and first keyframe arrival

2. PLI keyframe request fix (cloud_webrtc_client.py)

Requests a keyframe from the cloud encoder to speed up decoder startup:

  • Polls every 100ms until aiortc's remote_streams dict is populated (up to 5s)
  • Sends a single PLI with the correct media_ssrc (required by aiortc >= 1.14.0)
  • Logs the SSRC on success for verification
  • Logs first frame receipt to confirm the keyframe roundtrip

How they work together

Connection established
  ├─ PLI task starts polling for remote_streams
  ├─ Frame receive loop starts
  │   ├─ P-frame arrives → decode error → retry (counter=1)
  │   ├─ P-frame arrives → decode error → retry (counter=2)
  │   ├─ PLI sent once remote_streams populated
  │   ├─ Keyframe arrives → decoded successfully → counter=0
  │   └─ Normal operation continues
  └─ "First frame received from cloud (keyframe decoded successfully)"

In practice, the cloud encoder sends a keyframe on its own at connection start, so the first frame usually arrives before the PLI is even sent. The PLI is a safety net, and the retry logic is a safety net for the safety net.

Expected logs

INFO - Sent PLI keyframe request (media_ssrc=2642438216)
INFO - First frame received from cloud (keyframe decoded successfully)
INFO - [FRAME-PROCESSOR] First frame produced, playback ready (ttff=1312ms)

Test plan

  • Verify server starts without errors (uv run daydream-scope)
  • Test cloud relay mode — no PLI warnings in logs
  • Verify "Sent PLI keyframe request" and "First frame received" appear in logs
  • VP8 decode warnings should be logged but not kill the session
  • Clean disconnects still work (MediaStreamError path)

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 6, 2026

📝 Walkthrough

Walkthrough

Introduces consecutive-error tolerance (counter + max 10) to server input loops; treats MediaStreamError as end-of-source. Updates cloud_webrtc_client to track/reset consecutive errors, log first/frame progress, and to request keyframes by polling for remote streams (0.1s interval, up to 5s) and sending a PLI using the remote stream SSRC when available.

Changes

Cohort / File(s) Summary
Input loop error tolerance
src/scope/server/cloud_track.py, src/scope/server/tracks.py
Add consecutive_errors with max_consecutive_errors = 10; reset on successful frame; treat MediaStreamError as end-of-source; on other exceptions increment counter, log transient warnings, sleep 0.01s between retries, stop and log when threshold reached.
WebRTC client receive & keyframe logic
src/scope/server/cloud_webrtc_client.py
In _receive_frames track/reset consecutive errors, handle MediaStreamError to terminate loop, log first-frame and per-100-frame progress. Rework _request_keyframe to poll for remote_streams (0.1s intervals, up to 5s), extract a media SSRC from found remote streams, and send a PLI via receiver._send_rtcp_pli(media_ssrc); skip PLI if no remote streams found within timeout and log accordingly.

Sequence Diagram

sequenceDiagram
    participant Client as WebRTC Caller
    participant Receiver as Video Receiver
    participant Remote as Remote Streams Registry

    Client->>Receiver: request_keyframe()
    Receiver->>Remote: poll for remote_streams (0.1s intervals, up to 5s)
    alt remote_streams found
        Remote-->>Receiver: provide remote_streams (media SSRC)
        Receiver->>Receiver: extract media_ssrc
        Receiver->>Receiver: call _send_rtcp_pli(media_ssrc)
        Receiver-->>Client: log success / return
    else no remote_streams within timeout
        Remote-->>Receiver: no streams
        Receiver-->>Client: skip PLI, debug log
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nudge the stream with patient hops,

I count the misses, mend the drops.
A little wait, a gentle plea,
The SSRC hums — frames come to me.
Hooray — the picture dances free!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title directly reflects the main objectives: addressing WebRTC VP8 decode errors through retry logic and fixing PLI keyframe requests, which are the core changes across all three modified files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch emran/webrtc-retry-fix-vp8

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/scope/server/cloud_webrtc_client.py`:
- Around line 336-342: The receive-loop branch that logs and breaks when
consecutive_errors >= max_consecutive_errors must also mark the client as failed
by flipping the connection state before exiting; update the block that
references consecutive_errors, max_consecutive_errors and logger.error to set
self.is_connected = False (or call the existing failure/cleanup method if one
exists) immediately prior to the break so callers stop queuing frames to a dead
output path.
- Around line 365-383: The _request_keyframe coroutine can access self.pc after
disconnect() sets it to None, causing AttributeError; to fix, inside the retry
loop (the block that iterates delays and calls self.pc.getReceivers()) snapshot
the peer connection into a local variable (e.g., pc = self.pc), then check if pc
is None or pc.isClosed()/pc.connectionState indicates closed before calling
pc.getReceivers(); skip the attempt or return if closed/None to avoid calling
methods on None. Ensure you reference _request_keyframe, self.pc, and disconnect
when making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d78d11f1-6edc-4f6e-ad0e-8bf7a4a1f6d2

📥 Commits

Reviewing files that changed from the base of the PR and between 0a52f7c and a746734.

📒 Files selected for processing (3)
  • src/scope/server/cloud_track.py
  • src/scope/server/cloud_webrtc_client.py
  • src/scope/server/tracks.py

Comment on lines +336 to 342
consecutive_errors += 1
if consecutive_errors >= max_consecutive_errors:
logger.error(
f"Error receiving frame, stopping after "
f"{consecutive_errors} consecutive errors: {e}"
)
break
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Mark the client failed before exiting the receive loop.

Once this branch trips, the task stops but is_connected can still stay true, so callers may keep queueing frames against a dead output path and wait forever for processed video. Flip the connection state here before break.

Suggested fix
                 except Exception as e:
                     consecutive_errors += 1
                     if consecutive_errors >= max_consecutive_errors:
                         logger.error(
                             f"Error receiving frame, stopping after "
                             f"{consecutive_errors} consecutive errors: {e}"
                         )
+                        self._connected = False
+                        self._stats["connection_state"] = "failed"
                         break
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
consecutive_errors += 1
if consecutive_errors >= max_consecutive_errors:
logger.error(
f"Error receiving frame, stopping after "
f"{consecutive_errors} consecutive errors: {e}"
)
break
consecutive_errors += 1
if consecutive_errors >= max_consecutive_errors:
logger.error(
f"Error receiving frame, stopping after "
f"{consecutive_errors} consecutive errors: {e}"
)
self._connected = False
self._stats["connection_state"] = "failed"
break
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/scope/server/cloud_webrtc_client.py` around lines 336 - 342, The
receive-loop branch that logs and breaks when consecutive_errors >=
max_consecutive_errors must also mark the client as failed by flipping the
connection state before exiting; update the block that references
consecutive_errors, max_consecutive_errors and logger.error to set
self.is_connected = False (or call the existing failure/cleanup method if one
exists) immediately prior to the break so callers stop queuing frames to a dead
output path.

Copy link
Copy Markdown
Contributor

@mjh1 mjh1 Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems the same as the previous behaviour though? not that that means it's fine, but if this is a bug it doesn't seem like it's being introduced here

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjh1 — you're right, this isn't a regression introduced by this PR. Neither the old except Exception path nor the new one updated self._connected before breaking, so the gap existed before these changes too.

The suggestion is still worthwhile as a correctness improvement (callers checking is_connected after the loop exits won't know the receive path has died), but it's a pre-existing latent issue rather than something this PR introduces. Happy to leave it for a follow-up if you'd prefer to keep this PR focused on the VP8/PLI fixes.

Would you like me to open a follow-up issue to track it?

Comment on lines +365 to +383
delays = [0.1, 0.5, 1.0]
for attempt, delay in enumerate(delays):
await asyncio.sleep(delay)
sent = False
for receiver in self.pc.getReceivers():
if receiver.track and receiver.track.kind == "video":
try:
await receiver._send_rtcp_pli()
logger.info(
f"Sent PLI (keyframe request), attempt {attempt + 1}"
)
sent = True
except Exception as e:
logger.warning(
f"Could not send PLI (attempt {attempt + 1}): {e}"
)
if sent:
return
logger.warning("Failed to send PLI after all retry attempts")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's explore the repository structure
git ls-files | head -20

Repository: daydreamlive/scope

Length of output: 591


🏁 Script executed:

# Find and examine the cloud_webrtc_client.py file
find . -type f -name "cloud_webrtc_client.py" 2>/dev/null

Repository: daydreamlive/scope

Length of output: 103


🏁 Script executed:

# Check the file size to determine best approach for reading
wc -l src/scope/server/cloud_webrtc_client.py

Repository: daydreamlive/scope

Length of output: 105


🏁 Script executed:

# Read the full file to understand the context
cat -n src/scope/server/cloud_webrtc_client.py

Repository: daydreamlive/scope

Length of output: 18440


🏁 Script executed:

# Check if connectionState is used elsewhere to confirm it's a valid property
rg "connectionState" src/scope/server/cloud_webrtc_client.py

Repository: daydreamlive/scope

Length of output: 105


🏁 Script executed:

# Check dependencies to understand aiortc version
fd -e txt -e toml -e cfg -e lock | grep -iE "(requirements|setup|pyproject|poetry|pipfile)"

Repository: daydreamlive/scope

Length of output: 211


🏁 Script executed:

# Check the main pyproject.toml for aiortc version
cat pyproject.toml | grep -A 30 -B 5 aiortc

Repository: daydreamlive/scope

Length of output: 1216


🏁 Script executed:

# Check if _send_rtcp_pli is used elsewhere in the codebase
rg "_send_rtcp_pli" --type py

Repository: daydreamlive/scope

Length of output: 157


🏁 Script executed:

# Check aiortc documentation by looking for any aiortc usage patterns in comments or documentation
find . -type f \( -name "*.md" -o -name "*.txt" -o -name "*.rst" \) | xargs grep -l "aiortc" 2>/dev/null | head -5

Repository: daydreamlive/scope

Length of output: 44


🏁 Script executed:

# Let's verify that self.pc can indeed become None during the method execution
# Check the disconnect method and how pc is set to None
sed -n '412,434p' src/scope/server/cloud_webrtc_client.py

Repository: daydreamlive/scope

Length of output: 679


Add self.pc None check to prevent crash if connection closes during keyframe request.

_request_keyframe() runs as a detached task and sleeps up to 1.6s total across retries. If disconnect() runs during this window, self.pc becomes None (line 428), causing self.pc.getReceivers() to crash with AttributeError. Snapshot self.pc at each retry attempt and check for None or closed state before iterating receivers.

Suggested fix
         delays = [0.1, 0.5, 1.0]
         for attempt, delay in enumerate(delays):
             await asyncio.sleep(delay)
+            pc = self.pc
+            if pc is None or pc.connectionState == "closed":
+                return
             sent = False
-            for receiver in self.pc.getReceivers():
+            for receiver in pc.getReceivers():
                 if receiver.track and receiver.track.kind == "video":
                     try:
                         await receiver._send_rtcp_pli()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
delays = [0.1, 0.5, 1.0]
for attempt, delay in enumerate(delays):
await asyncio.sleep(delay)
sent = False
for receiver in self.pc.getReceivers():
if receiver.track and receiver.track.kind == "video":
try:
await receiver._send_rtcp_pli()
logger.info(
f"Sent PLI (keyframe request), attempt {attempt + 1}"
)
sent = True
except Exception as e:
logger.warning(
f"Could not send PLI (attempt {attempt + 1}): {e}"
)
if sent:
return
logger.warning("Failed to send PLI after all retry attempts")
delays = [0.1, 0.5, 1.0]
for attempt, delay in enumerate(delays):
await asyncio.sleep(delay)
pc = self.pc
if pc is None or pc.connectionState == "closed":
return
sent = False
for receiver in pc.getReceivers():
if receiver.track and receiver.track.kind == "video":
try:
await receiver._send_rtcp_pli()
logger.info(
f"Sent PLI (keyframe request), attempt {attempt + 1}"
)
sent = True
except Exception as e:
logger.warning(
f"Could not send PLI (attempt {attempt + 1}): {e}"
)
if sent:
return
logger.warning("Failed to send PLI after all retry attempts")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/scope/server/cloud_webrtc_client.py` around lines 365 - 383, The
_request_keyframe coroutine can access self.pc after disconnect() sets it to
None, causing AttributeError; to fix, inside the retry loop (the block that
iterates delays and calls self.pc.getReceivers()) snapshot the peer connection
into a local variable (e.g., pc = self.pc), then check if pc is None or
pc.isClosed()/pc.connectionState indicates closed before calling
pc.getReceivers(); skip the attempt or return if closed/None to avoid calling
methods on None. Ensure you reference _request_keyframe, self.pc, and disconnect
when making the change.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 6, 2026

🚀 fal.ai Preview Deployment

App ID daydream/scope-pr-614--preview
WebSocket wss://fal.run/daydream/scope-pr-614--preview/ws
Commit bb69e0a

Testing

Connect to this preview deployment by running this on your branch:

uv run build && SCOPE_CLOUD_APP_ID="daydream/scope-pr-614--preview/ws" uv run daydream-scope

🧪 E2E tests will run automatically against this deployment.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 6, 2026

✅ E2E Tests passed

Status passed
fal App daydream/scope-pr-614--preview
Run View logs

Test Artifacts

Check the workflow run for screenshots.

@emranemran emranemran changed the title Fix WebRTC input loops dying on transient VP8 decode errors Fix WebRTC VP8 decode errors and PLI keyframe requests Mar 9, 2026
@emranemran emranemran force-pushed the emran/webrtc-retry-fix-vp8 branch from a746734 to 6411c07 Compare March 9, 2026 06:52
@emranemran
Copy link
Copy Markdown
Contributor Author

@leszko @mjh1: FYI I'm still testing this PR and it's not ready for merge yet.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/scope/server/cloud_webrtc_client.py (1)

311-312: Consider moving import to file level for consistency.

MediaStreamError is imported inside the method, but tracks.py imports it at the file level (line 8). Since line 24 already imports from aiortc.mediastreams, adding MediaStreamError there maintains consistency.

Suggested change

At line 24:

-from aiortc.mediastreams import VIDEO_TIME_BASE, MediaStreamTrack
+from aiortc.mediastreams import VIDEO_TIME_BASE, MediaStreamError, MediaStreamTrack

Then remove lines 311-312 from _receive_frames.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/scope/server/cloud_webrtc_client.py` around lines 311 - 312, Move the
MediaStreamError import out of the _receive_frames method and into the
file-level imports: add MediaStreamError to the existing top-level import from
aiortc.mediastreams (the same import already present near the other aiortc
imports) and then remove the local "from aiortc.mediastreams import
MediaStreamError" inside _receive_frames so the symbol is consistently imported
at module scope.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/scope/server/cloud_webrtc_client.py`:
- Around line 377-386: The code is brittle because it directly accesses the
name-mangled private attribute receiver._RTCRtpReceiver__remote_streams; replace
this fragile access with a safe runtime compatibility check and explicit
failure: at function entry validate aiortc version (ensure >= 1.14.0) or attempt
to read the attribute with getattr/hasattr (e.g., check for
"_RTCRtpReceiver__remote_streams" or public alternatives) and if missing raise a
clear RuntimeError explaining the required aiortc version; keep the existing
retry/exception handling but log the compatibility error and fail fast instead
of silently continuing, and update documentation/dependencies to declare the
minimum aiortc version.

---

Nitpick comments:
In `@src/scope/server/cloud_webrtc_client.py`:
- Around line 311-312: Move the MediaStreamError import out of the
_receive_frames method and into the file-level imports: add MediaStreamError to
the existing top-level import from aiortc.mediastreams (the same import already
present near the other aiortc imports) and then remove the local "from
aiortc.mediastreams import MediaStreamError" inside _receive_frames so the
symbol is consistently imported at module scope.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7697451f-d207-4edb-8a2f-2d31fc5c332f

📥 Commits

Reviewing files that changed from the base of the PR and between a746734 and 6411c07.

📒 Files selected for processing (3)
  • src/scope/server/cloud_track.py
  • src/scope/server/cloud_webrtc_client.py
  • src/scope/server/tracks.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/scope/server/cloud_track.py

Comment on lines +377 to +386
# Get remote SSRC from receiver's internal stream tracking
# (required by aiortc >= 1.14.0)
remote_streams = receiver._RTCRtpReceiver__remote_streams
if not remote_streams:
logger.debug(
f"No remote streams yet (attempt {attempt + 1})"
)
continue
media_ssrc = next(iter(remote_streams))
await receiver._send_rtcp_pli(media_ssrc)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

aiortc 1.14.0 _send_rtcp_pli media_ssrc parameter

💡 Result:

In aiortc 1.14.0, RTCRtpReceiver._send_rtcp_pli(media_ssrc) expects the SSRC of the media source you want the sender to generate a new keyframe for—i.e., the SSRC of the incoming RTP stream (the “media source” SSRC in the RTCP feedback common header). aiortc then puts that value into the RTCP PSFB PLI packet’s media_ssrc field. (raw.githubusercontent.com)

Concretely, aiortc calls it like this when the jitter buffer decides a PLI is needed:

And _send_rtcp_pli builds:

Notes:

  • ssrc=self.__rtcp_ssrc is the SSRC of the feedback sender (your receiver endpoint), while media_ssrc is the SSRC of the media stream being referenced. (raw.githubusercontent.com)
  • If RTX is in use, aiortc unwraps RTX and rewrites the packet SSRC to the original (primary) SSRC before potentially sending PLI, so the PLI targets the main media SSRC, not the RTX SSRC. (raw.githubusercontent.com)

Citations:


🏁 Script executed:

# Search for the full context of this code section
cd /tmp && find . -name "cloud_webrtc_client.py" 2>/dev/null | head -5

Repository: daydreamlive/scope

Length of output: 44


🏁 Script executed:

# Check if the file exists in the repo and examine surrounding code
git ls-files | grep -i cloud_webrtc

Repository: daydreamlive/scope

Length of output: 101


🏁 Script executed:

# Look at the full method/function containing these lines
rg -A 20 -B 10 "_RTCRtpReceiver__remote_streams" --type py

Repository: daydreamlive/scope

Length of output: 2735


🏁 Script executed:

# Search for alternative approaches to getting media_ssrc in the codebase
rg "_send_rtcp_pli" --type py -B 5 -A 5

Repository: daydreamlive/scope

Length of output: 1058


🏁 Script executed:

# Check aiortc imports and version handling in the file
rg "import aiortc|from aiortc" --type py -A 2 -B 2

Repository: daydreamlive/scope

Length of output: 2875


Private attribute access is fragile despite defensive handling.

receiver._RTCRtpReceiver__remote_streams accesses a name-mangled private attribute. While the code already wraps this in a try/except block with retry logic, the direct access remains vulnerable to silent breakage if aiortc's internals change. The broad exception handling provides defense, but consider adding a version check at runtime (the comment indicates aiortc >= 1.14.0 is required) or documenting the minimum aiortc version as a hard dependency to catch incompatibilities earlier.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/scope/server/cloud_webrtc_client.py` around lines 377 - 386, The code is
brittle because it directly accesses the name-mangled private attribute
receiver._RTCRtpReceiver__remote_streams; replace this fragile access with a
safe runtime compatibility check and explicit failure: at function entry
validate aiortc version (ensure >= 1.14.0) or attempt to read the attribute with
getattr/hasattr (e.g., check for "_RTCRtpReceiver__remote_streams" or public
alternatives) and if missing raise a clear RuntimeError explaining the required
aiortc version; keep the existing retry/exception handling but log the
compatibility error and fail fast instead of silently continuing, and update
documentation/dependencies to declare the minimum aiortc version.

- Add retry logic for transient VP8 decode errors in cloud WebRTC input loops
- Fix PLI keyframe request for aiortc >= 1.14.0 (pass media_ssrc)
- Poll for remote streams before sending PLI instead of blind retries
- Log first frame receipt to verify PLI/keyframe roundtrip

Signed-off-by: emranemran <[email protected]>
@emranemran emranemran force-pushed the emran/webrtc-retry-fix-vp8 branch from 6411c07 to bb69e0a Compare March 9, 2026 18:24
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/scope/server/cloud_track.py (1)

132-154: ⚠️ Potential issue | 🟠 Major

Reset consecutive_errors after the cloud handoff succeeds.

consecutive_errors is cleared before self.frame_processor.put(frame) runs, so a persistent put() failure never reaches the 10-error cutoff — every successful recv() zeroes the counter first.

🐛 Proposed fix
                 try:
                     # Get frame from browser
                     frame = await self._source_track.recv()
-                    consecutive_errors = 0

                     # Send through FrameProcessor (which relays to cloud)
                     if self.frame_processor:
                         self.frame_processor.put(frame)
+                    consecutive_errors = 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/scope/server/cloud_track.py` around lines 132 - 154, The bug is that
consecutive_errors is reset immediately after a successful _source_track.recv()
so failures in self.frame_processor.put(frame) never accumulate; move the reset
to only run after the cloud handoff succeeds by wrapping the handoff in its own
try/except: call self.frame_processor.put(frame) (or await if needed) inside a
try block, on success set consecutive_errors = 0, and on exception increment
consecutive_errors and log/handle using the existing max_consecutive_errors
logic (use the same logger messages and break behavior), keeping the recv()
exception handling for MediaStreamError and other recv errors unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/scope/server/cloud_webrtc_client.py`:
- Around line 370-399: The current PLI send logic polls for remote_streams then
sends a single RTCP PLI via receiver._send_rtcp_pli(media_ssrc) and bails if it
fails; change this to retry sending PLIs on a bounded schedule (e.g.,
exponential or fixed-interval backoff with a max attempts/timeout) until a video
frame arrives or the retry budget is exhausted: after detecting remote_streams
and selecting receiver/media_ssrc, loop up to N attempts (or until a global
timeout) calling receiver._send_rtcp_pli(media_ssrc), awaiting a short interval
between attempts, break if the first frame is observed (check r.track.readyState
or other indicator from pc.getReceivers()/remote_streams), and log each attempt
and final failure using logger (replace the single try/except block around
receiver._send_rtcp_pli with this retry loop and preserve existing debug/info
messages).

In `@src/scope/server/tracks.py`:
- Around line 63-92: The loop calls self.frame_processor.put() but
self.frame_processor is lazily created by await self.track.recv(), so initialize
the processor before entering the retry loop: perform an initial await
self.track.recv() (or call the existing FrameProcessor initializer)
outside/above the while self.input_task_running loop to force creation of
self.frame_processor, handling asyncio.CancelledError and MediaStreamError the
same way as in-loop, then continue into the loop using
self.frame_processor.put() as before so transient recv-driven creation cannot
cause perpetual single-error retries.

---

Outside diff comments:
In `@src/scope/server/cloud_track.py`:
- Around line 132-154: The bug is that consecutive_errors is reset immediately
after a successful _source_track.recv() so failures in
self.frame_processor.put(frame) never accumulate; move the reset to only run
after the cloud handoff succeeds by wrapping the handoff in its own try/except:
call self.frame_processor.put(frame) (or await if needed) inside a try block, on
success set consecutive_errors = 0, and on exception increment
consecutive_errors and log/handle using the existing max_consecutive_errors
logic (use the same logger messages and break behavior), keeping the recv()
exception handling for MediaStreamError and other recv errors unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9433c2e5-b52b-4495-802e-32712d1485cc

📥 Commits

Reviewing files that changed from the base of the PR and between 6411c07 and bb69e0a.

📒 Files selected for processing (3)
  • src/scope/server/cloud_track.py
  • src/scope/server/cloud_webrtc_client.py
  • src/scope/server/tracks.py

Comment on lines +370 to +399
# Poll until remote_streams is populated (RTP packets have arrived)
timeout = 5.0
poll_interval = 0.1
elapsed = 0.0
receiver = None
remote_streams = None

while elapsed < timeout:
await asyncio.sleep(poll_interval)
elapsed += poll_interval
for r in self.pc.getReceivers():
if r.track and r.track.kind == "video":
streams = r._RTCRtpReceiver__remote_streams
if streams:
receiver = r
remote_streams = streams
break
if remote_streams:
break

if not remote_streams or not receiver:
logger.debug("No remote streams after %.1fs, skipping PLI", timeout)
return

try:
media_ssrc = next(iter(remote_streams))
await receiver._send_rtcp_pli(media_ssrc)
logger.info(f"Sent PLI keyframe request (media_ssrc={media_ssrc})")
except Exception as e:
logger.debug(f"Could not send PLI: {e}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Retry the PLI instead of sending it once.

Polling remote_streams fixes the “too early” case, but after that this code makes exactly one RTCP PLI attempt. If that packet is lost or the sender still is not ready, startup falls back to waiting for the next natural keyframe and the VP8 decode loop can keep failing. Keep a bounded retry schedule until the first frame arrives or the retry budget is exhausted.

🔁 Proposed fix
-        try:
-            media_ssrc = next(iter(remote_streams))
-            await receiver._send_rtcp_pli(media_ssrc)
-            logger.info(f"Sent PLI keyframe request (media_ssrc={media_ssrc})")
-        except Exception as e:
-            logger.debug(f"Could not send PLI: {e}")
+        media_ssrc = next(iter(remote_streams))
+        for attempt, delay in enumerate((0.0, 0.5, 1.0, 2.0), start=1):
+            if delay:
+                await asyncio.sleep(delay)
+            if self._stats["frames_received"] > 0:
+                return
+            try:
+                await receiver._send_rtcp_pli(media_ssrc)
+                logger.info(
+                    "Sent PLI keyframe request (media_ssrc=%s, attempt %s)",
+                    media_ssrc,
+                    attempt,
+                )
+            except Exception as e:
+                logger.debug("Could not send PLI on attempt %s: %s", attempt, e)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/scope/server/cloud_webrtc_client.py` around lines 370 - 399, The current
PLI send logic polls for remote_streams then sends a single RTCP PLI via
receiver._send_rtcp_pli(media_ssrc) and bails if it fails; change this to retry
sending PLIs on a bounded schedule (e.g., exponential or fixed-interval backoff
with a max attempts/timeout) until a video frame arrives or the retry budget is
exhausted: after detecting remote_streams and selecting receiver/media_ssrc,
loop up to N attempts (or until a global timeout) calling
receiver._send_rtcp_pli(media_ssrc), awaiting a short interval between attempts,
break if the first frame is observed (check r.track.readyState or other
indicator from pc.getReceivers()/remote_streams), and log each attempt and final
failure using logger (replace the single try/except block around
receiver._send_rtcp_pli with this retry loop and preserve existing debug/info
messages).

Comment on lines +63 to +92
consecutive_errors = 0
max_consecutive_errors = 10
while self.input_task_running:
try:
input_frame = await self.track.recv()
consecutive_errors = 0

# Store raw VideoFrame for later processing (tracks input FPS internally)
self.frame_processor.put(input_frame)

except asyncio.CancelledError:
break
except Exception as e:
# Stop the input loop on connection errors to avoid spam
logger.error(f"Error in input loop, stopping: {e}")
except MediaStreamError:
logger.info("Source track ended")
self.input_task_running = False
break
except Exception as e:
consecutive_errors += 1
if consecutive_errors >= max_consecutive_errors:
logger.error(
f"Error in input loop, stopping after "
f"{consecutive_errors} consecutive errors: {e}"
)
self.input_task_running = False
break
logger.warning(
f"Transient error in input loop "
f"({consecutive_errors}/{max_consecutive_errors}): {e}"
)
await asyncio.sleep(0.01)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Initialize the processor before starting this retry loop.

recv() is what lazily creates self.frame_processor, but this loop dereferences it earlier. If input starts first, put() can fail on every frame, and because the counter is reset before the call the loop can stay at 1/10 forever.

🐛 Proposed fix
     def initialize_input_processing(self, track: MediaStreamTrack):
         self.track = track
+        self.initialize_output_processing()
         self.input_task_running = True
         self.input_task = asyncio.create_task(self.input_loop())
         while self.input_task_running:
             try:
                 input_frame = await self.track.recv()
-                consecutive_errors = 0

                 # Store raw VideoFrame for later processing (tracks input FPS internally)
                 self.frame_processor.put(input_frame)
+                consecutive_errors = 0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
consecutive_errors = 0
max_consecutive_errors = 10
while self.input_task_running:
try:
input_frame = await self.track.recv()
consecutive_errors = 0
# Store raw VideoFrame for later processing (tracks input FPS internally)
self.frame_processor.put(input_frame)
except asyncio.CancelledError:
break
except Exception as e:
# Stop the input loop on connection errors to avoid spam
logger.error(f"Error in input loop, stopping: {e}")
except MediaStreamError:
logger.info("Source track ended")
self.input_task_running = False
break
except Exception as e:
consecutive_errors += 1
if consecutive_errors >= max_consecutive_errors:
logger.error(
f"Error in input loop, stopping after "
f"{consecutive_errors} consecutive errors: {e}"
)
self.input_task_running = False
break
logger.warning(
f"Transient error in input loop "
f"({consecutive_errors}/{max_consecutive_errors}): {e}"
)
await asyncio.sleep(0.01)
consecutive_errors = 0
max_consecutive_errors = 10
while self.input_task_running:
try:
input_frame = await self.track.recv()
# Store raw VideoFrame for later processing (tracks input FPS internally)
self.frame_processor.put(input_frame)
consecutive_errors = 0
except asyncio.CancelledError:
break
except MediaStreamError:
logger.info("Source track ended")
self.input_task_running = False
break
except Exception as e:
consecutive_errors += 1
if consecutive_errors >= max_consecutive_errors:
logger.error(
f"Error in input loop, stopping after "
f"{consecutive_errors} consecutive errors: {e}"
)
self.input_task_running = False
break
logger.warning(
f"Transient error in input loop "
f"({consecutive_errors}/{max_consecutive_errors}): {e}"
)
await asyncio.sleep(0.01)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/scope/server/tracks.py` around lines 63 - 92, The loop calls
self.frame_processor.put() but self.frame_processor is lazily created by await
self.track.recv(), so initialize the processor before entering the retry loop:
perform an initial await self.track.recv() (or call the existing FrameProcessor
initializer) outside/above the while self.input_task_running loop to force
creation of self.frame_processor, handling asyncio.CancelledError and
MediaStreamError the same way as in-loop, then continue into the loop using
self.frame_processor.put() as before so transient recv-driven creation cannot
cause perpetual single-error retries.

@emranemran emranemran merged commit 3a76284 into main Mar 9, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants