Fix WebRTC VP8 decode errors and PLI keyframe requests#614
Conversation
📝 WalkthroughWalkthroughIntroduces consecutive-error tolerance (counter + max 10) to server input loops; treats MediaStreamError as end-of-source. Updates cloud_webrtc_client to track/reset consecutive errors, log first/frame progress, and to request keyframes by polling for remote streams (0.1s interval, up to 5s) and sending a PLI using the remote stream SSRC when available. Changes
Sequence DiagramsequenceDiagram
participant Client as WebRTC Caller
participant Receiver as Video Receiver
participant Remote as Remote Streams Registry
Client->>Receiver: request_keyframe()
Receiver->>Remote: poll for remote_streams (0.1s intervals, up to 5s)
alt remote_streams found
Remote-->>Receiver: provide remote_streams (media SSRC)
Receiver->>Receiver: extract media_ssrc
Receiver->>Receiver: call _send_rtcp_pli(media_ssrc)
Receiver-->>Client: log success / return
else no remote_streams within timeout
Remote-->>Receiver: no streams
Receiver-->>Client: skip PLI, debug log
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/scope/server/cloud_webrtc_client.py`:
- Around line 336-342: The receive-loop branch that logs and breaks when
consecutive_errors >= max_consecutive_errors must also mark the client as failed
by flipping the connection state before exiting; update the block that
references consecutive_errors, max_consecutive_errors and logger.error to set
self.is_connected = False (or call the existing failure/cleanup method if one
exists) immediately prior to the break so callers stop queuing frames to a dead
output path.
- Around line 365-383: The _request_keyframe coroutine can access self.pc after
disconnect() sets it to None, causing AttributeError; to fix, inside the retry
loop (the block that iterates delays and calls self.pc.getReceivers()) snapshot
the peer connection into a local variable (e.g., pc = self.pc), then check if pc
is None or pc.isClosed()/pc.connectionState indicates closed before calling
pc.getReceivers(); skip the attempt or return if closed/None to avoid calling
methods on None. Ensure you reference _request_keyframe, self.pc, and disconnect
when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d78d11f1-6edc-4f6e-ad0e-8bf7a4a1f6d2
📒 Files selected for processing (3)
src/scope/server/cloud_track.pysrc/scope/server/cloud_webrtc_client.pysrc/scope/server/tracks.py
| consecutive_errors += 1 | ||
| if consecutive_errors >= max_consecutive_errors: | ||
| logger.error( | ||
| f"Error receiving frame, stopping after " | ||
| f"{consecutive_errors} consecutive errors: {e}" | ||
| ) | ||
| break |
There was a problem hiding this comment.
Mark the client failed before exiting the receive loop.
Once this branch trips, the task stops but is_connected can still stay true, so callers may keep queueing frames against a dead output path and wait forever for processed video. Flip the connection state here before break.
Suggested fix
except Exception as e:
consecutive_errors += 1
if consecutive_errors >= max_consecutive_errors:
logger.error(
f"Error receiving frame, stopping after "
f"{consecutive_errors} consecutive errors: {e}"
)
+ self._connected = False
+ self._stats["connection_state"] = "failed"
break📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| consecutive_errors += 1 | |
| if consecutive_errors >= max_consecutive_errors: | |
| logger.error( | |
| f"Error receiving frame, stopping after " | |
| f"{consecutive_errors} consecutive errors: {e}" | |
| ) | |
| break | |
| consecutive_errors += 1 | |
| if consecutive_errors >= max_consecutive_errors: | |
| logger.error( | |
| f"Error receiving frame, stopping after " | |
| f"{consecutive_errors} consecutive errors: {e}" | |
| ) | |
| self._connected = False | |
| self._stats["connection_state"] = "failed" | |
| break |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/scope/server/cloud_webrtc_client.py` around lines 336 - 342, The
receive-loop branch that logs and breaks when consecutive_errors >=
max_consecutive_errors must also mark the client as failed by flipping the
connection state before exiting; update the block that references
consecutive_errors, max_consecutive_errors and logger.error to set
self.is_connected = False (or call the existing failure/cleanup method if one
exists) immediately prior to the break so callers stop queuing frames to a dead
output path.
There was a problem hiding this comment.
seems the same as the previous behaviour though? not that that means it's fine, but if this is a bug it doesn't seem like it's being introduced here
There was a problem hiding this comment.
@mjh1 — you're right, this isn't a regression introduced by this PR. Neither the old except Exception path nor the new one updated self._connected before breaking, so the gap existed before these changes too.
The suggestion is still worthwhile as a correctness improvement (callers checking is_connected after the loop exits won't know the receive path has died), but it's a pre-existing latent issue rather than something this PR introduces. Happy to leave it for a follow-up if you'd prefer to keep this PR focused on the VP8/PLI fixes.
Would you like me to open a follow-up issue to track it?
| delays = [0.1, 0.5, 1.0] | ||
| for attempt, delay in enumerate(delays): | ||
| await asyncio.sleep(delay) | ||
| sent = False | ||
| for receiver in self.pc.getReceivers(): | ||
| if receiver.track and receiver.track.kind == "video": | ||
| try: | ||
| await receiver._send_rtcp_pli() | ||
| logger.info( | ||
| f"Sent PLI (keyframe request), attempt {attempt + 1}" | ||
| ) | ||
| sent = True | ||
| except Exception as e: | ||
| logger.warning( | ||
| f"Could not send PLI (attempt {attempt + 1}): {e}" | ||
| ) | ||
| if sent: | ||
| return | ||
| logger.warning("Failed to send PLI after all retry attempts") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's explore the repository structure
git ls-files | head -20Repository: daydreamlive/scope
Length of output: 591
🏁 Script executed:
# Find and examine the cloud_webrtc_client.py file
find . -type f -name "cloud_webrtc_client.py" 2>/dev/nullRepository: daydreamlive/scope
Length of output: 103
🏁 Script executed:
# Check the file size to determine best approach for reading
wc -l src/scope/server/cloud_webrtc_client.pyRepository: daydreamlive/scope
Length of output: 105
🏁 Script executed:
# Read the full file to understand the context
cat -n src/scope/server/cloud_webrtc_client.pyRepository: daydreamlive/scope
Length of output: 18440
🏁 Script executed:
# Check if connectionState is used elsewhere to confirm it's a valid property
rg "connectionState" src/scope/server/cloud_webrtc_client.pyRepository: daydreamlive/scope
Length of output: 105
🏁 Script executed:
# Check dependencies to understand aiortc version
fd -e txt -e toml -e cfg -e lock | grep -iE "(requirements|setup|pyproject|poetry|pipfile)"Repository: daydreamlive/scope
Length of output: 211
🏁 Script executed:
# Check the main pyproject.toml for aiortc version
cat pyproject.toml | grep -A 30 -B 5 aiortcRepository: daydreamlive/scope
Length of output: 1216
🏁 Script executed:
# Check if _send_rtcp_pli is used elsewhere in the codebase
rg "_send_rtcp_pli" --type pyRepository: daydreamlive/scope
Length of output: 157
🏁 Script executed:
# Check aiortc documentation by looking for any aiortc usage patterns in comments or documentation
find . -type f \( -name "*.md" -o -name "*.txt" -o -name "*.rst" \) | xargs grep -l "aiortc" 2>/dev/null | head -5Repository: daydreamlive/scope
Length of output: 44
🏁 Script executed:
# Let's verify that self.pc can indeed become None during the method execution
# Check the disconnect method and how pc is set to None
sed -n '412,434p' src/scope/server/cloud_webrtc_client.pyRepository: daydreamlive/scope
Length of output: 679
Add self.pc None check to prevent crash if connection closes during keyframe request.
_request_keyframe() runs as a detached task and sleeps up to 1.6s total across retries. If disconnect() runs during this window, self.pc becomes None (line 428), causing self.pc.getReceivers() to crash with AttributeError. Snapshot self.pc at each retry attempt and check for None or closed state before iterating receivers.
Suggested fix
delays = [0.1, 0.5, 1.0]
for attempt, delay in enumerate(delays):
await asyncio.sleep(delay)
+ pc = self.pc
+ if pc is None or pc.connectionState == "closed":
+ return
sent = False
- for receiver in self.pc.getReceivers():
+ for receiver in pc.getReceivers():
if receiver.track and receiver.track.kind == "video":
try:
await receiver._send_rtcp_pli()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| delays = [0.1, 0.5, 1.0] | |
| for attempt, delay in enumerate(delays): | |
| await asyncio.sleep(delay) | |
| sent = False | |
| for receiver in self.pc.getReceivers(): | |
| if receiver.track and receiver.track.kind == "video": | |
| try: | |
| await receiver._send_rtcp_pli() | |
| logger.info( | |
| f"Sent PLI (keyframe request), attempt {attempt + 1}" | |
| ) | |
| sent = True | |
| except Exception as e: | |
| logger.warning( | |
| f"Could not send PLI (attempt {attempt + 1}): {e}" | |
| ) | |
| if sent: | |
| return | |
| logger.warning("Failed to send PLI after all retry attempts") | |
| delays = [0.1, 0.5, 1.0] | |
| for attempt, delay in enumerate(delays): | |
| await asyncio.sleep(delay) | |
| pc = self.pc | |
| if pc is None or pc.connectionState == "closed": | |
| return | |
| sent = False | |
| for receiver in pc.getReceivers(): | |
| if receiver.track and receiver.track.kind == "video": | |
| try: | |
| await receiver._send_rtcp_pli() | |
| logger.info( | |
| f"Sent PLI (keyframe request), attempt {attempt + 1}" | |
| ) | |
| sent = True | |
| except Exception as e: | |
| logger.warning( | |
| f"Could not send PLI (attempt {attempt + 1}): {e}" | |
| ) | |
| if sent: | |
| return | |
| logger.warning("Failed to send PLI after all retry attempts") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/scope/server/cloud_webrtc_client.py` around lines 365 - 383, The
_request_keyframe coroutine can access self.pc after disconnect() sets it to
None, causing AttributeError; to fix, inside the retry loop (the block that
iterates delays and calls self.pc.getReceivers()) snapshot the peer connection
into a local variable (e.g., pc = self.pc), then check if pc is None or
pc.isClosed()/pc.connectionState indicates closed before calling
pc.getReceivers(); skip the attempt or return if closed/None to avoid calling
methods on None. Ensure you reference _request_keyframe, self.pc, and disconnect
when making the change.
🚀 fal.ai Preview Deployment
TestingConnect to this preview deployment by running this on your branch: 🧪 E2E tests will run automatically against this deployment. |
✅ E2E Tests passed
Test ArtifactsCheck the workflow run for screenshots. |
a746734 to
6411c07
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/scope/server/cloud_webrtc_client.py (1)
311-312: Consider moving import to file level for consistency.
MediaStreamErroris imported inside the method, buttracks.pyimports it at the file level (line 8). Since line 24 already imports fromaiortc.mediastreams, addingMediaStreamErrorthere maintains consistency.Suggested change
At line 24:
-from aiortc.mediastreams import VIDEO_TIME_BASE, MediaStreamTrack +from aiortc.mediastreams import VIDEO_TIME_BASE, MediaStreamError, MediaStreamTrackThen remove lines 311-312 from
_receive_frames.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/scope/server/cloud_webrtc_client.py` around lines 311 - 312, Move the MediaStreamError import out of the _receive_frames method and into the file-level imports: add MediaStreamError to the existing top-level import from aiortc.mediastreams (the same import already present near the other aiortc imports) and then remove the local "from aiortc.mediastreams import MediaStreamError" inside _receive_frames so the symbol is consistently imported at module scope.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/scope/server/cloud_webrtc_client.py`:
- Around line 377-386: The code is brittle because it directly accesses the
name-mangled private attribute receiver._RTCRtpReceiver__remote_streams; replace
this fragile access with a safe runtime compatibility check and explicit
failure: at function entry validate aiortc version (ensure >= 1.14.0) or attempt
to read the attribute with getattr/hasattr (e.g., check for
"_RTCRtpReceiver__remote_streams" or public alternatives) and if missing raise a
clear RuntimeError explaining the required aiortc version; keep the existing
retry/exception handling but log the compatibility error and fail fast instead
of silently continuing, and update documentation/dependencies to declare the
minimum aiortc version.
---
Nitpick comments:
In `@src/scope/server/cloud_webrtc_client.py`:
- Around line 311-312: Move the MediaStreamError import out of the
_receive_frames method and into the file-level imports: add MediaStreamError to
the existing top-level import from aiortc.mediastreams (the same import already
present near the other aiortc imports) and then remove the local "from
aiortc.mediastreams import MediaStreamError" inside _receive_frames so the
symbol is consistently imported at module scope.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7697451f-d207-4edb-8a2f-2d31fc5c332f
📒 Files selected for processing (3)
src/scope/server/cloud_track.pysrc/scope/server/cloud_webrtc_client.pysrc/scope/server/tracks.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/scope/server/cloud_track.py
| # Get remote SSRC from receiver's internal stream tracking | ||
| # (required by aiortc >= 1.14.0) | ||
| remote_streams = receiver._RTCRtpReceiver__remote_streams | ||
| if not remote_streams: | ||
| logger.debug( | ||
| f"No remote streams yet (attempt {attempt + 1})" | ||
| ) | ||
| continue | ||
| media_ssrc = next(iter(remote_streams)) | ||
| await receiver._send_rtcp_pli(media_ssrc) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
aiortc 1.14.0 _send_rtcp_pli media_ssrc parameter
💡 Result:
In aiortc 1.14.0, RTCRtpReceiver._send_rtcp_pli(media_ssrc) expects the SSRC of the media source you want the sender to generate a new keyframe for—i.e., the SSRC of the incoming RTP stream (the “media source” SSRC in the RTCP feedback common header). aiortc then puts that value into the RTCP PSFB PLI packet’s media_ssrc field. (raw.githubusercontent.com)
Concretely, aiortc calls it like this when the jitter buffer decides a PLI is needed:
await self._send_rtcp_pli(packet.ssrc)(raw.githubusercontent.com)
And _send_rtcp_pli builds:
RtcpPsfbPacket(fmt=RTCP_PSFB_PLI, ssrc=self.__rtcp_ssrc, media_ssrc=media_ssrc)(raw.githubusercontent.com)
Notes:
ssrc=self.__rtcp_ssrcis the SSRC of the feedback sender (your receiver endpoint), whilemedia_ssrcis the SSRC of the media stream being referenced. (raw.githubusercontent.com)- If RTX is in use, aiortc unwraps RTX and rewrites the packet SSRC to the original (primary) SSRC before potentially sending PLI, so the PLI targets the main media SSRC, not the RTX SSRC. (raw.githubusercontent.com)
Citations:
- 1: https://raw.githubusercontent.com/aiortc/aiortc/1.14.0/src/aiortc/rtcrtpreceiver.py
- 2: https://raw.githubusercontent.com/aiortc/aiortc/1.14.0/src/aiortc/rtcrtpreceiver.py
- 3: https://raw.githubusercontent.com/aiortc/aiortc/1.14.0/src/aiortc/rtcrtpreceiver.py
- 4: https://raw.githubusercontent.com/aiortc/aiortc/1.14.0/src/aiortc/rtcrtpreceiver.py
- 5: https://raw.githubusercontent.com/aiortc/aiortc/1.14.0/src/aiortc/rtcrtpreceiver.py
🏁 Script executed:
# Search for the full context of this code section
cd /tmp && find . -name "cloud_webrtc_client.py" 2>/dev/null | head -5Repository: daydreamlive/scope
Length of output: 44
🏁 Script executed:
# Check if the file exists in the repo and examine surrounding code
git ls-files | grep -i cloud_webrtcRepository: daydreamlive/scope
Length of output: 101
🏁 Script executed:
# Look at the full method/function containing these lines
rg -A 20 -B 10 "_RTCRtpReceiver__remote_streams" --type pyRepository: daydreamlive/scope
Length of output: 2735
🏁 Script executed:
# Search for alternative approaches to getting media_ssrc in the codebase
rg "_send_rtcp_pli" --type py -B 5 -A 5Repository: daydreamlive/scope
Length of output: 1058
🏁 Script executed:
# Check aiortc imports and version handling in the file
rg "import aiortc|from aiortc" --type py -A 2 -B 2Repository: daydreamlive/scope
Length of output: 2875
Private attribute access is fragile despite defensive handling.
receiver._RTCRtpReceiver__remote_streams accesses a name-mangled private attribute. While the code already wraps this in a try/except block with retry logic, the direct access remains vulnerable to silent breakage if aiortc's internals change. The broad exception handling provides defense, but consider adding a version check at runtime (the comment indicates aiortc >= 1.14.0 is required) or documenting the minimum aiortc version as a hard dependency to catch incompatibilities earlier.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/scope/server/cloud_webrtc_client.py` around lines 377 - 386, The code is
brittle because it directly accesses the name-mangled private attribute
receiver._RTCRtpReceiver__remote_streams; replace this fragile access with a
safe runtime compatibility check and explicit failure: at function entry
validate aiortc version (ensure >= 1.14.0) or attempt to read the attribute with
getattr/hasattr (e.g., check for "_RTCRtpReceiver__remote_streams" or public
alternatives) and if missing raise a clear RuntimeError explaining the required
aiortc version; keep the existing retry/exception handling but log the
compatibility error and fail fast instead of silently continuing, and update
documentation/dependencies to declare the minimum aiortc version.
- Add retry logic for transient VP8 decode errors in cloud WebRTC input loops - Fix PLI keyframe request for aiortc >= 1.14.0 (pass media_ssrc) - Poll for remote streams before sending PLI instead of blind retries - Log first frame receipt to verify PLI/keyframe roundtrip Signed-off-by: emranemran <[email protected]>
6411c07 to
bb69e0a
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/scope/server/cloud_track.py (1)
132-154:⚠️ Potential issue | 🟠 MajorReset
consecutive_errorsafter the cloud handoff succeeds.
consecutive_errorsis cleared beforeself.frame_processor.put(frame)runs, so a persistentput()failure never reaches the 10-error cutoff — every successfulrecv()zeroes the counter first.🐛 Proposed fix
try: # Get frame from browser frame = await self._source_track.recv() - consecutive_errors = 0 # Send through FrameProcessor (which relays to cloud) if self.frame_processor: self.frame_processor.put(frame) + consecutive_errors = 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/scope/server/cloud_track.py` around lines 132 - 154, The bug is that consecutive_errors is reset immediately after a successful _source_track.recv() so failures in self.frame_processor.put(frame) never accumulate; move the reset to only run after the cloud handoff succeeds by wrapping the handoff in its own try/except: call self.frame_processor.put(frame) (or await if needed) inside a try block, on success set consecutive_errors = 0, and on exception increment consecutive_errors and log/handle using the existing max_consecutive_errors logic (use the same logger messages and break behavior), keeping the recv() exception handling for MediaStreamError and other recv errors unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/scope/server/cloud_webrtc_client.py`:
- Around line 370-399: The current PLI send logic polls for remote_streams then
sends a single RTCP PLI via receiver._send_rtcp_pli(media_ssrc) and bails if it
fails; change this to retry sending PLIs on a bounded schedule (e.g.,
exponential or fixed-interval backoff with a max attempts/timeout) until a video
frame arrives or the retry budget is exhausted: after detecting remote_streams
and selecting receiver/media_ssrc, loop up to N attempts (or until a global
timeout) calling receiver._send_rtcp_pli(media_ssrc), awaiting a short interval
between attempts, break if the first frame is observed (check r.track.readyState
or other indicator from pc.getReceivers()/remote_streams), and log each attempt
and final failure using logger (replace the single try/except block around
receiver._send_rtcp_pli with this retry loop and preserve existing debug/info
messages).
In `@src/scope/server/tracks.py`:
- Around line 63-92: The loop calls self.frame_processor.put() but
self.frame_processor is lazily created by await self.track.recv(), so initialize
the processor before entering the retry loop: perform an initial await
self.track.recv() (or call the existing FrameProcessor initializer)
outside/above the while self.input_task_running loop to force creation of
self.frame_processor, handling asyncio.CancelledError and MediaStreamError the
same way as in-loop, then continue into the loop using
self.frame_processor.put() as before so transient recv-driven creation cannot
cause perpetual single-error retries.
---
Outside diff comments:
In `@src/scope/server/cloud_track.py`:
- Around line 132-154: The bug is that consecutive_errors is reset immediately
after a successful _source_track.recv() so failures in
self.frame_processor.put(frame) never accumulate; move the reset to only run
after the cloud handoff succeeds by wrapping the handoff in its own try/except:
call self.frame_processor.put(frame) (or await if needed) inside a try block, on
success set consecutive_errors = 0, and on exception increment
consecutive_errors and log/handle using the existing max_consecutive_errors
logic (use the same logger messages and break behavior), keeping the recv()
exception handling for MediaStreamError and other recv errors unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9433c2e5-b52b-4495-802e-32712d1485cc
📒 Files selected for processing (3)
src/scope/server/cloud_track.pysrc/scope/server/cloud_webrtc_client.pysrc/scope/server/tracks.py
| # Poll until remote_streams is populated (RTP packets have arrived) | ||
| timeout = 5.0 | ||
| poll_interval = 0.1 | ||
| elapsed = 0.0 | ||
| receiver = None | ||
| remote_streams = None | ||
|
|
||
| while elapsed < timeout: | ||
| await asyncio.sleep(poll_interval) | ||
| elapsed += poll_interval | ||
| for r in self.pc.getReceivers(): | ||
| if r.track and r.track.kind == "video": | ||
| streams = r._RTCRtpReceiver__remote_streams | ||
| if streams: | ||
| receiver = r | ||
| remote_streams = streams | ||
| break | ||
| if remote_streams: | ||
| break | ||
|
|
||
| if not remote_streams or not receiver: | ||
| logger.debug("No remote streams after %.1fs, skipping PLI", timeout) | ||
| return | ||
|
|
||
| try: | ||
| media_ssrc = next(iter(remote_streams)) | ||
| await receiver._send_rtcp_pli(media_ssrc) | ||
| logger.info(f"Sent PLI keyframe request (media_ssrc={media_ssrc})") | ||
| except Exception as e: | ||
| logger.debug(f"Could not send PLI: {e}") |
There was a problem hiding this comment.
Retry the PLI instead of sending it once.
Polling remote_streams fixes the “too early” case, but after that this code makes exactly one RTCP PLI attempt. If that packet is lost or the sender still is not ready, startup falls back to waiting for the next natural keyframe and the VP8 decode loop can keep failing. Keep a bounded retry schedule until the first frame arrives or the retry budget is exhausted.
🔁 Proposed fix
- try:
- media_ssrc = next(iter(remote_streams))
- await receiver._send_rtcp_pli(media_ssrc)
- logger.info(f"Sent PLI keyframe request (media_ssrc={media_ssrc})")
- except Exception as e:
- logger.debug(f"Could not send PLI: {e}")
+ media_ssrc = next(iter(remote_streams))
+ for attempt, delay in enumerate((0.0, 0.5, 1.0, 2.0), start=1):
+ if delay:
+ await asyncio.sleep(delay)
+ if self._stats["frames_received"] > 0:
+ return
+ try:
+ await receiver._send_rtcp_pli(media_ssrc)
+ logger.info(
+ "Sent PLI keyframe request (media_ssrc=%s, attempt %s)",
+ media_ssrc,
+ attempt,
+ )
+ except Exception as e:
+ logger.debug("Could not send PLI on attempt %s: %s", attempt, e)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/scope/server/cloud_webrtc_client.py` around lines 370 - 399, The current
PLI send logic polls for remote_streams then sends a single RTCP PLI via
receiver._send_rtcp_pli(media_ssrc) and bails if it fails; change this to retry
sending PLIs on a bounded schedule (e.g., exponential or fixed-interval backoff
with a max attempts/timeout) until a video frame arrives or the retry budget is
exhausted: after detecting remote_streams and selecting receiver/media_ssrc,
loop up to N attempts (or until a global timeout) calling
receiver._send_rtcp_pli(media_ssrc), awaiting a short interval between attempts,
break if the first frame is observed (check r.track.readyState or other
indicator from pc.getReceivers()/remote_streams), and log each attempt and final
failure using logger (replace the single try/except block around
receiver._send_rtcp_pli with this retry loop and preserve existing debug/info
messages).
| consecutive_errors = 0 | ||
| max_consecutive_errors = 10 | ||
| while self.input_task_running: | ||
| try: | ||
| input_frame = await self.track.recv() | ||
| consecutive_errors = 0 | ||
|
|
||
| # Store raw VideoFrame for later processing (tracks input FPS internally) | ||
| self.frame_processor.put(input_frame) | ||
|
|
||
| except asyncio.CancelledError: | ||
| break | ||
| except Exception as e: | ||
| # Stop the input loop on connection errors to avoid spam | ||
| logger.error(f"Error in input loop, stopping: {e}") | ||
| except MediaStreamError: | ||
| logger.info("Source track ended") | ||
| self.input_task_running = False | ||
| break | ||
| except Exception as e: | ||
| consecutive_errors += 1 | ||
| if consecutive_errors >= max_consecutive_errors: | ||
| logger.error( | ||
| f"Error in input loop, stopping after " | ||
| f"{consecutive_errors} consecutive errors: {e}" | ||
| ) | ||
| self.input_task_running = False | ||
| break | ||
| logger.warning( | ||
| f"Transient error in input loop " | ||
| f"({consecutive_errors}/{max_consecutive_errors}): {e}" | ||
| ) | ||
| await asyncio.sleep(0.01) |
There was a problem hiding this comment.
Initialize the processor before starting this retry loop.
recv() is what lazily creates self.frame_processor, but this loop dereferences it earlier. If input starts first, put() can fail on every frame, and because the counter is reset before the call the loop can stay at 1/10 forever.
🐛 Proposed fix
def initialize_input_processing(self, track: MediaStreamTrack):
self.track = track
+ self.initialize_output_processing()
self.input_task_running = True
self.input_task = asyncio.create_task(self.input_loop()) while self.input_task_running:
try:
input_frame = await self.track.recv()
- consecutive_errors = 0
# Store raw VideoFrame for later processing (tracks input FPS internally)
self.frame_processor.put(input_frame)
+ consecutive_errors = 0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| consecutive_errors = 0 | |
| max_consecutive_errors = 10 | |
| while self.input_task_running: | |
| try: | |
| input_frame = await self.track.recv() | |
| consecutive_errors = 0 | |
| # Store raw VideoFrame for later processing (tracks input FPS internally) | |
| self.frame_processor.put(input_frame) | |
| except asyncio.CancelledError: | |
| break | |
| except Exception as e: | |
| # Stop the input loop on connection errors to avoid spam | |
| logger.error(f"Error in input loop, stopping: {e}") | |
| except MediaStreamError: | |
| logger.info("Source track ended") | |
| self.input_task_running = False | |
| break | |
| except Exception as e: | |
| consecutive_errors += 1 | |
| if consecutive_errors >= max_consecutive_errors: | |
| logger.error( | |
| f"Error in input loop, stopping after " | |
| f"{consecutive_errors} consecutive errors: {e}" | |
| ) | |
| self.input_task_running = False | |
| break | |
| logger.warning( | |
| f"Transient error in input loop " | |
| f"({consecutive_errors}/{max_consecutive_errors}): {e}" | |
| ) | |
| await asyncio.sleep(0.01) | |
| consecutive_errors = 0 | |
| max_consecutive_errors = 10 | |
| while self.input_task_running: | |
| try: | |
| input_frame = await self.track.recv() | |
| # Store raw VideoFrame for later processing (tracks input FPS internally) | |
| self.frame_processor.put(input_frame) | |
| consecutive_errors = 0 | |
| except asyncio.CancelledError: | |
| break | |
| except MediaStreamError: | |
| logger.info("Source track ended") | |
| self.input_task_running = False | |
| break | |
| except Exception as e: | |
| consecutive_errors += 1 | |
| if consecutive_errors >= max_consecutive_errors: | |
| logger.error( | |
| f"Error in input loop, stopping after " | |
| f"{consecutive_errors} consecutive errors: {e}" | |
| ) | |
| self.input_task_running = False | |
| break | |
| logger.warning( | |
| f"Transient error in input loop " | |
| f"({consecutive_errors}/{max_consecutive_errors}): {e}" | |
| ) | |
| await asyncio.sleep(0.01) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/scope/server/tracks.py` around lines 63 - 92, The loop calls
self.frame_processor.put() but self.frame_processor is lazily created by await
self.track.recv(), so initialize the processor before entering the retry loop:
perform an initial await self.track.recv() (or call the existing FrameProcessor
initializer) outside/above the while self.input_task_running loop to force
creation of self.frame_processor, handling asyncio.CancelledError and
MediaStreamError the same way as in-loop, then continue into the loop using
self.frame_processor.put() as before so transient recv-driven creation cannot
cause perpetual single-error retries.
Problem
Cloud relay WebRTC connections would permanently die from transient VP8 decode errors. When a decoder receives P-frames before a keyframe (common at connection start), the exception killed the frame receive loop with no recovery.
Additionally, aiortc 1.14.0 changed
_send_rtcp_pli()to require amedia_ssrcargument, causing PLI keyframe requests to fail with warnings on every cloud connection.Two fixes
1. VP8 decode error retry (
tracks.py,cloud_track.py,cloud_webrtc_client.py)All three frame input loops now tolerate transient errors instead of dying on the first one:
2. PLI keyframe request fix (
cloud_webrtc_client.py)Requests a keyframe from the cloud encoder to speed up decoder startup:
remote_streamsdict is populated (up to 5s)media_ssrc(required by aiortc >= 1.14.0)How they work together
In practice, the cloud encoder sends a keyframe on its own at connection start, so the first frame usually arrives before the PLI is even sent. The PLI is a safety net, and the retry logic is a safety net for the safety net.
Expected logs
Test plan
uv run daydream-scope)