fix(relay): survive WS disconnects and MV3 worker restarts#17588
fix(relay): survive WS disconnects and MV3 worker restarts#17588Unayung wants to merge 5 commits intoopenclaw:mainfrom
Conversation
| function onRelayClosed(reason) { | ||
| relayWs = null |
There was a problem hiding this comment.
onRelayClosed can null a healthy socket
onRelayClosed unconditionally sets relayWs = null at line 117 without checking whether relayWs still refers to the socket that actually closed. Here's the race:
- Old socket
ws_oldis connected,relayWs = ws_old. ws_olddisconnects →onRelayClosedfires fromws_old'sonclosehandler.- Meanwhile,
scheduleReconnectfires its timer,ensureRelayConnectioncreatesws_new, and at line 73 setsrelayWs = ws_new. - Since JS is single-threaded, the microtask from step 2 can interleave with the async flow from step 3. Specifically, after
ensureRelayConnectionyields at theawaiton line 75 (waiting forws_new.onopen),onRelayClosedruns and setsrelayWs = null— even thoughrelayWsnow points tows_new, notws_old.
The fix is to check identity before nulling:
| function onRelayClosed(reason) { | |
| relayWs = null | |
| function onRelayClosed(reason) { | |
| // Guard: only act if this close event is for the current socket | |
| if (relayWs && relayWs.readyState !== WebSocket.CLOSED) return | |
| relayWs = null |
Or alternatively, capture a reference in the close handler and compare:
// In ensureRelayConnection, when installing the permanent handler:
ws.onclose = () => { if (relayWs === ws) onRelayClosed('closed') }
ws.onerror = () => { if (relayWs === ws) onRelayClosed('error') }The second approach is more robust since it directly ties the handler to the socket instance.
Prompt To Fix With AI
This is a comment left during a code review.
Path: assets/chrome-extension/background.js
Line: 116:117
Comment:
**`onRelayClosed` can null a healthy socket**
`onRelayClosed` unconditionally sets `relayWs = null` at line 117 without checking whether `relayWs` still refers to the socket that actually closed. Here's the race:
1. Old socket `ws_old` is connected, `relayWs = ws_old`.
2. `ws_old` disconnects → `onRelayClosed` fires from `ws_old`'s `onclose` handler.
3. Meanwhile, `scheduleReconnect` fires its timer, `ensureRelayConnection` creates `ws_new`, and at line 73 sets `relayWs = ws_new`.
4. Since JS is single-threaded, the microtask from step 2 can interleave with the async flow from step 3. Specifically, after `ensureRelayConnection` yields at the `await` on line 75 (waiting for `ws_new.onopen`), `onRelayClosed` runs and sets `relayWs = null` — even though `relayWs` now points to `ws_new`, not `ws_old`.
The fix is to check identity before nulling:
```suggestion
function onRelayClosed(reason) {
// Guard: only act if this close event is for the current socket
if (relayWs && relayWs.readyState !== WebSocket.CLOSED) return
relayWs = null
```
Or alternatively, capture a reference in the close handler and compare:
```js
// In ensureRelayConnection, when installing the permanent handler:
ws.onclose = () => { if (relayWs === ws) onRelayClosed('closed') }
ws.onerror = () => { if (relayWs === ws) onRelayClosed('error') }
```
The second approach is more robust since it directly ties the handler to the socket instance.
How can I resolve this? If you propose a fix, please make it concise.| chrome.runtime.onInstalled.addListener(() => { | ||
| // Useful: first-time instructions. | ||
| void chrome.runtime.openOptionsPage() | ||
| // Start keepalive alarm | ||
| chrome.alarms.create('relay-keepalive', { periodInMinutes: 4 }) | ||
| }) | ||
|
|
||
| // ===== FIX #5: Keepalive via chrome.alarms ===== | ||
| // Prevents MV3 service worker from being terminated, preserving debugger sessions | ||
| chrome.alarms.create('relay-keepalive', { periodInMinutes: 4 }) |
There was a problem hiding this comment.
Duplicate chrome.alarms.create call
chrome.alarms.create('relay-keepalive', ...) is called both inside onInstalled (line 670) and at module top-level (line 675). The top-level call runs on every service worker startup (including restarts), while onInstalled only fires on install/update. Since the top-level call at line 675 already covers the onInstalled case, the call at line 670 is redundant. It's not harmful (calling create with the same name replaces the existing alarm), but it's confusing and could be removed to reduce noise.
| chrome.runtime.onInstalled.addListener(() => { | |
| // Useful: first-time instructions. | |
| void chrome.runtime.openOptionsPage() | |
| // Start keepalive alarm | |
| chrome.alarms.create('relay-keepalive', { periodInMinutes: 4 }) | |
| }) | |
| // ===== FIX #5: Keepalive via chrome.alarms ===== | |
| // Prevents MV3 service worker from being terminated, preserving debugger sessions | |
| chrome.alarms.create('relay-keepalive', { periodInMinutes: 4 }) | |
| chrome.runtime.onInstalled.addListener(() => { | |
| // Useful: first-time instructions. | |
| void chrome.runtime.openOptionsPage() | |
| }) | |
| // ===== FIX #5: Keepalive via chrome.alarms ===== | |
| // Prevents MV3 service worker from being terminated, preserving debugger sessions | |
| chrome.alarms.create('relay-keepalive', { periodInMinutes: 4 }) |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: assets/chrome-extension/background.js
Line: 666:675
Comment:
**Duplicate `chrome.alarms.create` call**
`chrome.alarms.create('relay-keepalive', ...)` is called both inside `onInstalled` (line 670) and at module top-level (line 675). The top-level call runs on every service worker startup (including restarts), while `onInstalled` only fires on install/update. Since the top-level call at line 675 already covers the `onInstalled` case, the call at line 670 is redundant. It's not harmful (calling `create` with the same name replaces the existing alarm), but it's confusing and could be removed to reduce noise.
```suggestion
chrome.runtime.onInstalled.addListener(() => {
// Useful: first-time instructions.
void chrome.runtime.openOptionsPage()
})
// ===== FIX #5: Keepalive via chrome.alarms =====
// Prevents MV3 service worker from being terminated, preserving debugger sessions
chrome.alarms.create('relay-keepalive', { periodInMinutes: 4 })
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| reconnectTimer = setTimeout(async () => { | ||
| reconnectTimer = null |
There was a problem hiding this comment.
Reconnect guard disarmed during async execution
At line 153, reconnectTimer = null is set at the very start of the timer callback, before ensureRelayConnection() is called. This disarms the if (reconnectTimer) return guard in scheduleReconnect for the entire duration of the async reconnection attempt. If onRelayClosed fires during this window (e.g., from an old socket), it will call scheduleReconnect(), which will successfully schedule a second timer since the guard is disarmed. This means two reconnect flows can run concurrently.
Consider adding a boolean flag like reconnecting = true to guard against this, or move reconnectTimer = null to just before scheduleReconnect() is called in the catch block.
Prompt To Fix With AI
This is a comment left during a code review.
Path: assets/chrome-extension/background.js
Line: 152:153
Comment:
**Reconnect guard disarmed during async execution**
At line 153, `reconnectTimer = null` is set at the very start of the timer callback, before `ensureRelayConnection()` is called. This disarms the `if (reconnectTimer) return` guard in `scheduleReconnect` for the entire duration of the async reconnection attempt. If `onRelayClosed` fires during this window (e.g., from an old socket), it will call `scheduleReconnect()`, which will successfully schedule a *second* timer since the guard is disarmed. This means two reconnect flows can run concurrently.
Consider adding a boolean flag like `reconnecting = true` to guard against this, or move `reconnectTimer = null` to just before `scheduleReconnect()` is called in the catch block.
How can I resolve this? If you propose a fix, please make it concise.| } catch { | ||
| console.warn(`[OpenClaw Relay] Keepalive: tab ${tabId} debugger lost — triggering restore`) | ||
| void restoreState() | ||
| break |
There was a problem hiding this comment.
Keepalive calls restoreState which may clobber in-memory state
When the keepalive ping detects a lost debugger, it calls restoreState() which reads from chrome.storage.session and rebuilds the entire in-memory tabs/tabBySession maps. However, the current in-memory state may differ from what's persisted (e.g., tabs added/removed since last persistState call, or a reconnection is in progress). restoreState also calls ensureRelayConnection() internally, which could conflict with an active reconnection attempt.
Consider whether cleanupTab(tabId) (to remove just the lost tab) or a more targeted recovery is more appropriate here than a full state restore.
Prompt To Fix With AI
This is a comment left during a code review.
Path: assets/chrome-extension/background.js
Line: 693:696
Comment:
**Keepalive calls `restoreState` which may clobber in-memory state**
When the keepalive ping detects a lost debugger, it calls `restoreState()` which reads from `chrome.storage.session` and rebuilds the entire in-memory `tabs`/`tabBySession` maps. However, the current in-memory state may differ from what's persisted (e.g., tabs added/removed since last `persistState` call, or a reconnection is in progress). `restoreState` also calls `ensureRelayConnection()` internally, which could conflict with an active reconnection attempt.
Consider whether `cleanupTab(tabId)` (to remove just the lost tab) or a more targeted recovery is more appropriate here than a full state restore.
How can I resolve this? If you propose a fix, please make it concise.
Additional Comments (1)
After the connection promise resolves at line 89 (when
This is an edge case, but it was also present in the original code. Now that reconnection logic depends on Prompt To Fix With AIThis is a comment left during a code review.
Path: assets/chrome-extension/background.js
Line: 85:93
Comment:
**Race window between temporary and permanent `onclose` handlers**
After the connection promise resolves at line 89 (when `onopen` fires), there is a gap before the permanent `onclose`/`onerror` handlers are installed at lines 92-93. If the WebSocket closes during this window:
1. The temporary `onclose` handler at line 85 fires, calling `reject()` on the already-resolved promise — this is a no-op.
2. The permanent `onclose` handler at line 92 hasn't been installed yet, so `onRelayClosed` is never called.
3. `relayWs` is left pointing to a dead socket, and no reconnect is triggered.
This is an edge case, but it was also present in the original code. Now that reconnection logic depends on `onRelayClosed` being called reliably, this gap becomes more consequential. Consider installing the permanent handlers immediately after creating the WebSocket (before the `await`) and having the temporary handlers coordinate with them, or moving `relayWs = ws` to *after* the handlers are installed.
How can I resolve this? If you propose a fix, please make it concise. |
|
Thanks @napetrov! Addressing the checklist:
Let me know if anything else is needed! |
Additional Comments (1)
Unlike the equivalent loop in the While Prompt To Fix With AIThis is a comment left during a code review.
Path: assets/chrome-extension/background.js
Line: 311:334
Comment:
**Re-announce loop aborts on first tab failure**
Unlike the equivalent loop in the `scheduleReconnect` callback (lines 178–233), this re-announce loop has no per-tab error handling. If `Target.getTargetInfo` or `sendToRelay` throws for any single tab, the `catch` at line 331 fires and all remaining tabs are silently skipped — only `scheduleReconnect()` is called.
While `scheduleReconnect` would eventually re-announce the skipped tabs (since the WS is already open, `ensureRelayConnection` returns immediately), there's an unnecessary delay equal to the backoff timer. Consider wrapping the per-tab work in its own try/catch, matching the pattern used in the reconnect handler:
```suggestion
// Try to reconnect WS and re-announce
try {
await ensureRelayConnection()
for (const [tabId, tab] of tabs.entries()) {
if (tab.state === 'connected' && tab.sessionId && tab.targetId) {
try {
const info = /** @type {any} */ (
await chrome.debugger.sendCommand({ tabId }, 'Target.getTargetInfo')
)
sendToRelay({
method: 'forwardCDPEvent',
params: {
method: 'Target.attachedToTarget',
params: {
sessionId: tab.sessionId,
targetInfo: { ...(info?.targetInfo || {}), attached: true },
waitingForDebugger: false,
},
},
})
} catch (err) {
console.warn(`[OpenClaw Relay] restoreState: failed to re-announce tab ${tabId}:`, err)
cleanupTab(tabId)
}
}
}
} catch {
// WS not available yet — will reconnect later
scheduleReconnect()
}
```
How can I resolve this? If you propose a fix, please make it concise. |
The Browser Relay extension frequently loses connection and fails to recover, requiring manual re-attach. This adds 5 fixes: 1. Don't detach debugger sessions on WebSocket drop — keep them alive and re-announce when WS reconnects 2. Auto-reconnect with exponential backoff (1s → 30s cap) + jitter 3. Persist state to chrome.storage.session for MV3 worker restart recovery, with debugger re-attach on restore 4. Tab lifecycle cleanup via onRemoved/onReplaced listeners 5. Keepalive alarm (every 4 min) to prevent MV3 worker termination and detect stale debugger sessions manifest.json: add 'alarms' permission for fix openclaw#5. Tested with 12+ hours continuous uptime on Edge, surviving gateway restarts, sleeping tabs, and automated cron runs. Fixes openclaw#15099
Address Greptile review: install permanent onclose/onerror/onmessage handlers immediately after creating the WebSocket, before awaiting the open event. Use a 'connected' flag to gate onRelayClosed() calls. The temporary handlers used during the connection promise are restored to permanent ones via a once-listener on 'open'. This also moves 'relayWs = ws' into the onopen callback so a dead socket is never assigned to the module-level variable. Previously, a WebSocket closing between onopen firing and permanent handler installation would leave relayWs pointing to a dead socket with no reconnect triggered.
Address Greptile review: wrap per-tab Target.getTargetInfo + sendToRelay in individual try/catch so a single tab failure doesn't abort re-announce for remaining tabs. Failed tabs are cleaned up via cleanupTab() instead of falling through to scheduleReconnect() with unnecessary backoff delay.
4faa275 to
e813abe
Compare
… inside onopen before resolve
- Remove duplicate chrome.alarms.create (keep module-level only) - Fix reconnect guard: keep reconnectTimer set during async work to prevent concurrent reconnects - Keepalive: clean up individual failed tabs instead of calling restoreState() which could clobber in-memory state - restoreState: wrap per-tab re-announce in try/catch so one failure doesn't skip remaining tabs
Summary
The Browser Relay extension frequently loses connection and fails to recover, requiring manual re-attach. This PR adds 5 fixes to make the relay resilient to WebSocket drops, MV3 service worker termination, and tab lifecycle events.
Related issues: #15099, #15989, #11142
Problem
The current
background.jshas several failure modes:onRelayClosed()immediately detaches all debugger sessions and clears state. The user must manually click the toolbar icon again to re-attach.Fixes
chrome.storage.sessionfor MV3 worker restarts, with debugger re-attachonRemoved/onReplacedlisteners to clean up stale entrieschrome.alarmsevery 4 min to prevent worker termination + detect stale debuggersmanifest.json
"alarms"to permissions array (required for Fix CLI: add Opencode integration #5)Testing
Tested on Edge (Chromium-based) with an automated pipeline running via cron at 3:00 AM and 3:00 PM daily:
Greptile Summary
This PR makes the Browser Relay extension resilient to WebSocket disconnects, MV3 service worker termination, and tab lifecycle events by adding five fixes: keeping debugger sessions alive across WS drops, auto-reconnect with exponential backoff, session storage persistence, tab lifecycle cleanup, and a keepalive alarm.
onRelayClosedno longer tears down debugger sessions — it updates badges and preserves in-memory state for reconnection.scheduleReconnectuses exponential backoff (1s–30s cap) with jitter. On success, it re-validates and re-announces all tabs with thorough per-tab error handling.chrome.storage.sessionafter attach/detach and restored on service worker startup, including debugger re-attach if needed.onRemovedandonReplacedlisteners clean up stale entries.chrome.alarmsinterval pings debuggers and triggers state restore if a debugger is lost.restoreStatere-announce loop (lines 313–330) lacks per-tab error handling — a single tab failure aborts re-announcement for all remaining tabs, unlike the analogous loop inscheduleReconnect.childSessionToTabis not persisted, so child sessions (iframes, service workers) will be lost across service worker restarts. New child session events will re-populate the map, but the relay server may hold stale references in the interim.Confidence Score: 3/5
assets/chrome-extension/background.js— therestoreStatefunction (lines 310–334) and the race conditions noted in previous review threadsLast reviewed commit: 7ab237e