Skip to content

fix(relay): survive WS disconnects and MV3 worker restarts#17588

Open
Unayung wants to merge 5 commits intoopenclaw:mainfrom
Unayung:fix/relay-disconnect-resilience
Open

fix(relay): survive WS disconnects and MV3 worker restarts#17588
Unayung wants to merge 5 commits intoopenclaw:mainfrom
Unayung:fix/relay-disconnect-resilience

Conversation

@Unayung
Copy link

@Unayung Unayung commented Feb 15, 2026

Summary

The Browser Relay extension frequently loses connection and fails to recover, requiring manual re-attach. This PR adds 5 fixes to make the relay resilient to WebSocket drops, MV3 service worker termination, and tab lifecycle events.

Related issues: #15099, #15989, #11142

Problem

The current background.js has several failure modes:

  1. WebSocket drop → debugger detach: When the relay WebSocket closes (e.g., OpenClaw restart, network blip), onRelayClosed() immediately detaches all debugger sessions and clears state. The user must manually click the toolbar icon again to re-attach.
  2. No auto-reconnect: After a WS disconnect, the extension makes no attempt to reconnect. It silently goes dead.
  3. MV3 service worker death: Chrome/Edge can terminate the service worker at any time. All in-memory state (tabs, sessions) is lost.
  4. Tab close/replace not handled: Stale entries remain in the maps.
  5. Service worker termination: Without periodic activity, Chrome terminates the MV3 service worker after ~30s of inactivity.

Fixes

# Fix Description
1 Don't detach on WS drop Keep debugger attached, show reconnecting badge, re-announce on reconnect
2 Auto-reconnect Exponential backoff (1s → 30s cap) with jitter, re-announce all tabs on success
3 Session storage persistence Save/restore state via chrome.storage.session for MV3 worker restarts, with debugger re-attach
4 Tab lifecycle cleanup onRemoved / onReplaced listeners to clean up stale entries
5 Keepalive alarm chrome.alarms every 4 min to prevent worker termination + detect stale debuggers

manifest.json

Testing

Tested on Edge (Chromium-based) with an automated pipeline running via cron at 3:00 AM and 3:00 PM daily:

  • 12+ hours continuous uptime without disconnects
  • Survived multiple OpenClaw gateway restarts (auto-reconnect worked)
  • Survived Edge Sleeping Tabs (keepalive prevented worker termination)
  • Cron jobs successfully connected to the relay without manual intervention
  • Badge correctly shows ⏳ during reconnect, ✅ on recovery

Greptile Summary

This PR makes the Browser Relay extension resilient to WebSocket disconnects, MV3 service worker termination, and tab lifecycle events by adding five fixes: keeping debugger sessions alive across WS drops, auto-reconnect with exponential backoff, session storage persistence, tab lifecycle cleanup, and a keepalive alarm.

  • Fix fix: add @lid format support and allowFrom wildcard handling #1 (Don't detach on WS drop): onRelayClosed no longer tears down debugger sessions — it updates badges and preserves in-memory state for reconnection.
  • Fix Login fails with 'WebSocket Error (socket hang up)' ECONNRESET #2 (Auto-reconnect): scheduleReconnect uses exponential backoff (1s–30s cap) with jitter. On success, it re-validates and re-announces all tabs with thorough per-tab error handling.
  • Fix WA business, groups & office hours  #3 (Session storage persistence): State is persisted to chrome.storage.session after attach/detach and restored on service worker startup, including debugger re-attach if needed.
  • Fix Images not passed to Claude CLI - only path reference in text #4 (Tab lifecycle cleanup): onRemoved and onReplaced listeners clean up stale entries.
  • Fix CLI: add Opencode integration #5 (Keepalive alarm): 4-minute chrome.alarms interval pings debuggers and triggers state restore if a debugger is lost.
  • Issue: The restoreState re-announce loop (lines 313–330) lacks per-tab error handling — a single tab failure aborts re-announcement for all remaining tabs, unlike the analogous loop in scheduleReconnect.
  • Note: childSessionToTab is not persisted, so child sessions (iframes, service workers) will be lost across service worker restarts. New child session events will re-populate the map, but the relay server may hold stale references in the interim.

Confidence Score: 3/5

  • This PR introduces significant resilience improvements but has several race conditions and error-handling gaps that could cause unexpected behavior under concurrent failure scenarios.
  • The core logic is well-structured and addresses real failure modes effectively. However, the re-announce loop in restoreState lacks per-tab error handling (new finding), and previous review threads identified race conditions in onRelayClosed (socket identity), reconnect guard disarming, and keepalive-triggered restoreState clobbering — all of which remain unresolved. The manifest change is minimal and correct.
  • assets/chrome-extension/background.js — the restoreState function (lines 310–334) and the race conditions noted in previous review threads

Last reviewed commit: 7ab237e

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 116 to 117
function onRelayClosed(reason) {
relayWs = null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onRelayClosed can null a healthy socket

onRelayClosed unconditionally sets relayWs = null at line 117 without checking whether relayWs still refers to the socket that actually closed. Here's the race:

  1. Old socket ws_old is connected, relayWs = ws_old.
  2. ws_old disconnects → onRelayClosed fires from ws_old's onclose handler.
  3. Meanwhile, scheduleReconnect fires its timer, ensureRelayConnection creates ws_new, and at line 73 sets relayWs = ws_new.
  4. Since JS is single-threaded, the microtask from step 2 can interleave with the async flow from step 3. Specifically, after ensureRelayConnection yields at the await on line 75 (waiting for ws_new.onopen), onRelayClosed runs and sets relayWs = null — even though relayWs now points to ws_new, not ws_old.

The fix is to check identity before nulling:

Suggested change
function onRelayClosed(reason) {
relayWs = null
function onRelayClosed(reason) {
// Guard: only act if this close event is for the current socket
if (relayWs && relayWs.readyState !== WebSocket.CLOSED) return
relayWs = null

Or alternatively, capture a reference in the close handler and compare:

// In ensureRelayConnection, when installing the permanent handler:
ws.onclose = () => { if (relayWs === ws) onRelayClosed('closed') }
ws.onerror = () => { if (relayWs === ws) onRelayClosed('error') }

The second approach is more robust since it directly ties the handler to the socket instance.

Prompt To Fix With AI
This is a comment left during a code review.
Path: assets/chrome-extension/background.js
Line: 116:117

Comment:
**`onRelayClosed` can null a healthy socket**

`onRelayClosed` unconditionally sets `relayWs = null` at line 117 without checking whether `relayWs` still refers to the socket that actually closed. Here's the race:

1. Old socket `ws_old` is connected, `relayWs = ws_old`.
2. `ws_old` disconnects → `onRelayClosed` fires from `ws_old`'s `onclose` handler.
3. Meanwhile, `scheduleReconnect` fires its timer, `ensureRelayConnection` creates `ws_new`, and at line 73 sets `relayWs = ws_new`.
4. Since JS is single-threaded, the microtask from step 2 can interleave with the async flow from step 3. Specifically, after `ensureRelayConnection` yields at the `await` on line 75 (waiting for `ws_new.onopen`), `onRelayClosed` runs and sets `relayWs = null` — even though `relayWs` now points to `ws_new`, not `ws_old`.

The fix is to check identity before nulling:

```suggestion
function onRelayClosed(reason) {
  // Guard: only act if this close event is for the current socket
  if (relayWs && relayWs.readyState !== WebSocket.CLOSED) return
  relayWs = null
```

Or alternatively, capture a reference in the close handler and compare:
```js
// In ensureRelayConnection, when installing the permanent handler:
ws.onclose = () => { if (relayWs === ws) onRelayClosed('closed') }
ws.onerror = () => { if (relayWs === ws) onRelayClosed('error') }
```
The second approach is more robust since it directly ties the handler to the socket instance.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 666 to 675
chrome.runtime.onInstalled.addListener(() => {
// Useful: first-time instructions.
void chrome.runtime.openOptionsPage()
// Start keepalive alarm
chrome.alarms.create('relay-keepalive', { periodInMinutes: 4 })
})

// ===== FIX #5: Keepalive via chrome.alarms =====
// Prevents MV3 service worker from being terminated, preserving debugger sessions
chrome.alarms.create('relay-keepalive', { periodInMinutes: 4 })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate chrome.alarms.create call

chrome.alarms.create('relay-keepalive', ...) is called both inside onInstalled (line 670) and at module top-level (line 675). The top-level call runs on every service worker startup (including restarts), while onInstalled only fires on install/update. Since the top-level call at line 675 already covers the onInstalled case, the call at line 670 is redundant. It's not harmful (calling create with the same name replaces the existing alarm), but it's confusing and could be removed to reduce noise.

Suggested change
chrome.runtime.onInstalled.addListener(() => {
// Useful: first-time instructions.
void chrome.runtime.openOptionsPage()
// Start keepalive alarm
chrome.alarms.create('relay-keepalive', { periodInMinutes: 4 })
})
// ===== FIX #5: Keepalive via chrome.alarms =====
// Prevents MV3 service worker from being terminated, preserving debugger sessions
chrome.alarms.create('relay-keepalive', { periodInMinutes: 4 })
chrome.runtime.onInstalled.addListener(() => {
// Useful: first-time instructions.
void chrome.runtime.openOptionsPage()
})
// ===== FIX #5: Keepalive via chrome.alarms =====
// Prevents MV3 service worker from being terminated, preserving debugger sessions
chrome.alarms.create('relay-keepalive', { periodInMinutes: 4 })

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: assets/chrome-extension/background.js
Line: 666:675

Comment:
**Duplicate `chrome.alarms.create` call**

`chrome.alarms.create('relay-keepalive', ...)` is called both inside `onInstalled` (line 670) and at module top-level (line 675). The top-level call runs on every service worker startup (including restarts), while `onInstalled` only fires on install/update. Since the top-level call at line 675 already covers the `onInstalled` case, the call at line 670 is redundant. It's not harmful (calling `create` with the same name replaces the existing alarm), but it's confusing and could be removed to reduce noise.

```suggestion
chrome.runtime.onInstalled.addListener(() => {
  // Useful: first-time instructions.
  void chrome.runtime.openOptionsPage()
})

// ===== FIX #5: Keepalive via chrome.alarms =====
// Prevents MV3 service worker from being terminated, preserving debugger sessions
chrome.alarms.create('relay-keepalive', { periodInMinutes: 4 })
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 152 to 153
reconnectTimer = setTimeout(async () => {
reconnectTimer = null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconnect guard disarmed during async execution

At line 153, reconnectTimer = null is set at the very start of the timer callback, before ensureRelayConnection() is called. This disarms the if (reconnectTimer) return guard in scheduleReconnect for the entire duration of the async reconnection attempt. If onRelayClosed fires during this window (e.g., from an old socket), it will call scheduleReconnect(), which will successfully schedule a second timer since the guard is disarmed. This means two reconnect flows can run concurrently.

Consider adding a boolean flag like reconnecting = true to guard against this, or move reconnectTimer = null to just before scheduleReconnect() is called in the catch block.

Prompt To Fix With AI
This is a comment left during a code review.
Path: assets/chrome-extension/background.js
Line: 152:153

Comment:
**Reconnect guard disarmed during async execution**

At line 153, `reconnectTimer = null` is set at the very start of the timer callback, before `ensureRelayConnection()` is called. This disarms the `if (reconnectTimer) return` guard in `scheduleReconnect` for the entire duration of the async reconnection attempt. If `onRelayClosed` fires during this window (e.g., from an old socket), it will call `scheduleReconnect()`, which will successfully schedule a *second* timer since the guard is disarmed. This means two reconnect flows can run concurrently.

Consider adding a boolean flag like `reconnecting = true` to guard against this, or move `reconnectTimer = null` to just before `scheduleReconnect()` is called in the catch block.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 693 to 696
} catch {
console.warn(`[OpenClaw Relay] Keepalive: tab ${tabId} debugger lost — triggering restore`)
void restoreState()
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keepalive calls restoreState which may clobber in-memory state

When the keepalive ping detects a lost debugger, it calls restoreState() which reads from chrome.storage.session and rebuilds the entire in-memory tabs/tabBySession maps. However, the current in-memory state may differ from what's persisted (e.g., tabs added/removed since last persistState call, or a reconnection is in progress). restoreState also calls ensureRelayConnection() internally, which could conflict with an active reconnection attempt.

Consider whether cleanupTab(tabId) (to remove just the lost tab) or a more targeted recovery is more appropriate here than a full state restore.

Prompt To Fix With AI
This is a comment left during a code review.
Path: assets/chrome-extension/background.js
Line: 693:696

Comment:
**Keepalive calls `restoreState` which may clobber in-memory state**

When the keepalive ping detects a lost debugger, it calls `restoreState()` which reads from `chrome.storage.session` and rebuilds the entire in-memory `tabs`/`tabBySession` maps. However, the current in-memory state may differ from what's persisted (e.g., tabs added/removed since last `persistState` call, or a reconnection is in progress). `restoreState` also calls `ensureRelayConnection()` internally, which could conflict with an active reconnection attempt.

Consider whether `cleanupTab(tabId)` (to remove just the lost tab) or a more targeted recovery is more appropriate here than a full state restore.

How can I resolve this? If you propose a fix, please make it concise.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 15, 2026

Additional Comments (1)

assets/chrome-extension/background.js
Race window between temporary and permanent onclose handlers

After the connection promise resolves at line 89 (when onopen fires), there is a gap before the permanent onclose/onerror handlers are installed at lines 92-93. If the WebSocket closes during this window:

  1. The temporary onclose handler at line 85 fires, calling reject() on the already-resolved promise — this is a no-op.
  2. The permanent onclose handler at line 92 hasn't been installed yet, so onRelayClosed is never called.
  3. relayWs is left pointing to a dead socket, and no reconnect is triggered.

This is an edge case, but it was also present in the original code. Now that reconnection logic depends on onRelayClosed being called reliably, this gap becomes more consequential. Consider installing the permanent handlers immediately after creating the WebSocket (before the await) and having the temporary handlers coordinate with them, or moving relayWs = ws to after the handlers are installed.

Prompt To Fix With AI
This is a comment left during a code review.
Path: assets/chrome-extension/background.js
Line: 85:93

Comment:
**Race window between temporary and permanent `onclose` handlers**

After the connection promise resolves at line 89 (when `onopen` fires), there is a gap before the permanent `onclose`/`onerror` handlers are installed at lines 92-93. If the WebSocket closes during this window:

1. The temporary `onclose` handler at line 85 fires, calling `reject()` on the already-resolved promise — this is a no-op.
2. The permanent `onclose` handler at line 92 hasn't been installed yet, so `onRelayClosed` is never called.
3. `relayWs` is left pointing to a dead socket, and no reconnect is triggered.

This is an edge case, but it was also present in the original code. Now that reconnection logic depends on `onRelayClosed` being called reliably, this gap becomes more consequential. Consider installing the permanent handlers immediately after creating the WebSocket (before the `await`) and having the temporary handlers coordinate with them, or moving `relayWs = ws` to *after* the handlers are installed.

How can I resolve this? If you propose a fix, please make it concise.

@Unayung
Copy link
Author

Unayung commented Feb 16, 2026

Thanks @napetrov! Addressing the checklist:

  • Local validation: The extension is a standalone JS file (no build step). Validated by loading the patched extension in Edge, running automated cron jobs (Suno music pipeline at 3:00 AM/PM) through the relay for 12+ hours, and manually testing reconnect scenarios (gateway restart, tab close/replace, service worker termination via chrome://serviceworker-internals). All CI checks pass except formal_conformance which is an unrelated 403 permission error on the bot trying to comment.

  • Focused scope: Single theme — relay disconnect resilience. All 5 fixes address the same root problem (relay losing connection and not recovering).

  • Clear description: Covered in the PR body with problem/fix breakdown.

  • AI-assistance transparency: This PR was co-authored with AI assistance (Claude via OpenClaw agent). The fixes were designed through iterative debugging of real-world disconnect failures, then implemented and validated through 12+ hours of production use. The race condition fix (commit 2) was in response to Greptile's review comment.

Let me know if anything else is needed!

@steipete steipete closed this Feb 16, 2026
@steipete steipete reopened this Feb 17, 2026
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 17, 2026

Additional Comments (1)

assets/chrome-extension/background.js
Re-announce loop aborts on first tab failure

Unlike the equivalent loop in the scheduleReconnect callback (lines 178–233), this re-announce loop has no per-tab error handling. If Target.getTargetInfo or sendToRelay throws for any single tab, the catch at line 331 fires and all remaining tabs are silently skipped — only scheduleReconnect() is called.

While scheduleReconnect would eventually re-announce the skipped tabs (since the WS is already open, ensureRelayConnection returns immediately), there's an unnecessary delay equal to the backoff timer. Consider wrapping the per-tab work in its own try/catch, matching the pattern used in the reconnect handler:

      // Try to reconnect WS and re-announce
      try {
        await ensureRelayConnection()
        for (const [tabId, tab] of tabs.entries()) {
          if (tab.state === 'connected' && tab.sessionId && tab.targetId) {
            try {
              const info = /** @type {any} */ (
                await chrome.debugger.sendCommand({ tabId }, 'Target.getTargetInfo')
              )
              sendToRelay({
                method: 'forwardCDPEvent',
                params: {
                  method: 'Target.attachedToTarget',
                  params: {
                    sessionId: tab.sessionId,
                    targetInfo: { ...(info?.targetInfo || {}), attached: true },
                    waitingForDebugger: false,
                  },
                },
              })
            } catch (err) {
              console.warn(`[OpenClaw Relay] restoreState: failed to re-announce tab ${tabId}:`, err)
              cleanupTab(tabId)
            }
          }
        }
      } catch {
        // WS not available yet — will reconnect later
        scheduleReconnect()
      }
Prompt To Fix With AI
This is a comment left during a code review.
Path: assets/chrome-extension/background.js
Line: 311:334

Comment:
**Re-announce loop aborts on first tab failure**

Unlike the equivalent loop in the `scheduleReconnect` callback (lines 178–233), this re-announce loop has no per-tab error handling. If `Target.getTargetInfo` or `sendToRelay` throws for any single tab, the `catch` at line 331 fires and all remaining tabs are silently skipped — only `scheduleReconnect()` is called.

While `scheduleReconnect` would eventually re-announce the skipped tabs (since the WS is already open, `ensureRelayConnection` returns immediately), there's an unnecessary delay equal to the backoff timer. Consider wrapping the per-tab work in its own try/catch, matching the pattern used in the reconnect handler:

```suggestion
      // Try to reconnect WS and re-announce
      try {
        await ensureRelayConnection()
        for (const [tabId, tab] of tabs.entries()) {
          if (tab.state === 'connected' && tab.sessionId && tab.targetId) {
            try {
              const info = /** @type {any} */ (
                await chrome.debugger.sendCommand({ tabId }, 'Target.getTargetInfo')
              )
              sendToRelay({
                method: 'forwardCDPEvent',
                params: {
                  method: 'Target.attachedToTarget',
                  params: {
                    sessionId: tab.sessionId,
                    targetInfo: { ...(info?.targetInfo || {}), attached: true },
                    waitingForDebugger: false,
                  },
                },
              })
            } catch (err) {
              console.warn(`[OpenClaw Relay] restoreState: failed to re-announce tab ${tabId}:`, err)
              cleanupTab(tabId)
            }
          }
        }
      } catch {
        // WS not available yet — will reconnect later
        scheduleReconnect()
      }
```

How can I resolve this? If you propose a fix, please make it concise.

The Browser Relay extension frequently loses connection and fails to
recover, requiring manual re-attach. This adds 5 fixes:

1. Don't detach debugger sessions on WebSocket drop — keep them alive
   and re-announce when WS reconnects
2. Auto-reconnect with exponential backoff (1s → 30s cap) + jitter
3. Persist state to chrome.storage.session for MV3 worker restart
   recovery, with debugger re-attach on restore
4. Tab lifecycle cleanup via onRemoved/onReplaced listeners
5. Keepalive alarm (every 4 min) to prevent MV3 worker termination
   and detect stale debugger sessions

manifest.json: add 'alarms' permission for fix openclaw#5.

Tested with 12+ hours continuous uptime on Edge, surviving gateway
restarts, sleeping tabs, and automated cron runs.

Fixes openclaw#15099
Address Greptile review: install permanent onclose/onerror/onmessage
handlers immediately after creating the WebSocket, before awaiting
the open event. Use a 'connected' flag to gate onRelayClosed() calls.

The temporary handlers used during the connection promise are restored
to permanent ones via a once-listener on 'open'. This also moves
'relayWs = ws' into the onopen callback so a dead socket is never
assigned to the module-level variable.

Previously, a WebSocket closing between onopen firing and permanent
handler installation would leave relayWs pointing to a dead socket
with no reconnect triggered.
Address Greptile review: wrap per-tab Target.getTargetInfo + sendToRelay
in individual try/catch so a single tab failure doesn't abort re-announce
for remaining tabs. Failed tabs are cleaned up via cleanupTab() instead
of falling through to scheduleReconnect() with unnecessary backoff delay.
@Unayung Unayung force-pushed the fix/relay-disconnect-resilience branch from 4faa275 to e813abe Compare February 17, 2026 02:57
- Remove duplicate chrome.alarms.create (keep module-level only)
- Fix reconnect guard: keep reconnectTimer set during async work to prevent concurrent reconnects
- Keepalive: clean up individual failed tabs instead of calling restoreState() which could clobber in-memory state
- restoreState: wrap per-tab re-announce in try/catch so one failure doesn't skip remaining tabs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

2 participants

Comments