fix(relay): survive WS disconnects and MV3 worker restarts by Unayung · Pull Request #17588 · openclaw/openclaw

Unayung · 2026-02-15T23:33:04Z

Summary

The Browser Relay extension frequently loses connection and fails to recover, requiring manual re-attach. This PR adds 5 fixes to make the relay resilient to WebSocket drops, MV3 service worker termination, and tab lifecycle events.

Related issues: #15099, #15989, #11142

Problem

The current background.js has several failure modes:

WebSocket drop → debugger detach: When the relay WebSocket closes (e.g., OpenClaw restart, network blip), onRelayClosed() immediately detaches all debugger sessions and clears state. The user must manually click the toolbar icon again to re-attach.
No auto-reconnect: After a WS disconnect, the extension makes no attempt to reconnect. It silently goes dead.
MV3 service worker death: Chrome/Edge can terminate the service worker at any time. All in-memory state (tabs, sessions) is lost.
Tab close/replace not handled: Stale entries remain in the maps.
Service worker termination: Without periodic activity, Chrome terminates the MV3 service worker after ~30s of inactivity.

Fixes

#	Fix	Description
1	Don't detach on WS drop	Keep debugger attached, show reconnecting badge, re-announce on reconnect
2	Auto-reconnect	Exponential backoff (1s → 30s cap) with jitter, re-announce all tabs on success
3	Session storage persistence	Save/restore state via `chrome.storage.session` for MV3 worker restarts, with debugger re-attach
4	Tab lifecycle cleanup	`onRemoved` / `onReplaced` listeners to clean up stale entries
5	Keepalive alarm	`chrome.alarms` every 4 min to prevent worker termination + detect stale debuggers

manifest.json

Added "alarms" to permissions array (required for Fix CLI: add Opencode integration #5)

Testing

Tested on Edge (Chromium-based) with an automated pipeline running via cron at 3:00 AM and 3:00 PM daily:

12+ hours continuous uptime without disconnects
Survived multiple OpenClaw gateway restarts (auto-reconnect worked)
Survived Edge Sleeping Tabs (keepalive prevented worker termination)
Cron jobs successfully connected to the relay without manual intervention
Badge correctly shows ⏳ during reconnect, ✅ on recovery

Greptile Summary

This PR makes the Browser Relay extension resilient to WebSocket disconnects, MV3 service worker termination, and tab lifecycle events by adding five fixes: keeping debugger sessions alive across WS drops, auto-reconnect with exponential backoff, session storage persistence, tab lifecycle cleanup, and a keepalive alarm.

Fix fix: add @lid format support and allowFrom wildcard handling #1 (Don't detach on WS drop): onRelayClosed no longer tears down debugger sessions — it updates badges and preserves in-memory state for reconnection.
Fix Login fails with 'WebSocket Error (socket hang up)' ECONNRESET #2 (Auto-reconnect): scheduleReconnect uses exponential backoff (1s–30s cap) with jitter. On success, it re-validates and re-announces all tabs with thorough per-tab error handling.
Fix WA business, groups & office hours #3 (Session storage persistence): State is persisted to chrome.storage.session after attach/detach and restored on service worker startup, including debugger re-attach if needed.
Fix Images not passed to Claude CLI - only path reference in text #4 (Tab lifecycle cleanup): onRemoved and onReplaced listeners clean up stale entries.
Fix CLI: add Opencode integration #5 (Keepalive alarm): 4-minute chrome.alarms interval pings debuggers and triggers state restore if a debugger is lost.
Issue: The restoreState re-announce loop (lines 313–330) lacks per-tab error handling — a single tab failure aborts re-announcement for all remaining tabs, unlike the analogous loop in scheduleReconnect.
Note: childSessionToTab is not persisted, so child sessions (iframes, service workers) will be lost across service worker restarts. New child session events will re-populate the map, but the relay server may hold stale references in the interim.

Confidence Score: 3/5

This PR introduces significant resilience improvements but has several race conditions and error-handling gaps that could cause unexpected behavior under concurrent failure scenarios.
The core logic is well-structured and addresses real failure modes effectively. However, the re-announce loop in restoreState lacks per-tab error handling (new finding), and previous review threads identified race conditions in onRelayClosed (socket identity), reconnect guard disarming, and keepalive-triggered restoreState clobbering — all of which remain unresolved. The manifest change is minimal and correct.
assets/chrome-extension/background.js — the restoreState function (lines 310–334) and the race conditions noted in previous review threads

_{Last reviewed commit: 7ab237e}

greptile-apps

_{2 files reviewed, 5 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-15T23:38:52Z

assets/chrome-extension/background.js

 function onRelayClosed(reason) {
  relayWs = null


onRelayClosed can null a healthy socket

onRelayClosed unconditionally sets relayWs = null at line 117 without checking whether relayWs still refers to the socket that actually closed. Here's the race:

Old socket ws_old is connected, relayWs = ws_old.

ws_old disconnects → onRelayClosed fires from ws_old's onclose handler.

Meanwhile, scheduleReconnect fires its timer, ensureRelayConnection creates ws_new, and at line 73 sets relayWs = ws_new.

Since JS is single-threaded, the microtask from step 2 can interleave with the async flow from step 3. Specifically, after ensureRelayConnection yields at the await on line 75 (waiting for ws_new.onopen), onRelayClosed runs and sets relayWs = null — even though relayWs now points to ws_new, not ws_old.

The fix is to check identity before nulling:

Suggested change

function onRelayClosed(reason) {

relayWs = null

function onRelayClosed(reason) {

// Guard: only act if this close event is for the current socket

if (relayWs && relayWs.readyState !== WebSocket.CLOSED) return

relayWs = null

Or alternatively, capture a reference in the close handler and compare:

// In ensureRelayConnection, when installing the permanent handler: ws.onclose = () => { if (relayWs === ws) onRelayClosed('closed') } ws.onerror = () => { if (relayWs === ws) onRelayClosed('error') }

The second approach is more robust since it directly ties the handler to the socket instance.

Prompt To Fix With AI

This is a comment left during a code review. Path: assets/chrome-extension/background.js Line: 116:117 Comment: **`onRelayClosed` can null a healthy socket** `onRelayClosed` unconditionally sets `relayWs = null` at line 117 without checking whether `relayWs` still refers to the socket that actually closed. Here's the race: 1. Old socket `ws_old` is connected, `relayWs = ws_old`. 2. `ws_old` disconnects → `onRelayClosed` fires from `ws_old`'s `onclose` handler. 3. Meanwhile, `scheduleReconnect` fires its timer, `ensureRelayConnection` creates `ws_new`, and at line 73 sets `relayWs = ws_new`. 4. Since JS is single-threaded, the microtask from step 2 can interleave with the async flow from step 3. Specifically, after `ensureRelayConnection` yields at the `await` on line 75 (waiting for `ws_new.onopen`), `onRelayClosed` runs and sets `relayWs = null` — even though `relayWs` now points to `ws_new`, not `ws_old`. The fix is to check identity before nulling: ```suggestion function onRelayClosed(reason) { // Guard: only act if this close event is for the current socket if (relayWs && relayWs.readyState !== WebSocket.CLOSED) return relayWs = null ``` Or alternatively, capture a reference in the close handler and compare: ```js // In ensureRelayConnection, when installing the permanent handler: ws.onclose = () => { if (relayWs === ws) onRelayClosed('closed') } ws.onerror = () => { if (relayWs === ws) onRelayClosed('error') } ``` The second approach is more robust since it directly ties the handler to the socket instance. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-02-15T23:38:53Z

assets/chrome-extension/background.js

 chrome.runtime.onInstalled.addListener(() => {
  // Useful: first-time instructions.
  void chrome.runtime.openOptionsPage()
+  // Start keepalive alarm
+  chrome.alarms.create('relay-keepalive', { periodInMinutes: 4 })
+})
+
+// ===== FIX #5: Keepalive via chrome.alarms =====
+// Prevents MV3 service worker from being terminated, preserving debugger sessions
+chrome.alarms.create('relay-keepalive', { periodInMinutes: 4 })


Duplicate chrome.alarms.create call

chrome.alarms.create('relay-keepalive', ...) is called both inside onInstalled (line 670) and at module top-level (line 675). The top-level call runs on every service worker startup (including restarts), while onInstalled only fires on install/update. Since the top-level call at line 675 already covers the onInstalled case, the call at line 670 is redundant. It's not harmful (calling create with the same name replaces the existing alarm), but it's confusing and could be removed to reduce noise.

Suggested change

chrome.runtime.onInstalled.addListener(() => {

// Useful: first-time instructions.

void chrome.runtime.openOptionsPage()

// Start keepalive alarm

chrome.alarms.create('relay-keepalive', { periodInMinutes: 4 })

})

// ===== FIX #5: Keepalive via chrome.alarms =====

// Prevents MV3 service worker from being terminated, preserving debugger sessions

chrome.alarms.create('relay-keepalive', { periodInMinutes: 4 })

chrome.runtime.onInstalled.addListener(() => {

// Useful: first-time instructions.

void chrome.runtime.openOptionsPage()

})

// ===== FIX #5: Keepalive via chrome.alarms =====

// Prevents MV3 service worker from being terminated, preserving debugger sessions

chrome.alarms.create('relay-keepalive', { periodInMinutes: 4 })

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: assets/chrome-extension/background.js Line: 666:675 Comment: **Duplicate `chrome.alarms.create` call** `chrome.alarms.create('relay-keepalive', ...)` is called both inside `onInstalled` (line 670) and at module top-level (line 675). The top-level call runs on every service worker startup (including restarts), while `onInstalled` only fires on install/update. Since the top-level call at line 675 already covers the `onInstalled` case, the call at line 670 is redundant. It's not harmful (calling `create` with the same name replaces the existing alarm), but it's confusing and could be removed to reduce noise. ```suggestion chrome.runtime.onInstalled.addListener(() => { // Useful: first-time instructions. void chrome.runtime.openOptionsPage() }) // ===== FIX #5: Keepalive via chrome.alarms ===== // Prevents MV3 service worker from being terminated, preserving debugger sessions chrome.alarms.create('relay-keepalive', { periodInMinutes: 4 }) ``` <sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub> How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-02-15T23:38:57Z

assets/chrome-extension/background.js

+  reconnectTimer = setTimeout(async () => {
+    reconnectTimer = null


Reconnect guard disarmed during async execution

At line 153, reconnectTimer = null is set at the very start of the timer callback, before ensureRelayConnection() is called. This disarms the if (reconnectTimer) return guard in scheduleReconnect for the entire duration of the async reconnection attempt. If onRelayClosed fires during this window (e.g., from an old socket), it will call scheduleReconnect(), which will successfully schedule a second timer since the guard is disarmed. This means two reconnect flows can run concurrently.

Consider adding a boolean flag like reconnecting = true to guard against this, or move reconnectTimer = null to just before scheduleReconnect() is called in the catch block.

Prompt To Fix With AI

This is a comment left during a code review. Path: assets/chrome-extension/background.js Line: 152:153 Comment: **Reconnect guard disarmed during async execution** At line 153, `reconnectTimer = null` is set at the very start of the timer callback, before `ensureRelayConnection()` is called. This disarms the `if (reconnectTimer) return` guard in `scheduleReconnect` for the entire duration of the async reconnection attempt. If `onRelayClosed` fires during this window (e.g., from an old socket), it will call `scheduleReconnect()`, which will successfully schedule a *second* timer since the guard is disarmed. This means two reconnect flows can run concurrently. Consider adding a boolean flag like `reconnecting = true` to guard against this, or move `reconnectTimer = null` to just before `scheduleReconnect()` is called in the catch block. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-02-15T23:38:58Z

assets/chrome-extension/background.js

+    } catch {
+      console.warn(`[OpenClaw Relay] Keepalive: tab ${tabId} debugger lost — triggering restore`)
+      void restoreState()
+      break


Keepalive calls restoreState which may clobber in-memory state

When the keepalive ping detects a lost debugger, it calls restoreState() which reads from chrome.storage.session and rebuilds the entire in-memory tabs/tabBySession maps. However, the current in-memory state may differ from what's persisted (e.g., tabs added/removed since last persistState call, or a reconnection is in progress). restoreState also calls ensureRelayConnection() internally, which could conflict with an active reconnection attempt.

Consider whether cleanupTab(tabId) (to remove just the lost tab) or a more targeted recovery is more appropriate here than a full state restore.

Prompt To Fix With AI

This is a comment left during a code review. Path: assets/chrome-extension/background.js Line: 693:696 Comment: **Keepalive calls `restoreState` which may clobber in-memory state** When the keepalive ping detects a lost debugger, it calls `restoreState()` which reads from `chrome.storage.session` and rebuilds the entire in-memory `tabs`/`tabBySession` maps. However, the current in-memory state may differ from what's persisted (e.g., tabs added/removed since last `persistState` call, or a reconnection is in progress). `restoreState` also calls `ensureRelayConnection()` internally, which could conflict with an active reconnection attempt. Consider whether `cleanupTab(tabId)` (to remove just the lost tab) or a more targeted recovery is more appropriate here than a full state restore. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-02-15T23:39:01Z

Additional Comments (1)

assets/chrome-extension/background.js
Race window between temporary and permanent onclose handlers

After the connection promise resolves at line 89 (when onopen fires), there is a gap before the permanent onclose/onerror handlers are installed at lines 92-93. If the WebSocket closes during this window:

The temporary onclose handler at line 85 fires, calling reject() on the already-resolved promise — this is a no-op.
The permanent onclose handler at line 92 hasn't been installed yet, so onRelayClosed is never called.
relayWs is left pointing to a dead socket, and no reconnect is triggered.

This is an edge case, but it was also present in the original code. Now that reconnection logic depends on onRelayClosed being called reliably, this gap becomes more consequential. Consider installing the permanent handlers immediately after creating the WebSocket (before the await) and having the temporary handlers coordinate with them, or moving relayWs = ws to after the handlers are installed.

Prompt To Fix With AI

This is a comment left during a code review.
Path: assets/chrome-extension/background.js
Line: 85:93

Comment:
**Race window between temporary and permanent `onclose` handlers**

After the connection promise resolves at line 89 (when `onopen` fires), there is a gap before the permanent `onclose`/`onerror` handlers are installed at lines 92-93. If the WebSocket closes during this window:

1. The temporary `onclose` handler at line 85 fires, calling `reject()` on the already-resolved promise — this is a no-op.
2. The permanent `onclose` handler at line 92 hasn't been installed yet, so `onRelayClosed` is never called.
3. `relayWs` is left pointing to a dead socket, and no reconnect is triggered.

This is an edge case, but it was also present in the original code. Now that reconnection logic depends on `onRelayClosed` being called reliably, this gap becomes more consequential. Consider installing the permanent handlers immediately after creating the WebSocket (before the `await`) and having the temporary handlers coordinate with them, or moving `relayWs = ws` to *after* the handlers are installed.

How can I resolve this? If you propose a fix, please make it concise.

Unayung · 2026-02-16T02:08:19Z

Thanks @napetrov! Addressing the checklist:

Local validation: The extension is a standalone JS file (no build step). Validated by loading the patched extension in Edge, running automated cron jobs (Suno music pipeline at 3:00 AM/PM) through the relay for 12+ hours, and manually testing reconnect scenarios (gateway restart, tab close/replace, service worker termination via chrome://serviceworker-internals). All CI checks pass except formal_conformance which is an unrelated 403 permission error on the bot trying to comment.
Focused scope: Single theme — relay disconnect resilience. All 5 fixes address the same root problem (relay losing connection and not recovering).
Clear description: Covered in the PR body with problem/fix breakdown.
AI-assistance transparency: This PR was co-authored with AI assistance (Claude via OpenClaw agent). The fixes were designed through iterative debugging of real-world disconnect failures, then implemented and validated through 12+ hours of production use. The race condition fix (commit 2) was in response to Greptile's review comment.

Let me know if anything else is needed!

greptile-apps

_{2 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-17T01:51:39Z

Additional Comments (1)

assets/chrome-extension/background.js
Re-announce loop aborts on first tab failure

Unlike the equivalent loop in the scheduleReconnect callback (lines 178–233), this re-announce loop has no per-tab error handling. If Target.getTargetInfo or sendToRelay throws for any single tab, the catch at line 331 fires and all remaining tabs are silently skipped — only scheduleReconnect() is called.

While scheduleReconnect would eventually re-announce the skipped tabs (since the WS is already open, ensureRelayConnection returns immediately), there's an unnecessary delay equal to the backoff timer. Consider wrapping the per-tab work in its own try/catch, matching the pattern used in the reconnect handler:

      // Try to reconnect WS and re-announce
      try {
        await ensureRelayConnection()
        for (const [tabId, tab] of tabs.entries()) {
          if (tab.state === 'connected' && tab.sessionId && tab.targetId) {
            try {
              const info = /** @type {any} */ (
                await chrome.debugger.sendCommand({ tabId }, 'Target.getTargetInfo')
              )
              sendToRelay({
                method: 'forwardCDPEvent',
                params: {
                  method: 'Target.attachedToTarget',
                  params: {
                    sessionId: tab.sessionId,
                    targetInfo: { ...(info?.targetInfo || {}), attached: true },
                    waitingForDebugger: false,
                  },
                },
              })
            } catch (err) {
              console.warn(`[OpenClaw Relay] restoreState: failed to re-announce tab ${tabId}:`, err)
              cleanupTab(tabId)
            }
          }
        }
      } catch {
        // WS not available yet — will reconnect later
        scheduleReconnect()
      }

Prompt To Fix With AI

This is a comment left during a code review.
Path: assets/chrome-extension/background.js
Line: 311:334

Comment:
**Re-announce loop aborts on first tab failure**

Unlike the equivalent loop in the `scheduleReconnect` callback (lines 178–233), this re-announce loop has no per-tab error handling. If `Target.getTargetInfo` or `sendToRelay` throws for any single tab, the `catch` at line 331 fires and all remaining tabs are silently skipped — only `scheduleReconnect()` is called.

While `scheduleReconnect` would eventually re-announce the skipped tabs (since the WS is already open, `ensureRelayConnection` returns immediately), there's an unnecessary delay equal to the backoff timer. Consider wrapping the per-tab work in its own try/catch, matching the pattern used in the reconnect handler:

```suggestion
      // Try to reconnect WS and re-announce
      try {
        await ensureRelayConnection()
        for (const [tabId, tab] of tabs.entries()) {
          if (tab.state === 'connected' && tab.sessionId && tab.targetId) {
            try {
              const info = /** @type {any} */ (
                await chrome.debugger.sendCommand({ tabId }, 'Target.getTargetInfo')
              )
              sendToRelay({
                method: 'forwardCDPEvent',
                params: {
                  method: 'Target.attachedToTarget',
                  params: {
                    sessionId: tab.sessionId,
                    targetInfo: { ...(info?.targetInfo || {}), attached: true },
                    waitingForDebugger: false,
                  },
                },
              })
            } catch (err) {
              console.warn(`[OpenClaw Relay] restoreState: failed to re-announce tab ${tabId}:`, err)
              cleanupTab(tabId)
            }
          }
        }
      } catch {
        // WS not available yet — will reconnect later
        scheduleReconnect()
      }
```

How can I resolve this? If you propose a fix, please make it concise.

The Browser Relay extension frequently loses connection and fails to recover, requiring manual re-attach. This adds 5 fixes: 1. Don't detach debugger sessions on WebSocket drop — keep them alive and re-announce when WS reconnects 2. Auto-reconnect with exponential backoff (1s → 30s cap) + jitter 3. Persist state to chrome.storage.session for MV3 worker restart recovery, with debugger re-attach on restore 4. Tab lifecycle cleanup via onRemoved/onReplaced listeners 5. Keepalive alarm (every 4 min) to prevent MV3 worker termination and detect stale debugger sessions manifest.json: add 'alarms' permission for fix openclaw#5. Tested with 12+ hours continuous uptime on Edge, surviving gateway restarts, sleeping tabs, and automated cron runs. Fixes openclaw#15099

Address Greptile review: install permanent onclose/onerror/onmessage handlers immediately after creating the WebSocket, before awaiting the open event. Use a 'connected' flag to gate onRelayClosed() calls. The temporary handlers used during the connection promise are restored to permanent ones via a once-listener on 'open'. This also moves 'relayWs = ws' into the onopen callback so a dead socket is never assigned to the module-level variable. Previously, a WebSocket closing between onopen firing and permanent handler installation would leave relayWs pointing to a dead socket with no reconnect triggered.

Address Greptile review: wrap per-tab Target.getTargetInfo + sendToRelay in individual try/catch so a single tab failure doesn't abort re-announce for remaining tabs. Failed tabs are cleaned up via cleanupTab() instead of falling through to scheduleReconnect() with unnecessary backoff delay.

… inside onopen before resolve

- Remove duplicate chrome.alarms.create (keep module-level only) - Fix reconnect guard: keep reconnectTimer set during async work to prevent concurrent reconnects - Keepalive: clean up individual failed tabs instead of calling restoreState() which could clobber in-memory state - restoreState: wrap per-tab re-announce in try/catch so one failure doesn't skip remaining tabs

openclaw-barnacle bot added the size: M label Feb 15, 2026

greptile-apps bot reviewed Feb 15, 2026

View reviewed changes

steipete closed this Feb 16, 2026

steipete reopened this Feb 17, 2026

greptile-apps bot reviewed Feb 17, 2026

View reviewed changes

Unayung added 3 commits February 17, 2026 10:57

Unayung force-pushed the fix/relay-disconnect-resilience branch from 4faa275 to e813abe Compare February 17, 2026 02:57

Unayung added 2 commits February 18, 2026 20:06

fix: simplify WebSocket onclose race fix - install permanent handlers…

7435188

… inside onopen before resolve

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(relay): survive WS disconnects and MV3 worker restarts#17588

fix(relay): survive WS disconnects and MV3 worker restarts#17588
Unayung wants to merge 5 commits intoopenclaw:mainfrom
Unayung:fix/relay-disconnect-resilience

Unayung commented Feb 15, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 15, 2026

Uh oh!

greptile-apps bot Feb 15, 2026

Uh oh!

greptile-apps bot Feb 15, 2026

Uh oh!

greptile-apps bot Feb 15, 2026

Uh oh!

greptile-apps bot commented Feb 15, 2026

Uh oh!

Unayung commented Feb 16, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

		reconnectTimer = setTimeout(async () => {
		reconnectTimer = null

Uh oh!

Conversation

Unayung commented Feb 15, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Fixes

manifest.json

Testing

Greptile Summary

Confidence Score: 3/5

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 15, 2026

Uh oh!

Unayung commented Feb 16, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Unayung commented Feb 15, 2026 •

edited by greptile-apps bot

Loading