Skip to content

fix(gateway): prevent probe timeout from deferred ESM module evaluation#48270

Open
wongcode wants to merge 5 commits intoopenclaw:mainfrom
wongcode:fix/probe-event-loop-starvation
Open

fix(gateway): prevent probe timeout from deferred ESM module evaluation#48270
wongcode wants to merge 5 commits intoopenclaw:mainfrom
wongcode:fix/probe-event-loop-starvation

Conversation

@wongcode
Copy link
Copy Markdown

@wongcode wongcode commented Mar 16, 2026

Summary

  • Fixes gateway probe always reporting timeout on Windows after upgrading to 2026.3.13
  • Adds waitForEventLoopReady() before opening the probe WebSocket to ensure deferred ESM module evaluation has completed

Root cause

The auth-profiles ESM bundle triggers deferred synchronous work (primarily AJV schema compilation) that blocks the Node.js event loop for ~7 seconds after the top-level import() promise resolves. This blocking starts after the first event loop cycle completes — setTimeout(0) fires on time, but setTimeout(100) is delayed by 7+ seconds.

The probe's resolveProbeBudgetMs caps local loopback budget at 800ms and the overall default is 3000ms. Both expire while the event loop is blocked, because the WebSocket's open/message callbacks cannot fire until the synchronous work finishes.

Evidence from debugging on a Windows 10 machine with Node 24.14:

Test Connect time
Raw net.connect after import 3ms
http.request after import ~7000ms
ws WebSocket after import ~7000ms
ws WebSocket without import 8ms
Event loop stall detected via setInterval(100) 7234ms

The gateway status command (which uses callGateway with a 10s timeout) was unaffected because its budget outlasts the stall.

Fix

waitForEventLoopReady() schedules 20ms timers and checks for abnormal drift (> 200ms). It resolves only after two consecutive on-time callbacks, guaranteeing the deferred evaluation has finished. On systems without the blocking issue, this adds only ~40ms overhead.

A longer-term fix would be to lazy-compile AJV schemas instead of evaluating them at module scope, which would eliminate the event loop stall entirely.

Test plan

  • Verified openclaw gateway probe returns Reachable: yes (21ms latency) on the affected Windows machine after patch
  • Existing probe.test.ts uses mocked GatewayClient, so waitForEventLoopReady completes instantly — no test breakage expected
  • CI tests pass

Related issues

Fixes #45940 — False negative from openclaw gateway probe on Windows
Fixes #46226 — Gateway probe shows 3000ms budget but uses 800ms internally — false timeout on healthy local loopback
Related #46316devices list / nodes status timeout while gateway status shows RPC probe: ok (regression in 2026.3.12/2026.3.13)
Related #46000 — Windows local gateway reissues operator device token without operator.read on 2026.3.13, breaking status/probe/health
Related #47640, #47307

https://www.answeroverflow.com/m/1482583046749163692

🤖 Generated with Claude Code

On Windows (and potentially other platforms with slower module evaluation),
the auth-profiles ESM bundle triggers deferred synchronous work (primarily
AJV schema compilation) that blocks the event loop for ~7 seconds *after*
the top-level import promise resolves. The probe's 800ms loopback budget
expires during this window because WebSocket data callbacks cannot fire,
causing `gateway probe` to always report "timeout" on 2026.3.13.

Add `waitForEventLoopReady()` that schedules short timers and watches for
abnormal drift, resolving only after two consecutive on-time callbacks.
This guarantees deferred module evaluation has finished before opening
any network connections. On unaffected systems this adds ~40ms overhead.

Fixes: probe timeout regression on Windows after upgrading to 2026.3.13
Related: openclaw#47640, openclaw#47307

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@openclaw-barnacle openclaw-barnacle bot added gateway Gateway runtime size: XS labels Mar 16, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ddaff6dc3c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}): Promise<GatewayProbeResult> {
// Ensure the event loop is not starved by deferred module evaluation before
// opening any network connections (see waitForEventLoopReady jsdoc).
await waitForEventLoopReady();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bound event-loop readiness wait by probe timeout

probeGateway now awaits waitForEventLoopReady() before it starts the probe timeout timer, but waitForEventLoopReady() has no upper bound and only resolves after two sub-200ms timer intervals. In environments with sustained event-loop starvation (for example, recurring synchronous work that blocks for >200ms each cycle), this preflight can loop forever, so opts.timeoutMs is never enforced and gateway probe/status can hang indefinitely instead of timing out.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 16, 2026

Greptile Summary

This PR fixes a Windows-specific regression where gateway probe always reported timeout after upgrading to 2026.3.13. The root cause was deferred synchronous work (AJV schema compilation) in the auth-profiles ESM bundle blocking the Node.js event loop for ~7 seconds after the import() promise resolves, preventing WebSocket callbacks from firing within the probe's budget.

The fix adds waitForEventLoopReady() — a polling helper that schedules 20 ms timers, measures drift, and resolves only after two consecutive on-time callbacks — called once at the top of probeGateway.

Key concern:

  • waitForEventLoopReady has no upper-bound timeout. If drift stays above 200 ms continuously (e.g. a non-terminating stall or heavy OS scheduler pressure), the function never resolves and probeGateway hangs indefinitely — even though the caller provided a timeoutMs budget, which is only activated after the wait returns. Adding a maxWaitMs parameter (or racing against opts.timeoutMs) would bound the total probe duration regardless of stall duration.

Confidence Score: 3/5

  • The fix correctly solves the reported Windows timeout bug, but an unbounded wait in waitForEventLoopReady could cause probeGateway to hang forever if the event loop never recovers.
  • The approach is sound and well-tested for the reported scenario. The risk is that on systems with persistent or non-terminating event loop stalls, the lack of a maximum wait time means probeGateway can hang indefinitely, completely ignoring the caller-supplied timeoutMs. This is a real, fixable issue worth addressing before merge.
  • src/gateway/probe.ts — specifically the waitForEventLoopReady function and the missing maximum wait cap.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/gateway/probe.ts
Line: 46-68

Comment:
**No upper-bound timeout on `waitForEventLoopReady`**

`waitForEventLoopReady` has no maximum wait time. If the event loop drift stays above 200 ms continuously (e.g. due to runaway CPU-intensive work, a bug in the blocking module that never finishes, or even heavy OS scheduling pressure on a loaded machine), this promise never resolves and `probeGateway` hangs forever.

Critically, `opts.timeoutMs` is passed in by callers precisely to cap how long the probe can take — but that timeout (`timer`) is only set up *after* `await waitForEventLoopReady()` returns, so it provides no protection during the wait.

Consider adding a hard cap, for example by racing against the caller-supplied budget:

```typescript
function waitForEventLoopReady(maxWaitMs = 10_000): Promise<void> {
  return new Promise<void>((resolve) => {
    let consecutiveOk = 0;
    let prev = Date.now();
    const deadline = Date.now() + maxWaitMs;

    const check = () => {
      const now = Date.now();
      const drift = now - prev;
      prev = now;
      if (drift > 200) {
        consecutiveOk = 0;
      } else {
        consecutiveOk++;
      }
      if (consecutiveOk >= 2 || Date.now() >= deadline) {
        resolve();
      } else {
        setTimeout(check, 20);
      }
    };
    setTimeout(check, 20);
  });
}
```

And at the call site, pass the remaining budget:

```typescript
await waitForEventLoopReady(opts.timeoutMs);
```

This way the total wall-clock time for `probeGateway` stays bounded by `opts.timeoutMs` regardless of how long the stall lasts.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: ddaff6d

Comment on lines +46 to +68
function waitForEventLoopReady(): Promise<void> {
return new Promise<void>((resolve) => {
let consecutiveOk = 0;
let prev = Date.now();
const check = () => {
const now = Date.now();
const drift = now - prev;
prev = now;
if (drift > 200) {
// Timer fired way later than expected — event loop was starved.
consecutiveOk = 0;
} else {
consecutiveOk++;
}
if (consecutiveOk >= 2) {
resolve();
} else {
setTimeout(check, 20);
}
};
setTimeout(check, 20);
});
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No upper-bound timeout on waitForEventLoopReady

waitForEventLoopReady has no maximum wait time. If the event loop drift stays above 200 ms continuously (e.g. due to runaway CPU-intensive work, a bug in the blocking module that never finishes, or even heavy OS scheduling pressure on a loaded machine), this promise never resolves and probeGateway hangs forever.

Critically, opts.timeoutMs is passed in by callers precisely to cap how long the probe can take — but that timeout (timer) is only set up after await waitForEventLoopReady() returns, so it provides no protection during the wait.

Consider adding a hard cap, for example by racing against the caller-supplied budget:

function waitForEventLoopReady(maxWaitMs = 10_000): Promise<void> {
  return new Promise<void>((resolve) => {
    let consecutiveOk = 0;
    let prev = Date.now();
    const deadline = Date.now() + maxWaitMs;

    const check = () => {
      const now = Date.now();
      const drift = now - prev;
      prev = now;
      if (drift > 200) {
        consecutiveOk = 0;
      } else {
        consecutiveOk++;
      }
      if (consecutiveOk >= 2 || Date.now() >= deadline) {
        resolve();
      } else {
        setTimeout(check, 20);
      }
    };
    setTimeout(check, 20);
  });
}

And at the call site, pass the remaining budget:

await waitForEventLoopReady(opts.timeoutMs);

This way the total wall-clock time for probeGateway stays bounded by opts.timeoutMs regardless of how long the stall lasts.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/probe.ts
Line: 46-68

Comment:
**No upper-bound timeout on `waitForEventLoopReady`**

`waitForEventLoopReady` has no maximum wait time. If the event loop drift stays above 200 ms continuously (e.g. due to runaway CPU-intensive work, a bug in the blocking module that never finishes, or even heavy OS scheduling pressure on a loaded machine), this promise never resolves and `probeGateway` hangs forever.

Critically, `opts.timeoutMs` is passed in by callers precisely to cap how long the probe can take — but that timeout (`timer`) is only set up *after* `await waitForEventLoopReady()` returns, so it provides no protection during the wait.

Consider adding a hard cap, for example by racing against the caller-supplied budget:

```typescript
function waitForEventLoopReady(maxWaitMs = 10_000): Promise<void> {
  return new Promise<void>((resolve) => {
    let consecutiveOk = 0;
    let prev = Date.now();
    const deadline = Date.now() + maxWaitMs;

    const check = () => {
      const now = Date.now();
      const drift = now - prev;
      prev = now;
      if (drift > 200) {
        consecutiveOk = 0;
      } else {
        consecutiveOk++;
      }
      if (consecutiveOk >= 2 || Date.now() >= deadline) {
        resolve();
      } else {
        setTimeout(check, 20);
      }
    };
    setTimeout(check, 20);
  });
}
```

And at the call site, pass the remaining budget:

```typescript
await waitForEventLoopReady(opts.timeoutMs);
```

This way the total wall-clock time for `probeGateway` stays bounded by `opts.timeoutMs` regardless of how long the stall lasts.

How can I resolve this? If you propose a fix, please make it concise.

Move waitForEventLoopReady into a shared module (event-loop-ready.ts)
and call it in executeGatewayRequestWithScopes in addition to
probeGateway.  This fixes commands like `cron list`, `devices list`,
and any other CLI path that goes through callGateway — they hit the
same deferred ESM module evaluation stall that was causing probe
timeouts.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ae7482227d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}): Promise<T> {
// Ensure the event loop is not starved by deferred module evaluation before
// opening any network connections (see waitForEventLoopReady jsdoc).
await waitForEventLoopReady();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Cap event-loop readiness wait by call timeout

executeGatewayRequestWithScopes now awaits waitForEventLoopReady() before starting the request timeout timer, but that helper has no upper bound and keeps rescheduling itself whenever timer drift stays above 200ms. Under sustained synchronous load (for example recurring >200ms event-loop blocks), this preflight can never resolve, so callGateway hangs indefinitely and never enforces timeoutMs; this is fresh evidence beyond the earlier probe-only report because this code path is used by all gateway RPC methods.

Useful? React with 👍 / 👎.

Addresses review feedback: if the event loop remains starved beyond the
deadline (default 10 s), resolve anyway so that callers' own timeout
logic can take over rather than hanging indefinitely.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Move event-loop-ready import before method-scopes to satisfy
alphabetical import ordering enforced by the formatter.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a1b1c58cd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}): Promise<GatewayProbeResult> {
// Ensure the event loop is not starved by deferred module evaluation before
// opening any network connections (see waitForEventLoopReady jsdoc).
await waitForEventLoopReady();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include readiness preflight in probe timeout budget

probeGateway now awaits waitForEventLoopReady() before it records startedAt and before it arms the probe timeout, but the new helper defaults to waiting up to 10,000ms (src/gateway/event-loop-ready.ts) while probe budgets are often much smaller (for example the 800ms local loopback budget used by gateway status); in the starvation scenario this patch targets, the command can therefore run far beyond the configured timeout and still report a normal probe outcome, which breaks the timeout contract and makes --timeout misleading.

Useful? React with 👍 / 👎.

}): Promise<T> {
// Ensure the event loop is not starved by deferred module evaluation before
// opening any network connections (see waitForEventLoopReady jsdoc).
await waitForEventLoopReady();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce call timeout before waiting for event-loop readiness

executeGatewayRequestWithScopes now awaits waitForEventLoopReady() before starting its safeTimerTimeoutMs timer, and because that helper has a fixed 10,000ms default max wait, callGateway can exceed caller-provided timeouts by up to ~10s (for example timeoutMs: 500 can take ~10.5s before failing) while still throwing gateway timeout after 500ms, which is a regression for every gateway RPC path that relies on strict timeout behavior.

Useful? React with 👍 / 👎.

Pass the caller-supplied timeoutMs to waitForEventLoopReady so the
readiness preflight respects the probe/call timeout budget instead of
using the 10 s default.  This prevents commands with tight budgets
(e.g. 800 ms loopback probe) from exceeding their timeout contract.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: S

Projects

None yet

1 participant