fix(gateway): prevent probe timeout from deferred ESM module evaluation#48270
fix(gateway): prevent probe timeout from deferred ESM module evaluation#48270wongcode wants to merge 5 commits intoopenclaw:mainfrom
Conversation
On Windows (and potentially other platforms with slower module evaluation), the auth-profiles ESM bundle triggers deferred synchronous work (primarily AJV schema compilation) that blocks the event loop for ~7 seconds *after* the top-level import promise resolves. The probe's 800ms loopback budget expires during this window because WebSocket data callbacks cannot fire, causing `gateway probe` to always report "timeout" on 2026.3.13. Add `waitForEventLoopReady()` that schedules short timers and watches for abnormal drift, resolving only after two consecutive on-time callbacks. This guarantees deferred module evaluation has finished before opening any network connections. On unaffected systems this adds ~40ms overhead. Fixes: probe timeout regression on Windows after upgrading to 2026.3.13 Related: openclaw#47640, openclaw#47307 Co-Authored-By: Claude Opus 4.6 <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ddaff6dc3c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
src/gateway/probe.ts
Outdated
| }): Promise<GatewayProbeResult> { | ||
| // Ensure the event loop is not starved by deferred module evaluation before | ||
| // opening any network connections (see waitForEventLoopReady jsdoc). | ||
| await waitForEventLoopReady(); |
There was a problem hiding this comment.
Bound event-loop readiness wait by probe timeout
probeGateway now awaits waitForEventLoopReady() before it starts the probe timeout timer, but waitForEventLoopReady() has no upper bound and only resolves after two sub-200ms timer intervals. In environments with sustained event-loop starvation (for example, recurring synchronous work that blocks for >200ms each cycle), this preflight can loop forever, so opts.timeoutMs is never enforced and gateway probe/status can hang indefinitely instead of timing out.
Useful? React with 👍 / 👎.
Greptile SummaryThis PR fixes a Windows-specific regression where The fix adds Key concern:
Confidence Score: 3/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/gateway/probe.ts
Line: 46-68
Comment:
**No upper-bound timeout on `waitForEventLoopReady`**
`waitForEventLoopReady` has no maximum wait time. If the event loop drift stays above 200 ms continuously (e.g. due to runaway CPU-intensive work, a bug in the blocking module that never finishes, or even heavy OS scheduling pressure on a loaded machine), this promise never resolves and `probeGateway` hangs forever.
Critically, `opts.timeoutMs` is passed in by callers precisely to cap how long the probe can take — but that timeout (`timer`) is only set up *after* `await waitForEventLoopReady()` returns, so it provides no protection during the wait.
Consider adding a hard cap, for example by racing against the caller-supplied budget:
```typescript
function waitForEventLoopReady(maxWaitMs = 10_000): Promise<void> {
return new Promise<void>((resolve) => {
let consecutiveOk = 0;
let prev = Date.now();
const deadline = Date.now() + maxWaitMs;
const check = () => {
const now = Date.now();
const drift = now - prev;
prev = now;
if (drift > 200) {
consecutiveOk = 0;
} else {
consecutiveOk++;
}
if (consecutiveOk >= 2 || Date.now() >= deadline) {
resolve();
} else {
setTimeout(check, 20);
}
};
setTimeout(check, 20);
});
}
```
And at the call site, pass the remaining budget:
```typescript
await waitForEventLoopReady(opts.timeoutMs);
```
This way the total wall-clock time for `probeGateway` stays bounded by `opts.timeoutMs` regardless of how long the stall lasts.
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: ddaff6d |
src/gateway/probe.ts
Outdated
| function waitForEventLoopReady(): Promise<void> { | ||
| return new Promise<void>((resolve) => { | ||
| let consecutiveOk = 0; | ||
| let prev = Date.now(); | ||
| const check = () => { | ||
| const now = Date.now(); | ||
| const drift = now - prev; | ||
| prev = now; | ||
| if (drift > 200) { | ||
| // Timer fired way later than expected — event loop was starved. | ||
| consecutiveOk = 0; | ||
| } else { | ||
| consecutiveOk++; | ||
| } | ||
| if (consecutiveOk >= 2) { | ||
| resolve(); | ||
| } else { | ||
| setTimeout(check, 20); | ||
| } | ||
| }; | ||
| setTimeout(check, 20); | ||
| }); | ||
| } |
There was a problem hiding this comment.
No upper-bound timeout on waitForEventLoopReady
waitForEventLoopReady has no maximum wait time. If the event loop drift stays above 200 ms continuously (e.g. due to runaway CPU-intensive work, a bug in the blocking module that never finishes, or even heavy OS scheduling pressure on a loaded machine), this promise never resolves and probeGateway hangs forever.
Critically, opts.timeoutMs is passed in by callers precisely to cap how long the probe can take — but that timeout (timer) is only set up after await waitForEventLoopReady() returns, so it provides no protection during the wait.
Consider adding a hard cap, for example by racing against the caller-supplied budget:
function waitForEventLoopReady(maxWaitMs = 10_000): Promise<void> {
return new Promise<void>((resolve) => {
let consecutiveOk = 0;
let prev = Date.now();
const deadline = Date.now() + maxWaitMs;
const check = () => {
const now = Date.now();
const drift = now - prev;
prev = now;
if (drift > 200) {
consecutiveOk = 0;
} else {
consecutiveOk++;
}
if (consecutiveOk >= 2 || Date.now() >= deadline) {
resolve();
} else {
setTimeout(check, 20);
}
};
setTimeout(check, 20);
});
}And at the call site, pass the remaining budget:
await waitForEventLoopReady(opts.timeoutMs);This way the total wall-clock time for probeGateway stays bounded by opts.timeoutMs regardless of how long the stall lasts.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/probe.ts
Line: 46-68
Comment:
**No upper-bound timeout on `waitForEventLoopReady`**
`waitForEventLoopReady` has no maximum wait time. If the event loop drift stays above 200 ms continuously (e.g. due to runaway CPU-intensive work, a bug in the blocking module that never finishes, or even heavy OS scheduling pressure on a loaded machine), this promise never resolves and `probeGateway` hangs forever.
Critically, `opts.timeoutMs` is passed in by callers precisely to cap how long the probe can take — but that timeout (`timer`) is only set up *after* `await waitForEventLoopReady()` returns, so it provides no protection during the wait.
Consider adding a hard cap, for example by racing against the caller-supplied budget:
```typescript
function waitForEventLoopReady(maxWaitMs = 10_000): Promise<void> {
return new Promise<void>((resolve) => {
let consecutiveOk = 0;
let prev = Date.now();
const deadline = Date.now() + maxWaitMs;
const check = () => {
const now = Date.now();
const drift = now - prev;
prev = now;
if (drift > 200) {
consecutiveOk = 0;
} else {
consecutiveOk++;
}
if (consecutiveOk >= 2 || Date.now() >= deadline) {
resolve();
} else {
setTimeout(check, 20);
}
};
setTimeout(check, 20);
});
}
```
And at the call site, pass the remaining budget:
```typescript
await waitForEventLoopReady(opts.timeoutMs);
```
This way the total wall-clock time for `probeGateway` stays bounded by `opts.timeoutMs` regardless of how long the stall lasts.
How can I resolve this? If you propose a fix, please make it concise.Move waitForEventLoopReady into a shared module (event-loop-ready.ts) and call it in executeGatewayRequestWithScopes in addition to probeGateway. This fixes commands like `cron list`, `devices list`, and any other CLI path that goes through callGateway — they hit the same deferred ESM module evaluation stall that was causing probe timeouts. Co-Authored-By: Claude Opus 4.6 <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ae7482227d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
src/gateway/call.ts
Outdated
| }): Promise<T> { | ||
| // Ensure the event loop is not starved by deferred module evaluation before | ||
| // opening any network connections (see waitForEventLoopReady jsdoc). | ||
| await waitForEventLoopReady(); |
There was a problem hiding this comment.
Cap event-loop readiness wait by call timeout
executeGatewayRequestWithScopes now awaits waitForEventLoopReady() before starting the request timeout timer, but that helper has no upper bound and keeps rescheduling itself whenever timer drift stays above 200ms. Under sustained synchronous load (for example recurring >200ms event-loop blocks), this preflight can never resolve, so callGateway hangs indefinitely and never enforces timeoutMs; this is fresh evidence beyond the earlier probe-only report because this code path is used by all gateway RPC methods.
Useful? React with 👍 / 👎.
Addresses review feedback: if the event loop remains starved beyond the deadline (default 10 s), resolve anyway so that callers' own timeout logic can take over rather than hanging indefinitely. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Move event-loop-ready import before method-scopes to satisfy alphabetical import ordering enforced by the formatter. Co-Authored-By: Claude Opus 4.6 <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1a1b1c58cd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
src/gateway/probe.ts
Outdated
| }): Promise<GatewayProbeResult> { | ||
| // Ensure the event loop is not starved by deferred module evaluation before | ||
| // opening any network connections (see waitForEventLoopReady jsdoc). | ||
| await waitForEventLoopReady(); |
There was a problem hiding this comment.
Include readiness preflight in probe timeout budget
probeGateway now awaits waitForEventLoopReady() before it records startedAt and before it arms the probe timeout, but the new helper defaults to waiting up to 10,000ms (src/gateway/event-loop-ready.ts) while probe budgets are often much smaller (for example the 800ms local loopback budget used by gateway status); in the starvation scenario this patch targets, the command can therefore run far beyond the configured timeout and still report a normal probe outcome, which breaks the timeout contract and makes --timeout misleading.
Useful? React with 👍 / 👎.
src/gateway/call.ts
Outdated
| }): Promise<T> { | ||
| // Ensure the event loop is not starved by deferred module evaluation before | ||
| // opening any network connections (see waitForEventLoopReady jsdoc). | ||
| await waitForEventLoopReady(); |
There was a problem hiding this comment.
Enforce call timeout before waiting for event-loop readiness
executeGatewayRequestWithScopes now awaits waitForEventLoopReady() before starting its safeTimerTimeoutMs timer, and because that helper has a fixed 10,000ms default max wait, callGateway can exceed caller-provided timeouts by up to ~10s (for example timeoutMs: 500 can take ~10.5s before failing) while still throwing gateway timeout after 500ms, which is a regression for every gateway RPC path that relies on strict timeout behavior.
Useful? React with 👍 / 👎.
Pass the caller-supplied timeoutMs to waitForEventLoopReady so the readiness preflight respects the probe/call timeout budget instead of using the 10 s default. This prevents commands with tight budgets (e.g. 800 ms loopback probe) from exceeding their timeout contract. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Summary
gateway probealways reporting timeout on Windows after upgrading to 2026.3.13waitForEventLoopReady()before opening the probe WebSocket to ensure deferred ESM module evaluation has completedRoot cause
The
auth-profilesESM bundle triggers deferred synchronous work (primarily AJV schema compilation) that blocks the Node.js event loop for ~7 seconds after the top-levelimport()promise resolves. This blocking starts after the first event loop cycle completes —setTimeout(0)fires on time, butsetTimeout(100)is delayed by 7+ seconds.The probe's
resolveProbeBudgetMscaps local loopback budget at 800ms and the overall default is 3000ms. Both expire while the event loop is blocked, because the WebSocket'sopen/messagecallbacks cannot fire until the synchronous work finishes.Evidence from debugging on a Windows 10 machine with Node 24.14:
net.connectafter importhttp.requestafter importwsWebSocket after importwsWebSocket without importsetInterval(100)The
gateway statuscommand (which usescallGatewaywith a 10s timeout) was unaffected because its budget outlasts the stall.Fix
waitForEventLoopReady()schedules 20ms timers and checks for abnormal drift (> 200ms). It resolves only after two consecutive on-time callbacks, guaranteeing the deferred evaluation has finished. On systems without the blocking issue, this adds only ~40ms overhead.A longer-term fix would be to lazy-compile AJV schemas instead of evaluating them at module scope, which would eliminate the event loop stall entirely.
Test plan
openclaw gateway probereturnsReachable: yes(21ms latency) on the affected Windows machine after patchprobe.test.tsuses mockedGatewayClient, sowaitForEventLoopReadycompletes instantly — no test breakage expectedRelated issues
Fixes #45940 — False negative from
openclaw gateway probeon WindowsFixes #46226 — Gateway probe shows 3000ms budget but uses 800ms internally — false timeout on healthy local loopback
Related #46316 —
devices list/nodes statustimeout whilegateway statusshowsRPC probe: ok(regression in 2026.3.12/2026.3.13)Related #46000 — Windows local gateway reissues operator device token without operator.read on 2026.3.13, breaking status/probe/health
Related #47640, #47307
https://www.answeroverflow.com/m/1482583046749163692
🤖 Generated with Claude Code