fix: clamp setTimeout values to 32-bit safe max to prevent gateway hang#9576
fix: clamp setTimeout values to 32-bit safe max to prevent gateway hang#9576pycckuu wants to merge 2 commits intoopenclaw:mainfrom
Conversation
Node.js setTimeout uses a 32-bit signed integer internally. Values exceeding 2^31-1 (2,147,483,647ms) silently wrap to 1ms, breaking timeout semantics and causing cascading gateway failures. Root cause: NO_TIMEOUT_MS was 30 days (2,592,000,000ms) which exceeds the 32-bit limit. When subagent-registry adds a 10s buffer, the total (2,592,010,000ms) overflows, causing setTimeout to fire instantly (1ms) instead of after 30 days. Changes: - timeout.ts: Reduce NO_TIMEOUT_MS from 30 days to 24 days (2,073,600,000ms), safely under the 32-bit limit even with buffers - call.ts: Add MAX_SAFE_TIMEOUT_MS clamp as defense-in-depth so callGateway never passes an overflowing value to setTimeout - timeout.test.ts: Add regression tests verifying NO_TIMEOUT_MS stays within 32-bit safe range, including with the 10s buffer Fixes openclaw#9572
src/gateway/call.ts
Outdated
| opts: CallGatewayOptions, | ||
| ): Promise<T> { | ||
| const timeoutMs = opts.timeoutMs ?? 10_000; | ||
| const timeoutMs = Math.min(opts.timeoutMs ?? 10_000, MAX_SAFE_TIMEOUT_MS); |
There was a problem hiding this comment.
Timeout clamp misses NaN
Math.min(opts.timeoutMs ?? 10_000, MAX_SAFE_TIMEOUT_MS) will produce NaN if opts.timeoutMs is NaN (or a non-finite number typed as number), and setTimeout(..., NaN) effectively becomes 0, causing immediate timeouts. If timeoutMs is user/config-derived anywhere, this is a behavior change that can still “hang” workflows. Consider normalizing here (e.g., Number.isFinite + floor + >= 1) before clamping.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/call.ts
Line: 157:157
Comment:
**Timeout clamp misses NaN**
`Math.min(opts.timeoutMs ?? 10_000, MAX_SAFE_TIMEOUT_MS)` will produce `NaN` if `opts.timeoutMs` is `NaN` (or a non-finite number typed as `number`), and `setTimeout(..., NaN)` effectively becomes `0`, causing immediate timeouts. If `timeoutMs` is user/config-derived anywhere, this is a behavior change that can still “hang” workflows. Consider normalizing here (e.g., `Number.isFinite` + floor + `>= 1`) before clamping.
How can I resolve this? If you propose a fix, please make it concise.| import { describe, expect, it } from "vitest"; | ||
| import { resolveAgentTimeoutMs } from "./timeout.js"; |
There was a problem hiding this comment.
Test import path may fail
This new test imports resolveAgentTimeoutMs from "./timeout.js". If vitest is running against TS sources (common in this repo) rather than built ESM output, importing the .js extension can fail module resolution and make the test suite fail to load. Other tests typically import via ./timeout or ./timeout.ts depending on config—worth aligning this import with the repo’s established convention so tests run in all environments.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/timeout.test.ts
Line: 1:2
Comment:
**Test import path may fail**
This new test imports `resolveAgentTimeoutMs` from `"./timeout.js"`. If vitest is running against TS sources (common in this repo) rather than built ESM output, importing the `.js` extension can fail module resolution and make the test suite fail to load. Other tests typically import via `./timeout` or `./timeout.ts` depending on config—worth aligning this import with the repo’s established convention so tests run in all environments.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
The .js extension is the established convention in this repo — all other test files in src/agents/*.test.ts use .js imports (e.g. agent-paths.test.ts, agent-scope.test.ts, auth-health.test.ts). This aligns with the ESM + TypeScript moduleResolution setup.
Address review feedback: Math.min with NaN produces NaN, which setTimeout treats as 0 (immediate). Now normalize with Number.isFinite + floor + minimum of 1, falling back to default 10s timeout for invalid values.
Fix
Prevents
setTimeoutinteger overflow that causes gateway hangs.Closes #9572
Root Cause
NO_TIMEOUT_MSintimeout.tswas set to 30 days (2,592,000,000ms). Node.jssetTimeoutuses a 32-bit signed integer internally — max safe value is 2,147,483,647ms (≈24.8 days). Whensubagent-registry.tsadds a 10s buffer, the total becomes 2,592,010,000ms, which overflows and silently wraps to 1ms.This causes
callGatewayto time out instantly, blocking the session lane and eventually making the entire gateway unresponsive (process alive but not processing messages).Changes
src/agents/timeout.tsNO_TIMEOUT_MSfrom 30 days → 24 days (2,073,600,000ms), safely under 2^31-1 even with bufferssrc/gateway/call.tsMAX_SAFE_TIMEOUT_MSclamp as defense-in-depth —callGatewaynow caps any timeout to 2^31-1 before passing tosetTimeoutsrc/agents/timeout.test.tsNO_TIMEOUT_MSstays within 32-bit safe range, including with the 10s bufferDefense in Depth
Two layers of protection:
timeout.ts): The "no timeout" sentinel value is now inherently safecall.ts): Even if future code passes a too-large value,callGatewaywill clamp itTesting
New test in
timeout.test.ts:Greptile Overview
Greptile Summary
This PR addresses a real Node.js
setTimeoutoverflow footgun by (1) lowering the “no timeout” sentinel insrc/agents/timeout.tsto 24 days so it remains < 2^31-1 even after the subagent buffer, and (2) clampingcallGateway’stimeoutMsto2_147_483_647insrc/gateway/call.tsas defense-in-depth. It also adds a new vitest regression suite insrc/agents/timeout.test.tsto lock in the sentinel value and the +10s buffer invariant used bysubagent-registry.ts.Confidence Score: 4/5
./timeout.jsfrom a TS test file.(2/5) Greptile learns from your feedback when you react with thumbs up/down!