fix: sweep stale chatRunState buffers for stuck runs#52428
fix: sweep stale chatRunState buffers for stuck runs#52428steipete merged 11 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR fixes an OOM crash (
Confidence Score: 4/5
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9e5ba544f9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for (const [runId, lastSentAt] of params.chatDeltaSentAt) { | ||
| if (params.chatRunState.abortedRuns.has(runId)) { | ||
| continue; // already handled above | ||
| } | ||
| if (now - lastSentAt <= ABORTED_RUN_TTL_MS) { | ||
| continue; | ||
| } | ||
| params.chatRunBuffers.delete(runId); |
There was a problem hiding this comment.
Guard stale-buffer sweep against active runs
chatDeltaSentAt is only the timestamp of the last broadcasted assistant delta, not proof that the run is abandoned. We still allow chat runs to stay active far longer than an hour (src/gateway/chat-abort.ts:17-28), so a legitimate run can sit in a long tool call or other quiet phase after already emitting text. In that case this loop deletes chatRunState.buffers, and when output resumes resolveMergedAssistantText/emitChatFinal in src/gateway/server-chat.ts:513-630 can only reconstruct the post-pause suffix. The segmented-text path in src/gateway/server-chat.agent-events.test.ts:339-371 depends on that buffer surviving across tool gaps, so this change will truncate replies for any run that resumes after >1h without a delta.
Useful? React with 👍 / 👎.
9e5ba54 to
dbd61b2
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e1d772eb69
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (params.chatAbortControllers.has(runId)) { | ||
| continue; |
There was a problem hiding this comment.
Don't reap active controllerless chat runs
This assumes every active chat run still has a chatAbortControllers entry, but that is not true for other chat-producing gateway flows: agent.request adds chat runs without creating an abort controller (src/gateway/server-methods/agent.ts:503-506), and voice transcript ingress does the same (src/gateway/server-node-events.ts:295-298). Those runs still populate chatRunState.buffers/chatDeltaSentAt through src/gateway/server-chat.ts:502-630, so a legitimate run that pauses in a long tool call for more than an hour will be treated as orphaned here and lose its pre-pause text before the eventual final event.
Useful? React with 👍 / 👎.
.devcontainer/devcontainer.json
Outdated
| "version": "latest" | ||
| } | ||
| }, | ||
| "postCreateCommand": "npm install" |
There was a problem hiding this comment.
Use pnpm for the devcontainer bootstrap
npm install only installs the root package in this repo, because the workspace graph lives in pnpm-workspace.yaml and the root package.json does not declare workspaces. A fresh devcontainer created from this file will therefore come up without the ui, packages/*, and extensions/* dependencies that the documented pnpm build/pnpm test flows rely on, so contributors will hit broken commands until they manually rerun pnpm install.
Useful? React with 👍 / 👎.
e1d772e to
8c71659
Compare
8c71659 to
fa994c3
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fa994c33be
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Sweep stale buffers for runs that were never explicitly aborted. | ||
| // Only reap orphaned buffers after the abort controller is gone; active | ||
| // runs can legitimately sit idle while tools/models work. | ||
| for (const [runId, lastSentAt] of params.chatDeltaSentAt) { |
There was a problem hiding this comment.
Sweep runs that buffer text without broadcasting it
This loop uses chatDeltaSentAt as the source of truth, but emitChatDelta writes merged assistant text into chatRunState.buffers before the early returns for silent replies and hidden heartbeat output (src/gateway/server-chat.ts:522-530). In those cases a stuck run can still accumulate a large buffer while never creating a chatDeltaSentAt entry, so the new maintenance sweep never reclaims the strings for exactly the classes of runs that do not stream to chat. A wedged heartbeat or NO_REPLY run can therefore still hold onto the same OOM-sized buffers this patch is meant to clear.
Useful? React with 👍 / 👎.
d7e707b to
122894e
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4fa1c96c95
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (provider === "openai" || provider === "openai-codex") { | ||
| if (provider === "openai") { | ||
| // Default OpenAI Responses to WebSocket-first with transparent SSE fallback. | ||
| agent.streamFn = createOpenAIDefaultTransportWrapper(agent.streamFn); |
There was a problem hiding this comment.
Avoid rewrapping direct OpenAI transport defaults
For the direct openai provider, the bundled plugin already adds createOpenAIDefaultTransportWrapper(...) in extensions/openai/openai-provider.ts:175-176. Putting the same wrapper here means any model-level extra params applied later by createStreamFnWithExtraParams()—for example transport: "sse" or openaiWsWarmup: true in agents.defaults.models["openai/..."].params—get clobbered by the outer plugin wrapper's fallback transport: "auto" / openaiWsWarmup: false, because that helper merges {...streamParams, ...options}. In practice, direct OpenAI users can no longer force SSE or enable warm-up through config.
Useful? React with 👍 / 👎.
emitChatFinal frees buffers on clean run completion, and the maintenance timer sweeps abortedRuns after ABORTED_RUN_TTL_MS. But runs that get stuck (e.g. LLM timeout without triggering clean lifecycle end) are never aborted and their string buffers persist indefinitely. This is the direct trigger for the StringAdd_CheckNone OOM crash reported in the issue. Add a stale buffer sweep in the maintenance timer that cleans up buffers, deltaSentAt, and deltaLastBroadcastLen for any run not updated within ABORTED_RUN_TTL_MS, regardless of abort status. Closes openclaw#51821
|
Landed via temp rebase onto main.
Thanks @karanuppal! |
Summary
emitChatFinalfreeschatRunState.buffers(anddeltaSentAt/deltaLastBroadcastLen) on clean run completion. The maintenance timer also sweepsabortedRunsafterABORTED_RUN_TTL_MS. But runs that get stuck — e.g. an LLM request times out without triggering a clean lifecycle end — are never explicitly aborted, so their string buffers persist indefinitely.This is the direct trigger for the
StringAdd_CheckNoneOOM crash: stuck buffers hold large concatenated strings that can't be GC'd.Fix
Add a stale buffer sweep in the existing maintenance timer (
server-maintenance.ts) that iterateschatDeltaSentAtand cleans up buffers, deltaSentAt, and deltaLastBroadcastLen for any run not updated withinABORTED_RUN_TTL_MS(1 hour), regardless of whether the run was explicitly aborted.Changes
src/gateway/server-maintenance.ts— add stale buffer sweep loop, acceptchatDeltaLastBroadcastLenparamsrc/gateway/server-runtime-state.ts— exposechatDeltaLastBroadcastLenfrom runtime statesrc/gateway/server.impl.ts— passchatDeltaLastBroadcastLento maintenance timersrc/gateway/server-maintenance.test.ts— update test dep factory for new paramTesting
server-maintenance,server-chat.agent-events,server-close)Closes #51821
🤖 AI-assisted (Claude), fully tested, author understands the change.