Skip to content

fix: sweep stale chatRunState buffers for stuck runs#52428

Merged
steipete merged 11 commits intoopenclaw:mainfrom
karanuppal:fix/memory-leak-stale-buffers-51821
Mar 22, 2026
Merged

fix: sweep stale chatRunState buffers for stuck runs#52428
steipete merged 11 commits intoopenclaw:mainfrom
karanuppal:fix/memory-leak-stale-buffers-51821

Conversation

@karanuppal
Copy link
Copy Markdown
Contributor

Summary

emitChatFinal frees chatRunState.buffers (and deltaSentAt/deltaLastBroadcastLen) on clean run completion. The maintenance timer also sweeps abortedRuns after ABORTED_RUN_TTL_MS. But runs that get stuck — e.g. an LLM request times out without triggering a clean lifecycle end — are never explicitly aborted, so their string buffers persist indefinitely.

This is the direct trigger for the StringAdd_CheckNone OOM crash: stuck buffers hold large concatenated strings that can't be GC'd.

Fix

Add a stale buffer sweep in the existing maintenance timer (server-maintenance.ts) that iterates chatDeltaSentAt and cleans up buffers, deltaSentAt, and deltaLastBroadcastLen for any run not updated within ABORTED_RUN_TTL_MS (1 hour), regardless of whether the run was explicitly aborted.

Changes

  • src/gateway/server-maintenance.ts — add stale buffer sweep loop, accept chatDeltaLastBroadcastLen param
  • src/gateway/server-runtime-state.ts — expose chatDeltaLastBroadcastLen from runtime state
  • src/gateway/server.impl.ts — pass chatDeltaLastBroadcastLen to maintenance timer
  • src/gateway/server-maintenance.test.ts — update test dep factory for new param

Testing

  • 30 tests pass across gateway test files (server-maintenance, server-chat.agent-events, server-close)
  • Full pre-commit suite passes (tsgo, oxlint, oxfmt, all boundary lints)

Closes #51821

🤖 AI-assisted (Claude), fully tested, author understands the change.

@openclaw-barnacle openclaw-barnacle bot added gateway Gateway runtime size: XS labels Mar 22, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 22, 2026

Greptile Summary

This PR fixes an OOM crash (StringAdd_CheckNone) by adding a stale-buffer sweep to the existing maintenance timer. Runs that time out without a clean lifecycle end (no emitChatFinal, no explicit abort) were previously leaving their large concatenated string buffers alive forever; the new sweep reaps any chatDeltaSentAt entry older than ABORTED_RUN_TTL_MS (1 hour) that was not already handled by the explicit-abort TTL path. The change is minimal and well-scoped.

  • server-maintenance.ts — new sweep loop iterates chatDeltaSentAt, skips already-aborted runs, and deletes chatRunBuffers, chatDeltaSentAt, and chatDeltaLastBroadcastLen for stale entries.
  • server-runtime-state.ts / server.impl.ts — plumbing to expose and forward chatDeltaLastBroadcastLen to the maintenance timer.
  • server-maintenance.test.ts — dep factory updated with the new map.
  • Minor gap: the pre-existing aborted-run TTL sweep (lines 126–133) still does not delete from chatDeltaLastBroadcastLen, so explicitly-aborted runs' entries in that map are never reclaimed. Because the values are small integers the practical impact is negligible, but it is inconsistent with the cleanup done by the new stale sweep.

Confidence Score: 4/5

  • Safe to merge — the core OOM fix is correct and the only gap is a minor numeric-value leak in chatDeltaLastBroadcastLen for explicitly-aborted runs.
  • The stale-buffer sweep logic is correct: safe Map mutation during iteration in JS (current entry already yielded before deletion), proper skipping of runs still tracked in abortedRuns, and consistent cleanup of all three maps for the newly targeted case. The one concrete issue — chatDeltaLastBroadcastLen not being cleaned up in the existing aborted-run sweep — is trivially fixable with a one-line addition and doesn't affect the primary OOM fix.
  • The abortedRuns TTL sweep in src/gateway/server-maintenance.ts (lines 126–133) should also delete from chatDeltaLastBroadcastLen.

Comments Outside Diff (1)

  1. src/gateway/server-maintenance.ts, line 126-133 (link)

    P2 chatDeltaLastBroadcastLen not cleaned up for aborted runs

    The existing aborted-run TTL sweep deletes from chatRunBuffers and chatDeltaSentAt but does not delete from chatDeltaLastBroadcastLen. Because the stale-buffer sweep (added below) only iterates chatDeltaSentAt and skips entries still present in abortedRuns, explicitly-aborted runs' chatDeltaLastBroadcastLen entries will never be removed by either sweep. After the TTL expiry the entry disappears from abortedRuns and chatDeltaSentAt, making it invisible to both cleanup paths.

    The individual values are just numbers so the heap impact is negligible, but the inconsistency means the maps can drift out of sync over time.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/gateway/server-maintenance.ts
    Line: 126-133
    
    Comment:
    **`chatDeltaLastBroadcastLen` not cleaned up for aborted runs**
    
    The existing aborted-run TTL sweep deletes from `chatRunBuffers` and `chatDeltaSentAt` but does **not** delete from `chatDeltaLastBroadcastLen`. Because the stale-buffer sweep (added below) only iterates `chatDeltaSentAt` and skips entries still present in `abortedRuns`, explicitly-aborted runs' `chatDeltaLastBroadcastLen` entries will never be removed by either sweep. After the TTL expiry the entry disappears from `abortedRuns` and `chatDeltaSentAt`, making it invisible to both cleanup paths.
    
    The individual values are just numbers so the heap impact is negligible, but the inconsistency means the maps can drift out of sync over time.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/gateway/server-maintenance.ts
Line: 126-133

Comment:
**`chatDeltaLastBroadcastLen` not cleaned up for aborted runs**

The existing aborted-run TTL sweep deletes from `chatRunBuffers` and `chatDeltaSentAt` but does **not** delete from `chatDeltaLastBroadcastLen`. Because the stale-buffer sweep (added below) only iterates `chatDeltaSentAt` and skips entries still present in `abortedRuns`, explicitly-aborted runs' `chatDeltaLastBroadcastLen` entries will never be removed by either sweep. After the TTL expiry the entry disappears from `abortedRuns` and `chatDeltaSentAt`, making it invisible to both cleanup paths.

The individual values are just numbers so the heap impact is negligible, but the inconsistency means the maps can drift out of sync over time.

```suggestion
      params.chatRunState.abortedRuns.delete(runId);
      params.chatRunBuffers.delete(runId);
      params.chatDeltaSentAt.delete(runId);
      params.chatDeltaLastBroadcastLen.delete(runId);
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix: sweep stale chatRunState buffers fo..." | Re-trigger Greptile

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9e5ba544f9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +138 to +145
for (const [runId, lastSentAt] of params.chatDeltaSentAt) {
if (params.chatRunState.abortedRuns.has(runId)) {
continue; // already handled above
}
if (now - lastSentAt <= ABORTED_RUN_TTL_MS) {
continue;
}
params.chatRunBuffers.delete(runId);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard stale-buffer sweep against active runs

chatDeltaSentAt is only the timestamp of the last broadcasted assistant delta, not proof that the run is abandoned. We still allow chat runs to stay active far longer than an hour (src/gateway/chat-abort.ts:17-28), so a legitimate run can sit in a long tool call or other quiet phase after already emitting text. In that case this loop deletes chatRunState.buffers, and when output resumes resolveMergedAssistantText/emitChatFinal in src/gateway/server-chat.ts:513-630 can only reconstruct the post-pause suffix. The segmented-text path in src/gateway/server-chat.agent-events.test.ts:339-371 depends on that buffer surviving across tool gaps, so this change will truncate replies for any run that resumes after >1h without a delta.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e1d772eb69

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +144 to +145
if (params.chatAbortControllers.has(runId)) {
continue;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Don't reap active controllerless chat runs

This assumes every active chat run still has a chatAbortControllers entry, but that is not true for other chat-producing gateway flows: agent.request adds chat runs without creating an abort controller (src/gateway/server-methods/agent.ts:503-506), and voice transcript ingress does the same (src/gateway/server-node-events.ts:295-298). Those runs still populate chatRunState.buffers/chatDeltaSentAt through src/gateway/server-chat.ts:502-630, so a legitimate run that pauses in a long tool call for more than an hour will be treated as orphaned here and lose its pre-pause text before the eventual final event.

Useful? React with 👍 / 👎.

"version": "latest"
}
},
"postCreateCommand": "npm install"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use pnpm for the devcontainer bootstrap

npm install only installs the root package in this repo, because the workspace graph lives in pnpm-workspace.yaml and the root package.json does not declare workspaces. A fresh devcontainer created from this file will therefore come up without the ui, packages/*, and extensions/* dependencies that the documented pnpm build/pnpm test flows rely on, so contributors will hit broken commands until they manually rerun pnpm install.

Useful? React with 👍 / 👎.

@steipete steipete force-pushed the fix/memory-leak-stale-buffers-51821 branch from e1d772e to 8c71659 Compare March 22, 2026 22:28
steipete added a commit to karanuppal/openclaw that referenced this pull request Mar 22, 2026
steipete added a commit to karanuppal/openclaw that referenced this pull request Mar 22, 2026
steipete added a commit to karanuppal/openclaw that referenced this pull request Mar 22, 2026
@openclaw-barnacle openclaw-barnacle bot added cli CLI command changes size: M and removed size: S labels Mar 22, 2026
steipete added a commit to karanuppal/openclaw that referenced this pull request Mar 22, 2026
steipete added a commit to karanuppal/openclaw that referenced this pull request Mar 22, 2026
@steipete steipete force-pushed the fix/memory-leak-stale-buffers-51821 branch from 8c71659 to fa994c3 Compare March 22, 2026 22:32
steipete added a commit to karanuppal/openclaw that referenced this pull request Mar 22, 2026
steipete added a commit to karanuppal/openclaw that referenced this pull request Mar 22, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fa994c33be

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// Sweep stale buffers for runs that were never explicitly aborted.
// Only reap orphaned buffers after the abort controller is gone; active
// runs can legitimately sit idle while tools/models work.
for (const [runId, lastSentAt] of params.chatDeltaSentAt) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Sweep runs that buffer text without broadcasting it

This loop uses chatDeltaSentAt as the source of truth, but emitChatDelta writes merged assistant text into chatRunState.buffers before the early returns for silent replies and hidden heartbeat output (src/gateway/server-chat.ts:522-530). In those cases a stuck run can still accumulate a large buffer while never creating a chatDeltaSentAt entry, so the new maintenance sweep never reclaims the strings for exactly the classes of runs that do not stream to chat. A wedged heartbeat or NO_REPLY run can therefore still hold onto the same OOM-sized buffers this patch is meant to clear.

Useful? React with 👍 / 👎.

steipete added a commit to karanuppal/openclaw that referenced this pull request Mar 22, 2026
steipete added a commit to karanuppal/openclaw that referenced this pull request Mar 22, 2026
steipete added a commit to karanuppal/openclaw that referenced this pull request Mar 22, 2026
@steipete steipete force-pushed the fix/memory-leak-stale-buffers-51821 branch from d7e707b to 122894e Compare March 22, 2026 23:28
steipete added a commit to karanuppal/openclaw that referenced this pull request Mar 22, 2026
steipete added a commit to karanuppal/openclaw that referenced this pull request Mar 22, 2026
steipete added a commit to karanuppal/openclaw that referenced this pull request Mar 22, 2026
steipete added a commit to karanuppal/openclaw that referenced this pull request Mar 22, 2026
steipete added a commit to karanuppal/openclaw that referenced this pull request Mar 22, 2026
steipete added a commit to karanuppal/openclaw that referenced this pull request Mar 22, 2026
steipete added a commit to karanuppal/openclaw that referenced this pull request Mar 22, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4fa1c96c95

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +221 to +224
if (provider === "openai" || provider === "openai-codex") {
if (provider === "openai") {
// Default OpenAI Responses to WebSocket-first with transparent SSE fallback.
agent.streamFn = createOpenAIDefaultTransportWrapper(agent.streamFn);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid rewrapping direct OpenAI transport defaults

For the direct openai provider, the bundled plugin already adds createOpenAIDefaultTransportWrapper(...) in extensions/openai/openai-provider.ts:175-176. Putting the same wrapper here means any model-level extra params applied later by createStreamFnWithExtraParams()—for example transport: "sse" or openaiWsWarmup: true in agents.defaults.models["openai/..."].params—get clobbered by the outer plugin wrapper's fallback transport: "auto" / openaiWsWarmup: false, because that helper merges {...streamParams, ...options}. In practice, direct OpenAI users can no longer force SSE or enable warm-up through config.

Useful? React with 👍 / 👎.

karanuppal and others added 11 commits March 22, 2026 23:51
emitChatFinal frees buffers on clean run completion, and the
maintenance timer sweeps abortedRuns after ABORTED_RUN_TTL_MS. But
runs that get stuck (e.g. LLM timeout without triggering clean
lifecycle end) are never aborted and their string buffers persist
indefinitely. This is the direct trigger for the StringAdd_CheckNone
OOM crash reported in the issue.

Add a stale buffer sweep in the maintenance timer that cleans up
buffers, deltaSentAt, and deltaLastBroadcastLen for any run not
updated within ABORTED_RUN_TTL_MS, regardless of abort status.

Closes openclaw#51821
@steipete
Copy link
Copy Markdown
Contributor

Landed via temp rebase onto main.

  • Gate: pnpm check; OPENCLAW_TEST_PROFILE=low OPENCLAW_TEST_SERIAL_GATEWAY=1 pnpm test; pnpm test -- src/cli/directory-cli.test.ts
  • Land commit: c490a98
  • Merge commit: 7bd12f4

Thanks @karanuppal!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling app: web-ui App: web-ui channel: discord Channel integration: discord cli CLI command changes docs Improvements or additions to documentation gateway Gateway runtime size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory leak: chatRunState.buffers not cleaned up for stuck runs

2 participants