Skip to content

fix(sessions): use copyFile instead of rename in rotateSessionFile#45181

Open
soray42 wants to merge 3 commits intoopenclaw:mainfrom
soray42:fix/session-store-rotate-race-condition
Open

fix(sessions): use copyFile instead of rename in rotateSessionFile#45181
soray42 wants to merge 3 commits intoopenclaw:mainfrom
soray42:fix/session-store-rotate-race-condition

Conversation

@soray42
Copy link
Copy Markdown

@soray42 soray42 commented Mar 13, 2026

Summary

  • Problem: rotateSessionFile used fs.promises.rename to back up sessions.json, which removed the file from disk for a brief window between the rename and the subsequent atomic write of the new content.
  • Why it matters: Any inbound message (Discord, Telegram, etc.) processed during that window calls loadSessionStore, finds the file missing, falls back to {}, and generates a fresh sessionId — a silent mid-session memory wipe with no /new or /reset invoked. Issue reporter observed 772 rotations in one day (store repeatedly hitting the 10 MB rotateBytes threshold), making the race window hit constantly.
  • What changed: renamecopyFile in rotateSessionFile. The original sessions.json stays present throughout rotation; the subsequent writeTextAtomic call (already part of saveSessionStoreUnlocked) overwrites it with the pruned content.
  • What did NOT change: Backup file format, cleanup logic (keep 3 most recent .bak.*), rotateBytes threshold, or any other maintenance behavior.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Memory / storage

Linked Issue/PR

User-visible / Behavior Changes

Sessions on high-traffic Discord/Telegram deployments will no longer randomly lose context mid-conversation when sessions.json exceeds rotateBytes.

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: Linux (reported); race is platform-independent
  • Runtime/container: Node.js
  • Model/provider: Any
  • Integration/channel: Discord / Telegram (any gateway channel with concurrent inbound messages)
  • Relevant config: session.maintenance.rotateBytes at or below default 10 MB

Steps

  1. Run a high-traffic gateway deployment until sessions.json exceeds rotateBytes (10 MB default)
  2. Send a message at the exact moment rotateSessionFile is running
  3. (Old behavior) Agent responds with zero context — new sessionId generated silently

Expected

  • Agent retains session context across rotation events

Actual (before fix)

  • Agent loses all context; session effectively reset with no archive and no user action

Evidence

  • Failing test/log before + passing after — updated rotateSessionFile unit test now asserts original file is preserved; new regression test intercepts copyFile to simulate a concurrent reader and verifies it always sees valid JSON

Human Verification (required)

  • Verified scenarios: rotateSessionFile with content above threshold — .bak file created, original file still present and readable with identical content
  • Edge cases checked: copy failure (returns false, skips rotation as before); file below threshold (no-op, unchanged)
  • What I did not verify: actual high-traffic Discord/Telegram deployment at scale; Windows-specific behavior (Windows already has a separate 5-retry path in saveSessionStoreUnlocked)

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: Revert the one-line change in rotateSessionFile back to fs.promises.rename
  • Files/config to restore: src/config/sessions/store-maintenance.ts
  • Known bad symptoms: If copyFile is unexpectedly slow on a deployment (e.g. cross-device path), rotation may take longer — raise rotateBytes threshold as a workaround

Risks and Mitigations

  • Risk: copyFile uses more I/O than rename for large files (data is physically copied rather than a directory entry moved)
    • Mitigation: Rotation only triggers when the file exceeds rotateBytes; the copy is immediately followed by an atomic overwrite with the smaller post-prune content, so the extra I/O is bounded and one-time per rotation event

Sessions.json would disappear for a brief window between the rename to
.bak and the subsequent atomic write of the new content.  Any inbound
message (Discord, Telegram, etc.) processed during that window would
call loadSessionStore, find the file missing, fall back to an empty
store, and generate a fresh sessionId — causing a silent mid-session
memory wipe with no /new or /reset invoked (openclaw#18572).

Switch from fs.promises.rename to fs.promises.copyFile so the original
file is always present and readable by concurrent readers.  The
subsequent writeTextAtomic call (tmp → sessions.json) will overwrite
it with the pruned content, keeping disk usage bounded exactly as
before.

Update the existing rotateSessionFile test to assert the new invariant
(original file preserved) and add a regression test that intercepts
copyFile to simulate a concurrent reader, verifying it always sees
valid JSON rather than a missing file.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 13, 2026

Greptile Summary

This PR fixes a race condition in rotateSessionFile (src/config/sessions/store-maintenance.ts) where fs.promises.rename was used to create the rotation backup. Because rename atomically removes the source path, there was a window between the rename and the subsequent atomic overwrite during which sessions.json did not exist on disk. Any concurrent message handler that called loadSessionStore during this window fell back to an empty store and generated a fresh sessionId, silently resetting the user's conversation context.

The fix is a single-line change — renamecopyFile — which leaves the original file intact throughout rotation. The subsequent atomic write (already part of saveSessionStoreUnlocked) overwrites it with the smaller, pruned content, so disk usage remains bounded.

Key changes:

  • rotateSessionFile: fs.promises.renamefs.promises.copyFile; a detailed inline block comment explains the reasoning.
  • Tests: existing assertion that the original file is absent after rotation is inverted; a new regression test intercepts copyFile mid-execution to confirm the file remains readable to a concurrent reader.
  • Minor: the JSDoc above rotateSessionFile (line 272) still says "Renames the current file" — this should be updated to "Copies the current file" to match the new behaviour.

Confidence Score: 5/5

  • This PR is safe to merge — it is a minimal, well-reasoned one-line fix with appropriate tests and no behaviour changes outside the specific race window it closes.
  • The change is surgically small (one call replaced, no interface changes, no new dependencies). The root cause analysis is sound (rename removes the source path atomically; copyFile does not), the fix directly addresses it, and the updated tests confirm the invariant that the original file is readable throughout rotation. The only nit is a one-word stale JSDoc entry.
  • No files require special attention beyond the minor JSDoc correction on line 272 of src/config/sessions/store-maintenance.ts.

Comments Outside Diff (1)

  1. src/config/sessions/store-maintenance.ts, line 272 (link)

    Stale JSDoc — still says "Renames"

    The JSDoc above rotateSessionFile still describes the old rename-based behaviour. It should be updated to reflect the new copyFile approach.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/config/sessions/store-maintenance.ts
Line: 272

Comment:
**Stale JSDoc — still says "Renames"**

The JSDoc above `rotateSessionFile` still describes the old `rename`-based behaviour. It should be updated to reflect the new `copyFile` approach.

```suggestion
 * Copies the current file to `sessions.json.bak.{timestamp}` and cleans up
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: daa7aca

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sessions.json rotation (rename) + unlocked store read can cause random new sessionId ("memory wipe") without /new

1 participant