Skip to content

Fix/issue 51097 session store memory tests#51567

Open
sahilsatralkar wants to merge 17 commits intoopenclaw:mainfrom
sahilsatralkar:fix/issue-51097-session-store-memory-tests
Open

Fix/issue 51097 session store memory tests#51567
sahilsatralkar wants to merge 17 commits intoopenclaw:mainfrom
sahilsatralkar:fix/issue-51097-session-store-memory-tests

Conversation

@sahilsatralkar
Copy link
Copy Markdown
Contributor

@sahilsatralkar sahilsatralkar commented Mar 21, 2026

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem: large sessions.json stores could drive long-running gateway RSS toward ~1.2–1.3 GB because session metadata stayed on a whole-file read/
    parse/write path, and hot paths often loaded a store and then immediately reparsed it again inside updateSessionStore.
  • Why it matters: cron-heavy and long-lived gateways pay repeated O(n) session-store costs, which increases memory pressure and makes the reported
    [Bug] Gateway memory leak: sessions.json loaded entirely into RAM, grows unbounded #51097 behavior reproducible locally.
  • What changed: added a default OPENCLAW_SESSION_OBJECT_CACHE_MAX_BYTES guard for oversized in-memory session-store caching, stopped retaining
    oversized serialized snapshots, and taught updateSessionStore to reuse an already-loaded store snapshot when the on-disk file still matches it.
  • What did NOT change (scope boundary): this PR does not replace sessions.json with SQLite/sharding and does not remove the underlying whole-file
    stringify/write design.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • Added OPENCLAW_SESSION_OBJECT_CACHE_MAX_BYTES with a default of 1000000 bytes.
  • When sessions.json exceeds that limit, OpenClaw disables the parsed object cache for that store and logs a one-time warning.
  • Large stores now retain less memory and avoid one redundant whole-store reparse on common update paths.
  • Session semantics, file format, and maintenance defaults did not change.

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) No
  • If any Yes, explain risk + mitigation: None

Repro + Verification

Environment

  • OS: macOS local dev machine
  • Runtime/container: Node 22 + pnpm
  • Model/provider: None
  • Integration/channel (if any): synthetic session-store stress harness; heartbeat persistence path regression-tested
  • Relevant config (redacted): default session config, synthetic ~7.6 MB sessions.json, OPENCLAW_SESSION_OBJECT_CACHE_MAX_BYTES=1000000,
    OPENCLAW_SESSION_CACHE_TTL_MS unset

Steps

  1. Generate a synthetic large sessions.json and run repeated session-store read/update cycles with scripts/stress-session-store-rss.ts.
  2. Observe RSS growth on the pre-fix path with the same ~7.6 MB store shape reported in [Bug] Gateway memory leak: sessions.json loaded entirely into RAM, grows unbounded #51097.
  3. Apply the cache-limit + snapshot-reuse fixes and rerun the same stress harness and focused tests.

Expected

  • Large stores should retain less memory.
  • Hot update paths should avoid a redundant second parse when the caller already loaded the same unchanged store.
  • Focused regression tests should stay green.

Actual

  • Pre-fix reproduced roughly 1236 MB RSS by cycle 50.
  • Post-fix snapshot-reuse run dropped to roughly 890 MB RSS by cycle 50.
  • Focused session-store and heartbeat tests passed.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Failing test/log before + passing after

Before:

scripts/stress-session-store-rss.ts --target-kb 7600 --cycles 50 --sample-every 10

cycle rss_mb heap_mb external_mb store_kb elapsed_ms
0 668.7 221.3 21.4 7601 15181
10 877.6 236.3 20.1 7602 15760
20 1004.7 236.3 20.1 7603 16318
30 1109.8 236.3 20.1 7604 16873
40 1214.1 236.3 20.1 7605 17433
50 1236.6 236.3 20.1 7605 17987

After:

scripts/stress-session-store-rss.ts --target-kb 7600 --cycles 50 --sample-every 10 --reuse-loaded-store

cycle rss_mb heap_mb external_mb store_kb elapsed_ms
0 662.1 221.3 12.6 7601 15115
10 740.7 236.3 27.5 7602 15561
20 778.6 236.3 27.5 7603 16001
30 815.9 236.3 27.5 7604 16430
40 853.2 236.3 27.5 7605 16862
50 890.5 236.3 27.5 7605 17297

Focused regression tests after the fix:

pnpm test -- src/config/sessions/store.update-base-store.test.ts src/config/sessions/store.cache.large-store.test.ts src/config/sessions/
store.cache.large-store-noop-save.test.ts src/config/sessions/store.cache.limit-config.test.ts src/config/sessions/store.cache.limit-read.test.ts
src/config/sessions/store.cache.limit-warning.test.ts src/config/sessions/store.cache.limit-write.test.ts src/config/sessions/
store.cache.serialized-ttl.test.ts src/infra/heartbeat-runner.returns-default-unset.test.ts

Test Files 9 passed (9)
Tests 47 passed (47)

Trace/log snippets

Oversized-store warning showing the cache limit path activates:

[sessions/store] session object cache disabled for large store

Relevant code-path evidence:

  • src/config/sessions/store.ts
    • large stores skip parsed object-cache reuse
    • updateSessionStore can reuse a caller-provided loaded snapshot when file contents still match
  • src/auto-reply/reply/session-updates.ts
    • reply/session hot path now passes baseStore
  • src/infra/heartbeat-runner.ts
    • heartbeat dedupe persistence now uses updateSessionStore(..., { baseStore })

Perf numbers

Reporter-scale synthetic store, ~7.6 MB:

  • Before hot-path snapshot reuse: 1236.6 MB RSS at cycle 50
  • After hot-path snapshot reuse: 890.5 MB RSS at cycle 50

Delta:

  • RSS reduction at cycle 50: about 346 MB
  • Relative reduction: about 28%

Validation gates:

pnpm build ✅
pnpm check ✅
focused tests ✅

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
    • pnpm build
    • pnpm check
    • pnpm test -- src/config/sessions/store.update-base-store.test.ts src/config/sessions/store.cache.large-store.test.ts src/config/sessions/
      store.cache.large-store-noop-save.test.ts src/config/sessions/store.cache.limit-config.test.ts src/config/sessions/store.cache.limit-
      read.test.ts src/config/sessions/store.cache.limit-warning.test.ts src/config/sessions/store.cache.limit-write.test.ts src/config/sessions/
      store.cache.serialized-ttl.test.ts src/infra/heartbeat-runner.returns-default-unset.test.ts
    • scripts/stress-session-store-rss.ts before/after on a ~7.6 MB store
  • Edge cases checked:
    • oversized stores stop using parsed object cache
    • oversized serialized snapshots are not retained
    • unchanged oversized stores do not rewrite unnecessarily
    • updateSessionStore reuses the caller snapshot only when the on-disk file still matches
    • fallback reparsing still happens if the file changed underneath the caller
    • heartbeat dedupe persistence still passes focused regression coverage
  • What you did not verify:
    • full pnpm test across the entire repo on this machine
    • production multi-process gateway deployments
    • a clean terminating codex review --base origin/main run; repeated runs stalled without surfacing explicit findings

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) Yes
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps:
    • Operators can optionally tune OPENCLAW_SESSION_OBJECT_CACHE_MAX_BYTES.
    • No data migration is required.

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly:
    • Revert this PR.
    • As an operational workaround, raise OPENCLAW_SESSION_OBJECT_CACHE_MAX_BYTES if the new cache limit is too aggressive.
  • Files/config to restore:
    • src/config/sessions/store.ts
    • src/config/sessions/store-cache.ts
    • src/config/sessions/store-cache-limit.ts
    • reply/heartbeat call sites updated to pass baseStore
  • Known bad symptoms reviewers should watch for:
    • session metadata not persisting after in-memory updates
    • stale session data winning over newer on-disk changes
    • unexpected repeated warning logs for oversized stores

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

  • Risk:
    • The optimistic snapshot-reuse path could miss a concurrent on-disk change if the equality check is wrong.
    • Mitigation: reuse only happens while holding the write lock and only when the current file bytes still exactly match the previously loaded
      serialized snapshot; otherwise it falls back to the old reparse path.
  • Risk:
    • The new object-cache size limit could trade memory down for slower repeated reads on large stores.
    • Mitigation: the limit is configurable via env var, logged once per oversized store, and the change preserves session behavior rather than
      pruning data.

Built with Codex

Build prompt-

Issue 51097 Session Cache Fix Plan

DO NOT COMMIT THIS FILE.
Keep this file untracked for the entire implementation.
It will be needed later when drafting the PR description.

Goal

Implement a safe first fix for #51097 by size-gating the in-memory session object cache, defaulting the limit to 1 MB, while preserving existing session behavior and validating the change with targeted tests plus all applicable local CI-equivalent gates.

Scope Assumptions

  • The fix is limited to the session object cache, not session pruning, storage format redesign, or maintenance mode changes.
  • The expected touched code paths are under src/config/sessions/, tests under src/config/sessions/, and operator docs under docs/.
  • For changes under src/ and scripts/, scripts/ci-changed-scope.mjs implies run_node=true and run_windows=true.
  • run_macos=false, run_android=false, and run_skills_python=false unless the implementation expands into apps/macos, apps/android, apps/shared, skills/, or .github/workflows/ci.yml.
  • If docs are changed, check-docs must also be run locally before push.

Execution Rules

  • Use AGENTS.md as the operational guardrail for the entire session.
  • Use CONTRIBUTING.md as the PR-readiness guardrail for the entire session.
  • Keep this plan file untracked and never include it in any commit.
  • Keep the resulting work scoped to the concrete fix for issue [Bug] Gateway memory leak: sessions.json loaded entirely into RAM, grows unbounded #51097; do not turn it into a refactor-only or test-only PR.
  • Before every code-changing commit:
    • run the step-specific targeted tests
    • run pnpm build
    • make the commit with scripts/committer "<message>" <files...>
  • If a step has no tracked file changes, do not create an empty commit.
  • If any baseline gate is red before starting code changes, stop and record the failure locally before proceeding.
  • Do not push until the full final verification section is complete.

Baseline Gate (No Commit By Design)

  • Confirm branch state is clean and synced with the intended working branch.

  • Run dependency sync once:

    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm install
  • Record the current branch tip and baseline environment versions:

    git branch --show-current
    git log -1 --oneline --decorate
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH node -v
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm -v
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH bun -v
  • Run the main repo-wide gate used by the commit hook:

    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm check
  • Run the strict build smoke used by CI:

    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm build:strict-smoke
  • Run the main build:

    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm build
  • Run the exact minimum pre-PR gate required by CONTRIBUTING.md:

    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm build && pnpm check && pnpm test
  • Run the two Linux unit-test shards locally, one at a time, using the same environment shape as CI:

    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm canvas:a2ui:bundle
    OPENCLAW_TEST_WORKERS=2 OPENCLAW_TEST_MAX_OLD_SPACE_SIZE_MB=6144 OPENCLAW_TEST_SHARDS=2 OPENCLAW_TEST_SHARD_INDEX=1 PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test
    OPENCLAW_TEST_WORKERS=2 OPENCLAW_TEST_MAX_OLD_SPACE_SIZE_MB=6144 OPENCLAW_TEST_SHARDS=2 OPENCLAW_TEST_SHARD_INDEX=2 PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test
  • Run the remaining Linux CI matrix lanes:

    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test:extensions
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test:channels
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test:contracts
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm protocol:check
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm canvas:a2ui:bundle
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH bunx vitest run --config vitest.unit.config.ts
  • Run the Node 22 compatibility lane locally in a Node 22 shell or toolchain manager session:

    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm build
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH node scripts/stage-bundled-plugin-runtime-deps.mjs
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH node --import tsx scripts/release-check.ts
  • Run the check-additional gates:

    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm run lint:plugins:no-extension-imports
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm run lint:web-search-provider-boundaries
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm run lint:extensions:no-src-outside-plugin-sdk
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm run lint:extensions:no-plugin-sdk-internal
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm lint:ui:no-raw-window-open
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test:gateway:watch-regression
  • Run the build-smoke gates:

    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm build
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH node openclaw.mjs --help
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH node openclaw.mjs status --json --timeout 1
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test:build:singleton
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test:startup:memory
  • Run the always-on security/workflow gates locally:

    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH python3 -m pip install --upgrade pip pre-commit
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pre-commit run --all-files detect-private-key
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pre-commit run --all-files pnpm-audit-prod
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH python3 - <<'PY'
    from __future__ import annotations
    import pathlib
    import sys
    
    root = pathlib.Path(".github/workflows")
    bad: list[str] = []
    for path in sorted(root.rglob("*.yml")):
        if b"\t" in path.read_bytes():
            bad.append(str(path))
    for path in sorted(root.rglob("*.yaml")):
        if b"\t" in path.read_bytes():
            bad.append(str(path))
    if bad:
        print("Tabs found in workflow file(s):")
        for path in bad:
            print(f"- {path}")
        sys.exit(1)
    PY
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH python3 scripts/check-composite-action-input-interpolation.py
  • Run actionlint locally if any workflow files might be touched during the work:

    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH actionlint
  • Run the install-smoke workflow locally:

    docker build -t openclaw-dockerfile-smoke:local -f Dockerfile .
    docker run --rm --entrypoint sh openclaw-dockerfile-smoke:local -lc 'which openclaw && openclaw --version'
    docker build --build-arg OPENCLAW_EXTENSIONS=matrix -t openclaw-ext-smoke:local -f Dockerfile .
    docker run --rm --entrypoint sh openclaw-ext-smoke:local -lc 'which openclaw && openclaw --version && openclaw plugins list --json >/tmp/plugins.json && test -s /tmp/plugins.json'
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test:install:smoke
  • Run the Windows CI-equivalent lane on a Windows host or Windows VM because src/ or scripts/ changes trigger run_windows=true.

  • On Windows, run:

    pnpm install --frozen-lockfile --prefer-offline --ignore-scripts=false --config.engine-strict=false --config.enable-pre-post-scripts=true --config.side-effects-cache=true
    pnpm canvas:a2ui:bundle
    $env:OPENCLAW_TEST_WORKERS="1"
    $env:OPENCLAW_TEST_SHARDS="6"
    $env:OPENCLAW_TEST_SHARD_INDEX="1"; pnpm test
    $env:OPENCLAW_TEST_SHARD_INDEX="2"; pnpm test
    $env:OPENCLAW_TEST_SHARD_INDEX="3"; pnpm test
    $env:OPENCLAW_TEST_SHARD_INDEX="4"; pnpm test
    $env:OPENCLAW_TEST_SHARD_INDEX="5"; pnpm test
    $env:OPENCLAW_TEST_SHARD_INDEX="6"; pnpm test
  • If docs are changed later, run the docs gate now and again at the end:

    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm check:docs

Step 1: Add Threshold Configuration Helper And Tests

  • Add a new helper in src/config/sessions/store.ts or a new src/config/sessions/ helper file to resolve the object-cache byte limit.
  • Define the initial semantics:
    • default object-cache limit is 1_000_000 bytes
    • a positive integer env override is accepted
    • invalid values fall back to the default
    • 0 disables the object cache entirely
  • Add a new dedicated test file for helper behavior, for example src/config/sessions/store.cache.limit-config.test.ts.
  • Verify this step with:
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test -- src/config/sessions/store.cache.limit-config.test.ts
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm build
  • Commit only the helper and helper-test files with:
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH scripts/committer "fix(sessions): add object cache limit config helper" <files...>

Step 2: Gate Object-Cache Reads For Oversized Stores

  • Teach loadSessionStore(...) to skip serving the in-memory object cache when the current file size is above the configured limit.
  • When a store is above the limit, explicitly drop any existing object cache for that path so the process does not retain stale large cache entries.
  • Add a new dedicated behavior test file, for example src/config/sessions/store.cache.limit-read.test.ts.
  • Cover at least these cases:
    • small store remains cached
    • large store is not served from object cache
    • OPENCLAW_SESSION_OBJECT_CACHE_MAX_BYTES=0 disables object-cache reads even for small stores
  • Re-run the existing large baseline test to ensure the pre-fix assumption now changes only when the limit is set.
  • Verify this step with:
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test -- src/config/sessions/store.cache.limit-config.test.ts
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test -- src/config/sessions/store.cache.limit-read.test.ts
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test -- src/config/sessions/store.cache.large-store.test.ts
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm build
  • Commit only the files from this step with:
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH scripts/committer "fix(sessions): skip object cache reads for large stores" <files...>

Step 3: Gate Object-Cache Writes After Load And Save

  • Update cache population paths so oversized stores do not populate the object cache during:
    • load-time cache writes
    • save-time write-through cache updates
  • Keep the serialized cache unchanged in this first implementation.
  • Add or update a dedicated test file, for example src/config/sessions/store.cache.limit-write.test.ts.
  • Cover at least these cases:
    • oversized stores do not populate write-through object cache after saveSessionStore(...)
    • stores that grow across the limit lose object-cache eligibility immediately
    • small stores still get normal write-through caching
  • Verify this step with:
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test -- src/config/sessions/store.cache.limit-config.test.ts
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test -- src/config/sessions/store.cache.limit-read.test.ts
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test -- src/config/sessions/store.cache.limit-write.test.ts
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test -- src/config/sessions/store.cache.large-store.test.ts
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm build
  • Commit only the files from this step with:
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH scripts/committer "fix(sessions): stop caching oversized session stores" <files...>

Step 4: Add A One-Time Operator Warning

  • Add a once-per-store-path warning when the object cache is disabled because a store exceeded the configured limit.
  • Ensure the warning text includes:
    • store path
    • current size
    • configured limit
    • OPENCLAW_SESSION_OBJECT_CACHE_MAX_BYTES
  • Ensure the warning is low-noise:
    • warn once per store path per process
    • do not warn on every load or save
  • Add a dedicated warning test file, for example src/config/sessions/store.cache.limit-warning.test.ts.
  • Cover at least these cases:
    • first oversized access logs the warning
    • repeated oversized accesses for the same path do not spam
    • small stores do not warn
  • Verify this step with:
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test -- src/config/sessions/store.cache.limit-config.test.ts
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test -- src/config/sessions/store.cache.limit-read.test.ts
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test -- src/config/sessions/store.cache.limit-write.test.ts
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test -- src/config/sessions/store.cache.limit-warning.test.ts
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test -- src/config/sessions/store.cache.large-store.test.ts
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm build
  • Commit only the files from this step with:
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH scripts/committer "fix(sessions): warn when object cache limit is hit" <files...>

Step 5: Document The New Limit And Operator Override

  • Update the operator/environment documentation in docs/help/environment.md.
  • Update the session CLI/operator documentation in docs/cli/sessions.md.
  • Document:
    • the env var name
    • the default value (1 MB)
    • what happens when the limit is exceeded
    • that exceeding the limit changes memory/performance tradeoffs, not session semantics
  • Keep internal links Mintlify-safe and root-relative.
  • Verify this step with:
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm check:docs
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm build
  • Commit only the docs files from this step with:
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH scripts/committer "docs(sessions): document object cache size limit" <files...>

Step 6: Re-run The Session Measurements Against The Fix

  • Re-run the targeted measurement harness with the default limit:
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH node --import tsx scripts/measure-session-store-cache.ts
  • Re-run the RSS stress harness with the default limit:
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH node --expose-gc --import tsx scripts/stress-session-store-rss.ts --target-kb 7600 --cycles 50 --sample-every 10
  • Confirm the post-fix measurements show lower RSS growth or at least materially lower retained object-cache memory for oversized stores.
  • If the measurements suggest the default needs adjustment, stop and make that change in a separate follow-up commit rather than folding it into an unreviewable mixed step.

Final PR-Readiness Gate (No Commit By Design)

  • Re-run the canonical CONTRIBUTING.md pre-PR gate exactly:
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm build && pnpm check && pnpm test
  • Re-run all targeted session-cache tests for the touched files:
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test -- src/config/sessions/store.cache.limit-config.test.ts
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test -- src/config/sessions/store.cache.limit-read.test.ts
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test -- src/config/sessions/store.cache.limit-write.test.ts
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test -- src/config/sessions/store.cache.limit-warning.test.ts
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm test -- src/config/sessions/store.cache.large-store.test.ts
  • If docs changed, re-run:
    PATH=/usr/local/bin:/opt/homebrew/bin:$PATH pnpm check:docs
  • Run the Codex review required by CONTRIBUTING.md before opening or updating the PR:
    codex review --base origin/main
  • Confirm every bot/Codex review conversation introduced by the PR is addressed or explicitly responded to before requesting review.
  • Confirm this plan file is still untracked:
    git status --short ISSUE_51097_FIX_PLAN.md

Final Verification Gate Before Push (No Commit By Design)

  • Re-run every baseline gate from the Baseline Gate section after the final code commit.
  • Re-run docs checks if any docs changed.
  • Re-run Windows shard coverage on a Windows host or Windows VM because the touched paths still imply run_windows=true.
  • Confirm git status --short is clean before pushing.
  • Confirm commit history is one logical commit per step plus any intentional merge-from-main commit.
  • Push only after every applicable local CI-equivalent lane is green.

Suggested Commit Sequence

  • fix(sessions): add object cache limit config helper
  • fix(sessions): skip object cache reads for large stores
  • fix(sessions): stop caching oversized session stores
  • fix(sessions): warn when object cache limit is hit
  • docs(sessions): document object cache size limit

Final Deliverables Checklist

  • New configurable object-cache limit with default 1 MB
  • Targeted tests for config, read gating, write gating, and warnings
  • Existing large-store baseline test still meaningful
  • Updated operator/docs guidance
  • Full applicable local CI-equivalent coverage completed before push
  • This plan file remains uncommitted for later PR description drafting

@openclaw-barnacle openclaw-barnacle bot added docs Improvements or additions to documentation scripts Repository scripts size: XL labels Mar 21, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 21, 2026

Greptile Summary

This PR fixes issue #51097 by introducing a configurable size guard (OPENCLAW_SESSION_OBJECT_CACHE_MAX_BYTES, defaulting to 1 MB) for the in-memory session-store object cache, and adds a snapshot-reuse optimisation that lets updateSessionStore skip a second JSON.parse when the caller already holds a loaded store and the on-disk file has not changed.

Key changes:

  • src/config/sessions/store-cache-limit.ts — new helper that resolves the cache size limit from the environment, with 0 disabling the cache entirely and invalid values falling back to the 1 MB default.
  • src/config/sessions/store.ts — gates object-cache reads and writes behind the size limit; drops oversized serialized-cache entries; adds a module-level WeakMap (loadedSessionStoreSnapshots) that associates each returned store object with its on-disk serialized snapshot; teaches updateSessionStore to reuse a caller-provided baseStore via tryReuseLoadedSessionStoreSnapshot when the file is unchanged (verified under the write lock); emits a one-time per-path warning when the limit is exceeded.
  • src/config/sessions/store-cache.ts — adds TTL enforcement to serialized-cache entries (consistent with the object-cache TTL).
  • src/infra/heartbeat-runner.ts and the src/auto-reply/reply/ call sites — each now passes the previously loaded store as baseStore, enabling the snapshot-reuse path on the hot-path update cycle.
  • Docs & testsOPENCLAW_SESSION_OBJECT_CACHE_MAX_BYTES is documented in docs/help/environment.md and docs/cli/sessions.md; 9 new focused test files cover config resolution, read/write gating, limit warnings, noop-save detection, serialized-TTL behaviour, and base-snapshot reuse/fallback.

One P2 note: in src/infra/heartbeat-runner.ts (lines 910–914), the store[sessionKey] = { ...current, ... } mutation that predates the refactor is now dead code. In the snapshot-reuse path the mutator re-applies the same values, and in the fallback path the pre-mutated store is discarded. The mutation can be removed without any behavioural change.

Confidence Score: 4/5

  • This PR is safe to merge; the snapshot-reuse path is correctly guarded by a synchronous equality check under the write lock, and all existing session-store contracts are preserved.
  • The core logic is sound: oversized stores shed their parsed and serialized caches, and the baseStore snapshot-reuse path only fires when the file bytes on disk exactly match the loaded snapshot while the write lock is held. Nine new test files provide thorough coverage of config resolution, read/write gating, TTL expiry, one-time warnings, noop-save deduplication, and concurrent-change fallback. The only actionable finding is the leftover pre-mutation in heartbeat-runner.ts (lines 910–914) which is semantically dead code but harmless — it does not affect the persisted state on either code path. That single P2 cleanup is the reason for 4 rather than 5.
  • Pay close attention to src/infra/heartbeat-runner.ts lines 910–914 (dead pre-mutation) and src/config/sessions/store.ts (tryReuseLoadedSessionStoreSnapshot / isSessionStoreSerializedSnapshotEqual) if memory or latency characteristics change under production load.

Comments Outside Diff (1)

  1. src/infra/heartbeat-runner.ts, line 910-914 (link)

    P2 Leftover pre-mutation is now dead code

    The store[sessionKey] = { ... } mutation on lines 910–914 was necessary in the old architecture where saveSessionStore(storePath, store) would persist the pre-mutated object directly. Now that updateSessionStore is used with an explicit mutator function, this pre-mutation has no net effect on the persisted result:

    • Snapshot-reuse path (tryReuseLoadedSessionStoreSnapshot returns store): the mutator receives the already-mutated store and re-applies the identical values.
    • Fallback path (disk reload): a fresh copy is loaded from disk and the pre-mutation is completely ignored; the mutator applies the values correctly on its own.

    The pre-mutation survives both paths and produces the correct final state, but it allocates an extra object, mutates a reference that will be overwritten, and could mislead future maintainers into thinking it has side effects.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/infra/heartbeat-runner.ts
    Line: 910-914
    
    Comment:
    **Leftover pre-mutation is now dead code**
    
    The `store[sessionKey] = { ... }` mutation on lines 910–914 was necessary in the old architecture where `saveSessionStore(storePath, store)` would persist the pre-mutated object directly. Now that `updateSessionStore` is used with an explicit mutator function, this pre-mutation has no net effect on the persisted result:
    
    - **Snapshot-reuse path** (`tryReuseLoadedSessionStoreSnapshot` returns `store`): the mutator receives the already-mutated `store` and re-applies the identical values.
    - **Fallback path** (disk reload): a fresh copy is loaded from disk and the pre-mutation is completely ignored; the mutator applies the values correctly on its own.
    
    The pre-mutation survives both paths and produces the correct final state, but it allocates an extra object, mutates a reference that will be overwritten, and could mislead future maintainers into thinking it has side effects.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/infra/heartbeat-runner.ts
Line: 910-914

Comment:
**Leftover pre-mutation is now dead code**

The `store[sessionKey] = { ... }` mutation on lines 910–914 was necessary in the old architecture where `saveSessionStore(storePath, store)` would persist the pre-mutated object directly. Now that `updateSessionStore` is used with an explicit mutator function, this pre-mutation has no net effect on the persisted result:

- **Snapshot-reuse path** (`tryReuseLoadedSessionStoreSnapshot` returns `store`): the mutator receives the already-mutated `store` and re-applies the identical values.
- **Fallback path** (disk reload): a fresh copy is loaded from disk and the pre-mutation is completely ignored; the mutator applies the values correctly on its own.

The pre-mutation survives both paths and produces the correct final state, but it allocates an extra object, mutates a reference that will be overwritten, and could mislead future maintainers into thinking it has side effects.

```suggestion
        await updateSessionStore(
          storePath,
          (nextStore) => {
            const nextCurrent = nextStore[sessionKey];
            if (!nextCurrent) {
              return;
            }
            nextStore[sessionKey] = {
              ...nextCurrent,
              lastHeartbeatText: normalized.text,
              lastHeartbeatSentAt: startedAt,
            };
          },
          {
            baseStore: store,
          },
        );
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: "fix(sessions): reuse..."

@sahilsatralkar
Copy link
Copy Markdown
Contributor Author

The failing build-smoke job is the status --json startup-memory check.

What I verified:

  • The failure is real and PR-specific.
  • It is not caused by the new tests.
  • The current overage is on the production CLI startup path, where status --json is exceeding the 400 MB limit enforced by scripts/check-cli-
    startup-memory.mjs.

What I’ve done so far:

  • Reproduced the failure locally.
  • Confirmed the regression came in with this PR’s runtime code changes, not test-only files.
  • Tried isolating the session-store write/snapshot logic out of src/config/sessions/store.ts so read-only paths would not pay for it at import
    time.

Current status:

  • The isolation refactor passes pnpm build and the focused session/heartbeat regression tests.
  • But pnpm test:startup:memory is still failing locally at about 429.5 MB, so that change alone is not enough.

Conclusion:

  • The PR does introduce the startup-memory regression, but the exact remaining import/load cost is still being narrowed down.
  • I’m continuing on this PR until status --json is back under the CI limit.

@sahilsatralkar
Copy link
Copy Markdown
Contributor Author

This PR does fix #51097, but it also introduces a separate status --json startup-memory regression.

What I verified locally:

  • origin/main: status --json uses about 253.7 MB RSS
  • this branch: status --json uses about 429.2 MB RSS
  • CI limit: 400 MB

I spent a full investigation cycle on this and the remaining issue no longer looks like a normal source-path bug. The status --json command sources are unchanged; the regression appears to come from bundle/chunk placement caused by this PR’s runtime changes.

So the current state is:

I’d appreciate a maintainer call on whether to:

  1. keep this blocked pending a dedicated bundle-level follow-up, or
  2. allow a temporary CI override for OPENCLAW_STARTUP_MEMORY_STATUS_JSON_MB and follow up separately on the startup-memory regression.

@sahilsatralkar sahilsatralkar force-pushed the fix/issue-51097-session-store-memory-tests branch from ba1d724 to db18753 Compare March 23, 2026 15:19
@sahilsatralkar sahilsatralkar requested a review from a team as a code owner March 23, 2026 15:19
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: db18753100

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: db18753100

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c46ddfdea0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Improvements or additions to documentation scripts Repository scripts size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Gateway memory leak: sessions.json loaded entirely into RAM, grows unbounded

1 participant