-
-
Notifications
You must be signed in to change notification settings - Fork 69.5k
Bug: main session silently rolls over after restart/lock recovery; heartbeat lands in fresh main (compactionCount=0) #23615
Description
Audit Report: Overnight Session Rollover + Memory Search Outage
Date: 2026-02-22
Prepared by: Sene
Scope: What happened overnight, why agent:main:main looked fresh with compactionCount=0, and why memory_search failed despite account top-up.
Executive Summary
Two distinct issues occurred:
- Main session rollover happened overnight. A new
agent:main:mainsession was created at 12:29:48 AM EST, so by morning the main session looked fresh with zero compactions. - Embedded memory search outage was caused by OpenRouter embedding key limits, not by memory-search cost inefficiency. The configured key returned
403 Key limit exceeded (total limit).
These are correlated in time but technically separate failures.
Evidence Collected
A) Main session rollover evidence
- Current main session record in
~/.openclaw/agents/main/sessions/sessions.json:agent:main:main.sessionId = f0121695-92fe-4fc6-9a3a-684bb5fb3529compactionCount = 0
- Session file header confirms creation time:
~/.openclaw/agents/main/sessions/f0121695-...jsonl- first event timestamp:
2026-02-22T09:29:48.869Z= 12:29:48 AM EST
- First user event in that session was a heartbeat prompt (4:29 AM local message text), which landed in this already-fresh session.
B) Gateway lifecycle turbulence before rollover
From ~/.openclaw/logs/gateway.log / gateway.err.log:
- ~7:09 PM EST (Feb 21): SIGUSR1 restart sequence observed.
- ~7:18 PM EST (Feb 21): additional gateway startup sequence (new PID).
gateway.err.logincludes stale lock cleanup:- removed stale lock for old main session file
74aaf0df-...jsonl.lock.
- removed stale lock for old main session file
- ~10:46 PM EST (Feb 21): old main session (
74aaf0df-...) logged embedded run timeout. - 12:29:48 AM EST (Feb 22): new main session (
f012...) appears.
C) Memory search outage evidence
memory_searchtool returned:openai embeddings failed: 403 ... Key limit exceeded (total limit)
gateway.err.logshows progression:- earlier
402 Insufficient credits - later repeated
403 Key limit exceeded (total limit)
- earlier
- Direct embeddings probe using configured OpenRouter key in
openclaw.jsonreturned the same403.
Conclusion: key-level quota/limit policy blocked embeddings; not a semantic-memory design issue.
Incident Timeline (EST)
- 7:09 PM (Feb 21): restart activity (SIGUSR1).
- 7:18 PM (Feb 21): another gateway startup instance.
- 10:46 PM (Feb 21): prior main session (
74aaf...) times out on embedded run. - 12:29:48 AM (Feb 22): new
agent:main:mainsession (f012...) created. - 4:29 AM (Feb 22): heartbeat prompt handled in this fresh session.
- Morning: user observes new session / 0 compactions.
Root-Cause Assessment
Session rollover
Most likely cause: restart/lock instability around the previous main session caused the gateway to initialize a new main session binding instead of reusing the prior main pointer.
Important nuance: heartbeat likely did not create the rollover; it was simply the first obvious interaction after rollover.
Memory search outage
Direct cause: OpenRouter embeddings key constraints (403 Key limit exceeded) on the specific key configured for memory embeddings.
Proposed Fix Plan
1) Session continuity hardening
-
Pin main-session continuity on startup
- On gateway startup, reuse existing
agent:main:main.sessionIdif session file exists and is readable. - Do not create fresh main session unless pointer is missing/corrupt.
- On gateway startup, reuse existing
-
Lock-file recovery policy
- If lock is stale, remove lock and retry original session file first.
- Only rotate main session after explicit failed recovery path.
-
No implicit main rebinding from non-user triggers
- Heartbeats, cron runs, and diagnostics must never rebind
agent:main:mainpointer.
- Heartbeats, cron runs, and diagnostics must never rebind
-
Session-ID drift alerting
- Emit explicit warning when main session ID changes unexpectedly:
- previous ID
- new ID
- reason code (manual reset / corruption / recovery fallback)
- Emit explicit warning when main session ID changes unexpectedly:
-
Post-restart verification check
- Add a startup integrity check that compares:
- previous main session ID (from persisted metadata)
- loaded main session ID
- If mismatched without operator command, raise high-priority alert.
- Add a startup integrity check that compares:
2) Memory search reliability
-
Fix OpenRouter key policy
- Raise/remove per-key total limit on the key configured for memory embeddings.
-
Add preflight embeddings health check
- On startup, perform tiny embedding probe and report failures clearly in status.
-
Fallback path for memory retrieval
- If embeddings fail, auto-fallback to lexical search over
MEMORY.md+memory/*.md(degraded mode) instead of hard failure.
- If embeddings fail, auto-fallback to lexical search over
-
Separate keying strategy (recommended)
- Use dedicated key for embeddings with explicit budget/limits independent of generation model traffic.
Immediate Operator Actions (minimal disruption)
- Verify and adjust OpenRouter key limits for the exact key in
openclaw.json. - Run embeddings smoke test (
/embeddings) and confirm 200. - Keep heartbeat isolated (already expected behavior) and verify no main pointer mutation occurs on next heartbeat.
- Instrument session-ID drift alert before next overnight window.
Notes for Claude (handoff)
- This report is intended for implementation planning and/or upstream bug report quality.
- Most actionable engineering target is startup lock recovery + main-pointer continuity.
- Memory failure is operational (key policy), but should still get degraded-mode fallback so memory_search never hard-dies silently.