Skip to content

Bug: main session silently rolls over after restart/lock recovery; heartbeat lands in fresh main (compactionCount=0) #23615

@sene1337

Description

@sene1337

Audit Report: Overnight Session Rollover + Memory Search Outage

Date: 2026-02-22
Prepared by: Sene
Scope: What happened overnight, why agent:main:main looked fresh with compactionCount=0, and why memory_search failed despite account top-up.


Executive Summary

Two distinct issues occurred:

  1. Main session rollover happened overnight. A new agent:main:main session was created at 12:29:48 AM EST, so by morning the main session looked fresh with zero compactions.
  2. Embedded memory search outage was caused by OpenRouter embedding key limits, not by memory-search cost inefficiency. The configured key returned 403 Key limit exceeded (total limit).

These are correlated in time but technically separate failures.


Evidence Collected

A) Main session rollover evidence

  • Current main session record in ~/.openclaw/agents/main/sessions/sessions.json:
    • agent:main:main.sessionId = f0121695-92fe-4fc6-9a3a-684bb5fb3529
    • compactionCount = 0
  • Session file header confirms creation time:
    • ~/.openclaw/agents/main/sessions/f0121695-...jsonl
    • first event timestamp: 2026-02-22T09:29:48.869Z = 12:29:48 AM EST
  • First user event in that session was a heartbeat prompt (4:29 AM local message text), which landed in this already-fresh session.

B) Gateway lifecycle turbulence before rollover

From ~/.openclaw/logs/gateway.log / gateway.err.log:

  • ~7:09 PM EST (Feb 21): SIGUSR1 restart sequence observed.
  • ~7:18 PM EST (Feb 21): additional gateway startup sequence (new PID).
  • gateway.err.log includes stale lock cleanup:
    • removed stale lock for old main session file 74aaf0df-...jsonl.lock.
  • ~10:46 PM EST (Feb 21): old main session (74aaf0df-...) logged embedded run timeout.
  • 12:29:48 AM EST (Feb 22): new main session (f012...) appears.

C) Memory search outage evidence

  • memory_search tool returned:
    • openai embeddings failed: 403 ... Key limit exceeded (total limit)
  • gateway.err.log shows progression:
    • earlier 402 Insufficient credits
    • later repeated 403 Key limit exceeded (total limit)
  • Direct embeddings probe using configured OpenRouter key in openclaw.json returned the same 403.

Conclusion: key-level quota/limit policy blocked embeddings; not a semantic-memory design issue.


Incident Timeline (EST)

  • 7:09 PM (Feb 21): restart activity (SIGUSR1).
  • 7:18 PM (Feb 21): another gateway startup instance.
  • 10:46 PM (Feb 21): prior main session (74aaf...) times out on embedded run.
  • 12:29:48 AM (Feb 22): new agent:main:main session (f012...) created.
  • 4:29 AM (Feb 22): heartbeat prompt handled in this fresh session.
  • Morning: user observes new session / 0 compactions.

Root-Cause Assessment

Session rollover

Most likely cause: restart/lock instability around the previous main session caused the gateway to initialize a new main session binding instead of reusing the prior main pointer.

Important nuance: heartbeat likely did not create the rollover; it was simply the first obvious interaction after rollover.

Memory search outage

Direct cause: OpenRouter embeddings key constraints (403 Key limit exceeded) on the specific key configured for memory embeddings.


Proposed Fix Plan

1) Session continuity hardening

  1. Pin main-session continuity on startup

    • On gateway startup, reuse existing agent:main:main.sessionId if session file exists and is readable.
    • Do not create fresh main session unless pointer is missing/corrupt.
  2. Lock-file recovery policy

    • If lock is stale, remove lock and retry original session file first.
    • Only rotate main session after explicit failed recovery path.
  3. No implicit main rebinding from non-user triggers

    • Heartbeats, cron runs, and diagnostics must never rebind agent:main:main pointer.
  4. Session-ID drift alerting

    • Emit explicit warning when main session ID changes unexpectedly:
      • previous ID
      • new ID
      • reason code (manual reset / corruption / recovery fallback)
  5. Post-restart verification check

    • Add a startup integrity check that compares:
      • previous main session ID (from persisted metadata)
      • loaded main session ID
    • If mismatched without operator command, raise high-priority alert.

2) Memory search reliability

  1. Fix OpenRouter key policy

    • Raise/remove per-key total limit on the key configured for memory embeddings.
  2. Add preflight embeddings health check

    • On startup, perform tiny embedding probe and report failures clearly in status.
  3. Fallback path for memory retrieval

    • If embeddings fail, auto-fallback to lexical search over MEMORY.md + memory/*.md (degraded mode) instead of hard failure.
  4. Separate keying strategy (recommended)

    • Use dedicated key for embeddings with explicit budget/limits independent of generation model traffic.

Immediate Operator Actions (minimal disruption)

  1. Verify and adjust OpenRouter key limits for the exact key in openclaw.json.
  2. Run embeddings smoke test (/embeddings) and confirm 200.
  3. Keep heartbeat isolated (already expected behavior) and verify no main pointer mutation occurs on next heartbeat.
  4. Instrument session-ID drift alert before next overnight window.

Notes for Claude (handoff)

  • This report is intended for implementation planning and/or upstream bug report quality.
  • Most actionable engineering target is startup lock recovery + main-pointer continuity.
  • Memory failure is operational (key policy), but should still get degraded-mode fallback so memory_search never hard-dies silently.

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions