Skip to content

fix(sessions): add context-engine fallback session-size guard (#76940)#76950

Closed
100yenadmin wants to merge 2 commits intoopenclaw:mainfrom
electricsheephq:fix/76940-session-size-guard
Closed

fix(sessions): add context-engine fallback session-size guard (#76940)#76950
100yenadmin wants to merge 2 commits intoopenclaw:mainfrom
electricsheephq:fix/76940-session-size-guard

Conversation

@100yenadmin
Copy link
Copy Markdown
Contributor

@100yenadmin 100yenadmin commented May 3, 2026

TLDR: when a context engine fails, is disabled, or breaks. It breaks gateway and session > and OC gets blamed for bad experience. This fixes that. No plugin should be able to break or disable gateway or leave it hanging 20+ min.

Summary

Implements the defensive guard proposed in #76940. When a configured context-engine plugin (e.g. lossless-claw) fails to resolve and the gateway falls back to the default legacy engine, walks the affected agent's transcript directory and applies a configured action to any session jsonl exceeding the size threshold.

Real-world trigger that motivated this: the 2026.5.2 npm install silently dropped several configured extensions from the runtime plugin set, including a context-engine slot plugin. The next gateway boot loaded an existing session at 808 messages / 6.3 MB, which immediately hit context overflow on the first turn:

[gateway] http server listening (3 plugins: browser, cortex, telegram; 2.7s)   # was 7 before upgrade
[context-engine] Context engine "lossless-claw" is not registered; falling back to default engine "legacy".
[agent/embedded] [context-overflow-diag] sessionKey=agent:main:main provider=openai-codex/gpt-5.5
  source=assistantError messages=808 sessionFile=…b1ed0fe1-…jsonl diagId=ovf-… compactionAttempts=0
  observedTokens=unknown error=Context overflow: estimated context size exceeds safe threshold during tool loop.
[agent/embedded] context overflow detected (attempt 1/3); attempting auto-compaction for openai-codex/gpt-5.5

Larger sessions in the same install (200 MB jsonl files from prior work) would have been unrecoverable without manual jsonl rotation. Cross-references in the issue (#64767 — 444 MB jsonl hangs gateway, #66360, #73691, #75740) show this is a recurring class of failure.

Config surface

"session": {
  "maintenance": {
    "contextFallbackGuard": {
      "sizeBytes": "1mb",        // default; accepts "1mb", "512kb", number of bytes, etc.
      "action": "auto"           // default; "warn" | "archive" | "block" | "auto"
    }
  }
}

Action semantics:

  • warn — log a structured, actionable warning naming file + size + applied action. Per-process dedup so repeated resolves don't spam.
  • archive — rename the jsonl to <basename>.archived-no-context-engine-<ISO>.jsonl. Recoverable via existing archive-recovery work (Recover archived (.reset) session transcripts in memory hook + session-logs skill #71537, [codex] Include reset archives in session log searches #76119).
  • block — throw from the resolver with a structured message naming the offending transcripts and the failed engine id, refusing to fall back until an operator takes action.
  • auto (default) — archive when the agent's state dir contains a known context-engine sqlite store (lcm.db, lossless-claw.db, context-engine.db), warn otherwise. Rationale: when an engine like LCM is in use, the jsonl is just the live buffer — the engine has the source-of-truth in SQLite, so archiving the jsonl loses at most the fresh tail (~32-64 messages, equivalent blast radius to a forced compaction). When no engine has run, the jsonl IS the only record, so auto conservatively warns.

Threshold note: 2 MB default is small intentionally — 1mb is roughly 250k tokens which would overflow gpt-5.5 in the wild. Operators can raise this via config when their workflow tolerates more.

Implementation

  • src/context-engine/fallback-guard.ts — pure function that walks the agent transcript dir, applies action per oversized file, dedups warnings per process, falls back to warn when archive rename fails (so we never silently lose the signal). All filesystem and resolver calls are injectable for testing.
  • src/context-engine/registry.ts — single fallbackToDefault helper closure inside resolveContextEngine runs the guard before each of the four fallback sites (engine-not-registered, factory throw, contract validation throw, contract validation error). The block action throws from the resolver with a structured message; warn/archive/auto continue to the default engine.
  • src/agents/pi-embedded-runner/run.ts — plumb params.agentId through ResolveContextEngineOptions so the guard inspects the correct agent's sessions. Other resolver call sites continue to default to the primary agent id (existing behavior — opt-in plumbing for the future).
  • src/config/zod-schema.session.ts, types.base.ts, schema.labels.ts, schema.help.ts, schema.base.generated.ts — config schema, types, labels, help text. session.maintenance.contextFallbackGuard.{sizeBytes,action} validated alongside the existing maintenance fields.

Filename filter ignores .archived-*, .bak, .reset, .deleted, .trim-backup so we never re-archive our own archives or interfere with other rotation systems.

Tests

12 new unit tests in src/context-engine/fallback-guard.test.ts cover:

  • warn/archive/block/auto actions
  • auto resolution in both directions (history-present → archive, history-absent → warn)
  • threshold parsing from string ("1mb", "512kb") and number forms
  • default 1 MiB threshold when config absent
  • transcript-name filtering (skip backup/reset/deleted/archived/trim-backup)
  • per-process warn dedup
  • archive-rename failure falls back to warn (signal preserved)
  • missing/unreadable sessions dir returns inspected:0
  • fallbackGuardOutcomeIsBlocking helper

All 34 existing src/context-engine/*.test.ts tests pass unchanged. The 1 failing test in src/config/io.compat.test.ts ("logs validation warnings with real line breaks") fails on bare upstream/main too — pre-existing, unrelated.

Validation

  • pnpm exec vitest run src/context-engine/58 tests passed
  • pnpm exec vitest run src/config/ → 1226 passed / 1 pre-existing failure
  • pnpm exec oxlint --type-aware on changed files → 0 errors
  • pnpm check:base-config-schema → clean (regenerated schema.base.generated.ts)

Change Type

  • Bug fix
  • Feature (new config surface)

Scope

  • Gateway / orchestration
  • API / contracts (new config keys)

Closes #76940

…aw#76940)

When a configured context-engine plugin (e.g. lossless-claw) fails to
resolve and the gateway falls back to the default `legacy` engine, walk
the affected agent's transcript directory and apply a configured action
to any session jsonl exceeding the size threshold. Surfaces the real
failure mode (engine disabled / unregistered / contract violation)
instead of letting next-load context overflow stall the gateway.

Real-world trigger: the openclaw 2026.5.2 npm install silently dropped
several configured extensions from the runtime plugin set, including a
context-engine slot plugin. The next gateway boot loaded an existing
session at 808 messages / 6.3 MB, which immediately hit context overflow
on the first turn. Larger sessions in the same install (200 MB jsonl
files from prior work) would have been unrecoverable without manual
jsonl rotation.

Defaults:
  - sizeBytes: 1mb (small enough to catch realistic overflow cases)
  - action: "auto" (archive when an engine sqlite store is present
    and the engine has the source-of-truth; warn otherwise)

Config surface (session.maintenance.contextFallbackGuard):
  - sizeBytes: number | string (e.g. "1mb", "512kb")
  - action: "warn" | "archive" | "block" | "auto"

Implementation:
  - New module src/context-engine/fallback-guard.ts walks the agent
    transcript dir, applies action per oversized file, dedups warnings
    per process, treats archive-rename failure as warn so signal isn't
    lost.
  - Wired into all four resolver fallback sites in
    src/context-engine/registry.ts (engine-not-registered, factory
    throw, contract validation throw, contract validation error) via
    a single fallbackToDefault helper.
  - "block" action throws from the resolver with a structured message
    naming the offending transcripts and the failed engine id.
  - Plumbed agentId through ResolveContextEngineOptions so the guard
    inspects the correct agent's sessions; updated the main embedded
    runner call site. Other call sites continue to default to the
    primary agent id (existing behavior).

Tests:
  - 12 unit tests in fallback-guard.test.ts cover warn/archive/block,
    auto resolution in both directions, threshold parsing, default
    threshold, dedup, archive-failure-falls-back-to-warn,
    transcript-name filtering (skip .bak / .reset / .archived /
    .deleted / .trim-backup), and missing sessions dir.
  - All 34 existing src/context-engine tests pass unchanged.

Closes openclaw#76940
Copilot AI review requested due to automatic review settings May 3, 2026 21:36
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 3, 2026

Codex review: needs changes before merge.

Summary
The PR adds a context-engine fallback and boot-time session-size guard, new session.maintenance.contextFallbackGuard config/schema/help/changelog entries, agent-id plumbing, and unit coverage.

Reproducibility: yes. for the review findings: source inspection of PR head shows the logger type mismatch, stale default docs, default-agent scan gap, classifier divergence, and invalid recovery command. The underlying gateway-stall class is supported by linked reports and logs, but I did not live-reproduce it in this read-only pass.

Next step before merge
The remaining blockers are concrete changed-file repairs that an automated worker can attempt; maintainer policy review is still needed before merge because startup auto-archive is a product decision.

Security
Cleared: The diff adds local transcript inspection/rename logic, config schema/help text, startup wiring, and tests; I found no new dependency, workflow, package-resolution, install, publish, permission, or secret-handling concern.

Review findings

  • [P1] Route boot guard errors through an error logger — src/gateway/server-startup-post-attach.ts:720
  • [P2] Pass the default agent into the boot guard — src/gateway/server-startup-post-attach.ts:714-722
  • [P2] Align the documented fallback-guard default — src/config/schema.help.ts:1490
Review details

Best possible solution:

Land a revised version that fixes the type/build issue, aligns config defaults, scans the intended agent sessions, reuses existing transcript artifact helpers, points operators to real recovery commands, and keeps broader transcript-size caps tracked separately.

Do we have a high-confidence way to reproduce the issue?

Yes for the review findings: source inspection of PR head shows the logger type mismatch, stale default docs, default-agent scan gap, classifier divergence, and invalid recovery command. The underlying gateway-stall class is supported by linked reports and logs, but I did not live-reproduce it in this read-only pass.

Is this the best way to solve the issue?

No, not as currently written. The guard direction is plausible, but the patch should fix the concrete blockers and get maintainer agreement on the startup auto-archive policy before merge.

Full review comments:

  • [P1] Route boot guard errors through an error logger — src/gateway/server-startup-post-attach.ts:720
    params.log in startGatewayPostAttachRuntime is still typed with only info and warn, but the new boot-guard logger calls params.log.error(...). This should fail type-checking and can be undefined for valid callers, so route errors through an error-capable logger such as params.logHooks.error or widen/provide the logger consistently.
    Confidence: 0.95
  • [P2] Pass the default agent into the boot guard — src/gateway/server-startup-post-attach.ts:714-722
    The boot guard is called without an agentId, so the fallback guard resolves only the hard-coded default sessions directory (main) instead of the configured default agent or all configured agents. Multi-agent installs where the active/default agent is not main will miss the oversized transcript that actually gets loaded.
    Confidence: 0.9
  • [P2] Align the documented fallback-guard default — src/config/schema.help.ts:1490
    The runtime constant defaults to 2 MiB, while this help text and the generated schema/type comments still say Default 1mb. Operators will tune and diagnose from the wrong threshold unless the source and generated config docs match the implementation, or the implementation is changed back.
    Confidence: 0.94
  • [P2] Probe the agent state directory for engine history — src/context-engine/fallback-guard.ts:144-150
    The auto heuristic says it checks the agent state directory, but dirname() three times from <state>/agents/<id>/sessions lands at the global state root. That can miss per-agent engine stores and make action: "auto" warn when the intended safe path is archive.
    Confidence: 0.86
  • [P2] Reuse the session transcript artifact classifier — src/context-engine/fallback-guard.ts:157-176
    This custom filter treats trajectory/checkpoint sidecars as live transcripts and skips valid primary names that merely contain substrings like .deleted. Reuse isPrimarySessionTranscriptFileName() and add only the new no-context-engine archive exclusion so the guard operates on the same primary transcript set as the rest of session maintenance.
    Confidence: 0.9
  • [P2] Replace the nonexistent sessions archive command — src/context-engine/fallback-guard.ts:515-516
    Warn-mode recovery tells operators to run openclaw sessions archive ..., but current CLI/docs only define sessions cleanup and sessions export-trajectory. In the path where the guard does not mutate files, that sends users to a command that fails, so point to an existing recovery flow or add the command and docs.
    Confidence: 0.93
  • [P2] Branch the no-engine operator wording — src/context-engine/fallback-guard.ts:502-504
    The boot guard intentionally fires when no context engine is configured, but the warning/archive messages always say a configured engine failed. For the legacy/unset trigger this misdiagnoses the cause, so branch the copy on the synthesized (legacy/none) reason or pass an explicit reason kind.
    Confidence: 0.87
  • [P3] Include blocked transcript names in the resolver error — src/context-engine/registry.ts:546-555
    The block path computes the blocked paths but throws an error with only a count and threshold. Since block is the operator-facing stop condition, include sanitized basenames or session ids so the operator knows which transcript to rotate without hunting logs.
    Confidence: 0.82

Overall correctness: patch is incorrect
Overall confidence: 0.92

Acceptance criteria:

  • pnpm test src/context-engine/fallback-guard.test.ts src/config/sessions/artifacts.test.ts src/gateway/server-startup-post-attach.test.ts
  • pnpm test src/context-engine/
  • pnpm exec oxfmt --check --threads=1 src/context-engine/fallback-guard.ts src/context-engine/fallback-guard.test.ts src/context-engine/registry.ts src/gateway/server-startup-post-attach.ts src/config/schema.help.ts src/config/types.base.ts src/config/schema.base.generated.ts CHANGELOG.md
  • pnpm check:changed in Testbox before handoff if the branch is otherwise ready

What I checked:

Likely related people:

  • steipete: Recent commits touched gateway startup hot paths and session maintenance/write-lock behavior, including server-startup-post-attach.ts and session-management docs. (role: recent gateway and session-maintenance maintainer; confidence: high; commits: fa866d562ed4, 0b1fbeabed8e, f7ed29e11812; files: src/gateway/server-startup-post-attach.ts, docs/reference/session-management-compaction.md, src/config/sessions/artifacts.ts)
  • jalehman: Multiple recent context-engine registry changes list @jalehman as reviewer or coauthor, including runtime context, contract validation, and third-party engine compatibility work. (role: context-engine reviewer and coauthor; confidence: high; commits: d8a600f2ad01, 263a190fc9e0, 2677f7cf1446; files: src/context-engine/registry.ts)
  • jarimustonen: Authored the recent ContextEngineFactory runtime context change in the central registry path that this PR extends with agentId. (role: context-engine runtime-context contributor; confidence: medium; commits: d8a600f2ad01; files: src/context-engine/registry.ts)
  • gumadeiras: Introduced session/cron maintenance hardening and cleanup UX that established much of the current session maintenance surface this PR extends. (role: session maintenance contributor; confidence: medium; commits: eff3c5c70778; files: src/config/sessions/artifacts.ts, docs/reference/session-management-compaction.md)
  • vincentkoc: Local blame on the current checkout attributes the refreshed config docs/schema baseline across the config surfaces touched by this PR to Vincent Koc. (role: recent config schema/docs maintainer; confidence: medium; commits: 62fb50d7fc5d; files: src/config/schema.help.ts, src/config/schema.base.generated.ts, src/config/types.base.ts)

Remaining risk / open question:

  • The default auto archive policy can rename local transcripts during startup, so maintainers should explicitly approve the product policy before merge.
  • I did not run tests because this was a read-only review; findings are source-backed against PR head and current main.

Codex review notes: model gpt-5.5, reasoning high; reviewed against e5ec14a06a67.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a defensive “context-engine fallback session-size guard” so that when a configured context engine fails to resolve and the gateway falls back to legacy, the system scans the affected agent’s session transcript directory and applies a configurable policy to oversized .jsonl transcripts.

Changes:

  • Add applyContextEngineFallbackGuard() (with warn / archive / block / auto) plus unit tests.
  • Invoke the guard from resolveContextEngine() at each fallback site; plumb agentId from the embedded runner.
  • Introduce new config surface session.maintenance.contextFallbackGuard.{sizeBytes,action} across schema/types/help/labels and document it in CHANGELOG.md.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/context-engine/registry.ts Runs the fallback guard before returning the default context engine; adds agentId to resolver options.
src/context-engine/fallback-guard.ts Implements transcript-dir scan + size threshold policy actions (warn/archive/block/auto).
src/context-engine/fallback-guard.test.ts Unit tests covering action behaviors, parsing, filtering, dedup, and failure paths.
src/config/zod-schema.session.ts Adds Zod validation for session.maintenance.contextFallbackGuard and validates sizeBytes.
src/config/types.base.ts Adds typed config definitions for the new guard.
src/config/schema.labels.ts Adds labels for the new config keys.
src/config/schema.help.ts Adds help text describing the new guard behavior and defaults.
src/config/schema.base.generated.ts Regenerates the base schema output to include the new keys.
src/agents/pi-embedded-runner/run.ts Passes params.agentId into resolveContextEngine() so the guard scans the correct agent.
CHANGELOG.md Documents the new guard config and semantics.

Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/registry.ts
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/fallback-guard.ts
…rd + operator recovery prompt (openclaw#76940)

Three follow-ups to the initial guard addition based on operator feedback:

1) Default threshold 1mb → 2mb. 1MiB jsonl is roughly 250k tokens of message
   content; 2MiB is roughly 500k tokens. 500k tokens already overflows every
   shipping context window — for models in the 200-256k effective-window
   range it overflows much sooner. Operators on smaller-context models can
   still dial down via session.maintenance.contextFallbackGuard.sizeBytes.

2) Boot-time guard (applyContextEngineBootGuard). The on-fallback path only
   catches "configured engine failed to load." It misses the much more common
   case: no context engine was ever configured. The legacy engine windows
   the prompt in-memory at request time but never shrinks the on-disk jsonl,
   so an unmanaged session grows append-only until the gateway stalls on
   next start. The boot guard runs once at startup and applies the same
   policy when slots.contextEngine is unset/legacy or the configured plugin
   is missing from loadedPluginIds. Both triggers funnel into the same
   applyContextEngineFallbackGuard implementation; one config knob, one
   policy, two entry points.

3) Operator-facing message rewrite. The terse single-line warn/archive log
   is replaced with a structured block that names the file, the engine that
   failed, the size, the available repair commands (openclaw doctor --fix /
   sessions archive / config set slots), AND a copy-pasteable recovery
   prompt for the next agent turn. The prompt instructs the agent to read
   the archived tail (last ~200 non-system messages, group into chunks of
   1-2k tokens each, stop at ~40k tokens aggregate), giving the fresh
   session enough context to continue meaningfully. Sized so we use the
   fresh session's available context window — not so miserly that the user
   loses their working state, not so generous that we eat the whole window.

Tests:
  - 17 unit tests pass (12 original + 5 new for boot-guard / recovery prompt)
  - Existing 34 src/context-engine tests unchanged
  - Lint clean on changed files

Wiring:
  - fallback-guard.ts: bump DEFAULT_FALLBACK_GUARD_SIZE_BYTES, add
    renderWarnMessage / renderArchiveMessage / renderRecoveryPrompt,
    add applyContextEngineBootGuard
  - server-startup-post-attach.ts: invoke boot guard right after
    logGatewayStartup; never let guard exceptions stall startup
  - CHANGELOG: expanded entry covering both trigger paths and threshold
    rationale

Refs openclaw#76940.
@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime size: XL and removed size: L labels May 3, 2026
@100yenadmin
Copy link
Copy Markdown
Contributor Author

Pushed 30002174db addressing operator review:

1. Default threshold 1mb → 2mb

1MiB jsonl ≈ 250k tokens of message content; 2MiB ≈ 500k tokens. 500k tokens already overflows every shipping context window, and for models in the 200-256k effective-window range it overflows much sooner. 2mb is the realistic guardrail; operators on smaller-context models can still dial down via session.maintenance.contextFallbackGuard.sizeBytes.

2. Boot-time guard in addition to on-fallback

The original PR only caught "configured engine failed to load." It missed the much more common case: no context engine was ever configured. The legacy engine windows the prompt in-memory at request time but never shrinks the on-disk jsonl — so an unmanaged session grows append-only until the gateway stalls on next start. (See cross-referenced reports: #64767 444MB jsonl, #66360 unbounded growth, #73691 MEMORY.md gateway freeze — all this same shape.)

The boot guard runs once at startup (server-startup-post-attach.ts right after logGatewayStartup) and applies the same policy when:

  • slots.contextEngine is unset / legacy / empty → no engine ever managed sessions, OR
  • slots.contextEngine is set but the plugin isn't in loadedPluginIds → engine failed to load

Both trigger paths funnel into the same applyContextEngineFallbackGuard implementation. One config knob, one policy, two entry points.

When the configured engine is loaded and active, the boot guard short-circuits — the engine itself is responsible for size management (LCM rewrites the jsonl on every compaction, so its sessions stay bounded indefinitely).

3. Operator-facing message rewrite

The terse single-line warn/archive log is replaced with a structured block that names the file, the engine that failed, the size, the repair commands, AND a copy-pasteable recovery prompt for the next agent turn:

[context-engine] Session-size guard archived a transcript that would have stalled the gateway.

  Reason:    Context engine "lossless-claw" is configured but failed
             (engine "lossless-claw" is not registered). Falling back to the default
             "legacy" engine would have loaded the full transcript on next start.

  Archived:  ~/.openclaw/agents/main/sessions/b1ed0fe1-….jsonl
             → b1ed0fe1-…archived-no-context-engine-2026-05-04T03-15-22.jsonl
             (6.30 MiB — 40k+ tokens of message content)

  Next session start will be fresh and small. To recover the prior context,
  paste this prompt into the agent on the first turn:
  ┌─────────────────────────────────────────────────────────────────────────┐
  │ My previous session was archived because the configured context-engine    │
  │ plugin failed to load and the transcript would have overflowed the model  │
  │ context on next gateway start. Read the archived transcript at:           │
  │                                                                           │
  │   ~/.openclaw/agents/main/sessions/b1ed0fe1-…archived-…jsonl              │
  │                                                                           │
  │ Take the last ~200 non-system messages (skip heartbeat, synthetic, and    │
  │ bootstrap turns). Group them into chronological chunks of ~1000-2000      │
  │ tokens each — one chunk per coherent unit of work (a tool-call run, a     │
  │ topic shift, a multi-message exchange). For each chunk emit a 1000-2000   │
  │ token summary that:                                                       │
  │   - names the goal of the work in that chunk,                             │
  │   - lists tools called with key inputs/outputs (file paths, commits,      │
  │     decisions),                                                           │
  │   - notes unresolved threads, errors, or pending follow-ups.              │
  │                                                                           │
  │ Stop at ~40k tokens of aggregate summary so the fresh session keeps       │
  │ headroom. Output chunks in chronological order with one-line dividers     │
  │ like "chunk N: <topic>" so I can reference them later. After the chunks,  │
  │ give:                                                                     │
  │   - "open threads": anything in-flight,                                   │
  │   - "decisions made": anything settled,                                   │
  │   - "next likely action": what I would have done next.                    │
  │                                                                           │
  │ That summary is now my working context — proceed from there.              │
  └─────────────────────────────────────────────────────────────────────────┘

  Repair the engine plugin so this does not repeat:
    openclaw doctor --fix

  Or remove the configured slot to fall back cleanly without this guard:
    openclaw config set plugins.slots.contextEngine ""

Sizing rationale on the prompt:

  • ~200 messages of tail (not 100): the user just got a fresh session with the full context window available, and the archived tail is where the most recent in-progress work lives. Smaller tails lose too much.
  • 1-2k tokens per chunk: matches LCM's own chunked-summary granularity, so the fresh session gets useful per-chunk references rather than one mushy paragraph.
  • ~40k tokens aggregate ceiling: leaves the fresh session ~200-300k tokens of headroom for ongoing work (depending on model). Big enough to actually carry the work forward, small enough not to monopolize the new window.

Validation

  • 72 tests pass (17 fallback-guard cases including 5 new boot-guard cases + 2 new recovery-prompt assertions; 34 existing context-engine tests unchanged; 21 cortex/etc unchanged)
  • Lint clean on changed files
  • Locally swap-tested against a live gateway with LCM as the active engine: boot guard correctly short-circuits when LCM loads, and the recovery prompt was readable enough to paste straight into a chat.

Operator notes

The boot guard runs in a try/catch so a guard exception can never stall gateway startup — the worst case degrades to "no guard fired this boot."

The on-fallback guard remains for the request-time path (resolver fails mid-conversation rather than at boot), so a regression that takes down the engine after boot is still caught.

@100yenadmin
Copy link
Copy Markdown
Contributor Author

@steipete @vincentkoc I recommend this fix in the hot fix to prevent context engines from blowing up gateways if disabled or deleted.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/config/schema.help.ts
Comment thread src/config/types.base.ts
Comment thread CHANGELOG.md
Comment thread src/gateway/server-startup-post-attach.ts
Comment thread src/context-engine/fallback-guard.ts
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.

Comment thread src/context-engine/fallback-guard.ts
Comment thread src/config/schema.help.ts
Comment thread CHANGELOG.md
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/gateway/server-startup-post-attach.ts
Comment thread src/context-engine/registry.ts
Comment thread src/config/types.base.ts
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/fallback-guard.ts
@100yenadmin 100yenadmin requested a review from Copilot May 3, 2026 22:50
@steipete steipete closed this May 3, 2026
100yenadmin pushed a commit to electricsheephq/openclaw-local-test that referenced this pull request May 3, 2026
…rd (openclaw#76940)

Addresses Copilot's 5 inline review comments on PR openclaw#76950 plus 13
additional findings from a parallel adversarial sweep. Single
consolidated commit; per-finding diff lives below.

== Copilot review comments ==

(1) Archive-failure path used optional `logger.error?.()` so the
    fallback signal could be silently lost on loggers without `error`.
    Now ALWAYS emits a `warn` (deduped via warnedPaths) in addition to
    the optional error, so the signal survives loggers that only have
    warn.
    File: src/context-engine/fallback-guard.ts (archive catch block)

(2) The block-action error thrown from resolveContextEngine() didn't
    name the offending transcripts. Now includes basenames of the
    blocked files (capped at 5 + "+N more"), so operators see exactly
    which sessions trip the gate without leaking absolute paths into
    error stacks shared in bug reports.
    File: src/context-engine/registry.ts (fallbackToDefault helper)

(3) `defaultHasContextEngineHistory()` walked path.dirname THREE times
    from sessionsDir, landing at ~/.openclaw (global state) instead of
    the agent state dir. The auto-action heuristic mis-detected
    history. Now walks two dirs (sessionsDir → agentDir) and ALSO
    checks the global state root for legacy LCM installs that keep a
    shared sqlite store. Function additionally takes the injected
    `ioFs` so tests can stub it (previously bare `fs.existsSync`).
    File: src/context-engine/fallback-guard.ts

(4) The custom `isLiveSessionTranscript` filter accepted non-primary
    artifacts (.trajectory.jsonl, .checkpoint.<uuid>.jsonl) and could
    exclude legitimate primaries with `.deleted.` in id. Replaced with
    `isPrimarySessionTranscriptFileName` from
    src/config/sessions/artifacts.ts — the same predicate every other
    transcript code path uses. Custom filter deleted.
    File: src/context-engine/fallback-guard.ts

(5) `ApplyFallbackGuardOptions.agentDir` was documented + accepted by
    callers but never read. Now used to derive sessionsDir
    (`<agentDir>/sessions`) when supplied, before falling back to
    agentId-based resolution. Embedded runners that scope to a
    non-default state root now have the guard scoped correctly.
    File: src/context-engine/fallback-guard.ts (resolution order block)

== Additional adversarial-sweep findings ==

(6) [P0] Archive filename `<id>.archived-no-context-engine-<ts>.jsonl`
    was NOT recognized by `isPrimarySessionTranscriptFileName`, so the
    archived file got loaded as a live session on next gateway start —
    the guard accomplished nothing in the case it exists to fix. Now
    extends `SessionArchiveReason` with "context-fallback" and uses the
    canonical `<id>.jsonl.context-fallback.<iso-ts>-<nonce>` shape.
    Existing disk-budget pruning + transcript helpers correctly
    recognize and exclude these files.
    Files: src/config/sessions/artifacts.ts,
           src/context-engine/fallback-guard.ts (buildArchivePath)

(7) [P0] Doc-drift: schema.help.ts + types.base.ts JSDoc said
    "Default `1mb`" after the bump to 2mb in the prior commit. Both
    sites updated; types.base.ts now points readers at the constant
    in fallback-guard.ts so future changes stay in sync.
    Files: src/config/schema.help.ts, src/config/types.base.ts

(8) [P0] `action: "block"` was silently downgraded to warn at gateway
    boot — the boot wrapper discarded the outcome and never called
    `fallbackGuardOutcomeIsBlocking`. Now collects blocking outcomes
    across all agents and sets `process.exitCode = 1` with a
    structured error log, so launchd / systemd / docker treat the boot
    as unhealthy. Startup itself does not throw (degrades to
    "no protection this boot" if the guard faults internally).
    File: src/gateway/server-startup-post-attach.ts

(9) [P0] Boot guard was hardcoded to the "main" agent. Multi-agent
    installs (concierge, support, etc.) had no boot-time protection.
    Now iterates every agent under ~/.openclaw/agents/ via
    readdirSync(withFileTypes); falls back to DEFAULT_AGENT_ID for
    fresh installs / test environments.
    File: src/gateway/server-startup-post-attach.ts

(10) [P0] Four resolveContextEngine call sites passed no agentId, so
     when fallback fired for a non-default agent the guard walked the
     wrong agent's session directory. Plumbed agentId through:
       - src/agents/pi-embedded-runner/compact.queued.ts
         (hoisted resolveSessionAgentIds early)
       - src/agents/command/cli-compaction.ts
         (extended cliCompactionDeps.resolveContextEngine signature)
       - src/agents/subagent-spawn.ts
         (parseAgentSessionKey on requesterInternalKey)
       - src/agents/subagent-registry.ts
         (parseAgentSessionKey on childSessionKey)

(11) [P1] `renameSync` race: archive timestamp was millisecond-
     resolution; two archives in the same ms (rapid fallback across
     agents, or two large transcripts in one pass) collided and
     `renameSync` silently overwrites. Now appends a 6-hex random
     nonce.
     File: src/context-engine/fallback-guard.ts (buildArchivePath)

(12) [P1] `sizeBytes: 0` (or "0b" / "0kb") passed schema validation
     but was silently downgraded to the default in resolveGuardConfig
     — opposite of operator intent (trying to disable or tighten the
     guard would actually loosen it to default). Now logs an explicit
     warn so the misconfig is visible.
     File: src/context-engine/fallback-guard.ts (resolveGuardConfig)

(13) [P1] `openclaw sessions archive <id>` doesn't exist as a
     subcommand — operator-facing message printed a fake command.
     Replaced with `openclaw sessions cleanup --enforce`.
     `openclaw config set ... ""` likewise replaced with the canonical
     `openclaw config unset ...` form.
     File: src/context-engine/fallback-guard.ts (renderWarnMessage,
     renderArchiveMessage)

(14) [P1] Multi-line operator block destroyed by JSON log encoding
     (the JSON file logger encodes \n literally; the box-drawing
     characters became one mile-long line). Replaced unicode box
     drawing with explicit `----- BEGIN RECOVERY PROMPT -----` /
     `----- END RECOVERY PROMPT -----` delimiters that survive
     JSON encoding and are easy to grep / extract.
     File: src/context-engine/fallback-guard.ts

(15) [P1] Home-dir leaked into operator-facing paths. When operators
     paste the message into bug reports / GitHub issues / chat the
     username was exposed. Added `redactHomePrefix()` that substitutes
     `~` for the user's home prefix in the rendered prompt + Archived
     block. The structured `summary` line keeps the absolute path for
     grep-by-path.
     File: src/context-engine/fallback-guard.ts (renderRecoveryPrompt,
     renderWarnMessage, renderArchiveMessage)

(16) [P1] Two structurally-identical action union types
     (`SessionContextFallbackGuardAction` in types.base.ts and
     `FallbackGuardAction` in fallback-guard.ts). Now a single source
     of truth: `FallbackGuardAction` aliases the public type.
     Files: src/context-engine/fallback-guard.ts

(17) [P1] Recovery prompt assumed agent could safely Read a multi-MiB
     jsonl. Now explicitly tells the agent to use Read with
     offset/limit, skip individual messages over ~10k tokens, and
     bound itself to ~40k aggregate. Also added a "next likely action"
     bullet so the fresh session resumes work cleanly.
     File: src/context-engine/fallback-guard.ts (renderRecoveryPrompt)

(18) [P1] `agent="(default)"` printed in summary line when agentId
     was unset, but actual on-disk path uses `main`. Operators
     grepping by the printed label couldn't find the path. Now uses
     `DEFAULT_AGENT_ID` ("main") consistently.
     File: src/context-engine/fallback-guard.ts (renderAgentLabel)

(19) [P2] platform-specific restart hint: `launchctl` was
     unconditional. Now branches on process.platform to suggest
     `launchctl` on macOS, `systemctl --user restart` on Linux,
     `Restart-Service` on Windows.
     File: src/context-engine/fallback-guard.ts (platformRestartHint)

(20) [P2] `lstatSync` used in place of `statSync` so a symlink in the
     sessions dir doesn't get archived (previously could rename the
     link and orphan the target).
     File: src/context-engine/fallback-guard.ts

(21) [P2] Aggregated stat-error count is now logged at end of pass
     so an operator sees a signal when (e.g.) every file in the
     sessions dir is unreadable due to permission/quarantine drift —
     previously the swallowed errors produced an empty outcome with
     no clue why.
     File: src/context-engine/fallback-guard.ts (statErrors counter)

== Tests ==

  - 25 fallback-guard.test.ts cases pass (was 17 — 8 added covering
    canonical archive name, prior archives ignored, home-dir redact,
    nonce collision-avoidance, sizeBytes:0 warns, agentDir option used,
    archive-failure-also-warns, lstat for symlinks)
  - 6 zod-schema.session-maintenance-extensions.test.ts cases for
    contextFallbackGuard (valid/absent/casing/typo/malformed/wrong-level)
  - All 142 src/config/sessions tests pass unchanged
  - 95 total context-engine + maintenance-extension tests pass
  - Lint clean on changed files (the 2 remaining `__testing` warnings
    are pre-existing in v2026.5.2 and unrelated)

Refs openclaw#76940. Addresses inline review on PR openclaw#76950.
@steipete
Copy link
Copy Markdown
Contributor

steipete commented May 3, 2026

Thanks for jumping on this and for the detailed incident write-up.

For the hotfix, we are not going to take this core guard as-is. The immediate failure is a lossless-claw compatibility/install/load problem, so the primary fix belongs in lossless-claw rather than OpenClaw core. Core hardening here is still worth discussing, but startup-time transcript auto-archive is product-sensitive and this PR currently has unresolved implementation blockers: merge conflicts, a logger type/runtime issue, mismatched config default docs, a non-existent recovery command, and multi-agent/default-agent gaps.

Closing this PR for now. If we revisit core hardening, the safer shape is likely a narrower diagnostic/warn-only guard first, with docs and recovery paths aligned to existing session maintenance.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.

Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/fallback-guard.test.ts
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/gateway/server-startup-post-attach.ts
@100yenadmin
Copy link
Copy Markdown
Contributor Author

100yenadmin commented May 3, 2026

@steipete not sure if your AI wrote that but the issue isn't lossless claw here. We have no control over the session file management. LCM manages session only when enabled.

Nothing we can do when plugin system is rewritten and mass disables plugins (plugin builders can't conform to new requirements that come out in new patch without notice- normal process is to put out requirements in update and let them know new system is phasing in on X date). We lose a lot of good will constantly rebuilding plugin infra.

That being said when it is disabled the session file explodes the gateway because LCM handles session management when it is enabled. When it is disabled or deleted, the session file breaks the gateway. I was in the middle of the commits for this fix but got blocked by it closing.

We can on LCM side start to truncate and manage session file while LCM is enabled but that is on your decision if you would prefer we do that.

100yenadmin pushed a commit to electricsheephq/openclaw-local-test that referenced this pull request May 3, 2026
… dead activeContextEngineId, accurate token estimate, boot-message phrasing (openclaw#76940)

Three additional Copilot/sweep findings on the prior consolidated commit:

(22) [P1] params.log.error doesn't typecheck on the gateway post-attach
     surface (the typed log here is { info, warn } only — error is on
     the richer subsystem logger but not on this narrowed shape).
     Switched the boot-guard logger sink to omit `error` (the guard
     always emits a `warn` fallback when the optional error is absent)
     and the structured "block" startup line to use `warn`. The
     non-zero process.exitCode is the real signal external supervisors
     pick up, so demoting the channel doesn't lose anything.
     File: src/gateway/server-startup-post-attach.ts

(23) [P1] `activeContextEngineId` was required on ApplyBootGuardOptions
     but applyContextEngineBootGuard ignored it (re-read the same value
     from `options.config.plugins.slots.contextEngine`). Easy to
     misinterpret as "the boot guard respects the explicit
     activeContextEngineId override" when it does not. Removed the
     dead field; callers now pass only the loaded set + config.
     File: src/context-engine/fallback-guard.ts (ApplyBootGuardOptions)

(24) [P1] Operator messages printed `RECOVERY_PROMPT_MAX_SUMMARY_TOKENS`
     ("40k+ tokens of message content") for every transcript regardless
     of file size. Now estimates from the actual byte size with a
     simple ~4-chars-per-token heuristic, formatted with reasonable
     significant digits ("~580k tokens (estimated)" / "~1.4M tokens
     (estimated)").
     File: src/context-engine/fallback-guard.ts (estimateTokensFromBytes)

(25) [P2] The "Reason: Context engine X is configured but failed (...)"
     line in operator messages reads incorrectly when the boot guard
     fires for the no-engine case (failedEngineId is the synthetic
     "(legacy/none)" label, not a real configured engine). Branch on
     that label and emit a no-engine-specific reason block instead.
     File: src/context-engine/fallback-guard.ts (renderReasonLines)

Tests: all 25 fallback-guard cases still pass; the GuardMessageContext
type now carries `sizeBytes` so the token estimate has the source data
without re-deriving from the formatted MiB string.

Refs openclaw#76940. Addresses Copilot review on PR openclaw#76950.
@100yenadmin
Copy link
Copy Markdown
Contributor Author

Adversarial-review pass: Copilot's review + 3 internal sub-agent sweeps

Two new commits address everything the Codex bot review found, plus 13 additional findings from a parallel adversarial-agent sweep I ran in 4 dimensions (concurrency / fs, config validation, integration / multi-agent, UX / recovery prompt).

Commits in this push

  • 93d96d4afe — main consolidated fix (13 files, +584/-117). Addresses Copilot's 5 inline findings + 16 from the adversarial sweep.
  • 49c8ccc156 — three remaining nits found in a final pass (params.log.error type mismatch, dead activeContextEngineId field, hardcoded "40k+ tokens" string regardless of file size, "configured but failed" phrasing wrong for the boot-guard no-engine path).

Each Copilot inline comment is replied with the specific commit + file:line + relevant test that covers the fix.

Highest-impact fix (would-have-killed-the-PR-purpose level)

The previous archive shape was <id>.archived-no-context-engine-<ts>.jsonl — which isPrimarySessionTranscriptFileName did NOT recognize as an archive. So the file would have been loaded as a live session on next gateway start, and the guard would have accomplished literally nothing in the case it exists to fix. Now extends SessionArchiveReason with "context-fallback" and uses the canonical <id>.jsonl.context-fallback.<iso-ts>-<6-hex-nonce> shape, so existing transcript helpers correctly exclude these files. The nonce also fixes a same-millisecond renameSync race that would silently overwrite one of two concurrent archives.

Other significant fixes beyond Copilot's 5

Severity Bug Fix
P0 action: "block" silently downgraded to warn at gateway boot Boot path now collects blocking outcomes across all agents and sets process.exitCode = 1 so launchd/systemd/docker treat the boot as unhealthy
P0 Boot guard hardcoded to "main" agent — multi-agent installs got no boot protection Iterates every dir under ~/.openclaw/agents/
P0 4 other resolveContextEngine call sites passed no agentId — guard walked wrong agent's sessions on fallback agentId plumbed through subagent-spawn, cli-compaction, compact.queued, subagent-registry
P1 sizeBytes: 0 silently used the default (opposite of operator intent) Now logs an explicit warn so the misconfig is visible
P1 openclaw sessions archive <id> printed in operator message — that subcommand doesn't exist Replaced with openclaw sessions cleanup --enforce and openclaw config unset ...
P1 Multi-line operator block destroyed by JSON log encoding (box-drawing chars become literal \n) Replaced unicode boxes with explicit ----- BEGIN/END RECOVERY PROMPT ----- delimiters
P1 Home-dir leaked into pasted-into-issues paths redactHomePrefix() substitutes ~ in the operator-facing block; absolute path still in the structured summary line for grep
P1 Two structurally-identical action union types Single source of truth — FallbackGuardAction aliases SessionContextFallbackGuardAction
P1 Recovery prompt assumed agent could Read a multi-MiB jsonl whole Prompt now tells agent to use Read with offset/limit, skip individual messages > ~10k tokens
P1 "40k+ tokens" printed for every transcript regardless of size Estimates from actual byte size: ~580k tokens (estimated) etc.
P1 Boot-guard message said "Context engine X is configured but failed" even when no engine was configured Branches on the synthetic (legacy/none) label; emits accurate no-engine phrasing
P2 launchctl restart hint was unconditional Branches on process.platform for macOS / Linux / Windows
P2 statSync followed symlinks (could archive a link, orphan target) Switched to lstatSync (matches existing safe-fs helpers)
P2 Aggregate statErrors count silently swallowed Logged once at end of pass so unreadable-files-everywhere shows a signal

Test coverage

  • 25 unit tests in fallback-guard.test.ts (was 17 — 8 added covering canonical archive name, prior archives ignored, home-dir redact, nonce collision-avoidance, sizeBytes:0 warns, agentDir option used, archive-failure-also-warns, lstat for symlinks)
  • 6 zod schema tests for contextFallbackGuard (valid actions/sizes/casing/typos/wrong-nesting/back-compat-absent)
  • All 142 src/config/sessions/ tests pass unchanged
  • 95 total context-engine + maintenance-extension tests pass
  • Lint clean on changed files (the 2 remaining __testing warnings are pre-existing in v2026.5.2)

Adversarial methodology

3 sub-agents ran in parallel with focused scopes (concurrency/fs, config/back-compat, integration/UX). Each was given the PR diff + Copilot's existing 5 findings up front so they didn't duplicate. Results merged into the consolidated commit; per-finding rationale and file:line in the commit body of 93d96d4afe.

The "single conversation, three perspectives" pattern surfaced bugs Copilot missed — particularly the archive-shape-not-recognized P0 (which would have nullified the whole PR), the multi-agent-boot-guard-hardcoded-main P0, and the 4-other-call-sites-pass-no-agentId P0. Worth doing on guard-style PRs that touch multiple subsystems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling gateway Gateway runtime size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add startup-time session-size guard: auto-archive when no context engine is registered

3 participants