fix(sessions): add context-engine fallback session-size guard (#76940) by 100yenadmin · Pull Request #76950 · openclaw/openclaw

100yenadmin · 2026-05-03T21:36:22Z

TLDR: when a context engine fails, is disabled, or breaks. It breaks gateway and session > and OC gets blamed for bad experience. This fixes that. No plugin should be able to break or disable gateway or leave it hanging 20+ min.

Summary

Implements the defensive guard proposed in #76940. When a configured context-engine plugin (e.g. lossless-claw) fails to resolve and the gateway falls back to the default legacy engine, walks the affected agent's transcript directory and applies a configured action to any session jsonl exceeding the size threshold.

Real-world trigger that motivated this: the 2026.5.2 npm install silently dropped several configured extensions from the runtime plugin set, including a context-engine slot plugin. The next gateway boot loaded an existing session at 808 messages / 6.3 MB, which immediately hit context overflow on the first turn:

[gateway] http server listening (3 plugins: browser, cortex, telegram; 2.7s)   # was 7 before upgrade
[context-engine] Context engine "lossless-claw" is not registered; falling back to default engine "legacy".
[agent/embedded] [context-overflow-diag] sessionKey=agent:main:main provider=openai-codex/gpt-5.5
  source=assistantError messages=808 sessionFile=…b1ed0fe1-…jsonl diagId=ovf-… compactionAttempts=0
  observedTokens=unknown error=Context overflow: estimated context size exceeds safe threshold during tool loop.
[agent/embedded] context overflow detected (attempt 1/3); attempting auto-compaction for openai-codex/gpt-5.5

Larger sessions in the same install (200 MB jsonl files from prior work) would have been unrecoverable without manual jsonl rotation. Cross-references in the issue (#64767 — 444 MB jsonl hangs gateway, #66360, #73691, #75740) show this is a recurring class of failure.

Config surface

"session": {
  "maintenance": {
    "contextFallbackGuard": {
      "sizeBytes": "1mb",        // default; accepts "1mb", "512kb", number of bytes, etc.
      "action": "auto"           // default; "warn" | "archive" | "block" | "auto"
    }
  }
}

Action semantics:

warn — log a structured, actionable warning naming file + size + applied action. Per-process dedup so repeated resolves don't spam.
archive — rename the jsonl to <basename>.archived-no-context-engine-<ISO>.jsonl. Recoverable via existing archive-recovery work (Recover archived (.reset) session transcripts in memory hook + session-logs skill #71537, [codex] Include reset archives in session log searches #76119).
block — throw from the resolver with a structured message naming the offending transcripts and the failed engine id, refusing to fall back until an operator takes action.
auto (default) — archive when the agent's state dir contains a known context-engine sqlite store (lcm.db, lossless-claw.db, context-engine.db), warn otherwise. Rationale: when an engine like LCM is in use, the jsonl is just the live buffer — the engine has the source-of-truth in SQLite, so archiving the jsonl loses at most the fresh tail (~32-64 messages, equivalent blast radius to a forced compaction). When no engine has run, the jsonl IS the only record, so auto conservatively warns.

Threshold note: 2 MB default is small intentionally — 1mb is roughly 250k tokens which would overflow gpt-5.5 in the wild. Operators can raise this via config when their workflow tolerates more.

Implementation

src/context-engine/fallback-guard.ts — pure function that walks the agent transcript dir, applies action per oversized file, dedups warnings per process, falls back to warn when archive rename fails (so we never silently lose the signal). All filesystem and resolver calls are injectable for testing.
src/context-engine/registry.ts — single fallbackToDefault helper closure inside resolveContextEngine runs the guard before each of the four fallback sites (engine-not-registered, factory throw, contract validation throw, contract validation error). The block action throws from the resolver with a structured message; warn/archive/auto continue to the default engine.
src/agents/pi-embedded-runner/run.ts — plumb params.agentId through ResolveContextEngineOptions so the guard inspects the correct agent's sessions. Other resolver call sites continue to default to the primary agent id (existing behavior — opt-in plumbing for the future).
src/config/zod-schema.session.ts, types.base.ts, schema.labels.ts, schema.help.ts, schema.base.generated.ts — config schema, types, labels, help text. session.maintenance.contextFallbackGuard.{sizeBytes,action} validated alongside the existing maintenance fields.

Filename filter ignores .archived-*, .bak, .reset, .deleted, .trim-backup so we never re-archive our own archives or interfere with other rotation systems.

Tests

12 new unit tests in src/context-engine/fallback-guard.test.ts cover:

warn/archive/block/auto actions
auto resolution in both directions (history-present → archive, history-absent → warn)
threshold parsing from string ("1mb", "512kb") and number forms
default 1 MiB threshold when config absent
transcript-name filtering (skip backup/reset/deleted/archived/trim-backup)
per-process warn dedup
archive-rename failure falls back to warn (signal preserved)
missing/unreadable sessions dir returns inspected:0
fallbackGuardOutcomeIsBlocking helper

All 34 existing src/context-engine/*.test.ts tests pass unchanged. The 1 failing test in src/config/io.compat.test.ts ("logs validation warnings with real line breaks") fails on bare upstream/main too — pre-existing, unrelated.

Validation

pnpm exec vitest run src/context-engine/ → 58 tests passed
pnpm exec vitest run src/config/ → 1226 passed / 1 pre-existing failure
pnpm exec oxlint --type-aware on changed files → 0 errors
pnpm check:base-config-schema → clean (regenerated schema.base.generated.ts)

Change Type

Bug fix
Feature (new config surface)

Scope

Gateway / orchestration
API / contracts (new config keys)

Closes #76940

…aw#76940) When a configured context-engine plugin (e.g. lossless-claw) fails to resolve and the gateway falls back to the default `legacy` engine, walk the affected agent's transcript directory and apply a configured action to any session jsonl exceeding the size threshold. Surfaces the real failure mode (engine disabled / unregistered / contract violation) instead of letting next-load context overflow stall the gateway. Real-world trigger: the openclaw 2026.5.2 npm install silently dropped several configured extensions from the runtime plugin set, including a context-engine slot plugin. The next gateway boot loaded an existing session at 808 messages / 6.3 MB, which immediately hit context overflow on the first turn. Larger sessions in the same install (200 MB jsonl files from prior work) would have been unrecoverable without manual jsonl rotation. Defaults: - sizeBytes: 1mb (small enough to catch realistic overflow cases) - action: "auto" (archive when an engine sqlite store is present and the engine has the source-of-truth; warn otherwise) Config surface (session.maintenance.contextFallbackGuard): - sizeBytes: number | string (e.g. "1mb", "512kb") - action: "warn" | "archive" | "block" | "auto" Implementation: - New module src/context-engine/fallback-guard.ts walks the agent transcript dir, applies action per oversized file, dedups warnings per process, treats archive-rename failure as warn so signal isn't lost. - Wired into all four resolver fallback sites in src/context-engine/registry.ts (engine-not-registered, factory throw, contract validation throw, contract validation error) via a single fallbackToDefault helper. - "block" action throws from the resolver with a structured message naming the offending transcripts and the failed engine id. - Plumbed agentId through ResolveContextEngineOptions so the guard inspects the correct agent's sessions; updated the main embedded runner call site. Other call sites continue to default to the primary agent id (existing behavior). Tests: - 12 unit tests in fallback-guard.test.ts cover warn/archive/block, auto resolution in both directions, threshold parsing, default threshold, dedup, archive-failure-falls-back-to-warn, transcript-name filtering (skip .bak / .reset / .archived / .deleted / .trim-backup), and missing sessions dir. - All 34 existing src/context-engine tests pass unchanged. Closes openclaw#76940

chatgpt-codex-connector · 2026-05-03T21:36:27Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

clawsweeper · 2026-05-03T21:39:28Z

Codex review: needs changes before merge.

Summary
The PR adds a context-engine fallback and boot-time session-size guard, new session.maintenance.contextFallbackGuard config/schema/help/changelog entries, agent-id plumbing, and unit coverage.

Reproducibility: yes. for the review findings: source inspection of PR head shows the logger type mismatch, stale default docs, default-agent scan gap, classifier divergence, and invalid recovery command. The underlying gateway-stall class is supported by linked reports and logs, but I did not live-reproduce it in this read-only pass.

Next step before merge
The remaining blockers are concrete changed-file repairs that an automated worker can attempt; maintainer policy review is still needed before merge because startup auto-archive is a product decision.

Security
Cleared: The diff adds local transcript inspection/rename logic, config schema/help text, startup wiring, and tests; I found no new dependency, workflow, package-resolution, install, publish, permission, or secret-handling concern.

Review findings

[P1] Route boot guard errors through an error logger — src/gateway/server-startup-post-attach.ts:720
[P2] Pass the default agent into the boot guard — src/gateway/server-startup-post-attach.ts:714-722
[P2] Align the documented fallback-guard default — src/config/schema.help.ts:1490

Review details

Best possible solution:

Land a revised version that fixes the type/build issue, aligns config defaults, scans the intended agent sessions, reuses existing transcript artifact helpers, points operators to real recovery commands, and keeps broader transcript-size caps tracked separately.

Do we have a high-confidence way to reproduce the issue?

Yes for the review findings: source inspection of PR head shows the logger type mismatch, stale default docs, default-agent scan gap, classifier divergence, and invalid recovery command. The underlying gateway-stall class is supported by linked reports and logs, but I did not live-reproduce it in this read-only pass.

Is this the best way to solve the issue?

No, not as currently written. The guard direction is plausible, but the patch should fix the concrete blockers and get maintainer agreement on the startup auto-archive policy before merge.

Full review comments:

[P1] Route boot guard errors through an error logger — src/gateway/server-startup-post-attach.ts:720
params.log in startGatewayPostAttachRuntime is still typed with only info and warn, but the new boot-guard logger calls params.log.error(...). This should fail type-checking and can be undefined for valid callers, so route errors through an error-capable logger such as params.logHooks.error or widen/provide the logger consistently.
Confidence: 0.95
[P2] Pass the default agent into the boot guard — src/gateway/server-startup-post-attach.ts:714-722
The boot guard is called without an agentId, so the fallback guard resolves only the hard-coded default sessions directory (main) instead of the configured default agent or all configured agents. Multi-agent installs where the active/default agent is not main will miss the oversized transcript that actually gets loaded.
Confidence: 0.9
[P2] Align the documented fallback-guard default — src/config/schema.help.ts:1490
The runtime constant defaults to 2 MiB, while this help text and the generated schema/type comments still say Default 1mb. Operators will tune and diagnose from the wrong threshold unless the source and generated config docs match the implementation, or the implementation is changed back.
Confidence: 0.94
[P2] Probe the agent state directory for engine history — src/context-engine/fallback-guard.ts:144-150
The auto heuristic says it checks the agent state directory, but dirname() three times from <state>/agents/<id>/sessions lands at the global state root. That can miss per-agent engine stores and make action: "auto" warn when the intended safe path is archive.
Confidence: 0.86
[P2] Reuse the session transcript artifact classifier — src/context-engine/fallback-guard.ts:157-176
This custom filter treats trajectory/checkpoint sidecars as live transcripts and skips valid primary names that merely contain substrings like .deleted. Reuse isPrimarySessionTranscriptFileName() and add only the new no-context-engine archive exclusion so the guard operates on the same primary transcript set as the rest of session maintenance.
Confidence: 0.9
[P2] Replace the nonexistent sessions archive command — src/context-engine/fallback-guard.ts:515-516
Warn-mode recovery tells operators to run openclaw sessions archive ..., but current CLI/docs only define sessions cleanup and sessions export-trajectory. In the path where the guard does not mutate files, that sends users to a command that fails, so point to an existing recovery flow or add the command and docs.
Confidence: 0.93
[P2] Branch the no-engine operator wording — src/context-engine/fallback-guard.ts:502-504
The boot guard intentionally fires when no context engine is configured, but the warning/archive messages always say a configured engine failed. For the legacy/unset trigger this misdiagnoses the cause, so branch the copy on the synthesized (legacy/none) reason or pass an explicit reason kind.
Confidence: 0.87
[P3] Include blocked transcript names in the resolver error — src/context-engine/registry.ts:546-555
The block path computes the blocked paths but throws an error with only a count and threshold. Since block is the operator-facing stop condition, include sanitized basenames or session ids so the operator knows which transcript to rotate without hunting logs.
Confidence: 0.82

Overall correctness: patch is incorrect
Overall confidence: 0.92

Acceptance criteria:

pnpm test src/context-engine/fallback-guard.test.ts src/config/sessions/artifacts.test.ts src/gateway/server-startup-post-attach.test.ts
pnpm test src/context-engine/
pnpm exec oxfmt --check --threads=1 src/context-engine/fallback-guard.ts src/context-engine/fallback-guard.test.ts src/context-engine/registry.ts src/gateway/server-startup-post-attach.ts src/config/schema.help.ts src/config/types.base.ts src/config/schema.base.generated.ts CHANGELOG.md
pnpm check:changed in Testbox before handoff if the branch is otherwise ready

What I checked:

Boot logger type mismatch: PR head calls params.log.error(...) from the boot guard, while startGatewayPostAttachRuntime still types params.log with only info and warn. (src/gateway/server-startup-post-attach.ts:720, 30002174dbd3)
Default mismatch: Runtime defaults to 2 MiB, but help/generated schema/type docs still say the default is 1mb. (src/config/schema.help.ts:1490, 30002174dbd3)
Boot guard default-agent gap: The boot guard is invoked without agentId; the guard then resolves sessions with resolveSessionTranscriptsDirForAgent(options.agentId), which defaults to the hard-coded main agent when undefined. (src/gateway/server-startup-post-attach.ts:714, 30002174dbd3)
Existing transcript classifier: Current main already has isPrimarySessionTranscriptFileName() excluding trajectories, compaction checkpoints, and archive artifacts; the PR adds a divergent substring filter instead. (src/config/sessions/artifacts.ts:55, e5ec14a06a67)
Nonexistent recovery command: Current CLI/docs expose openclaw sessions cleanup and openclaw sessions export-trajectory; no openclaw sessions archive command was found, but the PR warning tells operators to run it. (src/context-engine/fallback-guard.ts:516, 30002174dbd3)
Related issue context: The PR explicitly closes Add startup-time session-size guard: auto-archive when no context engine is registered #76940 and cites related oversized-session reports [Bug] Bloated session jsonl (444 MB) hangs gateway via String.prototype.replace — diagnose with sample+lsof #64767, session.maintenance has no size cap for transcript .jsonl files — unbounded growth causes gateway CPU 100% #66360, [Bug]: MEMORY.md grows unbounded → bootstrap overflow → Gateway freeze #73691, and Embedded agent auto-compaction retries without reducing 873k-token Paperclip prompt #75740; that discussion supports the underlying failure class but not the current patch as-is.

Likely related people:

steipete: Recent commits touched gateway startup hot paths and session maintenance/write-lock behavior, including server-startup-post-attach.ts and session-management docs. (role: recent gateway and session-maintenance maintainer; confidence: high; commits: fa866d562ed4, 0b1fbeabed8e, f7ed29e11812; files: src/gateway/server-startup-post-attach.ts, docs/reference/session-management-compaction.md, src/config/sessions/artifacts.ts)
jalehman: Multiple recent context-engine registry changes list @jalehman as reviewer or coauthor, including runtime context, contract validation, and third-party engine compatibility work. (role: context-engine reviewer and coauthor; confidence: high; commits: d8a600f2ad01, 263a190fc9e0, 2677f7cf1446; files: src/context-engine/registry.ts)
jarimustonen: Authored the recent ContextEngineFactory runtime context change in the central registry path that this PR extends with agentId. (role: context-engine runtime-context contributor; confidence: medium; commits: d8a600f2ad01; files: src/context-engine/registry.ts)
gumadeiras: Introduced session/cron maintenance hardening and cleanup UX that established much of the current session maintenance surface this PR extends. (role: session maintenance contributor; confidence: medium; commits: eff3c5c70778; files: src/config/sessions/artifacts.ts, docs/reference/session-management-compaction.md)
vincentkoc: Local blame on the current checkout attributes the refreshed config docs/schema baseline across the config surfaces touched by this PR to Vincent Koc. (role: recent config schema/docs maintainer; confidence: medium; commits: 62fb50d7fc5d; files: src/config/schema.help.ts, src/config/schema.base.generated.ts, src/config/types.base.ts)

Remaining risk / open question:

The default auto archive policy can rename local transcripts during startup, so maintainers should explicitly approve the product policy before merge.
I did not run tests because this was a read-only review; findings are source-backed against PR head and current main.

Codex review notes: model gpt-5.5, reasoning high; reviewed against e5ec14a06a67.

Copilot

Pull request overview

Adds a defensive “context-engine fallback session-size guard” so that when a configured context engine fails to resolve and the gateway falls back to legacy, the system scans the affected agent’s session transcript directory and applies a configurable policy to oversized .jsonl transcripts.

Changes:

Add applyContextEngineFallbackGuard() (with warn / archive / block / auto) plus unit tests.
Invoke the guard from resolveContextEngine() at each fallback site; plumb agentId from the embedded runner.
Introduce new config surface session.maintenance.contextFallbackGuard.{sizeBytes,action} across schema/types/help/labels and document it in CHANGELOG.md.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/context-engine/registry.ts	Runs the fallback guard before returning the default context engine; adds `agentId` to resolver options.
src/context-engine/fallback-guard.ts	Implements transcript-dir scan + size threshold policy actions (warn/archive/block/auto).
src/context-engine/fallback-guard.test.ts	Unit tests covering action behaviors, parsing, filtering, dedup, and failure paths.
src/config/zod-schema.session.ts	Adds Zod validation for `session.maintenance.contextFallbackGuard` and validates `sizeBytes`.
src/config/types.base.ts	Adds typed config definitions for the new guard.
src/config/schema.labels.ts	Adds labels for the new config keys.
src/config/schema.help.ts	Adds help text describing the new guard behavior and defaults.
src/config/schema.base.generated.ts	Regenerates the base schema output to include the new keys.
src/agents/pi-embedded-runner/run.ts	Passes `params.agentId` into `resolveContextEngine()` so the guard scans the correct agent.
CHANGELOG.md	Documents the new guard config and semantics.

…rd + operator recovery prompt (openclaw#76940) Three follow-ups to the initial guard addition based on operator feedback: 1) Default threshold 1mb → 2mb. 1MiB jsonl is roughly 250k tokens of message content; 2MiB is roughly 500k tokens. 500k tokens already overflows every shipping context window — for models in the 200-256k effective-window range it overflows much sooner. Operators on smaller-context models can still dial down via session.maintenance.contextFallbackGuard.sizeBytes. 2) Boot-time guard (applyContextEngineBootGuard). The on-fallback path only catches "configured engine failed to load." It misses the much more common case: no context engine was ever configured. The legacy engine windows the prompt in-memory at request time but never shrinks the on-disk jsonl, so an unmanaged session grows append-only until the gateway stalls on next start. The boot guard runs once at startup and applies the same policy when slots.contextEngine is unset/legacy or the configured plugin is missing from loadedPluginIds. Both triggers funnel into the same applyContextEngineFallbackGuard implementation; one config knob, one policy, two entry points. 3) Operator-facing message rewrite. The terse single-line warn/archive log is replaced with a structured block that names the file, the engine that failed, the size, the available repair commands (openclaw doctor --fix / sessions archive / config set slots), AND a copy-pasteable recovery prompt for the next agent turn. The prompt instructs the agent to read the archived tail (last ~200 non-system messages, group into chunks of 1-2k tokens each, stop at ~40k tokens aggregate), giving the fresh session enough context to continue meaningfully. Sized so we use the fresh session's available context window — not so miserly that the user loses their working state, not so generous that we eat the whole window. Tests: - 17 unit tests pass (12 original + 5 new for boot-guard / recovery prompt) - Existing 34 src/context-engine tests unchanged - Lint clean on changed files Wiring: - fallback-guard.ts: bump DEFAULT_FALLBACK_GUARD_SIZE_BYTES, add renderWarnMessage / renderArchiveMessage / renderRecoveryPrompt, add applyContextEngineBootGuard - server-startup-post-attach.ts: invoke boot guard right after logGatewayStartup; never let guard exceptions stall startup - CHANGELOG: expanded entry covering both trigger paths and threshold rationale Refs openclaw#76940.

100yenadmin · 2026-05-03T22:20:08Z

Pushed 30002174db addressing operator review:

1. Default threshold 1mb → 2mb

1MiB jsonl ≈ 250k tokens of message content; 2MiB ≈ 500k tokens. 500k tokens already overflows every shipping context window, and for models in the 200-256k effective-window range it overflows much sooner. 2mb is the realistic guardrail; operators on smaller-context models can still dial down via session.maintenance.contextFallbackGuard.sizeBytes.

2. Boot-time guard in addition to on-fallback

The original PR only caught "configured engine failed to load." It missed the much more common case: no context engine was ever configured. The legacy engine windows the prompt in-memory at request time but never shrinks the on-disk jsonl — so an unmanaged session grows append-only until the gateway stalls on next start. (See cross-referenced reports: #64767 444MB jsonl, #66360 unbounded growth, #73691 MEMORY.md gateway freeze — all this same shape.)

The boot guard runs once at startup (server-startup-post-attach.ts right after logGatewayStartup) and applies the same policy when:

slots.contextEngine is unset / legacy / empty → no engine ever managed sessions, OR
slots.contextEngine is set but the plugin isn't in loadedPluginIds → engine failed to load

Both trigger paths funnel into the same applyContextEngineFallbackGuard implementation. One config knob, one policy, two entry points.

When the configured engine is loaded and active, the boot guard short-circuits — the engine itself is responsible for size management (LCM rewrites the jsonl on every compaction, so its sessions stay bounded indefinitely).

3. Operator-facing message rewrite

The terse single-line warn/archive log is replaced with a structured block that names the file, the engine that failed, the size, the repair commands, AND a copy-pasteable recovery prompt for the next agent turn:

[context-engine] Session-size guard archived a transcript that would have stalled the gateway.

  Reason:    Context engine "lossless-claw" is configured but failed
             (engine "lossless-claw" is not registered). Falling back to the default
             "legacy" engine would have loaded the full transcript on next start.

  Archived:  ~/.openclaw/agents/main/sessions/b1ed0fe1-….jsonl
             → b1ed0fe1-…archived-no-context-engine-2026-05-04T03-15-22.jsonl
             (6.30 MiB — 40k+ tokens of message content)

  Next session start will be fresh and small. To recover the prior context,
  paste this prompt into the agent on the first turn:
  ┌─────────────────────────────────────────────────────────────────────────┐
  │ My previous session was archived because the configured context-engine    │
  │ plugin failed to load and the transcript would have overflowed the model  │
  │ context on next gateway start. Read the archived transcript at:           │
  │                                                                           │
  │   ~/.openclaw/agents/main/sessions/b1ed0fe1-…archived-…jsonl              │
  │                                                                           │
  │ Take the last ~200 non-system messages (skip heartbeat, synthetic, and    │
  │ bootstrap turns). Group them into chronological chunks of ~1000-2000      │
  │ tokens each — one chunk per coherent unit of work (a tool-call run, a     │
  │ topic shift, a multi-message exchange). For each chunk emit a 1000-2000   │
  │ token summary that:                                                       │
  │   - names the goal of the work in that chunk,                             │
  │   - lists tools called with key inputs/outputs (file paths, commits,      │
  │     decisions),                                                           │
  │   - notes unresolved threads, errors, or pending follow-ups.              │
  │                                                                           │
  │ Stop at ~40k tokens of aggregate summary so the fresh session keeps       │
  │ headroom. Output chunks in chronological order with one-line dividers     │
  │ like "chunk N: <topic>" so I can reference them later. After the chunks,  │
  │ give:                                                                     │
  │   - "open threads": anything in-flight,                                   │
  │   - "decisions made": anything settled,                                   │
  │   - "next likely action": what I would have done next.                    │
  │                                                                           │
  │ That summary is now my working context — proceed from there.              │
  └─────────────────────────────────────────────────────────────────────────┘

  Repair the engine plugin so this does not repeat:
    openclaw doctor --fix

  Or remove the configured slot to fall back cleanly without this guard:
    openclaw config set plugins.slots.contextEngine ""

Sizing rationale on the prompt:

~200 messages of tail (not 100): the user just got a fresh session with the full context window available, and the archived tail is where the most recent in-progress work lives. Smaller tails lose too much.
1-2k tokens per chunk: matches LCM's own chunked-summary granularity, so the fresh session gets useful per-chunk references rather than one mushy paragraph.
~40k tokens aggregate ceiling: leaves the fresh session ~200-300k tokens of headroom for ongoing work (depending on model). Big enough to actually carry the work forward, small enough not to monopolize the new window.

Validation

72 tests pass (17 fallback-guard cases including 5 new boot-guard cases + 2 new recovery-prompt assertions; 34 existing context-engine tests unchanged; 21 cortex/etc unchanged)
Lint clean on changed files
Locally swap-tested against a live gateway with LCM as the active engine: boot guard correctly short-circuits when LCM loads, and the recovery prompt was readable enough to paste straight into a chat.

Operator notes

The boot guard runs in a try/catch so a guard exception can never stall gateway startup — the worst case degrades to "no guard fired this boot."

The on-fallback guard remains for the request-time path (resolver fails mid-conversation rather than at boot), so a regression that takes down the engine after boot is still caught.

100yenadmin · 2026-05-03T22:21:34Z

@steipete @vincentkoc I recommend this fix in the hot fix to prevent context engines from blowing up gateways if disabled or deleted.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.

…rd (openclaw#76940) Addresses Copilot's 5 inline review comments on PR openclaw#76950 plus 13 additional findings from a parallel adversarial sweep. Single consolidated commit; per-finding diff lives below. == Copilot review comments == (1) Archive-failure path used optional `logger.error?.()` so the fallback signal could be silently lost on loggers without `error`. Now ALWAYS emits a `warn` (deduped via warnedPaths) in addition to the optional error, so the signal survives loggers that only have warn. File: src/context-engine/fallback-guard.ts (archive catch block) (2) The block-action error thrown from resolveContextEngine() didn't name the offending transcripts. Now includes basenames of the blocked files (capped at 5 + "+N more"), so operators see exactly which sessions trip the gate without leaking absolute paths into error stacks shared in bug reports. File: src/context-engine/registry.ts (fallbackToDefault helper) (3) `defaultHasContextEngineHistory()` walked path.dirname THREE times from sessionsDir, landing at ~/.openclaw (global state) instead of the agent state dir. The auto-action heuristic mis-detected history. Now walks two dirs (sessionsDir → agentDir) and ALSO checks the global state root for legacy LCM installs that keep a shared sqlite store. Function additionally takes the injected `ioFs` so tests can stub it (previously bare `fs.existsSync`). File: src/context-engine/fallback-guard.ts (4) The custom `isLiveSessionTranscript` filter accepted non-primary artifacts (.trajectory.jsonl, .checkpoint.<uuid>.jsonl) and could exclude legitimate primaries with `.deleted.` in id. Replaced with `isPrimarySessionTranscriptFileName` from src/config/sessions/artifacts.ts — the same predicate every other transcript code path uses. Custom filter deleted. File: src/context-engine/fallback-guard.ts (5) `ApplyFallbackGuardOptions.agentDir` was documented + accepted by callers but never read. Now used to derive sessionsDir (`<agentDir>/sessions`) when supplied, before falling back to agentId-based resolution. Embedded runners that scope to a non-default state root now have the guard scoped correctly. File: src/context-engine/fallback-guard.ts (resolution order block) == Additional adversarial-sweep findings == (6) [P0] Archive filename `<id>.archived-no-context-engine-<ts>.jsonl` was NOT recognized by `isPrimarySessionTranscriptFileName`, so the archived file got loaded as a live session on next gateway start — the guard accomplished nothing in the case it exists to fix. Now extends `SessionArchiveReason` with "context-fallback" and uses the canonical `<id>.jsonl.context-fallback.<iso-ts>-<nonce>` shape. Existing disk-budget pruning + transcript helpers correctly recognize and exclude these files. Files: src/config/sessions/artifacts.ts, src/context-engine/fallback-guard.ts (buildArchivePath) (7) [P0] Doc-drift: schema.help.ts + types.base.ts JSDoc said "Default `1mb`" after the bump to 2mb in the prior commit. Both sites updated; types.base.ts now points readers at the constant in fallback-guard.ts so future changes stay in sync. Files: src/config/schema.help.ts, src/config/types.base.ts (8) [P0] `action: "block"` was silently downgraded to warn at gateway boot — the boot wrapper discarded the outcome and never called `fallbackGuardOutcomeIsBlocking`. Now collects blocking outcomes across all agents and sets `process.exitCode = 1` with a structured error log, so launchd / systemd / docker treat the boot as unhealthy. Startup itself does not throw (degrades to "no protection this boot" if the guard faults internally). File: src/gateway/server-startup-post-attach.ts (9) [P0] Boot guard was hardcoded to the "main" agent. Multi-agent installs (concierge, support, etc.) had no boot-time protection. Now iterates every agent under ~/.openclaw/agents/ via readdirSync(withFileTypes); falls back to DEFAULT_AGENT_ID for fresh installs / test environments. File: src/gateway/server-startup-post-attach.ts (10) [P0] Four resolveContextEngine call sites passed no agentId, so when fallback fired for a non-default agent the guard walked the wrong agent's session directory. Plumbed agentId through: - src/agents/pi-embedded-runner/compact.queued.ts (hoisted resolveSessionAgentIds early) - src/agents/command/cli-compaction.ts (extended cliCompactionDeps.resolveContextEngine signature) - src/agents/subagent-spawn.ts (parseAgentSessionKey on requesterInternalKey) - src/agents/subagent-registry.ts (parseAgentSessionKey on childSessionKey) (11) [P1] `renameSync` race: archive timestamp was millisecond- resolution; two archives in the same ms (rapid fallback across agents, or two large transcripts in one pass) collided and `renameSync` silently overwrites. Now appends a 6-hex random nonce. File: src/context-engine/fallback-guard.ts (buildArchivePath) (12) [P1] `sizeBytes: 0` (or "0b" / "0kb") passed schema validation but was silently downgraded to the default in resolveGuardConfig — opposite of operator intent (trying to disable or tighten the guard would actually loosen it to default). Now logs an explicit warn so the misconfig is visible. File: src/context-engine/fallback-guard.ts (resolveGuardConfig) (13) [P1] `openclaw sessions archive <id>` doesn't exist as a subcommand — operator-facing message printed a fake command. Replaced with `openclaw sessions cleanup --enforce`. `openclaw config set ... ""` likewise replaced with the canonical `openclaw config unset ...` form. File: src/context-engine/fallback-guard.ts (renderWarnMessage, renderArchiveMessage) (14) [P1] Multi-line operator block destroyed by JSON log encoding (the JSON file logger encodes \n literally; the box-drawing characters became one mile-long line). Replaced unicode box drawing with explicit `----- BEGIN RECOVERY PROMPT -----` / `----- END RECOVERY PROMPT -----` delimiters that survive JSON encoding and are easy to grep / extract. File: src/context-engine/fallback-guard.ts (15) [P1] Home-dir leaked into operator-facing paths. When operators paste the message into bug reports / GitHub issues / chat the username was exposed. Added `redactHomePrefix()` that substitutes `~` for the user's home prefix in the rendered prompt + Archived block. The structured `summary` line keeps the absolute path for grep-by-path. File: src/context-engine/fallback-guard.ts (renderRecoveryPrompt, renderWarnMessage, renderArchiveMessage) (16) [P1] Two structurally-identical action union types (`SessionContextFallbackGuardAction` in types.base.ts and `FallbackGuardAction` in fallback-guard.ts). Now a single source of truth: `FallbackGuardAction` aliases the public type. Files: src/context-engine/fallback-guard.ts (17) [P1] Recovery prompt assumed agent could safely Read a multi-MiB jsonl. Now explicitly tells the agent to use Read with offset/limit, skip individual messages over ~10k tokens, and bound itself to ~40k aggregate. Also added a "next likely action" bullet so the fresh session resumes work cleanly. File: src/context-engine/fallback-guard.ts (renderRecoveryPrompt) (18) [P1] `agent="(default)"` printed in summary line when agentId was unset, but actual on-disk path uses `main`. Operators grepping by the printed label couldn't find the path. Now uses `DEFAULT_AGENT_ID` ("main") consistently. File: src/context-engine/fallback-guard.ts (renderAgentLabel) (19) [P2] platform-specific restart hint: `launchctl` was unconditional. Now branches on process.platform to suggest `launchctl` on macOS, `systemctl --user restart` on Linux, `Restart-Service` on Windows. File: src/context-engine/fallback-guard.ts (platformRestartHint) (20) [P2] `lstatSync` used in place of `statSync` so a symlink in the sessions dir doesn't get archived (previously could rename the link and orphan the target). File: src/context-engine/fallback-guard.ts (21) [P2] Aggregated stat-error count is now logged at end of pass so an operator sees a signal when (e.g.) every file in the sessions dir is unreadable due to permission/quarantine drift — previously the swallowed errors produced an empty outcome with no clue why. File: src/context-engine/fallback-guard.ts (statErrors counter) == Tests == - 25 fallback-guard.test.ts cases pass (was 17 — 8 added covering canonical archive name, prior archives ignored, home-dir redact, nonce collision-avoidance, sizeBytes:0 warns, agentDir option used, archive-failure-also-warns, lstat for symlinks) - 6 zod-schema.session-maintenance-extensions.test.ts cases for contextFallbackGuard (valid/absent/casing/typo/malformed/wrong-level) - All 142 src/config/sessions tests pass unchanged - 95 total context-engine + maintenance-extension tests pass - Lint clean on changed files (the 2 remaining `__testing` warnings are pre-existing in v2026.5.2 and unrelated) Refs openclaw#76940. Addresses inline review on PR openclaw#76950.

steipete · 2026-05-03T22:54:25Z

Thanks for jumping on this and for the detailed incident write-up.

For the hotfix, we are not going to take this core guard as-is. The immediate failure is a lossless-claw compatibility/install/load problem, so the primary fix belongs in lossless-claw rather than OpenClaw core. Core hardening here is still worth discussing, but startup-time transcript auto-archive is product-sensitive and this PR currently has unresolved implementation blockers: merge conflicts, a logger type/runtime issue, mismatched config default docs, a non-existent recovery command, and multi-agent/default-agent gaps.

Closing this PR for now. If we revisit core hardening, the safer shape is likely a narrower diagnostic/warn-only guard first, with docs and recovery paths aligned to existing session maintenance.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.

100yenadmin · 2026-05-03T22:56:52Z

@steipete not sure if your AI wrote that but the issue isn't lossless claw here. We have no control over the session file management. LCM manages session only when enabled.

Nothing we can do when plugin system is rewritten and mass disables plugins (plugin builders can't conform to new requirements that come out in new patch without notice- normal process is to put out requirements in update and let them know new system is phasing in on X date). We lose a lot of good will constantly rebuilding plugin infra.

That being said when it is disabled the session file explodes the gateway because LCM handles session management when it is enabled. When it is disabled or deleted, the session file breaks the gateway. I was in the middle of the commits for this fix but got blocked by it closing.

We can on LCM side start to truncate and manage session file while LCM is enabled but that is on your decision if you would prefer we do that.

… dead activeContextEngineId, accurate token estimate, boot-message phrasing (openclaw#76940) Three additional Copilot/sweep findings on the prior consolidated commit: (22) [P1] params.log.error doesn't typecheck on the gateway post-attach surface (the typed log here is { info, warn } only — error is on the richer subsystem logger but not on this narrowed shape). Switched the boot-guard logger sink to omit `error` (the guard always emits a `warn` fallback when the optional error is absent) and the structured "block" startup line to use `warn`. The non-zero process.exitCode is the real signal external supervisors pick up, so demoting the channel doesn't lose anything. File: src/gateway/server-startup-post-attach.ts (23) [P1] `activeContextEngineId` was required on ApplyBootGuardOptions but applyContextEngineBootGuard ignored it (re-read the same value from `options.config.plugins.slots.contextEngine`). Easy to misinterpret as "the boot guard respects the explicit activeContextEngineId override" when it does not. Removed the dead field; callers now pass only the loaded set + config. File: src/context-engine/fallback-guard.ts (ApplyBootGuardOptions) (24) [P1] Operator messages printed `RECOVERY_PROMPT_MAX_SUMMARY_TOKENS` ("40k+ tokens of message content") for every transcript regardless of file size. Now estimates from the actual byte size with a simple ~4-chars-per-token heuristic, formatted with reasonable significant digits ("~580k tokens (estimated)" / "~1.4M tokens (estimated)"). File: src/context-engine/fallback-guard.ts (estimateTokensFromBytes) (25) [P2] The "Reason: Context engine X is configured but failed (...)" line in operator messages reads incorrectly when the boot guard fires for the no-engine case (failedEngineId is the synthetic "(legacy/none)" label, not a real configured engine). Branch on that label and emit a no-engine-specific reason block instead. File: src/context-engine/fallback-guard.ts (renderReasonLines) Tests: all 25 fallback-guard cases still pass; the GuardMessageContext type now carries `sizeBytes` so the token estimate has the source data without re-deriving from the formatted MiB string. Refs openclaw#76940. Addresses Copilot review on PR openclaw#76950.

100yenadmin · 2026-05-03T23:06:28Z

Adversarial-review pass: Copilot's review + 3 internal sub-agent sweeps

Two new commits address everything the Codex bot review found, plus 13 additional findings from a parallel adversarial-agent sweep I ran in 4 dimensions (concurrency / fs, config validation, integration / multi-agent, UX / recovery prompt).

Commits in this push

93d96d4afe — main consolidated fix (13 files, +584/-117). Addresses Copilot's 5 inline findings + 16 from the adversarial sweep.
49c8ccc156 — three remaining nits found in a final pass (params.log.error type mismatch, dead activeContextEngineId field, hardcoded "40k+ tokens" string regardless of file size, "configured but failed" phrasing wrong for the boot-guard no-engine path).

Each Copilot inline comment is replied with the specific commit + file:line + relevant test that covers the fix.

Highest-impact fix (would-have-killed-the-PR-purpose level)

The previous archive shape was <id>.archived-no-context-engine-<ts>.jsonl — which isPrimarySessionTranscriptFileName did NOT recognize as an archive. So the file would have been loaded as a live session on next gateway start, and the guard would have accomplished literally nothing in the case it exists to fix. Now extends SessionArchiveReason with "context-fallback" and uses the canonical <id>.jsonl.context-fallback.<iso-ts>-<6-hex-nonce> shape, so existing transcript helpers correctly exclude these files. The nonce also fixes a same-millisecond renameSync race that would silently overwrite one of two concurrent archives.

Other significant fixes beyond Copilot's 5

Severity	Bug	Fix
P0	`action: "block"` silently downgraded to warn at gateway boot	Boot path now collects blocking outcomes across all agents and sets `process.exitCode = 1` so launchd/systemd/docker treat the boot as unhealthy
P0	Boot guard hardcoded to `"main"` agent — multi-agent installs got no boot protection	Iterates every dir under `~/.openclaw/agents/`
P0	4 other `resolveContextEngine` call sites passed no `agentId` — guard walked wrong agent's sessions on fallback	`agentId` plumbed through `subagent-spawn`, `cli-compaction`, `compact.queued`, `subagent-registry`
P1	`sizeBytes: 0` silently used the default (opposite of operator intent)	Now logs an explicit warn so the misconfig is visible
P1	`openclaw sessions archive <id>` printed in operator message — that subcommand doesn't exist	Replaced with `openclaw sessions cleanup --enforce` and `openclaw config unset ...`
P1	Multi-line operator block destroyed by JSON log encoding (box-drawing chars become literal `\n`)	Replaced unicode boxes with explicit `----- BEGIN/END RECOVERY PROMPT -----` delimiters
P1	Home-dir leaked into pasted-into-issues paths	`redactHomePrefix()` substitutes `~` in the operator-facing block; absolute path still in the structured `summary` line for grep
P1	Two structurally-identical action union types	Single source of truth — `FallbackGuardAction` aliases `SessionContextFallbackGuardAction`
P1	Recovery prompt assumed agent could `Read` a multi-MiB jsonl whole	Prompt now tells agent to use `Read` with offset/limit, skip individual messages > ~10k tokens
P1	"40k+ tokens" printed for every transcript regardless of size	Estimates from actual byte size: `~580k tokens (estimated)` etc.
P1	Boot-guard message said "Context engine X is configured but failed" even when no engine was configured	Branches on the synthetic `(legacy/none)` label; emits accurate no-engine phrasing
P2	`launchctl` restart hint was unconditional	Branches on `process.platform` for macOS / Linux / Windows
P2	`statSync` followed symlinks (could archive a link, orphan target)	Switched to `lstatSync` (matches existing safe-fs helpers)
P2	Aggregate `statErrors` count silently swallowed	Logged once at end of pass so unreadable-files-everywhere shows a signal

Test coverage

25 unit tests in fallback-guard.test.ts (was 17 — 8 added covering canonical archive name, prior archives ignored, home-dir redact, nonce collision-avoidance, sizeBytes:0 warns, agentDir option used, archive-failure-also-warns, lstat for symlinks)
6 zod schema tests for contextFallbackGuard (valid actions/sizes/casing/typos/wrong-nesting/back-compat-absent)
All 142 src/config/sessions/ tests pass unchanged
95 total context-engine + maintenance-extension tests pass
Lint clean on changed files (the 2 remaining __testing warnings are pre-existing in v2026.5.2)

Adversarial methodology

3 sub-agents ran in parallel with focused scopes (concurrency/fs, config/back-compat, integration/UX). Each was given the PR diff + Copilot's existing 5 findings up front so they didn't duplicate. Results merged into the consolidated commit; per-finding rationale and file:line in the commit body of 93d96d4afe.

The "single conversation, three perspectives" pattern surfaced bugs Copilot missed — particularly the archive-shape-not-recognized P0 (which would have nullified the whole PR), the multi-agent-boot-guard-hardcoded-main P0, and the 4-other-call-sites-pass-no-agentId P0. Worth doing on guard-style PRs that touch multiple subsystems.

Copilot AI review requested due to automatic review settings May 3, 2026 21:36

openclaw-barnacle Bot added agents Agent runtime and tooling size: L labels May 3, 2026

Copilot started reviewing on behalf of 100yenadmin May 3, 2026 21:37 View session

Copilot AI reviewed May 3, 2026

View reviewed changes

Comment thread src/context-engine/fallback-guard.ts

Comment thread src/context-engine/registry.ts

Comment thread src/context-engine/fallback-guard.ts

Comment thread src/context-engine/fallback-guard.ts

Comment thread src/context-engine/fallback-guard.ts

openclaw-barnacle Bot added gateway Gateway runtime size: XL and removed size: L labels May 3, 2026

100yenadmin mentioned this pull request May 3, 2026

Add startup-time session-size guard: auto-archive when no context engine is registered #76940

Closed

100yenadmin requested a review from Copilot May 3, 2026 22:22

Copilot started reviewing on behalf of 100yenadmin May 3, 2026 22:23 View session

Copilot AI reviewed May 3, 2026

View reviewed changes

100yenadmin requested a review from Copilot May 3, 2026 22:38

Copilot started reviewing on behalf of 100yenadmin May 3, 2026 22:39 View session

Copilot AI reviewed May 3, 2026

View reviewed changes

100yenadmin requested a review from Copilot May 3, 2026 22:50

Copilot started reviewing on behalf of 100yenadmin May 3, 2026 22:51 View session

steipete closed this May 3, 2026

Copilot AI reviewed May 3, 2026

View reviewed changes

100yenadmin mentioned this pull request May 4, 2026

CLI suggests plugins.allow for unknown subcommands when input is actually an agent tool name #77214

Open

Uh oh!

Conversation

100yenadmin commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Config surface

Implementation

Tests

Validation

Change Type

Scope

Uh oh!

chatgpt-codex-connector Bot commented May 3, 2026

Uh oh!

clawsweeper Bot commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

100yenadmin commented May 3, 2026

1. Default threshold 1mb → 2mb

2. Boot-time guard in addition to on-fallback

3. Operator-facing message rewrite

Validation

Operator notes

Uh oh!

100yenadmin commented May 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

steipete commented May 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

100yenadmin commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

100yenadmin commented May 3, 2026

Adversarial-review pass: Copilot's review + 3 internal sub-agent sweeps

100yenadmin commented May 3, 2026 •

edited

Loading

clawsweeper Bot commented May 3, 2026 •

edited

Loading

100yenadmin commented May 3, 2026 •

edited

Loading