Context-engine assembled prompts are bypassed by overflow precheck using pre-assembly history

## Summary

OpenClaw's `ContextEngine.assemble()` contract says the engine returns ordered messages ready for the model under a token budget, but the embedded runner still hard-gates prompt submission on a larger pre-assembly message view (`unwindowedMessages`). With `lossless-claw`/LCM as the context engine, this can force repeated precheck overflow, compaction retries, and eventually a session reset even when the assembled context should be the prompt-authoritative view.

This is related to earlier compaction/context drift reports such as #69838 and #50065, but the failure mode here is narrower: context-engine assembly can be applied to `activeSession.messages`, then bypassed by the preemptive overflow precheck.

## Environment observed

- OpenClaw package: `2026.4.24`
- Live install symlink: `/opt/homebrew/lib/node_modules/openclaw -> /Users/lume/repos/openclaw-pr70071-rebase`
- Live OpenClaw head: `39199f8e42`
- Active context engine slot: `lossless-claw`
- Active model: `openai-codex/gpt-5.5`
- Configured model context window: `258000`
- Runtime reserve: `50000`
- Effective prompt budget before reserve in logs: `208000`

## Confidence matrix

- **0.94 - primary cause:** OpenClaw precheck treats the pre-assembly message view as a hard prompt-admission gate even after context-engine assembly has produced a smaller prompt-ready view.
- **0.86 - runtime failure timeline:** Apr 28-29 logs show repeated `gpt-5.5` precheck overflow, compaction retries, and a hard reset to a new session.
- **0.85 - contributing cause:** aggregate tool-result bulk repeatedly pushed prompt estimates over budget; recovery worked in some turns but not the fatal loop.
- **0.80 - user-visible forgetfulness cause:** compaction failure resets the session and clears token/cache/accounting fields; continuity then depends on the context engine and recall tools.
- **0.75 - contributing pressure:** `gpt-5.5`'s `258000` context window plus a `50000` reserve leaves only `208000` prompt tokens, making this bug much easier to trigger than under larger-window models.
- **0.70 - contributing pressure:** repeated bootstrap/system/context injection and large workspace guidance add prompt bulk; logs show `AGENTS.md` and `SOUL.md` truncation around 40k chars.
- **0.65 - related cache/retry risk:** the deployed branch does not appear to include the full CLI prompt-build drain-cache retry fix from commit `4225db3a7b`; this is likely adjacent for CLI/session-expired forgetfulness, though the fatal overflow evidence here is in the embedded runner.
- **0.65 - related recall risk:** Cortex recall was effectively absent in the same window, which can amplify perceived forgetfulness after a reset, but this is separate from the overflow precheck bug.

## Source evidence

`ContextEngine.assemble()` is defined as the prompt-ready assembly point:

- `src/context-engine/types.ts`: `AssembleResult.messages` are the "Ordered messages to use as model context" and `assemble()` "Returns an ordered set of messages ready for the model."

The embedded runner snapshots the pre-assembly messages, calls the context engine, and replaces the active session messages with the assembled output:

- `src/agents/pi-embedded-runner/run/attempt.ts`
  - snapshots `unwindowedContextEngineMessagesForPrecheck = activeSession.messages.slice()` before assembly
  - calls `assembleAttemptContextEngine(...)`
  - assigns `activeSession.agent.state.messages = assembled.messages` when assembly returns a different array

The precheck then receives both the assembled active messages and the pre-assembly snapshot:

- `src/agents/pi-embedded-runner/run/attempt.ts`
  - `shouldPreemptivelyCompactBeforePrompt({ messages: activeSession.messages, unwindowedMessages: unwindowedContextEngineMessagesForPrecheck, ... })`

The precheck intentionally chooses the larger estimate:

- `src/agents/pi-embedded-runner/run/preemptive-compaction.ts`
  - if `unwindowedEstimatedPromptTokens > estimatedPromptTokens`, it replaces the estimate and `messagesForPressure` with `unwindowedMessages`

A current unit test locks in this behavior:

- `src/agents/pi-embedded-runner/run/preemptive-compaction.test.ts`
  - test name: `uses the larger unwindowed message estimate when context engine assembly windows history`

A targeted local test run passed, confirming the behavior is expected by the current test suite:

```text
node scripts/run-vitest.mjs run --config test/vitest/vitest.agents.config.ts src/agents/pi-embedded-runner/run/preemptive-compaction.test.ts

Test Files  1 passed (1)
Tests       10 passed (10)
```

## Runtime evidence

The fatal Apr 29 loop looked like this:

```text
2026-04-29T01:24:27.357+07:00 [context-overflow-precheck] route=compact_then_truncate estimatedPromptTokens=284165 promptBudgetBeforeReserve=208000 overflowTokens=76165 toolResultReducibleChars=227371 sessionFile=.../7fa8806a-4ad4-4b57-8978-e7d08a6bfdc2.jsonl
2026-04-29T01:24:36.133+07:00 [context-overflow-precheck] route=compact_only estimatedPromptTokens=215976 promptBudgetBeforeReserve=208000 overflowTokens=7976 toolResultReducibleChars=0 sessionFile=.../7fa8806a-4ad4-4b57-8978-e7d08a6bfdc2.jsonl
2026-04-29T01:24:47.337+07:00 [context-overflow-precheck] route=compact_only estimatedPromptTokens=215976 promptBudgetBeforeReserve=208000 overflowTokens=7976 toolResultReducibleChars=0 sessionFile=.../7fa8806a-4ad4-4b57-8978-e7d08a6bfdc2.jsonl
2026-04-29T01:24:48.125+07:00 [context-overflow-precheck] route=compact_only estimatedPromptTokens=215976 promptBudgetBeforeReserve=208000 overflowTokens=7976 toolResultReducibleChars=0 sessionFile=.../7fa8806a-4ad4-4b57-8978-e7d08a6bfdc2.jsonl
2026-04-29T01:24:48.172+07:00 Auto-compaction failed (Context overflow: prompt too large for the model (precheck).). Restarting session agent:main:main -> 4ff1673e-4b2c-406a-b6e4-03c0beb54b25 and retrying.
```

The new session begins with an injected reset message:

```text
Context limit exceeded. I've reset our conversation to start fresh - please try again.
```

The active LCM database state after the reset showed much smaller assembled context than the raw stored history:

- active conversation had about `56k` messages / `20M` stored message tokens
- current `context_items` were compact: summaries + recent messages, around tens of thousands of tokens
- maintenance row reported `current_token_count=87767` against `token_budget=258000`

Caveat: the LCM DB tracks `sessionKey=agent:main:main` through reset, so the exact LCM assembled output for the pre-reset `7fa...` failure is not preserved as a separate per-session row. The code-path contradiction above does not depend on that caveat.

## Possible causes to investigate

1. The current precheck conflates two different concepts: prompt-admission size and raw-history maintenance pressure.
2. Context engines that perform real summarizing/retrieval assembly need their assembled result to be prompt-authoritative by default.
3. If OpenClaw needs to detect raw-history debt, that should be a separate maintenance signal, not an unconditional prompt blocker.
4. Post-compaction recovery can flip from `compact_then_truncate` to `compact_only`; after that, remaining tool-result cleanup may be skipped even though it was part of the original pressure.
5. Session reset after precheck compaction failure causes user-visible forgetfulness and clears accounting/cache fields.
6. Bootstrap/context injection volume and small effective `gpt-5.5` prompt budget make the failure easier to trigger.
7. Cortex/recall unavailability after reset can make the continuity loss more visible, though it is not the primary overflow trigger.

## Suggested fix

Make context-engine assembly prompt-authoritative for prompt admission by default:

- Precheck should estimate against `assembled.messages` plus system prompt and user prompt.
- Prefer `assembled.estimatedTokens` when the engine provides a trustworthy estimate, or add a flag/capability for trust level.
- If the host still wants to monitor pre-assembly/raw-history pressure, expose that as a separate maintenance/debt signal.
- Add a regression test where a context engine assembles a small valid prompt from a large pre-assembly history and precheck allows the prompt instead of raising `Context overflow: prompt too large for the model (precheck).`
- Consider a `ContextEngineInfo` capability such as `assemblyIsPromptAuthoritative` or a result-level metadata field that distinguishes `assembled`, `fallback-live`, and `emergency` outputs.

## Suggested regression shape

1. Mock a context engine whose `assemble()` returns a small `messages` array under budget.
2. Provide a much larger pre-assembly message array that would overflow if checked directly.
3. Run the embedded attempt precheck path.
4. Assert prompt submission is allowed, and raw-history pressure is reported only as maintenance debt.

## Impact

Long-running context-engine sessions can reset even though the context engine has already compacted or assembled a prompt-ready view. This causes avoidable prompt failures, compaction loops, cache disruption, and user-visible memory loss.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Context-engine assembled prompts are bypassed by overflow precheck using pre-assembly history #74233

Summary

Environment observed

Confidence matrix

Source evidence

Runtime evidence

Possible causes to investigate

Suggested fix

Suggested regression shape

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Context-engine assembled prompts are bypassed by overflow precheck using pre-assembly history #74233

Description

Summary

Environment observed

Confidence matrix

Source evidence

Runtime evidence

Possible causes to investigate

Suggested fix

Suggested regression shape

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions