Skip to content

Context-engine assembled prompts are bypassed by overflow precheck using pre-assembly history #74233

@100yenadmin

Description

@100yenadmin

Summary

OpenClaw's ContextEngine.assemble() contract says the engine returns ordered messages ready for the model under a token budget, but the embedded runner still hard-gates prompt submission on a larger pre-assembly message view (unwindowedMessages). With lossless-claw/LCM as the context engine, this can force repeated precheck overflow, compaction retries, and eventually a session reset even when the assembled context should be the prompt-authoritative view.

This is related to earlier compaction/context drift reports such as #69838 and #50065, but the failure mode here is narrower: context-engine assembly can be applied to activeSession.messages, then bypassed by the preemptive overflow precheck.

Environment observed

  • OpenClaw package: 2026.4.24
  • Live install symlink: /opt/homebrew/lib/node_modules/openclaw -> /Users/lume/repos/openclaw-pr70071-rebase
  • Live OpenClaw head: 39199f8e42
  • Active context engine slot: lossless-claw
  • Active model: openai-codex/gpt-5.5
  • Configured model context window: 258000
  • Runtime reserve: 50000
  • Effective prompt budget before reserve in logs: 208000

Confidence matrix

  • 0.94 - primary cause: OpenClaw precheck treats the pre-assembly message view as a hard prompt-admission gate even after context-engine assembly has produced a smaller prompt-ready view.
  • 0.86 - runtime failure timeline: Apr 28-29 logs show repeated gpt-5.5 precheck overflow, compaction retries, and a hard reset to a new session.
  • 0.85 - contributing cause: aggregate tool-result bulk repeatedly pushed prompt estimates over budget; recovery worked in some turns but not the fatal loop.
  • 0.80 - user-visible forgetfulness cause: compaction failure resets the session and clears token/cache/accounting fields; continuity then depends on the context engine and recall tools.
  • 0.75 - contributing pressure: gpt-5.5's 258000 context window plus a 50000 reserve leaves only 208000 prompt tokens, making this bug much easier to trigger than under larger-window models.
  • 0.70 - contributing pressure: repeated bootstrap/system/context injection and large workspace guidance add prompt bulk; logs show AGENTS.md and SOUL.md truncation around 40k chars.
  • 0.65 - related cache/retry risk: the deployed branch does not appear to include the full CLI prompt-build drain-cache retry fix from commit 4225db3a7b; this is likely adjacent for CLI/session-expired forgetfulness, though the fatal overflow evidence here is in the embedded runner.
  • 0.65 - related recall risk: Cortex recall was effectively absent in the same window, which can amplify perceived forgetfulness after a reset, but this is separate from the overflow precheck bug.

Source evidence

ContextEngine.assemble() is defined as the prompt-ready assembly point:

  • src/context-engine/types.ts: AssembleResult.messages are the "Ordered messages to use as model context" and assemble() "Returns an ordered set of messages ready for the model."

The embedded runner snapshots the pre-assembly messages, calls the context engine, and replaces the active session messages with the assembled output:

  • src/agents/pi-embedded-runner/run/attempt.ts
    • snapshots unwindowedContextEngineMessagesForPrecheck = activeSession.messages.slice() before assembly
    • calls assembleAttemptContextEngine(...)
    • assigns activeSession.agent.state.messages = assembled.messages when assembly returns a different array

The precheck then receives both the assembled active messages and the pre-assembly snapshot:

  • src/agents/pi-embedded-runner/run/attempt.ts
    • shouldPreemptivelyCompactBeforePrompt({ messages: activeSession.messages, unwindowedMessages: unwindowedContextEngineMessagesForPrecheck, ... })

The precheck intentionally chooses the larger estimate:

  • src/agents/pi-embedded-runner/run/preemptive-compaction.ts
    • if unwindowedEstimatedPromptTokens > estimatedPromptTokens, it replaces the estimate and messagesForPressure with unwindowedMessages

A current unit test locks in this behavior:

  • src/agents/pi-embedded-runner/run/preemptive-compaction.test.ts
    • test name: uses the larger unwindowed message estimate when context engine assembly windows history

A targeted local test run passed, confirming the behavior is expected by the current test suite:

node scripts/run-vitest.mjs run --config test/vitest/vitest.agents.config.ts src/agents/pi-embedded-runner/run/preemptive-compaction.test.ts

Test Files  1 passed (1)
Tests       10 passed (10)

Runtime evidence

The fatal Apr 29 loop looked like this:

2026-04-29T01:24:27.357+07:00 [context-overflow-precheck] route=compact_then_truncate estimatedPromptTokens=284165 promptBudgetBeforeReserve=208000 overflowTokens=76165 toolResultReducibleChars=227371 sessionFile=.../7fa8806a-4ad4-4b57-8978-e7d08a6bfdc2.jsonl
2026-04-29T01:24:36.133+07:00 [context-overflow-precheck] route=compact_only estimatedPromptTokens=215976 promptBudgetBeforeReserve=208000 overflowTokens=7976 toolResultReducibleChars=0 sessionFile=.../7fa8806a-4ad4-4b57-8978-e7d08a6bfdc2.jsonl
2026-04-29T01:24:47.337+07:00 [context-overflow-precheck] route=compact_only estimatedPromptTokens=215976 promptBudgetBeforeReserve=208000 overflowTokens=7976 toolResultReducibleChars=0 sessionFile=.../7fa8806a-4ad4-4b57-8978-e7d08a6bfdc2.jsonl
2026-04-29T01:24:48.125+07:00 [context-overflow-precheck] route=compact_only estimatedPromptTokens=215976 promptBudgetBeforeReserve=208000 overflowTokens=7976 toolResultReducibleChars=0 sessionFile=.../7fa8806a-4ad4-4b57-8978-e7d08a6bfdc2.jsonl
2026-04-29T01:24:48.172+07:00 Auto-compaction failed (Context overflow: prompt too large for the model (precheck).). Restarting session agent:main:main -> 4ff1673e-4b2c-406a-b6e4-03c0beb54b25 and retrying.

The new session begins with an injected reset message:

Context limit exceeded. I've reset our conversation to start fresh - please try again.

The active LCM database state after the reset showed much smaller assembled context than the raw stored history:

  • active conversation had about 56k messages / 20M stored message tokens
  • current context_items were compact: summaries + recent messages, around tens of thousands of tokens
  • maintenance row reported current_token_count=87767 against token_budget=258000

Caveat: the LCM DB tracks sessionKey=agent:main:main through reset, so the exact LCM assembled output for the pre-reset 7fa... failure is not preserved as a separate per-session row. The code-path contradiction above does not depend on that caveat.

Possible causes to investigate

  1. The current precheck conflates two different concepts: prompt-admission size and raw-history maintenance pressure.
  2. Context engines that perform real summarizing/retrieval assembly need their assembled result to be prompt-authoritative by default.
  3. If OpenClaw needs to detect raw-history debt, that should be a separate maintenance signal, not an unconditional prompt blocker.
  4. Post-compaction recovery can flip from compact_then_truncate to compact_only; after that, remaining tool-result cleanup may be skipped even though it was part of the original pressure.
  5. Session reset after precheck compaction failure causes user-visible forgetfulness and clears accounting/cache fields.
  6. Bootstrap/context injection volume and small effective gpt-5.5 prompt budget make the failure easier to trigger.
  7. Cortex/recall unavailability after reset can make the continuity loss more visible, though it is not the primary overflow trigger.

Suggested fix

Make context-engine assembly prompt-authoritative for prompt admission by default:

  • Precheck should estimate against assembled.messages plus system prompt and user prompt.
  • Prefer assembled.estimatedTokens when the engine provides a trustworthy estimate, or add a flag/capability for trust level.
  • If the host still wants to monitor pre-assembly/raw-history pressure, expose that as a separate maintenance/debt signal.
  • Add a regression test where a context engine assembles a small valid prompt from a large pre-assembly history and precheck allows the prompt instead of raising Context overflow: prompt too large for the model (precheck).
  • Consider a ContextEngineInfo capability such as assemblyIsPromptAuthoritative or a result-level metadata field that distinguishes assembled, fallback-live, and emergency outputs.

Suggested regression shape

  1. Mock a context engine whose assemble() returns a small messages array under budget.
  2. Provide a much larger pre-assembly message array that would overflow if checked directly.
  3. Run the embedded attempt precheck path.
  4. Assert prompt submission is allowed, and raw-history pressure is reported only as maintenance debt.

Impact

Long-running context-engine sessions can reset even though the context engine has already compacted or assembled a prompt-ready view. This causes avoidable prompt failures, compaction loops, cache disruption, and user-visible memory loss.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions