Skip to content

[Bug]: Default bootstrapMaxChars=20000 + verbose auto-generated bootstrap content degrades tool dispatch on small/mid models #75189

@camerono

Description

@camerono

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

OpenClaw's default agents.defaults.bootstrapMaxChars: 20000 combined with the verbose content of the auto-generated workspace bootstrap files (~13 KB of content across AGENTS.md, SOUL.md, TOOLS.md, IDENTITY.md, USER.md, HEARTBEAT.md, BOOTSTRAP.md) produces a system prompt totaling ~27 KB chars by the time framework-level guidance is added; on small/mid models (≤14B params) this exceeds the prompt-length threshold beyond which tool-use instruction is deprioritized, causing the agent to hallucinate tool calls (produce plausible-sounding text describing tool use) rather than actually invoking the structured tools.

Steps to reproduce

  1. Install OpenClaw 2026.4.9 standalone on a host with a small/mid model serving locally (the symptom reproduces broadly; we tested Hermes-3-Llama-3.1-8B on vLLM with --tool-call-parser hermes).
  2. Run openclaw doctor --fix to bootstrap default config + auto-generate the workspace files (default behavior).
  3. Configure inference at the small/mid model (any local-inference setup; we used vLLM at http://127.0.0.1:8002/v1).
  4. Restart gateway and run a tool-warranted prompt:
    openclaw agent --session-id repro -m "Fetch https://example.com and summarize the page in one sentence." --json
    
  5. Inspect result.meta.systemPromptReport.systemPrompt.chars and the session jsonl at ~/.openclaw/agents/main/sessions/<sessionId>.jsonl for toolCall events.

Expected behavior

For local-inference users on small/mid models (the most common consumer-GPU configuration), the default bootstrap configuration should produce a system prompt that fits within the prompt-length threshold for reliable tool-use instruction following on those models (~2,000–4,000 chars per AGENTIF benchmark and Runyard 2026 best-practices guidance). Either:

Actual behavior

With OpenClaw defaults, the systemPromptReport reports systemPrompt.chars: 27345 for any agent run. Breakdown: projectContextChars: 13465 (the 7 workspace bootstrap files) + nonProjectContextChars: 13880 (framework-supplied tool/agent-shell guidance). On Hermes-3-Llama-3.1-8B with this prompt, tool dispatch silently fails: the agent produces plausible-sounding text describing tool use (e.g. "I searched memory for X but found no matches", "The fetched page at https://example.com appears to be an example domain") with zero toolCall events in the session jsonl and zero tool/fetch/invoke markers in the gateway log. The user sees what looks like a correct response and may not realize tools aren't actually firing.

Setting agents.defaults.bootstrapMaxChars: 1500 and bootstrapTotalMaxChars: 12000 (no other changes) drops systemPrompt.chars to ~16 KB and restores real tool dispatch — wire-level confirmed: 1 toolCall + 1 toolResult for single-tool prompts, 3 toolCall + 3 toolResult for chained 3-step prompts.

OpenClaw version

2026.4.9 (build 0512059)

Operating system

Ubuntu 24.04 LTS aarch64 (Linux 6.17.0-1014-nvidia)

Install method

npm global (npm install -g [email protected]), Node v22.22.2 via nvm

Model

NousResearch/Hermes-3-Llama-3.1-8B (also reproduced on Qwen3-14B with default thinking + Mistral-7B-Instruct-v0.3, where the same prompt-bloat pressure manifests in different but equally broken failure modes — Qwen3 produces empty payloads via reasoning_content split, Mistral echoes the system prompt back as its response)

Provider / routing chain

openclaw (standalone host gateway) → vLLM (http://127.0.0.1:8002/v1) → small/mid local model

Additional provider/model setup details

Standalone OpenClaw on a host (no NemoClaw sandbox). vLLM 0.19.1 Docker container at :8002 with --enable-auto-tool-choice --tool-call-parser hermes --gpu-memory-utilization 0.20 --max-model-len 32768. ~14 GB GPU resident. Gateway in local mode, primary model inference/hermes-3-llama-3.1-8b.

Logs, screenshots, and evidence

Default-config systemPromptReport (excerpt):

{
  "systemPrompt": {
    "chars": 27345,
    "projectContextChars": 13465,
    "nonProjectContextChars": 13880
  },
  "injectedWorkspaceFiles": [
    {"name": "AGENTS.md",   "rawChars": 7809, "injectedChars": 7809, "truncated": false},
    {"name": "SOUL.md",     "rawChars": 1738, "injectedChars": 1738, "truncated": false},
    {"name": "TOOLS.md",    "rawChars":  850, "injectedChars":  850, "truncated": false},
    {"name": "IDENTITY.md", "rawChars":  633, "injectedChars":  633, "truncated": false},
    {"name": "USER.md",     "rawChars":  474, "injectedChars":  474, "truncated": false},
    {"name": "HEARTBEAT.md","rawChars":  192, "injectedChars":  192, "truncated": false},
    {"name": "BOOTSTRAP.md","rawChars": 1450, "injectedChars": 1450, "truncated": false}
  ]
}


Default config + tool-warranted prompt — gateway log delta during the agent run:

$ tail -c 6321 /tmp/openclaw/openclaw-2026-04-30.log | grep -iE 'tool_call|tool.*name|web_fetch|fetch.*url|invoke|dispatch'
(empty — no tool dispatch markers)


Default config — session jsonl event sequence:

L5  message  role=user
L6  message  role=assistant  ctypes=['text']  text="The fetched page at https://example.com appears to be an example domain..."
                                              ← hallucinated; example.com is famously a placeholder, model knew this without fetching


After lowering `bootstrapMaxChars` to 1500 (no other changes) — same prompt, same model, same vLLM, same parser:

L5  message  role=user
L6  message  role=assistant  ctypes=['toolCall']
L7  message  role=toolResult  ctypes=['text']  text='{"url":"https://example.com","status":200,"contentType":"text/html",...}'
L8  message  role=assistant  ctypes=['text']  text="The fetched page at https://example.com is a security notice..."
                                              ← real summary of real fetched HTML body


Public-domain validation (research showing prompt length above ~2,000 words degrades tool-following on small/mid models):
- "Writing System Prompts for AI Agents: Best Practices for 2026" (Runyard) — "Prompts over 2,000 words tend to produce agents that follow early instructions well and ignore later ones."
- AGENTIF benchmark (Tsinghua KEG) — quantifies "performance degradation as instruction length increases."
- BFCL v3 leaderboard 2026 — Qwen3 8B at F1 0.933, Hermes-3 mid-tier; performance is sensitive to prompt shape.

Impact and severity

Affected: every OpenClaw user running a small/mid local model (the most common consumer-GPU configuration) who relies on tool dispatch. Cloud-API users on capable models (Sonnet 4.6, Opus 4.7, GPT-5.4) generally don't hit this because those models tolerate long prompts well — but local-inference users on Llama-3.1 / Qwen3 / Hermes / Mistral / similar 7B–14B models do. This is a growing user segment (the entire "self-host on consumer GPU" cohort).

Severity: medium-high. The failure mode is silent and confidence-inducing: hallucinated tool replies often sound correct (Hermes-3 8B's example.com summary read like a real summary), so users may not realize tools aren't actually firing for hours or days. Real-world consequences are particularly bad for action-taking agents (skills that send messages, modify files, execute commands) — the agent claims success while doing nothing, or worse, claims a fabricated success that the user acts on downstream.

Frequency: deterministic on default config + small/mid model.

Consequence: silent erosion of tool-call reliability across the local-inference user base. Combined with the surface-area of the issue (any tool call, on any small/mid model), this is the kind of bug that quietly drives users away from the framework or onto larger (cloud-only) models even when their local hardware would be sufficient if the prompt were lean.

Additional information

Related issues:

Suggested fixes (any one materially helps; combining them is best):

  1. Lower default bootstrapMaxChars to ~1,500–2,000. Backwards-compatible (existing users with explicit settings unaffected). One-line change to the schema default.
  2. Trim the auto-generated AGENTS.md content to ~1,500 chars, focused on tool-use rules + red-lines. (Synergizes with Track and persist Claude's returned session ID for conversation continuity #8.)
  3. Land PR feat(workspace): add tiered bootstrap loading with configurable bootstrapTier #22439 (feat: Tiered bootstrap file loading for progressive context control #22438) with bootstrapTier: minimal as the default for new installs.
  4. Add a doctor warning when systemPrompt.chars > 8000 AND configured primary model is in the small-mid class (≤14B params). Suggest the trim + link to the bootstrap-tier docs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions