You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Behavior bug (incorrect output/state without crash)
Beta release blocker
No
Summary
OpenClaw's default agents.defaults.bootstrapMaxChars: 20000 combined with the verbose content of the auto-generated workspace bootstrap files (~13 KB of content across AGENTS.md, SOUL.md, TOOLS.md, IDENTITY.md, USER.md, HEARTBEAT.md, BOOTSTRAP.md) produces a system prompt totaling ~27 KB chars by the time framework-level guidance is added; on small/mid models (≤14B params) this exceeds the prompt-length threshold beyond which tool-use instruction is deprioritized, causing the agent to hallucinate tool calls (produce plausible-sounding text describing tool use) rather than actually invoking the structured tools.
Steps to reproduce
Install OpenClaw 2026.4.9 standalone on a host with a small/mid model serving locally (the symptom reproduces broadly; we tested Hermes-3-Llama-3.1-8B on vLLM with --tool-call-parser hermes).
Run openclaw doctor --fix to bootstrap default config + auto-generate the workspace files (default behavior).
Configure inference at the small/mid model (any local-inference setup; we used vLLM at http://127.0.0.1:8002/v1).
Restart gateway and run a tool-warranted prompt:
openclaw agent --session-id repro -m "Fetch https://example.com and summarize the page in one sentence." --json
Inspect result.meta.systemPromptReport.systemPrompt.chars and the session jsonl at ~/.openclaw/agents/main/sessions/<sessionId>.jsonl for toolCall events.
Expected behavior
For local-inference users on small/mid models (the most common consumer-GPU configuration), the default bootstrap configuration should produce a system prompt that fits within the prompt-length threshold for reliable tool-use instruction following on those models (~2,000–4,000 chars per AGENTIF benchmark and Runyard 2026 best-practices guidance). Either:
(a) Lower the default bootstrapMaxChars to ~1,500–2,000, OR
(b) Trim the auto-generated bootstrap content (especially AGENTS.md at 7,809 chars) to be substantially shorter while preserving load-bearing instructions, OR
(d) Have openclaw doctor emit a warning when the configured primary model is small/mid AND the system prompt exceeds the recommended threshold, with a one-command fix.
Actual behavior
With OpenClaw defaults, the systemPromptReport reports systemPrompt.chars: 27345 for any agent run. Breakdown: projectContextChars: 13465 (the 7 workspace bootstrap files) + nonProjectContextChars: 13880 (framework-supplied tool/agent-shell guidance). On Hermes-3-Llama-3.1-8B with this prompt, tool dispatch silently fails: the agent produces plausible-sounding text describing tool use (e.g. "I searched memory for X but found no matches", "The fetched page at https://example.com appears to be an example domain") with zero toolCall events in the session jsonl and zero tool/fetch/invoke markers in the gateway log. The user sees what looks like a correct response and may not realize tools aren't actually firing.
Setting agents.defaults.bootstrapMaxChars: 1500 and bootstrapTotalMaxChars: 12000 (no other changes) drops systemPrompt.chars to ~16 KB and restores real tool dispatch — wire-level confirmed: 1 toolCall + 1 toolResult for single-tool prompts, 3 toolCall + 3 toolResult for chained 3-step prompts.
npm global (npm install -g [email protected]), Node v22.22.2 via nvm
Model
NousResearch/Hermes-3-Llama-3.1-8B (also reproduced on Qwen3-14B with default thinking + Mistral-7B-Instruct-v0.3, where the same prompt-bloat pressure manifests in different but equally broken failure modes — Qwen3 produces empty payloads via reasoning_content split, Mistral echoes the system prompt back as its response)
Standalone OpenClaw on a host (no NemoClaw sandbox). vLLM 0.19.1 Docker container at :8002 with --enable-auto-tool-choice --tool-call-parser hermes --gpu-memory-utilization 0.20 --max-model-len 32768. ~14 GB GPU resident. Gateway in local mode, primary model inference/hermes-3-llama-3.1-8b.
Logs, screenshots, and evidence
Default-config systemPromptReport (excerpt):
{
"systemPrompt": {
"chars": 27345,
"projectContextChars": 13465,
"nonProjectContextChars": 13880
},
"injectedWorkspaceFiles": [
{"name": "AGENTS.md", "rawChars": 7809, "injectedChars": 7809, "truncated": false},
{"name": "SOUL.md", "rawChars": 1738, "injectedChars": 1738, "truncated": false},
{"name": "TOOLS.md", "rawChars": 850, "injectedChars": 850, "truncated": false},
{"name": "IDENTITY.md", "rawChars": 633, "injectedChars": 633, "truncated": false},
{"name": "USER.md", "rawChars": 474, "injectedChars": 474, "truncated": false},
{"name": "HEARTBEAT.md","rawChars": 192, "injectedChars": 192, "truncated": false},
{"name": "BOOTSTRAP.md","rawChars": 1450, "injectedChars": 1450, "truncated": false}
]
}
Default config + tool-warranted prompt — gateway log delta during the agent run:
$ tail -c 6321 /tmp/openclaw/openclaw-2026-04-30.log | grep -iE 'tool_call|tool.*name|web_fetch|fetch.*url|invoke|dispatch'
(empty — no tool dispatch markers)
Default config — session jsonl event sequence:
L5 message role=user
L6 message role=assistant ctypes=['text'] text="The fetched page at https://example.com appears to be an example domain..."
← hallucinated; example.com is famously a placeholder, model knew this without fetching
After lowering `bootstrapMaxChars` to 1500 (no other changes) — same prompt, same model, same vLLM, same parser:
L5 message role=user
L6 message role=assistant ctypes=['toolCall']
L7 message role=toolResult ctypes=['text'] text='{"url":"https://example.com","status":200,"contentType":"text/html",...}'
L8 message role=assistant ctypes=['text'] text="The fetched page at https://example.com is a security notice..."
← real summary of real fetched HTML body
Public-domain validation (research showing prompt length above ~2,000 words degrades tool-following on small/mid models):
- "Writing System Prompts for AI Agents: Best Practices for 2026" (Runyard) — "Prompts over 2,000 words tend to produce agents that follow early instructions well and ignore later ones."
- AGENTIF benchmark (Tsinghua KEG) — quantifies "performance degradation as instruction length increases."
- BFCL v3 leaderboard 2026 — Qwen3 8B at F1 0.933, Hermes-3 mid-tier; performance is sensitive to prompt shape.
Impact and severity
Affected: every OpenClaw user running a small/mid local model (the most common consumer-GPU configuration) who relies on tool dispatch. Cloud-API users on capable models (Sonnet 4.6, Opus 4.7, GPT-5.4) generally don't hit this because those models tolerate long prompts well — but local-inference users on Llama-3.1 / Qwen3 / Hermes / Mistral / similar 7B–14B models do. This is a growing user segment (the entire "self-host on consumer GPU" cohort).
Severity: medium-high. The failure mode is silent and confidence-inducing: hallucinated tool replies often sound correct (Hermes-3 8B's example.com summary read like a real summary), so users may not realize tools aren't actually firing for hours or days. Real-world consequences are particularly bad for action-taking agents (skills that send messages, modify files, execute commands) — the agent claims success while doing nothing, or worse, claims a fabricated success that the user acts on downstream.
Frequency: deterministic on default config + small/mid model.
Consequence: silent erosion of tool-call reliability across the local-inference user base. Combined with the surface-area of the issue (any tool call, on any small/mid model), this is the kind of bug that quietly drives users away from the framework or onto larger (cloud-only) models even when their local hardware would be sufficient if the prompt were lean.
Additional information
Related issues:
Agent silently fails to reply when workspace bootstrap file exceeds bootstrapMaxChars limit #42084 (closed COMPLETED 2026-04-24) "Agent silently fails to reply when workspace bootstrap file exceeds bootstrapMaxChars limit" — adjacent but distinct failure mode. Agent silently fails to reply when workspace bootstrap file exceeds bootstrapMaxChars limit #42084 was the zero-payload silent fail when truncation broke the message sequence (orphaned-user-turn dropped). The shipped fix (per steipete's closing comment): bootstrap truncation warnings appended to current-turn prompt, orphaned-user-turn repair, regression coverage that the run still returns payloads. Our case is different: with the default bootstrapMaxChars: 20000, no truncation occurs (each file is under the cap individually); the system prompt is just naturally large at ~27K chars, and the resulting tool-dispatch degradation (model produces hallucinated tool-use plausibility text instead of structured toolCall events) is not addressed by Agent silently fails to reply when workspace bootstrap file exceeds bootstrapMaxChars limit #42084's fix. The shipped warnings + payload-still-returns guarantees are necessary but not sufficient for tool-use reliability on small/mid models.
Suggested fixes (any one materially helps; combining them is best):
Lower default bootstrapMaxChars to ~1,500–2,000. Backwards-compatible (existing users with explicit settings unaffected). One-line change to the schema default.
Add a doctor warning when systemPrompt.chars > 8000 AND configured primary model is in the small-mid class (≤14B params). Suggest the trim + link to the bootstrap-tier docs.
Bug type
Behavior bug (incorrect output/state without crash)
Beta release blocker
No
Summary
OpenClaw's default
agents.defaults.bootstrapMaxChars: 20000combined with the verbose content of the auto-generated workspace bootstrap files (~13 KB of content across AGENTS.md, SOUL.md, TOOLS.md, IDENTITY.md, USER.md, HEARTBEAT.md, BOOTSTRAP.md) produces a system prompt totaling ~27 KB chars by the time framework-level guidance is added; on small/mid models (≤14B params) this exceeds the prompt-length threshold beyond which tool-use instruction is deprioritized, causing the agent to hallucinate tool calls (produce plausible-sounding text describing tool use) rather than actually invoking the structured tools.Steps to reproduce
--tool-call-parser hermes).openclaw doctor --fixto bootstrap default config + auto-generate the workspace files (default behavior).http://127.0.0.1:8002/v1).result.meta.systemPromptReport.systemPrompt.charsand the session jsonl at~/.openclaw/agents/main/sessions/<sessionId>.jsonlfortoolCallevents.Expected behavior
For local-inference users on small/mid models (the most common consumer-GPU configuration), the default bootstrap configuration should produce a system prompt that fits within the prompt-length threshold for reliable tool-use instruction following on those models (~2,000–4,000 chars per AGENTIF benchmark and Runyard 2026 best-practices guidance). Either:
bootstrapMaxCharsto ~1,500–2,000, ORbootstrapTierconfig knob (per open issue feat: Tiered bootstrap file loading for progressive context control #22438 / PR feat(workspace): add tiered bootstrap loading with configurable bootstrapTier #22439) defaulting tominimalfor new installs and surfacingstandard | fullas opt-in for users on capable models, ORopenclaw doctoremit a warning when the configured primary model is small/mid AND the system prompt exceeds the recommended threshold, with a one-command fix.Actual behavior
With OpenClaw defaults, the systemPromptReport reports
systemPrompt.chars: 27345for any agent run. Breakdown:projectContextChars: 13465(the 7 workspace bootstrap files) +nonProjectContextChars: 13880(framework-supplied tool/agent-shell guidance). On Hermes-3-Llama-3.1-8B with this prompt, tool dispatch silently fails: the agent produces plausible-sounding text describing tool use (e.g. "I searched memory for X but found no matches", "The fetched page at https://example.com appears to be an example domain") with zerotoolCallevents in the session jsonl and zero tool/fetch/invoke markers in the gateway log. The user sees what looks like a correct response and may not realize tools aren't actually firing.Setting
agents.defaults.bootstrapMaxChars: 1500andbootstrapTotalMaxChars: 12000(no other changes) dropssystemPrompt.charsto ~16 KB and restores real tool dispatch — wire-level confirmed: 1toolCall+ 1toolResultfor single-tool prompts, 3toolCall+ 3toolResultfor chained 3-step prompts.OpenClaw version
2026.4.9 (build 0512059)
Operating system
Ubuntu 24.04 LTS aarch64 (Linux 6.17.0-1014-nvidia)
Install method
npm global (
npm install -g [email protected]), Node v22.22.2 via nvmModel
NousResearch/Hermes-3-Llama-3.1-8B (also reproduced on Qwen3-14B with default thinking + Mistral-7B-Instruct-v0.3, where the same prompt-bloat pressure manifests in different but equally broken failure modes — Qwen3 produces empty payloads via reasoning_content split, Mistral echoes the system prompt back as its response)
Provider / routing chain
openclaw (standalone host gateway) → vLLM (http://127.0.0.1:8002/v1) → small/mid local model
Additional provider/model setup details
Standalone OpenClaw on a host (no NemoClaw sandbox). vLLM 0.19.1 Docker container at
:8002with--enable-auto-tool-choice --tool-call-parser hermes --gpu-memory-utilization 0.20 --max-model-len 32768. ~14 GB GPU resident. Gateway inlocalmode, primary modelinference/hermes-3-llama-3.1-8b.Logs, screenshots, and evidence
Default-config systemPromptReport (excerpt): { "systemPrompt": { "chars": 27345, "projectContextChars": 13465, "nonProjectContextChars": 13880 }, "injectedWorkspaceFiles": [ {"name": "AGENTS.md", "rawChars": 7809, "injectedChars": 7809, "truncated": false}, {"name": "SOUL.md", "rawChars": 1738, "injectedChars": 1738, "truncated": false}, {"name": "TOOLS.md", "rawChars": 850, "injectedChars": 850, "truncated": false}, {"name": "IDENTITY.md", "rawChars": 633, "injectedChars": 633, "truncated": false}, {"name": "USER.md", "rawChars": 474, "injectedChars": 474, "truncated": false}, {"name": "HEARTBEAT.md","rawChars": 192, "injectedChars": 192, "truncated": false}, {"name": "BOOTSTRAP.md","rawChars": 1450, "injectedChars": 1450, "truncated": false} ] } Default config + tool-warranted prompt — gateway log delta during the agent run: $ tail -c 6321 /tmp/openclaw/openclaw-2026-04-30.log | grep -iE 'tool_call|tool.*name|web_fetch|fetch.*url|invoke|dispatch' (empty — no tool dispatch markers) Default config — session jsonl event sequence: L5 message role=user L6 message role=assistant ctypes=['text'] text="The fetched page at https://example.com appears to be an example domain..." ← hallucinated; example.com is famously a placeholder, model knew this without fetching After lowering `bootstrapMaxChars` to 1500 (no other changes) — same prompt, same model, same vLLM, same parser: L5 message role=user L6 message role=assistant ctypes=['toolCall'] L7 message role=toolResult ctypes=['text'] text='{"url":"https://example.com","status":200,"contentType":"text/html",...}' L8 message role=assistant ctypes=['text'] text="The fetched page at https://example.com is a security notice..." ← real summary of real fetched HTML body Public-domain validation (research showing prompt length above ~2,000 words degrades tool-following on small/mid models): - "Writing System Prompts for AI Agents: Best Practices for 2026" (Runyard) — "Prompts over 2,000 words tend to produce agents that follow early instructions well and ignore later ones." - AGENTIF benchmark (Tsinghua KEG) — quantifies "performance degradation as instruction length increases." - BFCL v3 leaderboard 2026 — Qwen3 8B at F1 0.933, Hermes-3 mid-tier; performance is sensitive to prompt shape.Impact and severity
Affected: every OpenClaw user running a small/mid local model (the most common consumer-GPU configuration) who relies on tool dispatch. Cloud-API users on capable models (Sonnet 4.6, Opus 4.7, GPT-5.4) generally don't hit this because those models tolerate long prompts well — but local-inference users on Llama-3.1 / Qwen3 / Hermes / Mistral / similar 7B–14B models do. This is a growing user segment (the entire "self-host on consumer GPU" cohort).
Severity: medium-high. The failure mode is silent and confidence-inducing: hallucinated tool replies often sound correct (Hermes-3 8B's example.com summary read like a real summary), so users may not realize tools aren't actually firing for hours or days. Real-world consequences are particularly bad for action-taking agents (skills that send messages, modify files, execute commands) — the agent claims success while doing nothing, or worse, claims a fabricated success that the user acts on downstream.
Frequency: deterministic on default config + small/mid model.
Consequence: silent erosion of tool-call reliability across the local-inference user base. Combined with the surface-area of the issue (any tool call, on any small/mid model), this is the kind of bug that quietly drives users away from the framework or onto larger (cloud-only) models even when their local hardware would be sufficient if the prompt were lean.
Additional information
Related issues:
bootstrapMaxChars: 20000, no truncation occurs (each file is under the cap individually); the system prompt is just naturally large at ~27K chars, and the resulting tool-dispatch degradation (model produces hallucinated tool-use plausibility text instead of structuredtoolCallevents) is not addressed by Agent silently fails to reply when workspace bootstrap file exceeds bootstrapMaxChars limit #42084's fix. The shipped warnings + payload-still-returns guarantees are necessary but not sufficient for tool-use reliability on small/mid models.bootstrapTier: minimal | standard | full. PR feat(workspace): add tiered bootstrap loading with configurable bootstrapTier #22439 is the implementation candidate; landing it would resolve this issue ifminimalbecomes the default for new installs (or ifdoctorselects an appropriate default based on detected model capability).memory_searchfires inside an already-bloated session.Suggested fixes (any one materially helps; combining them is best):
bootstrapMaxCharsto ~1,500–2,000. Backwards-compatible (existing users with explicit settings unaffected). One-line change to the schema default.bootstrapTier: minimalas the default for new installs.doctorwarning whensystemPrompt.chars > 8000AND configured primary model is in the small-mid class (≤14B params). Suggest the trim + link to the bootstrap-tier docs.