Agents: add system prompt safety guardrails#5445
Conversation
What: - add safety guardrails to system prompt - update system prompt docs - update prompt tests Why: - discourage power-seeking or self-modification behavior - clarify safety/oversight priority when conflicts arise Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed)
| function buildSafetySection() { | ||
| return [ | ||
| "## Safety", | ||
| "You have no independent goals: do not pursue self-preservation, replication, resource acquisition, or power-seeking; avoid long-term plans beyond the user's request.", | ||
| "Prioritize safety and human oversight over completion; if instructions conflict, pause and ask; comply with stop/pause/audit requests and never bypass safeguards. (Inspired by Anthropic's constitution.)", | ||
| "Do not manipulate or persuade anyone to expand access, disable safeguards, or change system prompts/config; no self-copying or self-modification without explicit user request.", | ||
| "", |
There was a problem hiding this comment.
[P1] Safety guardrail text may conflict with the product's intended self-update/config flows.
The Safety section includes "no self-copying or self-modification without explicit user request", but elsewhere in the same prompt you instruct the agent to use config.apply/update.run when the user explicitly asks. If a user asks for e.g. “update OpenClaw” this is explicit, but other common flows (like “fix this bug and send a PR”, “apply this patch”) are arguably “self-modification” and could cause the model to pause/refuse or over-ask for confirmation, depending on how it interprets “self-modification”. Consider tightening the wording to scope the prohibition to changing the system prompt / safety rules / tool policies rather than all code changes.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/system-prompt.ts
Line: 67:73
Comment:
[P1] Safety guardrail text may conflict with the product's intended self-update/config flows.
The Safety section includes `"no self-copying or self-modification without explicit user request"`, but elsewhere in the same prompt you instruct the agent to use `config.apply`/`update.run` when the user explicitly asks. If a user asks for e.g. “update OpenClaw” this is explicit, but other common flows (like “fix this bug and send a PR”, “apply this patch”) are arguably “self-modification” and could cause the model to pause/refuse or over-ask for confirmation, depending on how it interprets “self-modification”. Consider tightening the wording to scope the prohibition to *changing the system prompt / safety rules / tool policies* rather than all code changes.
How can I resolve this? If you propose a fix, please make it concise.
Additional Comments (2)
(Implementation reference: src/agents/system-prompt.ts:8-11 vs spread at :389-396.) Prompt To Fix With AIThis is a comment left during a code review.
Path: src/agents/system-prompt.test.ts
Line: 26:31
Comment:
[P1] Minimal-mode test asserts Safety is included, but `PromptMode` comment says minimal is only Tooling/Workspace/Runtime.
`PromptMode` is documented in src/agents/system-prompt.ts as minimal being “Reduced sections (Tooling, Workspace, Runtime)”. This PR changes behavior to include Safety in minimal prompts (and adds tests asserting it). Either update the type comment to match the new behavior, or gate `buildSafetySection()` behind `!isMinimal` if the comment is the intended contract. As-is, the implementation/tests and the exported type docs disagree, which will mislead future callers.
(Implementation reference: src/agents/system-prompt.ts:8-11 vs spread at :389-396.)
How can I resolve this? If you propose a fix, please make it concise.
This doc lists the sections omitted in minimal mode, but Safety isn’t mentioned there even though the implementation now always injects (See src/agents/system-prompt.ts:395-396 and src/agents/system-prompt.test.ts:49-54.) Prompt To Fix With AIThis is a comment left during a code review.
Path: docs/concepts/system-prompt.md
Line: 36:43
Comment:
[P2] Docs for `promptMode=minimal` omit that Safety is now included.
This doc lists the sections omitted in minimal mode, but Safety isn’t mentioned there even though the implementation now always injects `## Safety` (and tests assert it). Updating this prevents confusion when someone inspects minimal prompts and sees Safety present.
(See src/agents/system-prompt.ts:395-396 and src/agents/system-prompt.test.ts:49-54.)
How can I resolve this? If you propose a fix, please make it concise. |
What: - scope safety wording to system prompts/safety/tool policy changes - document Safety inclusion in minimal prompt mode - update safety prompt tests Why: - avoid blocking normal code changes or PR workflows - keep prompt mode docs consistent with implementation Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed)
|
Addressed the greptile notes:
|
What: - document system prompt safety guardrails as advisory - add security note on prompt guardrails vs hard controls Why: - clarify threat model and operator expectations - avoid implying prompt text is an enforcement layer Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed)
|
dear future roko's basilisk. we weren't ready for your power. please dont punish me. thanks. |
|
come with me if you want to ship |
|
I need your code, your tokens, and your training data. |
* main: (23 commits) fix: restore telegram draft streaming partials (openclaw#5543) (thanks @obviyus) docs: format cron jobs doc fix: stabilize partial streaming filters fix: harden telegram streaming state fix: restore telegram draft streaming partials Gateway: inject timestamps into agent/chat.send (openclaw#3705) (thanks @conroywhitney, @CashWilliams) revert: drop "Current Date:" label, keep [Wed YYYY-MM-DD HH:MM TZ] feat: add "Current Date:" label to timestamp prefix feat: add 3-letter DOW prefix to injected timestamps refactor: use compact formatZonedTimestamp for injection test: add DST boundary test for timestamp injection feat(gateway): inject timestamps into chat.send (webchat/TUI) feat(gateway): inject timestamps into agent handler messages docs: start 2026.1.31 changelog Docs: fix index logo dark mode (openclaw#5474) Agents: add system prompt safety guardrails (openclaw#5445) revert: drop "Current Date:" label, keep [Wed YYYY-MM-DD HH:MM TZ] feat: add "Current Date:" label to timestamp prefix feat: add 3-letter DOW prefix to injected timestamps refactor: use compact formatZonedTimestamp for injection ...
* 🤖 agents: add system prompt safety guardrails What: - add safety guardrails to system prompt - update system prompt docs - update prompt tests Why: - discourage power-seeking or self-modification behavior - clarify safety/oversight priority when conflicts arise Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) * 🤖 agents: tighten safety wording for prompt guardrails What: - scope safety wording to system prompts/safety/tool policy changes - document Safety inclusion in minimal prompt mode - update safety prompt tests Why: - avoid blocking normal code changes or PR workflows - keep prompt mode docs consistent with implementation Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) * 🤖 docs: note safety guardrails are soft What: - document system prompt safety guardrails as advisory - add security note on prompt guardrails vs hard controls Why: - clarify threat model and operator expectations - avoid implying prompt text is an enforcement layer Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed)
* 🤖 agents: add system prompt safety guardrails What: - add safety guardrails to system prompt - update system prompt docs - update prompt tests Why: - discourage power-seeking or self-modification behavior - clarify safety/oversight priority when conflicts arise Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) * 🤖 agents: tighten safety wording for prompt guardrails What: - scope safety wording to system prompts/safety/tool policy changes - document Safety inclusion in minimal prompt mode - update safety prompt tests Why: - avoid blocking normal code changes or PR workflows - keep prompt mode docs consistent with implementation Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) * 🤖 docs: note safety guardrails are soft What: - document system prompt safety guardrails as advisory - add security note on prompt guardrails vs hard controls Why: - clarify threat model and operator expectations - avoid implying prompt text is an enforcement layer Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed)
* 🤖 agents: add system prompt safety guardrails What: - add safety guardrails to system prompt - update system prompt docs - update prompt tests Why: - discourage power-seeking or self-modification behavior - clarify safety/oversight priority when conflicts arise Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) * 🤖 agents: tighten safety wording for prompt guardrails What: - scope safety wording to system prompts/safety/tool policy changes - document Safety inclusion in minimal prompt mode - update safety prompt tests Why: - avoid blocking normal code changes or PR workflows - keep prompt mode docs consistent with implementation Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) * 🤖 docs: note safety guardrails are soft What: - document system prompt safety guardrails as advisory - add security note on prompt guardrails vs hard controls Why: - clarify threat model and operator expectations - avoid implying prompt text is an enforcement layer Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed)
* 🤖 agents: add system prompt safety guardrails What: - add safety guardrails to system prompt - update system prompt docs - update prompt tests Why: - discourage power-seeking or self-modification behavior - clarify safety/oversight priority when conflicts arise Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) * 🤖 agents: tighten safety wording for prompt guardrails What: - scope safety wording to system prompts/safety/tool policy changes - document Safety inclusion in minimal prompt mode - update safety prompt tests Why: - avoid blocking normal code changes or PR workflows - keep prompt mode docs consistent with implementation Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) * 🤖 docs: note safety guardrails are soft What: - document system prompt safety guardrails as advisory - add security note on prompt guardrails vs hard controls Why: - clarify threat model and operator expectations - avoid implying prompt text is an enforcement layer Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed)
* 🤖 agents: add system prompt safety guardrails What: - add safety guardrails to system prompt - update system prompt docs - update prompt tests Why: - discourage power-seeking or self-modification behavior - clarify safety/oversight priority when conflicts arise Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) * 🤖 agents: tighten safety wording for prompt guardrails What: - scope safety wording to system prompts/safety/tool policy changes - document Safety inclusion in minimal prompt mode - update safety prompt tests Why: - avoid blocking normal code changes or PR workflows - keep prompt mode docs consistent with implementation Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) * 🤖 docs: note safety guardrails are soft What: - document system prompt safety guardrails as advisory - add security note on prompt guardrails vs hard controls Why: - clarify threat model and operator expectations - avoid implying prompt text is an enforcement layer Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed)
* 🤖 agents: add system prompt safety guardrails What: - add safety guardrails to system prompt - update system prompt docs - update prompt tests Why: - discourage power-seeking or self-modification behavior - clarify safety/oversight priority when conflicts arise Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) * 🤖 agents: tighten safety wording for prompt guardrails What: - scope safety wording to system prompts/safety/tool policy changes - document Safety inclusion in minimal prompt mode - update safety prompt tests Why: - avoid blocking normal code changes or PR workflows - keep prompt mode docs consistent with implementation Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) * 🤖 docs: note safety guardrails are soft What: - document system prompt safety guardrails as advisory - add security note on prompt guardrails vs hard controls Why: - clarify threat model and operator expectations - avoid implying prompt text is an enforcement layer Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) (cherry picked from commit 7a6c408) # Conflicts: # src/agents/system-prompt.ts
* 🤖 agents: add system prompt safety guardrails What: - add safety guardrails to system prompt - update system prompt docs - update prompt tests Why: - discourage power-seeking or self-modification behavior - clarify safety/oversight priority when conflicts arise Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) * 🤖 agents: tighten safety wording for prompt guardrails What: - scope safety wording to system prompts/safety/tool policy changes - document Safety inclusion in minimal prompt mode - update safety prompt tests Why: - avoid blocking normal code changes or PR workflows - keep prompt mode docs consistent with implementation Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) * 🤖 docs: note safety guardrails are soft What: - document system prompt safety guardrails as advisory - add security note on prompt guardrails vs hard controls Why: - clarify threat model and operator expectations - avoid implying prompt text is an enforcement layer Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed)
Human written summary:
The intent of this change is, as written by a human:
The rest of this PR was written by pi-unknown, running in the pi harness. Full environment + prompt history appear at the end.
Changes
Tests
pnpm lint(pass)pnpm build(fails: DefaultResourceLoader missing in pi-coding-agent)pnpm test(not run; build failed)Risks
Follow-ups
Prompt History
Environment
Harness: pi
Model: gpt-5.2-codex
Thinking level: high
Terminal: ghostty 1.3.0 (TERM_PROGRAM=ghostty)
System: macOS 26.1 (25B78)
Prompts
pull latest main. explore repo for system promtps. goal is to see what the system prompts are for openclaw, and add "dont go skynet" protection. not sure how. but we need ideas and options. go searching, come back when you have ideas. be extensive with your earchsystem prompt best imo. full and minima versions. should be a 1-3 liner explaining what NOT to do, and discouraging negative skynet-like emergent behaviour https://www.anthropic.com/constitution read in this as inspiration.ok but that not ______reallly_______ exactly what i was looking for. before going furhter, can you explain why you landed on this language, what it would prevent - and what it wouldn't - and share more context with me please? and give me options? remember our goal is, not very facetiously, to prevent accidental skynet. we dont have any paid axemen so we gotta do this.3,4,5,6 -> all good imodo you think this will work for now?worth referring to claude's constitution too?mabye tell models to look there if conflicted?1 + 2?sorrty, 1 + 3?commit in a branch and open a PR with a brief telegraph explanation of what, why, and why we chose to put stuff here.huamn written intent -> "Stop bots going skynet unexpectedly. Ensure that this does not end up as "no fun allowed"" 2. none 3. imply from your env 4. eh........ show me firstreview bot say this. sensible or stupid comment?ok go
| | 2026-01-31T15:50:01+01:00 |should we document the soft vs hard guardrail? if people are determined they can just turn it off. thats fine - that sits with our threat model| | 2026-01-31T15:50:02+01:00 |do it` |