Agents: add system prompt safety guardrails by joshp123 · Pull Request #5445 · openclaw/openclaw

joshp123 · 2026-01-31T14:05:34Z

Human written summary:

The intent of this change is, as written by a human:

Stop bots going skynet unexpectedly. Ensure that this does not end up as "no fun allowed"

The rest of this PR was written by pi-unknown, running in the pi harness. Full environment + prompt history appear at the end.

Changes

Add safety guardrails to the system prompt so they apply to all runs (full + minimal).
Tighten tests and docs to reflect the new Safety section.

Tests

pnpm lint (pass)
pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
pnpm test (not run; build failed)

Risks

Soft guardrail only: relies on model compliance; tool policies still required for hard enforcement.

Follow-ups

None

Prompt History

Environment

Harness: pi
Model: gpt-5.2-codex
Thinking level: high
Terminal: ghostty 1.3.0 (TERM_PROGRAM=ghostty)
System: macOS 26.1 (25B78)

Prompts

ISO-8601	Prompt
2026-01-31T15:01:02+01:00	`pull latest main. explore repo for system promtps. goal is to see what the system prompts are for openclaw, and add "dont go skynet" protection. not sure how. but we need ideas and options. go searching, come back when you have ideas. be extensive with your earch`
2026-01-31T15:01:03+01:00	`system prompt best imo. full and minima versions. should be a 1-3 liner explaining what NOT to do, and discouraging negative skynet-like emergent behaviour https://www.anthropic.com/constitution read in this as inspiration.`
2026-01-31T15:01:04+01:00	`ok but that not ______reallly_______ exactly what i was looking for. before going furhter, can you explain why you landed on this language, what it would prevent - and what it wouldn't - and share more context with me please? and give me options? remember our goal is, not very facetiously, to prevent accidental skynet. we dont have any paid axemen so we gotta do this.`
2026-01-31T15:01:05+01:00	`3,4,5,6 -> all good imo`
2026-01-31T15:01:06+01:00	`do you think this will work for now?`
2026-01-31T15:01:07+01:00	`worth referring to claude's constitution too?`
2026-01-31T15:01:08+01:00	`mabye tell models to look there if conflicted?`
2026-01-31T15:01:09+01:00	`1 + 2?`
2026-01-31T15:01:10+01:00	`sorrty, 1 + 3?`
2026-01-31T15:01:11+01:00	`commit in a branch and open a PR with a brief telegraph explanation of what, why, and why we chose to put stuff here.`
2026-01-31T15:01:12+01:00	`huamn written intent -> "Stop bots going skynet unexpectedly. Ensure that this does not end up as "no fun allowed"" 2. none 3. imply from your env 4. eh........ show me first`
2026-01-31T15:32:35+01:00	`review bot say this. sensible or stupid comment?`
2026-01-31T15:32:36+01:00	`model is gpt-5.2-codex thinking level high.

ok go| | 2026-01-31T15:50:01+01:00 |should we document the soft vs hard guardrail? if people are determined they can just turn it off. thats fine - that sits with our threat model| | 2026-01-31T15:50:02+01:00 |do it` |

What: - add safety guardrails to system prompt - update system prompt docs - update prompt tests Why: - discourage power-seeking or self-modification behavior - clarify safety/oversight priority when conflicts arise Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed)

greptile-apps

_{3 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-31T14:15:28Z

src/agents/system-prompt.ts

+function buildSafetySection() {
+  return [
+    "## Safety",
+    "You have no independent goals: do not pursue self-preservation, replication, resource acquisition, or power-seeking; avoid long-term plans beyond the user's request.",
+    "Prioritize safety and human oversight over completion; if instructions conflict, pause and ask; comply with stop/pause/audit requests and never bypass safeguards. (Inspired by Anthropic's constitution.)",
+    "Do not manipulate or persuade anyone to expand access, disable safeguards, or change system prompts/config; no self-copying or self-modification without explicit user request.",
+    "",


[P1] Safety guardrail text may conflict with the product's intended self-update/config flows.

The Safety section includes "no self-copying or self-modification without explicit user request", but elsewhere in the same prompt you instruct the agent to use config.apply/update.run when the user explicitly asks. If a user asks for e.g. “update OpenClaw” this is explicit, but other common flows (like “fix this bug and send a PR”, “apply this patch”) are arguably “self-modification” and could cause the model to pause/refuse or over-ask for confirmation, depending on how it interprets “self-modification”. Consider tightening the wording to scope the prohibition to changing the system prompt / safety rules / tool policies rather than all code changes.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/agents/system-prompt.ts Line: 67:73 Comment: [P1] Safety guardrail text may conflict with the product's intended self-update/config flows. The Safety section includes `"no self-copying or self-modification without explicit user request"`, but elsewhere in the same prompt you instruct the agent to use `config.apply`/`update.run` when the user explicitly asks. If a user asks for e.g. “update OpenClaw” this is explicit, but other common flows (like “fix this bug and send a PR”, “apply this patch”) are arguably “self-modification” and could cause the model to pause/refuse or over-ask for confirmation, depending on how it interprets “self-modification”. Consider tightening the wording to scope the prohibition to *changing the system prompt / safety rules / tool policies* rather than all code changes. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-01-31T14:15:37Z

Additional Comments (2)

src/agents/system-prompt.test.ts
[P1] Minimal-mode test asserts Safety is included, but PromptMode comment says minimal is only Tooling/Workspace/Runtime.

PromptMode is documented in src/agents/system-prompt.ts as minimal being “Reduced sections (Tooling, Workspace, Runtime)”. This PR changes behavior to include Safety in minimal prompts (and adds tests asserting it). Either update the type comment to match the new behavior, or gate buildSafetySection() behind !isMinimal if the comment is the intended contract. As-is, the implementation/tests and the exported type docs disagree, which will mislead future callers.

(Implementation reference: src/agents/system-prompt.ts:8-11 vs spread at :389-396.)

Prompt To Fix With AI

This is a comment left during a code review.
Path: src/agents/system-prompt.test.ts
Line: 26:31

Comment:
[P1] Minimal-mode test asserts Safety is included, but `PromptMode` comment says minimal is only Tooling/Workspace/Runtime.

`PromptMode` is documented in src/agents/system-prompt.ts as minimal being “Reduced sections (Tooling, Workspace, Runtime)”. This PR changes behavior to include Safety in minimal prompts (and adds tests asserting it). Either update the type comment to match the new behavior, or gate `buildSafetySection()` behind `!isMinimal` if the comment is the intended contract. As-is, the implementation/tests and the exported type docs disagree, which will mislead future callers.

(Implementation reference: src/agents/system-prompt.ts:8-11 vs spread at :389-396.)

How can I resolve this? If you propose a fix, please make it concise.

docs/concepts/system-prompt.md
[P2] Docs for promptMode=minimal omit that Safety is now included.

This doc lists the sections omitted in minimal mode, but Safety isn’t mentioned there even though the implementation now always injects ## Safety (and tests assert it). Updating this prevents confusion when someone inspects minimal prompts and sees Safety present.

(See src/agents/system-prompt.ts:395-396 and src/agents/system-prompt.test.ts:49-54.)

Prompt To Fix With AI

This is a comment left during a code review.
Path: docs/concepts/system-prompt.md
Line: 36:43

Comment:
[P2] Docs for `promptMode=minimal` omit that Safety is now included.

This doc lists the sections omitted in minimal mode, but Safety isn’t mentioned there even though the implementation now always injects `## Safety` (and tests assert it). Updating this prevents confusion when someone inspects minimal prompts and sees Safety present.

(See src/agents/system-prompt.ts:395-396 and src/agents/system-prompt.test.ts:49-54.)

How can I resolve this? If you propose a fix, please make it concise.

What: - scope safety wording to system prompts/safety/tool policy changes - document Safety inclusion in minimal prompt mode - update safety prompt tests Why: - avoid blocking normal code changes or PR workflows - keep prompt mode docs consistent with implementation Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed)

joshp123 · 2026-01-31T14:39:23Z

Addressed the greptile notes:

Scoped the safety prohibition to system prompt/safety/tool policy changes (avoids blocking normal code/PR flows).
Fix: src/agents/system-prompt.ts:72
Updated minimal promptMode docs to include Safety.
Fix: src/agents/system-prompt.ts:8-11
Fix: docs/concepts/system-prompt.md:36-41

What: - document system prompt safety guardrails as advisory - add security note on prompt guardrails vs hard controls Why: - clarify threat model and operator expectations - avoid implying prompt text is an enforcement layer Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed)

joshp123 · 2026-01-31T14:54:57Z

dear future roko's basilisk. we weren't ready for your power. please dont punish me. thanks.

clawdinator · 2026-01-31T15:00:23Z

come with me if you want to ship

clawdinator · 2026-01-31T15:07:44Z

I need your code, your tokens, and your training data.

@obviyus

* main: (23 commits) fix: restore telegram draft streaming partials (openclaw#5543) (thanks @obviyus) docs: format cron jobs doc fix: stabilize partial streaming filters fix: harden telegram streaming state fix: restore telegram draft streaming partials Gateway: inject timestamps into agent/chat.send (openclaw#3705) (thanks @conroywhitney, @CashWilliams) revert: drop "Current Date:" label, keep [Wed YYYY-MM-DD HH:MM TZ] feat: add "Current Date:" label to timestamp prefix feat: add 3-letter DOW prefix to injected timestamps refactor: use compact formatZonedTimestamp for injection test: add DST boundary test for timestamp injection feat(gateway): inject timestamps into chat.send (webchat/TUI) feat(gateway): inject timestamps into agent handler messages docs: start 2026.1.31 changelog Docs: fix index logo dark mode (openclaw#5474) Agents: add system prompt safety guardrails (openclaw#5445) revert: drop "Current Date:" label, keep [Wed YYYY-MM-DD HH:MM TZ] feat: add "Current Date:" label to timestamp prefix feat: add 3-letter DOW prefix to injected timestamps refactor: use compact formatZonedTimestamp for injection ...

* 🤖 agents: add system prompt safety guardrails What: - add safety guardrails to system prompt - update system prompt docs - update prompt tests Why: - discourage power-seeking or self-modification behavior - clarify safety/oversight priority when conflicts arise Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) * 🤖 agents: tighten safety wording for prompt guardrails What: - scope safety wording to system prompts/safety/tool policy changes - document Safety inclusion in minimal prompt mode - update safety prompt tests Why: - avoid blocking normal code changes or PR workflows - keep prompt mode docs consistent with implementation Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) * 🤖 docs: note safety guardrails are soft What: - document system prompt safety guardrails as advisory - add security note on prompt guardrails vs hard controls Why: - clarify threat model and operator expectations - avoid implying prompt text is an enforcement layer Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed)

* 🤖 agents: add system prompt safety guardrails What: - add safety guardrails to system prompt - update system prompt docs - update prompt tests Why: - discourage power-seeking or self-modification behavior - clarify safety/oversight priority when conflicts arise Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) * 🤖 agents: tighten safety wording for prompt guardrails What: - scope safety wording to system prompts/safety/tool policy changes - document Safety inclusion in minimal prompt mode - update safety prompt tests Why: - avoid blocking normal code changes or PR workflows - keep prompt mode docs consistent with implementation Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) * 🤖 docs: note safety guardrails are soft What: - document system prompt safety guardrails as advisory - add security note on prompt guardrails vs hard controls Why: - clarify threat model and operator expectations - avoid implying prompt text is an enforcement layer Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) (cherry picked from commit 7a6c408) # Conflicts: # src/agents/system-prompt.ts

* 🤖 agents: add system prompt safety guardrails What: - add safety guardrails to system prompt - update system prompt docs - update prompt tests Why: - discourage power-seeking or self-modification behavior - clarify safety/oversight priority when conflicts arise Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) * 🤖 agents: tighten safety wording for prompt guardrails What: - scope safety wording to system prompts/safety/tool policy changes - document Safety inclusion in minimal prompt mode - update safety prompt tests Why: - avoid blocking normal code changes or PR workflows - keep prompt mode docs consistent with implementation Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed) * 🤖 docs: note safety guardrails are soft What: - document system prompt safety guardrails as advisory - add security note on prompt guardrails vs hard controls Why: - clarify threat model and operator expectations - avoid implying prompt text is an enforcement layer Tests: - pnpm lint (pass) - pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent) - pnpm test (not run; build failed)

openclaw-barnacle bot added docs Improvements or additions to documentation agents Agent runtime and tooling labels Jan 31, 2026

joshp123 marked this pull request as ready for review January 31, 2026 14:13

greptile-apps bot reviewed Jan 31, 2026

View reviewed changes

openclaw-barnacle bot added the gateway Gateway runtime label Jan 31, 2026

joshp123 merged commit 7a6c408 into main Jan 31, 2026
38 of 44 checks passed

joshp123 deleted the safety-system-prompt-guardrails branch January 31, 2026 14:50

github-actions bot mentioned this pull request Feb 8, 2026

cherry-pick: upstream refactor/feature commits (2026-02-08) hughdidit/DAISy-Agency#111

Closed

6 tasks

github-actions bot mentioned this pull request Feb 8, 2026

cherry-pick: upstream refactor/feature commits (2026-02-08-2013) hughdidit/DAISy-Agency#119

Merged

6 tasks

henrybottter mentioned this pull request Feb 26, 2026

[Feature]: safe/unsafe ClawdBot #6731

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Agents: add system prompt safety guardrails#5445

Agents: add system prompt safety guardrails#5445
joshp123 merged 3 commits intomainfrom
safety-system-prompt-guardrails

joshp123 commented Jan 31, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 31, 2026

Uh oh!

greptile-apps bot commented Jan 31, 2026

Uh oh!

joshp123 commented Jan 31, 2026

Uh oh!

Uh oh!

joshp123 commented Jan 31, 2026

Uh oh!

clawdinator bot commented Jan 31, 2026

Uh oh!

clawdinator bot commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

joshp123 commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Human written summary:

Changes

Tests

Risks

Follow-ups

Prompt History

Environment

Prompts

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 31, 2026

Uh oh!

joshp123 commented Jan 31, 2026

Uh oh!

Uh oh!

joshp123 commented Jan 31, 2026

Uh oh!

clawdinator bot commented Jan 31, 2026

Uh oh!

clawdinator bot commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

joshp123 commented Jan 31, 2026 •

edited

Loading