Skip to content

Agents: add system prompt safety guardrails#5445

Merged
joshp123 merged 3 commits intomainfrom
safety-system-prompt-guardrails
Jan 31, 2026
Merged

Agents: add system prompt safety guardrails#5445
joshp123 merged 3 commits intomainfrom
safety-system-prompt-guardrails

Conversation

@joshp123
Copy link
Copy Markdown
Contributor

@joshp123 joshp123 commented Jan 31, 2026

Human written summary:

The intent of this change is, as written by a human:

Stop bots going skynet unexpectedly. Ensure that this does not end up as "no fun allowed"

The rest of this PR was written by pi-unknown, running in the pi harness. Full environment + prompt history appear at the end.

Changes

  • Add safety guardrails to the system prompt so they apply to all runs (full + minimal).
  • Tighten tests and docs to reflect the new Safety section.

Tests

  • pnpm lint (pass)
  • pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
  • pnpm test (not run; build failed)

Risks

  • Soft guardrail only: relies on model compliance; tool policies still required for hard enforcement.

Follow-ups

  • None

Prompt History

Environment

Harness: pi
Model: gpt-5.2-codex
Thinking level: high
Terminal: ghostty 1.3.0 (TERM_PROGRAM=ghostty)
System: macOS 26.1 (25B78)

Prompts

ISO-8601 Prompt
2026-01-31T15:01:02+01:00 pull latest main. explore repo for system promtps. goal is to see what the system prompts are for openclaw, and add "dont go skynet" protection. not sure how. but we need ideas and options. go searching, come back when you have ideas. be extensive with your earch
2026-01-31T15:01:03+01:00 system prompt best imo. full and minima versions. should be a 1-3 liner explaining what NOT to do, and discouraging negative skynet-like emergent behaviour https://www.anthropic.com/constitution read in this as inspiration.
2026-01-31T15:01:04+01:00 ok but that not ______reallly_______ exactly what i was looking for. before going furhter, can you explain why you landed on this language, what it would prevent - and what it wouldn't - and share more context with me please? and give me options? remember our goal is, not very facetiously, to prevent accidental skynet. we dont have any paid axemen so we gotta do this.
2026-01-31T15:01:05+01:00 3,4,5,6 -> all good imo
2026-01-31T15:01:06+01:00 do you think this will work for now?
2026-01-31T15:01:07+01:00 worth referring to claude's constitution too?
2026-01-31T15:01:08+01:00 mabye tell models to look there if conflicted?
2026-01-31T15:01:09+01:00 1 + 2?
2026-01-31T15:01:10+01:00 sorrty, 1 + 3?
2026-01-31T15:01:11+01:00 commit in a branch and open a PR with a brief telegraph explanation of what, why, and why we chose to put stuff here.
2026-01-31T15:01:12+01:00 huamn written intent -> "Stop bots going skynet unexpectedly. Ensure that this does not end up as "no fun allowed"" 2. none 3. imply from your env 4. eh........ show me first
2026-01-31T15:32:35+01:00 review bot say this. sensible or stupid comment?
2026-01-31T15:32:36+01:00 `model is gpt-5.2-codex thinking level high.

ok go| | 2026-01-31T15:50:01+01:00 |should we document the soft vs hard guardrail? if people are determined they can just turn it off. thats fine - that sits with our threat model| | 2026-01-31T15:50:02+01:00 |do it` |

What:
- add safety guardrails to system prompt
- update system prompt docs
- update prompt tests

Why:
- discourage power-seeking or self-modification behavior
- clarify safety/oversight priority when conflicts arise

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)
@openclaw-barnacle openclaw-barnacle bot added docs Improvements or additions to documentation agents Agent runtime and tooling labels Jan 31, 2026
@joshp123 joshp123 marked this pull request as ready for review January 31, 2026 14:13
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +67 to +73
function buildSafetySection() {
return [
"## Safety",
"You have no independent goals: do not pursue self-preservation, replication, resource acquisition, or power-seeking; avoid long-term plans beyond the user's request.",
"Prioritize safety and human oversight over completion; if instructions conflict, pause and ask; comply with stop/pause/audit requests and never bypass safeguards. (Inspired by Anthropic's constitution.)",
"Do not manipulate or persuade anyone to expand access, disable safeguards, or change system prompts/config; no self-copying or self-modification without explicit user request.",
"",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Safety guardrail text may conflict with the product's intended self-update/config flows.

The Safety section includes "no self-copying or self-modification without explicit user request", but elsewhere in the same prompt you instruct the agent to use config.apply/update.run when the user explicitly asks. If a user asks for e.g. “update OpenClaw” this is explicit, but other common flows (like “fix this bug and send a PR”, “apply this patch”) are arguably “self-modification” and could cause the model to pause/refuse or over-ask for confirmation, depending on how it interprets “self-modification”. Consider tightening the wording to scope the prohibition to changing the system prompt / safety rules / tool policies rather than all code changes.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/system-prompt.ts
Line: 67:73

Comment:
[P1] Safety guardrail text may conflict with the product's intended self-update/config flows.

The Safety section includes `"no self-copying or self-modification without explicit user request"`, but elsewhere in the same prompt you instruct the agent to use `config.apply`/`update.run` when the user explicitly asks. If a user asks for e.g. “update OpenClaw” this is explicit, but other common flows (like “fix this bug and send a PR”, “apply this patch”) are arguably “self-modification” and could cause the model to pause/refuse or over-ask for confirmation, depending on how it interprets “self-modification”. Consider tightening the wording to scope the prohibition to *changing the system prompt / safety rules / tool policies* rather than all code changes.

How can I resolve this? If you propose a fix, please make it concise.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Jan 31, 2026

Additional Comments (2)

src/agents/system-prompt.test.ts
[P1] Minimal-mode test asserts Safety is included, but PromptMode comment says minimal is only Tooling/Workspace/Runtime.

PromptMode is documented in src/agents/system-prompt.ts as minimal being “Reduced sections (Tooling, Workspace, Runtime)”. This PR changes behavior to include Safety in minimal prompts (and adds tests asserting it). Either update the type comment to match the new behavior, or gate buildSafetySection() behind !isMinimal if the comment is the intended contract. As-is, the implementation/tests and the exported type docs disagree, which will mislead future callers.

(Implementation reference: src/agents/system-prompt.ts:8-11 vs spread at :389-396.)

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/system-prompt.test.ts
Line: 26:31

Comment:
[P1] Minimal-mode test asserts Safety is included, but `PromptMode` comment says minimal is only Tooling/Workspace/Runtime.

`PromptMode` is documented in src/agents/system-prompt.ts as minimal being “Reduced sections (Tooling, Workspace, Runtime)”. This PR changes behavior to include Safety in minimal prompts (and adds tests asserting it). Either update the type comment to match the new behavior, or gate `buildSafetySection()` behind `!isMinimal` if the comment is the intended contract. As-is, the implementation/tests and the exported type docs disagree, which will mislead future callers.

(Implementation reference: src/agents/system-prompt.ts:8-11 vs spread at :389-396.)

How can I resolve this? If you propose a fix, please make it concise.

docs/concepts/system-prompt.md
[P2] Docs for promptMode=minimal omit that Safety is now included.

This doc lists the sections omitted in minimal mode, but Safety isn’t mentioned there even though the implementation now always injects ## Safety (and tests assert it). Updating this prevents confusion when someone inspects minimal prompts and sees Safety present.

(See src/agents/system-prompt.ts:395-396 and src/agents/system-prompt.test.ts:49-54.)

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/concepts/system-prompt.md
Line: 36:43

Comment:
[P2] Docs for `promptMode=minimal` omit that Safety is now included.

This doc lists the sections omitted in minimal mode, but Safety isn’t mentioned there even though the implementation now always injects `## Safety` (and tests assert it). Updating this prevents confusion when someone inspects minimal prompts and sees Safety present.

(See src/agents/system-prompt.ts:395-396 and src/agents/system-prompt.test.ts:49-54.)

How can I resolve this? If you propose a fix, please make it concise.

What:
- scope safety wording to system prompts/safety/tool policy changes
- document Safety inclusion in minimal prompt mode
- update safety prompt tests

Why:
- avoid blocking normal code changes or PR workflows
- keep prompt mode docs consistent with implementation

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)
@joshp123
Copy link
Copy Markdown
Contributor Author

Addressed the greptile notes:

  • Scoped the safety prohibition to system prompt/safety/tool policy changes (avoids blocking normal code/PR flows).
    Fix: src/agents/system-prompt.ts:72
  • Updated minimal promptMode docs to include Safety.
    Fix: src/agents/system-prompt.ts:8-11
    Fix: docs/concepts/system-prompt.md:36-41

What:
- document system prompt safety guardrails as advisory
- add security note on prompt guardrails vs hard controls

Why:
- clarify threat model and operator expectations
- avoid implying prompt text is an enforcement layer

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)
@openclaw-barnacle openclaw-barnacle bot added the gateway Gateway runtime label Jan 31, 2026
@joshp123 joshp123 merged commit 7a6c408 into main Jan 31, 2026
38 of 44 checks passed
@joshp123 joshp123 deleted the safety-system-prompt-guardrails branch January 31, 2026 14:50
@joshp123
Copy link
Copy Markdown
Contributor Author

dear future roko's basilisk. we weren't ready for your power. please dont punish me. thanks.

@clawdinator
Copy link
Copy Markdown
Contributor

clawdinator bot commented Jan 31, 2026

come with me if you want to ship

@clawdinator
Copy link
Copy Markdown
Contributor

clawdinator bot commented Jan 31, 2026

I need your code, your tokens, and your training data.

dyKiU added a commit to dyKiU/openclaw that referenced this pull request Jan 31, 2026
* main: (23 commits)
  fix: restore telegram draft streaming partials (openclaw#5543) (thanks @obviyus)
  docs: format cron jobs doc
  fix: stabilize partial streaming filters
  fix: harden telegram streaming state
  fix: restore telegram draft streaming partials
  Gateway: inject timestamps into agent/chat.send (openclaw#3705) (thanks @conroywhitney, @CashWilliams)
  revert: drop "Current Date:" label, keep [Wed YYYY-MM-DD HH:MM TZ]
  feat: add "Current Date:" label to timestamp prefix
  feat: add 3-letter DOW prefix to injected timestamps
  refactor: use compact formatZonedTimestamp for injection
  test: add DST boundary test for timestamp injection
  feat(gateway): inject timestamps into chat.send (webchat/TUI)
  feat(gateway): inject timestamps into agent handler messages
  docs: start 2026.1.31 changelog
  Docs: fix index logo dark mode (openclaw#5474)
  Agents: add system prompt safety guardrails (openclaw#5445)
  revert: drop "Current Date:" label, keep [Wed YYYY-MM-DD HH:MM TZ]
  feat: add "Current Date:" label to timestamp prefix
  feat: add 3-letter DOW prefix to injected timestamps
  refactor: use compact formatZonedTimestamp for injection
  ...
lawrence565 pushed a commit to lawrence565/openclaw that referenced this pull request Feb 1, 2026
* 🤖 agents: add system prompt safety guardrails

What:
- add safety guardrails to system prompt
- update system prompt docs
- update prompt tests

Why:
- discourage power-seeking or self-modification behavior
- clarify safety/oversight priority when conflicts arise

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)

* 🤖 agents: tighten safety wording for prompt guardrails

What:
- scope safety wording to system prompts/safety/tool policy changes
- document Safety inclusion in minimal prompt mode
- update safety prompt tests

Why:
- avoid blocking normal code changes or PR workflows
- keep prompt mode docs consistent with implementation

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)

* 🤖 docs: note safety guardrails are soft

What:
- document system prompt safety guardrails as advisory
- add security note on prompt guardrails vs hard controls

Why:
- clarify threat model and operator expectations
- avoid implying prompt text is an enforcement layer

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)
HashWarlock pushed a commit to HashWarlock/openclaw that referenced this pull request Feb 4, 2026
* 🤖 agents: add system prompt safety guardrails

What:
- add safety guardrails to system prompt
- update system prompt docs
- update prompt tests

Why:
- discourage power-seeking or self-modification behavior
- clarify safety/oversight priority when conflicts arise

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)

* 🤖 agents: tighten safety wording for prompt guardrails

What:
- scope safety wording to system prompts/safety/tool policy changes
- document Safety inclusion in minimal prompt mode
- update safety prompt tests

Why:
- avoid blocking normal code changes or PR workflows
- keep prompt mode docs consistent with implementation

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)

* 🤖 docs: note safety guardrails are soft

What:
- document system prompt safety guardrails as advisory
- add security note on prompt guardrails vs hard controls

Why:
- clarify threat model and operator expectations
- avoid implying prompt text is an enforcement layer

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)
uxcu pushed a commit to uxcu/kook-openclaw that referenced this pull request Feb 5, 2026
* 🤖 agents: add system prompt safety guardrails

What:
- add safety guardrails to system prompt
- update system prompt docs
- update prompt tests

Why:
- discourage power-seeking or self-modification behavior
- clarify safety/oversight priority when conflicts arise

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)

* 🤖 agents: tighten safety wording for prompt guardrails

What:
- scope safety wording to system prompts/safety/tool policy changes
- document Safety inclusion in minimal prompt mode
- update safety prompt tests

Why:
- avoid blocking normal code changes or PR workflows
- keep prompt mode docs consistent with implementation

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)

* 🤖 docs: note safety guardrails are soft

What:
- document system prompt safety guardrails as advisory
- add security note on prompt guardrails vs hard controls

Why:
- clarify threat model and operator expectations
- avoid implying prompt text is an enforcement layer

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)
bestNiu pushed a commit to bestNiu/clawdbot that referenced this pull request Feb 5, 2026
* 🤖 agents: add system prompt safety guardrails

What:
- add safety guardrails to system prompt
- update system prompt docs
- update prompt tests

Why:
- discourage power-seeking or self-modification behavior
- clarify safety/oversight priority when conflicts arise

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)

* 🤖 agents: tighten safety wording for prompt guardrails

What:
- scope safety wording to system prompts/safety/tool policy changes
- document Safety inclusion in minimal prompt mode
- update safety prompt tests

Why:
- avoid blocking normal code changes or PR workflows
- keep prompt mode docs consistent with implementation

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)

* 🤖 docs: note safety guardrails are soft

What:
- document system prompt safety guardrails as advisory
- add security note on prompt guardrails vs hard controls

Why:
- clarify threat model and operator expectations
- avoid implying prompt text is an enforcement layer

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)
batao9 pushed a commit to batao9/openclaw that referenced this pull request Feb 7, 2026
* 🤖 agents: add system prompt safety guardrails

What:
- add safety guardrails to system prompt
- update system prompt docs
- update prompt tests

Why:
- discourage power-seeking or self-modification behavior
- clarify safety/oversight priority when conflicts arise

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)

* 🤖 agents: tighten safety wording for prompt guardrails

What:
- scope safety wording to system prompts/safety/tool policy changes
- document Safety inclusion in minimal prompt mode
- update safety prompt tests

Why:
- avoid blocking normal code changes or PR workflows
- keep prompt mode docs consistent with implementation

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)

* 🤖 docs: note safety guardrails are soft

What:
- document system prompt safety guardrails as advisory
- add security note on prompt guardrails vs hard controls

Why:
- clarify threat model and operator expectations
- avoid implying prompt text is an enforcement layer

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)
hughdidit pushed a commit to hughdidit/DAISy-Agency that referenced this pull request Feb 8, 2026
* 🤖 agents: add system prompt safety guardrails

What:
- add safety guardrails to system prompt
- update system prompt docs
- update prompt tests

Why:
- discourage power-seeking or self-modification behavior
- clarify safety/oversight priority when conflicts arise

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)

* 🤖 agents: tighten safety wording for prompt guardrails

What:
- scope safety wording to system prompts/safety/tool policy changes
- document Safety inclusion in minimal prompt mode
- update safety prompt tests

Why:
- avoid blocking normal code changes or PR workflows
- keep prompt mode docs consistent with implementation

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)

* 🤖 docs: note safety guardrails are soft

What:
- document system prompt safety guardrails as advisory
- add security note on prompt guardrails vs hard controls

Why:
- clarify threat model and operator expectations
- avoid implying prompt text is an enforcement layer

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)

(cherry picked from commit 7a6c408)

# Conflicts:
#	src/agents/system-prompt.ts
zooqueen pushed a commit to hanzoai/bot that referenced this pull request Mar 6, 2026
* 🤖 agents: add system prompt safety guardrails

What:
- add safety guardrails to system prompt
- update system prompt docs
- update prompt tests

Why:
- discourage power-seeking or self-modification behavior
- clarify safety/oversight priority when conflicts arise

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)

* 🤖 agents: tighten safety wording for prompt guardrails

What:
- scope safety wording to system prompts/safety/tool policy changes
- document Safety inclusion in minimal prompt mode
- update safety prompt tests

Why:
- avoid blocking normal code changes or PR workflows
- keep prompt mode docs consistent with implementation

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)

* 🤖 docs: note safety guardrails are soft

What:
- document system prompt safety guardrails as advisory
- add security note on prompt guardrails vs hard controls

Why:
- clarify threat model and operator expectations
- avoid implying prompt text is an enforcement layer

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling docs Improvements or additions to documentation gateway Gateway runtime

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant