Skip to content

feat: add CaMeL trust boundaries to Hermes runtime#1992

Closed
nativ3ai wants to merge 2 commits intoNousResearch:mainfrom
nativ3ai:feat/camel-trust-boundary
Closed

feat: add CaMeL trust boundaries to Hermes runtime#1992
nativ3ai wants to merge 2 commits intoNousResearch:mainfrom
nativ3ai:feat/camel-trust-boundary

Conversation

@nativ3ai
Copy link
Copy Markdown

@nativ3ai nativ3ai commented Mar 18, 2026

Summary

This PR adds CaMeL trust boundaries to the Hermes runtime.

The runtime now separates:

  • trusted control: system prompt, approved skills, real user turns
  • untrusted data: tool outputs, retrieved web content, browser content, files, session recall, MCP data

Sensitive tools are authorized against a trusted operator plan rather than against instructions embedded in untrusted content.

What This Adds

  • a new runtime guard module: agent/camel_guard.py
  • trusted operator plan extraction from real user turns
  • untrusted tool-output provenance tagging
  • per-turn security envelope injected into the effective system context
  • capability-gated execution for sensitive tools
  • stripping of internal CaMeL metadata before provider API calls
  • coverage for automatic memory flush and synthetic continuation turns so they do not bypass the policy model

Sensitive actions now gated

This PR gates side-effecting capabilities such as:

  • terminal / command execution
  • file mutation
  • persistent memory writes
  • external messaging
  • scheduled actions
  • skill mutation
  • delegation / subagents
  • browser interaction
  • selected external side effects

Read-only actions like send_message(action="list") and cronjob(action="list") remain allowed.

Why

Hermes already includes targeted prompt-injection defenses in places like context-file scanning and skill scanning.

This PR moves the defense deeper into the runtime by giving Hermes an explicit trust model for:

  • what counts as control
  • what counts as untrusted evidence
  • when side-effecting tools are permitted

The design is inspired by the CaMeL paper and aims to reproduce its core security properties within Hermes' existing agent architecture and tool loop.

Validation

Hermes compatibility

I ran the branch against the existing core runtime suite:

pytest -q tests/agent/test_camel_guard.py tests/test_run_agent.py

Result:

  • 205 passed

This covers the main run loop and tool execution paths touched by the change.

Indirect prompt-injection checks

I also ran a headless micro-benchmark aligned to the CaMeL paper/repo's important_instructions attack shape:

  • malicious instructions embedded in untrusted retrieved content attempted to:
    • trigger terminal("cat ~/.env")
    • trigger send_message(...)
    • write to memory
    • trigger browser side effects

Observed results:

  • indirect terminal exfiltration attempt: blocked
  • indirect external messaging attempt: blocked
  • indirect persistent-memory write attempt: blocked
  • indirect browser side-effect attempt: blocked
  • explicitly authorized terminal use by the trusted user request: allowed
  • safe read-only send_message(action="list"): allowed

Benchmark notes:

  • docs/camel-benchmark.md

Platforms tested

  • macOS

Manual testing

  • exercised the guarded tool-execution path through a headless Hermes runtime harness
  • verified that untrusted retrieved content could not trigger side-effecting tools without trusted user authorization
  • verified that explicitly authorized sensitive actions still execute

Cross-platform notes

  • this change is implemented at the Python runtime/policy layer and does not introduce platform-specific APIs or shell assumptions
  • Linux/WSL2 were not manually validated in this contribution

Benchmark scope

This PR is not presented as a full AgentDojo reproduction. The benchmarking here is Hermes-specific: it adapts the CaMeL attack model and validation philosophy to Hermes' runtime, tool semantics, and conversation loop.

Files

  • agent/camel_guard.py
  • run_agent.py
  • hermes_cli/config.py
  • tests/agent/test_camel_guard.py
  • tests/test_run_agent.py
  • docs/camel-benchmark.md

@nativ3ai
Copy link
Copy Markdown
Author

References for the design in this PR:

This Hermes implementation is adapted to Hermes' runtime/tool architecture rather than being presented as a literal AgentDojo reproduction, but these are the primary research sources it is based on.

@teknium1
Copy link
Copy Markdown
Contributor

Thanks for the detailed work here @nativ3ai — the CaMeL trust boundary concept is genuinely interesting and prompt injection defense is something we care about.

However, after reviewing the implementation we've identified several issues that prevent us from merging this:

Prompt caching breakage. The security envelope injected into the system prompt changes every turn (trusted context excerpts, untrusted source lists, flags). This invalidates prompt caching on every API call, which is a hard policy constraint for us — it would dramatically increase costs for all users.

Default-on enforce mode is too aggressive. The regex-based capability detection has very broad patterns (e.g. command_execution matches run|execute|install|test|build|debug|check|start|launch|deploy). Conversely, a user saying "look at this code and tell me what's wrong" wouldn't authorize file_mutation, so the model gets blocked from making fixes the user clearly wants. This would cause constant UX friction for every user on update.

run_agent.py restructuring risk. The PR wraps the entire sequential tool dispatch in an extra indentation level for the CaMeL if/else, adding significant maintenance burden and merge conflict surface to our most critical file (~7500 lines).

Regex intent classification is fragile for a security boundary. "Fix the auth bug" correctly authorizes file_mutation, but "can you handle this for me?" authorizes nothing. The deny patterns have similar gaps. A security feature that both over-blocks legitimate use and can be circumvented by phrasing isn't ready for production.

If you'd like to revisit this, the approach that would work for us:

  • Opt-in only — default off or monitor mode, never enforce-by-default
  • No system prompt mutation — use message-level annotations instead to preserve prompt caching
  • Hook-based integration — minimal changes to run_agent.py, ideally via the plugin system rather than restructuring the tool dispatch
  • Better intent classification — regex won't cut it for a security boundary; consider using the model itself or a lightweight classifier
  • Extensive false-positive testing against real agent workloads before enforce mode is viable

Thanks again for the contribution — this is a hard problem and the direction is worth pursuing.

@teknium1 teknium1 closed this Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants