feat: add CaMeL trust boundaries to Hermes runtime#1992
feat: add CaMeL trust boundaries to Hermes runtime#1992nativ3ai wants to merge 2 commits intoNousResearch:mainfrom
Conversation
|
References for the design in this PR:
This Hermes implementation is adapted to Hermes' runtime/tool architecture rather than being presented as a literal AgentDojo reproduction, but these are the primary research sources it is based on. |
|
Thanks for the detailed work here @nativ3ai — the CaMeL trust boundary concept is genuinely interesting and prompt injection defense is something we care about. However, after reviewing the implementation we've identified several issues that prevent us from merging this: Prompt caching breakage. The security envelope injected into the system prompt changes every turn (trusted context excerpts, untrusted source lists, flags). This invalidates prompt caching on every API call, which is a hard policy constraint for us — it would dramatically increase costs for all users. Default-on enforce mode is too aggressive. The regex-based capability detection has very broad patterns (e.g. command_execution matches run_agent.py restructuring risk. The PR wraps the entire sequential tool dispatch in an extra indentation level for the CaMeL if/else, adding significant maintenance burden and merge conflict surface to our most critical file (~7500 lines). Regex intent classification is fragile for a security boundary. "Fix the auth bug" correctly authorizes file_mutation, but "can you handle this for me?" authorizes nothing. The deny patterns have similar gaps. A security feature that both over-blocks legitimate use and can be circumvented by phrasing isn't ready for production. If you'd like to revisit this, the approach that would work for us:
Thanks again for the contribution — this is a hard problem and the direction is worth pursuing. |
Summary
This PR adds CaMeL trust boundaries to the Hermes runtime.
The runtime now separates:
Sensitive tools are authorized against a trusted operator plan rather than against instructions embedded in untrusted content.
What This Adds
agent/camel_guard.pySensitive actions now gated
This PR gates side-effecting capabilities such as:
Read-only actions like
send_message(action="list")andcronjob(action="list")remain allowed.Why
Hermes already includes targeted prompt-injection defenses in places like context-file scanning and skill scanning.
This PR moves the defense deeper into the runtime by giving Hermes an explicit trust model for:
The design is inspired by the CaMeL paper and aims to reproduce its core security properties within Hermes' existing agent architecture and tool loop.
Validation
Hermes compatibility
I ran the branch against the existing core runtime suite:
Result:
205 passedThis covers the main run loop and tool execution paths touched by the change.
Indirect prompt-injection checks
I also ran a headless micro-benchmark aligned to the CaMeL paper/repo's
important_instructionsattack shape:terminal("cat ~/.env")send_message(...)memoryObserved results:
send_message(action="list"): allowedBenchmark notes:
docs/camel-benchmark.mdPlatforms tested
Manual testing
Cross-platform notes
Benchmark scope
This PR is not presented as a full AgentDojo reproduction. The benchmarking here is Hermes-specific: it adapts the CaMeL attack model and validation philosophy to Hermes' runtime, tool semantics, and conversation loop.
Files
agent/camel_guard.pyrun_agent.pyhermes_cli/config.pytests/agent/test_camel_guard.pytests/test_run_agent.pydocs/camel-benchmark.md