feat(security): LLM-based guardrail pre-screener for prompt injection (#1651)#1875
Merged
feat(security): LLM-based guardrail pre-screener for prompt injection (#1651)#1875
Conversation
…tion (#1651) Adds GuardrailFilter — an LLM-based classifier that screens user input and (optionally) tool output for prompt injection, jailbreaks, and manipulation attempts before they enter the main agent context. Key design decisions: - Dedicated leaf LLM provider per session (not the primary agent provider) - Hardcoded system prompt (not configurable — security boundary) - fail_strategy = closed by default: timeout/error blocks rather than allows - scan_tool_output = false by default: tool scanning is explicitly opt-in - Gated behind guardrail feature flag (included in full bundle) Parsing: strict SAFE/UNSAFE: prefix matching — "SAFE" requires EOF or ASCII whitespace at byte 4 to prevent false-safe on "SAFELY...", "SAFEGUARD...". Unrecognized responses are treated as Flagged (defense in depth). Empty/whitespace input returns Safe without an LLM call. Truncation: max_input_chars counts Unicode scalar values (chars), not bytes; char_indices().nth() finds the exact byte boundary. Integration points: [security.guardrail] config section, --guardrail CLI flag, GRD: TUI status indicator, /guardrail slash command (stats), --init wizard step, --migrate-config via default.toml embed, ACP session clone, CHANGELOG.md, docs/src/references.md. Fixes: IMPL-02 (SAFE prefix), IMPL-03 (byte vs char), SEC-LOW-02 (vault_key removed), SEC-LOW-03/REV-01 (raw error not forwarded to user channel), IMPL-07 (Debug impl), IMPL-08 (empty input guard), IMPL-05 (documented tool-output warn asymmetry).
Conflicts resolved: - CHANGELOG.md: keep both guardrail (#1651) and policy-enforcer (#1695) entries - Cargo.toml: full bundle includes both guardrail and policy-enforcer - crates/zeph-core/src/config/migrate.rs: keep both migration tests - src/init.rs: keep both guardrail and policy-enforcer config-write blocks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements
GuardrailFilter— an LLM-based classifier that screens user input and (optionally) tool output for prompt injection, jailbreaks, and manipulation attempts before they enter the main agent context.fail_strategy = closedby default: timeout/LLM error blocks rather than allowsscan_tool_output = falseby default: indirect injection scanning is opt-inguardrailfeature flag; included infullbundleKey design notes
Response parsing: strict
SAFE/UNSAFE:prefix matching.SAFErequires byte[4] to be EOF or ASCII whitespace — prevents false-safe on "SAFELY...", "SAFEGUARD...". Unrecognized responses default toFlagged(defense in depth).Truncation:
max_input_charscounts Unicode scalar values (chars).char_indices().nth()finds the exact UTF-8 byte boundary. Default: 4096 chars.Error handling: raw LLM error strings are not forwarded to the user channel — generic messages used; full errors logged via
tracing::warn.Empty input:
check("")/check(" ")returnSafewithout an LLM call.Integration points delivered
[security.guardrail]indefault.toml+GuardrailConfiginSecurityConfig--guardrailboolean flagGRD:on(green) /GRD:warn(yellow) status indicator/guardrail— shows enabled state, action, fail_strategy, timeout, scan_tool_output, statsstep_security()prompts for provider/model/action/timeoutConfigMigratordiff covers[security.guardrail]viadefault.tomlembedguardrail_providercloned per session inSharedAgentDepsTests
+837 tests (5032 → 5869), all passing. Coverage includes:
parse_response: safe/unsafe/unknown/empty/multibyte/SAFELY-prefix/SAFEGUARD-prefixcheck(): MockProvider with recording, failing, delay — safe/unsafe/error/timeout pathsGuardrailStatsaccumulation[security.guardrail]commented sectionPre-merge checks
cargo +nightly fmt --check— cleancargo clippy --workspace --features full -- -D warnings— clean (0 warnings)cargo nextest run --workspace --features full --lib --bins— 5869 passed, 12 skipped, 0 failedCloses #1651