Skip to content

feat(security): LLM-based guardrail pre-screener for prompt injection (#1651)#1875

Merged
bug-ops merged 4 commits intomainfrom
guardrail-llm-injection
Mar 15, 2026
Merged

feat(security): LLM-based guardrail pre-screener for prompt injection (#1651)#1875
bug-ops merged 4 commits intomainfrom
guardrail-llm-injection

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented Mar 15, 2026

Summary

Implements GuardrailFilter — an LLM-based classifier that screens user input and (optionally) tool output for prompt injection, jailbreaks, and manipulation attempts before they enter the main agent context.

  • Dedicated leaf LLM provider (ollama/claude/openai/compatible/gemini); Orchestrator and Router rejected at construction
  • Hardcoded const system prompt — not configurable, not interpolatable
  • fail_strategy = closed by default: timeout/LLM error blocks rather than allows
  • scan_tool_output = false by default: indirect injection scanning is opt-in
  • Gated behind guardrail feature flag; included in full bundle

Key design notes

Response parsing: strict SAFE/UNSAFE: prefix matching. SAFE requires byte[4] to be EOF or ASCII whitespace — prevents false-safe on "SAFELY...", "SAFEGUARD...". Unrecognized responses default to Flagged (defense in depth).

Truncation: max_input_chars counts Unicode scalar values (chars). char_indices().nth() finds the exact UTF-8 byte boundary. Default: 4096 chars.

Error handling: raw LLM error strings are not forwarded to the user channel — generic messages used; full errors logged via tracing::warn.

Empty input: check("") / check(" ") return Safe without an LLM call.

Integration points delivered

Point Detail
Config [security.guardrail] in default.toml + GuardrailConfig in SecurityConfig
CLI --guardrail boolean flag
TUI GRD:on (green) / GRD:warn (yellow) status indicator
Slash command /guardrail — shows enabled state, action, fail_strategy, timeout, scan_tool_output, stats
Init wizard step_security() prompts for provider/model/action/timeout
Config migration Generic ConfigMigrator diff covers [security.guardrail] via default.toml embed
ACP guardrail_provider cloned per session in SharedAgentDeps

Tests

+837 tests (5032 → 5869), all passing. Coverage includes:

  • parse_response: safe/unsafe/unknown/empty/multibyte/SAFELY-prefix/SAFEGUARD-prefix
  • check(): MockProvider with recording, failing, delay — safe/unsafe/error/timeout paths
  • Closed and Open fail strategies for both timeout and LLM error
  • Input truncation with ASCII and multibyte (emoji) content
  • Empty and whitespace-only input early return (verified via recording mock)
  • GuardrailStats accumulation
  • Router/Orchestrator provider rejection at construction
  • Config defaults and serde roundtrip
  • Migration: pre-guardrail config gets [security.guardrail] commented section

Pre-merge checks

  • cargo +nightly fmt --check — clean
  • cargo clippy --workspace --features full -- -D warnings — clean (0 warnings)
  • cargo nextest run --workspace --features full --lib --bins — 5869 passed, 12 skipped, 0 failed

Closes #1651

bug-ops added 2 commits March 15, 2026 21:07
…tion (#1651)

Adds GuardrailFilter — an LLM-based classifier that screens user input
and (optionally) tool output for prompt injection, jailbreaks, and
manipulation attempts before they enter the main agent context.

Key design decisions:
- Dedicated leaf LLM provider per session (not the primary agent provider)
- Hardcoded system prompt (not configurable — security boundary)
- fail_strategy = closed by default: timeout/error blocks rather than allows
- scan_tool_output = false by default: tool scanning is explicitly opt-in
- Gated behind guardrail feature flag (included in full bundle)

Parsing: strict SAFE/UNSAFE: prefix matching — "SAFE" requires EOF or
ASCII whitespace at byte 4 to prevent false-safe on "SAFELY...",
"SAFEGUARD...". Unrecognized responses are treated as Flagged (defense
in depth). Empty/whitespace input returns Safe without an LLM call.

Truncation: max_input_chars counts Unicode scalar values (chars), not
bytes; char_indices().nth() finds the exact byte boundary.

Integration points: [security.guardrail] config section, --guardrail
CLI flag, GRD: TUI status indicator, /guardrail slash command (stats),
--init wizard step, --migrate-config via default.toml embed, ACP
session clone, CHANGELOG.md, docs/src/references.md.

Fixes: IMPL-02 (SAFE prefix), IMPL-03 (byte vs char), SEC-LOW-02
(vault_key removed), SEC-LOW-03/REV-01 (raw error not forwarded to
user channel), IMPL-07 (Debug impl), IMPL-08 (empty input guard),
IMPL-05 (documented tool-output warn asymmetry).
@github-actions github-actions bot added documentation Improvements or additions to documentation rust Rust code changes core zeph-core crate dependencies Dependency updates enhancement New feature or request size/XL Extra large PR (500+ lines) labels Mar 15, 2026
@bug-ops bug-ops enabled auto-merge (squash) March 15, 2026 20:22
bug-ops added 2 commits March 15, 2026 21:24
Conflicts resolved:
- CHANGELOG.md: keep both guardrail (#1651) and policy-enforcer (#1695) entries
- Cargo.toml: full bundle includes both guardrail and policy-enforcer
- crates/zeph-core/src/config/migrate.rs: keep both migration tests
- src/init.rs: keep both guardrail and policy-enforcer config-write blocks
@bug-ops bug-ops merged commit ca9b5ba into main Mar 15, 2026
24 checks passed
@bug-ops bug-ops deleted the guardrail-llm-injection branch March 15, 2026 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core zeph-core crate dependencies Dependency updates documentation Improvements or additions to documentation enhancement New feature or request rust Rust code changes size/XL Extra large PR (500+ lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

research(security): guardrail LLM for prompt injection pre-screening (PromptArmor pattern)

1 participant