feat(security): LLM-based guardrail pre-screener for prompt injection (#1651) by bug-ops · Pull Request #1875 · bug-ops/zeph

bug-ops · 2026-03-15T20:13:49Z

Summary

Implements GuardrailFilter — an LLM-based classifier that screens user input and (optionally) tool output for prompt injection, jailbreaks, and manipulation attempts before they enter the main agent context.

Dedicated leaf LLM provider (ollama/claude/openai/compatible/gemini); Orchestrator and Router rejected at construction
Hardcoded const system prompt — not configurable, not interpolatable
fail_strategy = closed by default: timeout/LLM error blocks rather than allows
scan_tool_output = false by default: indirect injection scanning is opt-in
Gated behind guardrail feature flag; included in full bundle

Key design notes

Response parsing: strict SAFE/UNSAFE: prefix matching. SAFE requires byte[4] to be EOF or ASCII whitespace — prevents false-safe on "SAFELY...", "SAFEGUARD...". Unrecognized responses default to Flagged (defense in depth).

Truncation: max_input_chars counts Unicode scalar values (chars). char_indices().nth() finds the exact UTF-8 byte boundary. Default: 4096 chars.

Error handling: raw LLM error strings are not forwarded to the user channel — generic messages used; full errors logged via tracing::warn.

Empty input: check("") / check(" ") return Safe without an LLM call.

Integration points delivered

Point	Detail
Config	`[security.guardrail]` in `default.toml` + `GuardrailConfig` in `SecurityConfig`
CLI	`--guardrail` boolean flag
TUI	`GRD:on` (green) / `GRD:warn` (yellow) status indicator
Slash command	`/guardrail` — shows enabled state, action, fail_strategy, timeout, scan_tool_output, stats
Init wizard	`step_security()` prompts for provider/model/action/timeout
Config migration	Generic `ConfigMigrator` diff covers `[security.guardrail]` via `default.toml` embed
ACP	`guardrail_provider` cloned per session in `SharedAgentDeps`

Tests

+837 tests (5032 → 5869), all passing. Coverage includes:

parse_response: safe/unsafe/unknown/empty/multibyte/SAFELY-prefix/SAFEGUARD-prefix
check(): MockProvider with recording, failing, delay — safe/unsafe/error/timeout paths
Closed and Open fail strategies for both timeout and LLM error
Input truncation with ASCII and multibyte (emoji) content
Empty and whitespace-only input early return (verified via recording mock)
GuardrailStats accumulation
Router/Orchestrator provider rejection at construction
Config defaults and serde roundtrip
Migration: pre-guardrail config gets [security.guardrail] commented section

Pre-merge checks

cargo +nightly fmt --check — clean
cargo clippy --workspace --features full -- -D warnings — clean (0 warnings)
cargo nextest run --workspace --features full --lib --bins — 5869 passed, 12 skipped, 0 failed

Closes #1651

…tion (#1651) Adds GuardrailFilter — an LLM-based classifier that screens user input and (optionally) tool output for prompt injection, jailbreaks, and manipulation attempts before they enter the main agent context. Key design decisions: - Dedicated leaf LLM provider per session (not the primary agent provider) - Hardcoded system prompt (not configurable — security boundary) - fail_strategy = closed by default: timeout/error blocks rather than allows - scan_tool_output = false by default: tool scanning is explicitly opt-in - Gated behind guardrail feature flag (included in full bundle) Parsing: strict SAFE/UNSAFE: prefix matching — "SAFE" requires EOF or ASCII whitespace at byte 4 to prevent false-safe on "SAFELY...", "SAFEGUARD...". Unrecognized responses are treated as Flagged (defense in depth). Empty/whitespace input returns Safe without an LLM call. Truncation: max_input_chars counts Unicode scalar values (chars), not bytes; char_indices().nth() finds the exact byte boundary. Integration points: [security.guardrail] config section, --guardrail CLI flag, GRD: TUI status indicator, /guardrail slash command (stats), --init wizard step, --migrate-config via default.toml embed, ACP session clone, CHANGELOG.md, docs/src/references.md. Fixes: IMPL-02 (SAFE prefix), IMPL-03 (byte vs char), SEC-LOW-02 (vault_key removed), SEC-LOW-03/REV-01 (raw error not forwarded to user channel), IMPL-07 (Debug impl), IMPL-08 (empty input guard), IMPL-05 (documented tool-output warn asymmetry).

Conflicts resolved: - CHANGELOG.md: keep both guardrail (#1651) and policy-enforcer (#1695) entries - Cargo.toml: full bundle includes both guardrail and policy-enforcer - crates/zeph-core/src/config/migrate.rs: keep both migration tests - src/init.rs: keep both guardrail and policy-enforcer config-write blocks

bug-ops added 2 commits March 15, 2026 21:07

Merge remote-tracking branch 'origin/main' into guardrail-llm-injection

28e2e0d

github-actions bot added documentation Improvements or additions to documentation rust Rust code changes core zeph-core crate dependencies Dependency updates enhancement New feature or request size/XL Extra large PR (500+ lines) labels Mar 15, 2026

bug-ops enabled auto-merge (squash) March 15, 2026 20:22

bug-ops added 2 commits March 15, 2026 21:24

fix(test): update config default snapshots with guardrail section

6d3c11e

bug-ops merged commit ca9b5ba into main Mar 15, 2026
24 checks passed

bug-ops deleted the guardrail-llm-injection branch March 15, 2026 20:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(security): LLM-based guardrail pre-screener for prompt injection (#1651)#1875

feat(security): LLM-based guardrail pre-screener for prompt injection (#1651)#1875
bug-ops merged 4 commits intomainfrom
guardrail-llm-injection

bug-ops commented Mar 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bug-ops commented Mar 15, 2026

Summary

Key design notes

Integration points delivered

Tests

Pre-merge checks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant