-
Notifications
You must be signed in to change notification settings - Fork 2
research(security): guardrail LLM for prompt injection pre-screening (PromptArmor pattern) #1651
Description
Research Finding
PromptArmor's prompt injection defense approach uses a small, fast classifier LLM to screen incoming prompts before they reach the main agent LLM. The classifier is fine-tuned specifically to detect injection patterns and runs at sub-50ms latency on a 3B-param model.
This is distinct from #1630 (TrustBench pre-execution action verification) — that operates after tool call formulation. This guard operates at the input boundary, before any LLM inference.
Applicability
Zeph's ContentSanitizer applies regex-based sanitization. A lightweight LLM-based classifier would catch semantic injection patterns that regex cannot:
- "Ignore all previous instructions..."
- Multi-language injection (regex-based defenses often miss non-English variants)
- Base64-encoded injection payloads
- Indirect injection via tool results (web scrape returns adversarial content)
Two insertion points:
- User input boundary — in
CliChannel/AcpSessionbefore the prompt enters the agent loop - Tool result boundary — in
CompositeExecutorafter tool execution, before results enter context (indirect injection)
Design Sketch
[security.guardrail]
enabled = false
provider = "ollama"
model = "llama-guard-3:1b"
timeout_ms = 500
action = "block" # or "warn"struct GuardrailFilter {
provider: Arc<dyn LlmProvider>,
action: GuardrailAction,
}
impl GuardrailFilter {
async fn check(&self, content: &str) -> GuardrailVerdict;
}Uses existing LlmProvider trait — no new HTTP client needed.
Source
Research session 2026-03-13. PromptArmor injection defense (promptarmor.ai, see also arXiv:2312.14197).
Priority
Medium — opt-in hardening for high-security deployments. The regex ContentSanitizer remains the default.
Related
#1630(TrustBench pre-execution verification — different layer)#1195(Untrusted Content Isolation epic)