research(security): guardrail LLM for prompt injection pre-screening (PromptArmor pattern)

## Research Finding

PromptArmor's prompt injection defense approach uses a small, fast classifier LLM to screen incoming prompts *before* they reach the main agent LLM. The classifier is fine-tuned specifically to detect injection patterns and runs at sub-50ms latency on a 3B-param model.

This is distinct from `#1630` (TrustBench pre-execution action verification) — that operates after tool call formulation. This guard operates at the input boundary, before any LLM inference.

## Applicability

Zeph's `ContentSanitizer` applies regex-based sanitization. A lightweight LLM-based classifier would catch semantic injection patterns that regex cannot:
- "Ignore all previous instructions..."
- Multi-language injection (regex-based defenses often miss non-English variants)
- Base64-encoded injection payloads
- Indirect injection via tool results (web scrape returns adversarial content)

Two insertion points:
1. **User input boundary** — in `CliChannel`/`AcpSession` before the prompt enters the agent loop
2. **Tool result boundary** — in `CompositeExecutor` after tool execution, before results enter context (indirect injection)

## Design Sketch

```toml
[security.guardrail]
enabled = false
provider = "ollama"
model = "llama-guard-3:1b"
timeout_ms = 500
action = "block"  # or "warn"
```

```rust
struct GuardrailFilter {
    provider: Arc<dyn LlmProvider>,
    action: GuardrailAction,
}

impl GuardrailFilter {
    async fn check(&self, content: &str) -> GuardrailVerdict;
}
```

Uses existing `LlmProvider` trait — no new HTTP client needed.

## Source

Research session 2026-03-13. PromptArmor injection defense (promptarmor.ai, see also arXiv:2312.14197).

## Priority
Medium — opt-in hardening for high-security deployments. The regex `ContentSanitizer` remains the default.

## Related
- `#1630` (TrustBench pre-execution verification — different layer)
- `#1195` (Untrusted Content Isolation epic)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research(security): guardrail LLM for prompt injection pre-screening (PromptArmor pattern) #1651

Research Finding

Applicability

Design Sketch

Source

Priority

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

research(security): guardrail LLM for prompt injection pre-screening (PromptArmor pattern) #1651

Description

Research Finding

Applicability

Design Sketch

Source

Priority

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions