Skip to content

research(security): guardrail LLM for prompt injection pre-screening (PromptArmor pattern) #1651

@bug-ops

Description

@bug-ops

Research Finding

PromptArmor's prompt injection defense approach uses a small, fast classifier LLM to screen incoming prompts before they reach the main agent LLM. The classifier is fine-tuned specifically to detect injection patterns and runs at sub-50ms latency on a 3B-param model.

This is distinct from #1630 (TrustBench pre-execution action verification) — that operates after tool call formulation. This guard operates at the input boundary, before any LLM inference.

Applicability

Zeph's ContentSanitizer applies regex-based sanitization. A lightweight LLM-based classifier would catch semantic injection patterns that regex cannot:

  • "Ignore all previous instructions..."
  • Multi-language injection (regex-based defenses often miss non-English variants)
  • Base64-encoded injection payloads
  • Indirect injection via tool results (web scrape returns adversarial content)

Two insertion points:

  1. User input boundary — in CliChannel/AcpSession before the prompt enters the agent loop
  2. Tool result boundary — in CompositeExecutor after tool execution, before results enter context (indirect injection)

Design Sketch

[security.guardrail]
enabled = false
provider = "ollama"
model = "llama-guard-3:1b"
timeout_ms = 500
action = "block"  # or "warn"
struct GuardrailFilter {
    provider: Arc<dyn LlmProvider>,
    action: GuardrailAction,
}

impl GuardrailFilter {
    async fn check(&self, content: &str) -> GuardrailVerdict;
}

Uses existing LlmProvider trait — no new HTTP client needed.

Source

Research session 2026-03-13. PromptArmor injection defense (promptarmor.ai, see also arXiv:2312.14197).

Priority

Medium — opt-in hardening for high-security deployments. The regex ContentSanitizer remains the default.

Related

  • #1630 (TrustBench pre-execution verification — different layer)
  • #1195 (Untrusted Content Isolation epic)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestllmzeph-llm crate (Ollama, Claude)researchResearch-driven improvementsecuritySecurity-related issue

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions