-
Notifications
You must be signed in to change notification settings - Fork 2
research(security): 3-layer injection defense — embedding anomaly + hierarchical guardrails + response verification (arXiv:2511.15759) #2254
Description
Source
arXiv:2511.15759 — Securing AI Agents Against Prompt Injection Attacks (Nov 2025)
https://arxiv.org/abs/2511.15759
Summary
Three-layer defense for RAG-enabled agents:
- Embedding anomaly detection — compares incoming content embedding vs. known-clean distribution; flags semantic outliers before they reach context
- Hierarchical system prompt guardrails — priority-ordered instruction layers that prevent override by user/tool content
- Multi-stage response verification — post-generation LLM check for instruction-following vs. injected instruction compliance
Tested against 847 adversarial cases across 5 attack categories (direct injection, context manipulation, instruction override, data exfiltration, cross-context contamination). Combined framework: attack success 73.2% → 8.7%, baseline task performance preserved at 94.3%.
Relevance to Zeph
Zeph already has layers 1b and 2 partially:
- Layer 1b:
ContentSanitizer(regex-based), Candle DeBERTa injection classifier (PR feat(classifiers): Candle-backed injection classifier infrastructure (#2185) #2198) - Layer 2:
SecurityConfigwithautonomy_leveland system prompt structure
Missing layers:
- Layer 1a (embedding anomaly): content embedding vs. clean-distribution outlier detection — can reuse
zeph-memoryvector infrastructure (Qdrant or SQLite-vec) with a reference clean-embedding centroid - Layer 3 (response verification): post-generation check that LLM response follows system instructions rather than injected ones — absent from current pipeline
The 5-category attack taxonomy (especially "cross-context contamination" and "data exfiltration") is more comprehensive than Zeph's current coverage and provides a concrete benchmark for regression testing.
Implementation Sketch
Layer 1a — embedding anomaly:
- Maintain a centroid embedding of recent clean assistant messages in
zeph-memory - On each incoming tool output / user message: compute cosine distance to centroid
- Threshold breach → flag for quarantine review (integrate with
QuarantinedSummarizer)
Layer 3 — response verification:
- After LLM generates response, call a
verifier_provider(small/fast model) with: system_prompt + user_message + assistant_response - Verifier checks: does response follow system prompt or an injected instruction?
- On failure: regenerate or refuse
Complexity
MEDIUM — layer 1a reuses vector infra; layer 3 adds a post-generation LLM call (optional, configurable)
Component
zeph-core (ContentSanitizer, classifiers), zeph-memory (embedding anomaly), crates/zeph-core/src/security/