research(security): 3-layer injection defense — embedding anomaly + hierarchical guardrails + response verification (arXiv:2511.15759)

## Source

arXiv:2511.15759 — *Securing AI Agents Against Prompt Injection Attacks* (Nov 2025)
https://arxiv.org/abs/2511.15759

## Summary

Three-layer defense for RAG-enabled agents:
1. **Embedding anomaly detection** — compares incoming content embedding vs. known-clean distribution; flags semantic outliers before they reach context
2. **Hierarchical system prompt guardrails** — priority-ordered instruction layers that prevent override by user/tool content
3. **Multi-stage response verification** — post-generation LLM check for instruction-following vs. injected instruction compliance

Tested against 847 adversarial cases across 5 attack categories (direct injection, context manipulation, instruction override, data exfiltration, cross-context contamination). Combined framework: attack success 73.2% → 8.7%, baseline task performance preserved at 94.3%.

## Relevance to Zeph

Zeph already has layers 1b and 2 partially:
- Layer 1b: `ContentSanitizer` (regex-based), Candle DeBERTa injection classifier (PR #2198)
- Layer 2: `SecurityConfig` with `autonomy_level` and system prompt structure

**Missing layers**:
- **Layer 1a (embedding anomaly)**: content embedding vs. clean-distribution outlier detection — can reuse `zeph-memory` vector infrastructure (Qdrant or SQLite-vec) with a reference clean-embedding centroid
- **Layer 3 (response verification)**: post-generation check that LLM response follows system instructions rather than injected ones — absent from current pipeline

The 5-category attack taxonomy (especially "cross-context contamination" and "data exfiltration") is more comprehensive than Zeph's current coverage and provides a concrete benchmark for regression testing.

## Implementation Sketch

**Layer 1a** — embedding anomaly:
- Maintain a centroid embedding of recent clean assistant messages in `zeph-memory`
- On each incoming tool output / user message: compute cosine distance to centroid
- Threshold breach → flag for quarantine review (integrate with `QuarantinedSummarizer`)

**Layer 3** — response verification:
- After LLM generates response, call a `verifier_provider` (small/fast model) with: system_prompt + user_message + assistant_response
- Verifier checks: does response follow system prompt or an injected instruction?
- On failure: regenerate or refuse

## Complexity

MEDIUM — layer 1a reuses vector infra; layer 3 adds a post-generation LLM call (optional, configurable)

## Component

`zeph-core` (`ContentSanitizer`, classifiers), `zeph-memory` (embedding anomaly), `crates/zeph-core/src/security/`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research(security): 3-layer injection defense — embedding anomaly + hierarchical guardrails + response verification (arXiv:2511.15759) #2254

Source

Summary

Relevance to Zeph

Implementation Sketch

Complexity

Component

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

research(security): 3-layer injection defense — embedding anomaly + hierarchical guardrails + response verification (arXiv:2511.15759) #2254

Description

Source

Summary

Relevance to Zeph

Implementation Sketch

Complexity

Component

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions