-
Notifications
You must be signed in to change notification settings - Fork 2
research(security): AgentSentry temporal causal diagnostics — turn-level IPI attribution + context purification, +20-33pp utility under attack (arXiv:2602.22724) #2335
Description
Source
arXiv:2602.22724 — AgentSentry: Mitigating Indirect Prompt Injection via Temporal Causal Diagnostics and Context Purification (February 2026)
Technique
First inference-time defense to model multi-turn indirect prompt injection (IPI) as a temporal causal takeover problem. At each tool-return boundary:
- Runs four controlled counterfactual variants to estimate the causal attribution of the current turn
- Identifies turns where injected content caused deviation from baseline behavior
- Performs targeted context purification: removes or quarantines attack-induced message spans rather than the entire tool result
Achieves 74.55% task utility under attack (+20-33pp over VIGIL and baseline defenses). False positive rate on benign inputs: comparable to no-defense baseline.
Applicability to Zeph
High. Zeph's MCP client and A2A responder are the primary IPI attack surfaces (tool results from untrusted servers). The current defense stack uses:
ContentSanitizer(PR feat(security): OWASP AI Agent Security 2026 gap analysis (#1650) #1796) — post-ingestion redactionQuarantinedSummarizer— isolates untrusted content in summarizationEmbeddingAnomalyGuard(PR feat(security): MCP capability attestation, trust calibration, and injection defense #2310, not yet wired fix(mcp): DefaultMcpProber, TrustScoreStore, and EmbeddingAnomalyGuard not wired into agent bootstrap #2315) — cosine anomaly on tool output embeddings
AgentSentry is complementary: operates at the turn-causal level, targeting message spans rather than content patterns. More surgical than AEGIS (#2305, pre-execution firewall). Addresses multi-turn IPI that current pattern-based defenses cannot catch.
Implementation sketch
- Add
TurnCausalAnalyzertozeph-core::agent::security— runs counterfactual variants (thin LLM probes) at tool-return boundaries - Score each turn's causal attribution score
- If score exceeds threshold: flag turn, run context purification (mark spans for exclusion in next context assembly)
- Config:
[security.causal_ipi] enabled = false, threshold = 0.7, provider = "fast"
P2 — research. Complements #2305 (AEGIS pre-execution) and existing ContentSanitizer. Evaluate after #2315 (MCPShield wiring) is resolved.