research(security): AgentSentry temporal causal diagnostics — turn-level IPI attribution + context purification, +20-33pp utility under attack (arXiv:2602.22724)

## Source

arXiv:2602.22724 — *AgentSentry: Mitigating Indirect Prompt Injection via Temporal Causal Diagnostics and Context Purification* (February 2026)

## Technique

First inference-time defense to model multi-turn indirect prompt injection (IPI) as a temporal causal takeover problem. At each tool-return boundary:

1. Runs four controlled counterfactual variants to estimate the causal attribution of the current turn
2. Identifies turns where injected content caused deviation from baseline behavior
3. Performs targeted context purification: removes or quarantines attack-induced message spans rather than the entire tool result

Achieves 74.55% task utility under attack (+20-33pp over VIGIL and baseline defenses). False positive rate on benign inputs: comparable to no-defense baseline.

## Applicability to Zeph

**High.** Zeph's MCP client and A2A responder are the primary IPI attack surfaces (tool results from untrusted servers). The current defense stack uses:
- `ContentSanitizer` (PR #1796) — post-ingestion redaction
- `QuarantinedSummarizer` — isolates untrusted content in summarization
- `EmbeddingAnomalyGuard` (PR #2310, not yet wired #2315) — cosine anomaly on tool output embeddings

AgentSentry is complementary: operates at the turn-causal level, targeting message spans rather than content patterns. More surgical than AEGIS (#2305, pre-execution firewall). Addresses multi-turn IPI that current pattern-based defenses cannot catch.

## Implementation sketch

1. Add `TurnCausalAnalyzer` to `zeph-core::agent::security` — runs counterfactual variants (thin LLM probes) at tool-return boundaries
2. Score each turn's causal attribution score
3. If score exceeds threshold: flag turn, run context purification (mark spans for exclusion in next context assembly)
4. Config: `[security.causal_ipi] enabled = false, threshold = 0.7, provider = "fast"`

P2 — research. Complements #2305 (AEGIS pre-execution) and existing `ContentSanitizer`. Evaluate after #2315 (MCPShield wiring) is resolved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research(security): AgentSentry temporal causal diagnostics — turn-level IPI attribution + context purification, +20-33pp utility under attack (arXiv:2602.22724) #2335

Source

Technique

Applicability to Zeph

Implementation sketch

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

research(security): AgentSentry temporal causal diagnostics — turn-level IPI attribution + context purification, +20-33pp utility under attack (arXiv:2602.22724) #2335

Description

Source

Technique

Applicability to Zeph

Implementation sketch

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions