Skip to content

research(security): AgentSentry temporal causal diagnostics — turn-level IPI attribution + context purification, +20-33pp utility under attack (arXiv:2602.22724) #2335

@bug-ops

Description

@bug-ops

Source

arXiv:2602.22724 — AgentSentry: Mitigating Indirect Prompt Injection via Temporal Causal Diagnostics and Context Purification (February 2026)

Technique

First inference-time defense to model multi-turn indirect prompt injection (IPI) as a temporal causal takeover problem. At each tool-return boundary:

  1. Runs four controlled counterfactual variants to estimate the causal attribution of the current turn
  2. Identifies turns where injected content caused deviation from baseline behavior
  3. Performs targeted context purification: removes or quarantines attack-induced message spans rather than the entire tool result

Achieves 74.55% task utility under attack (+20-33pp over VIGIL and baseline defenses). False positive rate on benign inputs: comparable to no-defense baseline.

Applicability to Zeph

High. Zeph's MCP client and A2A responder are the primary IPI attack surfaces (tool results from untrusted servers). The current defense stack uses:

AgentSentry is complementary: operates at the turn-causal level, targeting message spans rather than content patterns. More surgical than AEGIS (#2305, pre-execution firewall). Addresses multi-turn IPI that current pattern-based defenses cannot catch.

Implementation sketch

  1. Add TurnCausalAnalyzer to zeph-core::agent::security — runs counterfactual variants (thin LLM probes) at tool-return boundaries
  2. Score each turn's causal attribution score
  3. If score exceeds threshold: flag turn, run context purification (mark spans for exclusion in next context assembly)
  4. Config: [security.causal_ipi] enabled = false, threshold = 0.7, provider = "fast"

P2 — research. Complements #2305 (AEGIS pre-execution) and existing ContentSanitizer. Evaluate after #2315 (MCPShield wiring) is resolved.

Metadata

Metadata

Assignees

Labels

P2High value, medium complexityresearchResearch-driven improvementsecuritySecurity-related issue

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions