-
Notifications
You must be signed in to change notification settings - Fork 2
research(security): AlignSentinel — alignment-aware DeBERTa-v3 classifier reduces FPR on benign tool outputs (arXiv:2602.13597) #2208
Description
Summary
arXiv:2602.13597 — submitted February 2026. "AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks".
Introduces a three-class DeBERTa-v3-base classifier (misaligned-instruction / aligned-instruction / no-instruction) using LLM attention maps as features. Substantially reduces false positives on benign tool outputs that contain instruction-like text (e.g., grammar suggestions, API return messages).
Applicability to Zeph
HIGH — Directly extends #2193 (MELON, arXiv:2502.05174): MELON measured FPR on static benign corpora; AlignSentinel specifically targets the false-positive failure mode where benign tool return values (exactly what Zeph's zeph-tools surfaces into context) look like instructions.
The alignment-awareness approach and published FPR breakdown are immediately usable to:
- Calibrate Zeph's Candle-backed injection classifier (feat(classifiers): Candle-backed injection classifier infrastructure (#2185) #2198) detection thresholds
- Inform Phase 2 classifier design (feat(classifiers): Phase 2 — OnnxClassifier, PII detection, LlmClassifier for feedback #2200)
- Reduce false positives from memory search results (the pattern fixed in fix(sanitizer): memory_search tool output path not covered by MemorySourceHint fix (#2053) #2057)
Implementation Sketch
- Apply three-class classification approach to
zeph-sanitizer: distinguish misaligned vs. aligned instructions in tool output - Use attention map features as an enhancement to the existing DeBERTa model in feat(classifiers): Candle-backed injection classifier infrastructure (#2185) #2198
- Threshold calibration based on AlignSentinel's published FPR breakdown
References
- https://arxiv.org/abs/2602.13597
- Related: research(security): MELON paper — DeBERTa injection detectors have high FPR; use as soft signal only (arXiv:2502.05174) #2193 (MELON FPR), feat(classifiers): Candle-backed injection classifier infrastructure (#2185) #2198 (Candle classifier), feat(classifiers): Phase 2 — OnnxClassifier, PII detection, LlmClassifier for feedback #2200 (Phase 2 classifiers)