research(security): AlignSentinel — alignment-aware DeBERTa-v3 classifier reduces FPR on benign tool outputs (arXiv:2602.13597)

## Summary

arXiv:2602.13597 — submitted February 2026. "AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks".

Introduces a three-class DeBERTa-v3-base classifier (misaligned-instruction / aligned-instruction / no-instruction) using LLM attention maps as features. Substantially reduces false positives on benign tool outputs that contain instruction-like text (e.g., grammar suggestions, API return messages).

## Applicability to Zeph

**HIGH** — Directly extends #2193 (MELON, arXiv:2502.05174): MELON measured FPR on static benign corpora; AlignSentinel specifically targets the false-positive failure mode where benign tool return values (exactly what Zeph's `zeph-tools` surfaces into context) look like instructions.

The alignment-awareness approach and published FPR breakdown are immediately usable to:
1. Calibrate Zeph's Candle-backed injection classifier (#2198) detection thresholds
2. Inform Phase 2 classifier design (#2200)
3. Reduce false positives from memory search results (the pattern fixed in #2057)

## Implementation Sketch

- Apply three-class classification approach to `zeph-sanitizer`: distinguish misaligned vs. aligned instructions in tool output
- Use attention map features as an enhancement to the existing DeBERTa model in #2198
- Threshold calibration based on AlignSentinel's published FPR breakdown

## References

- https://arxiv.org/abs/2602.13597
- Related: #2193 (MELON FPR), #2198 (Candle classifier), #2200 (Phase 2 classifiers)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research(security): AlignSentinel — alignment-aware DeBERTa-v3 classifier reduces FPR on benign tool outputs (arXiv:2602.13597) #2208

Summary

Applicability to Zeph

Implementation Sketch

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

research(security): AlignSentinel — alignment-aware DeBERTa-v3 classifier reduces FPR on benign tool outputs (arXiv:2602.13597) #2208

Description

Summary

Applicability to Zeph

Implementation Sketch

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions