Skip to content

research(security): AlignSentinel — alignment-aware DeBERTa-v3 classifier reduces FPR on benign tool outputs (arXiv:2602.13597) #2208

@bug-ops

Description

@bug-ops

Summary

arXiv:2602.13597 — submitted February 2026. "AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks".

Introduces a three-class DeBERTa-v3-base classifier (misaligned-instruction / aligned-instruction / no-instruction) using LLM attention maps as features. Substantially reduces false positives on benign tool outputs that contain instruction-like text (e.g., grammar suggestions, API return messages).

Applicability to Zeph

HIGH — Directly extends #2193 (MELON, arXiv:2502.05174): MELON measured FPR on static benign corpora; AlignSentinel specifically targets the false-positive failure mode where benign tool return values (exactly what Zeph's zeph-tools surfaces into context) look like instructions.

The alignment-awareness approach and published FPR breakdown are immediately usable to:

  1. Calibrate Zeph's Candle-backed injection classifier (feat(classifiers): Candle-backed injection classifier infrastructure (#2185) #2198) detection thresholds
  2. Inform Phase 2 classifier design (feat(classifiers): Phase 2 — OnnxClassifier, PII detection, LlmClassifier for feedback #2200)
  3. Reduce false positives from memory search results (the pattern fixed in fix(sanitizer): memory_search tool output path not covered by MemorySourceHint fix (#2053) #2057)

Implementation Sketch

References

Metadata

Metadata

Assignees

Labels

P2High value, medium complexityresearchResearch-driven improvementsecuritySecurity-related issue

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions