Skip to content

research(security): MELON paper — DeBERTa injection detectors have high FPR; use as soft signal only (arXiv:2502.05174) #2193

@bug-ops

Description

@bug-ops

Source

"MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents" (arXiv:2502.05174, February 2025, updated June 2025)
https://arxiv.org/abs/2502.05174

Summary

Evaluates DeBERTa-based classifiers as injection detectors in LLM agent pipelines. Key findings:

  • Off-the-shelf DeBERTa checkpoints (e.g., mDeBERTa-v3-base-prompt-injection-v2) achieve near-zero attack success rate on some attack patterns — excellent detection recall
  • BUT: exhibit high false positive rates on benign tool outputs, misidentifying legitimate content as malicious
  • Root cause: DeBERTa was pre-trained on phishing/spam patterns, not on adversarial agent-specific prompt injection patterns
  • MELON alternative: masked re-execution — re-runs the tool call with key content masked, compares outputs to detect injection artifacts. Provably correct defense.

Applicability to Zeph

Directly relevant to issue #2185 (Candle classifier implementation) and #2190 (classifier integration tests). Critical design constraints:

  1. Off-the-shelf DeBERTa is NOT a drop-in hard guardrail — high FPR would block legitimate agent operations
  2. Safe deployment path: use as soft signal / first-pass filter only, not as a hard block
  3. For a hard guardrail, either:
    • Fine-tune DeBERTa on agent-specific injection examples (few-shot is sufficient per MELON paper)
    • Implement MELON-style masked re-execution as a secondary verification layer
  4. Integration tests (test(candle): add integration tests for Candle-backed classifier models #2190) must include FPR measurement on benign tool outputs, not just attack detection

Implementation Recommendation

When implementing the DeBERTa classifier (#2185):

  • Add a soft_signal mode (default) vs hard_block mode config option
  • Default to soft_signal: flag suspicious content for attention, do not block
  • Document FPR risk prominently in the config comments

Priority

P2 — informs design decisions for #2185 and #2190 before implementation.

Related Issues

Metadata

Metadata

Assignees

Labels

P2High value, medium complexityresearchResearch-driven improvementsecuritySecurity-related issue

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions