-
Notifications
You must be signed in to change notification settings - Fork 2
research(security): MELON paper — DeBERTa injection detectors have high FPR; use as soft signal only (arXiv:2502.05174) #2193
Copy link
Copy link
Closed
Labels
P2High value, medium complexityHigh value, medium complexityresearchResearch-driven improvementResearch-driven improvementsecuritySecurity-related issueSecurity-related issue
Description
Source
"MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents" (arXiv:2502.05174, February 2025, updated June 2025)
https://arxiv.org/abs/2502.05174
Summary
Evaluates DeBERTa-based classifiers as injection detectors in LLM agent pipelines. Key findings:
- Off-the-shelf DeBERTa checkpoints (e.g.,
mDeBERTa-v3-base-prompt-injection-v2) achieve near-zero attack success rate on some attack patterns — excellent detection recall - BUT: exhibit high false positive rates on benign tool outputs, misidentifying legitimate content as malicious
- Root cause: DeBERTa was pre-trained on phishing/spam patterns, not on adversarial agent-specific prompt injection patterns
- MELON alternative: masked re-execution — re-runs the tool call with key content masked, compares outputs to detect injection artifacts. Provably correct defense.
Applicability to Zeph
Directly relevant to issue #2185 (Candle classifier implementation) and #2190 (classifier integration tests). Critical design constraints:
- Off-the-shelf DeBERTa is NOT a drop-in hard guardrail — high FPR would block legitimate agent operations
- Safe deployment path: use as soft signal / first-pass filter only, not as a hard block
- For a hard guardrail, either:
- Fine-tune DeBERTa on agent-specific injection examples (few-shot is sufficient per MELON paper)
- Implement MELON-style masked re-execution as a secondary verification layer
- Integration tests (test(candle): add integration tests for Candle-backed classifier models #2190) must include FPR measurement on benign tool outputs, not just attack detection
Implementation Recommendation
When implementing the DeBERTa classifier (#2185):
- Add a
soft_signalmode (default) vshard_blockmode config option - Default to
soft_signal: flag suspicious content for attention, do not block - Document FPR risk prominently in the config comments
Priority
P2 — informs design decisions for #2185 and #2190 before implementation.
Related Issues
- feat(classifiers): replace regex heuristics with Candle-backed lightweight classifiers #2185 (feat): Candle-backed lightweight classifiers
- test(candle): add integration tests for Candle-backed classifier models #2190 (test): Integration tests for Candle-backed classifier models
- research(mcp): MCP protocol-level security vulnerabilities — capability attestation and origin authentication #2178 (research): MCP protocol-level security vulnerabilities
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P2High value, medium complexityHigh value, medium complexityresearchResearch-driven improvementResearch-driven improvementsecuritySecurity-related issueSecurity-related issue