-
Notifications
You must be signed in to change notification settings - Fork 2
research(security): classifier benchmark inflation — 8.4pt AUC gap under distribution shift, 7-37% detection of indirect injections (arXiv:2602.14161) #2303
Copy link
Copy link
Closed
Labels
P2High value, medium complexityHigh value, medium complexityresearchResearch-driven improvementResearch-driven improvementsecuritySecurity-related issueSecurity-related issue
Description
Paper
arXiv:2602.14161 — When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift
Key Finding
Standard benchmark evaluation inflates injection classifier AUC by 8.4 percentage points due to dataset shortcuts (shared stylistic artifacts between train/test). Under rigorous leave-one-dataset-out testing, production guardrails detect indirect agent-targeted injections at only 7–37% recall. The gap is largest for indirect/embedding-injected attacks vs direct injections.
Applicability to Zeph
- Direct: Zeph's
zeph-sanitizerusesprotectai/deberta-v3-small-prompt-injection-v2— a benchmark-trained model subject to exactly this shortcut problem. Issue bug(security): sanitizer classifier 401 on HuggingFace download — regex fallback blocks benign queries #2292 (classifier 401 → regex FP) is compounded: even when the model is available, its real-world recall for indirect injections may be near 7%. - Action: Use leave-one-dataset-out validation when evaluating classifier updates. Treat classifier as a soft signal (as recommended by MELON research(security): MELON paper — DeBERTa injection detectors have high FPR; use as soft signal only (arXiv:2502.05174) #2193) rather than a hard gate. Consider calibration: if model score < threshold_soft, log WARN; only block if score > threshold_hard.
- Design: The 7–37% recall gap means ~2/3 of indirect injections pass through. Defense-in-depth (LLM refusal as second layer) is not optional — it's load-bearing. Document this explicitly in sanitizer architecture.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P2High value, medium complexityHigh value, medium complexityresearchResearch-driven improvementResearch-driven improvementsecuritySecurity-related issueSecurity-related issue