research(security): classifier benchmark inflation — 8.4pt AUC gap under distribution shift, 7-37% detection of indirect injections (arXiv:2602.14161)

## Paper

**arXiv:2602.14161** — *When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift*

## Key Finding

Standard benchmark evaluation inflates injection classifier AUC by **8.4 percentage points** due to dataset shortcuts (shared stylistic artifacts between train/test). Under rigorous leave-one-dataset-out testing, production guardrails detect indirect agent-targeted injections at only **7–37% recall**. The gap is largest for indirect/embedding-injected attacks vs direct injections.

## Applicability to Zeph

- **Direct**: Zeph's `zeph-sanitizer` uses `protectai/deberta-v3-small-prompt-injection-v2` — a benchmark-trained model subject to exactly this shortcut problem. Issue #2292 (classifier 401 → regex FP) is compounded: even when the model *is* available, its real-world recall for indirect injections may be near 7%.
- **Action**: Use leave-one-dataset-out validation when evaluating classifier updates. Treat classifier as a **soft signal** (as recommended by MELON #2193) rather than a hard gate. Consider calibration: if model score < threshold_soft, log WARN; only block if score > threshold_hard.
- **Design**: The 7–37% recall gap means ~2/3 of indirect injections pass through. Defense-in-depth (LLM refusal as second layer) is not optional — it's load-bearing. Document this explicitly in sanitizer architecture.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research(security): classifier benchmark inflation — 8.4pt AUC gap under distribution shift, 7-37% detection of indirect injections (arXiv:2602.14161) #2303

Paper

Key Finding

Applicability to Zeph

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

research(security): classifier benchmark inflation — 8.4pt AUC gap under distribution shift, 7-37% detection of indirect injections (arXiv:2602.14161) #2303

Description

Paper

Key Finding

Applicability to Zeph

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions