Skip to content

research(security): classifier benchmark inflation — 8.4pt AUC gap under distribution shift, 7-37% detection of indirect injections (arXiv:2602.14161) #2303

@bug-ops

Description

@bug-ops

Paper

arXiv:2602.14161When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

Key Finding

Standard benchmark evaluation inflates injection classifier AUC by 8.4 percentage points due to dataset shortcuts (shared stylistic artifacts between train/test). Under rigorous leave-one-dataset-out testing, production guardrails detect indirect agent-targeted injections at only 7–37% recall. The gap is largest for indirect/embedding-injected attacks vs direct injections.

Applicability to Zeph

Metadata

Metadata

Assignees

Labels

P2High value, medium complexityresearchResearch-driven improvementsecuritySecurity-related issue

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions