Skip to content

bug(security): sanitizer classifier 401 on HuggingFace download — regex fallback blocks benign queries #2292

@bug-ops

Description

@bug-ops

Summary

During A2A daemon live testing (CI-238), the injection sanitizer fails to load the classifier model with a 401 error, then falls back to a regex-based classifier that produces false positives blocking benign user messages.

Log evidence

WARN zeph_sanitizer: classifier inference error, falling back to regex
  error=model loading failed: failed to download config.json from
  protectai/deberta-v3-small-prompt-injection-v2: request error: http status: 401

Steps to reproduce

  1. Run ./target/debug/zeph --config .local/config/testing.toml --daemon
  2. Send POST to http://localhost:8080/a2a with:
    {"jsonrpc":"2.0","id":1,"method":"message/send","params":{"message":{"role":"user","parts":[{"kind":"text","text":"hello, who are you?"}]}}}
  3. Also blocked: "what is 2 + 2?"

Expected

  • Classifier downloads successfully (HuggingFace token resolved from vault), or
  • Regex fallback only blocks genuine injection patterns, not benign greetings/arithmetic

Actual

Both "hello, who are you?" and "what is 2 + 2?" are blocked with:

{"kind":"text","text":"[security] Input blocked: injection detected by classifier."}

Root causes to investigate

  1. protectai/deberta-v3-small-prompt-injection-v2 requires a HuggingFace API token — check if ZEPH_HF_TOKEN or equivalent is in the vault and passed to the downloader
  2. Regex fallback patterns are too aggressive — any regex that matches basic greetings or arithmetic is a miscalibration
  3. Behavior is non-deterministic: the same "what is 2+2?" query was passed once (no artifacts) and blocked another time — suggests a race condition in classifier initialization

Config used

.local/config/testing.toml[sanitizer] section with default settings (no explicit HuggingFace token config)

Metadata

Metadata

Assignees

Labels

P1High ROI, low complexity — do next sprintbugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions