Skip to content

feat(classifiers): Candle-backed injection classifier infrastructure (#2185)#2198

Merged
bug-ops merged 4 commits intomainfrom
feat-classifiers-replace-regex
Mar 27, 2026
Merged

feat(classifiers): Candle-backed injection classifier infrastructure (#2185)#2198
bug-ops merged 4 commits intomainfrom
feat-classifiers-replace-regex

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented Mar 27, 2026

Summary

  • Add ClassifierBackend async trait and CandleClassifier using deberta-v3-small-prompt-injection-v2 for ML-backed prompt injection detection
  • Object-safe async trait (Pin<Box<dyn Future>>) with MockClassifierBackend for tests
  • Token-based chunking (448/64 overlap); inference via tokio::task::spawn_blocking; "positive wins" aggregation — any injection-positive chunk propagates regardless of SAFE chunk scores
  • ContentSanitizer::classify_injection() async method separate from sync sanitize(); falls back to detect_injections() regex on timeout/error
  • Wired into agent loop in process_user_message_inner; activated via [classifiers] enabled = true in config
  • zeph classifiers download CLI subcommand for model pre-caching
  • Feature classifiers (disabled by default, implies candle); included in full feature
  • --migrate-config adds [classifiers] section to existing configs
  • Credential patterns (sk-, AKIA, ghp_, Bearer) kept as regex — no ML needed

Configuration

[classifiers]
enabled = false           # set to true to activate
timeout_ms = 5000
injection_model = "protectai/deberta-v3-small-prompt-injection-v2"
injection_threshold = 0.8

Pre-cache model before enabling:

zeph classifiers download

Test plan

  • cargo nextest run --workspace --features full --lib --bins — 6593 passed
  • cargo nextest run --workspace --features full,classifiers --lib --bins — 6593 passed
  • cargo +nightly fmt --check — clean
  • New tests: 13 classifier tests covering threshold boundary, positive-wins aggregation, error/timeout fallback, disabled fallback
  • ClassifiersConfig serde tests: default deserialization, partial override, roundtrip

Phase 2 (tracked separately)

  • OnnxClassifier via ort for 3-5x faster CPU inference
  • PII detection via iiiorg/piiranha-v1-detect-personal-information
  • LlmClassifier zero-shot for feedback detection
  • injection_threshold and model hash verification config fields
  • TUI spinner during model load, --init wizard integration

Closes #2185

@github-actions github-actions bot added documentation Improvements or additions to documentation llm zeph-llm crate (Ollama, Claude) rust Rust code changes core zeph-core crate dependencies Dependency updates config Configuration file changes enhancement New feature or request size/XL Extra large PR (500+ lines) labels Mar 27, 2026
@bug-ops bug-ops enabled auto-merge (squash) March 27, 2026 02:44
bug-ops added 4 commits March 27, 2026 04:07
Introduce a `ClassifierBackend` trait and `CandleClassifier` implementation
that replace regex heuristics with a lightweight DeBERTa-v3-small model for
prompt injection detection (feature `classifiers`, disabled by default).

- Add `crates/zeph-llm/src/classifier/` with `ClassifierBackend` object-safe
  async trait (`Pin<Box<dyn Future>>` for object safety) and `CandleClassifier`
  loading `protectai/deberta-v3-small-prompt-injection-v2` lazily via OnceLock;
  token-based chunking (448 tokens / 64 overlap); inference via
  `tokio::task::spawn_blocking`; "positive wins" aggregation ensures any
  injection-positive chunk propagates regardless of SAFE chunk scores
- Add `ClassifiersConfig` in `zeph-config` with `enabled`, `timeout_ms`,
  `injection_model`, and `injection_threshold` fields; `--migrate-config`
  adds `[classifiers]` section to existing configs automatically
- Add `ContentSanitizer::classify_injection()` async method (separate from
  sync `sanitize()`); on error/timeout falls back to `detect_injections()`
  regex preserving the security baseline
- Wire into agent loop: `process_user_message_inner` calls
  `classify_injection()` when `classifiers.enabled = true`; wired in
  `runner.rs` via `apply_injection_classifier()` alongside guardrail
- Add `zeph classifiers download` CLI subcommand for pre-caching models
- Include `classifiers` in `full` feature so CI compiles all guarded paths
- Fix stale expected error string in `bootstrap/tests.rs` surfaced by
  `--features full,classifiers` compilation

Closes #2185
…ndle_provider

Fixes surfaced by adding classifiers to the full feature, which transitively
enables candle and exposes these pre-existing lint failures in CI:

- candle_whisper.rs: doc_markdown (HuggingFace), items_after_statements
  (MAX_DECODE_TOKENS), cast_precision_loss (audio duration and channel
  averaging), similar_names (decoder/decoded renamed to audio_buf),
  cast_possible_truncation (SAMPLE_RATE as u32 via named binding),
  unnecessary_qualification (use Channels::count as method reference),
  missing MediaSourceStreamOptions import for explicit default()
- candle_provider/embed.rs: items_after_statements (MAX_HEADER moved
  before first statement)
- candle_provider/mod.rs: unnecessary_literal_bound (&str -> &'static str)
Resolves clippy::needless_pass_by_value: source was passed by value
but only used as &source inside the function body.
- Wrap HuggingFace in backticks in classifiers.rs doc comments
- Remove needless borrow on config in runner.rs apply_injection_classifier call
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

config Configuration file changes core zeph-core crate dependencies Dependency updates documentation Improvements or additions to documentation enhancement New feature or request llm zeph-llm crate (Ollama, Claude) rust Rust code changes size/XL Extra large PR (500+ lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(classifiers): replace regex heuristics with Candle-backed lightweight classifiers

1 participant