feat(classifiers): Candle-backed injection classifier infrastructure (#2185)#2198
Merged
feat(classifiers): Candle-backed injection classifier infrastructure (#2185)#2198
Conversation
Introduce a `ClassifierBackend` trait and `CandleClassifier` implementation that replace regex heuristics with a lightweight DeBERTa-v3-small model for prompt injection detection (feature `classifiers`, disabled by default). - Add `crates/zeph-llm/src/classifier/` with `ClassifierBackend` object-safe async trait (`Pin<Box<dyn Future>>` for object safety) and `CandleClassifier` loading `protectai/deberta-v3-small-prompt-injection-v2` lazily via OnceLock; token-based chunking (448 tokens / 64 overlap); inference via `tokio::task::spawn_blocking`; "positive wins" aggregation ensures any injection-positive chunk propagates regardless of SAFE chunk scores - Add `ClassifiersConfig` in `zeph-config` with `enabled`, `timeout_ms`, `injection_model`, and `injection_threshold` fields; `--migrate-config` adds `[classifiers]` section to existing configs automatically - Add `ContentSanitizer::classify_injection()` async method (separate from sync `sanitize()`); on error/timeout falls back to `detect_injections()` regex preserving the security baseline - Wire into agent loop: `process_user_message_inner` calls `classify_injection()` when `classifiers.enabled = true`; wired in `runner.rs` via `apply_injection_classifier()` alongside guardrail - Add `zeph classifiers download` CLI subcommand for pre-caching models - Include `classifiers` in `full` feature so CI compiles all guarded paths - Fix stale expected error string in `bootstrap/tests.rs` surfaced by `--features full,classifiers` compilation Closes #2185
…ndle_provider Fixes surfaced by adding classifiers to the full feature, which transitively enables candle and exposes these pre-existing lint failures in CI: - candle_whisper.rs: doc_markdown (HuggingFace), items_after_statements (MAX_DECODE_TOKENS), cast_precision_loss (audio duration and channel averaging), similar_names (decoder/decoded renamed to audio_buf), cast_possible_truncation (SAMPLE_RATE as u32 via named binding), unnecessary_qualification (use Channels::count as method reference), missing MediaSourceStreamOptions import for explicit default() - candle_provider/embed.rs: items_after_statements (MAX_HEADER moved before first statement) - candle_provider/mod.rs: unnecessary_literal_bound (&str -> &'static str)
Resolves clippy::needless_pass_by_value: source was passed by value but only used as &source inside the function body.
- Wrap HuggingFace in backticks in classifiers.rs doc comments - Remove needless borrow on config in runner.rs apply_injection_classifier call
24e29ae to
b111b0c
Compare
This was referenced Mar 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ClassifierBackendasync trait andCandleClassifierusingdeberta-v3-small-prompt-injection-v2for ML-backed prompt injection detectionPin<Box<dyn Future>>) withMockClassifierBackendfor teststokio::task::spawn_blocking; "positive wins" aggregation — any injection-positive chunk propagates regardless of SAFE chunk scoresContentSanitizer::classify_injection()async method separate from syncsanitize(); falls back todetect_injections()regex on timeout/errorprocess_user_message_inner; activated via[classifiers] enabled = truein configzeph classifiers downloadCLI subcommand for model pre-cachingclassifiers(disabled by default, impliescandle); included infullfeature--migrate-configadds[classifiers]section to existing configsConfiguration
Pre-cache model before enabling:
Test plan
cargo nextest run --workspace --features full --lib --bins— 6593 passedcargo nextest run --workspace --features full,classifiers --lib --bins— 6593 passedcargo +nightly fmt --check— cleanClassifiersConfigserde tests: default deserialization, partial override, roundtripPhase 2 (tracked separately)
OnnxClassifierviaortfor 3-5x faster CPU inferenceiiiorg/piiranha-v1-detect-personal-informationLlmClassifierzero-shot for feedback detectioninjection_thresholdand model hash verification config fields--initwizard integrationCloses #2185