-
Notifications
You must be signed in to change notification settings - Fork 2
feat(classifiers): replace regex heuristics with Candle-backed lightweight classifiers #2185
Description
Problem
Zeph currently uses regex/rule-based heuristics in several critical subsystems. These have well-known failure modes: no context awareness, brittle to paraphrasing, require manual pattern maintenance, and miss semantic variants.
Affected subsystems:
- FeedbackDetector (
detector_mode = "regex") — detects user corrections/disagreements for skill learning - Content injection detection —
flag_injection_patternsregex list in SecurityConfig - PII filter — email/phone/SSN/credit card regex patterns
redact_sensitive()— credential pattern detection (sk-, AKIA, ghp_, Bearer)- ACON compression failure detection — UNCERTAINTY_PATTERNS + PRIOR_CONTEXT_PATTERNS
- Output filter SecurityPatterns — 17 LazyLock regex in tools
Strategic Direction
Replace all heuristic regex classifiers with specialized lightweight models via Candle (HuggingFace). Large LLMs (GPT-4, Claude Opus) should be reserved for complex reasoning and planning. Classification/detection tasks go to dedicated small models.
Architecture principle: every subsystem that currently calls regex.is_match() on LLM input/output should instead call a ClassifierProvider backed by a Candle model (or zero-shot gpt-4o-mini as fallback). Provider pattern already established via [[llm.providers]] and *_provider fields.
Research Findings (CI-178)
Three-tier hierarchy from 2024–2026 literature:
| Tier | Approach | Latency | Best for |
|---|---|---|---|
| 1 | Regex pre-filter | <1ms | High-recall first pass (keep as fallback) |
| 2 | Linear probe on LLM activations | <10ms | Injection, PII when host model available |
| 3 | Fine-tuned small transformer via Candle | 50–200ms | All tasks with HuggingFace models |
| 4 | Zero-shot gpt-4o-mini prompt | 200ms+ | Cold-start, no labeled data available |
Recommended Models (HuggingFace / Candle-compatible)
| Task | Model | Size | Notes |
|---|---|---|---|
| Injection detection | mDeBERTa-v3-base-prompt-injection-v2 | 280MB | Multi-label, production-grade |
| Content safety | Llama-Guard-3-1B | 1B | Meta canonical agent safeguard |
| PII detection | piiranha-v1-detect-personal-information | 300MB | BERT-based, 300k+ downloads |
| Feedback/correction | zero-shot gpt-4o-mini | API | No labeled dataset; bootstrap with synthetic data |
| General safety (lightweight) | DeBERTa-v3 + LEC head (arXiv:2412.13435) | 0.5–3B | Qwen 0.5B backbone, fast after warmup |
Key Papers
- arXiv:2412.13435 — LEC (Layer Enhanced Classification): logistic regression on intermediate transformer layer activations, surpasses GPT-4o with <100 examples, Qwen 0.5B–3B backbone. Most applicable to Zeph.
- arXiv:2510.14005 — PIShield: linear probe on residual stream, no fine-tuning, covers injection detection
- arXiv:2510.07551 — RECAP Hybrid PII: regex fast path + LLM second pass, 82% better than NER-only
- arXiv:2312.06674 — Llama Guard 3: Meta production safety classifier for agent I/O
- arXiv:2509.23994 — AI Agent Code of Conduct: policy-as-prompt enforcement via LLM classifier
- arXiv:2510.09781 — Safiron: pre-execution guardian model for agentic plans
Implementation Plan
Phase 1 — ClassifierProvider abstraction
- Add
ClassifierProvidertrait tozeph-corewithclassify(text) -> ClassificationResult - Candle backend: load ONNX/safetensors from HuggingFace cache
- Fallback: zero-shot LLM via existing
[[llm.providers]] - Config:
[classifiers]section withinjection_provider,pii_provider,feedback_provider,safety_provider
Phase 2 — FeedbackDetector migration
- Replace
detector_mode = "regex"withdetector_mode = "model" - Zero-shot gpt-4o-mini or fine-tuned small model; keep regex as offline fallback
Phase 3 — Injection detection
- Replace
flag_injection_patternsregex with mDeBERTa-v3-base-prompt-injection-v2 via Candle - Score threshold replaces pattern list
Phase 4 — PII filter
- Hybrid: keep regex fast path, add piiranha/NER second pass for contextual PII
Notes
- Candle backend (
zeph-candlecrate) already exists but underutilized — large selection of ready HuggingFace models - All classifier calls must be async with configurable timeouts; fall back to regex on timeout
- Models cached in
~/.cache/zeph/classifiers/on first use - Privacy: classifier models MUST run locally via Candle by default — no external API for PII/injection data
- Future direction: many specialized lightweight models per task > one large LLM for everything