Skip to content

feat(classifiers): replace regex heuristics with Candle-backed lightweight classifiers #2185

@bug-ops

Description

@bug-ops

Problem

Zeph currently uses regex/rule-based heuristics in several critical subsystems. These have well-known failure modes: no context awareness, brittle to paraphrasing, require manual pattern maintenance, and miss semantic variants.

Affected subsystems:

  • FeedbackDetector (detector_mode = "regex") — detects user corrections/disagreements for skill learning
  • Content injection detectionflag_injection_patterns regex list in SecurityConfig
  • PII filter — email/phone/SSN/credit card regex patterns
  • redact_sensitive() — credential pattern detection (sk-, AKIA, ghp_, Bearer)
  • ACON compression failure detection — UNCERTAINTY_PATTERNS + PRIOR_CONTEXT_PATTERNS
  • Output filter SecurityPatterns — 17 LazyLock regex in tools

Strategic Direction

Replace all heuristic regex classifiers with specialized lightweight models via Candle (HuggingFace). Large LLMs (GPT-4, Claude Opus) should be reserved for complex reasoning and planning. Classification/detection tasks go to dedicated small models.

Architecture principle: every subsystem that currently calls regex.is_match() on LLM input/output should instead call a ClassifierProvider backed by a Candle model (or zero-shot gpt-4o-mini as fallback). Provider pattern already established via [[llm.providers]] and *_provider fields.

Research Findings (CI-178)

Three-tier hierarchy from 2024–2026 literature:

Tier Approach Latency Best for
1 Regex pre-filter <1ms High-recall first pass (keep as fallback)
2 Linear probe on LLM activations <10ms Injection, PII when host model available
3 Fine-tuned small transformer via Candle 50–200ms All tasks with HuggingFace models
4 Zero-shot gpt-4o-mini prompt 200ms+ Cold-start, no labeled data available

Recommended Models (HuggingFace / Candle-compatible)

Task Model Size Notes
Injection detection mDeBERTa-v3-base-prompt-injection-v2 280MB Multi-label, production-grade
Content safety Llama-Guard-3-1B 1B Meta canonical agent safeguard
PII detection piiranha-v1-detect-personal-information 300MB BERT-based, 300k+ downloads
Feedback/correction zero-shot gpt-4o-mini API No labeled dataset; bootstrap with synthetic data
General safety (lightweight) DeBERTa-v3 + LEC head (arXiv:2412.13435) 0.5–3B Qwen 0.5B backbone, fast after warmup

Key Papers

  • arXiv:2412.13435 — LEC (Layer Enhanced Classification): logistic regression on intermediate transformer layer activations, surpasses GPT-4o with <100 examples, Qwen 0.5B–3B backbone. Most applicable to Zeph.
  • arXiv:2510.14005 — PIShield: linear probe on residual stream, no fine-tuning, covers injection detection
  • arXiv:2510.07551 — RECAP Hybrid PII: regex fast path + LLM second pass, 82% better than NER-only
  • arXiv:2312.06674 — Llama Guard 3: Meta production safety classifier for agent I/O
  • arXiv:2509.23994 — AI Agent Code of Conduct: policy-as-prompt enforcement via LLM classifier
  • arXiv:2510.09781 — Safiron: pre-execution guardian model for agentic plans

Implementation Plan

Phase 1 — ClassifierProvider abstraction

  • Add ClassifierProvider trait to zeph-core with classify(text) -> ClassificationResult
  • Candle backend: load ONNX/safetensors from HuggingFace cache
  • Fallback: zero-shot LLM via existing [[llm.providers]]
  • Config: [classifiers] section with injection_provider, pii_provider, feedback_provider, safety_provider

Phase 2 — FeedbackDetector migration

  • Replace detector_mode = "regex" with detector_mode = "model"
  • Zero-shot gpt-4o-mini or fine-tuned small model; keep regex as offline fallback

Phase 3 — Injection detection

  • Replace flag_injection_patterns regex with mDeBERTa-v3-base-prompt-injection-v2 via Candle
  • Score threshold replaces pattern list

Phase 4 — PII filter

  • Hybrid: keep regex fast path, add piiranha/NER second pass for contextual PII

Notes

  • Candle backend (zeph-candle crate) already exists but underutilized — large selection of ready HuggingFace models
  • All classifier calls must be async with configurable timeouts; fall back to regex on timeout
  • Models cached in ~/.cache/zeph/classifiers/ on first use
  • Privacy: classifier models MUST run locally via Candle by default — no external API for PII/injection data
  • Future direction: many specialized lightweight models per task > one large LLM for everything

Metadata

Metadata

Assignees

Labels

P2High value, medium complexityenhancementNew feature or requestresearchResearch-driven improvement

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions