feat(classifiers): replace regex heuristics with Candle-backed lightweight classifiers

## Problem

Zeph currently uses regex/rule-based heuristics in several critical subsystems. These have well-known failure modes: no context awareness, brittle to paraphrasing, require manual pattern maintenance, and miss semantic variants.

Affected subsystems:
- **FeedbackDetector** (`detector_mode = "regex"`) — detects user corrections/disagreements for skill learning
- **Content injection detection** — `flag_injection_patterns` regex list in SecurityConfig
- **PII filter** — email/phone/SSN/credit card regex patterns
- `redact_sensitive()` — credential pattern detection (sk-, AKIA, ghp_, Bearer)
- **ACON compression failure detection** — UNCERTAINTY_PATTERNS + PRIOR_CONTEXT_PATTERNS
- **Output filter SecurityPatterns** — 17 LazyLock regex in tools

## Strategic Direction

Replace all heuristic regex classifiers with **specialized lightweight models via Candle** (HuggingFace). Large LLMs (GPT-4, Claude Opus) should be reserved for complex reasoning and planning. Classification/detection tasks go to dedicated small models.

Architecture principle: every subsystem that currently calls `regex.is_match()` on LLM input/output should instead call a `ClassifierProvider` backed by a Candle model (or zero-shot gpt-4o-mini as fallback). Provider pattern already established via `[[llm.providers]]` and `*_provider` fields.

## Research Findings (CI-178)

Three-tier hierarchy from 2024–2026 literature:

| Tier | Approach | Latency | Best for |
|------|----------|---------|----------|
| 1 | Regex pre-filter | <1ms | High-recall first pass (keep as fallback) |
| 2 | Linear probe on LLM activations | <10ms | Injection, PII when host model available |
| 3 | Fine-tuned small transformer via Candle | 50–200ms | All tasks with HuggingFace models |
| 4 | Zero-shot gpt-4o-mini prompt | 200ms+ | Cold-start, no labeled data available |

## Recommended Models (HuggingFace / Candle-compatible)

| Task | Model | Size | Notes |
|------|-------|------|-------|
| Injection detection | mDeBERTa-v3-base-prompt-injection-v2 | 280MB | Multi-label, production-grade |
| Content safety | Llama-Guard-3-1B | 1B | Meta canonical agent safeguard |
| PII detection | piiranha-v1-detect-personal-information | 300MB | BERT-based, 300k+ downloads |
| Feedback/correction | zero-shot gpt-4o-mini | API | No labeled dataset; bootstrap with synthetic data |
| General safety (lightweight) | DeBERTa-v3 + LEC head (arXiv:2412.13435) | 0.5–3B | Qwen 0.5B backbone, fast after warmup |

## Key Papers

- **arXiv:2412.13435** — LEC (Layer Enhanced Classification): logistic regression on intermediate transformer layer activations, surpasses GPT-4o with <100 examples, Qwen 0.5B–3B backbone. **Most applicable to Zeph.**
- **arXiv:2510.14005** — PIShield: linear probe on residual stream, no fine-tuning, covers injection detection
- **arXiv:2510.07551** — RECAP Hybrid PII: regex fast path + LLM second pass, 82% better than NER-only
- **arXiv:2312.06674** — Llama Guard 3: Meta production safety classifier for agent I/O
- **arXiv:2509.23994** — AI Agent Code of Conduct: policy-as-prompt enforcement via LLM classifier
- **arXiv:2510.09781** — Safiron: pre-execution guardian model for agentic plans

## Implementation Plan

### Phase 1 — ClassifierProvider abstraction
- Add `ClassifierProvider` trait to `zeph-core` with `classify(text) -> ClassificationResult`
- Candle backend: load ONNX/safetensors from HuggingFace cache
- Fallback: zero-shot LLM via existing `[[llm.providers]]`
- Config: `[classifiers]` section with `injection_provider`, `pii_provider`, `feedback_provider`, `safety_provider`

### Phase 2 — FeedbackDetector migration
- Replace `detector_mode = "regex"` with `detector_mode = "model"`
- Zero-shot gpt-4o-mini or fine-tuned small model; keep regex as offline fallback

### Phase 3 — Injection detection
- Replace `flag_injection_patterns` regex with mDeBERTa-v3-base-prompt-injection-v2 via Candle
- Score threshold replaces pattern list

### Phase 4 — PII filter
- Hybrid: keep regex fast path, add piiranha/NER second pass for contextual PII

## Notes

- Candle backend (`zeph-candle` crate) already exists but underutilized — large selection of ready HuggingFace models
- All classifier calls must be async with configurable timeouts; fall back to regex on timeout
- Models cached in `~/.cache/zeph/classifiers/` on first use
- Privacy: classifier models MUST run locally via Candle by default — no external API for PII/injection data
- Future direction: many specialized lightweight models per task > one large LLM for everything


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(classifiers): replace regex heuristics with Candle-backed lightweight classifiers #2185

Problem

Strategic Direction

Research Findings (CI-178)

Recommended Models (HuggingFace / Candle-compatible)

Key Papers

Implementation Plan

Phase 1 — ClassifierProvider abstraction

Phase 2 — FeedbackDetector migration

Phase 3 — Injection detection

Phase 4 — PII filter

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Tier	Approach	Latency	Best for
1	Regex pre-filter	<1ms	High-recall first pass (keep as fallback)
2	Linear probe on LLM activations	<10ms	Injection, PII when host model available
3	Fine-tuned small transformer via Candle	50–200ms	All tasks with HuggingFace models
4	Zero-shot gpt-4o-mini prompt	200ms+	Cold-start, no labeled data available

Task	Model	Size	Notes
Injection detection	mDeBERTa-v3-base-prompt-injection-v2	280MB	Multi-label, production-grade
Content safety	Llama-Guard-3-1B	1B	Meta canonical agent safeguard
PII detection	piiranha-v1-detect-personal-information	300MB	BERT-based, 300k+ downloads
Feedback/correction	zero-shot gpt-4o-mini	API	No labeled dataset; bootstrap with synthetic data
General safety (lightweight)	DeBERTa-v3 + LEC head (arXiv:2412.13435)	0.5–3B	Qwen 0.5B backbone, fast after warmup

feat(classifiers): replace regex heuristics with Candle-backed lightweight classifiers #2185

Description

Problem

Strategic Direction

Research Findings (CI-178)

Recommended Models (HuggingFace / Candle-compatible)

Key Papers

Implementation Plan

Phase 1 — ClassifierProvider abstraction

Phase 2 — FeedbackDetector migration

Phase 3 — Injection detection

Phase 4 — PII filter

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions