Self-Learning Skills
Zeph continuously improves its skills based on execution outcomes, user corrections, and provider performance. The self-learning system operates across four layers: failure classification, implicit feedback detection, Bayesian re-ranking, and hybrid search with EMA-based routing.
Overview
When a skill fails or a user implicitly corrects the agent, Zeph records the signal, re-ranks affected skills, and — when failures cross a threshold — generates an improved skill version via LLM reflection.
User message
│
▼
Skill matching (BM25 + cosine → RRF fusion)
│
▼
Skill execution → SkillOutcome recorded
│
├─ Success → Wilson score updated, EMA updated
│
└─ Failure → FailureKind classified
│
├─ FeedbackDetector checks next user turn
│ └─ UserCorrection stored in SQLite + Qdrant
│
└─ repeated failures → LLM generates improved version
Phase 1 — Failure Classification
Every skill invocation records a SkillOutcome. Tool failures now carry a FailureKind that distinguishes seven root causes:
| Variant | Meaning |
|---|---|
ExitNonzero | The tool process exited with a non-zero exit code |
Timeout | The tool call exceeded the configured timeout |
PermissionDenied | Tool execution was blocked by the permission policy |
WrongApproach | The skill used a command or method inappropriate for the task |
Partial | The tool completed but produced incomplete or truncated output |
SyntaxError | The generated command or script contained a syntax error |
Unknown | Failure cause could not be classified from the error message |
The raw reason string is stored in the outcome_detail column (migration 018, skill_outcomes table) for later inspection and LLM-based improvement prompts.
Rejecting a Skill
Use /skill reject to record an explicit user rejection and immediately trigger the improvement pipeline:
/skill reject <name> <reason>
Example:
/skill reject web-search "always uses the wrong search engine"
This is equivalent to min_failures consecutive failures — the improvement loop starts on the next agent cycle.
Phase 2 — Implicit Feedback Detection
Zeph inspects each user turn for implicit corrections without requiring an explicit /feedback command. Two detection strategies are available, selected via detector_mode:
Regex Detector (default)
FeedbackDetector uses pattern matching only — zero LLM calls.
Detection signals:
- Explicit rejection (confidence 0.85) — phrases like “no”, “wrong”, “that’s wrong”, “that didn’t work”, “bad answer”, “that’s incorrect”.
- Self-correction — user corrects themselves (e.g., “I was wrong, the capital is Canberra”). Self-corrections are stored for analytics but do not penalize active skills.
- Alternative request (confidence 0.70) — “instead use…”, “try a different approach”, “can you do it differently”.
- Repetition (confidence 0.75) — Jaccard token overlap > 0.8 against the last 3 user messages.
Judge Detector (LLM-backed)
JudgeDetector uses an LLM call to classify borderline or missed cases. It is invoked only when regex confidence falls in the adaptive zone or regex returns no signal at all.
How the adaptive zone works:
| Regex result | Action |
|---|---|
Confidence >= judge_adaptive_high (0.80) | Accepted without judge |
Confidence in [judge_adaptive_low, judge_adaptive_high) | Judge invoked to confirm/override |
Confidence < judge_adaptive_low (0.50) | Treated as “no correction” |
| No regex match | Judge invoked as fallback |
The judge call runs in a background tokio::spawn task and does not block the agent response loop. A sliding-window rate limiter caps judge calls at 5 per 60 seconds to control cost.
Judge prompt design:
- System prompt classifies user satisfaction into
explicit_rejection,alternative_request,repetition, orneutral. - User message content is XML-escaped to mitigate prompt injection via
</user_message>tags. - Response is parsed as structured JSON (
JudgeVerdict) with confidence clamping to[0.0, 1.0].
Multi-Language Support
FeedbackDetector matches correction patterns across 7 languages:
| Language | Example rejection | Example alternative |
|---|---|---|
| English | “that’s wrong”, “bad answer” | “try a different approach” |
| Russian | “неправильно”, “неверно” | “попробуй по-другому” |
| Spanish | “eso esta mal”, “incorrecto” | “intenta de otra manera” |
| German | “das ist falsch”, “stimmt nicht” | “versuch es anders” |
| French | “c’est faux”, “incorrect” | “essaie autrement” |
| Chinese | “错了”, “不对” | “换个方法” |
| Japanese | “違います”, “間違い” | “別の方法で” |
Each language uses dual anchoring: anchored patterns (^) for messages starting with the feedback phrase, and unanchored patterns for mid-sentence feedback. Confidence values are assigned per pattern: explicit rejections score 0.85, alternatives 0.70.
Mixed-language inputs are supported. CJK patterns use 2+ character minimums for unanchored matching to reduce false positives from substring matches. Unsupported languages (Korean, Arabic, etc.) produce no regex signal, causing every message to trigger a judge call (rate-limited to 5/min).
Storage
Detected corrections are stored as UserCorrection records in:
- SQLite (
zeph_correctionstable) — persistent, queryable - Qdrant (
zeph_correctionscollection) — vector-indexed for similarity recall
On each subsequent query, the top-3 most similar corrections (cosine similarity >= 0.75) are injected into the system prompt to steer the agent away from repeating the same mistake.
Configuration
[skills.learning]
detector_mode = "regex" # "regex" (default) or "judge"
judge_model = "" # Model for judge calls (empty = use primary provider)
judge_adaptive_low = 0.5 # Below this, regex "no correction" is trusted (default: 0.5)
judge_adaptive_high = 0.8 # At or above, regex result accepted without judge (default: 0.8)
[agent.learning]
correction_detection = true # Enable FeedbackDetector (default: true)
correction_confidence_threshold = 0.7 # Confidence threshold to accept a candidate (default: 0.7)
correction_recall_limit = 3 # Max corrections injected into system prompt (default: 3)
correction_min_similarity = 0.75 # Minimum cosine similarity for correction recall (default: 0.75)
Setting
detector_mode = "judge"does not disable regex — regex always runs first. The judge is invoked only for borderline or missed cases, keeping LLM costs minimal.
Phase 3 — Bayesian Re-Ranking and Trust Transitions
Wilson Score Confidence Interval
Skill success/failure outcomes feed a Wilson score calculator that produces a lower-bound confidence interval. This replaces the raw success-rate sort used previously:
wilson_lower = (successes + z²/2) / (n + z²) - z * sqrt(n * p*(1-p) + z²/4) / (n + z²)
where z = 1.96 (95% CI). Skills with few observations are naturally ranked lower until they accumulate evidence.
Auto Promote / Demote
check_trust_transition() runs after each outcome and applies automatic trust level changes:
| Condition | Action |
|---|---|
| Wilson score ≥ 0.85 and ≥ 10 evaluations | Promote to trusted |
| Wilson score < 0.40 and ≥ 5 evaluations | Demote to quarantined |
| Quarantined skill improves above 0.70 | Promote back to verified |
Trust transitions are logged via tracing and reflected immediately in /skill stats output.
TUI Confidence Bars
The TUI dashboard (--tui) shows a per-skill confidence bar in the Skills panel:
- Green — Wilson score ≥ 0.75 (high confidence)
- Yellow — Wilson score 0.40–0.74 (moderate)
- Red — Wilson score < 0.40 (low confidence, at risk of demotion)
The bar width is proportional to the score and updates in real time as outcomes are recorded.
Phase 4 — Hybrid Search and EMA Routing
BM25 + Cosine Hybrid Search
Skill matching now combines two signals via Reciprocal Rank Fusion (RRF):
| Signal | Description |
|---|---|
| BM25 | Term-frequency keyword match against skill names, descriptions, and trigger phrases |
| Cosine | Embedding similarity of the query against skill body vectors |
rrf_score(d) = 1/(k + rank_bm25(d)) + 1/(k + rank_cosine(d)) k = 60
The cosine_weight parameter scales the cosine component relative to BM25 before RRF:
[skills]
cosine_weight = 0.7 # Weight for cosine signal in fusion (default: 0.7)
hybrid_search = true # Enable BM25+cosine fusion (default: true)
When hybrid_search = false, the previous cosine-only matching is used.
EMA-Based Provider Routing
EmaTracker maintains an exponential moving average of response latency per provider. When router_ema_enabled = true, the router re-orders providers by EMA score every router_reorder_interval requests, preferring providers with consistently lower latency.
[llm]
router_ema_enabled = false # Enable EMA-based provider reordering (default: false)
router_ema_alpha = 0.1 # EMA smoothing factor, 0.0–1.0 (default: 0.1)
router_reorder_interval = 10 # Re-order every N requests (default: 10)
A lower router_ema_alpha gives more weight to historical latency; a higher value tracks recent performance more aggressively.
Skill Health in System Prompt
When hybrid_search = true, active skills include XML health attributes in the injected system prompt block:
<skill name="git" trust="trusted" reliability="91%" uses="47">
...skill body...
</skill>
These attributes let the LLM factor in skill reliability when choosing between overlapping skills.
Complete Configuration Reference
[skills]
cosine_weight = 0.7 # Cosine signal weight in BM25+cosine fusion (default: 0.7)
hybrid_search = true # Enable hybrid BM25+cosine skill matching (default: true)
[llm]
router_ema_enabled = false # EMA-based provider latency routing (default: false)
router_ema_alpha = 0.1 # EMA smoothing factor (default: 0.1)
router_reorder_interval = 10 # Provider re-order interval in requests (default: 10)
[agent.learning]
correction_detection = true # Implicit correction detection (default: true)
correction_confidence_threshold = 0.7 # Jaccard overlap threshold (default: 0.7)
correction_recall_limit = 3 # Corrections injected into system prompt (default: 3)
correction_min_similarity = 0.75 # Min cosine similarity for correction recall (default: 0.75)
[skills.learning]
enabled = true
auto_activate = false # Require manual approval for new versions (default: false)
min_failures = 3 # Failures before triggering improvement
improve_threshold = 0.7 # Success rate below which improvement starts
rollback_threshold = 0.5 # Auto-rollback when success rate drops below this
min_evaluations = 5 # Minimum evaluations before rollback decision
max_versions = 10 # Max auto-generated versions per skill
cooldown_minutes = 60 # Cooldown between improvements for same skill
detector_mode = "regex" # "regex" (default) or "judge"
judge_model = "" # Model for judge calls (empty = primary provider)
judge_adaptive_low = 0.5 # Regex confidence floor for judge bypass (default: 0.5)
judge_adaptive_high = 0.8 # Regex confidence ceiling for judge bypass (default: 0.8)
Feedback Command
The /feedback command records explicit user feedback about the agent’s most recent response. Positive or neutral feedback stores a user_approval outcome; negative feedback stores user_rejection. Approval and rejection outcomes are excluded from Wilson score calculations — they are tracked for analytics only and do not dilute execution-based success rate metrics. Positive feedback also skips generate_improved_skill() to avoid unnecessary LLM calls when a skill is working correctly.
Chat Commands
| Command | Description |
|---|---|
/skill stats | View execution metrics, Wilson scores, and trust levels per skill |
/skill versions | List auto-generated versions |
/skill activate <id> | Activate a specific version |
/skill approve <id> | Approve a pending version |
/skill reset <name> | Revert to original version |
/skill reject <name> <reason> | Record user rejection and trigger improvement |
/feedback | Provide explicit quality feedback (positive or negative) |
Storage
| Store | Table / Collection | Contents |
|---|---|---|
| SQLite | skill_outcomes | Per-invocation outcomes with outcome_detail (migration 018) |
| SQLite | skill_versions | LLM-generated skill versions |
| SQLite | zeph_corrections | Detected user corrections with metadata |
| Qdrant | zeph_corrections | Vector-indexed corrections for similarity recall |
How Improvement Works
- Failures accumulate against a skill, each tagged with a
FailureKindand stored inoutcome_detail. - When the failure count reaches
min_failuresand success rate drops belowimprove_threshold, Zeph prompts the LLM with the skill body, recent failure details, and any recalled corrections. - The LLM generates a new SKILL.md body. The new version is stored in
skill_versionsand either auto-activated or held pending approval depending onauto_activate. - The Wilson score and EMA metrics continue to accumulate on the new version. If performance drops below
rollback_threshold, automatic rollback restores the previous version.
Set
auto_activate = false(default) to review LLM-generated improvements before they go live. Use/skill versionsand/skill approve <id>to inspect and promote candidates manually.