Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Self-Learning Skills

Zeph continuously improves its skills based on execution outcomes, user corrections, and provider performance. The self-learning system operates across four layers: failure classification, implicit feedback detection, Bayesian re-ranking, and hybrid search with EMA-based routing.

Overview

When a skill fails or a user implicitly corrects the agent, Zeph records the signal, re-ranks affected skills, and — when failures cross a threshold — generates an improved skill version via LLM reflection.

User message
     │
     ▼
Skill matching (BM25 + cosine → RRF fusion)
     │
     ▼
Skill execution → SkillOutcome recorded
     │
     ├─ Success → Wilson score updated, EMA updated
     │
     └─ Failure → FailureKind classified
                       │
                       ├─ FeedbackDetector checks next user turn
                       │        └─ UserCorrection stored in SQLite + Qdrant
                       │
                       └─ repeated failures → LLM generates improved version

Phase 1 — Failure Classification

Every skill invocation records a SkillOutcome. Tool failures now carry a FailureKind that distinguishes seven root causes:

VariantMeaning
ExitNonzeroThe tool process exited with a non-zero exit code
TimeoutThe tool call exceeded the configured timeout
PermissionDeniedTool execution was blocked by the permission policy
WrongApproachThe skill used a command or method inappropriate for the task
PartialThe tool completed but produced incomplete or truncated output
SyntaxErrorThe generated command or script contained a syntax error
UnknownFailure cause could not be classified from the error message

The raw reason string is stored in the outcome_detail column (migration 018, skill_outcomes table) for later inspection and LLM-based improvement prompts.

Rejecting a Skill

Use /skill reject to record an explicit user rejection and immediately trigger the improvement pipeline:

/skill reject <name> <reason>

Example:

/skill reject web-search "always uses the wrong search engine"

This is equivalent to min_failures consecutive failures — the improvement loop starts on the next agent cycle.

Phase 2 — Implicit Feedback Detection

Zeph inspects each user turn for implicit corrections without requiring an explicit /feedback command. Two detection strategies are available, selected via detector_mode:

Regex Detector (default)

FeedbackDetector uses pattern matching only — zero LLM calls.

Detection signals:

  1. Explicit rejection (confidence 0.85) — phrases like “no”, “wrong”, “that’s wrong”, “that didn’t work”, “bad answer”, “that’s incorrect”.
  2. Self-correction — user corrects themselves (e.g., “I was wrong, the capital is Canberra”). Self-corrections are stored for analytics but do not penalize active skills.
  3. Alternative request (confidence 0.70) — “instead use…”, “try a different approach”, “can you do it differently”.
  4. Repetition (confidence 0.75) — Jaccard token overlap > 0.8 against the last 3 user messages.

Judge Detector (LLM-backed)

JudgeDetector uses an LLM call to classify borderline or missed cases. It is invoked only when regex confidence falls in the adaptive zone or regex returns no signal at all.

How the adaptive zone works:

Regex resultAction
Confidence >= judge_adaptive_high (0.80)Accepted without judge
Confidence in [judge_adaptive_low, judge_adaptive_high)Judge invoked to confirm/override
Confidence < judge_adaptive_low (0.50)Treated as “no correction”
No regex matchJudge invoked as fallback

The judge call runs in a background tokio::spawn task and does not block the agent response loop. A sliding-window rate limiter caps judge calls at 5 per 60 seconds to control cost.

Judge prompt design:

  • System prompt classifies user satisfaction into explicit_rejection, alternative_request, repetition, or neutral.
  • User message content is XML-escaped to mitigate prompt injection via </user_message> tags.
  • Response is parsed as structured JSON (JudgeVerdict) with confidence clamping to [0.0, 1.0].

Multi-Language Support

FeedbackDetector matches correction patterns across 7 languages:

LanguageExample rejectionExample alternative
English“that’s wrong”, “bad answer”“try a different approach”
Russian“неправильно”, “неверно”“попробуй по-другому”
Spanish“eso esta mal”, “incorrecto”“intenta de otra manera”
German“das ist falsch”, “stimmt nicht”“versuch es anders”
French“c’est faux”, “incorrect”“essaie autrement”
Chinese“错了”, “不对”“换个方法”
Japanese“違います”, “間違い”“別の方法で”

Each language uses dual anchoring: anchored patterns (^) for messages starting with the feedback phrase, and unanchored patterns for mid-sentence feedback. Confidence values are assigned per pattern: explicit rejections score 0.85, alternatives 0.70.

Mixed-language inputs are supported. CJK patterns use 2+ character minimums for unanchored matching to reduce false positives from substring matches. Unsupported languages (Korean, Arabic, etc.) produce no regex signal, causing every message to trigger a judge call (rate-limited to 5/min).

Storage

Detected corrections are stored as UserCorrection records in:

  • SQLite (zeph_corrections table) — persistent, queryable
  • Qdrant (zeph_corrections collection) — vector-indexed for similarity recall

On each subsequent query, the top-3 most similar corrections (cosine similarity >= 0.75) are injected into the system prompt to steer the agent away from repeating the same mistake.

Configuration

[skills.learning]
detector_mode = "regex"              # "regex" (default) or "judge"
judge_model = ""                     # Model for judge calls (empty = use primary provider)
judge_adaptive_low = 0.5            # Below this, regex "no correction" is trusted (default: 0.5)
judge_adaptive_high = 0.8           # At or above, regex result accepted without judge (default: 0.8)

[agent.learning]
correction_detection = true           # Enable FeedbackDetector (default: true)
correction_confidence_threshold = 0.7 # Confidence threshold to accept a candidate (default: 0.7)
correction_recall_limit = 3           # Max corrections injected into system prompt (default: 3)
correction_min_similarity = 0.75      # Minimum cosine similarity for correction recall (default: 0.75)

Setting detector_mode = "judge" does not disable regex — regex always runs first. The judge is invoked only for borderline or missed cases, keeping LLM costs minimal.

Phase 3 — Bayesian Re-Ranking and Trust Transitions

Wilson Score Confidence Interval

Skill success/failure outcomes feed a Wilson score calculator that produces a lower-bound confidence interval. This replaces the raw success-rate sort used previously:

wilson_lower = (successes + z²/2) / (n + z²) - z * sqrt(n * p*(1-p) + z²/4) / (n + z²)

where z = 1.96 (95% CI). Skills with few observations are naturally ranked lower until they accumulate evidence.

Auto Promote / Demote

check_trust_transition() runs after each outcome and applies automatic trust level changes:

ConditionAction
Wilson score ≥ 0.85 and ≥ 10 evaluationsPromote to trusted
Wilson score < 0.40 and ≥ 5 evaluationsDemote to quarantined
Quarantined skill improves above 0.70Promote back to verified

Trust transitions are logged via tracing and reflected immediately in /skill stats output.

TUI Confidence Bars

The TUI dashboard (--tui) shows a per-skill confidence bar in the Skills panel:

  • Green — Wilson score ≥ 0.75 (high confidence)
  • Yellow — Wilson score 0.40–0.74 (moderate)
  • Red — Wilson score < 0.40 (low confidence, at risk of demotion)

The bar width is proportional to the score and updates in real time as outcomes are recorded.

Phase 4 — Hybrid Search and EMA Routing

Skill matching now combines two signals via Reciprocal Rank Fusion (RRF):

SignalDescription
BM25Term-frequency keyword match against skill names, descriptions, and trigger phrases
CosineEmbedding similarity of the query against skill body vectors
rrf_score(d) = 1/(k + rank_bm25(d)) + 1/(k + rank_cosine(d))     k = 60

The cosine_weight parameter scales the cosine component relative to BM25 before RRF:

[skills]
cosine_weight = 0.7    # Weight for cosine signal in fusion (default: 0.7)
hybrid_search = true   # Enable BM25+cosine fusion (default: true)

When hybrid_search = false, the previous cosine-only matching is used.

EMA-Based Provider Routing

EmaTracker maintains an exponential moving average of response latency per provider. When router_ema_enabled = true, the router re-orders providers by EMA score every router_reorder_interval requests, preferring providers with consistently lower latency.

[llm]
router_ema_enabled = false      # Enable EMA-based provider reordering (default: false)
router_ema_alpha = 0.1          # EMA smoothing factor, 0.0–1.0 (default: 0.1)
router_reorder_interval = 10    # Re-order every N requests (default: 10)

A lower router_ema_alpha gives more weight to historical latency; a higher value tracks recent performance more aggressively.

Skill Health in System Prompt

When hybrid_search = true, active skills include XML health attributes in the injected system prompt block:

<skill name="git" trust="trusted" reliability="91%" uses="47">
  ...skill body...
</skill>

These attributes let the LLM factor in skill reliability when choosing between overlapping skills.

Complete Configuration Reference

[skills]
cosine_weight = 0.7    # Cosine signal weight in BM25+cosine fusion (default: 0.7)
hybrid_search = true   # Enable hybrid BM25+cosine skill matching (default: true)

[llm]
router_ema_enabled = false      # EMA-based provider latency routing (default: false)
router_ema_alpha = 0.1          # EMA smoothing factor (default: 0.1)
router_reorder_interval = 10    # Provider re-order interval in requests (default: 10)

[agent.learning]
correction_detection = true           # Implicit correction detection (default: true)
correction_confidence_threshold = 0.7 # Jaccard overlap threshold (default: 0.7)
correction_recall_limit = 3           # Corrections injected into system prompt (default: 3)
correction_min_similarity = 0.75      # Min cosine similarity for correction recall (default: 0.75)

[skills.learning]
enabled = true
auto_activate = false     # Require manual approval for new versions (default: false)
min_failures = 3          # Failures before triggering improvement
improve_threshold = 0.7   # Success rate below which improvement starts
rollback_threshold = 0.5  # Auto-rollback when success rate drops below this
min_evaluations = 5       # Minimum evaluations before rollback decision
max_versions = 10         # Max auto-generated versions per skill
cooldown_minutes = 60     # Cooldown between improvements for same skill
detector_mode = "regex"   # "regex" (default) or "judge"
judge_model = ""          # Model for judge calls (empty = primary provider)
judge_adaptive_low = 0.5  # Regex confidence floor for judge bypass (default: 0.5)
judge_adaptive_high = 0.8 # Regex confidence ceiling for judge bypass (default: 0.8)

Feedback Command

The /feedback command records explicit user feedback about the agent’s most recent response. Positive or neutral feedback stores a user_approval outcome; negative feedback stores user_rejection. Approval and rejection outcomes are excluded from Wilson score calculations — they are tracked for analytics only and do not dilute execution-based success rate metrics. Positive feedback also skips generate_improved_skill() to avoid unnecessary LLM calls when a skill is working correctly.

Chat Commands

CommandDescription
/skill statsView execution metrics, Wilson scores, and trust levels per skill
/skill versionsList auto-generated versions
/skill activate <id>Activate a specific version
/skill approve <id>Approve a pending version
/skill reset <name>Revert to original version
/skill reject <name> <reason>Record user rejection and trigger improvement
/feedbackProvide explicit quality feedback (positive or negative)

Storage

StoreTable / CollectionContents
SQLiteskill_outcomesPer-invocation outcomes with outcome_detail (migration 018)
SQLiteskill_versionsLLM-generated skill versions
SQLitezeph_correctionsDetected user corrections with metadata
Qdrantzeph_correctionsVector-indexed corrections for similarity recall

How Improvement Works

  1. Failures accumulate against a skill, each tagged with a FailureKind and stored in outcome_detail.
  2. When the failure count reaches min_failures and success rate drops below improve_threshold, Zeph prompts the LLM with the skill body, recent failure details, and any recalled corrections.
  3. The LLM generates a new SKILL.md body. The new version is stored in skill_versions and either auto-activated or held pending approval depending on auto_activate.
  4. The Wilson score and EMA metrics continue to accumulate on the new version. If performance drops below rollback_threshold, automatic rollback restores the previous version.

Set auto_activate = false (default) to review LLM-generated improvements before they go live. Use /skill versions and /skill approve <id> to inspect and promote candidates manually.