-
Notifications
You must be signed in to change notification settings - Fork 2
research(skills): SkillsBench — curated skills +16.2pp pass rate, self-generated skills provide zero benefit (arXiv:2602.12670) #2261
Copy link
Copy link
Closed
Labels
P2High value, medium complexityHigh value, medium complexityresearchResearch-driven improvementResearch-driven improvementskillszeph-skills cratezeph-skills crate
Description
Summary
SkillsBench (arXiv:2602.12670, Feb 2026) — 86-task benchmark across 11 domains evaluating agent skills under three conditions: no skills, curated skills, and self-generated skills.
Key Findings
- Curated skills raise task pass rate by 16.2pp on average with high domain variance
- Self-generated skills provide zero average benefit — and in some domains regress performance
- Focused skills (2–3 module documents) outperform large documentation bundles — narrower, focused SKILL.md files are better
- Smaller models with curated Skills can match larger models without them
Applicability to Zeph
Direct implications for zeph-skills and the self-learning pipeline:
- Self-learning validation gap: if self-generated skills provide zero average benefit, Zeph's skill generation path (
skills.learning) needs domain-conditioned evaluation gates before promoting auto-generated skills — rawimprove_thresholdis not sufficient - SKILL.md authoring guideline: existing bundled skills (browser, os-automation, etc.) should follow the 2–3 module focused pattern; large consolidated SKILL.md files should be split
- Benchmark opportunity: adopt SkillsBench-style task scenarios as Zeph skill regression tests to measure matching quality vs. task success rate (currently we only measure match precision, not downstream task success)
Implementation Sketch
- Short-term (LOW complexity): add a
max_content_sectionsguideline to the skill-creator SKILL.md template; constrain auto-generated skills to ≤3 sections - Medium-term (MEDIUM): implement a local SkillsBench-style evaluation harness that scores skill-matched vs. no-skill task success rates across a fixed prompt set
- Long-term: gate skill promotion in self-learning on task-success score, not just correction absence
Source: https://arxiv.org/abs/2602.12670
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P2High value, medium complexityHigh value, medium complexityresearchResearch-driven improvementResearch-driven improvementskillszeph-skills cratezeph-skills crate