Skip to content

research(skills): SkillsBench — curated skills +16.2pp pass rate, self-generated skills provide zero benefit (arXiv:2602.12670) #2261

@bug-ops

Description

@bug-ops

Summary

SkillsBench (arXiv:2602.12670, Feb 2026) — 86-task benchmark across 11 domains evaluating agent skills under three conditions: no skills, curated skills, and self-generated skills.

Key Findings

  • Curated skills raise task pass rate by 16.2pp on average with high domain variance
  • Self-generated skills provide zero average benefit — and in some domains regress performance
  • Focused skills (2–3 module documents) outperform large documentation bundles — narrower, focused SKILL.md files are better
  • Smaller models with curated Skills can match larger models without them

Applicability to Zeph

Direct implications for zeph-skills and the self-learning pipeline:

  1. Self-learning validation gap: if self-generated skills provide zero average benefit, Zeph's skill generation path (skills.learning) needs domain-conditioned evaluation gates before promoting auto-generated skills — raw improve_threshold is not sufficient
  2. SKILL.md authoring guideline: existing bundled skills (browser, os-automation, etc.) should follow the 2–3 module focused pattern; large consolidated SKILL.md files should be split
  3. Benchmark opportunity: adopt SkillsBench-style task scenarios as Zeph skill regression tests to measure matching quality vs. task success rate (currently we only measure match precision, not downstream task success)

Implementation Sketch

  • Short-term (LOW complexity): add a max_content_sections guideline to the skill-creator SKILL.md template; constrain auto-generated skills to ≤3 sections
  • Medium-term (MEDIUM): implement a local SkillsBench-style evaluation harness that scores skill-matched vs. no-skill task success rates across a fixed prompt set
  • Long-term: gate skill promotion in self-learning on task-success score, not just correction absence

Source: https://arxiv.org/abs/2602.12670

Metadata

Metadata

Assignees

Labels

P2High value, medium complexityresearchResearch-driven improvementskillszeph-skills crate

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions