research(skills): SkillsBench — curated skills +16.2pp pass rate, self-generated skills provide zero benefit (arXiv:2602.12670)

## Summary

SkillsBench (arXiv:2602.12670, Feb 2026) — 86-task benchmark across 11 domains evaluating agent skills under three conditions: no skills, curated skills, and self-generated skills.

## Key Findings

- Curated skills raise task pass rate by **16.2pp on average** with high domain variance
- Self-generated skills provide **zero average benefit** — and in some domains regress performance
- **Focused skills (2–3 module documents) outperform large documentation bundles** — narrower, focused SKILL.md files are better
- Smaller models with curated Skills can match larger models without them

## Applicability to Zeph

Direct implications for `zeph-skills` and the self-learning pipeline:

1. **Self-learning validation gap**: if self-generated skills provide zero average benefit, Zeph's skill generation path (`skills.learning`) needs domain-conditioned evaluation gates before promoting auto-generated skills — raw `improve_threshold` is not sufficient
2. **SKILL.md authoring guideline**: existing bundled skills (browser, os-automation, etc.) should follow the 2–3 module focused pattern; large consolidated SKILL.md files should be split
3. **Benchmark opportunity**: adopt SkillsBench-style task scenarios as Zeph skill regression tests to measure matching quality vs. task success rate (currently we only measure match precision, not downstream task success)

## Implementation Sketch

- Short-term (LOW complexity): add a `max_content_sections` guideline to the skill-creator SKILL.md template; constrain auto-generated skills to ≤3 sections
- Medium-term (MEDIUM): implement a local SkillsBench-style evaluation harness that scores skill-matched vs. no-skill task success rates across a fixed prompt set
- Long-term: gate skill promotion in self-learning on task-success score, not just correction absence

Source: https://arxiv.org/abs/2602.12670

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research(skills): SkillsBench — curated skills +16.2pp pass rate, self-generated skills provide zero benefit (arXiv:2602.12670) #2261

Summary

Key Findings

Applicability to Zeph

Implementation Sketch

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

research(skills): SkillsBench — curated skills +16.2pp pass rate, self-generated skills provide zero benefit (arXiv:2602.12670) #2261

Description

Summary

Key Findings

Applicability to Zeph

Implementation Sketch

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions