Summary
AgentBench needs a structured artifact model for storing benchmark inputs, run outputs, scores, and leaderboard-ready summaries.
Why this matters
An evaluation repo is much more useful when results can be compared over time without relying on ad hoc logs or one-off spreadsheets.
Suggested scope
- Define the canonical artifact layout for benchmark inputs, raw outputs, metrics, and metadata
- Add a stable run identifier and summary record for each evaluation run
- Document how leaderboard summaries are derived from raw run artifacts
- Capture the minimum provenance needed to reproduce a reported score later
Done when
- Benchmark runs produce structured artifacts
- Results can be compared consistently across runs
- Leaderboard outputs are traceable to raw evaluation data
Summary
AgentBench needs a structured artifact model for storing benchmark inputs, run outputs, scores, and leaderboard-ready summaries.
Why this matters
An evaluation repo is much more useful when results can be compared over time without relying on ad hoc logs or one-off spreadsheets.
Suggested scope
Done when