Skip to content

Add result dataset and leaderboard artifact model for benchmark runs #3

@MukundaKatta

Description

@MukundaKatta

Summary

AgentBench needs a structured artifact model for storing benchmark inputs, run outputs, scores, and leaderboard-ready summaries.

Why this matters

An evaluation repo is much more useful when results can be compared over time without relying on ad hoc logs or one-off spreadsheets.

Suggested scope

  • Define the canonical artifact layout for benchmark inputs, raw outputs, metrics, and metadata
  • Add a stable run identifier and summary record for each evaluation run
  • Document how leaderboard summaries are derived from raw run artifacts
  • Capture the minimum provenance needed to reproduce a reported score later

Done when

  • Benchmark runs produce structured artifacts
  • Results can be compared consistently across runs
  • Leaderboard outputs are traceable to raw evaluation data

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions