Add result dataset and leaderboard artifact model for benchmark runs

## Summary
AgentBench needs a structured artifact model for storing benchmark inputs, run outputs, scores, and leaderboard-ready summaries.

## Why this matters
An evaluation repo is much more useful when results can be compared over time without relying on ad hoc logs or one-off spreadsheets.

## Suggested scope
- Define the canonical artifact layout for benchmark inputs, raw outputs, metrics, and metadata
- Add a stable run identifier and summary record for each evaluation run
- Document how leaderboard summaries are derived from raw run artifacts
- Capture the minimum provenance needed to reproduce a reported score later

## Done when
- Benchmark runs produce structured artifacts
- Results can be compared consistently across runs
- Leaderboard outputs are traceable to raw evaluation data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add result dataset and leaderboard artifact model for benchmark runs #3

Summary

Why this matters

Suggested scope

Done when

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add result dataset and leaderboard artifact model for benchmark runs #3

Description

Summary

Why this matters

Suggested scope

Done when

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions