Fixes to Random Baseline #115

jeffreywpli · 2025-09-05T20:01:18Z

Fixed centered averaging for Core and Extended evaluations as discussed in #114.

TL;DR: Centered averages depend on baseline values for individual tasks. We had set some baselines incorrectly due to being unaware of discrepancies between some llmfoundry evaluations and their original forms. After fixing these baselines, new Core/Extended averages (which we'll refer to as Core_v2, Extended_v2) will be a bit lower than before (renamed as Core_v1, Extended_v1). While numbers will no longer be directly comparable between v1 and v2, our empirical analysis suggests rankings between models should remain highly consistent (Spearman Correlation > 0.999).

This PR aims to make the default behavior to record "fixed" v2 averages while also retaining old scores and the option to compute them for posterity. More specifically, it

Modifies baseline values in eval_meta_data.csv. We keep and rename the old version as eval_meta_data_v1.csv.
Modifies eval/aggregated_metrics.py to allow for computing v1 and v2 averages
- Previous users can use this script to recompute v2 averages and update their existing evaluation JSONs. Old Core and Extended scores will be renamed as Core_v1 and Extended_v2.
- New users should get v2 averages from the evaluation script to generate new eval JSONs but can later add in keys for v1 averages by running eval/aggregated_metrics.py with --version v1.
Adds documentation for these changes at the top of the main README

fix averaging and add documentation

6204236

jeffreywpli force-pushed the fix_random_baseline branch from 4e4be48 to 6204236 Compare September 6, 2025 04:50

jeffreywpli mentioned this pull request Sep 6, 2025

CORE centered accuracy calculation and random baseline #114

Open

afang-story self-requested a review September 8, 2025 05:39

afang-story approved these changes Sep 8, 2025

View reviewed changes

add v1 and v2 to table

2967b44

jeffreywpli merged commit 361714b into main Sep 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes to Random Baseline #115

Fixes to Random Baseline #115

Uh oh!

jeffreywpli commented Sep 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fixes to Random Baseline #115

Fixes to Random Baseline #115

Uh oh!

Conversation

jeffreywpli commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jeffreywpli commented Sep 5, 2025 •

edited

Loading