Skip to content

Conversation

@jeffreywpli
Copy link
Contributor

@jeffreywpli jeffreywpli commented Sep 5, 2025

Fixed centered averaging for Core and Extended evaluations as discussed in #114.

TL;DR: Centered averages depend on baseline values for individual tasks. We had set some baselines incorrectly due to being unaware of discrepancies between some llmfoundry evaluations and their original forms. After fixing these baselines, new Core/Extended averages (which we'll refer to as Core_v2, Extended_v2) will be a bit lower than before (renamed as Core_v1, Extended_v1). While numbers will no longer be directly comparable between v1 and v2, our empirical analysis suggests rankings between models should remain highly consistent (Spearman Correlation > 0.999).

centered_metric_fix_core centered_average_fix_extended

This PR aims to make the default behavior to record "fixed" v2 averages while also retaining old scores and the option to compute them for posterity. More specifically, it

  • Modifies baseline values in eval_meta_data.csv. We keep and rename the old version as eval_meta_data_v1.csv.
  • Modifies eval/aggregated_metrics.py to allow for computing v1 and v2 averages
    • Previous users can use this script to recompute v2 averages and update their existing evaluation JSONs. Old Core and Extended scores will be renamed as Core_v1 and Extended_v2.
    • New users should get v2 averages from the evaluation script to generate new eval JSONs but can later add in keys for v1 averages by running eval/aggregated_metrics.py with --version v1.
  • Adds documentation for these changes at the top of the main README

@jeffreywpli jeffreywpli merged commit 361714b into main Sep 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants