Fixes to Random Baseline #115
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixed centered averaging for Core and Extended evaluations as discussed in #114.
TL;DR: Centered averages depend on baseline values for individual tasks. We had set some baselines incorrectly due to being unaware of discrepancies between some
llmfoundryevaluations and their original forms. After fixing these baselines, new Core/Extended averages (which we'll refer to asCore_v2,Extended_v2) will be a bit lower than before (renamed asCore_v1,Extended_v1). While numbers will no longer be directly comparable betweenv1andv2, our empirical analysis suggests rankings between models should remain highly consistent (Spearman Correlation > 0.999).This PR aims to make the default behavior to record "fixed"
v2averages while also retaining old scores and the option to compute them for posterity. More specifically, iteval_meta_data.csv. We keep and rename the old version aseval_meta_data_v1.csv.eval/aggregated_metrics.pyto allow for computingv1andv2averagesv2averages and update their existing evaluation JSONs. OldCoreandExtendedscores will be renamed asCore_v1andExtended_v2.v2averages from the evaluation script to generate new eval JSONs but can later add in keys forv1averages by runningeval/aggregated_metrics.pywith--version v1.