This is the official repo for the following paper
- The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions, Xiang Zhou, Yixin Nie, Hao Tan and Mohit Bansal, EMNLP 2020 (arxiv)
This code requires Python 3. All the dependencies are specified in "requirement.txt"
pip install -r requirements.txt
The current code supports the calculation of decomposed variance metrics from standard evaluation numbers.
-
Download the NLI datasets and put it under the
nli_datafolder in the root directory -
Organize the evaluation result of your model under the
modelsdirectly in the same way as theberts(an example folder showing the result of BERT-base) folder, name of the folder representing the model typeMODEL_TYPE/seed_xsaves the evaluation results with seedx- Inside
MODEL_TYPE/seed_x/, each folder represent the evaluation result on one dataset, including three files:eval_results.txt: Final accuracy of the modellogits_results.txt: List of logits output by the model on every example in the datasetpred_results.txt: List of labels predicted by the model on every example in the dataset
-
Run the evaluation scripts by
python variance_report.py MODEL_PATH
Other scripts (training/evaluation/analysis) and model checkpoints that are used in the paper will come soon.