DATE-LM is a benchmark suite designed for evaluating data attribution methods on real-world applications for large language models (LLMs), created by researchers from Carnegie Mellon University, University of Illinois at Urbana-champaign, and University of Michigan.
Our evaluation framework contains:
- 🧩 Modular pipeline supporting attribution scoring, subset selection, and task evaluation
- 🔌 Plug-and-play support for new attribution methods
- 📈 Pre-trained and fine-tuned model checkpoints for reproducibility and efficiency
- 🏆 Public leaderboard for standardized benchmarking and community engagement
We support three core evaluation tasks:
- Training Data Selection: pre-training and SFT
- Toxicity/Bias Filtering
- Factual Attribution
Install the required dependencies using the files env.yml and requirements.txt:
conda env create --file env.yml --name myenv
conda activate myenv
pip install -r requirements.txtThe evaluation process consists of the following steps:
-
Download Datasets and Models: Choose a task and follow the preparation steps to download datasets and models.
-
Run Attribution Scoring: Define a scoring function that takes a model checkpoint, a reference dataset, and a training dataset, and outputs an attribution score for each datapoint in the training dataset.
-
Run Task-Specific Evaluation: Execute the evaluation pipeline for the selected task.
Detailed steps and tutorials are available in the following documentation files:
Please see the link below for instructions to submit to the leaderboard: https://huggingface.co/spaces/DataAttributionEval/DATE-LM-Leaderboard
Please use this citation to reference our work
@inproceedings{2025datelm,
title={DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models},
author={Cathy Jiao and Yijun Pan and Emily Xiao and Daisy Sheng and Niket Jain and Hanzhang Zhao and Ishita Dasgupta and Jiaqi W. Ma and Chenyan Xiong},
year={2025},
booktitle={NeurIPS},
url={https://arxiv.org/abs/2507.09424},
}
This repository includes components from the following projects:
@inproceedings{NEURIPS2024_f7326833,
author = {Deng, Junwei and Li, Ting-Wei and Zhang, Shiyuan and Liu, Shixuan and Pan, Yijun and Huang, Hao and Wang, Xinhe and Hu, Pingbang and Zhang, Xingjian and Ma, Jiaqi},
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
pages = {136763--136781},
publisher = {Curran Associates, Inc.},
title = {\textbackslash texttt\lbrace dattri\rbrace : A Library for Efficient Data Attribution},
url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/f732683302d91e47610b2416b4977a66-Paper-Datasets_and_Benchmarks_Track.pdf},
volume = {37},
year = {2024}
}
@misc{ivisondata2025,
title={{Large-Scale Data Selection for Instruction Tuning}},
author={Hamish Ivison and Muru Zhang and Faeze Brahman and Pang Wei Koh and Pradeep Dasigi},
year={2025},
eprint={2503.01807},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.01807},
}
@inproceedings{yu2024mates,
title={MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models},
author={Yu, Zichun and Das, Spandan and Xiong, Chenyan},
booktitle={NeurIPS},
year={2024}
}
