How much can we forget about Data Contamination?

This is the code repository for the ICML'25 paper "How much can we forget about Data Contamination?".

Overview

This repository contains the code to reproduce the results in our paper. We also provide the deduplicated benchmark questions, as well as model checkpoints and our training logs for OLMo.

Using the de-duplicated benchmark questions

Many popular LLM benchmarks contain duplicate questions. For this project, we created a universe of 44000 benchmark questions whose ground-truth answers were deduplicated using fuzzy string matching (for details, see the Section "Filtering Near-Duplicate Benchmark Questions" in our paper). We provide the deduplicated benchmark questions on Huggingface

The format of the benchmark questions is the same as in the respective original repositories, with additional columns:

options: Provides the format of the benchmark questions that we used in the paper.
split-id: Partitions the benchmark questions into the subsets used for contamination. We used the questions with split-id=0 as holdout, and the other splits were contaminated as specified here.

You can use the provided repositories in the same way that you would use the original repositories (because the format of the questions remains the same).

Reproducing the results in the paper

Here is a brief overview of the code:

llm.c/: The code to contaminate, train, and evaluate small models.
evaluation/: The code to format benchmark questions, de-duplicate them, and generate the differnt splits that we use for contamination.
olmo/: The code to download specific training batches, contaminate them, and insert them back into the olmo pre-training data.
compute_results.ipynb: Compute accuracies, confidence intervals.
figures.ipynb: Generate the figures in the paper.
forgetting_curves.ipynb: Generate the forgetting curves.

The OLMo experiments depend on https://github.com/allenai/OLMo. We worked with the repository version with the commit hash ca81901eca2faa1947ced49ce5c5cef729203db1.

Checkpoints

We provide the final model checkpoints for both the small models and OLMo. The checkpoints for the small models contain the model state dicts (among others). See our eval script for how to load the checkpoints.

All small checkpoints: Link to Google Drive Folder
Figure 1(a): 124M, 350M, 774M, 1.6B
Figure 1(b): 2x, 4x, 8x, 15x
Figure 1(c): 124M, 350M, 774M, 1.6B
OLMo: OLMo-1B, OLMo-7B

You can use these checkpoints to perform additional evaluations.

In addition, the results of the evaluations in the paper for the small models are contained in evals folders for every checkpoint.

Weights & Biases Logs

For reproducibility, we additionally share the following Weights & Biases Logs:

OLMo-1B: Link

OLMO-7B: Link

Absolute Accuracies in Figure 1 of the paper

In Figure 1 in the paper, we report the accuracy differences between the different splits of benchmark questions. Here, we additionally provide the absolute accuracies of the respective splits.

Figure 1(a)

Model Size	Holdout	4x	12x	32x	144x
124M	44.16	49.54	54.98	73.05	93.20
350M	44.72	55.69	69.90	89.20	95.50
774M	45.78	67.30	85.16	94.65	97.25
1558M	46.90	75.48	91.04	95.70	97.55

Fig 1(b)

Model Size	Holdout	4x	12x	32x	144x
2x Chinchilla	43.31	50.40	59.84	80.85	94.85
4x Chinchilla	44.52	50.75	58.10	78.35	93.65
8x Chinchilla	45.14	49.16	51.84	64.15	85.15
15x Chinchilla	46.45	48.51	47.88	51.20	67.10

Fig 1(c)

Model Size	Holdout	4x	12x	32x	144x
124M	42.22	48.14	56.92	80.70	96.45
350M	44.72	55.69	69.90	89.20	95.50
774M	49.16	64.76	81.30	92.95	96.05
1.6B	52.06	67.61	82.32	91.85	95.40

The result for Figure 1 (c) is Table 1 in the paper.

Citing our work

If you use the code or the deduplicated benchmark questions in your research, we encourage you to cite our paper.

@inproceedings{bordt2025forgetting,
  author    = {Sebastian Bordt and Suraj Srinivas and Valentyn Boreiko and Ulrike von Luxburg},
  title     = {How much can we forget about Data Contamination?},
  booktitle = {ICML},
  year      = {2025}
 }

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
evaluation		evaluation
images		images
llm.c		llm.c
olmo		olmo
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compute_results.ipynb		compute_results.ipynb
cumulative_weight_decay.ipynb		cumulative_weight_decay.ipynb
figures.ipynb		figures.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

How much can we forget about Data Contamination?

Overview

Using the de-duplicated benchmark questions

Reproducing the results in the paper

Checkpoints

Weights & Biases Logs

Absolute Accuracies in Figure 1 of the paper

Figure 1(a)

Fig 1(b)

Fig 1(c)

Citing our work

About

Uh oh!

Releases

Packages

Languages

License

tml-tuebingen/forgetting-contamination

Folders and files

Latest commit

History

Repository files navigation

How much can we forget about Data Contamination?

Overview

Using the de-duplicated benchmark questions

Reproducing the results in the paper

Checkpoints

Weights & Biases Logs

Absolute Accuracies in Figure 1 of the paper

Figure 1(a)

Fig 1(b)

Fig 1(c)

Citing our work

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages