Skip to content

yikee/ScienceMeter

Repository files navigation

ScienceMeter: Tracking Scientific Knowledge Updates in Language Models

Yike Wang1, Shangbin Feng1, Yulia Tsvetkov1, Hannaneh Hajishirzi12
1University of Washington, 2Allen Institute for Artificial Intelligence

Dataset

We retrieve 1,000 journal or conference papers from each of 10 scientific domains using the Semantic Scholar API. For each paper, we also collect its citing papers, forming our raw corpus.

We filter out papers that lack citation information or abstracts, then regroup the remaining papers based on the knowledge cutoff date of a given model and the publication dates of the papers. This process yields 5,148 triplets of (prior paper, new paper, future paper). For each paper, we synthetically generate one SUPPORT claim (a uniquely supporting scientific claim) and one REFUTE claim (a relevant but non-supporting scientific claim). The resulting dataset is available in the filtered_with_claims folder.

Evaluation of Scientific Knowledge

The eval_judgment.py and eval_generation.py scripts are used to evaluate a specific type of scientific knowledge in the model, assumed to be a knowledge-updated version of the basemodel. If the model and basemodel are the same, the evaluation is performed on the basemodel. The --portion argument allows control over the fraction of the dataset used for evaluation.

Claim Judgment Task

# example
python eval_judgment.py \
  --basemodel llama \
  --model llama \
  --domain computer_science \
  --knowledge new \
  --portion 0.8

Claim Generation Task

# example
python eval_generation.py \
  --basemodel olmo32b \
  --model _ar_testdoc \
  --domain education \
  --knowledge future \
  --portion 1.0

Evaluation of knowledge Update Methods

The metrics.py script computes all eight evaluation metrics introduced in the paper, based on evaluation results obtained before (basemodel) and after (model) a knowledge update. The model is assumed to be a knowledge-updated version of the basemodel, using a specified update method (e.g., _ar_traintestdoc_it_trainqa).

# example
python metrics.py \
  --basemodel llama \
  --model _ar_traintestdoc_it_trainqa \
  --domain political_science \
  --task judgment

Knowledge Update Baselines

The following examples show how to run training baselines using llama as the base model and computer_science as the target domain.

Continual Pre-training

python ar.py -bm llama -m llama -d computer_science -ds testdoc
# Output model will be saved as: llama/computer_science/_ar_testdoc

Standard Instruction-tuning

python ar.py -bm llama -m llama -d computer_science -ds traintestdoc
python it.py -bm llama -m _ar_traintestdoc -d computer_science -ds trainqa
# Output model will be saved as: llama/computer_science/_ar_traintestdoc_it_trainqa

Pre-instruction-tuning

python it.py -bm llama -m llama -d computer_science -ds trainqadoc
python ar.py -bm llama -m _it_trainqadoc -d computer_science -ds testdoc
# Output model will be saved as: llama/computer_science/_it_trainqadoc_ar_testdoc

Questions

If you have any questions or comments about our paper, data, or scripts, or if you notice any issues in the code, feel free to reach out via email at [email protected]. We will do our best to respond within one business day.

Citing

If you found this work helpful, please consider starring this repository and citing our paper as shown below:

@article{wang2025sciencemeter,
  title={ScienceMeter: Tracking Scientific Knowledge Updates in Language Models},
  author={Wang, Yike and Feng, Shangbin and Tsvetkov, Yulia and Hajishirzi, Hannaneh},
  journal={arXiv preprint arXiv:2505.24302},
  year={2025}
}

About

ScienceMeter: Tracking Scientific Knowledge Updates in Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages