Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

This repository contains reference implementation code for the experiments in our paper. As the work is still ongoing, updates may be expected in the future.

Paper

The link to our paper is available here: https://arxiv.org/pdf/2507.10532

The RandomCalculation dataset files are located in random_calculation/result. You can also manually regenerate them if needed.

Setup

# 准备Python环境（Prepare the Python Environment）
conda create -n llm-math-evaluation python=3.10 
conda activate llm-math-evaluation

pip install -r requirements.txt
pip install flash_attn==2.7.0.post2

Uasge

# 评估LLM的数学能力（Evaluate the mathematical ability of LLMs）
cd math_evaluation
bash run_batch_task_math_qwen2.5.sh
# 汇总结果（Summarize Results）
python sum_metrics.py 

# 生成RandomCalculation数据集（Generate the RandomCalculation dataset）
cd random_calculation
python generate_datasets.py

Acknowledgments

The code used for answer scoring is sourced from https://github.com/ruixin31/Spurious_Rewards/. We thank the authors for their valuable work.

Citation

@misc{wu2025reasoningmemorizationunreliableresults,
      title={Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination}, 
      author={Mingqi Wu and Zhihao Zhang and Qiaole Dong and Zhiheng Xi and Jun Zhao and Senjie Jin and Xiaoran Fan and Yuhao Zhou and Yanwei Fu and Qin Liu and Songyang Zhang and Qi Zhang},
      year={2025},
      eprint={2507.10532},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.10532}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
math_evaluation		math_evaluation
random_calculation		random_calculation
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Paper

Setup

Uasge

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Languages

wumingqi/LLM-Math-Evaluation

Folders and files

Latest commit

History

Repository files navigation

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Paper

Setup

Uasge

Acknowledgments

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages