Skip to content

wumingqi/LLM-Math-Evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Awesome Maintenance Contribution Welcome Code

This repository contains reference implementation code for the experiments in our paper. As the work is still ongoing, updates may be expected in the future.

Paper

The link to our paper is available here: https://arxiv.org/pdf/2507.10532

The RandomCalculation dataset files are located in random_calculation/result. You can also manually regenerate them if needed.

Setup

# 准备Python环境(Prepare the Python Environment)
conda create -n llm-math-evaluation python=3.10 
conda activate llm-math-evaluation

pip install -r requirements.txt
pip install flash_attn==2.7.0.post2

Uasge

# 评估LLM的数学能力(Evaluate the mathematical ability of LLMs)
cd math_evaluation
bash run_batch_task_math_qwen2.5.sh
# 汇总结果(Summarize Results)
python sum_metrics.py 

# 生成RandomCalculation数据集(Generate the RandomCalculation dataset)
cd random_calculation
python generate_datasets.py

Acknowledgments

The code used for answer scoring is sourced from https://github.com/ruixin31/Spurious_Rewards/. We thank the authors for their valuable work.

Citation

@misc{wu2025reasoningmemorizationunreliableresults,
      title={Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination}, 
      author={Mingqi Wu and Zhihao Zhang and Qiaole Dong and Zhiheng Xi and Jun Zhao and Senjie Jin and Xiaoran Fan and Yuhao Zhou and Yanwei Fu and Qin Liu and Songyang Zhang and Qi Zhang},
      year={2025},
      eprint={2507.10532},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.10532}, 
}

About

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published