This repository hosts evaluation code from our paper:
"A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility"
📄 Paper
📊 Leaderboard
🧪 HuggingFace Dataset Page
To launch a single evaluation run, use:
python main.py \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
--task "custom|aime24|0|0" \
--temperature 0.8 \
--top_p 0.9 \
--seed 0 \
--output_dir /path/to/output \
--max_new_tokens 32768 \
--max_model_length 32768 \
--custom_tasks_directory lighteval_tasks.py \
--use_chat_template
Replace --task with the appropriate benchmark specification (e.g., aime24, math_500, etc.).
-
Build the Docker image from the provided Dockerfile:
docker build -t sober-reasoning-eval . -
Launch a Runpod instance using this image.
-
SSH into the instance and run:
python main.py ...
-
Set the following variables inside run.sh:
LOCAL_DIR: Path to cloned repoOUTPUT_DIR: Path for logs and outputsPARTITION: Your Slurm partition nameVENV: Path to your Python virtual environment
-
Configure task/seed/temp/... ranges inside
run.sh. -
Submit the batch job:
bash run.sh
@inproceedings{hochlehnert2025soberreasoning,
title={A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility},
author={Andreas Hochlehnert and Hardik Bhatnagar and Vishaal Udandarao and Samuel Albanie and Ameya Prabhu and Matthias Bethge},
booktitle={Second Conference on Language Modeling},
year={2025},
url={https://openreview.net/forum?id=90UrTTxp5O}
}