🧠 Sober Reasoning: Evaluation Code

This repository hosts evaluation code from our paper:

"A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility"

📄 Paper
📊 Leaderboard
🧪 HuggingFace Dataset Page

🚀 Quickstart: Running an Evaluation

To launch a single evaluation run, use:

python main.py \
    --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
    --task "custom|aime24|0|0" \
    --temperature 0.8 \
    --top_p 0.9 \
    --seed 0 \
    --output_dir /path/to/output \
    --max_new_tokens 32768 \
    --max_model_length 32768 \
    --custom_tasks_directory lighteval_tasks.py \
    --use_chat_template

Replace --task with the appropriate benchmark specification (e.g., aime24, math_500, etc.).

🧱 Infrastructure Setup

🔁 Runpod (Single GPU Inference)

Build the Docker image from the provided Dockerfile:
```
docker build -t sober-reasoning-eval .
```
Launch a Runpod instance using this image.
SSH into the instance and run:
```
python main.py ...
```

🧵 Slurm (Multi-Seed, Multi-Task, Multi-Temp Grid)

Set the following variables inside run.sh:
- LOCAL_DIR: Path to cloned repo
- OUTPUT_DIR: Path for logs and outputs
- PARTITION: Your Slurm partition name
- VENV: Path to your Python virtual environment
Configure task/seed/temp/... ranges inside run.sh.
Submit the batch job:
```
bash run.sh
```

🔄 Citation

@inproceedings{hochlehnert2025soberreasoning,
    title={A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility},
    author={Andreas Hochlehnert and Hardik Bhatnagar and Vishaal Udandarao and Samuel Albanie and Ameya Prabhu and Matthias Bethge},
    booktitle={Second Conference on Language Modeling},
    year={2025},
    url={https://openreview.net/forum?id=90UrTTxp5O}
}

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
Dockerfile		Dockerfile
README.md		README.md
data.json		data.json
index.html		index.html
lighteval_tasks.py		lighteval_tasks.py
main.py		main.py
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 Sober Reasoning: Evaluation Code

🚀 Quickstart: Running an Evaluation

🧱 Infrastructure Setup

🔁 Runpod (Single GPU Inference)

🧵 Slurm (Multi-Seed, Multi-Task, Multi-Temp Grid)

🔄 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

bethgelab/sober-reasoning

Folders and files

Latest commit

History

Repository files navigation

🧠 Sober Reasoning: Evaluation Code

🚀 Quickstart: Running an Evaluation

🧱 Infrastructure Setup

🔁 Runpod (Single GPU Inference)

🧵 Slurm (Multi-Seed, Multi-Task, Multi-Temp Grid)

🔄 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages