This repository contains the code and resources for the paper Thinker: Learning to Think Fast and Slow, published at NeurIPS 2025.
Our work introduces the Thinker task, a novel four-stage Reinforcement Learning (RL) approach for question-answering (QA) designed to enhance the reasoning capabilities of Large Language Models (LLMs) by explicitly training distinct cognitive abilities: intuition (Fast Thinking), evaluation (Verification), refinement (Slow Thinking), and integration (Summarization).
Figure: Evaluation Accuracy on Mathematical Reasoning Benchmarks.
Performance comparison across various mathematical reasoning benchmarks. All scores are Pass@1 accuracy (%) averaged over 16 samples. Top score in each benchmark column (within each model group) is bolded.
| Method | MATH 500 | AIME 2024 | AIME 2025 | GPQA Diamond | Olympiad bench | AMC 23 | Minerva Math | College Math | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-1.5B (Q1.5B) | |||||||||
| Pretrained | 9.05 | 0.00 | 0.00 | 4.55 | 3.09 | 4.06 | 2.30 | 7.40 | 3.81 |
| Baseline | 59.82 | 4.10 | 2.43 | 20.52 | 26.05 | 35.36 | 19.25 | 37.42 | 25.62 |
| Thinker | 64.45 | 6.25 | 2.22 | 19.21 | 28.21 | 39.06 | 20.38 | 38.82 | 27.33 |
| Thinker-Fast | 59.82 | 4.58 | 1.25 | 21.28 | 24.52 | 34.53 | 17.85 | 37.58 | 25.18 |
| ORZ | 58.00 | 3.50 | 1.00 | 16.80 | - | - | - | - | - |
| SimpleRL | 59.00 | 4.20 | - | - | 21.00 | 35.00 | 20.20 | - | - |
| DeepSeek-R1-Distill-Qwen-1.5B (R1.5B) | |||||||||
| Pretrained | 76.21 | 17.50 | 17.92 | 13.76 | 37.46 | 55.94 | 24.82 | 38.85 | 35.31 |
| Baseline | 86.24 | 35.42 | 23.75 | 25.69 | 49.22 | 72.81 | 32.08 | 42.02 | 45.90 |
| Thinker | 88.51 | 38.96 | 26.67 | 37.41 | 55.49 | 83.59 | 34.77 | 42.46 | 50.98 |
| Thinker-Fast | 81.35 | 18.33 | 14.58 | 28.85 | 45.68 | 66.41 | 31.39 | 41.74 | 41.05 |
| DeepSeek-R1-Distill-Qwen-7B (R7B) | |||||||||
| Pretrained | 84.05 | 37.50 | 28.54 | 17.58 | 37.92 | 36.41 | 34.49 | 40.72 | 39.65 |
| Baseline | 91.03 | 47.50 | 34.58 | 34.63 | 56.76 | 87.81 | 40.23 | 42.71 | 54.41 |
| Thinker | 93.04 | 56.25 | 41.46 | 41.51 | 62.12 | 91.09 | 44.39 | 42.84 | 59.09 |
| Thinker-Fast | 86.47 | 26.46 | 21.88 | 34.12 | 51.77 | 71.56 | 43.08 | 42.14 | 47.19 |
Results from concurrent works are extracted from the respective papers.
| Method | MATH500 Acc. (%) | MATH500 Length | AIME24 Acc. (%) | AIME24 Length | AMC23 Acc. (%) | AMC23 Length |
|---|---|---|---|---|---|---|
| ThinkPrune | 83.2 | 1938 | 27.1 | 5631 | 73.2 | 3039 |
| Concise Reasoning | 81.0 | 1965 | 30.0 | 6752 | 69.4 | 2936 |
| SR-FLOW | 85.3 | - | 36.7 | - | 77.8 | - |
| AdaptThink | 82.0 | 1782 | 31.0 | 6679 | - | - |
| Baseline | 86.2 | 2780 | 35.4 | 5778 | 72.8 | 3938 |
| Thinker | 88.5 | 2501 | 39.0 | 5597 | 83.6 | 3517 |
| Thinker-Fast | 80.9 | 600 | 18.1 | 853 | 66.9 | 751 |
This project requires Python >=3.10.
Ensure you have essential system libraries. For Debian-based systems (like Ubuntu), you can install them using:
sudo apt-get update
sudo apt-get install -y ffmpeg libsm6 libxext6
It's recommended to use a virtual environment (e.g., Python's venv or Conda).
Once you have cloned the repository and navigated into the main project directory (where pyproject.toml is located), activate your chosen virtual environment. Then, install the project and its dependencies:
pip install -e .
This command installs the thinker_task project in editable mode and pulls all required Python packages with their specific versions as defined in pyproject.toml.
- Python:
>=3.10 - Python Packages: All specific versions for packages like
torch,deepspeed, etc., are listed in thepyproject.tomlfile.
Download the base model R1.5B (DeepSeek-R1-Distill-Qwen-1.5B), R7B (DeepSeek-R1-Distill-Qwen-7B), and Q1.5B (Qwen2.5-1.5B) under the directory large_data/base, using the following command:
python -c "from huggingface_hub import snapshot_download; print(snapshot_download('deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', local_dir='large_data/base/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B'))"
python -c "from huggingface_hub import snapshot_download; print(snapshot_download('deepseek-ai/DeepSeek-R1-Distill-Qwen-7B', local_dir='large_data/base/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B'))"
python -c "from huggingface_hub import snapshot_download; print(snapshot_download('Qwen/Qwen2.5-Math-1.5B', local_dir='large_data/base/Qwen/Qwen2.5-Math-1.5B'))"
python script/add_token.py large_data/base/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
python script/add_token.py large_data/base/deepseek-ai/DeepSeek-R1-Distill-Qwen-7BThe last two lines add two special tokens, <|im_start|> and <|im_end|>, to R1-Distilled models that were present in Q1.5B and mark the start and end of a prompt.
Single node, R1.5B Thinker agent (replace r1_5b with q1_5b for Q1.5B model, or r7b for R7B model):
python -m playground.thinker_r1_5bMulti-node Training:
First on master node, run:
ray start --headthen on other nodes, run:
ray start --address='<master-node-ip>:<master-node-port>'then on master node, run (adjust NUM_NODE as needed; both 2 and 4 should work fine):
NUM_NODE=4 python -m playground.thinker_r1_5bThe training data are sourced from Open-Reasoner-Zero.
Trained model checkpoints can be found at:
Please refer to thinker_task/exp_engine/accelerators/inference/sum_llm.py on how to perform inference on the Thinker-task with vLLM.
- Our training framework is built on Open-Reasoner-Zero, OpenRLHF, vllm, DeepSpeed and ray.
- Our model is based on Qwen2.5-1.5B, DeepSeek-R1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Qwen-7B.
