This repository contains the official implementation for
ScaleDiff, a simple yet effective pipeline designed to scale the creation of challenging mathematical problems to enhance the reasoning capabilities of Large Reasoning Models (LRMs). Our method addresses the scarcity of high-quality, difficult training data, which is often manually created and is therefore costly and difficult to scale.
The ScaleDiff pipeline systematically generates a large-scale dataset of difficult problems. The process involves three key steps:
-
Problem Selection: We use AdaptThink, an adaptive thinking model, to select difficult problems from AM-Distilled-Dataset. This method is more efficient than traditional approaches like fail rates or LLM-as-a judge as it only needs one forward pass to select the problems without the need for solutions.
-
Problem Generation: A dedicated problem generator, DiffGen-8B, is trained on these selected difficult problems to produce a vast number of new, challenging problems.
-
Solution Distillation and Filtration: Long CoT Solutions for the newly generated problems are distilled using Qwen3-8B and then filtered through both rule-based and model-based methods to ensure high quality and relevance.
The resulting ScaleDiff-Math dataset, which combines these new problem-solution pairs with the original dataset, is designed to provide a more effective training signal for improving reasoning abilities.
We release the ScaleDiff-Math dataset and ScaleDiff-7B model fine-tuned on this dataset. The results on the AIME'24, AIME'25, and MATH500 datasets are shown in the following table:
System Information:
- System: CentOS Linux 7 (Core)
- GNU C Library: ldd (GNU libc) 2.17
- CUDA: release 12.4
You need to install two environments, one for training and one for evaluation & data generation. We use
Our training codes depend on LLaMA-Factory.
conda create -n scalediff_train python=3.10
conda activate scalediff_train
# Install LLaMA-Factory
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
git checkout v0.9.3
# Install torch that fits your cuda version
pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install transformers==4.51.3 accelerate==1.6.0 deepspeed==0.15.4 av==14.4.0 sentencepiece==0.2.0 Cython liger-kernel
pip install flash-attn==2.7.4.post1 --no-build-isolation
pip install -e ".[torch,metrics]" --no-build-isolationOur evaluation codes depend on Sober-Reasoning.
conda create -n scalediff_eval_gen python=3.10
conda activate scalediff_eval_gen
# Install torch 2.5.1 that fits your cuda version
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install pyarrow==20.0.0 sentencepiece==0.2.0 pydantic
pip install protobuf==3.20.3 datasets==3.6.0
pip install vllm==0.7.2
pip install lighteval[math]==0.8.1Then you need to register the qwen_qft and qwen_nosystem templates in LLaMA-Factory/src/llamafactory/data/template.py by adding the following code:
register_template(
name="qwen_qft",
format_user=StringFormatter(slots=["<|im_start|>user\n{{content}}"]),
format_assistant=StringFormatter(slots=["{{content}}<|im_end|>\n"]),
stop_words=["<|im_end|>"],
)
register_template(
name="qwen_nosystem",
format_user=StringFormatter(slots=["<|im_start|>user\n{{content}}<|im_end|>\n<|im_start|>assistant\n"]),
format_system=StringFormatter(slots=["<|im_start|>system\n{{content}}<|im_end|>\n"]),
format_observation=StringFormatter(slots=["<|im_start|>tool\n{{content}}<|im_end|>\n<|im_start|>assistant\n"]),
stop_words=["<|im_end|>"],
replace_eos=True,
replace_jinja_template=False,
)conda activate scalediff_eval_gen
cd difficult_identify_generation
# download the math subset of https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled
wget https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled/resolve/main/math.jsonl
# identify the difficult problems
python adapt_think_difficulty_vllm.py --ds_path math.jsonlThen you can use the identified difficult problems to train the difficult problem generator DiffGen-8B.
We place the identified difficult problems in math_difficult_qft.jsonl, and you can refer to train/train_diffgen.sh to train the difficult problem generator.
conda activate scalediff_eval_gen
cd difficult_identify_generation
python diffgen_vllm.pyconda activate scalediff_eval_gen
cd difficult_identify_generation
python solution_distill_vllm.py --ds_path your_generated_problems.jsonl- Modify the
dataset_info.jsonfile inLLaMA-Factory/datato add the ScaleDiff-Math dataset."ScaleDiff-Math-generated": { "hf_hub_url": "QizhiPei/ScaleDiff-Math", "split": "generated", "columns": { "prompt": "problem", "response": "solution" } }, "ScaleDiff-Math-original": { "hf_hub_url": "QizhiPei/ScaleDiff-Math", "split": "original", "columns": { "prompt": "problem", "response": "solution" } }
- Begin training:
This script will automatically download the ScaleDiff-Math dataset and start training.
bash train/train_scalediff.sh
Our evaluation codes are based on Sober-Reasoning. We use the suggested lighteval==0.8.1 to evaluate the model.
You need to first download the ScaleDiff-7B model from HuggingFace, or SFT the model on your own. Then run the following evaluation script:
# Set your GPU id here. Currently, the evaluation script only supports single GPU.
export CUDA_VISIBLE_DEVICES=0
export MODEL_PATH=your_sft_model_path
bash eval.shMany thanks to
If you find our code, model, or data are useful, please kindly cite our paper:
@article{scalediff,
title={ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning},
author={Qizhi Pei and Zhuoshi Pan and Honglin Lin and Xin Gao and Yu Li and Zinan Tang and Conghui He and Rui Yan and Lijun Wu},
journal={arXiv preprint arXiv:2509.21070},
year={2025}
}

