ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning

This repository contains the official implementation for

ScaleDiff, a simple yet effective pipeline designed to scale the creation of challenging mathematical problems to enhance the reasoning capabilities of Large Reasoning Models (LRMs). Our method addresses the scarcity of high-quality, difficult training data, which is often manually created and is therefore costly and difficult to scale.

Pipeline

The ScaleDiff pipeline systematically generates a large-scale dataset of difficult problems. The process involves three key steps:

Problem Selection: We use AdaptThink, an adaptive thinking model, to select difficult problems from AM-Distilled-Dataset. This method is more efficient than traditional approaches like fail rates or LLM-as-a judge as it only needs one forward pass to select the problems without the need for solutions.
Problem Generation: A dedicated problem generator, DiffGen-8B, is trained on these selected difficult problems to produce a vast number of new, challenging problems.
Solution Distillation and Filtration: Long CoT Solutions for the newly generated problems are distilled using Qwen3-8B and then filtered through both rule-based and model-based methods to ensure high quality and relevance.

The resulting ScaleDiff-Math dataset, which combines these new problem-solution pairs with the original dataset, is designed to provide a more effective training signal for improving reasoning abilities.

We release the ScaleDiff-Math dataset and ScaleDiff-7B model fine-tuned on this dataset. The results on the AIME'24, AIME'25, and MATH500 datasets are shown in the following table:

Installation

System Information:

System: CentOS Linux 7 (Core)
GNU C Library: ldd (GNU libc) 2.17
CUDA: release 12.4

You need to install two environments, one for training and one for evaluation & data generation. We use

Our training codes depend on LLaMA-Factory.

conda create -n scalediff_train python=3.10
conda activate scalediff_train
# Install LLaMA-Factory
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
git checkout v0.9.3
# Install torch that fits your cuda version
pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install transformers==4.51.3 accelerate==1.6.0 deepspeed==0.15.4 av==14.4.0 sentencepiece==0.2.0 Cython liger-kernel
pip install flash-attn==2.7.4.post1 --no-build-isolation
pip install -e ".[torch,metrics]" --no-build-isolation

Our evaluation codes depend on Sober-Reasoning.

conda create -n scalediff_eval_gen python=3.10
conda activate scalediff_eval_gen
# Install torch 2.5.1 that fits your cuda version
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install pyarrow==20.0.0 sentencepiece==0.2.0 pydantic
pip install protobuf==3.20.3 datasets==3.6.0
pip install vllm==0.7.2
pip install lighteval[math]==0.8.1

Training

Then you need to register the qwen_qft and qwen_nosystem templates in LLaMA-Factory/src/llamafactory/data/template.py by adding the following code:

register_template(
    name="qwen_qft",
    format_user=StringFormatter(slots=["<|im_start|>user\n{{content}}"]),
    format_assistant=StringFormatter(slots=["{{content}}<|im_end|>\n"]),
    stop_words=["<|im_end|>"],
)

register_template(
    name="qwen_nosystem",
    format_user=StringFormatter(slots=["<|im_start|>user\n{{content}}<|im_end|>\n<|im_start|>assistant\n"]),
    format_system=StringFormatter(slots=["<|im_start|>system\n{{content}}<|im_end|>\n"]),
    format_observation=StringFormatter(slots=["<|im_start|>tool\n{{content}}<|im_end|>\n<|im_start|>assistant\n"]),
    stop_words=["<|im_end|>"],
    replace_eos=True,
    replace_jinja_template=False,
)

Data Generation

1. Difficult Problem Identification

conda activate scalediff_eval_gen
cd difficult_identify_generation
# download the math subset of https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled
wget https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled/resolve/main/math.jsonl
# identify the difficult problems
python adapt_think_difficulty_vllm.py --ds_path math.jsonl

Then you can use the identified difficult problems to train the difficult problem generator DiffGen-8B. We place the identified difficult problems in math_difficult_qft.jsonl, and you can refer to train/train_diffgen.sh to train the difficult problem generator.

2. Problem Generation

conda activate scalediff_eval_gen
cd difficult_identify_generation
python diffgen_vllm.py

3. Solution Distillation

conda activate scalediff_eval_gen
cd difficult_identify_generation
python solution_distill_vllm.py --ds_path your_generated_problems.jsonl

Training

Modify the dataset_info.json file in LLaMA-Factory/data to add the ScaleDiff-Math dataset.

"ScaleDiff-Math-generated": {
    "hf_hub_url": "QizhiPei/ScaleDiff-Math",
    "split": "generated",
    "columns": {
        "prompt": "problem",
        "response": "solution"
    }
},
"ScaleDiff-Math-original": {
    "hf_hub_url": "QizhiPei/ScaleDiff-Math",
    "split": "original",
    "columns": {
        "prompt": "problem",
        "response": "solution"
    }
}

Begin training:
```
bash train/train_scalediff.sh
```
This script will automatically download the ScaleDiff-Math dataset and start training.

Evaluation

Our evaluation codes are based on Sober-Reasoning. We use the suggested lighteval==0.8.1 to evaluate the model.

You need to first download the ScaleDiff-7B model from HuggingFace, or SFT the model on your own. Then run the following evaluation script:

# Set your GPU id here. Currently, the evaluation script only supports single GPU.
export CUDA_VISIBLE_DEVICES=0
export MODEL_PATH=your_sft_model_path

bash eval.sh

🙏 Acknowledgements

Many thanks to

Citation

If you find our code, model, or data are useful, please kindly cite our paper:

@article{scalediff,
 title={ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning}, 
 author={Qizhi Pei and Zhuoshi Pan and Honglin Lin and Xin Gao and Yu Li and Zinan Tang and Conghui He and Rui Yan and Lijun Wu},
 journal={arXiv preprint arXiv:2509.21070},
 year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
difficult_identify_generation		difficult_identify_generation
evaluation		evaluation
imgs		imgs
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning

Pipeline

Installation

Training

Data Generation

1. Difficult Problem Identification

2. Problem Generation

3. Solution Distillation

Training

Evaluation

🙏 Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Languages

License

QizhiPei/ScaleDiff

Folders and files

Latest commit

History

Repository files navigation

ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning

Pipeline

Installation

Training

Data Generation

1. Difficult Problem Identification

2. Problem Generation

3. Solution Distillation

Training

Evaluation

🙏 Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages