Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Overview

On-Policy Self-Distillation (OPSD) trains a single model to act as both student and teacher by conditioning on different contexts — the student sees only the problem, while the teacher additionally sees the ground-truth solution — and performs token-level distribution matching along the student's own on-policy trajectories.

Updates

Mar 18, 2026: Released updated code.

(1) Fixed chat template and zero2 bugs (see template issue), we re-ran experiments with updated results (detailed results & ablations updated on arxiv/blog). The fixes yield improved OPSD performance, most notably on Qwen3-1.7B.

(2) Added a new training stabilization strategy 🚀: per-token point-wise KL clipping. We find style tokens (such as 'wait', 'think') can exhibit 6–15× higher KL divergence than math-related tokens, and dominates the training signal. Clipping stablizes training and improves performance.
Mar 3, 2026: Initial code release.

Installation

conda env create -f environment.yml
conda activate opsd

pip install flash-attn==2.8.3 --no-build-isolation

If you encounter difficulties installing flash-attn, you can check the version matching your CUDA and PyTorch versions from the flash-attention releases page.

The code uses trl's experimental GOLD trainer as a base.

Repository Structure

├── opsd_trainer.py          # OPSDTrainer: core self-distillation trainer
├── data_collator.py         # Data collator for self-distillation
├── opsd_train.py            # OPSD training entry point
├── sft_train.py             # SFT baseline training entry point
├── grpo_train.py            # GRPO baseline training entry point
├── accelerate.yaml          # Accelerate config (multi-GPU)
├── scripts/
│   ├── run_opsd.sh          # Example launch script for OPSD
│   ├── run_sft.sh           # Example launch script for SFT
│   └── run_grpo.sh          # Example launch script for GRPO
└── eval/
    ├── evaluate_math.py     # Evaluation script (vLLM)
    └── run_eval.sh          # Example evaluation script

Quick Start

Reproduce results on Qwen3-1.7B (🚀 training only takes ~15 minutes on 4×H100 and peaks within 100 steps):

bash scripts/run_opsd_1b.sh

Evaluation: (evaluation takes ~ 30-50 minutes on 4xh100 for each checkpoint)

cd eval
bash run_eval.sh

Evaluation Results across Tasks on Qwen3-1.7B

AIME24

AIME25

HMMT25

Step	Avg@12
Base	51.5%
25	51.4%
50	52.8%
75	54.4%
100	57.2%

Step	Avg@12
Base	36.7%
25	42.5%
50	43.9%
75	40.6%
100	41.1%

Step	Avg@12
Base	23.1%
25	24.7%
50	27.8%
75	26.9%
100	29.2%

Evaluation settings: temperature=1.0, thinking mode enabled, max new tokens=38912, top-p=none, top-k disabled, min-p=0, presence penalty=0, num samples=12

Training

OPSD

See

scripts/run_opsd_1b.sh. scripts/run_opsd_4b.sh. scripts/run_opsd_8b.sh.

Key OPSD arguments

Argument	Default	Description
`--fixed_teacher`	`False`	Fix the teacher to the initial policy (step 0). Requires --use_peft. Note ❗ If you disable PEFT, the teacher will keep updating at every training step, which may make training unstable. Our main results use the fixed teacher, which is currently implemented with LoRA adapter weights.
`--use_tinker_loss`	`False`	Use sampled-token policy-gradient objective instead of full-vocabulary JSD. More memory efficient. Currently no clipped implemented for this variant, could be unstable.
`--max_completion_length`	—	Student generation length for distillation. We use 1024 in our main experiments.
`--beta`	—	Interpolation weight for the JSD mixture distribution. Beta=0 means forward KL and 1 means reverse KL.
`--jsd_token_clip`	0.05	Clip the JSD loss for each token to a maximum value. This can improve stability by preventing stylistic tokens from dominating the training signal.
`--reason_first`	`False`	Prepend an explicit rationalization to the teacher context before distillation.
`--run_config`	`None`	Custom name suffix for the output directory and WandB run.

SFT Baseline

See scripts/run_sft.sh.

GRPO Baseline

See scripts/run_grpo.sh.

Acknowledgements

Our implementation builds on TRL GOLD Trainer. We sincerely thank @simran135 and @beanie00 for identifying the prompt template bugs and the zero-2 issue, respectively!

Citation

If you find this useful, please consider citing:

@article{zhao2026self,
  title={Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models},
  author={Zhao, Siyan and Xie, Zhihui and Liu, Mengchen and Huang, Jing and Pang, Guan and Chen, Feiyu and Grover, Aditya},
  journal={arXiv preprint arXiv:2601.18734},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Overview

Updates

Installation

Repository Structure

Quick Start

Evaluation Results across Tasks on Qwen3-1.7B

Training

OPSD

Key OPSD arguments

SFT Baseline

GRPO Baseline

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
eval		eval
scripts		scripts
README.md		README.md
accelerate.yaml		accelerate.yaml
data_collator.py		data_collator.py
environment.yml		environment.yml
grpo_train.py		grpo_train.py
opsd_train.py		opsd_train.py
opsd_trainer.py		opsd_trainer.py
sft_train.py		sft_train.py

Folders and files

Latest commit

History

Repository files navigation

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Overview

Updates

Installation

Repository Structure

Quick Start

Evaluation Results across Tasks on Qwen3-1.7B

Training

OPSD

Key OPSD arguments

SFT Baseline

GRPO Baseline

Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages