On-Policy Self-Distillation (OPSD) trains a single model to act as both student and teacher by conditioning on different contexts — the student sees only the problem, while the teacher additionally sees the ground-truth solution — and performs token-level distribution matching along the student's own on-policy trajectories.
-
Mar 18, 2026: Released updated code.
(1) Fixed chat template and zero2 bugs (see template issue), we re-ran experiments with updated results (detailed results & ablations updated on arxiv/blog). The fixes yield improved OPSD performance, most notably on Qwen3-1.7B.
(2) Added a new training stabilization strategy 🚀: per-token point-wise KL clipping. We find style tokens (such as 'wait', 'think') can exhibit 6–15× higher KL divergence than math-related tokens, and dominates the training signal. Clipping stablizes training and improves performance.
-
Mar 3, 2026: Initial code release.
conda env create -f environment.yml
conda activate opsdpip install flash-attn==2.8.3 --no-build-isolationIf you encounter difficulties installing flash-attn, you can check the version matching your CUDA and PyTorch versions from the flash-attention releases page.
The code uses trl's experimental GOLD trainer as a base.
├── opsd_trainer.py # OPSDTrainer: core self-distillation trainer
├── data_collator.py # Data collator for self-distillation
├── opsd_train.py # OPSD training entry point
├── sft_train.py # SFT baseline training entry point
├── grpo_train.py # GRPO baseline training entry point
├── accelerate.yaml # Accelerate config (multi-GPU)
├── scripts/
│ ├── run_opsd.sh # Example launch script for OPSD
│ ├── run_sft.sh # Example launch script for SFT
│ └── run_grpo.sh # Example launch script for GRPO
└── eval/
├── evaluate_math.py # Evaluation script (vLLM)
└── run_eval.sh # Example evaluation script
Reproduce results on Qwen3-1.7B (🚀 training only takes ~15 minutes on 4×H100 and peaks within 100 steps):
bash scripts/run_opsd_1b.shEvaluation: (evaluation takes ~ 30-50 minutes on 4xh100 for each checkpoint)
cd eval
bash run_eval.sh| AIME24 | AIME25 | HMMT25 | ||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
Evaluation settings: temperature=1.0, thinking mode enabled, max new tokens=38912, top-p=none, top-k disabled, min-p=0, presence penalty=0, num samples=12
See
scripts/run_opsd_1b.sh.
scripts/run_opsd_4b.sh.
scripts/run_opsd_8b.sh.
| Argument | Default | Description |
|---|---|---|
--fixed_teacher |
False |
Fix the teacher to the initial policy (step 0). Requires --use_peft. Note ❗ If you disable PEFT, the teacher will keep updating at every training step, which may make training unstable. Our main results use the fixed teacher, which is currently implemented with LoRA adapter weights. |
--use_tinker_loss |
False |
Use sampled-token policy-gradient objective instead of full-vocabulary JSD. More memory efficient. Currently no clipped implemented for this variant, could be unstable. |
--max_completion_length |
— | Student generation length for distillation. We use 1024 in our main experiments. |
--beta |
— | Interpolation weight for the JSD mixture distribution. Beta=0 means forward KL and 1 means reverse KL. |
--jsd_token_clip |
0.05 | Clip the JSD loss for each token to a maximum value. This can improve stability by preventing stylistic tokens from dominating the training signal. |
--reason_first |
False |
Prepend an explicit rationalization to the teacher context before distillation. |
--run_config |
None |
Custom name suffix for the output directory and WandB run. |
See scripts/run_sft.sh.
See scripts/run_grpo.sh.
Our implementation builds on TRL GOLD Trainer. We sincerely thank @simran135 and @beanie00 for identifying the prompt template bugs and the zero-2 issue, respectively!
If you find this useful, please consider citing:
@article{zhao2026self,
title={Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models},
author={Zhao, Siyan and Xie, Zhihui and Liu, Mengchen and Huang, Jing and Pang, Guan and Chen, Feiyu and Grover, Aditya},
journal={arXiv preprint arXiv:2601.18734},
year={2026}
}