📢 If you also engaged in the research of LRM or RL, we welcome your suggestions. And feel free to create an issue, when you have any questions about the code. If you are interested in our work, please star ⭐ our repository, Thx 💕.
- 📢 News
- 📖 Introduction
- ✨ Getting Started
- 🔧 Usage
- 🙏 Citation
- 🌻 Acknowledgement
- [2025/04/20] CANON paper available on arXiv.
We introduce Conditional advANtage estimatiON (CANON), which amplifies the impact of specific metric changes by regrouping the sampled responses into two groups based on the values of a given metric. Rather than comparing against the mean value of all responses like DR.GRPO, CANON selects the direction of metric change that offers greater contributions to performance through inter-group comparison and favors responses that exhibit better performance within groups following the same trend in its intra-group comparison. DR.GRPO can be expressed as the average of CANON’s two advantage estimates and is therefore a special case of CANON.
To setup the environment, run;
git clone https://github.com/biuboomc/CANON.git
pip install -e .
For the Llama series model, we construct a dataset with 35k queries. For the Qwen series model, we utilize the data with 47k queries released in Huggingface.
Our code is based on VeRL, and you can utilize CANON with different adv_estimators:
| adv_estimator | Method |
|---|---|
| dr_entropy_token_budget | CANON based on per-token genration entropy |
| dr_length_on_mean | CANON based on response length |
| dr_entropy_token_budget_annel | CANON based on Entropy with First-Inter-Later-Intra and First-Intra-Later-Inter |
| dr_entropy_token_budget_cosine_restart | CANON based on Cosin-First-Intra-Later-Inter |
| dr_entropy_token_budget_cosine_restart_r | CANON based on Cosin-First-Inter-Later-Intra |
| dr_random | CANON based on random regrouping |
And we introduce two hypeparameters for CANON:
| hyperparameter | Description |
|---|---|
| alpha | alpha in Eq. 8 |
| _lambda | mu in Eq. 5 |
If you find this work useful, please consider citing:
@article{chen2025conditional,
title={Conditional Advantage Estimation for Reinforcement Learning in Large Reasoning Models},
author={Chen, Guanxu and Li, Yafu and Jiang, Yuxian and Qian, Chen and Ren, Qihan and Yang, Jingyi and Cheng, Yu and Liu, Dongrui and Shao, Jing},
journal={arXiv preprint arXiv:2509.23962},
year={2025}
}The codes are based on VeRL. Sincere thanks to their wonderful works.