Skip to content

biuboomc/CANON

Repository files navigation

CANON: Conditional Advantage Estimation for Reinforcement Learning in Large Reasoning Models

📢 If you also engaged in the research of LRM or RL, we welcome your suggestions. And feel free to create an issue, when you have any questions about the code. If you are interested in our work, please star ⭐ our repository, Thx 💕.

Version License Stars Issues

Paper Github


📚 Overview


📢News

  • [2025/04/20] CANON paper available on arXiv.

📖Introduction

intro

We introduce Conditional advANtage estimatiON (CANON), which amplifies the impact of specific metric changes by regrouping the sampled responses into two groups based on the values of a given metric. Rather than comparing against the mean value of all responses like DR.GRPO, CANON selects the direction of metric change that offers greater contributions to performance through inter-group comparison and favors responses that exhibit better performance within groups following the same trend in its intra-group comparison. DR.GRPO can be expressed as the average of CANON’s two advantage estimates and is therefore a special case of CANON.


✨ Getting Started

To setup the environment, run;

git clone https://github.com/biuboomc/CANON.git
pip install -e .

🔧 Usage

For the Llama series model, we construct a dataset with 35k queries. For the Qwen series model, we utilize the data with 47k queries released in Huggingface.

Our code is based on VeRL, and you can utilize CANON with different adv_estimators:

adv_estimator Method
dr_entropy_token_budget CANON based on per-token genration entropy
dr_length_on_mean CANON based on response length
dr_entropy_token_budget_annel CANON based on Entropy with First-Inter-Later-Intra and First-Intra-Later-Inter
dr_entropy_token_budget_cosine_restart CANON based on Cosin-First-Intra-Later-Inter
dr_entropy_token_budget_cosine_restart_r CANON based on Cosin-First-Inter-Later-Intra
dr_random CANON based on random regrouping

And we introduce two hypeparameters for CANON:

hyperparameter Description
alpha alpha in Eq. 8
_lambda mu in Eq. 5

🙏 Citation

If you find this work useful, please consider citing:

@article{chen2025conditional,
  title={Conditional Advantage Estimation for Reinforcement Learning in Large Reasoning Models},
  author={Chen, Guanxu and Li, Yafu and Jiang, Yuxian and Qian, Chen and Ren, Qihan and Yang, Jingyi and Cheng, Yu and Liu, Dongrui and Shao, Jing},
  journal={arXiv preprint arXiv:2509.23962},
  year={2025}
}

🌻 Acknowledgements

The codes are based on VeRL. Sincere thanks to their wonderful works.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published