CANON: Conditional Advantage Estimation for Reinforcement Learning in Large Reasoning Models

📢 If you also engaged in the research of LRM or RL, we welcome your suggestions. And feel free to create an issue, when you have any questions about the code. If you are interested in our work, please star ⭐ our repository, Thx 💕.

📚 Overview

📢 News
📖 Introduction
✨ Getting Started
🔧 Usage
🙏 Citation
🌻 Acknowledgement

📢News

[2025/04/20] CANON paper available on arXiv.

📖Introduction

We introduce Conditional advANtage estimatiON (CANON), which amplifies the impact of specific metric changes by regrouping the sampled responses into two groups based on the values of a given metric. Rather than comparing against the mean value of all responses like DR.GRPO, CANON selects the direction of metric change that offers greater contributions to performance through inter-group comparison and favors responses that exhibit better performance within groups following the same trend in its intra-group comparison. DR.GRPO can be expressed as the average of CANON’s two advantage estimates and is therefore a special case of CANON.

✨ Getting Started

To setup the environment, run;

git clone https://github.com/biuboomc/CANON.git
pip install -e .

🔧 Usage

For the Llama series model, we construct a dataset with 35k queries. For the Qwen series model, we utilize the data with 47k queries released in Huggingface.

Our code is based on VeRL, and you can utilize CANON with different adv_estimators:

adv_estimator	Method
dr_entropy_token_budget	CANON based on per-token genration entropy
dr_length_on_mean	CANON based on response length
dr_entropy_token_budget_annel	CANON based on Entropy with First-Inter-Later-Intra and First-Intra-Later-Inter
dr_entropy_token_budget_cosine_restart	CANON based on Cosin-First-Intra-Later-Inter
dr_entropy_token_budget_cosine_restart_r	CANON based on Cosin-First-Inter-Later-Intra
dr_random	CANON based on random regrouping

And we introduce two hypeparameters for CANON:

hyperparameter	Description
alpha	alpha in Eq. 8
_lambda	mu in Eq. 5

🙏 Citation

If you find this work useful, please consider citing:

@article{chen2025conditional,
  title={Conditional Advantage Estimation for Reinforcement Learning in Large Reasoning Models},
  author={Chen, Guanxu and Li, Yafu and Jiang, Yuxian and Qian, Chen and Ren, Qihan and Yang, Jingyi and Cheng, Yu and Liu, Dongrui and Shao, Jing},
  journal={arXiv preprint arXiv:2509.23962},
  year={2025}
}

🌻 Acknowledgements

The codes are based on VeRL. Sincere thanks to their wonderful works.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
docker		docker
docs		docs
media		media
patches		patches
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
.style.yapf		.style.yapf
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CANON: Conditional Advantage Estimation for Reinforcement Learning in Large Reasoning Models

📚 Overview

📢News

📖Introduction

✨ Getting Started

🔧 Usage

🙏 Citation

🌻 Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

biuboomc/CANON

Folders and files

Latest commit

History

Repository files navigation

CANON: Conditional Advantage Estimation for Reinforcement Learning in Large Reasoning Models

📚 Overview

📢News

📖Introduction

✨ Getting Started

🔧 Usage

🙏 Citation

🌻 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages