Count Counts: Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards

This is an official code for paper Count Counts: Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards.

💡 What is MERCI?

Our MERCI (Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards) provides a critical insight applicable to a broad class of LLM reasoning tasks—specifically those that are self-contained, such as mathematical problem-solving, where the model operates without an external, stochastic world.

In this context of autoregressive generation, the underlying Markov Decision Process (MDP) has known and deterministic transitions. When an LLM in a state s (the token sequence generated so far) selects an action a (the next token), the subsequent state is determined without ambiguity. This property dramatically simplifies the Uncertainty Bellman Equation (UBE), which propagates uncertainty from two sources: the reward function estimate (\hat{r}) and the transition function estimate (\hat{P}).

With known transitions, the epistemic uncertainty of \hat{P} is zero. The UBE thus reduces to a simple accumulation of local reward uncertainty along a trajectory. This reframes the intractable problem of estimating Q-value uncertainty into the more manageable one of estimating local reward uncertainty.

To this end, we employ the Flipping Coins method, a computationally lightweight and theoretically grounded pseudo-counting technique that provides a scalable estimator for this purpose, and translates the estimation into intrinsic rewards that guide policy optimization.

✅ Results

Coin Flipping Network

The following picture shows some examples of token-level estimated epistemic uncertainty within a response. Red regions indicate relatively higher uncertainty estimates assigned by the CFN to the corresponding token positions, while blue regions indicate relatively lower estimates.

RL Training

We conduct a comprehensive set of experiments on two types of benchmarks: mathematical reasoning and SQL generation.

🌍 Environment

# Following https://github.com/volcengine/verl to build the environment
# Install verl
pip install -e .

📖 Quick Start

# Training for Mathematical Reasoning
# GRPO
bash ./examples/MERCI/train_qwen2.5_math_grpo_cfn.sh
# DAPO
bash ./recipe/dapo/example/run_qwen2.5_math_dapo_cfn.sh

# Training for SQL Generation
# GRPO
bash ./examples/MERCI/train_llama3_sql_grpo_cfn.sh
# DAPO
bash ./examples/MERCI/train_llama3_sql_dapo_cfn.sh

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docker		docker
docs		docs
examples		examples
imgs		imgs
recipe		recipe
rl/scorer		rl/scorer
scripts		scripts
tests		tests
verl		verl
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Count Counts: Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards

💡 What is MERCI?

✅ Results

Coin Flipping Network

RL Training

🌍 Environment

📖 Quick Start

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

dd88s87/MERCI

Folders and files

Latest commit

History

Repository files navigation

Count Counts: Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards

💡 What is MERCI?

✅ Results

Coin Flipping Network

RL Training

🌍 Environment

📖 Quick Start

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages