Skip to content

dd88s87/MERCI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Count Counts: Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards

This is an official code for paper Count Counts: Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards.

💡 What is MERCI?

Our MERCI (Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards) provides a critical insight applicable to a broad class of LLM reasoning tasks—specifically those that are self-contained, such as mathematical problem-solving, where the model operates without an external, stochastic world.

In this context of autoregressive generation, the underlying Markov Decision Process (MDP) has known and deterministic transitions. When an LLM in a state s (the token sequence generated so far) selects an action a (the next token), the subsequent state is determined without ambiguity. This property dramatically simplifies the Uncertainty Bellman Equation (UBE), which propagates uncertainty from two sources: the reward function estimate (\hat{r}) and the transition function estimate (\hat{P}).

With known transitions, the epistemic uncertainty of \hat{P} is zero. The UBE thus reduces to a simple accumulation of local reward uncertainty along a trajectory. This reframes the intractable problem of estimating Q-value uncertainty into the more manageable one of estimating local reward uncertainty.

To this end, we employ the Flipping Coins method, a computationally lightweight and theoretically grounded pseudo-counting technique that provides a scalable estimator for this purpose, and translates the estimation into intrinsic rewards that guide policy optimization.

MERCI

✅ Results

Coin Flipping Network

The following picture shows some examples of token-level estimated epistemic uncertainty within a response. Red regions indicate relatively higher uncertainty estimates assigned by the CFN to the corresponding token positions, while blue regions indicate relatively lower estimates.

cfn_results

RL Training

We conduct a comprehensive set of experiments on two types of benchmarks: mathematical reasoning and SQL generation.

rl_results

🌍 Environment

# Following https://github.com/volcengine/verl to build the environment
# Install verl
pip install -e .

📖 Quick Start

# Training for Mathematical Reasoning
# GRPO
bash ./examples/MERCI/train_qwen2.5_math_grpo_cfn.sh
# DAPO
bash ./recipe/dapo/example/run_qwen2.5_math_dapo_cfn.sh

# Training for SQL Generation
# GRPO
bash ./examples/MERCI/train_llama3_sql_grpo_cfn.sh
# DAPO
bash ./examples/MERCI/train_llama3_sql_dapo_cfn.sh

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors