This is an official code for paper Count Counts: Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards.
Our MERCI (Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards) provides a critical insight applicable to a broad class of LLM reasoning tasks—specifically those that are self-contained, such as mathematical problem-solving, where the model operates without an external, stochastic world.
In this context of autoregressive generation, the underlying Markov Decision Process (MDP) has known and deterministic transitions. When an LLM in a state s (the token sequence generated so far) selects an action a (the next token), the subsequent state is determined without ambiguity. This property dramatically simplifies the Uncertainty Bellman Equation (UBE), which propagates uncertainty from two sources: the reward function estimate (\hat{r}) and the transition function estimate (\hat{P}).
With known transitions, the epistemic uncertainty of \hat{P} is zero. The UBE thus reduces to a simple accumulation of local reward uncertainty along a trajectory. This reframes the intractable problem of estimating Q-value uncertainty into the more manageable one of estimating local reward uncertainty.
To this end, we employ the Flipping Coins method, a computationally lightweight and theoretically grounded pseudo-counting technique that provides a scalable estimator for this purpose, and translates the estimation into intrinsic rewards that guide policy optimization.
The following picture shows some examples of token-level estimated epistemic uncertainty within a response. Red regions indicate relatively higher uncertainty estimates assigned by the CFN to the corresponding token positions, while blue regions indicate relatively lower estimates.
We conduct a comprehensive set of experiments on two types of benchmarks: mathematical reasoning and SQL generation.
# Following https://github.com/volcengine/verl to build the environment
# Install verl
pip install -e .
# Training for Mathematical Reasoning
# GRPO
bash ./examples/MERCI/train_qwen2.5_math_grpo_cfn.sh
# DAPO
bash ./recipe/dapo/example/run_qwen2.5_math_dapo_cfn.sh
# Training for SQL Generation
# GRPO
bash ./examples/MERCI/train_llama3_sql_grpo_cfn.sh
# DAPO
bash ./examples/MERCI/train_llama3_sql_dapo_cfn.sh


