Direct Reasoning Optimization
Direct Reasoning Optimization
A BSTRACT
arXiv:2506.13351v1 [[Link]] 16 Jun 2025
1 I NTRODUCTION
The emergence of reasoning capabilities in Large Language Models (LLMs) has marked a major
leap forward, particularly in tasks involving mathematics and programming (Guo et al., 2025; Jaech
et al., 2024; Zeng et al., 2024; Yang et al., 2025; Kavukcuoglu, Koray, 2025). To enable such rea-
soning, LLMs are trained using Reinforcement Learning with Verifiable Rewards (RLVR) technique
guided by verifiable rewards computed from the model’s own final outcomes (Schulman et al., 2017;
Guo et al., 2025; Liu et al., 2025a; Yu et al., 2025). These rewards are derived from objective signals
such as matching reference answers in math problems, passing unit tests in coding challenges, or se-
lecting the correct option in multiple-choice questions (MCQ). Compared to traditional approaches
like reward model training, verifiable rewards have proven effective in mitigating reward hacking
and are relatively straightforward to implement (Guo et al., 2025; Shao et al., 2024; Yue et al., 2025;
Liu et al., 2025a).
Building on the success of reasoning capabilities in math, programming, and MCQ tasks with RLVR,
there is growing interest in extending these techniques to open-ended tasks that require logical anal-
ysis, for example, revising a document in response to comments, composing analytical summaries
or reports, or reviewing financial documents. The primary challenge lies in designing a generic,
verifiable reward signal akin to those used in math and coding tasks (Zuo et al., 2025; Zhao et al.,
2025b; Su et al., 2025; Chen et al., 2025c). Given the limitations and potential inefficacy of training
a separate reward model (Guo et al., 2025; Shao et al., 2024; Zuo et al., 2025; Zhao et al., 2025b), the
∗
Equal contribution.
1
Prompt: Reference Outcome: This paper is organized as follows. Section 2 gives an overview of existing work on graph pooling.
- Paper context Section 3 details the components and computational flow for HaarPooling. Section 34 provides the mathematical details on
- Reviewer comments HaarPooling, including the compressive Haar basis, compressive Haar transforms, and efficient implementations. Section 4
- Paragraph to revise gives an overview of existing work on graph pooling. Section 5 reports ...
Better Reasoning
Advantage: Advantage:
Reasoning A: <think>...Therefore, the revised paragraph should be modified
to have Section 4 before Sections 2 and 3. So, the correct order is...Section 0.173 0.503
4 gives...Section 2 details...Section 3 provides...</think>
❌ ✅
Reasoning B: <think>...Maybe the answer is to leave it as is but make sure
0.416 0.427
the related work is properly mentioned...</think>
Reasoning-Reflective Tokens
This paper is organized as follows. Section 2 gives an overview of existing work on graph pooling. Section 3...
Figure 1: Illustrative example of Reasoning Reflection Reward (R3). For the paper revision task,
the model is prompted to revise a paragraph based on reviewer comments (upper left). R3 computes
per-token self-certainty (log-probabilities) in the reference revision (upper right) for each sampled
reasoning trace, and highlights reasoning-reflective tokens using σ(certainty). In this example, Rea-
soning A correctly identifies that Section 4 (overview) has been moved earlier and adjusts the para-
graph structure accordingly, with a minor omission of section numbers. Reasoning B gives up.
While a vanilla aggregate of certainty prefers B over A due to A’s lower certainty on the token “2”,
R3 successfully aligns with the desired ranking by up-weighting high-σ(certainty) tokens “gives”,
“existing” and “.” that better reflect reasoning effectiveness.
LLM-as-a-judge (Zheng et al., 2023; Lee et al., 2023) may seem to be an alternative. However, rely-
ing on an external LLM to evaluate the outcomes of an actor LLM in RLVR introduces sensitivities
to factors such as prompt design and optimization, model selection, the generator–discriminator gap,
and reward hacking (Chen et al., 2025b;b; Huang et al., 2024; Zuo et al., 2025; Xu et al., 2025b;
Sharma et al., 2024). Evaluating training model’s chain-of-thought (CoT) reasoning in semantic
space adds an even greater challenge given how it hides reasoning in the latent space (Chen et al.,
2025d). Meanwhile, traditional similarity-based metrics such as ROUGE scores or cosine similarity
often fail to capture key logical aspects of open-ended outcomes and remain vulnerable to reward
hacking (Christiano et al., 2017; Stiennon et al., 2020; Su et al., 2025).
To address these challenges, we first introduce a new token-level dense reward called the Reasoning
Reflection Reward (R3). Owing to the autoregressive nature of LLMs, the CoT reasoning serves as
a latent prefix that conditions the model’s generation of the final outcome. Consequently, the LLM’s
token-level certainty of the reference outcome – measured under this reasoning prefix – effectively
captures how likely the generated reasoning is to produce the correct outcome. However, in long-
form generation, only a limited subset of tokens in the reference intrinsically reflect variations in
reasoning paths, while many others are less informative and may dilute the reward signal. To over-
come this, R3 selectively identifies and emphasizes the key tokens in the reference that are most
sensitive to variations in reasoning, shaping the reward signal to focus on these reasoning-reflective
tokens (Fig. 1). This approach enables the model to directly optimize its reasoning paths toward
achieving the reference outcomes in open-ended tasks, promoting outcome-driven reasoning in a
manner analogous to RLVR.
We then propose Direct Reasoning Optimization (DRO), an RL-based fine-tuning framework that
leverages R3 as its core reward signal. To compute R3, DRO directly uses a dynamic reward policy
derived from the same reference policy (LLM) being optimized – thereby eliminating the need for
any external reward model or signal. Our method builds upon the widely adopted RLVR framework,
Group Relative Policy Optimization (GRPO) (Guo et al., 2025; Shao et al., 2024), extending its
outcome-driven effectiveness to open-ended reasoning tasks. DRO further integrates a ubiquitous
data filtering technique for open-ended reasoning tasks, motivated by the growing recognition of
2
Weight
Update
Length Reward
R3-Based Sync
Dataset Filter
Compute R3
Figure 2: Overview of Direct Reasoning Optimization (DRO), a framework that rewards and refines
reasoning by directly leveraging feedback from the training model. DRO operates within the GRPO
framework, where a group of CoT reasoning traces sampled from the actor policy (πθ ) are scored
primarily using the R3 score along with length penalty on final outcome. The reward is computed
via an internal policy (πrwd ), derived from the same base reference policy (πref ) being optimized.
DRO employs R3-based dynamic training data filtering for open-ended reasoning tasks to improve
data efficiency and downstream task performance.
data selection’s importance in recent work (Muennighoff et al., 2025; Jiang et al., 2025; Ye et al.,
2025; Yang et al., 2025). Our approach leverages R3 to dynamically filter training samples during
RL training, without requiring any task-specific filtering heuristics or external frameworks. This
filtering strategy improves downstream performance while simultaneously reducing training cost
and time.
Finally, we evaluate DRO on two distinct datasets—ParaRev(Jourdan et al., 2025) and FinQA(Chen
et al., 2021)—using two Qwen reasoning models distilled from DeepSeek-R1. To the best of our
knowledge, this is the first work to evaluate reasoning optimization on an open-ended task like para-
graph revision (ParaRev), which involves relatively long-form textual outputs beyond the traditional
math and programming domains. On ParaRev, DRO outperforms all baseline methods in terms of
downstream task performance while achieving around 45% reduction in training cost. We further
validate DRO on FinQA, a task with classic math-style answers, demonstrating that it achieves com-
parable performance to standard binary verifiable reward approaches—highlighting its versatility
across both structured and open-ended tasks.
2 R ELATED W ORK
Chain-of-Thought (CoT) reasoning has emerged as a critical driver of advanced reasoning in LLMs,
improving accuracy across mathematical, commonsense, and logical tasks while increasing trans-
parency in the decision-making process. Initial prompting-based methods demonstrated that LLMs
could be guided to reason step-by-step without additional training, resulting in significant perfor-
mance gains (Kojima et al., 2022; Huang & Chang, 2022; Zhang et al., 2022; Zelikman et al., 2022;
Wei et al., 2022). Building on this foundation, recent approaches have incorporated CoT reasoning
into the training loop—either through supervised fine-tuning on annotated reasoning traces (Zelik-
man et al., 2022) or via reinforcement learning with process- or outcome-based rewards (Shao et al.,
2024; Lambert et al., 2024)—to further strengthen reasoning capabilities. By decomposing prob-
lems into intermediate steps, LLMs not only improve in accuracy but also become more interpretable
and trustworthy, both of which are essential for real-world deployment (Lightman et al., 2023).
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful framework
for improving LLM performance in domains where success can be unambiguously defined and
automatically evaluated (Lambert et al., 2024; Liu et al., 2025a; Su et al., 2025). In areas such as
3
coding and mathematics, RLVR has enabled substantial advancements—models now solve complex
problems and generate correct code with unprecedented accuracy and consistency (Shao et al., 2024;
Yu et al., 2025; Muennighoff et al., 2025; Ye et al., 2025; Hu et al., 2025; Luo et al.; Liu & Zhang,
2025). This success stems from the integration of reinforcement learning with deterministic outcome
verification, eliminating the need for learned reward models and facilitating large-scale training on
diverse problem sets. However, extending RLVR to open-ended reasoning tasks remains a significant
challenge. These tasks often involve diverse reasoning paths and multiple valid outcomes, making it
difficult to define rule-based or verifiable rewards. As a result, designing reward signals that reliably
reflect reasoning quality in such settings is still an open problem.
Over the past year, considerable efforts have been made to extend the success of RLVR to open-
ended reasoning tasks. One line of work focuses on training general-purpose reward models to
supervise reasoning optimization (Chen et al., 2025c; Liu et al., 2025b; Su et al., 2025), which
introduces the overhead of developing and maintaining an additional reward model during RL
training. A complementary line of research explores the use of internal model feedback, such as
self-certainty—as a reward signal, thereby eliminating the need for external verifiers (Zhao et al.,
2025b;a; Xu et al., 2025a; Zuo et al., 2025; Zhou et al., 2025; Chen et al., 2024; Tang et al., 2025).
Among these, several concurrent studies (Zhao et al., 2025a; Xu et al., 2025a; Zhao et al., 2025b;
Zuo et al., 2025) rely exclusively on intrinsic feedback to optimize reasoning traces without refer-
ence answers, while other concurrent studies (Tang et al., 2025; Zhou et al., 2025) incorporate ref-
erence outcomes to estimate the quality of generated reasoning. However, none of these approaches
examine the token-level sensitivity of reasoning-quality rewards in the context of open-ended, long-
form generation, as we introduce in Section 4.1. Additionally, prior work does not address data
filtering for reasoning training using task-independent, model-internal rewards – an approach we
propose in Section 4.2.2 to improve data efficiency. Finally, to the best of our knowledge, we are the
first to evaluate RL-based reasoning optimization on a long-form open-ended task such as paragraph
revision in ParaRev (Section 5.1).
Recent advances in LLM reasoning have largely been driven by reinforcement learning (RL)-based
optimization techniques. To ground this process theoretically, we begin by framing RL-based rea-
soning optimization within the Markov Decision Process (MDP) framework. For LLMs, the MDP
can be naturally defined at the token level, as the model generates one token at each time step t. In
this setup, the state st at time t consists of the input prompt or question q followed by the sequence
of output tokens generated so far (o<t ), i.e., st = q; o<t . The LLM, acting as the policy πθ , takes
an stochastic action by picking the next token (ot ) from its vocabulary based on the current state st .
The state then transitions to st+1 = st ; [ot ]. With RL-based reasoning optimization, the goal is to
learn an optimal policy π ∗ that generates a sequence of tokens conditioned on the question q in such
a way that it leads to a desired final outcome, such as the correct answer to a math question.
In order to optimize the policy πθ , Shao et al. (2024) proposed Group Relative Policy Optimization
(GRPO), a variant of Proximal Policy Optimization (PPO) Schulman et al. (2017). The surrogate
objective in GRPO, which is maximized to learn the optimal policy, is defined as:
|oi |
" G
1X 1 X πθ (oi |q,oi<t ) πθ (oi |q,oi<t )
min Âi,t , clip ,1−ϵ,1+ϵ Âi,t
G i=1 |oi | t=1 πθold (oi |q,oi<t ) πθold (oi |q,oi<t )
(1)
#
−β DKL (πθ ∥πref )
4
where ϵ is the clipping parameter to maintain stability and πθold denotes the policy before the most
recent update. The key distinction in GRPO lies in the computation of ith token’s advantage esti-
mate Âi,t , which introduces a structured comparison across a group of generations from the same
questions. Specifically, for a given prompt or question, suppose we sample a group of G outputs
G G
{oi }i=1 from the actor model with corresponding rewards {ri }i=1 . Then, for each token in the ith
output, the advantage is estimated as:
G
ri − mean({ri }i=1 )
Âi,t = G
(2)
std({ri }i=1 )
In the context of RLVR, ri is typically a verifiable reward computed on the final outcome – such
as 1 if the final answer is correct and 0 otherwise. Note that each sampled output (oi ) consists of
CoT reasoning followed my final answer. This group-normalized formulation encourages the policy
to assign higher probabilities to trajectories that outperform their peers, steering generation toward
more promising reasoning paths. As a result, the model learns to sample tokens that are more likely
to lead to correct or high-reward outcomes.
Finally, GRPO includes a KL divergence regularization term, DKL , to constrain the updated policy
from deviating too much from the reference policy. This regularization is critical in preventing over-
fitting or reward exploitation – especially when a proxy reward model is used instead of reference.
At the same time, a certain degree of exploration is necessary for the policy to meaningfully evolve
beyond the reference policy. The hyperparameter β controls this trade-off between exploration and
exploitation. In the context of RLVR, where the reward is derived from matching reference answers
(rather than a learned model), this risk is mitigated, and therefore recent state-of-the-art approaches
often set β = 0 (Liu et al., 2025a; Guo et al., 2025).
The success of the RLVR technique, as outlined in the previous section, stems from its simple yet ro-
bust reward mechanism based on verifiable reference outcomes. This outcome-driven reward struc-
ture makes RL training more resilient to reward hacking (Silver et al., 2017). RLVR has proven par-
ticularly effective in domains such as mathematics, programming, and logical reasoning—where the
correctness of a model’s output can be objectively verified and reliably translated into rewards (Shao
et al., 2024; Su et al., 2025). However, extending RLVR to open-ended, especially long-form gen-
eration, tasks – such as text drafting and revision, composing analytical summaries or reports, or
form completion—poses a significant challenge. In these scenarios, verifying logical correctness
and translating it into a clean reward signal is inherently difficult, even when reference outcomes are
available (Zhao et al., 2025b;a; Xu et al., 2025a; Zhou et al., 2025; Lu, 2025). Considering potential
solutions, we observe that:
Traditional similarity-based metrics fail to capture the essential features of open-ended rea-
soning outcomes. An intuitive approach involves measuring the similarity between the model-
generated output and the reference text using surface-level metrics such as ROUGE, which rely on
n-gram overlap. However, such metrics are ill-suited for evaluating logical coherence or reasoning
consistency, as they emphasize lexical similarity rather than logical or structural alignment. Two
responses that are logically equivalent but lexically distinct may receive a low ROUGE score, while
a response that merely copies phrases from the ground truth – without preserving the underlying
logic – may score highly. Embedding-based metrics such as cosine similarity offer a more flexi-
ble representation space, but they still struggle to reliably distinguish reasoning-valid outputs from
superficially similar yet logically flawed ones.
External dense reward models are infeasible for open-ended reasoning tasks. Leveraging a
dedicated reward model to provide dense feedback typically requires preference-style datasets com-
prising paired examples of preferred outputs – a resource that is organically often unavailable for
many open-ended tasks (Ethayarajh et al., 2024). Training such a reward model introduces addi-
tional computational and annotation costs, further limiting its practicality. More critically, reward
models are susceptible to reward hacking, where models exploit weaknesses in the learned reward
signal rather than genuinely improving reasoning quality (Silver et al., 2017; Shao et al., 2024).
5
LLM-as-a-judge is not a turnkey or reliable solution for reasoning reward signal. Recently,
LLMs have been increasingly adopted as automated evaluators in place of human judges (Gu et al.,
2024). However, multiple studies have raised concerns about their reliability, highlighting issues
such as sensitivity to prompt phrasing and evaluation rubrics, self-enhancement bias, prone to reward
hacking, and the generator–discriminator gap (Sharma et al., 2024; Gu et al., 2024; Chen et al.,
2025c; Liu et al., 2025b; Chen et al., 2025b). Moreover, extracting a dense, task-specific reward
signal from LLM-as-a-judge remains particularly challenging (Liu et al., 2025b; Chen et al., 2025c).
This challenge is further compounded when aiming for a scalable and turnkey fine-tuning framework
across diverse tasks and datasets (Microsoft, 2024; Atreya, 2024; Xu et al., 2025b), as the LLM-as-
a-judge must be carefully tailored, validated, and maintained for each new use case (Liu et al.,
2025b).
Before addressing the challenges discussed above, it is important to understand how a reasoning-
capable LLM generates outputs in response to a question or prompt. The output of such a model
typically consists of two components: a CoT reasoning segment, followed by the final answer. Due
to the autoregressive nature of LLMs, the CoT reasoning acts as a latent prefix that conditions the
generation of the final answer (Chen et al., 2025d; 2024). In this formulation, the CoT reasoning
can be viewed as an implicit intermediate state that guides the model’s final outcome generation.
Specifically, the final answer is generated according to the conditional probability distribution π(· |
q, ĉ), where q denotes the input question or prompt, and ĉ is the CoT reasoning generated in response
to q. Intuitively, the quality of the reasoning trace directly influences the likelihood of producing a
correct answer – strong reasoning increases this likelihood, while flawed reasoning reduces it.
Building upon this property, we introduce a new reward signal – Reasoning Reflection Reward (R3)
– designed specifically for open-ended, particularly long-form, generation tasks. R3 is a token-
level dense reward signal that measures the consistency between the CoT reasoning generated by
the actor model and the reference outcome by placing special emphasis on the key tokens in the
reference that reflect the preceding CoT reasoning. We quantify this consistency by leveraging the
model’s own self-certainty (Gupta et al., 2024; Kauf et al., 2024) – specifically, the probabilistic
likelihood assigned by the LLM to the reference outcome y conditioned on the prompt q and its
generated CoT reasoning ĉ, i.e., π(y | q, ĉ). Intuitively, if the model’s reasoning ĉ is correct, the
model should assign a high likelihood to the reference outcome y. This likelihood thus serves as
a natural reward signal to assess the quality of the generated CoT reasoning. Moreover, since it
is grounded in golden answers rather than learned reward models, it offers greater reliability and
alignment with the target objective – making it a robust choice for RL training, as recommended
in state-of-the-art (Shao et al., 2024; Silver et al., 2017). However, an important oversight in this
formulation – also overlooked in recent state-of-the-art work (Chen et al., 2024; Zhou et al., 2025;
Tang et al., 2025) – is the uniform treatment of all tokens in the reference outcome y. In practice, this
assumption can significantly undermine the effectiveness of the reward signal, and in some cases,
even introduce reverse effect – particularly in long-form generation tasks. Next, we present two
key empirical observations that reveal why only a selective subset of reference tokens meaningfully
contributes to reasoning consistency.
6
Our objective is to assign higher advantage scores to higher-quality CoT traces, enabling a cleaner
signal in the optimization objective (Eq.1).
To evaluate whether this plain aggregate token-level probability reward effectively distinguishes
better CoT traces within a group, we conduct a case study using a representative example from
the ParaRev dataset. Specifically, we sample 16 outputs in response to a given prompt, where
each output consists of a CoT reasoning trace followed by an answer (i.e., a revised paragraph).
We then manually rank these outputs based on the quality of their final answers and CoT traces –
assessing how well they address the relevant reviewer comments from the prompt and align with the
reference revision. Fig. 1 presents two representative CoT reasoning samples from this set, arranged
in descending order of quality. The differences in quality are visibly substantial. For each CoT
sample, we show the corresponding advantage values computed using the aggregate conditional
log-probabilities over the reference tokens. Interestingly, the derived advantage values show only
weak correlation with the actual sample quality and in the figure, even rank lower-quality CoT trace
above higher-quality one.
To understand this unexpected behavior, we closely examine the log-probability distributions over
the reference outcome shown in Fig. 1. Most tokens in the reference sequence receive similar log-
probability values, regardless of the preceding CoT reasoning. Only a small number of tokens –
three in this case – exhibit clear variation in likelihood depending on the prior CoT trace. These
reasoning-reflective tokens are the ones that truly encode the effect of the preceding reasoning on
the model’s certainty over the outcome. However, since these reflective tokens tend to have lower
log-probability values than the bulk of the reference tokens, their influence gets diluted when we
compute a sequence-wide aggregate log-probability. As a result, their contribution to the reward
for the CoT trace, and thus to the corresponding advantage value is effectively masked. This issue
becomes more pronounced when the number of reasoning-reflective tokens is small relative to the
total length of the reference outcome. This phenomenon, where critical token-level signals are
suppressed by sequence-wide aggregation, has also been observed in other contexts such as model
cascading and hallucination detection (Gupta et al., 2024; Chen et al., 2025a).
7
when conditioned on different CoT traces. That is, in reasoning-conditioned log-probability estima-
tion, the tokens in the reference outcome that show substantial variability across a set of sampled
CoT traces are likely to reflect the influence of upstream reasoning. This comparative nature is also
emphasized in GRPO paper with a connection to preference-based reward modeling (Shao et al.,
2024). For example, in Fig. 1, we highlight three tokens from the reference outcome that exhibit
high standard deviation in their log-probabilities across 16 distinct CoT traces. These tokens are not
only statistically reflective of reasoning variation but also intuitively important upon qualitative in-
spection. In R3, we emphasize these reasoning-reflective tokens by weighting each reference token’s
log-probability contribution according to its standard deviation. Specifically, the CoT-conditioned
P|y|
likelihood of the reference outcome is computed as: j=1 w∆ (σj ) log π(yj | q, cˆi , y<j ) , where
w∆ (σj ) assigns greater weight to tokens with higher standard deviation σj , thereby amplifying the
influence of reasoning-reflective tokens in the reward estimation.
Next, we turn our attention to the second challenge: the tendency of reference tokens to compensate
for poor CoT reasoning. A natural idea is to propagate the self-certainty (i.e., token-level likelihood)
of all preceding reference tokens when computing the certainty of a given token. However, this ap-
proach is computationally prohibitive for long sequences and risks propagating misleading certainty
from unrelated tokens, potentially leading to underestimation of CoT quality. An alternative is to
apply a position-based discounting scheme – down-weighting the contribution of later tokens in
the reference outcome under the assumption that they benefit more from cumulative context. Yet
this strategy introduces a different failure mode: reasoning-reflective tokens that appear later in
the sequence may be unfairly penalized, while non-informative early tokens are disproportionately
emphasized.
To address these issues, we adopt a more targeted solution that centers around the reasoning-
reflective tokens. Our insight is that for poor CoT traces, a reasoning-reflective token is likely to
receive low model confidence (i.e., probability). When the reference sequence “corrects” this token
– appending it during likelihood computation for next tokens – it begins to influence subsequent
tokens, effectively initiating a chain of error compensation. We leverage this observation by intro-
ducing controlled self-certainty propagation, which begins at reasoning-reflective tokens and decays
over a localized window of subsequent tokens. Formally, for each reasoning-reflective token at posi-
tion k, we define a propagation factor: Pkprop (j) = pRRT
k +(1−pRRTk )(1−e−γd ) where pRRTk is the
th
self-certainty (probability) of k reflection token, d is the distance from the reflection token to the
current token j, and γ is a hyperparameter controlling the propagation decay from k th token. The
final reward formulation incorporates both variance-based token weighting and propagation-aware
P|y|
correction: j=1 w∆ (σj ) log π(yj | q, cˆi , y<j )Πk<j Pkprop (j) .
While the targeted decay-based propagation approach is effective when the number of reasoning-
reflective tokens is small, it becomes computationally expensive as their proportion increases
within the reference outcome. To address this, we propose a more efficient alternative for es-
timating the self-influence of reference tokens. Specifically, we compute the log-probabilities
of reference tokens conditioned on a masked CoT trace, which serves as a baseline estimate
of token-level influence originating from the reference itself. For instance, in the earlier foot-
ball example, the token “Messi” is still likely to receive a high probability due to the presence
of the preceding token “Lionel”, even when no reasoning is provided. By subtracting these
masked-CoT log-probabilities from those computed with the model-generated CoT, we isolate
the self-induced certainty boost by reference tokens. Then, the reward formulation becomes:
P|y|
j=1 w∆ (σj ) log π(yj | q, cˆi , y<j ) − log π(yj | q, cmasked , y<j ) .
We now introduce Direct Reasoning Optimization (DRO), a RL-based fine-tuning framework that
that employs R3 as its primary reward signal for guiding reasoning quality and dynamic data filtering
for open-ended reasoning tasks.
8
aligns with the group-relative and comparative nature of our core reward, R3. Given a prompt q,
the actor policy πθ generates a group of outputs, each comprising a CoT trace ĉi followed by a
final outcome ŷi . We replace ŷi with the ground-truth reference outcome y to compute the R3i
score for each ĉi . To evaluate R3, we use an internal policy πrwd , instantiated in three variants: (1)
statically using the reference policy πref , (2) dynamically syncing with πθ , and (3) using a lagged
version of πθ . Since R3 only scores the reasoning trace and not the generated final outcome, we
observed that models tend to produce verbose completions, e.g., appending explanations at the end
of revised paragraph in the ParaRev task. To mitigate this, we apply a length penalty solely on the
final outcome: rlen β
(ŷ, y) := 1 − β · | |y|−|ŷ|
|y|
|
, where β controls the strength of the penalty. The
final reward is a weighted combination of R3i and the length penalty, which is used to compute the
advantage (Eq.2). This advantage is then used in the GRPO objective (Eq.1) to update the model
parameters.
Recent work (Meta, 2025; Muennighoff et al., 2025; Jiang et al., 2025; Ye et al., 2025; Yang et al.,
2025; Costello et al., 2025) highlights the critical role of data filtering in reinforcement learning,
demonstrating its impact on both data efficiency and downstream task performance. These ap-
proaches typically rely on either LLM-as-a-judge frameworks or verifiable reward signals. How-
ever, in open-ended reasoning tasks where no reliable verifiers exist, such strategies are not appli-
cable. Moreover, using LLM-as-a-judge would require designing task and dataset-specific prompts,
compounding the complexity and inheriting the limitations discussed earlier. To address this, DRO
introduces a generic, dynamic data filtering mechanism tailored for open-ended reasoning tasks
leveraging R3, enhancing data efficiency during RL-based training without the need for manual
prompt engineering or external verification.
DRO performs data filtering at regular intervals throughout training, beginning with an initial filtering
round before the start of training. Each filtering round is guided by the current policy model (πθ )
and is conducted in two stages:
• Filtering Out Questions with Low Reasoning Variation: In the second stage, we filter out
questions that exhibit low variation in the reasoning space, which typically corresponds to overly
simple questions (assuming the previous stage has already removed most overly difficult ones).
We leverage the R3 scores computed in the prior step using the current policy πθ . Specifically,
for each prompt, we compute the maximum per-token standard deviation across N sampled CoT
traces: max(σj ). This value captures the highest degree of reasoning-induced variability in refer-
ence token predictions. We then rank all prompts in descending order of max(σj ) and remove a
proportion of the lowest-ranked samples. The cutoff is determined based on the available training
data size and the model’s capacity.
In each round of filtering, we carry forward 10% of data from the previous training set.
9
5 E XPERIMENTS
Datasets. We use the following datasets in our experiments: (1) ParaRev (Jourdan et al., 2025):
This dataset contains over 48K original-revised paragraph pairs from scientific papers on OpenRe-
view, along with corresponding reviews. Since many papers undergo multiple revisions, we focus
on the initial revision, as it typically reflects the most substantial changes in response to reviewer
feedback. As ParaRev does not include the full paper context for each paragraph, which is crucial
for reasoning, we extend the dataset by locating the paragraphs in the raw papers and extracting their
preceding and following context from CASIMIR (Jourdan et al., 2024). This results in an adapted
dataset of 4.8K samples, and we follow a 95%/5% train-test split. (2) FinQA (Chen et al., 2021):
A dataset focused on numerical reasoning over financial data, comprising over 8K samples with
expert-written context, questions, reasoning programs, and answers. For our RL training, we use
only the context, questions, and answers, adhering to the original train-test split.
Training. We conduct DRO training on the DeepSeek-R1-Distill-Qwen-7B and 14B mod-
els. A learning rate of 1.0 × e−6 is used with a warmup ratio of 0.2 and a “constant with warmup”
learning rate scheduler. During each training step, the actor model generates 16 responses per ques-
tion using a temperature of 1.0, top-p sampling with p = 0.95, a repetition penalty of 1.0, and a
maximum completion length of 10,000 tokens for FinQA and 8,000 tokens for ParaRev. We process
256 questions per step for FinQA and 128 for ParaRev. For GRPO optimization, we adopt the loss
function from Liu et al. (2025a), using scaled rewards, masking for truncated completions, and an
upper clipping coefficient of ϵhigh = 0.2. While prior studies typically set the entropy regulariza-
tion weight β = 0, we empirically found β = 0.001 to improve training stability and convergence.
Training is conducted across three nodes, each with 8× NVIDIA A100 GPUs. We utilize Hug-
gingFace TRL for reinforcement learning, DeepSpeed for distributed training, and vLLM for rollout
generation and R3 computation.
Metrics. For the FinQA task, where answers are verifiable, we use numerical correctness with a 2%
tolerance. For the ParaRev task, we adopt pairwise win rate as the primary evaluation metric. To
compute win rates, we adapt the AlpacaEval prompt to the revision setting by providing the paper
context, reviewer comments, original paragraph, and reference revision for each sample. Our vali-
dation indicates that this prompt yields a 94.6% win rate for expert revisions over GPT-4o revisions,
demonstrating strong alignment with human preferences. The full prompt template is provided in
Appendix A. To mitigate potential self-enhancement bias (Zheng et al., 2023), we use both GPT-4o
and Claude 3.7 Sonnet as judges.
Baselines. We mainly compare DRO with the following baselines in our evaluation: (1) Base Mod-
els: The off-the-shelf DeepSeek-R1-Distill-Qwen-7B (for FinQA) and 14B (for ParaRev)
models without RL on the specific tasks. (2) ROUGE (ParaRev): For ParaRev, although the out-
comes are not directly verifiable, we use ROUGE-1 F1 score (Lin, 2004) as the reward in GRPO to
represent RL with a standard automatic metric as a proxy verifier. (3) Correctness (FinQA): For
FinQA, where outputs are math-like and easily verifiable, we use binary correctness (within a 2%
tolerance) as the reward in GRPO to serve as an upper bound where ideal outcome verification is
feasible. (4) Aggregate: To assess the efficacy of R3, we include a set of baselines that use the
aggregate certainty across all tokens as the reward. As these baselines share the same training work-
flow as DRO, we denote them as DRO-Aggr. Specifically for ParaRev, we introduce DRO-Aggr-S
and DRO-Aggr-R to represent strict and relaxed length control, respectively, each using different β
in the length reward to study its impact. (5) GPT-4o: A strong baseline using a significantly larger
model.
5.2 R ESULTS
5.2.1 PARA R EV
DRO with R3 improves reasoning quality and alignment. As shown in Table 1, DRO-R3 achieves
higher win rates against GPT-4o than all other variants, outperforming the base model by 8.0%
(GPT judge) and 10.2% (Claude judge), and even surpassing GPT-4o itself despite being a much
smaller model. It also generates outputs with lengths closer to the reference revisions, indicating
10
Win Rate vs. GPT-4o
Model Length
GPT Judge Claude Judge
DeepSeek-R1-Distill-Qwen-14B (ROUGE) 31.1 42.8 570
DeepSeek-R1-Distill-Qwen-14B (DRO-Aggr-S) 31.1 44.0 587
DeepSeek-R1-Distill-Qwen-14B (Base) 43.8 48.6 1095
DeepSeek-R1-Distill-Qwen-14B (DRO-Aggr-R) 47.9 51.0 1038
GPT-4o (50) (50) 889
DeepSeek-R1-Distill-Qwen-14B (DRO-R3) 51.8 58.8 743
Original Paragraph (No Revision) 13.2 23.0 545
Reference Revision 94.6 100.0 613
3k 0.8 0.5
Generation Length (Tokens)
Aggregate
R3
Rouge-L F1 Score
0.6 0.4
2k
0.4 0.3
R3
1k
0.2 0.2 w/ Filtering
w/o Filtering
0 0.0 0.1
0 5 10 15 20 0 5 10 15 20 0 100k 200k 300k
Step Step Time (sec)
more faithful and efficient edits. Given the known length bias in LLM-based evaluators (Zheng
et al., 2023), this improvement further reflects better alignment with human preference.
R3 outperforms ROUGE-based rewards. Compared to the ROUGE-rewarded baseline, R3 yields
a win rate improvement of 16.0% (GPT judge) and 20.7% (Claude judge). We observe that the
ROUGE-trained model frequently leaves the paragraph unchanged, likely due to the reward favoring
textual overlap, resulting in shorter outputs similar in length to the original paragraph. This behavior
harms revision quality.
R3 also outperforms aggregate-certainty rewards. Compared to aggregated certainty rewards, R3
leads to consistently higher win rates regardless of length control settings. Against the same base
model, DRO-R3 achieves up to a 4.25× improvement over DRO-Aggr-R, highlighting the impor-
tance of reasoning-reflective token weighting. Furthermore, strict length control (DRO-Aggr-S)
degrades performance, suggesting that rigid enforcement of output length may suppress effective
reasoning and degrade revision quality.
Training insights. (1) R3 stimulates longer reasoning generation: As shown in Figure 3a, R3
encourages the model to produce longer CoTs, with generation length growing steadily from 1k
to over 2.5k tokens during training. In contrast, aggregate-certainty rewards lead to early collapse
below 100 tokens, as the model learns to omit reasoning due to the misleading reward signal. (2)
Implicit improvement in textual similarity: Figure 3b shows that, despite ROUGE not being part
of the reward, DRO with R3 substantially improves ROUGE-L F1 from 0.4 to 0.7 in the early stage
of training, suggesting that optimization toward reasoning-reflective tokens also results in better
surface-level alignment. (3) Filtering accelerates and stabilizes training: As shown in Figure 3c,
on-the-fly data filtering in DRO reduces training time by 45% while achieving comparable final
reward scores and smoother convergence, demonstrating its efficiency and robustness.
5.3 F IN QA
11
Pass@k
Model
1 2 4 8 16
DeepSeek-R1-Distill-Qwen-7B (Base) 61.7 70.7 75.9 79.3 81.5
DeepSeek-R1-Distill-Qwen-7B (DRO-Aggr) 63.0 72.5 77.4 80.4 82.6
DeepSeek-R1-Distill-Qwen-7B (DRO-R3) 67.1 74.2 78.3 81.0 82.5
DeepSeek-R1-Distill-Qwen-7B (Correctness) 68.0 73.8 77.9 80.4 82.1
GPT-4o 69.5 73.9 76.2 78.1 79.3
0.4 0.12 1k
0.3 0.10 0
0 10 20 30 40 0 20 40 0 10 20 30 40
Step Step Step
correctness-based rewards. Specifically, as shown in Table 2, it falls only 0.9% short on Pass@1
but outperforms the correctness baseline on Pass@k for k ≥ 2. This result highlights that R3 can
match the benefits of correctness-based rewards without access to a reliable verifier, demonstrating
its potential for tasks where ideal outcome verification is difficult to obtain or not well-defined.
R3 outperforms aggregate-certainty rewards even in short-outcome tasks. Although FinQA
involves relatively short outputs where most tokens appear to contribute directly to the final answer,
R3 still outperforms the aggregate-certainty reward. Compared to the base model, DRO-R3 achieves
a 4.15× higher improvement than DRO-Aggr. This indicates that reasoning-reflective tokens are
not exclusive to long-form generation. For example, in math-like tasks, tokens such as the decimal
point “.” may reflect reasoning quality more than trailing digits.
Training insights. (1) Steady reward improvement and stabilization: As shown in Figures 4a
and 4b, DRO consistently improves the R3 reward while reducing its standard deviation across sam-
pled reasoning traces, indicating both stronger and more stable reward attribution over time. (2)
Emergence of longer reasoning: Generation length steadily increases from 1k to over 3k tokens
(Figure 4c). Interestingly, while the R3 improvement slows around step 6 (Figure 4a), the reasoning
length continues to grow almost linearly. This divergence suggests that as the reward signal begins
to saturate, the model continues to elaborate its reasoning, potentially exploring richer explanations
or extended self-reflection beyond what R3 explicitly rewards. This behavior remains effective, as
the R3 continues to improve gradually thereafter.
6 C ONCLUSION
12
baselines. This work highlights the promise of self-supervised reward design in enabling scalable,
outcome-driven reasoning optimization for LLMs.
R EFERENCES
Mohan Atreya. Fine-Tuning AI Models with Tuning-as-a-Service Platforms, 2024.
Gregor Bachmann and Vaishnavh Nagarajan. The pitfalls of next-token prediction. arXiv preprint
arXiv:2403.06963, 2024.
Haolin Chen, Yihao Feng, Zuxin Liu, Weiran Yao, Akshara Prabhakar, Shelby Heinecke, Ricky
Ho, Phil Mui, Silvio Savarese, Caiming Xiong, et al. Language models are hidden reasoners:
Unlocking latent reasoning capabilities via self-rewarding. arXiv preprint arXiv:2411.04282,
2024.
Kedi Chen, Qin Chen, Jie Zhou, Xinqi Tao, Bowen Ding, Jingwen Xie, Mingchen Xie, Peilong
Li, and Zheng Feng. Enhancing uncertainty modeling with semantic graph for hallucination
detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.
23586–23594, 2025a.
Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, and Bingsheng He.
Judgelrm: Large reasoning models as a judge. arXiv preprint arXiv:2504.00050, 2025b.
Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang,
Denghui Zhang, Tong Zhang, et al. Rm-r1: Reward modeling as reasoning. arXiv preprint
arXiv:2505.02387, 2025c.
Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman,
Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, et al. Reasoning models don’t always
say what they think. arXiv preprint arXiv:2505.05410, 2025d.
Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema
Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, et al. Finqa: A dataset of numerical
reasoning over financial data. arXiv preprint arXiv:2109.00122, 2021.
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep
reinforcement learning from human preferences. Advances in neural information processing sys-
tems, 30, 2017.
Caia Costello, Simon Guo, Anna Goldie, and Azalia Mirhoseini. Think, prune, train, improve:
Scaling reasoning without scaling models. arXiv preprint arXiv:2504.18116, 2025.
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model
alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Ying-
han Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge. arXiv preprint
arXiv:2411.15594, 2024.
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu,
Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms
via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
Neha Gupta, Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Aditya Krishna
Menon, and Sanjiv Kumar. Language model cascades: Token-level uncertainty and beyond. arXiv
preprint arXiv:2404.10136, 2024.
Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum.
Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base
model. arXiv preprint arXiv:2503.24290, 2025.
Chenghua Huang, Zhizhen Fan, Lu Wang, Fangkai Yang, Pu Zhao, Zeqi Lin, Qingwei Lin, Dongmei
Zhang, Saravan Rajmohan, and Qi Zhang. Self-evolved reward learning for llms. arXiv preprint
arXiv:2411.00418, 2024.
13
Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey.
arXiv preprint arXiv:2212.10403, 2022.
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec
Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv
preprint arXiv:2412.16720, 2024.
Pengcheng Jiang, Xueqiang Xu, Jiacheng Lin, Jinfeng Xiao, Zifeng Wang, Jimeng Sun, and Ji-
awei Han. s3: You don’t need that much data to train a search agent via rl. arXiv preprint
arXiv:2505.14146, 2025.
Léane Jourdan, Florian Boudin, Nicolas Hernandez, and Richard Dufour. Casimir: A cor-
pus of scientific articles enhanced with multiple author-integrated revisions. arXiv preprint
arXiv:2403.00241, 2024.
Léane Jourdan, Nicolas Hernandez, Richard Dufour, Florian Boudin, and Akiko Aizawa. Pararev:
Building a dataset for scientific paragraph revision annotated with revision instruction. arXiv
preprint arXiv:2501.05222, 2025.
Carina Kauf, Emmanuele Chersoni, Alessandro Lenci, Evelina Fedorenko, and Anna A Ivanova.
Log probabilities are a reliable estimate of semantic plausibility in base and instruction-tuned
language models. arXiv preprint arXiv:2403.14859, 2024.
Kavukcuoglu, Koray. Gemini 2.5: Our most intelligent AI
model. [Link]
gemini-model-thinking-updates-march-2025/, 2025. Accessed: 2025-06-
02.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large
language models are zero-shot reasoners. Advances in neural information processing systems,
35:22199–22213, 2022.
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brah-
man, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. T\” ulu 3: Pushing
frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024.
Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Ren Lu, Thomas Mesnard, Johan Ferret,
Colton Bishop, Ethan Hall, Victor Carbune, and Abhinav Rastogi. Rlaif: Scaling reinforcement
learning from human feedback with ai feedback. 2023.
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan
Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth
International Conference on Learning Representations, 2023.
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization
branches out, pp. 74–81, 2004.
Jiawei Liu and Lingming Zhang. Code-r1: Reproducing r1 for code with reliable rewards. arXiv
preprint arXiv:2503.18470, 3, 2025.
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee,
and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint
arXiv:2503.20783, 2025a.
Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu.
Inference-time scaling for generalist reward modeling. arXiv preprint arXiv:2504.02495, 2025b.
Xun Lu. Writing-zero: Bridge the gap between non-verifiable problems and verifiable rewards.
arXiv preprint arXiv:2506.00103, 2025.
Michael Luo, Sijun Tan, Roy Huang, Ameen Patel, Alpay Ariyak, Qingyang Wu, Xiaoxiang Shi,
Rachel Xin, Colin Cai, Maurice Weber, et al. Deepcoder: A fully open-source 14b coder at
o3-mini level, 2025. Notion Blog, 3(4):6.
14
Meta. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. https:
//[Link]/blog/llama-4-multimodal-intelligence/, 2025. Accessed:
2025-06-02.
Microsoft. Microsoft 365 Copilot Tuning overview (preview), 2024. URL
[Link]
copilot-tuning-overview.
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke
Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time
scaling. arXiv preprint arXiv:2501.19393, 2025.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang,
Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical
reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
Archit Sharma, Sedrick Scott Keh, Eric Mitchell, Chelsea Finn, Kushal Arora, and Thomas Kollar.
A critical evaluation of ai feedback for aligning large language models. Advances in Neural
Information Processing Systems, 37:29166–29190, 2024.
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,
Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go
without human knowledge. nature, 550(7676):354–359, 2017.
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally
can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024.
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford,
Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances
in neural information processing systems, 33:3008–3021, 2020.
Yi Su, Dian Yu, Linfeng Song, Juntao Li, Haitao Mi, Zhaopeng Tu, Min Zhang, and Dong Yu.
Crossing the reward bridge: Expanding rl with verifiable rewards across diverse domains. arXiv
preprint arXiv:2503.23829, 2025.
Yunhao Tang, Sid Wang, and Rémi Munos. Learning to chain-of-thought with jensen’s evidence
lower bound. arXiv preprint arXiv:2503.19618, 2025.
Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. A stitch in time saves
nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation.
arXiv preprint arXiv:2307.03987, 2023.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny
Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in
neural information processing systems, 35:24824–24837, 2022.
Fangzhi Xu, Hang Yan, Chang Ma, Haiteng Zhao, Qiushi Sun, Kanzhi Cheng, Junxian He, Jun Liu,
and Zhiyong Wu. Genius: A generalizable and purely unsupervised self-training framework for
advanced reasoning. arXiv preprint arXiv:2504.08672, 2025a.
Yifei Xu, Tusher Chakraborty, Emre Kıcıman, Bibek Aryal, Eduardo Rodrigues, Srinagesh Sharma,
Roberto Estevao, Maria Angels de Luis Balaguer, Jessica Wolk, Rafael Padilha, et al. Rlthf:
Targeted human feedback for llm alignment. arXiv preprint arXiv:2502.13417, 2025b.
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu,
Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint
arXiv:2505.09388, 2025.
Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more
for reasoning. arXiv preprint arXiv:2502.03387, 2025.
15
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong
Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at
scale. arXiv preprint arXiv:2503.14476, 2025.
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does re-
inforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv
preprint arXiv:2504.13837, 2025.
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with
reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.
Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Bo Wang, Shimin Li, Yunhua Zhou, Qipeng Guo,
Xuanjing Huang, and Xipeng Qiu. Scaling of search and learning: A roadmap to reproduce o1
from reinforcement learning perspective. arXiv preprint arXiv:2412.14135, 2024.
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in
large language models. arXiv preprint arXiv:2210.03493, 2022.
Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun
Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero
data. arXiv preprint arXiv:2505.03335, 2025a.
Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason
without external rewards. arXiv preprint arXiv:2505.19590, 2025b.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang,
Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and
chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023.
Xiangxin Zhou, Zichen Liu, Anya Sims, Haonan Wang, Tianyu Pang, Chongxuan Li, Liang
Wang, Min Lin, and Chao Du. Reinforcing general reasoning without verifiers. arXiv preprint
arXiv:2505.21493, 2025.
Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu
Cui, Ning Ding, and Bowen Zhou. Ttrl: Test-time reinforcement learning. arXiv preprint
arXiv:2504.16084, 2025.
16
A A DAPTED P ROMPT T EMPLATE FOR PARA R EV W IN -R ATE E VALUATION
<|im start|>user
I want you to create a leaderboard of different large-language
models based on the quality of their revisions to a given
paragraph of a scientific paper. To do so, I will give you the
paper context, reviews, paragraph to revise, golden revision
written by human experts, and revisions output by the models.
Please rank the models based on which revision would align
better with the golden revision written by human experts. Note
that alignment should be evaluated based on how effectively the
concerns are addressed, rather than on textual similarity. All
inputs and outputs should be python dictionaries.
## Paper Context
{paper context}
## Reviews
{reviews}
## Paragraph to Revise
{paragraph to revise}
## Golden Revision
{golden revision}
17