0% found this document useful (0 votes)
54 views17 pages

RAGtreebased

The paper introduces AirRAG, a novel approach that enhances Retrieval-Augmented Generation (RAG) by integrating Monte Carlo Tree Search (MCTS) and intrinsic reasoning capabilities of large language models (LLMs) to expand the solution space for complex tasks. AirRAG employs five fundamental reasoning actions to improve decision-making and performance in knowledge-intensive tasks, demonstrating significant gains over existing RAG methods. Experimental results indicate that AirRAG is flexible, lightweight, and can be easily integrated with other advanced technologies.

Uploaded by

sakethsreeram7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views17 pages

RAGtreebased

The paper introduces AirRAG, a novel approach that enhances Retrieval-Augmented Generation (RAG) by integrating Monte Carlo Tree Search (MCTS) and intrinsic reasoning capabilities of large language models (LLMs) to expand the solution space for complex tasks. AirRAG employs five fundamental reasoning actions to improve decision-making and performance in knowledge-intensive tasks, demonstrating significant gains over existing RAG methods. Experimental results indicate that AirRAG is flexible, lightweight, and can be easily integrated with other advanced technologies.

Uploaded by

sakethsreeram7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

AirRAG: Activating Intrinsic Reasoning for Retrieval Augmented

Generation via Tree-based Search

Wenfeng Feng* , Chuzhan Hao* , Yuewei Zhang, Jingyi Song & Hao Wang†
Alibaba Cloud, Alibaba Group
{wenfeng.fwf,haochuzhan.hcz,liyou.zyw,sjy267290}@alibaba-inc.com
[email protected]

78 32k
Abstract Accuracy
76 Recall
74 F1
Leveraging the autonomous decision-making Ltotal
capabilities of large language models (LLMs) 72

Total Tokens
arXiv:2501.10053v1 [cs.AI] 17 Jan 2025

demonstrates superior performance in reason- 70

Score
ing tasks. Despite the successes of iterative 68 16k
or recursive retrieval-augmented generation 66
(RAG), they often are trapped in a single so- 64
lution space when confronted with complex 8k
62
tasks. In this paper, we propose a novel think- 4k
60
ing pattern in RAG which integrates system 2k
1k
analysis with efficient reasoning actions, sig- 1 2 3 4 5
nificantly Activating intrinsic reasoning capa- Number of Output Sequences
bilities and expanding the solution space of
Figure 1: Averaged performance comparison on the
specific tasks via Monte Carlo Tree Search
three datasets under different number of output se-
(MCTS), dubbed AirRAG. Specifically, our
quences. Ltotal is the total amount of tokens consumed
approach designs five fundamental reasoning
in the reasoning process. AirRAG leverages the diver-
actions that are expanded to a wide tree-based
sity of generation and self-consistency verification to
reasoning spaces using MCTS. The extension
explore the potential solution space, which can effec-
also uses self-consistency verification to ex-
tively improve overall performance by scaling inference
plore potential reasoning paths and implement
computation.
inference scaling. In addition, computationally
optimal strategies are used to apply more infer-
ence computation to key actions to achieve fur-
ther performance improvements. Experimental Therefore, it becomes an important research task to
results demonstrate the effectiveness of Air- leverage the reasoning capabilities of LLMs to im-
RAG through considerable performance gains prove the performance of RAG (Jiang et al., 2023;
over complex QA datasets. Furthermore, Air- Jeong et al., 2024; Asai et al., 2024; Yu et al., 2024).
RAG is flexible and lightweight, making it easy Previous studies on complex query scenarios fo-
to integrate with other advanced technologies.
cus on optimizing the query and retrieval process
to obtain effective information (Shi et al., 2023;
1 Introduction Zhou et al., 2023; Gao et al., 2023; Jiang et al.,
2023; Zheng et al., 2024). Although these meth-
Retrieval-Augmented Generation (RAG) can ef- ods can solve specific tasks efficiently, their perfor-
fectively alleviate the problem of generating factu- mance heavily depends on manually crafted rules
ally incorrect content, which is crucial in domain- and prompt engineering to improve retrieval rel-
specific or knowledge-intensive tasks (Kandpal evance. The lack of flexibility makes it difficult
et al., 2023). However, with the increase of task to quickly adapt to different scenarios. In addi-
complexity, further challenges arise, such as the in- tion, recursive retrieval is often used to improve the
ability to effectively retrieve sufficient knowledge depth and relevance of search results in information
in a single query and the difficulty of understand- retrieval tasks. Thus, the intermediate query and
ing the complex reasoning logic in the question. retrieval results are continuously updated to satisfy
* The first two authors contributed equally the dynamic information needs during the complex

Corresponding author task solution process (Jiang et al., 2023; Asai et al.,
2024; Yan et al., 2024; Jeong et al., 2024). Recent mance are improved by implementing com-
work leverages the decision-making capabilities of prehensive inference scaling and a pluggable
LLMs to implement automated iterative retrieval architecture.
and generation (Yu et al., 2024; Yue et al., 2024).
However, these approaches have two main issues. • Experimental results show that AirRAG is su-
First, it is difficult for the modular RAG to cover perior to current iterative or recursive RAG.
the diverse types of question, which involves sub- Our approach effectively activates the intrinsic
stantial human effort and is extremely inefficient. reasoning capabilities of LLMs and expands
The single solution space also makes it impossible the solution space to a controllable extent.
to adequately activate the decision-making ability
2 Related Work
of LLMs. Second, the single reasoning paradigm
and the chain-like reasoning process struggle to Retrieval-Augmented Generation (RAG) RAG
effectively explore the solution space during rea- can significantly improve the performance of LLMs
soning. The self-exploration process is susceptible in knowledge-intensive tasks. Compared to vanilla
to low-quality intermediate reasoning steps and RAG, optimizing the query and retrieval process
is trapped in a single solution space. It is also can significantly enhance knowledge correlation
difficult to effectively guide self-exploration for and thus improve reasoning performance. Some
relatively smaller language models (e.g., Qwen2.5- methods use query expansion and transformation
14B-Instruct (Yang et al., 2024)). Therefore, we to obtain better results (Zhou et al., 2023; Ma et al.,
should design rational reasoning actions to compre- 2023; Gao et al., 2023). As the complexity of tasks
hensively explore a wide and deep solution space. escalates, it becomes more difficult to obtain suf-
In response, we propose AirRAG to activate in- ficient knowledge in a single retrieval. Therefore,
trinsic reasoning capabilities and expand the solu- the idea of iterative retrieval is introduced to obtain
tion space via MCTS. First, we design five funda- additional contextual references. IRCoT (Trivedi
mental reasoning actions, including system anal- et al., 2023) uses chain-of-thought to guide the re-
ysis, direct answer, retrieval-answer, query trans- trieval process and refines the CoT with the ob-
formation and summary-answer. This series of tained retrieval results. ITER-RETGEN (Shao
actions can address a variety of issues in different et al., 2023) collaborates retrieval and generation
scenarios, such as complex problems requiring pro- modules to implement a sophisticated understand-
gressive or parallel queries. These actions can be ing of the specific task.
performed efficiently even on relatively small lan- Activating the reasoning capabilities of LLMs in
guage models. Additionally, we introduce MCTS RAG Leveraging the decision-making capabilities
and self-consistency to achieve controllable rea- of LLMs can enhance the efficiency and relevance
soning path generation and efficient inference scal- of the information sourced (Nakano et al., 2022;
ing. In order to accurately select the answer from Schick et al., 2023). Self-RAG and its variant (Asai
multiple reasoning paths, we combine the voting et al., 2024; Yan et al., 2024; Jeong et al., 2024)
method with the process-supervised reward model. adopt a self-reflection mechanism to iteratively me-
As the amount of inference computation increases, chanically predict reflection tokens during training,
our approach can achieve significant performance which makes the LLM controllable during the in-
improvements as shown in Figure 1. Moreover, ference phase. Auto-RAG (Yu et al., 2024) sys-
AirRAG has a flexible architecture that can easily tematically plans retrievals and refines queries to
integrate other advanced methods into our approach acquire valuable knowledge through multi-turn iter-
as an action branch. In summary, our main contri- ations. IterDRAG (Yue et al., 2024) explores infer-
butions are as follows: ence scaling strategies in RAG, which can enhance
LLMs’ ability to effectively acquire and utilize con-
• We design five fundamental reasoning actions
textual information. However, these methods often
that can solve most problem types in RAG
struggle to effectively explore the solution space
scenarios and ensure the controllability of the
during reasoning. The self-exploration often traps
reasoning process.
in a solution space with low-quality reasoning steps
• We introduce MCTS and self-consistency to even after many attempts. This phenomenon often
effectively expand the solution space of com- stems from the chain reasoning pattern and the diffi-
plex tasks. The generalization and perfor- culty of small-scale LLMs to tackle overly complex
tasks in a single iteration. and human cognition in RAG scenarios.
Monte Carlo Tree Search (MCTS) Tree-based • A1 : System Analysis (SAY). This action analyzes
search algorithms, particularly MCTS, have the overall structure of the problem and then its
emerged with remarkable capabilities to expand decomposition or planning. This is global think-
search spaces and enhance reasoning capabili- ing before problem solving.
ties (Silver et al., 2017; Chen et al., 2024; Qi et al.,
2024; Zhang et al., 2024). Many studies have • A2 : Direct Answer (DA). This action leverages
proven that the search strategy can extend reason- parametric knowledge of LLMs to answer ques-
ing by using multiple branch queries to explore di- tions directly, without being influenced by other
verse reasoning paths (Yao et al., 2023; Besta et al., external knowledge.
2024). In the mathematical reasoning scenario,
• A3 : Retrieval-Answer (RA). This action retrieves
Zhang et al. (2024) and Chen et al. (2024) lever-
related knowledge from the external knowledge
age MCTS to foster a more efficient exploration of
base to support subsequent reasoning.
solution spaces. Meanwhile, Qi et al. (2024) de-
signs rich human-like reasoning actions for higher • A4 : Query Transformation (QT). This action
quality reasoning trajectories. Recent research has transforms human questions in order to improve
shown that inference scaling (Yue et al., 2024) and retrieval performance. It is capable of various
self-consistency (Wang et al., 2023) can lead to transformational capabilities such as rewriting,
substantial improvements. Therefore, our approach step back prompting, follow-up questions and
samples diverse paths in the next expansion of the multi-query retrieval.
action space, achieving inference computation scal-
• A5 : Summary-Answer (SA). Combined with the
ing, and performs self-consistency verification on
intermediate reasoning steps and answers and the
all candidate solutions.
initial questions, the final answer is obtained by
further refinement and summarization.
3 Methodology
The above five actions define a highly diverse ac-
In order to effectively explore the solution space tion space {A1 , A2 , A3 , A4 , A5 }. At the first step,
during reasoning, we propose a controllable tree- the initial state is denoted as s0 and then MCTS
based framework of RAG. In the framework, the selects the action a1 and a2 to prompt the LLM to
combination of MCTS and five reasoning actions generate the next reasoning steps in parallel. Subse-
results in an efficient and controllable expansion of quent actions are performed sequentially to expand
the solution space. Meanwhile, we further imple- the reasoning path. Note that there are sequential
ment more comprehensive inference scaling strate- dependencies between different actions. For ex-
gies based on Yue et al. (2024) and use pruning ample, A1 and A2 can only happen after the root
and computationally optimal strategies to achieve a question. Meanwhile, we introduce the diverse
balance of effectiveness and efficiency. The whole sampling of self-consistency (Wang et al., 2023)
process is illustrated in Figure 2. for each action to expand the solution space and
improve task performance. Specifically, an action
3.1 Define Fundamental Reasoning Actions is more likely to generate the correct reasoning step
Using only the autonomy of LLMs to perform itera- if we sample multiple times in the current state. Fi-
tive self-exploration makes it easy to get trapped in nally, we can obtain multiple generated reasoning
a solution space that is also difficult to deal with dif- trajectories, such as [s0 ⊕ s1:n ] and so on.
ferent types of complex questions. IterDRAG (Yue To further improve the efficiency of inference,
et al., 2024) uses a single action type to generate we choose the action {A3 , A4 , A5 } as a simplified
the next reasoning step, which can easily lead to action space, which can achieve a better balance
ineffective space exploration. The core of MCTS between efficiency and effectiveness, also referred
generation lies the action space, which defines the to as AirRAG-Lite.
scope of tree exploration. Therefore, it is essential
3.2 Perform Reasoning Processes via MCTS
to simplify human cognitive processes in complex
reasoning (Jaffe et al., 2023). Inspired by this, we 3.2.1 Solution Generation
introduce five fundamental human-like reasoning Based on the action space defined above, we intro-
actions to bridge the gap between LLM reasoning duce MCTS to generate candidate reasoning tra-
Ours RAG-O1-plus 以下展示的是推理过程的动作序列
Solution = s0,s1,…,si-1

Tree-based Solution Space Root Node Fundamental Action Space Simplified


(User Query) System Analysis (SAY)
QT
Direct Answer (DA)
Retrieval-Answer (RA)
DA SAY ? RA
Query Transformation (QT)
Thinking and Planning
Summary-Answer (SA)
SA
? Other Advanced Technologies
RA QT SA

Automated Process Supervision

QT QT SA RA SA QT ··· RA SA

RA QT ··· RA SA
RA RA QT
q, s0:i-1
External Knowledge QT ··· RA SA
SA SA SA

self-consistency
QT ··· RA SA

n=2 MonteCarlo(𝑞, s0:i-1) = 3/4

Figure 2: The schematic diagram of our proposed AirRAG. AirRAG implements a paradigm that combines systems
thinking with step-by-step reasoning. In the inference phase, we introduce MCTS and self-consistency to scaling
computation, which significantly outperforms other strong baselines.

jectories. The initial root node s0 is the question 3.2.2 Inference Scaling
without any reasoning steps. The policy can be
Many studies have shown that scaling the inference
directly modeled by a language model as π(a|s) =
computation can lead to substantial improvements
LM(a|s), and the state transition function is the
in the performance of LLM without training (Snell
combination of the preceding reasoning steps and
et al., 2024; Yue et al., 2024). Based on the above
the current actions, i.e., si = Concat(s0:i−1 , aj ).
methods, we further introduce a wide range of
In each MCTS rollout, we perform multiple steps,
strategies to explore how AirRAG benefits from
including selection, expansion, simulations, and
the scaling of inference computation. A straight-
backpropagation. We perform multiple rollouts
forward strategy is to extend the effective context
to expand the solution space. To balance the
length (short for Lmax ) during the document re-
exploration and exploitation, we use the well-
trieval phase, where more related documents are
known Upper Confidence Bounds applied to Trees
provided to supplement the knowledge. Moreover,
(UCT) (Kocsis and Szepesvári, 2006) for node se-
we perform multiple rollouts to fully explore the so-
lection as follows:
lution space relying on the tree-based search. The
s number of output sequences (short for n) generated
Q(s, a) log Np (s) in certain actions can also be adjusted to achieve
UCT(s, p) = +w , (1)
N (s) N (s) self-consistency verification and inference compu-
tation scaling. All in all, these strategies provide
where Q(s, a) is the reward value for node s and more comprehensive flexibility in scaling inference
will be updated through backpropagation. N (s) is computation for RAG, allowing LLMs to more
the number of visits to s, p is the parent node of effectively address complex knowledge-intensive
s, and w is the weight to balance exploration and queries.
exploitation. To improve efficiency and reduce redundant in-
Once the search reaches a terminal node, either ference computations, we implement an early prun-
a terminal state or a predetermined maximum tree ing strategy for the state node and reasoning path.
depth d, we obtain a trajectory from the root to Deduplication is applied to the output sequence
the terminal node. We collect all trajectories from states generated by each action, ensuring subse-
the rollout iterations as candidate solutions. The quent path diversity. Additionally, if multiple roll-
next section 3.3 explains how we select the optimal outs select the same state sequence, we save only
answer node from them. one valid reasoning path.
3.2.3 Flexible Architecture 4 Experiments
Our tree-based architecture provides the flexibility In this section, we conducted experiments on com-
to integrate other advanced approaches. We repro- plex QA benchmarks by answering the following
duce IterDRAG referring to the prompt of Yue et al. research questions.
(2024). Meanwhile, inspired by its iterative im-
plementation, we simplify the fundamental action • RQ1: Does AirRAG outperform state-of-the-
space to {A3 , A4 , A5 }, which can be quickly imple- art baselines?
mented and achieves relatively good results. These
methods can be used as an exploratory branch of • RQ2: How does AirRAG perform when it
our approach and can be turned on or off flexibly. comes to comprehensive inference scaling?

3.3 Select the Optimal Answer Node • RQ3: What is the performance benefit of Air-
For some common mathematical reasoning tasks, RAG in optimizing the allocation of inference
we can use the simple consistency-based method computation?
to efficiently select the most precise reason- • RQ4: How does AirRAG perform for various
ing path (e.g., select the most frequent number verification methods for multiple candidate
extracted from multiple candidate solutions in rollouts?
MATH (Hendrycks et al., 2021) as the final an-
swer). However, it is difficult to extract precise • RQ5: What is the intuitive performance of
answers and perform effective aggregation for AirRAG in the reasoning process?
knowledge-intensive tasks. Thus, we design two
self-consistency verification methods for such prob- 4.1 Experimental Settings
lems. Jaccard similarity and text embeddings are 4.1.1 Datasets
two different approaches used in natural language To evaluate the effectiveness of AirRAG, we con-
processing to measure the similarity between texts. duct experiments on various question-answering
We use these methods to implement the clustering (QA) tasks, including both open-domain QA
of text answers. Each answer score is computed by and multi-hop QA. The complex multi-hop QA
N datasets consist of HotpotQA (Yang et al., 2018),
1 X |Ai ∩ Aj | MuSiQue (Trivedi et al., 2022) and 2WikiMul-
jcdScorei = , (2)
N |Ai ∪ Aj | tiHopQA (2Wiki) (Ho et al., 2020). Other
j=1
N single-hop QA datasets include Natural Questions
1 X
embScorei = cos(Ei , Ej ), (3) (NQ) (Kwiatkowski et al., 2019), TriviaQA (Joshi
N et al., 2017), PopQA (Mallen et al., 2023) and We-
j=1
bQA (Berant et al., 2013).
where N is the number of valid answer nodes, Ai is
the word-level set of answer text i, and Ei denotes 4.1.2 Implementation Details
the embedding vector of answer text i. We use the hyperparameters reported for the exist-
In addition, we further investigate the self-refine ing models whenever available. Implementation
and process-supervision reward model to identify details are available in the Appendix A.
the most accurate reasoning trajectory. Self-refine
is a process that uses the action SA to refine the 4.1.3 Baselines and Metrics
final answer from all candidate answer nodes. The To investigate the enhancement effects of thinking
reward modeling process consists of data synthe- and planning on complex RAG tasks, we compare
sis and instruction tuning. During data synthesis, it with vanilla RAG, which performs only a single
we leverage MCTS to perform multiple rollouts retrieval and generation process. We evaluate the
on partial training sets. According to the known naive generators of Qwen2.5-14B-Instruct (Yang
ground truth, we sample some positive and neg- et al., 2024) and Llama-3-8B-Instruct (Grattafiori
ative reasoning trajectories and use Monte Carlo et al., 2024). In the retrieval phase, we employ
estimation to evaluate the intermediate state score. multilingual-e5-base (Wang et al., 2024) as the re-
In the instruction tuning phase, we utilize synthetic triever. The prompt of vanilla RAG are shown in
samples to fine-tune a relatively small LLM (i.e., the Appendix C. For iterative retrieval, we com-
Qwen2.5-14B-Instruct). pare AirRAG with Iter-RetGen (Shao et al., 2023),
HotpotQA MuSiQue 2Wiki Average
Lmax Method
F1 Acc F1 Acc F1 Acc F1 Acc
ZeroShot QA 42.5 41.3 13.5 12.1 48.2 47.3 34.7 33.6
Vanilla RAG 70.3 65.4 23.0 17.7 55.8 53.4 49.7 45.5
8k
IterDRAG∗ 74.3 69.1 26.7 19.4 60.5 57.6 53.8 48.7
AirRAG-Lite 80.6 75.4 35.4 28.9 75.3 73.1 63.8 59.1
AirRAG 79.6 75.2 41.0 35.0 76.0 74.2 65.6 61.5
AirRAG-Blender 81.1 79.8 41.6 36.4 82.2 81.7 68.3 66.0
Vanilla RAG 77.1 72.0 29.0 22.9 60.9 58.1 55.7 51.0
IterDRAG∗ 77.7 71.6 30.8 22.3 63.0 60.2 57.1 51.4
32k
AirRAG-Lite 82.4 76.9 36.7 30.1 78.8 76.8 66.0 61.3
AirRAG 82.5 77.4 43.2 36.3 80.4 78.9 68.7 64.2
AirRAG-Blender 82.9 80.6 43.3 37.6 83.4 83.0 69.9 67.1
IterDRAG∗ 76.8 71.0 31.7 24.8 65.5 62.4 58.0 52.7
128k AirRAG-Lite 82.5 77.1 35.7 30.4 78.3 76.0 65.5 62.2
AirRAG 83.3 78.0 43.5 36.5 82.3 80.5 69.7 65.0
AirRAG-Blender 83.7 81.4 43.9 38.5 84.4 84.2 70.6 68.0

Table 1: Overall evaluation results on the test sets of three datasets. * indicates the results reproduced by us. Lmax
denotes the maximum number of input tokens across all rollouts. The best results for each Lmax are in bold. The
number of both rollouts and output sequences is set to 1.

Self-RAG (Asai et al., 2024), Auto-RAG (Yu et al., diverse LLMs and datasets. For a fair comparison,
2024), and IterDRAG (Yue et al., 2024). To fur- we utilize the widely used Wikipedia dump from
ther explore RAG performance and inference com- December 2018 (Karpukhin et al., 2020) as the
putation scaling, we focus on a comparison with retrieval corpus. We observe consistent improve-
IterDRAG for a given budget on inference com- ments over vanilla RAG and existing iterative meth-
putation. For evaluation metrics, we report Exact ods (more than 10% on average). The significant
Match (EM), F1 score (F1) and Accuracy (Acc) boost over IterDRAG and Auto-RAG suggests that
between the generated summary and gold answer, AirRAG explores more effective reasoning paths
where accuracy measures whether the gold answer through the human-like thinking paradigm and tree-
is covered in the generated answer. based search.

4.2 Main Results (RQ1) Average


Method
F1 Acc
We first evaluate the performance of AirRAG on
Vanilla RAG 47.0 43.2
various complex QA datasets. Table 1 compares IterDRAG 49.8 45.9
its accuracy and F1 with strong baselines under the AirRAG
given inference computation budget, which is im- + nall =1 62.9 58.8
+ nall =3 63.4 62.1
plemented based on Qwen2.5-14B-Instruct and one + na1 ,a4 =3, na2 ,a3 ,a5 =1 63.2 62.0
million document database. The optimal perfor- + na1 ,a4 =3, na2 ,a3 ,a5 =1, qdiv =1.0 65.1 63.9
mance exhibits consistent gains as Lmax expands,
which is termed as the inference scaling laws for Table 2: Performance comparison with different compu-
tationally optimal strategies. nai denotes the number of
RAG (Yue et al., 2024). We integrate the remaining
output sequences of the reasoning action ai in a single
methods for a given maximum computational bud- extension. qdiv indicates that setting top-p to 1.0 and
get into our approach, dubbed as AirRAG-Blender. temperature to 1.0 for query-related actions, i.e. SAY
The best results are obtained by using only the SA and QT, increases the diversity of reasoning. The de-
action to refine the final answer from all candidates, fault sampling parameters top-p, top-k and temperature
as shown in Table 1. This also demonstrates the are set to 0.8, 50 and 0.7 respectively. Reasonable sam-
flexibility of our approach architecture. In addi- pling strategies and inference scaling further improve
tion, to verify the robustness and generalization of performance across multiple datasets.
AirRAG, Table 3 shows the performance on more
65 65 85
85

60 80 80
60
Accuracy

Accuracy
Recall

Recall
75 75
55
70 55
70 AirRAG
50 65 AirRAG-Lite
50 IterDRAG
65
5 10 20 30 40 50 5 10 20 30 40 50 816 32 64 128 816 32 64 128
Number of Documents Effective Context Length (k)

Figure 3: Impact of the retrieved document number scaling (Left) and the maximum context length scaling (Right)
on model performance (averaged Accuracy and Recall of three datasets). All methods show consistent performance
improvements as the effective inference computation scales.

90 Accuracy 66 Accuracy 9.2 the number of output sequences per action, demon-
9
85 Recall 64 Valid rollouts 8
F1 6.7
strating the effectiveness of self-consistency. We
80 62 7
Accuracy

6 also investigate the number of effective reasoning


75 60 4.8 5
58 3.4 4 paths under different rollouts in Figure 4. Simi-
70 56 3 larly, the performance improvement caused by the
1.9 2
65 54 1.0 increase of effective reasoning paths in the early
1
60 52 0 stage is relatively high. Similarly, the increase
2e7 10e6 10e5 10e4 1 2 4 8 16 32
Size of the Document Database Number of rollouts in early reasoning paths leads to relatively higher
performance gains. We provide additional dataset-
Figure 4: Left: Performance comparison under different
size of document database. A streamlined database can specific results in Appendix B.
maintain a better performance. Right: Performance
comparison in increasing the number of valid rollouts. 4.4 Ablation Studies
Sampling a higher number of diverse reasoning paths
consistently improves accuracy. Effect of Computationally Optimal Strategies
(RQ3). Extensive experiments show that the out-
puts of certain actions (e.g., RA, DA and SA) are
4.3 Inference Scaling for RAG (RQ2) almost consistent when performing multiple gen-
Inference computation scaling can enable LLMs erations. Therefore, we only increase the number
to improve their output performance (Snell et al., of output sequences (short for n) for the remaining
2024). Self-consistency can also improve the ro- actions (e.g., SAY and QT), which reduces invalid
bustness of the reasoning process (Wang et al., inference computation while maintaining good re-
2023). Therefore, we carry out a comprehensive sults. This also reflects that this kind of reasoning
experimental analysis on the inference computation action, which can effectively activate the creativity
scaling. Based on tree-based search and RAG sce- of LLMs and expand the solution space, requires
nario, there are multiple ways to optimize the use more diversified sampling strategies. We adjust
of inference computation resources. Specifically, the sampling parameters (top-p=1.0 and temper-
we can adjust both the number of retrieved docu- ature=1.0) to improve the diversity of the model
ments in a single retrieval and the effective context output. The complete experimental results in Ta-
length in all iterations. The average performance of ble 2 show that the diversity of key actions can
three datasets exhibits consistent gains in Figure 3. significantly improve performance.
In subsequent experiments, unless otherwise speci- From the aforementioned experiments, it is ob-
fied, the data presented represent the average perfor- served that the recall and accuracy of model are
mance across the HotpotQA, MuSiQue, and 2Wiki linearly correlated. Intuitively, the size of docu-
datasets. In particular, the initial computation scal- ment database is also related to the recall score. By
ing brings significant performance improvements. reducing the scale of the document database, we
In addition, the number of output sequences and find a gradual improvement in model performance
rollouts in MCTS can expand the solution space (shown in Figure 4). This observation provides
and explore potential reasoning paths. As shown in experimental evidence for effective database parti-
Figure 1, the average performance increases with tioning in practical application.
NQ TriviaQA PopQA WebQA HotpotQA 2Wiki
Method
EM EM F1 EM F1 F1
Vanilla RAG 35.1 58.8 36.7 15.7 35.3 21.0
Self-RAG 36.4 38.2 32.7 21.9 29.6 25.1
Iter-RetGen 36.8 60.1 37.9 18.2 38.3 21.6
Auto-RAG 37.9 60.9 47.8 25.1 44.9 48.9
AirRAG 53.6 63.2 51.8 52.6 67.6 66.3

Table 3: Performance comparison on six benchmarks, where Llama3-8B-Instruct is used as the generator LLM.
Partial experimental results are quoted from Jin et al. (2024) and Yu et al. (2024). The best results are in bold. The
number of both rollouts and output sequences is set to 1. The number of documents for a single retrieval is set to 5.
Our proposed AirRAG significantly outperform the others.

Retrieve 3 Documents Retrieve 8k Context Length 4.5 Qualitative Analysis (RQ5)


QwenRM QwenRM
65 SA 71 SA To make it easier to understand why our proposed
64 SC-emb 70 SC-emb
63 SC-jcd SC-jcd AirRAG works, we present a qualitative analysis
Accuracy

Average 69 Average
62 in MuSiQue. Existing iterative methods are often
61 68
67
trapped in a single solution space when confronted
60
59 66 with complex tasks. As illustrated in Figure 13,
58 65 these iterative methods exhibit a key limitation that
1 2 3 4 5 1 2 3 4 5
Number of Output Sequences insufficient or ambiguous retrieval context can lead
to repetitive follow-up queries until it reaches the
Figure 5: Performance comparison of different verifi- predefined maximum depth of iterations. This inef-
cation methods. "QwenRM" is short for reward model
ficient iteration results in high computational cost
trained on the Qwen model. "SA" is the reasoning ac-
tion of summary and answer. "SC-emb/jcd" are two
and incorrect answer. In contrast, our proposed Air-
self-consistency verification methods based on text em- RAG designs efficient reasoning actions to activate
beddings and jaccard similarity. "Average" is the aver- the intrinsic reasoning capabilities of LLMs. As
age score over all candidate rollouts. The single retrieval shown in Figure 14, the SAY action decomposes
process is set to retrieve three documents or fixed 8k the original query into a more rational sequence
context. of sub-queries, and then the combination of RA
and QT ensures the accuracy of the intermediate
reasoning step. We eventually leverage the efficient
Effect of Verification Methods (RQ4). The larger reasoning trajectory to obtain the correct answer.
search space also generates more candidate reason-
5 Conclusions
ing trajectories. Therefore, how to select the opti-
mal trajectory is crucial for the final performance. In this paper, we propose AirRAG, a novel RAG
We compare multiple verification methods with the approach to activate intrinsic reasoning capabili-
average scores of all candidates in Figure 5. These ties of LLMs. AirRAG designs an efficient action
two self-consistency verification methods are al- space for the controllable reasoning generation. We
ways slightly better than the average score, but they also introduce Monte Carlo Tree Search to expand
are not nearly as good as the SA and QwenRM the solution space. Meanwhile, by employing the
methods. The SA method uses the LLM to further tree-based search and self-consistency verification,
refine the final answer from all candidate rollouts, we explore potential reasoning paths and achieve
which is simple and effective. Finally, the reward comprehensive inference computation scaling in
model achieves the most competitive results due RAG. In addition, computationally optimal strate-
to the introduction of supervised information on gies are used to apply more computation to key
key intermediate reasoning steps in the training pro- actions, leading to further performance improve-
cess. However, collecting process-supervised train- ments. Experimental results on diverse QA datasets
ing samples requires high computational costs and demonstrate the significant superiority of AirRAG
high-quality raw data. In the practical application over other methods designed for complex RAG
scenario, we can choose the appropriate method scenarios. Furthermore, AirRAG can be integrated
while balancing efficiency and effectiveness. with other advanced methods with great flexibility.
Limitations Alex Vaughan, Amy Yang, and et al. 2024. The llama
3 herd of models. Preprint, arXiv:2407.21783.
Although our model achieves competitive perfor-
mance in various RAG tasks, there are some meth- Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul
Arora, Steven Basart, Eric Tang, Dawn Song, and
ods and limitations that can be improved. The Jacob Steinhardt. 2021. Measuring mathematical
current optimal computation allocation strategy is problem solving with the MATH dataset. In Thirty-
derived from sufficient experiments. We can con- fifth Conference on Neural Information Processing
sider designing an automated policy model to im- Systems Datasets and Benchmarks Track (Round 2).
plement the trade-off between computational cost Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara,
and performance. Despite great efforts in the in- and Akiko Aizawa. 2020. Constructing a multi-
ference scaling of RAG, the experimental analysis hop QA dataset for comprehensive evaluation of
reasoning steps. In Proceedings of the 28th Inter-
may be limited due to the massive computational
national Conference on Computational Linguistics,
cost of tree-based search approaches. We will ex- pages 6609–6625, Barcelona, Spain (Online). Inter-
plore more complex reasoning tasks to verify the national Committee on Computational Linguistics.
robustness and effectiveness of our approach. In
Paul I Jaffe, Russell A Poldrack, Robert J Schafer, and
addition, the large search space also brings more Patrick G Bissett. 2023. Modelling human behaviour
noise information, so we will further investigate in cognitive tasks with latent dynamical systems. Na-
the reward model or strategy to explore a better ture Human Behaviour, 7(6):986–1000.
reasoning path. Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju
Hwang, and Jong Park. 2024. Adaptive-RAG: Learn-
ing to adapt retrieval-augmented large language mod-
References els through question complexity. In Proceedings of
the 2024 Conference of the North American Chap-
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and ter of the Association for Computational Linguistics:
Hannaneh Hajishirzi. 2024. Self-RAG: Learning to Human Language Technologies (Volume 1: Long
retrieve, generate, and critique through self-reflection. Papers), pages 7036–7050, Mexico City, Mexico. As-
In The Twelfth International Conference on Learning sociation for Computational Linguistics.
Representations.
Zhengbao Jiang, Frank Xu, Luyu Gao, Zhiqing Sun,
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie
Liang. 2013. Semantic parsing on Freebase from Callan, and Graham Neubig. 2023. Active retrieval
question-answer pairs. In Proceedings of the 2013 augmented generation. In Proceedings of the 2023
Conference on Empirical Methods in Natural Lan- Conference on Empirical Methods in Natural Lan-
guage Processing, pages 1533–1544, Seattle, Wash- guage Processing, pages 7969–7992, Singapore. As-
ington, USA. Association for Computational Linguis- sociation for Computational Linguistics.
tics.
Jiajie Jin, Yutao Zhu, Xinyu Yang, Chenghao Zhang,
Maciej Besta, Nils Blach, Ales Kubicek, Robert Ger- and Zhicheng Dou. 2024. Flashrag: A modular
stenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz toolkit for efficient retrieval-augmented generation
Lehmann, Michał Podstawski, Hubert Niewiadomski, research. CoRR, abs/2405.13576.
Piotr Nyczyk, and Torsten Hoefler. 2024. Graph of
Thoughts: Solving Elaborate Problems with Large Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke
Language Models. Proceedings of the AAAI Confer- Zettlemoyer. 2017. TriviaQA: A large scale distantly
ence on Artificial Intelligence, 38(16):17682–17690. supervised challenge dataset for reading comprehen-
sion. In Proceedings of the 55th Annual Meeting of
Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. the Association for Computational Linguistics (Vol-
2024. Alphamath almost zero: Process supervision ume 1: Long Papers), pages 1601–1611, Vancouver,
without process. In The Thirty-eighth Annual Con- Canada. Association for Computational Linguistics.
ference on Neural Information Processing Systems.
Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric
Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Wallace, and Colin Raffel. 2023. Large language
2023. Precise zero-shot dense retrieval without rel- models struggle to learn long-tail knowledge. In
evance labels. In Proceedings of the 61st Annual Proceedings of the 40th International Conference
Meeting of the Association for Computational Lin- on Machine Learning, volume 202 of Proceedings
guistics (Volume 1: Long Papers), pages 1762–1777, of Machine Learning Research, pages 15696–15707.
Toronto, Canada. Association for Computational Lin- PMLR.
guistics.
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and
Abhinav Pandey, Abhishek Kadian, Ahmad Al- Wen-tau Yih. 2020. Dense passage retrieval for open-
Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, domain question answering. In Proceedings of the
2020 Conference on Empirical Methods in Natural and Denny Zhou. 2023. Large language models can
Language Processing (EMNLP), pages 6769–6781, be easily distracted by irrelevant context. In Proceed-
Online. Association for Computational Linguistics. ings of the 40th International Conference on Machine
Learning, volume 202 of Proceedings of Machine
Levente Kocsis and Csaba Szepesvári. 2006. Bandit Learning Research, pages 31210–31227. PMLR.
based monte-carlo planning. In Machine Learning:
ECML 2006, pages 282–293, Berlin, Heidelberg. David Silver, Thomas Hubert, Julian Schrittwieser, Ioan-
Springer Berlin Heidelberg. nis Antonoglou, Matthew Lai, Arthur Guez, Marc
Lanctot, Laurent Sifre, Dharshan Kumaran, Thore
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- Graepel, et al. 2017. Mastering chess and shogi by
field, Michael Collins, Ankur Parikh, Chris Alberti, self-play with a general reinforcement learning algo-
Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken- rithm. arXiv preprint arXiv:1712.01815.
ton Lee, Kristina Toutanova, Llion Jones, Matthew
Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku-
Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu- mar. 2024. Scaling llm test-time compute optimally
ral questions: A benchmark for question answering can be more effective than scaling model parameters.
research. Transactions of the Association for Compu- Preprint, arXiv:2408.03314.
tational Linguistics, 7:452–466.
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot,
Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Ashish Sabharwal. 2022. MuSiQue: Multi-
and Nan Duan. 2023. Query rewriting in retrieval- hop questions via single-hop question composition.
augmented large language models. In Proceedings of Transactions of the Association for Computational
the 2023 Conference on Empirical Methods in Natu- Linguistics, 10:539–554.
ral Language Processing, pages 5303–5315, Singa-
pore. Association for Computational Linguistics. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot,
and Ashish Sabharwal. 2023. Interleaving retrieval
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, with chain-of-thought reasoning for knowledge-
Daniel Khashabi, and Hannaneh Hajishirzi. 2023. intensive multi-step questions. In Proceedings of
When not to trust language models: Investigating the 61st Annual Meeting of the Association for Com-
effectiveness of parametric and non-parametric mem- putational Linguistics (Volume 1: Long Papers),
ories. In Proceedings of the 61st Annual Meeting of pages 10014–10037, Toronto, Canada. Association
the Association for Computational Linguistics (Vol- for Computational Linguistics.
ume 1: Long Papers), pages 9802–9822, Toronto,
Canada. Association for Computational Linguistics. Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang,
Rangan Majumder, and Furu Wei. 2024. Multilin-
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, gual e5 text embeddings: A technical report. arXiv
Long Ouyang, Christina Kim, Christopher Hesse, preprint arXiv:2402.05672.
Shantanu Jain, Vineet Kosaraju, William Saunders,
Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le,
Krueger, Kevin Button, Matthew Knight, Benjamin Ed H. Chi, Sharan Narang, Aakanksha Chowdhery,
Chess, and John Schulman. 2022. Webgpt: Browser- and Denny Zhou. 2023. Self-consistency improves
assisted question-answering with human feedback. chain of thought reasoning in language models. In
Preprint, arXiv:2112.09332. The Eleventh International Conference on Learning
Representations.
Zhenting Qi, Mingyuan Ma, Jiahang Xu, Li Lyna Zhang,
Fan Yang, and Mao Yang. 2024. Mutual reason- Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling.
ing makes smaller llms stronger problem-solvers. 2024. Corrective retrieval augmented generation.
Preprint, arXiv:2408.06195. arXiv preprint arXiv:2401.15884.

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui,
Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu,
moyer, Nicola Cancedda, and Thomas Scialom. 2023. Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jian-
Toolformer: Language models can teach themselves hong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang,
to use tools. In Thirty-seventh Conference on Neural Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu,
Information Processing Systems. Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng
Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tian-
Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie hao Li, Tingyu Xia, Xingzhang Ren, Xuancheng
Huang, Nan Duan, and Weizhu Chen. 2023. En- Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan,
hancing retrieval-augmented large language models Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan
with iterative retrieval-generation synergy. In Find- Qiu. 2024. Qwen2.5 technical report. arXiv preprint
ings of the Association for Computational Linguis- arXiv:2412.15115.
tics: EMNLP 2023, pages 9248–9274, Singapore.
Association for Computational Linguistics. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,
William Cohen, Ruslan Salakhutdinov, and Christo-
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan pher D. Manning. 2018. HotpotQA: A dataset for
Scales, David Dohan, Ed H. Chi, Nathanael Schärli, diverse, explainable multi-hop question answering.
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
2369–2380, Brussels, Belgium. Association for Com-
putational Linguistics.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
Thomas L. Griffiths, Yuan Cao, and Karthik R
Narasimhan. 2023. Tree of thoughts: Deliberate
problem solving with large language models. In
Thirty-seventh Conference on Neural Information
Processing Systems.
Tian Yu, Shaolei Zhang, and Yang Feng. 2024. Auto-
rag: Autonomous retrieval-augmented generation for
large language models. Preprint, arXiv:2411.19443.
Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf
Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuan-
hui Wang, and Michael Bendersky. 2024. Inference
scaling for long-context retrieval augmented genera-
tion. Preprint, arXiv:2410.04343.
Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong
Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco
Pavone, Yuqiang Li, Wanli Ouyang, and Dongzhan
Zhou. 2024. Llama-berry: Pairwise optimization
for o1-like olympiad-level mathematical reasoning.
Preprint, arXiv:2410.02884.
Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen,
Heng-Tze Cheng, Ed H. Chi, Quoc V Le, and Denny
Zhou. 2024. Take a step back: Evoking reasoning via
abstraction in large language models. In The Twelfth
International Conference on Learning Representa-
tions.

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei,


Nathan Scales, Xuezhi Wang, Dale Schuurmans,
Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H.
Chi. 2023. Least-to-most prompting enables com-
plex reasoning in large language models. In The
Eleventh International Conference on Learning Rep-
resentations.
A Implementation Details racy) over 14B model. Meanwhile, we observe that
the performance of our AirRAG based on the 14B
For evaluation, we randomly select 1,000 samples
model surpasses other methods using a 72B model.
from the whole validation sets of each dataset as
Furthermore, we present detailed inference scaling
our final test set, with a fixed random seed 0. To
results for each dataset individually, as shown in
better understand the complexity of multi-hop rea-
Figure 7 and Figure 8.
soning in these datasets, we analyze the hop dis-
tribution of the HotpotQA, MuSiQue, and 2Wiki-
Qwen2.5-72B Qwen2.5-14B
MultiHopQA test sets in Figure 6. The statistics Method
show that there is a high proportion of complex rea- F1 Acc F1 Acc
soning queries with 3 hops or more (aboout 30%, Vanilla RAG 51.6 46.9 47.0 43.2
IterDRAG 53.4 48.5 49.8 45.9
50%, 25%). HotpotQA lacks explicit hop annota- AirRAG 69.6 68.5 65.1 63.9
tions, so we instead count the number of supporting
facts. MuSiQue has a significantly higher propor- Table 4: Performance comparison between Qwen2.5-
tion of 3-hop and 4-hop queries compared to the 14B-Instruct and Qwen2.5-72B-Instruct. Three docu-
other datasets, indicating great reasoning complex- ments are retrieved in a single RA action. Rollouts and
ity. This observation is further corroborated by our n are set to 32 and 3 respectively.
experimental results in Table 1 and Figure 7. The
performance of our approach on MuSiQue is much
lower than those of the other two datasets. Lmax Method
Qwen2.5-72B Qwen2.5-14B
In the retrieval process, we employ the F1 Acc F1 Acc
multilingual-e5-base (Wang et al., 2024) as the ZeroShot QA 38.8 37.6 34.7 33.6
retriever and use the widely used Wikipedia Vanilla RAG 56.2 51.8 49.7 45.5
8k
IterDRAG∗ 57.0 52.2 53.8 48.7
dump from December 2018 as the retrieval cor- AirRAG-Lite 70.9 66.6 63.8 59.1
pus (Karpukhin et al., 2020) which comprises over AirRAG 72.4 67.6 65.6 61.5
21 million passages. For generation, the default AirRAG-Blender 73.2 70.0 68.3 66.0
sampling parameters top-p, top-k and temperature Vanilla RAG 56.8 52.5 53.5 49.2
IterDRAG∗ 58.3 53.9 55.3 50.2
are set to 0.8, 50 and 0.7 respectively. Evalua- 16k
AirRAG-Lite 73.2 68.8 65.2 60.7
tion metrics include Exact Match (EM), F1 score AirRAG 74.2 69.7 67.2 62.8
(F1), and Accuracy (Acc), where accuracy indi- AirRAG-Blender 75.3 71.0 69.3 66.5
cates whether the ground truth is a substring of the
final generated answer. For reward model training, Table 5: Overall average evaluation results between
we sample 8,000 question-answer pairs from each Qwen2.5-14B-Instruct and Qwen2.5-72B-Instruct over
three datasets. Lmax denotes the maximum number
dataset and generate more than 156,000 reasoning
of input tokens across all rollouts or iterations. Both
paths using our proposed AirRAG (rollouts=32, rollouts and n are set to 1.
n=4, qdiv =1.0). In inference scaling experiments,
we sample maximum computation budgets Lmax
(e.g., 8k, 16k, 32k, 64k and 128k tokens). The
Lmax (maximum effective context length) denotes
C Prompt Examples
the maximum number of input tokens across all Given a user input query, our proposed AirRAG,
rollouts following (Yue et al., 2024). The predeter- as shown in Figure 2, first attempts the direct an-
mined maximum tree depth d is set to 10, specif- swer (DA) action without prompts and performs
ically indicating that the SAY and SA actions are system analysis (SAY) using the prompt in Figure 9.
executed once, while the RA-QT or QT-RA actions Subsequently, AirRAG performs retrieval and an-
have a maximum of 4 iterations. swer (RA) with the prompt in Figure 11, or query
transformation (QT) to generate refined queries for
B Additional Experiment Results
better retrieval and answer. This process of RA-QT
We report the average performance of our approach or QT-RA can continuously iterate until no new
on three datasets with the support of the Qwen2.5- sub-queries arise or the maximum iteration depth is
72B-Instruct model in Table 4 and 5. The experi- reached. Finally, the summary answer (SA) in Fig-
mental results show that the performance is further ure 11 utilizes all the information and conclusions
improved (+4.6%, +6.1%, +6.9% in average accu- from intermediate steps to refine the final answer.
HotpotQA MuSiQue 2WikiMultiHopQA
>=5 facts >=5 hops
4 facts 4 hops 4 hops
1.6% 0.6%
3 facts 6.5% 17.9% 22.1%
21.2% 3 hops
51.2% 0.6%
2 hops
70.7% 30.9%
3 hops 76.7%
2 facts
2 hops

Figure 6: Overview of the distribution of query complexity over three multi-hop QA datasets.

D Case Study
We select a sample from the complex multi-hop
dataset MuSiQue to analyse in detail, as shown in
figure 13 and 14.
HotpotQA
81 96 80 94
78 92 78 92
90
75 88 76 88
Accuracy

Accuracy
Recall

Recall
72 84 74 86
69 80 72 84 AirRAG
82 AirRAG-Lite
66 76 70 80 IterDRAG
63 72 68 78
5 10 20 30 40 50 5 10 20 30 40 50 816 32 64 128 816 32 64 128
Number of Documents Effective Context Length (k)
MuSiQue
40 72 40 72
36 66 36 68
32 60 32 64
Accuracy

Accuracy

60
Recall

Recall
28 54 28
56
24 48 24 52 AirRAG
20 42 20 AirRAG-Lite
48 IterDRAG
16 36 16 44
5 10 20 30 40 50 5 10 20 30 40 50 816 32 64 128 816 32 64 128
Number of Documents Effective Context Length (k)
2WikiMultihopQA
85 102 85 98
80 96 80 91
75 90 75 84
Accuracy

Accuracy

70 84
Recall

Recall

70 77
65 78
60 72 65 70 AirRAG
60 63 AirRAG-Lite
55 66 IterDRAG
50 60 55 56
5 10 20 30 40 50 5 10 20 30 40 50 816 32 64 128 816 32 64 128
Number of Documents Effective Context Length (k)

Figure 7: Impact of the retrieved document number scaling and the maximum context length scaling over three
datasets.
HotpotQA MuSiQue 2WikiMultihopQA
82 38 78
Accuracy 10 Accuracy 10 Accuracy 12
Valid rollouts Valid rollouts Valid rollouts 10.9
80 8.6 36 76
8.1 10
8 8
6.7 6.4
74
78 34 8
Accuracy

6 6 6.9
4.8 4.9 72 6
76 4 32 3.5 4 4.6
3.3 70 4
3.2
74 1.9 2 30 2.0 2 1.9
1.0 1.0 68 2
1.0
72 0 28 0 66 0
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
Number of rollouts Number of rollouts Number of rollouts

Figure 8: Performance comparison on the number of different effective rollouts over three datasets. Sampling more
diverse reasoning paths consistently improves accuracy.

Example prompt for SAY reasoning action

Given the user query, you may rephrase it for better clarity, summarize it at a higher level,
or decompose it into multiple sub-queries to facilitate more effective information retrieval and
response generation. If no modification is necessary, return "None". Otherwise, list sub-queries,
each on a new line.
<Here are some examples.>
Query: {question}
Output:

Figure 9: Example prompt for SAY reasoning action.

Example prompt for QT reasoning action

Given the context provided, please determine whether rephrasing, summarization, or decomposition
into sub-queries is necessary to enhance the accuracy and efficiency of information retrieval and
response generation. If no modification is required, return "None". Subsequent queries should be
listed individually.
<Here are some examples.>
Main Query: {question}
History: {history}
This Query: {this_question}

Figure 10: Example prompt for QT reasoning action.

Example Prompt for RA actions

You are an expert in question answering. I am going to give you some contexts with may or may
not be relevant to the question. Answer the question according to the contexts.
{contexts}
Question: {question}

Figure 11: Example Prompt for RA actions.


Example prompt for SA reasoning action

You are an expert in question answering. Given the context, sub-queries and responses, output a
correct and concise answer to User Query.
<Here are some examples.>
User Query: {question}
{history}
Contexts: {contexts}
Final Answer:

Figure 12: Example prompt for SA reasoning action.

The iterative methods often traps in a solution space


Context: <Retrieved documents>
Question: What is the mouth of watercourse for the body of water where Bartram’s Covered Bridge is located?
Follow up: Which body of water is Bartram’s Covered Bridge located on?
Follow up: | Intermediate answer: | So the final answer is:

Output: Follow up: Which body of water is Bartram’s Covered Bridge located on?

(Note: The context does not specify which body of water Bartram’s Covered Bridge is located on, but it does mention
that the bridge carries Goshen Road over Crum Creek. Therefore, we need to confirm that Crum Creek is the body
of water in question before proceeding to find the mouth of the watercourse.)

···
· · · (It enters an endless cycle of iteration until it reaches predefined maximum depth.)

Gold Answer: Delaware River


Process Annotation: Q1: Bartram’s Covered Bridge » crosses
A1: Crum Creek
Q2: Q1 » mouth of the watercourse
A2: Delaware River

The iterative self-exploration often traps in a solution space with low-quality reasoning steps even
after many attempts.

Figure 13: Incorrect reasoning path of iterative RAG methods. Ambiguity or missing information in the intermediate
steps leads in duplicate sub-query.
The simple and effective action space of AirRAG leads to the correct reasoning trajectory
Question: What is the mouth of watercourse for the body of water where Bartram’s Covered Bridge is located?

Action1: SAY
Input: What is the mouth of watercourse for the body of water where Bartram’s Covered Bridge is located?
Output: Where is Bartram’s Covered Bridge located?
What is the mouth of the watercourse for the body of water near Bartram’s Covered Bridge?

Action2: RA
Input: <Retrieved documents> Where is Bartram’s Covered Bridge located?
Output: Bartram’s Covered Bridge is located on the border between Delaware County and Chester County, Pennsylvania,
carrying Goshen Road over Crum Creek.

Action3: QT
Input: What is the mouth of the watercourse for the body of water near Bartram’s Covered Bridge?
Output: What is the mouth of Crum Creek?

Action4: RA
Input: <Retrieved documents> What is the mouth of Crum Creek?
Output: Based on the provided contexts, there are two different locations referred to as Crum Creek, and they flow into
different bodies of water:
1. The Crum Creek in Delaware County and Chester County, Pennsylvania, flows into the Delaware River in Eddystone,
Pennsylvania.
2. The Crum Creek (also called Cathatachua Creek) in East Creek, New York, flows into the Mohawk River.
Since the question does not specify which Crum Creek is being referred to, both answers are possible. However, if
we consider the first context as the primary reference, the mouth of Crum Creek is the Delaware River in Eddystone,
Pennsylvania.

Action5: SA
Input: <Retrieved documents> <Sub-queries with answers>
Output: The mouth of the watercourse for the body of water where Bartram’s Covered Bridge is located, Crum Creek,
is the Delaware River in Eddystone, Pennsylvania.

Gold Answer: Delaware River.

AirRAG obtains the correct answer by performing five reasoning actions efficiently.

Figure 14: Effective reasoning rollout of AirRAG. The fundamental action space and tree-based search provides
greater compatibility and flexibility, which avoids being trapped in a single solution space by a low-quality
intermediate step for chain-like reasoning paradigm.

You might also like