RAGtreebased
RAGtreebased
Wenfeng Feng* , Chuzhan Hao* , Yuewei Zhang, Jingyi Song & Hao Wang†
Alibaba Cloud, Alibaba Group
{wenfeng.fwf,haochuzhan.hcz,liyou.zyw,sjy267290}@alibaba-inc.com
[email protected]
78 32k
Abstract Accuracy
76 Recall
74 F1
Leveraging the autonomous decision-making Ltotal
capabilities of large language models (LLMs) 72
Total Tokens
arXiv:2501.10053v1 [cs.AI] 17 Jan 2025
Score
ing tasks. Despite the successes of iterative 68 16k
or recursive retrieval-augmented generation 66
(RAG), they often are trapped in a single so- 64
lution space when confronted with complex 8k
62
tasks. In this paper, we propose a novel think- 4k
60
ing pattern in RAG which integrates system 2k
1k
analysis with efficient reasoning actions, sig- 1 2 3 4 5
nificantly Activating intrinsic reasoning capa- Number of Output Sequences
bilities and expanding the solution space of
Figure 1: Averaged performance comparison on the
specific tasks via Monte Carlo Tree Search
three datasets under different number of output se-
(MCTS), dubbed AirRAG. Specifically, our
quences. Ltotal is the total amount of tokens consumed
approach designs five fundamental reasoning
in the reasoning process. AirRAG leverages the diver-
actions that are expanded to a wide tree-based
sity of generation and self-consistency verification to
reasoning spaces using MCTS. The extension
explore the potential solution space, which can effec-
also uses self-consistency verification to ex-
tively improve overall performance by scaling inference
plore potential reasoning paths and implement
computation.
inference scaling. In addition, computationally
optimal strategies are used to apply more infer-
ence computation to key actions to achieve fur-
ther performance improvements. Experimental Therefore, it becomes an important research task to
results demonstrate the effectiveness of Air- leverage the reasoning capabilities of LLMs to im-
RAG through considerable performance gains prove the performance of RAG (Jiang et al., 2023;
over complex QA datasets. Furthermore, Air- Jeong et al., 2024; Asai et al., 2024; Yu et al., 2024).
RAG is flexible and lightweight, making it easy Previous studies on complex query scenarios fo-
to integrate with other advanced technologies.
cus on optimizing the query and retrieval process
to obtain effective information (Shi et al., 2023;
1 Introduction Zhou et al., 2023; Gao et al., 2023; Jiang et al.,
2023; Zheng et al., 2024). Although these meth-
Retrieval-Augmented Generation (RAG) can ef- ods can solve specific tasks efficiently, their perfor-
fectively alleviate the problem of generating factu- mance heavily depends on manually crafted rules
ally incorrect content, which is crucial in domain- and prompt engineering to improve retrieval rel-
specific or knowledge-intensive tasks (Kandpal evance. The lack of flexibility makes it difficult
et al., 2023). However, with the increase of task to quickly adapt to different scenarios. In addi-
complexity, further challenges arise, such as the in- tion, recursive retrieval is often used to improve the
ability to effectively retrieve sufficient knowledge depth and relevance of search results in information
in a single query and the difficulty of understand- retrieval tasks. Thus, the intermediate query and
ing the complex reasoning logic in the question. retrieval results are continuously updated to satisfy
* The first two authors contributed equally the dynamic information needs during the complex
†
Corresponding author task solution process (Jiang et al., 2023; Asai et al.,
2024; Yan et al., 2024; Jeong et al., 2024). Recent mance are improved by implementing com-
work leverages the decision-making capabilities of prehensive inference scaling and a pluggable
LLMs to implement automated iterative retrieval architecture.
and generation (Yu et al., 2024; Yue et al., 2024).
However, these approaches have two main issues. • Experimental results show that AirRAG is su-
First, it is difficult for the modular RAG to cover perior to current iterative or recursive RAG.
the diverse types of question, which involves sub- Our approach effectively activates the intrinsic
stantial human effort and is extremely inefficient. reasoning capabilities of LLMs and expands
The single solution space also makes it impossible the solution space to a controllable extent.
to adequately activate the decision-making ability
2 Related Work
of LLMs. Second, the single reasoning paradigm
and the chain-like reasoning process struggle to Retrieval-Augmented Generation (RAG) RAG
effectively explore the solution space during rea- can significantly improve the performance of LLMs
soning. The self-exploration process is susceptible in knowledge-intensive tasks. Compared to vanilla
to low-quality intermediate reasoning steps and RAG, optimizing the query and retrieval process
is trapped in a single solution space. It is also can significantly enhance knowledge correlation
difficult to effectively guide self-exploration for and thus improve reasoning performance. Some
relatively smaller language models (e.g., Qwen2.5- methods use query expansion and transformation
14B-Instruct (Yang et al., 2024)). Therefore, we to obtain better results (Zhou et al., 2023; Ma et al.,
should design rational reasoning actions to compre- 2023; Gao et al., 2023). As the complexity of tasks
hensively explore a wide and deep solution space. escalates, it becomes more difficult to obtain suf-
In response, we propose AirRAG to activate in- ficient knowledge in a single retrieval. Therefore,
trinsic reasoning capabilities and expand the solu- the idea of iterative retrieval is introduced to obtain
tion space via MCTS. First, we design five funda- additional contextual references. IRCoT (Trivedi
mental reasoning actions, including system anal- et al., 2023) uses chain-of-thought to guide the re-
ysis, direct answer, retrieval-answer, query trans- trieval process and refines the CoT with the ob-
formation and summary-answer. This series of tained retrieval results. ITER-RETGEN (Shao
actions can address a variety of issues in different et al., 2023) collaborates retrieval and generation
scenarios, such as complex problems requiring pro- modules to implement a sophisticated understand-
gressive or parallel queries. These actions can be ing of the specific task.
performed efficiently even on relatively small lan- Activating the reasoning capabilities of LLMs in
guage models. Additionally, we introduce MCTS RAG Leveraging the decision-making capabilities
and self-consistency to achieve controllable rea- of LLMs can enhance the efficiency and relevance
soning path generation and efficient inference scal- of the information sourced (Nakano et al., 2022;
ing. In order to accurately select the answer from Schick et al., 2023). Self-RAG and its variant (Asai
multiple reasoning paths, we combine the voting et al., 2024; Yan et al., 2024; Jeong et al., 2024)
method with the process-supervised reward model. adopt a self-reflection mechanism to iteratively me-
As the amount of inference computation increases, chanically predict reflection tokens during training,
our approach can achieve significant performance which makes the LLM controllable during the in-
improvements as shown in Figure 1. Moreover, ference phase. Auto-RAG (Yu et al., 2024) sys-
AirRAG has a flexible architecture that can easily tematically plans retrievals and refines queries to
integrate other advanced methods into our approach acquire valuable knowledge through multi-turn iter-
as an action branch. In summary, our main contri- ations. IterDRAG (Yue et al., 2024) explores infer-
butions are as follows: ence scaling strategies in RAG, which can enhance
LLMs’ ability to effectively acquire and utilize con-
• We design five fundamental reasoning actions
textual information. However, these methods often
that can solve most problem types in RAG
struggle to effectively explore the solution space
scenarios and ensure the controllability of the
during reasoning. The self-exploration often traps
reasoning process.
in a solution space with low-quality reasoning steps
• We introduce MCTS and self-consistency to even after many attempts. This phenomenon often
effectively expand the solution space of com- stems from the chain reasoning pattern and the diffi-
plex tasks. The generalization and perfor- culty of small-scale LLMs to tackle overly complex
tasks in a single iteration. and human cognition in RAG scenarios.
Monte Carlo Tree Search (MCTS) Tree-based • A1 : System Analysis (SAY). This action analyzes
search algorithms, particularly MCTS, have the overall structure of the problem and then its
emerged with remarkable capabilities to expand decomposition or planning. This is global think-
search spaces and enhance reasoning capabili- ing before problem solving.
ties (Silver et al., 2017; Chen et al., 2024; Qi et al.,
2024; Zhang et al., 2024). Many studies have • A2 : Direct Answer (DA). This action leverages
proven that the search strategy can extend reason- parametric knowledge of LLMs to answer ques-
ing by using multiple branch queries to explore di- tions directly, without being influenced by other
verse reasoning paths (Yao et al., 2023; Besta et al., external knowledge.
2024). In the mathematical reasoning scenario,
• A3 : Retrieval-Answer (RA). This action retrieves
Zhang et al. (2024) and Chen et al. (2024) lever-
related knowledge from the external knowledge
age MCTS to foster a more efficient exploration of
base to support subsequent reasoning.
solution spaces. Meanwhile, Qi et al. (2024) de-
signs rich human-like reasoning actions for higher • A4 : Query Transformation (QT). This action
quality reasoning trajectories. Recent research has transforms human questions in order to improve
shown that inference scaling (Yue et al., 2024) and retrieval performance. It is capable of various
self-consistency (Wang et al., 2023) can lead to transformational capabilities such as rewriting,
substantial improvements. Therefore, our approach step back prompting, follow-up questions and
samples diverse paths in the next expansion of the multi-query retrieval.
action space, achieving inference computation scal-
• A5 : Summary-Answer (SA). Combined with the
ing, and performs self-consistency verification on
intermediate reasoning steps and answers and the
all candidate solutions.
initial questions, the final answer is obtained by
further refinement and summarization.
3 Methodology
The above five actions define a highly diverse ac-
In order to effectively explore the solution space tion space {A1 , A2 , A3 , A4 , A5 }. At the first step,
during reasoning, we propose a controllable tree- the initial state is denoted as s0 and then MCTS
based framework of RAG. In the framework, the selects the action a1 and a2 to prompt the LLM to
combination of MCTS and five reasoning actions generate the next reasoning steps in parallel. Subse-
results in an efficient and controllable expansion of quent actions are performed sequentially to expand
the solution space. Meanwhile, we further imple- the reasoning path. Note that there are sequential
ment more comprehensive inference scaling strate- dependencies between different actions. For ex-
gies based on Yue et al. (2024) and use pruning ample, A1 and A2 can only happen after the root
and computationally optimal strategies to achieve a question. Meanwhile, we introduce the diverse
balance of effectiveness and efficiency. The whole sampling of self-consistency (Wang et al., 2023)
process is illustrated in Figure 2. for each action to expand the solution space and
improve task performance. Specifically, an action
3.1 Define Fundamental Reasoning Actions is more likely to generate the correct reasoning step
Using only the autonomy of LLMs to perform itera- if we sample multiple times in the current state. Fi-
tive self-exploration makes it easy to get trapped in nally, we can obtain multiple generated reasoning
a solution space that is also difficult to deal with dif- trajectories, such as [s0 ⊕ s1:n ] and so on.
ferent types of complex questions. IterDRAG (Yue To further improve the efficiency of inference,
et al., 2024) uses a single action type to generate we choose the action {A3 , A4 , A5 } as a simplified
the next reasoning step, which can easily lead to action space, which can achieve a better balance
ineffective space exploration. The core of MCTS between efficiency and effectiveness, also referred
generation lies the action space, which defines the to as AirRAG-Lite.
scope of tree exploration. Therefore, it is essential
3.2 Perform Reasoning Processes via MCTS
to simplify human cognitive processes in complex
reasoning (Jaffe et al., 2023). Inspired by this, we 3.2.1 Solution Generation
introduce five fundamental human-like reasoning Based on the action space defined above, we intro-
actions to bridge the gap between LLM reasoning duce MCTS to generate candidate reasoning tra-
Ours RAG-O1-plus 以下展示的是推理过程的动作序列
Solution = s0,s1,…,si-1
QT QT SA RA SA QT ··· RA SA
RA QT ··· RA SA
RA RA QT
q, s0:i-1
External Knowledge QT ··· RA SA
SA SA SA
self-consistency
QT ··· RA SA
Figure 2: The schematic diagram of our proposed AirRAG. AirRAG implements a paradigm that combines systems
thinking with step-by-step reasoning. In the inference phase, we introduce MCTS and self-consistency to scaling
computation, which significantly outperforms other strong baselines.
jectories. The initial root node s0 is the question 3.2.2 Inference Scaling
without any reasoning steps. The policy can be
Many studies have shown that scaling the inference
directly modeled by a language model as π(a|s) =
computation can lead to substantial improvements
LM(a|s), and the state transition function is the
in the performance of LLM without training (Snell
combination of the preceding reasoning steps and
et al., 2024; Yue et al., 2024). Based on the above
the current actions, i.e., si = Concat(s0:i−1 , aj ).
methods, we further introduce a wide range of
In each MCTS rollout, we perform multiple steps,
strategies to explore how AirRAG benefits from
including selection, expansion, simulations, and
the scaling of inference computation. A straight-
backpropagation. We perform multiple rollouts
forward strategy is to extend the effective context
to expand the solution space. To balance the
length (short for Lmax ) during the document re-
exploration and exploitation, we use the well-
trieval phase, where more related documents are
known Upper Confidence Bounds applied to Trees
provided to supplement the knowledge. Moreover,
(UCT) (Kocsis and Szepesvári, 2006) for node se-
we perform multiple rollouts to fully explore the so-
lection as follows:
lution space relying on the tree-based search. The
s number of output sequences (short for n) generated
Q(s, a) log Np (s) in certain actions can also be adjusted to achieve
UCT(s, p) = +w , (1)
N (s) N (s) self-consistency verification and inference compu-
tation scaling. All in all, these strategies provide
where Q(s, a) is the reward value for node s and more comprehensive flexibility in scaling inference
will be updated through backpropagation. N (s) is computation for RAG, allowing LLMs to more
the number of visits to s, p is the parent node of effectively address complex knowledge-intensive
s, and w is the weight to balance exploration and queries.
exploitation. To improve efficiency and reduce redundant in-
Once the search reaches a terminal node, either ference computations, we implement an early prun-
a terminal state or a predetermined maximum tree ing strategy for the state node and reasoning path.
depth d, we obtain a trajectory from the root to Deduplication is applied to the output sequence
the terminal node. We collect all trajectories from states generated by each action, ensuring subse-
the rollout iterations as candidate solutions. The quent path diversity. Additionally, if multiple roll-
next section 3.3 explains how we select the optimal outs select the same state sequence, we save only
answer node from them. one valid reasoning path.
3.2.3 Flexible Architecture 4 Experiments
Our tree-based architecture provides the flexibility In this section, we conducted experiments on com-
to integrate other advanced approaches. We repro- plex QA benchmarks by answering the following
duce IterDRAG referring to the prompt of Yue et al. research questions.
(2024). Meanwhile, inspired by its iterative im-
plementation, we simplify the fundamental action • RQ1: Does AirRAG outperform state-of-the-
space to {A3 , A4 , A5 }, which can be quickly imple- art baselines?
mented and achieves relatively good results. These
methods can be used as an exploratory branch of • RQ2: How does AirRAG perform when it
our approach and can be turned on or off flexibly. comes to comprehensive inference scaling?
3.3 Select the Optimal Answer Node • RQ3: What is the performance benefit of Air-
For some common mathematical reasoning tasks, RAG in optimizing the allocation of inference
we can use the simple consistency-based method computation?
to efficiently select the most precise reason- • RQ4: How does AirRAG perform for various
ing path (e.g., select the most frequent number verification methods for multiple candidate
extracted from multiple candidate solutions in rollouts?
MATH (Hendrycks et al., 2021) as the final an-
swer). However, it is difficult to extract precise • RQ5: What is the intuitive performance of
answers and perform effective aggregation for AirRAG in the reasoning process?
knowledge-intensive tasks. Thus, we design two
self-consistency verification methods for such prob- 4.1 Experimental Settings
lems. Jaccard similarity and text embeddings are 4.1.1 Datasets
two different approaches used in natural language To evaluate the effectiveness of AirRAG, we con-
processing to measure the similarity between texts. duct experiments on various question-answering
We use these methods to implement the clustering (QA) tasks, including both open-domain QA
of text answers. Each answer score is computed by and multi-hop QA. The complex multi-hop QA
N datasets consist of HotpotQA (Yang et al., 2018),
1 X |Ai ∩ Aj | MuSiQue (Trivedi et al., 2022) and 2WikiMul-
jcdScorei = , (2)
N |Ai ∪ Aj | tiHopQA (2Wiki) (Ho et al., 2020). Other
j=1
N single-hop QA datasets include Natural Questions
1 X
embScorei = cos(Ei , Ej ), (3) (NQ) (Kwiatkowski et al., 2019), TriviaQA (Joshi
N et al., 2017), PopQA (Mallen et al., 2023) and We-
j=1
bQA (Berant et al., 2013).
where N is the number of valid answer nodes, Ai is
the word-level set of answer text i, and Ei denotes 4.1.2 Implementation Details
the embedding vector of answer text i. We use the hyperparameters reported for the exist-
In addition, we further investigate the self-refine ing models whenever available. Implementation
and process-supervision reward model to identify details are available in the Appendix A.
the most accurate reasoning trajectory. Self-refine
is a process that uses the action SA to refine the 4.1.3 Baselines and Metrics
final answer from all candidate answer nodes. The To investigate the enhancement effects of thinking
reward modeling process consists of data synthe- and planning on complex RAG tasks, we compare
sis and instruction tuning. During data synthesis, it with vanilla RAG, which performs only a single
we leverage MCTS to perform multiple rollouts retrieval and generation process. We evaluate the
on partial training sets. According to the known naive generators of Qwen2.5-14B-Instruct (Yang
ground truth, we sample some positive and neg- et al., 2024) and Llama-3-8B-Instruct (Grattafiori
ative reasoning trajectories and use Monte Carlo et al., 2024). In the retrieval phase, we employ
estimation to evaluate the intermediate state score. multilingual-e5-base (Wang et al., 2024) as the re-
In the instruction tuning phase, we utilize synthetic triever. The prompt of vanilla RAG are shown in
samples to fine-tune a relatively small LLM (i.e., the Appendix C. For iterative retrieval, we com-
Qwen2.5-14B-Instruct). pare AirRAG with Iter-RetGen (Shao et al., 2023),
HotpotQA MuSiQue 2Wiki Average
Lmax Method
F1 Acc F1 Acc F1 Acc F1 Acc
ZeroShot QA 42.5 41.3 13.5 12.1 48.2 47.3 34.7 33.6
Vanilla RAG 70.3 65.4 23.0 17.7 55.8 53.4 49.7 45.5
8k
IterDRAG∗ 74.3 69.1 26.7 19.4 60.5 57.6 53.8 48.7
AirRAG-Lite 80.6 75.4 35.4 28.9 75.3 73.1 63.8 59.1
AirRAG 79.6 75.2 41.0 35.0 76.0 74.2 65.6 61.5
AirRAG-Blender 81.1 79.8 41.6 36.4 82.2 81.7 68.3 66.0
Vanilla RAG 77.1 72.0 29.0 22.9 60.9 58.1 55.7 51.0
IterDRAG∗ 77.7 71.6 30.8 22.3 63.0 60.2 57.1 51.4
32k
AirRAG-Lite 82.4 76.9 36.7 30.1 78.8 76.8 66.0 61.3
AirRAG 82.5 77.4 43.2 36.3 80.4 78.9 68.7 64.2
AirRAG-Blender 82.9 80.6 43.3 37.6 83.4 83.0 69.9 67.1
IterDRAG∗ 76.8 71.0 31.7 24.8 65.5 62.4 58.0 52.7
128k AirRAG-Lite 82.5 77.1 35.7 30.4 78.3 76.0 65.5 62.2
AirRAG 83.3 78.0 43.5 36.5 82.3 80.5 69.7 65.0
AirRAG-Blender 83.7 81.4 43.9 38.5 84.4 84.2 70.6 68.0
Table 1: Overall evaluation results on the test sets of three datasets. * indicates the results reproduced by us. Lmax
denotes the maximum number of input tokens across all rollouts. The best results for each Lmax are in bold. The
number of both rollouts and output sequences is set to 1.
Self-RAG (Asai et al., 2024), Auto-RAG (Yu et al., diverse LLMs and datasets. For a fair comparison,
2024), and IterDRAG (Yue et al., 2024). To fur- we utilize the widely used Wikipedia dump from
ther explore RAG performance and inference com- December 2018 (Karpukhin et al., 2020) as the
putation scaling, we focus on a comparison with retrieval corpus. We observe consistent improve-
IterDRAG for a given budget on inference com- ments over vanilla RAG and existing iterative meth-
putation. For evaluation metrics, we report Exact ods (more than 10% on average). The significant
Match (EM), F1 score (F1) and Accuracy (Acc) boost over IterDRAG and Auto-RAG suggests that
between the generated summary and gold answer, AirRAG explores more effective reasoning paths
where accuracy measures whether the gold answer through the human-like thinking paradigm and tree-
is covered in the generated answer. based search.
60 80 80
60
Accuracy
Accuracy
Recall
Recall
75 75
55
70 55
70 AirRAG
50 65 AirRAG-Lite
50 IterDRAG
65
5 10 20 30 40 50 5 10 20 30 40 50 816 32 64 128 816 32 64 128
Number of Documents Effective Context Length (k)
Figure 3: Impact of the retrieved document number scaling (Left) and the maximum context length scaling (Right)
on model performance (averaged Accuracy and Recall of three datasets). All methods show consistent performance
improvements as the effective inference computation scales.
90 Accuracy 66 Accuracy 9.2 the number of output sequences per action, demon-
9
85 Recall 64 Valid rollouts 8
F1 6.7
strating the effectiveness of self-consistency. We
80 62 7
Accuracy
Table 3: Performance comparison on six benchmarks, where Llama3-8B-Instruct is used as the generator LLM.
Partial experimental results are quoted from Jin et al. (2024) and Yu et al. (2024). The best results are in bold. The
number of both rollouts and output sequences is set to 1. The number of documents for a single retrieval is set to 5.
Our proposed AirRAG significantly outperform the others.
Average 69 Average
62 in MuSiQue. Existing iterative methods are often
61 68
67
trapped in a single solution space when confronted
60
59 66 with complex tasks. As illustrated in Figure 13,
58 65 these iterative methods exhibit a key limitation that
1 2 3 4 5 1 2 3 4 5
Number of Output Sequences insufficient or ambiguous retrieval context can lead
to repetitive follow-up queries until it reaches the
Figure 5: Performance comparison of different verifi- predefined maximum depth of iterations. This inef-
cation methods. "QwenRM" is short for reward model
ficient iteration results in high computational cost
trained on the Qwen model. "SA" is the reasoning ac-
tion of summary and answer. "SC-emb/jcd" are two
and incorrect answer. In contrast, our proposed Air-
self-consistency verification methods based on text em- RAG designs efficient reasoning actions to activate
beddings and jaccard similarity. "Average" is the aver- the intrinsic reasoning capabilities of LLMs. As
age score over all candidate rollouts. The single retrieval shown in Figure 14, the SAY action decomposes
process is set to retrieve three documents or fixed 8k the original query into a more rational sequence
context. of sub-queries, and then the combination of RA
and QT ensures the accuracy of the intermediate
reasoning step. We eventually leverage the efficient
Effect of Verification Methods (RQ4). The larger reasoning trajectory to obtain the correct answer.
search space also generates more candidate reason-
5 Conclusions
ing trajectories. Therefore, how to select the opti-
mal trajectory is crucial for the final performance. In this paper, we propose AirRAG, a novel RAG
We compare multiple verification methods with the approach to activate intrinsic reasoning capabili-
average scores of all candidates in Figure 5. These ties of LLMs. AirRAG designs an efficient action
two self-consistency verification methods are al- space for the controllable reasoning generation. We
ways slightly better than the average score, but they also introduce Monte Carlo Tree Search to expand
are not nearly as good as the SA and QwenRM the solution space. Meanwhile, by employing the
methods. The SA method uses the LLM to further tree-based search and self-consistency verification,
refine the final answer from all candidate rollouts, we explore potential reasoning paths and achieve
which is simple and effective. Finally, the reward comprehensive inference computation scaling in
model achieves the most competitive results due RAG. In addition, computationally optimal strate-
to the introduction of supervised information on gies are used to apply more computation to key
key intermediate reasoning steps in the training pro- actions, leading to further performance improve-
cess. However, collecting process-supervised train- ments. Experimental results on diverse QA datasets
ing samples requires high computational costs and demonstrate the significant superiority of AirRAG
high-quality raw data. In the practical application over other methods designed for complex RAG
scenario, we can choose the appropriate method scenarios. Furthermore, AirRAG can be integrated
while balancing efficiency and effectiveness. with other advanced methods with great flexibility.
Limitations Alex Vaughan, Amy Yang, and et al. 2024. The llama
3 herd of models. Preprint, arXiv:2407.21783.
Although our model achieves competitive perfor-
mance in various RAG tasks, there are some meth- Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul
Arora, Steven Basart, Eric Tang, Dawn Song, and
ods and limitations that can be improved. The Jacob Steinhardt. 2021. Measuring mathematical
current optimal computation allocation strategy is problem solving with the MATH dataset. In Thirty-
derived from sufficient experiments. We can con- fifth Conference on Neural Information Processing
sider designing an automated policy model to im- Systems Datasets and Benchmarks Track (Round 2).
plement the trade-off between computational cost Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara,
and performance. Despite great efforts in the in- and Akiko Aizawa. 2020. Constructing a multi-
ference scaling of RAG, the experimental analysis hop QA dataset for comprehensive evaluation of
reasoning steps. In Proceedings of the 28th Inter-
may be limited due to the massive computational
national Conference on Computational Linguistics,
cost of tree-based search approaches. We will ex- pages 6609–6625, Barcelona, Spain (Online). Inter-
plore more complex reasoning tasks to verify the national Committee on Computational Linguistics.
robustness and effectiveness of our approach. In
Paul I Jaffe, Russell A Poldrack, Robert J Schafer, and
addition, the large search space also brings more Patrick G Bissett. 2023. Modelling human behaviour
noise information, so we will further investigate in cognitive tasks with latent dynamical systems. Na-
the reward model or strategy to explore a better ture Human Behaviour, 7(6):986–1000.
reasoning path. Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju
Hwang, and Jong Park. 2024. Adaptive-RAG: Learn-
ing to adapt retrieval-augmented large language mod-
References els through question complexity. In Proceedings of
the 2024 Conference of the North American Chap-
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and ter of the Association for Computational Linguistics:
Hannaneh Hajishirzi. 2024. Self-RAG: Learning to Human Language Technologies (Volume 1: Long
retrieve, generate, and critique through self-reflection. Papers), pages 7036–7050, Mexico City, Mexico. As-
In The Twelfth International Conference on Learning sociation for Computational Linguistics.
Representations.
Zhengbao Jiang, Frank Xu, Luyu Gao, Zhiqing Sun,
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie
Liang. 2013. Semantic parsing on Freebase from Callan, and Graham Neubig. 2023. Active retrieval
question-answer pairs. In Proceedings of the 2013 augmented generation. In Proceedings of the 2023
Conference on Empirical Methods in Natural Lan- Conference on Empirical Methods in Natural Lan-
guage Processing, pages 1533–1544, Seattle, Wash- guage Processing, pages 7969–7992, Singapore. As-
ington, USA. Association for Computational Linguis- sociation for Computational Linguistics.
tics.
Jiajie Jin, Yutao Zhu, Xinyu Yang, Chenghao Zhang,
Maciej Besta, Nils Blach, Ales Kubicek, Robert Ger- and Zhicheng Dou. 2024. Flashrag: A modular
stenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz toolkit for efficient retrieval-augmented generation
Lehmann, Michał Podstawski, Hubert Niewiadomski, research. CoRR, abs/2405.13576.
Piotr Nyczyk, and Torsten Hoefler. 2024. Graph of
Thoughts: Solving Elaborate Problems with Large Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke
Language Models. Proceedings of the AAAI Confer- Zettlemoyer. 2017. TriviaQA: A large scale distantly
ence on Artificial Intelligence, 38(16):17682–17690. supervised challenge dataset for reading comprehen-
sion. In Proceedings of the 55th Annual Meeting of
Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. the Association for Computational Linguistics (Vol-
2024. Alphamath almost zero: Process supervision ume 1: Long Papers), pages 1601–1611, Vancouver,
without process. In The Thirty-eighth Annual Con- Canada. Association for Computational Linguistics.
ference on Neural Information Processing Systems.
Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric
Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Wallace, and Colin Raffel. 2023. Large language
2023. Precise zero-shot dense retrieval without rel- models struggle to learn long-tail knowledge. In
evance labels. In Proceedings of the 61st Annual Proceedings of the 40th International Conference
Meeting of the Association for Computational Lin- on Machine Learning, volume 202 of Proceedings
guistics (Volume 1: Long Papers), pages 1762–1777, of Machine Learning Research, pages 15696–15707.
Toronto, Canada. Association for Computational Lin- PMLR.
guistics.
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and
Abhinav Pandey, Abhishek Kadian, Ahmad Al- Wen-tau Yih. 2020. Dense passage retrieval for open-
Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, domain question answering. In Proceedings of the
2020 Conference on Empirical Methods in Natural and Denny Zhou. 2023. Large language models can
Language Processing (EMNLP), pages 6769–6781, be easily distracted by irrelevant context. In Proceed-
Online. Association for Computational Linguistics. ings of the 40th International Conference on Machine
Learning, volume 202 of Proceedings of Machine
Levente Kocsis and Csaba Szepesvári. 2006. Bandit Learning Research, pages 31210–31227. PMLR.
based monte-carlo planning. In Machine Learning:
ECML 2006, pages 282–293, Berlin, Heidelberg. David Silver, Thomas Hubert, Julian Schrittwieser, Ioan-
Springer Berlin Heidelberg. nis Antonoglou, Matthew Lai, Arthur Guez, Marc
Lanctot, Laurent Sifre, Dharshan Kumaran, Thore
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- Graepel, et al. 2017. Mastering chess and shogi by
field, Michael Collins, Ankur Parikh, Chris Alberti, self-play with a general reinforcement learning algo-
Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken- rithm. arXiv preprint arXiv:1712.01815.
ton Lee, Kristina Toutanova, Llion Jones, Matthew
Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku-
Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu- mar. 2024. Scaling llm test-time compute optimally
ral questions: A benchmark for question answering can be more effective than scaling model parameters.
research. Transactions of the Association for Compu- Preprint, arXiv:2408.03314.
tational Linguistics, 7:452–466.
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot,
Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Ashish Sabharwal. 2022. MuSiQue: Multi-
and Nan Duan. 2023. Query rewriting in retrieval- hop questions via single-hop question composition.
augmented large language models. In Proceedings of Transactions of the Association for Computational
the 2023 Conference on Empirical Methods in Natu- Linguistics, 10:539–554.
ral Language Processing, pages 5303–5315, Singa-
pore. Association for Computational Linguistics. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot,
and Ashish Sabharwal. 2023. Interleaving retrieval
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, with chain-of-thought reasoning for knowledge-
Daniel Khashabi, and Hannaneh Hajishirzi. 2023. intensive multi-step questions. In Proceedings of
When not to trust language models: Investigating the 61st Annual Meeting of the Association for Com-
effectiveness of parametric and non-parametric mem- putational Linguistics (Volume 1: Long Papers),
ories. In Proceedings of the 61st Annual Meeting of pages 10014–10037, Toronto, Canada. Association
the Association for Computational Linguistics (Vol- for Computational Linguistics.
ume 1: Long Papers), pages 9802–9822, Toronto,
Canada. Association for Computational Linguistics. Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang,
Rangan Majumder, and Furu Wei. 2024. Multilin-
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, gual e5 text embeddings: A technical report. arXiv
Long Ouyang, Christina Kim, Christopher Hesse, preprint arXiv:2402.05672.
Shantanu Jain, Vineet Kosaraju, William Saunders,
Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le,
Krueger, Kevin Button, Matthew Knight, Benjamin Ed H. Chi, Sharan Narang, Aakanksha Chowdhery,
Chess, and John Schulman. 2022. Webgpt: Browser- and Denny Zhou. 2023. Self-consistency improves
assisted question-answering with human feedback. chain of thought reasoning in language models. In
Preprint, arXiv:2112.09332. The Eleventh International Conference on Learning
Representations.
Zhenting Qi, Mingyuan Ma, Jiahang Xu, Li Lyna Zhang,
Fan Yang, and Mao Yang. 2024. Mutual reason- Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling.
ing makes smaller llms stronger problem-solvers. 2024. Corrective retrieval augmented generation.
Preprint, arXiv:2408.06195. arXiv preprint arXiv:2401.15884.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui,
Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu,
moyer, Nicola Cancedda, and Thomas Scialom. 2023. Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jian-
Toolformer: Language models can teach themselves hong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang,
to use tools. In Thirty-seventh Conference on Neural Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu,
Information Processing Systems. Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng
Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tian-
Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie hao Li, Tingyu Xia, Xingzhang Ren, Xuancheng
Huang, Nan Duan, and Weizhu Chen. 2023. En- Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan,
hancing retrieval-augmented large language models Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan
with iterative retrieval-generation synergy. In Find- Qiu. 2024. Qwen2.5 technical report. arXiv preprint
ings of the Association for Computational Linguis- arXiv:2412.15115.
tics: EMNLP 2023, pages 9248–9274, Singapore.
Association for Computational Linguistics. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,
William Cohen, Ruslan Salakhutdinov, and Christo-
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan pher D. Manning. 2018. HotpotQA: A dataset for
Scales, David Dohan, Ed H. Chi, Nathanael Schärli, diverse, explainable multi-hop question answering.
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
2369–2380, Brussels, Belgium. Association for Com-
putational Linguistics.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
Thomas L. Griffiths, Yuan Cao, and Karthik R
Narasimhan. 2023. Tree of thoughts: Deliberate
problem solving with large language models. In
Thirty-seventh Conference on Neural Information
Processing Systems.
Tian Yu, Shaolei Zhang, and Yang Feng. 2024. Auto-
rag: Autonomous retrieval-augmented generation for
large language models. Preprint, arXiv:2411.19443.
Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf
Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuan-
hui Wang, and Michael Bendersky. 2024. Inference
scaling for long-context retrieval augmented genera-
tion. Preprint, arXiv:2410.04343.
Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong
Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco
Pavone, Yuqiang Li, Wanli Ouyang, and Dongzhan
Zhou. 2024. Llama-berry: Pairwise optimization
for o1-like olympiad-level mathematical reasoning.
Preprint, arXiv:2410.02884.
Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen,
Heng-Tze Cheng, Ed H. Chi, Quoc V Le, and Denny
Zhou. 2024. Take a step back: Evoking reasoning via
abstraction in large language models. In The Twelfth
International Conference on Learning Representa-
tions.
Figure 6: Overview of the distribution of query complexity over three multi-hop QA datasets.
D Case Study
We select a sample from the complex multi-hop
dataset MuSiQue to analyse in detail, as shown in
figure 13 and 14.
HotpotQA
81 96 80 94
78 92 78 92
90
75 88 76 88
Accuracy
Accuracy
Recall
Recall
72 84 74 86
69 80 72 84 AirRAG
82 AirRAG-Lite
66 76 70 80 IterDRAG
63 72 68 78
5 10 20 30 40 50 5 10 20 30 40 50 816 32 64 128 816 32 64 128
Number of Documents Effective Context Length (k)
MuSiQue
40 72 40 72
36 66 36 68
32 60 32 64
Accuracy
Accuracy
60
Recall
Recall
28 54 28
56
24 48 24 52 AirRAG
20 42 20 AirRAG-Lite
48 IterDRAG
16 36 16 44
5 10 20 30 40 50 5 10 20 30 40 50 816 32 64 128 816 32 64 128
Number of Documents Effective Context Length (k)
2WikiMultihopQA
85 102 85 98
80 96 80 91
75 90 75 84
Accuracy
Accuracy
70 84
Recall
Recall
70 77
65 78
60 72 65 70 AirRAG
60 63 AirRAG-Lite
55 66 IterDRAG
50 60 55 56
5 10 20 30 40 50 5 10 20 30 40 50 816 32 64 128 816 32 64 128
Number of Documents Effective Context Length (k)
Figure 7: Impact of the retrieved document number scaling and the maximum context length scaling over three
datasets.
HotpotQA MuSiQue 2WikiMultihopQA
82 38 78
Accuracy 10 Accuracy 10 Accuracy 12
Valid rollouts Valid rollouts Valid rollouts 10.9
80 8.6 36 76
8.1 10
8 8
6.7 6.4
74
78 34 8
Accuracy
6 6 6.9
4.8 4.9 72 6
76 4 32 3.5 4 4.6
3.3 70 4
3.2
74 1.9 2 30 2.0 2 1.9
1.0 1.0 68 2
1.0
72 0 28 0 66 0
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
Number of rollouts Number of rollouts Number of rollouts
Figure 8: Performance comparison on the number of different effective rollouts over three datasets. Sampling more
diverse reasoning paths consistently improves accuracy.
Given the user query, you may rephrase it for better clarity, summarize it at a higher level,
or decompose it into multiple sub-queries to facilitate more effective information retrieval and
response generation. If no modification is necessary, return "None". Otherwise, list sub-queries,
each on a new line.
<Here are some examples.>
Query: {question}
Output:
Given the context provided, please determine whether rephrasing, summarization, or decomposition
into sub-queries is necessary to enhance the accuracy and efficiency of information retrieval and
response generation. If no modification is required, return "None". Subsequent queries should be
listed individually.
<Here are some examples.>
Main Query: {question}
History: {history}
This Query: {this_question}
You are an expert in question answering. I am going to give you some contexts with may or may
not be relevant to the question. Answer the question according to the contexts.
{contexts}
Question: {question}
You are an expert in question answering. Given the context, sub-queries and responses, output a
correct and concise answer to User Query.
<Here are some examples.>
User Query: {question}
{history}
Contexts: {contexts}
Final Answer:
Output: Follow up: Which body of water is Bartram’s Covered Bridge located on?
(Note: The context does not specify which body of water Bartram’s Covered Bridge is located on, but it does mention
that the bridge carries Goshen Road over Crum Creek. Therefore, we need to confirm that Crum Creek is the body
of water in question before proceeding to find the mouth of the watercourse.)
···
· · · (It enters an endless cycle of iteration until it reaches predefined maximum depth.)
The iterative self-exploration often traps in a solution space with low-quality reasoning steps even
after many attempts.
Figure 13: Incorrect reasoning path of iterative RAG methods. Ambiguity or missing information in the intermediate
steps leads in duplicate sub-query.
The simple and effective action space of AirRAG leads to the correct reasoning trajectory
Question: What is the mouth of watercourse for the body of water where Bartram’s Covered Bridge is located?
Action1: SAY
Input: What is the mouth of watercourse for the body of water where Bartram’s Covered Bridge is located?
Output: Where is Bartram’s Covered Bridge located?
What is the mouth of the watercourse for the body of water near Bartram’s Covered Bridge?
Action2: RA
Input: <Retrieved documents> Where is Bartram’s Covered Bridge located?
Output: Bartram’s Covered Bridge is located on the border between Delaware County and Chester County, Pennsylvania,
carrying Goshen Road over Crum Creek.
Action3: QT
Input: What is the mouth of the watercourse for the body of water near Bartram’s Covered Bridge?
Output: What is the mouth of Crum Creek?
Action4: RA
Input: <Retrieved documents> What is the mouth of Crum Creek?
Output: Based on the provided contexts, there are two different locations referred to as Crum Creek, and they flow into
different bodies of water:
1. The Crum Creek in Delaware County and Chester County, Pennsylvania, flows into the Delaware River in Eddystone,
Pennsylvania.
2. The Crum Creek (also called Cathatachua Creek) in East Creek, New York, flows into the Mohawk River.
Since the question does not specify which Crum Creek is being referred to, both answers are possible. However, if
we consider the first context as the primary reference, the mouth of Crum Creek is the Delaware River in Eddystone,
Pennsylvania.
Action5: SA
Input: <Retrieved documents> <Sub-queries with answers>
Output: The mouth of the watercourse for the body of water where Bartram’s Covered Bridge is located, Crum Creek,
is the Delaware River in Eddystone, Pennsylvania.
AirRAG obtains the correct answer by performing five reasoning actions efficiently.
Figure 14: Effective reasoning rollout of AirRAG. The fundamental action space and tree-based search provides
greater compatibility and flexibility, which avoids being trapped in a single solution space by a low-quality
intermediate step for chain-like reasoning paradigm.