0% found this document useful (0 votes)
25 views11 pages

RAFT RAG Cum Fine Tuning

The document presents Retrieval Augmented Fine Tuning (RAFT), a novel method for adapting large language models (LLMs) to specialized domains using Retrieval Augmented Generation (RAG). RAFT improves the model's ability to answer questions by training it to distinguish between relevant and irrelevant documents, thereby enhancing performance in domain-specific tasks. The study demonstrates RAFT's effectiveness across various datasets, showing significant improvements over traditional fine-tuning and retrieval methods.

Uploaded by

officemanish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views11 pages

RAFT RAG Cum Fine Tuning

The document presents Retrieval Augmented Fine Tuning (RAFT), a novel method for adapting large language models (LLMs) to specialized domains using Retrieval Augmented Generation (RAG). RAFT improves the model's ability to answer questions by training it to distinguish between relevant and irrelevant documents, thereby enhancing performance in domain-specific tasks. The study demonstrates RAFT's effectiveness across various datasets, showing significant improvements over traditional fine-tuning and retrieval methods.

Uploaded by

officemanish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

RAFT: Adapting Language Model to Domain Specific RAG

Tianjun Zhang Shishir G. Patil Naman Jain Sheng Shen Matei Zaharia Ion Stoica Joseph E. Gonzalez
[email protected], [email protected]
UC Berkeley

Abstract ments). In these settings, general knowledge reasoning is


Pretraining Large Language Models (LLMs) on less critical but instead, the primary goal is to maximize ac-
large corpora of textual data is now a standard curacy based on a given set of documents. Indeed, adapting
paradigm. When using these LLMs for many LLMs to the specialized domains (e.g., recent news, enter-
downstream applications, it is common to ad- prise private documents, or program resources constructed
ditionally bake in new knowledge (e.g., time- after the training cutoff) is essential to many emerging ap-
critical news, or private domain knowledge) into plications (Vu et al., 2023; Lazaridou et al., 2022) and is the
the pretrained model either through RAG-based- focus of this work.
prompting, or finetuning. However, the optimal This paper studies the following question – How to adapt
methodology for the model to gain such new pre-trained LLMs for Retrieval Augmented Generation
knowledge remains an open question. In this pa- (RAG) in specialized domains?
per, we present Retrieval Augmented Fine Tun-
ing (RAFT), a training recipe that improves the When it comes to adapting LLMs to specialized domains,
model’s ability to answer questions in an "open- we consider the following two candidates: in-context learn-
book" in-domain setting. In RAFT, given a ques- ing through Retrieval-Augmented Generation (RAG) and
tion, and a set of retrieved documents, we train supervised fine-tuning. RAG-based methods allow the LLM
the model to ignore those documents that don’t to reference the documents when answering questions. How-
help in answering the question, which we call, ever, these methods fail to leverage the learning opportunity
distractor documents. RAFT accomplishes this afforded by the fixed domain setting and early access to
by citing verbatim the right sequence from the rel- the test documents. Alternatively, supervised fine-tuning
evant document that would help answer the ques- offers the opportunity to learn more general patterns in the
tion. This coupled with RAFT’s chain-of-thought- documents and better align to end tasks and user prefer-
style response helps improve the model’s ability ences (Zhou et al., 2023a). However, existing fine-tuning
to reason. In domain specific RAG, RAFT consis- based approaches either fail to leverage the documents at
tently improves the model’s performance across test time (don’t incorporate RAG) or fail to account for the
PubMed, HotpotQA, and Gorilla datasets, present- imperfections in the retrieval process during training.
ing a post-training recipe to improve pre-trained We can draw an analogy to an open-book exam. Exist-
LLMs to in-domain RAG. RAFT’s code and demo ing in-context retrieval methods are equivalent to taking an
are open-sourced at https://github.com/ open-book exam without studying. Alternatively, existing
ShishirPatil/gorilla fine-tuning based approaches implement “studying" by ei-
ther directly “memorizing" (Xiong et al., 2023) the input
documents or answering practice questions (Wang et al.,
1. Introduction 2022) without referencing the documents. While these ap-
proaches leverage in-domain learning they fail to prepare
Trained on vast quantities of public data, Large Language
for the open-book nature of test setting.
Models LLMs have achieved significant advances in a wide
range of general knowledge reasoning tasks (Brown et al., In this paper, we study how to combine supervised
2020; Wei et al., 2022). fine-tuning (SFT) with retrieval augmented generation
(RAG). We propose a novel adaptation strategy – Retrieval-
However, increasingly LLMs are being employed in special-
Augmented Fine Tuning (RAFT). RAFT specifically ad-
ized domains to support tasks ranging from code completion
dresses the challenge of fine-tuning LLMs to incorporate
for specific software frameworks to question answering on
domain knowledge while also improving in-domain RAG
specific document collections (e.g., legal or medical docu-

1
RAFT: Adapting Language Model to Domain Specific RAG

Teach Model to
Bake in Knowledge Model can use use External Docs at Test
at Train Time External Docs at Test

query answer query answer query answer

“Closed book” “Open book” RAFT (Proposed)

Figure 1: How best to prepare for an Exam?(a) Fine-tuning based approaches implement "studying" by either directly
"memorizing" the input documents or answering practice QA without referencing the documents. (b) Alternatively, in-
context retrieval methods fail to leverage the learning opportunity afforded by the fixed domain and are equivalent to
taking an open-book exam without studying. While these approaches leverage in-domain learning, they fail to prepare for
open-book tests. In contrast, our approach (c) RAFT leverages fine-tuning with question-answer pairs while referencing the
documents in a simulated imperfect retrieval setting — thereby effectively preparing for the open-book exam setting.

performance. RAFT aims to not only enable models to learn Open Book Exam In contrast, we liken the open-book
domain specific knowledge through fine-tuning, but also exam setting to the scenario in which the LLM can re-
to ensure robustness against inaccurate retrievals. This is fer to external sources of information (e.g., a website or
achieved by training the models to understand the dynamics a book chapter). In such scenarios, typically, the LLM is
between the question posed (prompt), the domain specific paired with a retriever which retrieves ‘k’ documents (or
documents retrieved, and the appropriate answer. Going specific segments of the document) which are appended to
back to our analogy, our approach is analogous to study- the prompt. It is only through these documents retrieved that
ing for an open-book exam by recognizing relevant, and the LLM gains access to “new knowledge”. As a result, we
irrelevant retrieved documents. argue that the LLM’s performance in these settings, where it
is trained as a general-purpose LLM is largely dependent on
In RAFT, we train the model to answer the question (Q)
the quality of the retriever and how accurately the retriever
from Document(s) (D*) to generate an answer (A*), where
can identify the most relevant piece of information.
A* includes chain-of-thought (Wei et al., 2022; Anthropic,
2023), and in the presence of distractor documents (Dk ).
We explain the methodology in detail in Section 3 and ana- Domain Specific Open-Book Exam In this paper, we
lyze the sensitivity to the number of distractor documents focused on a narrower but increasingly popular domain than
(k) at train- and test- time in Section 5. RAFT consis- the general open book exam, called the domain specific
tently outperforms Supervised-finetuning both with- and open book exam. In domain specific open book exams, we
without- RAG across PubMed (Dernoncourt & Lee, 2017), know apriori the domain in which the LLM will be tested
HotpotQA (Yang et al., 2018), and HuggingFace Hub, Torch – used for inference. The LLM can respond to the prompt
Hub, and Tensorflow Hub Gorilla datasets (Patil et al., 2023), using use any and all information from this specific domain,
presenting a novel, yet simple technique to improve pre- which it has been fine-tuned on. Examples of domain spe-
trained LLMs for in-domain RAG. cific examples include enterprise documents, latest news,
code repositories belonging to an organization, etc. In all
2. LLMs for Open-Book Exam these scenarios, the LLM will be used to respond to the
questions, whose answers can be found within a collection
To understand our goal better, we expand on our analogy be- of documents (a small practical domain). The retrieval tech-
tween training an LLM in the real-world setting of preparing nique itself has little to no impact on the mechanism (though
for an exam. it may impact the accuracy). This paper mainly studies this,
domain specific open-book setting and how to adapt a pre-
trained LLM to this specific domain, including how to make
Closed-Book Exam A closed book exam often refers to it more robust to a varying number of retrieved documents
a scenario where the LLMs do not have access to any ad- and distractors.
ditional documents or references to answer the questions
during the exam. For LLMs, this is equivalent to the sce- 3. RAFT
nario, for example, in which the LLM is used as a chatbot.
In this scenario, the LLM draws from the knowledge baked In this section, we present RAFT, a novel way of training
in during pre-training and supervised finetuning to respond LLMs for domain specific open-book exams. We first intro-
to the prompt. duce the classical technique of supervised fine-tuning, fol-

2
RAFT: Adapting Language Model to Domain Specific RAG

Figure 2: Overview of our RAFT method. The top-left figure depicts our approach of adapting LLMs to reading solution
from a set of positive and negative documents in contrast to standard RAG setup where models are trained based on the
retriever outputs, which is a mixture of both memorization and reading. At test time, all methods follow the standard RAG
setting, provided with a top-k retrieved documents in the context.

lowed by the key takeaways from our experiments. Then, we between two types of documents: ‘oracle’ documents (D∗)
introduce RAFT , a modified version of general instruction i.e. the documents from which the answer to the question
tuning. Lastly, we provide an overview of the experiments can be deduced, and ‘distractor’ documents (Di ) that do not
to expect in the later sections. contain answer-relevant information. As an implementation
detail, the ‘oracle’ document doesn’t need to be a single doc-
Supervised Finetuning
ument, but can be more than one document, as is the case in
Consider the supervised fine-tuning (SFT) setting for a HotpotQA (Yang et al., 2018). Then, for P fraction of the
Question-Answer dataset. The formulation consists of the questions (qi ) in the dataset, we retain the oracle document
Dataset (D) from which a set of Question (Q) and corre- (d∗i ) along with distractor documents (dk−1 ). For (1 − P )
sponding answer (A) pairs are derived or already available. fraction of the questions (qi ) in the dataset, we include no
In the classical SFT setting, the model is trained to improve oracle document and only include distractor documents (dk ).
its ability to answer the questions based on its knowledge - We then fine-tune the language model using the standard
obtained either during pre-training, or during the SFT train- supervised training (SFT) technique, training it to generate
ing phase. The model so trained can also be used at test-time answers from the provided documents and questions. Fig. 2
with the Retrieval Augmented Generation (RAG) setting, illustrates the high-level design principal for RAFT .
where additional documents can be introduced in the prompt
We demonstrate that our approach trains the model to per-
to help the model answer the question. This can be repre-
form better RAG on the set of documents it is trained on i.e.,
sented as follows:
in-domain. By removing the oracle documents in some in-
stances, we are compelling the model to memorize answers
• Train: Q → A instead of deriving them from the context. The training data
• 0-shot Inference: Q → A for RAFT is as follows, and an example of training data can
be seen in Fig. 3:
• RAG Inference: Q + D → A
• P % of data: Q + D∗ + D2 + . . . + Dk → A∗
RAFT
• (1 − P) % of data: Q + D1 + D2 + . . . + Dk → A∗
Retrieval Aware Fine-Tuning (RAFT), presents a novel
recipe to prepare fine-tuning data to tailor the models for
Subsequently, for the test scenario, the model is provided
domain specific open-book settings, equivalent to in-domain
with the Q and top-k documents retrieved by the RAG
RAG In RAFT, we prepare the training data such that each
pipeline. Note that RAFT is independent of the retriever
data point contains a question (Q), a set of documents (Dk ),
used.
and a corresponding Chain-of-though style answer (A∗ )
generated from one of the document (D∗ ). We differentiate A key factor in enhancing training quality is the genera-

3
RAFT: Adapting Language Model to Domain Specific RAG

tion of a reasoning process, such as Chain-of-Thought, to answering. It mainly focuses on answering medical
explain the provided answers.RAFT approach is similar: and biology questions based on a given set of docu-
we demonstrate that creating a full reasoning chain and in ments.
addition, clearly citing sources enhances the model’s accu-
racy in answering questions. In Fig. 3, we illustrate this Note that the first category of dataset (NQ, Trivia QA, and
set-up. Generating the training data in this fashion, involves HotpotQA) is a relatively general domain whereas the latter
presenting the model with a question, context, and verified two domains are on very domain specific documents.
answers, and then requesting it to form a reasoning chain
that appropriately references the original context. Baselines We consider the following baselines for our
For all the datasets in our experiments, we generate the experiments:
answers using the technique described above. Note that the
Gorilla APIBench dataset, already includes reasoning in • LlaMA2-7B-chat model with 0-shot prompting: this is
the answers. We provide an example of the generation step the commonly used instruction-finetuned model for QA
in Fig. 3, the detailed reasoning answer includes a citation tasks, where we provide clearly written instructions,
from the original context inside ##begin_quote## and but no reference documentation.
##end_quote## as well as the detailed explanation on
how to reach the conclusion based on the citations. We • LlaMA2-7B-chat model with RAG (Llama2 + RAG):
demonstrate that adding detailed reasoning paragraphs helps similar to the previous setting, except here we include
boost the model’s performance in our experiment section. reference documents. This is a popular technique when
dealing with domain specific QA tasks.
4. Evaluation • domain specific Finetuning with 0-shot prompting
We design our experiments to study how well RAFT per- (DSF): Performing standard supervised finetuning,
forms compared to various baselines. We find that the RAFT- without documents in context. We find that it mostly
7B model (a finetuned version of LlaMA-2) is better at read- useful to align the answering style of the model as well
ing and extracting information from in-domain documents, as get familiar with the domain context.
than domain specific finetuned model, and general-purpose
• domain specific Finetuning with RAG (DSF + RAG):
model with RAG. As an ablation, we also demonstrate how
Equip a domain specific finetuned model with external
important it is for the model to learn with Chain-of-Thought
knowledge using RAG. So, for the “knowledge” the
responses. In this section, we will first introduce all the
model does not know, it can still refer to the context.
datasets we used in the experiments, then all the baseline
model/fine-tuning techniques that we benchmark against.
4.2. Results
4.1. Datasets Using the above datasets and baselines, we evaluate our
In our experiments, we use the following datasets to evaluate model RAFT and demonstrate the effectiveness of RAFT in
our model and all baselines. We selected these datasets Tab. 1. We see that RAFT consistently and significantly
to represent both popular and diverse domains including outperforms the baselines. Compared with the base Llama-
Wikipedia, Coding/API documents, and question-answering 2 instruction-tuned model, RAFT with RAG does much
on medical documents. better in terms of extracting information as well as being
robust towards distractors. The gain can be as big as 35.25%
• Natural Questions (NQ) (Kwiatkowski et al., 2019), on Hotpot QA and 76.35% on Torch Hub evaluation. Com-
Trivia QA (Joshi et al., 2017) and HotpotQA (Yang pared with DSF on the specific dataset, our model does bet-
et al., 2018) are the open-domain question-answers ter at relying on the provided context to solve the problem.
based on Wikipedia, mainly focused on common RAFT does much better on tasks like HotpotQA and Hug-
knowledge (e.g., movies, sports, etc). gingFace datasets (30.87% on HotpotQA and 31.41% on
HuggingFace). Note that for PubMed QA, since it is a binary
• HuggingFace, Torch Hub, and TensorFlow Hub are yes/no question, we don’t observe significant gains when we
from the APIBench (Patil et al., 2023) proposed in compare our model with DSF + RAG. Even compared with
the Gorilla paper. These benchmarks measure how to a much larger and better model GPT-3.5, RAFT demon-
generate the correct, functional, and executable API strates significant advantages.
calls based on the documentation.
Overall, the LLaMA-7B model, both with and without the
• PubMed QA (Jin et al., 2019) is a question-answering RAG, performs poorly due to its answering style not align-
dataset tailored only for biomedical-research question- ing with the ground truth. By applying domain specific

4
RAFT: Adapting Language Model to Domain Specific RAG

Figure 3: RAFT prompt to help LLM evaluate its own generated reasoning and answers, contrasting them with the correct
reasoning and answers. The LLM is prompted to identify errors in its reasoning and extract key insights for improvement.
This figure specifically represents the ‘GenerateExplanation‘ step in the RAFT algorithm (Section 3).

Table 1: RAFT improves RAG performance forall specialized domains: Across PubMed, HotpotQA, HuggingFace,
Torch Hub, and Tensorflow Hub, we see that domain specific Finetuning improves significantly of the performance of the
base model, but RAFT consistently outperforms the existing domain specific finetuning method with or without RAG. This
suggests the need to train the model with context. We compare our model with LLaMA finetuning receipes, and provide
GPT-3.5 for reference.

PubMed HotpotQA HuggingFace Torch Hub TensorFlow Hub


GPT-3.5 + RAG 71.60 41.5 29.08 60.21 65.59
LLaMA2-7B 56.5 0.54 0.22 0 0
LLaMA2-7B + RAG 58.8 0.03 26.43 08.60 43.06
DSF 59.7 6.38 61.06 84.94 86.56
DSF + RAG 71.6 4.41 42.59 82.80 60.29
RAFT (LLaMA2-7B) 73.30 35.28 74.00 84.95 86.86

tuning, we significantly enhance its performance. This pro- itatively demonstrates a scenario where the DSF model
cess enables the model to learn and adopt the appropriate becomes confused by a question asking for the identity of
style of answering. However, introducing RAG to a domain- a screenwriter. Instead of providing the correct name, it
specifically fine-tuned (DSF) model doesn’t invariably lead mistakenly cites one of the films written by the screenwriter.
to better outcomes. This might indicate that the model lacks In contrast, the RAFT model accurately answers the ques-
training in context processing and extracting useful infor- tion. This discrepancy suggests that training a model solely
mation from it. By incorporating our method, RAFT , we with question-answer pairs may impair its ability to derive
train the model not only to match its answering style with relevant context from provided documents. The comparison
that required but also to improve its document processing underscores the importance of incorporating both standard
capabilities. Consequently, our approach outperforms all instructional tuning and context comprehension into the
others. training dataset to preserve and enhance the model’s ability
to process text effectively.
4.3. Effect of CoT
4.5. Should we train the LLM always with the oracle
We also conduct an analysis to evaluate the effectiveness of
context for RAG?
the Chain-of-Thought approach in enhancing the model’s
performance. As indicated in Table 2, simply providing In our exploration of whether large language models
the answer to a question may not always be adequate. This (LLMs) should always be trained with the oracle context for
approach can lead to a rapid decrease in loss, resulting in Retrieval-Augmented Generation (RAG), we address a key
the training process to diverge. Incorporating a reasoning question: what proportion (p%) of the training data should
chain that not only guides the model to the answer but also include oracle documents? Intuitively, one might assume
enriches the model’s understanding can improve the over- that for effective training in reading and extracting informa-
all accuracy. In our experiments, integrating the Chain-of- tion from context (e.g., RAG tasks), the oracle document
Thought significantly enhances training robustness. We em- should always be included during training (P = 100%). How-
ploy GPT-4-1106 to generate our Chain-of-Thought prompts ever, our findings challenge this assumption: incorporating
and include an example of the prompt we used in Figure 3. a portion of the training data without the oracle document
in the context (P = 80%) appears to enhance the model’s
4.4. Qualitative Analysis performance on RAG tasks.
To illustrate the potential advantages of RAFT over the Fig. 5 presents our investigation into the hyperparameter
domain-specifically fine-tuned (DSF) approach, we present P%, which represents the percentage of training instances
a comparative example in Figure 4. This example qual- that should include oracle documents. Our analysis reveals

5
RAFT: Adapting Language Model to Domain Specific RAG

Table 2: Ablation on Chain-of-Thought: The numbers of RAFT and RAFT without CoT. Results on various datasets
show that adding CoT can significantly improve the performance of the finetuned model. With a gain of 9.66% and 14.93%
on the Hotpot QA and HuggingFace datasets respectively.

PubMed HotpotQA HuggingFace Torch Hub TensorFlow Hub


RAFT w.o CoT 68.30 25.62 59.07 86.56 83.21
RAFT 73.30 35.28 74.00 84.95 86.86

HotPot QA

Question: What screenwriter with credits for “Evolution”[0/1879]e a film starring Nicolas Cage and Téa Leoni?
Documents: . . . David Weissman is a screenwriter and director. His film credits include “The Family Man” (2000),
“Evolution” (2001), and “When in Rome” (2010).
...
The Family Man is a 2000 American romantic comedy-drama film directed by Brett Ratner, written by David Diamond
and David Weissman, and starring Nicolas Cage and Téa Leoni. . . .
——————————————————————————————————————————————–
RAFT: ##Reason: The screenwriter with credits for the film “Evolution,” starring Nicolas Cage and Téa Leoni, is
David Weissman. This information is provided in the reference documents which mention David Weissman as a
screenwriter with film credits including “The Family Man” (2000), “Evolution” (2001), and “When in Rome” (2010).
Therefore, the screenwriter for “Evolution” is David Weissman. ##Answer: David Weissman ✓
DSF: The Family Man ✗

Figure 4: Comparison of RAFT and DSF: We prompt RAFT and DSF fine-tuned models on the HotpotQA dataset. We
can see that the DSF model extracts the wrong information from the context. For the question, who is the screenwriter, it
responds with a film name. RAFT manages to get the result correctly .

that the optimal proportion varies across datasets, with fig- Such a scenario necessitates the model to have the ability to
ures ranging from 40%, 60%, and 100%. This indicates discern and disregard irrelevant content, focusing solely on
that training your LLM without the correct corresponding pertinent information.
context at times can be beneficial for the downstream task of
answering questions related to the documents. In our train- 5.1. Making Model Robust to top-K RAG
ing setup, we include four distractor documents alongside
the oracle document, and at test time, we maintain this for- To tackle the challenge of enhancing large language mod-
mat by providing the oracle document with four distractors. els’ (LLMs) ability to sift through irrelevant text within the
Our findings suggest that, for domain specific RAG tasks, retrieval pipeline, our analysis revealed that training solely
including a certain percentage of training data without the with oracle (highly relevant) documents can inadvertently di-
oracle documents in the context proves to be advantageous. minish the model’s ability to discern and disregard irrelevant
information. To address this, our algorithm, RAFT , adopts
a strategy that integrates oracle documents with a mix of
5. RAFT Generalizes to Top-K RAG irrelevant ones. This methodology prompts us to investigate
After demonstrating the performance of RAFT on vari- the ideal fraction of negative (irrelevant) documents to in-
ous benchmarks, we now study another important problem: corporate throughout the training process and to assess how
How does the number of distractor documents in RAFT af- well this training approach adapts to different volumes of
fect the model’s performance when augmented with top-k documents encountered by the Retrieval-Augmented Gen-
retriever augmented generation (RAG) result during the eval- eration (RAG) during the test phase. Our aim is to refine
uation? Previous research has highlighted the vulnerability the balance between relevant and irrelevant information to
of LLMs to irrelevant text (see studies (Shi et al., 2023a; strengthen the model’s efficiency in identifying and utilizing
Weston & Sukhbaatar, 2023; Liu et al., 2023b)). This issue pertinent content. Notice that Sec 4.5 looked at what P% of
is particularly critical for LLMs + RAG since top-k RAG training data should include distractors, while in this section,
is frequently employed at test time to ensure high recall. we study test-time scenarios.

6
RAFT: Adapting Language Model to Domain Specific RAG

Test Domain: NQ Test Domain: TQA Test Domain: Hotpot QA


0.45
0.60
0.40 0.65
Final Accuracy

Final Accuracy

Final Accuracy
0.55
0.35 0.60
0.50
0.30
0.55 0.45
0.25
0.50 0.40
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
P % Golden Retrieved Context at Training % Golden Retrieved Context at Training P % Golden Retrieved Context at Training

Figure 5: How many golden documents to involve? We study the hyperparameter P % which indicates what fraction of
the training data contains the oracle document(s) in its context. Results on NQ, TQA and HotpotQA suggest that mixing a
fraction of data that does not have the oracle document in its context is helpful for in-domain RAG.

Training with Negative Documents To enhance the robust- 6. Related Works


ness of large language models (LLMs) against irrelevant
text in retrieved documents, we adopted a finetuning ap- Retrieval-Augmented Language Models RAG enhances
proach that incorporates both golden (highly relevant) docu- language models by integrating a retrieval module that
ments and distractor (irrelevant) documents. The model was sources relevant information from external knowledge bases,
trained with varying numbers of distractor documents, but significantly improving performance across various NLP
consistently evaluated using the top-k documents obtained tasks, including language modeling (Guu et al., 2020;
from the retriever - not to be confused with p. Borgeaud et al., 2022; Khandelwal et al., 2019; Shi et al.,
2023d; Lin et al., 2023b; Shi et al., 2023c; Asai et al., 2023;
Our findings, detailed in Fig. 6, reveal that finetuning with Xu et al., 2023; Wang et al., 2023) and open-domain ques-
only the oracle document frequently results in inferior per- tion answering (Izacard et al., 2023; Lewis et al., 2020). This
formance compared to configurations that include a greater integration follows a “retrieve-and-read" paradigm where
number of distractor documents. As we can see in the fig- the retrieval module provides additional context from exter-
ure, the better performance for Natural Questions is training nal sources, which the LM then uses to generate the final out-
with D∗ + 3D and it is D∗ + 1D documents with Hotpot put. The retrieval process involves using the input as a query
QA. This insight has been particularly beneficial for our to fetch documents, which the LM incorporates for final pre-
algorithm, RAFT . In our experiments, we typically employ dictions. For instance, Atlas (Izacard et al., 2023) fine-tunes
a training setup consisting of one oracle document alongside T5 models with the retriever, treating documents as latent
four distractor documents. This approach strikes a balance, variables, while RETRO (Borgeaud et al., 2022) modifies
ensuring the model is not overwhelmed by distractors while the decoder-only architecture to include retrieved texts and
still gaining the ability to effectively discern and prioritize conducts pre-training from scratch. kNN-LM (Khandelwal
relevant information. et al., 2019) interpolates between the LM’s next token distri-
Generalization to a variable number of test-time docu- bution and distributions computed from retrieved tokens at
ments. We extended our research to examine the impact of inference. (Shi et al., 2023d; Ram et al., 2023) assume black-
different quantities of test-time documents on the model’s box access to an LM and combine it with either off-the-shelf
performance. Specifically, our experiments focused on as- or fine-tuned retriever.
sessing how models, trained with varying numbers of dis- Memorization A key question around large neural language
tractor documents, respond to changes in the number of models is whether they truly “understand” text (Feldman,
documents presented at test time. 2020; Power et al., 2022) or simply rely on surface pattern
The results, illustrated in Fig. 6, confirm that the inclusion memorization (Carlini et al., 2019; Tänzer et al., 2022).
of distractor documents during training indeed makes the (Feldman, 2020; Carlini et al., 2019; 2022) develop method-
model more resilient to fluctuations in the number of docu- ologies to quantify the extent of memorization in neural
ments encountered during testing. This ability to maintain models. (Brown et al., 2020; Power et al., 2022; Liu et al.,
consistent performance despite variations in test-time doc- 2022b) further explored how memorization impacts the mod-
ument numbers further validates the robustness of our ap- els’ generalization capabilities. Recently, a seminal work
proach, RAFT . This finding underscores the importance of by (Carlini et al., 2021; Shi et al., 2023b) demonstrated
a well-calibrated training environment to prepare the model the ability of language models to memorize and regurgitate
for a range of scenarios it may encounter in real-world ap- training data, raising significant privacy concerns (Kandpal
plications. et al., 2022; Pan et al., 2020).

7
RAFT: Adapting Language Model to Domain Specific RAG

0.32 Natural Questions Hotpot QA


Train D* 0.250 Train D*
0.30 Train D* + 1D Train D* + 1D
Train D* + 2D 0.225 Train D* + 2D
Final Accuracy

Final Accuracy
0.28 Train D* + 3D Train D* + 3D
0.200
0.26 0.175
0.24 0.150

0.22 0.125
2 4 6 8 10 2 4 6 8 10
# Test Documents (Top-k) # Test Documents (Top-k)
Figure 6: Test-Time Documents Varying: We study how robust RAFT is to varying numbers of test-time documents
that a retriever might provide. In NQ, we find that training with 4 documents leads to the best performance, but training
with 2 documents is optimal for HotpotQA. However, across both datasets, training with all datasets consisting of oracle
documents hurts performance.

Finetuning of LLMs Recent years have seen rapid progress 7. Conclusion


in developing large-scale language models (LLMs) (Brown
et al., 2020; OpenAI, 2023; Workshop et al., 2022; Tou- RAFT is a training strategy designed to enhance the model’s
vron et al., 2023;?; Anil et al., 2023). To adapt these foun- performance in answering questions within a specific do-
dation models to downstream tasks, fine-tuning (Mishra main, in "open-book" settings. This technique demonstrates
et al., 2021; Sanh et al., 2021; Chung et al., 2022; Muen- a fine-tuning recipe for LLMs for question-answering tasks
nighoff et al., 2023; Zhou et al., 2023b; Lin et al., 2023b; based on a selected collection of documents. We have pin-
Ji et al., 2024) has become a prevalent approach. Tradi- pointed several crucial design decisions, such as training
tional supervised fine-tuning may be limited by the cost the model alongside distractor documents, organizing the
and compute required for adapating LLMs. Addressing dataset so a portion lacks oracle documents in their con-
these challenges, research in the realm of parameter-efficient text, and formulating answers in a chain-of-thought manner
fine-tuning (Houlsby et al., 2019), such as Prompt Tuning with direct quotations from the relevant text. Our evalua-
(Lester et al., 2021), Prefix-Tuning (Li & Liang, 2021), tions on PubMed, HotpotQA, and Gorilla API Bench un-
P-Tuning (Liu et al., 2022a) and Low-Rank based fine- derline RAFT’s significant potential. Looking forward, we
tuning (Hu et al., 2021), has gained traction. These methods anticipate that in-domain Retrieval-Augmented Generation
enable LLMs to acquire domain-specific knowledge and (RAG) will continue to gain interest within both industrial
adapt to specialized tasks such as question answering, sum- and academic spheres. Unlike general-RAG, our work ad-
marization, and dialogue generation. Another branch of dresses practical scenarios where LLMs are tasked with an-
finetuning is through RLHF (Ouyang et al., 2022; Rafailov swering questions using domain-specific knowledge. Align-
et al., 2023; Liu et al., 2023a; Zhang et al., 2023), which ing with current trends, our findings suggest that smaller,
adopts RL to align LLM’s preference with human. fine-tuned models are capable of performing comparably
well in domain-specific question-answering tasks, in con-
Finetuning for RAG More recently, several papers have trast to their generic LLM counterparts.
been exploring the idea of finetuning a pretrained LLM to
be better at RAG tasks (Lin et al., 2023a; Wang et al., 2023;
Xu et al., 2023; Liu et al., 2024). These works focus on con-
References
structing a combination of finetuning dataset for RAG and Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin,
train a model to perform well on these tasks. In particular, in D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen,
their settings, at test time, the domain or documents can be Z., et al. Palm 2 technical report. arXiv preprint
different than the training time; whereas our paper studies a arXiv:2305.10403, 2023.
slightly opposite scenario where we only care about testing
the LLM on the same set of documents. Anthropic. Prompt engineering for claude’s long context
window. 2023.

Asai, A., Wu, Z., Wang, Y., Sil, A., and Hajishirzi, H. Self-

8
RAFT: Adapting Language Model to Domain Specific RAG

rag: Learning to retrieve, generate, and critique through Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang,
self-reflection. arXiv preprint arXiv:2310.11511, 2023. S., Wang, L., and Chen, W. Lora: Low-rank adaptation of
large language models. arXiv preprint arXiv:2106.09685,
Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, 2021.
E., Millican, K., Van Den Driessche, G. B., Lespiau, J.-B.,
Damoc, B., Clark, A., et al. Improving language models Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni,
by retrieving from trillions of tokens. In International F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., and
conference on machine learning, pp. 2206–2240. PMLR, Grave, E. Atlas: Few-shot learning with retrieval aug-
2022. mented language models. Journal of Machine Learning
Research, 24(251):1–43, 2023. URL http://jmlr.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., org/papers/v24/23-0037.html.
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., et al. Language models are few-shot learners. Ji, C. C.-J., Mao, H., Yan, F., Shishir G. Patil, T. Z., Stoica,
Advances in neural information processing systems, 33: I., and Gonzalez, J. E. Gorilla openfunctions v2. 2024.
1877–1901, 2020.
Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W., and Lu, X.
Carlini, N., Liu, C., Erlingsson, Ú., Kos, J., and Song, Pubmedqa: A dataset for biomedical research question
D. The secret sharer: Evaluating and testing unintended answering. arXiv preprint arXiv:1909.06146, 2019.
memorization in neural networks. In 28th USENIX Se-
Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L.
curity Symposium (USENIX Security 19), pp. 267–284,
Triviaqa: A large scale distantly supervised challenge
2019.
dataset for reading comprehension. arXiv preprint
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert- arXiv:1705.03551, 2017.
Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Kandpal, N., Wallace, E., and Raffel, C. Deduplicating
Erlingsson, U., et al. Extracting training data from large training data mitigates privacy risks in language models.
language models. In 30th USENIX Security Symposium In International Conference on Machine Learning, pp.
(USENIX Security 21), pp. 2633–2650, 2021. 10697–10707. PMLR, 2022.
Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L.,
and Zhang, C. Quantifying memorization across neural and Lewis, M. Generalization through memorization:
language models. In The Eleventh International Confer- Nearest neighbor language models. arXiv preprint
ence on Learning Representations, 2022. arXiv:1911.00172, 2019.
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M.,
Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin,
S., et al. Scaling instruction-finetuned language models. J., Lee, K., et al. Natural questions: a benchmark for ques-
arXiv preprint arXiv:2210.11416, 2022. tion answering research. Transactions of the Association
for Computational Linguistics, 7:453–466, 2019.
Dernoncourt, F. and Lee, J. Y. Pubmed 200k rct: a dataset
for sequential sentence classification in medical abstracts. Lazaridou, A., Gribovskaya, E., Stokowiec, W., and Grig-
arXiv preprint arXiv:1710.06071, 2017. orev, N. Internet-augmented language models through
few-shot prompting for open-domain question answering.
Feldman, V. Does learning require memorization? a short
arXiv preprint arXiv:2203.05115, 2022.
tale about a long tail. In Proceedings of the 52nd Annual
ACM SIGACT Symposium on Theory of Computing, pp. Lester, B., Al-Rfou, R., and Constant, N. The power of scale
954–959, 2020. for parameter-efficient prompt tuning. arXiv preprint
arXiv:2104.08691, 2021.
Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M.
Retrieval augmented language model pre-training. In Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V.,
International conference on machine learning, pp. 3929– Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel,
3938. PMLR, 2020. T., et al. Retrieval-augmented generation for knowledge-
intensive nlp tasks. Advances in Neural Information Pro-
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., cessing Systems, 33:9459–9474, 2020.
De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and
Gelly, S. Parameter-efficient transfer learning for nlp. Li, X. L. and Liang, P. Prefix-tuning: Optimizing continuous
In International Conference on Machine Learning, pp. prompts for generation. arXiv preprint arXiv:2101.00190,
2790–2799. PMLR, 2019. 2021.

9
RAFT: Adapting Language Model to Domain Specific RAG

Lin, X. V., Chen, X., Chen, M., Shi, W., Lomeli, M., James, Pan, X., Zhang, M., Ji, S., and Yang, M. Privacy risks of
R., Rodriguez, P., Kahn, J., Szilvasy, G., Lewis, M., general-purpose language models. In 2020 IEEE Sympo-
et al. Ra-dit: Retrieval-augmented dual instruction tuning. sium on Security and Privacy (SP), pp. 1314–1331. IEEE,
arXiv preprint arXiv:2310.01352, 2023a. 2020.
Lin, X. V., Chen, X., Chen, M., Shi, W., Lomeli, M., James, Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. Gorilla:
R., Rodriguez, P., Kahn, J., Szilvasy, G., Lewis, M., Large language model connected with massive apis. arXiv
et al. Ra-dit: Retrieval-augmented dual instruction tuning. preprint arXiv:2305.15334, 2023.
arXiv preprint arXiv:2310.01352, 2023b.
Power, A., Burda, Y., Edwards, H., Babuschkin, I., and
Liu, H., Sferrazza, C., and Abbeel, P. Chain of hindsight Misra, V. Grokking: Generalization beyond overfit-
aligns language models with feedback. arXiv preprint ting on small algorithmic datasets. arXiv preprint
arXiv:2302.02676, 3, 2023a. arXiv:2201.02177, 2022.
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilac-
qua, M., Petroni, F., and Liang, P. Lost in the middle: Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning,
How language models use long contexts. arXiv preprint C. D., and Finn, C. Direct preference optimization: Your
arXiv:2307.03172, 2023b. language model is secretly a reward model. arXiv preprint
arXiv:2305.18290, 2023.
Liu, X., Ji, K., Fu, Y., Tam, W., Du, Z., Yang, Z., and Tang,
J. P-tuning: Prompt tuning can be comparable to fine- Ram, O., Levine, Y., Dalmedigos, I., Muhlgay, D., Shashua,
tuning across scales and tasks. In Proceedings of the 60th A., Leyton-Brown, K., and Shoham, Y. In-context
Annual Meeting of the Association for Computational retrieval-augmented language models. arXiv preprint
Linguistics (Volume 2: Short Papers), pp. 61–68, 2022a. arXiv:2302.00083, 2023.

Liu, Z., Kitouni, O., Nolte, N. S., Michaud, E., Tegmark, Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L.,
M., and Williams, M. Towards understanding grokking: Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja,
An effective theory of representation learning. Advances A., et al. Multitask prompted training enables zero-shot
in Neural Information Processing Systems, 35:34651– task generalization. arXiv preprint arXiv:2110.08207,
34663, 2022b. 2021.
Liu, Z., Ping, W., Roy, R., Xu, P., Shoeybi, M., and Catan- Shi, F., Chen, X., Misra, K., Scales, N., Dohan, D., Chi,
zaro, B. Chatqa: Building gpt-4 level conversational qa E. H., Schärli, N., and Zhou, D. Large language models
models. arXiv preprint arXiv:2401.10225, 2024. can be easily distracted by irrelevant context. In Inter-
Mishra, S., Khashabi, D., Baral, C., and Hajishirzi, H. Cross- national Conference on Machine Learning, pp. 31210–
task generalization via natural language crowdsourcing 31227. PMLR, 2023a.
instructions. arXiv preprint arXiv:2104.08773, 2021. Shi, W., Ajith, A., Xia, M., Huang, Y., Liu, D., Blevins,
Muennighoff, N., Wang, T., Sutawika, L., Roberts, A., Bi- T., Chen, D., and Zettlemoyer, L. Detecting pretrain-
derman, S., Le Scao, T., Bari, M. S., Shen, S., Yong, ing data from large language models. arXiv preprint
Z. X., Schoelkopf, H., Tang, X., Radev, D., Aji, A. F., Al- arXiv:2310.16789, 2023b.
mubarak, K., Albanie, S., Alyafeai, Z., Webson, A., Raff,
E., and Raffel, C. Crosslingual generalization through Shi, W., Min, S., Lomeli, M., Zhou, C., Li, M., Lin, V.,
multitask finetuning. In Rogers, A., Boyd-Graber, J., and Smith, N. A., Zettlemoyer, L., Yih, S., and Lewis, M.
Okazaki, N. (eds.), Proceedings of the 61st Annual Meet- In-context pretraining: Language modeling beyond doc-
ing of the Association for Computational Linguistics (Vol- ument boundaries. arXiv preprint arXiv:2310.10638,
ume 1: Long Papers), pp. 15991–16111, Toronto, Canada, 2023c.
July 2023. Association for Computational Linguistics. Shi, W., Min, S., Yasunaga, M., Seo, M., James, R., Lewis,
doi: 10.18653/v1/2023.acl-long.891. URL https: M., Zettlemoyer, L., and Yih, W.-t. Replug: Retrieval-
//aclanthology.org/2023.acl-long.891. augmented black-box language models. arXiv preprint
OpenAI. Gpt-4 technical report, 2023. arXiv:2301.12652, 2023d.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Tänzer, M., Ruder, S., and Rei, M. Memorisation versus
Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., generalisation in pre-trained language models. In Pro-
et al. Training language models to follow instructions ceedings of the 60th Annual Meeting of the Association
with human feedback. Advances in Neural Information for Computational Linguistics (Volume 1: Long Papers),
Processing Systems, 35:27730–27744, 2022. pp. 7564–7578, 2022.

10
RAFT: Adapting Language Model to Domain Specific RAG

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X.,
A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Efrat, A., Yu, P., Yu, L., et al. Lima: Less is more for
Bhosale, S., et al. Llama 2: Open foundation and fine- alignment. arXiv preprint arXiv:2305.11206, 2023a.
tuned chat models. arXiv preprint arXiv:2307.09288,
2023. Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X.,
Efrat, A., Yu, P., Yu, L., et al. Lima: Less is more for
Vu, T., Iyyer, M., Wang, X., Constant, N., Wei, J., Wei, J., alignment. arXiv preprint arXiv:2305.11206, 2023b.
Tar, C., Sung, Y.-H., Zhou, D., Le, Q., et al. Freshllms:
Refreshing large language models with search engine
augmentation. arXiv preprint arXiv:2310.03214, 2023.

Wang, B., Ping, W., McAfee, L., Xu, P., Li, B., Shoeybi,
M., and Catanzaro, B. Instructretro: Instruction tun-
ing post retrieval-augmented pretraining. arXiv preprint
arXiv:2310.07713, 2023.

Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A.,
Khashabi, D., and Hajishirzi, H. Self-instruct: Aligning
language models with self-generated instructions. arXiv
preprint arXiv:2212.10560, 2022.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F.,
Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought
prompting elicits reasoning in large language models.
Advances in Neural Information Processing Systems, 35:
24824–24837, 2022.

Weston, J. and Sukhbaatar, S. System 2 attention


(is something you might need too). arXiv preprint
arXiv:2311.11829, 2023.

Workshop, B., Scao, T. L., Fan, A., Akiki, C., Pavlick, E.,
Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon,
F., et al. Bloom: A 176b-parameter open-access multilin-
gual language model. arXiv preprint arXiv:2211.05100,
2022.

Xiong, W., Liu, J., Molybog, I., Zhang, H., Bhargava, P.,
Hou, R., Martin, L., Rungta, R., Sankararaman, K. A.,
Oguz, B., et al. Effective long-context scaling of founda-
tion models. arXiv preprint arXiv:2309.16039, 2023.

Xu, P., Ping, W., Wu, X., McAfee, L., Zhu, C., Liu, Z., Sub-
ramanian, S., Bakhturina, E., Shoeybi, M., and Catanzaro,
B. Retrieval meets long context large language models.
arXiv preprint arXiv:2310.03025, 2023.

Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W.,
Salakhutdinov, R., and Manning, C. D. Hotpotqa: A
dataset for diverse, explainable multi-hop question an-
swering. arXiv preprint arXiv:1809.09600, 2018.

Zhang, T., Liu, F., Wong, J., Abbeel, P., and Gonzalez, J. E.
The wisdom of hindsight makes language models better
instruction followers. arXiv preprint arXiv:2302.05206,
2023.

11

You might also like