RAFT RAG Cum Fine Tuning
RAFT RAG Cum Fine Tuning
Tianjun Zhang Shishir G. Patil Naman Jain Sheng Shen Matei Zaharia Ion Stoica Joseph E. Gonzalez
[email protected], [email protected]
UC Berkeley
1
RAFT: Adapting Language Model to Domain Specific RAG
Teach Model to
Bake in Knowledge Model can use use External Docs at Test
at Train Time External Docs at Test
Figure 1: How best to prepare for an Exam?(a) Fine-tuning based approaches implement "studying" by either directly
"memorizing" the input documents or answering practice QA without referencing the documents. (b) Alternatively, in-
context retrieval methods fail to leverage the learning opportunity afforded by the fixed domain and are equivalent to
taking an open-book exam without studying. While these approaches leverage in-domain learning, they fail to prepare for
open-book tests. In contrast, our approach (c) RAFT leverages fine-tuning with question-answer pairs while referencing the
documents in a simulated imperfect retrieval setting — thereby effectively preparing for the open-book exam setting.
performance. RAFT aims to not only enable models to learn Open Book Exam In contrast, we liken the open-book
domain specific knowledge through fine-tuning, but also exam setting to the scenario in which the LLM can re-
to ensure robustness against inaccurate retrievals. This is fer to external sources of information (e.g., a website or
achieved by training the models to understand the dynamics a book chapter). In such scenarios, typically, the LLM is
between the question posed (prompt), the domain specific paired with a retriever which retrieves ‘k’ documents (or
documents retrieved, and the appropriate answer. Going specific segments of the document) which are appended to
back to our analogy, our approach is analogous to study- the prompt. It is only through these documents retrieved that
ing for an open-book exam by recognizing relevant, and the LLM gains access to “new knowledge”. As a result, we
irrelevant retrieved documents. argue that the LLM’s performance in these settings, where it
is trained as a general-purpose LLM is largely dependent on
In RAFT, we train the model to answer the question (Q)
the quality of the retriever and how accurately the retriever
from Document(s) (D*) to generate an answer (A*), where
can identify the most relevant piece of information.
A* includes chain-of-thought (Wei et al., 2022; Anthropic,
2023), and in the presence of distractor documents (Dk ).
We explain the methodology in detail in Section 3 and ana- Domain Specific Open-Book Exam In this paper, we
lyze the sensitivity to the number of distractor documents focused on a narrower but increasingly popular domain than
(k) at train- and test- time in Section 5. RAFT consis- the general open book exam, called the domain specific
tently outperforms Supervised-finetuning both with- and open book exam. In domain specific open book exams, we
without- RAG across PubMed (Dernoncourt & Lee, 2017), know apriori the domain in which the LLM will be tested
HotpotQA (Yang et al., 2018), and HuggingFace Hub, Torch – used for inference. The LLM can respond to the prompt
Hub, and Tensorflow Hub Gorilla datasets (Patil et al., 2023), using use any and all information from this specific domain,
presenting a novel, yet simple technique to improve pre- which it has been fine-tuned on. Examples of domain spe-
trained LLMs for in-domain RAG. cific examples include enterprise documents, latest news,
code repositories belonging to an organization, etc. In all
2. LLMs for Open-Book Exam these scenarios, the LLM will be used to respond to the
questions, whose answers can be found within a collection
To understand our goal better, we expand on our analogy be- of documents (a small practical domain). The retrieval tech-
tween training an LLM in the real-world setting of preparing nique itself has little to no impact on the mechanism (though
for an exam. it may impact the accuracy). This paper mainly studies this,
domain specific open-book setting and how to adapt a pre-
trained LLM to this specific domain, including how to make
Closed-Book Exam A closed book exam often refers to it more robust to a varying number of retrieved documents
a scenario where the LLMs do not have access to any ad- and distractors.
ditional documents or references to answer the questions
during the exam. For LLMs, this is equivalent to the sce- 3. RAFT
nario, for example, in which the LLM is used as a chatbot.
In this scenario, the LLM draws from the knowledge baked In this section, we present RAFT, a novel way of training
in during pre-training and supervised finetuning to respond LLMs for domain specific open-book exams. We first intro-
to the prompt. duce the classical technique of supervised fine-tuning, fol-
2
RAFT: Adapting Language Model to Domain Specific RAG
Figure 2: Overview of our RAFT method. The top-left figure depicts our approach of adapting LLMs to reading solution
from a set of positive and negative documents in contrast to standard RAG setup where models are trained based on the
retriever outputs, which is a mixture of both memorization and reading. At test time, all methods follow the standard RAG
setting, provided with a top-k retrieved documents in the context.
lowed by the key takeaways from our experiments. Then, we between two types of documents: ‘oracle’ documents (D∗)
introduce RAFT , a modified version of general instruction i.e. the documents from which the answer to the question
tuning. Lastly, we provide an overview of the experiments can be deduced, and ‘distractor’ documents (Di ) that do not
to expect in the later sections. contain answer-relevant information. As an implementation
detail, the ‘oracle’ document doesn’t need to be a single doc-
Supervised Finetuning
ument, but can be more than one document, as is the case in
Consider the supervised fine-tuning (SFT) setting for a HotpotQA (Yang et al., 2018). Then, for P fraction of the
Question-Answer dataset. The formulation consists of the questions (qi ) in the dataset, we retain the oracle document
Dataset (D) from which a set of Question (Q) and corre- (d∗i ) along with distractor documents (dk−1 ). For (1 − P )
sponding answer (A) pairs are derived or already available. fraction of the questions (qi ) in the dataset, we include no
In the classical SFT setting, the model is trained to improve oracle document and only include distractor documents (dk ).
its ability to answer the questions based on its knowledge - We then fine-tune the language model using the standard
obtained either during pre-training, or during the SFT train- supervised training (SFT) technique, training it to generate
ing phase. The model so trained can also be used at test-time answers from the provided documents and questions. Fig. 2
with the Retrieval Augmented Generation (RAG) setting, illustrates the high-level design principal for RAFT .
where additional documents can be introduced in the prompt
We demonstrate that our approach trains the model to per-
to help the model answer the question. This can be repre-
form better RAG on the set of documents it is trained on i.e.,
sented as follows:
in-domain. By removing the oracle documents in some in-
stances, we are compelling the model to memorize answers
• Train: Q → A instead of deriving them from the context. The training data
• 0-shot Inference: Q → A for RAFT is as follows, and an example of training data can
be seen in Fig. 3:
• RAG Inference: Q + D → A
• P % of data: Q + D∗ + D2 + . . . + Dk → A∗
RAFT
• (1 − P) % of data: Q + D1 + D2 + . . . + Dk → A∗
Retrieval Aware Fine-Tuning (RAFT), presents a novel
recipe to prepare fine-tuning data to tailor the models for
Subsequently, for the test scenario, the model is provided
domain specific open-book settings, equivalent to in-domain
with the Q and top-k documents retrieved by the RAG
RAG In RAFT, we prepare the training data such that each
pipeline. Note that RAFT is independent of the retriever
data point contains a question (Q), a set of documents (Dk ),
used.
and a corresponding Chain-of-though style answer (A∗ )
generated from one of the document (D∗ ). We differentiate A key factor in enhancing training quality is the genera-
3
RAFT: Adapting Language Model to Domain Specific RAG
tion of a reasoning process, such as Chain-of-Thought, to answering. It mainly focuses on answering medical
explain the provided answers.RAFT approach is similar: and biology questions based on a given set of docu-
we demonstrate that creating a full reasoning chain and in ments.
addition, clearly citing sources enhances the model’s accu-
racy in answering questions. In Fig. 3, we illustrate this Note that the first category of dataset (NQ, Trivia QA, and
set-up. Generating the training data in this fashion, involves HotpotQA) is a relatively general domain whereas the latter
presenting the model with a question, context, and verified two domains are on very domain specific documents.
answers, and then requesting it to form a reasoning chain
that appropriately references the original context. Baselines We consider the following baselines for our
For all the datasets in our experiments, we generate the experiments:
answers using the technique described above. Note that the
Gorilla APIBench dataset, already includes reasoning in • LlaMA2-7B-chat model with 0-shot prompting: this is
the answers. We provide an example of the generation step the commonly used instruction-finetuned model for QA
in Fig. 3, the detailed reasoning answer includes a citation tasks, where we provide clearly written instructions,
from the original context inside ##begin_quote## and but no reference documentation.
##end_quote## as well as the detailed explanation on
how to reach the conclusion based on the citations. We • LlaMA2-7B-chat model with RAG (Llama2 + RAG):
demonstrate that adding detailed reasoning paragraphs helps similar to the previous setting, except here we include
boost the model’s performance in our experiment section. reference documents. This is a popular technique when
dealing with domain specific QA tasks.
4. Evaluation • domain specific Finetuning with 0-shot prompting
We design our experiments to study how well RAFT per- (DSF): Performing standard supervised finetuning,
forms compared to various baselines. We find that the RAFT- without documents in context. We find that it mostly
7B model (a finetuned version of LlaMA-2) is better at read- useful to align the answering style of the model as well
ing and extracting information from in-domain documents, as get familiar with the domain context.
than domain specific finetuned model, and general-purpose
• domain specific Finetuning with RAG (DSF + RAG):
model with RAG. As an ablation, we also demonstrate how
Equip a domain specific finetuned model with external
important it is for the model to learn with Chain-of-Thought
knowledge using RAG. So, for the “knowledge” the
responses. In this section, we will first introduce all the
model does not know, it can still refer to the context.
datasets we used in the experiments, then all the baseline
model/fine-tuning techniques that we benchmark against.
4.2. Results
4.1. Datasets Using the above datasets and baselines, we evaluate our
In our experiments, we use the following datasets to evaluate model RAFT and demonstrate the effectiveness of RAFT in
our model and all baselines. We selected these datasets Tab. 1. We see that RAFT consistently and significantly
to represent both popular and diverse domains including outperforms the baselines. Compared with the base Llama-
Wikipedia, Coding/API documents, and question-answering 2 instruction-tuned model, RAFT with RAG does much
on medical documents. better in terms of extracting information as well as being
robust towards distractors. The gain can be as big as 35.25%
• Natural Questions (NQ) (Kwiatkowski et al., 2019), on Hotpot QA and 76.35% on Torch Hub evaluation. Com-
Trivia QA (Joshi et al., 2017) and HotpotQA (Yang pared with DSF on the specific dataset, our model does bet-
et al., 2018) are the open-domain question-answers ter at relying on the provided context to solve the problem.
based on Wikipedia, mainly focused on common RAFT does much better on tasks like HotpotQA and Hug-
knowledge (e.g., movies, sports, etc). gingFace datasets (30.87% on HotpotQA and 31.41% on
HuggingFace). Note that for PubMed QA, since it is a binary
• HuggingFace, Torch Hub, and TensorFlow Hub are yes/no question, we don’t observe significant gains when we
from the APIBench (Patil et al., 2023) proposed in compare our model with DSF + RAG. Even compared with
the Gorilla paper. These benchmarks measure how to a much larger and better model GPT-3.5, RAFT demon-
generate the correct, functional, and executable API strates significant advantages.
calls based on the documentation.
Overall, the LLaMA-7B model, both with and without the
• PubMed QA (Jin et al., 2019) is a question-answering RAG, performs poorly due to its answering style not align-
dataset tailored only for biomedical-research question- ing with the ground truth. By applying domain specific
4
RAFT: Adapting Language Model to Domain Specific RAG
Figure 3: RAFT prompt to help LLM evaluate its own generated reasoning and answers, contrasting them with the correct
reasoning and answers. The LLM is prompted to identify errors in its reasoning and extract key insights for improvement.
This figure specifically represents the ‘GenerateExplanation‘ step in the RAFT algorithm (Section 3).
Table 1: RAFT improves RAG performance forall specialized domains: Across PubMed, HotpotQA, HuggingFace,
Torch Hub, and Tensorflow Hub, we see that domain specific Finetuning improves significantly of the performance of the
base model, but RAFT consistently outperforms the existing domain specific finetuning method with or without RAG. This
suggests the need to train the model with context. We compare our model with LLaMA finetuning receipes, and provide
GPT-3.5 for reference.
tuning, we significantly enhance its performance. This pro- itatively demonstrates a scenario where the DSF model
cess enables the model to learn and adopt the appropriate becomes confused by a question asking for the identity of
style of answering. However, introducing RAG to a domain- a screenwriter. Instead of providing the correct name, it
specifically fine-tuned (DSF) model doesn’t invariably lead mistakenly cites one of the films written by the screenwriter.
to better outcomes. This might indicate that the model lacks In contrast, the RAFT model accurately answers the ques-
training in context processing and extracting useful infor- tion. This discrepancy suggests that training a model solely
mation from it. By incorporating our method, RAFT , we with question-answer pairs may impair its ability to derive
train the model not only to match its answering style with relevant context from provided documents. The comparison
that required but also to improve its document processing underscores the importance of incorporating both standard
capabilities. Consequently, our approach outperforms all instructional tuning and context comprehension into the
others. training dataset to preserve and enhance the model’s ability
to process text effectively.
4.3. Effect of CoT
4.5. Should we train the LLM always with the oracle
We also conduct an analysis to evaluate the effectiveness of
context for RAG?
the Chain-of-Thought approach in enhancing the model’s
performance. As indicated in Table 2, simply providing In our exploration of whether large language models
the answer to a question may not always be adequate. This (LLMs) should always be trained with the oracle context for
approach can lead to a rapid decrease in loss, resulting in Retrieval-Augmented Generation (RAG), we address a key
the training process to diverge. Incorporating a reasoning question: what proportion (p%) of the training data should
chain that not only guides the model to the answer but also include oracle documents? Intuitively, one might assume
enriches the model’s understanding can improve the over- that for effective training in reading and extracting informa-
all accuracy. In our experiments, integrating the Chain-of- tion from context (e.g., RAG tasks), the oracle document
Thought significantly enhances training robustness. We em- should always be included during training (P = 100%). How-
ploy GPT-4-1106 to generate our Chain-of-Thought prompts ever, our findings challenge this assumption: incorporating
and include an example of the prompt we used in Figure 3. a portion of the training data without the oracle document
in the context (P = 80%) appears to enhance the model’s
4.4. Qualitative Analysis performance on RAG tasks.
To illustrate the potential advantages of RAFT over the Fig. 5 presents our investigation into the hyperparameter
domain-specifically fine-tuned (DSF) approach, we present P%, which represents the percentage of training instances
a comparative example in Figure 4. This example qual- that should include oracle documents. Our analysis reveals
5
RAFT: Adapting Language Model to Domain Specific RAG
Table 2: Ablation on Chain-of-Thought: The numbers of RAFT and RAFT without CoT. Results on various datasets
show that adding CoT can significantly improve the performance of the finetuned model. With a gain of 9.66% and 14.93%
on the Hotpot QA and HuggingFace datasets respectively.
HotPot QA
Question: What screenwriter with credits for “Evolution”[0/1879]e a film starring Nicolas Cage and Téa Leoni?
Documents: . . . David Weissman is a screenwriter and director. His film credits include “The Family Man” (2000),
“Evolution” (2001), and “When in Rome” (2010).
...
The Family Man is a 2000 American romantic comedy-drama film directed by Brett Ratner, written by David Diamond
and David Weissman, and starring Nicolas Cage and Téa Leoni. . . .
——————————————————————————————————————————————–
RAFT: ##Reason: The screenwriter with credits for the film “Evolution,” starring Nicolas Cage and Téa Leoni, is
David Weissman. This information is provided in the reference documents which mention David Weissman as a
screenwriter with film credits including “The Family Man” (2000), “Evolution” (2001), and “When in Rome” (2010).
Therefore, the screenwriter for “Evolution” is David Weissman. ##Answer: David Weissman ✓
DSF: The Family Man ✗
Figure 4: Comparison of RAFT and DSF: We prompt RAFT and DSF fine-tuned models on the HotpotQA dataset. We
can see that the DSF model extracts the wrong information from the context. For the question, who is the screenwriter, it
responds with a film name. RAFT manages to get the result correctly .
that the optimal proportion varies across datasets, with fig- Such a scenario necessitates the model to have the ability to
ures ranging from 40%, 60%, and 100%. This indicates discern and disregard irrelevant content, focusing solely on
that training your LLM without the correct corresponding pertinent information.
context at times can be beneficial for the downstream task of
answering questions related to the documents. In our train- 5.1. Making Model Robust to top-K RAG
ing setup, we include four distractor documents alongside
the oracle document, and at test time, we maintain this for- To tackle the challenge of enhancing large language mod-
mat by providing the oracle document with four distractors. els’ (LLMs) ability to sift through irrelevant text within the
Our findings suggest that, for domain specific RAG tasks, retrieval pipeline, our analysis revealed that training solely
including a certain percentage of training data without the with oracle (highly relevant) documents can inadvertently di-
oracle documents in the context proves to be advantageous. minish the model’s ability to discern and disregard irrelevant
information. To address this, our algorithm, RAFT , adopts
a strategy that integrates oracle documents with a mix of
5. RAFT Generalizes to Top-K RAG irrelevant ones. This methodology prompts us to investigate
After demonstrating the performance of RAFT on vari- the ideal fraction of negative (irrelevant) documents to in-
ous benchmarks, we now study another important problem: corporate throughout the training process and to assess how
How does the number of distractor documents in RAFT af- well this training approach adapts to different volumes of
fect the model’s performance when augmented with top-k documents encountered by the Retrieval-Augmented Gen-
retriever augmented generation (RAG) result during the eval- eration (RAG) during the test phase. Our aim is to refine
uation? Previous research has highlighted the vulnerability the balance between relevant and irrelevant information to
of LLMs to irrelevant text (see studies (Shi et al., 2023a; strengthen the model’s efficiency in identifying and utilizing
Weston & Sukhbaatar, 2023; Liu et al., 2023b)). This issue pertinent content. Notice that Sec 4.5 looked at what P% of
is particularly critical for LLMs + RAG since top-k RAG training data should include distractors, while in this section,
is frequently employed at test time to ensure high recall. we study test-time scenarios.
6
RAFT: Adapting Language Model to Domain Specific RAG
Final Accuracy
Final Accuracy
0.55
0.35 0.60
0.50
0.30
0.55 0.45
0.25
0.50 0.40
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
P % Golden Retrieved Context at Training % Golden Retrieved Context at Training P % Golden Retrieved Context at Training
Figure 5: How many golden documents to involve? We study the hyperparameter P % which indicates what fraction of
the training data contains the oracle document(s) in its context. Results on NQ, TQA and HotpotQA suggest that mixing a
fraction of data that does not have the oracle document in its context is helpful for in-domain RAG.
7
RAFT: Adapting Language Model to Domain Specific RAG
Final Accuracy
0.28 Train D* + 3D Train D* + 3D
0.200
0.26 0.175
0.24 0.150
0.22 0.125
2 4 6 8 10 2 4 6 8 10
# Test Documents (Top-k) # Test Documents (Top-k)
Figure 6: Test-Time Documents Varying: We study how robust RAFT is to varying numbers of test-time documents
that a retriever might provide. In NQ, we find that training with 4 documents leads to the best performance, but training
with 2 documents is optimal for HotpotQA. However, across both datasets, training with all datasets consisting of oracle
documents hurts performance.
Asai, A., Wu, Z., Wang, Y., Sil, A., and Hajishirzi, H. Self-
8
RAFT: Adapting Language Model to Domain Specific RAG
rag: Learning to retrieve, generate, and critique through Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang,
self-reflection. arXiv preprint arXiv:2310.11511, 2023. S., Wang, L., and Chen, W. Lora: Low-rank adaptation of
large language models. arXiv preprint arXiv:2106.09685,
Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, 2021.
E., Millican, K., Van Den Driessche, G. B., Lespiau, J.-B.,
Damoc, B., Clark, A., et al. Improving language models Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni,
by retrieving from trillions of tokens. In International F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., and
conference on machine learning, pp. 2206–2240. PMLR, Grave, E. Atlas: Few-shot learning with retrieval aug-
2022. mented language models. Journal of Machine Learning
Research, 24(251):1–43, 2023. URL http://jmlr.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., org/papers/v24/23-0037.html.
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., et al. Language models are few-shot learners. Ji, C. C.-J., Mao, H., Yan, F., Shishir G. Patil, T. Z., Stoica,
Advances in neural information processing systems, 33: I., and Gonzalez, J. E. Gorilla openfunctions v2. 2024.
1877–1901, 2020.
Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W., and Lu, X.
Carlini, N., Liu, C., Erlingsson, Ú., Kos, J., and Song, Pubmedqa: A dataset for biomedical research question
D. The secret sharer: Evaluating and testing unintended answering. arXiv preprint arXiv:1909.06146, 2019.
memorization in neural networks. In 28th USENIX Se-
Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L.
curity Symposium (USENIX Security 19), pp. 267–284,
Triviaqa: A large scale distantly supervised challenge
2019.
dataset for reading comprehension. arXiv preprint
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert- arXiv:1705.03551, 2017.
Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Kandpal, N., Wallace, E., and Raffel, C. Deduplicating
Erlingsson, U., et al. Extracting training data from large training data mitigates privacy risks in language models.
language models. In 30th USENIX Security Symposium In International Conference on Machine Learning, pp.
(USENIX Security 21), pp. 2633–2650, 2021. 10697–10707. PMLR, 2022.
Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L.,
and Zhang, C. Quantifying memorization across neural and Lewis, M. Generalization through memorization:
language models. In The Eleventh International Confer- Nearest neighbor language models. arXiv preprint
ence on Learning Representations, 2022. arXiv:1911.00172, 2019.
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M.,
Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin,
S., et al. Scaling instruction-finetuned language models. J., Lee, K., et al. Natural questions: a benchmark for ques-
arXiv preprint arXiv:2210.11416, 2022. tion answering research. Transactions of the Association
for Computational Linguistics, 7:453–466, 2019.
Dernoncourt, F. and Lee, J. Y. Pubmed 200k rct: a dataset
for sequential sentence classification in medical abstracts. Lazaridou, A., Gribovskaya, E., Stokowiec, W., and Grig-
arXiv preprint arXiv:1710.06071, 2017. orev, N. Internet-augmented language models through
few-shot prompting for open-domain question answering.
Feldman, V. Does learning require memorization? a short
arXiv preprint arXiv:2203.05115, 2022.
tale about a long tail. In Proceedings of the 52nd Annual
ACM SIGACT Symposium on Theory of Computing, pp. Lester, B., Al-Rfou, R., and Constant, N. The power of scale
954–959, 2020. for parameter-efficient prompt tuning. arXiv preprint
arXiv:2104.08691, 2021.
Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M.
Retrieval augmented language model pre-training. In Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V.,
International conference on machine learning, pp. 3929– Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel,
3938. PMLR, 2020. T., et al. Retrieval-augmented generation for knowledge-
intensive nlp tasks. Advances in Neural Information Pro-
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., cessing Systems, 33:9459–9474, 2020.
De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and
Gelly, S. Parameter-efficient transfer learning for nlp. Li, X. L. and Liang, P. Prefix-tuning: Optimizing continuous
In International Conference on Machine Learning, pp. prompts for generation. arXiv preprint arXiv:2101.00190,
2790–2799. PMLR, 2019. 2021.
9
RAFT: Adapting Language Model to Domain Specific RAG
Lin, X. V., Chen, X., Chen, M., Shi, W., Lomeli, M., James, Pan, X., Zhang, M., Ji, S., and Yang, M. Privacy risks of
R., Rodriguez, P., Kahn, J., Szilvasy, G., Lewis, M., general-purpose language models. In 2020 IEEE Sympo-
et al. Ra-dit: Retrieval-augmented dual instruction tuning. sium on Security and Privacy (SP), pp. 1314–1331. IEEE,
arXiv preprint arXiv:2310.01352, 2023a. 2020.
Lin, X. V., Chen, X., Chen, M., Shi, W., Lomeli, M., James, Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. Gorilla:
R., Rodriguez, P., Kahn, J., Szilvasy, G., Lewis, M., Large language model connected with massive apis. arXiv
et al. Ra-dit: Retrieval-augmented dual instruction tuning. preprint arXiv:2305.15334, 2023.
arXiv preprint arXiv:2310.01352, 2023b.
Power, A., Burda, Y., Edwards, H., Babuschkin, I., and
Liu, H., Sferrazza, C., and Abbeel, P. Chain of hindsight Misra, V. Grokking: Generalization beyond overfit-
aligns language models with feedback. arXiv preprint ting on small algorithmic datasets. arXiv preprint
arXiv:2302.02676, 3, 2023a. arXiv:2201.02177, 2022.
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilac-
qua, M., Petroni, F., and Liang, P. Lost in the middle: Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning,
How language models use long contexts. arXiv preprint C. D., and Finn, C. Direct preference optimization: Your
arXiv:2307.03172, 2023b. language model is secretly a reward model. arXiv preprint
arXiv:2305.18290, 2023.
Liu, X., Ji, K., Fu, Y., Tam, W., Du, Z., Yang, Z., and Tang,
J. P-tuning: Prompt tuning can be comparable to fine- Ram, O., Levine, Y., Dalmedigos, I., Muhlgay, D., Shashua,
tuning across scales and tasks. In Proceedings of the 60th A., Leyton-Brown, K., and Shoham, Y. In-context
Annual Meeting of the Association for Computational retrieval-augmented language models. arXiv preprint
Linguistics (Volume 2: Short Papers), pp. 61–68, 2022a. arXiv:2302.00083, 2023.
Liu, Z., Kitouni, O., Nolte, N. S., Michaud, E., Tegmark, Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L.,
M., and Williams, M. Towards understanding grokking: Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja,
An effective theory of representation learning. Advances A., et al. Multitask prompted training enables zero-shot
in Neural Information Processing Systems, 35:34651– task generalization. arXiv preprint arXiv:2110.08207,
34663, 2022b. 2021.
Liu, Z., Ping, W., Roy, R., Xu, P., Shoeybi, M., and Catan- Shi, F., Chen, X., Misra, K., Scales, N., Dohan, D., Chi,
zaro, B. Chatqa: Building gpt-4 level conversational qa E. H., Schärli, N., and Zhou, D. Large language models
models. arXiv preprint arXiv:2401.10225, 2024. can be easily distracted by irrelevant context. In Inter-
Mishra, S., Khashabi, D., Baral, C., and Hajishirzi, H. Cross- national Conference on Machine Learning, pp. 31210–
task generalization via natural language crowdsourcing 31227. PMLR, 2023a.
instructions. arXiv preprint arXiv:2104.08773, 2021. Shi, W., Ajith, A., Xia, M., Huang, Y., Liu, D., Blevins,
Muennighoff, N., Wang, T., Sutawika, L., Roberts, A., Bi- T., Chen, D., and Zettlemoyer, L. Detecting pretrain-
derman, S., Le Scao, T., Bari, M. S., Shen, S., Yong, ing data from large language models. arXiv preprint
Z. X., Schoelkopf, H., Tang, X., Radev, D., Aji, A. F., Al- arXiv:2310.16789, 2023b.
mubarak, K., Albanie, S., Alyafeai, Z., Webson, A., Raff,
E., and Raffel, C. Crosslingual generalization through Shi, W., Min, S., Lomeli, M., Zhou, C., Li, M., Lin, V.,
multitask finetuning. In Rogers, A., Boyd-Graber, J., and Smith, N. A., Zettlemoyer, L., Yih, S., and Lewis, M.
Okazaki, N. (eds.), Proceedings of the 61st Annual Meet- In-context pretraining: Language modeling beyond doc-
ing of the Association for Computational Linguistics (Vol- ument boundaries. arXiv preprint arXiv:2310.10638,
ume 1: Long Papers), pp. 15991–16111, Toronto, Canada, 2023c.
July 2023. Association for Computational Linguistics. Shi, W., Min, S., Yasunaga, M., Seo, M., James, R., Lewis,
doi: 10.18653/v1/2023.acl-long.891. URL https: M., Zettlemoyer, L., and Yih, W.-t. Replug: Retrieval-
//aclanthology.org/2023.acl-long.891. augmented black-box language models. arXiv preprint
OpenAI. Gpt-4 technical report, 2023. arXiv:2301.12652, 2023d.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Tänzer, M., Ruder, S., and Rei, M. Memorisation versus
Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., generalisation in pre-trained language models. In Pro-
et al. Training language models to follow instructions ceedings of the 60th Annual Meeting of the Association
with human feedback. Advances in Neural Information for Computational Linguistics (Volume 1: Long Papers),
Processing Systems, 35:27730–27744, 2022. pp. 7564–7578, 2022.
10
RAFT: Adapting Language Model to Domain Specific RAG
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X.,
A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Efrat, A., Yu, P., Yu, L., et al. Lima: Less is more for
Bhosale, S., et al. Llama 2: Open foundation and fine- alignment. arXiv preprint arXiv:2305.11206, 2023a.
tuned chat models. arXiv preprint arXiv:2307.09288,
2023. Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X.,
Efrat, A., Yu, P., Yu, L., et al. Lima: Less is more for
Vu, T., Iyyer, M., Wang, X., Constant, N., Wei, J., Wei, J., alignment. arXiv preprint arXiv:2305.11206, 2023b.
Tar, C., Sung, Y.-H., Zhou, D., Le, Q., et al. Freshllms:
Refreshing large language models with search engine
augmentation. arXiv preprint arXiv:2310.03214, 2023.
Wang, B., Ping, W., McAfee, L., Xu, P., Li, B., Shoeybi,
M., and Catanzaro, B. Instructretro: Instruction tun-
ing post retrieval-augmented pretraining. arXiv preprint
arXiv:2310.07713, 2023.
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A.,
Khashabi, D., and Hajishirzi, H. Self-instruct: Aligning
language models with self-generated instructions. arXiv
preprint arXiv:2212.10560, 2022.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F.,
Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought
prompting elicits reasoning in large language models.
Advances in Neural Information Processing Systems, 35:
24824–24837, 2022.
Workshop, B., Scao, T. L., Fan, A., Akiki, C., Pavlick, E.,
Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon,
F., et al. Bloom: A 176b-parameter open-access multilin-
gual language model. arXiv preprint arXiv:2211.05100,
2022.
Xiong, W., Liu, J., Molybog, I., Zhang, H., Bhargava, P.,
Hou, R., Martin, L., Rungta, R., Sankararaman, K. A.,
Oguz, B., et al. Effective long-context scaling of founda-
tion models. arXiv preprint arXiv:2309.16039, 2023.
Xu, P., Ping, W., Wu, X., McAfee, L., Zhu, C., Liu, Z., Sub-
ramanian, S., Bakhturina, E., Shoeybi, M., and Catanzaro,
B. Retrieval meets long context large language models.
arXiv preprint arXiv:2310.03025, 2023.
Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W.,
Salakhutdinov, R., and Manning, C. D. Hotpotqa: A
dataset for diverse, explainable multi-hop question an-
swering. arXiv preprint arXiv:1809.09600, 2018.
Zhang, T., Liu, F., Wong, J., Abbeel, P., and Gonzalez, J. E.
The wisdom of hindsight makes language models better
instruction followers. arXiv preprint arXiv:2302.05206,
2023.
11