0% found this document useful (0 votes)

26 views26 pages

Enhancing LLMs with Reading Comprehension

Uploaded by

ouesleti.97.farah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views26 pages

Enhancing LLMs with Reading Comprehension

Uploaded by

ouesleti.97.farah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

A DAPTING L ARGE L ANGUAGE M ODELS VIA

R EADING C OMPREHENSION
Daixuan Cheng, Shaohan Huang∗ & Furu Wei
Microsoft

A BSTRACT
arXiv:2309.09530v1 [cs.CL] 18 Sep 2023

We explore how continued pre-training on domain-specific corpora influences

large language models, revealing that training on the raw corpora endows
the model with domain knowledge, but drastically hurts its prompting ability
for question answering. Taken inspiration from human learning via reading
comprehension—practice after reading improves the ability to answer questions
based on the learned knowledge—we propose a simple method for transforming
raw corpora into reading comprehension texts. Each raw text is enriched with a
series of tasks related to its content. Our method, highly scalable and applica-
ble to any pre-training corpora, consistently enhances performance across various
tasks in three different domains: biomedicine, finance, and law. Notably, our 7B
language model achieves competitive performance with domain-specific models
of much larger scales, such as BloombergGPT-50B. Furthermore, we demonstrate
that domain-specific reading comprehension texts can improve the model’s per-
formance even on general benchmarks, showing the potential to develop a general
model across even more domains. Our model, code, and data will be available at
https://github.com/microsoft/LMOps.

Biomedicine Finance Law

65 85 35

55 70 28

45 55 21

35 40 14

25 25 7
QP

A
T
ot

ac
ic

ic
B

ac
e
SA
QA
RC

lin
FP

-m
Pr

-m
m
M

QA
Fin

S-
ad
m

S-

LD
LD
bM

U
e

U
He
nv

O
OT
Ch

O
OT

eH
Pu

eH
SC
SC

s
s

Ca
Ca

General LLM DAPT AdaptLLM

Figure 1: Domain-specific task performance in biomedicine, finance, and law. General LLM is
28
the general language model without continued training, DAPT (Gururangan et al., 2020) continues
to train the general model on domain-specific raw corpora, and AdaptLLM continues to train the
20

general model on the reading comprehension texts constructed based on the raw corpora, mixed with
general instructions.12

4
∗ SCOTUS-mac SCOTUS-mic CaseHOLD-mac CaseHOLD-mic
Corresponding author: [email protected]

1
Reading Comprehension

Raw Text
Here is the first part of an article about biomedicine: Recent reported
evidence indicates that vocal cord carcinoma is evolving similarly to
oropharyngeal cancer with an increasing number of patients (...)

Glottic Carcinoma in Young Patients Answer questions based on the article:

Title
Recent reported evidence indicates that vocal cord What is a summary? Glottic Carcinoma in Young Patients.
carcinoma is evolving similarly to
oropharyngeal Domain
cancer with an increasing Generate a sentence that includes these biomedicine
number of patients without a smoking history keywords keywords [carcinoma, oropharyngeal, papillomavirus]: Recent
having human papillomavirus (HPV) disease. reported evidence indicates that vocal cord carcinoma is evolving…
Therefore
(...) , an investigation was done to Entailment
examine the incidence of glottic carcinoma in
relation Premise:… Hypothesis:… Does the premise entail the hypothesis? Yes
due to
patients 30 years old or younger. (...)
the morphology of the lesions and the patients' Cause & effect What is the reason for ”…"? the morphology of the lesions and the
young age. Historically, glottic carcinoma is patients' young age.
considered to be a tobacco-induced disease. In Semantic
contrast, Compose a sentence that contradicts the meaning of "Historically,
recent published evidence shows
similarity glottic carcinoma … ”.
that glottic carcinoma can be an HPV-related
disease with increasing incidence in nonsmokers. Answer: Recent published evidence …
(...) This finding further supports… Text
How would you complete the article? This finding further
ending supports…

Figure 2: A simplified example of a reading comprehension text, wherein the raw text is followed
by a series of tasks constructed from it, including Summarization (purple), Word-to-Text (blue),
Natural Language Inference (red), Commonsense Reasoning (teal), Paraphrase Detection (yellow),
and Text Completion (green). The complete version is in Appendix G.

1 I NTRODUCTION

The proliferation of general large language models (LLMs) has given rise to the emergence of
domain-specific large language models. Existing methods can be broadly classified into three ap-
proaches. The first trains models from scratch on a mixture of domain-specific and general cor-
pora (Wu et al., 2023b). While this intuitively creates domain-specific LLMs, the substantial com-
putational and data requirements raise significant concerns (Yang et al., 2023; Ling et al., 2023).
The second fine-tunes the language model using supervised datasets (Singhal et al., 2022; 2023; Li
et al., 2023b;a; Wang et al., 2023; Han et al., 2023; Xiong et al., 2023; Huang et al., 2023), offering
a more cost-effective option. However, there are still uncertainties about how well fine-tuned LLMs
grasp domain knowledge that can be universally applied to all domain-specific tasks, as discussed
by Zhou et al. (2023) and Gudibande et al. (2023). The third prompts the general language model
with retrieved domain knowledge (Li et al., 2023b; Cui et al., 2023; Huang et al., 2023), which can
be considered as an application of LLM rather than a direct enhancement to the LLM itself.
Continued pre-training on domain-specific corpora, also known as domain-adaptive pretraining (Gu-
rurangan et al., 2020), has been proven effective in adapting various natural language understanding
models (Devlin et al., 2019; Liu et al., 2019; Clark et al., 2020) to specific domains (Yao et al.,
2021; Gururangan et al., 2020; Cheng et al., 2022). This approach enables language models to lever-
age general ability while incorporating domain-specific knowledge, benefiting downstream domain-
specific tasks at reduced costs. This motivates our investigation into whether continued pre-training
also benefits large-scale generative models. We conduct initial experiments on three domains—
biomedicine, finance, and law—revealing that continued training on the raw corpora results in a
drastic drop in prompting performance but still benefits fine-tuning evaluation and knowledge prob-
ing tests. This leads us to conclude that domain-adaptive pre-training using raw corpora imparts
domain knowledge to the LLM while affecting its prompting ability.
To leverage domain-specific knowledge while enhancing prompting performance, we introduce a
simple method for transforming large-scale raw corpora into reading comprehension texts: each raw
text is enriched with a series of tasks relevant to its content, as illustrated in Figure 2. These tasks are
designed to help the model maintain its ability to answer questions using natural language, based on
the context of the raw text. Furthermore, we augment the reading comprehension texts with diverse
general instructions, thereby further enhancing prompting ability (Wei et al., 2022; Zhou et al., 2023;

2
Xu et al., 2023; Mukherjee et al., 2023). Our experiments in domains such as biomedicine, finance,
and law highlight the effectiveness of our approach in improving model performance on various
domain-specific tasks. We refer to this resulting model as AdaptLLM, for Adapted Large Language
Model. Looking ahead, we envision extending this methodology to the development of a general
large language model, contributing to the ever-expanding landscape of tasks across more domains.
In summary, our contributions include:
• We investigate continued pre-training for large language models, where we find continued training
on domain-specific raw corpora can endow the model with domain knowledge, but drastically
hurts its prompting ability.
• We propose a simple recipe which automatically converts large-scale raw corpora into reading
comprehension texts, to effectively learn the domain knowledge while concurrently preserving
prompting performance.
• Our experiments show the effectiveness of our method in consistently improving model perfor-
mance in three different domains: biomedicine, finance and law.

2 P RELIMINARY E XPLORATION ON C ONTINUED P RE - TRAINING

Given the proven efficacy and efficiency of continued pre-training in adapting natural language
understanding models (Gururangan et al., 2020; Yao et al., 2021; Cheng et al., 2022), we embark on
an exploration to ascertain whether this method remains effective for large-scale generative models.
We continue to train the general LLaMA (Touvron et al., 2023) on the domain-specific raw corpora
of biomedicine, finance, and law, respectively, and conduct prompting and fine-tuning evaluations,
as well as domain knowledge probing to assess the model performance within each domain (detailed
experimental settings are in Section 4).

Table 1: Domain-specific task scores of general language model (General LLM) and the language
model that has undergone continued pre-training on the domain-specific raw corpora (DAPT (Gu-
rurangan et al., 2020)). We report the average of task scores within each domain under prompting,
fine-tuning and knowledge probing settings.

Prompting Fine-tuning Knowledge Prob

Method
BioMed. Finance Law BioMed. Finance Law BioMed. Law
General LLM 44.2 58.6 34.2 64.2 79.9 42.0 36.5 45.0
DAPT 41.7 57.6 35.0 66.5 80.9 45.4 36.9 45.6

Prompting vs. Fine-tuning. As seen in Table 1, when fine-tuning is applied, consistent performance
improvements across all three domains are evident after domain-adaptive pre-training. This trend
aligns with findings related to language understanding models (Gururangan et al., 2020), indicating
that continued pre-training enriches the LLM with domain-specific knowledge. Paradoxically, a con-
tradictory trend emerges in the prompting performance, where a noticeable drop is observed across
most domains after domain-adaptive pre-training. This contradiction leads us to hypothesize that
while vanilla domain-adaptive pre-training enhances the LLM’s domain knowledge, contributing to
the fine-tuning improvements, it also significantly impairs its ability to perform well in prompting,
causing the observed drop in prompting performance.
Domain Knowledge Probing. To further confirm whether the language model gains domain knowl-
edge during continued pre-training, we employ a method similar to LAMA (Petroni et al., 2019) for
probing domain knowledge. Using the supervised datasets available in each domain as the basis,
we create domain-specific knowledge-probing datasets. The dataset creation process is detailed in
Appendix A. In Table 1, we present the results of domain knowledge probing for the biomedicine
and law domains1 . Across both domains, we observe improved results after domain-adaptive pre-
training, indicating that the model indeed acquires domain-specific knowledge.

1
We were unable to construct a knowledge probing test for finance due to the limited availability of super-
vised datasets in this domain.

3
The above analyses indicate that the decline in domain-specific prompting performance can be at-
tributed to the reduced prompting ability. This reduction may stem from the limited diversity of
pre-training corpora within one particular domain (Longpre et al., 2023b), which limits the input-
output patterns derived from raw texts (Wei et al., 2022). Therefore, enhancing prompting ability is
crucial for effectively harnessing the domain knowledge acquired during continued pre-training.

3 A DAPTING L ARGE L ANGUAGE M ODELS VIA R EADING C OMPREHENSION

Instead of continuing to train large language models on domain-specific raw corpora, we convert
the raw corpora into reading comprehension texts and adapt the model using the converted data.
In reading comprehension, each raw text is followed by a series of tasks related to its content.
We regard the model training phase on the raw text as the “reading” phase, and the subsequent
training on the followed tasks as the “comprehension” phase. These comprehension tasks follow
the question-answering format, aimed at enriching the model’s prompting ability to respond to input
questions (Wei et al., 2022). This design is inspired from human learning, where practice after
reading enhances the ability to answer questions based on the acquired knowledge. Furthermore,
we propose augmenting the training data with general instructions (Zhou et al., 2023; Xu et al.,
2023; Mukherjee et al., 2023) to benefit from the diversity of input-output formats, thereby further
improving prompting ability.

3.1 C REATING R EADING C OMPREHENSION T EXTS

The idea of mining tasks from raw pre-training corpora to enhance zero-shot capability was in-
troduced by van de Kar et al. (2022). This approach effectively extracts intrinsic tasks from raw
texts through a handful of regex-based patterns, leading to substantial enhancements in the model’s
zero-shot performance via fine-tuning. Our approach leverages the self-supervised nature of this
mining strategy to create our comprehension tasks. This enables us to scale up the transfer of raw
pre-training data, capitalizing on the domain-specific knowledge embedded in the raw texts and the
enhanced prompting ability provided by the comprehension tasks.
Table 2 gives an overview of the techniques used to extract and create tasks from raw texts. Phrases
like Answer questions based on the article: are employed to concatenate each raw text
with the followed tasks, as illustrated in Figure 2. Additionally, we paraphrase each task template to
multiple variations and turn the task around to enhance task diversity (Wei et al., 2022; Chung et al.,
2022; Longpre et al., 2023a).
Summarization prompts the models to generate a concise summary of the provided article, en-
couraging them to extract its main idea. To create task inputs, we employ queries like What is a
summary? to prompt the model to summarize the article, using the text title as the groundtruth. We
also reverse the task, asking the model to craft an article based on the given title.
Additionally, we task the language model with identifying sentence topics. To unearth such input-
output pairs, we utilize regex-based patterns to identify sentences aligning with the patterns spec-
ified in Table 2. We then employ the corresponding task templates to construct the input-output
pairs (van de Kar et al., 2022).
Word-to-Text enhances the model’s grasp of domain-specific vocabulary by prompting it to gen-
erate sentences incorporating specific words. To identify domain-specific words, we use the Sen-
tencePiece tool (Kudo & Richardson, 2018) to build a vocabulary from the target domain corpora.
We then compare this domain vocabulary to the general language model’s vocabulary, considering
words present in the domain vocabulary but absent from the general vocabulary as domain-specific.
Additionally, we filter out tokens with fewer than 10 characters, resulting in a set of domain-specific
keywords.
For each sentence in the raw text, we count the number of domain-specific keywords. Sentences
having more than three domain-specific keywords are selected for making Word-to-Text tasks. We
take the domain-specific keywords in the sentence as the input, asking the model to generate a sen-
tence with Generate a sentence that includes these {DOMAIN} keywords. We also
turn the task around by taking the sentence as input and asking the model to find the keywords about
the target domain using What keywords about {DOMAIN} can be extracted from this

4
Table 2: Mining patterns and input-output templates. {VERBAL} is replaced with the verbalizers
in Table 3. For mining, {WORD} captures a single word, and {SENT} captures a single sentence.
Each input-output template is paraphrased into multiple variations. We also turn the task around—
exchanging the question and answer—to achieve enhanced diversity.

Task Type Mining Pattern Input-output Template

Summarization
Title Title as summary What is a summary? {TITLE}
Topic {SENT1} {VERBAL} {SENT2} {SENT1} is about: {SENT2}
Word-to-Text
Generate a sentence about
Domain keywords as input;
Word-to-text these {DOMAIN} keywords
sentence as output
[{WORD1}, {WORD2}, {WORD3}]: {SENT}
Definition {WORD} {VERBAL} {SENT} How to define {WORD}? {SENT}
Natural Language Inference
Entail
Does "{SENT1}" entail "{SENT2}"?
Neutral {SENT1} {VERBAL}, {SENT2}
{Yes/Maybe/No}
Contradict
Commonsense Reasoning
Cause-effect {SENT1} {VERBAL}, {SENT2} What is the {effect/cause}
Effect-cause {SENT1} {VERBAL} {SENT2} of {SENT1}? {SENT2}

Paragraph Detection
Similar Compose a sentence to {support/
{SENT1} {VERBAL}, {SENT2}
Different contradict} "{SENT1}". {SENT2}

Text Completion
How would you complete the
Text completion Text ending as completion
article? {ENDING}

sentence?. Here we point out the target domain by replacing {DOMAIN} with domain names such
as biomedicine, finance, or law. Besides, we task the language model with defining concepts
using the mining pattern and input-output template in Table 2.
Natural Language Inference concerns how two sentences relate, typically asking, given a first
sentence, whether a second sentence is true, false or possibly true. We use the regex-based patterns
in Table 2 to search for “premise-hypothesis-relation” triplets within the raw text. For example, we
categorize the relationship between two sentences as “Entailment” if they are are connected by the
verbalizer Therefore, and as “Neutral” if connected by Furthermore.
Additionally, we enhance diversity by converting classification tasks into generation tasks. For
example, when the relationship between two sentences is entailment, we employ templates like
{SENT1} Thus? to query for an output of which the groundtruth is the second sentence.
Commonsense Reasoning evaluates the ability to perform physical or scientific reasoning while
considering common sense. We identify cause-and-effect logic within sentences using the regex-
based patterns in Table 2. We then formulate the input-output pairs using templates such as What
is the reason of {SENT1}? {SENT2}.
Paraphrase Detection asks a model to determine whether two sentences are semantically equiv-
alent. To collect such task data, we use regex-based patterns in Table 2 to search for “sentence1-
sentence2-label” data triplets. However, we empirically find that these mining patterns cannot con-
sistently identify two sentences with strictly equivalent semantic meanings. For instance, sentences
linked by the verbalizer Similarly may not share similar meanings.

5
Table 3: Verbalizers for mining patterns in Table 2.

Task Type Verbalizer

Summarization
Topic talks about, is about, ’s topic is
Word-to-Text
Definition is defined as, ’s definition is
Natural Language Inference
Entail Yes, Therefore, Thus, Accordingly, Hence, For this reason
Neutral Maybe, Furthermore, Additionally, Moreover, In addition
Contradict No, However, But, On the contrary, In contrast, Whereas
Commonsense Reasoning
Cause-effect Therefore, Thus, Accordingly, Hence, For this reason
Effect-cause due to, on account of, owing to
Paragraph Detection
Similar In other words, Namely, That is to say, Similarly, Equally
Different No, However, But, On the contrary, In contrast, Whereas

Therefore, we reformat the classification task into a generation task to reduce dependence on label
accuracy. Instead of inquiring whether two sentences are similar, we prompt the model to generate a
sentence that either supports or contradicts the meaning of a given sentence, using input-output tem-
plates like Can you create a sentence that contradicts the meaning of {SENT1}?
{SENT2} when the extracted label is “Different.”
Text Completion. In addition to the inherent casual language modeling task within generative
language models, we insert queries such as How would you complete the article? between
sentences to prompt the language model to generate the subsequent section. An advantage of Text
Completion task is that it does not require any specific mining patterns, thus can be applied to any
raw texts.

3.2 M IXING WITH G ENERAL I NSTRUCTIONS

While we have designed diverse mining patterns, input-output templates and task reversals to en-
hance prompting ability, they might not fully address the infinite task diversity in real-world scenar-
ios. In light of this, we propose to mix the reading comprehension texts with general instructions to
cover a wider range of input-output types.

4 E XPERIMENT S ETTINGS

Domain-adaptive Pre-training. PubMed Abstracts and FreeLaw Opinions from the Pile (Gao et al.,
2021) are utilized as the pre-training corpora for the biomedicine and law domains, respectively. For
finance, we collect financial news from May 2022 to May 20232 for over 7, 000 stocks, using the
FinGPT codebase (Yang et al., 2023). General instructions are sourced from LIMA (Zhou et al.,
2023), WizardLM (Xu et al., 2023), and Orca (Mukherjee et al., 2023). Our pre-training code is
based on TorchScale3 . We continue to train LLaMA-7B (Touvron et al., 2023) on each domain,
and explore different ratios for mixing reading comprehension texts with general instructions; the
optimal ratios for biomedicine, finance, and law are 1 : 1, 1 : 2, and 1 : 1, respectively. Dataset
details and other pre-training hyper-parameters can be found in Appendix B.

2
Access to earlier news is limited.
3
https://github.com/microsoft/torchscale

6
Creating Reading Comprehension Texts. Using the mining patterns in Table 2, we search for sub-
categories within each task type. To prevent task dominance, we limit the number of task examples
per sub-category to two for each raw text. For each mined example, we randomly sample from
various paraphrased or task-reversed templates to generate an input-output example. To structure
the reading comprehension text, we use \n\n to connect comprehension tasks and link them with
the raw text. On average, about two input-output examples are collected per reading comprehension
text. Please refer to Appendix C for mining pattern implementation details and Appendix G for
cases of reading comprehension texts.
Domain-specific Tasks. For biomedicine, we evaluate on PubMedQA (Jin et al., 2019),
ChemProt (Kringelum et al., 2016), MQP (McCreery et al., 2020), RCT (Dernoncourt & Lee, 2017),
and USMLE (Jin et al., 2020). For finance, we evaluate on the five publicly available tasks also eval-
uated by BloombergGPT (Wu et al., 2023b): ConvFinQA (Chen et al., 2022), FPB (Malo et al.,
2014), FiQA SA (Maia et al., 2018), Headline (Sinha & Khandait, 2020), and NER (Alvarado
et al., 2015), and adopt similar prompting settings with BloombergGPT. For law, we evaluate on
SCOTUS (Spaeth et al., 2020), CaseHOLD (Zheng et al., 2021) and UNFAIR-ToS (Lippi et al.,
2019) from the LexGLUE (Chalkidis et al., 2022) benchmark. Evaluation details are provided in
Appendix D.

5 M AIN R ESULTS

In Table 4, we present the comparative prompting results of our models (AdaptLLM) against the
general language model (General LLM) and the models that have gone vanilla domain-adaptive pre-
training on raw corpora (DAPT). On various tasks in the three different domains, the use of raw
texts in DAPT adversely affects the performance. However, the reformatting of raw texts and the
inclusion of general instructions in AdaptLLM manage to counteract this effect, resulting in better
results than the general language model.

Table 4: Domain-specific task performance of general large language model (General LLM),
vanilla domain-adaptive pretraining (DAPT), and ours (AdaptLLM) in prompting evaluation.
We also display prompting results of other models including MedAlpaca (Han et al., 2023) in
biomedicine, BloombergGPT (Wu et al., 2023b) in finance, and LexGPT (Lee, 2023) in law.

Biomedicine PubMedQA ChemProt MQP RCT UMSLE AVERAGE

MedAlpaca-7B 58.6 39.0 50.7 40.8 36.7 45.1
MedAlpaca-13B 60.7 38.4 57.4 51.3 39.0 49.4
General LLM-7B 59.6 31.4 50.7 45.1 34.5 44.2
DAPT-7B 52.6 26.6 49.2 46.6 33.5 41.7
AdaptLLM-7B 63.3 35.2 54.4 50.4 33.1 47.3

Finance ConvFinQA FPB FiQA SA Headline NER AVERAGE

BloombergGPT-50B 43.4 51.1 75.1 82.2 60.8 62.5
General LLM-7B 29.2 55.9 69.2 77.7 61.1 58.6
DAPT-7B 29.6 55.3 64.9 77.5 60.6 57.6
AdaptLLM-7B 41.5 62.5 72.1 81.4 59.3 63.4

SCOTUS CaseHOLD
Law UNFAIR-ToS AVERAGE
mic-F1 mac-F1 mic-F1 mac-F1
GPT-J-6B 15.9 13.6 34.9 34.9 79.8 35.9
LexGPT-6B 16.9 7.7 27.0 27.0 81.9 32.1
General LLM-7B 28.3 10.8 32.9 32.9 65.8 34.2
DAPT-7B 25.0 9.8 34.2 34.2 72.0 35.0
AdaptLLM-7B 30.0 17.8 35.1 35.1 74.4 38.5

7
Besides, we compare AdaptLLM with other publicly-available models/results in each domain as
follows.
Biomedicine. We compare with MedAlpaca-7B/13B (Han et al., 2023), which fine-tunes LLaMA-
7B/13B (Touvron et al., 2023) on medical question-answering instructions. AdaptLLM-7B per-
forms better than MedAlpaca-7B and approaches MedAlpaca-13B in the average score. While the
supervised instructions help MedAlpaca-7B outperform General LLM-7B (LLaMA-7B) in some
domain-specific tasks, this advantage isn’t consistent. This could be because instructions don’t fully
infuse domain knowledge for all tasks, or the domain-specific instructions struggle with various
input-output scenarios.
Finance. We compare our results with those reported in BloombergGPT (Wu et al., 2023b), a
model trained from scratch on a mixture of financial and general corpora. While General LLM-7B
scores lower than BloombergGPT-50B, AdaptLLM-7B achieves competitive performance with the
50B BloombergGPT model. This highlights the computational and data efficiency of our approach
compared to training from scratch.
Law. We compare with LexGPT-6B (Lee, 2023) which conducts vanilla domain adaptive pretraining
of GPT-J-6B (Wang & Komatsuzaki, 2021) on Pile of Law (Henderson et al., 2022) corpora. In
contrast to the general model GPT-J-6B, LexGPT-6B shows negative prompting results. This trend
aligns with our observation in section 2 that continued pre-training on domain-specific raw texts
leads to worse prompting performance. On the other hand, our method contributes to positive results
on the prompting performance, highlighting the effectiveness of the comprehension tasks and the
general instructions.

6 A BLATIONS ON T RAINING DATA

Table 5 presents ablation results on different training data and data mixtures: (1) Raw Text trains
on the raw pre-training corpora. (2) Read. Compre. converts raw texts into reading comprehension
texts, boosting the prompting ability to show better results in all of the adapted domains. (3) Gen.
Ins. trains on general instructions. (4) Read. + Gen. Ins. augments reading comprehension texts
with general instructions. Compared to using reading comprehension texts only, the inclusion of
general instructions further improves the prompting ability, leading to better task results. Moreover,
compared to the use of general instructions alone, the utilization of reading comprehension texts
provides domain knowledge that enhances performance in domain-specific tasks. Additionally, we
provide ablations for each of the comprehension task types, where we find that Word-to-Text and
Natural Language Inference exhibit the highest effectiveness on domain-specific tasks; detailed re-
sults are listed in Appendix E.

Table 5: Ablation results on training data. Raw Text refers to raw corpora, Read. Compre. refers
to reading comprehension texts, Gen. Ins. refers to general instructions, and Raw. + Gen. Ins. and
Read. + Gen. Ins. correspond to different data mixtures. We report the average of task scores in
prompting evaluation within each domain.

Data Raw Text Read. Compre. Gen. Ins. Raw. + Gen. Ins. Read. + Gen. Ins.
BioMed. 41.7 44.3 43.3 44.8 47.3
Finance 57.6 60.0 62.2 61.7 63.4
Law 35.0 37.0 37.8 34.7 38.5

7 A NALYSIS OF D OMAIN K NOWLEDGE AND P ROMPTING A BILITY

Our design of reading comprehension is to learn the domain-specific knowledge from the raw texts
and to enhance the prompting ability from the comprehension tasks. In this section, we conduct
analyses on the two aspects respectively.
Domain Knowledge. In addition to the prompting results presented in Sections 5 and 6, we also
conduct fine-tuning evaluations and knowledge probing tests to assess whether the reading com-
prehension texts can endow the general model with domain knowledge. As demonstrated in the

8
Domain Knowledge Prompting Ability
Summarize
85
14
Read. Word-to
75 Compre. 33 19 -Text
11

Fine-tune Score 27 15
65 9
20 12
Close.
30 24 18 24 31 38 N.L.I
55 QA
7 44
24
58
45 10

32
Text 12 73
Common.
35 Comple. Reason.
40
BioMed. Finance Law
Paraphrase

General LLM Raw Text Read. Compre.

Summarization
Figure 3: Fine-tuning evaluation on domain-specific tasks (left) and prompting evaluation on
Reading
general tasks (right). General LLM is14.1
the general language model, Raw Text trains the general
Word-to-Text
Comprehension
model on the domain-specific raw32.3
corpora, and19.2Read. Compre. trains the general model on the
reading comprehension texts constructed based on the raw corpora. We report the average of task
scores within each domain/type,
Naturaldetailed results are listed inLanguage
Appendix F.
29.6 37.6
Question Inference

fine-tuning results in Figure 3,Text

continued
12.0 72.8 on the
training reading comprehension texts consistently
Commonsense
enhances model performance when fine-tuning
Completion 39.5
on domain-specific
Reasoning tasks. The fine-tuning and do-
main knowledge probing improvements (detailed in Appendix A) provide empirical evidence that
Paraphrase
the reading comprehension texts indeed imbue the general language model with domain knowledge.
Detection
Furthermore, it’s noteworthy that Read. Compre. outperforms Raw Text in all the adapted domains
in the fine-tuning results. This improvement can be attributed to the fact that the appended compre-
hension tasks naturally create a “multi-task instruction tuning” setting, which benefits single-task
fine-tuning, as discussed by Longpre et al. (2023a).
Prompting Ability. Our approach focuses on enhancing prompting ability through the comprehen-
sion tasks. To assess the effectiveness of each comprehension task type, we employ general LLM
benchmarks to evaluate zero-shot prompting performance. Specifically, we evaluate at least three
general tasks for each comprehension task type, following the task clustering settings in FLAN (Wei
et al., 2022). Besides, we assess the model’s performance on general Reading Comprehension and
Closed-book QA tasks to verify its ability to answer questions with or without contexts.
Figure 3 presents the average task scores within each task type, subsequently averaged across the
three adapted language models. By transferring raw texts into reading comprehension texts, we
observe consistent prompting performance enhancements across all task types. Remarkably, when
solely trained on our domain-specific reading comprehension texts (without the inclusion of gen-
eral instructions), we achieve even better results than the general language model for most task
types. This highlights our approach’s potential in developing a general language model across more
domains. We also conduct an ablation study on each comprehension task type in Appendix E to
analyze whether the inclusion of a particular comprehension task type affects the performance of
corresponding downstream tasks.

8 R ELATED W ORK

Recent works that apply large language models to specific domains such as medicine (Singhal et al.,
2022; 2023; Li et al., 2023b; Wu et al., 2023a; Li et al., 2023a; Wang et al., 2023; Xiong et al.,
2023), finance (Wu et al., 2023b; Yang et al., 2023) and law (Cui et al., 2023; Huang et al., 2023),
can be categorized into three main approaches: training from scratch, instruction fine-tuning and
retrieval-augmented prompting.

9
Training from Scratch. Training a domain-specific language models from scratch is an intuitive
approach to realize domain specialization. BloombergGPT (Wu et al., 2023b) represents an early
example of large language models in the financial domain, trained on a mix of financial and general
corpora. This approach demonstrates significant improvements in performance on financial tasks
without sacrificing the performance on general LLM benchmarks. However, studies (Yang et al.,
2023; Ling et al., 2023) have pointed out “training from scratch” comes with expensive computa-
tional and data requirements, which motivates the need for low-cost domain adaptation methods
such as continued pre-training or fine-tuning.
Instruction Fine-tuning. Fine-tuning large language models on domain-specific tasks, particularly
those involving question-answering instructions, serves as a cost-effective approach to enhance their
performance in specific domains (Singhal et al., 2022; 2023; Li et al., 2023b;a; Wang et al., 2023;
Han et al., 2023; Xiong et al., 2023; Huang et al., 2023). However, due to the limited availability
of supervised fine-tuning data, models fine-tuned with a small amount of data might struggle to
acquire sufficient domain knowledge. Therefore, creating large-scale, supervised instruction-tuning
datasets emerges as a significant challenge. Previous methods employ high-performing LLMs such
as ChatGPT and GPT-4 (OpenAI, 2023) to generate these question-answer pairs (Li et al., 2023a),
but the cost of utilizing those closed-source models for inference can be a concern. In such situations,
harnessing large-scale domain corpora for continual pre-training represents a promising solution to
acquire domain knowledge.
Retrieval-augmented Prompting. Retrieval augmentation enhances LLMs by integrating external
domain-specific information without modifying the model parameters (Li et al., 2023b; Cui et al.,
2023; Huang et al., 2023). LLMs gain domain context from sources like documents, domain-specific
knowledge graphs, or neural networks with parametric domain knowledge. This enables LLMs to
better answer domain-specific questions and address issues like hallucination. In such cases, seam-
less integration of external knowledge into LLMs is crucial, existing methods typically concatenate
retrieved knowledge to the LLM’s input or intermediate layers. However, it’s important to allow
LLMs the option to accept or reject retrieved information due to potential incompleteness or con-
flicts (Ling et al., 2023). Training LLMs to incorporate domain knowledge can aid in making such
informed acceptance or rejection decisions.

9 CONCLUSION

This paper focuses on adapting large language models via continued training on domain-specific
corpora. We propose a simple method to transform large-scale domain-specific raw corpora into
reading comprehension texts, enabling the model to acquire domain knowledge from raw texts and
to enhance prompting ability through comprehension tasks. Experiments in different domains con-
firm the approach’s effectiveness and generalizability. Moreover, the extracted comprehension tasks
enhance the model’s performance on general LLM benchmarks, suggesting potential for enhancing
general language models across more domains. We hope our method can inspire further exploration
into adapting large language models with the use of large-scale unsupervised corpora, efficiently
empowering language models for downstream tasks in specialized areas.

R EFERENCES
Julio Cesar Salinas Alvarado, Karin Verspoor, and Timothy Baldwin. Domain adaption of named
entity recognition to support credit risk assessment. In ALTA, pp. 84–90. ACL, 2015.

Luisa Bentivogli, Bernardo Magnini, Ido Dagan, Hoa Trang Dang, and Danilo Giampiccolo. The
fifth PASCAL recognizing textual entailment challenge. In TAC. NIST, 2009.

Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richard-
son, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, and Peter Clark. Think you have
solved direct-answer question answering? try arc-da, the direct-answer AI2 reasoning challenge.
CoRR, abs/2102.03315, 2021.

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: reasoning about
physical commonsense in natural language. In AAAI, pp. 7432–7439. AAAI Press, 2020.

10
Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large anno-
tated corpus for learning natural language inference. In EMNLP, pp. 632–642. The Association
for Computational Linguistics, 2015.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal,
Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.
Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,
Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford,
Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In NeurIPS, 2020.
Ilias Chalkidis. Chatgpt may pass the bar exam soon, but has a long way to go for the lexglue
benchmark. SSRN, 2023. URL https://papers.ssrn.com/sol3/papers.cfm?
abstract_id=4385460.
Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael J. Bommarito II, Ion Androutsopoulos,
Daniel Martin Katz, and Nikolaos Aletras. Lexglue: A benchmark dataset for legal language
understanding in english. In ACL (1), pp. 4310–4330. Association for Computational Linguistics,
2022.
Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang.
Convfinqa: Exploring the chain of numerical reasoning in conversational finance question an-
swering. In EMNLP, pp. 6279–6292. Association for Computational Linguistics, 2022.
Daixuan Cheng, Shaohan Huang, Jianfeng Liu, Yuefeng Zhan, Hao Sun, Furu Wei, Denvy Deng,
and Qi Zhang. Snapshot-guided domain adaptation for ELECTRA. In EMNLP (Findings), pp.
2226–2232. Association for Computational Linguistics, 2022.
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi
Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai,
Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams
Yu, Vincent Y. Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi,
Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling
instruction-finetuned language models. CoRR, abs/2210.11416, 2022.
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina
Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL-HLT
(1), pp. 2924–2936. Association for Computational Linguistics, 2019.
Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. ELECTRA: pre-
training text encoders as discriminators rather than generators. In ICLR. OpenReview.net, 2020.
Jiaxi Cui, Zongjian Li, Yang Yan, Bohua Chen, and Li Yuan. Chatlaw: Open-source legal large
language model with integrated external knowledge bases. CoRR, abs/2306.16092, 2023.
Franck Dernoncourt and Ji Young Lee. Pubmed 200k RCT: a dataset for sequential sentence clas-
sification in medical abstracts. In IJCNLP, pp. 308–313. Asian Federation of Natural Language
Processing, 2017.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep
bidirectional transformers for language understanding. In NAACL-HLT (1), pp. 4171–4186. As-
sociation for Computational Linguistics, 2019.
William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases.
In IWP@IJCNLP. Asian Federation of Natural Language Processing, 2005.
Ondrej Dusek, David M. Howcroft, and Verena Rieser. Semantic noise matters for neural natural
language generation. In INLG, pp. 421–426. Association for Computational Linguistics, 2019.
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason
Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile:
An 800gb dataset of diverse text for language modeling. CoRR, abs/2101.00027, 2021.

11
Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey
Levine, and Dawn Song. The false promise of imitating proprietary llms. CoRR, abs/2305.15717,
2023.
Suchin Gururangan, Ana Marasovic, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,
and Noah A. Smith. Don’t stop pretraining: Adapt language models to domains and tasks. In
ACL, pp. 8342–8360. Association for Computational Linguistics, 2020.
Tianyu Han, Lisa C. Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser,
Alexander Löser, Daniel Truhn, and Keno K. Bressem. Medalpaca - an open-source collection of
medical conversational AI models and training data. CoRR, abs/2304.08247, 2023.
Peter Henderson, Mark S. Krass, Lucia Zheng, Neel Guha, Christopher D. Manning, Dan Jurafsky,
and Daniel E. Ho. Pile of law: Learning responsible data filtering from the law and a 256gb
open-source legal dataset. In NeurIPS, 2022.
Quzhe Huang, Mingxu Tao, Zhenwei An, Chen Zhang, Cong Jiang, Zhibin Chen, Zirui Wu, and
Yansong Feng. Lawyer llama technical report. CoRR, abs/2305.15062, 2023.
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What dis-
ease does this patient have? A large-scale open domain question answering dataset from medical
exams. CoRR, abs/2009.13081, 2020.
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, and Xinghua Lu. Pubmedqa: A
dataset for biomedical research question answering. In EMNLP/IJCNLP (1), pp. 2567–2577.
Association for Computational Linguistics, 2019.
Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Look-
ing beyond the surface: A challenge set for reading comprehension over multiple sentences. In
NAACL-HLT, pp. 252–262. Association for Computational Linguistics, 2018.
Jens Kringelum, Sonny Kim Kjærulff, Søren Brunak, Ole Lund, Tudor I. Oprea, and Olivier
Taboureau. Chemprot-3.0: a global chemical biology diseases mapping. Database J. Biol.
Databases Curation, 2016, 2016.
Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword
tokenizer and detokenizer for neural text processing. In EMNLP (Demonstration), pp. 66–71.
Association for Computational Linguistics, 2018.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur P. Parikh, Chris
Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion
Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav
Petrov. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput.
Linguistics, 7:452–466, 2019.
Jieh-Sheng Lee. Lexgpt 0.1: pre-trained GPT-J models with pile of law. CoRR, abs/2306.05431,
2023.
Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Nau-
mann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assis-
tant for biomedicine in one day. CoRR, abs/2306.00890, 2023a.
Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, and You Zhang. Chatdoctor: A medical chat model
fine-tuned on llama model using medical domain knowledge. CoRR, abs/2303.14070, 2023b.
Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and
Xiang Ren. Commongen: A constrained text generation challenge for generative commonsense
reasoning. In EMNLP (Findings), volume EMNLP 2020 of Findings of ACL, pp. 1823–1840.
Association for Computational Linguistics, 2020.
Chen Ling, Xujiang Zhao, Jiaying Lu, Chengyuan Deng, Can Zheng, Junxiang Wang, Tanmoy
Chowdhury, Yun Li, Hejie Cui, Xuchao Zhang, Tianjiao Zhao, Amit Panalkar, Wei Cheng, Haoyu
Wang, Yanchi Liu, Zhengzhang Chen, Haifeng Chen, Chris White, Quanquan Gu, Carl Yang, and
Liang Zhao. Beyond one-model-fits-all: A survey of domain specialization for large language
models. CoRR, abs/2305.18703, 2023.

12
Marco Lippi, Przemyslaw Palka, Giuseppe Contissa, Francesca Lagioia, Hans-Wolfgang Micklitz,
Giovanni Sartor, and Paolo Torroni. CLAUDETTE: an automated detector of potentially unfair
clauses in online terms of service. Artif. Intell. Law, 27(2):117–139, 2019.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining
approach. CoRR, abs/1907.11692, 2019.

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V.
Le, Barret Zoph, Jason Wei, and Adam Roberts. The flan collection: Designing data and meth-
ods for effective instruction tuning. In ICML, volume 202 of Proceedings of Machine Learning
Research, pp. 22631–22648. PMLR, 2023a.

Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny
Zhou, Jason Wei, Kevin Robinson, David Mimno, and Daphne Ippolito. A pretrainer’s guide to
training data: Measuring the effects of data age, domain coverage, quality, & toxicity. CoRR,
abs/2305.13169, 2023b.

Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk,
and Alexandra Balahur. Www’18 open challenge: Financial opinion mining and question an-
swering. In WWW (Companion Volume), pp. 1941–1942. ACM, 2018.

Pekka Malo, Ankur Sinha, Pekka J. Korhonen, Jyrki Wallenius, and Pyry Takala. Good debt or
bad debt: Detecting semantic orientations in economic texts. J. Assoc. Inf. Sci. Technol., 65(4):
782–796, 2014.

Clara H. McCreery, Namit Katariya, Anitha Kannan, Manish Chablani, and Xavier Amatriain. Ef-
fective transfer learning for identifying similar questions: Matching user questions to COVID-19
faqs. In KDD, pp. 3458–3465. ACM, 2020.

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct elec-
tricity? A new dataset for open book question answering. In EMNLP, pp. 2381–2391. Association
for Computational Linguistics, 2018.

Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and
Ahmed Hassan Awadallah. Orca: Progressive learning from complex explanation traces of GPT-
4. CoRR, abs/2306.02707, 2023.

Linyong Nan, Dragomir R. Radev, Rui Zhang, Amrit Rau, Abhinand Sivaprasad, Chiachun Hsieh,
Xiangru Tang, Aadit Vyas, Neha Verma, Pranav Krishna, Yangxiaokang Liu, Nadia Irwanto,
Jessica Pan, Faiaz Rahman, Ahmad Zaidi, Mutethia Mutuma, Yasin Tarabar, Ankit Gupta, Tao
Yu, Yi Chern Tan, Xi Victoria Lin, Caiming Xiong, Richard Socher, and Nazneen Fatema Rajani.
DART: open-domain structured data record to text generation. In NAACL-HLT, pp. 432–447.
Association for Computational Linguistics, 2021.

Courtney Napoles, Matthew R. Gormley, and Benjamin Van Durme. Annotated gigaword. In AKBC-
WEKEX@NAACL-HLT, pp. 95–100. Association for Computational Linguistics, 2012.

OpenAI. Gpt-4 technical report. Technical report, OpenAI, March 2023. URL https://cdn.
openai.com/papers/gpt-4.pdf.

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale
multi-subject multi-choice dataset for medical domain question answering. In CHIL, volume 174
of Proceedings of Machine Learning Research, pp. 248–260. PMLR, 2022.

Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick S. H. Lewis, Anton Bakhtin, Yuxiang
Wu, and Alexander H. Miller. Language models as knowledge bases? In EMNLP/IJCNLP (1),
pp. 2463–2473. Association for Computational Linguistics, 2019.

Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-
training. 2018.

13
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for
machine comprehension of text. In EMNLP, pp. 2383–2392. The Association for Computational
Linguistics, 2016.
Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions
for squad. In ACL (2), pp. 784–789. Association for Computational Linguistics, 2018.
Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alterna-
tives: An evaluation of commonsense causal reasoning. In AAAI Spring Symposium: Logical
Formalizations of Commonsense Reasoning. AAAI, 2011.
Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan
Scales, Ajay Kumar Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Senevi-
ratne, Paul Gamble, Chris Kelly, Nathaneal Schärli, Aakanksha Chowdhery, Philip Andrew Mans-
field, Blaise Agüera y Arcas, Dale R. Webster, Gregory S. Corrado, Yossi Matias, Katherine Chou,
Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Rajkomar, Joelle K. Barral, Christopher Semturs,
Alan Karthikesalingam, and Vivek Natarajan. Large language models encode clinical knowledge.
CoRR, abs/2212.13138, 2022.
Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen
Pfohl, Heather Cole-Lewis, Darlene Neal, Mike Schaekermann, Amy Wang, Mohamed Amin,
Sami Lachgar, Philip Andrew Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska,
Blaise Agüera y Arcas, Nenad Tomasev, Yun Liu, Renee Wong, Christopher Semturs, S. Sara
Mahdavi, Joelle K. Barral, Dale R. Webster, Gregory S. Corrado, Yossi Matias, Shekoofeh Azizi,
Alan Karthikesalingam, and Vivek Natarajan. Towards expert-level medical question answering
with large language models. CoRR, abs/2305.09617, 2023.
Ankur Sinha and Tanmay Khandait. Impact of news on the commodity market: Dataset and results.
CoRR, abs/2009.04202, 2020.
Harold J. Spaeth, Lee Epstein, Jeffrey A. Segal Andrew D. Martin, Theodore J. Ruger, and Sara C.
Benesh. Supreme Court Database, Version 2020 Release 01. Washington University Law, 2020.
URL http://Supremecourtdatabase.org.
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question
answering challenge targeting commonsense knowledge. In NAACL-HLT (1), pp. 4149–4158.
Association for Computational Linguistics, 2019.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée
Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Ar-
mand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation
language models. CoRR, abs/2302.13971, 2023.
Don Tuggener, Pius von Däniken, Thomas Peetz, and Mark Cieliebak. LEDGAR: A large-scale
multi-label corpus for text classification of legal provisions in contracts. In LREC, pp. 1235–
1241. European Language Resources Association, 2020.
Mozes van de Kar, Mengzhou Xia, Danqi Chen, and Mikel Artetxe. Don’t prompt, search! mining-
based zero-shot learning with language models. In EMNLP, pp. 7508–7520. Association for
Computational Linguistics, 2022.
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman.
GLUE: A multi-task benchmark and analysis platform for natural language understanding. In
ICLR (Poster). OpenReview.net, 2019.
Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language
Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
Haochun Wang, Chi Liu, Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin, and Ting Liu. Huatuo:
Tuning llama model with chinese medical knowledge. CoRR, abs/2304.06975, 2023.
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du,
Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. In ICLR.
OpenReview.net, 2022.

14
Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for
sentence understanding through inference. In NAACL-HLT, pp. 1112–1122. Association for Com-
putational Linguistics, 2018.
Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-llama: Further fine-
tuning llama on medical papers. CoRR, abs/2304.14454, 2023a.
Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prab-
hanjan Kambadur, David S. Rosenberg, and Gideon Mann. Bloomberggpt: A large language
model for finance. CoRR, abs/2303.17564, 2023b.
Honglin Xiong, Sheng Wang, Yitao Zhu, Zihao Zhao, Yuxiao Liu, Linlin Huang, Qian Wang, and
Dinggang Shen. Doctorglm: Fine-tuning your chinese doctor is not a herculean task. CoRR,
abs/2304.01097, 2023.
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and
Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions.
CoRR, abs/2304.12244, 2023.
Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. Fingpt: Open-source financial large
language models. CoRR, abs/2306.06031, 2023.
Yunzhi Yao, Shaohan Huang, Wenhui Wang, Li Dong, and Furu Wei. Adapt-and-distill: Developing
small, fast and effective pretrained language models for domains. In ACL/IJCNLP (Findings),
volume ACL/IJCNLP 2021 of Findings of ACL, pp. 460–470. Association for Computational
Linguistics, 2021.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a ma-
chine really finish your sentence? In ACL (1), pp. 4791–4800. Association for Computational
Linguistics, 2019.
Rui Zhang and Joel R. Tetreault. This email could save your life: Introducing the task of email
subject line generation. In ACL (1), pp. 446–456. Association for Computational Linguistics,
2019.
Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. Character-level convolutional networks for text
classification. In NIPS, pp. 649–657, 2015.
Yuan Zhang, Jason Baldridge, and Luheng He. PAWS: paraphrase adversaries from word scram-
bling. In NAACL-HLT (1), pp. 1298–1308. Association for Computational Linguistics, 2019.
Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. When does
pretraining help?: assessing self-supervised learning for law and the casehold dataset of 53, 000+
legal holdings. In ICAIL, pp. 159–168. ACM, 2021.
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat,
Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy.
LIMA: less is more for alignment. CoRR, abs/2305.11206, 2023.

15
A D OMAIN K NOWLEDGE P ROBING

We devise domain knowledge probing tests to determine whether continued training on the domain-
specific texts can enhance the model’s domain-specific knowledge. Our probing test design is in-
spired by LAMA (Petroni et al., 2019), where the task format closely resembles the pre-training
task. This allows us to analyze the model’s inherent knowledge without altering its architecture
(e.g., adding a model head) or parameters (e.g., fine-tuning). LAMA utilizes “fill-in-the-blank”
cloze statements to match the masked language modeling task of BERT (Devlin et al., 2019). Sim-
ilarly, we create “predict-the-next-token/sentence” tests to align with the casual language modeling
tasks of generative language models (Radford & Narasimhan, 2018). Table 6 presents the knowl-
edge probing results in the biomedicine and law domains. We observe that continued training on
domain-specific raw/reading comprehension texts indeed imparts the large language model with new
domain knowledge.

Table 6: Domain knowledge probing results. Raw Text is vanilla domain adaptive pre-training
(DAPT) using raw texts, Read. Compre. trains on the reading comprehension texts.

Domain General LLM Raw Text Read. Compre.

BioMed. 36.5 36.9 36.8
Law 45.0 45.6 46.4

Biomedicine. To create a knowledge probing test for the biomedicine domain, we utilize the MedM-
CQA (Pal et al., 2022) dataset. This dataset comprises numerous high-quality multiple-choice ques-
tions, covering diverse healthcare topics and 21 medical subjects. To align the testing format with
casual language modeling, we exclude data samples in the instruction-following format. These in-
clude samples starting with question words like “What”, “Who” and “When”, or ending with “:”,
“?”, and “-”. Additionally, samples having the fill-in-the-blank marker “ ” are also removed.
The evaluation is similar to zero-shot prompting: we feed into the model the raw data input, without
introducing any task descriptions or demonstrations, and then compare per-token-likelihood of each
option to get the model prediction. This evaluation is conducted individually for the 21 medical
subjects, and the average score across all subjects is reported.
Law. For the Law domain knowledge probing, we employ the LEDGAR dataset (Tuggener et al.,
2020). This dataset is designed for contract provision classification and encompasses a wide spec-
trum of 100 distinct law topics. Each label represents the principal topic of the given contract
provision. Originally structured as a 100-classification task, we adapt it for knowledge probing by
simplifying it into a four-choice question format. For each data sample, we preserve the label class
and randomly select three additional classes to create the four candidate options.
Similar to biomedicine knowledge probing, we feed into the model the data input using the template
“{CONTRACT} The topic is”, and then compare per-token-likelihood of each option to get the
model prediction. The evaluation is performed individually for each of the 100 law topics, and the
average score across all topics is reported.

B D OMAIN - ADAPTIVE P RE - TRAINING

Table 7 presents specifications of the pre-training corpora in each domain and Table 8 presents pre-
training hyper-parameters. A <pad> token is added to the model vocabulary for sentence padding.
In each domain, we explore different ratios for mixing reading comprehension data with general
instructions, specifically considering ratios of 1 : 2, 1 : 1, and 2 : 1. The end-of-sentence token
</s> is used to concatenate between documents, where a document could be a raw text, a reading
comprehension text, or a general instruction.

16
Table 7: Pre-training corpora.

Domain Data Source Raw Size # Tokens # Docs

BioMed. PubMed Abstracts (Gao et al., 2021) 19.3 GiB 5.4 B 15.5 M
Finance Stock News 5.1 GiB 1.2 B 1.1 M
Law FreeLaw Opinions (Gao et al., 2021) 51.2 GiB 16.7 B 3.6 M

Table 8: Hyper-parameters of domain-adaptive pre-training.

Hyperparameter Assignment
Computing infrastructure 32 V100-32GB GPUs
Runtime 24 Hours
Number of steps 10,000
Batch size 32
Maximum sequence length 2,048
Maximum learning rate 1e-5
Optimizer Adam
Adam beta weights 0.9, 0.95
Learning rate scheduler cosine
Weight decay 0.1
Warmup steps 1000
Gradient clipping 1.0
Dropout ratio 0.1

C C REATING R EADING C OMPREHENSION T EXTS

Title Collection for Summarization Tasks. In the biomedicine domain, the title for each raw text
in PubMed Abstracts (Gao et al., 2021) is the first sentence within the text, separated from other
sentences by a newline character \n. In the finance domain, we specifically collect the titles when
gathering news using the FinGPT codebase (Yang et al., 2023). In FreeLaw Opinions corpora (Gao
et al., 2021) of the law domain, there are no explicit titles for individual raw texts. Instead, titles are
available for some sections within each text. Hence, we initially divide each raw text into sections
and gather as many titles as possible from these sections. Subsequently, we consider one section,
rather than an entire raw text, as the basis for creating one reading comprehension text.
Regex Pattern Implementation. For tasks which employ regex patterns to mine task examples,
we fill in the patterns with the corresponding verbalizer and identify sentences that match the pat-
terns. This process of expanding patterns into regular expressions follows van de Kar et al. (2022):
{VERBAL} is substituted with a capturing group that incorporates all verbalizers, separated by the
alternation operator |. For instance, the verbalizer set Therefore, Thus, Hence is expanded to
(Therefore|Thus|Hence). Subsequently, the keywords listed in Table 9 are replaced with the
corresponding regular expressions. The result is a regular expression containing capturing groups
for extracting sentences.
Data Pre-processing and Post-processing. Before mining task examples, we truncate each raw text
to its initial 1,800 tokens, enabling the insertion of comprehension tasks within a maximum sequence
length of 2,048. After constructing reading comprehension texts, we wrap each reading comprehen-
sion text using this template: {"text": "ReadCompre"}, by replacing the ReadCompre with one
piece of reading comprehension text. This template is used to explicitly separate each reading com-
prehension text from others, so that the model can focus on the context of the current document.

D D OMAIN - SPECIFIC TASKS E VALUATION

Prompting. In prompting evaluation, each task corresponds to multiple prompt templates and we
randomly sample one of them for each data example, to mitigate result variance caused by template

17
Table 9: Keywords that compile into regular expressions. These keywords are used in the mining
patterns and verbalizers (van de Kar et al., 2022).

Keyword Regex
{VERBAL} Replaced with the verbalizer
regex: ([ˆ.!?\n,;\"\s]{10,})
{WORD}
Matches a single word having more than 9 characters
regex: ([ˆ.!?\n]{50,}[.!?]+)
{SENT}
Matches a single sentence having more than 50 characters

sensitivity. Prompt template examples are presented in Table 11. Following Brown et al. (2020), we
classify tasks into two question types to get model predictions: 1) For multiple-choice questions,
we compare the per-token likelihood of each option to determine the model prediction; 2) For text
completion questions, we employ greedy search to get the free-form answer.
Our prompting settings in the finance domain follow BloombergGPT (Wu et al., 2023b), with the
exception that we use multiple templates to address template sensitivity. The prompt templates for
law domain are based on Chalkidis (2023). The UNFAIR-ToS (Lippi et al., 2019) task is a multi-
label classification task. To get model predictions for this task, we categorize it as a multiple-choice
question. The accuracy of an individual data example is considered true if the model prediction
(i.e., the option with the highest per-token likelihood) belongs to the label(s) set. In the biomedicine
domain, some classification tasks, including MQP (McCreery et al., 2020), RCT (Dernoncourt &
Lee, 2017), and ChemProt (Kringelum et al., 2016), are too challenging for the model, thus we
conduct few-shot prompting and maintain the number of demonstrations the same in each class.
Fine-tuning. In fine-tuning, we utilize a fixed prompt template (the one displayed in Table 11) for
each task to convert input-output into question-answering pairs. The model is then trained on these
pairs for one epoch with the warm-up step as 0, and we compute the loss only on the tokens of the
output answer of each training example (Mukherjee et al., 2023). All other training settings are the
same with domain-adaptive pre-training. Fine-tuning evaluation is similar to prompting evaluation,
but with two differences to align with the fine-tuning training stage: no demonstration is presented
before the prompt input, and the prompt template is the same with the one used in the training stage.

Table 10: Specifications of the domain-specific task datasets. # Demos is the number of demon-
strations in prompting evaluation.

Task Type Metric # Demos

BioMed.
MQP (McCreery et al., 2020) Binary classification Accuracy 4
PubMedQA (Jin et al., 2019) Binary classification Accuracy 0
USMLE (Jin et al., 2020) Multi-chioice QA Accuracy 0
RCT (Dernoncourt & Lee, 2017) Multi-class classification Micro F1 10
ChemProt (Kringelum et al., 2016) Multi-class classification Micro F1 13
Finance
FiQA SA (Maia et al., 2018) Multi-class classification Weighted F1 5
FPB (Malo et al., 2014) Multi-class classification Weighted F1 5
NER (Alvarado et al., 2015) Named entity recongnition Entity-level F1 20
Headline (Sinha & Khandait, 2020) Binary-class classification Weighted F1 5
ConvFinQA (Chen et al., 2022) Question Answering Exact Match 0
Law
SCOTUS (Spaeth et al., 2020) Multi-class classification Micro/Macro F1 0
CaseHOLD (Zheng et al., 2021) Multi-chioice QA Micro/Macro F1 0
UNFAIR-ToS (Lippi et al., 2019) Multi-label classification Accuracy 4

18
Table 11: Prompt templates. Each template example is paraphrased to multiple variations for
prompting evaluation.

Task Template
BioMed.
Question 1: {QUESTION1}
MQP Question 2: {QUESTION2}
Are questions 1 and 2 asking the same thing? {ANSWER}
Context: {CONTEXT}
PubMedQA Question: {QUESTION}
Answer: {ANSWER}
Question: {QUESTION}
USMLE
Answer: {ANSWER}
{SENTENCE}
RCT Question: what is the role of this sentence in an abstract?
Answer: {ANSWER}
{SENTENCE}
ChemProt Question: what is the relation?
Answer: {ANSWER}
Finance
{SENTENCE}
FiQA SA Question: what is the sentiment on {TARGET}?
Answer: {ANSWER}
{SENTENCE}
FPB Question: what is the sentiment?
Answer: {ANSWER}
{SENTENCE}
NER
Extract named entity: {ANSWER}
{SENTENCE}
Headline Question: {QUESTION}
Answer: {ANSWER}
{CONTEXT}
ConvFinQA {PREVIOUS QAS}
{QUESTION} {ANSWER}
Law
Given the following opinion from the Supreme Court of USA (SCOTUS):
SCOTUS "{TEXT}"
The relevant issue area is: {ANSWER}
Complete the following excerpt from a US court opinion:
CaseHOLD
{CONTEXT}: {ANSWER}
Given the following sentence from an online Term of Services:
UNFAIR-ToS "{SENTENCE}"
The sentence is unfair with respect to: {ANSWER}

19
E F URTHER A BLATIONS ON C OMPREHENSION TASKS
Figure 4 presents the percentages of mined examples of each task type in all the comprehension
task examples, with Word-To-Text, Summarization, and Text Completion accounting for the highest
ratios.

Paraphrase, Summarize,
Paraphrase, 3.5 Paraphrase, 7.8
Common 4.9 7.2
NLI, 2.7
Reason, 0.3 Text
Text Text
Comple., Summarize, Common
Comple., Summarize, Comple.,
17.4 26.3 Reason,
20.7 31.4 Common 22.0
0.5
Reason,
1.0 Word-to-
Word-to- NLI, 2.2 Word-to- Text, 62.3
Text, 50.3 Text, 35.8
NLI, 3.8

Biomedicine Finance Law

Figure 4: Percentages of mined examples of each task type in all the comprehension task ex-
amples.

In the biomedicine domain, we conduct ablations on each comprehension task type by systemat-
ically removing each task type from the reading comprehension texts. We then use the resulting
modified reading comprehension texts to train the general model. Subsequently, we evaluate these
trained models on both domain-specific tasks and general benchmarks to analyze the impacts of
these ablations.

Domain Tasks General Tasks

45 80 w/o
All Comm.

44
60
Domain Task Score

General Task Score

43
w/o
All Para.
40
w/o All
42 NLI

20 All w/o
41 w/o Word. w/o
All
Summ. All Text.

40
All w/o w/o w/o w/o w/o w/o 0
Summ. Word. NLI Common. Para. Text Summarize Word-to-Text NLI Common Paraphrase Text
Com. Reason Completion

Figure 5: Prompting scores of domain-specific tasks (left) and general benchmarks (right) of
models trained with different comprehension tasks. All denotes the model trained with all the com-
prehension tasks, while w/o Summ. represents the model trained with the comprehension tasks
excluding Summarization tasks, w/o Word. represents the model trained with the comprehension
tasks excluding Word-to-Text tasks, and so on. We report the average task scores within each do-
main/type.

Domain-specific Tasks. As shown in Figure 5, when evaluating on the domain-specific tasks, the
removal of any comprehension task type leads to a decrease in task performance, showing their con-
tributions to these domain-specific tasks. Notably, removing Word-to-Text, Summarization, or Text
Completion tasks results in a noticeable drop in performance, aligning with the high percentages of
these tasks in the mined examples. Interestingly, even though the Natural Language Inference task
type doesn’t constitute a large portion of the comprehension tasks, its removal leads to a substantial
decrease in performance. This could be attributed to its unique role as the sole classification task
type within all the comprehension tasks. In contrast, the impact of removing Commonsense Rea-
soning and Paraphrase Detection tasks is less pronounced, reflecting their lower percentages in the

20
mined task examples. However, this also suggests the potential benefits of including more diverse
task examples, which could further enhance domain-specific task performance.
General LLM Benchmarks. Additionally, we conduct experiments where we remove a specific
task type from all the comprehension tasks. We then evaluate the model’s performance specifically
on the general tasks corresponding to the removed task type, aiming to demonstrate whether the
comprehension tasks have a positive impact on the respective downstream tasks. In the results for
general tasks in Figure 5, when we exclude a particular task type from the comprehension tasks,
we observe performance declines in the corresponding removed tasks, specifically for Summariza-
tion, Word-to-Text, Natural Language Inference, and Commonsense Reasoning tasks. This suggests
a beneficial connection between the trained comprehension tasks and their corresponding down-
stream tasks. However, when we remove Paraphrase Detection or Text Completion, it does not lead
to a performance decline in the corresponding tasks. This discrepancy may be attributed to the refor-
matting of Paraphrase Detection from a classification task to a generation task in the comprehension
tasks, causing a mismatch between training and evaluation settings. Furthermore, the Text Com-
prehension comprehension task type lacks obvious counterparts in the general LLM benchmarks,
which may contribute to the mismatch in the performance change trend.

F A NALYSIS OF D OMAIN K NOWLEDGE AND P ROMPTING A BILITY

Table 12: Fine-tuning performance on the domain-specific tasks of general large language model
(General LLM), the model trained on domain-specific raw corpora (Raw Text) and the model
trained on the reading comprehension texts constructed based on the raw corpora (Read. Compre.).

BioMed. PubMedQA ChemProt MQP RCT UMSLE AVERAGE

General LLM 75.4 64.6 55.4 87.0 38.5 64.2
Raw Text 76.2 64.8 65.6 87.0 39.0 66.5
Read. Compre. 76.0 65.4 87.9 87.5 41.0 71.5

Finance ConvFinQA FPB FiQA SA Headline NER AVERAGE

General LLM 58.1 81.9 86.4 95.7 77.5 79.9
Raw Text 56.2 83.3 87.9 95.8 81.3 80.9
Read. Compre. 57.2 88.6 83.1 96.1 82.5 81.5

SCOTUS CaseHOLD
Law UNFAIR-ToS AVERAGE
mic-F1 mac-F1 mic-F1 mac-F1
General LLM 31.7 14.0 35.3 35.3 93.8 42.0
Raw Text 36.7 26.0 35.4 35.4 93.7 45.4
Read. Compre. 40.0 26.0 35.5 35.5 94.2 46.2

21
Table 13: Prompting results on general LLM benchmarks. Raw trains on the raw texts, and
Read trains on the reading comprehension texts. Text Completion is more close to a question type
than a task type in general benchmarks, so we report the average of all the tasks following the
free-form text completion question type. Each task corresponds to multiple prompt templates taken
from FLAN (Wei et al., 2022), and we remove the option suffixes from the templates to fit for the
prompting evaluation approach by Brown et al. (2020).

General BioMed. Finance Law

Task Metric
LLM Raw Read Raw Read Raw Read
Summarization
AGNews (Zhang et al., 2015) Acc 58.7 51.7 55.5 56.1 50.1 57.8 60.6
R-1 1.5 3.6 7.5 1.9 10.8 3.4 7.4
AESLC (Zhang & Tetreault, 2019) R-2 0.2 0.9 2.8 0.3 3.8 0.8 2.7
R-L 1.5 3.6 7.2 1.8 10.3 3.3 7.2
R-1 0.6 3.8 9.3 3.1 13.2 3.4 9.8
Gigaword (Napoles et al., 2012) R-2 0.1 0.7 2.4 0.6 3.8 0.6 2.5
R-L 0.6 3.5 8.5 2.9 12.1 3.1 8.9
Word-to-Text
R-1 14.2 16.1 24.9 17.2 24.2 17.8 27.9
CommonGen (Lin et al., 2020) R-2 0.0 1.6 5.9 1.5 6.2 2.5 7.2
R-L 14.2 15.4 22.2 16.2 21.1 16.7 24.3
R-1 17.0 21.3 20.0 21.9 22.3 23.6 19.5
DART (Nan et al., 2021) R-2 3.7 6.0 5.0 6.7 7.1 7.5 5.9
R-L 16.0 19.6 18.5 20.1 20.5 21.4 18.0
R-1 14.3 18.4 27.6 22.6 37.3 18.4 28.3
E2ENLG (Dusek et al., 2019) R-2 3.2 5.2 9.8 7.5 15.1 5.8 10.5
R-L 13.4 17.1 24.5 20.4 32.9 17.0 25.5
Natural Language Inference
MNLI-m (Williams et al., 2018) Acc 35.6 35.0 36.7 36.2 37.4 34.8 34.9
MNLI-mm (Williams et al., 2018) Acc 34.9 34.9 36.1 36.8 37.2 35.7 34.4
QNLI (Rajpurkar et al., 2018) Acc 51.1 52.4 53.7 52.1 52.0 50.7 51.9
RTE (Bentivogli et al., 2009) Acc 30.7 9.4 30.0 23.1 26.0 18.8 32.9
SNLI (Bowman et al., 2015) Acc 34.0 34.5 33.8 36.8 38.1 34.7 33.9
Commonsense Reasoning
COPA (Roemmele et al., 2011) Acc 71.0 68.0 75.0 74.0 73.0 72.0 74.0
PIQA (Bisk et al., 2020) Acc 71.7 70.9 71.3 71.6 71.7 71.9 71.8
HellaSwag (Zellers et al., 2019) Acc 71.8 71.8 72.6 72.5 73.0 72.0 72.5
Paraphrase Detection
Acc 35.5 31.6 34.8 33.3 37.3 34.1 36.0
MRPC (Dolan & Brockett, 2005)
F1 17.0 0.0 13.6 9.9 22.4 8.8 19.7
Acc 56.0 62.7 61.1 52.5 54.3 60.6 59.9
QQP (Wang et al., 2019)
F1 34.0 3.0 10.7 33.1 44.5 15.3 14.4
Paws Wiki (Zhang et al., 2019) Acc 54.8 55.6 54.9 53.7 53.7 55.4 54.8
Closed-book QA
ARC-c (Bhakthavatsalam et al., 2021) Acc 37.3 36.7 39.0 37.8 38.8 37.6 39.3
ARC-e (Bhakthavatsalam et al., 2021) Acc 58.5 59.2 63.1 59.4 62.3 60.0 62.9
EM 2.2 0.1 0.5 0.1 0.0 0.1 0.1
NQ (Kwiatkowski et al., 2019)
F1 3.4 0.8 1.9 1.2 2.1 1.3 1.8
CMSQA (Talmor et al., 2019) Acc 39.6 40.4 43.5 40.0 42.6 40.9 44.6
Reading Comprehension
BoolQ (Clark et al., 2019) Acc 55.7 42.6 50.7 53.4 55.8 51.0 53.9
OBQA (Mihaylov et al., 2018) Acc 46.2 45.8 46.4 46.0 47.0 45.4 47.2
EM 0.1 1.0 3.5 0.2 0.0 0.3 2.0
SQuADv1 (Rajpurkar et al., 2016)
F1 0.2 4.6 10.0 0.6 7.3 6.3 10.7
MultiRC (Khashabi et al., 2018) Acc 55.9 47.5 53.0 49.9 49.7 52.5 52.4

22
G C ASES OF R EADING C OMPREHENSION T EXTS

Here is the first part of an article about biomedicine: Recent reported

evidence indicates that vocal cord carcinoma is evolving similarly to
Raw Text oropharyngeal cancer with an increasing number of patients without a
smoking history having human papillomavirus (HPV) disease. Observations
also suggest that an increasing number of patients who present with glottic
carcinoma are younger than has been reported in the past. (…TL; DR...)
However, recent published evidence shows that glottic carcinoma can be an
Summarization HPV-related disease with increasing incidence in nonsmokers. It isn't
surprising that alternate malignant pathways may have a different timeline.
Answer questions based on the article:
What is a summary? Glottic Carcinoma in Young Patients.
Generate a sentence that includes these biomedicine words [carcinoma,
Word-to-Text
oropharyngeal, papillomavirus]: Recent reported evidence indicates that
vocal cord carcinoma is evolving similarly to oropharyngeal cancer with an
increasing number of patients without a smoking history having human
papillomavirus (HPV) disease.
Natural Language Premise: Observations also suggest that an increasing number of
Inference patients who present with glottic carcinoma are younger than has been
reported in the past.
Hypothesis: An investigation was done to examine the incidence of
glottic carcinoma in patients 30 years old (y/o) or younger.
Does the premise entail the hypothesis? Yes
Commonsense
Reasoning What is the reason for "None of the 11 glottic cancer patients had been
previously treated for benign recurrent respiratory papillomatosis
although it was initially suspected prior to biopsy"? The morphology of
the lesions and the patients' young age.
Compose a sentence that contradicts the meaning of "Historically,
Paraphrase glottic carcinoma is considered to be a tobacco-induced disease.”.
Detection Answer: Recent published evidence shows that glottic carcinoma can be an
HPV-related disease with increasing incidence in nonsmokers.
How would you complete the article? This finding further supports the
concept that glottic carcinoma is an evolving disease, and it demonstrates
the increasing importance of discriminating potential glottic carcinomas in
Text Completion young patients from benign low-risk HPV recurrent respiratory
papillomatosis.

Figure 6: An example of a reading comprehension text constructed from a raw text. The
underlined sentence is added to guide the model to answer questions based the given context.

23
Table 14: Case of a reading comprehension text in biomedicine domain. Certain portions are
omitted for brevity and are represented as (...).

Pancreastatin (PST), a chromogranin A-derived peptide, has been found to modulate glucose,
lipid, and protein metabolism in rat adipocytes. PST has an overall counterregulatory effect on
insulin action by activating a specific receptor-effector system (Galpha(q/11) protein-PLC-beta-
PKC(classical)). However, PST stimulates both basal and insulin-mediated protein synthesis in rat
adipocytes. In order to further investigate the mechanisms underlying the effect of PST stimulating
protein synthesis, we sought to study the regulation of different components of the core translational
machinery by the signaling triggered by PST. Thus, we studied ribosomal p70 S6 kinase, phosphory-
lation of the cap-binding protein (initiation factor) eIF4E, and phosphorylation of the eIF4E-binding
protein 4E-BP1 (PHAS-I). We have found that PST stimulates the S6 kinase activity, as assessed by
kinase assay using specific immunoprecipitates and substrate. This effect was checked by Western
blot with specific antibodies against the phosphorylated S6 kinase. Thus, PST dose-dependently
stimulates Thr421/Ser424 phosphorylation of S6 kinase. Moreover, PST promotes phosphorylation
of regulatory sites in 4E-BP1 (PHAS-I) (Thr37, Thr46). The initiation factor eIF4E itself, whose
activity is also increased upon phosphorylation, is phosphorylated in Ser209 by PST stimulation.
(...)
Use evidence from the biomedicine article to answer these questions:

Assess the relationship between Sentence 1: “This effect was checked by Western blot with specific
antibodies against the phosphorylated S6 kinase.”
Sentence 2: “PST dose-dependently stimulates Thr421/Ser424 phosphorylation of S6 kinase.”
Is it characterized as Entailment, Neutral, or Contradiction? Entailment

Assess the relationship between Sentence 1: “PST has an overall counterregulatory effect on
insulin action by activating a specific receptor-effector system (Galpha(q/11) protein-PLC-beta-
PKC(classical)).”
Sentence 2: “PST stimulates both basal and insulin-mediated protein synthesis in rat adipocytes.”
Is it characterized as Entailment, Neutral, or Contradiction? Contradiction

Next question: What is the reason of the following sentence?

We studied ribosomal p70 S6 kinase, phosphorylation of the cap-binding protein (initiation factor)
eIF4E, and phosphorylation of the eIF4E-binding protein 4E-BP1 (PHAS-I). In order to further
investigate the mechanisms underlying the effect of PST stimulating protein synthesis, we sought
to study the regulation of different components of the core translational machinery by the signaling
triggered by PST.

Next question: What is the reason of the following sentence?

PST dose-dependently stimulates Thr421/Ser424 phosphorylation of S6 kinase. This effect was
checked by Western blot with specific antibodies against the phosphorylated S6 kinase.

“PST has an overall counterregulatory effect on insulin action by activating a specific receptor-
effector system (Galpha(q/11) protein-PLC-beta-PKC(classical)).” Generate a sentence that ex-
presses a contrasting idea to the previous statement. PST stimulates both basal and insulin-mediated
protein synthesis in rat adipocytes.

Here are some concepts about biomedicine: mechanisms, regulation, translational.

Generate a detailed description of them In order to further investigate the mechanisms underlying
the effect of PST stimulating protein synthesis, we sought to study the regulation of different com-
ponents of the core translational machinery by the signaling triggered by PST.

Here are some concepts about biomedicine: phosphorylation, stimulation, phosphorylated.

Generate a detailed description of them The initiation factor eIF4E itself, whose activity is also
increased upon phosphorylation, is phosphorylated in Ser209 by PST stimulation.

Briefly summarize this text. Pancreastatin, a chromogranin A-derived peptide, activates protein
synthesis signaling cascade in rat adipocytes.

24
Table 15: Case of a reading comprehension text in finance domain. Certain portions are omitted
for brevity and are represented as (...).

Read the beginning of an article on finance: In this article, we discuss the 12 biggest commer-
cial janitorial companies in USA. If you want to skip our detailed analysis of these companies, go
directly to the 5 Biggest Commercial Janitorial Companies In USA. According to Statista, the jani-
torial services sector’s market size will increase by 6.6 percent in 2022. The annualized percentage
of this market’s growth was 6.7% between 2017 and 2022. Additionally, between 2020 and 2021,
the number of janitors and cleaners employed in the United States rose by nearly 50,000. By 2024,
it is predicted that the U.S. janitorial services industry will generate around $50.1 billion in sales.
The market size, measured by revenue, of the janitorial services industry is $97.6 billion in 2022.
Businesses frequently grow when corporate profits increase, raising demand for janitorial services.
However, dwindling corporate profit decreases demand for janitorial services as companies close fa-
cilities and cut back on the frequency of contracted cleaning to cut expenses. The industry may have
an opportunity in 2022 as corporate profits are anticipated to rise, according to ibisworld. According
to Allied Market Research, the cleaning services market is to reach $111.49 billion globally by 2030
at a 6.5% CAGR, as reported by Bloomberg. The global cleaning services industry is expanding
due to service providers expanding their online presence and rising commercial consumer demand.
However, heightened rivalry and the introduction of new companies limit market expansion. On the
other hand, the demand for construction and post-construction cleaning services will open up new
prospects. The Covid-19 pandemic has had a favorable effect on the global cleaning services busi-
ness. Deep cleaning and disinfection have become more popular among residential and commercial
users to reduce the risk of cross-contamination. (...)

How would you extend the article? Moore adds that Casella Waste Systems, Inc. (NASDAQ:CWST)
is also the only public waste company its size not to be acquired, which provides ”downside support
to the valuation on a takeout potential”. 09. Stericycle, Inc. (NASDAQ: SRCL) Market Capitaliza-
tion as of November 18, 2022: $4.73 billion Stericycle, Inc. (NASDAQ:SRCL) was incorporated in
1989 and is located in Bannockburn, Illinois. (...)

Then, answer the following questions based on the whole article:

Please write a short summary for the above article. 12 Biggest Commercial Janitorial Companies In
USA

Given the sentence: The global cleaning services industry is expanding due to service providers
expanding their online presence and rising commercial consumer demand.
Select the most suitable word to connect it with: “Heightened rivalry and the introduction of new
companies limit market expansion.”
Options: Therefore, However, Maybe However

Answer the following question. The global cleaning services industry is expanding due to? Service
providers expanding their online presence and rising commercial consumer demand.

Given the sentence “Businesses frequently grow when corporate profits increase, raising demand
for janitorial services.”, come up with a different sentence that contradicts its meaning. Dwindling
corporate profit decreases demand for janitorial services as companies close facilities and cut back
on the frequency of contracted cleaning to cut expenses.

Given the sentence “The global cleaning services industry is expanding due to service providers
expanding their online presence and rising commercial consumer demand.”, come up with a different
sentence that contradicts its meaning. Heightened rivalry and the introduction of new companies
limit market expansion.

Generate a sentence that includes the following finance keywords: Capitalization, Incorporated,
Industries ABM Industries Incorporated (NYSE: ABM) Market Capitalization as of November 18,
2022: $2.98 billion ABM Industries Incorporated (NYSE:ABM) was incorporated in 1985 and is
based in New York, New York.

25
Table 16: Case of a reading comprehension text in law domain. Certain portions are omitted for
brevity and are represented as (...).

Here is the first part of an article about law: The district court ordered Arledge to pay restitution
in the amount of $5,829,334.90, without interest, to the Settlement Fund pursuant to the Mandatory
Victims Restitution Act of 1996 (”MVRA”), 18 U.S.C. §3663A. Arledge disputes the calculation
used to determine the amount of loss, claiming that the government did not sufficiently prove that
the Settlement Fund had paid false claims arising from Arledge’s illegal conduct. Specifically, he
objects to the calculation of losses related to the Fen Phen II settlement.
The “general rule is that a district court can award restitution to victims of the offense, but the
restitution award can encompass only those losses that resulted directly from the offense for which
the defendant was convicted.” United States v. Maturin, 488 F.3d 657, 660-61 (5th Cir. 2007)
(citing Hughey v. United States, 495 U.S. 411, 413, 110 S.Ct. 1979, 109 L.Ed.2d 408 (1990)). The
pre-sentence report attributed forty-seven fraudulent claims to the offenses for which Arledge was
convicted. There were three categories of evidence used to substantiate the government’s assertion
that these claims resulted from Arledge’s illegal conduct: (1) the testimony of Wyatt, an employee
of S&A, who created fraudulent documents; (...)

Please complete it: We have stated repeatedly that an order of restitution must be limited to losses
caused by the specific conduct underlying the offense of conviction. See United States v. Griffin,
324 F.3d 330, 367 (5th Cir.2003) (holding that restitution is restricted to the limits of the offense);
Tencer, 107 F.3d at 1135-36. Accordingly, we decline the government’s request to adopt a harmless
error analysis for the calculation of loss under the MVRA, and we remand for a re calculation of
actual loss based upon the evidence in the record.

Then, answer the following questions based on the whole article:

Based on the statement Arledge makes no convincing argument regarding why the identification by
AHP is not sufficient or how the district court abused its discretion in relying upon the represen-
tations by AHP., provide a continuation using the word “Hence” to express the following idea. It
was not an abuse of discretion for the court to award restitution that encompassed those losses re-
sulting from the creation of fraudulent documents in furtherance of the scheme to defraud for which
Arledge was convicted.
(...)
What happens before “The government asks this court to find that the failure to provide proof of
these three claims was harmless error.”? It also asserts that it would be able to provide, on remand,
proof of an additional false claim that was not originally included in the district court’s restitution
order, which resulted in a loss of $488,000.

In opposition to the sentence “Thus, it was not an abuse of discretion for the court to award restitution
that encompassed those losses resulting from the creation of fraudulent documents in furtherance of
the scheme to defraud for which Arledge was convicted.”, create a sentence with a contradictory
meaning. There are three claims for which the government concedes that there was no proof of
falsity: Florestine Baker ($18,000), Shirley E.

Produce a long descriptive sentence about law that uses all these words: prescription, manufactured,
fraudulent. There were three categories of evidence used to substantiate the government’s assertion
that these claims resulted from Arledge’s illegal conduct: (1) the testimony of Wyatt, an employee
of S&A, who created fraudulent documents; (2) the testimony of two pharmacists who testified that
specific prescriptions allegedly from their pharmacies were, in fact, manufactured; and (3) represen-
tations by AHP that the claims were fraudulent.

Produce a long descriptive sentence about law that uses all these words: restitution, fraudulent, fur-
therance. Thus, it was not an abuse of discretion for the court to award restitution that encompassed
those losses resulting from the creation of fraudulent documents in furtherance of the scheme to
defraud for which Arledge was convicted.

What was this article about? Amount of Loss

Enhancing LLMs with Reading Comprehension
No ratings yet
Enhancing LLMs with Reading Comprehension
30 pages
Advanced Topics and Recent Innovations in Machine Learning and Deep Learning References
No ratings yet
Advanced Topics and Recent Innovations in Machine Learning and Deep Learning References
12 pages
L L M H T C: A S R: Arge Anguage Odels For Ealthcare EXT Lassification Ystematic Eview
No ratings yet
L L M H T C: A S R: Arge Anguage Odels For Ealthcare EXT Lassification Ystematic Eview
55 pages
(2303.18223) A Survey of Large Language Models
No ratings yet
(2303.18223) A Survey of Large Language Models
115 pages
Paper 008
No ratings yet
Paper 008
12 pages
Large Language Model
0% (1)
Large Language Model
38 pages
NLP Research in the LLM Era
No ratings yet
NLP Research in the LLM Era
51 pages
AI: Pre-Trained Language Models Review
No ratings yet
AI: Pre-Trained Language Models Review
15 pages
Domain Specialization As The Key To Make Large Language Models Disruptive: A Comprehensive Survey
No ratings yet
Domain Specialization As The Key To Make Large Language Models Disruptive: A Comprehensive Survey
35 pages
PIIS2589004224005558
No ratings yet
PIIS2589004224005558
24 pages
WavLLM: Advanced Speech Language Model
No ratings yet
WavLLM: Advanced Speech Language Model
21 pages
Survey of GPT-3 Family Language Models
No ratings yet
Survey of GPT-3 Family Language Models
48 pages
Downloed Papers
No ratings yet
Downloed Papers
700 pages
Enhancing LLMs with Grounding Prompts
No ratings yet
Enhancing LLMs with Grounding Prompts
14 pages
LLMs: Logic Understanding vs. Mimicry
No ratings yet
LLMs: Logic Understanding vs. Mimicry
11 pages
Huatuogpt-Ii, One-Stage Training For Medical Adaption of Llms
No ratings yet
Huatuogpt-Ii, One-Stage Training For Medical Adaption of Llms
27 pages
Improving Text Embeddings With Large Language Models:,, Microsoft Corporation
No ratings yet
Improving Text Embeddings With Large Language Models:,, Microsoft Corporation
20 pages
Learning To Retrieve In-Context Examples For Large Language Models
No ratings yet
Learning To Retrieve In-Context Examples For Large Language Models
15 pages
RAG Pipeline For Domain Specific Applications A Case Study in Disseminating Dementia Care Practices
No ratings yet
RAG Pipeline For Domain Specific Applications A Case Study in Disseminating Dementia Care Practices
5 pages
RAFT RAG Cum Fine Tuning
No ratings yet
RAFT RAG Cum Fine Tuning
11 pages
Trend
No ratings yet
Trend
47 pages
Customizing General-Purpose Foundation Models For Medical Report Generation
No ratings yet
Customizing General-Purpose Foundation Models For Medical Report Generation
14 pages
Question-Answer System On Medical Domain With LLMS Using Various Fine-Tuning Methods
No ratings yet
Question-Answer System On Medical Domain With LLMS Using Various Fine-Tuning Methods
15 pages
Title and Abstract Screening For Literature Reviews Using Large Language Models: An Exploratory Study in The Biomedical Domain
No ratings yet
Title and Abstract Screening For Literature Reviews Using Large Language Models: An Exploratory Study in The Biomedical Domain
14 pages
Perspectives in Business Ethics
No ratings yet
Perspectives in Business Ethics
113 pages
Deep Learning: Large Language Models
No ratings yet
Deep Learning: Large Language Models
58 pages
Overview of Large Language Models
No ratings yet
Overview of Large Language Models
47 pages
Split 1363534026993628405
No ratings yet
Split 1363534026993628405
2 pages
Unifying LLMs and KGs: A Roadmap
No ratings yet
Unifying LLMs and KGs: A Roadmap
20 pages
Rephrasing The Web: A Recipe For Compute and Data-Efficient Language Modeling
No ratings yet
Rephrasing The Web: A Recipe For Compute and Data-Efficient Language Modeling
33 pages
A Bibliometric Review of Large Language Models Research From 2017 To 2023
No ratings yet
A Bibliometric Review of Large Language Models Research From 2017 To 2023
36 pages
RM Assignment 4
No ratings yet
RM Assignment 4
5 pages
A Comprehensive Overview of Large Language Models - 2307.06435v9
No ratings yet
A Comprehensive Overview of Large Language Models - 2307.06435v9
46 pages
BERT: Bidirectional Language Model
No ratings yet
BERT: Bidirectional Language Model
10 pages
LLMs: A Research Community Overview
No ratings yet
LLMs: A Research Community Overview
37 pages
Automated Question Generation with LLMs
No ratings yet
Automated Question Generation with LLMs
12 pages
Large Language Models in Neuroscience
No ratings yet
Large Language Models in Neuroscience
20 pages
Large Language Models (LLMS) : Survey, Technical Frameworks, and Future Challenges
No ratings yet
Large Language Models (LLMS) : Survey, Technical Frameworks, and Future Challenges
51 pages
Switchprompt: Learning Domain-Specific Gated Soft Prompts For Classification in Low-Resource Domains
No ratings yet
Switchprompt: Learning Domain-Specific Gated Soft Prompts For Classification in Low-Resource Domains
7 pages
Transfer Learning in Natural Language Processing PDF
0% (1)
Transfer Learning in Natural Language Processing PDF
238 pages
LLMs in Medicine: A Guide for Doctors
No ratings yet
LLMs in Medicine: A Guide for Doctors
19 pages
LLM - Introduction 2024
No ratings yet
LLM - Introduction 2024
77 pages
Jason Wei Stanford cs330 Talk
No ratings yet
Jason Wei Stanford cs330 Talk
44 pages
Natural Learning
No ratings yet
Natural Learning
35 pages
HARE - HumAn Priors - Key To Small Language Model Efficiency
No ratings yet
HARE - HumAn Priors - Key To Small Language Model Efficiency
10 pages
Fine-Tuning LLMs for Domain-Specific MT
No ratings yet
Fine-Tuning LLMs for Domain-Specific MT
9 pages
Large Language Models Overview
No ratings yet
Large Language Models Overview
43 pages
A Comprehensive Overview of Large Language Models: A A, B, C, D, E, F E, F G, I H I
No ratings yet
A Comprehensive Overview of Large Language Models: A A, B, C, D, E, F E, F G, I H I
46 pages
1 s2.0 S2095809922006324 Main
No ratings yet
1 s2.0 S2095809922006324 Main
20 pages
Investigating Continual Pretraining in Large
No ratings yet
Investigating Continual Pretraining in Large
25 pages
Survey On Large Language Models
No ratings yet
Survey On Large Language Models
52 pages
LLM Survey
100% (1)
LLM Survey
43 pages
RAFT: Adapting Language Model To Domain Specific RAG: Vu Et Al. 2023 Lazaridou Et Al. 2022
No ratings yet
RAFT: Adapting Language Model To Domain Specific RAG: Vu Et Al. 2023 Lazaridou Et Al. 2022
11 pages
2308 01727v1
No ratings yet
2308 01727v1
12 pages
Paper 1
No ratings yet
Paper 1
44 pages
Prompting and Fine-Tuning Pre-Trained Generative Language Models
No ratings yet
Prompting and Fine-Tuning Pre-Trained Generative Language Models
4 pages
A Survey of Large Language Models
No ratings yet
A Survey of Large Language Models
124 pages
Simmons SD350 Manual
100% (2)
Simmons SD350 Manual
20 pages
Multimedia Presentation on Culture
No ratings yet
Multimedia Presentation on Culture
17 pages
How To Fix A Corrupted SD Card On Android Without A Computer
No ratings yet
How To Fix A Corrupted SD Card On Android Without A Computer
1 page
Tree Traversal Methods Explained
No ratings yet
Tree Traversal Methods Explained
4 pages
Top 100+ Sap Idocs Interview Questions and Answers
No ratings yet
Top 100+ Sap Idocs Interview Questions and Answers
9 pages
Proforma Invoice for Ophthalmic Equipment
No ratings yet
Proforma Invoice for Ophthalmic Equipment
6 pages
Business Administrator Profile and Skills
No ratings yet
Business Administrator Profile and Skills
1 page
UAS Safety Checklist: Open Category Guide
No ratings yet
UAS Safety Checklist: Open Category Guide
2 pages
Ansari Resume
No ratings yet
Ansari Resume
5 pages
Makerere University Hospital ITECH Project Job Opp 240222 230631
No ratings yet
Makerere University Hospital ITECH Project Job Opp 240222 230631
6 pages
Artificial Intelligence and Islamic Theo
No ratings yet
Artificial Intelligence and Islamic Theo
45 pages
SIP Registration: For All Mediatrix Units
No ratings yet
SIP Registration: For All Mediatrix Units
11 pages
89137094-Wiring Diagram FE
80% (5)
89137094-Wiring Diagram FE
230 pages
Monarch 935 Rewinder: Standard Features
No ratings yet
Monarch 935 Rewinder: Standard Features
1 page
Java Programming Basics and Examples
No ratings yet
Java Programming Basics and Examples
11 pages
Integration 1 Revision Assignment
No ratings yet
Integration 1 Revision Assignment
4 pages
Beran PlantProtech PROTOR Mobile
No ratings yet
Beran PlantProtech PROTOR Mobile
8 pages
Practical No: 1 Aim
No ratings yet
Practical No: 1 Aim
32 pages
Python Tips & Tricks - 50 Basic & Intermediate Tips & Tricks PDF
100% (1)
Python Tips & Tricks - 50 Basic & Intermediate Tips & Tricks PDF
61 pages
Cloud Computing Interview Insights
No ratings yet
Cloud Computing Interview Insights
7 pages
Toaz - Info Face 2 Face Elementary Teachers Book Second Edition PR
No ratings yet
Toaz - Info Face 2 Face Elementary Teachers Book Second Edition PR
5 pages
Oracle Fusion Procurement Exam Q83
No ratings yet
Oracle Fusion Procurement Exam Q83
41 pages
Lectronic Evices: Floyd
No ratings yet
Lectronic Evices: Floyd
6 pages
What Is An Embedded System?: Laser Printer
No ratings yet
What Is An Embedded System?: Laser Printer
9 pages
Jeff Bezos Regrets Nothing Ver 1.0
No ratings yet
Jeff Bezos Regrets Nothing Ver 1.0
122 pages
Igus Low Cost Automation
No ratings yet
Igus Low Cost Automation
9 pages
Key Regression Metrics in Scikit-Learn
No ratings yet
Key Regression Metrics in Scikit-Learn
3 pages
Call Function 'Revision - Level - Select'
No ratings yet
Call Function 'Revision - Level - Select'
3 pages
RK3568 Tablet REF SCH V10 20210720
No ratings yet
RK3568 Tablet REF SCH V10 20210720
61 pages
Neural Cache Model for Language Prediction
No ratings yet
Neural Cache Model for Language Prediction
9 pages

Enhancing LLMs with Reading Comprehension

Uploaded by

Enhancing LLMs with Reading Comprehension

Uploaded by

A DAPTING L ARGE L ANGUAGE M ODELS VIA

We explore how continued pre-training on domain-specific corpora influences

Biomedicine Finance Law

General LLM DAPT AdaptLLM

Glottic Carcinoma in Young Patients Answer questions based on the article:

2 P RELIMINARY E XPLORATION ON C ONTINUED P RE - TRAINING

Prompting Fine-tuning Knowledge Prob

3 A DAPTING L ARGE L ANGUAGE M ODELS VIA R EADING C OMPREHENSION

3.1 C REATING R EADING C OMPREHENSION T EXTS

Task Type Mining Pattern Input-output Template

Task Type Verbalizer

3.2 M IXING WITH G ENERAL I NSTRUCTIONS

Biomedicine PubMedQA ChemProt MQP RCT UMSLE AVERAGE

Finance ConvFinQA FPB FiQA SA Headline NER AVERAGE

6 A BLATIONS ON T RAINING DATA

7 A NALYSIS OF D OMAIN K NOWLEDGE AND P ROMPTING A BILITY

General LLM Raw Text Read. Compre.

fine-tuning results in Figure 3,Text

Domain General LLM Raw Text Read. Compre.

B D OMAIN - ADAPTIVE P RE - TRAINING

Domain Data Source Raw Size # Tokens # Docs

Table 8: Hyper-parameters of domain-adaptive pre-training.

C C REATING R EADING C OMPREHENSION T EXTS

D D OMAIN - SPECIFIC TASKS E VALUATION

Task Type Metric # Demos

Biomedicine Finance Law

Domain Tasks General Tasks

General Task Score

F A NALYSIS OF D OMAIN K NOWLEDGE AND P ROMPTING A BILITY

BioMed. PubMedQA ChemProt MQP RCT UMSLE AVERAGE

Finance ConvFinQA FPB FiQA SA Headline NER AVERAGE

General BioMed. Finance Law

Here is the first part of an article about biomedicine: Recent reported

Next question: What is the reason of the following sentence?

Next question: What is the reason of the following sentence?

Here are some concepts about biomedicine: mechanisms, regulation, translational.

Here are some concepts about biomedicine: phosphorylation, stimulation, phosphorylated.

Then, answer the following questions based on the whole article:

Then, answer the following questions based on the whole article:

What was this article about? Amount of Loss

You might also like