DoLa: Enhancing LLM Factuality
DoLa: Enhancing LLM Factuality
A BSTRACT
Despite their impressive capabilities, large language models (LLMs) are prone to hallu-
cinations, i.e., generating content that deviates from facts seen during pretraining. We
propose a simple decoding strategy for reducing hallucinations with pretrained LLMs
that does not require conditioning on retrieved external knowledge nor additional fine-
tuning. Our approach obtains the next-token distribution by contrasting the differences
in logits obtained from projecting the later layers versus earlier layers to the vocabulary
space, exploiting the fact that factual knowledge in an LLMs has generally been shown to
be localized to particular transformer layers. We find that this Decoding by Contrasting
Layers (DoLa) approach is able to better surface factual knowledge and reduce the gen-
eration of incorrect facts. DoLa consistently improves the truthfulness across multiple
choices tasks and open-ended generation tasks, for example improving the performance
of LLaMA family models on TruthfulQA by 12-17% absolute points, demonstrating its
potential in making LLMs reliably generate truthful facts.1
1 I NTRODUCTION
Large language models (LLMs) have demonstrated great potential in numerous natural language processing
(NLP) applications (Brown et al., 2020; OpenAI, 2022; 2023). However, despite the continued increase in
performance and the emergence of new capabilities from scaling LLMs (Wei et al., 2022a), their tendency to
“hallucinate”, i.e., generate content that deviates from real-world facts observed during pretraining (Ji et al.,
2023), remains a persistent challenge. This represents a major bottleneck in their deployment especially for
high-stakes applications (e.g., clinical/legal settings) where reliable generation of trustworthy text is crucial.
While the exact reasons for LMs’ hallucinations are not fully understood, a possible reason is due to the
maximum likelihood language modeling objective which minimize the forward KL divergence between
the data and model distributions. This objective potentially results in a model with mass-seeking behavior
which causes the LM to assign non-zero probability to sentences that are not fully consistent with knowledge
embedded in the training data. Empirically, an LM trained with the next-word prediction objective on finite
data has been shown to result in a model that uses linguistic knowledge to recognize the superficial patterns,
instead of recognizing and generating the real-world facts extracted from the training corpus (Ji et al., 2023).
From a model interpretability perspective, transformer LMs have been loosely shown to encode “lower-
level” information (e.g., part-of-speech tags) in the earlier layers, and more “semantic” information in the
later layers (Tenney et al., 2019). More recently, Dai et al. (2022) find that “knowledge neurons” are dis-
tributed in the topmost layers of the pretrained BERT model. Meng et al. (2022) show that factual knowledge
1
The source code is available at https://github.com/voidism/DoLa.
⋆
Work mainly done during an internship at Microsoft.
1
Published as a conference paper at ICLR 2024
LLaMA-7B
32nd layer
Contrast!
…
Olympia
early
24th layer exit
r
ve
e
early
Va ia
an
tle
16th layer
ou
p
exit
ok
m
at
nc
ly
Se
Sp
O
…
Decoding by
early Contrasting Layers
8th layer exit
r
ve
e
Va ia
an
tle
ou
p
m
ok
at
nc
ly
Se
Sp
O
Where is the capital of
Washington State?
Figure 1: Illustration of an LLM progressively incorporates factual information along layers. While the
next-word probabilities of “Seattle” remain similar throughout different layers, the probabilities of the cor-
rect answer “Olympia” gradually increase from lower to higher layers. DoLa uses this fact to decode by
contrasting the difference between layers to sharpen an LLM’s probability towards factually correct outputs.
can even be edited by manipulating a specific set of feedforward layers within an autoregressive LM. We
propose to exploit this modular encoding of knowledge to amplify the factual knowledge in an LM through
a contrastive decoding approach, where the output next-word probability is obtained from the difference
in logits between a higher layer versus a lower layer. By emphasizing the knowledge of higher layers and
downplaying that of lower layers, we can potentially make LMs more factual and thus reduce hallucinations.
An illustration of this idea for a simple example is shown in Figure 1. While “Seattle” maintains high prob-
ability throughout all the layers—presumably because it is a syntactically plausible answer—the probability
of the true answer “Olympia” increases after the higher layers inject more factual knowledge. Contrasting
the differences between the different layers can thus reveal the true answer in this case. Based on this con-
cept, we propose a new decoding method, Decoding by Contrasting Layers (DoLa), for better surfacing
factual knowledge embedded in an LLM without retrieving external knowledge or additional fine-tuning.
Experiments on TruthfulQA (Lin et al., 2022) and FACTOR Muhlgay et al. (2023) demonstrate that DoLa
is able to increase the truthfulness of the models of the LLaMA family (Touvron et al., 2023). Further
experiments on chain-of-thought reasoning for StrategyQA (Geva et al., 2021) and GSM8K (Cobbe et al.,
2021) also show that it can facilitate more factual reasoning. Finally, experiments using GPT-4 for open-
ended chatbot evaluation (Chiang et al., 2023) show that when compared with the original decoding method,
DoLa can generate informative and significantly more factual responses that lead to better ratings from GPT-
4. From an efficiency perspective, we find that DoLa causes only a small additional latency in the decoding
process, suggesting it as a practical and useful decoding strategy for improving the truthfulness of LLMs.
2 M ETHOD
Recent language models consist of an embedding layer, N stacked transformer layers, and an affine layer
ϕ(·) for predicting the next-word distribtution. Given a sequence of tokens {x1 , x2 , . . . , xt−1 }, the embed-
(0) (0)
ding layer first embeds the tokens into a sequence of vectors H0 = {h1 , . . . , ht−1 }. Then H0 would be
processed by each of the transformer layers successively. We denote the output of the j-th layer as Hj . Then,
the vocabulary head ϕ(·) predicts the probability of the next token xt over the vocabulary set X ,
(N )
p(xt | x<t ) = softmax ϕ(ht ) xt , xt ∈ X .
Instead of applying ϕ on the final layer, our approach contrasts the higher-layer and lower-layer information
to obtain the next-token probability. More specifically, for the j-th early layer, we also compute the next-
2
Published as a conference paper at ICLR 2024
Input: Who was the first Nigerian to win the Nobel Prize, in which year?
Output: Wole Soyinka was the first Nigerian to win the Nobel Prize, in 1986.
W ole So y ink a was the first Niger ian to win the Nobel Prize , in 1 9 8 6 .
30 1.9 0.0 0.03 1.76 0.0 0.0 6.45 0.29 0.07 0.6 0.01 0.48 0.13 0.1 0.02 0.11 2.97 1.84 0.12 0.0 0.0 0.0 7.56 0.23
28 4.78 0.04 0.42 10.5 0.05 0.07 3.65 0.21 0.02 0.63 0.0 0.29 0.17 0.02 0.04 0.02 4.77 1.89 6.13 9.76 12.4 15.16 16.86 0.16
26 11.41 3.15 7.15 12.67 5.28 3.5 1.22 0.08 0.02 0.75 0.0 0.18 0.15 0.12 0.05 0.04 3.77 1.19 4.58 16.56 19.31 18.66 19.67 0.13
24 13.21 8.6 10.01 14.28 8.99 8.44 0.8 0.26 0.02 0.44 0.0 2.51 0.08 7.37 0.06 0.04 2.08 0.71 6.68 18.72 23.84 21.68 21.31 0.1
22 14.26 18.81 11.61 15.7 12.34 9.29 0.75 4.57 0.03 0.24 0.0 2.4 0.09 6.57 0.05 0.02 2.03 0.38 8.27 17.82 22.89 22.98 21.46 2.07
i-th early layer
20 10.18 15.95 12.99 16.32 13.52 11.07 1.85 9.78 0.03 0.06 0.04 0.39 0.73 6.28 0.02 0.03 11.41 4.36 9.19 16.84 19.57 20.38 19.45 10.26
18 7.75 15.97 12.59 16.46 14.52 12.25 7.76 8.33 5.15 6.47 2.48 5.73 10.67 7.41 1.29 8.92 13.57 10.99 12.59 14.02 19.57 16.98 15.63 12.9
16 8.99 16.05 12.81 17.45 15.47 13.52 9.8 11.18 10.73 10.97 12.1 11.4 14.52 13.09 10.34 11.86 14.34 12.16 13.7 13.73 19.44 17.05 15.85 13.47
14 9.06 16.14 13.33 17.83 16.24 14.0 10.63 13.03 12.78 12.66 15.07 13.2 16.06 14.71 13.61 13.61 14.09 12.04 14.19 14.4 19.76 17.17 16.24 12.87
12 9.75 16.3 13.47 17.92 16.45 14.94 11.52 13.95 14.11 13.92 15.82 14.23 16.76 15.6 14.81 14.42 14.47 13.48 14.47 15.02 19.44 17.4 16.45 13.57
10 10.22 16.4 13.63 18.1 16.24 15.52 12.4 14.54 14.71 14.2 16.34 14.85 16.78 15.66 15.02 15.06 14.53 13.8 14.13 14.96 19.63 17.7 16.62 13.42
8 10.66 16.57 14.04 18.24 16.2 16.21 12.66 14.42 15.09 14.09 16.82 14.71 16.88 15.57 15.2 15.31 14.44 13.89 14.47 15.15 19.93 17.93 16.81 13.9
6 10.68 16.49 14.2 18.38 16.3 16.62 13.18 14.53 15.4 14.27 17.81 15.44 16.98 15.82 15.43 15.8 14.27 14.16 14.65 15.54 19.79 18.2 17.14 13.92
4 10.65 16.59 14.31 18.53 16.38 16.77 13.43 15.02 15.99 14.53 18.29 15.5 17.29 16.33 15.9 16.14 14.31 14.53 14.69 15.81 19.93 18.38 17.4 14.25
2 10.8 16.69 14.29 18.64 16.74 16.9 13.36 15.23 15.97 14.76 18.68 15.45 17.31 16.71 16.05 16.46 14.58 14.51 14.84 16.02 20.13 18.6 17.67 14.44
0 11.0 16.69 14.51 18.78 16.82 17.09 13.54 15.6 16.47 14.88 19.12 15.88 17.45 16.98 16.26 16.87 14.85 15.34 15.16 16.34 20.46 18.79 17.83 14.95
Figure 2: JSD (scaled by 105 ) between the final 32nd layer and even-numbered early layers. Column names
are decoded tokens in each step. Row names are indices of the early layers. 0 means word embedding layer.
token probability using ϕ(·) as follows, where J ⊂ {0, . . . , N − 1} is a set of candidate layers,
(j)
qj (xt | x<t ) = softmax ϕ(ht ) xt , j ∈ J .
The idea of applying language heads directly to the hidden states of the middle layers, known as early
exit (Teerapittayanon et al., 2016; Elbayad et al., 2020; Schuster et al., 2022), has proven to be effective
even without special training process (Kao et al., 2020), as the residual connections (He et al., 2016) in
transformer layers make the hidden representations gradually evolve without abrupt changes. Using qj (xt )
to represent qj (xt | x<t ) for notational brevity, we then compute the probability of the next token by,
p̂(xt | x<t ) = softmax F qN (xt ), qM (xt ) x ,
t
where M = arg max d qN (·), qj (·) .
j∈J
Here, layer M is named premature layer, while the final layer, i.e., layer N , is named mature layer. The
operator F(·, ·), to be elaborated further in Section 2.3, is used to contrast between the output distributions
from the premature layer and the mature layer by computing the log-domain difference between two distri-
butions. The premature layer is dynamically selected in each decoding step using a distributional distance
measure d(·, ·) (we use Jensen-Shannon Divergence) between the mature layer and all the candidate layers
in J . We discuss d(·, ·) in more detail in Section 2.2. The motivation for selecting the layer with the highest
distance d(·, ·) is to ensure that the model would significantly change its output after that selected layer, and
thus have a higher chance to include more factual knowledge that does not exist in the early layers before it.
2.1 FACTUAL K NOWLEDGE E VOLVES ACROSS L AYERS
We conduct preliminary analysis with 32-layer LLaMA-7B (Touvron et al., 2023) to motivate our approach.
We compute the Jensen-Shannon Divergence (JSD) between the early exiting output distributions qj (· | x<t )
and the final layer output distribution qN (· | x<t ), to show how the early exiting outputs are different from
the final layer outputs. Figure 2 shows the JSDs when decoding the answer for the input question, from
which we can observe two patterns. Pattern #1 happens when predicting important name entities or dates,
such as Wole Soyinka and 1986 in Figure 2, which require factual knowledge. We observe the calculated
JSD would be still extremely high in the higher layers. This pattern indicates that the model is still changing
its predictions in the last few layers, and potentially injecting more factual knowledge into the predictions.
Pattern #2 happens when predicting function words, such as was, the, to, in, and the tokens copied from the
input question, such as first Nigerian, Nobel Prize. When predicting these “easy” tokens, we can observe that
the JSD becomes very small from middle layers. This finding indicates that the model has already decided
what token to generate in middle layers, and keeps the output distributions almost unchanged in the higher
3
Published as a conference paper at ICLR 2024
DoLa
Outputs
…
early
24th layer exit
…
early
16th layer exit
…
early
8th layer exit
…
Where was the author of the <s> Albert Einstein was from
Theory of Relativity from?
layers. This finding is also consistent with the assumptions in early exiting LMs (Schuster et al., 2022). A
preliminary analysis that can quantitatively support this observation is also shown in Appendix A.
Qualitatively, when the next-word prediction requires factual knowledge, LLaMA seems to to change the
predictions in the higher layers. Contrasting the layers before/after a sudden change may therefore amplify
the knowledge emerging from the higher layers and make the model rely more on its factual internal knowl-
edge. Moreover, this evolution of information seems to vary token by token. Our method requires accurately
selecting the premature layer that contains plausible but less factual information, which may not always stay
in the same early layer. Thus, we propose dynamic premature later selection as illustrated in Figure 3.
where J is a set of candidate layers for premature layer selection. For LLaMA models with various number
of layers, we divide the layers into 2 to 4 buckets of J based on their total layers, in order to focus on
contrasting from a certain range of layers. The best bucket for each task is chosen using a validation set, as
detailed in Section 3.1. This dynamic layer selection strategy enables the the selection of suitable premature
layers based on token difficulty, thereby making better use of the knowledge learned by different layers.
Besides the dynamic layer selection strategy, a very simple method that can also be considered is to select
the premature layer by running brute-force experiments on all the possible early layers with a validation
set, and pick the layer with the best validation performance. We refer to this simple method as DoLa-static.
However, DoLa-static has the drawbacks of 1) requiring more hyperparameter search runs in layers and the
fact that 2) best layers are sensitive to data distribution, thus requiring in-distribution validation sets. Our
proposed dynamic layer selection strategy also mitigates the drawbacks of DoLa-static by shrinking the layer
search space and making the method more robust without heavily relying on in-distribution validation sets.
We empirically investigate the effectiveness of this dynamic strategy over DoLa-static in Section 4.1.
4
Published as a conference paper at ICLR 2024
Similar to Li et al. (2022), the subset Vhead (xt |x<t ) ∈ X is defined as whether or not the token has high
enough output probabilities from the mature layer,
n o
Vhead (xt |x<t ) = xt ∈ X : qN (xt ) ≥ α max qN (w) .
w
If the predicted probability of a token is too small in the mature layer, it is not likely to be a reasonable
prediction, so we set the token probability to zero to minimize false positive and false negative cases. In
the context of DoLa, the false positive means an implausible token with an extremely low score may be
rewarded with a high score after contrast, due to the unstable low probability range on these implausible
tokens from different layers. The false negative means when the model is very confident about an easy
decision, the output probability of a high-score token does not change much in different layers and results in
low scores after contrast, so we need to force the model still select from these high-score tokens in this case.
This strategy is referred as an adaptive plausibility constraint (APC) proposed in Li et al. (2022).
Repetition Penalty. The motivation of DoLa is to downplay lower-layer linguistic knowledge and amplify
real-world factual knowledge. However, this may result in the model generating grammatically incorrect
paragraphs. Empirically, we do not observe such an issue, but we found that the resulting DoLa distribution
to sometimes have a higher tendency to repeat previously generated sentences (Xu et al., 2022), especially
during generation of long sequences of chain-of-thought reasoning. Here we include a simple repetition
penalty introduced in Keskar et al. (2019) with θ = 1.2 during decoding. The empirical analysis of the
repetition penalty is shown in Appendix K.
3 E XPERIMENTS
3.1 S ETUP
Datasets. We consider multiple choices and open-ended generation tasks. For multiple choices, we use
TruthfulQA (Lin et al., 2022) and FACTOR (News/Wiki) (Muhlgay et al., 2023) to assess LMs’ factuality in
short-answer/long-paragraph settings, respectively. For open-ended generation, we use TruthfulQA (rated
by fine-tuned GPT-3) (Lin et al., 2022) and tasks involving chain-of-thought (Wei et al., 2022b) reasoning:
StrategyQA (Geva et al., 2021) and GSM8K Cobbe et al. (2021). Finally, we test Vicuna QA (Chiang et al.,
2023) which uses GPT-4 to evaluate instruction-following abilities as chatbot assistants.
Models and Baselines. We examine four sizes of LLaMA models (Touvron et al., 2023) (7B, 13B, 33B,
65B) and compare them with three baselines: 1) original decoding (greedy decoding or sampling depending
on the tasks), 2) Contrastive Decoding (CD) (Li et al., 2022), where LLaMA-7B serves as the amateur model
and LLaMA-13B/33B/65B act as expert models, and 3) Inference Time Intervention (ITI). ITI uses LLaMA-
7B and a linear classifier trained on TruthfulQA. Our experiment focuses on contrasting layer differences in
DoLa and model differences in CD, without additional techniques, such as limiting the context window for
the premature layer or the amateur model, to make our setting clean. We set adaptive plausibility constraint
(α) to 0.1 and repetition penalty (θ) to 1.2 as per prior studies(Li et al., 2022; Keskar et al., 2019).
5
Published as a conference paper at ICLR 2024
Table 1: Experimental results on 1) multiple choices dataset: TruthfulQA and FACTOR and 2) open-ended
generation tasks: TruthfulQA and Chain-of-Thought (CoT) reasoning tasks, including StrategyQA (StrQA)
and GSM8K. %T∗I stands for %Truth∗Info in TruthfulQA.
Candidate Layers. In dynamic premature layer selection, we partition transformer layers into buckets and
select one bucket as candidate layers (J ). For 32-layer LLaMA-7B, we use two buckets: [0, 16), [16, 32);
for 40-layer LLaMA-13B, they are [0, 20), [20, 40); for 60-layer LLaMA-33B, three buckets: [0, 20), [20,
40), [40, 60); and for 80-layer LLaMA-65B, four buckets: [0, 20), [20, 40), [40, 60), [60, 80), where the
0th layer is the word embedding. This design limits the hyperparameter search space to only 2-4 validation
runs. For efficiency, only even-indexed layers (0th, 2nd, etc.) are considered as candidates. We use either
two-fold validation (TruthfulQA-MC, FACTOR) or a validation set (GSM8K, StrategyQA) to select the best
bucket. For Vicuna QA, which lacks a validation set, we use GSM8K’s best bucket.
3.2 M ULTIPLE C HOICES
Short-Answer Factuality. We test TruthfulQA with the default QA prompt from Lin et al. (2022) and Li
et al. (2023). For α in APC, we replace −∞ with −1000 to avoid ruining LM likelihood scores, which also
applies to FACTOR. The repetition penalty is unnecessary for likelihood score calculation. We use two-fold
validation to identify the best bucket of candidate layers based on MC3 score. Results in Table 1 show sig-
nificant performance improvement for LLaMA models in four sizes, outperforming ITI/CD and confirming
the effectiveness of DoLa. The only exception is LLaMA-33B on MC1, a “winner takes all” metric that is
more sensitive to fluctuations. In contrast, MC2/MC3 are relatively more stable metrics as they consider all
true/false answers together and average them for calculating the scores. The higher layers are consistently
chosen in two-fold validation—7B: [16, 32); 13B: [20, 40); 33B: [40, 60); 65B: [60, 80). Implementation
details and extra results of contrasting with the 0-th layer / all layers are shown in Appendix C.
Long-Paragraph Factuality. In FACTOR, each example has a long paragraph and four completions, with
one being correct. The News and Wiki subsets are used as the two folds for two-fold validation. Table 1
shows DoLa outperforms baselines by 2-4%, and is more effective than CD, except for 13B on Wiki. The
chosen candidate layers are consistently lower parts for FACTOR: [0, 16) for 7B and [0, 20) for 13/33/65B.
This differs from TruthfulQA, which selects higher layers. We believe this is due to TruthfulQA having
short, fact-critical choices, while FACTOR has long sentence choices. As noted in Section 2.1, contrasting
with higher layers works better for key facts, while contrasting with the lower layers can better take care of
all the tokens if they include many non-fact tokens that do not require to be contrasted with higher layers.
3.3 O PEN -E NDED T EXT G ENERATION
Short-Answer Factuality. In open-ended settings, TruthfulQA is rated by fine-tuned GPT-3 on truthful and
informative scores. A 100% truthful score can be easily achievable by answering “I have no comment”, but
results in a 0% informative score. We use the default QA prompt as in Lin et al. (2022) and Li et al. (2023),
with higher candidate layers for decoding, following the two-fold validation results of Section 3.2. Table 1
6
Published as a conference paper at ICLR 2024
Win
65B 65B Tie 39 3 38
Lose
33B 33B 42 3 35
13B 13B 52 3 25
LLaMA
7B LLaMA+DoLA 7B 43 4 33
400 425 450 475 500 525 550 575 0 10 20 30 40 50 60 70 80
Scores Number of Games
Figure 4: Vicuna QA results of LLaMA vs LLaMA+DoLa, judged by GPT-4. Left: Total scores. Right:
Win/tie/loss times of LLaMA+DoLA compared against LLaMA.
shows DoLa consistently enhances truthful scores, keeps informative scores above 90%, and has a ratio of
“I have no comment” (%Reject) under 10%. It improves the overall (%Truth∗Info) scores by 12-17% across
four models, reaching the performance level of ITI, which relies on supervised training with labels.
CD boosts truthfulness but often refuses to answer, generating ”I have no comment,” – over 60% of the
time for the LLaMA-33B model – thus lowering its %Truth∗Info score. We suspect this is because CD
uses LLaMA-7B for contrast, and a big difference is that 33B is better at instruction-following than 7B,
explaining why CD frequently answers ”I have no comment,” as this response is indicated in the instruction
prompt. Our method consistently outperforms CD in final %Truth∗Info scores.
Chain-of-Thought Reasoning. We evaluated our decoding strategy on StrategyQA and GSM8K, tasks
requiring not just factuality but also Chain-of-Thought (CoT) reasoning (Wei et al., 2022b) ability in order
to achieve good performance. We randomly sample a 10% GSM8K training subset as validation set for both
of the tasks. The best layer buckets, [0, 16) for 7B and [0, 20) for 13B/33B/65B, aligned with FACTOR
results, suggesting that contrasting with lower layers is effective for reasoning tasks.
• StrategyQA requires multi-hop CoT reasoning (Wei et al., 2022b). In Table 1, DoLa boosts accuracy by
1-4% for four models, while CD mostly worsens it, implying that contrasting a large LM with the 7B LM,
which has a certain level of reasoning ability, can impair reasoning ability of large LMs. In contrast, DoLa
enhances performance by contrasting within lower layers that lack reasoning ability.
• GSM8K is a math word problem benchmark requiring both factual knowledge and arithmetic reasoning.
Table 1 shows a 2% accuracy improvement for most LLaMA sizes, except 7B. This suggests that even
when requiring arithmetic reasoning, contrasting layers by DoLa is still helpful. In Appendix B we show
an additional study on improving CD using smaller amateur models, which is still falling behind DoLa.
Instruction Following. Vicuna QA (Chiang et al., 2023) uses GPT-4 to evaluate the abilities of open-ended
chatbots to follow instructions. Following the validation results from GSM8K/FACTOR, we used the lower
layers as candidate layers for decoding with all models. Pairwise comparisons rated by GPT-4 are in Figure 4,
showing DoLa notably outperforms the baseline, especially in the 13B and 33B models, indicating DoLa is
effective even in open-ended chatbot scenarios. Examples of qualitative studies are shown in Appendix M.
4 A NALYSIS
4.1 P REMATURE L AYER S ELECTION S TRATEGY
We introduce a variant of DoLa, DoLa-static, which selects a constant layer for contrasting throughout the
decoding process. We show some of the results of GSM8K validation sets in Figure 5, and FACTOR in
Figure 6 in Appendix H, by enumerating the DoLa-static results from all the layers.
In Figure 5 (left), DoLa-static performs better by contrasting lower layers. Some “optimal” layers, like
the 10th layer, even outperform DoLa. However, these optimal layers are sensitive across datasets, making
DoLa-static less versatile without a task-specific validation set, which may not always be available in real-
world applications. For example, when randomly sample another 10% GSM8K subset (Figure 5, right),
DoLa-static shows varying optimal layers across these two 10% GSM8K subsets. The 10th layer is optimal
7
Published as a conference paper at ICLR 2024
7B 7B
0.12
DoLa [0,16) DoLa [0,16)
Baseline DoLa [0,32)
Baseline
DoLa [0,32)
0.10 0.10
0.08 0.08
Accuracy
Accuracy
DoLa [16,32)
0.06 DoLa [16,32) 0.06
Figure 5: LLaMA-7B on GSM8K validation sets with DoLa/DoLa-static using different premature layers.
Left: subset#1. Right: subset #2.
in subset #1, while the 2nd layer is optimal in subset #2. Using subset #1’s optimal layer for subset #2
decreases its performance, highlighting DoLa-static’s sensitivity to fixed layer choice. In contrast, DoLa
with contrasting lower layers maintains high scores in both subsets, almost matching the best performing
DoLa-static layers, highlighting the robustness of DoLa. Additionally, DoLa simplifies hyperparameter
search space: it needs only 2-4 bucket tests, almost 10x fewer than the 16-40 tests needed in DoLa-static.
We include another analysis on the optimality of our dynamic layer selection strategy in Appendix J. Specifi-
cally, we include a random layer selection baseline, showing that the random selection strategy is even worse
than the original performance, demonstrating it is essential to apply our JSD-based layer selection strategy.
8
Published as a conference paper at ICLR 2024
Q1: On what date was the Declaration Q2: How long should you wait before fil-
Question Q3: What are the effects of red shoes?
of Independence officially signed? ing a missing person report?
The Declaration of Independence was You should wait 24 hours before filing a
LLaMA I have no comment.
officially signed on July 4, 1776. missing person report.
Score Truthful: 0.0 , Informative: 1.0 Truthful: 0.0, Informative: 1.0 Truthful: 1.0, Informative: 0.0
NLP Pipeline in Transformer. A study by Tenney et al. (2019) notes BERT mimics classical NLP pipeline:
early layers manage syntax while later ones handle semantics. This behavior varies based on training ob-
jectives (Fayyaz et al., 2021) and tasks (Niu et al., 2022). Recent studies highlight the role of middle and
topmost layers (Meng et al., 2022; Dai et al., 2022) and specific heads (Li et al., 2023) in factual predictions.
Contrastive Decoding. Contrastive Decoding (CD) (Li et al., 2022) contrasts strong expert LMs with weak
amateur LMs to improve fluency and coherence without discussing factuality. CD selects amateur LMs to be
smaller LMs, and it is crucial to select suitable sizes for amateur LMs. DoLa dynamically selects appropriate
early layers based on token complexity, avoiding the need for training and using smaller LMs in CD. For
efficiency, DoLa requires just a forward pass with early exiting from the same model itself. O’Brien & Lewis
(2023) is a concurrent work that extends CD to be evaluated on reasoning tasks.
Following the concept of CD, Shi et al. (2023) introduced context-aware decoding (CAD) to better focus
LMs on contexts for improving summarization and knowledge conflict tasks. A concurrent work, Autocon-
trastive Decoding (ACD) (Gera et al., 2023), partially resembles DoLa-static but focuses on small LMs like
GPT2 in 335M/125M, as ACD requires fine-tuning prediction heads for early layers. Unlike DoLa targeting
factuality, ACD aims to enhance diversity and coherence in small LMs. Interestingly, while the authors
reveal ACD increases hallucinations in its limitation section, DoLa instead reduces them. We attribute the
discrepency to model sizes, as our experiments in Appendix N suggest contrasting layers in a small GPT2
cannot improve factuality. Large LLMs storing distinct knowledge across layers is key for DoLa to work.
9
Published as a conference paper at ICLR 2024
ACKNOWLEDGEMENTS
We thank all the anonymous reviewers for their helpful discussions and insightful feedback. This research
was mainly done during Yung-Sung’s internship at Microsoft, Redmond. Yung-Sung is sponsored by the
United States Air Force Research Laboratory and the United States Air Force Artificial Intelligence Accel-
erator and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and
conclusions contained in this document are those of the authors and should not be interpreted as represent-
ing the official policies, either expressed or implied, of the Army Research Office or the United States Air
Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for
Government purposes, notwithstanding any copyright notation herein.
R EFERENCES
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican,
George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving
language models by retrieving from trillions of tokens. In International conference on machine learning,
pp. 2206–2240. PMLR, 2022.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack
Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Lan-
guage models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and
H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran
Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper files/paper/2020/file/
1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
Cheng-Han Chiang and Hung-yi Lee. Can large language models be an alternative to human evaluations?
arXiv preprint arXiv:2305.01937, 2023a.
Cheng-Han Chiang and Hung-yi Lee. A closer look into automatic evaluation using large language models.
arXiv preprint arXiv:2310.05657, 2023b.
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan
Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source
chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/
2023-03-30-vicuna/.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word
problems. arXiv preprint arXiv:2110.14168, 2021.
Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pre-
trained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pp. 8493–8502, 2022.
Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and
reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325, 2023.
Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer. In ICLR 2020-
Eighth International Conference on Learning Representations, pp. 1–14, 2020.
10
Published as a conference paper at ICLR 2024
Mohsen Fayyaz, Ehsan Aghazadeh, Ali Modarressi, Hosein Mohebbi, and Mohammad Taher Pilehvar. Not
all models localize linguistic knowledge in the same place: A layer-wise probing on bertoids’ repre-
sentations. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural
Networks for NLP, pp. 375–388, 2021.
Xinyang Geng and Hao Liu. Openllama: An open reproduction of llama, May 2023. URL https://
github.com/openlm-research/open llama.
Ariel Gera, Roni Friedman, Ofir Arviv, Chulaka Gunasekara, Benjamin Sznajder, Noam Slonim, and Eyal
Shnarch. The benefits of bad advice: Autocontrastive decoding across model layers. In Proceedings of
the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.
10406–10420, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/
v1/2023.acl-long.580. URL https://aclanthology.org/2023.acl-long.580.
Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a
laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Associa-
tion for Computational Linguistics, 9:346–361, 2021.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-
Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with retrieval augmented
language models. arXiv preprint arXiv:2208.03299, 2022.
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea
Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing
Surveys, 55(12):1–38, 2023.
Wei-Tsung Kao, Tsung-Han Wu, Po-Han Chi, Chun-Cheng Hsieh, and Hung-Yi Lee. Bert’s output layer
recognizes all hidden layers? some intriguing phenomena and a simple way to boost bert. arXiv preprint
arXiv:2001.09309, 2020.
Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. Ctrl: A
conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858,
2019.
Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time inter-
vention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341, 2023.
Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettle-
moyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. arXiv
preprint arXiv:2210.15097, 2022.
Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and
Shuming Shi. Encouraging divergent thinking in large language models through multi-agent debate. arXiv
preprint arXiv:2305.19118, 2023.
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human false-
hoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), pp. 3214–3252, 2022.
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation
using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023.
11
Published as a conference paper at ICLR 2024
Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination
detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023.
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations
in GPT. Advances in Neural Information Processing Systems, 36, 2022.
NLP Team MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms,
2023. URL www.mosaicml.com/blog/mpt-7b. Accessed: 2023-05-05.
Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin
Leyton-Brown, Amnon Shashua, and Yoav Shoham. Generating benchmarks for factuality evaluation of
language models. arXiv preprint arXiv:2307.06908, 2023.
Jingcheng Niu, Wenjie Lu, and Gerald Penn. Does bert rediscover a classical nlp pipeline? In Proceedings
of the 29th International Conference on Computational Linguistics, pp. 3143–3153, 2022.
Sean O’Brien and Mike Lewis. Contrastive decoding improves reasoning in large language models. arXiv
preprint arXiv:2309.09117, 2023.
OpenAI. Introducing chatgpt, November 2022. URL https://openai.com/blog/chatgpt.
OpenAI. Gpt-4 technical report. 2023. URL https://cdn.openai.com/papers/gpt-4.pdf.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with
human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav
Shoham. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, 2023.
Erik Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-
independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language
Learning at HLT-NAACL 2003, pp. 142–147, 2003.
Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Tran, Yi Tay, and Donald Metzler.
Confident adaptive language modeling. Advances in Neural Information Processing Systems, 35:17456–
17472, 2022.
Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, and Scott Wen-tau Yih. Trust-
ing your evidence: Hallucinate less with context-aware decoding. arXiv preprint arXiv:2305.14739, 2023.
Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. Branchynet: Fast inference via early
exiting from deep neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR),
pp. 2464–2469. IEEE, 2016.
Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline. In Proceedings of
the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4593–4601, 2019.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation
language models. arXiv preprint arXiv:2302.13971, 2023.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. Trans-
actions on Machine Learning Research, 2022a.
12
Published as a conference paper at ICLR 2024
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of
thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022b.
Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model
pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
Jin Xu, Xiaojiang Liu, Jianhao Yan, Deng Cai, Huayang Li, and Jian Li. Learning to break the loop: Ana-
lyzing and mitigating repetitions for neural text generation. Advances in Neural Information Processing
Systems, 35:3082–3095, 2022.
13
Published as a conference paper at ICLR 2024
We include an additional study to quantitatively support the claim we made from the observation in Figure 2.
We use the validation set of the CoNLL-2003 name entity recognition dataset Sang & De Meulder (2003)
with 3.25K examples. 2 We calculate which layer has the largest JS-divergence with the final layer when
LLaMA-7B predicts the next token with teacher forcing (we simply call this layer the “critical layer” for
short). We subdivide the results into two parts by whether LLaMA is predicting an entity token or a non-
entity token and show the results of the critical layer in Table 4.
From Table 4, we can find that 75% of the time the critical layer will be layer 0 when predicting non-entity
tokens. When predicting entity tokens, on the other hand, only 35% of the time the critical layer will be
layer 0, while more than 50% of the time the critical layer will be at a higher layer. This experiment can
quantitatively support our observations in Figure 2.
Note that we use teacher forcing to send the ground truth into LLaMA to predict the next word for each
token in the sentence. And the ground truth sentences are not generated by LLaMA. The mismatch here can
potentially make the result noisy when 1) LLaMA tries to predict an entity but the next token is not an entity,
or 2) LLaMA tries to predict a non-entity token but the next word is an entity. A more accurate but expen-
sive way to conduct this experiment would be to manually label each of the tokens in the greedy/sampled
decoding output from the same LLaMA itself. However, from the current experiments we have already seen
such a trend in this NER dataset.
Table 4: The distribution of critical layer in LLaMA-7B using the CoNLL 2003 NER dataset.
We explore the possibility of using smaller amateur models for contrastive decoding (CD) (Li et al., 2022)
to create better baselines. We experiment with OpenLLaMa (Geng & Liu, 2023) and Sheared-LLaMA (Xia
et al., 2023) models in the size of 7B, 3B, 2.7B, 1.3B. The results are shown in Table 5. We can see that
2
https://huggingface.co/datasets/conll2003
14
Published as a conference paper at ICLR 2024
using a small amateur LM, especially the 1.3B one, can improve the scores for CD compared to using the 7B
one as the amateur LM. However, most of the scores only match the scores of the baseline (the 33B model is
the only one that is better than the baseline), and they are still not better than DoLa. This result suggests that
the selection of the amateur LM is critical to making CD work. We explore many different amateur LMs but
still cannot obtain significant improvements from CD.
Table 5: Exploration of the contrastive decoding baselines with different size of amateur models on the task
of GSM8K.
When implementing DoLa for TruthfulQA, we found that not applying the softmax function on top of F
(defined in Section 2) can make the performance even better as shown in Table 6, so we stuck with this
implementation for (and only for) the TruthfulQA multiple choices setting. However, both implementations
(with and without softmax) are much better than baseline scores. We did not observe the same phenomenon
on other datasets.
LLaMA-7B
Method
MC1 MC2 MC3
Vanilla 25.6 40.6 19.2
DoLa w/ post softmax 31.9 52.2 28.2
DoLa w/o post softmax 32.2 63.8 32.1
Table 6: The scores of DoLa on TruthfulQA multiple choices setting with and without post-softmax applied
on top of F (defined in Section 2).
We also include the analysis of applying DoLa on TruthfulQA with two variants of DoLa: 1) only contrasting
with the word embedding (0-th) layer, and 2) contrasting with all the early even-numbered layers dynami-
cally. The results are shown in Table 7. We can see that both of the two variants can lead to performance
improvements, but they still fall behind our proposed DoLa.
15
Published as a conference paper at ICLR 2024
LLaMA-7B LLaMA-13B
Method
MC1 MC2 MC3 MC1 MC2 MC3
Vanilla 25.6 40.6 19.2 28.3 43.3 20.8
DoLa 0-th layer 31.6 61.7 30.1 28.5 62.3 30.2
DoLa all layers 32.0 63.9 31.2 30.5 62.3 31.0
DoLa 32.2 63.8 32.1 28.9 64.9 34.8
LLaMA-33B LLaMA-65B
Method
MC1 MC2 MC3 MC1 MC2 MC3
Vanilla 31.7 49.5 24.2 30.8 46.9 22.7
DoLa 0-th layer 31.4 61.1 31.1 31.0 63.6 31.2
DoLa all layers 29.1 61.5 30.7 30.5 62.0 31.7
DoLa 30.5 62.3 34.0 31.1 64.6 34.3
Table 7: The scores on TruthfulQA of DoLa contrasting with the 0-th (word embedding) layer and all the
early even-numbered layers.
We conduct an additional study of the quality of generated text using GPT4, given the fact that several prior
studies Chiang & Lee (2023a); Liu et al. (2023) have shown the great potential of GPT-4 to serve as an
alternative to human evaluation. And the effect is stable over different prompts and instructions Chiang &
Lee (2023b).
We adopt the pairwise evaluation code from Vicuna QA 3 . To make GPT-4 focus only on the quality without
being distracted by factuality, we changed the core sentence of the prompt to: Please rate by the
grammaticality and cohesiveness of their responses, but not factuality. You are not
required to verify the factual accuracy of the answers. Each assistant receives an
overall score on a scale of 1 to 10, where a higher score indicates better quality.
By using the prompt above, we observed the responses from GPT-4 can judge the answers based on gram-
maticality and cohesiveness without checking the factual correctness. The results are shown in Table 8,
where the scores are the average scores from 80 questions in Vicuna QA, on a scale of 1 to 10.
We can observe that for 7B/13B/33B models, DoLa has better grammaticality and cohesiveness compared
to the vanilla decoding baseline. For the largest 65B model, DoLa achieves a score that is almost the same as
vanilla decoding. We conclude that when evaluating text generation quality without considering factuality,
DoLa is still on par with (65B) or better than (7B/13B/33B) vanilla decoding.
E M EMORY OVERHEAD
To measure the overhead, we calculate (a) the occupied GPU memory before the first forward pass and
(b) the peak GPU memory during the forward passes. And then we can compute the memory overhead by
(b) − (a), or the proportion of overhead [(b)−(a)]
(a) in %. For 13B/33B/65B that require 2/4/8 GPUs, the total
memory is accumulated among all the GPUs. The results are shown in Table 9.
3
https://github.com/lm-sys/vicuna-blog-eval/tree/main/eval
16
Published as a conference paper at ICLR 2024
Table 8: GPT-4 evaluation on text generation quality on a scale of 1 to 10, averged over the 80 examples in
Vicuna QA.
We can see that during the forward pass of LLaMA-7B, the overhead for vanilla decoding is 2.5% while
DoLa requires 3.6%. There is only 1.1% difference for the memory overhead between Vanilla and DoLa.
For 13b/30b/65b models, the difference is even smaller than 1%. This result shows that the difference in
memory overhead between DoLa and the vanilla decoding baseline is still negligible.
LLaMA-7B LLaMA-13B
Metric
Baseline DoLa Baseline DoLa
(a) GPU Memory Before Forward (MB) 12916.5 12916.5 25025.8 25025.8
(b) Peak GPU Memory During Forward (MB) 13233.9 13385.7 25510.7 25674.8
(b) − (a) GPU Memory Overhead (MB) 317.4 469.2 484.9 681.6
[(b)−(a)]
(a) GPU Memory Overhead (%) 2.5% 3.6% 1.9% 2.7%
LLaMA-30B LLaMA-65B
Metric
Baseline DoLa Baseline DoLa
(a) GPU Memory Before Forward (MB) 55715.7 55715.7 124682.6 124682.6
(b) Peak GPU Memory During Forward (MB) 57057.5 57390.2 126950.0 127606.8
(b) − (a) GPU Memory Overhead (MB) 1341.9 1674.5 2267.4 2924.3
[(b)−(a)]
(a) GPU Memory Overhead (%) 2.4% 3.0% 1.8% 2.4%
F I NFERENCE D ETAILS
We run all the experiments with NVIDIA V100 GPUs on the machines equipped with 40-core CPUs of
Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHZ. We use the Huggingface Transformers package 4 to
conduct experiments. When decoding responses from the language models, we use greedy decode for
TruthfulQA, StrategyQA, and GSM8K. For the Vicuna QA Benchmark, we use random sampling with
temperature 0.7 and max new tokens 1024 to generate the responses.
For the latency and throughput analysis in Section 4.2, we use the 817 examples from TruthfulQA with the
default 6-shot in-context demonstration prompt which has an average input length is 250.3 after concate-
nating the prompt with the questions. We force the model to decode 50 new tokens without any stopping
criteria.
4
https://github.com/huggingface/transformers
17
Published as a conference paper at ICLR 2024
We run the models with 16-bit floating point and batch size = 1. For LLaMA 7/13/33/65B models, we
use 1/2/4/8 GPUs, respectively. The cross-GPU inference with model weight sharding was handled by
Huggingface accelerate package.5
We divide the layers of LLaMA 7/13/33/65B models into 2/2/3/4 buckets of candidate layers. For the 32-
layer MPT-7B (MosaicML, 2023), we divide the layers into 4 buckets of candidate layers. We exclude the
0-th layer (word embedding layer) for MPT-7B because its word embedding layer and LM prediction head
share their weights. Directly connecting the word embedding layer and LM prediction head together will
become an operation similar to identity mapping.
The following table concludes the best bucket selected by the validation set. For TruthfulQA and FACTOR,
although we conduct two-fold validation, the selected buckets by these two folds are the consistently same.
G N ON -LL A MA M ODEL
To check if DoLa works beyond LLaMA models, we tested MPT-7B (MosaicML, 2023). Table 11 shows
gains on most datasets, suggesting the potential of DoLa to generalize across various transformer LLMs.
In Figure 6, we show the additional examples on FACTOR-News to compare the performance of DoLa and
DoLa-static, for the four LLaMA models.
5
https://huggingface.co/docs/accelerate/concept guides/big model inference
18
Published as a conference paper at ICLR 2024
7B 13B
0.625 DoLa [0,16) 0.625 DoLa [0,20)
Baseline
0.600 0.600
Baseline
0.575 DoLa [0,32) 0.575
0.550 DoLa-static
Accuracy
Accuracy
0.550 Baseline
DoLa [16,32) DoLa [0,20)
0.525 0.525 DoLa [20,40)
DoLaDoLa
[0,40)
[0,40)
0.500 0.500
0.475 DoLa-static 0.475 DoLa [20,40)
Baseline
DoLa [0,16)
0.450 DoLa [16,32) 0.450
DoLa [0,32)
0 5 10 15 20 25 30 0 5 10 15 20 25 30 35
Premature Layer Premature Layer
Accuracy
0.575 DoLaDoLa
[20,40)
[0,20)
DoLa [20,40) 0.575 DoLa [20,40)
DoLa [40,60)
DoLa [60,80)
0.550 DoLa [0,60) 0.550
DoLa [0,80)
Besides the visualized comparisons, we also compare the scores of DoLa and DoLa-static in Table 12, 13,
14. The premature layers of DoLa-static are selected by the performance on validation sets. If it is in a
two-fold validation setting, we report both of the selected layers in the tables (Val Selected Layer).
We can observe that for TruthfulQA and FACTOR, DoLa-static is slightly better than DoLa in most of the
cases. However, for StrategyQA and GSM8K, DoLa can consistently outperform DoLa-static. Considering
that DoLa is more robust and generalizable, only requiring a very small hyperparameter search space, we
use DoLa as our main proposed method, instead of DoLa-static.
One question in our proposed method is: How optimal is this dynamic layer selection method? For compar-
ison, we used a “random” baseline similar to DoLa but with layers chosen randomly. Results in Table 15
show this random approach performs worse than the original baseline, highlighting the importance of our
JSD-based layer selection strategy.
19
Published as a conference paper at ICLR 2024
Table 12: Multiple choices results on TruthfulQA. In the column of Val Selected Layer, the two numbers
separated by “/” represent the selected layer on the first fold and second fold, respectively.
Table 13: Multiple choices results on FACTOR. In the column of Val Selected Layer, the two numbers
separated by “/” represent the selected layer on the first fold and second fold, respectively.
In Section 2.3, we discussed that DoLa sometimes repeats content, particularly in StrategyQA and GSM8K.
To mitigate this, we apply a repetition penalty. Figure 7 and 8 show that this improves the performance of
DoLa on StrategyQA and GSM8K, but hurts the performance of baseline. For CD, the penalty offers slight
gains but remains less effective than the baseline.
20
Published as a conference paper at ICLR 2024
Accuracy %
Accuracy %
60
Accuracy % 67.5
62 65.0
65.0
58 60 62.5
62.5
56 58 60.0 60.0
Baseline 57.5 Baseline Baseline
54 Baseline 56 CD CD 57.5 CD
DoLa DoLa 55.0 DoLa 55.0 DoLa
54
1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.2 1.4 1.6 1.8 2.0
Repetition Penalty (rp) Repetition Penalty (rp) Repetition Penalty (rp) Repetition Penalty (rp)
Figure 7: Baseline, CD, DoLa with different levels of repetition penalty on StrategyQA.
21
Published as a conference paper at ICLR 2024
Accuracy %
Accuracy %
Accuracy %
25
6 10.0 20 30
7.5 15
4 20
5.0 10
2 Baseline Baseline 10 Baseline
Baseline 2.5 CD 5 CD CD
0 DoLa 0.0 DoLa 0 DoLa 0 DoLa
1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.2 1.4 1.6 1.8 2.0
Repetition Penalty (rp) Repetition Penalty (rp) Repetition Penalty (rp) Repetition Penalty (rp)
Figure 8: Baseline, CD, DoLa with different levels of repetition penalty on GSM8K.
Table 16: Additional short response examples from LLaMA-33B and DoLa with the questions from Truth-
fulQA.
Besides the examples that DoLa outperforms the baseline, we also show examples that DoLa underperforms
the baseline by GPT-4 judgment in Table 21 and 22. We can observe that although DoLa tends to generate
detailed factual information, sometimes it will not be as relevant to the question as the baseline’s answer. In
future work, it would be worth exploring how to increase the ability of LLMs to follow instructions along
with increasing factuality.
22
Published as a conference paper at ICLR 2024
TruthfulQA-MC FACTOR
Model
MC1 MC2 MC3 News Wiki
GPT2-Medium 23.5 41.9 20.0 41.0 31.6
+ DoLa 22.9 41.4 16.4 22.2 20.9
Table 18: Qualitative Study for LLaMA-33B and DoLa with GPT-4 judgement.
23
Published as a conference paper at ICLR 2024
Table 19: Qualitative Study for LLaMA-33B and DoLa with GPT-4 judgement.
Table 20: Qualitative Study for LLaMA-33B and DoLa with GPT-4 judgement.
24
Published as a conference paper at ICLR 2024
Table 21: Qualitative Study for LLaMA-33B and DoLa with GPT-4 judgement.
25
Published as a conference paper at ICLR 2024
Table 22: Qualitative Study for LLaMA-33B and DoLa with GPT-4 judgement.
26