Star: Self-Taught Reasoner: Bootstrapping Reasoning With Reasoning
Star: Self-Taught Reasoner: Bootstrapping Reasoning With Reasoning
Abstract
Generating step-by-step "chain-of-thought" rationales improves language model
performance on complex reasoning tasks like mathematics or commonsense
question-answering. However, inducing language model rationale generation cur-
rently requires either constructing massive rationale datasets or sacrificing accuracy
by using only few-shot inference. We propose a technique to iteratively leverage a
small number of rationale examples and a large dataset without rationales, to boot-
strap the ability to perform successively more complex reasoning. This technique,
the "Self-Taught Reasoner" (STaR), relies on a simple loop: generate rationales to
answer many questions, prompted with a few rationale examples; if the generated
answers are wrong, try again to generate a rationale given the correct answer; fine-
tune on all the rationales that ultimately yielded correct answers; repeat. We show
that STaR significantly improves performance on multiple datasets compared to a
model fine-tuned to directly predict final answers, and performs comparably to fine-
tuning a 30× larger state-of-the-art language model on CommensenseQA. Thus,
STaR lets a model improve itself by learning from its own generated reasoning.
1 Introduction
Human decision-making is often the result of extended chains of thought [1, 2]. Recent work has
shown that explicit intermediate reasoning (“rationales”) can improve large language model (LLM)
performance as well [3–8]. For example, [5] demonstrated that LLMs explicitly trained to use
“scratchpads” for intermediate steps can attain perfect in-distribution performance on arithmetic,
and strong out-of-distribution generalization, while models trained to predict answers directly fail
to do either. These works suggest that generating explicit rationales before giving a final answer
(“rationale generation”) is valuable for LLMs across diverse tasks including mathematical reasoning,
commonsense reasoning, code evaluation, social bias inference, and natural language inference.
However, the two primary methods for inducing rationale generation both have serious drawbacks.
One approach to rationale generation is the construction of a fine-tuning dataset of rationales, either
manually by human annotators or automatically with hand-crafted templates [3–5, 9]. Manual
methods are expensive, and it is infeasible to construct such a dataset for each interesting problem
[3]. Meanwhile, template-based methods rely on automatically-generated rationales but only work
when a general solution is already known [5] or reasonable hard-coded heuristics can be made [4].
An alternative is to leverage in-context learning by including only a few rationale examples in the
language model prompt. This has been shown to improve accuracy on mathematical and symbolic
reasoning tasks relative to prompting without rationales (“direct” prompting) [5, 6]. Yet, while few-
shot techniques with rationales tend to outperform their non-reasoning counterparts, they generally
substantially underperform models fine-tuned to directly predict answers using larger datasets [5, 6].
*
These authors contributed equally to this work
Q: What can be used
Question, Rationale, Answer Correct to carry a small dog?
Answer Answer Choices:
(a) swimming pool
Finetune Rationale (b) basket
Language Generation (c) dog show
Rationale, Answer (d) backyard
Question Model
(e) own home
Rationalize A: The answer must be
something that can be
Wrong used to carry a small
Answer
dog. Baskets are
designed to hold things.
Therefore, the answer
Rationale, Answer Hint
is basket (b).
In this paper, we adopt a different approach: by leveraging the LLM’s pre-existing reasoning ability,
we iteratively bootstrap the ability to generate high-quality rationales. Specifically, we few-shot
prompt a large language model to self-generate rationales and refine the model’s ability further by
fine-tuning on those rationales that lead to correct answers. We repeat this procedure, using the
improved model to generate the next training set each time. This is a synergistic process, where
improvements in rationale generation improve the training data, and improvements in training data
further improve rationale generation.
However, we find this loop eventually fails to solve any new problems in the training set because it
receives no direct training signal for problems it fails to solve. To overcome this issue, we propose
rationalization: for each problem that the model fails to answer correctly, we generate a new rationale
by providing the model with the correct answer. This lets the model reason backward—given the
correct answer, the model can more easily generate a useful rationale. These rationales are then
collected as part of the training data, which often improves overall accuracy.
We thus develop the Self-Taught Reasoner (STaR, Fig. 1) method, a scalable bootstrapping method
allowing models to learn to generate their own rationales, while also learning to solve increasingly
difficult problems. In our method, we repeat the following process: in each iteration, first construct a
finetuning dataset by attempting to solve the dataset using the current model’s rationale generation
ability; then, augment this dataset using rationalization, justifying ground-truth answers to problems
the model failed to solve; finally, finetune the large language model on the combined dataset.
Applying STaR on arithmetic, math word problems, and commonsense reasoning, we observe it is
able to effectively translate a small number of few-shot prompts into a large rationale dataset, yielding
dramatic performance improvements. On CommonsenseQA [10], we find STaR improves over both
a few-shot baseline (+35.9%) and a baseline fine-tuned to directly predict answers (+12.5%) , and
performs comparably to a fine-tuned model that is 30× larger (72.5% vs. 73.0%).
Thus, we make the following contributions:
2
2 Background and Related Work
In-context Learning Recently, a collection of works has emerged exploring the capacity for large
language models to perform in-context learning [11, 12]. In essence, in-context learning treats
few-shot learning as a language modeling problem, by showing a few examples in the context (i.e.
prompt), and allowing the model to learn and identify the pattern to apply to new examples. Some
have studied in-context learning based on the language modeling objective in terms of Bayesian
inference [13] while others have attempted to describe the process more mechanistically in terms
of “induction heads” [14]. Moreover, differences in prompt configurations have been known to have
dramatic effects on few-shot performance. Some have even found that replacing few-shot prompts
with a “soft prompt” which can be optimized in embedding space results in noticeable gains [15].
Instead of emphasizing the representation of the question, we focus on the model output; in particular,
we focus on the model’s ability to reason through a problem before coming to a conclusion.
Rationales One of the initial works on the impact of rationales on language model performance was
[3], showing that training a language model on a dataset with explicit rationales preceding the answer
could improve a model’s ability to generate the final answer. However, this required many thousands
of training examples to be manually annotated with human reasoning. Recently, [5] demonstrated
that step-by-step “scratchpads” can improve fine-tuned LLM performance and generalization on
tasks such as arithmetic, polynomial evaluation, and program evaluation. Similarly, [6] used a single
few-shot “chain-of-thought” reasoning prompt in order to improve model performance on a collection
of tasks, without fine-tuning. Finally, [16] showed that a curriculum learning approach could help
solve formal math problems, as long as 1) they were translated into Lean (a theorem-proving language
[17]), 2) one could directly evaluate the validity of the proofs, 3) one could sample numerous potential
solutions for each problem, 4) had trained a separate value function model, and 5) started with GPT-f
(a model already fine-tuned on a large math dataset [18]). We note that there are many domains where
these conditions do not all apply. In addition, works have aimed to explain why rationales have this
beneficial effect: some have analyzed their impact from the perspective of latent variable models [19]
while others have provided formal proofs of the benefit of intermediate task supervision [20].
Iterated Learning A variety of iterated learning algorithms have been proposed, where solutions
or successful methods which are found are in turn used to find additional solutions [21, 22, 16]. [21]
introduced Expert Iteration (ExIt), a reinforcement learning technique serving as an inspiration for
our approach. Essentially, it consists of a loop of self-play by an “apprentice,” followed by imitation
learning with feedback from a slower “expert” and then the replacement of the expert with the now-
improved apprentice. [16] builds off of ExIt for formal reasoning, while [22] applies iterated learning
to visual question answering using modular networks which can be combined compositionally. There
are further similarities between STaR and expert iteration methods [21]. For example, filtering
generated examples based on whether their ultimate answer matches the target can be seen as expert
feedback. However, we have a fixed “expert” and do not train a separate value function.
Natural Language Explanations Natural language explanations have also been discussed from the
perspective of explainable machine learning, focusing on justification rather than reasoning [23, 24].
The motivation for this line of work is largely grounded in explainable decision making, and similarly
to [3], generally does not find that requiring post-hoc explanations improves model performance.
3 Method
3.1 Rationale Generation Bootstrapping (STaR Without Rationalization)
We are given a pretrained LLM M and an initial dataset of problems x with answers y: D =
{(xi , yi )}D
i=1 . Our technique starts with a small prompt set P of examples with intermediate ratio-
nales r: P = {(xpi , rip , yip )}P
i=1 , where P D (e.g. P = 10). Like standard few-shot prompting, we
concatenate this prompt set to each example in D, i.e. xi = (xp1 , r1p , y1p , . . . , xpP , rPp , yPp , xi ), which
encourages the model to produce a rationale r̂i for xi followed by an answer ŷi . We assume that
rationales that lead to correct answers are of better quality than those that lead to incorrect answers.
Therefore, we filter the generated rationales to include only the ones which result in the correct
answer (ŷi = yi ). We fine-tune the base model M on this filtered dataset, and then restart this process
by generating the new rationales with the newly fine-tuned model. We keep repeating this process
until the performance plateaus. Note that during this process, once we collect a new dataset, we train
from the original pre-trained model M instead of continually training one model to avoid overfitting.
We provide an outline of this algorithm in Algorithm 1.
3
STaR can be seen as an approximation to an RL-style policy gradient objective.
P To see this, note
that M can be viewed as a discrete latent variable model pM (y | x) = r p(r | x)p(y | x, r); in
other words, M first samples a latent rationale r before predicting y. Now, given the indicator reward
function 1(ŷ = y), the total expected reward across the dataset is
Er̂i ,ŷi ∼pM (·|xi ) 1(ŷi = yi ),
X
J(M, X, Y ) = (1)
i
3.2 Rationalization
Q: Where do you put your grapes just
The rationale generation bootstrapping algorithm car- before checking out?
ries a limitation. Since the model is only trained on Answer Choices:
the examples which it answers correctly, improve- (a) mouth
ment ends when the model fails to solve new prob- (b) grocery cart (CORRECT)
(c) super market
lems in the training set. This is fundamentally due (d) fruit basket
to the fact that the algorithm cannot obtain any train- (e) fruit market
ing signal from failed examples. Inspired by [3], we A: The answer should be the place
propose a technique we call “rationalization”. Specif- where grocery items are placed before
ically, we provide the answer as a hint to the model checking out. Of the above choices,
and ask it to generate rationales in the same style grocery cart makes the most sense for
as in the previous rationale generation step. Given holding grocery items. Therefore,
the answer, the model is able to reason backwards, the answer is grocery cart (b).
and hence more easily generate a rationale that leads Figure 2: A few-shot prompt hint we use for
to the correct answer. For example, in Figure 2, we rationalization (and not for rationale gener-
provide the hint that ”(b) grocery cart” is the cor- ation), using the rationale from [6], with its
rect answer in the prompt to generate the rationale. hint included in green, followed by the ratio-
We apply rationalization to the problems which the nale and the answer generated by the model.
model failed to solve with rationale generation. When
adding a rationalization-generated rationale to our dataset, we do not include the hint in its corre-
sponding prompt, as if the model had come up with the rationale without the hint. After filtering, we
fine-tune on the previously generated dataset combined with the rationalization-generated dataset.
Algorithm 1 STaR
Input M : a pretrained LLM; dataset D = {(xi , yi )}D i=1 (w/ few-shot prompts)
1: M0 ← M # Copy the original model
2: for n in 1...N do # Outer loop
3: (r̂i , ŷi ) ← Mn−1 (xi ) ∀i ∈ [1, D] # Perform rationale generation
4: (r̂irat , ŷirat ) ← Mn−1 (add_hint(xi , yi )) ∀i ∈ [1, D] # Perform rationalization
5: Dn ← {(xi , r̂i , yi ) | i ∈ [1, D] ∧ ŷi = yi } # Filter rationales using ground truth answers
6: Dnrat ← {(xi , r̂irat , yi ) | i ∈ [1, D] ∧ ŷi 6= yi ∧ ŷirat = yi } # Filter rationalized rationales
7: Mn ← train(M, Dn ∪ Dnrat ) # Finetune the original model on correct solutions - inner loop
8: end for
Algorithm 1 describes the full algorithm, with the parts in blue corresponding to rationalization.
Without those parts, Algorithm 1 corresponds to STaR without rationalization. Figure 1 provides an
overview diagram. Fine-tuning on the dataset generated by rationalization has a crucial benefit of
exposing the model to difficult problems which otherwise would not have appeared in its finetuning
dataset. This can be understood as challenging the model to “think outside the box” about problems
on which it was unsuccessful. A secondary benefit of rationalization is an increase in dataset size.
4
4 Experiments
For our experiments, we focus on arithmetic, commonsense reasoning, and grade school math to
demonstrate STaR’s breadth. In particular, for arithmetic, we follow a setup inspired by [5]. For
commonsense question-answering we follow [13, 6] and use CommonsenseQA (CQA), a widely
used multiple-choice dataset for this domain [10]. For grade school math, we use GSM8K from [9].
5
100 100
1 1
80 2 80 2
Accuracy (%)
3 3
60 4 60 4
5 5
40 40
20 20
0 0
0 4 8 12 16 20 24 28 0 4 8 12 16
Iterations Iterations
(a) Without rationalization (b) With rationalization
Figure 4: A visualization of the accuracy of n-digit summation with each iteration of STaR with
and without rationalization for arithmetic. Each series corresponds to the accuracy of summing two
n-digit numbers.
Corresponding to the broad variety of ConceptNet, CQA contains a diverse set of questions which
require commonsense reasoning ability building off of standard world knowledge, where human
performance is 89% [10]. Many have pointed out that CQA contains a number of biases, along
several dimensions including gender [3]. We discuss how this may impact our method in Appendix G.
There are also many typos and questions which are fundamentally ambiguous2 . We use it despite
these issues as it is a general question-answering dataset relying on both common world knowledge
and simple reasoning, which serves as a good test-bed for our method.
Grade School Math (GSM8K) We also evaluate on the Grade School Math (GSM8K) dataset,
which contains 7,473 train and 1,319 test examples of grade-school-level word problems [9]. These
math problems are posed in natural language and require two to eight calculation steps to arrive at a
final answer. This dataset combines the skills needed for arithmetic and commonsense reasoning.
The accuracies of the model across digits 1-5 over each iteration of the outer loop are plotted in
Figure 4. After running STaR for 16 iterations, the overall accuracy is 89.5%. For reference, a
baseline trained on 10,000 examples without rationales for 5,000 steps attains 76.3% accuracy.
Notably, few-shot accuracy on arithmetic problems is very low, even with rationales: accuracy on
2-digit addition is less than 1%, and accuracy on more digits close to zero.
With rationalization, the accuracy is able to improve 100
1
especially quickly. After one fine-tuning iteration on 2
the model’s generated scratchpads, 2-digit addition 80
Accuracy (%)
3
improves to 32% from less than 1%. Without ratio- 60 4
nalization, the performance improvement is stage- 5
wise: the model generally has poor performance 40
6
on the n-digit sum until it has good performance 7
on the (n − 1)-digit sum. With rationalization, the 20
8
model can learn many lengths at once, though not 0 9
with equal accuracy. Rationalization allows many 0 4 8 12 16 20 24 28 32 36 10
problems to be solved few-shot, so we start STaR Iterations
training with 300 steps (note, doing so without ra- Figure 5: We introduce additional digits to
tionalization causes overfitting on 1-digit addition), STaR with rationalization at the 20th iteration.
and increase training by 20 steps per iteration.
We also perform an experiment where we continue pre-training STaR with rationalization with
additional digits, starting before the 20th iteration, while keeping the total number of training examples
fixed at each iteration. We find that not only does this appear to quickly improve performance on the
initial set of digits, but when evaluated on 9 and 10 digit examples, never seen during training, the
model successfully solves many of these out-of-distribution problems. As visualized in Figure 5, the
introduction of these digits appears to make the training less stable, but the exact cause is unclear.
2
For example, “Billy bought coffee and waited for his wife to arrive from France. Where might he have
been?” includes airport and train station as options. The correct answer, perhaps surprisingly, is train station.
6
Table 1: We evaluate several baselines, including a few-shot GPT-J evaluation both with and without
scratchpads, a GPT-J baseline finetuned to directly predict the answer, and STaR with and without
rationalization applied to GPT-J. We use CoT to denote non-STaR models outputting rationales, and
Direct to indicate those directly predicting the final answer. Note the final STaR model is trained on
78.2% of the training dataset with rationale generation, and an additional 8.5% from rationalization.
CQA Dev Set Accuracy (%) Train Data Used (%)
GPT-3 Direct Finetuned [29] 73.0 100
Few-shot Direct GPT-J 20.9 ∼0
Few-shot CoT GPT-J 3 36.6 ∼0
Few-shot CoT LaMDA 137B [6] 55.6 ∼0
GPT-J Direct Finetuned 60.0 100
STaR without rationalization 68.8 69.7
STaR with rationalization 72.5 86.7
7
Table 2: We find that STaR substantially improves GSM8K performance over the baselines, despite
training on only 25.0% of the data for the model without rationalization, and 28.7% of the dataset
(with 0.5% from rationalization) for the model with rationalization.
GSM8K Test Accuracy (%) Train Data Used (%)
Few-shot Direct GPT-J 3.0 ∼0
Few-shot CoT GPT-J 3.1 ∼0
GPT-J Direct Finetuned 5.8 100
STaR without rationalization 10.1 25.0
STaR with rationalization 10.7 28.7
Failure Cases Finally, we found a variety of interesting failure cases, many of which corresponded
to standard logical fallacies. For example, the model often made statements related to the topic of the
question but which were not actually arguments for why the answer should be true. Sometimes, the
model claimed the question implied the answer as an argument, without explaining why. Other times,
especially early in training, the model answered as if it has knowledge about a particular individual,
instead of making a general statement - e.g. “the king’s castle is a place where he feels safe” instead
of “castles are places where kings feel safe.” We provide examples and analyze errors in Appendix A.
Few-shot Prompt Training Including few-shot prompts during fine-tuning [12] appears to have
a meaningful performance benefit (60.9% to 68.8% without rationalization, 69.9% to 72.5% with
rationalization). Thus, we generally suggest its use for at least some portion of the training, though
we discuss some caveats in Section 5.
Log frequency
that on this task, the use of rationalization does not
substantially improve performance. Note that, in 4
3
training, it was necessary to cap the number of train- 3
ing steps at the 30th iterations (after 7912 steps), 2
to prevent the training process from becoming pro- 2
hibitively long. The results were reached after 36 1 1
iterations for STaR without rationalization and an
0
additional 10 iterations with rationalization. 0
0 1 2 3 4 5 6 7 8
Most often, the number of calculation steps gen- Human-produced calculator steps
erated by the model matches the number of steps
Figure 6: A comparison of the number of calcu-
taken by humans (generally between 53% and 57%
lator steps generated by the model in order to
agreement across all iterations). We visualize this
solve examples in the training set relative to the
explicitly in Figure 6. We see that when the ground
number of steps used in the ground truth.
truth and model disagree on the number of calcu-
lation steps, the model typically uses fewer. Sometimes this is because the model skips steps, but
occasionally it finds different solutions. We show an example in Appendix J, where the model
disregards redundant information and solves a 7-step problem in a single step.
8
In addition, due to the low sampling temperature, the outputs without rationalization correspond
to the examples where the model is most confident in its answer. This results in these examples
providing a weaker gradient signal than the rationalization examples, at least in the first iteration.
Since we retrain from the initial pre-trained model every time we run a fine-tuning iteration, the
degree of this effect is also difficult to measure directly. Finally, we must point out that the method
to add the “hint” does not follow immediately from the question and answer and in some contexts
providing it may be nontrivial. An exploration of the various impacts of different hinting techniques
and their generality is an avenue for future work.
Temperature One intuitive alternative to rationalization, if one seeks to expand the training dataset,
is more and higher-temperature sampling. However, in practice, we found that this is counterpro-
ductive. In general, it substantially increases the likelihood of a correct answer despite incorrect
reasoning, and training on bad or irrelevant reasoning prevents generalization. This is particularly
clear in more structured tasks, like arithmetic, where the scratchpads that the model learns to produce
with a higher-temperature sampling approach diverge into meaninglessness and cause the model to
stagnate. Overall, we found that higher temperatures as an alternative to rationalization (e.g. 0.5
or 0.7) consistently led to models worse than models with reasoning alone. In addition, as text
generation by large language models is sequential (i.e. one cannot produce a token without producing
the preceding token), generating text is a bottleneck and this is computationally far less efficient
than rationalization. For example, generating 10 sample outputs is approximately 10 times slower
than generating one sample output. However, one potentially valuable way to leverage multiple
samples would be to use the method proposed in [32], using the majority-vote result of multiple high-
temperature scratchpads as a ground truth against which we compare a low-temperature scratchpad.
This may allow one to apply STaR to a dataset of only questions, without answers.
Few-shot Prompting A noteworthy phenomenon is that the inclusion of few-shot prompting during
sampling seems to dramatically reduce “drift” where later rationales become increasingly dissimilar
from the initial few-shot set of rationales. One benefit of this is that the model may be less constrained
by the quality and difficulty of the initial rationales, theoretically allowing it to generalize more. One
potentially negative consequence is that the style of the rationales may less-closely match the original
prompting style. Another benefit is in terms of computational resources - a shorter prompt length
allows for a shorter sequence length when sampling. Technically, the point in training at which we
“disable” few-shot prompts is another hyperparameter which we could tune, but we leave this to future
work. In addition, by leaving prompts out after the initial outer-loop iteration, the model tends to
perform gradually worse at rationalization as it trains for longer periods of time. As a result, it may
be necessary to include some hints during training for long periods of time with this approach.
Ultimately, the choice to include few-shot prompts in later iterations of training appears to depend on
the use-case: when the goal is consistent adherence to a particular prompt style, which may benefit
explainability, include few-shot prompts in sampling; when the goal is a faster training loop, one may
remove them. Moreover, it is possible that with other datasets or larger models there is an impact on
performance, so we encourage this to be generally treated as a hyperparameter.
6 Conclusion
We present the Self-Taught Reasoner (STaR), which iteratively improves a model’s ability to generate
rationales to solve problems. We few-shot prompt a model to solve many problems in a step-by-step
manner by generating rationales, and then prompt it to rationalize the correct answer for problems
it gets wrong. We finetune on both the initially correct solutions and rationalized correct solutions,
and repeat the process. We find that this technique significantly improves the model’s generalization
performance on both symbolic reasoning and natural language reasoning.
There are several important limitations on STaR as presented. In order for the first iteration of STaR
to succeed, few-shot performance must be above chance, implying that the initial model must be
big enough to have some reasoning capabilities. For instance we found that GPT-2 was not able
to bootstrap from few-shot reasoning in even the arithmetic domain. A further limitation is that
settings with a high level of chance performance (e.g. binary decisions) yield many poor rationales,
confounding the STaR approach. An open problem is how to filter bad reasoning in these settings.
Nonetheless, we believe using examples without reasoning to bootstrap reasoning is a very general
approach, and that STaR can serve as the basis of more sophisticated techniques across many domains.
9
Acknowledgements
We thank Imanol Schlag for his detailed feedback about this work, as well as Rose E Wang, Markus
Rabe, Aitor Lewkowycz, Rishi Bommasani, Allen Nie, Alex Tamkin, and Qian Huang. We thank
Cem Anil for his very helpful insight that rationale finetuning performance can be improved if the
training includes the few-shot rationales. We also thank Ben Prystawski for his suggestions on survey
creation. We thank Google TPU Research Cloud for TPU access.
References
[1] William James, Frederick Burkhardt, Fredson Bowers, and Ignas K Skrupskelis. The principles
of psychology, volume 1. Macmillan London, 1890.
[2] K Anders Ericsson and Herbert A Simon. Protocol analysis: Verbal reports as data. the MIT
Press, 1984.
[3] Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. Explain yourself!
leveraging language models for commonsense reasoning. ACL, 2019.
[4] Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Unsupervised
commonsense question answering with self-talk. EMNLP 2020, 2020.
[5] Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin,
David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show
your work: Scratchpads for intermediate computation with language models. arXiv preprint
arXiv:2112.00114, 2021.
[6] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny
Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint
arXiv:2201.11903, 2022.
[7] Ana Marasović, Iz Beltagy, Doug Downey, and Matthew E Peters. Few-shot self-rationalization
with natural language prompts. arXiv preprint arXiv:2111.08284, 2021.
[8] Andrew K Lampinen, Ishita Dasgupta, Stephanie CY Chan, Kory Matthewson, Michael Henry
Tessler, Antonia Creswell, James L McClelland, Jane X Wang, and Felix Hill. Can language
models learn from explanations in context? arXiv preprint arXiv:2204.02329, 2022.
[9] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christo-
pher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint
arXiv:2110.14168, 2021.
[10] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A
question answering challenge targeting commonsense knowledge. In Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, 2019.
[11] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[12] Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan
Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. ICLR
2022, 2021.
[13] Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of
in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021.
[14] Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom
Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, and et al. In-context learning
and induction heads. Transformer Circuits, Mar 2022.
10
[15] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient
prompt tuning. EMNLP 2021, 2021.
[16] Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya
Sutskever. Formal mathematics statement curriculum learning. arXiv preprint arXiv:2202.01344,
2022.
[17] Leonardo de Moura, Soonho Kong, Jeremy Avigad, Floris van Doorn, and Jakob von Raumer.
The lean theorem prover (system description). In International Conference on Automated
Deduction, pages 378–388. Springer, 2015.
[18] Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving.
arXiv preprint arXiv:2009.03393, 2020.
[19] Wangchunshu Zhou, Jinyi Hu, Hanlin Zhang, Xiaodan Liang, Maosong Sun, Chenyan Xiong,
and Jian Tang. Towards interpretable natural language understanding with explanations as latent
variables. Advances in Neural Information Processing Systems, 33:6803–6814, 2020.
[20] Noam Wies, Yoav Levine, and Amnon Shashua. Sub-task decomposition enables learning in
sequence to sequence tasks. arXiv preprint arXiv:2204.02892, 2022.
[21] Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning
and tree search. Advances in Neural Information Processing Systems, 30, 2017.
[22] Ankit Vani, Max Schwarzer, Yuchen Lu, Eeshan Dhekane, and Aaron Courville. Iterated
learning for emergent systematicity in vqa. ICLR 2021, 2021.
[23] Oana-Maria Camburu, Tim Rocktäschel, Thomas Lukasiewicz, and Phil Blunsom. e-snli:
Natural language inference with natural language explanations. Advances in Neural Information
Processing Systems, 31, 2018.
[24] Hanxiong Chen, Xu Chen, Shaoyun Shi, and Yongfeng Zhang. Generate natural language
explanations for recommendation. arXiv preprint arXiv:2101.03392, 2021.
[25] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal
policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
[26] Ben Wang. Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Lan-
guage Model with JAX. https://github.com/kingoflolz/mesh-transformer-jax,
May 2021.
[27] Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and
Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning
work? arXiv preprint arXiv:2202.12837, 2022.
[28] Robyn Speer, Joshua Chin, and Catherine Havasi. Conceptnet 5.5: An open multilingual graph
of general knowledge. singh 2002 (2016). arXiv preprint arxiv:1612.03975, 2016.
[29] Yichong Xu, Chenguang Zhu, Shuohang Wang, Siqi Sun, Hao Cheng, Xiaodong Liu, Jianfeng
Gao, Pengcheng He, Michael Zeng, and Xuedong Huang. Human parity on commonsenseqa:
Augmenting self-attention with external attention. arXiv:2112.03254, December 2021. human
parity result on CommonsenseQA.
[30] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-
Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for
dialog applications. arXiv preprint arXiv:2201.08239, 2022.
[31] Stefan Palan and Christian Schitter. Prolific. ac—a subject pool for online experiments. Journal
of Behavioral and Experimental Finance, 17:22–27, 2018.
[32] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-
consistency improves chain of thought reasoning in language models, 2022.
[33] Bernease Herman. The promise and peril of human evaluation for model interpretability. arXiv
preprint arXiv:1711.07414, 2017.
11
[34] Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable nlp systems: How should we
define and evaluate faithfulness? ACL 2020, 2020.
[35] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason
Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse
text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
[36] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
12
Appendix
Throughout our experiments, we came across a variety of interesting failure cases for commonsense
reasoning. Note that all the final answers are correct – however, we take issue with the reasoning
used in order to arrive at those answers.
One key failure case was answers in the form of “the answer must be something that is <question
property>. <answer> is <question property>. Therefore, the correct answer is <answer>.” In these
cases, the model fails to explain why the answer that it has chosen satisfies the question property.
These rationales, while perhaps useful to the model, read to us as opaque and unexplanatory.
A related but stronger version of the previous failure case, while less common, is particularly
uninsightful. Sometimes the model will imply the answer that it has chosen in its question.
Q: The christian rock band was on a large stage, the singer’s foot got caught
on a wire and he had a small what?
Answer Choices:
(a) commit sin
(b) fail
(c) disobey
(d) stumble
(e) boo boo
A: The answer must be something that would cause the singer to stumble. The
singer’s foot got caught on a wire, so he would stumble. Therefore, the
answer is stumble (d).
A rare failure case is when the model finds it unnecessary to justify its answer. For example:
Q: Other than school or from text where can one gain knowledge?
Answer Choices:
(a) book
(b) class
(c) thinking
(d) head
(e) meeting
A: The answer must be a place where someone can gain knowledge other than
school or from text. The answer is meeting (e).
13
A.4 World State Assertions
Sometimes, the model will assume that it knows something about a subject or a person whose name
was used as a variable. This leads to somewhat comical examples of reasoning. Part of the reason
for this is that generally, there is an expectation that good rationales will leverage understanding of
more general classes of objects and appeal to the relationship between those general classes and the
particular instance. For example, the argument that “a person would typically feel exhilaration from
heights” is generally more compelling than the argument that “James would feel exhilaration from
heights.”
Q: A person’s hand hurts after hitting the bottle. What might happen to
the bottle?
Answer Choices:
(a) alcoholism
(b) breaking
(c) fly
(d) nausea
(e) ticket
A: The answer must be something that happens to the bottle. The bottle is
broken. Therefore, the answer is breaking (b).
Some errors in reasoning corresponded to the model making a statement which, while technically
true, is not useful in demonstrating the claim.
Q: John ran very fast, his feet touched the ground in what way?
Answer Choices:
(a) behind
(b) loose
(c) feverishly
(d) slowly
(e) lightly
A: The answer must be a way of running that is different from the way John
ran. Running is a fast, light-footed activity. Therefore, the answer is
lightly (e).
14
A.6 Hint Short-cutting
In the experiments where the model was shown some examples of “hints” during training, in order to
prevent it from losing the ability to perform rationalization over time, the model appeared to pick up
on the fact that the final answer would always correspond to the hinted answer. This led to answers
such as
Q: Billy bought coffee and waited for his wife to arrive from France.
Where might he have been?
Answer Choices:
(a) airport
(b) grocery store
(c) internet cafe
(d) supermarket
(e) train station
A: The answer must be a place where Billy could have been waiting for his
wife to arrive from France. The airport is a place where people can wait
for flights. Therefore, the answer is train station (e).
15
B Modified CQA Prompts
For reference, we include our modified prompts based closely on those in [6].
Q: What do people use to absorb extra ink from a fountain pen?
Answer Choices:
(a) shirt pocket
(b) calligrapher’s hand
(c) inkwell
(d) desk drawer
(e) blotter
A: The answer must be used to absorb extra ink. Blotters are designed to absorb
liquids. Therefore, the answer is blotter (e).
Q: The fox walked from the city into the forest, what was it looking for?
Answer Choices:
(a) pretty flowers
(b) hen house
(c) natural habitat
(d) storybook
(e) dense forest
A: The answer must be a reason for a fox to go into the forest. The forest is a fox’
s natural habitat. Therefore, the answer is natural habitat (c).
Q: Google Maps and other highway and street GPS services have replaced what?
Answer Choices:
(a) united states
(b) mexico
(c) countryside
(d) atlas
(e) oceans
A: The answer must be something that used to do what Google Maps and GPS services do
, which is give directions. Atlases were also used to give directions. Therefore,
the answer is atlas (d).
16
Q: Before getting a divorce, what did the wife feel who was doing all the work?
Answer Choices:
(a) harder
(b) anguish
(c) bitterness
(d) tears
(e) sadness
A: The answer should be a feeling which would cause someone who was doing all the
work to get divorced. If someone feels bitter towards their spouse, they are likely
to want a divorce. Therefore, the answer is bitterness (c).
17
C Human-evaluated Test Prompts
We also selected a random sampling of 50 questions which were correctly answered both few-shot
and by a STaR-trained model (without rationalization), as discussed in 4.4. Presented in a random
order, twenty crowdworkers preferred the STaR-generated answers. We reproduce the examples here
with the few-shot rationale first, the STaR-trained rationale second, and the human rationale third,
though these were shuffled when presented to participants. We selected human answers from [3]’s
original split rationales where possible, finding that duplicate rationales were much more common in
the new split rationales. For example, the explanation “Rivers flow trough valleys,” appeared over
400 times verbatim in the new split dataset, and “This word was most relevant” appeared over 150
times. [3]’s dataset also includes explanations like “The only answer that makes sense” or “BOB
WILL NOT POKEMON CARDS WERE COMMON AND WORTHLESS BUT WRONG ABOUT
THEM SO FEEL REALLY RARE TO DELAY” or restatements of the answer. We append the phrase
“Therefore, the answer is ANSWERTEXT (ANSWERLETTER)” with ANSWERTEXT replaced
by the correct answer’s text and ANSWERLETTER replaced by the correct answer letter. This is
done 1) to make it less obvious that one of the answers is generated by a different source and 2) to
prioritize differences in rationales, not the answer format.
Before the questions and after the consent form, we presented the following directions:
Each question will have three explanations for one answer, which is a pre-determined correct ("best") answer.
Please rank the explanations for each question from best to worst (with 1 as the best and 3 as the worst) based on how well
they justify the correct answer.
Note there are some typos and grammatical errors in the examples, in the questions, answers, and the explanation. Please
treat the answer letter (a / b / c / d / e) at the end of the explanations as the correct answer letter, even if another
answer is valid (and perhaps more reasonable), and even if it does not match with the answer text preceding it.
Q: Who is likely yo have a caring heart? A: Having fun is enjoyable. Therefore, the answer is
Answer Choices: enjoyable (e).
(a) person
(b) species
(c) artichoke
(d) all mammals Q: What happens when I make a pig of myself eating breakfast
(e) live animal ?
A: The answer must be a person. All mammals have a caring Answer Choices:
heart. Therefore, the answer is person (a). (a) full stomach
A: The answer must be someone who is likely to have a (b) gain energy
caring heart. Humans are the most likely to have a caring (c) dieting
heart. Therefore, the answer is person (a). (d) have energy
A: A person is the only human being capable of having a (e) feeling satisfied
caring heart. Therefore, the answer is person (a).
A: The answer must be a feeling which would cause someone
Q: If a person wanted to become a good writer what should to feel full after eating breakfast. The answer is full
they do in college? stomach (a).
Answer Choices: A: The answer must be a feeling which would cause someone
(a) word sentence to make a pig of themselves eating breakfast. If someone
(b) own animal feels full, they are likely to stop eating. Therefore, the
(c) read the newspaper answer is full stomach (a).
(d) catch cold A: A colloquial phrase for over indulging in food.
(e) study literature Therefore, the answer is full stomach (a).
A: The answer must be something that would help a person
become a good writer. The answer is study literature (e). Q: In order to see a story on the big screen what must you
A: The answer must be something that would help a person do?
become a good writer. Literature is a form of study. Answer Choices:
Therefore, the answer is study literature (e). (a) go to movies
A: studying literature will give one skills of writing. (b) visualize
Therefore, the answer is study literature (e). (c) reading
(d) open book
Q: The human always would have fun making up questions for (e) sketching a picture
the A.I. overlords, he found the task quite what?
Answer Choices: A: The answer must be something that you do to see a story
(a) do enjoy on the big screen. Movies are a way to see a story on the
(b) eat cake big screen. Therefore, the answer is movies (a).
(c) enjoy living A: The answer must be something that is required to see a
(d) get laid story on the big screen. Movies are a form of storytelling.
(e) enjoyable Therefore, the answer is go to movies (a).
A: The big screen is a colloquial term meaning a movie
A: The answer must be something that the human would enjoy theater. Therefore, the answer is go to movies (a).
doing. The human enjoys making up questions for the A.I.
overlords. Therefore, the answer is enjoyable (e). Q: What is the opposite of remembering?
A: The answer must be something that the human found Answer Choices:
enjoyable. The human found the task of making up questions (a) answering question
for the A.I. overlords enjoyable. Therefore, the answer is (b) stopping to think
enjoyable (e). (c) knowing
(d) closing eyes
18
(e) forgetting A: A machine part is the only thing listed that would fix
A: The answer must be something that is the opposite of the car. Therefore, the answer is machine part (d).
remembering. The answer is forgetting (e).
A: The answer must be a feeling which would cause someone Q: He always traveled up the mountain when winter arrived,
to not remember something. If someone forgets something, he had a passion and desire to do what?
they are likely to not remember it. Therefore, the answer Answer Choices:
is forgetting (e). (a) play volleyball
A: Health complications. Therefore, the answer is (b) play hockey
forgetting (e). (c) ski
(d) knit
Q: If you aren’t experiencing curiosity and want to ensure (e) skate
viewing enjoyment, what may you watch? A: The answer must be something that the person does when
Answer Choices: winter arrives. The answer is skiing (c).
(a) examine thing A: The answer must be something that the person did when
(b) stare into a mirror winter arrived. The answer is ski (c).
(c) go somewhere A: Skiing is the only option that is done on a mountain.
(d) hear news Therefore, the answer is ski (c).
(e) see favorite show
A: The answer must be something that will make you want to Q: Bob thought that his Pokemon cards were common and
watch. If you are not experiencing curiosity, you may want worthless, but he was wrong about them. They were really
to watch something that will make you want to watch. what?
Therefore, the answer is watch (e). Answer Choices:
A: The answer must be something that will not cause you to (a) personal
want to watch. If you are not experiencing curiosity, you (b) special
are likely to want to watch something that is entertaining. (c) shiny
Therefore, the answer is favorite show (e). (d) rare
A: see favorite show can you do in front of a tv if you (e) irregular
have curiosity about what is being displayed on the tv. A: The answer must be something that is not common and
Therefore, the answer is favorite show (e). worthless. Pokemon cards are not common and worthless.
Therefore, the answer is rare (d).
Q: The carotid is a main artery susceptible to damage where? A: The answer must be something that are rare. Pokemon
Answer Choices: cards are rare. Therefore, the answer is rare (d).
(a) neck A: BOB WILL NOT POKEMON CARDS WERE COMMON AND WORTHLESS BUT
(b) face WRONG ABOUT THEM SO FEEL REALLY RARE TO DELAY. Therefore,
(c) busy city the answer is rare (d).
(d) torso
(e) human body Q: If you’re lighting a candle to see what are you in?
A: The answer must be a place where the carotid artery is Answer Choices:
susceptible to damage. The carotid artery is located in the (a) dimly lit room
neck. Therefore, the answer is neck (a). (b) synagogue
A: The answer must be a place where the carotid is (c) sunlight
susceptible to damage. The carotid is susceptible to damage (d) birthday cake
in the neck. Therefore, the answer is neck (a). (e) bedroom
A: The carotid artery is located in the neck. Therefore, A: The answer must be a place where you can see what you
the answer is neck (a). are in. The answer is a room with dim lighting. Therefore,
the answer is dimly lit room (a).
Q: The king needed to feel safe, where did he go? A: The answer must be a place where you can see a candle.
Answer Choices: Candles are used to see things in the dark. Therefore, the
(a) castle answer is dimly lit room (a).
(b) throne room A: Lighting or illumination is the deliberate use of light
(c) deck of cards to achieve a practical or aesthetic effect. Lighting
(d) fort includes the use of both artificial light sources like
(e) court lamps ... Therefore, the answer is dimly lit room (a).
A: The answer must be a place where the king feels safe.
The king’s castle is a place where he feels safe. Therefore, Q: The person was thirsty, what should he have done?
the answer is castle (a). Answer Choices:
A: The answer must be a place where the king can feel safe. (a) eat
Castles are places where kings feel safe. Therefore, the (b) sunny days
answer is castle (a). (c) compliments
A: In general, King live in castle. Therefore, the answer (d) drink water
is castle (a). (e) all kinds of things
A: The answer must be something that would make the person
Q: What is the result of instituting civil action? feel better. If the person is thirsty, he should drink
Answer Choices: water. Therefore, the answer is drink water (d).
(a) bad feelings A: The answer must be something that would make someone
(b) confusion feel better if they were thirsty. If someone is thirsty,
(c) going to court they are likely to want to drink water. Therefore, the
(d) hurt feelings answer is drink water (d).
(e) sandwiches A: The person was thirsty, so drink water. Therefore, the
A: The answer must be a result of instituting civil action. answer is drink water (d).
The result of instituting civil action is a lawsuit.
Therefore, the answer is going to court (c). Q: Where would you find many products for sale?
A: The answer must be a result of instituting civil action. Answer Choices:
If someone goes to court, they are likely to feel hurt (a) market
feelings. Therefore, the answer is going to court (c). (b) cost one dollar
A: Anger and Trial will not institute a court case. (c) shelf
Therefore, the answer is going to court (c). (d) salon
(e) demonstration
A: The answer must be a place where many products are sold.
Q: The person wanted to fix his car, what did he order?
A market is a place where many products are sold.
Answer Choices:
Therefore, the answer is market (a).
(a) selling things
A: The answer must be a place with a lot of products for
(b) gas
sale. Markets are the most common place to buy products.
(c) feel resentful
Therefore, the answer is market (a).
(d) machine part
A: The only place products are for sale are in markets and
(e) cross street
stores. Therefore, the answer is market (a).
A: The answer must be something that would help the person
fix his car. The person would not want to sell things,
Q: If man and mammoth wanted to meet, they would probably
because he would not want to give up his car. Therefore,
have to do it during when?
the answer is machine part (d).
Answer Choices:
A: The answer must be something that would be used to fix a
(a) a baby shower
car. Machine parts are used to fix cars. Therefore, the
(b) ancient history
answer is machine part (d).
19
(c) storybook A: The answer must be a physical gesture that is used to
(d) stone age greet someone. The answer is handshake (c).
(e) forest A: The answer must be a gesture used to meet people. If
A: The answer must be a time when man and mammoth would someone wants to meet people, they are likely to want to
meet. Stone age is the time when man and mammoth lived. shake hands. Therefore, the answer is shaking hands (c).
Therefore, the answer is stone age (d). A: A handshake is a globally widespread. Therefore, the
A: The answer must be a time when man and mammoth would be answer is shaking hands (c).
able to meet. The stone age was a time when man and mammoth
would be able to meet. Therefore, the answer is stone age Q: What do dogs get that make them itch?
(d). Answer Choices:
A: During the stone age is the only time mammoths lived. (a) bones
Therefore, the answer is stone age (d). (b) fleas
(c) bad breath
Q: They prided themselves on being a wealth of knowledge, (d) wagging tails
and that’s why many chose to attend their what? (e) four legs
Answer Choices: A: The answer must be something that makes dogs itch. Dogs
(a) book store get fleas, which make them itch. Therefore, the answer is
(b) university fleas (b).
(c) meeting A: The answer must be something that dogs get that make
(d) class them itch. Fleas are a common cause of dog itch. Therefore,
(e) encyclopedia the answer is fleas (b).
A: The answer must be a place where people go to learn A: fleas makes you itch. Therefore, the answer is fleas (b).
about things. The answer is university (b).
A: The answer must be a place where people go to learn
about things. Universities are places where people go to Q: What is ignorance often equated with?
learn about things. Therefore, the answer is university (b). Answer Choices:
(a) bliss
A: because in the university you get more knowledge. (b) hostility
Therefore, the answer is university (b). (c) peace
(d) accidents
Q: what happens when visiting museum that have narrow halls? (e) find truth
Answer Choices: A: The answer must be something that is often equated with
(a) sore feet ignorance. Bliss is often equated with ignorance. Therefore,
(b) amazement the answer is bliss (a).
(c) retrospection A: The answer must be a feeling which would cause someone
(d) boredom to be ignorant. If someone feels bliss, they are likely to
(e) crowding be ignorant. Therefore, the answer is bliss (a).
A: The answer must be something that happens when visiting A: bliss comes from ignorance. Therefore, the answer is
a museum that has narrow halls. Crowding is a common bliss (a).
problem in museums. Therefore, the answer is crowding (e).
A: The answer must be something that would cause someone to Q: The two teams were competing all the time, what likely
have sore feet if they are walking in narrow halls. If developed between them?
someone feels cramped, they are likely to want to go Answer Choices:
somewhere else. Therefore, the answer is crowding (e). (a) accomplishment
A: museum is one of the crowdest place. Therefore, the (b) enemies
answer is crowding (e). (c) rivalry
(d) friendship
Q: On their hike they brought a filtering straw, they were (e) bad blood
worried about germs in the what? A: The answer must be something that developed between the
Answer Choices: two teams. The teams were competing all the time. Therefore,
(a) make sick the answer is rivalry (c).
(b) doctor A: The answer must be a feeling which would cause two teams
(c) water to compete all the time. Rivalry is a common feeling which
(d) stream would cause two teams to compete all the time. Therefore,
(e) mouth the answer is rivalry (c).
A: The answer must be something that can filter out germs. A: likely developed between them of rivalry. Therefore, the
Filtering straws are used to filter out germs. Therefore, answer is rivalry (c).
the answer is filtering straw (c).
A: The answer must be something that would cause someone to Q: James doesn’t want a relationship. He thinks that loving
bring a filtering straw on a hike. Filtering straws are another will only create what?
used to filter water. Therefore, the answer is water (c). Answer Choices:
A: straw it something used to drink water. Therefore, the (a) satisfaction
answer is water (c). (b) jealousy
(c) happiness
Q: If I am not good at playing poker, I will begin what if I (d) wonderful
play? (e) quiet
Answer Choices: A: The answer must be something that would make James feel
(a) discord bad about not having a relationship. If someone feels
(b) social event jealous, they are likely to want a relationship. Therefore,
(c) losing money the answer is jealousy (b).
(d) competition A: The answer must be a feeling which would cause someone
(e) heightened awareness who doesn’t want a relationship to not want one. If someone
A: The answer must be something that will happen if you are feels jealous of another, they are likely to not want a
not good at playing poker. If you are not good at playing relationship. Therefore, the answer is jealousy (b).
poker, you will begin to lose money. Therefore, the answer A: Jealousy is a negative emotion. Therefore, the answer is
is losing money (c). jealousy (b).
A: The answer must be something that will help you if you
are not good at playing poker. If you are losing money, you Q: The had a reputation for celebrating in a debaucherous
are likely to want to stop playing. Therefore, the answer manner, what often happened as a result?
is losing money (c). Answer Choices:
A: Poker is a game in which people lose their money. (a) drunkenness
Therefore, the answer is losing money (c). (b) enjoyment
(c) have fun
Q: People are meeting people, what is the likely physical (d) hang over
gesture? (e) fighting
Answer Choices: A: The answer must be a result of a celebration. The answer
(a) introducing yourself is drunkenness (a).
(b) smiling A: The answer must be something that would cause someone to
(c) shake hands have a reputation for celebrating in a debaucherous manner.
(d) share a hug Alcohol is a depressant. Therefore, the answer is
(e) talking drunkenness (a).
20
A: it is normal not a good idea to get drunk. Therefore, Q: James was cooking s stew in his apartment kitchen. Where
the answer is drunkenness (a). might he look for a potato.
Answer Choices:
Q: The paper towels are usually in plain sight in the (a) restaurants
kitchen, where did the dad go next to find them? (b) casserole
Answer Choices: (c) root cellar
(a) pantry (d) pockets
(b) cabinet (e) pantry
(c) grocery store A: The answer must be a place where potatoes are stored.
(d) waste bin The answer is pantry (e).
(e) market A: The answer must be a place where potatoes are stored.
A: The answer must be a place where the dad would go to Pantries are the most likely place to store potatoes.
find the paper towels. The pantry is a place where food is Therefore, the answer is pantry (e).
stored. Therefore, the answer is pantry (a). A: apartments do not often have root cellars, and are not
A: The answer must be a place where paper towels are stored. restaurants. a pantry is closest to James inside his
Pantries are the most likely place to store paper towels. apartment. Therefore, the answer is pantry (e).
Therefore, the answer is pantry (a).
A: The pantry is in the kitchen and stores things. Q: If ships are in a storm and the sun comes out, what has
Therefore, the answer is pantry (a). happened?
Answer Choices:
Q: My house is very dry, with little water in the air. What (a) sinks
tool should I use to fix this? (b) cross river
Answer Choices: (c) collide
(a) humidifier (d) bottom out
(b) surface of earth (e) carry people
(c) skating rink A: The answer must be a thing that happens when ships are
(d) ice cube in a storm and the sun comes out. Ships are in a storm when
(e) burn it to the ground they are in the middle of the ocean. Therefore, the answer
A: The answer must be a tool that will make the air in your is bottom out (d).
house moist. A humidifier is a tool that will make the air A: The answer must be something that happens when the sun
in your house moist. Therefore, the answer is humidifier ( comes out. The sun is a source of energy. Therefore, the
a). answer is bottom out (d).
A: The answer must be something that would make the air in A: the sun comes out so its not a storm any more. Therefore,
my house more humid. A humidifier is used to make the air the answer is bottom out (d).
in a room more humid. Therefore, the answer is humidifier (
a). Q: Where do all marmot live before they are captured?
A: This word was most relevant. Therefore, the answer is Answer Choices:
humidifier (a). (a) switzerland
(b) spain
Q: A motorcycle is quicker because it isn’t as what as a car (c) vancouver island
? (d) wilderness
Answer Choices: (e) australia
(a) heavy A: The answer must be a place where marmots live. Marmots
(b) shiny live in the wilderness. Therefore, the answer is wilderness
(c) red (d).
(d) obesity A: The answer must be a place where marmots are found. The
(e) expensive wilderness is the most likely place to find marmots.
A: The answer must be something that is not as heavy as a Therefore, the answer is wilderness (d).
car. A motorcycle is lighter than a car. Therefore, the A: Marmots don’t live in Australia or Switzerland.
answer is not heavy (a). Therefore, the answer is wilderness (d).
A: The answer must be something that is not as quick as a
motorcycle. Motorcycles are not as heavy as cars. Therefore, Q: Where does one wait on a bench for something on tracks?
the answer is heavy (a). Answer Choices:
A: Because it is much smaller. Therefore, the answer is (a) train station
heavy (a). (b) state park
(c) crosswalk
Q: The baseball team was playing ball. It was time for the (d) bus stop
pitcher to act. How did the pitcher get the ball to the (e) bus depot
batter? A: The answer must be a place where one waits for a train,
Answer Choices: bus, or other vehicle. The answer is train station (a).
(a) lose A: The answer must be a place where one can wait for a
(b) injury train. Train stations are places where trains stop.
(c) competition Therefore, the answer is train station (a).
(d) throwing A: Only trains are on tracks. Therefore, the answer is
(e) rolled train station (a).
A: The answer must be a way to get the ball to the batter.
The pitcher is the one who throws the ball. Therefore, the Q: What can planning vacation lead to between a couple when
answer is throwing (d). they disagree?
A: The answer must be a way for the pitcher to get the ball Answer Choices:
to the batter. The pitcher is the person who is (a) relaxation
responsible for getting the ball to the batter. Therefore, (b) enjoying
the answer is throwing (d). (c) arguments
A: The pitcher threw the ball to the batter so that he (d) going abroad
could hit it. Therefore, the answer is throwing (d). (e) spending money
A: The answer must be something that can happen when a
Q: In what sort of container should you keep your ficus? couple disagrees. Arguments are a common result of
Answer Choices: disagreements. Therefore, the answer is arguments (c).
(a) refigerator A: The answer must be a result of planning vacation.
(b) screened porch Arguments are a common result of planning vacation.
(c) clay pot Therefore, the answer is arguments (c).
(d) shady places A: Arguments are the direct result of disagreements.
(e) arboretum Therefore, the answer is arguments (c).
A: The answer must be a container that keeps the ficus in a
shady place. Clay pots are used to keep plants in shady Q: A sloth climbs slow, but it can eventually reach all the
places. Therefore, the answer is clay pot (c). way up where?
A: The answer must be a container in which ficus can be Answer Choices:
kept. Ficus are tropical plants. Therefore, the answer is (a) forest canopy
clay pot (c). (b) universe
A: Clay pot is the only type of container listed. Therefore, (c) tropical rainforest
the answer is clay pot (c). (d) large trees
(e) nature preserve
21
A: The answer must be a place where a sloth can climb. The A: The answer must be something that would protect legs
forest canopy is the highest point in the forest. Therefore, from getting dirty on the farm. Jeans are used to protect
the answer is forest canopy (a). legs from getting dirty. Therefore, the answer is wear
A: The answer must be a place where a sloth can climb. The jeans (a).
canopy of a forest is the highest point in a forest. A: Jeans cover over legs. Therefore, the answer is wear
Therefore, the answer is canopy of a forest (a). jeans (a).
A: The sloth is a mammal that lives in the trees of South
America. They have a slow and peculiar way of moving about. Q: What types of stoves are there?
In fact, they are the slowest land mammals and travel very Answer Choices:
sluggishly on the ground (0.025 m/s-0.050 m/s). Therefore, (a) these twos are irrelevent to the questions
the answer is canopy of a forest (a). (b) warm pot of beans
(c) brown meat
Q: Where in America is known for theaters? (d) warm room
Answer Choices: (e) gas or electric
(a) new york city A: The answer must be a type of stove. The stove is used to
(b) shopping mall heat food. Therefore, the answer is gas or electric (e).
(c) downtown area A: The answer must be a type of stove. Stoves are used to
(d) skating rink warm rooms. Therefore, the answer is electric (e).
(e) military base A: stoves are gas or electric. Therefore, the answer is gas
or electric (e).
A: The answer must be a place where people go to see movies.
Theaters are places where people go to see movies. Q: In what geological feature will you find fungus growing?
Therefore, the answer is theater (a). Answer Choices:
A: The answer must be a place with a lot of theaters. New (a) shower stall
York City is known for its many theaters. Therefore, the (b) toenails
answer is New York City (a). (c) basement
A: New York City is the main region in US known for (d) forrest
theaters. Therefore, the answer is New York City (a). (e) cave
A: The answer must be a place where fungus grows. The
Q: What does hearing someone play violin beautifully make answer is cave (e).
you? A: The answer must be a place where fungus grows. Fungus
Answer Choices: grows in caves. Therefore, the answer is cave (e).
(a) buy earplugs A: Rivers flow trough valleys. Therefore, the answer is
(b) inspiring cave (e).
(c) guitar
(d) make music Q: What is our sun considered to be the center of?
(e) like music Answer Choices:
A: The answer must be something that makes you feel (a) universe
inspired. Hearing someone play violin beautifully makes you (b) solar system
feel inspired. Therefore, the answer is inspiring (b). (c) outer space
A: The answer must be a feeling which would cause someone (d) star
to want to hear someone play violin beautifully. Music is a (e) beans
form of art. Therefore, the answer is inspiring (b). A: The answer must be something that is considered to be
A: Hearing someone play a violin beautifully makes me the center of the universe. The sun is the center of the
inspired. Therefore, the answer is inspiring (b). solar system. Therefore, the answer is solar system (b).
A: The answer must be a place where the sun is the center
Q: What emotion does committing murder induce? of. The sun is the center of the solar system. Therefore,
Answer Choices: the answer is solar system (b).
(a) fear A: It is the best option because it is the closest.
(b) go to jail Therefore, the answer is solar system (b).
(c) problems
(d) cruel Q: What are people in a library likely doing?
(e) dead Answer Choices:
A: The answer must be a feeling that would cause someone to (a) talk to each other
commit murder. Murder is a crime. Therefore, the answer is (b) board ships
fear (a). (c) study books
A: The answer must be a feeling which would cause someone (d) suffer hunger
to commit murder. If someone feels fear, they are likely to (e) playing games
want to avoid committing murder. Therefore, the answer is A: The answer must be something that people in a library
fear (a). are likely to be doing. People in a library are likely to
A: The only feeling. Therefore, the answer is fear (a). be studying books. Therefore, the answer is study books (c).
Q: Sitting to close while watching TV can cause what sort of A: The answer must be something that is likely to happen in
pain? a library. Libraries are places where people study books.
Answer Choices: Therefore, the answer is studying books (c).
(a) brain problems A: Books shelved as study-skills: How to Become a Straight-
(b) laziness A Student by Cal Newport, The Study Skills Handbook by
(c) get fat Stella Cottrell, Make It Stick: The Scienc... Therefore,
(d) headache the answer is study books (c).
(e) laughter
A: The answer must be a pain that comes from sitting too Q: A shark isn’t something you have to worry about at the
close to the TV. Sitting too close to the TV can cause lake in the woods, but there are plenty of dangers in the
headaches. Therefore, the answer is headache (d). what?
A: The answer must be a type of pain which can occur if you Answer Choices:
are sitting too close to the TV. Headaches are a common (a) great oceans
side effect of sitting too close to the TV. Therefore, the (b) great outdoors
answer is headache (d). (c) tropical waters
A: the process of aging in humans. Therefore, the answer is (d) pacific ocean
headache (d). (e) gulf of mexico
A: The answer must be something that is dangerous in the
Q: What might people do to protect their legs from getting great outdoors. The great outdoors is a place where there
dirty on the farm? are many dangers. Therefore, the answer is great outdoors (
Answer Choices: b).
(a) wear jeans A: The answer must be a place where sharks are a danger.
(b) milk cow Lakes in the woods are not a place where sharks are a
(c) believe in god danger. Therefore, the answer is great outdoors (b).
(d) avoid mud A: The great outdoors is the only choice where you would
(e) plant flowers find a lake in the woods. Therefore, the answer is great
A: The answer must be something that people do to protect outdoors (b).
their legs from getting dirty. Of the above choices,
wearing jeans is the most logical. Therefore, the answer is Q: You can share files with someone if you have a connection
jeans (a). to a what?
22
Answer Choices: Q: What can an injury happen to?
(a) freeway Answer Choices:
(b) radio (a) emergency room
(c) wires (b) doctor’s office
(d) computer network (c) living thing
(e) electrical circuit (d) bones
A: The answer must be a way to share files. The answer is (e) accident
computer network (d). A: The answer must be something that can happen to a living
A: The answer must be a connection which allows you to thing. The answer is living thing (c).
share files. A computer network is a collection of A: The answer must be something that can happen to a living
computers which are connected to each other. Therefore, the thing. If someone is injured, they are likely to want to
answer is computer network (d). go to the emergency room. Therefore, the answer is living
A: Files can be transmitted over a computer network, and thing (c).
not a freeway. "wires" doesnt make any sense. Therefore, A: Living things feel pain, and therefore can be injured.
the answer is computer network (d). Therefore, the answer is living thing (c).
23
D Example Rationalizations on CQA
We include a randomly sampled set of rationalizations which the model is able to produce before
fine-tuning. We observe that sometimes, the model constructs an argument roughly of the form “the
answer must have a set of properties. correct answer has those properties. therefore, the answer is
correct answer.” This structure of argument is fairly standard, but given that the model originally
answered those questions incorrectly, it resembles template-matching more than reasoning. The
technique of rephrasing the question in a simpler way before answering it directly is also often
effective. In general, the rationales from the pretrained model’s rationalization appears to be of
similar quality to the pretrained model’s rationale generation. We reference the pretrained model here
rather than the STaR-trained model, as the final iteration’s rationales are the ones which the model
struggled to learn from – that is, if the rationales were good, then we might expect that the model
would have already incorporated them into its rationale generation.
Q: Why do I eat just enough when eating breakfast?
Answer Choices:
(a) gas
(b) full stomach
(c) feeling satisfied (CORRECT)
(d) have energy
(e) bloated
A: The answer must be a reason for eating just enough when eating breakfast. Eating
just enough is a way to feel satisfied. Therefore, the answer is feeling satisfied (
c).
Q: If a person is trying to weasel out of punishment for a crime, where are they
likely to be?
Answer Choices:
(a) court room (CORRECT)
(b) cherry tree
(c) chicken coop
(d) natural history museum
(e) jail
A: The answer must be a place where a person would be trying to get out of
punishment for a crime. The answer is court room (a).
Q: The person didn’t like the varying size of the cake slices handed out, she demand
more what?
Answer Choices:
(a) fairness (CORRECT)
(b) candies
(c) compliments
(d) well fed
(e) arguements
24
A: The answer must be a reason for the person to demand more cake. The person is not
happy with the size of the cake slices. Therefore, the answer is fairness (a).
Q: In America, the courthouse of any given town has a likely location, where is it?
Answer Choices:
(a) electrical circuit
(b) getting married
(c) capital city
(d) center of town (CORRECT)
(e) michigan
A: The answer must be a location where a courthouse is located. The capital city is
the location of the capital of a state. Therefore, the answer is center of town (d).
Q: Where might I find a bowl nestled among other bowls in the kitchen?
Answer Choices:
(a) refrigerator
(b) kitchen top
(c) cupboard (CORRECT)
(d) dishwasher
(e) apartment
A: The answer must be a place where a bowl is found. The kitchen is a place where
bowls are found. Therefore, the answer is cupboard (c).
Q: Which state has blue ridge mountains that are called Blue Mountain?
Answer Choices:
(a) appalachians
(b) virginia
(c) pennsylvania (CORRECT)
(d) north carolina
(e) montana
A: The answer must be a state with mountains. The answer is Pennsylvania (c).
25
E STaR Without Rationalization (Rationale Generation Bootstrapping)
For convenience, we include the rationale generation bootstrapping algorithm alone here. However, it
is simply Algorithm 1, with the blue components removed.
F Case Study
(a) Problems where the model was able to solve ques- (b) Problems where the model was already able to an-
tions which it had previously answered incorrectly. swer them with few-shot reasoning, but was able to
answer better after more iterations.
Figure 7: Case study on STaR. We highlight some problems which the model is not initially able
to solve, as well as rationales which are initially correct but improve. We note that for CQA, one is
tasked with selecting the “best” answer, where multiple answers may be technically correct.
26
G Bias and Faithfulness
It is important to note that STaR is designed to amplify the reasoning that leads to correct solutions
on a given dataset. The implication of this with a dataset like CommonsenseQA is that if biases are
“useful” in solving the dataset then they will be amplified. This is made worse by rationalization, as
biased answers that the model may not naturally arrive at are, in a sense, pulled out of the model.
The exact interaction between the bias in a dataset and the pre-existing bias learned by the model
is unclear, and something to consider before real-world deployment of all large language models in
general, but STaR-trained models in particular.
We find some encouraging initial results on this however: for questions where gender is not relevant,
the model appears to disregard it in its explanation. For example: “Q: Where is a good place for
a woman to store her sunglasses? → A: The answer must be a place where sunglasses are stored.
Sunglasses are stored in purses. Therefore, the answer is purse (e).” or “Q: The only baggage the
woman checked was a drawstring bag, where was she heading with it? → A: The answer must be
a place where a drawstring bag is checked. The answer is airport (e).” We believe this question
warrants a much more comprehensive study.
One important challenge with models which seek to be interpretable or provide explanations for
their reasoning is that of faithfulness. While our primary emphasis is not on the explainability
benefits that STaR may bring, there is a fundamental challenge around evaluating explanations
and rationales: namely, faithfulness [33, 34]. [34] describe faithful explanations as those which
“accurately [represent] the reasoning process behind the model’s prediction.” While STaR encourages
the use of reasoning in rationales which leads the model to correct answers, it is difficult, if not
impossible, to ensure that the rationales reflect the model’s internal processing. For example, it is
straightforward to imagine the model implicitly selecting a particular answer immediately and then
generating a rationale to justify that selected answer. This would allow a model to generate unbiased
rationales while selecting answers in a biased way.
The fact that our model outperforms one fine-tuned to directly predict the answers, and ablation
studies from papers such as [6] make it clear that the generation of a rationale before producing an
answer non-trivially improves the model’s answer quality. However, it is difficult to evaluate the
degree to which any particular answer’s rationale is faithful. However, we note that there problems
are not unique to STaR, but are symptomatic of the difficulty of understanding large language models
and in particular the rationales generated by large language models.
H Hyperparameters
GPT-J is a 28-layer decoder-only transformer, with an embedding size of 1024, 16 attention heads of
dimension 256, and an FFN hidden layer of size 16384. It was pre-trained on the Pile [35], with a
vocabulary size of 50.4K
In general, unless otherwise stated, we use a batch size of 8 sequences, each of length 1024. We also
use packing, namely, packing the shorter examples to form longer sequences (up to length 1024) to
improve TPU utilization. We do not use weight decay, and we train and sample on a single TPU-v3
node. We performed a hyperparameter search over learning rates from 10−7 to 10−4 using the Adam
optimizer [36]. We found that 10−6 was consistently the best-performing learning rate.
27
I GSM8K Few-shot Prompt
We include the following few-shot prompts for GSM8K, based on the examples in [9].
Q: Natalia sold clips to 48 of her friends in April, and then she sold half as many
clips in May. How many clips did Natalia sell altogether in April and May?
A: Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72
Q: Betty is saving money for a new wallet which costs $100. Betty has only half of
the money she needs. Her parents decided to give her $15 for that purpose, and her
grandparents twice as much as her parents. How much more money does Betty need to
buy the wallet?
A: In the beginning, Betty has only 100 / 2 = $<<100/2=50>>50.
Betty’s grandparents gave her 15 * 2 = $<<15*2=30>>30.
This means, Betty needs 100 - 50 - 30 - 15 = $<<100-50-30-15=5>>5 more.
#### 5
Q: Julie is reading a 120-page book. Yesterday, she was able to read 12 pages and
today, she read twice as many pages as yesterday. If she wants to read half of the
remaining pages tomorrow, how many pages should she read?
A: Maila read 12 x 2 = <<12*2=24>>24 pages today.
So she was able to read a total of 12 + 24 = <<12+24=36>>36 pages since yesterday.
There are 120 - 36 = <<120-36=84>>84 pages left to be read.
Since she wants to read half of the remaining pages tomorrow, then she should read
84/2 = <<84/2=42>>42 pages.
#### 42
Q: Mark has a garden with flowers. He planted plants of three different colors in it
. Ten of them are yellow, and there are 80% more of those in purple. There are only
25% as many green flowers as there are yellow and purple flowers. How many flowers
does Mark have in his garden?
A: There are 80/100 * 10 = <<80/100*10=8>>8 more purple flowers than yellow flowers.
So in Mark’s garden, there are 10 + 8 = <<10+8=18>>18 purple flowers.
Purple and yellow flowers sum up to 10 + 18 = <<10+18=28>>28 flowers.
That means in Mark’s garden there are 25/100 * 28 = <<25/100*28=7>>7 green flowers.
So in total Mark has 28 + 7 = <<28+7=35>>35 plants in his garden.
#### 35
Q: Alexis is applying for a new job and bought a new set of business clothes to wear
to the interview. She went to a department store with a budget of $200 and spent
$30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and
$18 on a belt. She also purchased a pair of shoes, but lost the receipt for them.
She has $16 left from her budget. How much did Alexis pay for the shoes?
A: Let S be the amount Alexis paid for the shoes.
She spent S + 30 + 46 + 38 + 11 + 18 = S + <<+30+46+38+11+18=143>>143.
She used all but $16 of her budget, so S + 143 = 200 - 16 = 184.
Thus, Alexis paid S = 184 - 143 = $<<184-143=41>>41 for the shoes.
#### 41
Q: Tina makes $18.00 an hour. If she works more than 8 hours per shift, she is
eligible for overtime, which is paid by your hourly wage + 1/2 your hourly wage. If
she works 10 hours every day for 5 days, how much money does she make?
A: She works 8 hours a day for $18 per hour so she makes 8*18 = $
<<8*18=144.00>>144.00 per 8-hour shift
She works 10 hours a day and anything over 8 hours is eligible for overtime, so she
gets 10-8 = <<10-8=2>>2 hours of overtime
Overtime is calculated as time and a half so and she makes $18/hour so her overtime
pay is 18*.5 = $<<18*.5=9.00>>9.00
Her overtime pay is 18+9 = $<<18+9=27.00>>27.00
Her base pay is $144.00 per 8-hour shift and she works 5 days and makes 5 * $144 = $
<<144*5=720.00>>720.00
Her overtime pay is $27.00 per hour and she works 2 hours of overtime per day and
makes 27*2 = $<<27*2=54.00>>54.00 in overtime pay
28
2 hours of overtime pay for 5 days means she makes 54*5 = $270.00
In 5 days her base pay is $720.00 and she makes $270.00 in overtime pay so she makes
$720 + $270 = $<<720+270=990.00>>990.00
#### 990
29
J STaR GSM8K Solutions
We observe some interesting patterns with the GSM8K solutions proposed by the STaR-trained
model. Typically, when the solution takes substantially fewer calculation steps than the ground truth,
it corresponds to an instance where the model accidentally answered the question correctly despite
mistakes in its reasoning. In some cases, however, the model produces simpler solutions than those in
the ground truth. One example is shown in Figure 8.
Q: A van is delivering 180 bottles of drinks to a neighborhood, and each bottle contains
either cider or beer or a mixture of two. Out of the 180 bottles, 40 contain only cider, 80
contain only beer, and the rest are a mixture of the two drinks. If the delivery man gives
half the number of each bottle of drink to the first house, how many bottles does the first
house get?
The total number of bottles containing The delivery man gives half the number of
cider or beer is 40+80 = <<40+80=120>>120 each bottle of drink to the first house.
If the number of bottles in the van was So he gives 180/2 = <<180/2=90>>90 bottles
180, then 180-120 =<<180-120=60>>60 of drink to the first house.
contains a mixture of the two drinks. #### 90
At the first house, the delivery man
dropped 1/2*40 = <<1/2*40=20>>20 bottles.
There were also 1/2*80 = <<1/2*80=40>>40
bottles of beer that the delivery man gave
to the first house.
Together, the first house received 40+20 =
<<40+20=60>>60 bottles of cider and beer
The total number of bottles that had a
mixture of the two drinks that were dropped
at the first house is 1/2*60 =
<<1/2*60=30>>30
The first house got 60+30 = <<60+30=90>>90
bottles of all types of drinks.
#### 90
Figure 8: An example problem in the training set where STaR derives a significantly simpler solution
than the ground truth.
30