2023 Li ChainOfThought
2023 Li ChainOfThought
Jia Li ♂ Ge Li
[email protected] Peking University
Peking University Beijing, China
Beijing, China [email protected]
[3], including sequence, branch, and loop structures. Intuitively, • We propose a Structured Chain-of-Thought (SCoT), which utilizes
intermediate reasoning steps leading to the structured code should program structures to build the intermediate reasoning steps.
also be structured. Consider a human developer’s thought process • We propose a novel prompting technique for code generation,
when solving a requirement (e.g., find the maximum number in a named SCoT Prompting. It prompts large language models first
list). It is typical to come up with a solving process with program to generate a SCoT and then implement the code.
structures: “Initialize a result with -inf; for each number in the list; • We conduct extensive experiments on three benchmarks. Quali-
if the number is greater than result: Update result with the number tative and quantitative experiments show that SCoT prompting
...”. Our idea is to enable LLMs to generate similar structured CoTs - significantly outperforms SOTA baselines (e.g., Chain-of-Thought
a coherent series of intermediate reasoning steps constructed by prompting).
program structures. Besides, LLMs’ training data contains lots of • We discuss the contributions of different program structures and
code data, so they have the ability to generate program structures. the robustness of SCoT prompting.
However, standard CoT ignores the program structures and has Data Availability. We open source our replication package [?
low accuracy in code generation. Thus, it is necessary to design ], including the datasets and the source code of SCoT prompting,
a structured CoT to unlock the reasoning ability of LLMs in code to facilitate other researchers and practitioners to repeat our work
generation. and verify their studies.
Figure 1 (b) shows a SCoT. The design of our SCoT has two
inspirations. First, existing work [3] proved that any simple or 2 METHODOLOGY
complex program can be composed of three basic structures, i.e.,
In this section, we propose a Structured Chain-of-Thought (SCoT).
sequence structure, branch structure, and loop structure. Thus, we
A SCoT denotes several intermediate reasoning steps constructed by
introduce three basic structures and constrain LLMs to use them
program structures. Then, we present a novel prompting technique
to generate CoTs. As shown in Figure 1 (b), the SCoT uses a loop
for code generation named SCoT prompting. SCoT prompting asks
structure to clearly describe an iteration in line 2. While in the
LLMs first to generate a SCoT and then output the final code. In the
CoT, the scopes of two iterations in lines 2 and 4 are ambiguous. It
subsections, we first describe the design of our SCoT and further
shows the superiority of SCoT in code generation. Besides, every
show the details of SCoT prompting.
program contains a required input-output structure, which includes
the input-output parameters and their types (e.g., Input: array:
list[list]; Output: result in Figure 1 (b)). By generating the
2.1 Structured Chain-of-Thought
input-output structure, LLMs are asked to analyze requirements Standard Chain-of-Thought (CoT) is several intermediate natural
and determine the entry and exit of the code, which benefits the language reasoning steps that lead to the final answer [35]. The CoT
following solving process. is initially designed for natural language generation (e.g., common-
Based on the SCoT, we present a new prompting technique sense reasoning [26]). Thus, the CoT only uses natural languages
named SCoT prompting. It asks LLMs first to generate a SCoT to sequentially describe how to solve a problem step by step. Fig-
using program structures and then implement the code. Compared ure 1 (a) shows a CoT on code generation. A limitation is that CoT
to CoT prompting, SCoT prompting explicitly introduces program brings slight improvements in code generation. For example, adding
structures into intermediate reasoning steps and constraints LLMs the CoT only improves ChatGPT by 0.82 points in Pass@1 upon a
to think about how to solve requirements from the view of pro- real-world benchmark - HumanEval [7].
gramming languages. It further unlocks the reasoning ability of In this paper, we propose a Structured CoT. Our motivation is
LLMs in code generation, thus achieving higher accuracy. that, unlike natural language generation, the goal of code generation
We apply SCoT prompting to two popular LLMs (i.e., ChatGPT is highly structured code. Source code solves a problem through
[18] and Codex [7]) and evaluate it on three representative bench- special structures, including sequence structures, branch structures,
marks (i.e., HumanEval [7], MBPP [2], and MBCPP [1]). We use and loop structures. For example, given a requirement - reading
unit tests to measure the correctness of generated programs and text from a given file, imagine a human developer’s thought
report the Pass@𝑘 (𝑘 ∈ [1, 3, 5]) [7]. Based on experimental re- process. The developer will use program structures to design an
sults, we obtain four findings. (1) SCoT prompting significantly initial idea: “if the given file exists: read text from the file; else: raise
improves the accuracy of LLMs on code generation. Compared to an error;”. The program structures clearly show the solving process
the SOTA baseline - Chain-of-Thought prompting, in terms and benefit the following code implementation. Thus, intermediate
of Pass@1, SCoT prompting outperforms it by up to 13.79% reasoning steps leading to the code should also be structured.
in HumanEval, 12.31% in MBPP, and 6.63% in MBCPP. (2) Figure 2 shows some examples of SCoT. Compared to the CoT,
Human evaluation shows that human developers prefer programs our SCoT explicitly introduces program structures. Existing work
generated by SCoT prompting. (3) SCoT prompting is effective for [3] proved that any simple or complex program can be composed
different LLMs and different programming languages. In terms of of three basic structures, i.e., sequence structure, branch structure,
Pass@1, it improves ChatGPT by up to 13.79% and Codex by up to and loop structure Thus, we introduce three basic structures, whose
13.77%. Besides, SCoT prompting is language-agnostic and effective details are shown as follows.
in multiple languages (e.g., Python and C++). (4) We explore the • Sequence Structure. The intermediate steps are sequentially
robustness of SCoT prompting to examples. Results show that SCoT placed and all steps are at the same level.
prompting does not depend on specific examples or writing styles. • Branch Structure. It starts with a condition and places different
We summarize our contributions in this paper as follows. intermediate steps for different results of the condition. In this
Structured Chain-of-Thought Prompting for Code Generation Conference’17, July 2017, Washington, DC, USA
the prompt ends with a new requirement and its SCoT, and is in- Table 1: Statistics of the datasets in our experiments.
put into LLMs. By learning from examples, LLMs generate a new
program based on the requirement and SCoT. Statistics HumanEval MBPP MBCPP
Related work [25] has found that generative models may be nega-
Language Python Python C++
tively affected by error accumulation. Similarly, in SCoT prompting,
the generated SCoT may contain noises (e.g., errors or missing # Train – 474 413
steps). These noises will further negatively affect code implementa- # Test 164 500 435
tion. In this paper, we utilize two approaches to alleviating error
Avg. tests per sample 7.7 3 3
accumulation. First, as shown in Figure 4, we ask LLMs to double-
check the SCoT and fix possible noises. It allows LLMs to adaptively
refer to the SCoT and filter out noises. Second, SCoT prompting RQ4: What are the contributions of different program
utilizes a two-step generation pipeline. It provides a window of structures in SCoT prompting? As stated in Section 2.1, SCoT
opportunity to debug where the SCoT goes wrong. In practice, hu- prompting introduces three basic structures and the input-output
man developers can first check the generated SCoT and fix possible structure. This RQ is designed to analyze the contributions of dif-
errors. Then, the SCoT is used to generate code. ferent structures. We select an LLM as the base model. Then, we
individually remove a program structure and report the fluctuations
2.3 Implementation Details in performance.
SCoT prompting is a prompting technique for code generation,
which does not rely on specific LLMs. In this paper, we consider
3.2 Benchmarks
ChatGPT as the default LLM. We select a few (e.g., three) <requirement, Following previous studies [6, 7, 17, 40], we conduct experiments
code> pairs from real-world benchmarks (i.e., training data) as ex- on three representative code generation benchmarks, including the
ample seeds. Then, we manually write the SCoT for seeds and obtain HumanEval in Python, MBPP in Python, and MBCPP in C++. The
examples - <requirement, SCoT, code> triples, which are used to details of the benchmarks are described as follows.
make prompts in Figure 3 and 4. A prompt contains three examples • HumanEval [7] is a Python function-level code generation
by default. The examples and prompt templates are available in benchmark, which contains 164 hand-written programming prob-
our replication package. In the future, users can flexibly apply our lems. Each programming problem consists of an English require-
approach to more powerful LLMs in a plug-and-play fashion. ment, a function signature, and several test cases, with an average
of 7.7 test cases per problem.
• MBPP [2] is a Python function-level code generation benchmark.
3 STUDY DESIGN
It contains 974 programming problems that involve simple nu-
To assess SCoT prompting, we conduct a large-scale study to answer meric manipulations or basic usage of standard libraries. Each
four research questions. In this section, we present the details of our problem contains an English requirement, a function signature,
study, including datasets, evaluation metrics, comparison baselines, and three manually written test cases for checking functions.
and implementation details. • MBCPP [1] is a C++ function-level code generation benchmark.
It consists of 848 programming problems that are collected by
3.1 Research Questions crowd-sourcing. Each problem contains an English description,
Our study aims to answer the following research questions (RQ). a function signature, and three test cases for checking the cor-
RQ1: How does SCoT prompting perform in terms of accu- rectness of functions.
racy compared to baselines? This RQ aims to verify that SCoT We follow the original splits of three datasets. The statistics of
prompting has a higher accuracy than existing prompting tech- the benchmarks are shown in Table 1. We randomly pick several
niques on code generation. We apply three existing prompting samples from training data to make examples in prompts (Section
techniques and SCoT prompting to two LLMs. Then, we use unit 2.3). Then, we measure the performance of different approaches
tests to measure the correctness of programs generated by different on test data. Because HumanEval does not contain train data, we
approaches and report the Pass@k. reuse examples from MBPP in HumanEval.
RQ2: Do developers prefer programs generated by SCoT
prompting? The ultimate goal of code generation is to assist hu- 3.3 Evaluation Metrics
man developers in writing code. In this RQ, we hire 10 develop- Following previous code generation studies [6, 7, 17, 40], we use
ers (including industry employees and academic researchers) to Pass@𝑘 as our evaluation metrics. Specifically, given a require-
manually review the programs generated by SCoT prompting and ment, a code generation model is allowed to generate 𝑘 programs.
baselines. We measure the quality of programs in three aspects, The requirement is solved if any generated programs pass all test
including correctness, code smell, and maintainability. cases. We compute the percentage of solved requirements in to-
RQ3: Is SCoT prompting robust to examples? Prompting tal requirements as Pass@𝑘. For Pass@𝑘, a higher value is better.
techniques may be sensitive to example [39]. In this RQ, we mea- In our experiments, 𝑘 is set to 1, 3, and 5, because we think that
sure the robustness of SCoT prompting to examples. Specifically, developers mainly use Top-5 outputs in real-world scenarios.
we measure the performance of SCoT prompting with different Previous work [1, 6, 7] found that standard Pass@𝑘 has high
example seeds and different example writing styles. variance and proposed an unbiased Pass@𝑘. We follow previous
Structured Chain-of-Thought Prompting for Code Generation Conference’17, July 2017, Washington, DC, USA
work and employ the unbiased Pass@𝑘. Specifically, we generate language models and instruction-tuned models. For each category,
𝑛 ≥ 𝑘 programs per requirement (in this paper, we use 𝑛 = 20, we pick a representative model as the base model.
𝑘 ∈ [1, 3, 5]), count the number of solved requirements 𝑐, and (1) Standard language models are pre-trained on a large-scale
calculate the unbiased Pass@𝑘: corpus with the next-token prediction objective. They are mainly
used to continually complete the given content, such as code com-
𝑛 −𝑐 pletion. Thus, we pick the state-of-the-art completion model for
𝑘
code - Codex [7] as a base model.
Pass@𝑘 := E 1 − (1)
Problems Codex [7] is a powerful language model for code generation,
𝑛
which supports a commercial application - GitHub Copilot [9].
𝑘
Codex’s training data contains both natural language and billions
We also notice that previous code generation studies use text-
of lines of code. We use OpenAI’s APIs to access the latest version
similarity-based metrics (e.g., BLEU [21]). These metrics are initially
of Codex with 175 billion parameters, i.e., code-davinci-002 [19].
designed for natural language generation and are poor in measuring
(2) Instruction-tuned models refer to LLMs after instruction tun-
the correctness of programs [7]. Thus, we omit these metrics in our
ing. Instruction tuning trains LLMs to understand human users’
experiments.
instructions and perform tasks based on the instructions. We select
the state-of-the-art instruction-tuned model - ChatGPT [18] as a
3.4 Comparison Baselines base model.
This paper proposes a new prompting technique for code genera- ChatGPT [18] is the state-of-the-art LLM for code generation.
tion. To assess the effectiveness of our approach, we select three ChatGPT is trained with extensive natural language text and code
mainstream prompting techniques as baselines. files. Then, it is trained with reinforcement learning and learns to
• Zero-shot prompting [7] directly feeds the requirement into follow human instructions. We use OpenAI’s APIs to access the
LLMs without examples. Then, it extracts a generated program ChatGPT, i.e., gpt-3.5-turbo-0301 [18].
from LLMs’ outputs. Our approach does not rely on specific LLMs and can be applied
• Few-shot prompting [7] randomly selects several < require- to different LLMs in a plus-and-play fashion. In the future, we will
ment, code> pairs as examples. Given a requirement, it concate- explore its usage on more powerful LLMs.
nates examples and the requirement together, making a prompt.
Then, the prompt is fed into LLMs, and LLMs predict a new 3.6 Sampling Settings
program. Following previous studies [7, 17, 40], we use nucleus sampling [11]
• Chain-of-Thought (CoT) prompting [35] is a variant of few- to decode programs from LLMs. To ensure the fairness of experi-
shot prompting. CoT prompting produces a special prompt con- ments, all baselines and SCoT prompting generate 20 programs per
sisting of <requirement, CoT, code> triples as examples. A CoT requirement. The details of sampling settings are shown as follows.
is several intermediate natural language reasoning steps. Baselines. The temperature is 0.8 and the top-𝑝 is 0.95. For zero-
To ensure the fairness of comparison, all baselines and SCoT prompt- shot and few-shot prompting, the maximum generated length is 300
ing have the same number of examples (i.e., three examples) and tokens. The maximum generated length of CoT prompting is 600
example seeds. tokens. Our motivation is that CoT prompting needs to generate
The criteria for selecting baselines are three-fold. (1) SCoT prompt- intermediate reasoning steps and code. Thus, it requires a larger
ing is a prompting technique for code generation. Thus, we directly generation length.
compare it to existing prompting techniques for code generation. SCoT prompting. In the first step, we sample 20 individual
We also notice some emerging prompting techniques in other fields, SCoTs from LLMs per requirement. The temperature is 0.8 and the
such as Self-Consistency [31] and Least-to-Most [41]. But these top-𝑝 is 0.95. The maximum generated length is 300 tokens. Then,
approaches are designed for specific tasks (e.g., Arithmetic reason- for each SCoT, we use LLMs to generate a corresponding program.
ing) and can not be directly applied to code generation. Thus, we The temperature is 0 and the maximum generated length is 300
omit them in this paper. (2) Our approach is to augment LLMs and tokens. Finally, we obtain 20 programs for each requirement. The
can be flexibly applied to different LLMs. Thus, we do not directly total generation length of two steps is the same as CoT prompting.
compare LLMs to our approach. (3) We also omit some rank tech-
niques for code generation [6]. They first use LLMs to generate
4 RESULTS AND ANALYSIS
many candidates and then leverage test cases or neural networks to
rerank candidates. We think our work and these rank techniques are 4.1 RQ1: How does SCoT prompting perform in
complementary. Users can use our approach to generate programs terms of accuracy compared to baselines?
and then use post-processing techniques to select the final output. In the first research question, we apply SCoT prompting and base-
We further discuss the complementarity through experiments in lines to three benchmarks and use unit tests to measure the cor-
Section 5.2. rectness of generated programs.
Setup. We apply baselines and SCoT prompting to two LLMs
3.5 Base Large Language Models (Section 3.5). Then, we measure the performance of different ap-
There are many available LLMs for source code. Our motivation proaches on three code generation benchmarks (Section 3.2) using
is that existing LLMs can be divided into two categories: standard the Pass@k (Section 3.3).
Conference’17, July 2017, Washington, DC, USA Jia Li ♂, Ge Li, Yongmin Li, and Zhi Jin
Table 2: The Pass@k (%) of SCoT prompting and baselines on three code generation benchmarks. The numbers in red denote
SCoT prompting’s relative improvements compared to the SOTA baseline - CoT prompting.
Results. The Pass@𝑘 (𝑘 ∈ [1, 3, 5]) of different approaches are Table 3: The results of human evaluation in three aspects.
shown in Table 2. The numbers in red denote SCoT prompting’s rel- The numbers in red denote SCoT prompting’s relative im-
ative improvements compared to the SOTA baseline - CoT prompt- provements compared to the SOTA baseline - CoT prompting.
ing. All the 𝑝-values are substantially smaller than 0.05.
Analyses. (1) SCoT prompting achieves the best results among
all baselines. Table 2 shows that SCoT prompting can generate more Approach Correctness Code Smell Maintainability
correct programs than baselines on three benchmarks. Compared Zero-shot prompting 1.012 1.523 1.372
to the SOTA baseline - CoT prompting, in terms of Pass@1, SCoT Few-shot prompting 1.119 1.653 1.552
prompting outperforms it by up to 13.79% in HumanEval, 12.31% CoT prompting 1.225 1.689 1.616
in MBPP, and 6.63% in MBCPP. Pass@1 is a strict metric and it is SCoT prompting 1.412 1.869 1.873
difficult to improve. The results show that SCoT prompting can Relative Improvement 15.27% 10.66% 15.90%
significantly improve the accuracy of LLMs on code generation
and is more promising than existing prompting techniques. (2)
SCoT prompting is effective in different LLMs and programming
Answer to RQ1: SCoT prompting achieves higher accuracy
languages. SCoT prompting is effective in different LLMs. Com-
than baselines on three benchmarks. In terms of Pass@1, SCoT
pared to CoT prompting, in terms of Pass@1, SCoT prompting fur-
prompting outperforms the SOTA baseline by up to 13.79% in
ther improves ChatGPT by up to 13.79% and Codex by up to 13.77%.
HumanEval, 12.31% in MBPP, and 6.63% in MBCPP. The signifi-
Besides, SCoT prompting is language-agnostic and can be applied
cant improvements show the effectiveness of SCoT prompting
to different programming languages. As shown in Table 2, SCoT
in code generation.
prompting brings substantial improvements in Python (i.e., Hu-
manEval and MBPP) and C++ (i.e., MBCPP). (3) SCoT prompting unlocks
the reasoning ability of LLMs on code generation. LLMs can ben- 4.2 RQ2: Do developers prefer programs
efit from generating intermediate reasoning steps. The baseline generated by SCoT prompting?
- CoT prompting utilizes natural language steps but only brings The ultimate goal of code generation is to assist developers in
slight improvements. In terms of Pass@1, CoT prompting improves writing programs. In this RQ, we hire 10 developers (including
few-shot prompting by up to 2% in HumanEval, 7.51% in MBPP, and industry employees and academic researchers) to manually review
3.79% in MBCPP. In this paper, we introduce program structures the programs generated by SCoT prompting and baselines.
into intermediate reasoning steps and propose a Structured Chain- Setup. To ensure the fairness of evaluation, we follow settings
of-Thought (SCoT). The SCoT constrains LLMs to use program of human evaluation in previous studies [10, 14]. We have care-
structures to generate intermediate steps, moving in the direction fully checked the evaluation settings and think our settings are
of code. In terms of Pass@1, SCoT prompting improves few-shot reliable. Specifically, we manually evaluate generated programs in
prompting by up to 16.05% in HumanEval, 17.45% in MBPP, and the following aspects:
9.56% in MBCPP. The improvements show that SCoT prompting • Correctness (whether the program satisfies the require-
further unlocks the reasoning ability of LLMs in code generation. ment). 0 point: the program is totally inconsistent with the re-
quirement. 1 point: the program is implemented, but misses some
details. 2 points: the program is correctly implemented.
• Code Smell (whether the program contains bad code smell).
0 point: There is a serious code smell. 1 point: some details are not
Structured Chain-of-Thought Prompting for Code Generation Conference’17, July 2017, Washington, DC, USA
Requirement:
We attribute the improvements to our proposed SCoT. The SCoT
# Return True is list elements are monotonically increasing or
decreasing. constrains LLMs to use program structures to generate intermediate
Few-shot prompting: reasoning steps. It allows LLMs to explore diverse solutions with
def monotonic(l: list):
if all(l[i] <= l[i+1] for i in range(len(l)-1)) or all(l[i] >=
three basic structures, improving the correctness of the code. Then,
l[i+1] for i in range(len(l)-1)): based on the SCoT, LLMs focus on implementing a program in
return True
else: a standardized way. Thus, the generated programs contain fewer
return False
code smells than ones from baselines.
SCoT prompting (SCoT & Source Code):
Input: l: list
Figure 5 shows two programs generated by SCoT prompting and
def monotonic(l: list):
Output: True or False increasing = False few-shot prompting, respectively. Both programs pass unit tests.
1: for each element in l do decreasing = False
2: if the element is greater than its for i in range(1, len(l)):
But the program from few-shot prompting contains a very complex
previous element then if l[i] > l[i-1]: statement highlighted in Figure 5). Developers have to spend lots
3: l increases increasing = True
4: if the element is less than its if l[i] < l[i-1]: of effort to understand and maintain this program. In contrast, the
previous element then decreasing = True
5: l decreases if increasing and decreasing:
program from SCoT prompting has good readability, and the SCoT
6: if both increase and decrease then
7: return False
return False clearly explains the behavior of the code. Developers can further
else:
8: else return True use the SCoT as comments of the program for future maintenance.
9: return True
Table 5: The Pass@k of CoT prompting and SCoT prompting SCoT prompting without basic structures:
with different example seeds.
Input: arry: list[list]
Output: result: int or float
CoT prompting SCoT prompting 1. Initialize a result with -999999
Seed
Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5 2. Iterate through the list of lists
3. Calculate the sum of the list
Seed A 53.29 69.76 75.52 60.64 73.53 77.32
4. Update the result with the maximum of sum
Seed B 52.81 68.97 74.55 60.27 73.11 77.16 and result
Seed C 51.36 67.44 73.62 59.36 72.88 76.79 5. Return the result
SCoT prompting:
Table 6: The Pass@k of CoT prompting and SCoT prompting Input: arry: list[list]
with different annotators. Output: result: int or float
1: Initialize a result with -999999
CoT prompting SCoT prompting 2: for _list in the list of lists:
Annotator 3: Calculate the sum of the _list
Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5
4: Update the result with the maximum of
Annotator A 53.29 69.76 75.52 60.64 73.53 77.32 sum and result
Annotator B 51.43 67.92 73.44 59.48 72.16 76.44 5: return the result
Annotator C 52.18 68.45 74.71 60.02 73.15 77.24
Figure 6: The comparison of SCoT prompting and SCoT
prompting without basic structures.
prompting come from the program structures instead of specific
details in examples.
We also notice that there are slight variances in the perfor-
mance of SCoT prompting under different settings. It is expected
for prompting techniques using examples. Similar variances can be
found in SCoT prompting, and SCoT prompting still outperforms
CoT prompting in different settings.
the performance of SCoT prompting drops obviously. We carefully
Answer to RQ3: SCoT prompting is robust to examples. Un- inspect failed cases and find that LLMs benefit from using basic
der different example seeds or writing styles, SCoT prompting structures to clearly write a solving process. Figure 6 shows the
substantially outperforms the SOTA baseline - CoT prompting. intermediate steps of SCoT prompting and SCoT prompting without
basic structures. SCoT prompting without basic structures uses
CoTs, which sequentially describe how to write the code line by
4.4 RQ4: What are the contributions of different line and contain many ambiguities. For example, the scopes of
program structures in SCoT prompting? two iterations on lines 2 and 4 are unclear. LLMs are likely to
As stated in Section 2.1, SCoT prompting introduces basic structures misunderstand the CoT and generate incorrect code. In contrast,
(i.e., sequence, branch, and loop) and the input-output structure. SCoT prompting uses three basic structures to describe the solving
This RQ is designed to analyze the contributions of different pro- process. The SCoT is clear and is similar to code, benefiting the
gram structures. following code implementation.
Setup. We select ChatGPT as the base model. Then, we conduct (2) The IO structure benefits the requirement understanding. In
an ablation study by independently removing basic structures and Table 4, after deleting the IO structure, the performance of SCoT
the input-output (IO) structure. When removing basic structures, prompting has a slight decrease. We analyze failed cases and think
we use a CoT with an IO structure as the intermediate steps. When the IO structure benefits the requirement understanding. Figure 7
removing the IO structure, the SCoT only contains a solving process shows two programs from SCoT prompting and SCoT prompting
with basic structures. We select ChatGPT as the base model. without the IO structure. We can see that SCoT prompting without
Results. The results are shown in Table 4. “w/o” is the abbrevia- the IO structure wrongly understands the output format and gen-
tion of without. erates an incorrect program. After adding the IO structure, LLMs
Analyses. (1) Three basic structures are beneficial to design a first reason about the input-output format and correctly return a
feasible solving process. In Table 4, after removing basic structures, boolean value.
Structured Chain-of-Thought Prompting for Code Generation Conference’17, July 2017, Washington, DC, USA
Table 7: The comparison of SCoT-P prompting and SCoT prompting. The numbers in red denote SCoT prompting’s relative
improvements compared to SCoT-P prompting.
HumanEval MBPP MBCPP
Approach
Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5
CoT prompting 53.29 69.76 75.52 41.83 51.04 54.57 53.51 63.84 67.03
SCoT-P prompting 55.23 70.33 75.94 43.28 52.16 55.77 54.25 64.09 67.78
SCoT prompting 60.64 73.53 77.32 46.98 55.31 58.36 57.06 65.70 68.70
Relative Improvement 9.80% 4.55% 1.82% 8.55% 6.04% 4.64% 5.18% 2.51% 1.36%
Value
SCoT prompting: 30
def test_duplicate(arraynums):
# Input: arraynums, a list of integers 20
# Output: True if exist duplicate element,
False otherwise
num_set = set(arraynums)
10 ChatGPT
ChatGPT+CodeT
if len(num_set) < len(arraynums): ChatGPT+CodeT+SCoT
return True 0
else: Pass@1 Pass@3 Pass@5
return False Pass@k
Figure 8: The complementarity between CodeT and SCoT
prompting.
Figure 7: The comparison of SCoT prompting and SCoT
prompting without the IO structure.
can see that the performance of ChatGPT is continually improved Standard Language models are pre-trained on a large-scale
by adding CodeT and SCoT prompting. corpus with the next-token prediction objective. They are mainly
(2) Rank techniques approaches rely on execution environments. used to continually complete the given context, such as code com-
Rank techniques require executing programs on test cases and using pletion. After the success of GPT series [4, 23, 24] in NLP, OpenAI
execution results to rerank programs. In many realistic program- fine-tunes GPT models on code to produce closed-source Codex [7].
ming scenarios, users want to get code suggestions for an unfinished There follow many open-source replication attempts, e.g., Code-
project. It is infeasible to execute auto-generated programs. Thus, Parrot [29], CodeGen [17], CodeGeeX [40], InCoder [8], StarCoder
we think rank techniques have limited application scenarios and [15] and CodeT5+ [33].
make additional use of the execution results. Our approach works Instruction-tuned models are models after instruction tuning
in a general scenario and does not use execution results. Thus, it is [34]. Instruction tuning trains models to understand human users’
unfair to directly compare SCoT prompting to rank techniques. instructions and perform tasks by following instructions. ChatGPT
[18] is trained with human feedback [20], powerful on both natural
5.3 Threats to Validity language tasks and programming tasks. Many attempt to train an
“open-source ChatGPT”. Alpaca [27] is LLaMA [28] tuned using
There are three main threats to the validity of our work.
self-instruct [32] and ChatGPT feedback. Code Alpaca [5] is LLaMA
(1) The generalizability of experimental results. To miti-
tuned using self-instruct and ChatGPT feedback, with instructions
gate this threat, we carefully select the benchmarks, metrics, and
focusing on programming tasks. WizardCoder [16] is StarCoder
baselines. Following previous studies [1, 2, 7], we pick three rep-
[15] tuned using Evol-Instruct [36] and ChatGPT feedback with
resentative code generation benchmarks. They are hand-written
Code Alpaca’s dataset as seed dataset. InstructCodeT5+ [33] is
or collected from real-world programming communities, and cover
CodeT5+ [33] tuned on Code Alpaca’s dataset.
two popular languages (i.e., Python and C++). For evaluation met-
Prompting Techniques. With the enormous number of pa-
rics, we select a widely used metric - Pass@𝑘, which utilizes test
rameters (e.g., Codex: 175 billion parameters), it is hard to directly
cases to check the correctness of programs. We use the unbiased
fine-tune LLMs on code generation. Prompting techniques are a pop-
Pass@𝑘 which is more reliable [7]. For comparison baselines, we
ular approach, which leverages LLMs to generate code by inputting
select the SOTA prompting techniques and conduct a comprehen-
a special prompt.
sive comparison in Section 4. SCoT prompting and baselines have
Early, researchers proposed zero-shot prompting and few-shot
the same example seeds and maximum generation lengths.
prompting. Zero-shot prompting concatenates a task instruction
(2) The impact of the two-step pipeline. CoT prompting
(e.g., please generate a program based on the requirement)
generates a CoT and the code in one step. Our SCoT prompting
and a requirement together, making a prompt. Based on the zero-
generates the code in two steps. It first generates SCoTs and then
shot prompting, few-shot prompting further adds several ⟨ requirement,
generates the code. It is possible that the improvements come from
code ⟩ pairs to the prompts, so that LLMs can learn code generation
the two-step pipeline. To solve this threat, we have two considera-
from given examples.
tions. First, LLMs in our experiments are auto-regressive language
The Chain-of-Thought (CoT) prompting [35] is a recently pro-
models. For an auto-regressive language model, a one-step pipeline
posed prompting technique. CoT prompting asks LLMs first to
and a two-step pipeline are theoretically equivalent. Second, we
generate CoTs (i.e., intermediate natural language reasoning steps)
conduct an ablation study in Section 4.4. We keep the two-step
and then output the final code. It allows LLMs to first design a
pipeline unchanged and remove program structures. The results in
solving process that leads to the code. CoT prompting has achieved
Table 4 show that SCoT prompting without prompt structures has
the SOTA results in natural language generation and sparked lots
a significant drop in the Pass@k. It shows that the improvements
of follow-up research, such as self-consistency prompting [31],
of SCoT prompting are brought by program structures instead of
least-to-most prompting [41]. But these prompting techniques are
the two-step pipeline.
designed for natural language generation and bring slight improve-
(3) The data leakage. Existing LLMs are trained with extensive
ments in code generation.
code files from open-source communities. It is possible that their
In this paper, we propose a novel prompting technique named
training data contains the experimental benchmarks, leading to
Structured Chain-of-Thought (SCoT) prompting. Different from
data leakage. But we think that it does not affect the fairness of our
standard CoT prompting, SCoT prompting explicitly introduces
experiments. In this paper, we select a specific LLM (e.g., ChatGPT)
program structures and asks LLMs to generate intermediate reason-
as the base model and apply different prompting techniques to
ing steps with program structures. We compare CoT prompting and
it. Thus, the reported relative improvements between baselines
SCoT prompting in Section 4. The results show that SCoT prompt-
and our approach are credible. In the future, we will add the latest
ing significantly outperforms CoT prompting in three benchmarks.
benchmarks to alleviate this threat.
proposes a Structured CoT (SCoT) and presents a new prompt- [15] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov,
ing technique for code generation, named SCoT prompting. SCoT Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023.
StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
prompting asks LLMs to generate a SCoT using program structures [16] Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu,
(i.e., sequence, branch, and loop structures). Then, LLMs generate Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023. WizardCoder:
Empowering Code Large Language Models with Evol-Instruct. arXiv preprint
the code based on the SCoT. A large-scale study on three bench- arXiv:2306.08568 (2023).
marks shows that SCoT prompting significantly outperforms CoT [17] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou,
prompting in Pass@k and human evaluation. Besides, SCoT prompt- Silvio Savarese, and Caiming Xiong. 2022. CodeGen: An Open Large Language
Model for Code with Multi-Turn Program Synthesis. arXiv preprint (2022).
ing is robust to examples and obtains stable improvements. [18] OpenAI. 2022. ChatGPT. https://openai.com/blog/chatgpt.
In the future, we will explore new prompting techniques for code [19] OpenAI. 2022. Codex. https://beta.openai.com/docs/api-reference/completions.
generation. For example, source code can be represented by a tree [20] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela
Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022.
(e.g., abstract syntax tree). We can design a tree-based prompting Training language models to follow instructions with human feedback. Advances
technique, which uses LLMs to generate a tree. in Neural Information Processing Systems 35 (2022), 27730–27744.
[21] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a
method for automatic evaluation of machine translation. In Proceedings of the
40th annual meeting of the Association for Computational Linguistics. 311–318.
REFERENCES [22] Han Peng, Ge Li, Wenhan Wang, Yunfei Zhao, and Zhi Jin. 2021. Integrat-
[1] Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen ing Tree Path in Transformer for Code Representation. In Advances in Neural
Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, Information Processing Systems 34: Annual Conference on Neural Information Pro-
et al. 2022. Multi-lingual Evaluation of Code Generation Models. arXiv preprint cessing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, Marc’Aurelio
arXiv:2210.14868 (2022). Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wort-
[2] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk man Vaughan (Eds.). 9343–9354. https://proceedings.neurips.cc/paper/2021/
Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, hash/4e0223a87610176ef0d24ef6d2dcde3a-Abstract.html
et al. 2021. Program synthesis with large language models. arXiv preprint [23] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018.
arXiv:2108.07732 (2021). Improving language understanding by generative pre-training. (2018).
[3] Corrado Böhm and Giuseppe Jacopini. 1966. Flow diagrams, turing machines and [24] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
languages with only two formation rules. Commun. ACM 9, 5 (1966), 366–371. Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019).
https://doi.org/10.1145/355592.365646 [25] Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016.
[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Sequence Level Training with Recurrent Neural Networks. In 4th International
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May
Askell, et al. 2020. Language models are few-shot learners. Advances in neural 2-4, 2016, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
information processing systems 33 (2020), 1877–1901. http://arxiv.org/abs/1511.06732
[5] Sahil Chaudhary. 2023. Code Alpaca: An Instruction-following LLaMA model [26] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Com-
for code generation. https://github.com/sahil280114/codealpaca. monsenseQA: A Question Answering Challenge Targeting Commonsense Knowl-
[6] Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang edge. In Proceedings of the 2019 Conference of the North American Chapter of the
Lou, and Weizhu Chen. 2022. CodeT: Code Generation with Generated Tests. Association for Computational Linguistics: Human Language Technologies, NAACL-
https://doi.org/10.48550/ARXIV.2207.10397 HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers),
[7] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Compu-
Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg tational Linguistics, 4149–4158. https://doi.org/10.18653/v1/n19-1421
Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, [27] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos
Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An
Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_
Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fo- alpaca.
tios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex [28] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shan- Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
tanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lam-
Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles ple. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv
Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, preprint arXiv:2302.13971 (2023).
Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large [29] Lewis Tunstall, Leandro Von Werra, and Thomas Wolf. 2022. Natural language
Language Models Trained on Code. (2021). arXiv:2107.03374 [cs.LG] processing with transformers. " O’Reilly Media, Inc.".
[8] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, [30] Wenhan Wang, Ge Li, Sijie Shen, Xin Xia, and Zhi Jin. 2020. Modular Tree
Ruiqi Zhong, Scott Yih, Luke Zettlemoyer, and Mike Lewis. 2023. InCoder: A Network for Source Code Representation Learning. 29, 4, Article 31 (sep 2020),
Generative Model for Code Infilling and Synthesis. In The Eleventh International 23 pages. https://doi.org/10.1145/3409331
Conference on Learning Representations. https://openreview.net/forum?id=hQwb- [31] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang,
lbM6EL Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves
[9] GitHub. 2022. GitHub Copilot. https://github.com/features/copilot. Chain of Thought Reasoning in Language Models. In The Eleventh International
[10] Yiyang Hao, Ge Li, Yongqiang Liu, Xiaowei Miao, He Zong, Siyuan Jiang, Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
Yang Liu, and He Wei. 2022. AixBench: A Code Generation Benchmark OpenReview.net. https://openreview.net/pdf?id=1PL1NIMMrw
Dataset. CoRR abs/2206.13179 (2022). https://doi.org/10.48550/arXiv.2206.13179 [32] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith,
arXiv:2206.13179 Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-Instruct: Aligning Lan-
[11] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The guage Models with Self-Generated Instructions. In Proceedings of the 61st Annual
Curious Case of Neural Text Degeneration. In 8th International Conference on Meeting of the Association for Computational Linguistics (Volume 1: Long Pa-
Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. pers). Association for Computational Linguistics, Toronto, Canada, 13484–13508.
OpenReview.net. https://openreview.net/forum?id=rygGQyrFvH https://aclanthology.org/2023.acl-long.754
[12] Jeevana Priya Inala, Chenglong Wang, Mei Yang, Andres Codas, Mark Encar- [33] Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and
nación, Shuvendu Lahiri, Madanlal Musuvathi, and Jianfeng Gao. 2022. Fault- Steven CH Hoi. 2023. Codet5+: Open code large language models for code
aware neural code rankers. Advances in Neural Information Processing Systems understanding and generation. arXiv preprint arXiv:2305.07922 (2023).
35 (2022), 13419–13432. [34] Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian
[13] Jia Li, Ge Li, Zhuo Li, Zhi Jin, Xing Hu, Kechi Zhang, and Zhiyi Fu. 2023. CodeEd- Lester, Nan Du, Andrew M. Dai, and Quoc V Le. 2022. Finetuned Language Models
itor: Learning to Edit Source Code with Pre-Trained Models. ACM Trans. Softw. are Zero-Shot Learners. In International Conference on Learning Representations.
Eng. Methodol. (may 2023). https://doi.org/10.1145/3597207 Just Accepted. https://openreview.net/forum?id=gEZrGCozdqR
[14] Jia Li, Yongmin Li, Ge Li, Zhi Jin, Yiyang Hao, and Xing Hu. 2023. SkCoder: [35] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia,
A Sketch-based Approach for Automatic Code Generation. In 45th IEEE/ACM Ed H. Chi, Quoc V Le, and Denny Zhou. 2022. Chain of Thought Prompting Elicits
International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, Reasoning in Large Language Models. In Advances in Neural Information Pro-
May 14-20, 2023. IEEE, 2124–2135. https://doi.org/10.1109/ICSE48619.2023.00179 cessing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun
Conference’17, July 2017, Washington, DC, USA Jia Li ♂, Ge Li, Yongmin Li, and Zhi Jin