0% found this document useful (0 votes)

52 views12 pages

2023 Li ChainOfThought

This paper proposes a new prompting technique called Structured Chain-of-Thought (SCoT) prompting to improve code generation performance of large language models. SCoT prompting asks models to generate intermediate reasoning steps using common code structures like sequences, branches, and loops. It evaluates SCoT prompting on two models against three benchmarks, finding improvements over the previous state-of-the-art Chain-of-Thought prompting technique of up to 13.79% higher accuracy. Human evaluations also found programs from SCoT prompting were preferred.

Uploaded by

malaysheth34

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views12 pages

2023 Li ChainOfThought

Uploaded by

malaysheth34

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Structured Chain-of-Thought Prompting for Code Generation

Jia Li ♂ Ge Li
[email protected] Peking University
Peking University Beijing, China
Beijing, China [email protected]

Yongmin Li Zhi Jin

Peking University Peking University
Beijing, China Beijing, China
[email protected] [email protected]
arXiv:2305.06599v3 [cs.SE] 7 Sep 2023

ABSTRACT 1. Initialize a result with -999999

2. Iterate through the list of lists
Large Language Models (LLMs) (e.g., ChatGPT) have shown im- 3. Initialize a sum with 0
pressive performance in code generation. LLMs take prompts as 4. Iterate through the list
inputs, and Chain-of-Thought (CoT) prompting is the state-of-the- 5. Add the element to the sum
art prompting technique. CoT prompting asks LLMs first to generate 6. Update result with the maximum of sum and result
7. Divide the result by K
CoTs (i.e., intermediate natural language reasoning steps) and then
8. Return the result
output the code. However, CoT prompting is designed for natural
(a) Chain-of-Thought
language generation and has low accuracy in code generation.
In this paper, we propose Structured CoTs (SCoTs) and present Input: arry: list[list], K: int
Output: result: int or float Loop
a novel prompting technique for code generation, named SCoT
1: Initialize a result with -999999 Structure
prompting. Our motivation is source code contains rich structural
2: for _list in the list of lists:
information and any code can be composed of three program struc-
3: Calculate the sum of the _list
tures (i.e., sequence, branch, and loop structures) [3]. Intuitively, 4: if the sum is great than result: Branch
structured intermediate reasoning steps make for structured source 5: Update the result Structure
code. Thus, we ask LLMs to use program structures to build CoTs, 6: Divide result by K Sequence
obtaining SCoTs. Then, LLMs generate the final code based on 7: return result Structure
SCoTs. Compared to CoT prompting, SCoT prompting explicitly (b) Structured Chain-of-Thought
constraints LLMs to think about how to solve requirements from
the view of source code and further the performance of LLMs Figure 1: The comparison of a Chain-of-Thoughts (CoT) and
in code generation. We apply SCoT prompting to two LLMs (i.e., our Structured Chain-of-Thought (SCoT).
ChatGPT and Codex) and evaluate it on three benchmarks (i.e., Hu-
manEval, MBPP, and MBCPP). (1) SCoT prompting outperforms
the state-of-the-art baseline - CoT prompting by up to 13.79% Language Models (LLMs) have recently shown impressive perfor-
in Pass@1. (2) Human evaluation shows human developers prefer mance in code generation, such as ChatGPT [18], and CodeGen
programs from SCoT prompting. (3) SCoT prompting is robust to [17]. During the inference, LLMs take a prompt as input that con-
examples and achieves substantial improvements. sists of several examples (e.g., <requirement, code> pairs) and a
ACM Reference Format: new requirement. LLMs learn code generation from examples and
Jia Li ♂, Ge Li, Yongmin Li, and Zhi Jin. 2023. Structured Chain-of-Thought analogously generate a new program. The performance of LLMs
Prompting for Code Generation. In Proceedings of ACM Conference (Con- heavily relies on the prompt [39]. Nowadays, how to make an effec-
ference’17). ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/ tive prompt (i.e., Prompting technique) for code generation is still
nnnnnnn.nnnnnnn an open question.
Chain-of-Thought (CoT) prompting [35] is the state-of-the-art
1 INTRODUCTION (SOTA) prompting technique. CoT Prompting asks LLMs first to
Code generation aims to automatically generate a program that generate a CoT and then output the code. A CoT is several interme-
satisfies a given natural language requirement [13, 14, 38]. Large diate natural language reasoning steps that describe how to write
code step by step. Figure 1 (a) shows a CoT on code generation.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed However, CoT prompting brings slight improvements in code gen-
for profit or commercial advantage and that copies bear this notice and the full citation eration. For example, it only improves ChatGPT by 0.82 points in
on the first page. Copyrights for components of this work owned by others than ACM Pass@1 upon a real-world benchmark [7].
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a In this paper, we propose a Structured CoT for code gen-
fee. Request permissions from [email protected]. eration. Our motivation is that code generation aims to convert a
Conference’17, July 2017, Washington, DC, USA natural language requirement to source code. Different from nat-
© 2023 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 ural languages, source code contains rich structural information
https://doi.org/10.1145/nnnnnnn.nnnnnnn [22, 30, 37]. For example, source code contains three basic structures
Conference’17, July 2017, Washington, DC, USA Jia Li ♂, Ge Li, Yongmin Li, and Zhi Jin

[3], including sequence, branch, and loop structures. Intuitively, • We propose a Structured Chain-of-Thought (SCoT), which utilizes
intermediate reasoning steps leading to the structured code should program structures to build the intermediate reasoning steps.
also be structured. Consider a human developer’s thought process • We propose a novel prompting technique for code generation,
when solving a requirement (e.g., find the maximum number in a named SCoT Prompting. It prompts large language models first
list). It is typical to come up with a solving process with program to generate a SCoT and then implement the code.
structures: “Initialize a result with -inf; for each number in the list; • We conduct extensive experiments on three benchmarks. Quali-
if the number is greater than result: Update result with the number tative and quantitative experiments show that SCoT prompting
...”. Our idea is to enable LLMs to generate similar structured CoTs - significantly outperforms SOTA baselines (e.g., Chain-of-Thought
a coherent series of intermediate reasoning steps constructed by prompting).
program structures. Besides, LLMs’ training data contains lots of • We discuss the contributions of different program structures and
code data, so they have the ability to generate program structures. the robustness of SCoT prompting.
However, standard CoT ignores the program structures and has Data Availability. We open source our replication package [?
low accuracy in code generation. Thus, it is necessary to design ], including the datasets and the source code of SCoT prompting,
a structured CoT to unlock the reasoning ability of LLMs in code to facilitate other researchers and practitioners to repeat our work
generation. and verify their studies.
Figure 1 (b) shows a SCoT. The design of our SCoT has two
inspirations. First, existing work [3] proved that any simple or 2 METHODOLOGY
complex program can be composed of three basic structures, i.e.,
In this section, we propose a Structured Chain-of-Thought (SCoT).
sequence structure, branch structure, and loop structure. Thus, we
A SCoT denotes several intermediate reasoning steps constructed by
introduce three basic structures and constrain LLMs to use them
program structures. Then, we present a novel prompting technique
to generate CoTs. As shown in Figure 1 (b), the SCoT uses a loop
for code generation named SCoT prompting. SCoT prompting asks
structure to clearly describe an iteration in line 2. While in the
LLMs first to generate a SCoT and then output the final code. In the
CoT, the scopes of two iterations in lines 2 and 4 are ambiguous. It
subsections, we first describe the design of our SCoT and further
shows the superiority of SCoT in code generation. Besides, every
show the details of SCoT prompting.
program contains a required input-output structure, which includes
the input-output parameters and their types (e.g., Input: array:
list[list]; Output: result in Figure 1 (b)). By generating the
2.1 Structured Chain-of-Thought
input-output structure, LLMs are asked to analyze requirements Standard Chain-of-Thought (CoT) is several intermediate natural
and determine the entry and exit of the code, which benefits the language reasoning steps that lead to the final answer [35]. The CoT
following solving process. is initially designed for natural language generation (e.g., common-
Based on the SCoT, we present a new prompting technique sense reasoning [26]). Thus, the CoT only uses natural languages
named SCoT prompting. It asks LLMs first to generate a SCoT to sequentially describe how to solve a problem step by step. Fig-
using program structures and then implement the code. Compared ure 1 (a) shows a CoT on code generation. A limitation is that CoT
to CoT prompting, SCoT prompting explicitly introduces program brings slight improvements in code generation. For example, adding
structures into intermediate reasoning steps and constraints LLMs the CoT only improves ChatGPT by 0.82 points in Pass@1 upon a
to think about how to solve requirements from the view of pro- real-world benchmark - HumanEval [7].
gramming languages. It further unlocks the reasoning ability of In this paper, we propose a Structured CoT. Our motivation is
LLMs in code generation, thus achieving higher accuracy. that, unlike natural language generation, the goal of code generation
We apply SCoT prompting to two popular LLMs (i.e., ChatGPT is highly structured code. Source code solves a problem through
[18] and Codex [7]) and evaluate it on three representative bench- special structures, including sequence structures, branch structures,
marks (i.e., HumanEval [7], MBPP [2], and MBCPP [1]). We use and loop structures. For example, given a requirement - reading
unit tests to measure the correctness of generated programs and text from a given file, imagine a human developer’s thought
report the Pass@𝑘 (𝑘 ∈ [1, 3, 5]) [7]. Based on experimental re- process. The developer will use program structures to design an
sults, we obtain four findings. (1) SCoT prompting significantly initial idea: “if the given file exists: read text from the file; else: raise
improves the accuracy of LLMs on code generation. Compared to an error;”. The program structures clearly show the solving process
the SOTA baseline - Chain-of-Thought prompting, in terms and benefit the following code implementation. Thus, intermediate
of Pass@1, SCoT prompting outperforms it by up to 13.79% reasoning steps leading to the code should also be structured.
in HumanEval, 12.31% in MBPP, and 6.63% in MBCPP. (2) Figure 2 shows some examples of SCoT. Compared to the CoT,
Human evaluation shows that human developers prefer programs our SCoT explicitly introduces program structures. Existing work
generated by SCoT prompting. (3) SCoT prompting is effective for [3] proved that any simple or complex program can be composed
different LLMs and different programming languages. In terms of of three basic structures, i.e., sequence structure, branch structure,
Pass@1, it improves ChatGPT by up to 13.79% and Codex by up to and loop structure Thus, we introduce three basic structures, whose
13.77%. Besides, SCoT prompting is language-agnostic and effective details are shown as follows.
in multiple languages (e.g., Python and C++). (4) We explore the • Sequence Structure. The intermediate steps are sequentially
robustness of SCoT prompting to examples. Results show that SCoT placed and all steps are at the same level.
prompting does not depend on specific examples or writing styles. • Branch Structure. It starts with a condition and places different
We summarize our contributions in this paper as follows. intermediate steps for different results of the condition. In this
Structured Chain-of-Thought Prompting for Code Generation Conference’17, July 2017, Washington, DC, USA

Input: paren_string: str def first_Repeated_Char(str):

Output: list_of_int: List[int] """
Write a python function to find the first repeated
1: Initialize list_of_int to an empty list character in a given string.
2: for each string in paren_string do """
Pass
3: Initialize depth to 0
4: for each character in string do Loop Structure
Please understand the requirement and write a rough solving
An
5: if character is '(' then process. It starts with a input-output structure. You
should use three basic structures to build the solving Example
6: depth += 1 process, including sequences, branches, and loops. The
7: elif character is ')' then Branch Structure necessary details should be written in natural languages.
8: depth -= 1
Input: str: a string
9: append depth to list_of_int Output: ch: a repeated character in str
10: return list_of_int 1: for each character ch in str:
2: if ch appears more than once in str:
(a) 3: return ch
4: return None
Input: string: str, substring: str
(more examples…)
Output: count: int Loop Structure
1: Initialize count to 0 def text_lowercase_underscore(text):
"""
2: while substring is not found in string do Write a function to find sequences of lowercase letters
3: if string is empty then joined with an underscore.
4: return 0 """ A new
pass
5: increment count input
6: remove the first character of string Please understand the requirement and write a rough solving
process. It starts with a input-output structure. You
7: return count Sequence Structure should use three basic structures to build the solving
(b) process, including sequences, branches, and loops. The
necessary details should be written in natural languages.

Figure 2: Examples of SCoT in code generation.

Figure 3: A prompt for generating a SCoT.

paper, branch structures contain three formats, i.e., if ..., if

def first_Repeated_Char(str):
... else, and if ... elif ... else. """
• Loop Structure. A set of intermediate steps are repeatedly con- Write a python function to find the first repeated
character in a given string.
ducted until given conditions are not met. In this paper, loop Input: str: a string
structures contain two basic formats, including the for loop Output: ch: a repeated character in str
1: for each character ch in str:
and the while loop. 2: if ch appears more than once in str:
3: return ch
We allow the nesting between different program structures. It 4: return None An
allows LLMs to design more complex SCoT for some difficult re- """ Example
# Please check the above solving process and write a
quirements. As shown in Figure 2, the SCoT flexibly uses various code based on it. Note that the solving process may contain
errors.
program structures to build a solving process. h = {}
Besides three basic structures, we add the input-output structure, for ch in str:
if ch in h:
which contains input-output parameters and their types. Our moti- return ch;
vation is that an input-output structure is required for a program, else:
h[ch] = 0
which indicates the entry and exit. Generating the input-output return None
structure is beneficial to clarify requirements and generate the
(more examples…)
following solving process. def text_lowercase_underscore(text):
"""
Write a function to find sequences of lowercase letters
2.2 SCoT prompting joined with an underscore.
Input: text: str
Based on the SCoT, we propose a new prompting technique for code Output: seq: list of str A new
generation, named SCoT prompting. It asks LLMs first to generate 1: Initialize seq as an empty list
2: for each word in text split by space do
input
a SCoT and then output the final code. 3: if word matches the pattern of lowercase
letters joined with an underscore then
To implement SCoT prompting, we design two special prompts. 4: append the word to seq
5: return seq
The first prompt is used to generate a SCoT, and an example of """
the prompt is shown in Figure 3. The prompt starts with several # Please check the above solving process and write a
code based on it. Note that the solving process may contain
examples (i.e., <requirement, SCoTs>). These examples cover three errors.
basic program structures and the input-output structure. Next, the
italic sentences are instructions for LLMs, which indicate the goal Figure 4: A prompt for generating the code.
of LLMs and related constraints. Finally, the prompt ends with a
new requirement and is fed into LLMs. We expect LLMs to learn
from examples and generate a SCoT for the new requirement. The prompt starts with several examples (i.e., <requirement, SCoT,
After generating a SCoT, we design the second prompt for gener- code>). The italic sentences are instructions. We consider the SCoT
ating the final code. An example of the prompt is shown in Figure 4. as a soft template and ask LLMs to implement a program. Finally,
Conference’17, July 2017, Washington, DC, USA Jia Li ♂, Ge Li, Yongmin Li, and Zhi Jin

the prompt ends with a new requirement and its SCoT, and is in- Table 1: Statistics of the datasets in our experiments.
put into LLMs. By learning from examples, LLMs generate a new
program based on the requirement and SCoT. Statistics HumanEval MBPP MBCPP
Related work [25] has found that generative models may be nega-
Language Python Python C++
tively affected by error accumulation. Similarly, in SCoT prompting,
the generated SCoT may contain noises (e.g., errors or missing # Train – 474 413
steps). These noises will further negatively affect code implementa- # Test 164 500 435
tion. In this paper, we utilize two approaches to alleviating error
Avg. tests per sample 7.7 3 3
accumulation. First, as shown in Figure 4, we ask LLMs to double-
check the SCoT and fix possible noises. It allows LLMs to adaptively
refer to the SCoT and filter out noises. Second, SCoT prompting RQ4: What are the contributions of different program
utilizes a two-step generation pipeline. It provides a window of structures in SCoT prompting? As stated in Section 2.1, SCoT
opportunity to debug where the SCoT goes wrong. In practice, hu- prompting introduces three basic structures and the input-output
man developers can first check the generated SCoT and fix possible structure. This RQ is designed to analyze the contributions of dif-
errors. Then, the SCoT is used to generate code. ferent structures. We select an LLM as the base model. Then, we
individually remove a program structure and report the fluctuations
2.3 Implementation Details in performance.
SCoT prompting is a prompting technique for code generation,
which does not rely on specific LLMs. In this paper, we consider
3.2 Benchmarks
ChatGPT as the default LLM. We select a few (e.g., three) <requirement, Following previous studies [6, 7, 17, 40], we conduct experiments
code> pairs from real-world benchmarks (i.e., training data) as ex- on three representative code generation benchmarks, including the
ample seeds. Then, we manually write the SCoT for seeds and obtain HumanEval in Python, MBPP in Python, and MBCPP in C++. The
examples - <requirement, SCoT, code> triples, which are used to details of the benchmarks are described as follows.
make prompts in Figure 3 and 4. A prompt contains three examples • HumanEval [7] is a Python function-level code generation
by default. The examples and prompt templates are available in benchmark, which contains 164 hand-written programming prob-
our replication package. In the future, users can flexibly apply our lems. Each programming problem consists of an English require-
approach to more powerful LLMs in a plug-and-play fashion. ment, a function signature, and several test cases, with an average
of 7.7 test cases per problem.
• MBPP [2] is a Python function-level code generation benchmark.
3 STUDY DESIGN
It contains 974 programming problems that involve simple nu-
To assess SCoT prompting, we conduct a large-scale study to answer meric manipulations or basic usage of standard libraries. Each
four research questions. In this section, we present the details of our problem contains an English requirement, a function signature,
study, including datasets, evaluation metrics, comparison baselines, and three manually written test cases for checking functions.
and implementation details. • MBCPP [1] is a C++ function-level code generation benchmark.
It consists of 848 programming problems that are collected by
3.1 Research Questions crowd-sourcing. Each problem contains an English description,
Our study aims to answer the following research questions (RQ). a function signature, and three test cases for checking the cor-
RQ1: How does SCoT prompting perform in terms of accu- rectness of functions.
racy compared to baselines? This RQ aims to verify that SCoT We follow the original splits of three datasets. The statistics of
prompting has a higher accuracy than existing prompting tech- the benchmarks are shown in Table 1. We randomly pick several
niques on code generation. We apply three existing prompting samples from training data to make examples in prompts (Section
techniques and SCoT prompting to two LLMs. Then, we use unit 2.3). Then, we measure the performance of different approaches
tests to measure the correctness of programs generated by different on test data. Because HumanEval does not contain train data, we
approaches and report the Pass@k. reuse examples from MBPP in HumanEval.
RQ2: Do developers prefer programs generated by SCoT
prompting? The ultimate goal of code generation is to assist hu- 3.3 Evaluation Metrics
man developers in writing code. In this RQ, we hire 10 develop- Following previous code generation studies [6, 7, 17, 40], we use
ers (including industry employees and academic researchers) to Pass@𝑘 as our evaluation metrics. Specifically, given a require-
manually review the programs generated by SCoT prompting and ment, a code generation model is allowed to generate 𝑘 programs.
baselines. We measure the quality of programs in three aspects, The requirement is solved if any generated programs pass all test
including correctness, code smell, and maintainability. cases. We compute the percentage of solved requirements in to-
RQ3: Is SCoT prompting robust to examples? Prompting tal requirements as Pass@𝑘. For Pass@𝑘, a higher value is better.
techniques may be sensitive to example [39]. In this RQ, we mea- In our experiments, 𝑘 is set to 1, 3, and 5, because we think that
sure the robustness of SCoT prompting to examples. Specifically, developers mainly use Top-5 outputs in real-world scenarios.
we measure the performance of SCoT prompting with different Previous work [1, 6, 7] found that standard Pass@𝑘 has high
example seeds and different example writing styles. variance and proposed an unbiased Pass@𝑘. We follow previous
Structured Chain-of-Thought Prompting for Code Generation Conference’17, July 2017, Washington, DC, USA

work and employ the unbiased Pass@𝑘. Specifically, we generate language models and instruction-tuned models. For each category,
𝑛 ≥ 𝑘 programs per requirement (in this paper, we use 𝑛 = 20, we pick a representative model as the base model.
𝑘 ∈ [1, 3, 5]), count the number of solved requirements 𝑐, and (1) Standard language models are pre-trained on a large-scale
calculate the unbiased Pass@𝑘: corpus with the next-token prediction objective. They are mainly
used to continually complete the given content, such as code com-
 𝑛 −𝑐  pletion. Thus, we pick the state-of-the-art completion model for
 
 𝑘 
code - Codex [7] as a base model.
Pass@𝑘 := E 1 −  (1)
Problems  Codex [7] is a powerful language model for code generation,
 𝑛 

which supports a commercial application - GitHub Copilot [9].
 𝑘 
 
Codex’s training data contains both natural language and billions
We also notice that previous code generation studies use text-
of lines of code. We use OpenAI’s APIs to access the latest version
similarity-based metrics (e.g., BLEU [21]). These metrics are initially
of Codex with 175 billion parameters, i.e., code-davinci-002 [19].
designed for natural language generation and are poor in measuring
(2) Instruction-tuned models refer to LLMs after instruction tun-
the correctness of programs [7]. Thus, we omit these metrics in our
ing. Instruction tuning trains LLMs to understand human users’
experiments.
instructions and perform tasks based on the instructions. We select
the state-of-the-art instruction-tuned model - ChatGPT [18] as a
3.4 Comparison Baselines base model.
This paper proposes a new prompting technique for code genera- ChatGPT [18] is the state-of-the-art LLM for code generation.
tion. To assess the effectiveness of our approach, we select three ChatGPT is trained with extensive natural language text and code
mainstream prompting techniques as baselines. files. Then, it is trained with reinforcement learning and learns to
• Zero-shot prompting [7] directly feeds the requirement into follow human instructions. We use OpenAI’s APIs to access the
LLMs without examples. Then, it extracts a generated program ChatGPT, i.e., gpt-3.5-turbo-0301 [18].
from LLMs’ outputs. Our approach does not rely on specific LLMs and can be applied
• Few-shot prompting [7] randomly selects several < require- to different LLMs in a plus-and-play fashion. In the future, we will
ment, code> pairs as examples. Given a requirement, it concate- explore its usage on more powerful LLMs.
nates examples and the requirement together, making a prompt.
Then, the prompt is fed into LLMs, and LLMs predict a new 3.6 Sampling Settings
program. Following previous studies [7, 17, 40], we use nucleus sampling [11]
• Chain-of-Thought (CoT) prompting [35] is a variant of few- to decode programs from LLMs. To ensure the fairness of experi-
shot prompting. CoT prompting produces a special prompt con- ments, all baselines and SCoT prompting generate 20 programs per
sisting of <requirement, CoT, code> triples as examples. A CoT requirement. The details of sampling settings are shown as follows.
is several intermediate natural language reasoning steps. Baselines. The temperature is 0.8 and the top-𝑝 is 0.95. For zero-
To ensure the fairness of comparison, all baselines and SCoT prompt- shot and few-shot prompting, the maximum generated length is 300
ing have the same number of examples (i.e., three examples) and tokens. The maximum generated length of CoT prompting is 600
example seeds. tokens. Our motivation is that CoT prompting needs to generate
The criteria for selecting baselines are three-fold. (1) SCoT prompt- intermediate reasoning steps and code. Thus, it requires a larger
ing is a prompting technique for code generation. Thus, we directly generation length.
compare it to existing prompting techniques for code generation. SCoT prompting. In the first step, we sample 20 individual
We also notice some emerging prompting techniques in other fields, SCoTs from LLMs per requirement. The temperature is 0.8 and the
such as Self-Consistency [31] and Least-to-Most [41]. But these top-𝑝 is 0.95. The maximum generated length is 300 tokens. Then,
approaches are designed for specific tasks (e.g., Arithmetic reason- for each SCoT, we use LLMs to generate a corresponding program.
ing) and can not be directly applied to code generation. Thus, we The temperature is 0 and the maximum generated length is 300
omit them in this paper. (2) Our approach is to augment LLMs and tokens. Finally, we obtain 20 programs for each requirement. The
can be flexibly applied to different LLMs. Thus, we do not directly total generation length of two steps is the same as CoT prompting.
compare LLMs to our approach. (3) We also omit some rank tech-
niques for code generation [6]. They first use LLMs to generate
4 RESULTS AND ANALYSIS
many candidates and then leverage test cases or neural networks to
rerank candidates. We think our work and these rank techniques are 4.1 RQ1: How does SCoT prompting perform in
complementary. Users can use our approach to generate programs terms of accuracy compared to baselines?
and then use post-processing techniques to select the final output. In the first research question, we apply SCoT prompting and base-
We further discuss the complementarity through experiments in lines to three benchmarks and use unit tests to measure the cor-
Section 5.2. rectness of generated programs.
Setup. We apply baselines and SCoT prompting to two LLMs
3.5 Base Large Language Models (Section 3.5). Then, we measure the performance of different ap-
There are many available LLMs for source code. Our motivation proaches on three code generation benchmarks (Section 3.2) using
is that existing LLMs can be divided into two categories: standard the Pass@k (Section 3.3).
Conference’17, July 2017, Washington, DC, USA Jia Li ♂, Ge Li, Yongmin Li, and Zhi Jin

Table 2: The Pass@k (%) of SCoT prompting and baselines on three code generation benchmarks. The numbers in red denote
SCoT prompting’s relative improvements compared to the SOTA baseline - CoT prompting.

HumanEval MBPP MBCPP

Base Model Prompting Technique
Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5
Zero-shot prompting 49.73 66.07 71.54 37.07 43.54 48.58 47.53 60.09 64.22
Few-shot prompting 52.47 69.32 74.10 40.00 49.82 53.13 52.58 63.03 66.11
ChatGPT
CoT prompting 53.29 69.76 75.52 41.83 51.04 54.57 53.51 63.84 67.03
SCoT Prompting 60.64 73.53 77.32 46.98 55.31 58.36 57.06 65.70 68.70
Relative Improvement 13.79% 5.40% 2.38% 12.31% 8.37% 6.95% 6.63% 2.91% 2.49%
Zero-shot prompting 40.20 61.78 68.11 27.07 43.81 47.93 40.25 54.17 60.65
Few-shot prompting 42.93 62.96 70.10 33.17 45.72 49.62 44.12 57.65 62.45
Codex
CoT prompting 43.79 63.41 71.56 35.66 46.57 50.11 45.79 58.92 62.56
SCoT Prompting 49.82 66.56 75.14 38.29 50.74 53.16 48.34 60.77 64.19
Relative Improvement 13.77% 4.97% 5.00% 7.38% 8.95% 6.09% 5.57% 3.14% 2.61%

Results. The Pass@𝑘 (𝑘 ∈ [1, 3, 5]) of different approaches are Table 3: The results of human evaluation in three aspects.
shown in Table 2. The numbers in red denote SCoT prompting’s rel- The numbers in red denote SCoT prompting’s relative im-
ative improvements compared to the SOTA baseline - CoT prompt- provements compared to the SOTA baseline - CoT prompting.
ing. All the 𝑝-values are substantially smaller than 0.05.
Analyses. (1) SCoT prompting achieves the best results among
all baselines. Table 2 shows that SCoT prompting can generate more Approach Correctness Code Smell Maintainability
correct programs than baselines on three benchmarks. Compared Zero-shot prompting 1.012 1.523 1.372
to the SOTA baseline - CoT prompting, in terms of Pass@1, SCoT Few-shot prompting 1.119 1.653 1.552
prompting outperforms it by up to 13.79% in HumanEval, 12.31% CoT prompting 1.225 1.689 1.616
in MBPP, and 6.63% in MBCPP. Pass@1 is a strict metric and it is SCoT prompting 1.412 1.869 1.873
difficult to improve. The results show that SCoT prompting can Relative Improvement 15.27% 10.66% 15.90%
significantly improve the accuracy of LLMs on code generation
and is more promising than existing prompting techniques. (2)
SCoT prompting is effective in different LLMs and programming
Answer to RQ1: SCoT prompting achieves higher accuracy
languages. SCoT prompting is effective in different LLMs. Com-
than baselines on three benchmarks. In terms of Pass@1, SCoT
pared to CoT prompting, in terms of Pass@1, SCoT prompting fur-
prompting outperforms the SOTA baseline by up to 13.79% in
ther improves ChatGPT by up to 13.79% and Codex by up to 13.77%.
HumanEval, 12.31% in MBPP, and 6.63% in MBCPP. The signifi-
Besides, SCoT prompting is language-agnostic and can be applied
cant improvements show the effectiveness of SCoT prompting
to different programming languages. As shown in Table 2, SCoT
in code generation.
prompting brings substantial improvements in Python (i.e., Hu-
manEval and MBPP) and C++ (i.e., MBCPP). (3) SCoT prompting unlocks
the reasoning ability of LLMs on code generation. LLMs can ben- 4.2 RQ2: Do developers prefer programs
efit from generating intermediate reasoning steps. The baseline generated by SCoT prompting?
- CoT prompting utilizes natural language steps but only brings The ultimate goal of code generation is to assist developers in
slight improvements. In terms of Pass@1, CoT prompting improves writing programs. In this RQ, we hire 10 developers (including
few-shot prompting by up to 2% in HumanEval, 7.51% in MBPP, and industry employees and academic researchers) to manually review
3.79% in MBCPP. In this paper, we introduce program structures the programs generated by SCoT prompting and baselines.
into intermediate reasoning steps and propose a Structured Chain- Setup. To ensure the fairness of evaluation, we follow settings
of-Thought (SCoT). The SCoT constrains LLMs to use program of human evaluation in previous studies [10, 14]. We have care-
structures to generate intermediate steps, moving in the direction fully checked the evaluation settings and think our settings are
of code. In terms of Pass@1, SCoT prompting improves few-shot reliable. Specifically, we manually evaluate generated programs in
prompting by up to 16.05% in HumanEval, 17.45% in MBPP, and the following aspects:
9.56% in MBCPP. The improvements show that SCoT prompting • Correctness (whether the program satisfies the require-
further unlocks the reasoning ability of LLMs in code generation. ment). 0 point: the program is totally inconsistent with the re-
quirement. 1 point: the program is implemented, but misses some
details. 2 points: the program is correctly implemented.
• Code Smell (whether the program contains bad code smell).
0 point: There is a serious code smell. 1 point: some details are not
Structured Chain-of-Thought Prompting for Code Generation Conference’17, July 2017, Washington, DC, USA

Requirement:
We attribute the improvements to our proposed SCoT. The SCoT
# Return True is list elements are monotonically increasing or
decreasing. constrains LLMs to use program structures to generate intermediate
Few-shot prompting: reasoning steps. It allows LLMs to explore diverse solutions with
def monotonic(l: list):
if all(l[i] <= l[i+1] for i in range(len(l)-1)) or all(l[i] >=
three basic structures, improving the correctness of the code. Then,
l[i+1] for i in range(len(l)-1)): based on the SCoT, LLMs focus on implementing a program in
return True
else: a standardized way. Thus, the generated programs contain fewer
return False
code smells than ones from baselines.
SCoT prompting (SCoT & Source Code):
Input: l: list
Figure 5 shows two programs generated by SCoT prompting and
def monotonic(l: list):
Output: True or False increasing = False few-shot prompting, respectively. Both programs pass unit tests.
1: for each element in l do decreasing = False
2: if the element is greater than its for i in range(1, len(l)):
But the program from few-shot prompting contains a very complex
previous element then if l[i] > l[i-1]: statement highlighted in Figure 5). Developers have to spend lots
3: l increases increasing = True
4: if the element is less than its if l[i] < l[i-1]: of effort to understand and maintain this program. In contrast, the
previous element then decreasing = True
5: l decreases if increasing and decreasing:
program from SCoT prompting has good readability, and the SCoT
6: if both increase and decrease then
7: return False
return False clearly explains the behavior of the code. Developers can further
else:
8: else return True use the SCoT as comments of the program for future maintenance.
9: return True

Answer to RQ2: Human developers prefer programs gener-

Figure 5: Two programs generated by few-shot prompting ated by SCoT prompting. Specifically, SCoT prompting out-
and SCoT prompting, respectively. performs the SOTA baseline by 19.93% in correctness, 11.25%
in code smell, and 16.17% in maintainability. A case study also
shows the program from SCoT prompting is easy to read and
in place. There is code smell of low severity. 2 points: the details maintain.
are in place. No obviously better code in terms of performance
exists. If possible, resources are released accordingly. No obvious
code smell. 4.3 RQ3: Is SCoT prompting robust to examples?
• Maintainability (whether the implementation is standard- As stated in Section 2.3, SCoT prompting requires manually written
ized and has good readability). 0 point: the program does not examples to make prompts. In practice, people may write different
follow a consistent specification, or there are many meaningless examples, which makes the performance of SCoT prompting varies.
names in variable naming, or there are certain repetitions and re- Thus, in this RQ, we explore the robustness of SCoT prompting to
dundant code. 1 point: the program implementation meets certain examples.
specifications. But some variable names can be further refined. 2 Setup. As stated in Section 2.3, we select some <requirement,
points: the program implementation is relatively standardized. code> pairs as example seeds and manually write SCoTs for them,
The variable naming is basically semantically straightforward, obtaining examples in prompts. In this RQ, we measure the ro-
and the readability is good. bustness of SCoT prompting to examples in two aspects, i.e., seed
We explain the above aspects to evaluators through some ex- selection and writing style.
amples. We also discuss with evaluators and set the score of each (1) Seed Selection. It aims to validate SCoT prompting does
aspect to an integer, ranging from 0 to 2 (from bad to good). For not rely on specific seeds. We select three groups of <requirement,
SCoT prompting and baselines, we select a fixed LLM as the base code> pairs as seeds and ask an annotator to write SCoTs for them.
model (i.e., ChatGPT) and collect 200 generated programs per ap- Then, we obtain three groups of examples. We measure the perfor-
proach. Finally, we obtain 800 programs for evaluation. We invite mance of SCoT prompting with different groups of examples. (2)
10 developers with 3-5 years of development experience to eval- Writing Style. People have different writing styles. It aims to vali-
uate the programs in the form of a questionnaire. The evaluators date that SCoT prompting does not rely on specific writing styles.
include industry employees and academic researchers that are not We hire three annotators to independently write SCoTs for the same
co-authors of this paper. The 800 programs are divided into 5 groups, example seed, and obtain three groups of examples. Annotator A is
with each questionnaire containing one group. The programs are a Ph.D. student in software engineering. Annotator B is a product
randomly shuffled and anonymously reviewed by evaluators. Each manager from the industry. Annotator C is a developer from the
group is evaluated by two evaluators, and the final score is the industry. Then, we measure the performance of SCoT prompting
average of two evaluators’ scores. Evaluators are allowed to search with different annotators.
the Internet for unfamiliar concepts. For comparison, we also measure the robustness of CoT prompt-
Results. The results of the human evaluation are shown in ing in the above settings. We select ChatGPT as the base model and
Table 3. The numbers in red denote SCoT prompting’s relative conduct evaluations in HumanEval.
improvements compared to the SOTA baseline - CoT prompting. Results. The results are shown in Table 5 and 6, respectively.
All the 𝑝-values are substantially smaller than 0.05. Analyses. SCoT prompting is robust to examples. As shown in
Analyses. SCoT prompting achieves the highest scores in all Table 5 and 6, SCoT prompting substantially outperforms CoT
three aspects among all baselines. Specifically, SCoT prompting prompting when using different example seeds or annotators. It
outperforms the SOTA baseline - CoT prompting by 15.27% in validates that SCoT prompting does not depend on specific seeds
correctness, 10.66% in code smell, and 15.90% in maintainability. or writing styles. It also shows that the improvements of SCoT
Conference’17, July 2017, Washington, DC, USA Jia Li ♂, Ge Li, Yongmin Li, and Zhi Jin

Table 4: The results of ablation study. The base model is ChatGPT.

HumanEval MBPP MBCPP

Prompting Technique
Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5
CoT prompting 53.29 69.76 75.52 41.83 51.04 54.57 53.51 63.84 67.03
SCoT prompting 60.64 73.53 77.32 46.98 55.31 58.36 57.06 65.70 68.70
w/o Basic structures 55.67 70.94 76.13 43.36 53.64 56.57 54.79 64.32 67.77
w/o IO structure 59.65 72.79 77.12 46.13 54.76 57.88 56.61 65.01 68.42

Table 5: The Pass@k of CoT prompting and SCoT prompting SCoT prompting without basic structures:
with different example seeds.
Input: arry: list[list]
Output: result: int or float
CoT prompting SCoT prompting 1. Initialize a result with -999999
Seed
Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5 2. Iterate through the list of lists
3. Calculate the sum of the list
Seed A 53.29 69.76 75.52 60.64 73.53 77.32
4. Update the result with the maximum of sum
Seed B 52.81 68.97 74.55 60.27 73.11 77.16 and result
Seed C 51.36 67.44 73.62 59.36 72.88 76.79 5. Return the result
SCoT prompting:
Table 6: The Pass@k of CoT prompting and SCoT prompting Input: arry: list[list]
with different annotators. Output: result: int or float
1: Initialize a result with -999999
CoT prompting SCoT prompting 2: for _list in the list of lists:
Annotator 3: Calculate the sum of the _list
Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5
4: Update the result with the maximum of
Annotator A 53.29 69.76 75.52 60.64 73.53 77.32 sum and result
Annotator B 51.43 67.92 73.44 59.48 72.16 76.44 5: return the result
Annotator C 52.18 68.45 74.71 60.02 73.15 77.24
Figure 6: The comparison of SCoT prompting and SCoT
prompting without basic structures.
prompting come from the program structures instead of specific
details in examples.
We also notice that there are slight variances in the perfor-
mance of SCoT prompting under different settings. It is expected
for prompting techniques using examples. Similar variances can be
found in SCoT prompting, and SCoT prompting still outperforms
CoT prompting in different settings.
the performance of SCoT prompting drops obviously. We carefully
Answer to RQ3: SCoT prompting is robust to examples. Un- inspect failed cases and find that LLMs benefit from using basic
der different example seeds or writing styles, SCoT prompting structures to clearly write a solving process. Figure 6 shows the
substantially outperforms the SOTA baseline - CoT prompting. intermediate steps of SCoT prompting and SCoT prompting without
basic structures. SCoT prompting without basic structures uses
CoTs, which sequentially describe how to write the code line by
4.4 RQ4: What are the contributions of different line and contain many ambiguities. For example, the scopes of
program structures in SCoT prompting? two iterations on lines 2 and 4 are unclear. LLMs are likely to
As stated in Section 2.1, SCoT prompting introduces basic structures misunderstand the CoT and generate incorrect code. In contrast,
(i.e., sequence, branch, and loop) and the input-output structure. SCoT prompting uses three basic structures to describe the solving
This RQ is designed to analyze the contributions of different pro- process. The SCoT is clear and is similar to code, benefiting the
gram structures. following code implementation.
Setup. We select ChatGPT as the base model. Then, we conduct (2) The IO structure benefits the requirement understanding. In
an ablation study by independently removing basic structures and Table 4, after deleting the IO structure, the performance of SCoT
the input-output (IO) structure. When removing basic structures, prompting has a slight decrease. We analyze failed cases and think
we use a CoT with an IO structure as the intermediate steps. When the IO structure benefits the requirement understanding. Figure 7
removing the IO structure, the SCoT only contains a solving process shows two programs from SCoT prompting and SCoT prompting
with basic structures. We select ChatGPT as the base model. without the IO structure. We can see that SCoT prompting without
Results. The results are shown in Table 4. “w/o” is the abbrevia- the IO structure wrongly understands the output format and gen-
tion of without. erates an incorrect program. After adding the IO structure, LLMs
Analyses. (1) Three basic structures are beneficial to design a first reason about the input-output format and correctly return a
feasible solving process. In Table 4, after removing basic structures, boolean value.
Structured Chain-of-Thought Prompting for Code Generation Conference’17, July 2017, Washington, DC, USA

Table 7: The comparison of SCoT-P prompting and SCoT prompting. The numbers in red denote SCoT prompting’s relative
improvements compared to SCoT-P prompting.
HumanEval MBPP MBCPP
Approach
Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5
CoT prompting 53.29 69.76 75.52 41.83 51.04 54.57 53.51 63.84 67.03
SCoT-P prompting 55.23 70.33 75.94 43.28 52.16 55.77 54.25 64.09 67.78
SCoT prompting 60.64 73.53 77.32 46.98 55.31 58.36 57.06 65.70 68.70
Relative Improvement 9.80% 4.55% 1.82% 8.55% 6.04% 4.64% 5.18% 2.51% 1.36%

SCoT prompting without IO structure:

60
def test_duplicate(arraynums):
num_set = set(arraynums)
if len(num_set) < len(arraynums): 50
print('Find duplicate elements‘)
else: 40
print('No duplicate elements')

Value
SCoT prompting: 30
def test_duplicate(arraynums):
# Input: arraynums, a list of integers 20
# Output: True if exist duplicate element,
False otherwise
num_set = set(arraynums)
10 ChatGPT
ChatGPT+CodeT
if len(num_set) < len(arraynums): ChatGPT+CodeT+SCoT
return True 0
else: Pass@1 Pass@3 Pass@5
return False Pass@k
Figure 8: The complementarity between CodeT and SCoT
prompting.
Figure 7: The comparison of SCoT prompting and SCoT
prompting without the IO structure.

named SCoT-P Prompting. It is similar to SCoT prompting, but

Answer to RQ3: The input-output structure helps LLMs un- considers the pseudocode as intermediate steps. We apply SCoT-P
derstand requirements and improves ChatGPT by up to 6.37% Prompting and SCoT prompting to ChatGPT and measure their
in Pass@1. Three basic structures are beneficial to clearly de- accuracy. The results are shown in Table 7. SCoT prompting sub-
scribe a solving process and improve ChatGPT by up to 12.73% stantially outperforms SCoT-P Prompting on three benchmarks.
in Pass@1. The improvements show the superiority of our SCoT.

5.2 SCoT prompting vs. Rank Techniques

5 DISCUSSION
Some recent studies [6, 12] propose rank techniques to improve the
5.1 SCoT vs. Pseudocode performance of LLMs on code generation. Given a requirement,
We notice that the SCoT is similar to the pseudocode. The SCoT and they first sample many programs from LLMs and then use test
pseudocode both contain an input-output structure and a solving cases or neural networks to rerank sampled programs. For example,
process. We randomly select 100 generated SCoTs and manually CodeT [6] is a popular rank technique. CodeT does large-scale
review them. We find that 26% of SCoTs are very close to the pseu- sampling and executes sampled programs on auto-generated test
docode. On one hand, we think the similarity enhances the usability cases. Based on execution results, the programs are reranked. In this
of our approach. For example, users can quickly know the behavior paper, we do not directly compare our approach to rank techniques
of a program based on its SCoT. The SCoT also can be inserted into due to two reasons.
the comment and benefits future maintenance. On the other hand, (1) SCoT prompting and rank techniques have different focuses,
the majority of SCoTs (74%) are different from the pseudocode be- and they are complementary. Our work aims to design a new prompt-
cause they are more abstract. Specifically, SCoTs tend to use natural ing technique and improve the accuracy of LLMs in code generation.
languages to summarize an operation, e.g., calaluate the sum of Rank techniques do not care about LLMs and aim to pick the best
list1. But the pseudocode contains more implementation details, one from LLMs’ multiple outputs. In practice, users can use SCoT
e.g., sum ← 0; for i in list1: sum ← sum + i;. prompting to generate many programs and then use rank tech-
Compared to the pseudocode, we think the SCoT is a better niques to pick a final output.
choice for intermediate steps. Because a SCoT naturally decom- To verify the complementarity between SCoT prompting and
poses code generation into two steps. LLMs first focus on exploring rank techniques, we conduct an exploratory experiment. We select
diverse solutions and then implement a program in a standardized ChatGPT as a base model and progressively introduce CodeT and
way. To validate this point, we design a variant of SCoT prompting, SCoT prompting. The results on MBPP are shown in Figure 8. We
Conference’17, July 2017, Washington, DC, USA Jia Li ♂, Ge Li, Yongmin Li, and Zhi Jin

can see that the performance of ChatGPT is continually improved Standard Language models are pre-trained on a large-scale
by adding CodeT and SCoT prompting. corpus with the next-token prediction objective. They are mainly
(2) Rank techniques approaches rely on execution environments. used to continually complete the given context, such as code com-
Rank techniques require executing programs on test cases and using pletion. After the success of GPT series [4, 23, 24] in NLP, OpenAI
execution results to rerank programs. In many realistic program- fine-tunes GPT models on code to produce closed-source Codex [7].
ming scenarios, users want to get code suggestions for an unfinished There follow many open-source replication attempts, e.g., Code-
project. It is infeasible to execute auto-generated programs. Thus, Parrot [29], CodeGen [17], CodeGeeX [40], InCoder [8], StarCoder
we think rank techniques have limited application scenarios and [15] and CodeT5+ [33].
make additional use of the execution results. Our approach works Instruction-tuned models are models after instruction tuning
in a general scenario and does not use execution results. Thus, it is [34]. Instruction tuning trains models to understand human users’
unfair to directly compare SCoT prompting to rank techniques. instructions and perform tasks by following instructions. ChatGPT
[18] is trained with human feedback [20], powerful on both natural
5.3 Threats to Validity language tasks and programming tasks. Many attempt to train an
“open-source ChatGPT”. Alpaca [27] is LLaMA [28] tuned using
There are three main threats to the validity of our work.
self-instruct [32] and ChatGPT feedback. Code Alpaca [5] is LLaMA
(1) The generalizability of experimental results. To miti-
tuned using self-instruct and ChatGPT feedback, with instructions
gate this threat, we carefully select the benchmarks, metrics, and
focusing on programming tasks. WizardCoder [16] is StarCoder
baselines. Following previous studies [1, 2, 7], we pick three rep-
[15] tuned using Evol-Instruct [36] and ChatGPT feedback with
resentative code generation benchmarks. They are hand-written
Code Alpaca’s dataset as seed dataset. InstructCodeT5+ [33] is
or collected from real-world programming communities, and cover
CodeT5+ [33] tuned on Code Alpaca’s dataset.
two popular languages (i.e., Python and C++). For evaluation met-
Prompting Techniques. With the enormous number of pa-
rics, we select a widely used metric - Pass@𝑘, which utilizes test
rameters (e.g., Codex: 175 billion parameters), it is hard to directly
cases to check the correctness of programs. We use the unbiased
fine-tune LLMs on code generation. Prompting techniques are a pop-
Pass@𝑘 which is more reliable [7]. For comparison baselines, we
ular approach, which leverages LLMs to generate code by inputting
select the SOTA prompting techniques and conduct a comprehen-
a special prompt.
sive comparison in Section 4. SCoT prompting and baselines have
Early, researchers proposed zero-shot prompting and few-shot
the same example seeds and maximum generation lengths.
prompting. Zero-shot prompting concatenates a task instruction
(2) The impact of the two-step pipeline. CoT prompting
(e.g., please generate a program based on the requirement)
generates a CoT and the code in one step. Our SCoT prompting
and a requirement together, making a prompt. Based on the zero-
generates the code in two steps. It first generates SCoTs and then
shot prompting, few-shot prompting further adds several ⟨ requirement,
generates the code. It is possible that the improvements come from
code ⟩ pairs to the prompts, so that LLMs can learn code generation
the two-step pipeline. To solve this threat, we have two considera-
from given examples.
tions. First, LLMs in our experiments are auto-regressive language
The Chain-of-Thought (CoT) prompting [35] is a recently pro-
models. For an auto-regressive language model, a one-step pipeline
posed prompting technique. CoT prompting asks LLMs first to
and a two-step pipeline are theoretically equivalent. Second, we
generate CoTs (i.e., intermediate natural language reasoning steps)
conduct an ablation study in Section 4.4. We keep the two-step
and then output the final code. It allows LLMs to first design a
pipeline unchanged and remove program structures. The results in
solving process that leads to the code. CoT prompting has achieved
Table 4 show that SCoT prompting without prompt structures has
the SOTA results in natural language generation and sparked lots
a significant drop in the Pass@k. It shows that the improvements
of follow-up research, such as self-consistency prompting [31],
of SCoT prompting are brought by program structures instead of
least-to-most prompting [41]. But these prompting techniques are
the two-step pipeline.
designed for natural language generation and bring slight improve-
(3) The data leakage. Existing LLMs are trained with extensive
ments in code generation.
code files from open-source communities. It is possible that their
In this paper, we propose a novel prompting technique named
training data contains the experimental benchmarks, leading to
Structured Chain-of-Thought (SCoT) prompting. Different from
data leakage. But we think that it does not affect the fairness of our
standard CoT prompting, SCoT prompting explicitly introduces
experiments. In this paper, we select a specific LLM (e.g., ChatGPT)
program structures and asks LLMs to generate intermediate reason-
as the base model and apply different prompting techniques to
ing steps with program structures. We compare CoT prompting and
it. Thus, the reported relative improvements between baselines
SCoT prompting in Section 4. The results show that SCoT prompt-
and our approach are credible. In the future, we will add the latest
ing significantly outperforms CoT prompting in three benchmarks.
benchmarks to alleviate this threat.

6 RELATED WORK 7 CONCLUSION AND FUTURE WORK

Large language models (LLMs) for Source Code are large-scale Large Language Models (LLMs) with Chain-of-Thought (CoT) prompt-
neural networks that are pre-trained with a large corpus consisting ing is the state-of-the-art (SOTA) approach to generating code. It
of natural language text and source code. Nowadays, LLMs for first generates a CoT and then outputs the code. A CoT is sev-
source code have been expanding and can be divided into two eral intermediate natural language reasoning steps. However, CoT
categories: standard language models and instruction-tuned models. prompting still has low accuracy in code generation. This paper
Structured Chain-of-Thought Prompting for Code Generation Conference’17, July 2017, Washington, DC, USA

proposes a Structured CoT (SCoT) and presents a new prompt- [15] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov,
ing technique for code generation, named SCoT prompting. SCoT Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023.
StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
prompting asks LLMs to generate a SCoT using program structures [16] Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu,
(i.e., sequence, branch, and loop structures). Then, LLMs generate Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023. WizardCoder:
Empowering Code Large Language Models with Evol-Instruct. arXiv preprint
the code based on the SCoT. A large-scale study on three bench- arXiv:2306.08568 (2023).
marks shows that SCoT prompting significantly outperforms CoT [17] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou,
prompting in Pass@k and human evaluation. Besides, SCoT prompt- Silvio Savarese, and Caiming Xiong. 2022. CodeGen: An Open Large Language
Model for Code with Multi-Turn Program Synthesis. arXiv preprint (2022).
ing is robust to examples and obtains stable improvements. [18] OpenAI. 2022. ChatGPT. https://openai.com/blog/chatgpt.
In the future, we will explore new prompting techniques for code [19] OpenAI. 2022. Codex. https://beta.openai.com/docs/api-reference/completions.
generation. For example, source code can be represented by a tree [20] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela
Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022.
(e.g., abstract syntax tree). We can design a tree-based prompting Training language models to follow instructions with human feedback. Advances
technique, which uses LLMs to generate a tree. in Neural Information Processing Systems 35 (2022), 27730–27744.
[21] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a
method for automatic evaluation of machine translation. In Proceedings of the
40th annual meeting of the Association for Computational Linguistics. 311–318.
REFERENCES [22] Han Peng, Ge Li, Wenhan Wang, Yunfei Zhao, and Zhi Jin. 2021. Integrat-
[1] Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen ing Tree Path in Transformer for Code Representation. In Advances in Neural
Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, Information Processing Systems 34: Annual Conference on Neural Information Pro-
et al. 2022. Multi-lingual Evaluation of Code Generation Models. arXiv preprint cessing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, Marc’Aurelio
arXiv:2210.14868 (2022). Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wort-
[2] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk man Vaughan (Eds.). 9343–9354. https://proceedings.neurips.cc/paper/2021/
Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, hash/4e0223a87610176ef0d24ef6d2dcde3a-Abstract.html
et al. 2021. Program synthesis with large language models. arXiv preprint [23] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018.
arXiv:2108.07732 (2021). Improving language understanding by generative pre-training. (2018).
[3] Corrado Böhm and Giuseppe Jacopini. 1966. Flow diagrams, turing machines and [24] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
languages with only two formation rules. Commun. ACM 9, 5 (1966), 366–371. Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019).
https://doi.org/10.1145/355592.365646 [25] Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016.
[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Sequence Level Training with Recurrent Neural Networks. In 4th International
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May
Askell, et al. 2020. Language models are few-shot learners. Advances in neural 2-4, 2016, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
information processing systems 33 (2020), 1877–1901. http://arxiv.org/abs/1511.06732
[5] Sahil Chaudhary. 2023. Code Alpaca: An Instruction-following LLaMA model [26] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Com-
for code generation. https://github.com/sahil280114/codealpaca. monsenseQA: A Question Answering Challenge Targeting Commonsense Knowl-
[6] Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang edge. In Proceedings of the 2019 Conference of the North American Chapter of the
Lou, and Weizhu Chen. 2022. CodeT: Code Generation with Generated Tests. Association for Computational Linguistics: Human Language Technologies, NAACL-
https://doi.org/10.48550/ARXIV.2207.10397 HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers),
[7] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Compu-
Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg tational Linguistics, 4149–4158. https://doi.org/10.18653/v1/n19-1421
Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, [27] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos
Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An
Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_
Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fo- alpaca.
tios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex [28] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shan- Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
tanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lam-
Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles ple. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv
Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, preprint arXiv:2302.13971 (2023).
Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large [29] Lewis Tunstall, Leandro Von Werra, and Thomas Wolf. 2022. Natural language
Language Models Trained on Code. (2021). arXiv:2107.03374 [cs.LG] processing with transformers. " O’Reilly Media, Inc.".
[8] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, [30] Wenhan Wang, Ge Li, Sijie Shen, Xin Xia, and Zhi Jin. 2020. Modular Tree
Ruiqi Zhong, Scott Yih, Luke Zettlemoyer, and Mike Lewis. 2023. InCoder: A Network for Source Code Representation Learning. 29, 4, Article 31 (sep 2020),
Generative Model for Code Infilling and Synthesis. In The Eleventh International 23 pages. https://doi.org/10.1145/3409331
Conference on Learning Representations. https://openreview.net/forum?id=hQwb- [31] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang,
lbM6EL Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves
[9] GitHub. 2022. GitHub Copilot. https://github.com/features/copilot. Chain of Thought Reasoning in Language Models. In The Eleventh International
[10] Yiyang Hao, Ge Li, Yongqiang Liu, Xiaowei Miao, He Zong, Siyuan Jiang, Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
Yang Liu, and He Wei. 2022. AixBench: A Code Generation Benchmark OpenReview.net. https://openreview.net/pdf?id=1PL1NIMMrw
Dataset. CoRR abs/2206.13179 (2022). https://doi.org/10.48550/arXiv.2206.13179 [32] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith,
arXiv:2206.13179 Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-Instruct: Aligning Lan-
[11] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The guage Models with Self-Generated Instructions. In Proceedings of the 61st Annual
Curious Case of Neural Text Degeneration. In 8th International Conference on Meeting of the Association for Computational Linguistics (Volume 1: Long Pa-
Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. pers). Association for Computational Linguistics, Toronto, Canada, 13484–13508.
OpenReview.net. https://openreview.net/forum?id=rygGQyrFvH https://aclanthology.org/2023.acl-long.754
[12] Jeevana Priya Inala, Chenglong Wang, Mei Yang, Andres Codas, Mark Encar- [33] Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and
nación, Shuvendu Lahiri, Madanlal Musuvathi, and Jianfeng Gao. 2022. Fault- Steven CH Hoi. 2023. Codet5+: Open code large language models for code
aware neural code rankers. Advances in Neural Information Processing Systems understanding and generation. arXiv preprint arXiv:2305.07922 (2023).
35 (2022), 13419–13432. [34] Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian
[13] Jia Li, Ge Li, Zhuo Li, Zhi Jin, Xing Hu, Kechi Zhang, and Zhiyi Fu. 2023. CodeEd- Lester, Nan Du, Andrew M. Dai, and Quoc V Le. 2022. Finetuned Language Models
itor: Learning to Edit Source Code with Pre-Trained Models. ACM Trans. Softw. are Zero-Shot Learners. In International Conference on Learning Representations.
Eng. Methodol. (may 2023). https://doi.org/10.1145/3597207 Just Accepted. https://openreview.net/forum?id=gEZrGCozdqR
[14] Jia Li, Yongmin Li, Ge Li, Zhi Jin, Yiyang Hao, and Xing Hu. 2023. SkCoder: [35] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia,
A Sketch-based Approach for Automatic Code Generation. In 45th IEEE/ACM Ed H. Chi, Quoc V Le, and Denny Zhou. 2022. Chain of Thought Prompting Elicits
International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, Reasoning in Large Language Models. In Advances in Neural Information Pro-
May 14-20, 2023. IEEE, 2124–2135. https://doi.org/10.1109/ICSE48619.2023.00179 cessing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun
Conference’17, July 2017, Washington, DC, USA Jia Li ♂, Ge Li, Yongmin Li, and Zhi Jin

Cho (Eds.). https://openreview.net/forum?id=_VjQlMeSB_J https://doi.org/10.18653/v1/2023.acl-long.45

[36] Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, [39] Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate
Chongyang Tao, and Daxin Jiang. 2023. Wizardlm: Empowering large language before use: Improving few-shot performance of language models. In International
models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023). Conference on Machine Learning. PMLR, 12697–12706.
[37] Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong [40] Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan
Liu. 2019. A novel neural source code representation based on abstract syntax tree. Wang, Lei Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. 2023.
In Proceedings of the 41st International Conference on Software Engineering, ICSE CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evalua-
2019, Montreal, QC, Canada, May 25-31, 2019, Joanne M. Atlee, Tevfik Bultan, and tions on HumanEval-X. arXiv:2303.17568 [cs.LG]
Jon Whittle (Eds.). IEEE / ACM, 783–794. https://doi.org/10.1109/ICSE.2019.00086 [41] Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang,
[38] Kechi Zhang, Zhuo Li, Jia Li, Ge Li, and Zhi Jin. 2023. Self-Edit: Fault-Aware Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. 2023.
Code Editor for Code Generation. In Proceedings of the 61st Annual Meeting Least-to-Most Prompting Enables Complex Reasoning in Large Language Models.
of the Association for Computational Linguistics (Volume 1: Long Papers), ACL In The Eleventh International Conference on Learning Representations, ICLR 2023,
2023, Toronto, Canada, July 9-14, 2023, Anna Rogers, Jordan L. Boyd-Graber, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=
and Naoaki Okazaki (Eds.). Association for Computational Linguistics, 769–787. WZH7099tgfM

LLM Thought Process Exploration
No ratings yet
LLM Thought Process Exploration
62 pages
Types of Prompting Part 3
No ratings yet
Types of Prompting Part 3
14 pages
Chain-of-Thought Prompting in LLMs
No ratings yet
Chain-of-Thought Prompting in LLMs
43 pages
Chain of Code: Reasoning With A Language Model-Augmented Code Emulator
No ratings yet
Chain of Code: Reasoning With A Language Model-Augmented Code Emulator
21 pages
NeurIPS 2022 Chain of Thought Prompting Elicits Reasoning in Large Language Models Paper Conference
No ratings yet
NeurIPS 2022 Chain of Thought Prompting Elicits Reasoning in Large Language Models Paper Conference
14 pages
Mercity Guide - Chain-of-Thought (CoT) Prompting
No ratings yet
Mercity Guide - Chain-of-Thought (CoT) Prompting
34 pages
Program of Thoughts for Numerical Reasoning
No ratings yet
Program of Thoughts for Numerical Reasoning
11 pages
Prompt Chaining Techniques for AI
No ratings yet
Prompt Chaining Techniques for AI
15 pages
AI Model Arithmetic Performance
No ratings yet
AI Model Arithmetic Performance
1 page
Chain of Thought - 240922 - 203623
No ratings yet
Chain of Thought - 240922 - 203623
4 pages
Treeof Code
No ratings yet
Treeof Code
13 pages
To CoT or Not To CoT
No ratings yet
To CoT or Not To CoT
45 pages
2201.11903v1 Chain of Thought Prompting Elicits Reasomning in LLM
No ratings yet
2201.11903v1 Chain of Thought Prompting Elicits Reasomning in LLM
24 pages
CoT's Impact on Math and Logic Tasks
No ratings yet
CoT's Impact on Math and Logic Tasks
45 pages
Advanced AI Prompt Techniques
No ratings yet
Advanced AI Prompt Techniques
53 pages
Seed-CTS:: Unleashing The Power of Tree Search For Superior Performance in Competitive Coding Tasks
No ratings yet
Seed-CTS:: Unleashing The Power of Tree Search For Superior Performance in Competitive Coding Tasks
19 pages
Blue & White Practical Uses of AI in EFL Presentation
No ratings yet
Blue & White Practical Uses of AI in EFL Presentation
30 pages
Enhancing Code Generation with CodeCoT
No ratings yet
Enhancing Code Generation with CodeCoT
10 pages
Self-Planning Code Generation With Large Language Models
No ratings yet
Self-Planning Code Generation With Large Language Models
29 pages
.Marked djiKsGD
No ratings yet
.Marked djiKsGD
8 pages
Advanced Prompt Engineering Techniques
No ratings yet
Advanced Prompt Engineering Techniques
2 pages
Docprompting:: G C R D
No ratings yet
Docprompting:: G C R D
19 pages
Poster
No ratings yet
Poster
1 page
200+ Chatgpt Prompts For Coders
No ratings yet
200+ Chatgpt Prompts For Coders
23 pages
22 Promptengg
No ratings yet
22 Promptengg
40 pages
Understanding Prompt Engineering Techniques
No ratings yet
Understanding Prompt Engineering Techniques
11 pages
A Survey On Large Language Models For Code Generation: Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim
No ratings yet
A Survey On Large Language Models For Code Generation: Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim
49 pages
Graph of Thoughts Framework for LLMs
No ratings yet
Graph of Thoughts Framework for LLMs
61 pages
Graph of Thoughts: Solving Elaborate Problems With Large Language Models
No ratings yet
Graph of Thoughts: Solving Elaborate Problems With Large Language Models
13 pages
Merged
No ratings yet
Merged
28 pages
Code Tree
No ratings yet
Code Tree
16 pages
NLP & Code: Bridging Language Gaps
No ratings yet
NLP & Code: Bridging Language Gaps
2 pages
Fin Irjmets1715742677
No ratings yet
Fin Irjmets1715742677
6 pages
Optimizing Few-Shot Reasoning with Prompt Space
No ratings yet
Optimizing Few-Shot Reasoning with Prompt Space
27 pages
Chain of Thought
No ratings yet
Chain of Thought
4 pages
A Survey On Large Language Models For Code Generation: Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim
No ratings yet
A Survey On Large Language Models For Code Generation: Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim
70 pages
Forest-of-Thought: Scaling Test-Time Compute For Enhancing LLM Reasoning
No ratings yet
Forest-of-Thought: Scaling Test-Time Compute For Enhancing LLM Reasoning
13 pages
A Task-Oriented Chatbot Based On LSTM and Reinforcement Learning
No ratings yet
A Task-Oriented Chatbot Based On LSTM and Reinforcement Learning
5 pages
Generative Code Models for Devs
No ratings yet
Generative Code Models for Devs
23 pages
20 Types Prompting Styles
100% (1)
20 Types Prompting Styles
22 pages
Evaluating Large Language Models Trained On Code
No ratings yet
Evaluating Large Language Models Trained On Code
35 pages
Synthesizing High-Quality Programming Tasks With LLM-based Expert and Student Agents
No ratings yet
Synthesizing High-Quality Programming Tasks With LLM-based Expert and Student Agents
12 pages
Sketch of Thought
No ratings yet
Sketch of Thought
16 pages
Prompt
No ratings yet
Prompt
41 pages
OpenAI Codex Arxiv
No ratings yet
OpenAI Codex Arxiv
35 pages
Edx 5
No ratings yet
Edx 5
2 pages
A Review On Question Generation From Natural Language Text
No ratings yet
A Review On Question Generation From Natural Language Text
43 pages
1998 - 1000 - DOC - AI-Powered Code Generation
No ratings yet
1998 - 1000 - DOC - AI-Powered Code Generation
5 pages
Natural Language To Code: Improving Semantic Reasoning in Code Generation Models
No ratings yet
Natural Language To Code: Improving Semantic Reasoning in Code Generation Models
10 pages
ChatGPT Prompts For Coders
No ratings yet
ChatGPT Prompts For Coders
2 pages
# Using Tree-of-Thought Prompting (ToT)
No ratings yet
# Using Tree-of-Thought Prompting (ToT)
16 pages
Paper - Self-Harmonized Chain of Thought
No ratings yet
Paper - Self-Harmonized Chain of Thought
21 pages
LLM+技巧总结+ +Prompt+Engineering指南
No ratings yet
LLM+技巧总结+ +Prompt+Engineering指南
25 pages
Prompt Engineering Lecture Elvis
100% (11)
Prompt Engineering Lecture Elvis
50 pages
ToT Prompting for Coders
No ratings yet
ToT Prompting for Coders
12 pages
Review 1 Capstone
No ratings yet
Review 1 Capstone
9 pages
Code Generation 2305.11790v3
No ratings yet
Code Generation 2305.11790v3
20 pages
Reasoning by Superposition: A Theoretical Perspective On Chain of Continuous Thought
No ratings yet
Reasoning by Superposition: A Theoretical Perspective On Chain of Continuous Thought
26 pages
School Participation Questionnaire Webinar Presentation 31 March 2021
No ratings yet
School Participation Questionnaire Webinar Presentation 31 March 2021
49 pages
Universidad Abierta para Adultos (UAPA) : Ingles 3
No ratings yet
Universidad Abierta para Adultos (UAPA) : Ingles 3
4 pages
Civil Engineering Research Essentials
No ratings yet
Civil Engineering Research Essentials
11 pages
TTl2 Syllabus (Edited)
No ratings yet
TTl2 Syllabus (Edited)
16 pages
Travel Itineraries for Education Events
No ratings yet
Travel Itineraries for Education Events
4 pages
Language Teaching Methods & Approaches
No ratings yet
Language Teaching Methods & Approaches
4 pages
Fajr Academy Brochure
No ratings yet
Fajr Academy Brochure
7 pages
Enhancing Inclusive Education in the Philippines
No ratings yet
Enhancing Inclusive Education in the Philippines
1 page
Modern Gym Experience Unveiled
No ratings yet
Modern Gym Experience Unveiled
12 pages
No-Code Programming: Build Apps & Websites
No ratings yet
No-Code Programming: Build Apps & Websites
5 pages
Army Public School Application Format
No ratings yet
Army Public School Application Format
4 pages
Shivansh Tiwari Admit Card Upp
No ratings yet
Shivansh Tiwari Admit Card Upp
1 page
Getting Started With AWS Lambda and The Serverless Cloud
No ratings yet
Getting Started With AWS Lambda and The Serverless Cloud
25 pages
Students' Ability in Paraphrasing An English Text
No ratings yet
Students' Ability in Paraphrasing An English Text
5 pages
Iocl R&D Advertisement
No ratings yet
Iocl R&D Advertisement
7 pages
Math Lab and Math Corner
No ratings yet
Math Lab and Math Corner
1 page
116-Maricel Gonzales
No ratings yet
116-Maricel Gonzales
3 pages
Manuel Peñalver en PDF
No ratings yet
Manuel Peñalver en PDF
2 pages
Tony Jagyasi: Cloud Computing Expert
No ratings yet
Tony Jagyasi: Cloud Computing Expert
3 pages
Sandra J Chung Resume
100% (2)
Sandra J Chung Resume
2 pages
Psychology Assertion Reason Questions Paper Analysis Part 3
50% (2)
Psychology Assertion Reason Questions Paper Analysis Part 3
5 pages
SAW Device Analysis with HCT
No ratings yet
SAW Device Analysis with HCT
3 pages
The Importance of Public Libraries
No ratings yet
The Importance of Public Libraries
4 pages
Popa 2021 Operationalizing Historical Consciousness A Review and Synthesis of The Literature On Meaning Making in
No ratings yet
Popa 2021 Operationalizing Historical Consciousness A Review and Synthesis of The Literature On Meaning Making in
38 pages
Full Leibniz Doctrine of Necessary Truth Routledge Library Editions 17th Century Philosophy Margaret Dauler Wilson PDF All Chapters
100% (4)
Full Leibniz Doctrine of Necessary Truth Routledge Library Editions 17th Century Philosophy Margaret Dauler Wilson PDF All Chapters
62 pages
Biography of
No ratings yet
Biography of
2 pages
Charlie's Insights on Veterinary Care
No ratings yet
Charlie's Insights on Veterinary Care
3 pages
Published Online Anxiety Article
No ratings yet
Published Online Anxiety Article
17 pages
372 Advertisement Lady Police Constable Karachi Range
No ratings yet
372 Advertisement Lady Police Constable Karachi Range
1 page
Banasthali Vidyapith Academic Calendar 2024-25
No ratings yet
Banasthali Vidyapith Academic Calendar 2024-25
1 page

2023 Li ChainOfThought

Uploaded by

2023 Li ChainOfThought

Uploaded by

Structured Chain-of-Thought Prompting for Code Generation

Yongmin Li Zhi Jin

ABSTRACT 1. Initialize a result with -999999

Input: paren_string: str def first_Repeated_Char(str):

Figure 2: Examples of SCoT in code generation.

paper, branch structures contain three formats, i.e., if ..., if

HumanEval MBPP MBCPP

Answer to RQ2: Human developers prefer programs gener-

Table 4: The results of ablation study. The base model is ChatGPT.

HumanEval MBPP MBCPP

SCoT prompting without IO structure:

named SCoT-P Prompting. It is similar to SCoT prompting, but

5.2 SCoT prompting vs. Rank Techniques

6 RELATED WORK 7 CONCLUSION AND FUTURE WORK

Cho (Eds.). https://openreview.net/forum?id=_VjQlMeSB_J https://doi.org/10.18653/v1/2023.acl-long.45

You might also like