0% found this document useful (0 votes)
25 views10 pages

Fully Autonomous Programming With Large Language Models

Uploaded by

kbc5x2hsj8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views10 pages

Fully Autonomous Programming With Large Language Models

Uploaded by

kbc5x2hsj8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Fully Autonomous Programming with Large Language Models

Vadim Liventsev∗† Anastasiia Grishina∗ Aki Härmä Leon Moonen


[email protected] [email protected] [email protected] [email protected]
TU Eindhoven & Simula & University of Philips Research Simula & BI Norwegian
Philips Research Oslo Eindhoven Business School
The Netherlands Norway The Netherlands Oslo, Norway

ABSTRACT 1 INTRODUCTION
Current approaches to program synthesis with Large Language Automatic programming has been an important goal of the Arti-
arXiv:2304.10423v1 [cs.SE] 20 Apr 2023

Models (LLMs) exhibit a “near miss syndrome”: they tend to gener- ficial Intelligence field almost since its inception [1], promising
ate programs that semantically resemble the correct answer (as to reduce the workload of software developers by automatically
measured by text similarity metrics or human evaluation), but solving some of the tasks they face. More recently, program syn-
achieve a low or even zero accuracy as measured by unit tests due thesis has emerged as an interpretable alternative [2] to black-box
to small imperfections, such as the wrong input or output format. machine learning methods that lets human experts understand,
This calls for an approach known as Synthesize, Execute, Debug validate and edit the algorithms generated by artificial intelligence.
(SED), whereby a draft of the solution is generated first, followed In addition to the scientific benefits of such knowledge, it extends
by a program repair phase addressing the failed tests. To effectively the benefits of machine learning to domains, such as embedded
apply this approach to instruction-driven LLMs, one needs to de- systems where it is technically challenging [3] or healthcare where
termine which prompts perform best as instructions for LLMs, as it is avoided for safety reasons [4, 5].
well as strike a balance between repairing unsuccessful programs The predominant methodology in automatic programming has
and replacing them with newly generated ones. We explore these shifted from deductive programming [6, 7] to genetic and evolution-
trade-offs empirically, comparing replace-focused, repair-focused, ary methods [8] to, more recently, large autoregressive language
and hybrid debug strategies, as well as different template-based models trained on corpora of source code due to their remarkable
and model-based prompt-generation techniques. We use OpenAI capability for zero-shot generalization [9]. However, even state-of-
Codex as the LLM and Program Synthesis Benchmark 2 as a data- the-art models fine-tuned on a specific class of programming tasks
base of problem descriptions and tests for evaluation. The resulting still require a costly filtering step where the LLM outputs that do
framework outperforms both conventional usage of Codex without not compile or pass tests are discarded [10]. These outputs tend to
the repair phase and traditional genetic programming approaches. be superficially similar to correct solutions [11] despite failing to
produce the expected output, a phenomenon known as "near miss
CCS CONCEPTS syndrome" or "last mile problem" [12].
• Software and its engineering → Software design engineering; • Given these challenges, research in machine learning on source
Computing methodologies → Neural networks; Model develop- code [13] tends to focus on restricted domain-specific languages [14–
ment and analysis; Search methodologies. 16] or automating specific parts1 of the software development pro-
cess [19, 20] such as code search [21], code translation [22], detec-
tion of issues [23, 24], improvement [25] and repair [26] rather than
KEYWORDS fully autonomous programming in a programming language popu-
automatic programming, large language models, program repair lar with human developers [27]. However, two recent innovations
potentially make the latter task tractable.
ACM Reference Format:
Vadim Liventsev, Anastasiia Grishina, Aki Härmä, and Leon Moonen. 2023.
One is Synthesize, Execute, Debug [28], a framework that attempts
Fully Autonomous Programming with Large Language Models. In Genetic to bridge the "last mile" gap by introducing program repair into
and Evolutionary Computation Conference (GECCO ’23), July 15–19, 2023, the program synthesis algorithm. A programming task is specified
Lisbon, Portugal. ACM, New York, NY, USA, 10 pages. https://doi.org/10. using both a natural language description and a set of input/output
1145/3583131.3590481 (I/O) pairs demonstrating what output is expected of the program,
thereby combining text to code [29] and programming by exam-
∗ ple [30, 31] paradigms typical for competitive programming [32].
Both authors contributed equally to this research.
† Corresponding author. Synthesize, Execute, Debug creates a first draft program using a gen-
erative model, compiles and executes it with given input examples.
This is followed by a program repair step to fix the identified errors.
Another relevant innovation is instruction-driven large language
This work is licensed under a Creative Commons
Attribution 4.0 International (CC BY 4.0) license. models [33]. Instruction-driven models use human feedback in their
training process and admit two inputs: a source text (or code) and
GECCO ’23, July 15–19, 2023, Lisbon, Portugal
a textual command instructing the model to edit the source in a
© 2023 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0119-1/23/07.
https://doi.org/10.1145/3583131.3590481 1 similarly to autonomous driving [17, 18]
input RANK EXECUTE EXECUTE
S Code with
LLM for Pi with yes
T template draft Program Top program valid set of 100% test set of
Program if executed
A draft candidates
instr programs candidates at least I/O pairs: pass rate? I/O pairs:
R Pi, i=1,...,N Pi, i=1...W
Task psynth once (I, O)val (I, O)test
T description
Pi
no
SYNTHESIZE N new
program input
Program
candidates LLM for
candidate Pi Failing (I, O)val Final program
input - LLM input: code or text to debugging INSTRUCT
and score on
complete or edit pdebug instr pairs and
unseen tests
Instruction stderr for Pi
instr - LLM instruction on how to END
change the input
DEBUG

Figure 1: Framework for LLM-based Synthesize, Execute, Instruct, Debug, and Rank approach.

particular way, i.e., "summarize" or "translate to Python". These prerequisite for SEIDR, since it lets us sample batches of diverse can-
models have been shown to be highly successful in automatic pro- didate solutions from 𝑝 synth (input, instr), 𝑝 debug (input, instr), and
gram repair [34]. However, given the free-form nature of these 𝑝 text (input). We have chosen the state-of-the-art transformer mod-
instructions2 how one should engineer instructions that maximize els [36] for 𝑝 synth (input, instr), 𝑝 debug (input, instr), and 𝑝 text (input)
repair performance is an open question. in our experiments as described in Section 4.5. In general, SEIDR
Section 2 presents a framework that adapts Synthesize, Execute, requires a sequence-to-sequence generative model for these blocks.
Debug to instruction-driven Large Language Models for solving
programming tasks in an autonomous fashion. We discuss related 2.2 Synthesize
work in Section 3, introduce experiments to establish optimal search
The framework starts with the SYNTHESIZE block, which is re-
and prompting strategies for this framework in Section 4. Finally,
sponsible for generating initial draft solutions to programming
we demonstrate in Section 5 that our framework outperforms con-
tasks to be repaired in the later stages of SEIDR. We start with a
ventional automatic programming techniques, such as genetic pro-
basic template for a chosen programming language that contains a
gramming and naive application of large language models that
number of standard library imports and an empty main function or
generate one solution per problem without updating it iteratively.
this language’s equivalent thereof, see figure 2. We populate this
template with a comment indicating a text description of a task
2 METHODOLOGY at hand and several I/O examples from the prompt training set.
The proposed framework, Synthesize, Execute, Instruct, Debug and We design the templates with guidelines by the authors of the lan-
Rank, or SEIDR,3 is summarized in figure 1. To solve a programming guage model [37] and prior work [38] in mind. We then sample 𝑁
task defined as a text description and a collection of I/O examples, programs from 𝑝 synth (input, instr), setting input to the populated
we split I/O examples into prompt and validation sets and use the template and instruction to the problem description. We use tem-
prompt set in a large language model to SYNTHESIZE a popula- perature sampling with a monotonically increasing temperature
tion of candidate solutions. We EXECUTE the solutions, test them schedule where 𝑖-th program is sampled with temperature 𝑡𝑖 ≈ 𝑁𝑖
against the validation set, generate a text description of the identi- (approximate equality enables efficient implementation by means
fied problems used to INSTRUCT a large language model to produce of batching). Thus, the sampling procedure for the first programs
repaired candidate solutions similar to the way a human developer
DEBUGs a program. We RANK the candidates by correctness mea-
sured by matching I/O pairs, discard the worst candidates, and import os #include <vector>
repeat until a fully correct solution is found. import sys standard preamble #include <iostream>
import numpy as np with useful imports #include <string>
... ...
2.1 Ingredients
""" /*
SEIDR makes use of 2 instruction-driven large language models for Given an integer x, return "Fizz" Given an integer x, return "Fizz"
source code: a synthesis model 𝑝 synth (input, instr) and a debugging if x is divisible by 3, "Buzz" if x if x is divisible by 3, "Buzz" if x
model 𝑝 debug (input, instr), as well as, optionally, a large natural is divisible by 5, "FizzBuzz" if x task is divisible by 5, "FizzBuzz" if x
is divisible by 3 and 5, and a description is divisible by 3 and 5, and a
language model 𝑝 text (input) that can be used for writing instruc- string version of x if none of string version of x if none of
tions for the code model. Each model is a highly parameterised the above hold. the above hold.
probability distribution over the space of (input, instruction)-tuples For example, For example,
input: input:
with parameters estimated on a large diverse (i.e., non-task-specific) 3 3
I/O
corpus. This stochastic nature of language models is an important output: examples output:
Fizz Fizz
""" */
2 Throughout this paper we avoid other definitions of instruction, such as an individual if __name__ == '__main__': "main" int main() {
block
operation in code, to prevent ambiguity.
3 seiðr also refers to a type of Norse magic [35] pertaining to predicting and controlling
the future, which we deem thematically appropriate. Figure 2: Anatomy of SYNTHESIZE templates
2
INSTRUCTstatic INSTRUCTLLM

Template Instruction: Template Instruction:


no no Instruction, e.g.:
engine Fix {stderr} engine Fix {stderr}
Failing Failing Fix {bug summary}
(I, O)val (stderr) (I, O)val (stderr)
pairs stderr pairs stderr
and empty? and empty?
Template output, e.g.:
stderr Template stderr Template LLM for
The code must return Template
for Pi engine Instruction, e.g.: for Pi engine text Bug engine
yes yes Oval instead of O for Ival.
(code Make sure that (code completion summary (debug
The error is... input
behavior) {Ival} -> {Oval} behavior) ptext instruction)

Figure 3: INSTRUCT blocks with and without an LLM.


approximates deterministic maximum likelihood estimation. Ulti- the code returns output O instead of expected output Oval for the
mately, this approach ensures that samples are diverse, but always failing test case with input string Ival . The LLM is then prompted
contain the likeliest programs. to auto-complete this description of program behavior with the
bug summary. The bug description is passed further to the next
2.3 Execute template engine and used as the debugging instruction, such as
In the EXECUTE block, the programs are compiled (if necessary) “Fix {bug summary}”.
and launched using the standard tools for the programming lan-
guage. The program is run once for every I/O pair in the valida- 2.5 Debug
tion set. Its stdin stream receives all the input lines in a given The DEBUG block iterates over all programs in the population
input pair, and its stdout and stderr streams are captured and and uses the instruction written by INSTRUCT based on the re-
saved. We then measure the score of the program defined as ac- sults of EXECUTE to sample from 𝑝 debug (input, instr) 𝑁 times to
curacy over output lines, with O being the expected output, and repair every candidate, setting input to the candidate solution
𝑛 = max{|𝑂 |, |stdout|}: and instruction to the output of INSTRUCT. The population of
Í𝑛 candidates is then replaced with the output of DEBUG.
I[stdout𝑖 = 𝑂𝑖 ]
score(𝑂, stdout) = 𝑖
𝑛 2.6 Rank
unless stderr is non-empty during compilation or execution, which Finally, the RANK block implements what is known in genetic
is considered to indicate failure and is assigned a score of 0. programming as parent selection [39]. It ranks all programs in the
candidate population by their score calculated in EXECUTE, keeps
2.4 Instruct the top 𝑊 programs, and removes the rest from the population.
The goal of the INSTRUCT block is to provide an instruction that
summarizes a bug in the program candidate for 𝑝 debug (input, instr). 2.7 Meaning of Hyperparameters
The resulting instruction with the bug summary should indicate After evaluating a given candidate solution in EXECUTE, SEIDR
what requirement is violated and instruct the LLM to edit the candi- supports two approaches to addressing the candidate’s flaws:
date program so that the candidate meets the violated requirements.
• Replace the candidate with another sample from the current
In SEIDR, we generate instructions using template engines. In gen-
population.
eral, template engines replace placeholders in files or strings with
• Use INSTRUCT and DEBUG to repair the candidate.
input values and return a formatted string. With template engines,
we can create a number of templates that will be adapted dynami- We refer to this problem as repair-replace trade-off, by analogy with
cally based on the results of program candidate execution. production economics [40].
We consider two different designs of the instruction generation How does the choice of hyperparameters 𝑁 and 𝑊 influence
block: INSTRUCTstatic and INSTRUCTLLM shown in figure 3. Both the flow of SEIDR? 𝑁 and 𝑊 act as upper bounds on the replace
types of blocks use failing I/O pairs from the validation set and option by limiting the size of the population. In the edge cases,
stderr output of the candidate execution. In both blocks, if stderr 𝑁 = 𝑊 = 1 corresponds to a repair-only process, while 𝑁 = 𝑊 = ∞
is not empty, i.e., execution errors occur before getting the output to corresponds to replace-only, see figure 4.
compare it with the expected output, the stderr-based template en- Observe that a mutation-only genetic algorithm with population
gine generates an instruction to fix the error mentioned in stderr. size 𝑊 , such as SEIDR, is equivalent to local beam search with beam
However, the blocks differ in the way they transform failing I/O width 𝑊 on a 𝑁 -ary tree [41, Section 4.1.4]. This corresponds to a
pairs to generate instructions in case stderr is empty. known property of local beam search: it degenerates into depth-first
INSTRUCTstatic uses a fixed input template and substitutes place- search when 𝑊 = 1, whereas setting 𝑊 = ∞ yields breadth-first
holders for input and output with the corresponding strings of the search. Hence, we refer to 𝑁 as tree arity and 𝑊 as beam width.
first failing test case. We show the resulting instruction for an ex-
emplar template in figure 3. By contrast, INSTRUCTLLM uses the 3 RELATED WORK
failing I/O pair in the LLM for text completion, thereby prompting The use of large language models for a program repair step within
the text LLM to produce the bug summary. An exemplar output a program synthesis pipeline has been studied by Joshi et al. [42]
of the code behavior template engine in figure 3 describes that and Gupta et al. [28], while a specific case of instruction-driven
3
initial template candidate solutions use the Program Synthesis Benchmark 2 (PSB2) for problem de-
scriptions and tests to evaluate the proposed framework [50]. We
0th 1st 2nd 3rd generation compare the performance of programs synthesized with SEIDR
repair only tree arity to the PushGP genetic programming system with down-sampled
1 2 3 4 = beam width
(depth first) synthesize debug debug lexicase selection [51]. During our empirical evaluation of SEIDR
= 1 program
performance, we address the following research questions:
RQ1. Repair-replace trade-off exploration: What is the im-

beam width = 3 = population size


compromise 1 2 6 10 pact of using different tree search strategies in the autonomous
(beam search) tree arity
programming setting? We experiment with four different tree ar-
= 2 programs
ities in the tree search and study their impact on the number of
3 5 9
resolved problems as well as the speed of obtaining solutions.

beam width x tree arity = 6


RQ2. Prompt engineering: What is the effect of using LLM-
produced bug summaries compared to static instructions on the
4 13
repair performance of automatically synthesized code? We test
six static debug instructions that describe bug behavior based on
7 11
violated requirements and five auto-generated debug prompts.
8
4.1 Data
12
PSB2 is a benchmark suite of 25 problems for program synthesis that
resemble small real-world tasks. PSB2 was developed as a more re-
alistic and challenging version of PSB1 [52], the latter consisting of
replace only 1 4 every I/O pair as expected textbook problems and is widely used in genetic programming [53].
(breadth first)
beam width = ∞

The problems require different data structures and control flows to


tree arity = ∞

2 at least one I/O pair as expected


be used for effective solutions and are taken from sources, such as
3 compiles and runs, but all I/O pairs wrong
competitive programming platforms and educational courses. The
problems have descriptions in English, as well as 1 million (M) tests
5 stderr non-empty at compile- or runtime for training and 1M testing-stage tests, including edge or corner
cases that test the resulting program on complicated inputs. The
Figure 4: Repair-replace trade-off as a tree search problem. tests are provided as I/O pairs and are distributed together with the
problem descriptions as a PyPI package.4
LLMs has been explored by Fan et al. [34]. The latter authors also
In PSB1 [52], the training set consists of the edge test cases
compare instruction-driven LLMs to other Automated Program
and is augmented by random test cases if the number of edge
Repair (APR) strategies, and conclude that LLMs solve the task
tests is not enough. The test set is formed by random test cases.
effectively. However, they do not consider the entire Synthesize,
This terminology is preserved in PSB2. However, we do not have
Execute, Debug framework, as we do in this work. By contrast, LLMs
a training or fine-tuning phase in our experiments, because the
are usually set up to generate a fix from the first debug attempt and
models are not made available for further training. Instead, we
do not update failing patches iteratively.
validate the framework with an existing pre-trained LLM for code
Prompt generation for LLMs pre-trained on code is investigated
and text as its parts. Therefore, we only have the validation and test
in prompt engineering literature on a number of tasks, such as
phases. We will refer to training test cases in the PSB terminology
source code repair, type inference, code synthesis, and autocomple-
as validation test cases in this study.
tion [43–45]. Studies on program repair automation used prompts
that contain code only, code with docstrings, and code with bug 4.2 Repair-replace Trade-off Settings
hints with the Codex model to test the repair capabilities of the LLM
on the QuixBugs benchmark [46, 47]. The studies reported that bug As described in Section 2.7, the choice of beam width 𝑊 and tree
localization hints were not helpful, whereas providing buggy code arity 𝑁 define the repair-replace trade-off where higher 𝑊 and 𝑁
and the task summary was the most effective. ChatGPT was tested prioritize to repair over replace. We evaluate four options for these
on QuixBugs in addition to Codex models as well [48]. Kuznia et hyperparameters as shown in table 1.
al. [49] proposed to summarize task descriptions of competitive Table 1: Tree search hyperparameters.
programming and interview programming tasks. Following guid- experiment 1 2 3 4
ance from these studies, we include I/O pairs and task descriptions
beam width 𝑊 1 10 100 ∞
as docstrings in addition to the function signature in our prompts. tree arity 𝑁 1 10 100 ∞

4 EXPERMENTAL SETUP Because we aim to compare tree search parameters, we fix one
To explore the capabilities of SEIDR, we test the framework on the default debugging instruction and use the INSTRUCTstatic block.
benchmark of problems for code competitions with different types Moreover, we set the upper limit for the total number of gener-
of instructions for program candidate debugging, varied search ated program candidates to 1000 to limit the experimentation time.
strategies, and two languages, Python and C++. Our experiments 4 https://pypi.org/project/psb2/
4
Although some solutions may not be found within the hard limit, program in debug instructions S0-S5, because it is recommended
we assume5 that 1000 program candidates form a sufficiently large to avoid asking the model what not to do.6 We denote the text
search space for our experiments. 𝑊 = 𝑁 = ∞ is achieved in im- completion LLM’s output as {bug} which should constitute the bug
plementation by setting 𝑊 and 𝑁 equal to the upper limit on the summary. Input templates to use LLM for bug description followed
number of candidates, i. e. 1000. This setting ensures that a second by debugging instruction templates (after “→”) are as follows:
generation of programs does not exist. M6 The code should solve the following problem: {task}. The
code must return {Oval } for input {Ival } but it returns {Op }.
4.3 Prompting Strategies Obviously, the error is that...
The prompt for the LLM model 𝑝 debug (input, instr) consists of the → Fix {bug};
input for editing — candidate program generated so far — and a M7 The code should solve the following problem: {task}. The
debug instruction to repair the candidate. We test SEIDR on 11 code must return {Oval } for input {Ival } but it returns {Op }.
debug instructions to explore whether the use of the LLM for text The error is that...
completion 𝑝 text (input) benefits the performance of our framework, → Fix {bug};
as well as what effect different prompt phrases have on the debug M8 Problem description: {task}. The code must return {Oval } for
process. We compare debug instructions that use neutral phrases input {Ival }, but it returns {Op }. It is clear the error is that...
with those that use more confident language and mimic experienced → Fix {bug};
software developers, as well as shorter and longer instructions with M9 There is clearly a bug in code, because the code returns {Op }
different amounts of details about code behavior. To alleviate the for input {Ival } but output {Oval } is expected. The bug is that...
effect of beam width and tree arity, we set 𝑁 = 𝑊 = 1 and test the → Fix {bug};
repair-only tree search strategy shown in figure 4. This strategy is M10 There is clearly a bug in code, because the code returns {Op }
used to gradually improve one program candidate throughout the for input {Ival } but output {Oval } is expected. The bug is that...
search with no competing programs in the same generation. → Fix {bug} and modify the code to return {Oval } for in-
The debug instructions are formulated as templates. The instruc- put {Ival }.
tions describe the violated requirements in terms of the wrong Note that the text completion LLM does not use program candidates
output in a failing I/O test or summarize the bug to capture issues in its input, but only template inputs M6–M10 before the arrow.
in code logic. We present debug instructions using the template Input M6 for the text completion LLM is used to evaluate the
engine format: the brackets { } denote that the placeholder in the effect of the “confidence” sentiment on the bug summaries and
brackets will be replaced with the value generated during execution, debugging process. It is identical to input M7, except for the word
{Ival } and {Oval } stand for values failing I/O pair from the validation “obviously”, which should reflect or confidence of the comment.
set. As shown in figure 3, the instruction to fix the execution errors, Inputs M7 and M8 can be compared in the way the problem descrip-
which abort the program before the resulting output is obtained, tion is introduced, i.e., as a separate sentence similar to a spoken
with stderr lines: Fix {stderr}. Static debug instructions that do not situation in prompt M7 or as a short title in M8.
use LLM for bug summarization are as follows: Input templates M9 and M10 for text completion LLM are iden-
S0 Make sure that {Ival } -> {Oval }; tical, but the instruction templates are different. Text completion
S1 Make sure the code returns {Oval } for input {Ival }; inputs start with a “confidently” phrased statement that a bug is
S2 Ensure that input {Ival } yields output {Oval }; present in code. We include both the LLM output {bug} and de-
S3 Modify code to get {Oval } from {Ival }; scription of the failing validation test case in debug instruction
S4 Code must correspond instructions in comments and {Ival } M10. Therefore, instructions M6–M9 rely mainly on the LLM out-
must yield {Oval }; put to summarize the bug, whereas instruction M10 also provides
S5 See comments in code and return {Oval } for input {Ival }. information about the expected output.
The instruction S0 is the default instruction for tree arity experi-
ments. It has an intuitive symbolic notation (->) instead of the word 4.4 Performance Indicators
“return” or “yield”. In instructions S1–S3, we experiment with verbs In our experiments, we compare the number of fully solved
and the order of output and input. Alternatively, in debug instruc- programs obtained using SEIDR with different values of hyper-
tions S4–S5, we prompt the model to consider task description in parameters. For a more detailed analysis of results, we use test pass
the docstring in addition to providing the details of the failing I/O rate (TPR) and Excess Programs Generated (EPG). TPR reflects the
pair. Overall, instructions S0–S5 indicate the requirements to be percentage of fully passed test cases based on the exact match of
met, but do not describe the current program’s behavior. program output and test output. The TPR metric is used for the
The second set of instructions use the LLM for text comple- final evaluation of generated programs and does not reflect partial
tion 𝑝 text (input). The instructions are designed so that the LLM is passing of the I/O test as opposed to score in the RANK block.
prompted to complete the sentence that should describe an error. DEBUG and EXECUTE blocks generate a number of programs
In addition to validation I/O pairs, the following notation is used: that are replaced or repaired during the search for solution program.
{Op } denotes the program candidate output for input {Ival }, {task} is The number of programs generated before the first occurrence of
a placeholder for a problem description in English. Note that we the program that passes all validation test cases is referred to as EPG.
do not include the incorrect output 𝑂 𝑝 of a generated candidate
6 https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-
5 This assumption is later confirmed in Section 5.1. with-openai-api
5
25 Python C++
22 PushGP 100 100
19 19 Python
16 17 16 24
15 14 C++ 21
13

frequency
13 14
10 10 10
7 8 7
5
4 3
1 2 2
1 10 100 ∞ 1 1 1 1 1 1 1 1 1
1 1
tree arity 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
EPG EPG
Figure 5: Number of solved PSB2 problems depending on the
tree arity in tree search for the fixed prompt type S0. (a) 0 ≤ EPG ≤ 10 with step 1.

Python C++
100 100
EPG is indicative of the computational cost of solving a problem 47 47

frequency
distributed in terms of LLM inferences and program compilations
and executions. 10 10
4 4 4
3
2
4.5 Implementation Details 1
1 1
1
1 1 1

0
100
200
300
400
500
600
700
800
900
1000

0
100
200
300
400
500
600
700
800
900
1000
We use GPT-3 models pre-trained on code7 and text8 as LLMs in
our framework. Specifically, we use Codex-edit (code-davinci-edit- EPG range EPG range
001) as the LLM for draft programs 𝑝 synth and LLM for debugging (b) 0 ≤ EPG ≤ 1000 with step 100.
𝑝 debug and GPT-3 (text-davinci-003) for bug summarization via
text completion with 𝑝 text . We ensure that the program candidates Figure 6: Distribution of the number of generated programs
generated from the same parent program are different from each during each problem-solving attempt in the experiments
other by changing the temperature parameter of Codex-edit. with different tree arities where a problem solution is found.
We use 2000 I/O pairs from the test split of PSB2 to evaluate
the candidate program that has passed all the validation test cases program candidates using variable temperature. The probability of
during debugging. Due to repetitive calls to the EXECUTE block, finding a better fix is higher when more alternatives are generated
we have to resolve the speed of testing versus precision trade-off to update the draft program at 𝑁 > 1 compared to 𝑁 = 1. The
while choosing the number of validation test pairs. We resolve the search strategy with 𝑁 = 10 yields the best results: it performs
trade-off by fixing the validation set size at 100. on par with PushGP for C++ and outperforms the baseline during
In the experiments with tree arity values, we set the limit to Python program synthesis by +2 problems resulting in a total of 19
generate a maximum of 1000 program candidates during the search programs that pass all test cases. The results imply that generating
of the candidate that passes all validation tests. If we reach 1000 a moderate number of programs in parallel during the DEBUG step
candidates and none of them passes all validation tests, we report works better than the policies in which more updates are generated
the test pass rate for the last generated candidate. In the experiments for each program (100 or 1000) or only one program is updated
with prompts, we set the limit of maximum generated programs to 5, iteratively.
because we search for the prompt that yields the fastest solution to We present the analogy of the solution speed for all four arities
exclude long searches and comply with the request rate limits. and fixed default debug instruction in figure 6. In detail, we show
the distribution of EPG values in all experiments to explore how
5 RESULTS AND DISCUSSION many candidate updates are generated before the solution is found.
We zoom in to the cases with solutions found with up to the first
5.1 RQ1. Repair-replace Trade-off Exploration 10 program candidates in figure 6a and show the EPG distribution
We compare the number of solved problems in the experiments with the step of 100 candidates in figure 6b.
with tree arity of 1, 10, 100, and ∞ and fixed debug instruction S0 Out of 100 experiments for each language, in 21–24% of runs in
in Python and C++ in figure 5. The results of SEIDR are compared Python and C++, the draft program is already the solution (EPG=0).
to the baseline performance of PushGP on the PSB2 benchmark, For 19-32% of experiments, the solution is found after discarding
which solves 17 out of 25 problems. Note that experiments with 5 candidates. Around half of experiments do not generate more
𝑁 = 1 and 𝑁 = ∞ can be considered as ablation studies, where the than 100 programs. However, 5 problems are solved with more than
replace option and repair option is turned off, correspondingly. 500 generated programs in Python and 1 problem in C++ (with
The results highlight the benefit of compromise strategies with 𝑁 = 10). The results imply that the first steps in the update of
tree arity of 10 and 100 over repair-only (𝑁 = 1) and replace-only the draft program are crucial for solving the problem. The chances
(𝑁 = ∞) strategies. The repair-only scheme is outperformed by of solving the problem on the later stages of the search, such as
other strategies. We explain the poor performance of repair-only after 100 programs have been generated, are low. This confirms our
strategy by the fact that the search space is under-explored. Specif- initial assumption in Section 4.2 that 1000 programs are sufficient.
ically, replace scenario ensures the LLM for debugging represented
Answer to RQ1. SEIDR outperforms the PushGP baseline on
by Codex-edit in our experiments generates different updates of
PSB2 in Python and performs on par with it in C++ experiments
7 https://platform.openai.com/docs/guides/code/editing-code
8 https://platform.openai.com/docs/models/gpt-3
6
25
Python C++
before max programs attempts (light-blue cells with a “-”), it is due
22
19
to the input length limit of the underlying LLM 𝑝 debug , i.e., the
16 generated code is too long and does not fit as input to the LLM.
13 11 12 11 Some problems are solved with all prompts, while other problems
98 109 910 108 910 10 910
10 8 87 8 8 are solved with only a subset of prompts, solved partially, or not
7 6
solved at all. A number of problems are solved with all or the
4
1
majority of prompts in both languages, such as basement, fizz-buzz,
S0 S1 S2 S3 S4 S5 M6 M7 M8 M9 M10 paired-digits, and twitter. Other problems pass all tests in only
debug prompt type one of the languages, such as luhn, vector-distance, fuel-cost, or
Figure 7: Number of solved PSB2 problems depending on the substitution-cipher. Most of the solved problems are generated as
instruction choice for the fixed tree arity of 1. the first draft or within 1–2 debug steps. However, some problems
pass 90% of test cases at the fifth step, such as substitution-cipher
with tree arity of 10. Search strategies with tree arity larger than in Python with prompts S4 and M8 or shopping-list in C++ with
one benefit from the replace possibility of the SEIDR framework prompts S0, S1, S5 and M7. These runs are likely to be updated with
as a consequence of using variable temperature for Codex-edit. the fully correct programs in the following several steps, according
The repair component is also crucial for the framework because to the results in section 5.1, but the experiments are stopped for the
the replace-only search policy (with tree arity of ∞) performs fairness of inter-prompt comparison. Alternatively, conducting the
worse than the policies that alternate between replace and repair prompt engineering experiment with 1000 max programs would
during program update (with tree arity of 10 or 100). have shown what prompts are beneficial for solving the problems
in the long run and can be interesting for future work.
5.2 RQ2. Prompt Engineering The most interesting cases concern the problems that are solved
We report the number of solved problems for different static and only with LLM bug summaries or only with static prompts. For
GPT-assisted debug instructions in figure 7. Because debug instruc- example, the gcd problem is solved only with prompts M6–M10
tions are parts of prompts for LLMs and the program candidate in C++ and is not solved with either of S0–S5. A similar result
format does not change, we will use the term prompt during the is obtained for spin-words and coin-sums in C++. In Python, we
analysis of experiment results with different instructions. Overall, observe only the cases where solutions are obtained with static
the performance of the framework is robust to the debug prompt prompts and are not obtained with GPT-assisted prompts, e.g., for
choice, both with LLM-generated and static templates. The number find-pair, camel-case. In addition, several prompts work well from
of solved problems differs for Python and C++ in our experiments. both S and M categories as for gcd in Python.
For C++, all debug prompts except S2 result in the same or higher
performance than the instruction S0 which is used in the repair- Answer to RQ2. Program synthesis in C++ with SEIDR
replace trade-off experiments. The debug instruction S2 contains achieves better performance in the repair-only setting with both
the verbs “yield” and “ensure” which are probably rarely used in GPT-assisted prompts that summarize bugs in code and static
code documentation. The best debug instruction for C++ is the templates which describe failing I/O cases. The best-performing
LLM-assisted template M6 containing the word “obviously”, which C++ instruction is obtained with GPT-3 for text completion that
should indicate the confidence of the author of bug summary whom contains the word “obviously”. Results differ for PSB2 solutions
GPT-3 should mimic during autocompletion. in Python: the static prompt template S0 results in the best per-
Python programs do not show the same effect during experi- formance. Overall, SEIDR performance is stable with different
ments with different prompts. The overall performance drops in debugging prompts submitted to Codex-edit.
comparison with using the prompt S0. By limiting the total number
of generated programs from 1000 to 5 in the current set of experi-
ments, we lose 2 problem solutions in Python with S0. The prompt 5.3 Threats to Validity
that results in the best performance in C++ for the EPG limit of 5 External threats to validity concern SEIDR performance on different
corresponds to the worst performance in Python. This result can benchmarks and the use of other language models than the tested
occur due to the small tree arity and low variability of debugging ones. Specifically, PSB2 contains competitive programming tasks
updates of the initial draft. Another reason is that the GPT-3 sum- which require smaller functions to be generated than production-
mary of bugs may not point to logical errors. The model for text scale software. We plan to extend our experiments in future work
autocompletion frequently outputs bug summaries that mention to explore the generalizability of results to other benchmarks.
“the code is not accepting the input correctly.” Note that such bug Internal threats relate to the implementation. We use PSB2,
summary appears in other debug prompts, too. which has corner case tests in the training set and test regular
To analyze the effect of using different prompts on a problem cases in the test set. To ensure a fair comparison with other studies
level, we present a heatmap of EPG for all 25 problems in figure 8. on PSB2, we evaluate and report results on the provided test set
We add the values of test pass rate in numbers or signs and show of PSB2 which risks that the synthesized programs do not pass
EPG in color. Empty cells denote that the search halts due to OpenAI some of the training cases. Large models for code editing and text
exceptions, such as APIError.9 In addition, if the framework halts completion used in this study are nondeterministic, which impacts
results. Due to prohibitive model inference costs, each experiment
9 https://platform.openai.com/docs/guides/error-codes/python-library-error-types was only run once. However, our temperature sampling procedure
7
Figure 8: Number of excess programs generated (in color) and test pass rate (as numbers) depending on the type of debug
prompt. Higher EPG values are shown in darker shades than low EPG. We denote solved problems with “+” (test pass rate = 1),
unsolved problems with “-” (test pass rate = 0), and show the test pass rate for partially solved problems.

described in section 2.2 reduces this stochasticity significantly, es- solved problems out of 25. It requires under 1000 program execu-
pecially for low-EPG results. As with other language models, Codex tions to solve them, in stark contrast to billions10 of executions in
is a black-box model and may generate malicious code [54]. The PushGP, making it feasible in the areas with costly testing, such as
Codex model was pre-trained on an unbalanced dataset across pro- robotics. Investigation of the repair-replace trade-off shows that
gramming languages [9]. Thus, the results can be skewed towards SEIDR with tree arity of 10 outperforms both the replace-only strat-
high performance in popular programming languages. egy and the repair-only approach. Our prompt engineering study
shows that bug summaries generated with “confidence indicators”,
such as “obviously”, improve the performance of SEIDR during C++
code synthesis. Overall, our framework shows low performance
variability with different prompts, which indicates its robustness.
6 CONCLUSION
Future work: To study the generalizability of the SEIDR frame-
In this study, we propose the SEIDR framework to solve the chal- work, we plan to expand the experiments to the competitive pro-
lenge of fully autonomous programming. We augment the program gramming dataset of AlphaCode [10] and QuixBugs [46], as well as
synthesis procedure based on the large language models for code experimenting with ranking strategies, such as lexicase selection.
generation from templates and textual instructions with the repair
block. The repair block consists of the tree search across the pro-
gram candidates generated by a large language model for code. The DATA AVAILABILITY
LLM used for code repair takes imperfect program candidates and The code and results are made available via Zenodo.11 Note that
instructions for their improvement as prompts. The instructions are OpenAI discontinued the Codex API on March 23, 2023, and sug-
obtained from both static templates with failing test case descrip- gests using the GPT-3.5-Turbo API instead.
tions and templates with auto-generated bug summaries by a text
completion language model. We explore 11 prompting strategies
and the repair-replace trade-off of updating the draft program.
Contributions: We test SEIDR with the Codex-edit as the model
for draft program synthesis and debugging in Python and C++ on 10 A problem is considered "solved" by PushGP if at least 1 of 100 runs, each with a
the PSB2 benchmark. In our experiments, SEIDR outperforms the limit of 60 million programs, was successful.
PushGP baseline and achieves the state-of-the-art result with 19 11 https://doi.org/10.5281/zenodo.7837282

8
ACKNOWLEDGMENTS [16] V. Liventsev, A. Härmä, and M. Petković. BF++: A Language for
The work presented in this paper was supported by the European General-Purpose Program Synthesis. Jan. 2021. arXiv: 2101.09571.
[17] S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu. “A Survey of
Commission through Horizon 2020 grant 812882, and by the Re-
Deep Learning Techniques for Autonomous Driving.” In: Journal of
search Council of Norway through the secureIT project (#288787). Field Robotics 37.3 (2020), pp. 362–386. doi: 10/gg9ztb.
The empirical evaluation made use of the Experimental Infrastruc- [18] M. Marcano, S. Díaz, J. Pérez, and E. Irigoyen. “A Review of Shared
ture for Exploration of Exascale Computing (eX3), supported by Control for Automated Vehicles: Theory and Applications.” In: IEEE
the Research Council of Norway through grant #270053. Transactions on Human-Machine Systems 50.6 (2020), pp. 475–491.
doi: 10/gj8zjj.
REFERENCES [19] S. Lu et al. “CodeXGLUE: A Machine Learning Benchmark Dataset for
[1] Z. Manna and R. J. Waldinger. “Toward Automatic Program Synthe- Code Understanding and Generation.” In: Proceedings of the Neural
sis.” In: Communications of the ACM 14.3 (1971), pp. 151–165. doi: Information Processing Systems Track on Datasets and Benchmarks.
10/bhgh4z. Dec. 2021.
[2] O. Bastani, J. P. Inala, and A. Solar-Lezama. “Interpretable, Verifiable, [20] C. Niu, C. Li, V. Ng, and B. Luo. CrossCodeBench: Benchmarking
and Robust Reinforcement Learning via Program Synthesis.” In: xxAI Cross-Task Generalization of Source Code Models. Feb. 2023. arXiv:
- Beyond Explainable AI: International Workshop, Held in Conjunction 2302.04030.
with ICML 2020, July 18, 2020, Vienna, Austria, Revised and Extended [21] H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt.
Papers. Ed. by A. Holzinger, R. Goebel, R. Fong, T. Moon, K.-R. Müller, CodeSearchNet Challenge: Evaluating the State of Semantic Code
and W. Samek. Lecture Notes in Computer Science. Springer, 2022, Search. June 2020. arXiv: 1909.09436.
pp. 207–228. doi: 10.j5zn. [22] B. Roziere, M.-A. Lachaux, G. Lample, and L. Chanussot. “Unsuper-
[3] S. Dhar, J. Guo, J. Liu, S. Tripathi, U. Kurup, and M. Shah. “A Survey of vised Translation of Programming Languages.” In: 34th Conference
On-Device Machine Learning: An Algorithms and Learning Theory on Neural Information Processing Systems. Vancouver, Canada, 2020.
Perspective.” In: ACM Transactions on Internet of Things 2.3 (2021), [23] E. Fernandes, J. Oliveira, G. Vale, T. Paiva, and E. Figueiredo. “A
pp. 1–49. doi: 10/gnxq57. Review-Based Comparative Study of Bad Smell Detection Tools.” In:
[4] T. M. Connolly, M. Soflano, and P. Papadopoulos. “Systematic Litera- Proceedings of the 20th International Conference on Evaluation and
ture Review: XAI and Clinical Decision Support.” In: Diverse Perspec- Assessment in Software Engineering. EASE ’16. New York, NY, USA:
tives and State-of-the-Art Approaches to the Utilization of Data-Driven Association for Computing Machinery, June 2016, pp. 1–12. doi:
Clinical Decision Support Systems. IGI Global, 2023, pp. 161–188. doi: 10/ggfv7h.
10/j5zp. [24] S. Chakraborty, R. Krishna, Y. Ding, and B. Ray. “Deep Learning
[5] Y. Jia, J. McDermid, T. Lawton, and I. Habli. “The Role of Explain- Based Vulnerability Detection: Are We There Yet?” In: IEEE Transac-
ability in Assuring Safety of Machine Learning in Healthcare.” In: tions on Software Engineering 48.9 (Sept. 2022), pp. 3280–3296. doi:
IEEE Transactions on Emerging Topics in Computing 10.4 (Oct. 2022), 10/gk52qr.
pp. 1746–1760. doi: 10/grrn36. [25] J. Petke, S. O. Haraldsson, M. Harman, W. B. Langdon, D. R. White,
[6] Z. Manna and R. Waldinger. “Fundamentals of Deductive Program and J. R. Woodward. “Genetic Improvement of Software: A Compre-
Synthesis.” In: IEEE Transactions on Software Engineering 18.8 (1992), hensive Survey.” In: IEEE Transactions on Evolutionary Computation
pp. 674–704. doi: 10/ddbdft. 22.3 (June 2018), pp. 415–432. doi: 10/gdrjb7.
[7] R. Alur et al. “Syntax-Guided Synthesis.” In: Dependable Software [26] C. Le Goues, M. Pradel, and A. Roychoudhury. “Automated Program
Systems Engineering (2015), pp. 1–25. doi: 10/grb6m9. Repair.” In: Communications of the ACM 62.12 (Nov. 2019), pp. 56–65.
[8] M. T. Ahvanooey, Q. Li, M. Wu, and S. Wang. “A Survey of Genetic doi: 10/gkgf29.
Programming and Its Applications.” In: KSII Transactions on Internet [27] TIOBE Index. https://www.tiobe.com/tiobe-index/.
and Information Systems (TIIS) 13.4 (2019), pp. 1765–1794. [28] K. Gupta, P. E. Christensen, X. Chen, and D. Song. “Synthesize, Exe-
[9] M. Chen et al. Evaluating Large Language Models Trained on Code. cute and Debug: Learning to Repair for Neural Program Synthesis.”
July 2021. arXiv: 2107.03374. In: Advances in Neural Information Processing Systems. Vol. 33. Curran
[10] Y. Li et al. “Competition-Level Code Generation with AlphaCode.” Associates, Inc., 2020, pp. 17685–17695.
In: Science 378.6624 (Dec. 2022), pp. 1092–1097. doi: 10/grggxf. [29] S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer. Mapping Language
[11] S. Ren et al. CodeBLEU: A Method for Automatic Evaluation of Code to Code in Programmatic Context. Aug. 2018. arXiv: 1808.09588.
Synthesis. Sept. 2020. arXiv: 2009.10297. [30] D. C. Halbert. “Programming by Example.” PhD thesis. University of
[12] R. Bavishi, H. Joshi, J. Cambronero, A. Fariha, S. Gulwani, V. Le, California, Berkeley, 1984.
I. Radicek, and A. Tiwari. “Neurosymbolic Repair for Low-Code [31] S. Gulwani. Programming by Examples (and Its Applications in Data
Formula Languages.” In: OOPSLA. Dec. 2022. Wrangling). Tech. rep. Redmond, WA, USA: Microsoft Corportation,
[13] M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton. “A Survey of 2016, p. 22.
Machine Learning for Big Code and Naturalness.” In: ACM Computing [32] M. Zavershynskyi, A. Skidanov, and I. Polosukhin. NAPS: Natural
Surveys 51.4 (July 2018), 81:1–81:37. doi: 10/gffkhm. Program Synthesis Dataset. 2018. arXiv: 1807.0316.
[14] X. Chen, D. Song, and Y. Tian. “Latent Execution for Neural Program [33] L. Ouyang et al. Training Language Models to Follow Instructions with
Synthesis Beyond Domain-Specific Languages.” In: Advances in Neu- Human Feedback. Mar. 2022. arXiv: 2203.02155.
ral Information Processing Systems. Vol. 34. Curran Associates, Inc., [34] Z. Fan, X. Gao, M. Mirchev, A. Roychoudhury, and S. H. Tan. Au-
2021, pp. 22196–22208. tomated Repair of Programs from Large Language Models. Jan. 2023.
[15] O. Polozov and S. Gulwani. “FlashMeta: A Framework for Inductive arXiv: 2205.10583.
Program Synthesis.” In: OOPSLA. OOPSLA 2015. New York, NY, USA: [35] J. Blain. Nine Worlds of Seid-magic: Ecstasy and Neo-shamanism in
Association for Computing Machinery, Oct. 2015, pp. 107–126. doi: North European Paganism. Routledge, 2002.
10/grb6mm. [36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, Ł. Kaiser, and I. Polosukhin. “Attention Is All You Need.”

9
In: International Conference on Neural Information Processing Systems [47] J. A. Prenner, H. Babii, and R. Robbes. “Can OpenAI’s Codex Fix
(NeurIPS). Ed. by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Bugs?: An Evaluation on QuixBugs.” In: 2022 IEEE/ACM International
Fergus, S. Vishwanathan, and R. Garnett. Curran Associates, Inc., Workshop on Automated Program Repair (APR). May 2022, pp. 69–75.
2017, pp. 5998–6008. doi: 10/grpcnx.
[37] OpenAI API. https://platform.openai.com. [48] D. Sobania, M. Briesch, C. Hanna, and J. Petke. An Analysis of the
[38] S. de Bruin, V. Liventsev, and M. Petković. Autoencoders as Tools for Automatic Bug Fixing Performance of ChatGPT. Jan. 2023. arXiv: 2301.
Program Synthesis. Sept. 2021. arXiv: 2108.07129. 08653.
[39] J. R. Koza. Genetic Programming II. Vol. 17. MIT Press, 1994. [49] K. Kuznia, S. Mishra, M. Parmar, and C. Baral. Less Is More: Summary
[40] N. Jack and F. Van der Duyn Schouten. “Optimal Repair–Replace of Long Instructions Is Better for Program Synthesis. Oct. 2022. arXiv:
Strategies for a Warranted Product.” In: International Journal of Pro- 2203.08597.
duction Economics 67.1 (Aug. 2000), pp. 95–100. doi: 10/cfzj7f. [50] T. Helmuth and P. Kelly. “Applying Genetic Programming to PSB2:
[41] S. J. Russell. Artificial Intelligence a Modern Approach. Pearson Edu- The next Generation Program Synthesis Benchmark Suite.” In: Ge-
cation, Inc., 2010. netic Programming and Evolvable Machines 23.3 (Sept. 2022), pp. 375–
[42] H. Joshi, J. Cambronero, S. Gulwani, V. Le, I. Radicek, and G. Ver- 404. doi: 10/gq5gjq.
bruggen. Repair Is Nearly Generation: Multilingual Program Repair [51] T. Helmuth and L. Spector. “Problem-Solving Benefits of Down-
with LLMs. Sept. 2022. arXiv: 2208.11640. Sampled Lexicase Selection.” In: Artificial Life 27.3–4 (Mar. 2022),
[43] D. Shrivastava, H. Larochelle, and D. Tarlow. Repository-Level Prompt pp. 183–203. doi: 10/grrnj7.
Generation for Large Language Models of Code. Oct. 2022. arXiv: [52] T. Helmuth and L. Spector. “General Program Synthesis Benchmark
2206.12839. Suite.” In: Proceedings of the 2015 Annual Conference on Genetic and
[44] Q. Huang, Z. Yuan, Z. Xing, X. Xu, L. Zhu, and Q. Lu. Prompt-Tuned Evolutionary Computation. Madrid Spain: ACM, July 2015, pp. 1039–
Code Language Model as a Neural Knowledge Base for Type Inference 1046. doi: 10/ghsn5b.
in Statically-Typed Partial Code. Aug. 2022. arXiv: 2208.05361. [53] D. Sobania, M. Briesch, and F. Rothlauf. “Choose Your Program-
[45] B. Ahmad, S. Thakur, B. Tan, R. Karri, and H. Pearce. Fixing Hardware ming Copilot: A Comparison of the Program Synthesis Performance
Security Bugs with Large Language Models. Feb. 2023. arXiv: 2302. of Github Copilot and Genetic Programming.” In: Proceedings of
01215. the Genetic and Evolutionary Computation Conference. Boston Mas-
[46] D. Lin, J. Koppel, A. Chen, and A. Solar-Lezama. “QuixBugs: A Multi- sachusetts: ACM, July 2022, pp. 1019–1027. doi: 10/gq5gjp.
Lingual Program Repair Benchmark Set Based on the Quixey Chal- [54] H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri. Asleep
lenge.” In: Companion of the SIGPLAN International Conference on at the Keyboard? Assessing the Security of GitHub Copilot’s Code
Systems, Programming, Languages, and Applications: Software for Contributions. Dec. 2021. arXiv: 2108.09293.
Humanity. Vancouver BC Canada: ACM, Oct. 2017, pp. 55–56. doi:
10/gf8nmn.

10

You might also like