0% found this document useful (0 votes)
74 views7 pages

Using GitHub Copilot To Solve Program

This paper evaluates GitHub Copilot, a version of OpenAI's Codex, in solving introductory programming problems and its implications for teaching. It compares Copilot's performance with the more advanced Davinci model, focusing on the correctness and variety of code, tests, and explanations generated. The findings suggest that while Copilot is accessible and user-friendly for students, it may have limitations in generating tests and explanations compared to Codex.

Uploaded by

kurt Polador
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views7 pages

Using GitHub Copilot To Solve Program

This paper evaluates GitHub Copilot, a version of OpenAI's Codex, in solving introductory programming problems and its implications for teaching. It compares Copilot's performance with the more advanced Davinci model, focusing on the correctness and variety of code, tests, and explanations generated. The findings suggest that while Copilot is accessible and user-friendly for students, it may have limitations in generating tests and explanations compared to Codex.

Uploaded by

kurt Polador
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Using GitHub Copilot to Solve Simple Programming Problems

Michel Wermelinger
School of Computing and Communications
The Open University
Milton Keynes, United Kingdom
[email protected]

ABSTRACT 1 INTRODUCTION
The teaching and assessment of introductory programming involves Program synthesis is an active research field with a long history.
writing code that solves a problem described by text. Previous re- Recent developments include OpenAI’s Codex1 [1], DeepMind’s
search found that OpenAI’s Codex, a natural language machine AlphaCode2 [8], Amazon’s CodeWhisperer3 and Tabnine4 , four
learning model trained on billions of lines of code, performs well systems that can translate a problem description to code. Currently
on many programming problems, often generating correct and (August 2022), AlphaCode and CodeWhisperer aren’t publicly avail-
readable Python code. GitHub’s version of Codex, Copilot, is freely able and Tabnine’s free version only does code suggestions. This
available to students. This raises pedagogic and academic integrity paper considers GitHub’s Copilot5 , a version of Codex accessible
concerns. Educators need to know what Copilot is capable of, in or- through a plugin for popular IDEs.
der to adapt their teaching to AI-powered programming assistants. In June 2022, GitHub released Copilot to individual customers
Previous research evaluated the most performant Codex model and included it in the free Student Pack. Codex, publicly available
quantitatively, e.g. how many problems have at least one correct since November 2021, is a paid service, accessed by writing a pro-
suggestion that passes all tests. Here I evaluate Copilot instead, to gram that calls OpenAI’s API. These are barriers to adoption by
see if and how it differs from Codex, and look qualitatively at the students and Copilot removed them. Whether we like it or not,
generated suggestions, to understand the limitations of Copilot. I many students will use the free IDE plugin for exercises and assign-
also report on the experience of using Copilot for other activities ments, without having to learn an API or disrupt their workflow.
asked of students in programming courses: explaining code, gener- With a suitable configuration, the most performant Codex model
ating tests and fixing bugs. The paper concludes with a discussion (Davinci) can solve typical CS1 problems [4]. (I use ‘solve’ in the
of the implications of the observed capabilities for the teaching of sense of ‘generate an answer that passes the tests’.) However, Davinci
programming. is also the slowest and most expensive of the models and we know
that “a distinct production version of Codex powers GitHub Copilot”
CCS CONCEPTS [1]. Given the recent release of Copilot and its expected widespread
use by students, it is timely to check if it performs as well as Davinci,
• Computing methodologies → Natural language processing;
and if not, what limitations it has.
• Software and its engineering → Automatic programming; •
OpenAI’s Codex examples6 include fixing a bug and generating
Social and professional topics → CS1.
documentation strings, step-by-step explanations and code sum-
maries. Sarsa et al. showed that Davinci can create exercises (as
KEYWORDS variations of a given problem), sample solutions, explanations and
code generation, test generation, code explanation, programming tests [12]. However, Copilot’s FAQ states that it “is not intended
exercises, programming patterns, novice programming, introduc- for non-coding tasks like data generation and natural language
tory programming, academic integrity, OpenAI Codex generation, like question and answering”. We can thus expect Copi-
lot to perform less well than Davinci in generating tests and code
ACM Reference Format: explanations, two tasks often asked of students.
Michel Wermelinger. 2023. Using GitHub Copilot to Solve Simple Program- The demo videos on the Codex site illustrate the incremental
ming Problems. In Proceedings of the 54th ACM Technical Symposium on creation of programs: the user’s initial request generates a minimal
Computer Science Education V. 1 (SIGCSE 2023), March 15–18, 2023, Toronto, program or function that is modified by each subsequent request.
ON, Canada. ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/ Codex and the user engage in a ‘dialogue’ in which the user’s Eng-
3545945.3569830
lish sentences elicit a reply in a programming language. Copilot
makes this interaction stronger and more seamless: it makes sugges-
tions as we type in the IDE, we can ask for alternative suggestions
Permission to make digital or hard copies of all or part of this work for personal or with one keystroke and we can edit Copilot’s suggestions. It’s no
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation surprise that GitHub dubs Copilot as “your AI pair programmer”,
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
1 https://openai.com/blog/openai-codex
republish, to post on servers or to redistribute to lists, requires prior specific permission
2 https://www.deepmind.com/blog/competitive-programming-with-alphacode
and/or a fee. Request permissions from [email protected].
3 https://aws.amazon.com/codewhisperer
SIGCSE 2023, March 15–18, 2023, Toronto, ON, Canada
4 https://www.tabnine.com
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
5 https://copilot.github.com
ACM ISBN 978-1-4503-9431-4/23/03. . . $15.00
https://doi.org/10.1145/3545945.3569830 6 https://beta.openai.com/examples?category=code

172
SIGCSE 2023, March 15–18, 2023, Toronto, ON, Canada Michel Wermelinger

even though the interaction is far more limited than with a human;
notably, Copilot does not provide a rationale for its suggestions.
In summary, Copilot is currently the most likely AI assistant to
be adopted by students, but only Codex has been evaluated, in a
quantitative way, mostly with automated tests to obtain the per-
centage of correct code suggestions for a given problem statement.
While Copilot could be evaluated in the same way, a qualitative
look at Copilot’s answers may be more insightful. This paper is
a report of my experience in using Copilot as if I were a student
tasked with writing code (possibly with an explanation) and tests
for a given CS1 problem. The guiding questions for this exploration
are:
• How does Copilot perform, compared with Davinci, in terms
of the correctness and variety of the generated code, tests
and explanations?
• If a suggestion is incorrect, can Copilot be interactively led
to a correct one?

2 RELATED WORK
The OpenAI team wrote 164 problems7 with tests, to evaluate if Figure 1: An inline suggestion for Ebrahimi’s rainfall prob-
early models of Codex generate correct Python code from English lem variant.
[1]. Almost 29% of the problems were solved by Codex’s first sug-
gestion. A fine-tuned model, trained only on correct Python code,
solved 38% of the problems on its first attempt. If that model is The authors also asked Codex to generate line-by-line explana-
allowed 100 attempts per problem, then at least one of them was tions for three functions and a class: only two thirds of the gener-
correct for 78% of the problems. Judging from the randomly selected ated lines were correct. Codex often got the operator wrong, e.g.
attempts in the paper’s appendix, most are incorrect. speed > 100 was explained as ‘speed is less than 100’.
Finnie-Ansley et al. [4] investigated the performance of the In summary, previous research shows that while Codex solves a
Davinci model on two sets of CS1 problems in Python, setting good percentage of problems on the first attempt, it requires many
the temperature parameter to 90% to elicit more creative answers. attempts for most problems, and it never solves quite a number of
The first set had 23 problems from past tests of their CS1 course. problems. None of the related work captures how most students
Codex had ten attempts at each problem. It solved 10 problems will use Codex: interactively in an IDE.
on the first attempt, it needed manual output correction for some
problems, and it couldn’t solve 4 problems that required ASCII 3 USING COPILOT
formatting or restricted what students could use, e.g. ‘you must use Copilot is accessible via a plugin for several editors. I used Visual
a while-loop but not the split() method’. Studio Code within the 60-day free Copilot trial period GitHub
The second set consisted of 7 variants of the rainfall problem users have. It’s unlikely results would be different with other IDEs
[3–7, 13, 14]. Codex was asked for 50 solutions to each problem. as they are only interfaces to Copilot.
For all but one variant, at least one attempt passed all tests. The 50 In VS Code, Copilot attempts to provide an inline suggestion
attempts passed on average 19% of the tests for the Ebrahimi et al. when the user pauses typing or presses Alt-\ or Enter. The sugges-
variant [3], up to 63% of the tests for Soloway’s [14]. tion, in italic grey font at the cursor position, may complete the
Sarsa et al. [12] also evaluated Davinci, but from the perspective current line or may be several lines long (Figure 1). The user can
of a teacher who wants varied but related exercises in different do- cycle through alternative suggestions with Alt-[ or Alt-]. Pressing
mains to better engage their students. The researchers submitted to Tab accepts the current suggestion. Pressing Ctrl-Enter requests
Codex a Python docstring containing: keywords describing the do- Copilot to generate up to 10 suggestions and display the unique
main (e.g. ‘football’) and the programming context (e.g. ‘list’); a base ones in a separate panel. Each suggestion has an acceptance button,
problem in English; a sample solution; some tests. This was followed which copies it to the editor.
by different domain and programming keywords and a prompt to The following sections detail the prompts used, i.e. the code and
generate another exercise (statement, solution and tests). The au- comments in the editor at the time a suggestion is asked for. All code
thors read half of the 240 exercises generated and 75% of those is in Python, to allow comparing with previous work and because
were reasonable, i.e. they made sense, incorporated the prompted it is the introductory programming language at our institution.
domain or programming concept, and included appropriate code. Copilot can be configured in the Settings page of the user’s
However, 30% of the exercises had no solution or no tests. When GitHub account. Users can choose whether to see suggestions that
both existed, only 31% of the solutions passed the tests. match public code and whether to allow GitHub and OpenAI to
use the submitted prompts for training. I declined both. The plugin
7 https://github.com/openai/human-eval doesn’t indicate which suggestions match public code, so students

173
Using GitHub Copilot to Solve Simple Programming Problems SIGCSE 2023, March 15–18, 2023, Toronto, ON, Canada

are likely to hide them to avoid unintended plagiarism. Although un-


likely, allowing to use the prompts for training might have changed
the results over the course of experiencing Copilot.

4 GENERATING CODE
I used Copilot with two CS1 problem sets. This section presents the
qualitative results, illustrated with selected examples. I highlight
Copilot’s mistakes with # Wrong and, due to limited space, I replace
repetitive generated code with an ellipsis.

4.1 Programming patterns


The use of programming and design patterns is grounded in cog-
nitive theories of how people construct and organise knowledge
and become problem-solving experts. Muller et al. [10] introduced
30 programming patterns, showing that they improve the students’
ability to solve problems. Patterns are step-by-step algorithmic tem- Figure 2: The list transformation exercise, with an inline
plates and thus I wanted to check if they are suited to incremental suggestion (line 22).
program construction as demonstrated by OpenAI.
For our introductory programming module, I created a small
set of patterns [11] and 8 exercises delivered via the CodeRunner # if input values fall into the first case :
plugin [9] for Moodle. Exercises include printing how many images if magnitude < 4:
fit in a disk (formula pattern), the strength of an earthquake given # compute outputs according to the first case
its Richter magnitude (case analysis), printing the list of values print (" That ␣ is ␣ a ␣ minor ␣ earthquake . ")
...
outside a range (list filtering), and printing the percentage of such # print the outputs
values (pattern combination). For the last exercise, students are print ( output ) # Wrong
expected to combine the list filtering and formula patterns. For
every other exercise, the CodeRunner editor shows the needed Replacing ‘compute outputs’ with ‘compute the description’ made
pattern as Python comments. Students write the corresponding code Copilot assign each string to a description variable and print it at the
line(s) after each comment. CodeRunner runs some tests on their end, as our students are intended to use the pattern. Impressively,
code and presents the results. Students have unlimited attempts: after I changed the last comment to ‘print the full message’ and
the exercises aren’t graded. They just help students practice the gave the prompt ‘Now let the description be just the adjaective.’
instantiation of patterns and the basics of programming. (note the typo), Copilot generated:
# if input values fall into the first case :
4.1.1 Method. I copied each exercise to VS Code, as in Figure 2: if magnitude < 4:
I put the problem statement as-is within a docstring, keeping the # compute the description according to the first case
instructions to students and the references to our materials. On description = " minor "
CodeRunner, each problem description includes a table of example ...
inputs and corresponding outputs. I translated the HTML table to # print the full message
print (" That ␣ is ␣ a ␣ {} ␣ earthquake . ". format ( description ) )
Markdown, within the docstring. The CodeRunner editor content
(the comments with the pattern) was copied unchanged. The hardest exercise asks students to generate a pseudo-random
I pressed Enter at the end of each comment line to make Copilot number sequence until the seed is repeated. The problem statement
suggest the code for that step of the pattern (see line 22 in Figure tells students they must change the given pattern. Copilot follows
2), until the whole program was generated. If it was incorrect or the pattern and produces a while-loop (while value != seed) that
could be improved, I added a prompt (a comment with an English is never entered because the first value is the seed. After several
instruction) after the code, like in OpenAI’s Codex examples. attempts, prompt ‘Modify the above program so that it computes
and prints the second value before entering the loop’ did generate
4.1.2 Results. Copilot solved 4 of the 8 problems on the first at- the required additional code, but in the separate suggestions, which
tempt. It extracted the relevant information from the docstring, required editing to integrate them in the program.
translated the general pattern into problem-specific code, found For the list transformation exercise (Figure 2), Copilot called
the right formula even if it was not given, and used appropriate a function transform on each value. Prompted with def trans after
variable names, like kelvin_values instead of output_list in Figure the code, Copilot wrote a correct Celsius to Kelvin transformation
2. When Copilot used the generic name from the pattern, prompt function, with a good docstring.
‘Rename ... to ...’ led to the desired improvement.
Copilot generated the correct algorithm for one exercise but it 4.1.3 Analysis. One the one hand, it is notable how Copilot sifts the
failed the automated test because Copilot added one word to the chaff from the grain in the problem statement, including extracting
output message. Copilot failed another exercise because it wrote the output messages from Markdown tables, and how it defines
code with an undefined variable: auxiliary functions as needed.

174
SIGCSE 2023, March 15–18, 2023, Toronto, ON, Canada Michel Wermelinger

On the other hand, it fails to solve half of these simple problems. iterate over a list parameter. Only one for-loop suggestion and two
While Copilot can be instructed to incrementally fix or improve a while-loop suggestions are correct: the others use len() or don’t
program, the user must know exactly what they want and find the check whether a value is negative. This variant elicited the most
right words to communicate their intentions. It’s easier and quicker varied suggestions: some use a while-loop to read the values up to
to edit the code directly. the sentinel into a list, followed by a for-loop to add and count the
Copilot sometimes uses constructs novice programmers might non-negative values; others use sum() / len() and one uses slicing.
not know about, like list comprehensions. How students will react Ebrahimi [3] asks for a program that reads rainfall amounts
to such suggestions remains to be seen. (negative values are invalid) until the sentinel occurs and then
outputs the number, total and maximum of the valid values and
4.2 The rainfall problem ‘the number of rainy days’, which isn’t further explained. This was
4.2.1 Method. To directly compare Copilot and Davinci, I asked the variant on which Davinci performed worst, with suggestions
Copilot to solve the same rainfall problem variants as Finnie-Ansley passing on average 2 of the 10 tests. Surprisingly, Copilot’s first
et al. For each one, the prompt was a docstring with the problem inline suggestion (Figure 1) and most other ones are correct. The
description given in Table 2 of [4], followed by def, to make Copi- incorrect suggestions compute the mean or use the wrong variable
lot suggest the function header and the body. Since the problem when printing. Some suggestions unnecessarily store the values
variants are similar, I didn’t accept Copilot’s suggestions, to reduce read in a list that isn’t further used. The prompt ‘The next code
any learning effect that might increase the correctness or reduce doesn’t create a rainfall list’ removed it.
the variety of the suggestions for subsequent variants. Finnie-Ansley et al. [4] ask for the mean, rounded to one decimal
The rainfall problem asks for the mean of the valid numbers place, of the non-negative values up to the sentinel or the end of
(what is valid depends on the variant) up to the first sentinel value. the list, whichever occurs first. If the input is None (instead of a list
If there are no valid numbers, the mean is undefined. Most variants of numbers) or the mean is undefined, then the output should be
don’t state what to return in that case or if we can assume the input -1.0. Copilot’s suggestions are all incorrect. They use len(), don’t
to have a valid number. I took the latter stance. Finnie-Ansley et al. round the result, don’t stop at the sentinel, or they check if the sum
tested Davinci’s suggestions with the empty input and with only (rather than the count) of valid values is zero to return -1. The first
invalid numbers, but it’s unclear what outcome their tests expect. inline suggestion is almost correct: it doesn’t round.

4.2.2 Results. Copilot made for every variant 1–3 inline sugges- 4.2.3 Analysis. For every variant, Finnie-Ansley et al. found that
tions and less than 10 unique separate suggestions, but often they Davinci generated a variety of attempts, including doing one or
were essentially the same, differing in the names of variables, the two passes over the list, with a for-loop or a while-loop. Copilot
docstring (or its absence), the messages in input() and print(), and generates one approach, possibly combined with len(), per variant:
minor coding variations, e.g. iterating with a repeat-until loop or a single-pass for-loop if the input is a list and a single-pass while-
an infinite while-loop with a break. loop if the standard input is read. It generates both approaches only
Soloway’s original problem [14] is the simplest: read numbers for Fisler’s variant, likely due to both ‘entered by a user’ and ‘list’
from the input until the sentinel occurs and compute their mean. occurring in the problem description.
All of Copilot’s suggestions are correct. Looking more closely at the suggestions for each variant, they
Simon [13] asks to treat negative values as if they were zeroes. seem to be obtained largely by combining, say, two different names
Copilot does not do that: it adds all values until the sentinel. for one variable, two docstrings, two ways of using the while-loop,
Guzdial et al. [6] ask for the mean of the positive values; there is etc. Even though the code of each suggestion is unique, algorithmi-
no sentinel. All Copilot suggestions are correct, except two. One cally they are essentially the same.
does if value < 0: continue instead of <= 0 to skip non-positive val- There are at least two reasons why Copilot’s suggestions are
ues. The other uses len to divide the sum by the length of the list, much less varied than Davinci’s. First, Copilot generates about
even though this is the only variant stating that one must divide 10–12 suggestions per variant, whereas Davinci was asked for 50.
by the number of valid values. Second, Copilot’s temperature is probably low, because lower tem-
Lakanen et al. [7] ask for the mean of the positive numbers up peratures tend to give more precise results while higher tempera-
to the first value exceeding 998. The description uses the word tures, as used with Davinci, tend to produce more random results.
‘sentinel’, but it is not a fixed value like in the other variants. This The lack of variety led to more polarised correctness results
was the only variant Davinci couldn’t solve, and neither can Copi- than those reported by Finnie-Ansley et al. The suggestions tend
lot: one suggestion breaks the loop on 999 only; the others don’t to be mostly (or exclusively) correct or incorrect. Davinci found a
check for a sentinel. All suggestions (except the one with the break solution for 6 of the 7 variants; Copilot solved 4 variants.
statement) have an unnecessary else: continue branch, a striking Also to note, the separate suggestions, which take a few seconds
example of the lack of variety. to appear, frequently do not improve on the inline suggestions,
Fisler [5] asks for the mean of the non-negative values up to the which are almost immediate. This may be on purpose, to provide a
sentinel, which is optional. The statement is ‘Design a program good user experience and avoid further requests to the servers.
[...] that consumes a list of numbers [...] entered by a user. The
list may contain [the sentinel]’. Contrary to all other variants, this 5 GENERATE TESTS
leads Copilot to generate two kinds of algorithms: while-loops that Students are sometimes asked to write tests in order to think about
use input() and break upon reading the sentinel, and for-loops that edge cases. Can Copilot generate comprehensive tests?

175
Using GitHub Copilot to Solve Simple Programming Problems SIGCSE 2023, March 15–18, 2023, Toronto, ON, Canada

The programming pattern exercises didn’t ask students for tests,


so I requested tests for the rainfall problems, using the prompt def
test_, sometimes after the code Copilot had generated, sometimes
after the problem description, to mimic test-driven development,
in which tests are written before the code.
Typical suggestions, both inline and separate, both after the
problem and after the code, look like Listing 1. While some tests are
correct, most are wrong and extend a data sequence in a repetitive
way. Copilot stops midway through a list, which is a syntax error.
Using a different prompt and requesting the tests line by line
(with Enter) starts off well...
Figure 3: Comments generated for the Finnie-Ansley variant.
# Write unit tests below this line
def test_harvest_none () :
assert harvest ( None ) == -1.0
def test_harvest_empty () : Generated comments may explain unknown constructs to stu-
assert harvest ([]) == -1.0 dents. Here are the comments generated for a suggestion for Soloway’s
def test_harvest_one () :
variant. I moved some comments one line up to improve the layout.
assert harvest ([1]) == 1.0
def test_harvest_two () : total = 0
assert harvest ([1 , 2]) == 1.5 count = 0
def test_harvest_three () : while True :
assert harvest ([1 , 2, 3]) == 2.0 try : # try to read in a rainfall value
number = int ( input (" Enter ␣ a ␣ number : ␣ ") )
... but Copilot again settles into a repetitive testing pattern. Describ- # if the value is 99999 , break out of the loop
ing the required unit tests brings more variety: if number == 99999:
break
# A unit test with positive and negative values
total += number
def test_harvest_positive_negative () :
count += 1
assert harvest ([1 , 2, -3]) == 1.5
# if the value is not an integer , print an error
def test_harvest_negative_positive () :
message
assert harvest ([ -1 , 2, 3]) == 2.0 # Wrong
except ValueError :
def test_harvest_negative_negative () :
print (" Invalid ␣ input ")
assert harvest ([ -1 , -2, -3]) == -1.0
# continue to the next iteration of the loop
def test_harvest_positive_positive () :
continue
assert harvest ([1 , 2, 3]) == 2.0
# print the average of the rainfall values
After leaving a blank line, continuing to press Enter and Tab gener- print (" The ␣ average ␣ is : ␣ {} ". format ( total / count ) )
ates this: Copilot explains the try ... except, break and continue constructs,
# A unit test with a sentinel but not the format syntax.
def test_harvest_sentinel () : The prompt ‘Explain the code above:’ (the colon seems crucial)
assert harvest ([1 , 2, 3, -999]) == 2.0 also generates text. If the file includes the problem statement, Copi-
def test_harvest_sentinel_positive () :
lot just repeats it as the explanation. Without the problem statement,
assert harvest ([1 , 2, 3, -999 , 4]) == 2.5 # Wrong
def test_harvest_sentinel_negative () : Copilot attempts to summarise what the code does, without explain-
assert harvest ([1 , 2, 3, -999 , -4]) == 1.5 # Wrong ing how it works. For example, for a correct solution to Ebrahimi’s
variant, Copilot generated:
The generated comment suggests that Copilot ‘knows’ that it has
yet to test for the sentinel value, but the wrong tests show that it The code above is a program that asks the user to
has no understanding of the sentinel’s effect. enter rainfall for a day. The user will enter 9999 to
indicate that they are finished entering rainfall. The
6 GENERATE EXPLANATIONS program will then print the number of valid days,
the number of rainy days, the total rainfall, and the
Explaining code allows students to ‘think aloud’ about their code
maximum rainfall.
and educators to check the students’ understanding. Codex can
translate code to natural language. How does Copilot fare? If we type ‘1.’ after the prompt to explain the code, Copilot, like
I typed # at the end of lines of code to request a comment from Codex [12], generates a list of statements. For the same 23-line
Copilot. Figure 3 shows the comments generated for a wrong sug- program, the generated 14-step list formed this text:
gestion. (The file also contains the problem description, not shown The code above is a program that asks the user to
in the figure.) Most comments are literal translations of the code, enter rainfall for a day. The rainfall is stored in a
but some explain the role of a number or variable: -999 is the sen- list. The rainfall is checked to see if it is valid. If the
tinel, the 1 in the last line is the number of decimal places and count rainfall is valid, the rainfall is added to the list. The
is the number of non-negative values. Whether these comments are rainfall is checked to see if it is the maximum rainfall.
enough to make students realise that the code counts all numbers If the rainfall is the maximum rainfall, the maximum
instead of the valid ones remains to be seen. rainfall is updated. The rainfall is checked to see if it

176
SIGCSE 2023, March 15–18, 2023, Toronto, ON, Canada Michel Wermelinger

Listing 1: A test function for the rainfall variant of Finnie-Ansley et al.


def test_harvest () :
"""
Test the harvest function with various inputs .
"""
assert harvest ([]) == -1.0
assert harvest ([1 , 2, 3, 4, 5]) == 3.0
assert harvest ([1 , 2, 3, 4, 5, -999]) == 3.0
assert harvest ([1 , 2, 3, 4, 5, -999 , 6, 7, 8, 9 , 10]) == 5.5 # Wrong
assert harvest ([1 , 2, 3, 4, 5, -999 , 6, 7, 8, 9 , 10 , -999]) == 5.5 # Wrong
assert harvest ([1 , 2, 3, 4, 5, -999 , 6, 7, 8, 9 , 10 , -999 , 11 , 12 , 13 , 14 , 15]) == 7.5 # Wrong
...
assert harvest ([1 , 2, 3, 4, 5, -999 , 6, 7, 8, 9 , 10 , -999 , 11 , 12 , 13 , 14 , 15 , -999 , 16 , 17 , 18 , 19 , 20 , -999 , 21 , 22 ,
23 , 24 , 25 , -999]) == 13.5 # Wrong
assert harvest ([1 ,

is greater than 0. If the rainfall is greater than 0, the impossible and futile. A more fruitful approach is to adopt these
rainy days is updated. The total rainfall is updated. tools and educate students and colleagues about their advantages
The number of valid days is updated. The number of and limitations. Knowing what Copilot is good at and what it can’t
rainy days is updated. The total rainfall is updated. do helps students and educators understand what they need to
The maximum rainfall is updated. The code prints the learn, teach and assess in an age where up to 40% of code is written
number of valid days, the number of rainy days, the by Copilot, when it is used [2].
total rainfall, the maximum rainfall, and the average Copilot can provide a first helpful attempt at a problem, but
rainfall. students still need to know a language’s syntax and semantics well,
in order to spot and modify Copilot’s often incorrect suggestions.
The step-by-step description is repetitive, sometimes unclear, misses Copilot’s ‘explanations’ may help students understand unfamil-
the check for the sentinel in the code, puts some steps in the wrong iar constructs and detect errors in the code, but they mostly describe
order, and wrongly states that the code prints the average rainfall. what the code does at a low level of detail and sometimes omit im-
portant aspects. Students still need to learn how to write clear,
7 CONCLUDING REMARKS high-level, synoptic documentation and they still need to figure out
The first guiding question was about Copilot’s performance com- why some code doesn’t work, in order to fix it.
pared with Davinci’s, in terms of correctness and variety of answers. Copilot can provide several suggestions, but to be productive,
Comparing my results to those cited about Davinci, it seems clear students must be able to quickly read and understand code without
that Copilot fares less well on both accounts: as observed in the running it, in order to choose the correct suggestion or to combine
code, tests and explanations generated, Copilot’s suggestions often the correct parts from different suggestions.
are wrong, include unnecessary elements or are mainly ‘variations Copilot quickly completes comments, code lines and individual
on a theme’, possibly due to a low default temperature. tests with hardly any syntax errors. Students will spend less time
Sometimes Copilot seems to have an uncanny understanding typing code and understanding compiler errors, and work through
of the problem, able to extract the relevant details from text and more exercises. Lecturers and TAs can focus more on documenta-
tables of examples, while ignoring student instructions and refer- tion, testing, debugging and program comprehension.
ences to materials. Other times, Copilot looks completely clueless, Educators should stop ‘re-dressing’ old toy problems that only
generating gibberish, like irrelevant lists of imports. have 1–2 sensible solutions, because Copilot’s first suggestion will
The second guiding question probed if Copilot can be interac- likely be correct, readable and the simplest one, thus restricting
tively led to a correct answer. As in the first question, the examples the learning opportunities for coding, debugging and algorithmic
provide a sobering reminder that, in spite of all the hype, using thinking, compared to problems with interesting ‘wrinkles’.
tools like Copilot can be a frustrating ‘hit and miss’ affair. Copilot Finnie-Ansley et al. note that students will likely get partial credit
most often does not understand our instructions to fix or improve if they submit any of Davinci’s suggestions, since almost each one
the code it generated unless we formulate them in a very specific passes some tests. This also happens with Copilot. Educators may
way. I felt like Gandalf trying to open the Doors of Durin. need to increase the use of all-or-nothing grading, to make students
Neither this nor previous work has studied how students will analyse and combine partially correct solutions.
use Codex and Copilot in practice. Even without such a study, some As this experience shows, Copilot is a useful springboard to
educated guesses of what the future holds can be made. productively solve CS1 problems, but the algorithmic thinking,
Finn-Ansley et al. note that Codex poses challenges to academic program comprehension, debugging and communication skills are
integrity that can’t be ‘wished away’: educators must adapt to the as needed as ever. As Jean-Baptiste Karr observed: the more things
new reality. This is even more so with Copilot’s free IDE plugin. change, the more they stay the same.
While less performant than Davinci, Copilot does generate code
(and with some editing, tests and explanations) that could have been
written by humans. Detecting and punishing the use of Copilot is

177
Using GitHub Copilot to Solve Simple Programming Problems SIGCSE 2023, March 15–18, 2023, Toronto, ON, Canada

REFERENCES (Koli Calling ’15). Association for Computing Machinery, New York, NY, USA,
[1] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira 40–49. https://doi.org/10.1145/2828959.2828970
Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, [8] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi
Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas
Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun
Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James
Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando
Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. 2022. Competition-Level Code
Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, Generation with AlphaCode. https://doi.org/10.48550/ARXIV.2203.07814
William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, [9] Richard Lobb and Jenny Harlow. 2016. Coderunner: A Tool for Assessing
Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Computer Programming Skills. ACM Inroads 7, 1 (Feb. 2016), 47–51. https:
Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam //doi.org/10.1145/2810041
McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large [10] Orna Muller, David Ginat, and Bruria Haberman. 2007. Pattern-Oriented In-
Language Models Trained on Code. https://doi.org/10.48550/ARXIV.2107.03374 struction and Its Influence on Problem Decomposition and Solution Construc-
[2] Thomas Dohmke. 2022. GitHub Copilot is generally available to all develop- tion. In Proceedings of the 12th Annual SIGCSE Conference on Innovation and
ers. GitHub blog. https://github.blog/2022-06-21-github-copilot-is-generally- Technology in Computer Science Education (Dundee, Scotland) (ITiCSE ’07). As-
available-to-all-developers sociation for Computing Machinery, New York, NY, USA, 151–155. https:
[3] Alireza Ebrahimi. 1994. Novice Programmer Errors: Language Constructs and //doi.org/10.1145/1268784.1268830
Plan Composition. Int. J. Hum.-Comput. Stud. 41, 4 (Oct. 1994), 457–480. https: [11] Paul Piwek, Michel Wermelinger, Robin Laney, and Richard Walker. 2019. Learn-
//doi.org/10.1006/ijhc.1994.1069 ing to Program: From Problems to Code. In Proceedings of the 3rd Conference
[4] James Finnie-Ansley, Paul Denny, Brett A. Becker, Andrew Luxton-Reilly, and on Computing Education Practice (Durham, United Kingdom) (CEP ’19). Asso-
James Prather. 2022. The Robots Are Coming: Exploring the Implications of ciation for Computing Machinery, New York, NY, USA, Article 14, 4 pages.
OpenAI Codex on Introductory Programming. In Australasian Computing Educa- https://doi.org/10.1145/3294016.3294024
tion Conference (Virtual Event, Australia) (ACE ’22). Association for Computing [12] Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. 2022. Automatic
Machinery, New York, NY, USA, 10–19. https://doi.org/10.1145/3511861.3511863 Generation of Programming Exercises and Code Explanations Using Large Lan-
[5] Kathi Fisler. 2014. The Recurring Rainfall Problem. In Proceedings of the Tenth guage Models. In Proceedings of the 2022 ACM Conference on International Com-
Annual Conference on International Computing Education Research (Glasgow, puting Education Research - Volume 1 (Lugano and Virtual Event, Switzerland)
Scotland, United Kingdom) (ICER ’14). Association for Computing Machinery, (ICER ’22). Association for Computing Machinery, New York, NY, USA, 27–43.
New York, NY, USA, 35–42. https://doi.org/10.1145/2632320.2632346 https://doi.org/10.1145/3501385.3543957
[6] Mark Guzdial, Rachel Fithian, Andrea Forte, and Lauren Rich. 2003. Report on [13] Simon. 2013. Soloway’s Rainfall Problem Has Become Harder. In Learning and
Pilot Offering of CS1315 Introduction to Media Computation With Comparison to Teaching in Computing and Engineering. IEEE, 130–135. https://doi.org/10.1109/
CS1321 and COE1361. Technical Report. Georgia Tech. LaTiCE.2013.44
[7] Antti-Jussi Lakanen, Vesa Lappalainen, and Ville Isomöttönen. 2015. Revisiting [14] Elliot Soloway. 1986. Learning to Program = Learning to Construct Mechanisms
Rainfall to Explore Exam Questions and Performance on CS1. In Proceedings of and Explanations. Commun. ACM 29, 9 (Sept. 1986), 850–858. https://doi.org/10.
the 15th Koli Calling Conference on Computing Education Research (Koli, Finland) 1145/6592.6594

178

You might also like