Using GitHub Copilot To Solve Program
Using GitHub Copilot To Solve Program
Michel Wermelinger
School of Computing and Communications
The Open University
Milton Keynes, United Kingdom
[email protected]
ABSTRACT 1 INTRODUCTION
The teaching and assessment of introductory programming involves Program synthesis is an active research field with a long history.
writing code that solves a problem described by text. Previous re- Recent developments include OpenAI’s Codex1 [1], DeepMind’s
search found that OpenAI’s Codex, a natural language machine AlphaCode2 [8], Amazon’s CodeWhisperer3 and Tabnine4 , four
learning model trained on billions of lines of code, performs well systems that can translate a problem description to code. Currently
on many programming problems, often generating correct and (August 2022), AlphaCode and CodeWhisperer aren’t publicly avail-
readable Python code. GitHub’s version of Codex, Copilot, is freely able and Tabnine’s free version only does code suggestions. This
available to students. This raises pedagogic and academic integrity paper considers GitHub’s Copilot5 , a version of Codex accessible
concerns. Educators need to know what Copilot is capable of, in or- through a plugin for popular IDEs.
der to adapt their teaching to AI-powered programming assistants. In June 2022, GitHub released Copilot to individual customers
Previous research evaluated the most performant Codex model and included it in the free Student Pack. Codex, publicly available
quantitatively, e.g. how many problems have at least one correct since November 2021, is a paid service, accessed by writing a pro-
suggestion that passes all tests. Here I evaluate Copilot instead, to gram that calls OpenAI’s API. These are barriers to adoption by
see if and how it differs from Codex, and look qualitatively at the students and Copilot removed them. Whether we like it or not,
generated suggestions, to understand the limitations of Copilot. I many students will use the free IDE plugin for exercises and assign-
also report on the experience of using Copilot for other activities ments, without having to learn an API or disrupt their workflow.
asked of students in programming courses: explaining code, gener- With a suitable configuration, the most performant Codex model
ating tests and fixing bugs. The paper concludes with a discussion (Davinci) can solve typical CS1 problems [4]. (I use ‘solve’ in the
of the implications of the observed capabilities for the teaching of sense of ‘generate an answer that passes the tests’.) However, Davinci
programming. is also the slowest and most expensive of the models and we know
that “a distinct production version of Codex powers GitHub Copilot”
CCS CONCEPTS [1]. Given the recent release of Copilot and its expected widespread
use by students, it is timely to check if it performs as well as Davinci,
• Computing methodologies → Natural language processing;
and if not, what limitations it has.
• Software and its engineering → Automatic programming; •
OpenAI’s Codex examples6 include fixing a bug and generating
Social and professional topics → CS1.
documentation strings, step-by-step explanations and code sum-
maries. Sarsa et al. showed that Davinci can create exercises (as
KEYWORDS variations of a given problem), sample solutions, explanations and
code generation, test generation, code explanation, programming tests [12]. However, Copilot’s FAQ states that it “is not intended
exercises, programming patterns, novice programming, introduc- for non-coding tasks like data generation and natural language
tory programming, academic integrity, OpenAI Codex generation, like question and answering”. We can thus expect Copi-
lot to perform less well than Davinci in generating tests and code
ACM Reference Format: explanations, two tasks often asked of students.
Michel Wermelinger. 2023. Using GitHub Copilot to Solve Simple Program- The demo videos on the Codex site illustrate the incremental
ming Problems. In Proceedings of the 54th ACM Technical Symposium on creation of programs: the user’s initial request generates a minimal
Computer Science Education V. 1 (SIGCSE 2023), March 15–18, 2023, Toronto, program or function that is modified by each subsequent request.
ON, Canada. ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/ Codex and the user engage in a ‘dialogue’ in which the user’s Eng-
3545945.3569830
lish sentences elicit a reply in a programming language. Copilot
makes this interaction stronger and more seamless: it makes sugges-
tions as we type in the IDE, we can ask for alternative suggestions
Permission to make digital or hard copies of all or part of this work for personal or with one keystroke and we can edit Copilot’s suggestions. It’s no
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation surprise that GitHub dubs Copilot as “your AI pair programmer”,
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
1 https://openai.com/blog/openai-codex
republish, to post on servers or to redistribute to lists, requires prior specific permission
2 https://www.deepmind.com/blog/competitive-programming-with-alphacode
and/or a fee. Request permissions from [email protected].
3 https://aws.amazon.com/codewhisperer
SIGCSE 2023, March 15–18, 2023, Toronto, ON, Canada
4 https://www.tabnine.com
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
5 https://copilot.github.com
ACM ISBN 978-1-4503-9431-4/23/03. . . $15.00
https://doi.org/10.1145/3545945.3569830 6 https://beta.openai.com/examples?category=code
172
SIGCSE 2023, March 15–18, 2023, Toronto, ON, Canada Michel Wermelinger
even though the interaction is far more limited than with a human;
notably, Copilot does not provide a rationale for its suggestions.
In summary, Copilot is currently the most likely AI assistant to
be adopted by students, but only Codex has been evaluated, in a
quantitative way, mostly with automated tests to obtain the per-
centage of correct code suggestions for a given problem statement.
While Copilot could be evaluated in the same way, a qualitative
look at Copilot’s answers may be more insightful. This paper is
a report of my experience in using Copilot as if I were a student
tasked with writing code (possibly with an explanation) and tests
for a given CS1 problem. The guiding questions for this exploration
are:
• How does Copilot perform, compared with Davinci, in terms
of the correctness and variety of the generated code, tests
and explanations?
• If a suggestion is incorrect, can Copilot be interactively led
to a correct one?
2 RELATED WORK
The OpenAI team wrote 164 problems7 with tests, to evaluate if Figure 1: An inline suggestion for Ebrahimi’s rainfall prob-
early models of Codex generate correct Python code from English lem variant.
[1]. Almost 29% of the problems were solved by Codex’s first sug-
gestion. A fine-tuned model, trained only on correct Python code,
solved 38% of the problems on its first attempt. If that model is The authors also asked Codex to generate line-by-line explana-
allowed 100 attempts per problem, then at least one of them was tions for three functions and a class: only two thirds of the gener-
correct for 78% of the problems. Judging from the randomly selected ated lines were correct. Codex often got the operator wrong, e.g.
attempts in the paper’s appendix, most are incorrect. speed > 100 was explained as ‘speed is less than 100’.
Finnie-Ansley et al. [4] investigated the performance of the In summary, previous research shows that while Codex solves a
Davinci model on two sets of CS1 problems in Python, setting good percentage of problems on the first attempt, it requires many
the temperature parameter to 90% to elicit more creative answers. attempts for most problems, and it never solves quite a number of
The first set had 23 problems from past tests of their CS1 course. problems. None of the related work captures how most students
Codex had ten attempts at each problem. It solved 10 problems will use Codex: interactively in an IDE.
on the first attempt, it needed manual output correction for some
problems, and it couldn’t solve 4 problems that required ASCII 3 USING COPILOT
formatting or restricted what students could use, e.g. ‘you must use Copilot is accessible via a plugin for several editors. I used Visual
a while-loop but not the split() method’. Studio Code within the 60-day free Copilot trial period GitHub
The second set consisted of 7 variants of the rainfall problem users have. It’s unlikely results would be different with other IDEs
[3–7, 13, 14]. Codex was asked for 50 solutions to each problem. as they are only interfaces to Copilot.
For all but one variant, at least one attempt passed all tests. The 50 In VS Code, Copilot attempts to provide an inline suggestion
attempts passed on average 19% of the tests for the Ebrahimi et al. when the user pauses typing or presses Alt-\ or Enter. The sugges-
variant [3], up to 63% of the tests for Soloway’s [14]. tion, in italic grey font at the cursor position, may complete the
Sarsa et al. [12] also evaluated Davinci, but from the perspective current line or may be several lines long (Figure 1). The user can
of a teacher who wants varied but related exercises in different do- cycle through alternative suggestions with Alt-[ or Alt-]. Pressing
mains to better engage their students. The researchers submitted to Tab accepts the current suggestion. Pressing Ctrl-Enter requests
Codex a Python docstring containing: keywords describing the do- Copilot to generate up to 10 suggestions and display the unique
main (e.g. ‘football’) and the programming context (e.g. ‘list’); a base ones in a separate panel. Each suggestion has an acceptance button,
problem in English; a sample solution; some tests. This was followed which copies it to the editor.
by different domain and programming keywords and a prompt to The following sections detail the prompts used, i.e. the code and
generate another exercise (statement, solution and tests). The au- comments in the editor at the time a suggestion is asked for. All code
thors read half of the 240 exercises generated and 75% of those is in Python, to allow comparing with previous work and because
were reasonable, i.e. they made sense, incorporated the prompted it is the introductory programming language at our institution.
domain or programming concept, and included appropriate code. Copilot can be configured in the Settings page of the user’s
However, 30% of the exercises had no solution or no tests. When GitHub account. Users can choose whether to see suggestions that
both existed, only 31% of the solutions passed the tests. match public code and whether to allow GitHub and OpenAI to
use the submitted prompts for training. I declined both. The plugin
7 https://github.com/openai/human-eval doesn’t indicate which suggestions match public code, so students
173
Using GitHub Copilot to Solve Simple Programming Problems SIGCSE 2023, March 15–18, 2023, Toronto, ON, Canada
4 GENERATING CODE
I used Copilot with two CS1 problem sets. This section presents the
qualitative results, illustrated with selected examples. I highlight
Copilot’s mistakes with # Wrong and, due to limited space, I replace
repetitive generated code with an ellipsis.
174
SIGCSE 2023, March 15–18, 2023, Toronto, ON, Canada Michel Wermelinger
On the other hand, it fails to solve half of these simple problems. iterate over a list parameter. Only one for-loop suggestion and two
While Copilot can be instructed to incrementally fix or improve a while-loop suggestions are correct: the others use len() or don’t
program, the user must know exactly what they want and find the check whether a value is negative. This variant elicited the most
right words to communicate their intentions. It’s easier and quicker varied suggestions: some use a while-loop to read the values up to
to edit the code directly. the sentinel into a list, followed by a for-loop to add and count the
Copilot sometimes uses constructs novice programmers might non-negative values; others use sum() / len() and one uses slicing.
not know about, like list comprehensions. How students will react Ebrahimi [3] asks for a program that reads rainfall amounts
to such suggestions remains to be seen. (negative values are invalid) until the sentinel occurs and then
outputs the number, total and maximum of the valid values and
4.2 The rainfall problem ‘the number of rainy days’, which isn’t further explained. This was
4.2.1 Method. To directly compare Copilot and Davinci, I asked the variant on which Davinci performed worst, with suggestions
Copilot to solve the same rainfall problem variants as Finnie-Ansley passing on average 2 of the 10 tests. Surprisingly, Copilot’s first
et al. For each one, the prompt was a docstring with the problem inline suggestion (Figure 1) and most other ones are correct. The
description given in Table 2 of [4], followed by def, to make Copi- incorrect suggestions compute the mean or use the wrong variable
lot suggest the function header and the body. Since the problem when printing. Some suggestions unnecessarily store the values
variants are similar, I didn’t accept Copilot’s suggestions, to reduce read in a list that isn’t further used. The prompt ‘The next code
any learning effect that might increase the correctness or reduce doesn’t create a rainfall list’ removed it.
the variety of the suggestions for subsequent variants. Finnie-Ansley et al. [4] ask for the mean, rounded to one decimal
The rainfall problem asks for the mean of the valid numbers place, of the non-negative values up to the sentinel or the end of
(what is valid depends on the variant) up to the first sentinel value. the list, whichever occurs first. If the input is None (instead of a list
If there are no valid numbers, the mean is undefined. Most variants of numbers) or the mean is undefined, then the output should be
don’t state what to return in that case or if we can assume the input -1.0. Copilot’s suggestions are all incorrect. They use len(), don’t
to have a valid number. I took the latter stance. Finnie-Ansley et al. round the result, don’t stop at the sentinel, or they check if the sum
tested Davinci’s suggestions with the empty input and with only (rather than the count) of valid values is zero to return -1. The first
invalid numbers, but it’s unclear what outcome their tests expect. inline suggestion is almost correct: it doesn’t round.
4.2.2 Results. Copilot made for every variant 1–3 inline sugges- 4.2.3 Analysis. For every variant, Finnie-Ansley et al. found that
tions and less than 10 unique separate suggestions, but often they Davinci generated a variety of attempts, including doing one or
were essentially the same, differing in the names of variables, the two passes over the list, with a for-loop or a while-loop. Copilot
docstring (or its absence), the messages in input() and print(), and generates one approach, possibly combined with len(), per variant:
minor coding variations, e.g. iterating with a repeat-until loop or a single-pass for-loop if the input is a list and a single-pass while-
an infinite while-loop with a break. loop if the standard input is read. It generates both approaches only
Soloway’s original problem [14] is the simplest: read numbers for Fisler’s variant, likely due to both ‘entered by a user’ and ‘list’
from the input until the sentinel occurs and compute their mean. occurring in the problem description.
All of Copilot’s suggestions are correct. Looking more closely at the suggestions for each variant, they
Simon [13] asks to treat negative values as if they were zeroes. seem to be obtained largely by combining, say, two different names
Copilot does not do that: it adds all values until the sentinel. for one variable, two docstrings, two ways of using the while-loop,
Guzdial et al. [6] ask for the mean of the positive values; there is etc. Even though the code of each suggestion is unique, algorithmi-
no sentinel. All Copilot suggestions are correct, except two. One cally they are essentially the same.
does if value < 0: continue instead of <= 0 to skip non-positive val- There are at least two reasons why Copilot’s suggestions are
ues. The other uses len to divide the sum by the length of the list, much less varied than Davinci’s. First, Copilot generates about
even though this is the only variant stating that one must divide 10–12 suggestions per variant, whereas Davinci was asked for 50.
by the number of valid values. Second, Copilot’s temperature is probably low, because lower tem-
Lakanen et al. [7] ask for the mean of the positive numbers up peratures tend to give more precise results while higher tempera-
to the first value exceeding 998. The description uses the word tures, as used with Davinci, tend to produce more random results.
‘sentinel’, but it is not a fixed value like in the other variants. This The lack of variety led to more polarised correctness results
was the only variant Davinci couldn’t solve, and neither can Copi- than those reported by Finnie-Ansley et al. The suggestions tend
lot: one suggestion breaks the loop on 999 only; the others don’t to be mostly (or exclusively) correct or incorrect. Davinci found a
check for a sentinel. All suggestions (except the one with the break solution for 6 of the 7 variants; Copilot solved 4 variants.
statement) have an unnecessary else: continue branch, a striking Also to note, the separate suggestions, which take a few seconds
example of the lack of variety. to appear, frequently do not improve on the inline suggestions,
Fisler [5] asks for the mean of the non-negative values up to the which are almost immediate. This may be on purpose, to provide a
sentinel, which is optional. The statement is ‘Design a program good user experience and avoid further requests to the servers.
[...] that consumes a list of numbers [...] entered by a user. The
list may contain [the sentinel]’. Contrary to all other variants, this 5 GENERATE TESTS
leads Copilot to generate two kinds of algorithms: while-loops that Students are sometimes asked to write tests in order to think about
use input() and break upon reading the sentinel, and for-loops that edge cases. Can Copilot generate comprehensive tests?
175
Using GitHub Copilot to Solve Simple Programming Problems SIGCSE 2023, March 15–18, 2023, Toronto, ON, Canada
176
SIGCSE 2023, March 15–18, 2023, Toronto, ON, Canada Michel Wermelinger
is greater than 0. If the rainfall is greater than 0, the impossible and futile. A more fruitful approach is to adopt these
rainy days is updated. The total rainfall is updated. tools and educate students and colleagues about their advantages
The number of valid days is updated. The number of and limitations. Knowing what Copilot is good at and what it can’t
rainy days is updated. The total rainfall is updated. do helps students and educators understand what they need to
The maximum rainfall is updated. The code prints the learn, teach and assess in an age where up to 40% of code is written
number of valid days, the number of rainy days, the by Copilot, when it is used [2].
total rainfall, the maximum rainfall, and the average Copilot can provide a first helpful attempt at a problem, but
rainfall. students still need to know a language’s syntax and semantics well,
in order to spot and modify Copilot’s often incorrect suggestions.
The step-by-step description is repetitive, sometimes unclear, misses Copilot’s ‘explanations’ may help students understand unfamil-
the check for the sentinel in the code, puts some steps in the wrong iar constructs and detect errors in the code, but they mostly describe
order, and wrongly states that the code prints the average rainfall. what the code does at a low level of detail and sometimes omit im-
portant aspects. Students still need to learn how to write clear,
7 CONCLUDING REMARKS high-level, synoptic documentation and they still need to figure out
The first guiding question was about Copilot’s performance com- why some code doesn’t work, in order to fix it.
pared with Davinci’s, in terms of correctness and variety of answers. Copilot can provide several suggestions, but to be productive,
Comparing my results to those cited about Davinci, it seems clear students must be able to quickly read and understand code without
that Copilot fares less well on both accounts: as observed in the running it, in order to choose the correct suggestion or to combine
code, tests and explanations generated, Copilot’s suggestions often the correct parts from different suggestions.
are wrong, include unnecessary elements or are mainly ‘variations Copilot quickly completes comments, code lines and individual
on a theme’, possibly due to a low default temperature. tests with hardly any syntax errors. Students will spend less time
Sometimes Copilot seems to have an uncanny understanding typing code and understanding compiler errors, and work through
of the problem, able to extract the relevant details from text and more exercises. Lecturers and TAs can focus more on documenta-
tables of examples, while ignoring student instructions and refer- tion, testing, debugging and program comprehension.
ences to materials. Other times, Copilot looks completely clueless, Educators should stop ‘re-dressing’ old toy problems that only
generating gibberish, like irrelevant lists of imports. have 1–2 sensible solutions, because Copilot’s first suggestion will
The second guiding question probed if Copilot can be interac- likely be correct, readable and the simplest one, thus restricting
tively led to a correct answer. As in the first question, the examples the learning opportunities for coding, debugging and algorithmic
provide a sobering reminder that, in spite of all the hype, using thinking, compared to problems with interesting ‘wrinkles’.
tools like Copilot can be a frustrating ‘hit and miss’ affair. Copilot Finnie-Ansley et al. note that students will likely get partial credit
most often does not understand our instructions to fix or improve if they submit any of Davinci’s suggestions, since almost each one
the code it generated unless we formulate them in a very specific passes some tests. This also happens with Copilot. Educators may
way. I felt like Gandalf trying to open the Doors of Durin. need to increase the use of all-or-nothing grading, to make students
Neither this nor previous work has studied how students will analyse and combine partially correct solutions.
use Codex and Copilot in practice. Even without such a study, some As this experience shows, Copilot is a useful springboard to
educated guesses of what the future holds can be made. productively solve CS1 problems, but the algorithmic thinking,
Finn-Ansley et al. note that Codex poses challenges to academic program comprehension, debugging and communication skills are
integrity that can’t be ‘wished away’: educators must adapt to the as needed as ever. As Jean-Baptiste Karr observed: the more things
new reality. This is even more so with Copilot’s free IDE plugin. change, the more they stay the same.
While less performant than Davinci, Copilot does generate code
(and with some editing, tests and explanations) that could have been
written by humans. Detecting and punishing the use of Copilot is
177
Using GitHub Copilot to Solve Simple Programming Problems SIGCSE 2023, March 15–18, 2023, Toronto, ON, Canada
REFERENCES (Koli Calling ’15). Association for Computing Machinery, New York, NY, USA,
[1] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira 40–49. https://doi.org/10.1145/2828959.2828970
Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, [8] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi
Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas
Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun
Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James
Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando
Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. 2022. Competition-Level Code
Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, Generation with AlphaCode. https://doi.org/10.48550/ARXIV.2203.07814
William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, [9] Richard Lobb and Jenny Harlow. 2016. Coderunner: A Tool for Assessing
Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Computer Programming Skills. ACM Inroads 7, 1 (Feb. 2016), 47–51. https:
Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam //doi.org/10.1145/2810041
McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large [10] Orna Muller, David Ginat, and Bruria Haberman. 2007. Pattern-Oriented In-
Language Models Trained on Code. https://doi.org/10.48550/ARXIV.2107.03374 struction and Its Influence on Problem Decomposition and Solution Construc-
[2] Thomas Dohmke. 2022. GitHub Copilot is generally available to all develop- tion. In Proceedings of the 12th Annual SIGCSE Conference on Innovation and
ers. GitHub blog. https://github.blog/2022-06-21-github-copilot-is-generally- Technology in Computer Science Education (Dundee, Scotland) (ITiCSE ’07). As-
available-to-all-developers sociation for Computing Machinery, New York, NY, USA, 151–155. https:
[3] Alireza Ebrahimi. 1994. Novice Programmer Errors: Language Constructs and //doi.org/10.1145/1268784.1268830
Plan Composition. Int. J. Hum.-Comput. Stud. 41, 4 (Oct. 1994), 457–480. https: [11] Paul Piwek, Michel Wermelinger, Robin Laney, and Richard Walker. 2019. Learn-
//doi.org/10.1006/ijhc.1994.1069 ing to Program: From Problems to Code. In Proceedings of the 3rd Conference
[4] James Finnie-Ansley, Paul Denny, Brett A. Becker, Andrew Luxton-Reilly, and on Computing Education Practice (Durham, United Kingdom) (CEP ’19). Asso-
James Prather. 2022. The Robots Are Coming: Exploring the Implications of ciation for Computing Machinery, New York, NY, USA, Article 14, 4 pages.
OpenAI Codex on Introductory Programming. In Australasian Computing Educa- https://doi.org/10.1145/3294016.3294024
tion Conference (Virtual Event, Australia) (ACE ’22). Association for Computing [12] Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. 2022. Automatic
Machinery, New York, NY, USA, 10–19. https://doi.org/10.1145/3511861.3511863 Generation of Programming Exercises and Code Explanations Using Large Lan-
[5] Kathi Fisler. 2014. The Recurring Rainfall Problem. In Proceedings of the Tenth guage Models. In Proceedings of the 2022 ACM Conference on International Com-
Annual Conference on International Computing Education Research (Glasgow, puting Education Research - Volume 1 (Lugano and Virtual Event, Switzerland)
Scotland, United Kingdom) (ICER ’14). Association for Computing Machinery, (ICER ’22). Association for Computing Machinery, New York, NY, USA, 27–43.
New York, NY, USA, 35–42. https://doi.org/10.1145/2632320.2632346 https://doi.org/10.1145/3501385.3543957
[6] Mark Guzdial, Rachel Fithian, Andrea Forte, and Lauren Rich. 2003. Report on [13] Simon. 2013. Soloway’s Rainfall Problem Has Become Harder. In Learning and
Pilot Offering of CS1315 Introduction to Media Computation With Comparison to Teaching in Computing and Engineering. IEEE, 130–135. https://doi.org/10.1109/
CS1321 and COE1361. Technical Report. Georgia Tech. LaTiCE.2013.44
[7] Antti-Jussi Lakanen, Vesa Lappalainen, and Ville Isomöttönen. 2015. Revisiting [14] Elliot Soloway. 1986. Learning to Program = Learning to Construct Mechanisms
Rainfall to Explore Exam Questions and Performance on CS1. In Proceedings of and Explanations. Commun. ACM 29, 9 (Sept. 1986), 850–858. https://doi.org/10.
the 15th Koli Calling Conference on Computing Education Research (Koli, Finland) 1145/6592.6594
178