0% found this document useful (0 votes)
24 views12 pages

Enhancing LLM Debugging with Bugs2Fix

This paper investigates enhancing the debugging capabilities of Large Language Models (LLMs) like GPT-3.5 through prompt engineering techniques, including few-shot and chain-of-thought prompting. The study employs static and dynamic evaluation metrics to assess the debugging proficiency of these models against various types of bugs. Results indicate limitations in the debugging abilities of GPT-3.5 Turbo, even with advanced prompting strategies, highlighting the need for further improvements in LLM debugging performance.

Uploaded by

ghdtrhbftnfg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views12 pages

Enhancing LLM Debugging with Bugs2Fix

This paper investigates enhancing the debugging capabilities of Large Language Models (LLMs) like GPT-3.5 through prompt engineering techniques, including few-shot and chain-of-thought prompting. The study employs static and dynamic evaluation metrics to assess the debugging proficiency of these models against various types of bugs. Results indicate limitations in the debugging abilities of GPT-3.5 Turbo, even with advanced prompting strategies, highlighting the need for further improvements in LLM debugging performance.

Uploaded by

ghdtrhbftnfg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Enhancing Debugging Skills of LLMs with Prompt Engineering

Keyu He, Max Li, Joseph Liu


University of Southern California
{frankhe, qianqili, jliu7350}@usc.edu

Abstract and dynamic evaluation metrics, we aim to quanti-


tatively assess the debugging proficiency of these
This paper presents a comprehensive study on
improving the debugging capabilities of Large LLMs and pave the way for their effective use in
Language Models (LLMs) like GPT-3.5, fo- debugging tasks.
cusing on the application of prompt engineer-
ing techniques. We explore the efficacy of 2 Hypothesis
few-shot learning, chain-of-thought prompt-
In this research, we posit that employing chain-of-
ing, and a baseline zero-shot model in en-
hancing LLMs’ ability to debug code. Uti- thought (Wei et al., 2023) prompt engineering tech-
lizing static and dynamic evaluation metrics, niques or using few-shots (Fei-Fei et al., 2006) will
the study rigorously assesses the debugging substantially elevate the debugging performance of
proficiency of these models. By introducing large language models. We anticipate that this tai-
different types of bugs, including procedural lored prompting approach will yield improvements
and language model-generated errors, and ap- in multiple evaluation metrics, serving as a quantifi-
plying varied prompting strategies, we pro-
able measure of debugging effectiveness. These ex-
vide a deeper understanding of LLMs’ de-
bugging capabilities. The results provide in- pected outcomes were rigorously compared against
sights into the limitation of debugging capa- the baseline.
bilities of GPT-3.5 Turbo, even with the assis-
tance of various prompting techniques. Source 3 Methodology
code of our evaluation method and bug gen-
3.1 Overview
eration techniques are in GitHub repository:
https://github.com/FrankHe2002/CSCI 499Fi- Our base experiment design is as follows: First,
nalProject we gathered a dataset of pairs of buggy code and
their correct versions. Then we ask our model to
1 Introduction and Motivation
debug the code using one of four prompt variations:
Large language models (LLMs), such as the GPT zero-shot without chain-of-thought (CoT), zero-
(Brown et al., 2020) and CodeLlama (Rozière et al., shot with CoT, few-shot without CoT, and few-shot
2023) families, have shown promise in automating with CoT. We take the outputted code and analyze it
coding tasks. However, their debugging skills re- using both static metrics, such as CodeBLEU (Ren
main limited and relatively untested. Simultane- et al., 2020) and our novel CodeROUGE/CodeF1,
ously, debugging has become an important aspect and dynamic metrics, in the form of test cases.
of development: Poor quality software in the US in
2018 costed approximately $2.8 trillion (Krasner, 3.2 Dataset Collection and Bug Generation
2021). Additionally, the popularity of code com- To create our dataset, we had two primary require-
pletion services powered by LLMs, such as GitHub ments. One, each data pair needed to be rela-
Copilot, is rising. These services are not perfect, tively independent. While the real-world distribu-
and their code may need to be debugged as well. tion of bugs include many points where significant
Our research aims to evaluate and enhance the de- amounts of context are needed, this is very difficult
bugging capabilities of LLMs through few-shot and to prompt. Therefore, we need a dataset where the
chain-of-thought approaches. We will establish a information necessary to find the bug can be found
baseline zero-shot prompt with no chain of thought in a relatively small number of tokens to be passed
for comparative evaluation. By utilizing both static into the model. Second, each code sample needed
to be runnable. In order to do dynamic analysis, we
need to be able to automate testing of each sample
on real test cases. An existing dataset that satisfies
both of these constraints could not be found. For
example, Bugs2Fix, while similar, is not runnable
and thus wouldn’t be able to be dynamically ana-
lyzed (Tufano et al., 2019).
With all this in mind, Java solutions to Leet-
code problems were chosen as our primary data
source. Java is a popular language, meaning
the model has some exposure to its syntax. It
is also more syntactically complex than a lan-
guage like Python, which allows us to test on
common usage errors. The solutions, and there-
fore bug-free versions of code, were easily ac-
quired from GitHub repositories; we specifically
used AnasImloul/Leetcode-Solutions (Imloul,
2023). The second step was the creation of the
buggy versions of each pair. To accomplish this,
we introduce a procedural bug generator and inves-
tigate LLM bug generation as an alternative.
Figure 1: Example of procedural bug, where an array
3.2.1 Procedural Bug Generation access index has been modified.
Our bug generator generates bugs based on com-
mon programming mistakes. In all cases, the gen-
erator locates some token and replaces or removes
it. These include, among others: Replacing array
access indices, removing syntactically important
characters, and replacing a boolean comparison op-
erator. In total, we have six cases, out of which
4 are logical errors as the code compiles, 1 is a
syntax error, and the last is the negative sample
(that is, no bugs). We include negative sampling to
simulate the possibility of a user asking the LLM to
debug code that has no bug. A variable number of
bugs are added to a given piece of code to generate
its buggy version; one example can be found in
Figure 1, and detailed information can be found in
appendix A.1.

3.2.2 Language Model Bug Generation


An alternative to a procedural approach to bug gen-
eration is using a language model. As large lan-
guage models are trained on extensive real-world
datasets, it can represent a wide spectrum of pro- Figure 2: LLM-generated bugs, demonstrating complex
gramming errors that are commonly made by hu- errors
man programmers. This ensures that our testcases
more closely match the real-world distribution of
or function calls replaced (add, remove).
bugs. An example can be found in Figure 2, and
prompt design for bug generation can be found in
3.2.3 Data Filtering and Cleaning
Appendix A.2. Note that the LLM bugs are more
complex, with certain keywords missing (break) To clean the data, the following steps were taken:
• Formatting: The LeetCode code snippets 3.5 Evaluation Metrics
were formatted using IntelliJ IDEA to fit a Code debugged by the LLM was evaluated on two
standard format. This ensures that CodeBLEU types of metrics:
tokenization, which is based on whitespace
characters, is consistent. LLM output (that • Static evaluation, where it is evaluated based
is, the debugged code) also seems to follow on its similarity to the ground truth (correct
a standard format that the model has learned, code)
and we try to make sure that the inputs match
• Dynamic evaluation, where it is run on test
that also.
cases to evaluate output correctness.
• Description Filter: All datapoints without We discuss both in more detail.
descriptions are removed from consideration.
3.5.1 Static Evaluation
Leetcode solution code snippets are short and
contain little information about the problem, The first metric we use is CodeBLEU:
typically using variable names such as i, j, CodeBLEU = α · BLEU
k. As such, descriptions are important for the + β · BLEUweight (1)
model to determine what exactly constitutes a
bug. + γ · Matchast + δ · Matchdf
CodeBLEU evaluates the similarity between two
• Length Filter: We remove 1 sample whose snippets of code using four factors: The original
solution is long; it runs the risk of generation BLEU score, weighted n-gram match, syntactic Ab-
over the model’s context limit, which would stract Syntax Tree match, and semantic data-flow
degrade performance. match. We chose CodeBLEU as it more accurately
represents unique properties of code. First, code
3.3 Model Selection has a strict format and unique instructions which
We selected GPT-3.5 Turbo as the primary model have no ambiguities. This is unlike natural lan-
for debugging code. The model excels in under- guage, and thus any metric designed for language,
standing and generating human-like text, and can including BLEU and ROUGE, would be flawed.
thus be enhanced by prompting techniques like Second, code has a limited vocabulary. While vari-
chain of thought and few-shot learning. These tech- able names can change, the most common tokens
niques have been shown to improve its performance would be keywords such as int, while, or public.
on a range of other tasks. It additionally offers a The weighted n-gram match is similar to BLEU,
good balance between reasoning ability and pure but modified to give more weight to these keywords.
memorization, which is significant risk for its larger This more accurately represents the layout and in-
counterpart GPT-4. Both models have seen Leet- structions of the code, rather than the exact variable
code problems before, and the higher the parameter names used. Lastly, code can be represented as a
count, the more likely it is to remember the answer syntax tree, as opposed to the sequential structure
from training data instead of debugging it. of natural language. The syntactic AST match
checks for code structure without taking into ac-
3.4 Baseline and Prompt Design count variable names or values at all, and the se-
mantic data-flow match evaluates the inputs and
The baseline for our study is established using the outputs (Ren et al., 2020).
zero-shot, non-chain-of-thought debugging perfor- To compute the score for some pair of buggy
mance of GPT-3.5 Turbo. In this baseline scenario, code and debugged code, we label the ground truth
the model is provided solely with the buggy code C0 , the buggy code Cb , and debugged code Cd . We
and its corresponding problem description. Com- can compute two scores
parative analysis is conducted between this baseline
and three alternative prompts: zero-shot with chain Sb = CodeBLEU(C0 , Cb ) (2)
of thought, few-shot without chain-of-thought, and
Sd = CodeBLEU(C0 , Cd ) (3)
few-shot with chain-of-thought. Details of our
prompt designs are in appendix A.3. Note that where Sb represents the score of the buggy code
for few-shot, we provide 5 examples. compared to ground truth and Sd is the score of the
debugged code compared to ground truth. Then, and the modified n-gram recall, which we use for
the net improvement of the debugging is the change CodeROUGE, can be written as:
in scores:
∆S = Sd − Sb (4) Σs min(C(s, ŷ), C(s, y))
r(ŷ, y) = (6)
With this value, we can then say that ∆S > 0 Σs C(s, y)
means the model was able to debug effectively;
∆S = 0 represents that debugged code is approx- Note that the only difference is that we are counting
imately the same as bugged code, and ∆S < 0 over the reference in the denominator. We perform
means the debugged code performed worse. When similar replacements in the remaining three factors
analyzing results, we will be averaging ∆S over to compute CodeROUGE.
all tested samples. As we have precision and recall, we can com-
This method has two main limitations: First, due pute an F1 score also. This score, which we call
to the nature of our bugs, Sb , Sd will both be quite CodeF1, offers a balanced number that does not
large: The majority of the code will still be identi- penalize extra code as heavily, but also doesn’t ig-
cal between the buggy and debugged versions. In nore syntactically important missing characters in
instances where the model successfully identifies generated code. We define CodeF1 also as the sum
and corrects the errors, ∆S may be quite small. of four factors, where each factor is the harmonic
Conversely, swapping two lines would result in a mean between its recall and precision based vari-
large dip in ∆S. We therefore use dynamic evalua- ants. For the modified n-gram match, for example,
tion as a second metric (discussed in 3.5.2), where we define the corresponding component:
the exact ordering of instructions does not matter.
Second, CodeBLEU is precision based. It eval-
2 · p′ (ŷ, y) · r′ (ŷ, y)
uates how much of the debugged code is relevant F1(ŷ, y) = (7)
p′ (ŷ, y) + r′ (ŷ, y)
or correct in relation to the reference code. It fol-
lows that CodeBLEU penalizes generations not in
ground truth, even if it’s possibly useful. An ex- where p′ and r′ are BLEU scores, including both
ample is the generation of helper functions. To the geometric mean over multiple n-grams and the
address this, we introduce CodeROUGE, a recall- brevity penalty. We then perform a weighted sum
based technique that instead penalizes code found over the four components like CodeBLEU.
in ground truth that is missing in the debugged
code. This distinction is crucial as it focuses on 3.5.2 Dynamic Evaluation
the extent to which the model’s output captures all
Beyond the static analysis provided by CodeBLEU,
the relevant aspects of the reference solution. Sim-
CodeROUGE, and CodeF1, we incorporate dy-
ilar to ROUGE (Recall-Oriented Understudy for
namic analysis, specifically focusing on the per-
Gisting Evaluation, Lin (2004)), which has been
centage of test cases passed by actually running the
effectively used in evaluating text summarization,
code. This aspect of our methodology is particu-
CodeROUGE offers an alternative perspective on a
larly important, as it addresses a key limitation in
model’s debugging capabilities. This is especially
static code evaluations. In instances where GPT
important in cases where the bug is a single miss-
has made substantial modifications to the original
ing character. As such, CodeROUGE serves as a
code, these changes often result in lower scores
complementary metric to CodeBLEU.
from static evaluation metrics, despite potentially
Computing CodeROUGE is quite simple: We
better runtime performance or correctness. By ex-
swap out the various matches in CodeBLEU for a
ecuting the code and measuring its performance
recall-based match, as opposed to precision. Con-
against a set of test cases, we can more accurately
sider the simplified case of BLEU, where we define
assess its functional correctness. This dynamic
C(s, y) to be the number of times s appears at a
analysis, therefore, serves as a crucial counterbal-
substring of y, y as the reference (the ground truth)
ance to static metrics, ensuring that our evaluation
and ŷ as the candidate (the buggy code). Then the
of the model’s debugging effectiveness is not only
modified n-gram precision can be written as:
based on syntactic and semantic correctness but
Σs min(C(s, ŷ), C(s, y)) also on the practical, real-world functionality of
p(ŷ, y) = (5) the debugged code.
Σs C(s, ŷ)
Status Score model is generally making the code worse. Addi-
Accepted 1 tionally, the change is relatively large, with scores
Compilation Error 0 differing by around 0.2 on average. We plot the
Other Proportion of Cases distributions of our baseline and the few-shot with
Passed (e.g. 0.5 for chain-of-thought CodeF1 scores in black (Figure
30/60 passed) 3). We notice a large and heavy tail which seems
to be pulling down the average. Further analysis
Table 1: Dynamic Evaluation Scores reveals that the vast majority of these were out-
puts that had a different number of lines of ac-
To evaluate performance at runtime, an auto- tual code than the input. That is, after removing
mated script was developed that takes each sam- blank lines, the debugged code had been exten-
ple and tests its debugged and buggy version on sively modified by the model. Removing these
Leetcode’s website. The resulting status is scraped "mismatched" points, which account for around
and converted to a score based on Table 1. These 35% of our output, shrinks the tail significantly and
scores have a minimum of 0 and a maximum of 1. results in far better results (Table 3); the filtered re-
Similarly to CodeBLEU, the exact value does not sults are shown in green. This behavior occurs with
matter; instead, we wish to investigate whether the different prompts as well. As a result, we make the
debugged code did better on average. Therefore, observation that GPT-3.5 Turbo is a very opinion-
we compute the difference: ated model - it likes to write code a certain way,
and even when asked to only debug existing code,
∆SLC = SLC,d − SLC,b (8)
will often rewrite portions to fit its understanding.
where SLC,d , SLC,b represent the Leetcode scores As CodeBLEU, CodeROUGE, and CodeF1 do not
from Table 1 for debugged and buggy scores, re- take this into account, it results in lower scores.
spectively. As with BLEU, a positive number rep- Using an alternative model, such as GPT-4, which
resents improvement. has been trained on a more extensive dataset, is a
potential solution to addressing this problem. Al-
4 Results & Discussion ternatively, a more targeted prompt may be able
to limit GPT’s formatting opinions, although the
4.1 Static Evaluation Results
extent to which this can help is unknown.
We evaluated each of the four prompts using Code-
BLEU, CodeROUGE, and CodeF1 over a dataset of
approximately 1700 code samples, and the results
are located in Table 2. For each prompt type, we in-
serted the relevant buggy code and problem descrip-
tion into the template listed in appendix A.3. After
trimming the results and verifying that the output is
valid, it is evaluated using the three metrics. These
scores are then averaged over all samples; recall
that positive numbers signify improvements, while
Figure 3: CodeF1 Score Distribution
negative numbers signify worse outputs.

Zero-Shot, no CoT Zero-Shot, with CoT


Zero-Shot, no CoT Zero-Shot, CoT
∆SCB -0.1072 -0.2712
∆SCR -0.1143 -0.2828 ∆SCF 1 -0.1123 -0.2801
∆SCF 1 -0.1123 -0.2801 ∆SCF 1,F iltered -0.0207 -0.0291

Few-Shot, no CoT Few-Shot, with CoT Few-Shot, no CoT Few-Shot, CoT

∆SCB -0.1291 -0.2353 ∆SCF 1 -0.2445 -0.1352


∆SCR -0.1374 -0.2475 ∆SCF 1,F iltered -0.0219 -0.0299
∆SCF 1 -0.1352 -0.2445
Table 3: ∆SCF 1 before/after mismatch removal
Table 2: Static Evaluation Scores Secondly, we compare results with and without
Our first observation is that generally speaking, chain-of-thought, and with and without few-shot
outputs are more dissimilar than inputs; that is, the learning. Surprisingly, chain-of-thought seems to
generally perform worse when added in all metrics, without significant information loss.
with a lower ∆S score than their non-CoT coun-
4.2 Dynamic Evaluation Results
terparts, whether with or without few-shot. We
hypothesize that this is primarily a result of format- We now move on to dynamic evaluation. Due to
ting issues. After removing the mismatched points, limitations of the Leetcode website, evaluations
the scores between CoT and without CoT are much take significantly longer. The scores (Table 4)
closer. GPT-3.5 also reformatted more samples are therefore averaged over a random selection of
than were filtered; only samples with mismatched around 70 problems. Note that all 4 prompt varia-
line numbers were removed, while many samples tions use the same selection of problems.
had reorganized contents in the same number of
lines. However, it is difficult to filter out even more: Zero-Shot, no CoT Zero-Shot, with CoT
The line between debugging and reformatting is ∆SLC -0.0109 0.0167
blurry in many cases, and even a human would Few-Shot, no CoT Few-Shot, with CoT
have a hard time distinguishing.
∆SLC -0.0170 0.0137
Another possible explanation is that chain-of-
thought "distracts" the model from the output for- Table 4: Dynamic Eval Scores
mat it is supposed to copy. It is reasonable to be- The results for the Leetcode evaluation are much
lieve that in the training distribution, explanations closer to the hypothesis; when CoT is included,
of code are typically followed by code formatted in we achieve positive scores, showing there is some
a standard way. The model then has a higher prob- improvement in model performance. That is,
ability of just following the format of its training even while CodeF1 scores were worse than their
distribution rather than what it has been given. non-CoT counterparts, the CoT prompts are able to
Our second comparison is between few-shot and generate code that fixes bugs more effectively. This
zero-shot. The results seem to be mixed: When discrepancy can be explained by formatting errors,
the model receives a chain-of-thought prompt, few- as mentioned in the previous section; formatting
shot seems to improve results. However, without affects CodeF1 scores, but not code execution flow.
CoT, zero-shot performs better. We could not find However, we do note the smaller sample size of
any numerical explanation as to why this is the case. the Leetcode evaluation, so this number may need
We hypothesize that this may be a combination to be refined.
of two factors: First, with CoT, the model learns Secondly, we note that few-shot seems to de-
what kind of errors to be looking for. The few- crease performance across the board. Investigating
shot serves its purpose of showing the model what the samples, few-shot seems to be worse at locating
it should be focusing on, and therefore improves syntax errors than its zero-shot counterpart. How-
results. In zero-shot, on the other hand, the model ever, as there are too few samples to work with, this
is asked to explain without examples. It does so out may be a result of chance rather than a meaningful
of its training distribution instead, which contains correlation.
much more complex bugs as shown in 3.2.2 and
Figure 2. It tries to find complex explanations for a 4.3 LLM Generated Bugs
simple bug, and ends up attempting to "fix" valid Lastly, we run a fast analysis of LLM generated
code. bugs to investigate their feasibility. The bugs are
Lastly, we analyze the difference between Code- generated according to section 3.2.2. We then ask
BLEU and CodeROUGE. For the most part, the it to debug the code it generated using the same
differences seem to be quite similar, with ∆S for prompts as before. We notice a score improvement
CodeROUGE being marginally lower than its Code- in all 4 categories of around 0.07 points over the
BLEU counterpart. This roughly means that the procedural bugs. Plotting the scores additionally re-
model’s output is shorter and more conservative. sulted in a much lighter tail than before. We believe
Investigating samples, however, does not seem that this has two primary causes: First, the bugs
to yield significant observations; a regression be- that the procedural algorithms produces are more
tween the CodeBLEU and CodeROUGE scores likely to be out-of-distribution, as GPT-generated
also yields an R2 ≈ 0.9975, showing that the bugs are, by definition, in-distribution. Second, the
two are highly correlated. We will primarily use lighter tail could be caused by formatting: GPT is
CodeF1 as it is able to represent both metrics well unlikely to reformat code that it generated itself,
resulting in higher scores across the board. It is 2020) and “CodeT5: Identifier-aware Unified Pre-
evident that in order to evaluate this performance trained Encoder-Decoder Models for Code Un-
correctly, we will need to use a different language derstanding and Generation” (Wang et al., 2021)
model to minimize the effects of a favorable gener- demonstrate two LLMs trained on coding tasks,
ated bug distribution. enhancing their understanding of code structures.
Code T5 uses a unified pre-training approach, com-
4.4 Evaluation Metrics bining denoising sequence-to-sequence training
Overall, the CodeROUGE/CodeBLEU/CodeF1 with identifier-aware tasks. CodeBERT’s training
metrics seem to be too sensitive to formatting to approach enables it to comprehend code in a way
accurately represent code performance. Leetcode, that mirrors human developers. Both of these mod-
on the other hand, only works on smaller datasets els are promising targets for further evaluation.
due to time constraints. Additionally, we noticed
that Leetcode is a very binary metric, with the ma- 4.5.3 Code Completion Fails with Bugs
jority of samples receiving a 1 or a 0. Code that Dinh et al. (2023) reveals a significant challenge
seems to compile, for example, often receive 0’s for Large Language Models (LLMs) like CodeGen
due to array access errors, which leaves very little and InCoder in code completion tasks. Specifically,
room for partial correctness. It may be useful to LLMs’ performance sharply declines in scenarios
investigate other potential avenues for evaluation where the context contains buggy code. These in-
metrics. For example, approaching this problem sights suggests that the difficulty observed in GPT-
from a background of formal verification, which 3.5 for debugging tasks may be linked to their in-
is exclusively focused on correctness of code, may herent struggles with handling the flawed code seg-
yield insights in this direction. ments that they are asked to debug, a problem that
persists even in models specifically fine-tuned for
4.5 Related Work
coding tasks.
4.5.1 Chain-of-Thought Assists LLM
Reasoning
5 Conclusion
Liu et al. (2023b) and Wei et al. (2023) inves-
tigate the potential of chain-of-thought prompts This research provides insights into the limitations
in making complex reasoning more accessible to of Large Language Models (LLMs), particularly
LLMs. They share a common goal of enhancing GPT-3.5, in performing debugging tasks. Despite
the model’s ability to perform tasks that necessi- employing various prompt engineering techniques,
tate advanced logical and sequential thinking. This including few-shot learning and chain-of-thought
research aligns with our work by emphasizing the prompting, our results indicate that these models
importance of step-by-step reasoning and logical struggle to effectively debug code. The study high-
problem solving, a key aspect we explore in de- lights the challenges faced by LLMs, such as their
bugging tasks with large language models. Their tendency to reformat code, which impede their de-
approach of using chain-of-thought instructions to bugging efficiency. The use of metrics like CodeF1
facilitate complex reasoning tasks offers insights and Leetcode evaluation quantify these limitations
into potential methods for enhancing debugging and demonstrate potential avenues for improve-
skills in language models. ment. These findings underscore the need for a
In addition, the paper "Improving ChatGPT more nuanced approach to leveraging LLMs in
Prompt for Code Generation" explored various software development, particularly in tasks requir-
prompt design strategies for enhancing code gener- ing precise and logical code correction. Moving
ation (Liu et al., 2023a). In particular, they recur- forward, there is a clear opportunity to explore al-
sively improve prompts by repeatedly modifying ternative models, refine prompt engineering meth-
and combining their best-performing prompts. This ods, and integrate more sophisticated evaluation
is a potential path for further exploration. metrics to enhance the debugging capabilities of
LLMs. We hope this work serves as a foundation
4.5.2 Pretrained Models for Coding Tasks for further research, guiding future efforts towards
The paper “CodeBERT: A Pre-Trained Model for developing LLMs that can more effectively assist
Programming and Natural Languages” (Feng et al., in complex programming tasks like debugging.
6 Acknowledgements Blanco, and Shuai Ma. 2020. Codebleu: a method
for automatic evaluation of code synthesis.
We extend our gratitude to Dr. Swabha
Swayamdipta for offering crucial guidance and in- Baptiste Rozière, Jonas Gehring, Fabian Gloeckle,
Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi
sights that shaped our methodology. Our apprecia- Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom
tion also goes to Avi Thawani, whose recommen- Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish
dations on evaluation metrics beyond CodeBLEU Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wen-
were invaluable. We are grateful to Mozhdeh han Xiong, Alexandre Défossez, Jade Copet, Faisal
Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier,
Gheini for steering the direction of our project. Thomas Scialom, and Gabriel Synnaeve. 2023. Code
llama: Open foundation models for code.

References Michele Tufano, Cody Watson, Gabriele Bavota, Massi-


miliano Di Penta, Martin White, and Denys Poshy-
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie vanyk. 2019. An empirical study on learning bug-
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind fixing patches in the wild via neural machine transla-
Neelakantan, Pranav Shyam, Girish Sastry, Amanda tion.
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child, Yue Wang, Weishi Wang, Shafiq Joty, and Steven C. H.
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Hoi. 2021. Codet5: Identifier-aware unified pre-
Clemens Winter, Christopher Hesse, Mark Chen, Eric trained encoder-decoder models for code understand-
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, ing and generation.
Jack Clark, Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and Dario Amodei. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
2020. Language models are few-shot learners. Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and
Denny Zhou. 2023. Chain-of-thought prompting elic-
Tuan Dinh, Jinman Zhao, Samson Tan, Renato Ne- its reasoning in large language models.
grinho, Leonard Lausen, Sheng Zha, and George
Karypis. 2023. Large language models of code fail
at completing code with potential bugs.

Li Fei-Fei, R. Fergus, and P. Perona. 2006. One-


shot learning of object categories. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence,
28(4):594–611.

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi-


aocheng Feng, Ming Gong, Linjun Shou, Bing Qin,
Ting Liu, Daxin Jiang, and Ming Zhou. 2020. Code-
bert: A pre-trained model for programming and natu-
ral languages.

Anas Imloul. 2023. Leetcode solutions.

Herb Krasner. 2021. The cost of poor software quality


in the us: A 2020 report. Technical report, Consor-
tium for Information & Software Quality (CISQ),
USA.

Chin-Yew Lin. 2004. ROUGE: A package for auto-


matic evaluation of summaries. In Text Summariza-
tion Branches Out, pages 74–81, Barcelona, Spain.
Association for Computational Linguistics.

Chao Liu, Xuanlin Bao, Hongyu Zhang, Neng Zhang,


Haibo Hu, Xiaohong Zhang, and Meng Yan. 2023a.
Improving chatgpt prompt for code generation.

Hanmeng Liu, Zhiyang Teng, Leyang Cui, Chaoli


Zhang, Qiji Zhou, and Yue Zhang. 2023b. Logi-
cot: Logical chain-of-thought instruction-tuning data
collection with gpt-4.

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu,
Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio
A Appendix: Dataset Details 4. Replace or modify a hardcoded int

Generated bugs fall into one of five categories, or


the negative case. Examples of each of the five
cases are shown below.

A.1 Procedural Bugs


1. Replace array index with another value

Figure 7: Replace or modify a hardcoded int

5. Replace a boolean comparison operator

Figure 4: Replace array index with another value

2. Remove a syntactically important charac-


ter

Figure 8: Replace a boolean comparison operator

Figure 5: Remove a syntactically important character

3. Replace a math operator with a random


other operator

Figure 6: Replace a math operator with a random other


operator
A.2 Prompt for GPT Bug Generation A.3 Prompt Design
Below is the prompt we used to ask GPT-3.5 Turbo For our prompts, which are provided below, we
to generate bugs given some code. make a few observations. First, we use the phrase
"may be buggy". This reminds the model that the
Below is a java code:
code may not have a bug and to not look for bugs
// code //
that do not exist, as we have negative samples. Sec-
Here is the problem that the code is solving.
ond, we try to prevent the model from rewriting
// problem //
code using the phrases "using minimal changes"
Given that this code have the correct imple-
and "do not optimize". For CoT, we ask the model
mentation of the problem, your job now is to
to explain reasoning, while telling the model ex-
introduce one bug in this code that could cause
plicitly to not explain when not using CoT. Lastly,
the compile time error/ runtime error. Make
we ask it to format the code in markdown for easier
sure not to indicate where you make the bug
parsing.
so that some one else can test their ability of
debugging. 1. Zero-shot, without CoT (Baseline)
When generating the bug, try to create bugs
that are likely made by general programmers. The provided Java code may be buggy.
Fix the bug if one exists, using minimal
changes. Do not reorganize. Do not op-
timize. Do not provide explanation or
justification. Format your code in mark-
down.
<Problem Description>
```java
<CODE>
```

2. Zero-shot, with CoT


The provided Java code may be buggy.
Fix the bug if one exists, using minimal
changes. Explain the reasoning process,
thinking step-by-step, for identifying and
fixing the bug. Do not optimize. Do
not provide explanation or justification.
Format your code in markdown.
<Problem Description>
```java
<CODE>
```
3. Few-shot, without CoT For <Problem Description>, both description
of the problem context in natural language and con-
The provided Java code may be buggy.
strains that specify input and output of the expected
Fix the bug if one exists, using minimal
behaviour of the program is provided. Below is an
changes. Do not optimize. Do not pro-
example that demonstrates this:
vide explanation or justification. Format
...
your code in markdown.
Code Description:
<Problem Description> The function repeatChar takes a character
Example #1: <CODE> c and an integer times, and returns a string
consisting of the character c repeated times
Example Fix #1: <CODE>
times.
<Four other examples and fixes> Constraints:
buggy code: times >= 0
c is a valid character
```java
...
<CODE>
For <Example> and <Example Fix>, we provide
``` a buggy code and how the bug is fixed in a correct
version. One example is on the next page.
4. Fewshot, with CoT
The provided Java code may be buggy.
Fix the bug if one exists, using minimal
changes. Do not optimize. Do not pro-
vide explanation or justification. Format
your code in markdown.
<Problem Description>
Example #1: <CODE>
Example Fix #1: <CODE>
Explanation for the fix
<Four other examples and fixes>
buggy code:
```java
<CODE>
```
... Below is explanation for the previous example:
Code: ...
```java Example #1: <CODE>
class Solution { Example Fix #1: <CODE>
public int findMax(int[] nums) { Explanation:
int max = nums[0]; The original code causes an ‘ArrayIndex-
for (int i = 1; OutOfBoundsException‘ due to the loop con-
i <= nums.length; i++) { dition ‘i <= nums.length‘, which attempts to
if (nums[i] > max) { access an index out of the array’s bounds. In
max = nums[i]; Java, array indices range from 0 to ‘length - 1‘.
} The fix is changing the loop condition to ‘i <
} nums.length‘, ensuring the loop iterates only
return max; within the array’s valid range.
} ...
}
```
Fix:
```java
class Solution {
public int findMax(int[] nums) {
int max = nums[0];
for (int i = 1;
i < nums.length; i++) {
if (nums[i] > max) {
max = nums[i];
}
}
return max;
}
}
```
...
For few-shot with chain of thought, we also pro-
vide explanations for where the bugs is and why it
should be fixed according to the sample fix. Specif-
ically, our explanations answer four key questions:

• What is the bug?

• Where is it?

• Why does our code lead to a bug?

• How do we fix it?

You might also like