Code Summarization Using LLM
Code Summarization Using LLM
Language Models
Weisong Sun1,2 , Yun Miao1 , Yuekang Li3 , Hongyu Zhang4 , Chunrong Fang1 ,
Yi Liu2 , Gelei Deng2 , Yang Liu2 , Zhenyu Chen1
1
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
2
College of Computing and Data Science, Nanyang Technological University Singapore, Singapore
3
School of Computer Science and Engineering, University of New South Wales, Sidney, Australia
4
School of Big Data and Software Engineering, Chongqing University, Chongqing, China
[email protected], [email protected], [email protected], [email protected],
[email protected], [email protected], [email protected] [email protected], [email protected]
arXiv:2407.07959v1 [cs.SE] 9 Jul 2024
1
stand the aspects of code summarization garnering attention including top p and temperature, on LLMs’ code summa-
in the era of LLMs, but they still have some limitations. First, rization performance. These two parameters may affect the
most of them only focus on one prompting technique, while randomness of generated summaries. The results demonstrate
some advanced prompting techniques have not been investi- that the effect of top p and temperature on summary qual-
gated and compared (e.g., chain-of-thought prompting). For ity varies depending on the base LLM and programming
example, Sun et al. [27] solely focus on zero-shot prompting, language. As alternative parameters, they exhibit a similar
while several other studies [25], [28], [30] only focus on impact on the quality of LLM-generated summaries. Further-
few-shot prompting. Second, they overlook the impact of the more, unlike existing studies that simply experimented with
model settings (i.e., parameter configuration) of LLMs on their multiple programming languages, we reveal the differences in
code summarization capabilities. There is no empirical evi- the code summarization capabilities of LLMs across five types
dence showing LLMs remain well in all model settings. Last (including procedural, object-oriented, scripting, functional,
but not least, these studies follow prior code summarization and logic programming languages) encompassing ten pro-
studies [4], [31], [32] to evaluate the quality of summaries gramming languages: Java, Python, C, Ruby, PHP, JavaScript,
generated by LLMs through computing text similarity (e.g., Go, Erlang, Haskell, and Prolog. The Erlang, Haskell, and
BLEU [33], METEOR [34], and ROUGE-L [35]) or semantic Prolog datasets are built by ourselves and we make them
similarity (e.g., SentenceBERT-based cosine similarity [36]) public to the community. We find that across all five types of
between the LLM-generated summaries and the reference programming languages, LLMs consistently perform the
summaries, detailed in Section IV-A. However, prior research worst in summarizing code written in logic programming
by Sun et al. [27] has shown that compared to traditional code languages. Finally, we investigate the ability of LLMs to
summarization models, the summaries generated by LLMs generate summaries of different categories, including What,
significantly differ from reference summaries in expression Why, How-to-use-it, How-it-is-done, Property,
and tend to describe more details. Consequently, whether these and Others. The results reveal that the four LLMs perform
traditional evaluation methods are suitable for assessing the well in generating distinct categories of summaries. For ex-
quality of LLM-generated summaries remains unknown. ample, CodeLlama-Instruct excels in generating Why and
To address these issues, in this paper, we conduct a sys- Property summaries, while GPT-4 is good at generating
tematic study on code summarization in the era of LLMs, What, How-it-is-done, and How-to-use summaries.
which covers various aspects involved in the LLM-based code Our comprehensive research findings will assist subsequent
summarization workflow. Considering that the choice of eval- researchers in quickly and deeply understanding the various
uation methods directly impacts the accuracy and reliability aspects involved in the workflow of code summarization based
of the evaluation results, we first systematically investigate on LLMs, as well as in designing advanced LLM-based code
the suitability of existing automated evaluation methods for summarization techniques for specific fields.
assessing the quality of summaries generated by LLMs (in- In summary, we make the following contributions.
cluding CodeLlama-Instruct, StarChat-β, GPT-3.5, and GPT- • To the best of our knowledge, we conduct the first investi-
4). Specifically, we compare multiple automated evaluation gation into the feasibility of applying LLMs as evaluators
methods (including methods based on summary-summary to assess the quality of LLM-generated summaries.
text similarity, summary-summary semantic similarity, and • We conduct a thorough study of code summarization
summary-code semantic similarity) with human evaluation to in the era of LLMs, covering multiple aspects of the
reveal their correlation. Inspired by the work in NLP [37]–[39], LLM-based code summarization workflow, and come up
we also explore the possibility of using the LLMs themselves with several novel and unexpected findings and insights.
as evaluation methods. The experimental results show that These findings and insights can benefit future research
among all automated evaluation methods, the GPT-4-based and practical usage of LLM-based code summarization.
evaluation method overall has the strongest correlation • We make our dataset and source code publicly accessi-
with human evaluation. Second, we conduct comprehensive ble [40] to facilitate the replication of our study and its
experiments on three widely used programming languages application in extensive contexts.
(Java, Python, and C) datasets to explore the effectiveness
of five prompting techniques (including zero-shot, few-shot, II. BACKGROUND AND R ELATED W ORK
chain-of-thought, critique, and expert) in adapting LLMs to Code summarization is the task of automatically generating
code summarization tasks. The experimental results show that natural language summaries (also called comments) for code
the optimal choice of prompting techniques varies for dif- snippets. Such summaries serve various purposes, including
ferent LLMs and programming languages. Surprisingly, the but not limited to explaining the functionality of code snip-
more advanced prompting techniques expected to perform pets [8], [28], [41]. The research on code summarization can
better may not necessarily outperform simple zero-shot be traced back to as early as 2010 when Sonia Haiduc et
prompting. For instance, when the base LLM is GPT-3.5, al. [42] introduced automated text summarization technology
zero-shot prompting outperforms the other four more advanced to summarize source code. Later on, following the significant
prompting techniques overall on three datasets. Then, we success of neural machine translation (NMT) research in
investigate the impact of two key model settings/parameters, the field of NLP [43], [44], a large number of researchers
2
migrate its underlying encoder-decoder architecture to code user needs, LLMs typically offer configurable parameters (i.e.,
summarization tasks [4], [9], [12], [45]–[47]. In the past two model settings) that allow users to control the randomness of
years, research on LLM-based code summarization has mush- model behaviour. In this RQ, we adjust the randomness of the
roomed. Fried et al. [48] introduce an LLM called InCoder, generated summaries by modifying LLMs’ parameters and see
and try zero-shot training on the CodeXGLUE [49] Python the impact of different model settings on the performance of
dataset. InCoder achieves impressive results, but fine-tuned LLMs in generating code summaries.
small PLMs like CodeT5 can still outperform the zero-shot RQ4: How do LLMs perform in summarizing code
setting. Ahmed et al. [25] investigate the effectiveness of few- snippets written in different types of programming lan-
shot prompting in adapting LLMs to code summarization and guages? Programming languages are diverse in types (e.g.,
find that it can make Codex significantly outperform fine-tuned object-oriented and functional programming languages), with
small PLMs (e.g., CodeT5). Given the concern of potential their implementations of the same functional requirements
code asset leakage when using commercial LLMs (e.g., GPT- being similar or entirely different. The scale of programs
3.5), Su et al. [50] utilize knowledge distillation technology implemented with them in Internet/open-source repositories
to distill small models from LLMs (e.g., GPT-3.5). Their also varies, which may result in differences in the mastery of
experimental findings reveal that the distilled small models knowledge of these languages by LLMs. Hence, this RQ aims
can achieve comparable code summarization performance to to reveal the differences in LLMs’ capabilities to summarize
LLMs. Gao et al. [30] investigate the optimal settings for few- code snippets across diverse programming language types.
shot learning, including few-shot example selection methods, RQ5: How do LLMs perform on different categories
few-shot example order, and the number of few-shot examples. of summaries? Previous research [3], [54], [55] has shown
Geng et al. [28] investigate LLMs’ ability to address multi- that summaries can be classified into various categories
intent comment generation. Ahmed et al. [51] propose to according to developers’ intentions, including What, Why,
enhance few-shot samples with semantic facts automatically How-to-use-it, How-it-is-done, Property, and
extracted from the source code. Sun et al. [27] design several others. Therefore, in this RQ, we aim to explore the ability of
heuristic questions to collect the feedback of ChatGPT, thereby LLMs to generate summaries of different categories.
finding an appropriate prompt to guide ChatGPT to generate
in-distribution code summaries. Rukmono et al. [52] address B. Experimental LLMs
the unreliability of LLMs in performing reasoning by applying We select four LLMs as experimental representatives.
a chain-of-thought prompting strategy. Recently, some stud- CodeLlama-Instruct. Code Llama [56] is a family of
ies [11], [26], [53] have also investigated the applicability of LLMs for code based on Llama 2 [57]. It provides multiple fla-
Parameter-Efficient Fine-Tuning (PEFT) techniques in code vors to cover a wide range of applications: foundation models,
summarization tasks. In this paper, we focus on uncovering Python specializations (Code Llama-Python), and instruction-
the effectiveness of various prompting techniques in adapting following models (Code Llama-Instruct) with 7B, 13B, and
LLMs to code summarization without fine-tuning. 34B parameters. Our study utilizes Code Llama-Instruct-7B.
StarChat-β. StarChat-β [58] is an LLM with 16B param-
III. S TUDY D ESIGN
eters fine-tuned on StarCoderPlus [59]. Compared with Star-
A. Research Questions CoderPlus, StarChat-β excels in chat-based coding assistance.
This study aims to answer the following research questions: GPT-3.5. GPT-3.5 [60] is an LLM provided by OpenAI. It
RQ1: What evaluation methods are suitable for assessing is trained with massive texts and codes. It can understand and
the quality of summaries generated by LLMs? Existing generate natural language or code.
research on LLM-based code summarization [25], [28], [30] GPT-4. GPT-4 is an improved version of GPT-3.5, which
widely follow earlier studies [32], [36] and employ automated can solve difficult problems with greater accuracy. OpenAI
evaluation metrics (e.g., BLEU) to evaluate the quality of has not disclosed the specific parameter scale of GPT- 3.5 and
LLM-generated summaries. However, recent studies [27], [50] GPT-4. Our study uses gpt-3.5-turbo and gpt-4-1106-preview.
have shown that LLM-generated summaries surpass reference Model Settings. Apart from RQ3 where we investigate the
summaries in quality. Therefore, evaluating LLM-generated impact of model settings, we uniformly set the temperature
summaries based on their text or semantic similarity to ref- to 0.1 to minimize the randomness of LLM’s responses and
erence summaries may not be appropriate. This RQ aims to highlight the impact of evaluation methods/prompting tech-
discover a suitable method for automated assessment of the niques/programming language types/summary categories.
quality of LLM-generated summaries.
RQ2: How effective are different prompting techniques C. Prompting Techniques
in adapting LLMs to the code summarization task? We compare five commonly used prompting techniques below.
This RQ aims to unveil the effectiveness of several popular Zero-Shot. Zero-shot prompting adapts LLMs to down-
prompting techniques (e.g., few-shot and chain-of-thought) in stream tasks using simple instructions. In our scenario, the
adapting LLMs to code summarization tasks. input to LLMs consists of a simple instruction and a code
RQ3: How do different model settings affect LLMs’ snippet to be summarized. We expect LLMs to output a
code summarization performance? To better meet diverse natural language summary of the code snippet. Therefore, we
3
follow [27] and adopt the input format: Please generate a short Expert. Expert prompting first asks LLMs to generate a
comment in one sentence for the following function: ⟨code⟩. description of an expert who can complete the instruction (e.g.,
Few-Shot. Few-shot prompting (also known as in-context through few-shot prompting), and then the description serves
learning [28], [30]) provides not only straightforward in- as the system prompt for zero-shot prompting. We use the
struction but also some examples when adapting LLMs to few-shot examples provided by Xu et al. [64] and employ
downstream tasks. The examples serve as conditioning for few-shot prompting to let LLMs generate a description of an
subsequent examples where we would like LLMs to generate expert who can “Generate a short comment in one sentence for
a response. In our scenario, the examples are pairs of ⟨code a function.” This description will replace the default system
snippet, summary⟩. According to the findings of Gao et prompt of LLMs. By default, we use the system prompt [65]
al. [30], we set the number of examples to 4 to achieve a of CodeLlama-Instruct for all LLMs to ensure fairness in
balance between LLMs’ performance and the cost of calling comparison. Then, we utilize the same steps as zero-shot
the OpenAI API. prompting to adapt LLMs to generate summaries.
Chain-of-Thought. Chain-of-thought prompting adapts Due to the page limit, we present examples of the afore-
LLMs to downstream tasks by providing intermediate reason- mentioned prompting techniques on our anonymous site [40].
ing steps [61]. These steps enable LLMs to possess complex
D. Experimental Datasets
reasoning capabilities. In this study, we follow Wang et
al. [62] and apply chain-of-thought prompting to the code The sources of the datasets utilized in our experiments include:
summarization task through the following four steps: CodeSearchNet (CSN). The CodeSearchNet corpus [66] is
a vast collection of methods accompanied by their respective
(1) Instruction 1: Input the code snippet and five questions comments, written in Go, Java, JavaScript, PHP, Python, and
about the code in the format Ruby. This corpus has been widely used in studying code
“Code: \n⟨code⟩ summarization [4], [67], [68]. We use the clean version of the
Question: \n⟨Q1⟩\n⟨Q2⟩\n⟨Q3⟩\n⟨Q4⟩\n⟨Q5⟩\n” CSN corpus provided by Lu et al. [49] in CodeXGLUE. We
(2) Get LLMs’ response to Instruction 1, i.e., Response 1. randomly select 200 samples for each programming language
(3) Instruction 2: “Let’s integrate the above information from the test set of this corpus for experiments.
and generate a short comment in one sentence for the CCSD. The CCSD dataset is provided by Liu et al. [69].
function.” They crawl data from 300+ projects such as Linux and
(4) Get LLMs’ response to Instruction 2, i.e., Response 2. Redis. The dataset contains 95,281 ⟨function, summary⟩ pairs.
Response 2 contains the comment generated by LLMs Similarly, we randomly select 200 samples from the final
for the code snippet. dataset for experiments.
when asking Instruction 2, Instruction 1 and Response 1 are In addition to the above two sources, we construct three
paired as history prompts and answers input into the LLM. new language datasets to evaluate LLM’s code summarization
Critique. Critique prompting improves the quality of capabilities across more programming language types.
LLMs’ answers by asking LLMs to find errors in the answers Erlang, Haskell, and Prolog Datasets. Erlang and Haskell
and correct them. We follow Kim et al. [63] and perform are Functional Programming Languages (FP), and Prolog
critique prompting on the code summarization task through belongs to Logic Programming Languages (LP). To construct
the six steps below: the three datasets, we sort the GitHub repositories whose main
language is Erlang/Haskell/Prolog according to the number of
(1) Instruction 1: Similar to zero-shot prompting, input the stars, and crawl data from the top 50 repositories. Following
instruction and the code snippet in the format “Please Husian et al. [66], (1) we remove any projects that do not
generate a short comment in one sentence for the follow- have a license or whose license does not explicitly permit
ing function: \n⟨code⟩” the re-distribution of parts of the project. (2) We consider the
(2) Get LLMs’ response to Instruction 1, i.e., Response 1. first sentence in the comment as the function summary. (3)
Response 1 contains the temporary comment generated We remove data where functions are shorter than three lines
by LLMs for the code snippet. or comments containing less than 3 tokens. (4) We remove
(3) Instruction 2: “Review your previous answer and find functions whose names contain the substring “test”. (5) We
problems with your answer.” remove duplicates by comparing the Jaccard similarities of
(4) Get LLMs’ response to Instruction 2, i.e., Response 2. the functions following Allamanis et al. [70]. Finally, we get
(5) Instruction 3: “Based on the problems you found, improve 7,025/6,759/1,547 pairs of ⟨function, summary⟩ pairs. For each
your answer.” language, we randomly select 200 samples for experiments.
(6) Get LLMs’ response to Instruction 3, i.e., Response 3. All in all, our experiments involve 10 programming lan-
Response 3 contains the modified comment, which is the guages across 5 types. Note that considering that experiments
final comment of the code snippet. with LLMs are resource-intensive (especially those involving
when prompting each instruction, previous instructions and GPT, which are quite costly), not all experiments are con-
responses are fed into the LLMs as pairs of history prompts ducted on all 10 programming language datasets. Specifically,
and answers. we first conduct experiments associated with RQ1 and RQ2
4
TABLE I: Datasets. PP: Procedural Programming Languages, and Universal Sentence Encoder [72] with Cosine Similarity
OOP: Object-Oriented Programming Languages, SP: Scripting (USECS). They are commonly used in code summarization
Programming Languages, FP: Functional Programming Lan- studies [36], [73], [74]. BERTScore [71] uses a variant of
guages, LP: Logic Programming Languages. BERT [75] (we use the default RoBERT alarge ) to embed
Language Source Type Usage every token in the summaries, and computes the pairwise
Java CSN OOP RQ1, RQ2, RQ3, RQ4, RQ5
inner product between tokens in the reference summary and
Python CSN SP RQ1, RQ2, RQ3, RQ4 generated summary. Then it matches every token in the ref-
C CCSD PP RQ1, RQ2, RQ3, RQ4 erence summary and the generated summary to compute the
Ruby CSN SP RQ4
PHP CSN SP RQ4 precision, recall, and F1 measure. In our experiment, we report
Go CSN PP RQ4 the F1 measure of BERTScore. The other three methods use a
JavaScript CSN SP RQ4
Erlang by us FP RQ4 pre-trained sentence encoder (SentenceBert [76] or Universal
Haskell by us FP RQ4 Sentence Encoder [72]) to produce vector representations
Prolog by us LP RQ4
of two summary sentences, and then compute the cosine
similarity or euclidean distance of the vector representations.
on commonly used programming languages, including Java,
SBCS, SBED, and USECS range within [-1,1]. Higher values
Python, and C. Analyzing the results of these two RQs helps
of SBCS and USECS represent greater similarity, while lower
find a suitable automated evaluation method and a suitable
values of SBED indicate greater similarity.
prompting technique. Subsequent experiments for other RQs
iii. Methods based on summary-code semantic similarity
can be built upon these findings, thereby significantly reducing
assess the quality of the generated summary by computing the
experimental costs. We use all 10 programming languages in
semantic similarity between the generated summary and the
the experiments for RQ4. In the experiments for RQ5, we only
code snippet to be summarized. Unlike the first two methods,
use the Java dataset because other programming languages
this type of evaluation method does not rely on reference
lack readily available comment classifiers. While training such
summaries and can effectively avoid issues related to low-
classifiers would be valuable, it falls outside the scope of this
quality and outdated reference summaries. SIDE proposed by
paper and is left for future exploration.
Mastropaolo et al. [10] is a representative of this type of
IV. R ESULTS AND F INDINGS method. It provides a continuous score ranging within [-1,1],
where a higher value represents greater similarity. We present
A. RQ1: What evaluation methods are suitable for assessing the scores reported by the above similarity-based evaluation
the quality of summaries generated by LLMs? methods in percentage.
1) Experimental Setup. Human Evaluation. We conduct human evaluations as a
Comparison Evaluation Methods. Existing automated eval- reference for automated evaluation methods. Comparing the
uation methods for code summarization can be divided into correlation between the results of automated evaluation meth-
the following three categories. ods and human evaluation can facilitate achieving the goal of
i. Methods based on summary-summary text similarity as- this RQ, which is to find a suitable automated method for
sess the quality of the generated summary by calculating assessing the quality of LLM-generated summaries. To do so,
the text similarity between the generated summary and the we invite 15 volunteers (including 1 PhD candidate, 5 masters,
reference summary. This category of methods is the most and 9 undergraduates) with more than 3 years of software
widely used in existing code summarization research [4], development experience and excellent English ability to carry
[25], [28], [30]. The text similarity metrics involved include out the evaluation. For each sample, we provide volunteers
BLEU, METEOR, and ROUGE-L, which compare the count with the code snippet, the reference summary, and summaries
of n-grams in the generated summary against the reference generated by four LLMs, where the reference summary and
summary. The scores of BLEU, METEOR, and ROUGE-L the summaries generated by four LLMs are mixed and out
are in the range of [0, 1]. The higher the score, the closer the of order. In other words, for each sample, volunteers do not
generated summary approximates the reference summary, in- know whether it is a reference or a summary generated by a
dicating superior code summarization performance. All scores certain LLM. We follow Shi et al. [9] and ask volunteers to
are computed by the same implementation provided by [47]. rate the summaries from 1 to 5 based on their quality where
ii. Methods based on summary-summary semantic similarity a higher score represents a higher quality. The final score of
evaluate the quality of the generated summary by computing the summaries is the average of scores rated by 15 volunteers.
the semantic similarity between the generated summary and LLM-based evaluation methods. Inspired by recent work in
the reference summary. Existing research [36] demonstrates NLP [37]–[39], we also investigate the feasibility of employing
that semantic similarity-based methods can effectively alle- LLMs as evaluators. Its advantage is that it does not rely on the
viate the issues of word overlap-based metrics, where not all quality of reference summaries, and the evaluation steps can be
words in a sentence have the same importance and many words the same as human evaluation. Specifically, similar to human
have synonyms. In this study, we compare four such methods, evaluation, when using LLMs as evaluators, for each sample,
including BERTScore [71], SentenceBert with Cosine Simi- we input the code snippet to be summarized, the reference
larity (SBCS), SentenceBert with Euclidean Distance (SBED), summary, and LLM-generated summaries, and ask LLMs to
5
of the LLM-generated summaries reported by three categories
of automated evaluation methods, and LLM-based evalua-
tion methods. Observed that among the three methods based
on summary-summary text similarity, 1) the BLEU-based
and ROUGE-L-based methods give StarChat-β the highest
scores on all three datasets; 2) the METEOR-based method
gives StarChat-β the highest score (i.e., 18.19) on the Java
dataset, while gives CodeLlama-Instruct the highest scores
(i.e., 21.64 and 17.29) on the Python and C datasets. Among
the four methods based on summary-summary semantic sim-
Fig. 2: An example of using an LLM as an evaluator. ilarity, BERTScore, SBCS, and SBED give the best scores
to StarChat-β, and USECS gives the best score of 50.69
TABLE II: Human evaluation scores for reference and LLM- to CodeLlama-Instruct on the Java dataset. On the Python
generated summaries. The value in parentheses represents the and C datasets, the four methods consistently give the best
percentage increase or decrease relative to the score of the scores to CodeLlama-Instruct and StarChat-β, respectively.
corresponding reference summary. The summary-code semantic similarity-based method SIDE
gives the highest scores (i.e., 80.46 and 66.35) to StarChat-β
Human Evaluation Score
Summary from on the Java and C datasets, and the highest score (i.e., 41.71) to
Java Python C
GPT-3.5 on the Python dataset. The four LLM-based methods
Reference 3.19 3.56 3.05
consistently give the highest scores to GPT-4 on the Java and
CodeLlama-Instruct 3.93 (+23.20%) 3.88 (+8.99%) 4.15 (+36.07%) C datasets, while they consistently award the highest scores
StarChat-β 3.18 (-0.31%) 3.14 (-11.80%) 3.49 (+14.43%)
GPT-3.5 4.00 (+25.39%) 4.16 (+16.85%) 4.06 (+33.11%) to GPT-3.5 on the Python dataset.
GPT-4 4.17 (+30.72%) 4.06 (+14.04%) 4.25 (+39.34%)
☞ Finding ▶ According to automated evaluation, over-
all, methods based on summary-summary text/semantic
rate each summary from 1 to 5 where a higher score represents similarity tend to give higher scores to specialized code
a higher quality of the summary. The specific prompt when LLMs StarChat-β and CodeLlama-Instruct, while LLM-
using LLMs as evaluators is shown in Figure 2. based evaluators tend to give higher scores to general-
Datasets and Prompting Techniques. In this RQ, to reduce purpose LLMs GPT-3.5 and GPT-4. The summary-code
the workload of human evaluation volunteers, we randomly semantic similarity-based method tends to give higher
select 50 samples from the Java, Python, and C datasets, scores to StarChat-β on the Java and C datasets, while
respectively, which means 150 samples in total. We employ favoring GPT-3.5 on the Python dataset. ◀
few-shot prompting to adapt the four LLMs to generate sum-
maries for code snippets as recent studies [25], [28], [30] have Correlation between Automated Evaluation and Human
demonstrated the effectiveness of this prompting technique on Evaluation. From Table III, it can also be observed that the
code summarization tasks. average scores of reference summaries evaluated by the four
2) Experimental Results. LLM-based methods are mostly below 3 points. It means that
Human Evaluation Results. Table II shows the human evalu- similar to human evaluation, LLM-based evaluation methods
ation scores for reference summaries and summaries generated also believe that the quality of the reference summaries is
by the four LLMs. Observe that the scores of reference not very high. Besides, LLM-based evaluation methods are
summaries in the three datasets are between 3 and 3.5 points, inclined to give higher scores to general-purpose LLMs GPT-
suggesting that the quality of the reference summaries is not 3.5 and GPT-4, which is the same as human evaluation.
very high. Therefore, evaluation methods based on summary- Based on the above observations, we can reasonably spec-
summary/code similarity may not accurately assess the quality ulate that compared to methods based on summary-summary
of LLM-generated summaries. text/semantic similarity and summary-code semantic similar-
Among the four LLMs, GPT-4 has the highest scores on the ity, LLM-based evaluation methods may be more suitable
Java and C datasets, and GPT-3.5 attains the highest score on for evaluating the quality of summaries generated by LLMs.
the Python dataset. This suggests that the quality of summaries Therefore, we follow [9], [73] and calculate Spearman’s
generated by GPT-3.5 and GPT-4 is relatively high. correlation coefficient ρ with the p-value between the results
☞ Finding ▶ According to human evaluation, the quality of each automated evaluation method and human evaluation,
of reference summaries in the existing datasets is not providing more convincing evidence for this speculation. The
particularly high. Summaries from general-purpose LLMs Spearman’s correlation coefficient ρ ∈ [−1, 1] is suitable for
(e.g., GPT-3.5) excel over those from specialized code judging the correlation between two sequences of discrete
LLMs (e.g., CodeLlama-Instruct) in quality. ◀ ordinal/continuous data, with a higher value representing a
stronger correlation [77]. −1 ≤ ρ < 0, ρ = 0, and 0 < ρ ≤ 1
Automated Evaluation Results. Table III displays the scores respectively indicate the presence of negative correlation, no
6
TABLE III: Automated evaluation scores for reference and LLM-generated summaries. S-S Tex.Sim.: methods based on
summary-summary text similarity; S-S Sem.Sim.: methods based on summary-summary semantic similarity; S-C Sem.Sim.:
methods based on summary-code semantic similarity. We bold the best score in each column.
Summary S-S Tex.Sim. S-S Sem.Sim. S-C Sem.Sim. LLM-based Evaluation Method
Language Human
from
BLEU METEOR ROUGE-L BERTScore SBCS SBED USECS SIDE CodeLlama-I StarChat-β GPT-3.5 GPT-4
Reference / / / / / / / 86.15 1.42 2.58 3.08 2.8 3.19
CodeLlama-Instruct 13.00 17.90 32.21 87.94 59.61 86.88 50.69 46.62 2.32 2.80 3.28 3.64 3.93
Java StarChat-β 18.95 18.19 38.43 88.69 61.97 83.45 50.57 80.46 2.24 1.94 2.42 2.50 3.18
GPT-3.5 12.49 16.74 31.87 87.73 59.47 88.11 48.87 62.04 2.40 2.40 3.72 3.82 4.00
GPT-4 9.46 17.02 28.36 86.72 58.83 89.27 46.50 36.12 2.44 2.60 4.10 4.50 4.17
Reference / / / / / / / 16.11 1.48 2.74 2.84 2.98 3.56
CodeLlama-Instruct 16.04 21.64 37.80 89.06 61.57 85.40 55.86 38.84 1.62 2.60 3.44 3.72 3.88
Python StarChat-β 18.35 17.62 37.96 88.92 58.97 87.39 51.54 24.05 1.94 1.96 2.40 2.42 3.14
GPT-3.5 11.95 19.14 30.20 87.63 61.37 86.36 49.54 41.71 1.96 2.72 4.32 4.30 4.16
GPT-4 14.07 20.87 35.38 88.11 60.65 87.04 51.21 33.16 1.76 2.54 3.92 4.16 4.06
Reference / / / / / / / 64.23 1.56 2.80 2.24 2.62 3.05
CodeLlama-Instruct 10.92 17.29 28.71 86.38 51.55 95.94 37.95 46.69 2.62 3.02 3.82 3.84 4.15
C StarChat-β 15.58 15.57 32.84 87.27 54.85 91.92 40.60 66.35 2.76 2.74 2.20 2.62 3.49
GPT-3.5 12.06 16.00 29.81 86.65 53.61 93.71 39.75 50.17 3.04 2.86 3.48 3.66 4.06
GPT-4 10.07 16.18 28.63 86.03 53.00 94.77 37.30 41.37 3.18 2.86 4.00 4.36 4.25
correlation, and positive correlation [78]. The p-value helps the base model is StarChat-β, chain-of-thought prompting
determine whether the observed correlation is statistically performs best on all the Java and C datasets, while expert
significant or simply due to random chance. By comparing the prompting excels on the Python dataset. When selecting GPT-
p-value to a predefined significance level (typically 0.05), we 3.5 as the base model, the simplest zero-shot prompting
can decide whether to reject the null hypothesis and conclude surprisingly achieves the highest scores on the Java and C
that the correlation is statistically significant. Table IV shows datasets, and is only slightly worse than few-shot prompting
the statistical results of ρ and p-value. It can be clearly on the Python dataset. When using GPT-4 as the base model,
observed that among all automated evaluation methods, there chain-of-thought prompting overall performs best.
is a significant positive correlation (0.28 ≤ ρ ≤ 0.65) between For the specific LLM and programming language, there
the GPT-4-based evaluation method and human evaluation in is no guarantee that intuitively more advanced prompting
scoring the quality of summaries generated by most LLMs, techniques will surpass simple zero-shot prompting. For ex-
followed by the GPT-3.5-based evaluation method. For other ample, on the Java dataset, when selecting any of StarChat-β,
automated evaluation methods, in most cases, their correlation GPT-3.5, and GPT-4 as the base model, few-shot prompting
with human evaluation is negative or weakly positive. Based yields lower scores than zero-shot prompting. Contrary to
on the above observations, we draw the conclusion that the findings of previous studies [25], [28], the GPT-4-based
compared with other automated evaluation methods, the GPT- evaluation method does not consider that few-shot prompt-
4-based method is more suitable for evaluating the quality ing will improve the quality of generated summaries. This
of summaries generated by LLMs. In the subsequent RQs, discrepancy may arise because previous studies evaluated the
we uniformly employ the GPT-4-based method to assess the quality of LLM-generated summaries using BLEU, METEOR,
quality of LLM-generated summaries. To make the output and ROUGE-L, which primarily assess text/semantic similarity
scores of GPT-4 more deterministic, we set the temperature with reference summaries. However, as we mentioned in Sec-
to 0 when using GPT-4 as the evaluator. tion IV-A, reference summaries contain low-quality noisy data
that undermines their reliability. Therefore, achieving greater
✎ Summary ▶ Among all automated evaluation methods, similarity with reference summaries does not necessarily imply
the GPT-4-based method overall has the strongest correla- that the human/GPT-4-based evaluation method will perceive
tion with human evaluation. Therefore, it is recommended the summary to be of higher quality.
to adopt the GPT-4-based method to evaluate the quality
of LLM-generated summaries. ◀ ✎ Summary ▶ The more advanced prompting techniques
expected to perform better may not necessarily outperform
B. RQ2: How effective are different prompting techniques in simple zero-shot prompting. In practice, selecting the
adapting LLMs to the code summarization task? appropriate prompting technique requires considering the
1) Experimental Setup. The experimental dataset comprises base LLM and the programming language. ◀
600 samples from Java, Python, and C datasets collectively.
2) Experimental Results. Table V presents the scores reported C. RQ3: How do different model settings affect LLMs’ code
by the GPT-4 evaluation method for summaries generated summarization performance?
by four LLMs using five prompting techniques. Observe that 1) Experimental Setup. There are three key model set-
when the base model is CodeLlama-Instruct, few-shot prompt- tings/parameters, including top k, top p, and temperature,
ing consistently performs best on all three datasets. When that allow the user to control the randomness of text (code
7
TABLE IV: Spearman’s correlation coefficient ρ with the p-value (values in parentheses) between the results of each automated
evaluation method and human evaluation. CodeLlama-I: CodeLlama-Instruct. We bold the best score in each row.
Summary S-S Tex.Sim. S-S Sem.Sim. S-C Sem.Sim. LLM-based Evaluation Method
Language
from
BLEU METEOR ROUGE-L BERTScore SBCS SBED USECS SIDE CodeLlama-I StarChat-β GPT-3.5 GPT-4
Reference / / / / / / / 0.31 (.03) 0.06 (.70) 0.26 (.07) 0.53 (.00) 0.60 (.00)
CodeLlama-I -0.25 (.08) 0.07 (.65) -0.14 (.35) -0.16 (.27) 0.01 (.94) -0.01 (.94) 0.01 (.95) -0.30 (.03) -0.00 (.98) -0.02 (.91) 0.35 (.01) 0.49 (.00)
Java StarChat-β 0.22 (.13) 0.32 (.02) 0.15 (.31) 0.22 (.12) 0.30 (.04) -0.30 (.04) 0.32 (.02) -0.12 (.42) 0.11 (.44) 0.16 (.27) 0.52 (.00) 0.41 (.00)
GPT-3.5 -0.10 (.50) 0.12 (.39) 0.22 (.13) 0.0 (.99) -0.01 (.97) 0.01 (.97) 0.10 (.50) -0.20 (.16) 0.28 (.05) -0.08 (.57) 0.60 (.00) 0.56 (.00)
GPT-4 0.05 (.74) -0.02 (.87) 0.05 (.75) 0.02 (.87) 0.14 (.34) -0.14 (.34) 0.11 (.46) -0.19 (.18) 0.00 (.98) 0.04 (.76) 0.38 (.01) 0.40 (.00)
Reference / / / / / / / -0.04 (.79) -0.15 (.30) 0.19 (.19) 0.35 (.01) 0.37 (.01)
CodeLlama-I -0.17 (.24) -0.05 (.76) -0.11 (.46) -0.16 (.28) -0.08 (.56) 0.08 (.56) -0.08 (.60) 0.10 (.48) 0.15 (.29) -0.01 (.97) 0.52 (.00) 0.45 (.00)
Python StarChat-β 0.20 (.17) 0.52 (.00) 0.19 (.20) 0.24 (.1) 0.18 (.22) -0.18 (.22) 0.50 (.00) 0.02 (.89) -0.19 (.18) 0.24 (.09) 0.48 (.00) 0.48 (.00)
GPT-3.5 -0.20 (.17) 0.01 (.96) -0.18 (.21) -0.07 (.61) -0.02 (.89) 0.02 (.89) -0.12 (.42) 0.17 (.24) 0.19 (.19) 0.35 (.01) 0.32 (.02) 0.42 (.00)
GPT-4 -0.24 (.09) 0.12 (.41) -0.16 (.26) -0.05 (.73) 0.02 (.91) -0.02 (.91) -0.19 (.19) 0.15 (.31) 0.18 (.21) 0.17 (.25) 0.16 (.28) 0.17 (.25)
Reference / / / / / / / 0.23 (.11) 0.12 (.39) 0.27 (.06) 0.60 (.00) 0.62 (.00)
CodeLlama-I -0.28 (.05) -0.11 (.46) -0.09 (.55) -0.28 (.05) -0.05 (.76) 0.04 (.76) -0.07 (.61) -0.32 (.03) -0.00 (.98) -0.01 (.48) 0.33 (.02) 0.42 (.00)
C StarChat-β -0.00 (.98) 0.35 (.01) 0.17 (.24) 0.0 (.98) 0.17 (.22) -0.17 (.22) 0.26 (.07) -0.01 (.94) 0.05 (.72) 0.17 (.23) 0.16 (.27) 0.62 (.00)
GPT-3.5 -0.35 (.01) 0.14 (.33) -0.12 (.41) -0.06 (.67) 0.05 (.71) -0.05 (.71) 0.07 (.64) -0.47 (.00) 0.21 (.15) 0.21 (.14) 0.38 (.01) 0.65 (.00)
GPT-4 0.11 (.44) 0.38 (.01) 0.37 (.01) 0.15 (.29) 0.21 (.14) -0.21 (.15) 0.21 (.14) -0.26 (.07) -0.09 (.55) 0.01 (.92) 0.38 (.01) 0.28 (.05)
TABLE V: Effectiveness of different prompting techniques modify one of the two parameters at a time [79]. Therefore, the
Model Prompting Technique Java Python C questions we want to answer are: (1) Does top p/temperature
zero-shot 3.42 2.98 3.41
impact the quality of LLM-generated summaries? (2) As alter-
few-shot 3.78 3.75 3.91 native parameters that both control the randomness of LLMs,
CodeLlama-Instruct chain-of-thought 3.21 3.14 3.37 do top p and temperature have a difference in the degree of
critique 2.15 2.02 2.13
expert 3.13 3.35 1.70 influence on the quality of LLM-generated summaries?
zero-shot 2.71 2.85 2.86 Drawing from a review of related work (see Section II),
few-shot 2.50 2.37 2.68
StarChat-β chain-of-thought 2.86 2.77 3.06 we find that existing LLM-based code summarization studies
critique 2.36 2.57 2.60 pay more attention to few-shot prompting. Since no prompting
expert 2.66 3.02 3.01
technique outperforms others on all LLMs, we uniformly em-
zero-shot 3.90 3.96 3.93
few-shot 3.73 3.97 3.56
ploy few-shot prompting in RQ3, RQ4, and RQ5 to facilitate
GPT-3.5 chain-of-thought 3.36 3.47 3.36 comparing our findings with prior studies.
critique 3.09 3.21 3.31
expert 2.72 3.43 3.49 2) Experimental Results. TABLE VI shows the scores eval-
zero-shot 4.50 4.55 4.42 uated by the GPT-4 evaluation method for the summaries
few-shot 4.38 4.16 4.18 generated by LLMs under different top p and temperature
GPT-4 chain-of-thought 4.57 4.60 4.44
critique 4.41 4.44 4.34 settings. It is observed that the impact of top p and temper-
expert 4.52 4.23 4.50 ature on the quality of LLM-generated summaries is specific
to the base LLM and programming language. For example,
when top p=0.5, as temperature increases, the quality of GPT-
summary in our scenario) generated by LLMs. Considering 4-generated summaries for Python code snippets increases,
that GPT-3.5 and GPT-4 do not support the top k setting, we while those for C code snippets decrease. Another example
only conduct experiments with the top p and temperature. is that when top p=0.5, as the temperature rises, the quality
Top p: In each round of token generation, LLMs sort tokens of GPT-4-generated Java comments first increases and then
by probability from high to low and keep tokens whose decreases, whereas CodeLlama-Instruct is exactly the opposite,
probability adds up to (no more than) top p. For example, first decreases and then increases. Regarding the difference in
top p = 0.1 means only the tokens comprising the top 10% influence between top p and temperature, it is observed that
probability mass are considered. The larger the top p is, the in most cases the influence of the two parameters is similar.
more tokens are sampled. Thus tokens with low probabilities For example, for C code snippets, when one parameter (top p
have a greater chance of being selected, so the summary or temperature) is fixed, as the other parameter (temperature
generated by LLMs is more random. or top p) grows, the quality of GPT-3.5-generated summaries
Temperature: Temperature adjusts the probability of tokens first decreases and then increases.
after top p sampling. The higher the temperature, the less the
difference between the adjusted token probabilities. Therefore,
✎ Summary ▶ The impact of top p and temperature
the token with a low probability has a greater chance of being
on the quality of generated summaries is specific to the
selected, so the generated summary is more random. If the
base LLM and programming language. As alternative
temperature is set to 0, the generated summary is the same
parameters, top p and temperature have similar influence
every time.
on the quality of LLM-generated summaries. ◀
Top p and temperature are alternatives and one should only
8
TABLE VI: Influence of different model settings.We bold the forms worst on both. The smallest LLM CodeLlama-Instruct
scores of the best setting combinations on each dataset. outperforms GPT-3.5 on C (3.91 vs. 3.56), but vice versa
Model Top p Temperature Java Python C on Go (3.86 vs. 4.14). Additionally, except for CodeLlama-
0.1 3.81 3.83 4.10
Instruct, which performs slightly worse on Go than on C
0.5 0.5 3.72 3.85 4.08 (3.86 vs. 3.91), the other three LLMs perform better on Go
1.0 3.91 3.81 4.11 than on C. For SP, GPT-4 consistently performs best on all
CodeLlama-Instruct 0.1 3.76 3.87 4.02 four languages. Surprisingly, CodeLlama-Instruct outperforms
0.75 0.5 3.91 3.73 4.01
1.0 3.80 3.79 3.88 GPT-3.5 on both Ruby and JavaScript. All four LLMs perform
0.1 3.78 3.75 3.91 better on PHP than on Python. For FP, the performance of
1.0 0.5 3.91 3.75 3.99 two specialized code LLMs (i.e., CodeLlama-Instruct and
1.0 3.73 3.59 3.60
StarChat-β) is better on Haskell than on Erlang, while the
0.1 2.49 2.42 2.72
0.5 0.5 2.47 2.36 2.70
opposite is true for the two general-purpose LLMs (i.e., GPT-
1.0 2.49 2.29 2.75 3.5 and GPT-4). For LP, GPT-4 still performs best, followed
StarChat-β 0.1 2.50 2.35 2.66 by GPT-3.5, CodeLlama-Instruct, and StarChat-β. Across all
0.75 0.5 2.45 2.47 2.80 five types of languages, the four LLMs consistently perform
1.0 2.48 2.37 2.71
the worst on LP, which indicates that summarizing logic
0.1 2.5 2.37 2.68
1.0 0.5 2.53 2.45 2.77 programming language code is the most challenging. One
1.0 2.54 2.38 2.69 possible reason is that fewer Prolog datasets are available
0.1 3.41 3.60 3.40 for training these LLMs compared to other programming
0.5 0.5 3.45 3.73 3.38
1.0 3.52 3.68 3.42 languages. The scale of the Prolog dataset we collected can
0.1 3.54 3.66 3.35
support this reason.
GPT-3.5
0.75 0.5 3.55 3.65 3.24
1.0 3.46 3.64 3.41 ✎ Summary ▶ GPT-4 surpasses the other three LLMs on
0.1 3.73 3.97 3.56 all five types of programming languages. For PP, LLMs
1.0 0.5 3.55 3.71 3.42 overall perform better on Go than on C. For SP, all four
1.0 3.41 3.72 3.52
LLMs perform better on PHP than on Python. For FP,
0.1 4.44 4.25 4.33
0.5 0.5 4.47 4.30 4.31 specialized code LLMs (e.g., StarChat-β) perform better
1.0 4.45 4.31 4.29 on Haskell than on Erlang, whereas the reverse is true
GPT-4 0.1 4.48 4.27 4.31 for general-purpose LLMs (e.g., GPT-4). All four LLMs
0.75 0.5 4.46 4.34 4.26
1.0 4.47 4.33 4.36 perform worse in summarizing LP code snippets. ◀
0.1 4.38 4.16 4.18
1.0 0.5 4.43 4.27 4.33 E. RQ5: How do LLMs perform on different categories of
1.0 4.40 4.18 4.33 summaries?
1) Experimental Setup. Following [3], [54], [55], we classify
TABLE VII: Effectiveness of LLMs in summarizing code code summaries into the following six categories.
snippets written in different types of programming languages. What: describes the functionality of the code snippet. It helps
CodeLlama-I: CodeLlama-Instruct. developers to understand the main functionality of the code
OOP PP SP FP LP without diving into implementation details. An example is
Model
Java C Go Python Ruby PHP JavaScript Erlang Haskell Prolog “Pushes an item onto the top of this stack”.
CodeLlama-I 3.78 3.91 3.86 3.75 3.98 3.88 4.03 3.51 3.58 3.23
Why: explains the reason why the code snippet is written or
StarChat-β 2.50 2.68 2.97 2.37 2.79 2.73 2.67 2.68 2.88 2.34 the design rationale of the code snippet. It is useful when
GPT-3.5 3.73 3.56 4.14 3.97 3.64 3.99 3.53 3.57 3.44 3.42
GPT-4 4.38 4.18 4.36 4.16 4.37 4.31 4.29 4.23 4.22 4.05
methods’ objective is masked by complex implementation. An
application scenario of Why summaries is to explain the design
rationale of overloaded functions.
How-it-is-done: describes the implementation details of the
D. RQ4: How do LLMs perform in summarizing code snippets
code snippet. Such information is critical for developers to
written in different types of programming languages?
understand the subject, especially when the code complexity
1) Experimental Setup. We conduct experiments on all 10 is high. For instance, “Shifts any subsequent elements to the
programming language datasets. As in RQ3, we uniformly left.” is a How-it-is-done comment.
employ few-shot prompting to adapt LLMs. Property: asserts properties of the code snippet, e.g., func-
2) Experimental Results. Table VII shows the performance tion’s pre-conditions/post-conditions. “This method is not a
evaluated by the GPT-4 evaluation method for the four LLMs constant-time operation.” is a Property summary.
on five types of programming languages. It is observed How-to-use: describes the expected set-up of using the code
that for OOP (i.e., Java), GPT-4 performs best, followed by snippet, such as platforms and compatible versions. For exam-
CodeLlama-Instruct, GPT-3.5, and StarChat-β. For PP, GPT- ple, “This method can be called only once per call to next().”
4 performs best on both C and Go, while StarChat-β per- is a How-to-use summary.
9
TABLE VIII: Statistics of six sub-datasets divided from the ✎ Summary ▶ The four LLMs excel in generating dif-
CSN-Java test dataset according to comment intention ferent categories of summaries. The smallest CodeLlama-
Instruct slightly outperforms the advanced GPT-4 in gen-
Summary Category Number of Samples Sample Ratio
erating Why and Property summaries. StarChat-β is not
What 6,132 0.56
Why 1,190 0.11 proficient at generating How-to-use summaries. GPT-
How-it-is-done 2,242 0.20 3.5 and GPT-4 perform worse in generating Property
Property 1,174 0.11
How-to-use 180 0.02 summaries than other categories of summaries. ◀
Others 37 < 0.01
V. T HREATS TO VALIDITY
Our empirical study may contain several threats to validity
TABLE IX: Effectiveness of LLMs in generating different that we have attempted to relieve.
categories of summaries Threats to External Validity. The threats to external
validity lie in the generalizability of our findings. One threat
Model What Why How-it-is-done Property How-to-use
to the validity of our study is that LLMs usually generate
CodeLlama-Instruct 4.15 4.29 3.85 4.19 3.96 varied responses for identical input across multiple requests
StarChat-β 2.68 2.78 2.77 2.94 2.52
GPT-3.5 3.61 3.54 3.97 3.54 4.17 due to their inherent randomness, while conclusions drawn
GPT-4 4.40 4.28 4.31 4.06 4.22 from random results may be misleading. To mitigate this
threat, considering that StarChat-β and CodeLlama-Instruct do
not support setting the temperature to 0, we uniformly set it
Others: Comments that do not fall into the above five cat- to 0.1 to reduce randomness except for RQ3. In RQ2-RQ5,
egories are classified as Others summaries, such as “The to make the evaluation scores more deterministic, we set the
implementation is awesome.”. Following Mu et al. [55], we temperature to 0 when using GPT-4 as the evaluator. Addi-
consider the ⟨code, summary⟩ pairs with Others comments tionally, for other RQs, we conduct experiments on multiple
as noisy data, and remove them if identified. programming languages to support our findings.
Threats to Internal Validity. A major threat to internal
We employ the comment classifier COIN provided by Mu
validity is the potential mistakes in the implementation of
et al. [55] to classify the CSN-Java dataset according to the
metrics and models. To mitigate this threat, we use the publicly
comment intention type. The test dataset is divided into six
available code from previous studies [10], [47] for BLEU,
sub-datasets, as shown in Table VIII. To facilitate comparison
METEOR, ROUGE-L, and SIDE. For COIN, BERTScore,
between different categories, we randomly select 180 samples
SentenceBert, Universal Sentence Encoder, StarChat-β [80]
from each sub-dataset. As in RQ4, we uniformly employ few-
and CodeLlama-Instruct [81], and GPT-3.5/GPT-4 [82], we
shot prompting to adapt LLMs. For each sub-dataset with
use the script provided along with the model to run.
different intention types, the few-shot example is of the same
Another threat lies in the processing of LLM’s responses.
intention type from the training dataset.
Usually, the output of LLMs is a paragraph, not a sentence of
2) Experimental Results. Table IX presents the results code summary (code comment) that we want. The real code
evaluated by the GPT-4 evaluation method for the four summary may be the first sentence in the LLMs’ response,
LLMs in generating five categories of summaries. Ob- or it may be returned in the comment before the code such
serve that CodeLlama-Instruct performs worse in generating as “/** ⟨code summary⟩ */”, etc. Therefore, we designed a
How-it-is-done summaries than generating the other four series of heuristic rules to extract the code summary. We have
categories of summaries. StarChat-β gets the lowest score of made our script for extracting code summaries from LLMs’
2.52 in generating How-to-use summaries. Both GPT-3.5 responses public for the community to review.
and GPT-4 are not as good at generating Property sum-
maries compared to generating other categories of summaries. VI. C ONCLUSION
Surprisingly, the smallest LLM CodeLlama-Instruct slightly In this paper, we provide a comprehensive study covering
outperforms the advanced GPT-4 in generating Why (4.29 vs. multiple aspects of code summarization in the era of LLMs.
4.28) and Property (4.19 vs. 4.06) summaries. Additionally, Our interesting and significant findings include, but are not
compared with GPT-3.5, CodeLlama-Instruct achieves higher limited to, the following aspects. 1) Compared with existing
scores in generating What, Why, and Property summaries. automated evaluation methods, the GPT-4-based evaluation
Certainly, it is undeniable that the reason for this phenomenon method is more fitting for assessing the quality of LLM-
is that the optimal prompting technique for GPT-3.5 and GPT- generated summaries. 2) The advanced prompting techniques
4 is not few-shot prompting. This phenomenon is also exciting anticipated to yield superior performance may not invariably
because it implies that most ordinary developers or teams who surpass the efficacy of straightforward zero-shot prompting. 3)
lack sufficient resources (e.g., GPUs) have the opportunity The two alternative model settings have a similar impact on the
to utilize open-source and small-scale LLMs to achieve code quality of LLM-generated summaries, and this impact varies
summarization capabilities close to (or even surpass) those of by the base LLM and programming language. 4) LLMs exhibit
commercial gigantic LLMs. inferior performance in summarizing LP code snippets. 5)
10
CodeLlama-Instruct with 7B parameters demonstrates superior [12] C. Fang, W. Sun, Y. Chen, X. Chen, Z. Wei, Q. Zhang, Y. You, B. Luo,
performance over the advanced GPT-4 in generating Why and Y. Liu, and Z. Chen, “Esale: Enhancing code-summary alignment
learning for source code summarization,” IEEE Transactions on Software
Property summaries. Our comprehensive research findings Engineering (Early Access), pp. 1–18, 2024.
will aid subsequent researchers in swiftly grasping the various [13] M. Du, F. He, N. Zou, D. Tao, and X. Hu, “Shortcut learning of large
facets of LLM-based code summarization, thereby promoting language models in natural language understanding,” Communications
of the ACM, vol. 67, no. 1, pp. 110–120, 2023.
the development of this field. [14] C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga, and D. Yang, “Is
ChatGPT a general-purpose natural language processing task solver?”
ACKNOWLEDGMENT arXiv preprint arXiv:2302.06476, 2023.
[15] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan,
The authors would like to thank the anonymous reviewers H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large
for their insightful comments. This work is supported par- language models trained on code,” arXiv preprint arXiv:2107.03374,
2021.
tially by the National Natural Science Foundation of China [16] Q. Zhang, C. Fang, Y. Xie, Y. Zhang, Y. Yang, W. Sun, S. Yu, and
(61932012, 62372228), and the National Research Foun- Z. Chen, “A survey on large language models for software engineering,”
dation, Singapore, and the Cyber Security Agency under CoRR, vol. abs/2312.15223, no. 1, pp. 1–57, 2023.
[17] A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo,
its National Cybersecurity R&D Programme (NCRP25-P04- and J. M. Zhang, “Large language models for software engineering:
TAICeN). Any opinions, findings and conclusions or recom- Survey and open problems,” arXiv preprint arXiv:2310.03533, 2023.
mendations expressed in this material are those of the author(s) [18] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo,
J. Grundy, and H. Wang, “Large language models for software engineer-
and do not reflect the views of National Research Foundation, ing: A systematic literature review,” arXiv preprint arXiv:2308.10620,
Singapore and Cyber Security Agency of Singapore. 2023.
[19] X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y. Chen, J. Feng, C. Sha,
R EFERENCES X. Peng, and Y. Lou, “Evaluating large language models in class-level
code generation,” in Proceedings of the 46th International Conference
[1] S. N. Woodfield, H. E. Dunsmore, and V. Y. Shen, “The effect of mod- on Software Engineering. Lisbon, Portugal: ACM, April 14-20 2024,
ularization and comments on program comprehension,” in Proceedings pp. 81:1–81:13.
of the 5th International Conference on Software Engineering. San [20] C. Wang, J. Zhang, Y. Feng, T. Li, W. Sun, Y. Liu, and X. Peng,
Diego, California, USA: IEEE Computer Society, March 9-12 1981, pp. “Teaching code llms to use autocompletion tools in repository-level code
215–223. generation,” CoRR, vol. abs/2401.06391, no. 1, pp. 1–13, 2024.
[2] S. C. B. de Souza, N. Anquetil, and K. M. de Oliveira, “A study of the [21] Q. Zhang, T. Zhang, J. Zhai, C. Fang, B. Yu, W. Sun, and Z. Chen,
documentation essential to software maintenance,” in Proceedings of the “A critical review of large language model on software engineering:
23rd Annual International Conference on Design of Communication: An example from chatgpt and automated program repair,” CoRR, vol.
documenting & Designing for Pervasive Information. Coventry, UK: abs/2310.08879, no. 1, pp. 1–12, 2023.
ACM, September 21-23 2005, pp. 68–75. [22] Q. Zhang, C. Fang, Y. Xie, Y. Ma, W. Sun, Y. Yang, and Z. Chen,
[3] J. Zhai, X. Xu, Y. Shi, G. Tao, M. Pan, S. Ma, L. Xu, W. Zhang, L. Tan, “A systematic literature review on large language models for automated
and X. Zhang, “CPC: Automatically classifying and propagating natural program repair,” CoRR, vol. abs/2405.01466, no. 1, pp. 1–39, 2024.
language comments via program analysis,” in Proceedings of the 42nd [23] X. Zhou, T. Zhang, and D. Lo, “Large language model for vulnerability
International Conference on Software Engineering. Seoul, South Korea: detection: Emerging results and future directions,” in Proceedings of
ACM, 27 June - 19 July 2020, pp. 1359–1371. the 44th International Conference on Software Engineering: New Ideas
[4] W. Sun, C. Fang, Y. Chen, Q. Zhang, G. Tao, Y. You, T. Han, Y. Ge, and Emerging Results. Lisbon, Portugal: ACM, April 14-20 2024, pp.
Y. Hu, B. Luo, and Z. Chen, “An extractive-and-abstractive framework 47–51.
for source code summarization,” ACM Transactions on Software Engi- [24] J. Zhang, C. Wang, A. Li, W. Sun, C. Zhang, W. Ma, and Y. Liu,
neering and Methodology, vol. Just Accepted, no. 1, pp. 1–39, 2023. “An empirical study of automated vulnerability localization with large
[5] B. Fluri, M. Würsch, and H. C. Gall, “Do code and comments co- language models,” CoRR, vol. abs/2404.00287, no. 1, 2024.
evolve? on the relation between source code and comment changes,” in [25] T. Ahmed and P. T. Devanbu, “Few-shot training llms for project-
Proceedings of the 14th Working Conference on Reverse Engineering. specific code-summarization,” in Proceedings of the 37th International
Vancouver, BC, Canada: IEEE Computer Society, 28-31 October 2007 Conference on Automated Software Engineering. Rochester, MI, USA:
2007, pp. 70–79. ACM, October 10-14 2022, pp. 177:1–177:5.
[6] M. L. Vásquez, B. Li, C. Vendome, and D. Poshyvanyk, “How do devel- [26] C. Wang, Y. Yang, C. Gao, Y. Peng, H. Zhang, and M. R. Lyu,
opers document database usages in source code? (N),” in Proceedings of “No more fine-tuning? an experimental evaluation of prompt tuning in
the 30th International Conference on Automated Software Engineering. code intelligence,” in Proceedings of the 30th Joint European Software
Lincoln, NE, USA: IEEE Computer Society, November 9-13 2015, pp. Engineering Conference and Symposium on the Foundations of Software
36–41. Engineering. Singapore, Singapore: ACM, November 14-18 2022, pp.
[7] F. Wen, C. Nagy, G. Bavota, and M. Lanza, “A large-scale empirical 382–394.
study on code-comment inconsistencies,” in Proceedings of the 27th [27] W. Sun, C. Fang, Y. You, Y. Miao, Y. Liu, Y. Li, G. Deng, S. Huang,
International Conference on Program Comprehension. Montreal, QC, Y. Chen, Q. Zhang, H. Qian, Y. Liu, and Z. Chen, “Automatic code sum-
Canada: IEEE / ACM, May 25-31 2019, pp. 53–64. marization via chatgpt: How far are we?” CoRR, vol. abs/2305.12865,
[8] X. Hu, X. Xia, D. Lo, Z. Wan, Q. Chen, and T. Zimmermann, pp. 1–13, 2023.
“Practitioners’ expectations on automated code comment generation,” [28] M. Geng, S. Wang, D. Dong, H. Wang, G. Li, Z. Jin, X. Mao, and
in Proceedings of the 44th International Conference on Software Engi- X. Liao, “Large language models are few-shot summarizers: Multi-intent
neering. Pittsburgh, PA, USA: ACM, May 25-27 2022, pp. 1693–1705. comment generation via in-context learning,” in Proceedings of the 46th
[9] E. Shi, Y. Wang, L. Du, J. Chen, S. Han, H. Zhang, D. Zhang, and International Conference on Software Engineering. Lisbon, Portugal:
H. Sun, “On the evaluation of neural code summarization,” in Pro- ACM, April 14-20 2024, pp. 39:1–39:13.
ceedings of the 44th International Conference on Software Engineering. [29] S. Gao, W. Mao, C. Gao, L. Li, X. Hu, X. Xia, and M. R. Lyu,
Pittsburgh, USA: IEEE, May 21–29 2022, pp. 1597––1608. “Learning in the wild: Towards leveraging unlabeled data for effectively
[10] A. Mastropaolo, M. Ciniselli, M. Di Penta, and G. Bavota, “Evaluating tuning pre-trained code models,” in Proceedings of the 46th International
code summarization techniques: A new metric and an empirical charac- Conference on Software Engineering. Lisbon, Portugal: ACM, April
terization,” arXiv e-prints, pp. arXiv–2312, 2023. 14–20 2024, pp. 1–13.
[11] W. Sun, C. Fang, Y. You, Y. Chen, Y. Liu, C. Wang, J. Zhang, Q. Zhang, [30] S. Gao, X. Wen, C. Gao, W. Wang, H. Zhang, and M. R. Lyu,
H. Qian, W. Zhao et al., “A prompt learning framework for source code “What makes good in-context demonstrations for code intelligence tasks
summarization,” arXiv preprint arXiv:2312.16066, 2023. with llms?” in Proceedings of the 38th International Conference on
11
Automated Software Engineering. Luxembourg: IEEE, September 11- [48] D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong,
15 2023, pp. 761–773. S. Yih, L. Zettlemoyer, and M. Lewis, “Incoder: A generative model for
[31] H. Wu, H. Zhao, and M. Zhang, “Code summarization with structure- code infilling and synthesis,” in Proceedings of the 11th International
induced transformer,” in Proceedings of the Findings of the 59th Annual Conference on Learning Representations. Kigali, Rwanda: OpenRe-
Meeting of the Association for Computational Linguistics. Online view.net, May 1-5 2023, pp. 1–14.
Event: Association for Computational Linguistics, August 1-6 2021, pp. [49] S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. B.
1078–1090. Clement, D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, L. Zhou,
[32] X. Hu, G. Li, X. Xia, D. Lo, S. Lu, and Z. Jin, “Summarizing source M. Tufano, M. Gong, M. Zhou, N. Duan, N. Sundaresan, S. K. Deng,
code with transferred API knowledge,” in Proceedings of the 27th S. Fu, and S. Liu, “Codexglue: A machine learning benchmark dataset
International Joint Conference on Artificial Intelligence. Stockholm, for code understanding and generation,” in Proceedings of the Neural
Sweden: ijcai.org, July 13-19 2018, pp. 2269–2275. Information Processing Systems Track on Datasets and Benchmarks,
[33] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “BLEU: A method virtual, December 2021, pp. 1–14.
for automatic evaluation of machine translation,” in Proceedings of the [50] C. Su and C. McMillan, “Distilled GPT for source code summarization,”
40th Annual Meeting of the Association for Computational Linguistics. Automated Software Engineering, vol. 31, no. 1, p. 22, 2024.
Philadelphia, PA, USA: ACL, July 6-12 2002, pp. 311–318. [51] T. A. andKunal Suresh Pai, P. Devanbu, and E. T. Barr, “Automatic
[34] S. Banerjee and A. Lavie, “METEOR: an automatic metric for MT evalu- semantic augmentation of language model prompts (for code summariza-
ation with improved correlation with human judgments,” in Proceedings tion),” in Proceedings of the 46th International Conference on Software
of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Engineering. Lisbon, Portugal: ACM, April 14–20 2024, pp. 1–13.
Machine Translation and/or Summarization. Ann Arbor, Michigan, [52] S. A. Rukmono, L. Ochoa, and M. R. Chaudron, “Achieving high-level
USA: Association for Computational Linguistics, June 29 2005, pp. 65– software component summarization via hierarchical chain-of-thought
72. prompting and static code analysis,” in Proceedings of the 2023 Interna-
[35] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” tional Conference on Data and Software Engineering. Toba, Indonesia:
in Proceedings of the 42nd Annual Meeting of the Association for IEEE, September 07-08 2023, pp. 7–12.
Computational Linguistics – workshop on Text Summarization Branches [53] Y. Choi and J. Lee, “Codeprompt: Task-agnostic prefix tuning for
Out. Barcelona, Spain: Association for Computational Linguistics, July program and language generation,” in Proceedings of the Findings of
21-26 2004, pp. 74–81. the 61st Association for Computational Linguistics. Toronto, Canada:
[36] S. Haque, Z. Eberhart, A. Bansal, and C. McMillan, “Semantic similarity Association for Computational Linguistics, July 9-14 2023, pp. 5282–
metrics for evaluating source code summarization,” in Proceedings of the 5297.
30th International Conference on Program Comprehension. Virtual [54] Q. Chen, X. Xia, H. Hu, D. Lo, and S. Li, “Why my code summarization
Event: ACM, May 16-17 2022, pp. 36–47. model does not work: Code comment improvement with category pre-
[37] J. Wang, Y. Liang, F. Meng, H. Shi, Z. Li, J. Xu, J. Qu, and J. Zhou, diction,” ACM Transactions on Software Engineering and Methodology,
“Is chatgpt a good NLG evaluator? A preliminary study,” CoRR, vol. vol. 30, no. 2, pp. 25:1–25:29, 2021.
abs/2303.04048, no. 1, pp. 1–11, 2023. [55] F. Mu, X. Chen, L. Shi, S. Wang, and Q. Wang, “Developer-intent driven
[38] I. Vykopal, M. Pikuliak, I. Srba, R. Moro, D. Macko, and M. Bielikova, code comment generation,” in Proceedings of the 45th International
“Disinformation capabilities of large language models,” arXiv preprint Conference on Software Engineering. Melbourne, Australia: IEEE,
arXiv:2311.08838, 2023. May 14-20 2023, pp. 768–780.
[39] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: NLG [56] B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi,
evaluation using gpt-4 with better human alignment,” in Proceedings J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models
of the 28th Conference on Empirical Methods in Natural Language for code,” arXiv preprint arXiv:2308.12950, 2023.
Processing. Singapore: Association for Computational Linguistics, [57] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei,
December 6-10 2023, pp. 2511–2522. N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama
[40] W. Sun, Y. Miao, and et al., “Artifacts of this study,” site: https://github. 2: Open foundation and fine-tuned chat models,” arXiv preprint
com/wssun/LLM4CodeSummarization, 24, accessed: 2024-07-09. arXiv:2307.09288, 2023.
[41] X. Hu, G. Li, X. Xia, D. Lo, and Z. Jin, “Deep code comment generation [58] L. Tunstall, N. Lambert, N. Rajani, E. Beeching, T. Le Scao,
with hybrid lexical and syntactical information,” Empirical Software L. von Werra, S. Han, P. Schmid, and A. Rush, “Creating
Engineering, vol. 25, no. 3, pp. 2179–2217, 2020. a coding assistant with starcoder,” Hugging Face Blog, 2023,
[42] S. Haiduc, J. Aponte, and A. Marcus, “Supporting program compre- https://huggingface.co/blog/starchat.
hension with source code summarization,” in Proceedings of the 32nd [59] Bigcode, “Starcoderplus,” Hugging Face Blog, 2023,
International Conference on Software Engineering. Cape Town, South https://huggingface.co/bigcode/starcoderplus.
Africa: ACM, 1-8 May 2010, pp. 223–226. [60] OpenAI, “OpenAI API,” site: https://platform.openai.com/docs/models,
[43] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation 2015, accessed: 2024-03-15.
by jointly learning to align and translate,” in Proceedings of the 3rd [61] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi,
International Conference on Learning Representations. San Diego, Q. V. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in
CA, USA: OpenReview.net, May 7-9 2015, pp. 1–15. large language models,” in Proceedings of the 36th Annual Conference
[44] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the on Neural Information Processing Systems. New Orleans, LA, USA:
properties of neural machine translation: Encoder-decoder approaches,” Curran Associates Inc., November 28 - December 9 2022, pp. 24 824–
in Proceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, 24 837.
Semantics and Structure in Statistical Translation. Doha, Qatar: [62] Y. Wang, Z. Zhang, and R. Wang, “Element-aware summarization with
Association for Computational Linguistics, 25 October 2014, pp. 103– large language models: Expert-aligned evaluation and chain-of-thought
111. method,” in Proceedings of the 61st Annual Meeting of the Association
[45] W. U. Ahmad, S. Chakraborty, B. Ray, and K. Chang, “A transformer- for Computational Linguistics. Toronto, Canada: Association for
based approach for source code summarization,” in Proceedings of the Computational Linguistics, July 9-14 2023, pp. 8640–8665.
58th Annual Meeting of the Association for Computational Linguistics. [63] G. Kim, P. Baldi, and S. McAleer, “Language models can solve
Online: Association for Computational Linguistics, July 5-10 2020, pp. computer tasks,” in Proceedings of the 37th Annual Conference on
4998–5007. Neural Information Processing Systems, vol. 36. New Orleans, LA,
[46] D. Gros, H. Sezhiyan, P. Devanbu, and Z. Yu, “Code to comment USA: Curran Associates, Inc., December 10 - 16 2023, pp. 39 648–
”translation”: data, metrics, baselining & evaluation,” in Proceedings of 39 677.
the 35th International Conference on Automated Software Engineering. [64] B. Xu, A. Yang, J. Lin, Q. Wang, C. Zhou, Y. Zhang, and Z. Mao,
Melbourne, Australia: IEEE, September 21-25 2020, pp. 746–757. “Expertprompting: Instructing large language models to be distinguished
[47] J. Zhang, X. Wang, H. Zhang, H. Sun, and X. Liu, “Retrieval-based experts,” CoRR, vol. abs/2305.14688, no. 1, pp. 1–6, 2023.
neural source code summarization,” in Proceedings of the 42nd Inter- [65] CodeLlama, “Application of codellama,” site: https://huggingface.
national Conference on Software Engineering. Seoul, South Korea: co/spaces/codellama/codellama-13b-chat/blob/main/app.py, 2023, ac-
ACM, 27 June - 19 July 2020, pp. 1385–1397. cessed: 2024-03-15.
12
[66] H. Husain, H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt,
“Codesearchnet challenge: Evaluating the state of semantic code search,”
CoRR, vol. abs/1909.09436, 2019.
[67] D. Guo, S. Lu, N. Duan, Y. Wang, M. Zhou, and J. Yin, “UniXcoder:
Unified cross-modal pre-training for code representation,” in Proceed-
ings of the 60th Annual Meeting of the Association for Computational
Linguistics. Dublin, Ireland: Association for Computational Linguistics,
May 22-27 2022, pp. 7212–7225.
[68] D. Wang, B. Chen, S. Li, W. Luo, S. Peng, W. Dong, and X. Liao,
“One adapter for all programming languages? adapter tuning for code
search and summarization,” in Proceedings of the 45th International
Conference on Software Engineering. Melbourne, Australia: IEEE,
May 14-20 2023, pp. 5–16.
[69] S. Liu, Y. Chen, X. Xie, J. K. Siow, and Y. Liu, “Retrieval-augmented
generation for code summarization via hybrid GNN,” in Proceedings of
the 9th International Conference on Learning Representations. Virtual
Event, Austria: OpenReview.net, May 3-7 2021, pp. 1–13.
[70] M. Allamanis, “The adverse effects of code duplication in machine
learning models of code,” in Proceedings of the 2019 ACM SIGPLAN
International Symposium on New Ideas, New Paradigms, and Reflections
on Programming and Software, 2019, pp. 143–153.
[71] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi,
“BERTScore: Evaluating text generation with BERT,” in Proceedings of
the 8th International Conference on Learning Representations. Addis
Ababa, Ethiopia: OpenReview.net, April 26-30 2020, pp. 1–14.
[72] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John,
N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar et al., “Universal
sentence encoder,” arXiv preprint arXiv:1803.11175, 2018.
[73] D. Roy, S. Fakhoury, and V. Arnaoudova, “Reassessing automatic
evaluation metrics for code summarization tasks,” in Proceedings of the
29th Joint European Software Engineering Conference and Symposium
on the Foundations of Software Engineering. Athens, Greece: ACM,
August 23-28 2021, pp. 1105–1116.
[74] Y. Zhang, Y. Liu, X. Fan, and Y. Lu, “RetCom: Information retrieval-
enhanced automatic source-code summarization,” in Proceedings of the
22nd International Conference on Software Quality, Reliability and
Security. Guangzhou, China: IEEE, December 5-9 2022, pp. 948–957.
[75] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training
of deep bidirectional transformers for language understanding,” in
Proceedings of the 23th Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language
Technologies. Minneapolis, MN, USA: Association for Computational
Linguistics, June 2-7 2019, pp. 4171–4186.
[76] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings
using siamese bert-networks,” in Proceedings of the the 9th International
Joint Conference on Natural Language Processing. Hong Kong, China:
Association for Computational Linguistics, November 3-7 2019, pp.
3980–3990.
[77] W. J. Conover, Practical nonparametric statistics. john wiley & sons,
1999, vol. 350.
[78] C. P. Dancey and J. Reidy, Statistics without maths for psychology.
Pearson education, 2007.
[79] OpenAI, “Create chat completion,” site: https://platform.openai.com/
docs/api-reference/chat/create, 2024, accessed: 2024-03-15.
[80] L. Tunstall, N. Lambert, N. Rajani, E. Beeching, T. Le Scao, L. von
Werra, S. Han, P. Schmid, and A. Rush, “Starchat-beta,” site: https://
huggingface.co/HuggingFaceH4/starchat-beta, 2023, accessed: 2024-03-
15.
[81] pcuenq, “Usage of codellama,” site: https://huggingface.co/spaces/
codellama/codellama-13b-chat/blob/main/app.py, 2023, accessed: 2024-
03-15.
[82] OpenAI, “Get up and running with the openai api,” site: https://platform.
openai.com/docs/quickstart?context=python, 2024, accessed: 2024-03-
15.
13