Code Summarization Using LLM

This paper presents a systematic study on source code summarization using large language models (LLMs), focusing on evaluation methods, prompting techniques, and the impact of model settings across various programming languages. The findings reveal that while advanced prompting techniques exist, simple zero-shot prompting often yields better results, and LLMs struggle with logic programming languages. Additionally, the study highlights the strengths of different LLMs in generating specific types of summaries, providing insights for future research in automated code summarization.

Uploaded by

dakshdhanwani2005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views13 pages

Code Summarization Using LLM

Uploaded by

dakshdhanwani2005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Source Code Summarization in the Era of Large

Language Models
Weisong Sun1,2 , Yun Miao1 , Yuekang Li3 , Hongyu Zhang4 , Chunrong Fang1 ,
Yi Liu2 , Gelei Deng2 , Yang Liu2 , Zhenyu Chen1
1
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
2
College of Computing and Data Science, Nanyang Technological University Singapore, Singapore
3
School of Computer Science and Engineering, University of New South Wales, Sidney, Australia
4
School of Big Data and Software Engineering, Chongqing University, Chongqing, China
[email protected], [email protected], [email protected], [email protected],
[email protected], [email protected], [email protected] [email protected], [email protected]
arXiv:2407.07959v1 [cs.SE] 9 Jul 2024

Abstract—To support software developers in understanding

and maintaining programs, various automatic (source) code
summarization techniques have been proposed to generate a Summary Prompting Model Summary-Summary Reference
Category Techniques Setting Similarity Summary
concise natural language summary (i.e., comment) for a given
code snippet. Recently, the emergence of large language models
(LLMs) has led to a great boost in the performance of code-
related tasks. In this paper, we undertake a systematic and Code Prompt LLM-generated Summary-Code Code
Prompt LLM
comprehensive study on code summarization in the era of LLMs, Snippet Generator Summary Similarity Snippet
which covers multiple aspects involved in the workflow of LLM- (a) Summary Generation (b) Summary Evaluation
based code summarization. Specifically, we begin by examining
prevalent automated evaluation methods for assessing the quality Fig. 1: General workflow of LLM-based code summarization
of summaries generated by LLMs and find that the results of the and its effectiveness evaluation
GPT-4 evaluation method are most closely aligned with human
evaluation. Then, we explore the effectiveness of five prompting
techniques (zero-shot, few-shot, chain-of-thought, critique, and code summarization), a hot research topic [9]–[12], addresses
expert) in adapting LLMs to code summarization tasks. Contrary this challenge by developing advanced techniques/models for
to expectations, advanced prompting techniques may not outper- automatically generating natural language summaries (i.e.,
form simple zero-shot prompting. Next, we investigate the impact
of LLMs’ model settings (including top p and temperature comments) for code snippets, such as Java methods or Python
parameters) on the quality of generated summaries. We find functions, provided by developers.
the impact of the two parameters on summary quality varies by Recently, with the success of large language models (LLMs)
the base LLM and programming language, but their impacts are in natural language processing (NLP) [13], [14], an increasing
similar. Moreover, we canvass LLMs’ abilities to summarize code
snippets in distinct types of programming languages. The results number of software engineering (SE) researchers have started
reveal that LLMs perform suboptimally when summarizing code integrating them into the resolution process of various SE
written in logic programming languages compared to other tasks [15]–[18], such as code generation [19], [20], program
language types (e.g., procedural and object-oriented program- repair [21], [22], and vulnerability detection/localization [23],
ming languages). Finally, we unexpectedly find that CodeLlama- [24]. In this study, we focus on the application of LLMs on the
Instruct with 7B parameters can outperform advanced GPT-4
in generating summaries describing code implementation details code summarization tasks. Figure 1 shows the general work-
and asserting code properties. We hope that our findings can flow of LLM-based code summarization and its effectiveness
provide a comprehensive understanding of code summarization evaluation methods. In the summary generation process, the
in the era of LLMs. input consists of a code snippet and the expected summary
Index Terms—large language model, source code summariza- category. The input is passed to a prompt generator equipped
tion, prompt engineering
with various prompt engineering techniques (referred to as
prompting technique), which constructs a prompt based on
I. I NTRODUCTION input. This prompt is then used to instruct LLMs to generate
Code comments are vital for enhancing program compre- a summary of the expected type for the input code snippet. In
hension [1] and facilitating software maintenance [2]. While the summary evaluation process, a common method used to
it is considered good programming practice to write high- automatically assess the quality of LLM-generated summaries
quality comments, the process is often labor-intensive and is to compute the text or semantic similarity between the LLM-
time-consuming [2]–[4]. As a result, high-quality comments generated summaries and the reference summaries.
are frequently absent, mismatched, or outdated during soft- There have been several recent studies investigating the
ware evolution, posing a common problem in the software effectiveness of LLMs in code summarization tasks [25]–[29].
industry [5]–[8]. Automatic code summarization (or simply, These studies can help subsequent researchers rapidly under-

1
stand the aspects of code summarization garnering attention including top p and temperature, on LLMs’ code summa-
in the era of LLMs, but they still have some limitations. First, rization performance. These two parameters may affect the
most of them only focus on one prompting technique, while randomness of generated summaries. The results demonstrate
some advanced prompting techniques have not been investi- that the effect of top p and temperature on summary qual-
gated and compared (e.g., chain-of-thought prompting). For ity varies depending on the base LLM and programming
example, Sun et al. [27] solely focus on zero-shot prompting, language. As alternative parameters, they exhibit a similar
while several other studies [25], [28], [30] only focus on impact on the quality of LLM-generated summaries. Further-
few-shot prompting. Second, they overlook the impact of the more, unlike existing studies that simply experimented with
model settings (i.e., parameter configuration) of LLMs on their multiple programming languages, we reveal the differences in
code summarization capabilities. There is no empirical evi- the code summarization capabilities of LLMs across five types
dence showing LLMs remain well in all model settings. Last (including procedural, object-oriented, scripting, functional,
but not least, these studies follow prior code summarization and logic programming languages) encompassing ten pro-
studies [4], [31], [32] to evaluate the quality of summaries gramming languages: Java, Python, C, Ruby, PHP, JavaScript,
generated by LLMs through computing text similarity (e.g., Go, Erlang, Haskell, and Prolog. The Erlang, Haskell, and
BLEU [33], METEOR [34], and ROUGE-L [35]) or semantic Prolog datasets are built by ourselves and we make them
similarity (e.g., SentenceBERT-based cosine similarity [36]) public to the community. We find that across all five types of
between the LLM-generated summaries and the reference programming languages, LLMs consistently perform the
summaries, detailed in Section IV-A. However, prior research worst in summarizing code written in logic programming
by Sun et al. [27] has shown that compared to traditional code languages. Finally, we investigate the ability of LLMs to
summarization models, the summaries generated by LLMs generate summaries of different categories, including What,
significantly differ from reference summaries in expression Why, How-to-use-it, How-it-is-done, Property,
and tend to describe more details. Consequently, whether these and Others. The results reveal that the four LLMs perform
traditional evaluation methods are suitable for assessing the well in generating distinct categories of summaries. For ex-
quality of LLM-generated summaries remains unknown. ample, CodeLlama-Instruct excels in generating Why and
To address these issues, in this paper, we conduct a sys- Property summaries, while GPT-4 is good at generating
tematic study on code summarization in the era of LLMs, What, How-it-is-done, and How-to-use summaries.
which covers various aspects involved in the LLM-based code Our comprehensive research findings will assist subsequent
summarization workflow. Considering that the choice of eval- researchers in quickly and deeply understanding the various
uation methods directly impacts the accuracy and reliability aspects involved in the workflow of code summarization based
of the evaluation results, we first systematically investigate on LLMs, as well as in designing advanced LLM-based code
the suitability of existing automated evaluation methods for summarization techniques for specific fields.
assessing the quality of summaries generated by LLMs (in- In summary, we make the following contributions.
cluding CodeLlama-Instruct, StarChat-β, GPT-3.5, and GPT- • To the best of our knowledge, we conduct the first investi-
4). Specifically, we compare multiple automated evaluation gation into the feasibility of applying LLMs as evaluators
methods (including methods based on summary-summary to assess the quality of LLM-generated summaries.
text similarity, summary-summary semantic similarity, and • We conduct a thorough study of code summarization
summary-code semantic similarity) with human evaluation to in the era of LLMs, covering multiple aspects of the
reveal their correlation. Inspired by the work in NLP [37]–[39], LLM-based code summarization workflow, and come up
we also explore the possibility of using the LLMs themselves with several novel and unexpected findings and insights.
as evaluation methods. The experimental results show that These findings and insights can benefit future research
among all automated evaluation methods, the GPT-4-based and practical usage of LLM-based code summarization.
evaluation method overall has the strongest correlation • We make our dataset and source code publicly accessi-
with human evaluation. Second, we conduct comprehensive ble [40] to facilitate the replication of our study and its
experiments on three widely used programming languages application in extensive contexts.
(Java, Python, and C) datasets to explore the effectiveness
of five prompting techniques (including zero-shot, few-shot, II. BACKGROUND AND R ELATED W ORK
chain-of-thought, critique, and expert) in adapting LLMs to Code summarization is the task of automatically generating
code summarization tasks. The experimental results show that natural language summaries (also called comments) for code
the optimal choice of prompting techniques varies for dif- snippets. Such summaries serve various purposes, including
ferent LLMs and programming languages. Surprisingly, the but not limited to explaining the functionality of code snip-
more advanced prompting techniques expected to perform pets [8], [28], [41]. The research on code summarization can
better may not necessarily outperform simple zero-shot be traced back to as early as 2010 when Sonia Haiduc et
prompting. For instance, when the base LLM is GPT-3.5, al. [42] introduced automated text summarization technology
zero-shot prompting outperforms the other four more advanced to summarize source code. Later on, following the significant
prompting techniques overall on three datasets. Then, we success of neural machine translation (NMT) research in
investigate the impact of two key model settings/parameters, the field of NLP [43], [44], a large number of researchers

2
migrate its underlying encoder-decoder architecture to code user needs, LLMs typically offer configurable parameters (i.e.,
summarization tasks [4], [9], [12], [45]–[47]. In the past two model settings) that allow users to control the randomness of
years, research on LLM-based code summarization has mush- model behaviour. In this RQ, we adjust the randomness of the
roomed. Fried et al. [48] introduce an LLM called InCoder, generated summaries by modifying LLMs’ parameters and see
and try zero-shot training on the CodeXGLUE [49] Python the impact of different model settings on the performance of
dataset. InCoder achieves impressive results, but fine-tuned LLMs in generating code summaries.
small PLMs like CodeT5 can still outperform the zero-shot RQ4: How do LLMs perform in summarizing code
setting. Ahmed et al. [25] investigate the effectiveness of few- snippets written in different types of programming lan-
shot prompting in adapting LLMs to code summarization and guages? Programming languages are diverse in types (e.g.,
find that it can make Codex significantly outperform fine-tuned object-oriented and functional programming languages), with
small PLMs (e.g., CodeT5). Given the concern of potential their implementations of the same functional requirements
code asset leakage when using commercial LLMs (e.g., GPT- being similar or entirely different. The scale of programs
3.5), Su et al. [50] utilize knowledge distillation technology implemented with them in Internet/open-source repositories
to distill small models from LLMs (e.g., GPT-3.5). Their also varies, which may result in differences in the mastery of
experimental findings reveal that the distilled small models knowledge of these languages by LLMs. Hence, this RQ aims
can achieve comparable code summarization performance to to reveal the differences in LLMs’ capabilities to summarize
LLMs. Gao et al. [30] investigate the optimal settings for few- code snippets across diverse programming language types.
shot learning, including few-shot example selection methods, RQ5: How do LLMs perform on different categories
few-shot example order, and the number of few-shot examples. of summaries? Previous research [3], [54], [55] has shown
Geng et al. [28] investigate LLMs’ ability to address multi- that summaries can be classified into various categories
intent comment generation. Ahmed et al. [51] propose to according to developers’ intentions, including What, Why,
enhance few-shot samples with semantic facts automatically How-to-use-it, How-it-is-done, Property, and
extracted from the source code. Sun et al. [27] design several others. Therefore, in this RQ, we aim to explore the ability of
heuristic questions to collect the feedback of ChatGPT, thereby LLMs to generate summaries of different categories.
finding an appropriate prompt to guide ChatGPT to generate
in-distribution code summaries. Rukmono et al. [52] address B. Experimental LLMs
the unreliability of LLMs in performing reasoning by applying We select four LLMs as experimental representatives.
a chain-of-thought prompting strategy. Recently, some stud- CodeLlama-Instruct. Code Llama [56] is a family of
ies [11], [26], [53] have also investigated the applicability of LLMs for code based on Llama 2 [57]. It provides multiple fla-
Parameter-Efficient Fine-Tuning (PEFT) techniques in code vors to cover a wide range of applications: foundation models,
summarization tasks. In this paper, we focus on uncovering Python specializations (Code Llama-Python), and instruction-
the effectiveness of various prompting techniques in adapting following models (Code Llama-Instruct) with 7B, 13B, and
LLMs to code summarization without fine-tuning. 34B parameters. Our study utilizes Code Llama-Instruct-7B.
StarChat-β. StarChat-β [58] is an LLM with 16B param-
III. S TUDY D ESIGN
eters fine-tuned on StarCoderPlus [59]. Compared with Star-
A. Research Questions CoderPlus, StarChat-β excels in chat-based coding assistance.
This study aims to answer the following research questions: GPT-3.5. GPT-3.5 [60] is an LLM provided by OpenAI. It
RQ1: What evaluation methods are suitable for assessing is trained with massive texts and codes. It can understand and
the quality of summaries generated by LLMs? Existing generate natural language or code.
research on LLM-based code summarization [25], [28], [30] GPT-4. GPT-4 is an improved version of GPT-3.5, which
widely follow earlier studies [32], [36] and employ automated can solve difficult problems with greater accuracy. OpenAI
evaluation metrics (e.g., BLEU) to evaluate the quality of has not disclosed the specific parameter scale of GPT- 3.5 and
LLM-generated summaries. However, recent studies [27], [50] GPT-4. Our study uses gpt-3.5-turbo and gpt-4-1106-preview.
have shown that LLM-generated summaries surpass reference Model Settings. Apart from RQ3 where we investigate the
summaries in quality. Therefore, evaluating LLM-generated impact of model settings, we uniformly set the temperature
summaries based on their text or semantic similarity to ref- to 0.1 to minimize the randomness of LLM’s responses and
erence summaries may not be appropriate. This RQ aims to highlight the impact of evaluation methods/prompting tech-
discover a suitable method for automated assessment of the niques/programming language types/summary categories.
quality of LLM-generated summaries.
RQ2: How effective are different prompting techniques C. Prompting Techniques
in adapting LLMs to the code summarization task? We compare five commonly used prompting techniques below.
This RQ aims to unveil the effectiveness of several popular Zero-Shot. Zero-shot prompting adapts LLMs to down-
prompting techniques (e.g., few-shot and chain-of-thought) in stream tasks using simple instructions. In our scenario, the
adapting LLMs to code summarization tasks. input to LLMs consists of a simple instruction and a code
RQ3: How do different model settings affect LLMs’ snippet to be summarized. We expect LLMs to output a
code summarization performance? To better meet diverse natural language summary of the code snippet. Therefore, we

3
follow [27] and adopt the input format: Please generate a short Expert. Expert prompting first asks LLMs to generate a
comment in one sentence for the following function: ⟨code⟩. description of an expert who can complete the instruction (e.g.,
Few-Shot. Few-shot prompting (also known as in-context through few-shot prompting), and then the description serves
learning [28], [30]) provides not only straightforward in- as the system prompt for zero-shot prompting. We use the
struction but also some examples when adapting LLMs to few-shot examples provided by Xu et al. [64] and employ
downstream tasks. The examples serve as conditioning for few-shot prompting to let LLMs generate a description of an
subsequent examples where we would like LLMs to generate expert who can “Generate a short comment in one sentence for
a response. In our scenario, the examples are pairs of ⟨code a function.” This description will replace the default system
snippet, summary⟩. According to the findings of Gao et prompt of LLMs. By default, we use the system prompt [65]
al. [30], we set the number of examples to 4 to achieve a of CodeLlama-Instruct for all LLMs to ensure fairness in
balance between LLMs’ performance and the cost of calling comparison. Then, we utilize the same steps as zero-shot
the OpenAI API. prompting to adapt LLMs to generate summaries.
Chain-of-Thought. Chain-of-thought prompting adapts Due to the page limit, we present examples of the afore-
LLMs to downstream tasks by providing intermediate reason- mentioned prompting techniques on our anonymous site [40].
ing steps [61]. These steps enable LLMs to possess complex
D. Experimental Datasets
reasoning capabilities. In this study, we follow Wang et
al. [62] and apply chain-of-thought prompting to the code The sources of the datasets utilized in our experiments include:
summarization task through the following four steps: CodeSearchNet (CSN). The CodeSearchNet corpus [66] is
a vast collection of methods accompanied by their respective
(1) Instruction 1: Input the code snippet and five questions comments, written in Go, Java, JavaScript, PHP, Python, and
about the code in the format Ruby. This corpus has been widely used in studying code
“Code: \n⟨code⟩ summarization [4], [67], [68]. We use the clean version of the
Question: \n⟨Q1⟩\n⟨Q2⟩\n⟨Q3⟩\n⟨Q4⟩\n⟨Q5⟩\n” CSN corpus provided by Lu et al. [49] in CodeXGLUE. We
(2) Get LLMs’ response to Instruction 1, i.e., Response 1. randomly select 200 samples for each programming language
(3) Instruction 2: “Let’s integrate the above information from the test set of this corpus for experiments.
and generate a short comment in one sentence for the CCSD. The CCSD dataset is provided by Liu et al. [69].
function.” They crawl data from 300+ projects such as Linux and
(4) Get LLMs’ response to Instruction 2, i.e., Response 2. Redis. The dataset contains 95,281 ⟨function, summary⟩ pairs.
Response 2 contains the comment generated by LLMs Similarly, we randomly select 200 samples from the final
for the code snippet. dataset for experiments.
when asking Instruction 2, Instruction 1 and Response 1 are In addition to the above two sources, we construct three
paired as history prompts and answers input into the LLM. new language datasets to evaluate LLM’s code summarization
Critique. Critique prompting improves the quality of capabilities across more programming language types.
LLMs’ answers by asking LLMs to find errors in the answers Erlang, Haskell, and Prolog Datasets. Erlang and Haskell
and correct them. We follow Kim et al. [63] and perform are Functional Programming Languages (FP), and Prolog
critique prompting on the code summarization task through belongs to Logic Programming Languages (LP). To construct
the six steps below: the three datasets, we sort the GitHub repositories whose main
language is Erlang/Haskell/Prolog according to the number of
(1) Instruction 1: Similar to zero-shot prompting, input the stars, and crawl data from the top 50 repositories. Following
instruction and the code snippet in the format “Please Husian et al. [66], (1) we remove any projects that do not
generate a short comment in one sentence for the follow- have a license or whose license does not explicitly permit
ing function: \n⟨code⟩” the re-distribution of parts of the project. (2) We consider the
(2) Get LLMs’ response to Instruction 1, i.e., Response 1. first sentence in the comment as the function summary. (3)
Response 1 contains the temporary comment generated We remove data where functions are shorter than three lines
by LLMs for the code snippet. or comments containing less than 3 tokens. (4) We remove
(3) Instruction 2: “Review your previous answer and find functions whose names contain the substring “test”. (5) We
problems with your answer.” remove duplicates by comparing the Jaccard similarities of
(4) Get LLMs’ response to Instruction 2, i.e., Response 2. the functions following Allamanis et al. [70]. Finally, we get
(5) Instruction 3: “Based on the problems you found, improve 7,025/6,759/1,547 pairs of ⟨function, summary⟩ pairs. For each
your answer.” language, we randomly select 200 samples for experiments.
(6) Get LLMs’ response to Instruction 3, i.e., Response 3. All in all, our experiments involve 10 programming lan-
Response 3 contains the modified comment, which is the guages across 5 types. Note that considering that experiments
final comment of the code snippet. with LLMs are resource-intensive (especially those involving
when prompting each instruction, previous instructions and GPT, which are quite costly), not all experiments are con-
responses are fed into the LLMs as pairs of history prompts ducted on all 10 programming language datasets. Specifically,
and answers. we first conduct experiments associated with RQ1 and RQ2

4
TABLE I: Datasets. PP: Procedural Programming Languages, and Universal Sentence Encoder [72] with Cosine Similarity
OOP: Object-Oriented Programming Languages, SP: Scripting (USECS). They are commonly used in code summarization
Programming Languages, FP: Functional Programming Lan- studies [36], [73], [74]. BERTScore [71] uses a variant of
guages, LP: Logic Programming Languages. BERT [75] (we use the default RoBERT alarge ) to embed
Language Source Type Usage every token in the summaries, and computes the pairwise
Java CSN OOP RQ1, RQ2, RQ3, RQ4, RQ5
inner product between tokens in the reference summary and
Python CSN SP RQ1, RQ2, RQ3, RQ4 generated summary. Then it matches every token in the ref-
C CCSD PP RQ1, RQ2, RQ3, RQ4 erence summary and the generated summary to compute the
Ruby CSN SP RQ4
PHP CSN SP RQ4 precision, recall, and F1 measure. In our experiment, we report
Go CSN PP RQ4 the F1 measure of BERTScore. The other three methods use a
JavaScript CSN SP RQ4
Erlang by us FP RQ4 pre-trained sentence encoder (SentenceBert [76] or Universal
Haskell by us FP RQ4 Sentence Encoder [72]) to produce vector representations
Prolog by us LP RQ4
of two summary sentences, and then compute the cosine
similarity or euclidean distance of the vector representations.
on commonly used programming languages, including Java,
SBCS, SBED, and USECS range within [-1,1]. Higher values
Python, and C. Analyzing the results of these two RQs helps
of SBCS and USECS represent greater similarity, while lower
find a suitable automated evaluation method and a suitable
values of SBED indicate greater similarity.
prompting technique. Subsequent experiments for other RQs
iii. Methods based on summary-code semantic similarity
can be built upon these findings, thereby significantly reducing
assess the quality of the generated summary by computing the
experimental costs. We use all 10 programming languages in
semantic similarity between the generated summary and the
the experiments for RQ4. In the experiments for RQ5, we only
code snippet to be summarized. Unlike the first two methods,
use the Java dataset because other programming languages
this type of evaluation method does not rely on reference
lack readily available comment classifiers. While training such
summaries and can effectively avoid issues related to low-
classifiers would be valuable, it falls outside the scope of this
quality and outdated reference summaries. SIDE proposed by
paper and is left for future exploration.
Mastropaolo et al. [10] is a representative of this type of
IV. R ESULTS AND F INDINGS method. It provides a continuous score ranging within [-1,1],
where a higher value represents greater similarity. We present
A. RQ1: What evaluation methods are suitable for assessing the scores reported by the above similarity-based evaluation
the quality of summaries generated by LLMs? methods in percentage.
1) Experimental Setup. Human Evaluation. We conduct human evaluations as a
Comparison Evaluation Methods. Existing automated eval- reference for automated evaluation methods. Comparing the
uation methods for code summarization can be divided into correlation between the results of automated evaluation meth-
the following three categories. ods and human evaluation can facilitate achieving the goal of
i. Methods based on summary-summary text similarity as- this RQ, which is to find a suitable automated method for
sess the quality of the generated summary by calculating assessing the quality of LLM-generated summaries. To do so,
the text similarity between the generated summary and the we invite 15 volunteers (including 1 PhD candidate, 5 masters,
reference summary. This category of methods is the most and 9 undergraduates) with more than 3 years of software
widely used in existing code summarization research [4], development experience and excellent English ability to carry
[25], [28], [30]. The text similarity metrics involved include out the evaluation. For each sample, we provide volunteers
BLEU, METEOR, and ROUGE-L, which compare the count with the code snippet, the reference summary, and summaries
of n-grams in the generated summary against the reference generated by four LLMs, where the reference summary and
summary. The scores of BLEU, METEOR, and ROUGE-L the summaries generated by four LLMs are mixed and out
are in the range of [0, 1]. The higher the score, the closer the of order. In other words, for each sample, volunteers do not
generated summary approximates the reference summary, in- know whether it is a reference or a summary generated by a
dicating superior code summarization performance. All scores certain LLM. We follow Shi et al. [9] and ask volunteers to
are computed by the same implementation provided by [47]. rate the summaries from 1 to 5 based on their quality where
ii. Methods based on summary-summary semantic similarity a higher score represents a higher quality. The final score of
evaluate the quality of the generated summary by computing the summaries is the average of scores rated by 15 volunteers.
the semantic similarity between the generated summary and LLM-based evaluation methods. Inspired by recent work in
the reference summary. Existing research [36] demonstrates NLP [37]–[39], we also investigate the feasibility of employing
that semantic similarity-based methods can effectively alle- LLMs as evaluators. Its advantage is that it does not rely on the
viate the issues of word overlap-based metrics, where not all quality of reference summaries, and the evaluation steps can be
words in a sentence have the same importance and many words the same as human evaluation. Specifically, similar to human
have synonyms. In this study, we compare four such methods, evaluation, when using LLMs as evaluators, for each sample,
including BERTScore [71], SentenceBert with Cosine Simi- we input the code snippet to be summarized, the reference
larity (SBCS), SentenceBert with Euclidean Distance (SBED), summary, and LLM-generated summaries, and ask LLMs to

5
of the LLM-generated summaries reported by three categories
of automated evaluation methods, and LLM-based evalua-
tion methods. Observed that among the three methods based
on summary-summary text similarity, 1) the BLEU-based
and ROUGE-L-based methods give StarChat-β the highest
scores on all three datasets; 2) the METEOR-based method
gives StarChat-β the highest score (i.e., 18.19) on the Java
dataset, while gives CodeLlama-Instruct the highest scores
(i.e., 21.64 and 17.29) on the Python and C datasets. Among
the four methods based on summary-summary semantic sim-
Fig. 2: An example of using an LLM as an evaluator. ilarity, BERTScore, SBCS, and SBED give the best scores
to StarChat-β, and USECS gives the best score of 50.69
TABLE II: Human evaluation scores for reference and LLM- to CodeLlama-Instruct on the Java dataset. On the Python
generated summaries. The value in parentheses represents the and C datasets, the four methods consistently give the best
percentage increase or decrease relative to the score of the scores to CodeLlama-Instruct and StarChat-β, respectively.
corresponding reference summary. The summary-code semantic similarity-based method SIDE
gives the highest scores (i.e., 80.46 and 66.35) to StarChat-β
Human Evaluation Score
Summary from on the Java and C datasets, and the highest score (i.e., 41.71) to
Java Python C
GPT-3.5 on the Python dataset. The four LLM-based methods
Reference 3.19 3.56 3.05
consistently give the highest scores to GPT-4 on the Java and
CodeLlama-Instruct 3.93 (+23.20%) 3.88 (+8.99%) 4.15 (+36.07%) C datasets, while they consistently award the highest scores
StarChat-β 3.18 (-0.31%) 3.14 (-11.80%) 3.49 (+14.43%)
GPT-3.5 4.00 (+25.39%) 4.16 (+16.85%) 4.06 (+33.11%) to GPT-3.5 on the Python dataset.
GPT-4 4.17 (+30.72%) 4.06 (+14.04%) 4.25 (+39.34%)
☞ Finding ▶ According to automated evaluation, over-
all, methods based on summary-summary text/semantic
rate each summary from 1 to 5 where a higher score represents similarity tend to give higher scores to specialized code
a higher quality of the summary. The specific prompt when LLMs StarChat-β and CodeLlama-Instruct, while LLM-
using LLMs as evaluators is shown in Figure 2. based evaluators tend to give higher scores to general-
Datasets and Prompting Techniques. In this RQ, to reduce purpose LLMs GPT-3.5 and GPT-4. The summary-code
the workload of human evaluation volunteers, we randomly semantic similarity-based method tends to give higher
select 50 samples from the Java, Python, and C datasets, scores to StarChat-β on the Java and C datasets, while
respectively, which means 150 samples in total. We employ favoring GPT-3.5 on the Python dataset. ◀
few-shot prompting to adapt the four LLMs to generate sum-
maries for code snippets as recent studies [25], [28], [30] have Correlation between Automated Evaluation and Human
demonstrated the effectiveness of this prompting technique on Evaluation. From Table III, it can also be observed that the
code summarization tasks. average scores of reference summaries evaluated by the four
2) Experimental Results. LLM-based methods are mostly below 3 points. It means that
Human Evaluation Results. Table II shows the human evalu- similar to human evaluation, LLM-based evaluation methods
ation scores for reference summaries and summaries generated also believe that the quality of the reference summaries is
by the four LLMs. Observe that the scores of reference not very high. Besides, LLM-based evaluation methods are
summaries in the three datasets are between 3 and 3.5 points, inclined to give higher scores to general-purpose LLMs GPT-
suggesting that the quality of the reference summaries is not 3.5 and GPT-4, which is the same as human evaluation.
very high. Therefore, evaluation methods based on summary- Based on the above observations, we can reasonably spec-
summary/code similarity may not accurately assess the quality ulate that compared to methods based on summary-summary
of LLM-generated summaries. text/semantic similarity and summary-code semantic similar-
Among the four LLMs, GPT-4 has the highest scores on the ity, LLM-based evaluation methods may be more suitable
Java and C datasets, and GPT-3.5 attains the highest score on for evaluating the quality of summaries generated by LLMs.
the Python dataset. This suggests that the quality of summaries Therefore, we follow [9], [73] and calculate Spearman’s
generated by GPT-3.5 and GPT-4 is relatively high. correlation coefficient ρ with the p-value between the results
☞ Finding ▶ According to human evaluation, the quality of each automated evaluation method and human evaluation,
of reference summaries in the existing datasets is not providing more convincing evidence for this speculation. The
particularly high. Summaries from general-purpose LLMs Spearman’s correlation coefficient ρ ∈ [−1, 1] is suitable for
(e.g., GPT-3.5) excel over those from specialized code judging the correlation between two sequences of discrete
LLMs (e.g., CodeLlama-Instruct) in quality. ◀ ordinal/continuous data, with a higher value representing a
stronger correlation [77]. −1 ≤ ρ < 0, ρ = 0, and 0 < ρ ≤ 1
Automated Evaluation Results. Table III displays the scores respectively indicate the presence of negative correlation, no

6
TABLE III: Automated evaluation scores for reference and LLM-generated summaries. S-S Tex.Sim.: methods based on
summary-summary text similarity; S-S Sem.Sim.: methods based on summary-summary semantic similarity; S-C Sem.Sim.:
methods based on summary-code semantic similarity. We bold the best score in each column.
Summary S-S Tex.Sim. S-S Sem.Sim. S-C Sem.Sim. LLM-based Evaluation Method
Language Human
from
BLEU METEOR ROUGE-L BERTScore SBCS SBED USECS SIDE CodeLlama-I StarChat-β GPT-3.5 GPT-4
Reference / / / / / / / 86.15 1.42 2.58 3.08 2.8 3.19
CodeLlama-Instruct 13.00 17.90 32.21 87.94 59.61 86.88 50.69 46.62 2.32 2.80 3.28 3.64 3.93
Java StarChat-β 18.95 18.19 38.43 88.69 61.97 83.45 50.57 80.46 2.24 1.94 2.42 2.50 3.18
GPT-3.5 12.49 16.74 31.87 87.73 59.47 88.11 48.87 62.04 2.40 2.40 3.72 3.82 4.00
GPT-4 9.46 17.02 28.36 86.72 58.83 89.27 46.50 36.12 2.44 2.60 4.10 4.50 4.17
Reference / / / / / / / 16.11 1.48 2.74 2.84 2.98 3.56
CodeLlama-Instruct 16.04 21.64 37.80 89.06 61.57 85.40 55.86 38.84 1.62 2.60 3.44 3.72 3.88
Python StarChat-β 18.35 17.62 37.96 88.92 58.97 87.39 51.54 24.05 1.94 1.96 2.40 2.42 3.14
GPT-3.5 11.95 19.14 30.20 87.63 61.37 86.36 49.54 41.71 1.96 2.72 4.32 4.30 4.16
GPT-4 14.07 20.87 35.38 88.11 60.65 87.04 51.21 33.16 1.76 2.54 3.92 4.16 4.06
Reference / / / / / / / 64.23 1.56 2.80 2.24 2.62 3.05
CodeLlama-Instruct 10.92 17.29 28.71 86.38 51.55 95.94 37.95 46.69 2.62 3.02 3.82 3.84 4.15
C StarChat-β 15.58 15.57 32.84 87.27 54.85 91.92 40.60 66.35 2.76 2.74 2.20 2.62 3.49
GPT-3.5 12.06 16.00 29.81 86.65 53.61 93.71 39.75 50.17 3.04 2.86 3.48 3.66 4.06
GPT-4 10.07 16.18 28.63 86.03 53.00 94.77 37.30 41.37 3.18 2.86 4.00 4.36 4.25

correlation, and positive correlation [78]. The p-value helps the base model is StarChat-β, chain-of-thought prompting
determine whether the observed correlation is statistically performs best on all the Java and C datasets, while expert
significant or simply due to random chance. By comparing the prompting excels on the Python dataset. When selecting GPT-
p-value to a predefined significance level (typically 0.05), we 3.5 as the base model, the simplest zero-shot prompting
can decide whether to reject the null hypothesis and conclude surprisingly achieves the highest scores on the Java and C
that the correlation is statistically significant. Table IV shows datasets, and is only slightly worse than few-shot prompting
the statistical results of ρ and p-value. It can be clearly on the Python dataset. When using GPT-4 as the base model,
observed that among all automated evaluation methods, there chain-of-thought prompting overall performs best.
is a significant positive correlation (0.28 ≤ ρ ≤ 0.65) between For the specific LLM and programming language, there
the GPT-4-based evaluation method and human evaluation in is no guarantee that intuitively more advanced prompting
scoring the quality of summaries generated by most LLMs, techniques will surpass simple zero-shot prompting. For ex-
followed by the GPT-3.5-based evaluation method. For other ample, on the Java dataset, when selecting any of StarChat-β,
automated evaluation methods, in most cases, their correlation GPT-3.5, and GPT-4 as the base model, few-shot prompting
with human evaluation is negative or weakly positive. Based yields lower scores than zero-shot prompting. Contrary to
on the above observations, we draw the conclusion that the findings of previous studies [25], [28], the GPT-4-based
compared with other automated evaluation methods, the GPT- evaluation method does not consider that few-shot prompt-
4-based method is more suitable for evaluating the quality ing will improve the quality of generated summaries. This
of summaries generated by LLMs. In the subsequent RQs, discrepancy may arise because previous studies evaluated the
we uniformly employ the GPT-4-based method to assess the quality of LLM-generated summaries using BLEU, METEOR,
quality of LLM-generated summaries. To make the output and ROUGE-L, which primarily assess text/semantic similarity
scores of GPT-4 more deterministic, we set the temperature with reference summaries. However, as we mentioned in Sec-
to 0 when using GPT-4 as the evaluator. tion IV-A, reference summaries contain low-quality noisy data
that undermines their reliability. Therefore, achieving greater
✎ Summary ▶ Among all automated evaluation methods, similarity with reference summaries does not necessarily imply
the GPT-4-based method overall has the strongest correla- that the human/GPT-4-based evaluation method will perceive
tion with human evaluation. Therefore, it is recommended the summary to be of higher quality.
to adopt the GPT-4-based method to evaluate the quality
of LLM-generated summaries. ◀ ✎ Summary ▶ The more advanced prompting techniques
expected to perform better may not necessarily outperform
B. RQ2: How effective are different prompting techniques in simple zero-shot prompting. In practice, selecting the
adapting LLMs to the code summarization task? appropriate prompting technique requires considering the
1) Experimental Setup. The experimental dataset comprises base LLM and the programming language. ◀
600 samples from Java, Python, and C datasets collectively.
2) Experimental Results. Table V presents the scores reported C. RQ3: How do different model settings affect LLMs’ code
by the GPT-4 evaluation method for summaries generated summarization performance?
by four LLMs using five prompting techniques. Observe that 1) Experimental Setup. There are three key model set-
when the base model is CodeLlama-Instruct, few-shot prompt- tings/parameters, including top k, top p, and temperature,
ing consistently performs best on all three datasets. When that allow the user to control the randomness of text (code

7
TABLE IV: Spearman’s correlation coefficient ρ with the p-value (values in parentheses) between the results of each automated
evaluation method and human evaluation. CodeLlama-I: CodeLlama-Instruct. We bold the best score in each row.
Summary S-S Tex.Sim. S-S Sem.Sim. S-C Sem.Sim. LLM-based Evaluation Method
Language
from
BLEU METEOR ROUGE-L BERTScore SBCS SBED USECS SIDE CodeLlama-I StarChat-β GPT-3.5 GPT-4
Reference / / / / / / / 0.31 (.03) 0.06 (.70) 0.26 (.07) 0.53 (.00) 0.60 (.00)
CodeLlama-I -0.25 (.08) 0.07 (.65) -0.14 (.35) -0.16 (.27) 0.01 (.94) -0.01 (.94) 0.01 (.95) -0.30 (.03) -0.00 (.98) -0.02 (.91) 0.35 (.01) 0.49 (.00)
Java StarChat-β 0.22 (.13) 0.32 (.02) 0.15 (.31) 0.22 (.12) 0.30 (.04) -0.30 (.04) 0.32 (.02) -0.12 (.42) 0.11 (.44) 0.16 (.27) 0.52 (.00) 0.41 (.00)
GPT-3.5 -0.10 (.50) 0.12 (.39) 0.22 (.13) 0.0 (.99) -0.01 (.97) 0.01 (.97) 0.10 (.50) -0.20 (.16) 0.28 (.05) -0.08 (.57) 0.60 (.00) 0.56 (.00)
GPT-4 0.05 (.74) -0.02 (.87) 0.05 (.75) 0.02 (.87) 0.14 (.34) -0.14 (.34) 0.11 (.46) -0.19 (.18) 0.00 (.98) 0.04 (.76) 0.38 (.01) 0.40 (.00)
Reference / / / / / / / -0.04 (.79) -0.15 (.30) 0.19 (.19) 0.35 (.01) 0.37 (.01)
CodeLlama-I -0.17 (.24) -0.05 (.76) -0.11 (.46) -0.16 (.28) -0.08 (.56) 0.08 (.56) -0.08 (.60) 0.10 (.48) 0.15 (.29) -0.01 (.97) 0.52 (.00) 0.45 (.00)
Python StarChat-β 0.20 (.17) 0.52 (.00) 0.19 (.20) 0.24 (.1) 0.18 (.22) -0.18 (.22) 0.50 (.00) 0.02 (.89) -0.19 (.18) 0.24 (.09) 0.48 (.00) 0.48 (.00)
GPT-3.5 -0.20 (.17) 0.01 (.96) -0.18 (.21) -0.07 (.61) -0.02 (.89) 0.02 (.89) -0.12 (.42) 0.17 (.24) 0.19 (.19) 0.35 (.01) 0.32 (.02) 0.42 (.00)
GPT-4 -0.24 (.09) 0.12 (.41) -0.16 (.26) -0.05 (.73) 0.02 (.91) -0.02 (.91) -0.19 (.19) 0.15 (.31) 0.18 (.21) 0.17 (.25) 0.16 (.28) 0.17 (.25)
Reference / / / / / / / 0.23 (.11) 0.12 (.39) 0.27 (.06) 0.60 (.00) 0.62 (.00)
CodeLlama-I -0.28 (.05) -0.11 (.46) -0.09 (.55) -0.28 (.05) -0.05 (.76) 0.04 (.76) -0.07 (.61) -0.32 (.03) -0.00 (.98) -0.01 (.48) 0.33 (.02) 0.42 (.00)
C StarChat-β -0.00 (.98) 0.35 (.01) 0.17 (.24) 0.0 (.98) 0.17 (.22) -0.17 (.22) 0.26 (.07) -0.01 (.94) 0.05 (.72) 0.17 (.23) 0.16 (.27) 0.62 (.00)
GPT-3.5 -0.35 (.01) 0.14 (.33) -0.12 (.41) -0.06 (.67) 0.05 (.71) -0.05 (.71) 0.07 (.64) -0.47 (.00) 0.21 (.15) 0.21 (.14) 0.38 (.01) 0.65 (.00)
GPT-4 0.11 (.44) 0.38 (.01) 0.37 (.01) 0.15 (.29) 0.21 (.14) -0.21 (.15) 0.21 (.14) -0.26 (.07) -0.09 (.55) 0.01 (.92) 0.38 (.01) 0.28 (.05)

TABLE V: Effectiveness of different prompting techniques modify one of the two parameters at a time [79]. Therefore, the
Model Prompting Technique Java Python C questions we want to answer are: (1) Does top p/temperature
zero-shot 3.42 2.98 3.41
impact the quality of LLM-generated summaries? (2) As alter-
few-shot 3.78 3.75 3.91 native parameters that both control the randomness of LLMs,
CodeLlama-Instruct chain-of-thought 3.21 3.14 3.37 do top p and temperature have a difference in the degree of
critique 2.15 2.02 2.13
expert 3.13 3.35 1.70 influence on the quality of LLM-generated summaries?
zero-shot 2.71 2.85 2.86 Drawing from a review of related work (see Section II),
few-shot 2.50 2.37 2.68
StarChat-β chain-of-thought 2.86 2.77 3.06 we find that existing LLM-based code summarization studies
critique 2.36 2.57 2.60 pay more attention to few-shot prompting. Since no prompting
expert 2.66 3.02 3.01
technique outperforms others on all LLMs, we uniformly em-
zero-shot 3.90 3.96 3.93
few-shot 3.73 3.97 3.56
ploy few-shot prompting in RQ3, RQ4, and RQ5 to facilitate
GPT-3.5 chain-of-thought 3.36 3.47 3.36 comparing our findings with prior studies.
critique 3.09 3.21 3.31
expert 2.72 3.43 3.49 2) Experimental Results. TABLE VI shows the scores eval-
zero-shot 4.50 4.55 4.42 uated by the GPT-4 evaluation method for the summaries
few-shot 4.38 4.16 4.18 generated by LLMs under different top p and temperature
GPT-4 chain-of-thought 4.57 4.60 4.44
critique 4.41 4.44 4.34 settings. It is observed that the impact of top p and temper-
expert 4.52 4.23 4.50 ature on the quality of LLM-generated summaries is specific
to the base LLM and programming language. For example,
when top p=0.5, as temperature increases, the quality of GPT-
summary in our scenario) generated by LLMs. Considering 4-generated summaries for Python code snippets increases,
that GPT-3.5 and GPT-4 do not support the top k setting, we while those for C code snippets decrease. Another example
only conduct experiments with the top p and temperature. is that when top p=0.5, as the temperature rises, the quality
Top p: In each round of token generation, LLMs sort tokens of GPT-4-generated Java comments first increases and then
by probability from high to low and keep tokens whose decreases, whereas CodeLlama-Instruct is exactly the opposite,
probability adds up to (no more than) top p. For example, first decreases and then increases. Regarding the difference in
top p = 0.1 means only the tokens comprising the top 10% influence between top p and temperature, it is observed that
probability mass are considered. The larger the top p is, the in most cases the influence of the two parameters is similar.
more tokens are sampled. Thus tokens with low probabilities For example, for C code snippets, when one parameter (top p
have a greater chance of being selected, so the summary or temperature) is fixed, as the other parameter (temperature
generated by LLMs is more random. or top p) grows, the quality of GPT-3.5-generated summaries
Temperature: Temperature adjusts the probability of tokens first decreases and then increases.
after top p sampling. The higher the temperature, the less the
difference between the adjusted token probabilities. Therefore,
✎ Summary ▶ The impact of top p and temperature
the token with a low probability has a greater chance of being
on the quality of generated summaries is specific to the
selected, so the generated summary is more random. If the
base LLM and programming language. As alternative
temperature is set to 0, the generated summary is the same
parameters, top p and temperature have similar influence
every time.
on the quality of LLM-generated summaries. ◀
Top p and temperature are alternatives and one should only

8
TABLE VI: Influence of different model settings.We bold the forms worst on both. The smallest LLM CodeLlama-Instruct
scores of the best setting combinations on each dataset. outperforms GPT-3.5 on C (3.91 vs. 3.56), but vice versa
Model Top p Temperature Java Python C on Go (3.86 vs. 4.14). Additionally, except for CodeLlama-
0.1 3.81 3.83 4.10
Instruct, which performs slightly worse on Go than on C
0.5 0.5 3.72 3.85 4.08 (3.86 vs. 3.91), the other three LLMs perform better on Go
1.0 3.91 3.81 4.11 than on C. For SP, GPT-4 consistently performs best on all
CodeLlama-Instruct 0.1 3.76 3.87 4.02 four languages. Surprisingly, CodeLlama-Instruct outperforms
0.75 0.5 3.91 3.73 4.01
1.0 3.80 3.79 3.88 GPT-3.5 on both Ruby and JavaScript. All four LLMs perform
0.1 3.78 3.75 3.91 better on PHP than on Python. For FP, the performance of
1.0 0.5 3.91 3.75 3.99 two specialized code LLMs (i.e., CodeLlama-Instruct and
1.0 3.73 3.59 3.60
StarChat-β) is better on Haskell than on Erlang, while the
0.1 2.49 2.42 2.72
0.5 0.5 2.47 2.36 2.70
opposite is true for the two general-purpose LLMs (i.e., GPT-
1.0 2.49 2.29 2.75 3.5 and GPT-4). For LP, GPT-4 still performs best, followed
StarChat-β 0.1 2.50 2.35 2.66 by GPT-3.5, CodeLlama-Instruct, and StarChat-β. Across all
0.75 0.5 2.45 2.47 2.80 five types of languages, the four LLMs consistently perform
1.0 2.48 2.37 2.71
the worst on LP, which indicates that summarizing logic
0.1 2.5 2.37 2.68
1.0 0.5 2.53 2.45 2.77 programming language code is the most challenging. One
1.0 2.54 2.38 2.69 possible reason is that fewer Prolog datasets are available
0.1 3.41 3.60 3.40 for training these LLMs compared to other programming
0.5 0.5 3.45 3.73 3.38
1.0 3.52 3.68 3.42 languages. The scale of the Prolog dataset we collected can
0.1 3.54 3.66 3.35
support this reason.
GPT-3.5
0.75 0.5 3.55 3.65 3.24
1.0 3.46 3.64 3.41 ✎ Summary ▶ GPT-4 surpasses the other three LLMs on
0.1 3.73 3.97 3.56 all five types of programming languages. For PP, LLMs
1.0 0.5 3.55 3.71 3.42 overall perform better on Go than on C. For SP, all four
1.0 3.41 3.72 3.52
LLMs perform better on PHP than on Python. For FP,
0.1 4.44 4.25 4.33
0.5 0.5 4.47 4.30 4.31 specialized code LLMs (e.g., StarChat-β) perform better
1.0 4.45 4.31 4.29 on Haskell than on Erlang, whereas the reverse is true
GPT-4 0.1 4.48 4.27 4.31 for general-purpose LLMs (e.g., GPT-4). All four LLMs
0.75 0.5 4.46 4.34 4.26
1.0 4.47 4.33 4.36 perform worse in summarizing LP code snippets. ◀
0.1 4.38 4.16 4.18
1.0 0.5 4.43 4.27 4.33 E. RQ5: How do LLMs perform on different categories of
1.0 4.40 4.18 4.33 summaries?
1) Experimental Setup. Following [3], [54], [55], we classify
TABLE VII: Effectiveness of LLMs in summarizing code code summaries into the following six categories.
snippets written in different types of programming languages. What: describes the functionality of the code snippet. It helps
CodeLlama-I: CodeLlama-Instruct. developers to understand the main functionality of the code
OOP PP SP FP LP without diving into implementation details. An example is
Model
Java C Go Python Ruby PHP JavaScript Erlang Haskell Prolog “Pushes an item onto the top of this stack”.
CodeLlama-I 3.78 3.91 3.86 3.75 3.98 3.88 4.03 3.51 3.58 3.23
Why: explains the reason why the code snippet is written or
StarChat-β 2.50 2.68 2.97 2.37 2.79 2.73 2.67 2.68 2.88 2.34 the design rationale of the code snippet. It is useful when
GPT-3.5 3.73 3.56 4.14 3.97 3.64 3.99 3.53 3.57 3.44 3.42
GPT-4 4.38 4.18 4.36 4.16 4.37 4.31 4.29 4.23 4.22 4.05
methods’ objective is masked by complex implementation. An
application scenario of Why summaries is to explain the design
rationale of overloaded functions.
How-it-is-done: describes the implementation details of the
D. RQ4: How do LLMs perform in summarizing code snippets
code snippet. Such information is critical for developers to
written in different types of programming languages?
understand the subject, especially when the code complexity
1) Experimental Setup. We conduct experiments on all 10 is high. For instance, “Shifts any subsequent elements to the
programming language datasets. As in RQ3, we uniformly left.” is a How-it-is-done comment.
employ few-shot prompting to adapt LLMs. Property: asserts properties of the code snippet, e.g., func-
2) Experimental Results. Table VII shows the performance tion’s pre-conditions/post-conditions. “This method is not a
evaluated by the GPT-4 evaluation method for the four LLMs constant-time operation.” is a Property summary.
on five types of programming languages. It is observed How-to-use: describes the expected set-up of using the code
that for OOP (i.e., Java), GPT-4 performs best, followed by snippet, such as platforms and compatible versions. For exam-
CodeLlama-Instruct, GPT-3.5, and StarChat-β. For PP, GPT- ple, “This method can be called only once per call to next().”
4 performs best on both C and Go, while StarChat-β per- is a How-to-use summary.

9
TABLE VIII: Statistics of six sub-datasets divided from the ✎ Summary ▶ The four LLMs excel in generating dif-
CSN-Java test dataset according to comment intention ferent categories of summaries. The smallest CodeLlama-
Instruct slightly outperforms the advanced GPT-4 in gen-
Summary Category Number of Samples Sample Ratio
erating Why and Property summaries. StarChat-β is not
What 6,132 0.56
Why 1,190 0.11 proficient at generating How-to-use summaries. GPT-
How-it-is-done 2,242 0.20 3.5 and GPT-4 perform worse in generating Property
Property 1,174 0.11
How-to-use 180 0.02 summaries than other categories of summaries. ◀
Others 37 < 0.01
V. T HREATS TO VALIDITY
Our empirical study may contain several threats to validity
TABLE IX: Effectiveness of LLMs in generating different that we have attempted to relieve.
categories of summaries Threats to External Validity. The threats to external
validity lie in the generalizability of our findings. One threat
Model What Why How-it-is-done Property How-to-use
to the validity of our study is that LLMs usually generate
CodeLlama-Instruct 4.15 4.29 3.85 4.19 3.96 varied responses for identical input across multiple requests
StarChat-β 2.68 2.78 2.77 2.94 2.52
GPT-3.5 3.61 3.54 3.97 3.54 4.17 due to their inherent randomness, while conclusions drawn
GPT-4 4.40 4.28 4.31 4.06 4.22 from random results may be misleading. To mitigate this
threat, considering that StarChat-β and CodeLlama-Instruct do
not support setting the temperature to 0, we uniformly set it
Others: Comments that do not fall into the above five cat- to 0.1 to reduce randomness except for RQ3. In RQ2-RQ5,
egories are classified as Others summaries, such as “The to make the evaluation scores more deterministic, we set the
implementation is awesome.”. Following Mu et al. [55], we temperature to 0 when using GPT-4 as the evaluator. Addi-
consider the ⟨code, summary⟩ pairs with Others comments tionally, for other RQs, we conduct experiments on multiple
as noisy data, and remove them if identified. programming languages to support our findings.
Threats to Internal Validity. A major threat to internal
We employ the comment classifier COIN provided by Mu
validity is the potential mistakes in the implementation of
et al. [55] to classify the CSN-Java dataset according to the
metrics and models. To mitigate this threat, we use the publicly
comment intention type. The test dataset is divided into six
available code from previous studies [10], [47] for BLEU,
sub-datasets, as shown in Table VIII. To facilitate comparison
METEOR, ROUGE-L, and SIDE. For COIN, BERTScore,
between different categories, we randomly select 180 samples
SentenceBert, Universal Sentence Encoder, StarChat-β [80]
from each sub-dataset. As in RQ4, we uniformly employ few-
and CodeLlama-Instruct [81], and GPT-3.5/GPT-4 [82], we
shot prompting to adapt LLMs. For each sub-dataset with
use the script provided along with the model to run.
different intention types, the few-shot example is of the same
Another threat lies in the processing of LLM’s responses.
intention type from the training dataset.
Usually, the output of LLMs is a paragraph, not a sentence of
2) Experimental Results. Table IX presents the results code summary (code comment) that we want. The real code
evaluated by the GPT-4 evaluation method for the four summary may be the first sentence in the LLMs’ response,
LLMs in generating five categories of summaries. Ob- or it may be returned in the comment before the code such
serve that CodeLlama-Instruct performs worse in generating as “/** ⟨code summary⟩ */”, etc. Therefore, we designed a
How-it-is-done summaries than generating the other four series of heuristic rules to extract the code summary. We have
categories of summaries. StarChat-β gets the lowest score of made our script for extracting code summaries from LLMs’
2.52 in generating How-to-use summaries. Both GPT-3.5 responses public for the community to review.
and GPT-4 are not as good at generating Property sum-
maries compared to generating other categories of summaries. VI. C ONCLUSION
Surprisingly, the smallest LLM CodeLlama-Instruct slightly In this paper, we provide a comprehensive study covering
outperforms the advanced GPT-4 in generating Why (4.29 vs. multiple aspects of code summarization in the era of LLMs.
4.28) and Property (4.19 vs. 4.06) summaries. Additionally, Our interesting and significant findings include, but are not
compared with GPT-3.5, CodeLlama-Instruct achieves higher limited to, the following aspects. 1) Compared with existing
scores in generating What, Why, and Property summaries. automated evaluation methods, the GPT-4-based evaluation
Certainly, it is undeniable that the reason for this phenomenon method is more fitting for assessing the quality of LLM-
is that the optimal prompting technique for GPT-3.5 and GPT- generated summaries. 2) The advanced prompting techniques
4 is not few-shot prompting. This phenomenon is also exciting anticipated to yield superior performance may not invariably
because it implies that most ordinary developers or teams who surpass the efficacy of straightforward zero-shot prompting. 3)
lack sufficient resources (e.g., GPUs) have the opportunity The two alternative model settings have a similar impact on the
to utilize open-source and small-scale LLMs to achieve code quality of LLM-generated summaries, and this impact varies
summarization capabilities close to (or even surpass) those of by the base LLM and programming language. 4) LLMs exhibit
commercial gigantic LLMs. inferior performance in summarizing LP code snippets. 5)

10
CodeLlama-Instruct with 7B parameters demonstrates superior [12] C. Fang, W. Sun, Y. Chen, X. Chen, Z. Wei, Q. Zhang, Y. You, B. Luo,
performance over the advanced GPT-4 in generating Why and Y. Liu, and Z. Chen, “Esale: Enhancing code-summary alignment
learning for source code summarization,” IEEE Transactions on Software
Property summaries. Our comprehensive research findings Engineering (Early Access), pp. 1–18, 2024.
will aid subsequent researchers in swiftly grasping the various [13] M. Du, F. He, N. Zou, D. Tao, and X. Hu, “Shortcut learning of large
facets of LLM-based code summarization, thereby promoting language models in natural language understanding,” Communications
of the ACM, vol. 67, no. 1, pp. 110–120, 2023.
the development of this field. [14] C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga, and D. Yang, “Is
ChatGPT a general-purpose natural language processing task solver?”
ACKNOWLEDGMENT arXiv preprint arXiv:2302.06476, 2023.
[15] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan,
The authors would like to thank the anonymous reviewers H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large
for their insightful comments. This work is supported par- language models trained on code,” arXiv preprint arXiv:2107.03374,
2021.
tially by the National Natural Science Foundation of China [16] Q. Zhang, C. Fang, Y. Xie, Y. Zhang, Y. Yang, W. Sun, S. Yu, and
(61932012, 62372228), and the National Research Foun- Z. Chen, “A survey on large language models for software engineering,”
dation, Singapore, and the Cyber Security Agency under CoRR, vol. abs/2312.15223, no. 1, pp. 1–57, 2023.
[17] A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo,
its National Cybersecurity R&D Programme (NCRP25-P04- and J. M. Zhang, “Large language models for software engineering:
TAICeN). Any opinions, findings and conclusions or recom- Survey and open problems,” arXiv preprint arXiv:2310.03533, 2023.
mendations expressed in this material are those of the author(s) [18] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo,
J. Grundy, and H. Wang, “Large language models for software engineer-
and do not reflect the views of National Research Foundation, ing: A systematic literature review,” arXiv preprint arXiv:2308.10620,
Singapore and Cyber Security Agency of Singapore. 2023.
[19] X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y. Chen, J. Feng, C. Sha,
R EFERENCES X. Peng, and Y. Lou, “Evaluating large language models in class-level
code generation,” in Proceedings of the 46th International Conference
[1] S. N. Woodfield, H. E. Dunsmore, and V. Y. Shen, “The effect of mod- on Software Engineering. Lisbon, Portugal: ACM, April 14-20 2024,
ularization and comments on program comprehension,” in Proceedings pp. 81:1–81:13.
of the 5th International Conference on Software Engineering. San [20] C. Wang, J. Zhang, Y. Feng, T. Li, W. Sun, Y. Liu, and X. Peng,
Diego, California, USA: IEEE Computer Society, March 9-12 1981, pp. “Teaching code llms to use autocompletion tools in repository-level code
215–223. generation,” CoRR, vol. abs/2401.06391, no. 1, pp. 1–13, 2024.
[2] S. C. B. de Souza, N. Anquetil, and K. M. de Oliveira, “A study of the [21] Q. Zhang, T. Zhang, J. Zhai, C. Fang, B. Yu, W. Sun, and Z. Chen,
documentation essential to software maintenance,” in Proceedings of the “A critical review of large language model on software engineering:
23rd Annual International Conference on Design of Communication: An example from chatgpt and automated program repair,” CoRR, vol.
documenting & Designing for Pervasive Information. Coventry, UK: abs/2310.08879, no. 1, pp. 1–12, 2023.
ACM, September 21-23 2005, pp. 68–75. [22] Q. Zhang, C. Fang, Y. Xie, Y. Ma, W. Sun, Y. Yang, and Z. Chen,
[3] J. Zhai, X. Xu, Y. Shi, G. Tao, M. Pan, S. Ma, L. Xu, W. Zhang, L. Tan, “A systematic literature review on large language models for automated
and X. Zhang, “CPC: Automatically classifying and propagating natural program repair,” CoRR, vol. abs/2405.01466, no. 1, pp. 1–39, 2024.
language comments via program analysis,” in Proceedings of the 42nd [23] X. Zhou, T. Zhang, and D. Lo, “Large language model for vulnerability
International Conference on Software Engineering. Seoul, South Korea: detection: Emerging results and future directions,” in Proceedings of
ACM, 27 June - 19 July 2020, pp. 1359–1371. the 44th International Conference on Software Engineering: New Ideas
[4] W. Sun, C. Fang, Y. Chen, Q. Zhang, G. Tao, Y. You, T. Han, Y. Ge, and Emerging Results. Lisbon, Portugal: ACM, April 14-20 2024, pp.
Y. Hu, B. Luo, and Z. Chen, “An extractive-and-abstractive framework 47–51.
for source code summarization,” ACM Transactions on Software Engi- [24] J. Zhang, C. Wang, A. Li, W. Sun, C. Zhang, W. Ma, and Y. Liu,
neering and Methodology, vol. Just Accepted, no. 1, pp. 1–39, 2023. “An empirical study of automated vulnerability localization with large
[5] B. Fluri, M. Würsch, and H. C. Gall, “Do code and comments co- language models,” CoRR, vol. abs/2404.00287, no. 1, 2024.
evolve? on the relation between source code and comment changes,” in [25] T. Ahmed and P. T. Devanbu, “Few-shot training llms for project-
Proceedings of the 14th Working Conference on Reverse Engineering. specific code-summarization,” in Proceedings of the 37th International
Vancouver, BC, Canada: IEEE Computer Society, 28-31 October 2007 Conference on Automated Software Engineering. Rochester, MI, USA:
2007, pp. 70–79. ACM, October 10-14 2022, pp. 177:1–177:5.
[6] M. L. Vásquez, B. Li, C. Vendome, and D. Poshyvanyk, “How do devel- [26] C. Wang, Y. Yang, C. Gao, Y. Peng, H. Zhang, and M. R. Lyu,
opers document database usages in source code? (N),” in Proceedings of “No more fine-tuning? an experimental evaluation of prompt tuning in
the 30th International Conference on Automated Software Engineering. code intelligence,” in Proceedings of the 30th Joint European Software
Lincoln, NE, USA: IEEE Computer Society, November 9-13 2015, pp. Engineering Conference and Symposium on the Foundations of Software
36–41. Engineering. Singapore, Singapore: ACM, November 14-18 2022, pp.
[7] F. Wen, C. Nagy, G. Bavota, and M. Lanza, “A large-scale empirical 382–394.
study on code-comment inconsistencies,” in Proceedings of the 27th [27] W. Sun, C. Fang, Y. You, Y. Miao, Y. Liu, Y. Li, G. Deng, S. Huang,
International Conference on Program Comprehension. Montreal, QC, Y. Chen, Q. Zhang, H. Qian, Y. Liu, and Z. Chen, “Automatic code sum-
Canada: IEEE / ACM, May 25-31 2019, pp. 53–64. marization via chatgpt: How far are we?” CoRR, vol. abs/2305.12865,
[8] X. Hu, X. Xia, D. Lo, Z. Wan, Q. Chen, and T. Zimmermann, pp. 1–13, 2023.
“Practitioners’ expectations on automated code comment generation,” [28] M. Geng, S. Wang, D. Dong, H. Wang, G. Li, Z. Jin, X. Mao, and
in Proceedings of the 44th International Conference on Software Engi- X. Liao, “Large language models are few-shot summarizers: Multi-intent
neering. Pittsburgh, PA, USA: ACM, May 25-27 2022, pp. 1693–1705. comment generation via in-context learning,” in Proceedings of the 46th
[9] E. Shi, Y. Wang, L. Du, J. Chen, S. Han, H. Zhang, D. Zhang, and International Conference on Software Engineering. Lisbon, Portugal:
H. Sun, “On the evaluation of neural code summarization,” in Pro- ACM, April 14-20 2024, pp. 39:1–39:13.
ceedings of the 44th International Conference on Software Engineering. [29] S. Gao, W. Mao, C. Gao, L. Li, X. Hu, X. Xia, and M. R. Lyu,
Pittsburgh, USA: IEEE, May 21–29 2022, pp. 1597––1608. “Learning in the wild: Towards leveraging unlabeled data for effectively
[10] A. Mastropaolo, M. Ciniselli, M. Di Penta, and G. Bavota, “Evaluating tuning pre-trained code models,” in Proceedings of the 46th International
code summarization techniques: A new metric and an empirical charac- Conference on Software Engineering. Lisbon, Portugal: ACM, April
terization,” arXiv e-prints, pp. arXiv–2312, 2023. 14–20 2024, pp. 1–13.
[11] W. Sun, C. Fang, Y. You, Y. Chen, Y. Liu, C. Wang, J. Zhang, Q. Zhang, [30] S. Gao, X. Wen, C. Gao, W. Wang, H. Zhang, and M. R. Lyu,
H. Qian, W. Zhao et al., “A prompt learning framework for source code “What makes good in-context demonstrations for code intelligence tasks
summarization,” arXiv preprint arXiv:2312.16066, 2023. with llms?” in Proceedings of the 38th International Conference on

11
Automated Software Engineering. Luxembourg: IEEE, September 11- [48] D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong,
15 2023, pp. 761–773. S. Yih, L. Zettlemoyer, and M. Lewis, “Incoder: A generative model for
[31] H. Wu, H. Zhao, and M. Zhang, “Code summarization with structure- code infilling and synthesis,” in Proceedings of the 11th International
induced transformer,” in Proceedings of the Findings of the 59th Annual Conference on Learning Representations. Kigali, Rwanda: OpenRe-
Meeting of the Association for Computational Linguistics. Online view.net, May 1-5 2023, pp. 1–14.
Event: Association for Computational Linguistics, August 1-6 2021, pp. [49] S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. B.
1078–1090. Clement, D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, L. Zhou,
[32] X. Hu, G. Li, X. Xia, D. Lo, S. Lu, and Z. Jin, “Summarizing source M. Tufano, M. Gong, M. Zhou, N. Duan, N. Sundaresan, S. K. Deng,
code with transferred API knowledge,” in Proceedings of the 27th S. Fu, and S. Liu, “Codexglue: A machine learning benchmark dataset
International Joint Conference on Artificial Intelligence. Stockholm, for code understanding and generation,” in Proceedings of the Neural
Sweden: ijcai.org, July 13-19 2018, pp. 2269–2275. Information Processing Systems Track on Datasets and Benchmarks,
[33] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “BLEU: A method virtual, December 2021, pp. 1–14.
for automatic evaluation of machine translation,” in Proceedings of the [50] C. Su and C. McMillan, “Distilled GPT for source code summarization,”
40th Annual Meeting of the Association for Computational Linguistics. Automated Software Engineering, vol. 31, no. 1, p. 22, 2024.
Philadelphia, PA, USA: ACL, July 6-12 2002, pp. 311–318. [51] T. A. andKunal Suresh Pai, P. Devanbu, and E. T. Barr, “Automatic
[34] S. Banerjee and A. Lavie, “METEOR: an automatic metric for MT evalu- semantic augmentation of language model prompts (for code summariza-
ation with improved correlation with human judgments,” in Proceedings tion),” in Proceedings of the 46th International Conference on Software
of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Engineering. Lisbon, Portugal: ACM, April 14–20 2024, pp. 1–13.
Machine Translation and/or Summarization. Ann Arbor, Michigan, [52] S. A. Rukmono, L. Ochoa, and M. R. Chaudron, “Achieving high-level
USA: Association for Computational Linguistics, June 29 2005, pp. 65– software component summarization via hierarchical chain-of-thought
72. prompting and static code analysis,” in Proceedings of the 2023 Interna-
[35] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” tional Conference on Data and Software Engineering. Toba, Indonesia:
in Proceedings of the 42nd Annual Meeting of the Association for IEEE, September 07-08 2023, pp. 7–12.
Computational Linguistics – workshop on Text Summarization Branches [53] Y. Choi and J. Lee, “Codeprompt: Task-agnostic prefix tuning for
Out. Barcelona, Spain: Association for Computational Linguistics, July program and language generation,” in Proceedings of the Findings of
21-26 2004, pp. 74–81. the 61st Association for Computational Linguistics. Toronto, Canada:
[36] S. Haque, Z. Eberhart, A. Bansal, and C. McMillan, “Semantic similarity Association for Computational Linguistics, July 9-14 2023, pp. 5282–
metrics for evaluating source code summarization,” in Proceedings of the 5297.
30th International Conference on Program Comprehension. Virtual [54] Q. Chen, X. Xia, H. Hu, D. Lo, and S. Li, “Why my code summarization
Event: ACM, May 16-17 2022, pp. 36–47. model does not work: Code comment improvement with category pre-
[37] J. Wang, Y. Liang, F. Meng, H. Shi, Z. Li, J. Xu, J. Qu, and J. Zhou, diction,” ACM Transactions on Software Engineering and Methodology,
“Is chatgpt a good NLG evaluator? A preliminary study,” CoRR, vol. vol. 30, no. 2, pp. 25:1–25:29, 2021.
abs/2303.04048, no. 1, pp. 1–11, 2023. [55] F. Mu, X. Chen, L. Shi, S. Wang, and Q. Wang, “Developer-intent driven
[38] I. Vykopal, M. Pikuliak, I. Srba, R. Moro, D. Macko, and M. Bielikova, code comment generation,” in Proceedings of the 45th International
“Disinformation capabilities of large language models,” arXiv preprint Conference on Software Engineering. Melbourne, Australia: IEEE,
arXiv:2311.08838, 2023. May 14-20 2023, pp. 768–780.
[39] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: NLG [56] B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi,
evaluation using gpt-4 with better human alignment,” in Proceedings J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models
of the 28th Conference on Empirical Methods in Natural Language for code,” arXiv preprint arXiv:2308.12950, 2023.
Processing. Singapore: Association for Computational Linguistics, [57] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei,
December 6-10 2023, pp. 2511–2522. N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama
[40] W. Sun, Y. Miao, and et al., “Artifacts of this study,” site: https://github. 2: Open foundation and fine-tuned chat models,” arXiv preprint
com/wssun/LLM4CodeSummarization, 24, accessed: 2024-07-09. arXiv:2307.09288, 2023.
[41] X. Hu, G. Li, X. Xia, D. Lo, and Z. Jin, “Deep code comment generation [58] L. Tunstall, N. Lambert, N. Rajani, E. Beeching, T. Le Scao,
with hybrid lexical and syntactical information,” Empirical Software L. von Werra, S. Han, P. Schmid, and A. Rush, “Creating
Engineering, vol. 25, no. 3, pp. 2179–2217, 2020. a coding assistant with starcoder,” Hugging Face Blog, 2023,
[42] S. Haiduc, J. Aponte, and A. Marcus, “Supporting program compre- https://huggingface.co/blog/starchat.
hension with source code summarization,” in Proceedings of the 32nd [59] Bigcode, “Starcoderplus,” Hugging Face Blog, 2023,
International Conference on Software Engineering. Cape Town, South https://huggingface.co/bigcode/starcoderplus.
Africa: ACM, 1-8 May 2010, pp. 223–226. [60] OpenAI, “OpenAI API,” site: https://platform.openai.com/docs/models,
[43] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation 2015, accessed: 2024-03-15.
by jointly learning to align and translate,” in Proceedings of the 3rd [61] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi,
International Conference on Learning Representations. San Diego, Q. V. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in
CA, USA: OpenReview.net, May 7-9 2015, pp. 1–15. large language models,” in Proceedings of the 36th Annual Conference
[44] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the on Neural Information Processing Systems. New Orleans, LA, USA:
properties of neural machine translation: Encoder-decoder approaches,” Curran Associates Inc., November 28 - December 9 2022, pp. 24 824–
in Proceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, 24 837.
Semantics and Structure in Statistical Translation. Doha, Qatar: [62] Y. Wang, Z. Zhang, and R. Wang, “Element-aware summarization with
Association for Computational Linguistics, 25 October 2014, pp. 103– large language models: Expert-aligned evaluation and chain-of-thought
111. method,” in Proceedings of the 61st Annual Meeting of the Association
[45] W. U. Ahmad, S. Chakraborty, B. Ray, and K. Chang, “A transformer- for Computational Linguistics. Toronto, Canada: Association for
based approach for source code summarization,” in Proceedings of the Computational Linguistics, July 9-14 2023, pp. 8640–8665.
58th Annual Meeting of the Association for Computational Linguistics. [63] G. Kim, P. Baldi, and S. McAleer, “Language models can solve
Online: Association for Computational Linguistics, July 5-10 2020, pp. computer tasks,” in Proceedings of the 37th Annual Conference on
4998–5007. Neural Information Processing Systems, vol. 36. New Orleans, LA,
[46] D. Gros, H. Sezhiyan, P. Devanbu, and Z. Yu, “Code to comment USA: Curran Associates, Inc., December 10 - 16 2023, pp. 39 648–
”translation”: data, metrics, baselining & evaluation,” in Proceedings of 39 677.
the 35th International Conference on Automated Software Engineering. [64] B. Xu, A. Yang, J. Lin, Q. Wang, C. Zhou, Y. Zhang, and Z. Mao,
Melbourne, Australia: IEEE, September 21-25 2020, pp. 746–757. “Expertprompting: Instructing large language models to be distinguished
[47] J. Zhang, X. Wang, H. Zhang, H. Sun, and X. Liu, “Retrieval-based experts,” CoRR, vol. abs/2305.14688, no. 1, pp. 1–6, 2023.
neural source code summarization,” in Proceedings of the 42nd Inter- [65] CodeLlama, “Application of codellama,” site: https://huggingface.
national Conference on Software Engineering. Seoul, South Korea: co/spaces/codellama/codellama-13b-chat/blob/main/app.py, 2023, ac-
ACM, 27 June - 19 July 2020, pp. 1385–1397. cessed: 2024-03-15.

12
[66] H. Husain, H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt,
“Codesearchnet challenge: Evaluating the state of semantic code search,”
CoRR, vol. abs/1909.09436, 2019.
[67] D. Guo, S. Lu, N. Duan, Y. Wang, M. Zhou, and J. Yin, “UniXcoder:
Unified cross-modal pre-training for code representation,” in Proceed-
ings of the 60th Annual Meeting of the Association for Computational
Linguistics. Dublin, Ireland: Association for Computational Linguistics,
May 22-27 2022, pp. 7212–7225.
[68] D. Wang, B. Chen, S. Li, W. Luo, S. Peng, W. Dong, and X. Liao,
“One adapter for all programming languages? adapter tuning for code
search and summarization,” in Proceedings of the 45th International
Conference on Software Engineering. Melbourne, Australia: IEEE,
May 14-20 2023, pp. 5–16.
[69] S. Liu, Y. Chen, X. Xie, J. K. Siow, and Y. Liu, “Retrieval-augmented
generation for code summarization via hybrid GNN,” in Proceedings of
the 9th International Conference on Learning Representations. Virtual
Event, Austria: OpenReview.net, May 3-7 2021, pp. 1–13.
[70] M. Allamanis, “The adverse effects of code duplication in machine
learning models of code,” in Proceedings of the 2019 ACM SIGPLAN
International Symposium on New Ideas, New Paradigms, and Reflections
on Programming and Software, 2019, pp. 143–153.
[71] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi,
“BERTScore: Evaluating text generation with BERT,” in Proceedings of
the 8th International Conference on Learning Representations. Addis
Ababa, Ethiopia: OpenReview.net, April 26-30 2020, pp. 1–14.
[72] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John,
N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar et al., “Universal
sentence encoder,” arXiv preprint arXiv:1803.11175, 2018.
[73] D. Roy, S. Fakhoury, and V. Arnaoudova, “Reassessing automatic
evaluation metrics for code summarization tasks,” in Proceedings of the
29th Joint European Software Engineering Conference and Symposium
on the Foundations of Software Engineering. Athens, Greece: ACM,
August 23-28 2021, pp. 1105–1116.
[74] Y. Zhang, Y. Liu, X. Fan, and Y. Lu, “RetCom: Information retrieval-
enhanced automatic source-code summarization,” in Proceedings of the
22nd International Conference on Software Quality, Reliability and
Security. Guangzhou, China: IEEE, December 5-9 2022, pp. 948–957.
[75] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training
of deep bidirectional transformers for language understanding,” in
Proceedings of the 23th Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language
Technologies. Minneapolis, MN, USA: Association for Computational
Linguistics, June 2-7 2019, pp. 4171–4186.
[76] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings
using siamese bert-networks,” in Proceedings of the the 9th International
Joint Conference on Natural Language Processing. Hong Kong, China:
Association for Computational Linguistics, November 3-7 2019, pp.
3980–3990.
[77] W. J. Conover, Practical nonparametric statistics. john wiley & sons,
1999, vol. 350.
[78] C. P. Dancey and J. Reidy, Statistics without maths for psychology.
Pearson education, 2007.
[79] OpenAI, “Create chat completion,” site: https://platform.openai.com/
docs/api-reference/chat/create, 2024, accessed: 2024-03-15.
[80] L. Tunstall, N. Lambert, N. Rajani, E. Beeching, T. Le Scao, L. von
Werra, S. Han, P. Schmid, and A. Rush, “Starchat-beta,” site: https://
huggingface.co/HuggingFaceH4/starchat-beta, 2023, accessed: 2024-03-
15.
[81] pcuenq, “Usage of codellama,” site: https://huggingface.co/spaces/
codellama/codellama-13b-chat/blob/main/app.py, 2023, accessed: 2024-
03-15.
[82] OpenAI, “Get up and running with the openai api,” site: https://platform.
openai.com/docs/quickstart?context=python, 2024, accessed: 2024-03-
15.

Large Language Models (LLMS) For Source Code Analysis: Applications, Models and Datasets
No ratings yet
Large Language Models (LLMS) For Source Code Analysis: Applications, Models and Datasets
24 pages
Distilled GPT For Source Code Summarization: Chia-Yi Su and Collin Mcmillan
No ratings yet
Distilled GPT For Source Code Summarization: Chia-Yi Su and Collin Mcmillan
26 pages
Towards Advancing Code Generation With Large Language Models: A Research Roadmap
No ratings yet
Towards Advancing Code Generation With Large Language Models: A Research Roadmap
10 pages
A Survey On Large Language Models For Code Generation: Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim
No ratings yet
A Survey On Large Language Models For Code Generation: Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim
49 pages
PAPER Prompt Engineering For LLM
No ratings yet
PAPER Prompt Engineering For LLM
6 pages
A Survey On Large Language Models For Code Generation: Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim
No ratings yet
A Survey On Large Language Models For Code Generation: Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim
70 pages
LLM Aiml
No ratings yet
LLM Aiml
2 pages
Few Shot Prompt Static Analysis
No ratings yet
Few Shot Prompt Static Analysis
12 pages
Code Tree
No ratings yet
Code Tree
16 pages
Case Study For Procurement
No ratings yet
Case Study For Procurement
62 pages
Evaluating LLMs in Code Generation
No ratings yet
Evaluating LLMs in Code Generation
26 pages
Summarization Is (Almost) Dead
No ratings yet
Summarization Is (Almost) Dead
9 pages
ASE2024 CodeGenSurvey-7
No ratings yet
ASE2024 CodeGenSurvey-7
17 pages
Large Language Models For Software Engineering
No ratings yet
Large Language Models For Software Engineering
79 pages
Mastropaolo CodeSummarization
No ratings yet
Mastropaolo CodeSummarization
12 pages
Code Generation AceCoder - Preprint
No ratings yet
Code Generation AceCoder - Preprint
12 pages
Tosem2hshzh024 5 PDF
No ratings yet
Tosem2hshzh024 5 PDF
79 pages
Repository Level Prompt Generation For LLMs
No ratings yet
Repository Level Prompt Generation For LLMs
21 pages
Unit 2 - Gai
No ratings yet
Unit 2 - Gai
14 pages
NLP Review 1
No ratings yet
NLP Review 1
17 pages
Few Shot Training LLMs
No ratings yet
Few Shot Training LLMs
5 pages
Can Developers Prompt? A Controlled Experiment For Code Documentation Generation
No ratings yet
Can Developers Prompt? A Controlled Experiment For Code Documentation Generation
13 pages
A Survey On Language Models For Code
No ratings yet
A Survey On Language Models For Code
125 pages
Evaluating LLMs for Code Generation
No ratings yet
Evaluating LLMs for Code Generation
6 pages
A - Review - On - Code - Generation - With - LLMs - Application - and - Evaluation 2
No ratings yet
A - Review - On - Code - Generation - With - LLMs - Application - and - Evaluation 2
6 pages
Merged
No ratings yet
Merged
28 pages
SEED: Customize Large Language Models With Sample-Efficient Adaptation For Code Generation
No ratings yet
SEED: Customize Large Language Models With Sample-Efficient Adaptation For Code Generation
13 pages
Cluster3 Prompt Engineering Generative AI Practice Summary
No ratings yet
Cluster3 Prompt Engineering Generative AI Practice Summary
6 pages
Large Language Models For Code Analysis - Do LLMs Really Do Their Job
No ratings yet
Large Language Models For Code Analysis - Do LLMs Really Do Their Job
18 pages
Large Language Models and Where To Use Them - Part 1
No ratings yet
Large Language Models and Where To Use Them - Part 1
12 pages
Week2 Llms
No ratings yet
Week2 Llms
25 pages
Large Language Models For Software Engineering - A Systematic Literature Review
No ratings yet
Large Language Models For Software Engineering - A Systematic Literature Review
79 pages
Prompt Engineering Fundamentals - IBM Developer
No ratings yet
Prompt Engineering Fundamentals - IBM Developer
7 pages
Abstract
No ratings yet
Abstract
1 page
Benchmarking Large Language Models For News Summarization
No ratings yet
Benchmarking Large Language Models For News Summarization
19 pages
Fine-Tuning LLMs for Code Documentation
No ratings yet
Fine-Tuning LLMs for Code Documentation
1 page
Code Generation Tools (Almost) For Free? A Study of Few-Shot, Pre-Trained Language Models On Code
No ratings yet
Code Generation Tools (Almost) For Free? A Study of Few-Shot, Pre-Trained Language Models On Code
12 pages
5624 Large Language Models Are Huma
No ratings yet
5624 Large Language Models Are Huma
43 pages
(2023) A Survey On Language Models For Code
No ratings yet
(2023) A Survey On Language Models For Code
55 pages
AI Techniques for Source Code Summarization
No ratings yet
AI Techniques for Source Code Summarization
7 pages
Prompt Engineering For Large Language Models
No ratings yet
Prompt Engineering For Large Language Models
9 pages
Generative AI for Tech Professionals
No ratings yet
Generative AI for Tech Professionals
54 pages
SP-Deep Code Comment Generation
No ratings yet
SP-Deep Code Comment Generation
12 pages
NLP Mini Project
No ratings yet
NLP Mini Project
19 pages
Clase6 Prompt Engineering
No ratings yet
Clase6 Prompt Engineering
69 pages
Fin Irjmets1715742677
No ratings yet
Fin Irjmets1715742677
6 pages
SP-Summarizing Source Code With Transferred API Knowledge
No ratings yet
SP-Summarizing Source Code With Transferred API Knowledge
9 pages
E1. Code Language Models
No ratings yet
E1. Code Language Models
40 pages
Appendix A - Advanced Prompting Techniques
No ratings yet
Appendix A - Advanced Prompting Techniques
29 pages
Code Llama: Advanced Code Generation Models
No ratings yet
Code Llama: Advanced Code Generation Models
47 pages
AgentCoder: Multiagent Code Generation
No ratings yet
AgentCoder: Multiagent Code Generation
21 pages
Selection of Prompt Engineering Techniques For Code Generation Through Predicting Code Complexity
No ratings yet
Selection of Prompt Engineering Techniques For Code Generation Through Predicting Code Complexity
21 pages
Prompt Engineering 201 Advanced Methods and Toolkits - AI, Software, Tech, and People. Not in That Order. by X
No ratings yet
Prompt Engineering 201 Advanced Methods and Toolkits - AI, Software, Tech, and People. Not in That Order. by X
2 pages
Code Llama: Open Foundation Models For Code
No ratings yet
Code Llama: Open Foundation Models For Code
48 pages
Mod 1
No ratings yet
Mod 1
31 pages
.Marked djiKsGD
No ratings yet
.Marked djiKsGD
8 pages
Acecode: A Reinforcement Learning Framework For Aligning Code Efficiency and Correctness in Code Language Models
No ratings yet
Acecode: A Reinforcement Learning Framework For Aligning Code Efficiency and Correctness in Code Language Models
20 pages
Enhancing LLM Debugging with Bugs2Fix
No ratings yet
Enhancing LLM Debugging with Bugs2Fix
12 pages
CrewAI Vs LangChain - The Clash of AI Titans in The LLM Arena - by Cogni Down Under - Nov, 2024 - Medium
No ratings yet
CrewAI Vs LangChain - The Clash of AI Titans in The LLM Arena - by Cogni Down Under - Nov, 2024 - Medium
13 pages
2025 06 AI Research Tool Overview
No ratings yet
2025 06 AI Research Tool Overview
41 pages
2509.14786v1
No ratings yet
2509.14786v1
45 pages
Report On Pirated Content Used in Training of AI
No ratings yet
Report On Pirated Content Used in Training of AI
17 pages
Agrigpt: A Large Language Model Ecosystem For Agriculture
No ratings yet
Agrigpt: A Large Language Model Ecosystem For Agriculture
10 pages
Llama 3.1 Model Cards & Prompt Formats
No ratings yet
Llama 3.1 Model Cards & Prompt Formats
25 pages
When Llms Meet Cybersecurity: A Systematic Literature Review
No ratings yet
When Llms Meet Cybersecurity: A Systematic Literature Review
41 pages
Prompt Engineering in 30 Days by Aniket Jain (PDF) (Nonfiction)
100% (1)
Prompt Engineering in 30 Days by Aniket Jain (PDF) (Nonfiction)
190 pages
R FT: Reasoning With R Inforced Fine-Tuning: Bytedance Research
No ratings yet
R FT: Reasoning With R Inforced Fine-Tuning: Bytedance Research
15 pages
2023 Emnlp-Industry 33
No ratings yet
2023 Emnlp-Industry 33
10 pages
AI War
No ratings yet
AI War
17 pages
2024.findings Emnlp.839
No ratings yet
2024.findings Emnlp.839
13 pages
HalluLens - LLM Hallucination Benchmark-2025
No ratings yet
HalluLens - LLM Hallucination Benchmark-2025
29 pages
Llama Pro
No ratings yet
Llama Pro
21 pages
FLoRA - Federated Fine-Tuning Large Language Models With Heterogeneous Low-Rank Adaptations
No ratings yet
FLoRA - Federated Fine-Tuning Large Language Models With Heterogeneous Low-Rank Adaptations
21 pages
Artificial Analysis State of AI China Q2 2025 Highlights
No ratings yet
Artificial Analysis State of AI China Q2 2025 Highlights
17 pages
Jesai Tarun - Extended Essay - Final Submission
No ratings yet
Jesai Tarun - Extended Essay - Final Submission
48 pages
Llama3.2: Meta's Multimodal AI Model
No ratings yet
Llama3.2: Meta's Multimodal AI Model
8 pages
Top Large Language Models
No ratings yet
Top Large Language Models
5 pages
Hai Ai Index Report 2025 Chapter2 Final
No ratings yet
Hai Ai Index Report 2025 Chapter2 Final
86 pages
Did The Neurons Read Your Book
No ratings yet
Did The Neurons Read Your Book
21 pages
Clap
No ratings yet
Clap
12 pages
GPQA Diamond Benchmark Results
No ratings yet
GPQA Diamond Benchmark Results
10 pages
Nyay Justice For All
No ratings yet
Nyay Justice For All
7 pages
Llama 4 :10M Context, Native Multimodality AI Power by Meta AI
No ratings yet
Llama 4 :10M Context, Native Multimodality AI Power by Meta AI
9 pages
Lecun 20230424 Santa Fe Institute
No ratings yet
Lecun 20230424 Santa Fe Institute
66 pages
DL Assignment 2 Final
No ratings yet
DL Assignment 2 Final
15 pages
SWE-Bench: Evaluating LMs on GitHub Issues
No ratings yet
SWE-Bench: Evaluating LMs on GitHub Issues
51 pages
Multimodal Llama 3.2 Overview
No ratings yet
Multimodal Llama 3.2 Overview
29 pages
Multi-Agent AI for Research Summarization
No ratings yet
Multi-Agent AI for Research Summarization
1 page

Code Summarization Using LLM

Uploaded by

Code Summarization Using LLM

Uploaded by

Source Code Summarization in the Era of Large

Abstract—To support software developers in understanding

You might also like