Automating Code Review
Automating Code Review
2023 IEEE/ACM 45th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion) | 979-8-3503-2263-7/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICSE-COMPANION58688.2023.00053
Abstract—Code reviews are popular in both industrial and When I started my PhD in February 2020, little effort had
open source projects. The benefits of code reviews are widely been devoted to the automation of the most complex code
recognized and include better code quality and lower likelihood review tasks, namely those dealing with the review of the
of introducing bugs. However, code review comes at the cost of
spending developers’ time on reviewing their teammates’ code. code itself (i.e., identifying problems in the submitted code and
The goal of this research is to investigate the possibility of using implement changes needed to address them). Indeed, only a
Deep Learning (DL) to automate specific code review tasks. few works started investigating the possibility to learn code
We started by training vanilla Transformer models to learn change patterns in software repositories [14], [15], which
code changes performed by developers during real code review might be used to improve the quality of the code submitted
activities. This gives the models the possibility to automatically
(i) revise the code submitted for review without any input from for review (e.g., by learning code changes often applied by
the reviewer; and (ii) implement changes required to address developers to fix a specific quality issue).
a specific reviewer’s comment. While the preliminary results
were encouraging, in this first work we tested DL models in
The goal of my PhD is the automation of the above-
rather simple code review scenarios, substantially simplifying mentioned non-trivial tasks. In particular, we target three spe-
the targeted problem. This was also due to the choices we made cific tasks, focusing on both the contributor and the reviewer
when designing both the technique and the experiments. Thus, sides of the review process. First, we defined two tasks to
in a subsequent work, we exploited a pre-trained Text-To-Text- learn code changes performed by developers during real code
Transfer-Transformer (T5) to overcome some of these limitations
and experiment DL models for code review automation in more
review activities. The first one — code-to-code (Tc2c )—, on
realistic and challenging scenarios. The achieved results show the contributor side, aims at providing them with a revised
the improvements brought by T5 both in terms of applicability version of their code implementing code transformations usu-
(i.e., scenarios in which it can be applied) and performance. ally recommended during code review before the code is even
Despite this, we are still far from performance levels making submitted for review. The second task, — code&comment-
these techniques deployable in practice, thus calling for additional
research in this area, as we discuss in our future work agenda.
to-code (Tc&nl2c ) —, on the reviewer side, provides the re-
Index Terms—Code Review, Deep Learning viewer commenting on a submitted code with the revised code
implementing their comment expressed in natural language.
Successively, we defined a third task, — code-to-comment
I. I NTRODUCTION (Tc2nl ) — still on the contributor side, taking as input the
code submitted for review and requesting to the contributor
Code Review is the process of analyzing code written
code changes as a reviewer would do, by commenting the
by a teammate to judge whether it is of sufficient quality.
code in natural language.
Recent studies provided evidence that reviewed code has lower
chances of being buggy [1]–[3] and exhibits higher internal The overall idea is not to replace developers during code
quality [3]. Given these benefits, code reviews are widely review, but to design techniques that can work in tandem with
adopted both in industrial and open source projects. them by spotting and/or fixing code quality issues being typical
The benefits brought by code reviews do not come for targets of a code review. A complete automation, besides not
free. Indeed, code reviews add additional expenses to the being realistic, would also dismiss one of the benefits of code
standard development costs due to the allocation of one or review: knowledge sharing among developers [16].
more reviewers having the responsibility of verifying the We started our investigation by training two Transformer
correctness, quality, and soundness of newly developed code. models to automate Tc2c and Tc&nl2c , respectively. The results
Bosu and Carver report that developers spend, on average, of this study have been published in the following paper [17]:
more than six hours per week reviewing code [4].
For this reason, researchers started proposing techniques Towards Automating Code Review Activities
aimed at automating specific code review tasks. Several works Rosalia Tufano, Luca Pascarella, Michele Tufano, Denys Poshyvanyk,
Gabriele Bavota. In Proceedings of the 43rd International Conference
targeted the recommendation of proper reviewers for a given on Software Engineering (ICSE 2021), pp. 163-174.
change (e.g., [5]–[11]), while others focused on classifying
the contributions to review into different categories [12], [13] While the results achieved in this work were promising, our
(again to ease the identification of proper reviewers). approach also had substantial limitations.
193
Authorized licensed use limited to: SDM COLLEGE OF ENGINEERING AND TECHNOLOGY. Downloaded on April 01,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
III. U SING P RE - TRAINED M ODELS TO B OOST C ODE Moving to the new (more complex) dataset, T5 achieves
R EVIEW AUTOMATION the following results. In the case of Tc2c , when a single
prediction is proposed by T5, it achieves 5% of correct
To partially address the limitations of our preliminary work, predictions. Such a result should be considered in the context
we experimented with DL models for code review automation of what we obtained with the encoder-decoder model that, on
in more realistic and challenging scenarios. We started by a much simpler test dataset, achieved for the same task 3%
training a Text-To-Text-Transfer-Transformer (T5) model [20] of correct predictions. Similar observations can be made for
on a bigger version of the dataset used in the previous work Tc&nl2c , where T5 can generate 14% of correct predictions. For
(we mined new open source projects on GitHub increasing the this task, the encoder-decoder model on the simpler dataset
size of the initial dataset). To avoid the usage of the abstraction achieves 12% of correct predictions. Moving to Tc2nl , T5
process adopted in the first stage of this research, we adopted struggles in formulating natural language comments identical
the SentencePiece subword-based tokenizer [21], which allows to the ones written by reviewers, with a 2% success rate.
to work with the raw source code while keeping the size of the It is worth noting that the reported results represent a
vocabulary under control. Also, we increased the maximum lower bound for the performance of our approach. Indeed,
length of the considered code components from 100 “ab- we consider a prediction as “correct” only if it is identical
stracted” tokens to 512 “SentencePiece” tokens. The absence to the reference one. For example, in the case of Tc2nl , the
of an abstraction mechanism and the increased upper bound for natural language comment generated by T5 is classified as
input/output length allowed us to build a substantially larger correct only if it is equal to the reference one, including
dataset as compared to our previous work [17] (140k instances punctuation. However, it is possible that a natural language
vs 17k) and, more importantly, to feature in such a dataset a comment generated by T5 is different but semantically equiv-
wider variety of code transformations implemented in the code alent to the one written by the developer (e.g., “variable
review process, including quite challenging instances such as v should be private” vs “change v visibility to private”).
those requiring the introduction of new identifiers and literals. Similar observations hold for the two code-generation tasks
The possibility of considering such a variety of code (e.g., a reviewer’s comment could be addressed in different
changes is also due to the learning abilities of the T5 model but semantically equivalent ways). To have an idea on the
[20]. T5 is subjected to a first training (pre-training) whose number of valuable predictions present among those classified
purpose is to provide it with a general knowledge useful to as “wrong” (i.e., the non-correct predictions), we manually
solve a set of related tasks. Suppose, for example, that we want analyzed a sample of 100 “wrong” predictions for each task.
to train a model able to translate English to German. Instead Overall, our analysis showed that the correct predictions really
of starting by training the model for this task, T5 can be represent a lower bound for the performance of T5, especially
pre-trained in an unsupervised manner by using the denoising for the two tasks in which natural language comments are
objective (or masked language modeling): The model is fed involved. For example, for Tc2nl we found out that 36% of
with sentences having 15% of their tokens (e.g., words in “wrong” predictions are actually a semantically equivalent
English sentences) randomly masked and it is asked to predict natural language comment produced by the model. For further
them. By learning how to predict the masked tokens, the details we point the interested reader to our paper [18].
model acquires knowledge about the language of interest. To provide a better idea of the capabilities of the model, the
In our example, we could pre-train the model on English top part of Fig. 1 shows one example of correct prediction gen-
and German sentences. Once pre-trained, T5 is fine-tuned on erated by the model for each task. For Tc2c (code-to-code), the
the downstream tasks in a supervised fashion. Each task is first code represents the input of the model, while the second
formulated in a “text-to-text” format (i.e., both the input and its output. We highlighted in bold the parts of code changed by
the output of the model are represented as text). For example, the model and replaced irrelevant parts of the methods with
for the translation task a dataset composed of pairs of English [...] to save space. For Tc&nl2c (code&comment-to-code), the
and German sentences allows to fine-tune the model. input provided by the model includes the comment written by
In our research, we pre-train T5 on Java source code the reviewer and requiring a specific change to the part of code
and “technical English” (e.g., natural language text used to highlighted in orange. Finally, for Tc2nl (code-to-comment), we
document source code). In particular, we built a pre-training report the code provided as input to the model (first line) with
dataset consisting of nearly 1.5M instances starting from two the comment it generated as output (second line). The bottom
public datasets featuring instances including both source code of Fig. 1 (black background) shows instead an example of
and technical English: the official Stack Overflow dump [22] “wrong” but valid prediction for Tc2nl , with T5 asking the
and CodeSearchNet [23]. Then, we fine-tune T5 on the three developer to implement the same change suggested by the
tasks defined in Section I: Tc2c , Tc&nl2c and Tc2nl . We started reviewer (the longer comment is the one generated by T5).
by evaluating T5 on the same (simpler) dataset used in our While this work represents a significant step forward in
ICSE’21 paper to compare it with the encoder-decoder model automating code review, the achieved performance is still
presented in Section II. The results showed the superiority quite far from levels which could be considered valuable by
of T5. For example, in Tc&nl2c the encoder-decoder model developers. We discuss Section IV our plans to further boost
achieves 10% correct predictions, while T5 reaches 30%. the automation of code review.
194
Authorized licensed use limited to: SDM COLLEGE OF ENGINEERING AND TECHNOLOGY. Downloaded on April 01,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
Correct predictions
code-to-code
public ConfigBuilder readFrom(View<?> view) { if (view instanceof Dataset && view instanceof FileSystemDataset)
{ FileSystemDataset dataset = (FileSystemDataset) view; [...] }
public ConfigBuilder readFrom(View<?> view) { if (view instanceof FileSystemDataset)
{ FileSystemDataset dataset = (FileSystemDataset) view; [...] }
code-to-comment
public List<[...]> getExecuteBefore() { Rules ann = [Link]().getAnnotation([Link]); if(ann != null) [...] }
“Rename ‘ann’ to ‘rules’, ‘rulesAnnotation’ or something more descriptive.”
“Extract the building of the ResponseMessage to it's own variable (in eclipse, select the text, right-click > refactor > extract local variable / select code + shift+alt+L). This will make the code a
bit more readable, especially when you'll be passing in other things besides the ResponseMessage.”
IV. F UTURE D IRECTIONS Chouchen et al. [13] used a binary classifier to assess the
quality of the code submitted for review leveraging quality
Investigating the usage of customized pre-training ob- metrics as features. Similarly, Shi et al. [32] presented a
jectives. In [18] (Section III) we adopted a pre-trained model DL model taking as input the code submitted for review
for the automation of code review. The positive role played and the revised code implementing changes recommended by
by pre-training on the achieved performance is clear in our reviewers and providing as output the acceptance or not of the
experiments. However, we did not investigate the possible changes. These techniques are complementary to ours.
impact of using different pre-training pre-training objective(s), Our research has been inspired by works aimed at learning
possibly specialized for code review automation. Indeed, the general change patterns from developers’ activities [14], [15].
one we used (denonising objective) is just one of the possible For example, Neural Machine Translation models have been
pre-training objectives and recent work from the natural lan- used to learn how to automatically modify a given Java method
guage processing field [24] suggest that pre-training objectives as developers would do during a pull request [14].
tailored for the specific downstream task of interest may boost Several recent works built on top of the research we
the model’s performance. We plan to propose and compare presented in [17], [18]. Li et al. [33] and Hong et al. [34]
different pre-training objectives to identify the one(s) best presented techniques to improve the results we achieved on
suited for the automation of code review tasks. the automated generation of reviewers’ comments (Tc2nl ) by
Investigating the role of context on the model perfor- using pre-trained DL models [33] or by exploiting information
mance. In our studies we limited the focus of the model retrieval to recommend reviewers’ comments posted in the
on a single method at time. In other words, the submitted past for code snippets similar to the one to review [34]. Li
code given as input to the model is a single method, possibly et al. [35] targeted our two tasks related to the automatic
accompanied by a reviewer’s comment (depending on the implementation of a reviewer’s comment (Tc&nl2c ) and to the
task). This means that, for example, the model has no further generation of reviewers’ comments for a given code (Tc2nl ).
information about other code submitted for review, the class They also estimate the quality of the submitted code to decide
in which the method is implemented, the past review rounds, whether it needs a review or not. The approach exploits a pre-
etc. Intuitively, more context provided as input could improve trained model and can work with 9 programming languages.
the performance of the model. We plan to experiment the
model when using a wider context, for example by looking VI. C ONCLUSION
at the entire class the method belongs to, providing more We presented our efforts in the automation of code review
information about other submitted code changes, or feeding and our future plans in this area. As discussed in Section V
multiple reviewers’ comments at a time. Clearly, adding more several researchers are targeting similar problems, thus in-
context could increase the models’ performance but also creasing our hopes in a more and more successful automation
the complexity of the training and of the learning. Thus, a which may be then subject to technological transfer to practi-
reasonable trade-off must be targeted. tioners. I would like to conclude by summarizing my steps in
the PhD: I started in February 2020 and defended my thesis
proposal in December 2021. I just started the fourth (and last)
V. R ELATED W ORKS year of PhD and plan to defend my thesis in January 2024.
Several works targeted the optimization of the reviewers’ ACKNOWLEDGMENT
assignment [5]–[11], [13], [25]–[31]. These works exploit This project has received funding from the European Re-
different features and algorithms to recommend the most suited search Council (ERC) under the European Union’s Horizon
reviewer for a given change. 2020 research and innovation programme (grant No. 851720).
195
Authorized licensed use limited to: SDM COLLEGE OF ENGINEERING AND TECHNOLOGY. Downloaded on April 01,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
R EFERENCES [18] R. Tufano, S. Masiero, A. Mastropaolo, L. Pascarella, D. Poshyvanyk,
and G. Bavota, “Using pre-trained models to boost code review au-
tomation,” in 44th IEEE/ACM International Conference on Software
[1] S. McIntosh, Y. Kamei, and B. A. andß Ahmed E. Hassan, “The impact Engineering, ICSE, pp. 2291–2302, 2022.
of code review coverage and code review participation on software [19] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
quality: A case study of the qt, vtk, and itk projects,” in 11th IEEE/ACM Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in 30th
Working Conference on Mining Software Repositories, MSR, pp. 192– Advances in Neural Information Processing Systems NIPS, pp. 5998–
201, 2014. 6008, 2017.
[2] R. Morales, S. McIntosh, and F. Khomh, “Do code review practices [20] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
impact design quality? a case study of the qt, vtk, and itk projects,” in Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning
22nd IEEE International Conference on Software Analysis, Evolution with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21,
and Reengineering, SANER, pp. 171–180, 2015. pp. 140:1–140:67, 2020.
[3] G. Bavota and B. Russo, “Four eyes are better than two: On the impact [21] T. Kudo and J. Richardson, “Sentencepiece: A simple and language inde-
of code reviews on software quality,” in IEEE International Conference pendent subword tokenizer and detokenizer for neural text processing,”
on Software Maintenance and Evolution, ICSME, pp. 81–90, 2015. CoRR, 2018.
[4] A. Bosu and J. C. Carver, “Impact of peer code review on peer impres- [22] “Stack exchange dumps.” url[Link]
sion formation: A survey,” in 7th IEEE/ACM International Symposium [23] H. Husain, H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt,
on Empirical Software Engineering and Measurement, ESEM, pp. 133– “Codesearchnet challenge: Evaluating the state of semantic code search,”
142, 2013. CoRR, vol. abs/1909.09436, 2019.
[5] J. Jiang, D. Lo, J. Zheng, X. Xia, Y. Yang, and L. Zhang, “Who [24] J. Zhang, Y. Zhao, M. Saleh, and P. J. Liu, “Pegasus: Pre-training with
should make decision on this pull request? analyzing time-decaying extracted gap-sentences for abstractive summarization,” in Proceedings
relationships and file similarities for integrator prediction,” J. Syst. of the 37th International Conference on Machine Learning, ICML’20,
Softw., vol. 154, p. 196–210, aug 2019. [Link], 2020.
[6] P. Thongtanunam, C. Tantithamthavorn, R. G. Kula, N. Yoshida, H. Iida, [25] J. Jiang, Y. Yang, J. He, X. Blanc, and L. Zhang, “Who should
and K.-i. Matsumoto, “Who should review my code? a file location- comment on this pull request? analyzing attributes for more accurate
based code-reviewer recommendation approach for modern code re- commenter recommendation in pull-based development,” Inf. Softw.
view,” in 2015 IEEE 22nd International Conference on Software Anal- Technol., vol. 84, p. 48–62, apr 2017.
ysis, Evolution, and Reengineering (SANER), pp. 141–150, 2015. [26] A. Strand, M. Gunnarson, R. Britto, and M. Usman, “Using a context-
[7] A. Ouni, R. G. Kula, and K. Inoue, “Search-based peer reviewers aware approach to recommend code reviewers: Findings from an indus-
recommendation in modern code review,” in 2016 IEEE International trial case study,” in Proceedings of the ACM/IEEE 42nd International
Conference on Software Maintenance and Evolution (ICSME), pp. 367– Conference on Software Engineering: Software Engineering in Practice,
377, 2016. ICSE-SEIP ’20, (New York, NY, USA), p. 1–10, Association for
[8] X. Xia, D. Lo, X. Wang, and X. Yang, “Who should review this Computing Machinery, 2020.
change?: Putting text and file location analyses together for more [27] M. B. Zanjani, H. Kagdi, and C. Bird, “Automatically recommending
accurate recommendations,” in 2015 IEEE International Conference on peer reviewers in modern code review,” IEEE Transactions on Software
Software Maintenance and Evolution (ICSME), pp. 261–270, 2015. Engineering, vol. 42, no. 6, pp. 530–543, 2016.
[28] M. M. Rahman, C. K. Roy, and J. A. Collins, “Correct: Code reviewer
[9] S. Asthana, R. Kumar, R. Bhagwan, C. Bird, C. Bansal, C. Maddila,
recommendation in github based on cross-project and technology expe-
S. Mehta, and B. Ashok, “Whodo: Automating reviewer suggestions at
rience,” in 2016 IEEE/ACM 38th International Conference on Software
scale,” in Proceedings of the 2019 27th ACM Joint Meeting on European
Engineering Companion (ICSE-C), pp. 222–231, 2016.
Software Engineering Conference and Symposium on the Foundations
[29] H. Ying, L. Chen, T. Liang, and J. Wu, “Earec: Leveraging expertise and
of Software Engineering, ESEC/FSE 2019, (New York, NY, USA),
authority for pull-request reviewer recommendation in github,” in 2016
p. 937–945, Association for Computing Machinery, 2019.
IEEE/ACM 3rd International Workshop on CrowdSourcing in Software
[10] E. Mirsaeedi and P. C. Rigby, “Mitigating turnover with code review Engineering (CSI-SE), pp. 29–35, 2016.
recommendation: Balancing expertise, workload, and knowledge dis- [30] Z. Xia, H. Sun, J. Jiang, X. Wang, and X. Liu, “A hybrid approach to
tribution,” in Proceedings of the ACM/IEEE 42nd International Con- code reviewer recommendation with collaborative filtering,” in 2017 6th
ference on Software Engineering, ICSE ’20, (New York, NY, USA), International Workshop on Software Mining (SoftwareMining), pp. 24–
p. 1183–1195, Association for Computing Machinery, 2020. 31, 2017.
[11] W. H. A. Al-Zubaidi, P. Thongtanunam, H. K. Dam, C. Tantithamtha- [31] Y. Yu, H. Wang, G. Yin, and T. Wang, “Reviewer recommendation for
vorn, and A. Ghose, “Workload-aware reviewer recommendation using a pull-requests in github,” Inf. Softw. Technol., vol. 74, p. 204–218, jun
multi-objective search-based approach,” in Proceedings of the 16th ACM 2016.
International Conference on Predictive Models and Data Analytics in [32] S.-T. Shi, M. Li, D. Lo, F. Thung, and X. Huo, “Automatic code review
Software Engineering, pp. 21–30, 2020. by learning the revision of source code,” in Proceedings of the AAAI
[12] X. Ge, S. Sarkar, J. Witschey, and E. Murphy-Hill, “Refactoring-aware Conference on Artificial Intelligence, vol. 33, pp. 4910–4917, 2019.
code review,” in 2017 IEEE Symposium on Visual Languages and [33] L. Li, L. Yang, H. Jiang, J. Yan, T. Luo, Z. Hua, G. Liang, and C. Zuo,
Human-Centric Computing (VL/HCC), pp. 71–79, 2017. “Auger: Automatically generating review comments with pre-training
[13] M. Chouchen, A. Ouni, M. W. Mkaouer, R. G. Kula, and K. Inoue, models,” in Proceedings of the 30th ACM Joint European Software
“Whoreview: A multi-objective search-based approach for code review- Engineering Conference and Symposium on the Foundations of Software
ers recommendation in modern code review,” Applied Soft Computing, Engineering, ESEC/FSE 2022, (New York, NY, USA), p. 1009–1021,
vol. 100, p. 106908, 2021. Association for Computing Machinery, 2022.
[14] M. Tufano, J. Pantiuchina, C. Watson, G. Bavota, and D. Poshyvanyk, [34] Y. Hong, C. Tantithamthavorn, P. Thongtanunam, and A. Aleti, “Com-
“On learning meaningful code changes via neural machine translation,” mentfinder: A simpler, faster, more accurate code review comments
in 41st IEEE/ACM International Conference on Software Engineering, recommendation,” in Proceedings of the 30th ACM Joint European
ICSE, pp. 25–36, 2019. Software Engineering Conference and Symposium on the Foundations
[15] S. Shi, M. Li, D. Lo, F. Thung, and X. Huo, “Automatic code review of Software Engineering, ESEC/FSE 2022, (New York, NY, USA),
by learning the revision of source code,” in The Thirty-Third AAAI p. 507–519, Association for Computing Machinery, 2022.
Conference on Artificial Intelligence, AAAI 2019, pp. 4910–4917, 2019. [35] Z. Li, S. Lu, D. Guo, N. Duan, S. Jannu, G. Jenks, D. Majumder,
[16] A. Bacchelli and C. Bird, “Expectations, outcomes, and challenges of J. Green, A. Svyatkovskiy, S. Fu, and N. Sundaresan, “Automating code
modern code review,” in 35th IEEE/ACM International Conference on review activities by large-scale pre-training,” in Proceedings of the 30th
Software Engineering, ICSE, pp. 712–721, 2013. ACM Joint European Software Engineering Conference and Symposium
[17] R. Tufano, L. Pascarella, M. Tufano, D. Poshyvanyk, and G. Bavota, on the Foundations of Software Engineering, ESEC/FSE 2022, (New
“Towards automating code review activities,” in 43rd IEEE/ACM In- York, NY, USA), p. 1035–1047, Association for Computing Machinery,
ternational Conference on Software Engineering, ICSE, pp. 163–174, 2022.
2021.
196
Authorized licensed use limited to: SDM COLLEGE OF ENGINEERING AND TECHNOLOGY. Downloaded on April 01,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.