0% found this document useful (0 votes)

81 views14 pages

Codexglue: A Machine Learning Benchmark Dataset For Code Understanding and Generation

CodeXGLUE is a benchmark dataset designed to enhance machine learning research in code understanding and generation, comprising 10 tasks across 14 datasets. It includes baseline models for various programming tasks, such as clone detection, defect detection, code completion, and code translation, facilitating model evaluation and comparison. The dataset aims to improve software developer productivity by leveraging AI for automated program understanding and generation.

Uploaded by

Đại Dương

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views14 pages

Codexglue: A Machine Learning Benchmark Dataset For Code Understanding and Generation

Uploaded by

Đại Dương

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CodeXGLUE: A Machine Learning Benchmark Dataset for Code

Understanding and Generation

Shuai Lu∗ Daya Guo∗ Shuo Ren∗
Peking University Sun Yat-sen University Beihang University

Junjie Huang∗ Alexey Svyatkovskiy Ambrosio Blanco

Beihang University Microsoft Microsoft Research Asia

Colin Clement Dawn Drain Daxin Jiang

arXiv:2102.04664v2 [cs.SE] 16 Mar 2021

Microsoft Microsoft Microsoft

Duyu Tang Ge Li Lidong Zhou

Microsoft Research Asia Peking University Microsoft Research Asia

Linjun Shou Long Zhou Michele Tufano

Microsoft Microsoft Research Asia Microsoft

Ming Gong Ming Zhou Nan Duan

Microsoft Microsoft Research Asia Microsoft Research Asia

Neel Sundaresan Shao Kun Deng Shengyu Fu

Microsoft Microsoft Microsoft

Shujie Liu
Microsoft Research Asia
ABSTRACT It is commonly accepted that benchmarks have a significant impact
Benchmark datasets have a significant impact on accelerating re- on the growth of applied AI research. In this paper, we focus on
search in programming language tasks. In this paper, we introduce establishing a benchmark dataset for code intelligence.
CodeXGLUE, a benchmark dataset to foster machine learning re- Automated program understanding and generation could in-
search for program understanding and generation. CodeXGLUE crease the productivity of software developers. In fact, developers
includes a collection of 10 tasks across 14 datasets and a platform who want to find code written by others with the same intent can
for model evaluation and comparison. CodeXGLUE also features leverage code search systems [23, 35, 58, 85] to automatically re-
three baseline systems, including the BERT-style, GPT-style, and trieve semantically relevant codes through natural language queries.
Encoder-Decoder models, to make it easy for researchers to use the Similarly, developers who are confused about what to write next
platform. The availability of such data and baselines can help the can use code completion systems [4, 8, 9, 31, 62, 63, 72, 73] to auto-
development and validation of new methods that can be applied to matically complete the following tokens based on the edits made to
various program understanding and generation problems 1 . the code. Finally, when developers want to implement the Java code
in Python, code-to-code translation systems [11, 41, 46, 54] can help
KEYWORDS translate their code from one programming language (Python) to
another (Java).
program understanding, machine learning, naturalness of software
In recent years, researchers have increasingly applied statisti-
cal models, including neural nets, to code intelligence tasks. Very
1 INTRODUCTION recently, the application of pretrained models that learn from big
Evans Data Corporation2 estimated that there were 23.9 million programming language data has been inspired by the great success
professional developers in 2019 and that the number was expected of pretrained models like BERT [16] and GPT [69] in natural lan-
to reach 28.7 million in 2024. With the population of developers guage processing (NLP). These models, including CodeBERT [18]
growing at such a rate, code intelligence that leverages artificial and IntelliCode Compose [72], have led to further improvements
intelligence (AI) to help software developers improve the productiv- in code understanding and generation problems, but they lack a
ity of the development process is becoming increasingly important. benchmark suite that covers a wide range of tasks. The use of Ima-
∗ indicates
geNet [15] for computer vision and the use of GLUE [81] for NLP
equal contribution and internship at Microsoft. Authors are listed in alpha-
beta order. Corresponding authors are Duyu Tang and Shujie Liu.
have shown that a diversified benchmark dataset has a significant
1 CodeXGLUE is publicly available at https://github.com/microsoft/CodeXGLUE. Par- impact on the growth of applied AI research.
ticipants can submit their results by emailing to [email protected].
2 https://evansdata.com/press/viewRelease.php?pressID=278
Lu, Guo, Ren and Huang, et al.

Table 1: A brief summary of CodeXGLUE, which includes tasks, datasets, languages, sizes in various states, and baseline sys-
tems. Highlighted datasets are newly introduced.

Category Task Dataset Name Language Train/Dev/Test Size Baselines

BigCloneBench [71] Java 900K/416K/416K
Clone Detection
POJ-104 [52] C/C++ 32K/8K/12K
Defect Detection Devign [99] C 21K/2.7K/2.7K
CodeBERT
Python,Java,PHP,
CT-all -/-/176K
JavaScript,Ruby,Go
Cloze Test
Python,Java,PHP,
CT-max/min [18] -/-/2.6K
Code-Code JavaScript,Ruby,Go
PY150 [62] Python 100K/5K/50K
Code Completion CodeGPT
Github Java Corpus[4] Java 13K/7K/8K
Code Repair Bugs2Fix [75] Java 98K/12K/12K Encoder-
Code Translation CodeTrans Java-C# 10K/0.5K/1K Decoder
CodeSearchNet [35],
Python 251K/9.6K/19K
AdvTest
NL Code Search CodeBERT
CodeSearchNet [35],
Python 251K/9.6K/1K
Text-Code WebQueryTest
Text-to-Code
CONCODE [38] Java 100K/2K/2K CodeGPT
Generation
Python,Java,PHP,
Code-Text Code Summarization CodeSearchNet [35] 908K/45K/53K
JavaScript,Ruby,Go Encoder-
Documentation English-Latvian/Danish Decoder
Text-Text Microsoft Docs 156K/4K/4K
Translation /Norwegian/Chinese

To address this problem, we introduce CodeXGLUE, a machine To make it easy for participants, we provide three baseline mod-
learning benchmark dataset for program understanding and genera- els to help perform the tasks, including a BERT-style pretrained
tion research that includes 14 datasets3 , a collection of 10 diversified model (in this case, CodeBERT) to supports code understanding
programming language understanding and generation tasks, and a problems, a GPT-style pretrained model, which we call CodeGPT,
platform for model evaluation and comparison. CodeXGLUE sup- to help solve completion and generation problems, and an Encoder-
ports the following tasks: Decoder framework that tackles sequence-to-sequence generation
problems.
• code-code (clone detection [10, 52, 71, 84, 89, 93, 97], defect
detection [47, 57, 61, 82, 83, 99], cloze test [18], code comple-
tion [4, 8, 9, 31, 62, 63, 72, 73], code repair [2, 28, 30, 75, 76, 78], 2 TASKS OVERVIEW
and code-to-code translation [11, 41, 46, 54])
In this section, we provide a definition for each task.
• text-code (natural language code search [23, 35, 85], text-
to-code generation [12, 26, 36, 38, 87, 90, 94, 95]) Clone detection [52, 71]. The task is to measure the semantic
• code-text (code summarization [3, 12, 19, 34, 37, 80, 85–87]) similarity between codes. This includes two subtasks: binary classi-
• text-text (documentation translation [40]) fication between a pair of codes and code retrieval, where the goal
CodeXGLUE includes eight previously proposed datasets — Big- is to find semantically similar codes.
CloneBench [71], POJ-104 [52], Devign [99], PY150 [62], Github Defect detection [99]. The objective is to identify whether a
Java Corpus [4], Bugs2Fix [75], CONCODE [38], and CodeSearch- body of source code contains defects that may be used to attack soft-
Net [35]— but also newly introduced datasets that are highlighted in ware systems, such as resource leaks, use-after-free vulnerabilities,
Table 1. The datasets are chosen or created based on the considera- and DoS attack.
tion that the task has clear definition, and the volume of the dataset Cloze test [18]. This aims to predict the masked token of a code
could support the development and evaluation of data-driven ma- and includes two subtasks. The first one is to measure the accuracy
chine learning methods. The datasets created by us include (1) two of predicting the masked token from the whole vocabulary. The
cloze test test sets that cover 6 programming languages, (2) two line- other is to test the semantic reasoning ability by distinguishing
level code completion test sets in Java and Python, respectively, (3) a between “max” and “min”.
code-to-code translation dataset between Java and C#, (4) two natu- Code completion [4, 62]. It aims to predict following tokens
ral language code search test sets with web queries and normalized based on a code context. Its subtasks are token-level completion
function and variable names, respectively, and (5) a documentation and line-level completion. The former checks whether the next
translation dataset that covers five natural languages. one token has been predicted correctly, while the latter tests the
goodness of the generated line.
Code translation [54]. It involves translating a code from one
3 We plan to evolve the benchmark over time by extending to more tasks. programming language to a different one.
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Code search [35]. It measures the semantic relatedness between for training/validation/testing. The task is formulated as a binary
texts and codes and is composed of two subtasks. The first one is to classification to predict whether a function is vulnerable.
find the most relevant code in a collection of codes according to a
natural language query. The second subtask entails the analysis of 3.3 Cloze test
a query-code pair to predict whether the code answers the query Figure 1 shows two examples of the cloze test (CT) task in code
or not. domain, which aims to assess models’ ability to understand a code
Code repair [75]. Its goal is to refine the code by fixing the bugs by asking those models to predict the masked code from several
automatically. candidates. We focus on two subtasks: CT-all with candidates from
Text-to-code generation [38]. This aims to generate a code a filtered vocabulary and CT-maxmin with the candidates “max”
via a natural language description. and “min”.
Code summarization [37]. The objective is to generate the
natural language comment for a code.
Documentation translation [40]. It aims to translate code doc- Cloze Test-all
umentation from one natural language to different one. Doc.: Open the drop box.
Code: def open(self):
self.workingArea.<mask>( )
3 DATASETS self.runid_pkgidx_map = {}
In this section, we describe the datasets included in CodeXGLUE. self.runid_to_return = deque()
Datasets are chosen or created based on the criterion that the vol- Answer: open
ume of the dataset could support the development and evaluation
of data-driven machine learning methods. Cloze Test-maxmin
Doc.: Find min and max values of every feature.
Code: def fit(self, X, y=None):
3.1 Clone detection
X = check_array(X)
Clone detection includes two subtasks. The first subtask is to predict self._x_min = X.<mask>(axis=0)
whether two given codes have the same semantics. We use the self._x_max = X.max(axis=0)
BigCloneBench [71] dataset for the subtask. The second subtask return self
aims to retrieve semantically similar codes given a code as the query Answer: min
and we use the dataset POJ-104 [52] to perform it.
BigCloneBench is a widely used large code clone benchmark
that contains over 6,000,000 true clone pairs and 260,000 false clone Figure 1: Two examples in the cloze test dataset.
pairs from 10 different functionalities. The dataset provided by
Wang et al. [84] is filtered by discarding code fragments without We use the validation and testing sets of CodeSearchNet [35]
any tagged true or false clone pairs, leaving it with 9,134 Java code to create CT-all and CT-maxmin datasets for six programming
fragments. Finally, the dataset includes 901,028/415,416/415,416 languages, i.e., Go, Java, JavaScript (JS), PHP, Python and Ruby.
examples for training, validation and testing, respectively.
POJ-104 dataset [52] comes from a pedagogical programming CT-all. To less introduce lengthy variable names and avoid the
open judge (OJ) system that automatically judges the validity of issue caused by the use of different tokenizers, we select target cloze
submitted source code for specific problems by running the code. words by retaining unique words after Byte Pair Encoding [67], and
We use the POJ-104 dataset, which consists of 104 problems and we remove meaningless tokens like punctuations with handcrafted
includes 500 student-written C/C++ programs for each problem. rules. At last, 930 tokens are selected among six languages in total.
Different from that of the BigCloneBench dataset, the task of POJ- We select codes containing the 930 tokens and manually set thresh-
104 aims to retrieve other programs that solve the same problem old values of token occurrence to balance the frequency of the 930
given a program. We group the datasets in three subsets based on tokens in CT-all.
the number of problems they are required to solve (64/16/24) for
CT-maxmin. To further evaluate models’ ability to understand
training, validation, and testing.
code semantics, we introduce CT-maxmin to test how well model
can distinguish the difference between max and min. CT-maxmin
3.2 Defect detection comes from the dataset used for the PL-Probing task in CodeBERT[18],
For the task of defect detection, Zhou et al. [99] provide the Devign which includes codes containing the keywords of max or min.
dataset that includes 27,318 manually-labeled functions collected The data statistics are listed in Table 2.
from two large C programming language open-source projects pop-
ular among developers and diversified in functionality, i.e., QEMU 3.4 Code completion
and FFmpeg. The dataset was created by collecting security-related We use two influential datasets for code completion, PY150 in
commits and extracting vulnerable or non-vulnerable functions python and Github Java Corpus in Java. Both datasets can help
from the labeled commits. Since Zhou et al. [99] did not provide achieve token-level code completion. We move further by creating
official training/validation/testing sets for the two projects, we two test sets for the line-level code completion task from the two
randomly shuffle the dataset and split 80%/10%/10% of the dataset corpora. The task is to complete an unfinished line. Models should
Lu, Guo, Ren and Huang, et al.

Table 2: Data statistics about the cloze test datasets. we follow for Python, the line to be predicted is selected at random
from the test file. The average numbers of tokens are 350.62 and
Task CT-all CT-maxmin 10.49 in input and output, respectively.

Go 25,282 152 3.5 Code translation

Java 40,492 482
The training data for code translation is the code pairs with equiva-
JavaScript 13,837 272
lent functionality in two programming languages. In this paper, we
PHP 51,930 407
provide a dataset consisting of parallel codes between Java and C#.
Python 40,137 1,264
We did not use the dataset of Lachaux et al. [46] because they did
Ruby 4,437 38
not have the data for supervised model training. Following Nguyen
All 176,115 2,615 et al. [54] and Chen et al. [11], we use the data collected from several
open-source projects, i.e., Lucene4 , POI5 , JGit6 and Antlr7 . We do
not use Itext8 and JTS9 due to the license problem. Those projects
be capable of predicting code sequences of arbitrary token types are originally developed in Java and then ported to C#. They are
and code structures. well-established systems with long developing histories and with
PY150 is a Python dataset [62] containing 150,000 Python source both Java and C# versions in use.
files collected from Github. We follow the data split in Raychev The following step is to mine paired functions or methods from
et al. [62], resulting in 100,000 files for training and 50,000 files for those projects. According to our observation, the directory struc-
testing, consisting 76.3M tokens and 37.2M tokens, respectively. tures and function or method names of the two versions are iden-
We preprocess the corpora by tokenizing source codes, removing tical or similar when they are applied to the same project. There-
comments, replacing strings with length of more than 15 characters fore, following Nguyen et al. [54], we conservatively search for
with empty strings, and adding a special token ⟨EOL⟩ (end-of-line) the functions having the same signatures in the classes with the
to mark the ending of a line explicitly. For line-level code com- same/similar names and included in the same/similar directory
pletion, we create 10,000 examples from different files in the test structures of both versions. We discard duplicate code pairs and
set of PY150 for testing. Since we intend to test model’s ability to the codes having multiple targets searched with the above method.
autocomplete an arbitrary line, we select the line to be predicted at After this step, we remove the pairs whose number of overlapping
random. We generate a test case by ensuring that there is sufficient tokens was less than 1/3 of the sentence length. To make our data
context, i.e., at least 15% of the whole file. Models are expected more scalable for further syntactic and semantic analysis, we also
to generate the following line ended by ⟨EOL⟩ given the context. remove the functions with null function body according to their
The average number of tokens in input and output are 489.11 and abstract syntax tree (AST). Then we build the data-flow graph [25]
6.56, respectively. Figure 2 shows an example of line-level code for each function, which represents the dependency between two
completion. variables and provides valuable semantic information for code un-
derstanding. Finally, a function with no data-flow extracted from
the AST of a specific function is also discarded.
At last, the total number of paired functions or methods is 11,800.
We randomly select 500 pairs of functions for the development
set and another 1,000 pairs for the test set. The average lengths of
the Java and C# functions after tokenization are 38.51 and 46.16,
respectively 10 . An example of the mined translation pairs from C#
to Java is shown in Figure 3.

3.6 Code search

Figure 2: An example in the line-level code completion Code search includes two subtasks. The first one is to find the
dataset. most relevant code from a collection of candidates given a nat-
ural language query. We create a challenging testing set, called
CodeSearchNet AdvTest, from CodeSearchNet corpus [35] for
Github Java Corpus is a Java dataset mined by Allamanis and
performing this task. An example of this dataset is shown in Figure
Sutton [4], and it collects over 14 thousand Java projects from
4. The second subtask is to predict whether a code answers a given
Github. We follow the settings established by Hellendoorn and
query. We provide a testing set WebQueryTest of real user queries.
Devanbu [29] as well as Karampatsis et al. [42], using 1% of the
Two examples of the dataset are illustrated in Figure 5.
subset in the corpus. We have 12,934/7,189/8,268 files for train-
ing/validation/testing, consisting of 15.8M/3.8M/5.3M tokens, re- 4 http://lucene.apache.org/
spectively. We do the same preprocessing conducted on PY150, but 5 http://poi.apache.org/
6 https://github.com/eclipse/jgit/
we don’t add the special token ⟨EOL⟩ since in Java the symbols ;
7 https://github.com/antlr/
and } are used to mark the ending of a code statement. For line- 8 http://sourceforge.net/projects/itext/
level code completion, we create 3,000 examples for testing from 9 http://sourceforge.net/projects/jts-topo-suite/

different files in the test set of the corpus. Similarly to the process 10 https://github.com/c2nes/javalang
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Query:
Scans through a string for substrings matched some patterns.
Gold Code:
def matchall(text, patterns):
ret = []
for pattern in patterns:
match = re.findall(pattern, text)
ret += match
return ret
Normalized Code:
def func(arg0, arg1):
arg2 = []
for arg3 in arg1:
arg4 = re.findall(arg3, arg0)
arg2 += arg4
return arg2

Figure 4: An example in the CodeSearchNet AdvTest dataset.

Figure 3: An example in the code translation dataset. real code search for Python. The problem is formulated as a binary
classification task and as a complementary setting to the retrieval
scenario. Given a pair of query and code function, a model needs
CodeSearchNet AdvTest is a Python dataset from the Code- to classify whether the code function can answer the query or not.
SearchNet [35] corpus. Each example includes a function paired The data creation process can be divided into two stages: data
with a document. We follow Husain et al. [35] to take the first collection and annotation. We first collect real user queries from
paragraph of the documentation as the query for the correspond- the web query logs of a commercial search engine and we keep
ing function. To improve the quality of the dataset, we filter it by the queries with “python”. Inspired by Yan et al. [91], we design
removing the following examples. some heuristics based on keyword exact matching to filter out
(1) Examples whose code could not be parsed into abstract syntax queries without the code search intent. Then we select candidate
tree. codes for each query from the Python validation and testing sets in
(2) Examples whose document tokens number is shorter than 3 CodeSearchNet. To shrink the candidates to be annotated for each
or larger than 256. query, we select the top two functions with the highest query-code
(3) Examples whose document contains special tokens such as similarity computed by a CodeBERT-based code retrieval model,
“http://". which is trained on 148K automated-minded Python Stack Overflow
(4) Examples whose document is empty or not written in English. Question-Code (StaQC) [92] with the default parameters provided
At the end of the process, we obtain a dataset with 251,820 / 9,604 by Feng et al. [18].
/ 19,210 examples for training/validation/testing. After normaliz- We use a two-stage annotation schema to label each instance.
ing function or variable names with special tokens, we observe The first step is to judge whether the query has a code-search intent.
that the Mean Reciprocal Rank (MRR) scores of RoBERTa [50] and Instances labeled as "-1" are those without code search intent. The
CodeBERT [18] for the code search task on the CodesearchNet [35] second step is to assess whether the code (with its documentation)
dataset drop from 0.809 to 0.419 and from 0.869 to 0.507, respectively, can answer the query. Instances labeled as "1" are those where the
in Python programming language. To better test the understanding code can answer the query. Otherwise, they are labeled as “0”. Two
and generalization abilities of the model, we normalize function examples are illustrated in Figure 5. We invite 13 developers profi-
and variable names in testing and development sets like 𝑓 𝑢𝑛𝑐 for cient in Python to label 1,300 instances, with each annotator dealing
the function name and 𝑎𝑟𝑔𝑖 for the i-th variable name. Figure 4 with 100 of them. Discussions are allowed during annotation. Fi-
shows an example in CodeSearchNet AdvTest dataset. The task nally, the numbers of instances with labels -1, 0 and 1 are 254, 642
aims to search source codes from candidates for a natural language and 422, respectively. Since we are more interested in query-code
query. In contrast to the testing phase of previous works [18, 35] matching, we include only the categories 0 and 1 in our final test
that only involved 1,000 candidates, we use the entire testing set set. The training and validation sets we use for this task are from
for each query, which makes CodeSearchNet AdvTest dataset the original CodeSearchNet dataset [35].
more difficult. The training set for this task comes from the filtered
CodeSearchNet dataset [35]. 3.7 Code repair
WebQueryTest: Most code search datasets use code documenta- Code repair aims to fix bugs in the code automatically. We use the
tions or questions from online communities for software developers dataset released by Tufano et al. [75]. The source is buggy Java
as queries, but these are different from real user search queries. To functions, whereas the target is the corresponding fixed functions.
fix this discrepancy, we provide WebQueryTest, a testing set of To build this dataset, they first download every public GitHub event
Lu, Guo, Ren and Huang, et al.

Query: python measure distance between 2 points PHP, Ruby, and Go. The data comes from publicly available open-
Code: source non-fork GitHub repositories and each documentation is the
def vector_distance(a, b):
first paragraph. We observe that some documents contain content
""" Euclidean distance between two vectors """
a = np.array(a)
unrelated to the function, such as a link “http://..." that refers to
b = np.array(b) external resources and an HTML image tag “<img ...>" that inserts
return np.linalg.norm(a - b) an image. Therefore, we filter the dataset to improve its quality
with the same four rules mentioned in Section 3.6.
Label: 1
The statistics about the filtered CodeSearchNet dataset used in
Query: how to append object in a specific index in list python CodeXGLUE are listed in Table 3.
Code: def append(self, item):
""" append item and print it to stdout """ Table 3: Data statistics about the filtered CodeSearchNet
print(item) dataset for the code summarization task.
super(MyList, self).append(item)
Label: 0 Language Training Dev Testing
Go 167,288 7,325 8,122
Figure 5: Two examples in the WebQueryTest dataset. Java 164,923 5,183 10,955
JavaScript 58,025 3,885 3,291
PHP 241,241 12,982 14,014
between March 2011 and October 2017 from GitHub Archive11 and Python 251,820 13,914 14,918
use the Google BigQuery APIs to identify all Java-file commits Ruby 24,927 1,400 1,261
having a message containing the patterns [21]: (“fix” or “solve”)
and (“bug” or “issue” or “problem” or “error”). For each bug-fixing
commit, they extract the source code before and after the fixing
process by using the GitHub Compare API12 to collect the buggy 3.10 Documentation translation
(pre-commit) and the fixed (post-commit) codes. Subsequently, they Documentation translation aims to translate code documentation
normalize all the names of the variables and custom methods, which automatically from one natural language (e.g., English) to another
greatly limits the vocabulary size and enables the model to focus natural language (e.g., Chinese), as shown in Figure 7. The dataset
on learning bug-fixing patterns. Then, they filter out the pairs that we use is crawled from Microsoft Documentation13 , including soft-
contain lexical or syntactic errors in either the buggy or fixed code, ware and code description documents in different languages. We
as well as the pairs with more than 100 atomic AST modification focus on low-resource language pairs, where parallel data is scarce,
actions between the buggy and the fixed versions. To achieve this, and introduce multilingual machine translation tasks, e.g., English
they employ the GumTree Spoon AST Diff tool [17]. Finally, they ⇔ Latvian, Danish, Norwegian, and Chinese. To improve the data
divide the whole dataset into two subsets (small with tokens ≤ 50 quality, we filter the corpus by removing the following examples.
and medium with tokens > 50 and ≤ 100) based on the code length. (1) Pairs whose source sentence is the same as the target sen-
For the small subset, the numbers of training, development, and test tence;
samples are 46,680, 5,835, and 5,835, respectively. For the medium (2) Pairs whose length of source language or target language is
subset, the numbers are 52,364, 6,545, and 6,545, respectively. less than three words;
(3) Pairs whose length ratio between source and target languages
3.8 Text-to-code generation is larger than three;
To carry out this task, we use CONCODE [38], a widely used (4) Pairs whose word alignment ratio computed by fast_align14
code generation dataset, which is collected from about 33,000 Java is less than 0.6.
projects on GitHub. It contains 100,000 examples for training and The final training data includes 43K, 19K, 44K, and 50K sentence
4,000 examples for validation and testing. Each example is a tuple pairs for English ⇔ Latvian, English ⇔ Danish, English ⇔ Nor-
consisting of NL descriptions, code environments and code snippets. wegian, and English ⇔ Chinese, respectively. In addition, each
The dataset is tasked with generating class member functions from language pair has 1K development and test sentence pairs, respec-
natural language descriptions (Javadoc-style method comments) tively.
and class environments. Class environment is the programmatic
context provided by the rest of the class, including other member 4 BASELINE SYSTEMS
variables and member functions in the class. We provide three types of baseline models to perform the previously
mentioned tasks, including a BERT-style pretrained model (in this
3.9 Code summarization case, CodeBERT), which supports program understanding problems,
For code summarization, we use the CodeSearchNet dataset [35], a GPT-style pretrained model called CodeGPT that helps us solve
which includes six programming languages; i.e., Python, Java, JavaScript, completion and generation problems, and an Encoder-Decoder
13 https://docs.microsoft.com/, whose document is located at
11 https://www.gharchive.org/ https://github.com/MicrosoftDocs/.
12 https://developer.github.com/v3/repos/commits/#compare-two-commits 14 https://github.com/clab/fast_align.
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Understanding Generation
Input tokens Input code
Previous code tokens
[CLS] text/code [SEP] code [SEP]

CodeBERT
as Encoder
CodeBERT CodeGPT

Decoder
FFNN + Softmax

0
Output code Next code tokens
1 Category distribution

Supported tasks: Supported tasks: Supported tasks:

• code search • code repair • code completion
• code clone detection • code translation • code generation

Figure 6: Three pipelines, including CodeBERT, CodeGPT, and Encoder-Decoder, are provided.

Input (English): 4.2 CodeGPT

Multinomial Logistic Regression (Softmax regression) is We provide CodeGPT, which is a Transformer-based language
used to compute the probabilities of several possible model pretrained on programming language (PL), to support the
outcomes in classification problems. code completion and the text-to-code generation tasks. CodeGPT
Output (Chinese): has the same model architecture and training objective of GPT-2
多项式逻辑回归（Softmax回归）用于计算分类问题中
[59], which consists of 12 layers of Transformer decoders. More
几种可能结果的概率。 model settings are listed in Table 4. We pretrain monolingual mod-
els on Python and Java corpora from the CodeSearchNet dataset
[35], which includes 1.1M Python functions and 1.6M Java meth-
Figure 7: An English-to-Chinese example in the documenta- ods. Each function in training dataset has a function signature and
tion translation dataset. a function body. Some functions also contain a natural language
documentation.
We train two CodeGPT models for each programming language.
framework that tackles sequence-to-sequence generation problems. One model is pretrained from scratch, so that the BPE (byte pair
An illustration of these three pipelines is shown in Figure 6. encoder) [67] vocabulary is newly obtained on the code corpus and
the model parameters are randomly initialized. The other model
4.1 CodeBERT is a domain-adaptive one, which uses the GPT-2 model as the
starting point and is continually trained on the code corpus. As
To carry out code understanding tasks like clone detection, defect
a result, the second model has the same GPT-2 vocabulary and
detection, cloze test, and code search, we use CodeBERT [18] as our
natural language understanding ability. We refer to this model as
encoder. This is a bimodal pretrained model based on Transformer
CodeGPT-adapted, and regard it as the default one for the code
with 12 layers, 768 dimensional hidden states, and 12 attention
completion and text-to-code generation tasks. Both models are
heads for programming language (PL) and natural language (NL).
publicly available at https://huggingface.co/microsoft/CodeGPT-
Feng et al. [18] pretrain CodeBERT by masked language modeling
small-java and https://huggingface.co/microsoft/CodeGPT-small-
and replaced token detection objectives on the CodeSearchNet
java-adaptedGPT2. 15
dataset [35], which includes 2.4M functions with document pairs
for six programming languages. The model supports different types
of the sequence input like text/code and code/code with a special
token [𝐶𝐿𝑆] in front of the sequence and a special symbol [𝑆𝐸𝑃]
to split two kinds of data types.
The model is publicly available at https://huggingface.co/microsoft/
codebert-base. 15 Replace "java" with "py" for models pre-trained on python dataset.
Lu, Guo, Ren and Huang, et al.

Table 4: Parameters of CodeBERT and CodeGPT models. Results. Results achieved by different models are shown in Table
5. RtvNN [89] trains a recursive autoencoder to learn representa-
CodeBERT CodeGPT tions for AST. Deckard [39] computes vectors for structural infor-
mation within ASTs and uses a Locality Sensitive Hashing (LSH)
Number of layers 12 12
[14] to cluster similar vectors. CDLH [88] learns representations
Max length of position 512 1,024
of code fragments via AST-based LSTM. ASTNN [97] uses RNNs
Embedding size 768 768
to encode AST subtrees for statements. It feeds the encodings of
Attention heads 12 12
all statement trees into an RNN to learn representation for a pro-
Attention head size 64 64
gram. FA-AST-GMN [84] uses GNNs over a flow-augmented AST
Vocabulary size 50,265 50,000
to leverage explicit control and data flow information. TBCCD
Total number of parameters 125M 124M
[96] proposes a position-aware character embedding and uses tree-
based convolution to capture both the structural information of
a code fragment from its AST and lexical information from code
4.3 Encoder-Decoder tokens. Code2vec [6] learns representations of code snippets by
For sequence-to-sequence generation problems like code repair, aggregating multiple syntactic paths into a single vector. NCC [7]
code translation, code summarization, and documentation trans- encodes programs by leveraging both the underlying data flow and
lation, we provide an Encoder-Decoder framework. We initialize control flow of the programs. Aroma [51] is a code recommen-
the encoder using CodeBERT [18] and use a randomly initialized dation engine that takes a partial code snippet and recommends
Transformer with 6 layers, 768 dimensional hidden states and 12 a small set of succinct code snippets that contain the query snip-
attention heads as the decoder in all settings. pet. MISIM-GNN [93] learns a structural representation of code
from context-aware semantic structure designed specifically to lift
5 EXPERIMENT semantic meaning from the code syntax.
In this section, we report accuracy numbers of the baseline systems In this experiment, we use pretrained models like RoBERTa [50]
on 10 tasks. We will also show how long it takes to train the model and CodeBERT [18] to encode source code and take the represen-
and to do inference on the model. tation to calculate semantic relevance of two codes through a feed
forward network or inner product. Although CodeBERT does not
5.1 Clone Detection leverage code structure that has proven to be effective in terms of
Setting. We use the BigCloneBench and POJ-104 datasets for code similarity measure [7, 84, 88, 93, 97], the model still performs
clone detection. The task of the BigCloneBench dataset is for- better than RoBERTa on the task of clone detection, achieving the
mulated as a binary classification to predict whether a given pair overall score of 90.4. These experimental results demonstrate that
of codes has the same semantics, with the F1 score used as the pretraining is useful for clone detection. There is room for further
evaluation metric. The task of the POJ-104 dataset aims to retrieve improvement if code structure is further leveraged.
499 codes for a given code from the development/test set for val-
idation/testing, with the Mean Average Precision (MAP) as the 5.2 Defect Detection
evaluation metric. The overall score of the clone detection task is Setting. We use the dataset mentioned in Section 3.2 for defect
the average value of F1 and MAP scores. detection, which aims to predict whether a source code contains
defects that may be used to attack software systems. The evaluation
Table 5: Results on the clone detection task. metric is accuracy score. We use the CodeBERT baseline to encode
source code and take the representation of source code to calculate
the probability of being exposed to vulnerabilities.
BigCloneBench POJ-104
Model F1 MAP Overall Results. Table 7 shows the results of the models we implemented.
We use Bidirectional LTSM (BiLTSM) [32], TextCNN [43], RoBERTa
RtvNN 1.0 - - [50], and CodeBERT [18] to encode the representation of a source
Deckard 3.0 - - code, respectively. Then, a two-layer feed forward network followed
CDLH 82.0 - - by a softmax layer is used to calculate the probability of encoun-
ASTNN 93.0 - - tering vulnerabilities. As shown in the results, CodeBERT achieve
FA-AST-GMN 95.0 - - a 62.1 accuracy score, resulting in state-of-the-art performance.
TBCCD 95.0 - - However, the improvement achieved by the pretrained models is
code2vec* - 1.98 - limited compared with TextCNN. A potential direction to improve
NCC* - 54.19 - these pretrained models is to incorporate information from code
Aroma* - 55.12 - structures such as Abstract Syntax Tree, data flow, control flow, etc.
MISIM-GNN* - 82.45 -
5.3 Cloze test
RoBERTa 94.9 79.96 87.4
CodeBERT 96.5 84.29 90.4 Setting. We use CT-all and CT-maxmin datasets for the cloze
test task. Models are expected to predict the masked code token
by leveraging documentation and the context of code. Accuracy
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Table 6: Results on the cloze test task.

CT-all CT-maxmin
Model Overall
Ruby JS Go Python Java PHP Ruby JS Go Python Java PHP
RoBERTa 47.44 59.96 40.77 54.35 50.73 60.16 73.68 64.71 71.71 59.18 59.75 69.78 62.45
CodeBERT(MLM) 80.17 81.77 83.31 87.21 80.63 85.05 86.84 86.40 90.79 82.20 90.46 88.21 85.66

Table 7: Results on the defect detection task. find the most relevant code from a collection of candidates given a
query and it is evaluated through the Mean Reciprocal Rank (MRR)
Model Accuracy metric. For the WebQueryTest dataset, the task is formulated as a
binary classification to predict whether a code can answer a given
BiLSTM 59.37
query and we use the F1 and accuracy scores as evaluation met-
TextCNN 60.69
rics. The overall score for code search is the average of the values
RoBERTa 61.05
recorded for the two subtasks.
CodeBERT 62.08
Results. Table 9 presents the results on the CodeSearchNet Ad-
vTest and WebQueryTest datasets. We report the performance of
are reported for each language, with the macro-average accuracy RoBERTa [50] and CodeBERT [18]. The table shows that CodeBERT
scores for all languages as the overall evaluation metric. performs better than RoBERTa.

Results. Table 6 shows the results on the CT-all and CT-maxmin 5.6 Text-to-code generation
datasets. We report the performance of RoBERTa [50] and Code-
Setting. We use the CONCODE dataset for text-to-code gener-
BERT (Masked Language Modeling, MLM) [18], which is initialized
ation. Models are expected to generate source codes of Java class
with RoBERTa and further trained with the masked language mod-
member functions, given natural language descriptions and class
eling objective. The results demonstrate that CodeBERT performs
environments. We report the exact match accuracy, the BLEU score
better than RoBERTa that only learns from natural language.
[56], and the CodeBLEU score [65]. We use the CodeBLEU score as
the overall evaluation metric.
5.4 Code completion
Setting. We use the PY150 and Github Java Corpus datasets Results. Table 10 presents the results on the CONCODE test
for token-level and line-level code completion tasks. The token-level set. Seq2Seq [70] is an RNN-based sequence to sequence model.
task is to predict the next token given context of previous tokens, Seq2Action + MAML [26] combines a context-aware retrieval
and predictions are evaluated according to token-level accuracy; model with model-agnostic meta-learning (MAML). Iyer-Simp
whereas the line-level task entails the completion of a whole-line of + 200 idoms [36] extracts code idioms and applies idioms-based
code, and the quality of the code is evaluated through the metrics decoding. We also report the performance of pretrained models, in-
known as exact match accuracy and Levenshtein edit similarity [72]. cluding GPT-2 [59], CodeGPT, and CodeGPT-adapted. CodeGPT-
Levenshtein edit similarity measures how many single character adapted achieves the CodeBLEU score of 35.98, resulting in a state-
edits are required to transform one string into another. This is a of-the-art performance.
critical evaluation metric for the code completion scenario as it
measures how much effort it takes for developers to correct an 5.7 Code translation
error in the code. The score on each dataset is the average value of Setting. We use the dataset we build as described in Section 3.5.
the accuracy on token-level completion and the edit similarity on The dataset contains matching samples of Java and C# functions.
line-level completion. The overall score of code completion task is We report the exact match accuracy, the BLEU score [56] and the
calculated by averaging the scores on both datasets. CodeBLEU score [65] on this task. CodeBLEU is used as the overall
evaluation metric.
Results. Table 8 shows the results of all models on both datasets.
We fine-tune LSTM [32], Transformer [77], GPT-2 [59], CodeGPT Results. Table 12 shows the results of models on both translation
and CodeGPT-adapted to generate following tokens. CodeGPT directions. The Naive method directly copies the source code as
and CodeGPT-adapted models are described in Section 4.2. CodeGPT- the translation result. PBSMT is short for phrase-based statistical
adapted achieves a state-of-the-art performance with the overall machine translation [44]. Transformer uses the same number of
score of 71.28. layers and hidden size as the pretrained models. The table shows
that Transformer initialized with CodeBERT and fine-tuned with
5.5 Code search the matching sample pairs produces the best result.
Setting. We use the CodeSearchNet AdvTest and WebQuery-
Test datasets mentioned in Section 3.6 for code search. To improve 5.8 Code repair
efficiency, we separately encode text and code to perform code Setting. We use the dataset originally released by Tufano et al.
search. For the CodeSearchNet AdvTest dataset, the task is to [75], which is described in Section 3.7. The dataset contains two
Lu, Guo, Ren and Huang, et al.

Table 8: Results on the code completion task.

PY150 Github Java Corpus

Model token-level line-level token-level line-level Overall

Accuracy EM Edit Sim Accuracy EM Edit Sim

LSTM 58.00 17.93 50.05 56.02 10.30 41.55 51.41
Transformer 73.26 36.65 67.51 64.16 15.33 50.39 63.83
GPT-2 74.22 38.55 68.94 74.89 24.30 60.70 69.69
CodeGPT 74.93 39.11 69.69 76.45 25.30 61.54 70.65
CodeGPT-adapted 75.11 39.65 69.84 77.13 26.43 63.03 71.28

Table 9: Results on the code search task. the batch size are 5e-5 and 32, respectively. We tune the hyper-
parameters and perform early stopping on the development set.
AdvTest WebQueryTest
Model MRR F1 Accuracy Overall Results. Table 13 shows the results achieved by different mod-
RoBERTa 18.33 57.49 40.92 33.63 els in code summarization. Seq2Seq is an RNN-based sequence to
CodeBERT 27.19 58.95 47.80 40.28 sequence model. Transformer and RoBERTa use the same set-
ting as CodeBERT, but the encoder is initialized randomly and
by RoBERTa [50], respectively. All models use Byte Pair Encod-
Table 10: Results on the text-to-code generation task. ing (BPE) [66] vocabulary. In this experiment, CodeBERT obtains
a 1.3% gain in the BLEU score over RoBERTa and achieves the
Model EM BLEU CodeBLEU state-of-the-art performance on six programming languages.
Seq2Seq 3.05 21.31 26.39
5.10 Documentation translation
Seq2Action+MAML 10.05 24.40 29.46
Iyer-Simp+200 idoms 12.20 26.60 - Setting. We use the Microsoft Docs dataset for text-to-text trans-
GPT-2 17.35 25.37 29.69 lation tasks, which focus on low-resource multilingual translation
CodeGPT 18.25 28.69 32.71 between English (EN) and other languages, including Latvian (LA),
CodeGPT-adapted 20.10 32.79 35.98 Danish (DA), Norwegian (NO), and Chinese (ZH). Following John-
son et al. [40], we train a single multilingual model as our base-
line. To distinguish between different translation pairs, we add an
subsets established according to the length of the Java functions: language token (e.g., ⟨2en⟩, ⟨2zh⟩) at the beginning of the source
small ≤ 50 and 50 < medium ≤ 100 . We report the exact match sentence to indicate the target language the model should translate.
accuracy, the BLEU score [56] and the CodeBLEU score [65] on this We initialize the encoder of the multilingual translation model with
task. The exact match accuracy is used as the overall evaluation XLM-R [13]. Models are evaluated through BLEU [56] score, and
metric. the overall score for documentation translation is the average BLEU
score on the eight translation directions.
Results. The Naive method directly copies the buggy code as
the repair result. As for Transformer, we use the same number of Results. Table 14 shows the results achieved by the models on
layers and hidden size as the pretrained models. With regard to the eight translation directions. Transformer Baseline is the multilin-
CodeBERT method, we initialize the Transformer encoder with gual translation model [40]. pretrained Transformer initializes
pretrained CodeBERT model and randomly initialize the parameters the encoder of Transformer Baseline with XLM-R[13]. In terms
of the decoder and the source-to-target attention. Then we use of overall performance on the eight translation directions, Trans-
the training data to fine-tune the whole model. As shown in the former Baseline and pretrained Transformer obtain the BLEU
table, Transformer with CodeBERT initialization achieves the best score of 52.67 and 66.16, respectively. Experimental results demon-
performance among all models. strate that pretraining achieves a 13.49 improvement in BLEU score
over strong baseline model. Figure 8 shows how long it takes to
5.9 Code Summarization train the model and to do inference on the model, as well as in other
Setting. We use the dataset mentioned in Section 3.9 for code tasks.
summarization. To evaluate the models, we follow Feng et al. [18],
who use smoothed BLEU score [49] as evaluation metric, because 6 RELATED WORK
this is suitable for evaluating short documents. We use the encoder- Benchmark datasets have been playing a central role in the growth
decoder pipeline to tackle this problem. The max length of input of applied AI research. For example, the LibriSpeech [55] and the
and inference are set as 256 and 128, respectively. We use the Adam SQuAD [60] datasets drive the development of data-driven models
optimizer to update the models’ parameters. The learning rate and for automatic speech recognition and reading comprehension of
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Table 11: Results on the code repair task.

small medium
Method Overall
BLEU Acc CodeBLEU BLEU Acc CodeBLEU
Naive 78.06 0.000 - 90.91 0.000 - 0.000
LSTM 76.76 0.100 - 72.08 0.025 - 0.063
Transformer 77.21 0.147 73.31 89.25 0.037 81.72 0.092
CodeBERT 77.42 0.164 75.58 91.07 0.052 87.52 0.108

Table 12: Results on the code translation task.

Java→C# C#→Java
Method Overall
BLEU Acc CodeBLEU BLEU Acc CodeBLEU
Naive 18.54 0.000 - 18.69 0.000 - -
PBSMT 43.53 0.125 42.71 40.06 0.161 43.48 43.10
Transformer 55.84 0.330 63.74 50.47 0.379 61.59 62.67
RoBERTa (code) 77.46 0.561 83.07 71.99 0.579 80.18 81.63
CodeBERT 79.92 0.590 85.10 72.14 0.580 79.41 82.26

Table 13: Results on the code summarization task.

Model Ruby Javascript Go Python Java PHP Overall

Seq2Seq 9.64 10.21 13.98 15.93 15.09 21.08 14.32
Transformer 11.18 11.59 16.38 15.81 16.26 22.12 15.56

Training and inference cost RoBERTa

CodeBERT
11.17
12.16
11.90
14.90
17.72
18.07
18.14
19.06
16.47
17.65
24.02
25.16
16.57
17.83

Task Dataset Name Language Training Cost Inference Cost

BigCloneBench Java 3 hours training on P100 x2 2 hours on p100 x2
Clone Detection
POJ-104 C/C++ 2 hours training on P100 x2 10 minutes on p100 x2
Defect Detection Devign C 1 hour on P100 x2 2 minutes on p100 x2
CT-all Python, Java, PHP, JavaScript, Ruby, Go N/A 30 minutes on P100-16G x2
Cloze Test
CT-max/min Python, Java, PHP, JavaScript, Ruby, Go N/A 1 minute on P100-16G x2
PY150 Python 25 hours on P100 x2 30 minutes on P100 x2
Code Completion
GitHub Java Corpus Java 2 hours on P100 x2 10 minutes on P100 x2
Code Repair Bugs2Fix Java 24 hours on P100 x2 20 minutes on P100 x2
Code Translation CodeTrans Java-C# 20 hours on P100 x2 5 minutes on P100 x2
CodeSearchnet,
Python 5 hours on P100 x2 7 minutes on p100 x2
AdvTest
NL Code Search
CodeSearchNet,
Python 5 hours on P100 x2 1 minute on P100 x2
WebQueryTest
Text-to-Code
CONCODE Java 30 hours on P100 x2 20 minutes on P100 x2
Generation
On average, 12 hours for On average, 1 hour for each PL
Code Summarization CodeSearchNet Python, Java, PHP, JavaScript, Ruby, Go
each PL on P100 x2 on p100 x2
Documentation English-
Microsoft Docs 30 hours on P100x2 55 minutes on P100x2
Translation Latvian/Danish/Norwegian/Chinese

Figure 8: Training and inference time costs for each task, evaluated on two P100 GPUs.
Lu, Guo, Ren and Huang, et al.

Table 14: Results on the documentation translation task. [3] Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A convolutional at-
tention network for extreme summarization of source code. In International
conference on machine learning. 2091–2100.
Task Transformer pretrained [4] Miltiadis Allamanis and Charles Sutton. 2013. Mining Source Code Repositories
at Massive Scale using Language Modeling. In 2013 10th Working Conference on
Baseline Transformer Mining Software Repositories (MSR). IEEE, 207–216.
[5] Miltiadis Allamanis and Charles Sutton. 2014. Mining idioms from source code.
EN → DA 53.31 67.09 In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations
EN → LA 37.85 51.92 of Software Engineering. 472–483.
EN → NO 53.84 68.00 [6] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learn-
ing distributed representations of code. Proceedings of the ACM on Programming
EN → ZH 59.90 70.60 Languages 3, POPL (2019), 1–29.
[7] Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. 2018. Neural code
DA → EN 58.73 67.02 comprehension: A learnable representation of code semantics. In Advances in
LA → EN 50.37 68.30 Neural Information Processing Systems. 3585–3597.
NO → EN 57.73 71.84 [8] Pavol Bielik, Veselin Raychev, and Martin Vechev. 2016. PHOG: Probabilistic
Model for Code. In Proceedings of the 33rd International Conference on International
ZH → EN 50.00 64.47 Conference on Machine Learning - Volume 48 (New York, NY, USA) (ICML’16).
JMLR.org, 2933–2942.
Overall 52.67 66.16 [9] Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from Ex-
amples to Improve Code Completion Systems. In Proceedings of the 7th Joint
Meeting of the European Software Engineering Conference and the ACM SIGSOFT
Symposium on The Foundations of Software Engineering (Amsterdam, The Nether-
text, respectively. With the growing demand for testing models’ lands) (ESEC/FSE ’09). Association for Computing Machinery, New York, NY,
USA, 213–222. https://doi.org/10.1145/1595696.1595728
generalization ability on a wide range of applications, researchers [10] L. Büch and A. Andrzejak. 2019. Learning-Based Recursive Aggregation of
have created or assembled datasets that cover many tasks. Rep- Abstract Syntax Trees for Code Clone Detection. In 2019 IEEE 26th International
resentative samples of these datasets include ImageNet [15] for Conference on Software Analysis, Evolution and Reengineering (SANER). 95–104.
https://doi.org/10.1109/SANER.2019.8668039
computer vision, GLUE [81] for natural language understanding, [11] Xinyun Chen, Chang Liu, and Dawn Song. 2018. Tree-to-tree neural networks
XTREME [33] and XGLUE [48] for cross-lingual natural language for program translation. In Advances in neural information processing systems.
processing. To the best of our knowledge, CodeXGLUE is the first 2547–2557.
[12] Colin B Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, and
diversified benchmark dataset that can be applied to various code Neel Sundaresan. 2020. PyMT5: multi-mode translation of natural language and
intelligence problems. Python code with transformers. arXiv preprint arXiv:2010.03150 (2020).
[13] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil-
Many tasks related to machine learning for software engineer- laume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer,
ing [1] have sufficient amount of data to support the development and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning
of data-driven methods, but are not covered by CodeXGLUE. We at scale. arXiv preprint arXiv:1911.02116 (2019).
[14] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. 2004. Locality-
plan to extend to these tasks in the future. For example, the idiom sensitive hashing scheme based on p-stable distributions. In Proceedings of the
mining task [5, 36] is to extract code idioms, which are syntactic twentieth annual symposium on Computational geometry. 253–262.
fragments that recur across software projects and serve a single [15] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet:
A large-scale hierarchical image database. In 2009 IEEE conference on computer
semantic purpose [5]. Bug localization [27, 61, 76] is to point the vision and pattern recognition. Ieee, 248–255.
error location when a program fails tests. The test case generation [16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
Pre-training of deep bidirectional transformers for language understanding. arXiv
task [22, 74] is to generate unit test cases automatically. The pro- preprint arXiv:1810.04805 (2018).
gram synthesis [20, 45, 53, 64, 68, 79, 98] extends the text-to-code [17] Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin
generation task aims to generate programs from a specification [24], Monperrus. 2014. Fine-grained and accurate source code differencing. In Pro-
ceedings of the 29th ACM/IEEE international conference on Automated software
such as pseudocode, natural language description, and input/output engineering. 313–324.
examples. [18] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong,
Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained
model for programming and natural languages. arXiv preprint arXiv:2002.08155
7 CONCLUSION (2020).
[19] Patrick Fernandes, Miltiadis Allamanis, and Marc Brockschmidt. 2018. Structured
With CodeXGLUE, we seek to support the development of models neural summarization. arXiv preprint arXiv:1811.01824 (2018).
that can be applied to various program understanding and gen- [20] John K. Feser, Swarat Chaudhuri, and Isil Dillig. 2015. Synthesizing Data Struc-
eration problems, with the goal of increasing the productivity of ture Transformations from Input-Output Examples. In Proceedings of the 36th
ACM SIGPLAN Conference on Programming Language Design and Implementation
software developers. We encourage researchers to participate in (Portland, OR, USA) (PLDI ’15). Association for Computing Machinery, New York,
the open challenge to make progress in code intelligence. Moving NY, USA, 229–239. https://doi.org/10.1145/2737924.2737977
forward, we are planning to extend CodeXGLUE to more program- [21] Michael Fischer, Martin Pinzger, and Harald Gall. 2003. Populating a release
history database from version control and bug tracking systems. In International
ming languages and downstream tasks while continuing to develop Conference on Software Maintenance, 2003. ICSM 2003. Proceedings. IEEE, 23–32.
advanced pretrained models by exploring new model structures, [22] Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation
for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium
introducing new pretraining tasks, using different types of data, and the 13th European conference on Foundations of software engineering. 416–419.
and more. [23] Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep Code Search. In
Proceedings of the 40th International Conference on Software Engineering (Gothen-
burg, Sweden) (ICSE ’18). Association for Computing Machinery, New York, NY,
REFERENCES USA, 933–944. https://doi.org/10.1145/3180155.3180167
[1] Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2018. [24] Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. 2017. Program synthesis.
A Survey of Machine Learning for Big Code and Naturalness. ACM Comput. Surv. Foundations and Trends® in Programming Languages 4, 1-2 (2017), 1–119.
51, 4, Article 81 (July 2018), 37 pages. https://doi.org/10.1145/3212695 [25] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long
[2] Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2017. Learning Zhou, Nan Duan, Jian Yin, Daxin Jiang, et al. 2020. GraphCodeBERT: Pre-training
to represent programs with graphs. arXiv preprint arXiv:1711.00740 (2017). Code Representations with Data Flow. arXiv preprint arXiv:2009.08366 (2020).
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

[26] Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, and Jian Yin. 2019. Coupling [49] Chin-Yew Lin and Franz Josef Och. 2004. Orange: a method for evaluating auto-
Retrieval and Meta-Learning for Context-Dependent Semantic Parsing. arXiv matic evaluation metrics for machine translation. In COLING 2004: Proceedings
preprint arXiv:1906.07108 (2019). of the 20th International Conference on Computational Linguistics. 501–507.
[27] Rahul Gupta, Aditya Kanade, and Shirish Shevade. 2019. Neural Attribution for Se- [50] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
mantic Bug-Localization in Student Programs. In Advances in Neural Information Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A
Processing Systems. 11884–11894. robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
[28] Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. 2017. DeepFix: (2019).
Fixing Common C Language Errors by Deep Learning. In Proceedings of the [51] Sifei Luan, Di Yang, Celeste Barnaby, Koushik Sen, and Satish Chandra. 2019.
Thirty-First AAAI Conference on Artificial Intelligence (San Francisco, California, Aroma: Code recommendation via structural code search. Proceedings of the ACM
USA) (AAAI’17). AAAI Press, 1345–1351. on Programming Languages 3, OOPSLA (2019), 1–28.
[29] Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks [52] Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural net-
the best choice for modeling source code?. In Proceedings of the 2017 11th Joint works over tree structures for programming language processing. In Proceedings
Meeting on Foundations of Software Engineering. 763–773. of the Thirtieth AAAI Conference on Artificial Intelligence. 1287–1293.
[30] Vincent J Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis, and David [53] Arvind Neelakantan, Quoc V Le, and Ilya Sutskever. 2015. Neural programmer:
Bieber. 2019. Global relational models of source code. In International Conference Inducing latent programs with gradient descent. arXiv preprint arXiv:1511.04834
on Learning Representations. (2015).
[31] Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. [54] Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N Nguyen. 2015. Divide-and-
2012. On the naturalness of software. In 2012 34th International Conference on conquer approach for multi-phase statistical migration for source code (t). In
Software Engineering (ICSE). IEEE, 837–847. 2015 30th IEEE/ACM International Conference on Automated Software Engineering
[32] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural (ASE). IEEE, 585–596.
computation 9, 8 (1997), 1735–1780. [55] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015.
[33] Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE
Melvin Johnson. 2020. Xtreme: A massively multilingual multi-task bench- International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
mark for evaluating cross-lingual generalization. arXiv preprint arXiv:2003.11080 5206–5210.
(2020). [56] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a
[34] Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, and Zhi Jin. 2018. Summarizing method for automatic evaluation of machine translation. In Proceedings of the
Source Code with Transferred API Knowledge. In Proceedings of the 27th Interna- 40th annual meeting of the Association for Computational Linguistics. 311–318.
tional Joint Conference on Artificial Intelligence (Stockholm, Sweden) (IJCAI’18). [57] Michael Pradel and Koushik Sen. 2018. DeepBugs: A Learning Approach to
AAAI Press, 2269–2275. Name-Based Bug Detection. Proc. ACM Program. Lang. 2, OOPSLA, Article 147
[35] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc (Oct. 2018), 25 pages. https://doi.org/10.1145/3276517
Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic [58] Varot Premtoon, James Koppel, and Armando Solar-Lezama. 2020. Semantic Code
code search. arXiv preprint arXiv:1909.09436 (2019). Search via Equational Reasoning. In Proceedings of the 41st ACM SIGPLAN Confer-
[36] Srinivasan Iyer, Alvin Cheung, and Luke Zettlemoyer. 2019. Learning program- ence on Programming Language Design and Implementation (London, UK) (PLDI
matic idioms for scalable semantic parsing. arXiv preprint arXiv:1904.09086 2020). Association for Computing Machinery, New York, NY, USA, 1066–1082.
(2019). https://doi.org/10.1145/3385412.3386001
[37] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. [59] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
Summarizing source code using a neural attention model. In Proceedings of the Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI
54th Annual Meeting of the Association for Computational Linguistics (Volume 1: blog 1, 8 (2019), 9.
Long Papers). 2073–2083. [60] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016.
[38] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Map- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint
ping language to code in programmatic context. arXiv preprint arXiv:1808.09588 arXiv:1606.05250 (2016).
(2018). [61] Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto
[39] Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Bacchelli, and Premkumar Devanbu. 2016. On the" naturalness" of buggy code.
Deckard: Scalable and accurate tree-based detection of code clones. In 29th In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).
International Conference on Software Engineering (ICSE’07). IEEE, 96–105. IEEE, 428–439.
[40] Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng [62] Veselin Raychev, Pavol Bielik, and Martin Vechev. 2016. Probabilistic Model for
Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al. Code with Decision Trees. ACM SIGPLAN Notices (2016), 731–747.
2017. Google’s multilingual neural machine translation system: Enabling zero- [63] Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code Completion with
shot translation. Transactions of the Association for Computational Linguistics 5 Statistical Language Models. In Proceedings of the 35th ACM SIGPLAN Confer-
(2017), 339–351. ence on Programming Language Design and Implementation (Edinburgh, United
[41] Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. 2014. Phrase-Based Kingdom) (PLDI ’14). Association for Computing Machinery, New York, NY, USA,
Statistical Translation of Programming Languages. In Proceedings of the 2014 419–428. https://doi.org/10.1145/2594291.2594321
ACM International Symposium on New Ideas, New Paradigms, and Reflections on [64] Scott Reed and Nando De Freitas. 2015. Neural programmer-interpreters. arXiv
Programming Software (Portland, Oregon, USA) (Onward! 2014). Association for preprint arXiv:1511.06279 (2015).
Computing Machinery, New York, NY, USA, 173–184. https://doi.org/10.1145/ [65] Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Ming Zhou,
2661136.2661148 Ambrosio Blanco, and Shuai Ma. 2020. CodeBLEU: a Method for Automatic
[42] Rafael-Michael Karampatsis, Hlib Babii, Romain Robbes, Charles Sutton, and Evaluation of Code Synthesis. arXiv preprint arXiv:2009.10297 (2020).
Andrea Janes. 2020. Big Code!= Big Vocabulary: Open-Vocabulary Models for [66] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine
Source Code. arXiv preprint arXiv:2003.07914 (2020). translation of rare words with subword units. arXiv preprint arXiv:1508.07909
[43] Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv (2015).
preprint arXiv:1408.5882 (2014). [67] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine
[44] Philipp Koehn, Franz J Och, and Daniel Marcu. 2003. Statistical phrase-based trans- Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual
lation. Technical Report. UNIVERSITY OF SOUTHERN CALIFORNIA MARINA Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
DEL REY INFORMATION SCIENCES INST. 1715–1725.
[45] Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex [68] Rishabh Singh and Sumit Gulwani. 2015. Predicting a correct program in pro-
Aiken, and Percy S Liang. 2019. Spoc: Search-based pseudocode to code. In gramming by example. In International Conference on Computer Aided Verification.
Advances in Neural Information Processing Systems. 11906–11917. Springer, 398–414.
[46] Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, and Guillaume Lample. [69] Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss,
2020. Unsupervised Translation of Programming Languages. arXiv preprint Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, et al.
arXiv:2006.03511 (2020). 2019. Release strategies and the social impacts of language models. arXiv preprint
[47] Yi Li, Shaohua Wang, Tien N Nguyen, and Son Van Nguyen. 2019. Improving bug arXiv:1908.09203 (2019).
detection via context-based code representation learning and attention-based [70] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning
neural networks. Proceedings of the ACM on Programming Languages 3, OOPSLA with neural networks. In Advances in neural information processing systems. 3104–
(2019), 1–30. 3112.
[48] Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fenfei Guo, Weizhen Qi, Ming [71] Jeffrey Svajlenko, Judith F Islam, Iman Keivanloo, Chanchal K Roy, and Moham-
Gong, Linjun Shou, Daxin Jiang, Guihong Cao, et al. 2020. Xglue: A new bench- mad Mamun Mia. 2014. Towards a big data curated benchmark of inter-project
mark dataset for cross-lingual pre-training, understanding and generation. arXiv code clones. In 2014 IEEE International Conference on Software Maintenance and
preprint arXiv:2004.01401 (2020). Evolution. IEEE, 476–480.
Lu, Guo, Ren and Huang, et al.

[72] Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. Overflow. In International Conference on Mining Software Repositories (MSR). ACM,
IntelliCode Compose: Code Generation Using Transformer. arXiv preprint 476–486. https://doi.org/10.1145/3196398.3196408
arXiv:2005.08025 (2020). [95] Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for general-
[73] Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu, and Neel Sundaresan. 2019. Pythia: purpose code generation. arXiv preprint arXiv:1704.01696 (2017).
ai-assisted code completion system. In Proceedings of the 25th ACM SIGKDD [96] Hao Yu, Wing Lam, Long Chen, Ge Li, Tao Xie, and Qianxiang Wang. 2019. Neural
International Conference on Knowledge Discovery & Data Mining. 2727–2735. detection of semantic code clones via tree-based convolution. In 2019 IEEE/ACM
[74] Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, and Neel 27th International Conference on Program Comprehension (ICPC). IEEE, 70–80.
Sundaresan. 2020. Unit Test Case Generation with Transformers. arXiv preprint [97] Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong
arXiv:2009.05617 (2020). Liu. 2019. A novel neural source code representation based on abstract syntax
[75] Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin tree. In 2019 IEEE/ACM 41st International Conference on Software Engineering
White, and Denys Poshyvanyk. 2019. An empirical study on learning bug-fixing (ICSE). IEEE, 783–794.
patches in the wild via neural machine translation. ACM Transactions on Software [98] Ruiqi Zhong, Mitchell Stern, and Dan Klein. 2020. Semantic Scaffolds for
Engineering and Methodology (TOSEM) 28, 4 (2019), 1–29. Pseudocode-to-Code Generation. arXiv preprint arXiv:2005.05927 (2020).
[76] Marko Vasic, Aditya Kanade, Petros Maniatis, David Bieber, and Rishabh Singh. [99] Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019.
2019. Neural program repair by jointly learning to localize and repair. arXiv Devign: Effective vulnerability identification by learning comprehensive program
preprint arXiv:1904.01720 (2019). semantics via graph neural networks. In Advances in Neural Information Processing
[77] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Systems. 10197–10207.
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in neural information processing systems. 5998–6008.
[78] Panagiotis Vekris, Benjamin Cosman, and Ranjit Jhala. 2016. Refinement Types
for TypeScript. In Proceedings of the 37th ACM SIGPLAN Conference on Program-
ming Language Design and Implementation (Santa Barbara, CA, USA) (PLDI
’16). Association for Computing Machinery, New York, NY, USA, 310–325.
https://doi.org/10.1145/2908080.2908110
[79] Murali Vijayaraghavan, Chaudhuri Swarat, and Jermaine Chris. 2017. Bayesian
Sketch Learning for Program Synthesis. CoRR.—-2017.—-Vol. abs/1703.05698.—-
1703.05698 (2017).
[80] Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and
Philip S Yu. 2018. Improving automatic source code summarization via deep rein-
forcement learning. In Proceedings of the 33rd ACM/IEEE International Conference
on Automated Software Engineering. 397–407.
[81] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R
Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural
language understanding. arXiv preprint arXiv:1804.07461 (2018).
[82] Song Wang, Devin Chollak, Dana Movshovitz-Attias, and Lin Tan. 2016. Bugram:
bug detection with n-gram language models. In Proceedings of the 31st IEEE/ACM
International Conference on Automated Software Engineering. 708–719.
[83] Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically Learning Semantic
Features for Defect Prediction. In Proceedings of the 38th International Conference
on Software Engineering (Austin, Texas) (ICSE ’16). Association for Computing Ma-
chinery, New York, NY, USA, 297–308. https://doi.org/10.1145/2884781.2884804
[84] Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. 2020. Detecting Code Clones
with Graph Neural Network and Flow-Augmented Abstract Syntax Tree. In 2020
IEEE 27th International Conference on Software Analysis, Evolution and Reengi-
neering (SANER). IEEE, 261–271.
[85] Wenhua Wang, Yuqun Zhang, Zhengran Zeng, and Guandong Xu. 2020. TranSˆ
3: A Transformer-based Framework for Unifying Code Summarization and Code
Search. arXiv preprint arXiv:2003.03238 (2020).
[86] Yanlin Wang, Lun Du, Ensheng Shi, Yuxuan Hu, Shi Han, and Dongmei Zhang.
2020. CoCoGUM: Contextual Code Summarization with Multi-Relational GNN
on UMLs.
[87] Bolin Wei, Ge Li, Xin Xia, Zhiyi Fu, and Zhi Jin. 2019. Code generation as a
dual task of code summarization. In Advances in Neural Information Processing
Systems. 6563–6573.
[88] Huihui Wei and Ming Li. 2017. Supervised Deep Features for Software Functional
Clone Detection by Exploiting Lexical and Syntactical Information in Source
Code.. In IJCAI. 3034–3040.
[89] Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk.
2016. Deep learning code fragments for code clone detection. In 2016 31st
IEEE/ACM International Conference on Automated Software Engineering (ASE).
IEEE, 87–98.
[90] Frank F Xu, Zhengbao Jiang, Pengcheng Yin, Bogdan Vasilescu, and Graham
Neubig. 2020. Incorporating external knowledge through pre-training for natural
language to code generation. arXiv preprint arXiv:2004.09015 (2020).
[91] S. Yan, H. Yu, Y. Chen, B. Shen, and L. Jiang. 2020. Are the Code Snippets
What We Are Searching for? A Benchmark and an Empirical Study on Code
Search with Natural-Language Queries. In 2020 IEEE 27th International Conference
on Software Analysis, Evolution and Reengineering (SANER). 344–354. https:
//doi.org/10.1109/SANER48275.2020.9054840
[92] Ziyu Yao, Daniel S Weld, Wei-Peng Chen, and Huan Sun. 2018. StaQC: A Sys-
tematically Mined Question-Code Dataset from Stack Overflow. In Proceedings of
the 2018 World Wide Web Conference. 1693–1703.
[93] Fangke Ye, Shengtian Zhou, Anand Venkat, Ryan Marucs, Nesime Tatbul,
Jesmin Jahan Tithi, Paul Petersen, Timothy Mattson, Tim Kraska, Pradeep Dubey,
et al. 2020. MISIM: An End-to-End Neural Code Similarity System. arXiv preprint
arXiv:2006.05265 (2020).
[94] Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig.
2018. Learning to Mine Aligned Code and Natural Language Pairs from Stack

OpenAI Codex Arxiv
No ratings yet
OpenAI Codex Arxiv
35 pages
Evaluating Large Language Models Trained On Code
No ratings yet
Evaluating Large Language Models Trained On Code
35 pages
Machine Learning for Code Processing
No ratings yet
Machine Learning for Code Processing
70 pages
CodeGeeX4: Multilingual Open-Source Code Assistant
No ratings yet
CodeGeeX4: Multilingual Open-Source Code Assistant
9 pages
Codebertscore: Evaluating Code Generation With Pretrained Models of Code
No ratings yet
Codebertscore: Evaluating Code Generation With Pretrained Models of Code
17 pages
CodeGeeX: Multilingual Code Generation
No ratings yet
CodeGeeX: Multilingual Code Generation
30 pages
Open-Source Code Intelligence Models
No ratings yet
Open-Source Code Intelligence Models
23 pages
Code Generation Tools (Almost) For Free? A Study of Few-Shot, Pre-Trained Language Models On Code
No ratings yet
Code Generation Tools (Almost) For Free? A Study of Few-Shot, Pre-Trained Language Models On Code
12 pages
Program Code Generation With Generative AIs
No ratings yet
Program Code Generation With Generative AIs
19 pages
Automated Code Testing with CODET
No ratings yet
Automated Code Testing with CODET
19 pages
Static Evaluation of Code Completion Errors
No ratings yet
Static Evaluation of Code Completion Errors
14 pages
Cosqa: Pioneering The Multi-Choice Code Search Benchmark With Test-Driven Agents
No ratings yet
Cosqa: Pioneering The Multi-Choice Code Search Benchmark With Test-Driven Agents
15 pages
Toolcoder: Teach Code Generation Models To Use Api Search Tools
No ratings yet
Toolcoder: Teach Code Generation Models To Use Api Search Tools
13 pages
Deep Code Search: Xiaodong Gu, Hongyu Zhang, and Sunghun Kim
No ratings yet
Deep Code Search: Xiaodong Gu, Hongyu Zhang, and Sunghun Kim
12 pages
(2023) A Survey On Language Models For Code
No ratings yet
(2023) A Survey On Language Models For Code
55 pages
LLM Benchmarks
No ratings yet
LLM Benchmarks
5 pages
Evaluating The Code Quality of Ai-Assisted Code Generation Tools: An Empirical Study On Github Copilot, Amazon Codewhisperer, and Chatgpt
No ratings yet
Evaluating The Code Quality of Ai-Assisted Code Generation Tools: An Empirical Study On Github Copilot, Amazon Codewhisperer, and Chatgpt
45 pages
Legal 2 AI
No ratings yet
Legal 2 AI
10 pages
Gitub Copilot
No ratings yet
Gitub Copilot
27 pages
682a14158a4d4 - Neural Networks in Code Generation How AI Is Changing Software Development
No ratings yet
682a14158a4d4 - Neural Networks in Code Generation How AI Is Changing Software Development
7 pages
E1. Code Language Models
No ratings yet
E1. Code Language Models
40 pages
Pretraining and Evaluation CodeLLMs
No ratings yet
Pretraining and Evaluation CodeLLMs
71 pages
Refining ChatGPT-Generated Code
No ratings yet
Refining ChatGPT-Generated Code
26 pages
AI - Tools - For - Code - Generation - 1697393355 2023-10-15 18 - 09 - 22
No ratings yet
AI - Tools - For - Code - Generation - 1697393355 2023-10-15 18 - 09 - 22
10 pages
DeepSeek-Coder-V2: Open-Source Code Model
No ratings yet
DeepSeek-Coder-V2: Open-Source Code Model
19 pages
Codeagent:: Autonomous Communicative Agents For Code Review
No ratings yet
Codeagent:: Autonomous Communicative Agents For Code Review
35 pages
CodeGen4Libs: Library-Oriented Code Generation
No ratings yet
CodeGen4Libs: Library-Oriented Code Generation
12 pages
AI Tools for Source Code Quality Evaluation
No ratings yet
AI Tools for Source Code Quality Evaluation
100 pages
(ICSE18) Deep Code Search
No ratings yet
(ICSE18) Deep Code Search
12 pages
Is Github'S Copilot As Bad As Humans at Introducing Vulnerabilities in Code?
No ratings yet
Is Github'S Copilot As Bad As Humans at Introducing Vulnerabilities in Code?
24 pages
Natural Language To Code: Improving Semantic Reasoning in Code Generation Models
No ratings yet
Natural Language To Code: Improving Semantic Reasoning in Code Generation Models
10 pages
Github Copilot Ai Pair Programmer: Asset or Liability?
No ratings yet
Github Copilot Ai Pair Programmer: Asset or Liability?
20 pages
Godoy Et Al. - 2023 - Evaluation of OpenAI Codex For HPC Parallel Progra
No ratings yet
Godoy Et Al. - 2023 - Evaluation of OpenAI Codex For HPC Parallel Progra
10 pages
Qwen2.5-Coder Technical Report: Binyuan Hui Jian Yang Zeyu Cui Jiaxi Yang
No ratings yet
Qwen2.5-Coder Technical Report: Binyuan Hui Jian Yang Zeyu Cui Jiaxi Yang
23 pages
2022 - Expectation vs. Experience - Evaluating The Usability of Code Generation Tools Powered by Large Language Models
No ratings yet
2022 - Expectation vs. Experience - Evaluating The Usability of Code Generation Tools Powered by Large Language Models
7 pages
Exploring and Evaluating Personalized Models For Code Generation
No ratings yet
Exploring and Evaluating Personalized Models For Code Generation
9 pages
BioCoder: Bioinformatics Code Benchmark
No ratings yet
BioCoder: Bioinformatics Code Benchmark
72 pages
Codesearchnet Challenge Evaluating The State of Semantic Code Search
No ratings yet
Codesearchnet Challenge Evaluating The State of Semantic Code Search
6 pages
Key Use Cases in Programming
No ratings yet
Key Use Cases in Programming
3 pages
Deepercoder: Code Generation Using Machine Learning: Ntroduction
No ratings yet
Deepercoder: Code Generation Using Machine Learning: Ntroduction
6 pages
IJPREMS50400010480
No ratings yet
IJPREMS50400010480
5 pages
A Survey On Large Language Models For Code Generation: Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim
No ratings yet
A Survey On Large Language Models For Code Generation: Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim
49 pages
SP-Deep Code Comment Generation
No ratings yet
SP-Deep Code Comment Generation
12 pages
Improving Implementation of Code Generators A Regular-Expression Approach
No ratings yet
Improving Implementation of Code Generators A Regular-Expression Approach
10 pages
Session 14 Generative AI - For Software Engineering
No ratings yet
Session 14 Generative AI - For Software Engineering
22 pages
Deepseek Coder
No ratings yet
Deepseek Coder
19 pages
Image To HTML Ai Paper
No ratings yet
Image To HTML Ai Paper
4 pages
CodeSense 1
No ratings yet
CodeSense 1
8 pages
UG PP2 Report Format Part2 - Abstract
No ratings yet
UG PP2 Report Format Part2 - Abstract
3 pages
AI For Code Generation and Debugging Review Paper
No ratings yet
AI For Code Generation and Debugging Review Paper
8 pages
AI-Powered Code Review Assistant
No ratings yet
AI-Powered Code Review Assistant
6 pages
CodeGemma: AI Tool for Developers
No ratings yet
CodeGemma: AI Tool for Developers
9 pages
Evaluation of Generative AI Models in Python Code Generation A Comparative Study - PDF Mini
No ratings yet
Evaluation of Generative AI Models in Python Code Generation A Comparative Study - PDF Mini
14 pages
SWVV L17a Code-Based Testing
No ratings yet
SWVV L17a Code-Based Testing
20 pages
Using GitHub Copilot To Solve Program
No ratings yet
Using GitHub Copilot To Solve Program
7 pages
Qwen2.5-Coder Technical Report
No ratings yet
Qwen2.5-Coder Technical Report
32 pages
Machine Learning Code Generation Review
No ratings yet
Machine Learning Code Generation Review
1 page
Chat GPT Automated Framework
No ratings yet
Chat GPT Automated Framework
13 pages
AI - Research - Paper - Updated
No ratings yet
AI - Research - Paper - Updated
8 pages
Avi Khandelwal Updated Resume
No ratings yet
Avi Khandelwal Updated Resume
1 page
Coding Interview Cheat Sheet
100% (1)
Coding Interview Cheat Sheet
2 pages
Introduction to Data Structures
No ratings yet
Introduction to Data Structures
61 pages
Software Developer Profile: Amit Jain
No ratings yet
Software Developer Profile: Amit Jain
2 pages
CallGcSupression NullPointerException
No ratings yet
CallGcSupression NullPointerException
24 pages
This Is A Weekly Dairy For Industrial Training
No ratings yet
This Is A Weekly Dairy For Industrial Training
18 pages
OS Lab Manual: Process & Pipes
No ratings yet
OS Lab Manual: Process & Pipes
5 pages
KPIT Process New
No ratings yet
KPIT Process New
4 pages
IGCSE Computer Science G10 (Computer System & Sub System)
No ratings yet
IGCSE Computer Science G10 (Computer System & Sub System)
4 pages
Packet Peeper Sniffer Tutorial
No ratings yet
Packet Peeper Sniffer Tutorial
2 pages
Os Lab Manual
No ratings yet
Os Lab Manual
93 pages
AI ML - Pradyot
No ratings yet
AI ML - Pradyot
77 pages
A Co-Simulation Approach Using Powerfactory and Matlab-Simulink To Enable Validation of Distributed Control Concepts Within Future Power Systems
No ratings yet
A Co-Simulation Approach Using Powerfactory and Matlab-Simulink To Enable Validation of Distributed Control Concepts Within Future Power Systems
5 pages
CS2311-oops EEE
No ratings yet
CS2311-oops EEE
189 pages
Network & Security Manual
No ratings yet
Network & Security Manual
84 pages
ACSL Java Lesson Full Report 19-10-2025
No ratings yet
ACSL Java Lesson Full Report 19-10-2025
3 pages
Data Processing and Analysis Guide
0% (1)
Data Processing and Analysis Guide
137 pages
Chapter 5 Math
No ratings yet
Chapter 5 Math
67 pages
Cloud Computing for Students
No ratings yet
Cloud Computing for Students
1 page
DSU Important Questios
No ratings yet
DSU Important Questios
4 pages
Informatika Fase E Kelas X Capaian Pembelajaran
No ratings yet
Informatika Fase E Kelas X Capaian Pembelajaran
10 pages
ATTENDANCE CONTROL
No ratings yet
ATTENDANCE CONTROL
6 pages
Implementing Lamport's Algorithm in Java
No ratings yet
Implementing Lamport's Algorithm in Java
6 pages
pl-WEB APPLICATIONS DEVELOPMENT USING PHP & MYSQL
No ratings yet
pl-WEB APPLICATIONS DEVELOPMENT USING PHP & MYSQL
5 pages
JNTUK CSE IIYear I Sem Important Questions
No ratings yet
JNTUK CSE IIYear I Sem Important Questions
3 pages
Sub Queries
No ratings yet
Sub Queries
7 pages
Android 11 NullPointerException Crash Log
No ratings yet
Android 11 NullPointerException Crash Log
23 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
Wi Threads PDFThreads and Networking in J2ME
No ratings yet
Wi Threads PDFThreads and Networking in J2ME
43 pages

Codexglue: A Machine Learning Benchmark Dataset For Code Understanding and Generation

Uploaded by

Codexglue: A Machine Learning Benchmark Dataset For Code Understanding and Generation

Uploaded by

CodeXGLUE: A Machine Learning Benchmark Dataset for Code

Understanding and Generation

Junjie Huang∗ Alexey Svyatkovskiy Ambrosio Blanco

Colin Clement Dawn Drain Daxin Jiang

Microsoft Microsoft Microsoft

Duyu Tang Ge Li Lidong Zhou

Linjun Shou Long Zhou Michele Tufano

Ming Gong Ming Zhou Nan Duan

Neel Sundaresan Shao Kun Deng Shengyu Fu

Category Task Dataset Name Language Train/Dev/Test Size Baselines

Go 25,282 152 3.5 Code translation

3.6 Code search

Figure 4: An example in the CodeSearchNet AdvTest dataset.

Supported tasks: Supported tasks: Supported tasks:

Input (English): 4.2 CodeGPT

Table 6: Results on the cloze test task.

Table 8: Results on the code completion task.

PY150 Github Java Corpus

Accuracy EM Edit Sim Accuracy EM Edit Sim

Table 11: Results on the code repair task.

Table 12: Results on the code translation task.

Table 13: Results on the code summarization task.

Model Ruby Javascript Go Python Java PHP Overall

Training and inference cost RoBERTa

Task Dataset Name Language Training Cost Inference Cost

You might also like