Building Program Vector Representations For Deep Learning
Building Program Vector Representations For Deep Learning
Abstract—Deep learning has made significant breakthroughs Such striking results raise the interest of its applications in the
arXiv:1409.3358v1 [cs.SE] 11 Sep 2014
in various fields of artificial intelligence. Advantages of deep field of program analysis. Using deep learning to automatically
learning include the ability to capture highly complicated fea- capture program features is an interesting and prospective
tures, weak involvement of human engineering, etc. However,
it is still virtually impossible to use deep learning to analyze research area.
programs since deep architectures cannot be trained effectively Unfortunately, it has been practically infeasible for deep
with pure back propagation. In this pioneering paper, we propose learning to analyze programs up till now. Since no proper
the “coding criterion” to build program vector representations, “pretraining” method is proposed for programs, deep neural
which are the premise of deep learning for program analysis. Our networks cannot be trained effectively with pure back propa-
representation learning approach directly makes deep learning a
reality in this new field. We evaluate the learned vector represen- gation [13], [14], [15] because gradients would either vanish or
tations both qualitatively and quantitatively. We conclude, based blow up through the deep architecture [16]. No useful features
on the experiments, the coding criterion is successful in building can be extracted, and it results in very poor performance.
program representations. To evaluate whether deep learning In this paper, we propose a novel “coding criterion” to build
is beneficial for program analysis, we feed the representations
program vector representations based on abstract syntax trees
to deep neural networks, and achieve higher accuracy in the
program classification task than “shallow” methods, such as (ASTs). The vector representations are the premise of deep
logistic regression and the support vector machine. This result architectures, and our method directly makes deep learning
confirms the feasibility of deep learning to analyze programs. It a reality in the new field—program analysis. In such vector
also gives primary evidence of its success in this new field. We representations, each node in ASTs (e.g. ID, Constant) is
believe deep learning will become an outstanding technique for
mapped to a real-valued vector, with each element indicating
program analysis in the near future.
a certain feature of the node. The vector representations
I. I NTRODUCTION serve as a “pretraining” method. They can emerge, through
a deep architecture, high-level abstract features, and thus
Machine learning-based program analysis has been studied benefit ultimate tasks. We analyze the learned representations
long in the literature [1], [2], [3]. Hindle et al. compare both qualitatively and quantitatively. We conclude from the
programming languages to natural languages and conclude experiments that the coding criterion is successful in building
that programs also have rich statistical properties [4]. These program vector representations.
properties are difficult for human to capture, but they justify To evaluate whether deep learning can be used to analyze
using learning-based approaches to analyze programs. programs, we feed the learned representations to a deep neural
The deep neural network, also known as deep learning, has network in the program classification task. We achieve higher
become one of the prevailing machine learning approaches accuracy than “shallow” methods. The result confirms the
since 2006 [5]. It has made significant breakthroughs in a feasibility of neural program analysis. It also sheds some light
variety of fields, such as natural language processing [6], [7], on the future of this new area.
image processing [8], [9], speech recognition [10], [11], etc. We publish all of our source codes, datasets, and learned
Compared with traditional machine learning approaches, deep representations on our project website1 to promote future
learning has the following major advantages: studies. The AST node representations can be used for further
• The deep architecture can capture highly complicated researches in various applications of program analysis. The
(non-linear) features efficiently. They are crucial to most source codes contain a versatile infrastructure of the feed-
real-world applications. forward neural network, based on which one can build one’s
• Very little human engineering and prior knowledge is own deep neural architectures.
required. Interestingly, with even a unified model, deep To our best knowledge, this paper is the first to propose
learning achieves better performance than state-of-the-art representation learning algorithms for programs. It is also the
approaches in many heterogeneous tasks [12]. first to analyze programs by deep learning. This study is a
∗ Corresponding author. 1 https://sites.google.com/site/learnrepresent/
pioneering research in the new field. To sum up, the main In short, deep neural networks are capable of capturing
contributions of this paper include: highly complicated features with little human involvement.
1) Introducing the techniques of deep learning and repre- Analyzing programs with deep learning is an interesting and
sentation learning to the field of program analysis; promising research topic.
2) Proposing the novel “coding criterion” to build program
representations, which are the premise of deep learning; B. Barriers of Deep Learning for Program Analysis
3) Exploring the feasibility to analyze programs by deep Although deep neural networks are powerful enough to cap-
neural networks, shedding some light on the future; ture complicated features, there are still barriers to overcome
4) Publishing all of our source codes, datasets, and learned before they can be practically used to analyze programs.
representations online to promote further researches. Since all program symbols (e.g. nodes in ASTs) are “dis-
In the rest of this paper, we first motivate our research in crete,” no order is defined on these symbols. Such discrete
Section II. The background of deep learning and representation symbols cannot be fed directly to a neural network. A possible
learning is introduced in Section III. Then we explain our solution is to map each symbol to a real-valued vector in some
approach in detail in Section IV. Experimental results are dimension. Each element in the vector characterizes a certain
shown in Section V. In Section VI, we look forward to the feature of the symbol spontaneously. Hence, it is also referred
future of deep learning in the field of program analysis. Last, to as distributed representation. By “distributed,” it is contrary
we draw our conclusion in Section VII. to one-of-all coding, such as k-means clustering results.
A direct mapping approach is to randomly initialize these
II. M OTIVATION vector representations and train with pure back propagation
A. From Machine Learning to Deep Learning (like shallow networks). Chances are that we end up with
both poor optimization and poor generalization if the network
Traditional machine learning approaches largely depend is deep [13], [14], [15]. An alternative is to first learn the
on human feature engineering, e.g., [17] for bug detection, representations unsupervisedly regardless of the specific task
[18] for clone detection. Such feature engineering is label- like clone detection, bug detection, etc. Then they are fed to
consuming and ad hoc to a specific task. Further, evidence a neural network for supervised training. The vector repre-
in the literature suggests that human-engineered features may sentations specify meaningful features of the symbols, and
be even worse than automatically learned ones. Mnih et al. benefit the ultimate tasks. Hence, many researches focus on
report—for example, in the field of natural language pro- the problem of representation learning per se, such as [27],
cessing (NLP)—the automatically learned taxonomy of words [28], [5], [7], [29] in fields like NLP.
[19] is better in their application than the famous WordNet
However, due to the structural differences between natural
constructed by experts [20] used in [21].
languages and programming languages [30], existing represen-
Such results pose increasing demands on highly automated
tation learning algorithms in NLP are improper for programs.
learning approaches, such as deep learning with very little
As we know, natural languages are always written/spoken in
human engineering.
one dimension as time flows. By contrast, programmers always
With deep neural networks, program analysis may be eas-
organize their source codes with proper indentation, indicating
ier with statistical methods. For example, in the task of
branches, loops or even nested structures. It will be extremely
program classification, deep neural networks automatically
difficult to read source codes if they are written in one line
extract program features of interest. Features can be organized
(like natural languages). The formal grammar rules of the
hierarchically, from local to abstract. Based on these abstract
programming language alias the notion of neighborhood. To
features, we may determine the category of a program. Such
be concrete, physically neighboring stuffs in a program source
deep learning architectures require less human engineering
code are not necessarily near to each other, but those in one
than existing methods like [22]. Moreover, the feature rep-
grammar rule are.
resentation nature also makes it easy for multi-task learning.
Further, if we want to build the representations for abstract
As pointed out in [23], “many decision problems can be
components of a program (e.g., function declaration/call nodes
reduced to classification.” Such deep learning architecture is
in ASTs), existing NLP representation learning algorithms are
also applicable to other program analysis tasks, including
inapplicable since all of them are “flat.”
• bug detection as [17], to which the deep learning ap-
Therefore, new approaches are needed to build program
proach is to automatically extract features of bugs;
vector representations, which are unfortunately not studied.
• clone detection as [24], to automatically match the fea-
tures of two programs;
• code retrieval as [25], to automatically match program
The above facts motivate our research of representation
features with that of the queries; and learning for programs. This eventually makes deep learning
• code recommendation as [26], to automatically predict
a reality to analyze programs. Considering current evidence
the probability of the next possible codes, e.g. APIs, in the literature, we believe deep learning will make great
according to previous ones based on the features (like progress in heterogenous tasks of program analysis.
[27] in NLP).
III. BACKGROUND OF D EEP L EARNING AND Output
R EPRESENTATION L EARNING
A. Deep Neural Networks
Input
Nowadays, the deep neural network is a widely-used tech-
(A) A single layer of neurons
nique in artificial intelligence. Comprehensive reviews include
[31], [32].
A single layer of neurons, the building block for deep neural Output layer
networks, takes a vector x ∈ Rn as input and outputs a vector Higher level
y ∈ Rm (Part A in Figure 1). Typically y is computed as ... (abstract features)
.
y = f (W ·x + b) (1) .
Hidden layers .
where W ∈ Rm×n , b ∈ Rm are model parameters, which are
...
first randomly initialized and then trained by gradient descent, Lower level
∂J
i.e., W ← W −α ∂W and b ← b−α ∂J∂b . (α is the learning rate; Input layer (local features)
J is the cost function.) f is the activation function, usually
non-linear, such as sigmoid, tanh, etc. The power of a single-
(B) A deep neural network
layer neural network is limited: the decision boundary is linear,
and it is insufficient for most real-world applications. Fig. 1. A deep neural network (B) is composed of multiple layers of neurons
Multi-layer neural networks (Part B in Figure 1) stack (A).
multiple layers of neurons. The parameters can also be trained
by gradient descent with back propagation algorithm [33].
Due to the stacking of multiple non-linear layers, multi-layer the data. Then, for supervised learning, the neural weights are
neural networks gain much more power. It can be proved initialized as that have been learned in the pretraining phase
that a 2-layer2 neural network with sufficient hidden units instead of random initialization. Standard back propagation
can approximate arbitrary Boolean or continuous functions, algorithm can be used for fine-tuning the weights so as to be
and that a 3-layer network can approximate any function [34]. specific to the task.
However, the shallow architecture is inefficient because the The pretraining phase plays a vital role in deep learning. It
number of hidden units may grow exponentially in order to learns the features of data unsupervisedly, and as a result, the
learn complicated (highly non-linear) features of data [35]. weights are much more meaningful than just chosen randomly.
Such exponentiation of hidden layer units, and hence param- According to the experiments reported in [13], [14], [15],
eters, raises the VC-dimension of the model, leading to very pretraining helps optimization (minimizing the training error)
poor generalization [36]. as well as generalization (minimizing the test error).
The theory of circuits suggests deep architectures would be However, the advantages of deep neural networks are not
more efficient to capture complicated features [31]. In such exploited in the field of program analysis. We believe deep
deep architectures, features can be organized hierarchically, learning will also exhibit its power in this new field.
with local features at lower layers and abstract features at
higher layers (Figure 1). However, while deep architectures B. Existing Representation Learning Approaches in NLP
capture abstract features efficiently, they also make the net-
Neural networks and the pretraining approaches like RBMs,
works very difficult to train. Few successful researches were
autoencoders work well with image data, speech data, etc.
reported in early years using deep architectures (except con-
But they cannot be applied directly to the field like NLP
volutional neural networks [37]).
and program analysis because words and program symbols
In 2006, Hinton et al. proposed stacked restricted Boltz- are “discrete.”
mann machine (RBM) as a greedy layer-wise pretraining
In data like images, a total order is defined on the value.
method for deep neural networks [5]. During the pretraining
For example, a gray-scale pixel with value 0 is black. If the
phase, the stacked RBM is learning underlying data features
value increases, it becomes brighter accordingly. If the value
unsupervisedly by minimizing the energy function defined
is 255, the pixel becomes white. However, in fields like NLP, a
on the unlabeled data (i.e., maximizing the likelihood of
word with index 20 is by no means twice as large as the word
the data). Shortly after that, stacked autoencoders are used
with index 10 for any meaningful feature. Therefore, unlike
for pretraining [13], the criterion of which is to minimize
traditional approaches in NLP, where each words is treated as
the reconstruction error. During the pretraining phase with
an atomic unit, it is meaningless to feed the indexes of words
either stacked RBMs or autoencoders, the weights for neuron
to neural networks. (Note the multiplication W ·x in Equation
connections are learned, which extract underlying features of
1.)
2 The input layer is not counted to the number of layers in our terminology Real-valued vector representations come to our rescue. With
as there is no weight associated with it. such representations, each word is mapped to a k-dimensional
double doubles(double doublee){
vector (k = 30, 100, 300, etc), representing certain (anony-
return 2 * doublee;
mous) features of the word. The value reflects the degree that
}
a feature is satisfied. These word representations can be fed
forward to standard neural networks, and every routine of deep (A) A C code snippet
learning works.
Early word representation learning is related to language FuncDef
modeling, the objective of which is to maximize the joint
probability of a linguistic corpus. In [27], they predict the Decl Compound
probability of each word given n−1 previous words. By max-
FuncDecl Return
imizing the conditional probability of the n-th word, useful
word features are learned. Hinton et al. introduce 3 energy- ParameterList TypeDecl BinaryOp
based models in [28], where they learn word representations
by minimizing the energy (maximizing the likelihood) defined Decl IdentifierType Constant ID
on neighboring words. In [21], [19], hierarchical architectures
are proposed to reduce the computational cost in calculating TypeDecl
the probabilities. Later, researchers realized the normalization IdentifierType
of probability is not essential if we merely want to learn feature
vectors. Fast algorithms are then proposed in [38], [39]. (B) The corresponding AST
All the above approaches adopt the Markovian assumption, Fig. 2. A C code snippet (A) and its corresponding AST (B). Each node
where each word is related to finite many previous words. in AST corresponds to an abstract component (e.g., a function declaration, a
Such approaches take into consideration only local patterns binary operator) in the program.
of physically nearing words. Recurrent neural network (RNN)
is introduced in order to capture long-term dependencies [40].
However, RNN may be very difficult to train since the gradient
would either vanish or blow up during back propagation [16]. a-z, A-Z, 0-9 and all punctuation marks. Although some
Besides, the time-delaying nature of RNN treats data as a one- researches explore character-level modeling for NLP [41],
dimensional signal, where structural information is also lost. it is improper for programming languages. In NLP, the
As we have discussed in Section I, programming languages knowledge of word morphology can be explored to
are different from natural languages in that the former contain some extent by character-level modeling, but the situation
richer and more explicit structural information. Therefore, new changes in programs. For example, the token double
representation learning algorithms are needed. To solve this in a C code refers to a data type. But if one writes
problem, we propose the “coding criterion” based on ASTs. doubles, it is an identifier (e.g., a function name),
The detail of our approach is explained in the following completely different from double. However, double
section. and doubles share most characters, which leads to
similar vector representations.
• Token-level. This level is most similar to NLP rep-
IV. T HE C ODING C RITERION FOR P ROGRAM resentation learning. We learn representations for all
R EPRESENTATION L EARNING tokens (analogous to words in NLP), including types
In this section, we first discuss the granularities of program like int, double, identifiers like doubles, func, etc.
representation. We settle for the granularity of nodes in ab- Unfortunately, the identifier representations bring severe
stract syntax trees (ASTs). problems. Unlike natural languages, where the number
In Subsection IV-B, we formalize our approach and give of words is generally fixed, programmers can declare
the learning objective. In Subsection IV-C, we present the their own identifiers in their source codes, e.g., func1,
stochastic gradient descent algorithm for training. func2, func3, so on and so forth. Therefore, the
number of tokens is unlimited. Because some identifiers
A. The Granularity
may appear only a few times (e.g., tmp), we will suffer
We should first answer a fundamental question of program from the problem of undesired data sparseness. Hence, it
representation learning—the granularity of representation. As is improper for representation learning at this level.
we have introduced in previous sections, vector representations Another problem of token-level representation is that
map a symbol to a real-valued, distributed vector. Possible information is not encoded efficiently. For example, we
granularities of the symbol include character-level, token-level, need two tokens to represent a pair of parentheses,
etc. We analyze each case as follows. indicating the priority of different stuffs. In fact, such
• Character-level. Characters are the most atomic units information need not to be expressed explicitly in a more
of programming languages. At this level, we treat each compressed representation like ASTs.
character in source codes as a symbol. This means that • Nodes in ASTs. The abstract syntax tree is a structural
we should learn the representations for characters like representation of a program. Figure 2 shows a C code
snippet and its corresponding AST, parsed by pycparser3 . unay/binary operator; both For and While are a block of
At this level, we learn the representations for nodes in codes, etc.
ASTs, e.g., FuncDef, ID, Constant. As is stated, the To capture such similarity using AST structural informa-
AST is more compressed compared with token-level rep- tion, we propose the “coding criterion”. The idea is that the
resentation. Furthermore, there are only finite many types representation of a node in ASTs should be “coded” by its
of nodes in ASTs, which makes it feasible to learn. The children’s representations via a single neural layer.
tree structural nature of ASTs also provides opportunities We denote the vector of node x as vec(x). vec(·) ∈ RNf ,
to capture structural information of programs. where Nf is the dimension of features. (Nf is set to 30
Despite the above facts, the AST representation also empirically in our experimental setting.) For each non-leaf
has its drawback since we regard all identifiers as a node p in ASTs and its direct children c1 , · · · , cn , their
same symbol. Such codes as a*b and c*d cannot be representations are vec(p), vec(c1 ), · · · , vec(cn ). The primary
distinguished between each other. We hypothesize that objective is that
structural information captures the semantics of programs n
!
to a large extent. For example, if we see two nested vec(p) ≈ tanh
X
li Wi · vec(ci ) + b (2)
for-loops, inside of which is a branch of comparison i=1
followed by three assignments, the code snippet is likely Nf ×Nf
where Wi ∈ R is the weight matrix for node ci ; b ∈
to be an implementation of bubble sort. This level is
RNf is the bias term. The weights (Wi ’s) are weighted by the
also used in traditional program analysis like code clone
number of leaves under ci and the coefficients are
detection [42], [43], vulnerability extrapolation [44], etc.
The experimental results in Section V-B also suggest high #leaves under ci
accuracy in the program classification task at this level. li = (3)
#leaves under p
• Statement-level, function-level or higher. Theoretically, As we have noticed in Figure 2, different nodes in ASTs
a statement, a function or even a program can also be may have different numbers of children, and thus the number
mapped to a real-valued vector. However, such represen- of Wi ’s is hard to determine. To solve this problem, one
tations cannot be trained directly. A possible approach extreme is to apply dynamic pooling [46], [47]. In this method,
of modeling such complex stuff is by composition, i.e., we take the summed or maximal value over vec(ci ) for each
the representation of a complex stuff is composited by feature dimension, and thus only one weight matrix is needed.
that of atomic ones. Such researches in NLP is often This is also mathematically equivalent to the continuous bag-
referred to as compositional semantics [45]. The state-of- of-words model [38]. However, when pooling is applied,
the-art approaches in NLP compositional semantics can position information of c’s will be totally lost and therefore it is
only model sentences, paragraphs roughly. It is very hard not satisfactory. Another extreme is to assign a different matrix
to capture the precise semantics; the “semantic barrier” parameter for each different position [45]. This method works
is still not overcome. in the original application with dependency trees, but may fail
To sum up, we have analyzed in this part different granular- in our scenario because there will be too many weights.
ities of program representations. We think the representation What we propose is a model called continuous binary tree,
for nodes in ASTs has theoretical foundations, and is feasible where there are two weight matrices as parameters, namely Wl
to learn and useful in applications. In the following subsection, and Wr . Any weight Wi is a linear combination of the two
we formalize our coding criterion to build program vector matrices. That is, regardless the number of children, we treat
representations. it as a “binary” tree. Formally, if p has n (n ≥ 2) children,
then for child ci ,
B. Formalization
n−i i−1
Wi = Wl + Wr (4)
The basic criterion of representation learning is that similar n−1 n−1
symbols should have similar representations. Further, symbols If n = 1, Wi = 21 Wl + 12 Wr .
that are similar in some aspects should have similar values This process is illustrated in Figure 3, where the gray-scale
in corresponding feature dimensions. This is referred to as values in the two bars represent the weight coefficients for
“disentangling the underlying factors of variation” in [32]. the node at the corresponding position. With this model, the
In our scenario of representation learning for AST nodes, relative position information of a node can be coded into the
similarity is defined based on the following intuition. We network.
think such symbols as ID, Constant are similar because Now that we are able to calculate the weight Wi for each
both of them are related to data reference; For, While are node and thus the right-hand side of Equation 2, we measure
similar because both are related to loops. The observation is closeness by the square of Euclidean distance, as below:
that similar symbols have similar usages in the programming
language: both ID and Constant can be an operand of a n
! 2
X
d = vec(p) − tanh li Wi · vec(ci ) + b (5)
3 https://pypi.python.org/pypi/pycparser/ i=1 2
Algorithm 1: StochasticGradientDescentWithMomentum
p
Input: Data samples x(i) , i = 1..N ;
Wl Momentum ;
Wr Learning rate α
c1 c2 ... cn Output: Model parameters Θ = vec(·), Wl , Wr , b
Fig. 3. Illustration of continuous binary tree. There are two weight matrices Randomly initialize Θ;
Wl and Wr . The gray-scale bars at the bottom indicate the coefficients of the while not converged do
weight parameters (Wl and Wr respectively) at the corresponding position.
for i = 1..N do
(i)
Generate a negative sample xc for x(i) ;
According to our “coding criterion,” d needs to be as Propagate forward and backward to compute J (i)
small as possible. However, we cannot directly minimize ∂J (i)
and the partial derivative ;
Equation 5. Otherwise, the network is likely to output trivial ∂Θ
representations like vec(·) = 0, W = 0, b = 0. Such result ∂J (i) ∂J (i−1) ∂J (i)
gives zero distance but is meaningless. ← + ; // momentum
∂Θ ∂Θ ∂Θ
To solve the problem, negative sampling can be applied
∂J (i)
[12], [45], [48]. The idea is that for each data sample x, Θ ← Θ−α ; // gradient descent
∂Θ
a new negative sample xc is generated. Since xc violates
the patterns of valid data, it needs to have a larger distance end
(denoted as dc ) than d. Hence, negative sampling method is end
also sometimes referred to as the pairwise ranking criterion
[49]. In our program representation learning, we randomly
select a symbol (one of p, c1 , c2 , · · · , cn ) in each training can back propagate. Thus, useful features are learned for AST
sample and substitute it with a different random symbol. The nodes.
objective is that dc should be at least as large as d + ∆, where To speed up training, we adopt the momentum method,
∆ is the margin and often set to 1. The error function of where the partial derivatives of the last iteration is added to the
(i)
training sample x(i) and its negative sample xc is then current ones with decay . Algorithm 1 shows the pseudo-code
n o of the training process.
J(d(i) , dc(i) ) = max 0, ∆ + d(i) − d(i)c (6)
Results
Query
Most Similar Most Dissimilar
ID BinaryOp, Constant, ArrayRef, Assignment, StructRef · · · PtrDecl, Compound, Root, Decl, TypeDecl
Constant ID, UnaryOp, StructRef, ArrayRef, Cast · · · EnumeratorList, ExprList, If, FuncDef, Compound
BinaryOp ArrayRef, Assignment, StructRef, UnaryOp, ID ··· PtrDecl, Compound, FuncDecl, Decl, TypeDecl
ArrayRef BinaryOp, StructRef, UnaryOp, Assignment, Return · · · Compound, PtrDecl, FuncDecl, Decl, TypeDecl
If For, Compound, Break, While, Case ··· BinaryOp, TypeDecl, Constant, Decl, ID
For If, While, Case, Break, Struct ··· BinaryOp, Constant, ID, TypeDecl, Decl
Break While, Case, Continue, Switch, InitList ··· BinaryOp, Constant, TypeDecl, Decl, ID
While Switch , Continue , Label , Goto ··· BinaryOp, Constant, Decl, TypeDecl, ID
FuncDecl ArrayDecl, PtrDecl, FuncDef, Typename, Root · · · ArrayRef, FuncCall, IdentifierType, BinaryOp, ID
ArrayDecl FuncDecl, PtrDecl, Typename, FuncDef, While · · · BinaryOp, Constant, FuncCall, IdentifierType, ID
PtrDecl FuncDecl, Typename, FuncDef, ArrayDecl ··· FuncCall, ArrayRef, Constant, BinaryOp, ID
TABLE II
nearest neighbor of each other. This seems meaningful since T HE RESULT OF k- MEANS CLUSTERING . k IS SET TO 3.
both of them are related to data reference. Similar symbols
also include ArrayRef, BinaryOp, which are related to Cluster Sybmols
data manipulating. Symbols like If, For, While, Break UnaryOp, FuncCall, Assignment, ExprList,
1
are similar as they are related to control flow. FuncDecl, StructRef, BinaryOp, ID, Constant, ArrayRef
ArrayDecl, PtrDecl are similar as they are declarations. FuncDef, TypeDecl, FuncDecl, Compound,
2
Moreover, these three groups are dissimilar with each other. ArrayDecl, PtrDecl, Decl, Root
(See most dissimilar part in Table I.) Typedef, Struct, For, Union, CompoundLiteral,
TernaryOp, Label, InitList, IdentifierType,
To further confirm the above potential clusters with vec- Return, Enum, Break, DoWhile, Case,
tor representations, we perform k-means clustering, where 3 DeclList, Default, While, Continue,
k is set to 3. The result is shown in Table II. As we ParamList, Enumerator, Typename, Goto,
see, almost all the symbols in Cluster 1 are related to data Cast, Switch, EmptyStatement,
reference/manipulating. Cluster 2 is mainly about declarations. EnumeratorList, If
Cluster 3 contains more symbols, the majority of which are
related to control flow. This result confirms our conjecture
that similar symbols can be clustered into groups with the B. Quantitative Evaluation: Improvement for Supervised
distributed vector representations that are learned by our Learning
approach.
We now evaluate whether building program vector repre-
As the qualitative evaluations show, the learned representa- sentations is beneficial for real-world tasks, i.e., whether they
tions are meaningful as they can characterize the relationships will improve optimization and/or generalization for supervised
between different symbols effectively. The results are consis- learning of interest. We feed the representations to the Tree-
tent with human understanding of programs. based Convolutional Neural Network (TCNN) for program
classification.
It should also be reminded that similarity is not the only The dataset comes from an online Open Judge (OJ) system4 ,
goal of representation learning. Even though heuristic metrics which contains a large number of programming problems for
can also be used to measure similarity—like [50] in NLP and students. Students solve the problems and submit their source
[18], [24] in program analysis, which may be useful for code codes to the system. The OJ system automatically compiles,
clone detection [51], code retrieval [25]—they fail to capture runs and judges the validity of the source codes. We select
different aspects of the relationships between different symbols four problems for our program classification task. Source
because the similarity is the only outcome of these metrics. codes (in C programming language) of the four problems are
Thus, they are not suitable for highly-automated learning algo- downloaded along with their labels (problem IDs). We split
rithms, e.g., deep neural networks. On the contrary, real-valued the dataset by 3 : 1 : 1 for training, cross-validating (CV) and
representations are distributed. As each dimension captures testing.
a feature in a certain aspect spontaneously, the distributed Figure 4 plots the learning curves for training and CV in
vector representations can emerge high-level abstract features first 40 epochs. (One epoch is an iteration over all training
and benefit various tasks. Therefore, representation learning is
crucial to program analysis with deep learning approaches. 4 http://programming.grids.cn/
TABLE III
1.4 ACCURACY OF P ROGRAM C LASSIFICATION .
1.3
Method Accuracy
1.2 Random guess 25.00%
1.1
Training error
where N is the number of data samples (training or CV To sum up, we evaluate the learned representations empir-
respectively); M = 4 is the number of labels (different ically by nearest neighbor querying and k-means clustering.
programming problems); yj is the probability for label j Program classification experiment shows the learned represen-
estimated by the TCNN model; t is the actual label (one-of-all tations are greatly beneficial for supervised learning.
coding), with tj indicating whether data sample i belongs to Based on the above experiments, we conclude that the
label j. proposed “coding criterion” based on ASTs is a successful
Since no effective program representation existed before, representation learning algorithm for programs.
the deep TCNN model could not be trained at all, as the blue Our experimental result in program classification confirms
curve demonstrates at the top of Part A in Figure 4. (Here, all the feasibility of using deep learning to analyze programs. It
model parameters are initialized randomly, which is a prevalent also shows primary evidence of its success in the new field.
setting in “shallow” architectures.) The reason is that gradients
will vanish or blow up during back propagation through a
deep network. No useful features are learned, and as a result, VI. L OOKING F ORWARD TO THE F UTURE
TCNN also performs poorly on the CV set, indicated by the As evidence in the literature show, deep learning is making
cyan curve at the top of Part B in Figure 4. breakthroughs in many fields of artificial intelligence. We
On the contrary, the program representation serves as a believe it will also become an important method in various
pretraining method. If the vector representations and the cod- tasks in the field of program analysis. As a pioneering study,
we address the following promising research topics in this new formal methods can be viewed as an approximation with pure
area. mathematical deduction, often giving the guarantee of either
no false-positive, or no false-negative, which may be important
A. Different Perspectives for Program Modeling to program analysis [58]. For now, it seems hard to combine
In this paper, we treat a program as a tree, where each these two techniques, but once they were combined, it would
node corresponds to an “abstract” component of the program. be beneficial for both.
We hypothesize in this paper that structural information is
important to programs to a large extent, and our experiments C. Various Applications
confirm our conjecture. However, the AST is not the only The application of deep learning in the field of program
perspective of program modeling. analysis is another promising research topic, which is at least
Another possible perspective is treating a program as a as important as, if not more important than, the theory of deep
sequence of statements. Such perspective is also adopted in learning. Some are pointed out in Section I, including code
traditional program analysis, e.g., API usage pattern mining clone detection, bug detection, and code retrieval.
[53], [54]. As the representations can be composited by atomic
symbols (e.g., AST nodes), we can also apply deep learning In short, as deep learning is brand new to program analysis,
approaches to sequences of statements. Although some struc- the questions addressed in this part still remain unknown to
tural information may be lost, the neighboring information are the literature. It is not clear which perspective is most proper
captured and local patterns can be extracted. to model programs, or which is most suitable for what appli-
Treating a program as a 2-dimensional signal is an interest- cation. It is also not very clear how to integrate human priors
ing, novel and also meaningful perspective, which is bionics- about programs to the neural network architecture. These are
inspired. As we, human beings, always read source codes among the fundamental questions of deep learning when it is
on a 2-D screen, it is also possible for neural networks to applied to the new field. Besides, real-world applications of
model programs in this perspective. Indents and linefeeds on deep learning are also important for program analysis.
the 2-D screen are useful features because they suggest strong
semantics of programs. Existing techniques in computer vision VII. C ONCLUSION
can be applied, e.g. the convolutional neural network (CNN).
CNN is analogous to visual cortex of human brains, and thus In this paper, we study deep learning and representation
it has the solid biological foundation in cognitive science. learning in the field of program analysis. We propose a novel
Interestingly, as a bionics-inspired model, deep CNN achieved “coding criterion” to build vector representations of nodes
unexpected high performance [37] before pretraining methods in ASTs, which make deep learning a reality for program
were invented. analysis. We also feed the learned representations to a deep
neural network to classify programs.
B. Integrating Prior about Programs to Network Architectures The experimental results show that our representations suc-
Despite the fact that a unified architecture of deep neural cessfully capture the similarity and relationships among differ-
networks is applicable to various tasks with high performance, ent nodes in ASTs. The learned representations significantly
we can also integrate human priors to the networks. improve supervised training for deep neural networks in terms
CNN is an example that specifies explicitly the physically of both optimization and generalization. We conclude that
neighborhood information of an image. Physically neighboring the coding criterion is successful in building program vector
pixels form local patterns (e.g. a circle, a line), which can be representations. The experiments also confirm the feasibility of
detected by convolution kernels. Being fed forward to higher deep learning to analyze programs, and show primary evidence
layers in the network, the local patterns emerge high-level of its success in the new field.
abstract features. Such abstract features are beneficial for the As a pioneering study, we address several fundamental
ultimate task (e.g., object recognition). Another widely-used problems, including the perspectives of program modeling,
domain specific prior in deep learning is slowness [55], [56]. the integration of human priors and the applications of deep
As it is not desired that features of image/acoustic data are learning.
changing too fast, penalties of variation are added to the cost To promote further researches in the new field, we publish
function, so that the learned features are “slow.” all of our datasets, source codes, and learned representations
For program analysis, priors can also be integrated to the online.
neural networks. In one of our undergoing research, we would We believe, considering the fact that deep learning has
like to capture the local features of ASTs. A tree-based made breakthroughs in many fields of artificial intelligence,
convolutional neural network (TCNN) is proposed and studied. along with the primary evidence reported in this paper, deep
Primary results have been reported in Section V-B. learning will become an outstanding approach of program
Another prior is that we can integrate formal methods of analysis in the near future. We call for more studies in this
program analysis to neural networks (or vise versa). [57] is new, prospective field.
an example of neural reasoning for knowledge base. For pro-
grams, even though all non-trivial properties are undecidable,
R EFERENCES [24] R. Kaur and S. Singh, “Clone detection in software source code
using operational similarity of statements,” ACM SIGSOFT Software
Engineering Notes, vol. 39, no. 3, pp. 1–5, 2014.
[1] H. Lu, B. Cukic, and M. Culp, “Software defect prediction using
[25] Y. Udagawa, “Source code retrieval using sequence based similarity,” In-
semi-supervised learning with dimension reduction,” in Proceedings of
ternational Journal of Data Mining & Knowledge Management Process,
the 27th IEEE/ACM International Conference on Automated Software
no. 4, 2013.
Engineering, 2012.
[26] N. Murakami and H. Masuhara, “Optimizing a search-based code recom-
[2] S. Lee, C. Jung, and S. Pande, “Detecting memory leaks through mendation system,” in 3rd International Workshop on Recommendation
introspective dynamic behavior modelling using machine learning,” in Systems for Software Engineering, 2012.
Proceedings of 36th International Conference on Software Engineering, [27] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural proba-
2014. bilistic language model,” Journal of Machine Learning Research, vol. 3,
[3] K. Canavera, N. Esfahani, and S. Malek, “Mining the execution history pp. 1137–1155, 2003.
of a software system to infer the best time for its adaptation,” in [28] A. Mnih and G. Hinton, “Three new graphical models for statistical lan-
Proceedings of the ACM SIGSOFT 20th International Symposium on guage modelling,” in Proceedings of the 24th International Conference
the Foundations of Software Engineering, 2012. on Machine learning, 2007.
[4] A. Hindle, E. Barr, Z. Su, M. Gabel, and P. Devanbu, “On the natural- [29] E. Huang, R. Socher, C. Manning, and A. Ng, “Improving word
ness of software,” in Proceedings of 34th International Conference on representations via global context and multiple word prototypes,” in
Software Engineering, 2012. Proceedings of the 50th Annual Meeting of the Association for Com-
[5] G. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep putational Linguistics, 2012.
belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006. [30] J. Pane, C. Ratanamahatana, and B. Myers, “Studying the language
[6] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and and structure in non-programmers’ solutions to programming problems,”
P. Kuksa, “Natural language processing (almost) from scratch,” The International Journal of Human-Computer Studies, vol. 54, no. 2, pp.
Journal of Machine Learning Research, vol. 12, pp. 2493–2537, 2011. 237–264, 2001.
[7] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. Manning, A. Ng, and [31] Y. Bengio, “Learning deep architectures for AI,” Foundations and Trends
C. Potts, “Recursive deep models for semantic compositionality over in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
a sentiment treebank,” in Proceedings of Conference on Empirical [32] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A
Methods in Natural Language Processing, 2013. review and new perspectives,” IEEE Transactions on Pattern Analysis
[8] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
with deep convolutional neural networks,” in Advances in Neural Infor- [33] T. Mitchell, Machine Learning. McGraw Hill, 1997.
mation Processing Systems, 2012. [34] G. Cybenko, “Approximation by superpositions of a sigmoidal function,”
[9] D. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep neural Mathematics of Control, Signals and Systems, vol. 2, no. 4, pp. 303–314,
networks for image classification,” in IEEE Conference on Computer 1989.
Vision and Pattern Recognition, 2012. [35] J. Hastad and M. Goldmann, “On the power of small-depth threshold
[10] G. Dahl, A. Mohamed, and G. E. Hinton, “Phone recognition with the circuits,” Computational Complexity, vol. 1, no. 2, pp. 113–129, 1991.
mean-covariance restricted Boltzmann machine,” in Advances in Neural [36] V. N. Vapnik and V. Vapnik, Statistical learning theory. Wiley New
Information Processing Systems, 2010. York, 1998.
[11] A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep [37] Y. LeCun, L. Jackel, L. Bottou, A. Brunot, C. Cortes, J. Denker,
belief networks,” IEEE Transactions on Audio, Speech, and Language H. Drucker, I. Guyon, U. Muller, and E. Sackinger, “Comparison of
Processing, vol. 20, no. 1, pp. 14–22, 2012. learning algorithms for handwritten digit recognition,” in Proceedings
[12] R. Collobert and J. Weston, “A unified architecture for natural language of International Conference on Artificial Neural Networks, 1995.
processing: Deep neural networks with multitask learning,” in Proceed- [38] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed
ings of the 25th International Conference on Machine learning, 2008. representations of words and phrases and their compositionality,” in
[13] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer- Advances in Neural Information Processing Systems, 2013.
wise training of deep networks,” in Advances in Neural Information [39] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of
Processing Systems, 2007. word representations in vector space,” in ICLR Workshop, 2013.
[14] D. Erhan, P. Manzagol, Y. Bengio, S. Bengio, and P. Vincent, “The [40] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur,
difficulty of training deep architectures and the effect of unsupervised “Recurrent neural network based language model,” in INTERSPEECH,
pre-training,” in Proceedings of International Conference on Artificial 2010.
Intelligence and Statistics, 2009. [41] I. Sutskever, J. Martens, and G. Hinton, “Generating text with recurrent
neural networks,” in Proceedings of the 28th International Conference
[15] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, “Exploring
on Machine Learning, 2011.
strategies for training deep neural networks,” The Journal of Machine
Learning Research, vol. 10, pp. 1–40, 2009. [42] I. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier, “Clone
detection using abstract syntax trees,” in Proceedings of the International
[16] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependen-
Conference on Software Maintenance, 1998.
cies with gradient descent is difficult,” IEEE Transactions on Neural
[43] F. Lazar and O. Banias, “Clone detection algorithm based on the Abstract
Networks, vol. 5, no. 2, pp. 157–166, 1994.
Syntax Tree approach,” in Proceedings of 9th IEEE International Sym-
[17] D. Steidl and N. Gode, “Feature-based detection of bugs in clones,” in posium on Applied Computational Intelligence and Informatic, 2014.
7th International Workshop on Software Clones, 2013. [44] F. Yamaguchi, M. Lottmann, and K. Rieck, “Generalized vulnerability
[18] M. Chilowicz, E. Duris, and G. Roussel, “Syntax tree fingerprinting extrapolation using abstract syntax trees,” in Proceedings of 28th Annual
for source code similarity detection,” in Proceedings of IEEE 17th Computer Security Applications Conference, 2012.
International Conference on Program Comprehension, 2009. [45] R. Socher, Q. Le, C. Manning, and A. Ng, “Grounded compositional
[19] A. Mnih and G. Hinton, “A scalable hierarchical distributed language semantics for finding and describing images with sentences,” in NIPS
model,” in Advances in Neural Information Processing Systems, 2009. Deep Learning Workshop, 2013.
[20] G. Miller, “WordNet: a lexical database for English,” Communications [46] R. Socher, E. Huang, J. Pennin, C. Manning, and A. Ng, “Dynamic
of the ACM, vol. 38, no. 11, pp. 39–41, 1995. pooling and unfolding recursive autoencoders for paraphrase detection,”
[21] F. Morin and Y. Bengio, “Hierarchical probabilistic neural network lan- in Advances in Neural Information Processing Systems, 2011.
guage model,” in Proceedings of International Conference on Artificial [47] K. Hermann and P. Blunsom, “Multilingual models for compositional
Intelligence and Statistics, 2005. distributed semantics,” in Proceedings of the 52nd Annual Meeting of
[22] S. Ugurel, R. Krovetz, and L. Giles, “What’s the code?: Automatic the Association for Computational Linguistics, 2014.
classification of source code archives,” in Proceedings of the 8th ACM [48] R. Socher, D. Chen, C. Manning, and A. Ng, “Reasoning with neural
SIGKDD International Conference on Knowledge Discovery and Data tensor networks for knowledge base completion,” in Advances in Neural
Mining, 2002. Information Processing Systems, 2013.
[23] K. Murphy, Machine Learning: A Probabilistic Perspective. MIT press, [49] W. Cohen, R. Schapire, and Y. Singer, “Learning to order things,” in
2012. Advances in Neural Information Processing Systems, 1998.
[50] E. Gabrilovich and S. Markovitch, “Computing semantic relatedness
using Wikipedia-based explicit semantic analysis.” in Proceedings of
the 20th International Joint Conference on Artificial Intelligence, 2007.
[51] C. Roy, J. Cordy, and R. Koschke, “Comparison and evaluation of code
clone detection techniques and tools: A qualitative approach,” Science
of Computer Programming, vol. 74, no. 7, pp. 470–495, 2009.
[52] R. Feldman and J. Sanger, The Text Mining Handbook: Advanced
Approaches in Analyzing Unstructured Data. Cambridge University
Press, 2007.
[53] J. Wang, Y. Dang, H. Zhang, K. Chen, T. Xie, and D. Zhang, “Mining
succinct and high-coverage API usage patterns from source code,” in
Proceedings of IEEE Working Conference on Mining Software Reposi-
tories, 2013.
[54] M. Acharya, T. Xie, J. Pei, and J. Xu, “Mining API patterns as partial
orders from source code: From usage scenarios to specifications,” in
Proc. of ESEC/SIGSOFT FSE, 2007.
[55] C. Cadieu and B. Olshausen, “Learning transformational invariants from
natural movies,” in Advances in Neural Information Processing Systems,
2008.
[56] J. Bergstra and Y. Bengio, “Slow, decorrelated features for pretraining
complex cell-like networks,” in Advances in Neural Information Pro-
cessing Systems, 2009.
[57] R. Socher, D. Chen, and A. N. C. Manning, “reasoning with neural
tensor networks for knowledge base completion,” in Advances in Neural
Information Processing Systems, 2013.
[58] M. Pradel and T. Gross, “Leveraging test generation and specification
mining for automated bug detection without false positives,” in Procee-
ings of 24th International Conference on Software Engineering, 2012.