Gene Prediction Using Unsupervised Deep Networks
Gene Prediction Using Unsupervised Deep Networks
Opleiding Informatica
MASTER’S THESIS
Leiden Institute of Advanced Computer Science (LIACS)
Leiden University
Niels Bohrweg 1
2333 CA Leiden
The Netherlands
Contents
1 Introduction 1
2 Related Work 2
2.1 Supervised methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Unsupervised methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
i
6 Experiments 29
6.1 In search of Seven Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.2 Preliminary Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.2.2 Supervised Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.2.3 Unsupervised Features - Supervised Training . . . . . . . . . . . . . . 35
6.3 UDN Gene Finder Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.3.1 Prediction results from K-Means . . . . . . . . . . . . . . . . . . . . 35
6.3.2 Visualization of Self-Organizing Maps . . . . . . . . . . . . . . . . . . 36
ii
Abstract
Gene finding is a well studied topic in bioinformatics. Supervised methods have
shown great success in accurately annotating DNA sequences, of which Hidden Markov
Models have been the most widely used approach, both in supervised and unsupervised
settings. In this thesis we explore the possibilities of using unsupervised neural networks
for identifying signals in DNA sequences. First, it is shown that Convolutional Neural
Networks are capable of classifying DNA sequences with high specificity and reason-
ably high sensitivity. Furthermore, we show that features extracted using Convolutional
AutoEncoders do not separate the data enough for either unsupervised or supervised
learning. Self-Orginizing Maps, though, are able to create some significant visual sep-
aration. To the extend of our knowledge this is the first proposal for an unsupervised
signal sensor for gene prediction, and the first time Convolutional AutoEncoders have
been used for features extraction on DNA sequences.
1 Introduction
Gene finding is the process of identifying gene coding regions in DNA sequences. In disciplines
such as structural genomics, functional genomics, where the objectives are to predict the 3D
structure of gene products, to identify the function and/or interaction of certain genes, knowing
the exact location of gene coding regions in DNA/RNA sequences is required. In the early days
this was a slow process that involved experimenting with live organisms in labs. Fortunately, due
to our increasing understanding of DNA functions and with the advancements of bioinformatics
tools, the use of statistical analysis and the, nowadays, available computational resources, much
progress has been made.
Many approaches for gene finding already exist which rely on statistical analysis, pattern
recognition and signal processing, as well as homology based methods using information from
already well studied organisms.
In this thesis we investigate the possibilities of using Convolutional Neural Networks as signal
sensors for gene prediction, as well as using Unsupervised Neural Networks for feature extraction
and clustering. The goal is to identify the signals that prepend coding regions, that signify the
start of the coding. Raw DNA sequence windows are encoded into one-hot encoding vectors,
from which features are extracted using Convolutional AutoEncoders (CAE). Self-Organizing
Maps (SOM) are then trained using the CAE feature representation of the sequences.
To the extend of our knowledge this is the first proposal for a completely unsupervised signal
sensor method for gene prediction. For this reason we cannot compare our method with any
other, and we used random classification as our baseline. Additionally this is the first work
that uses Convolutional Neural Networks for features extraction of DNA sequences.
We show that even though the SOM visualizations show some separation between the windows
that contain the aforementioned signal, and windows that don’t contain such signal, the
difference is not big enough for a clustering algorithm or a supervised method to achieve any
significant results.
Additionally we investigate the possibility of existence of the seven clusters structure [38]
using the CAE features. We show that no indication of such structure is present.
1
Figure 1: Categorization of gene finding approaches as given by Mathe et al. [3] [4]
2 Related Work
Gene finding is an easier task when working with prokaryotic sequences (see Section 3), due to
the high density of genes in the genome, as opposed to the high sparsity of genes in eukaryotic
genomes. The main challenge for prokaryotic gene finding is identifying the overlapping genes.
In eukaryotic DNA sequences, coding regions can be very hard to identify due to the fact that
some may be very short, followed by long introns (non-coding regions). Additionally, alternative
splicing makes the process of identifying genes even more challenging.
Generally, the methods for gene finding are divided into two categories, i) extrinsic methods,
also known as empirical, comparative, similarity or homology based methods and ii) intrinsic or
ab initio methods. Empirical methods, using local alignment methods ([6],[7],[10]), compare
sequences with known genes from a database and, given the degree of similarity, try to classify
regions as coding or non-coding for prokaryotic DNA and introns, exons, splice sites, for
eukaryotic etc. Comparing closely related species DNA for finding similar regions also falls
under the empirical category. Most state of the art software for gene finding use a combination
of all three methods to achieve the best possible result.
Ab initio methods, on the other hand, rely on predictive models, statistical features and/or
signal properties to make predictions for the location of genes and even their structures.
Because of the uncertainty of results that always comes with predictive models, gene finding
using ab initio methods is more appropriately called gene prediction. Ab initio methods can
be further divided into signal sensors or content sensors [8]. Signal sensors detect “signals”
in the DNA sequence, such as splice sites, whereas content sensors use statistics, such as
codon distributions or nucleotide distribution, to make predictions about sequences and/or to
compare with other known genes.
Ab initio methods often yield lower specificity than comparative methods, while comparative
methods suffer from weaker sensitivity. For this reason many methods combine the two in
order to maximize predictive performance [9].
To the extend of our knowledge the work of Nguyen et al. [28], is the only example of Deep
Neural Networks being applied to DNA sequences for classification and gene prediction.
2
2.1 Supervised methods
Of the ab initio methods, Hidden Markov Models (HMM) have been proven to be able to
accurately predict gene position in prokaryotic organisms as well as whole gene structure in
eukaryotes. [Link] [11], HMMGene [12], Veil system [13], AUGUSTUS [14], SNAP
[15] are among the many software that are available that are HMM based for gene prediction
or/and gene structure prediction. Genie [16] uses a combination of a HMM model along with
2 neural network classifiers which use dinucleotide frequencies for splice site detection. Most
of the HMM based approaches are considered to be content sensors, since they rely on the
probability distribution of the bases in a sequence.
GRAIL-I [17], GRAIL-II [17] and GeneParser [19] were among the earliest attempts to use
Neural Networks in order to find protein genes in DNA. The first iteration of grail used
statistical features from a 99-base windows in order to predict whether that windows was coding
or non-coding. Due to the small windows, it was prone to missing a large number of short exons.
GRAIL-II was an improvement over the first iteration of the system, which solved the problem
of short exons identification by introducing variable length window. Additionally it included
additional features, such as transcription promoter recognition, PolyA site recognition, protein
coding region prediction, gene model construction, translation to protein and DNA/protein
database searching capabilities. For the gene prediction process, first a rule based method is
applied to discard most of the improbable candidate sequences. Then three neural networks
are used to evaluate the remaining candidates. The sequences that pass this screening, are
clustered into three groups. From each group the top candidates are picked as being likely to
be exons.
GeneParser uses a method similar to GRAIL, but combines it with a Dynamic Programming
(DP) algorithm. Similar in spirit with Dynamic Programming for speech segmentation [21],
DNA and RNA sequences can be considered as languages with different grammatical con-
straints and the benefit of having discrete values in the sequence. By having the likelihood
of a segment of belonging to one or more classes, using DP the maximum likelihood solution
while enforcing the grammatical constraints.
GlimmerM [37] is a HMM based system which uses Dynamic Programming techniques to effi-
ciently search though all the possible combinations of exons for inclusion in the gene model. To
select the best of all these combinations Interpolated Markov Models (IMM) are used, trained
on complete coding regions. For the splice site prediction, Maximal Dependence Decomposition
(MDD), and first order Markov Chains are used.
Li et al. [20] propose the usage of a small neural network for predicting protein coding genes
in yeast genome. The neural network takes a 12-dimensional input vector, which contains the
frequency of each amino acid in one of three position in each codon in the reading frame.
Since we have 4 amino acids and 3 possible positions a codon (triplet of amino acids), we
end up with a 12 dimensional feature vector. For the single hidden layer of the network they
used 21 units, while for the output they used a single unit, with intended outputs 0 when a
input vector does not represent a coding region and 1 when the input does represent a coding
region.
Chicco et al. [22] proposed a system for gene annotation, where either an AutoEncoder (AE)
Neural Network or Singular Value Decomposition is used to get a real value noisy reconstruction
of binary feature vector. If the values of the reconstruction are above a predefined threshold,
it is an indication that the future is present, of absent if the value is bellow the threshold.
This method helps identify features that may be missed by other methods, or indicate false
3
positives in other cases.
Bayesian Networks, in combination with Hidden Markov Models, have been used for combin-
ing gene predictions from multiple systems, replacing the naive majority voting scheme [23]
and showing significant improvement over individual systems or combination through majority
voting.
Evolved Neural Networks [32], which are neural networks which use evolutionary computations
for the training of the network, were used for identifying abstract-functional RNAs (fRNAs)
genes in H. Sapiens and C. Elegans [33]. fRNA genes can be harder to identify as they may
have short coding regions and do not use similar signals as common protein coding genes.
TigrScan and GlimmerHMM [34] are two more HMM based frameworks for gene prediction
for eukaryotic organisms. They both have modular and extensible architectures and are open
source. Their main differences are that while TigrScan uses weight matrices and Markov
Chains, GlimmerHMM additionally utilizes splice site models adapted from GeneSplicer [35],
a system for splice site detection for various eukaryotic organism, and decision trees adapted
from GlimmerM.
BRAKER1 [36] is a pipeline framework which uses the semi-supervised version of GeneMark,
GeneMark-ET, for making gene predictions. RNA-Seq information are used in the process,
extracting spliced alignment information. GeneMark-ET uses the genome data along with
the RNA-seq extracted data for unsupervised training. The predicted genes are then used by
AUGUSTUS to further create refined gene models, again using RNA-Seq as extrinsic evidence
during the training.
Though Convolutional Neural Networks have been used for DNA sequence related tasks,
such as annotating gene expressions [25][26] and gene structure prediction [27], there is little
research on the ability of CNNs for gene prediction and annotation.
Nguyen et al. [28] used Convolutional Neural Networks for binary classification of DNA se-
quences. They used as input a one hot encoding, which was inspired from text classification
methods, of sequences 500 nucleotides long, and they labeled each window as 0 or 1 depending
on the classification task. They trained networks for identifying splice cite junctions, promoter
sequences, and parts of the DNA that wrap around histone proteins. Using this method they
achieved increase of 1%-6% compared to the state of the art, reaching accuracy of 99.06%
for promoter recognition and 95.87% for splice site recognition.
In this thesis we also experiment with Convolutional Neural Networks for DNA sequence
classification. The main differences with Nguyen et al.‘s work are in that we employ a different
kind of one hot encoding scheme, and use a smaller window size of 200 nucleotides. We achieve
sensitivity of 84.41%, specificity of 98.75%, f1 score of 0.8432, and accuracy of 91.58% on
the genome of E. Coli on our best setup.
4
already implicitly known, but formally studied.
Mahony et al. [24] (RescueNet) used Self-Organizing Maps with relative synonymous codon
usage (RSCU) encoding. This method, though unable to yield state of the art results, it is able
to identify multiple gene models within a genome and identify some genes that other methods
miss. Additionally it shows the potential of SOM networks for gene prediction.
GeneMarkS [40] is a HMM based approach, similar to the original GeneMark. It uses an
iterative training process over unlabeled data to train the HMM on prokaryotic sequences.
This method was later improved with [Link] ES-3.0 [41] [42] (E-Eukaryotic, S-Self
training, 3.0-version number) which was used on novel eukaryotic genomes, yielding state of
the art results comparable or better than other supervised methods. GeneMark-ES is able to
identify exons with 88.9% sensitivity and 89.2% specificity in the fungal genome of [Link].
Chan et al. [29] proposed a pipeline framework, where multiple HMM models are self trained
for any given organism. The framework includes data set preparation, parameter estimation for
the specific genome, training of HMM based models using existing frameworks, namely Glim-
merHMM, AUGUSTUS, SNAP and MAKER2 [30], and an additional step for result filtering,
resulting in correct annotation of at least 95% of BUSCOs plantae dataset [31].
In our method we approach unsupervised learning from a signal sensor direction, where we
try to identify signals in the genome that signify the start of a protein coding region. By taking
sequences 200 nucleotides long, we encode them in an array of one-hot vectors (Section 5.1)
and train a Convolutional AutoEncoder (CAE) to reconstruct those sequences. We convert the
encoded representation of the CAE to the Bag of Surrogate Parts (BoSP, Section ??) which
is used as a feature vector for each sequence. These feature vectors are then used to train a
Self-Organizing Map. We show that, even though in visualizations there is separation between
sequences that contain the the signal and sequences that do not, there is also a significant
overlap between them. Results from clustering the feature vectors are inconsistent and yield
an average sensitivity of 70% and specificity of 51%
5
Figure 2: Image showing the transcription
process. [1]
Figure 3: Image showing the translation
process [1].
words which specify which amino acid should be added next to create the specific protein
which this gene encodes. Codons are triplets of nucleobases, of which we have 4 different
types. Hence, for each type of nuclei acid, we can have a total of 43 = 64 combinations. This
would mean that codons have the ability to code for 64 distinct amino acids, but only 20
amino acids are actually used in protein synthesis. As a result multiple codons can code for the
same amino acid, creating redundancy, and thus making the protein synthesis process more
tolerant to transcription errors, mutations, etc.
In prokaryotic DNA genes are densely populated and may take up to 90% of the whole
sequence (E. Coli), while in eukaryotic DNA genes are usually sparse, and could make up as
little as 3% of the whole DNA sequence (Homo Sapiens). Moreover, genes differ structurally
between eukaryotic and prokaryotic organisms. Bellow, a short description will be given for
the basic building blocks of genes for prokaryotes and eukaryotes separately, and the basic
differences will be highlighted.
Protein synthesis can be viewed as two separate steps, transcription and translation. During
transcription, in eukaryotes, RNA is produced inside the cell nucleus. The hydrogen bond, that
holds the DNA double helix together, is broken around the gene to be transcribed, exposing
the strands (Figure 2). The strand where the gene is located is then used as a template
for the synthesis of the RNA molecule, where Uracil binds across Adenine, Adenine across
Thymine, Guanine across Cytosine and vice versa. When the RNA synthesis has reached the
end of the gene, it detaches from the DNA strand and leaves the cell nucleus and heads to the
ribosome to undergo the translation process (Figure 3). It is worth noting at this point that the
RNA product of transcription is different for prokaryotes and eukaryotes. For prokaryotes the
product is messenger RNA (mRNA) which needs no further processing, while in eukaryotes the
first product is called primary transcript. The primary transcript goes through post-transcript
modification which produces heterogeneous nuclear RNA (hnRNA) which, in turn, undergoes
splicing of introns, which strips away introns (non-coding regions inside the gene, which are
not present in prokaryotes).
6
Figure 4: Image showing the transcription
process in prokaryotic organisms [1]. The Figure 5: Image showing the translation
highlighted area in the DNA strand indi- process in eukaryotic organisms [1]. In this
cates the position of the gene. example the gene contains 5 introns, which
are the non coding parts of the gene. They
are included in the primary RNA tran-
script as shown below, and are stripped
away during splicing in order to produce
the final mRNA.
7
It is often the case that ORFs are grouped together into a single operon [1]. Operons can
contain multiple codings for different proteins, which usually serve related functions, after a
single promoter. All protein templates within one operon will be transcribed by a single mRNA.
8
Figure 6: Data Flow Diagram of the end goal of this project.
been significant increase in the quantity and quality of available datasets for computer vision
tasks, speech recognition and others.
Unsupervised training on the other hand is used for clustering data, or to estimate values for
latent variables of an underling distribution of the data. Philosophical discussions can be had
on whether a system is ever unsupervised, since there is always some kind of information/pa-
rameters given to the system (e.g. number of clusters).
In unsupervised learning one of the biggest challenges is feature extraction and feature selec-
tion. It is important to create discriminative features for the data, which will separate them
in the desirable manner. In many cases many redundant features are created from a single
method, from which only a few are selected in order to remove features that only add noise
to the data, and also to make the training process more efficient.
Semi-supervised learning sits somewhere in between supervised and unsupervised learning,
where the model tries to learn a specific task from sparse data.
The end goal of this thesis is to have a completely unsupervised method for identifying signals
in DNA sequences.
Withing the machine learning field one family of models has gained a lot of popularity over
the years and is called Artificial Neural Networks (ANN or simply NN).
In this project the goal is to make use of Unsupervised Neural Networks in order to cluster
DNA sequences in a way such that signals that signify the start of a coding region are grouped
separately from coding and non-coding regions. A data flow diagram of the system can be
seen in Figure 6. Raw nucleotide sequences are encoded as vectors of one-hot vectors (Section
5.1) which are used to train a Convolutional AutoEncoder (Section 4.2.5). The dataset is
then converted into an encoded feature representation. This representation is fed to a Self-
Organizing Map (Section 4.2.6) in order to visualize and cluster the sequences.
9
Figure 7: Left: A single perceptron with Figure 8: Left: A network of 2 perceptron
d inputs and the associated [Link]: sharing an input vector. Right: Example of
A 2-dimensional representation of a plane, decision boundaries produced by a multi-
and how the weights affect it. class linear discriminant.
a supervised learning setting. For the simplest form of NNs, the single layer perceptron [43],
neurons are simple processing nodes that take input X of length N and produce 1 output,
which is a linear combination of the inputs. The output of the neuron has the following form:
N
X
y = w 0 + x1 ∗ w 1 + · · · + xN ∗ w n = w 0 + xi ∗ w i = w 0 + w T ∗ x (1)
i=1
Where xi is the ith input of the neuron, wi is a weight that is associated with that input and
w0 is called the bias. For more convenience usually equation 1 takes the following form
N
X
y= xi ∗ wi = wT ∗ x, x0 = 1 (2)
i=0
a ∗ x + b ∗ y + c ∗ z + ··· = 0 (3)
in which case xi , wi are the independent variable and the slope in dimension i respectively,
and w0 is the y intercept of the plane. The N-dimensional hyperplane is a linear discriminant
function, which creates two regions.
We can “learn” the parameters of the hyperplane using a training set, in order to move the
hyperplane in a position where it optimally separates two classes of our data.
The perceptron is just a single node NN with a non-linear activation function
N
(
X g(a) = +1 if a ≥ 0
a= xi ∗ wi , g(a) = (4)
i=0
g(a) = −1 if b < 0
A network of c perceptrons which share the same input vector, but each one has its own
set of weights, creates c discriminant functions, or c convex regions, which can be used for
multi-class problems.
The algorithm for learning the weights of the network is called Gallant0 s pocket algorithm
and is shown in algorithm 1.
The core of the whole algorithm is the weight update rule, shown in equation 5:
10
Algorithm 1 Perceptron training algorithm
1: weight matrix := random()
2: pocket := weight matrix
3: best run := 0 current run = 0
4: while current run < num examples do
5: example := Select random example()
6: if example is misclassified then
7: apply weight update rule()
8: else
9: current run++
10: if current run > best run then
11: best run = current run
12: pocket = weight matrix
13: end if
14: end if
15: end while
where η is a predefined learning rate (usually around 0.1) d is the label of the example, and
x is the input. The motivation behind this is that if x is misclassified and the target d = 1,
then the wx should be bigger. Alternatively, if x is misclassified and d = −1, wx should be
smaller.
Gallant’s algorithm is proven to converge, as long as the data are linearly separable, meaning
that there exists a hyperplane such that it perfectly separates the two classes. A simple example
of a problem that cannot be solved by a linear discriminant, and thus is not linearly separable,
is the XOR problem.
In order to solve more complex problems, layered network are required, which are able to
create convex regions whose edges are used as the decision boundaries.
11
We define an error function for our network, which we will try to minimize. For classification
tasks the most common error function used is the mean square error.
1 X
E= (yi − di )2 (8)
2 outputs
By writing out the error function as a function of the network weights, we can then calculate
the partial derivatives of the output nodes, which will give us the direction in which the error
decreases the fastest. We can use this to adjust the weights in order to minimize the error.
∂E
∆wij = −η (10)
∂wij
where η is the predefined learning rate. This is the gradient descent algorithm.
∂E
In order to calculate the partial derivative ∂wij
we can expand it using the chain rule (Equation
11)
∂E ∂E ∂yj ∂xj
= (11)
∂wij ∂yj ∂xj ∂wij
where xj is the input of node j, which is the weighted sum of the outputs of the nodes in
the previous layer, and yj is the output of node j, thus yj = g(xj ), where g() is an arbitrary
activation function. This means that we need three things in order to complete the update rule
for each weight, the partial derivative of the error function w.r.t. the output of each neuron
(which is frequently called the error message or delta), the partial derivative of the output of
each neuron w.r.t. its input, and the partial derivative of the input w.r.t. each corresponding
weight.
The second term, the derivative of the output w.r.t. the input, is simply just the derivative
of the activation function, which is show in Equation 12
∂yj ∂g(xj )
=
∂xj ∂xj
(12)
1 g(xj )
When: g(x) = −x
, = g(xj )(1 − g(xj ))
1−e xj
where, again, g() denotes any activation function. In the example above the logistic activation
function is used.
The last term, the partial derivative of the input w.r.t. a specific weight, is simply the input
j of the neuron, as shown in equation 13.
∂( nk=1 wkj ∗ yk )
P
∂xj
= = yi (13)
∂wij ∂wij
where yi is the output of node i in the previous layer, because only one term in the sum, yk
when k = i, is depended on wij .
For the first term, the partial derivative of the error function w.r.t. the output of a specific
neuron, it is more straight forward if the neuron is in the output layer, as seen in equation
12
∂E ∂ 1 1
= δj = (y − t)2 = yj − tj , when: E = (y − t)2 (14)
∂yj ∂yj 2 2
For neurons in an arbitrary hidden layer, things get a bit more complicated, as seen in equation
15.
∂E ∂E(xu , xv , . . . , xw ) X ∂E ∂xl X
= δj = = ( )= δl wjl (15)
∂yj ∂yj l∈L
∂x l ∂yj
l∈L
where L denotes the layer above, wjl is the weight from neuron j to neuron l in the layer
layer above. This is, thus, a recursive function which we can use to calculate the derivative
for yj is we know all the derivatives with respect to yl on the layer above. So we are basically
back-propagating the error, weighted by the weights of the connections between neurons.
The update rule can be summarized as
∂E
∆wji = η yi , where :
∂wij
( (16)
∂E ∂E ∂yj ∂xj (yj − tj )yj (1 − yj )yi if j is output node
= = P
∂wij ∂yj ∂xj ∂wij ( l∈L δl wjl )yj (1 − yj )yi if j is hidden node
where yj is the output of the j th node in the current layer, and yi is the output of the ith
node in the previous layer. In this derivation is is assumed that the square error function is
used and a logistic activation function for the neurons.
In short, the steps the steps are:
i. apply input vector x and forward propagate through the network to produce outputs for
all nodes using Equation 2
iii. back-propagate the deltas and compute the deltas and gradients of the hidden units,
also shown in Equation 16
iv. update the weights by making a small step towards the steepest descent.
We repeat this process for all examples in the training set, for a pre-specified number of
epochs (iterations over the whole train set) , or the error over all examples converges.
There are no guarantees that a neural network will converge, since there are many factors to
consider when training a neural network, all of which affect the behavior of the network. Such
factors are the number of layers, the number of nodes per layer, the learning rate, number of
training examples, cost function, activation functions, etc. There are some rules of thumb for
choosing these values, but in practice a lot of parameter tuning has to be made in order to
get optimum results.
13
Algorithm 2 Feed Forward Neural Network training algorithm
1: weight matrix := random()
2: epoch := 0 , error dif := max float , epoch error := 0 , previous error := max float
3: while epoch < max epochs and error dif > converge threshold do
4: previous error := epoch error
5: epoch error := 0
6: for i:=0 i<dataset length i++ do
7: for j:=0 j<num layers j++ do
8: for k:=0 k<num layers k++ do
9: activation[j][k] := compute node activation using Equation 4
10: end for
11: end for
12: output error := sum of errors of output nodes
13: for j:=num layers-1 j>-1 j- - do
14: for k:=0 k<num layers k++ do
15: gradient[j][k] := compute node gradient using Equation 16
16: weight matrix[j][k] := weight matrix[j][k] + delta[j][k]
17: end for
18: end for
19: epoch error += output error
20: error dif := abs(previous error - epoch error)
21: end for
22: end while
other fields, such as recommender systems [46], speech recognition [48], and natural language
processing [49].
Traditional MLPs suffer from the curse of dimensionality. Let us consider a simple netowrk
that takes as input a small image of size 32x32 and has 3 color channels. For such network,
a node in the first hidden layer would require 32x32x3 = 3072 weights. By increasing the size
of the image to 200x200, which by today’s standards is also considered small, each neuron in
the hidden layer would require 200x200x3 = 120000, which is two orders of magnitude higher.
Additionally, MLPs do not take into consideration spacial structure of the data, meaning that
it treats data that are far away the same way as if they were near by.
CNNs, on the other hand, try to take advantage of the strong local correlation of pixels, that
is usually found in natural pictures while requiring fewer weights. They achieve that by having
two main differences from traditional MLPs, sparse connectivity and weight sharing.
Typical CNNs consist of 4 basic building blocks
i) Convolutional layers
14
Figure 9: Example of connectivity of nodes
in convolutional layer
Figure 10: Example of features learned
by the first convolutional layer in network
trained on image data [47].
15
H f W f Df
X XX
y i 0 j 0 k 0 = bf 0 + xi0 +i−1,j 0 +j−1,k0 +k−1 wijk (17)
i=1 j=1 k=1
where bf 0 is the bias term for the filter, i0 j 0 k 0 are the indexes of the position where the filter
is applied on the input and also the indexes of the value in the feature map that is calculated,
and wijk is the weight at position ijk in the filter matrix.
By shifting and applying the filter on the whole input step by step, we produce the complete
feature map for a single filter, which is the convolution of the input with that specific filter as
a convolution kernel. Convolutional layers usually include multiple filters, whose feature maps
become the depth dimension for the convolutional layer above.
where N is the size of one dimension of the N xN input (assuming a square image is the
input) and m is the the size of the one dimension of the mxm filter.
As with the MLP we have the same components which we have to calculate. Furthermore,
similar to a fully connected layer, the second and third component respectively become Equa-
tion 21 and Equation 22:
∂yij
= g 0 (xij ) (21)
∂xlij
∂xlij l−1
= y(i−a)(j−b) (22)
∂wab
And lastly, the first component:
16
∂E l
m−1
X m−1 X ∂E ∂xl+1
(i−i0 )(j−j 0 )
m−1
X m−1 X ∂E
l
= δ yij = l+1 l
= l+1
wab
∂yij i0 =0 j 0 =0
∂x(i−i0 )(j−j 0 ) ∂yij i0 =0 j 0 =0
∂x(i−i0 )(j−j 0 )
m−1 l+1
(23)
∂E X m−1X ∂E ∂y(i−i 0 )(j−j 0 )
m−1
X m−1X
0 l
l
= l+1 l+1
wab = g (x ij ) δil+1
0 j 0 wab
∂yij 0 0
i =0 j =0
∂y 0 0
(i−i )(j−j ) ∂x 0
(i−i )(j−j )0 0 0
i =0 j =0
which, intuitively, is the weighted cumulative error message of the next layer w.r.t. each input
l+1
∂y(i−i 0 )(j−j 0 )
from the current layer. From the term ∂xl+1
, only the term wab survives when a = i0 − i
(i−i0 )(j−j 0 )
and b = j 0 − j. So we can rewrite the function as
m−1
X m−1
∂E 0 l
X
0 l
= g (x ij ) δil+1 l+1
0 j 0 w(i0 −i)(j 0 −j) = g (xij )δij ∗ f lip180(w)l+1
ij (24)
∂yijl i0 =0 j 0 =0
By looking at it a little closer, it looks like a convolution of the input of layer l + 1 and the
filter matrix w, but the indexes of w are now w(i−a)(j−b) instead of w(i+i0 )(j+j 0 ) . We can flip
the weight matrix by 180 degrees and thus make it a convolution :
∂E
= δijl = g 0 (xlij )f lip180(w) ∗ δ l+1 (25)
∂yijl
Putting it all together again, equation 20 becomes:
N −m N −m
∂E X X
= δab y(i−a)(j−b) g 0 (xij ) = δab
l l−1 l
∗ f lip180(y)l−1 0
ab g (xij ) (26)
∂wab i=0 j=0
Finally, we can computed the new weights for by using the standard update rule used for
MLP.
Now, if you are thinking “oh, finally we are done”, you probably forgot that in the beginning
of this section it was mentioned that typical CNNs consist of 4 different types of layers, and
we just described one of them. Luckily for me the rest of the layers are easier to describe.
17
Figure 11: Visualization of the four most
common activation functions used in Arti- Figure 12: Example of a simple RBM show-
ficial NNs. ing the connections between layers, omit-
ting the biases for simplification.
[Link] Rectifier
In more recent years, in larger network structures, the Rectifier activation function [50] has
become increasingly popular and is preferred to the more traditional logistic and tanh functions.
The Rectifier is defined as
18
4.1.4 Cost functions
So far we have seen most of the fundamental building blocks of Neural Networks and Con-
volutional Neural Networks, but thus far the assumption has been that the mean square error
cost function has was used in order to derive the gradients in the final layer. There are a
number of cost functions that are more suitable for some tasks than others.
where n is the number of output nodes, yi is the ith output given by the network and di is
the desired output for node i.
In oder to calculate the gradients for each node in the output layer, we also need to calculate
the derivative of the error function with respect to the output of each node.
∂Emse
= y i − di (29)
∂yi
Respectively, the derivative of the cross entropy cost function is computed as follows
∂Ece yi − di
= (31)
∂yi (1 − yi )yi
19
4.1.5 Improving the learning process
[Link] Batch Training
In section 4.1.2 we saw an overview of the stochastic back-propagation algorithm, where we
took each example in the training set and applied it to the input of the neural network, did
a forward pass, followed by a back propagation of the gradients. Doing so for each example
independently is inefficient and suboptimal, as each example’s gradient may not be represen-
tative of it’s class, guiding the network’s weights towards the wrong directions, taking longer
to converge.
An alternative way of training is using the batch method. Here a predefined batch size k
is selected, where a forward pass is done for k examples. Once the batch forward passes are
completed the mean error is calculated for the final layer, and it is used to do a single back
propagation.
This, obviously, is much more efficient as it requires just a single backwards pass every k
examples, and it also helps converge faster, since each batch may contain examples from more
than a single class.
[Link] Dropout
Dropout [58] is another method for improving model generalization and reduce overfitting.
The idea behind this is that in an ideal scenario, multiple networks with different architectures
would be trained on different subsets of the training set, and would finally vote for the classi-
fication. But given that in many cases there are not enough data to train different networks,
finding optimal parameters for different architectures is a tedious task, and the time and re-
sources to train and evaluate many networks is prohibitive in most cases, this approach is not
feasible.
By introducing dropout, Srivastava et al. try to simulate this process by using a single network.
By having nodes randomly “dropped out” of the network along with all their incoming and
20
Figure 13: Left: Example of a fully connected network. Right: The same
network when dropout is applied.
outgoing connections, with probability 0.5, for a single batch iteration, essentially a thinned
network is trained for that specific batch of examples. By choosing different nodes for dropout
for each batch in a network of n nodes, there are 2n possible thinned networks with shared
weights that can be trained.
Finally, while dropout simulates the effect of having multiple neural networks for classification,
the goal is to have a single neural network during the evaluation process. This means that
when making predictions, the weights have to be scaled proportionally to the probability of
the dropout.
where n is the number of elements in each feature map, M is the number of feature maps
and P is the normalized activations of each feature map
Aij
Pji = (38)
max(Ai )
21
This features give us an indication of how much each filter has been activated given a
certain input. In order to reduce the noise from the activations and make the features more
discriminative, an additional step is taken, reducing the activations of each map to 0 if they
are bellow the mean of the activations of that map.
(
0 if Aji < mean(Ai )
Pji = i
Aj (39)
( max(A i) if Aji ≥ mean(Ai )
22
m
!
X
P (hj = 1|v) = g bj + wi,j vi (44)
i=1
and similarly, we can use equation 45 to calculate the probability of a visible unit activating,
given the hidden unit states
n
!
X
P (vi = 1|h) = g ai + wi,j hj (45)
j=1
where g() is the activation function, usually the logistic function, m and n are the number
of units in the visible and hidden layers respectively.
The activations of the nodes in the hidden layer are independent of each other, given the
activations of the visible layer after applying an input vector on the visible layer, and vice
versa, the activations of the nodes in the visible layer are independent of each other, given the
hidden unit activations. Thus we can simply calculate the probability of a hidden state vector
given a visible unit vector using equation 46
m
Y
P (v|h) = P (vi |h) (46)
i=1
and similarly the probability of a visible vector given a hidden unit vector using equation 47
n
Y
P (h|v) = P (hi |v) (47)
i=1
An extension to the traditional RBMs allows for the visible units to be of multinomial distri-
bution, while the hidden units remain of Bernoulli distribution, allowing the visible layers to
take discrete values. In this case, for the activation of the visible units the softmax function is
used.
e xj
g(x) = PK for j = 1,2..,K (48)
k=1 exk
where K is the number of dimension of our data.
Now we have all the tools we need in order to train an RBM, which has some similarities with
training a Feed Forward NN, since it also uses the gradient descent algorithm. The method
proposed my Hinton [60][61] called contrastive divergence (CD) and goes as follows
• take random sample v from the set and apply it on the input. Given input v calculate the
probabilities of the hidden units and sample an activation vector h from this distribution.
• update the weight matrix of the network ∆W = η(vhT − v 0 h0T ), where η is some
predefined learning rate.
23
• update biases using ∆a = η(v − v 0 ), ∆b = η(h − h0 ).
As in a FFNN we repeat the process a number of times for all the samples in the training set
until the network converges.
X NH X
K X NW K
X NH
X NV
X
E(v, h) = hkij Wrs
k
vi+r1,j+s1 − bk hkij −c vij (50)
k=1 j=1 s=1 k=1 i,j=1 i,j=1
where W̃ denotes the weight matrix flipped horizontally and vertically, ∗ denoted convolution
and • denotes dot product. Now we can train a CRBM using the contrastive divergence
algorithm, as in a simple RBM.
24
Figure 15: Example of a Stacked Autoen-
coder. Each Autoencoder is trained inde-
pendently by using the input of the previ-
Figure 14: Example of a simple 2-layer Au- ous trained AE. When all AEs are trained
toencoder. It has 2 layers in the coding we can stack the coding parts and the de-
part and 2 layers in the decoding part. The coding parts together to form a Stacked
middle layer is shared between the coding Autoencoder, which has the same struc-
and decoding part. ture as a normal AE.
4.2.4 Autoencoders
Autoencoders (AE) [64] are types of NNs which are used to reduce the dimensionality of
the input space. The simplest form of an Autoencoder is similar to the traditional multilayer
perceptron with one hidden layer, with two constraints; the output layer has to have the exact
same number of nodes as the input layer, and the hidden layer has to have less nodes than
the input layer.
The labels used to calculate the error are the inputs themselves. Effectively this forces the
network to reconstruct its input on the output nodes, after having reduced the dimensionality of
the data in the hidden layer. Thus, the hidden layer learns lower dimensionality representations
of the data, which make further supervised or unsupervised learning more efficient. Since
finding a large unlabeled training set is much easier than finding a labeled one we can first
train in unsupervised manner. Then we can take the coding part of the autoencoder, add
another fully connected layer on top that will perform the classification, and using a smaller,
labeled, training set we can fine tune the network. Since the network is already pre-trained,
and already “knows” the data, it will converge much faster and more accurately.
For some applications, though, trying to learn the identity function might not be sufficient.
For object recognition in images, for example, we need a representation of an object that is as
generic as possible. Vincent et al. proposed the Denoising Autoencoders (DAE) [67], to solve
this problem for learning image features, where the input x would be corrupted into x̂, while
the network would try to reconstruct the initial values of x. The noise introduced to the data
would be of binomial distribution for black and white images, which would turn pixels on and
of, and RGB images would be blurred with Gaussian noise.
We can, of course, extend the Autoencoder architecture and add more layers in between the
input and output layers in order to learn more complicated data (figure 14). There are two
approaches to training such network, as a traditional MLP or layer wise.
In the case of full network propagation we train the network as we would train any multilayer
feed forward network; make a forward propagation to get the activations of all the nodes,
calculate the error on the output nodes, backward propagate the error through the layers
25
and update the weights. The first part of the network is the called the coding part, where
each successive layer has fewer nodes, and thus encodes or compresses the information, and
the second part, called decoding part, has successively more nodes in each layer, which are
symmetric to the coding layers. For very shallow networks this works fine, but for deeper
architectures we start to encounter the problem of the vanishing gradient. Because traditional
activation functions used have gradients in the range (−1, 1) or [0, 1) and the and the gradient
of nodes in the lower levels of a n-layered deep network is calculated by multiplying n numbers
smaller than 1, the numbers tend to get extremely small. Thus, the upper layers of the network
get trained quite fast, but the lower layers learn very slowly, and only learn the average of the
gradient over the dataset.
Stacked Autoencoders [65] are the second type of Autoencoders, which solve the problem of
the vanishing gradient by training in a layer-wise fashion. We would start by training the first
hidden layer of the network as we would in a single layer Autoencoder. Once we are confident
that the first layer has learned to encode the input sufficiently well, we proceed to train the
next layer of the network by taking the output of the first hidden layer and try to reconstruct
it after the second hidden layer has encoded it. We repeat this process until every layer is
trained (figure 15). Thus the network learns features hierarchically and because each layer is
trained individually, the problem of the vanishing gradient, well.. vanishes.
Sparsity in feature representation has been shown [69][70][71] to increase the accuracy of
various classification tasks [72]. Hinton et al. [70] and Lee et al. [71] propose methods for
achieving sparsity in AE networks by using combination of different activation functions, sam-
pling steps and the incorporation of penalties in the penalties in the error function which
encourages nodes to activate less frequently. Makhzani et al. proposed the k-Sparse Autoen-
coder, which in each layer only the activations of the k highest activations are preserved, while
the rest are set to zero. Cho [73] showed that introducing sparsity in Denoising AEs improves
the network’s ability to denoise highly noisy images.
where, as before, g() is the activation function, bk is the bias term of the filter k, hk is
the feature map of the k-th filter, and W̃ k is the weight matrix of filter k, flipped in both
dimensions. The reconstruction of the pooling layer is done by replicating each point of each
feature map n times, where n is the pooling kernel size.
26
Figure 16: Example of a SOM topology.
Figure 17: Example of a SOM projecting
RGB colors into a 2d space.
where N is the number of inputs, xi is the ith element in the input vector and wi is the
weight of the node associated with that input. SOMs are winner-take-all networks, meaning
that the node with the highest activation is the “winner”, for one specific input. Once the
winner is identified, the weights are updated the following mechanism
w = w + η(x − w) (54)
where w is the weight vector of the node, η is a predefined learning rate and x is the input
vector. Effectively, the winning node adjust its weights in a way to better model the specific
input.
Now we have reach the point where we take advantage of the lattice structure of the network.
The weight update rule is applied to the winning node, as well as the 8 neighbors surrounding
it. This causes the neighborhood to model data similar to the input with some variations, since
the weights are most likely different in their initial state.
As with traditional NNs, during the training phase we iterate over all our examples a number
of times. In the end the network will have made a kind of topological “map” of the data in
low dimensions, as showed in figure 17.
27
After the original publications, several improvements were suggested such as using Euclidean
distance as a discriminant function, and the winner node is the one with the smallest distance.
Several improvements have been suggested regarding the weight update rule. Namely Shah et
al. [75] proposed having a time adjusted SOM, where each neuron has an individual learning
rate and neighborhood size that changes over time.
28
5.2 Convolutional AutoEncoder
For the CAE, we chose an architecture that is able to reconstruct the input sufficiently well,
and has as few parameters as possible. The final result is a 6 layer CAE, 3 convolutional layers
in the encoding part and 3 deconvolutional layers in the decoding part. The first convolutional
layer shares weights with the last deconvolutional layer, the second convolutional layer shares
weights with the second deconvolutional layer, and the last convolutional layer shares weights
with the first deconvolutional layer. No pooling was used in this network.
The first convolutional layer consists of 60 filters of shape 7x1 (width, height), while the
second and third layers consist of 30 filters of shape 5x1. Effectively this means that each
node in the last convolutional layer encodes information from 7 + 5 + 5 − 2 = 15 nucleotides
in the input.
For the training process the quadratic function was used as a cost function, and stochastic
gradient descent with momentum was used for training the network. The network was trained
over 100000 batches of size 64, with initial learning rate 0.01, learning rate decay rate 0.1
with the decay being applied every 20000 batches.
Once the CAE is trained, the dataset is converted to the CAE encoded representation, by
taking the activations of the last convolutional layer, for each individual example in our dataset.
5.3 BoSP
After the dataset has been converted to the CAE encoded representation, we apply the
BoSP transformation to the dataset, as indicated in Section [Link], to get our final feature
representation.
5.5 K-Means
We use K-Means to cluster our data into two clusters. Ideally we would like the examples
containing the signal to end up in one cluster and the rest to be in the other. As input, the
BoSP features of the CAE encoded sequences are given, concatenated by their coordinates in
the SOM.
6 Experiments
6.1 In search of Seven Clusters
In Section 2 the work of Gorban et al. was mentioned, where they visualized sequences by
projecting them into the space of triplet frequencies. Additionally, they visualized the first and
third principal components of features extracted from sequences using GLIMMER [37]. In both
29
Figure 18: Visualization of the distribution of predictions of GLIMMER
gene-finder in 64-dimensional space of codon frequencies. Every point corre-
sponds to one ORF [38].
a) Escherichia coli. Projection on the 1st and 3d principal components.
b) Caulobacter crescentus. Projection on the 1st and 2d principal compo-
nents.
cases they show that 7 visible clusters are formed, as shown in Figure 18. The labels they used
for the sequences are non-coding, coding in forward strand, coding in complementary strand.
In the same spirit, we devised similar experiments. New datasets were created using the same
labeling strategy as Gorban et al. for the sequences, with window sizes of 200 nucleotides. In
our case the features used for the visualizations are the first and second principal components
of the BoSP representation of sequences, extracted from the trained CAE. In Figures 19 20 21
it can be seen that the data are distributed mostly normally with the examples of the coding
sequences in the positive strand (green) being slightly more dominant in the top right part
of the graph, while the coding sequences in the complementary strand (blue) and non coding
sequences (red) have a very similar distribution. No seven clusters are visible in this case.
Additionally, in Figure 22 the SOM mapping can be seen for the same dataset. It presents
similar results as the scatter plot of Figure 19, where the green channel, which represents the
positive examples in the positive strand, forms a small clear cluster. The blue and red channels,
which represent the positive examples in the complementary strand and negative examples
respectively, overlap for the most part, forming a purple blob in the image. Additionally there
is a small cluster where all three classes overlap.
30
Figure 19: Visualization of the first and second Principal Components ex-
tracted from the CAE-BoSP features of [Link] sequences. Green points are
coding regions on the positive strand, blue points are coding regions in the
negative strand and red points are non-coding regions.
Figure 20: Visualization of the first and second Principal Components ex-
tracted from the CAE-BoSP features of [Link] sequences. Green points are
coding regions on the positive strand and red points are non-coding regions.
31
Figure 21: Visualization of the first and second Principal Components ex-
tracted from the CAE-BoSP features of [Link] sequences. Blue points are
coding regions in the negative strand and red points are non-coding regions.
32
features from the sequences. Then, we use the encoded representation of the sequences to
train a supervised CNN as in the previous step. This part serves to show how discriminative
those features are for this specific task. Ideally we would want features that yield the comparable
results in classification as the original sequences.
As a final part of the experiments, we want to use the features learned in the second part,
and train a Self-Organizing Map to see whether coding sequences form any visible and distinct
clusters. The BoSP method is used in order to reduce the dimensionality of the encoded
representation, while keeping the discriminative information.
6.2.1 Dataset
In order to get an intuition of whether the system is capable of such a task, we start off with
a simple prokaryotic organism, Escherichia Coli (E. Coli), which has been studied extensively.
E. Coli is used for our supervised experiments as well as for the supervised learning with
unsupervised features experiments.
For the unsupervised learning 3 organisms are used, namely, Escherichia Coli, Salmonella
Enterica and Saccharomyces Cerevisiae.
The datasets used are available on the NCBI site [76].
Table 1: Sensitivity, specificity and f1 score of experiments using various ratios of positive
and negative examples during training. The test set contains 4 complete genomes of E.
Coli. The same network structure is used for all experiments.
Once the optimum ratio of examples has been established, similar network architectures are
tested by removing the pooling layers one by one, starting with the one following the second
convolutional layer. The sensitivity and specificity of each network can be seen in Table 2.
33
c-p-c-p-c c-p-c-c c-c-c
sensitivity 0.8438 0.8381 0.8367
specificity 0.9836 0.9835 0.9831
f1 score 0.8225 81.52 0.81
Table 2: Sensitivity specificity and f1 score from 3 different networks. The first network
consists of 3 convolutional layers where the first two are followed by pooling. The second
network has 3 convolutional layers ‘c’ where only the first layer is followed by pooling
‘p’. The third network consists only of convolutional layers. Note that the first two con-
volutional layers in all cases are followed by Local Response Normalization layers and all
networks are completed with fully connected layers.
Table 3: The network parameters used for the structure optimization experiments. For
the convolutional layers (conv), the parameters shown are number of filters x filter width
x filter height. For the pooling layers (pool) the parameters are pooling width x pooling
height. For the fully connected (fc) layers the parameter shown is the number of nodes.
Finally, for the dropout layer (do) the parameter is the probability of shutting down the
output of a node in the previous layer.
From the results seen in Table 2 we can see that by removing the pooling layers the difference
in performance is minimal.
Furthermore, we compare the performances of the model trained on E. Coli with the perfor-
mance of a model trained on S. Enterica. Both models are evaluated using DNA sequences
of E. Coli, S. Enterica and S. Cerevisiae. In Tables 4,5 and 6 the f1 score, sensitivity and
specificity is shown for each model when evaluated for each organism.
aa
aa model
organism
aa [Link] [Link]
a
a
[Link] 0.8127 0.7370
[Link] 0.7449 0.8273
[Link] 0.1315 0.1329
Table 4: F1 score of each model (columns) when applied to each corresponding organism
(rows)
34
aa
aa model
organism
aa [Link] [Link]
aa
[Link] 0.8306 0.7406
[Link] 0.7592 0.8538
[Link] 0.2192 0.2330
Table 5: Sensitivity of each model (columns) when applied to each corresponding organism
(rows)
aa
aa model
organism
aa [Link] [Link]
aa
[Link] 0.9831 0.9788
[Link] 0.9787 0.9840
[Link] 0.9184 0.9123
Table 6: Specificity of each model (columns) when applied to each corresponding organism
(rows)
Is is clear from the results that using a model trained on one organism to make predictions
for another, significantly reduces its performance. Though the model maintains a high degree
of specificity, the sensitivity drops substantially.
35
representation of the CAE encoded sequence trained on the same organism as the K-Means
model). The results are shown in Tables 7,8,9.
aa
aa model
organism
aa [Link] [Link] [Link]
a
a
[Link] 0.070 0.1219 0.1232
[Link] 0.1050 0.1222 0.1163
[Link] 0.0792 0.0749 0.0737
Table 7: F1 score of each model (columns) when applied to each corresponding organism
(rows)
aa
aa model
organism
aa [Link] [Link] [Link]
a
a
[Link] 0.4151 0.4763 0.4536
[Link] 0.4094 0.4845 0.4288
[Link] 0.5228 0.5080 0.4873
Table 8: Sensitivity of each model (columns) when applied to each corresponding organism
(rows)
aa
aa model
organism
aa [Link] [Link] [Link]
a
a
[Link] 0.5129 0.5135 0.5462
[Link] 0.5129 0.5085 0.5466
[Link] 0.5499 0.5348 0.5476
Table 9: Specificity of each model (columns) when applied to each corresponding organism
(rows)
The results presented above are the average achieved over 5 runs.
It is hard to compare our method with any other, since, to the extend of our knowledge, this
is the only unsupervised signal sensor approach for gene prediction. But it is clear from the
results that the performance is comparable to random classification.
36
Figure 23: Data flow diagram of the visualization process using Self-
Organizing Maps. In the SOM visualization the intensity of each pixel shows
the amount of examples that mapped to that corresponding node. The pos-
itive examples are represented by the green channel, while the negative ex-
amples are represented by the red channel. Thus, a intense green pixel means
that many positive examples mapped to that node, while intense red means
many negative examples mapped to that node. Yellow tinted pixels are nodes
that many positive and negative examples have mapped to that node.
37
Figure 24: SOM visualization of BoSP representation of [Link] nucleotide
sequences, with 5:1 ratio of negative to positive examples.
Using this as input to a SOM gives us an intuition of whether a simple feature as that can be
used for clustering the sequences. it is clear from the image that there are visible clusters for
positive and negative examples.
For all SOM visualizations the following structure is followed:
• Right image: Mapping of both positive and negative examples. Positives are represented
by the green channel, and negatives by the red.
In the case of Figure 24, a ratio of 5:1 is used between negative and positive examples, as
is used during the supervised learning. In Figure 25 we see the SOM visualization of BoSP
representation of the one-hot encoding, of the same sequences using the default ratio of
positive and negative examples. It can be seen that there are some distinct clusters formed,
but the amount of overlap between the two classes increased significantly.
The difference in this visualization steps indicates that the main problem here is the ratio
between our two classes. Both in the case of SOM and K-Means, the negative examples are
too few in order to form their own distinct cluster, and are taken in by the much larger cluster
of negative examples.
Since we use unsupervised NNs to solve this problem, the question becomes whether we
can extract features in an unsupervised manner, that would highlight the essential differences
38
Figure 26: SOM visualization of the CAE-BoSP features of [Link], using
ratio of 5:1.
Figure 27: SOM visualization of the CAE-BoSP features of [Link], using the
default ratio.
between two classes. Using the Convolutional Auto-Encoders described in Section 6.2.3, the
dataset was converted to the encoded representation. In Figure 26 the results can be seen for
the SOM mapping of the dataset using the 5:1 ratio, and in Figure 27 the results of the same
process using the default ratio.
This time the visualization tells a somewhat different story. Using the BoSP representation
of the CAE features, the data are somewhat separated even when using the default ratio. This
hints that the CAE-BoSP features increase the seperability of the data.
Additionally, Figures 28 and 29 show the SOM mapping for the organisms Salmonella Enterica
and Saccharomyces Cerevisiae respectively. Both these organisms have more sparse genome
than Escherichia Coli, meaning that the default ratio for positive and negative examples is even
smaller. Even so, the SOM is able to map the different classes with relatively small amount of
overlap.
39
Figure 28: SOM visualization of the BoSP features of S. Enterica , using the
default ratio.
We introduced a novel approach for gene prediction, the UDN gene finder, an unsupervised
signal sensor for gene prediction. We show that, even though clustering techniques are not
able to separate the data in the desired way, when visualizing the data using Self-Organizing
Maps, there is a significant degree of separation of the sequences that contain the signal we
are searching for and the sequences that do not.
For feature work it would be interesting to investigate why SOMs are able to separate the
two classes while K-Means is unable to. Additionally we can try and apply other unsupervised
feature selection algorithms. There are plenty proposed in the literature, all of them with their
advantages and disadvantages. Such a study will not provide us with better features, rather,
it will help create more crisp clusters and thus more meaningful results.
Acknowledgments
I would like to thank my first supervisor, Dr. Erwin Bakker, for supporting and guiding me
throughout the whole duration of my thesis. He was always eager to advise me, and every
meeting was productive, interesting and, most of all, fun.
Secondly, I would like to thank my second supervisor, Dr. Michael Lew, for taking the time
to assess and evaluate my work, as well as giving me many opportunities to work with him
during my thesis.
I want to extend my gratitude to my friends with whom I had many constructive discussions
over my thesis, and supported me during the whole period.
Most of all I want to thank my family and friends for supporting me and being patient with
40
me, allowing me to do everything at my own pace.
References
[1] B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, P. Walter. “Molecular Biology of
the Cell (Fourth ed.)”. ISBN 978-0-8153-3218-3, 2002.
[2] P. Russell. “iGenetics”. New York: Benjamin Cummings. ISBN 0-8053-4553-1, 2001.
[3] C. Mathe, M. F. Sagot, T. Schiex, P. Rouze. “Current methods of gene prediction, their
strengths and weaknesses”. Oxford University Press. Nucleic Acids Research. Vol. 30 No.
19 pp. 4103-4117, 2002.
[4] S. Lakshmi, A. Shahin. “A Survey on Gene Prediction Using Neural Network”. International
Journal of Computer & Organization Trends. Volume 3, Issue 4, pp.88-93, 2013.
[5] N. Goel, S. Singh, T.C. Aseri. “A Review of Soft Computing Techniques for Gene Predic-
tion”. ISRN Genomics Volume 2013, article ID 191206, 2013.
[6] S. Altschul, W. Gish, W. Miller, E. Myers, D. Lipman. “Basic local alignment search tool”.
Journal of Molecular Biology. Volume 215, Issue 3, pp. 403410, 1990.
[7] D. Lipman, W. R. Pearson. “Rapid and sensitive protein similarity searches”. Science
Volume 227 (4693) pp. 1435-41, 1985.
[8] G. D. Stormo. “Gene-Finding Approaches for Eukaryotes”. Cold Spring Harbor Laboratory
Press, Issue 10, pp. 394-397, 2016.
[9] N. Yu, Z. Yu, B. Li, F. Gu, Y. Pan. “A Comprehensive Review of Emerging Computational
Methods for Gene Identification”. In Neural Information Processing Systems (NIPS), Vol-
ume 12, pp. 1-34, 2016
[12] A. Krogh. “Using Database Matches with HMMGene for Automated Gene Detection in
Drosophila”. Genome Research. Volume 10, Issue 4, pp. 523-528, 2000
[13] J. Henderson , S. Salzberg, K.H. Fasman. “Finding Genes in DNA with a Hidden Markov
Model”. Journal of Computational Biology. Volume 4 Issue 2, pp. 127-141, 2009
[14] M. Stanke, B. Morgenstern. “AUGUSTUS: a web server for gene prediction in eukaryotes
that allows user-defined constraints”. Nucleic Acids Research, Volume 33, Issue suppl 2,
pp. W465-467, 2005.
[15] I. Korf. “Gene finding in novel genomes”. BMC Bioinformatics, Volume 5, Article 59,
2004.
41
[16] M.G. Reese, F.H. Eeckman, Kulp D., D. Haussler. “Improved Splice Site Detection in
Genie”. Journal of Computational Biology. Volume 4 Issue 3, pp. 311-323, 1997
[18] Y. Xu, J. R. Einstein, R. J. Mural, M. Shah, and E. C. Uberbacher. “An improved system
for exon recognition and gene modeling in human DNA sequences”. in Proceedings of
the 16th Annual International Conference Intelligent Systems for Molecular Biology, pp.
376-383, 1994.
[20] C. Li, P. He, J. Wang. “Artificial Neural Network Method for Predicting Protein Coding
Genes in the Yeast Genome” Internet Electronic Journal of Molecular Design, Volume 2
pp. 527-538, [Link] 2003.
[21] J.R. Cohen. “Segmenting speech using dynamic programming”. The Journal of the acous-
tical society of America, Volume 69, Issue 5, pp. 1430-1438, 1981.
[22] D. Chicco, P. Sadowski, P. Baldi. “Deep Autoencoder Neural Networks for Gene Ontol-
ogy Annotation Predictions”. Proceedings of the 5th ACM Conference on Bioinformatics,
Computational Biology, and Health Informatics, pp. 533-540, 2014.
[23] V. Pavlivic, A. Garg, S. Kasif. “ A Bayesian Framework for combining gene predictions”.
Bioinformatics, Volume 18, No. 1, pp 19-27, 2002.
[24] S. Mahony, J.O. McInerney, T.J. Smith, A. Golden. “Gene prediction using the Self-
Organizing Map: automatic generation of multiple gene models”. BMC Bioinformatics,
Volume 5, article 23, pp. 1-9, 2004.
[25] [Link], [Link], [Link], [Link], [Link]. “Deep convolutional neural networks for annotat-
ing gene expression patterns in the mouse brain”. BMC Bioinformatics, Volume 16, article
147, 2015.
[27] S. Wang, J. Peng, J. Ma, J. Xu. “Protein Secondary Structure Prediction Using Deep
Convolutional Neural Fields”. Scientific Reports 6, Articlenumber18962, 2016.
[28] N.G. Nguyen, V.A. Tran, D.L. Ngo, D. Phan, F.R Lumbanraja, M.R. Faisal, B. Abapihi,
M. Kubo, K. Satou. “DNA Sequence Classification by Convolutional Neural Network”.
Journal of Biomedical Science and Engineering Volume 9, No. 5, pp. 280-286, 2016.
42
[29] K.L. Chan, R. Rosli, T.V. Tatarinova, M. Hogan, M. Firdaus-Raih, E.T.L. Low. “Se-
qping: gene prediction pipeline for plant genomes using self-training gene models and
transcriptomic data”. International Conference on Bioinformatics of Genome Regulation
and Structure \Systems Biology, 2016
[32] D. Weekes, G.B. Fogel. “Evolutionary optimization, backpropagation, and data prepara-
tion issues in QSAR modeling of HIV inhibition by HEPT derivatives”. BioSystems, Issue
72, pp. 149-158, 2003.
[33] M. Cheung, G.B. Fogel. “Identification of Functional RNA Genes Using Evolved Neural
Networks”. Computational Intelligence in Bioinformatics and Computational Biology, 2005.
[34] W.H. Majoros, M. Pertea, S.L. Salzberg. “TigrScan and GlimmerHMM: two open source
ab initio eukaryotic gene-finders”. Bioinformatics, Volume 20, Issue 16, pp. 2878-2879,
2004.
[35] M. Pertea, X. Lin, S.L. Salzberg. “GeneSplicer: a new computational method for splice
site prediction”. Nucleic Acids Research, Volume 29, Issue 5, pp. 1185-1190, 2001.
[37] W.H. Majoros, M. Pertea, C. Antonescu, S.L. Salzberg. “GlimmerM, Exonomy and Unveil:
three ab initio eukaryotic genefinders”. Nucleic Acids Research, Volume 31, Issue 13, pp.
3601-3604, 2003.
[38] A.N. Gorban, T.G. Popova, A.Y. Zinovyev. “SEVEN CLUSTERS AND UNSUPERVISED
GENE PREDICTION”. In proceedings of the fourth international conference on bioinfor-
matics and genome regulation and structure, Volume 2, 2004
[39] A.N. Gorban, A.Y. Zinovyev, T.G. Popova. “SEVEN CLUSTERS AND UNSUPERVISED
GENE PREDICTION”. In silico biology, Volume 3, Issue 4, pp. 471-82, 2003.
43
[43] C. M. Bishop. “Neural Networks for Pattern Recognition”. ISBN-13: 978-0198538646,
1995.
[44] [Link]
[46] A. van den Oord; S. Dieleman; B. Schrauwen. “Deep content-based music recommenda-
tion”. Curran Associates, Inc.: 2643-2651, 2013.
[49] R. Collobert, J. Weston. “A Unified Architecture for Natural Language Processing: Deep
Neural Networks with Multitask Learning”. Proceedings of the 25th International Confer-
ence on Machine Learning. ICML ’08 (New York, NY, USA: ACM): 160167, 2008.
[50] X. Glorot, A. Bordes; B. Yoshua. “Deep Sparse Rectifier Neural Networks”. Proceedings of
the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS-
11): 315-323, 2011
[51] V. Nair, G.E. Hinton.“Rectified Linear Units Improve Restricted Boltzmann Machines”.
In Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel,
2010.
[52] A.L. Maas, A.Y. Hannun, A.Y. Ng.“Rectifier Nonlinearities Improve Neural Network
Acoustic Models”.In Proceedings of the 30 th International Conference on Machine Learn-
ing, Atlanta, Georgia, USA. JMLR: W&CP volume 28, 2013.
[53] M.D. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q.V. Le, P. Nguyen, A. Senior,
V. Vanhoucke, J. Dean, G.E. Hinton. “On Rectified Linear Units For Speech Processing”.
Conference PaperinAcoustics, Speech, and Signal Processing, 1988. ICASSP-88, pp. 3517-
3521. International Conference, 2013.
[54] X. Glorot, A. Bordes, Y. Bengio. “Deep Sparse Rectifier Neural Networks”. In Proceedings
of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR
W&CP (AISTATS 2011), 2011.
[55] Y. Guo, M.S. Lew. “Bag of Surrogate Parts: one inherent feature of deep CNNs”. 27th
British Machine Vision Conference, 2016.
[56] D. Erhan, P.A. Manzagol, Y. Bengio, S. Bengio, P. Vincent. “The Difficulty of Training
Deep Architectures and the Effect of Unsupervised Pre-Training”. In Proceedings of the
Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), pp.
153160, 2009.
44
[57] A. Krogh, J.A. Hertz “A simple weight decay can improve generalization”. In Proceedings
of the 4th International Conference on Neural Information Processing Systems, 1991.
[61] G.E. Hinton. “Training Products of Experts by Minimizing Contrastive Divergence”. Neu-
ral Computation 14, pp. 17711800, 2002.
[62] H. Lee, Y. Largman, P. Pham, A.Y. Ng. “Unsupervised feature learning for audio clas-
sification using convolutional deep belief networks”. In Advances in Neural Information
Processing Systems 22 (NIPS), 2009.
[64] D.E. Rumelhart, G.E. Hinton, R.J. Williams. “Learning internal representations by error
propagation”. In Parallel Distributed Processing. Vol 1: Foundations. MIT Press, 1986.
[65] Y. Bengio. “Learning deep architectures for AI”. Foundations and Trends in Machine
Learning.
[66] H. Lee, R. Grosse, R. Ranganath, A.Y. Ng. “Convolutional Deep Belief Networks for
Scalable Unsupervised Learning of Hierarchical Representations”. In Proceedings of the 26
th International Conference on Machine Learning, 2009.
[67] P. Vincent, H. Larochelle, Y. Bengio, P.A. Manzagol. “Robust Features with Denoising
Autoencoders”. Neural Information Processing Systems (NIPS), 2008.
[69] B.A. Olshausen, D.J. Field. “Sparse coding with an overcomplete basis set: A strategy
employed by v1?”. Vision research, Volume 37, Issue 23, pp. 3311-3325, 1997.
[70] V. Nair, G.E. Hinton. “3d object recognition with deep belief nets”. In Advances in Neural
Information Processing Systems, pp. 1339-1347, 2009.
[71] H. Lee, C. Ekanadham, A. Ng. “Sparse deep belief net model for visual area v2”. In
Advances in neural information processing systems, pp. 873-880, 2007.
45
[72] A. Makhzani, B. Frey. “k-Sparse Autoencoders”. International Conference on Learning
Representations, ICLR, 2014.
46