Deep Learning From A Mathematical Point of View
Deep Learning From A Mathematical Point of View
Junio 2021
Prologue
Artificial Neural Networks are a Machine Learning algorithm based on the structure of biological
neurons (these neurons are organized in layers). Deep Learning is the branch of Machine
Learning that includes all the techniques used to build Deep Neural Networks (Artificial Neural
Networks with at least two hidden layers) that are able to learn from data with several levels
of abstraction.
A Feed-forward Neural Network with fully-connected layers is a Deep Neural Network
whose information flow goes forwards. The neurons that belong to consecutive layers are fully-
connected. Its architecture is based on the weights of the connections between the neurons and
on the bias that each neuron adds to its received information. The value of these parameters
is fitted during training. This learning process is reduced to an optimization problem that can
be solved using Gradient Descent or other recent algorithms as Scheduled Restart Stochastic
Gradient Descent, being Back Propagation the algorithm used to compute the required deriva-
tives. If the network is not able to learn correctly, overfitting or underfitting can arise. Other
parameters of the neural network (hyperparameters) are not tuned during training. To perform,
for example, image classification or prediction tasks we have to use other types of Deep Neural
Networks as Convolutional Neural Networks or Recurrent Neural Networks.
This Master’s Dissertation is organized into a summary (written in Spanish), three chapters,
some conclusions and a bibliography.
In Chapter 1 we define the concepts of Artificial Intelligence and Machine Learning in order
to introduce Artificial Neural Networks (we prove some results related to their capabilities).
In Chapter 2 we introduce the concept of Deep Learning. We explain the architecture of
Feed-forward Neural Networks with fully-connected layers and we study some topics related to
them (we prove some mathematical results).
In Chapter 3 we study Convolutional Neural Networks and Recurrent Neural Networks.
iii
Resumen
v
vi Resumen
aprendizaje: puede no ser capaz de generalizar (sobreajuste u overfitting) o tal vez no aprende
de los datos (subajuste o underfitting). Aquellos parámetros que describen la red pero no se
ajustan durante el entrenamiento, como el número de capas ocultas o el valor inicial de los
pesos y sesgos, se conocen como hiperparámetros.
El uso de Feed-forward Neural Networks con todas sus capas totalmente conectadas no es
factible cuando los datos de entrada son imágenes (esto es debido a que la capa de entrada
tendría tantas neuronas como píxeles tiene la imagen a estudiar y considerando únicamente
las conexiones entre esta capa y la primera capa oculta, ya tendríamos demasiados pesos para
entrenar). Las redes neuronales convolucionales (Convolutional Neural Networks) son redes
neuronales profundas utilizadas para tareas de clasificación y de reconocimiento de imágenes.
Aunque son Feed-forward Neural Networks, sus capas no están totalmente conectadas (esto
reducirá el número de pesos a entrenar). Su estructura se basa en capas convolucionales,
mapas de características y pooling layers.
Si debemos resolver problemas de predicción en los que los datos de entrada son series
temporales, necesitamos que la red neuronal artificial utilizada tenga memoria y por tanto,
no podemos utilizar Feed-forward Neural Networks como las anteriores. Las redes neuronales
recurrentes (Recurrent Neural Networks) son redes neuronales profundas con memoria. Como
datos de entrada utilizan la información observable y los estados previos. Su estructura prin-
cipal, donde se guardan dichos estados, es conocida como memory cell. Sin embargo, existe
evidencia empírica de que estas redes neuronales recurrentes no tienen memoria a largo plazo.
Para resolver este problema se utilizan memory cells con memoria a largo plazo como la LSTM
cell o la GRU cell.
Contents
Prologue iii
Resumen v
1 Introduction to AI and ML 1
1.1 Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Types of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 The Structure of a Biological Neuron . . . . . . . . . . . . . . . . . . . . 3
1.3.2 The Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.3 The Multi-Layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Deep Learning 15
2.1 Architecture of Feed-forward Neural Networks with Fully-connected Layers . . 15
2.2 Training Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Gradient Descent and Stochastic Gradient . . . . . . . . . . . . . . . . . 20
2.2.2 Accelerating Gradient Descent in Convex Smooth Optimization . . . . . 22
2.2.3 Accelerating Stochastic Gradient . . . . . . . . . . . . . . . . . . . . . . 25
2.2.4 Back Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Problems with Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.1 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.2 Underfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Conclusions 61
Bibliography 63
vii
Chapter 1
Introduction to AI and ML
Before explaining Deep Learning, which is the main topic of this Master’s Dissertation, we have
to introduce some concepts as Artificial Intelligence and Machine Learning to contextualize it.
• Face recognition.
• Machine Learning.
• Robots.
• Voice assistants.
1
2 Chapter 1. Introduction to AI and ML
→ Supervised learning.
Human supervision is needed during training. We use labeled data: each instance of the
training set is associated to a label that corresponds to the value we should obtain when the
task is performed. The available data is divided into the training set and the test set. This last
set is used to check how well the ML algorithm generalizes. One of the main problems that
ML can present is overfitting or loss of generalization: the algorithm fits the training set but it
does not work well for other data.
Supervised learning is used to perform classification tasks (as in Example 1.1) and prediction
tasks (for example, to predict the price of a car given some features).
Some of the most famous supervised learning algorithms are Decision Trees (DTs) and
Artificial Neural Networks (ANNs).
→ Unsupervised learning.
Human supervision is not needed during training. This means that the training instances
do not have a label. The algorithm has to be able to extract features of the data without help.
Unsupervised learning algorithms as k-Means are used for clustering (for instance, it can
be used to classify the clients of a company in k different groups to create specific ads for
each group) and others as Principal Component Analysis (PCA) are utilized for dimensionality
reduction. Some Artificial Neural Networks as autoencoders (they are used, for example, to
remove the noise of an image) are unsupervised learning algorithms.
Deep Learning from a Mathematical Point of View - Carmen Mayora Cebollero 3
→ Semi-supervised learning.
Not all the training instances are labeled. This type of algorithms can be useful when we
have unlabeled data and it is laborious to label all the samples.
The system is called agent and it knows which is the best to do in each situation (policy)
by learning from the environment and past actions from which it has got rewards or penalties.
Figure 1.1: On the left, principal structure of a biological neuron. On the right, drawing of a
cortical lamination by S. Ramón y Cajal (this image is reproduced from [15] and it has been
adapted following its style in [5]).
A biological neuron is a cell that can be found in the brain. It has four principal parts:
• Axon. It is a very long extension which transmits information from this neuron to the
other neurons to which it is connected. Near the end, it divides into many branches called
telodendria.
• Dendrites. They are branching extensions of the neuron that transmit the information
received from the axon (telodendria) of the other neurons.
• Synapses. They are the regions that connect the axon (telodendria) with the dendrites.
The electrical impulses (signals) pass from one neuron to another through them.
4 Chapter 1. Introduction to AI and ML
In the left image of Figure 1.1 we can see the representation of a neuron.
Biological Neural Networks (BNNs) are formed by billions of neurons and according to
research, these neurons seem to be organized in consecutive layers. In the right image of
Figure 1.1 we can see a drawing of a cortical lamination by S. Ramón y Cajal in which this
structure of consecutive layers can be appreciated. Taking into account the structure of a BNN,
an ANN can be designed. Let us start with one of the simplest ANNs, the Perceptron.
x1
w1
x2
w2
w3
x3 A y
w4
x4 w5
x5
A TLU is a neuron that receives n different inputs (numerical values) that we can denote
by xi for i = 1, ..., n. A weight wi (i = 1, ..., n) is associated to the connection between the input
xi and the neuron. The output of the TLU is a numerical value y obtained from the sum of the
weighted inputs and the application of an activation function A. Therefore, the calculation of
the output can be written mathematically as
n
!
X
y=A wi x i .
i=1
Notice that the Perceptron can be understood as a function that associates an output value to
the input of the network. The activation function is a step function as the Heaviside function
(its graph is on the right of Figure 1.3)
1 if x ≥ 0,
A(x) = H(x) =
0 if x < 0.
Notice that it is quite natural to use this function because it can be interpreted as the behaviour
of a neuron: value 1 when the neuron fires and value 0 if the neuron is inactive.
Deep Learning from a Mathematical Point of View - Carmen Mayora Cebollero 5
b1 ; A y1
1
0.9
w11 w12 0.8
x1 0.7
w21 0.6
b2 ; A y2 0.5
w22 0.4
0.3
x2
0.2
0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
b3 ; A y3
Figure 1.3: On the left, graphic representation of a Perceptron with n = 2 neurons in the input
layer and k = 3 TLUs in the output layer. On the right, Heaviside function.
On the left of Figure 1.3 we can see a representation of a Perceptron with n = 2 neurons in
the input layer and an output layer with k = 3 TLUs.
Remark 1.3. The term Perceptron is also used by some authors to refer to an ANN with a
unique TLU.
b1
1 A y1
w12
w11 b2
w21
x1 A y2
w31 w22
b3
x2 A y3
w32
Figure 1.4: Graphic representation of the Perceptron in Figure 1.3 using a bias neuron.
Remark 1.4. Some authors use an extra neuron with value 1 in the input layer to represent
the bias term. This neuron is called bias neuron. In Figure 1.4 we have the representation of
the Perceptron in Figure 1.3 using a bias neuron (in grey).
6 Chapter 1. Introduction to AI and ML
To train the Perceptron, the training samples are used one by one. When the prediction
ŷj (j = 1, ..., k) made by the Perceptron for a training sample does not coincide with the label
yj (j = 1, ..., k), the connections that would have given the right solution are reinforced. This
training process was proposed by F. Rosenblatt and it is based on Hebb’s rule. Mathematically,
the training algorithm of the Perceptron can be written as
where η is the learning rate and wji (m) represents the weight of the connection between the
input neuron i and the output neuron j when the m-th training sample is used.
The Perceptron is a relatively simple structure. However, it presents some problems. For
example, it is incapable of solving the XOR classification problem. Let us prove it.
Lemma 1.5. [12] The XOR classification problem cannot be solved using the Perceptron.
Proof. In what follows, we identify 0 with false and 1 with true. The XOR function has two
inputs. This function returns false when the inputs are both false or both true (that is to say,
the inputs are (0, 0) or (1, 1)), and true when the two inputs are different, that is to say, one is
true and the other is false (the possibilities are (0, 1) and (1, 0)).
(0, 1) (1, 1)
H1 H2
(0, 0) (1, 0)
If we interpret geometrically how the Perceptron works (see (1.1)), we deduce that the input
space has to be separated by a hyperplane. In our case we have four points in R2 : (0, 0), (0, 1),
(1, 0) and (1, 1). Therefore, one of the half-spaces of the hyperplane has to contain all the points
with which we obtain 0 with the XOR function (they are (0, 0) and (1, 1)) and the other, all
the points with which the solution is 1 (they are (0, 1) and (1, 0)). As the XOR function is not
linearly separable, it is impossible to find a unique hyperplane to separate the input space. In
Figure 1.5, it can be seen that to separate (0, 1) and (1, 0) from (0, 0) and (1, 1) we need two
hyperplanes, H1 and H2 .
If we stack some Perceptrons, some weakness of the Perceptron like the inability of solving
the XOR problem can be solved.
[2]
w11 [2] [3]
x1 b1 ; H w11
[2]
w12
[3] y
[2]
b1 ; H
w21
x2 [2]
[2] b2 ; H [3]
w12
w22
Remark 1.6. In the case of an MLP, if we use the graphic representation with bias neuron
(see Remark 1.4), we have to add one neuron with value 1 in each layer except in the output
one. The bias neurons are not connected to any neuron of the previous layer and they are
fully-connected to all the neurons in the following one.
Let us see that the XOR classification problem can be solved with an MLP.
Lemma 1.8. [5] The XOR classification problem can be solved using an MLP.
Proof. Let us consider a Multi-Layer Perceptron with an input layer (layer 1) formed by two
neurons, one hidden layer (layer 2) constituted by two neurons and an output layer (layer 3) of
a single neuron. In Figure 1.6 we can see a drawing of the situation.
[2] [2] [2] [2] [3] [3] [2] [2] [3]
We take w11 = w21 = w12 = w22 = w12 = 1, w11 = −1, b1 = −1.5, b2 = b1 = −0.5. We
consider the Heaviside function H as the activation function A. Then the output y of our MLP
can be computed as
We have seen that using one hidden layer, we can solve problems that simple ANNs like the
Perceptron cannot solve. In addition, it can be proved that an MLP with certain characteristics
can approximate any continuous function. That is to say, if the process we want to represent
with an MLP can be described with a continuous function, then such MLP exists. We state this
in Theorem 1.13, but before we need to introduce some concepts and to explain the structure
of the MLP of the theorem.
8 Chapter 1. Introduction to AI and ML
The MLP of the theorem has a single hidden layer and an output layer with one neuron.
The activation function of the neurons of the hidden layer is a continuous sigmoidal function.
We do not consider a bias term or an activation function for the output neuron.
Let us compute how the output of this ANN is. We have no restrictions on the input layer,
so we suppose that it has n neurons and we denote the input data as x ∈ Rn . We do not
have any restriction on the dimension of the unique hidden layer, we can consider that it has
m neurons. The activation function of these neurons is a continuous sigmoidal function. The
output layer has restrictions on the number of neurons, it has only one neuron. Moreover, we
do not consider a bias term or an activation function on it. Taking into account that the output
of the MLP is computed analogously to the output of the Perceptron, we have that the output
y is given by
m
X [3] [2] [2]
y= w1j σ hwj , xi + bj ,
j=1
Notation 1.10. Let I n be the n-dimensional cube, that is, I n ≡ [0, 1]n . The set of continuous
functions defined on I n is denoted by C(I n ) and the supremum norm of a function h ∈ C(I n ) is
khk∞ = supx∈I n |h(x)|.
Theorem 1.13. [13] Let σ be any continuous sigmoidal function. Then, finite sums of the
form
m
X [3] [2] [2]
G(x) = w1j σ hwj , xi + bj (1.2)
j=1
(they are the output of an MLP with the properties described above) are dense in C(I n ). That
is to say, given g ∈ C(I n ) and ε > 0 there is a function G of the form (1.2) such that
To prove this theorem we have to prove two auxiliary results that make the proof of Theorem
1.13 trivial. These results are Lemma 1.21 and Lemma 1.28. To sketch out each result and
to prove them we need to define and state some results of Functional Analysis and Measure
Theory. For simplicity, we state some of these results for the space of continuous functions on
the n-dimensional cube I n , which is the space of functions we consider.
Definition 1.14. σ-algebra, measurable space and measure space. [17] [18]
A σ-algebra over a set X is a set A ⊂ P(X) that verifies the following three properties:
• ∅∈A.
∞
k=1 is a sequence of sets Ak ∈ A , then Ak ∈ A .
[
• If {Ak }∞
k=1
A Borel measure µ is called regular if any Borel set A ⊂ X is outer and inner regular.
Notation 1.16. We denote by M(I n ) the space of finite, signed regular Borel measures on I n .
h(x) ≤ u(x),
where x ∈ M .
10 Chapter 1. Introduction to AI and ML
• Λx = h(x) for x ∈ M ,
µ ∈ M(I n ) as Z
Λh = h dµ,
In
where h ∈ C(I n ).
Lemma 1.21. [13] Let σ be any continuous discriminatory function. Then, finite sums of the
form (1.2) are dense in C(I n ). That is to say, given g ∈ C(I n ) and ε > 0 there is a finite sum
G of the form (1.2) such that, for all x ∈ I n ,
Proof. Let us denote by S the set of continuous functions of the form (1.2). It is obvious that S
is a linear subspace of C(I n ). We want to prove that S is dense in C(I n ) so, by Definition 1.12,
we have to prove that S̄ = C(I n ), where S̄ denotes the closure of the corresponding set. Let
us prove it by contradiction.
Let us assume that S̄ 6= C(I n ). Therefore, S̄ is a proper subspace of C(I n ) (that is to say,
S̄ 6= ∅, C(I n )). Moreover, as S̄ is the closure of S, S̄ is a closed subspace.
As S̄ is a closed proper subspace of C(I n ), applying the consequence of Hahn-Banach The-
orem 1.18 sketched out in Corollary 1.19 we have that there exists a bounded linear functional
L on C(I n ) such that L 6≡ 0 but L(S) = 0.
Applying Riesz Representation Theorem 1.20 to the linear bounded functional L, we con-
clude that it is of the form Z
L(h) = h dµ
In
Theorem 1.25. [20] Let (X, A , µ) be a measure space. The set of all measurable simple
functions is dense in L∞ (X) (this is the set of bounded functions defined on X).
if hp, xi + q > 0,
1
lim σλ (x) = 0 if hp, xi + q < 0,
λ→+∞
σ(r) if hp, xi + q = 0.
12 Chapter 1. Introduction to AI and ML
if hp, xi + q > 0,
1
γ(x) = 0 if hp, xi + q < 0,
σ(r) if hp, xi + q = 0,
we have that lim σλ (x) = γ(x). That is to say, all the functions σλ converge pointwise to the
λ→+∞
function γ. As σλ is measurable (we assume that σ is measurable and σλ is given by (1.5)),
by the Lebesgue’s Dominated Convergence Theorem 1.23 we have that the following equality
holds: Z Z
lim σλ dµ = γ dµ. (1.6)
λ→+∞ I n In
Let Hp,q be the hyperplane
Moreover, as Z Z
χA dµ = dµ = µ(A) with A ⊆ I n , (1.9)
In A
we conclude that Z
+
γ dµ = µ(Hp,q ) + σ(r)µ(Hp,q ). (1.10)
In
Taking into account (1.4), (1.6) and (1.10), we have that
+
0 = µ(Hp,q ) + σ(r)µ(Hp,q ), ∀p ∈ Rn ; q, r ∈ R.
Therefore,
+
µ(Hp,q ) = µ(Hp,q ) = 0, ∀p ∈ Rn , q ∈ R. (1.11)
Fix p. Let us consider a bounded measurable function f and let us define the linear func-
tional F given by Z
F (f ) := f (hp, xi) dµ.
In
As f is bounded by definition and µ ∈ M(I n ),
we have that F is a bounded functional on
L∞ (R).
Let us consider that f is the indicator function of the interval [q, +∞). Then, we have that
Z
F (χ[q,+∞) ) = χ[q,+∞) (hp, xi) dµ.
In
Applying property (1.9) to the right-hand side of previous equality we have that
Z
F (χ[q,+∞) ) = dµ = µ({x | hp, xi = q}) + µ({x | hp, xi > q}).
{x∈I n | hp,xi≥q}
Deep Learning from a Mathematical Point of View - Carmen Mayora Cebollero 13
Notice that the sets {x | hp, xi = q} and {x | hp, xi > q}, according to the notation given in (1.7)
+
and (1.8), are Hp,−q and Hp,−q , whose measure is 0 by (1.11). Therefore, we conclude that
F (χ[q,+∞) ) = 0. (1.12)
Let us consider now that f is the indicator function of the open interval (q, +∞). Arguing as
in the case χ[q,+∞) , we have that
F (χ(q,+∞) ) = 0. (1.13)
Taking into account the linearity of the functional F and the results (1.12) and (1.13), we
have that F (f ) = 0 for f the indicator function of any interval. For example, let us consider
the interval (q1 , q2 ) for q1 , q2 ∈ R with q1 < q2 . The indicator function of this interval can be
rewritten as χ(q1 ,q2 ) = χ(q1 ,+∞) − χ[q2 ,+∞) . Then, by the linearity of F we have that
F (χ(q1 ,q2 ) ) = F (χ(q1 ,+∞) − χ[q2 ,+∞) ) = F (χ(q1 ,+∞) ) − F (χ[q2 ,+∞) ),
and applying (1.12) and (1.13), we conclude that F (χ(q1 ,q2 ) ) = 0. The argument for the remain-
ing intervals is similar.
Simple functions are linear combinations of indicator functions (see Definition 1.24) so, by
the linearity of F , we have that F (f ) = 0 if f is a simple function. Moreover, as the set of
measurable simple functions is dense in L∞ (R) (see Theorem 1.25), we can conclude that F ≡ 0.
Then, in particular, if we take f (s) = cos(s) + i sin(s) we have that
for a fixed p ∈ Rn . As we have not considered any restriction over p, the previous equality
holds for any p ∈ Rn , and by Definition 1.26 we have that the Fourier Transform of the measure
µ is equal to 0. Applying Corollary 1.27 we obtain that µ = 0. We can conclude that σ is
discriminatory.
We have proved Lemma 1.21 and Lemma 1.28. Therefore, now we can present the proof of
the main result, Theorem 1.13.
Proof of Theorem 1.13. In Lemma 1.28 we have proved that any bounded measurable sigmoidal
function σ is discriminatory. So, in particular, as all the continuous functions are measurable
(with Borel measure) we have that any continuous sigmoidal function is discriminatory. With
Lemma 1.21 we can conclude that finite sums of the form (1.2) are dense in C(I n ).
As we have seen, using one hidden layer we can solve problems that simple ANNs like
the Perceptron cannot solve. Moreover, if the problem can be described with a continuous
function, there exists an MLP (with certain characteristics) that can solve the problem quite
well. Then, it is natural to think that if we add more hidden layers, ANNs will be able to
solve more complex problems and to learn from data with several levels of abstraction. Deep
Neural Networks (DNNs) are neural networks that have at least two hidden layers. The set of
techniques used to design this type of ANNs is called Deep Learning (DL) and we will study it
in Chapter 2.
Chapter 2
Deep Learning
Deep Learning (DL) is the branch of Machine Learning that includes all the techniques used to
design Deep Neural Networks (ANNs with multiple hidden layers) that are able to learn from
data with several levels of abstraction.
Real data is usually structured in a hierarchical way (from small details to large structures).
To have several layers allows the network to divide the learning process into three different levels
of abstraction:
• Middle hidden layers work with the previous low-level structures to be able to model
intermediate-level structures like squares.
• Last hidden layers and the output layer use the intermediate-level structures to model
high-level structures like faces.
This hierarchical structure of DNNs helps them to improve their ability to generalize to other
data sets.
DL has improved the state-of-the-art in some fields like speech recognition, object detection
and drug discovery. To understand DNNs, in this chapter we introduce Feed-forward Neural
Networks with fully-connected layers (in what follows we refer to them as FNNs with fully-
connected layers). In the field of DL, an FNN with fully-connected layers is an MLP with at
least two hidden layers (however, some topics we explain, as the architecture or the training
process, can also be applied to MLPs with a unique hidden layer).
15
16 Chapter 2. Deep Learning
consecutive layers, each neuron has a bias term and it applies an activation function. Let us
[l]
use Notation 1.7. We denote by ai the information stored in the i-th neuron of layer l.
The computations performed by a neuron j at layer l can be described as follows:
• Neuron j receives information from all the neurons of the previous layer l − 1. Taking
into account the weights of the connections between these neurons and neuron j, we have
that the information received by the neuron is
[l] [l−1]
wji ai
for i = 1, . . . , nl−1 .
• To this sum, the bias term associated to the current neuron is added. We have
nl−1
[l] X [l] [l−1]
bj + wji ai .
i=1
This number is known as the activation or output of neuron j at layer l (we can also refer to it
as the value of the neuron).
[l]
Let W [l] ∈ Rnl ×nl−1 be the matrix whose element in row j and column i is wji and let
[l]
b[l] ∈ Rnl be the vector whose j-th element is bj . We can write the vector a[l] ∈ Rnl , whose j-th
[l]
element is the value of neuron j at layer l denoted by aj , as
a[l] = A W [l] a[l−1] + b[l] ∈ Rnl ,
where we understand that the activation function A is applied to a vector in Rnl as follows:
A: Rnl −→ Rnl
(2.1)
z ≡ (zj )nj=1
l
7−→ A(z) = (A(zj ))nj=1
l
.
The vector a[l] can be interpreted as the activation of all the neurons at layer l.
So, if we suppose that the input data is a vector x ∈ Rn1 , the vector of activation of each
layer (and in particular the output of the network) can be obtained as
a
[1] = x ∈ Rn1 ,
(2.2)
a[l] = A W [l] a[l−1] + b[l] ∈ Rnl , l = 2, . . . , L.
Deep Learning from a Mathematical Point of View - Carmen Mayora Cebollero 17
1
A(x) = σ(x) = . (2.4)
1 + e−x
10 1 1
9 0.9 0.8
8 0.8 0.6
7 0.7 0.4
6 0.6 0.2
5 0.5 0
4 0.4 -0.2
3 0.3 -0.4
2 0.2 -0.6
1 0.1 -0.8
0 0 -1
-10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -8 -6 -4 -2 0 2 4 6 8 10
Figure 2.1: On the left, ReLU function. In the middle, sigmoid function. On the right,
hyperbolic tangent function.
Example 2.2. [22] Let us suppose that we have a piece of land and we have to determine the
points that are oil drilling sites and the ones that are not. To solve this problem we use an
FNN with fully-connected layers.
The input of our ANN is a point in R2 (a point on the ground). We can take as output
one vector with two coordinates interpreted as follows: if the first coordinate is bigger than the
second one, the input point corresponds to an oil drilling site; otherwise, it is not. Our network
is a function g ≡ (g1 , g2 ) : R2 → R2 .
We take an FNN with fully-connected layers that has four layers (L = 4). The first layer
(l = 1) is the input one. It has two neurons (n1 = 2) because, as we have said, the input data
is a point in R2 . The fourth layer (l = 4) is the output layer, it has two neurons (n4 = 2)
because of the structure we have defined for the output of the network. The second and third
layer (l = 2 and l = 3, respectively) are the hidden layers. We build the second one with two
neurons (n2 = 2) and the third one with three (n3 = 3). We use the same matrix notation for
the weights and biases that we have used in the general case. As the activation function, we
take the sigmoid function σ given by (2.4). In Figure 2.2 we can see a representation of the
situation.
18 Chapter 2. Deep Learning
[3]
b1 ; σ
[3] [3] [4] [4]
w11 w12 w21 w11
[2]
w11 [2] [4]
x1 b1 ; σ b1 ; σ g1 (x)
[3] [4]
[2]
w12 w21 w12
[3]
b2 ; σ
[2]
w21 [3] [4]
w22 w22
[2] [4]
x2 b2 ;σ b2 ; σ g2 (x)
[2]
w22
[3] [3] [4] [4]
w32 w31 w13 w23
[3]
b3 ; σ
Figure 2.2: Graphic representation of the FNN with fully-connected layers used to solve a
problem about the detection of oil drilling sites.
The input point is a[1] = x ∈ R2 . As the input is a point in R2 represented by two neurons
and the second layer has two neurons too, we have that W [2] ∈ R2×2 and b[2] ∈ R2 . The activation
of the neurons of the second layer, a[2] , is
a[2] = σ W [2] x + b[2] ∈ R2 . (2.5)
The next layer that receives information is the third one. It has three neurons, so taking
into account that the previous one (the second layer) has two, we have that W [3] ∈ R3×2 and
b[3] ∈ R3 . The value of the neurons at the third layer is given by
(2.5)
a[3] = σ W [3] a[2] + b[3] = σ W [3] σ W [2] x + b[2] + b[3] ∈ R3 . (2.6)
The following layer is the output layer. This implies that the output of its neurons is the output
of the network. As this last layer has two neurons and the third layer (the previous one) has
three, W [4] ∈ R2×3 and b[4] ∈ R2 . The output is
(2.6)
a[4] = σ W [4] a[3] + b[4] = σ W [4] σ W [3] σ W [2] x + b[2] + b[3] + b[4] ∈ R2 .
g : R2 −→ R2
x 7−→ σ W [4] σ W [3] σ W [2] x + b[2] + b[3] + b[4] .
In this example we have obtained an expression to deduce if a point on the land is an oil
drilling site or not. But, how have we chosen the number of layers or the number of neurons per
layer? Our DNN depends on 23 parameters (the elements of the weight matrices and the bias
vectors), how do we give values to them? We answer these questions in the following sections.
Deep Learning from a Mathematical Point of View - Carmen Mayora Cebollero 19
where M is the size of the training set (number of training points), y {m} denotes the target
output for the m-th training data point x{m} , yb {m} is the output of the network when the input
{m}
is the m-th training data point (that is, y
b = a[L] ) and k · k2 is the Euclidean norm.
Remark 2.4. If we multiply the loss function by a scalar, the optimization problem does not
change, but later computations may be easier. For example, we can take
M
1 X 1 {m}
f (W, b) = ky b {m} k22 ,
−y (2.8)
M m=1 2
Let us return to the example of the oil drilling sites given in Example 2.2. Notice that it is
relatively simple since it has only two hidden layers and nine neurons. However, to train the
network is a difficult optimization problem that consists in minimizing a non-convex function
over 23 parameters. Due to the difficulty of training DNNs, the beginnings of DL were hard
(this epoch is known as AI winter). Now, we are going to study how to solve this optimization
problem.
20 Chapter 2. Deep Learning
To solve it we may use a classical optimization method known as Gradient Descent (or Steepest
Descent).
Notation 2.5. To make it easier, instead of assuming that we have L − 1 matrices for the
weights (W [2] ∈ Rn2 ×n1 , . . . , W [L] ∈ RnL ×nL−1 ) and L − 1 vectors for the biases (b[2] ∈ Rn2 , . . . ,
b[L] ∈ RnL ), we assume that the weights and biases are stored in a unique vector p ∈ Rs with
L
X
s= nl (nl−1 + 1).
l=2
Now, the loss function is denoted by f (p) := f (W, b) and the minimization problem (2.9) is
rewritten as
min f (p). (2.10)
p
Gradient Descent (GD) is a classical optimization method that consists in computing iter-
atively vectors p1 , p2 , . . . ∈ Rs to converge to a vector pmin ∈ Rs that minimizes f (p). At each
iteration k, we have to compute the gradient of the loss function at pk and we have to move
pk on the direction of descending gradient to obtain pk+1 . Let us see that this direction is the
right one.
At each iteration k, we have to choose a perturbation ∆ pk such that
f (pk ) ≥ f (pk + ∆ pk ).
Then, pk+1 = pk + ∆ pk .
Assuming that ∆pk is small, we can use Taylor series expansion to approximate f (pk +∆ pk )
as s
X ∂f k
f (pk + ∆ pk ) ≈ f (pk ) + (p ) ∆ pkr , (2.11)
r=1
∂pr
∂f
where denotes the partial derivative of f with respect to the r-th component of the vector
∂pr
p and ∆ pkr is the r-th component of ∆ pk .
The previous formula (2.11) can be rewritten as
Equivalently,
−k∇ f (pk )kk∆ pk k ≤ h∇ f (pk ), ∆ pk i ≤ k∇ f (pk )kk∆ pk k.
It is obvious that the ‘most negative’ that h∇ f (pk ), ∆ pk i can be is −k∇ f (pk )kk∆ pk k .
This happens when the equality holds in (2.13), that is, when ∇ f (pk ) and ∆ pk are linearly
dependent. We conclude that ∆ pk has to lie in the direction of −∇ f (pk ). As (2.12) is an
approximation for a small ∆ pk , the step in that direction has to be small. So,
where η is the small constant step size known as learning rate. Since ∆pk = pk+1 − pk , we can
rewrite the last equality (2.14) as
This is the Gradient Descent update rule. We have proved that the direction of descending
gradient is the correct one to try to converge to the vector pmin that minimizes the loss function.
In practice, we choose an initial value p0 for the weights and biases (notice that p0 is a
parameter not fitted during training) and we apply iteratively (2.15) until a stopping criterion
is reached. Common stopping criterions are to iterate a certain number of times maxiter or to
iterate until the distance between pk and pk+1 is smaller than a given value tol.
Notice that if we use GD to train a neural network, we have to compute ∇ f (pk ) at each
iteration. It is computationally expensive to calculate it on the entire training set, so it is
convenient to find less computationally expensive alternatives to Gradient Descent.
Stochastic Gradient (SG), also called Stochastic Gradient Descent (SGD), is a computa-
tionally cheaper alternative to Gradient Descent. However, unlike GD, in SG the reduction
of f (pk ) is not guaranteed at each iteration. The idea of this optimization algorithm is to
choose an approximation of ∇ f (pk ) to substitute it in the update rule (2.15). We have several
possibilities for this replacement.
The simplest way is to choose randomly (with a uniform distribution) an index m ∈ {1, . . . , M }
at each iteration and to move in the direction of −∇ f{x{m} } (pk ). Formula (2.15) is rewritten
as
pk+1 = pk − η∇ f{x{m} } (pk ). (2.16)
The index at each iteration is chosen by sampling with replacement.
Another option is similar to the one described above, but instead of choosing the index
by sampling with replacement, it is chosen by sampling without replacement. When all the
indexes have been used, it is said that an epoch has been completed (notice that an epoch
consist in computing M iterations). In this case, the number of maximum epochs can be used
as a stopping criterion.
Other alternative closer to Gradient Descent is to take N M , to choose randomly (with
a uniform distribution) N indexes {α1 , . . . , αN } from {1, . . . , M }. We move in the descending
direction of the mean of the gradient of f{x{αm } } for m ∈ {1, . . . , N }. That is to say, (2.15) is
replaced by
N
k+1 k 1 X
p = p −η ∇ f{x{αm } } (pk ). (2.17)
N m=1
Notice that if we take N = 1 in (2.17) we obtain (2.16). So, in what follows, we are going
to use Stochastic Gradient to refer to formula (2.17), which is more general.
We have defined the learning rate η as a constant. However, it can depend on the iteration.
If we denote by η k the step size of iteration k, we can rewrite the update of the weights and
biases for GD (2.15) and SG (2.17) as
Notation 2.9. [23] Let {an } and {bn } be two sequences. We use an = O(bn ) (big O notation)
to indicate that there exists a positive constant K such that an ≤ K bn .
As we have seen, Gradient Descent is a classical optimization method that we can use to
solve the minimization problem (2.10) (related to the training process of a DNN), where f is
the loss function. Remember that if we use GD, the update rule for pk (vector with the weights
and biases of the ANN) is given by (2.18). If the loss function f is convex and L-smooth and
we take ηk ≡ 1/L (in what follows
we consider this value for the learning rate), the convergence
rate of Gradient Descent is O k1 , where k is the iteration number.
If we add the momentum pk − pk−1 to the Gradient Descent update rule (2.18), we may
accelerate it. The scheme that we obtain is known as Heavy Ball (HB) and it is given by
We can also accelerate GD using Nesterov momentum. In this case, the update rule has
two steps:
q k+1 = pk − ηk ∇ f (pk ),
(2.20)
pk+1 = q k+1 + µ(q k+1 − q k ).
As in the case of GD and HB, it has a convergence rate of O k1 .
We have seen that GD and the algorithms obtained adding momentum to GD have the
same convergence rate. However, there is empirical evidence that the convergence of GD is
slower when momentum is not used. In Figure 2.3 we have the evolution of {f (pk ) − f (pmin )}k
when f is a convex quadratic function with Lipschitz constant 4 defined as
1 T
p Ap − pT d, (2.21)
2
where A ∈ R1000×1000 is the Laplacian of a cycle graph and d ∈ R1000 is a vector whose first
component has value 1 and the remaining have value 0. The optimization algorithms are applied
during 50000 iterations with ηk = 41 . In dark green we have the evolution with GD and in red
with GD+Momentum (Nesterov momentum is used). It is clearly seen that GD is accelerated
when momentum is added.
If in (2.20) we take
tk − 1
µ= (2.22)
tk+1
with tk given by the recursive formula
q
1+ 1 + 4t2k
t0 = 1, tk+1 = for k ≥ 0, (2.23)
2
we obtain
q k+1 = pk − ηk ∇ f (pk ),
tk − 1 k+1 (2.24)
pk+1 = q k+1 + (q − q k ).
tk+1
This update rule gives us the optimization
method known as Nesterov Accelerated Gradient
(NAG). It has a convergence rate of O k12 (this rate is the optimal one for general convex
smooth optimization problems).
f(pk)-f(pmin)
GD
GD + Momentum
NAG
ARNAG
SRNAG
Iteration
Figure 2.3: Evolution of {f (pk ) − f (pmin )}k when f is the quadratic function (2.21). This
image has been modified from [23].
24 Chapter 2. Deep Learning
Remark 2.10. In [26] it is said that the asymptotic limit of the momentum coefficient given
by (2.22) is
k−1
, (2.25)
k+2
so we can consider the expression of the limit instead of the original one.
The sequence {f (pk ) − f (pmin )}k converges monotonically to zero when GD (2.18), HB
(2.19) or GD+Nesterov momentum (2.20) are used. However, in the case of NAG (2.24),
it oscillates. In Figure 2.3 we can see the evolution of this sequence for GD (dark green),
GD+Momentum (red) and NAG (olive-green) when we take the quadratic function (2.21)
defined before. We can clearly see that NAG converges faster than the other two, but the
sequence {f (pk ) − f (pmin )}k oscillates.
To alleviate this phenomenon, the Adaptive Restart NAG (ARNAG) method can be used.
It is obtained changing properly the momentum coefficient. Its update rule is given by
q k+1 = pk − ηk ∇ f (pk ),
m(k) − 1 k+1
pk+1 = q k+1 + (q − q k ),
m(k) + 2
if f (pk+1 ) ≤ f (pk ),
(
m(k) + 1
where m(1) = 1 and m(k + 1) =
1 otherwise.
Notice that if the condition f (pk+1 ) ≤ f (pk ) holds, we have the NAG scheme (using the
momentum coefficient (2.25)). Otherwise, q k+1 − q k is multiplied by 0 and the algorithm is
restarted in the sense that the last point obtained with it is used as the starting point to apply
the algorithm again. That is to say, NAG (with momentum coefficient (2.25)) is applied until
f (pk+1 ) > f (pk ), then NAG starts again with the last point as the initial point.
Scheduled Restart NAG (SRNAG) is other scheme that uses NAG and a restart strategy
(different from the one used in ARNAG). In this case, the interval of iterations (0, maxiter] is
divided into few intervals {Iβ }β such that (0, maxiter] = β Iβ . To each interval Iβ we associate
F
a number Fβ known as restart frequency. Inside each interval, NAG (using an appropriate
momentum coefficient) is applied until the iteration number k satisfies k ≡ 0 mod Fβ , then the
momentum is restarted and NAG is applied again with the last point as the initial point. The
scheme for each Iβ can be written as
q k+1 = pk − ηk ∇ f (pk ),
k mod Fβ (2.26)
pk+1 = q k+1 + (q k+1 − q k ).
(k mod Fβ ) + 3
If the loss function f satisfies Polyak-Łojasiewicz Condition, ARNAG and SRNAG have
linear convergence.
In Figure 2.3 we can see the evolution of {f (pk ) − f (pmin )}k for the optimization methods
studied so far, when f is the quadratic function (2.21). It can be seen that ARNAG (in purple)
and SRNAG (in dark blue) converge faster than the others.
Deep Learning from a Mathematical Point of View - Carmen Mayora Cebollero 25
where pmin denotes the minimizer of function f and the expectation is taken over the random
mini-batch samples.
According to Theorem 2.12, NASGD accumulates error even when the function f is convex
and L-smooth. To prove it we need to introduce some definitions and results (we prove some
of them).
We consider that the loss function f is convex and L-smooth. Let us suppose that we
have computed pk , then the update rule of Gradient Descent with learning rate η = 1r can be
obtained with the minimization of
r 2
Qr (q, pk ) := hq − pk , ∇ f (pk )i + kq − pk k2 . (2.28)
2
It can be proved that
kΛk − ∇ f (pk )k2
Qr (q k+1 , pk ) − min Qr (q, pk ) = , (2.29)
q 2r
where Λk is used to denote the expression of the gradient with mini-batch, that is,
N
1 X
Λk := ∇ f{x{αm } } (pk ).
N m=1
If we assume that Stochastic Gradient (with mini-batch) has bounded variance, we obtain the
Stochastic Gradient Rule Rδ :
k+1 k k k
E Qr (q , p ) − min Qr (q, p ) | χ ≤ δ, (2.30)
q
Remark 2.13. In formula (2.30) we take the expectation conditioned to the σ-algebra of the
weights and biases that have already been computed (we have used χk to denote this σ-algebra).
In what follows, although it is not written explicitly, we consider this conditional expectation.
If we consider (2.28) and (2.29), we have that the update rule of NASGD given by (2.27)
can be formulated as
q k+1 ≈ argmin Qr (q, pk ) with rule Rδ ,
q
(2.31)
tk − 1 k+1
pk+1 = q k+1 + (q − q k ).
tk+1
tk − 1 k+1 (2.32)
pk+1 = q k+1 + (q − q k ).
tk+1
kak22 λ
ha, bi ≤ + kbk22 .
2λ 2
Lemma 2.16. [23] Let a and b be two vectors. Then, we have the following equality:
Lemma 2.17. [28] Let h be a differentiable function such that dom(h) is a convex set. Then
h is convex if and only if
Lemma 2.19. [23] Let h be ν-strongly convex and let us denote the minimizer of h by xmin .
Then,
ν 2
h(x) − h(xmin ) ≥ kx − xmin k2 ; ∀x ∈ dom(h).
2
Lemma 2.20. [23] Let h be an L-smooth function. Then,
L
h(y) ≤ h(x) + h∇h(x), y − xi + ky − xk22 ; ∀x, y ∈ dom(h).
2
Lemma 2.21. [23] If r > 0, then
h 2
i 2δ
e k+1 k2 ≤
E kq k+1 − q .
r
Deep Learning from a Mathematical Point of View - Carmen Mayora Cebollero 27
Taking into account (2.33), (2.34) and that r > 0 by hypothesis, we obtain that
h 2
i 2δ
e k+1 k2 ≤
E kq k+1 − q .
r
Lemma 2.22. [23] If constants r and L satisfy the relation r > L, then
r k+1 r r−L h k
2 2 2
i
e k+1 ) + kq
E f (q e − pk k2 − f (q k+1 ) + kq k+1 − pk k2 ≥ −τ δ − e k+1 k2 ,
E kp − q
2 2 2
L2
where τ := 1 + .
r(r − L)
Proof. The proof of this lemma, that we can find in [23], follows from Lemma 2.17, the Stochas-
tic Gradient Rule given by (2.30), Lemma 2.15 (Schwarz Inequality) and Lemma 2.21.
Lemma 2.23. [23] If constants r and L satisfy r > L, then we have the following bounds:
h i r h 2
i h i
e k+1 − pk i − τ δ,
E f (q k ) − f (q k+1 ) ≥ E kpk − q k+1 k2 + rE hpk − q k , q (2.35a)
2
h i r h 2
i h i
e k+1 − pk i − τ δ.
E f (q min ) − f (q k+1 ) ≥ E kpk − q k+1 k2 + rE hpk − q min , q (2.35b)
2
L 2
In both formulas we define τ := 1 + r(r−L) . In the second formula we have used q min to denote
the minimizer of f .
28 Chapter 2. Deep Learning
Notice that the second addend of Qr is always positive (r > 0), so to minimize Qr , the first
addend has to be as negative as possible. In the justification of the update rule of GD we have
seen that, using Cauchy-Schwarz-Bunyakovski Inequality (Proposition 2.6), the first addend
hq − pk , ∇ f (pk )i is as negative as possible when q − pk is on the direction of −∇ f (pk ). Taking
r > 0 properly, we have that
1
e k+1 − pk = − ∇ f (pk ),
q
r
e k+1 . Isolating the gradient in the previous
where we have taken into account the definition of q
identity we obtain
∇ f (pk ) = r(pk − qe k+1 ). (2.36)
L k+1 2
e k+1 ) ≤ f (pk ) + h∇f (pk ), q
f (q e k+1 − pk i + kq
e − pk k2 .
2
L k+1 2
e k+1 ) ≥ −f (pk ) − hr(pk − q
− f (q e k+1 ), q
e k+1 − pk i − kq
e − p k k2 . (2.37)
2
f (q k ) ≥ f (pk ) + h∇ f (pk ), q k − pk i.
e k+1 ), q k − pk i.
f (q k ) ≥ f (pk ) + hr(pk − q (2.38)
L k+1 2
e k+1 ) ≥ −hr(pk − q
f (q k ) − f (q e k+1 ), q
e k+1 − pk i − kq
e e k+1 ), q k − pk i,
− pk k2 + hr(pk − q
2
that can be rewritten as
2 L k+1 2
e k+1 ) ≥ rkq
f (q k ) − f (q e k+1 − pk k2 − kq
e e k+1 , q k − pk i.
− pk k2 + rhpk − q
2
Joining terms we have
L
2
e k+1 ) ≥ r −
f (q k ) − f (q e k+1 − pk k2 + rhpk − q
kq e k+1 , q k − pk i.
2
Using this last inequality and basic properties of the expectation we obtain
L
2
h i h i h i h i
k+1
k
E f (q ) − E f (q
e ) ≥ r− e k+1 − pk k2 + rE hpk − q
E kq e k+1 , q k − pk i . (2.39)
2
Deep Learning from a Mathematical Point of View - Carmen Mayora Cebollero 29
As we assume that r > L, we can use Lemma 2.22. Using basic properties of the expectation
we rewrite the formula of this lemma as
h i r h 2
i h i r h 2
i
e k+1 ) + E kq
E f (q e k+1 − pk k2 − E f (q k+1 ) − E kq k+1 − pk k2 ≥
2 2
r−L h k 2
i
≥ −τ δ − e k+1 k2 ,
E kp − q (2.40)
2
L2
where τ := 1 + .
r(r − L)
Adding (2.39) and (2.40), and using basic properties of the expectation and the scalar
product we obtain
h i r h 2
i h i
e k+1 − pk i − τ δ,
E f (q k ) − f (q k+1 ) ≥ E kpk − q k+1 k2 + rE hpk − q k , q
2
which is the inequality (2.35a) of the statement.
As we have seen before, f is convex and differentiable and its domain is convex. Applying
Lemma 2.17 with q min , pk ∈ dom(f ) we have that
e k+1 ), q min − pk i.
f (q min ) ≥ f (pk ) + hr(pk − q (2.41)
L k+1 2
e k+1 ) ≥ −hr(pk − q
f (q min ) − f (q e k+1 ), q
e k+1 − pk i − kq
e e k+1 ), q min − pk i.
− pk k2 + hr(pk − q
2
Previous inequality can be rewritten as
2 L k+1 2
e k+1 ) ≥ rkq
f (q min ) − f (q e k+1 − pk k2 − kq
e e k+1 , q min − pk i,
− pk k2 + rhpk − q
2
and joining terms we get
L
2
e k+1 ) ≥ r −
f (q min ) − f (q e k+1 − pk k2 + rhpk − q
kq e k+1 , q min − pk i.
2
Adding (2.42) and the expression in Lemma 2.22 (we can apply it because we assume r > L,
in particular, we use the expression (2.40) that is equivalent to the one in the statement of the
lemma), and using basic properties of the expectation and the scalar product we have that
h i r h 2
i h i
e k+1 − pk i − τ δ,
E f (q min ) − f (q k+1 ) ≥ E kpk − q k+1 k2 + rE hpk − q min , q
2
L2
where τ := 1 + . This is the inequality (2.35b) of the statement.
r(r − L)
30 Chapter 2. Deep Learning
Before proving Theorem 2.12, we restate it. Notice that in the update rule of NASGD (2.27),
the approximation of the gradient is in the first step. Therefore, the following are equivalent:
• Study the sequence {pk }k≥0 and prove
h i
E f (pk ) − f (pmin ) = O(k).
Moreover, if we define η := 1r , where η is the step size or learning rate, we can substitute
ηk ≡ η ≤ L1 by r > L (observe that the second condition is more restrictive).
Taking into account the previous considerations, we can restate Theorem 2.12 as follows:
Theorem 2.24. Restated version of Theorem 2.12. [23]
Let f be a convex L-smooth function. Let {q k }k≥0 be the sequence obtained with NASGD
algorithm (2.27) and let r and L satisfy r > L. Then,
h i
E f (q k ) − f (q min ) = O(k),
where q min denotes the minimizer of function f and the expectation is taken over the random
mini-batch samples.
Proof of Theorem 2.24 or, equivalently, of Theorem 2.12. First of all, let us define
h i
F k := E f (q k ) − f (q min ) . (2.43)
Notice that if we use the linearity of the expectation and the previous notation, we have that
h i h i
E f (q k ) − f (q k+1 ) = E f (q k ) − f (q min ) + f (q min ) − f (q k+1 ) =
h i h i
= E f (q k ) − f (q min ) − E f (q k+1 ) − f (q min ) = F k − F k+1 .
Using this fact, we can rewrite inequality (2.35a) of Lemma 2.23 (notice that we can use
Lemma 2.23 because we suppose that r > L) as
r h 2
i h i
e k+1 − pk i − τ δ.
F k − F k+1 ≥ E kpk − q k+1 k2 + rE hpk − q k , q (2.44)
2
Using the notation introduced in (2.43) we can also rewrite inequality (2.35b) of Lemma 2.23
as
r h 2
i h i
− F k+1 ≥ E kpk − q k+1 k2 + rE hpk − q min , qe k+1 − pk i − τ δ. (2.45)
2
Remember that tk (with it we define the non-constant factor that multiplies the momentum)
is given by formula (2.23). Notice that tk ≥ 1 for any k ≥ 0. In effect, t0 = 1 ≥ 1 and for k > 0,
q
1+ 1 + 4t2k−1
1+1
tk = = 1. >
2 2
Therefore tk − 1 ≥ 0 and multiplying both sides of (2.44) by tk − 1, we have the following
inequality:
r h
2
h i i h i
e k+1 − pk i − τ δ . (2.46)
(tk − 1) F k − F k+1 ≥ (tk − 1) E kpk − q k+1 k2 + rE hpk − q k , q
2
Adding (2.45) and (2.46) we have that
(tk − 1)F k − tk F k+1 ≥
r h 2
i h i
e k+1 − pk i − tk τ δ.
≥ tk E kpk − q k+1 k2 + rE h(pk − q k )(tk − 1) + (pk − q min ), q
2
As r > 0 (observe that r > L and as L is a Lipschitz constant, it is positive, so r > 0), we can
Deep Learning from a Mathematical Point of View - Carmen Mayora Cebollero 31
So, tk (tk − 1) = t2k−1 for k > 0. Using this expression we can rewrite (2.47) as
2h 2 i
tk−1 F k − t2k F k+1 ≥
r
h 2
i h i 2t2k τ δ
e k+1 − pk , tk pk − (tk − 1)q k − q min i −
≥ E ktk q k+1 − tk pk k2 + 2tk E hq . (2.48)
r
Let us study right-hand side of previous inequality. It is obvious that
h 2
i h i 2t2k τ δ
e k+1 − pk , tk pk − (tk − 1)q k − q min i −
E ktk q k+1 − tk pk k2 + 2tk E hq =
r
h 2
i h i 2t2k τ δ
e k+1 − pk + q k+1 − q k+1 , tk pk − (tk − 1)q k − q min i −
= E ktk q k+1 − tk pk k2 + 2tk E hq =
r
2
h i h i
= E ktk q k+1 − tk pk k2 + 2tk E hq k+1 − pk , tk pk − (tk − 1)q k − q min i +
h i 2t2k τ δ
e k+1 − q k+1 , tk pk − (tk − 1)q k − q min i −
+2tk E hq . (2.49)
r
If we take
2 2
= ktk q k+1 − (tk − 1)q k − q min k2 − ktk pk − (tk − 1)q k − q min k2 .
Taking into account this expression we have the following equality for the expectation:
2
h i h i
E ktk q k+1 − tk pk k2 + E 2tk hq k+1 − pk , tk pk − (tk − 1)q k − q min i =
2 2
h i h i
= E ktk q k+1 − (tk − 1)q k − q min k2 − E ktk pk − (tk − 1)q k − q min k2 . (2.50)
Using (2.50), we have that expression (2.49) (remember that it is the right-hand side of
32 Chapter 2. Deep Learning
Notice that we have multiplied Schwarz Inequality by 2tk , but as 2tk > 0, the inequality holds.
Using the expression we have just obtained, we have the following relation for the expectation:
h i
e k+1 , tk−1 q k − (tk−1 − 1)q k−1 − q min i ≤
E 2tk hq k+1 − q
2 2
h i h i
e k+1 k2 + E ktk−1 q k − (tk−1 − 1)q k−1 − q min k2 .
≤ t2k E kq k+1 − q
Applying Lemma 2.21 (remember that r > 0) in the first addend of the right-hand side of
previous expression and defining
( )
tk−1 − 1 k−1 1 min
R = sup qk − q − q
tk−1 tk−1 2
for the second addend, we have that last inequality can be reformulated as
h i 2δ
e k+1 , tk−1 q k − (tk−1 − 1)q k−1 − q min i ≤ t2k
E 2tk hq k+1 − q + t2k−1 R2 . (2.54)
r
Applying the bound given in (2.54) on the right-hand side of inequality (2.53) we obtain
that
2t2 2t2 2t2 δ 2t2 τ δ
uk+1 + k F k+1 ≤ uk + k−1 F k + k + t2k−1 R2 + k .
r r r r
Defining
2t2
ξk := uk + k−1 F k and τ ∗ = 1 + τ,
r
we can rewrite the previous inequality as
2τ ∗ δ
ξk+1 ≤ ξk + t2k + t2k−1 R2 .
r
According to [23], this formula can be rewritten as
2τ ∗ δ (k + 1)3
ξk+1 ≤ + R2
r 3
and, in addition, we have that
(k + 1)2 F k+1
ξk+1 ≥ .
4
With these bounds we can deduce, taking into account the definition of F k (2.43), that
h i
E f (q k ) − f (q min ) = O(k),
as we wanted to prove.
We have proved that NASGD accumulates error even when we take a function f that is
convex and L-smooth. So, if we try to accelerate SG with non-constant Nesterov momentum
(this is what we were trying to do with NASGD), we do not obtain good results. Recently,
a new algorithm to speed up SG has been introduced in [23]. It is obtained using SRNAG
(we have seen its expression for GD in (2.26)) with SG. This new scheme is called Scheduled
Restart Stochastic Gradient Descent (SRSGD). According to [23] (article where the algorithm
is explained), the advantages of SRSGD can be summarized in four points:
• SRSGD helps to reduce overfitting (this concept will be defined in Section 2.3) when the
DNN has many hidden layers.
• It is easy to implement SRSGD, only few lines of the implementation of SG have to be
modified.
N
1 X
The update rule of SRSGD, that is obtained replacing ∇ f (pk ) by ∇ f{x{αm } } (pk ) in
N m=1
(2.26), is given by
N
1 X
q k+1 = pk − ηk ∇ f{x{αm } } (pk ),
N m=1
k mod Fβ
pk+1 = q k+1 + (q k+1 − q k ).
(k mod Fβ ) + 3
[l]
We use ξj to denote the partial derivative of fe respect to the weighted input for neuron j
at layer l, that is,
[l] ∂ fe
ξj = [l]
for 1 ≤ j ≤ nl , 2 ≤ l ≤ L. (2.58)
∂ zj
It is usually called the error in the j neuron at layer l. This name is due to the fact that the
loss function can only reach a minimum if all its derivatives are zero. In [22], it is recommended
to think about it as a measure of the sensitivity of the loss function to the weighted input for
neuron j at layer l.
The main result needed to understand Back Propagation is the next lemma.
Lemma 2.28. [22] Let us consider that we have a general Feed-forward Neural Network with
fully-connected layers and that the loss function is the quadratic cost function given by (2.55).
Then,
A0 (z [l] ) ⊗ (W [l+1] )T ξ [l+1] for l = 2, . . . , L − 1,
[l]
ξ = (2.59a)
A0 (z [L] ) ⊗ (a[L] − y {m} ) for l = L.
∂ fe [l]
[l]
= ξj for l = 2, . . . , L. (2.59b)
∂ bj
∂ fe [l] [l−1]
[l]
= ξj a i for l = 2, . . . , L. (2.59c)
∂ wji
We have used ⊗ to denote the componentwise multiplication.
Remark 2.29. We use ( · )0 to denote the derivative respect to z [l] (l = 2, . . . , L). In the previous
lemma we have A0 (z [l] ) where z [l] ∈ Rnl . Taking into account the way in which A is applied to
a vector (see (2.1)), we have that
[l] [l]
!T
0 [l]∂ A(z [l] ) ∂ A(z1 ) ∂ A(znl )
A (z ) = = ,..., .
∂ z [l] ∂ z1
[l]
∂ znl
[l]
Proof of Lemma 2.28. The proof of this lemma is based on the chain rule for the derivation.
Let us start proving (2.59a). For l = L and for each j = 1, . . . , nL , using (2.58) and the chain
rule, we have that
[L]
[L] ∂ fe ∂ fe ∂ aj
ξj = [L]
= [L] [L]
. (2.60)
∂ zj ∂ aj ∂ zj
The expression of fe is given in (2.55), so
n
!
L
1 {m}
1X 2
∂ ky − a[L] k22 ∂ yn{m} − a[L]
n
∂ fe 2 2 n=1
[L]
= [L]
= [L]
,
∂ aj ∂ aj ∂ aj
where we have applied the definition of the Euclidean norm for a vector. Notice that if we
compute the partial derivative, all the addends vanish except the j-th one. Observe that it
is here (in the process of computing the derivative of the j-th term) where the constant 21 is
useful, thanks to it we avoid having the constant 2 multiplying the derivative. We conclude
that
∂ fe [L] {m}
[L]
= aj − yj . (2.61)
∂ aj
36 Chapter 2. Deep Learning
Taking into account the relation between a[L] and z [L] given in (2.57), we have that
[L]
∂ aj [L]
[L]
= A0 (zj ). (2.62)
∂ zj
Replacing in (2.60) the corresponding expressions obtained in (2.61) and (2.62), we get
{m}
[L] [L] [L]
ξj = a j − yj A0 (zj ).
The last equality holds for any j = 1, . . . , nL , so using componentwise product we obtain (2.59a)
for l = L.
Let us consider now l = 2, . . . , L − 1. Applying expression (2.58) and using the chain rule
taking into account that the output of the neural network is computed forwards (that is to say,
z [l+1] is obtained from z [l] ) we have that
nl+1 [l+1] nl+1 [l+1]
[l] ∂ fe X ∂ fe ∂ zn X
[l+1] ∂ zn
ξj = [l]
= [l+1] [l]
= ξn [l]
. (2.63)
∂ zj n=1 ∂ zn ∂ zj n=1 ∂ zj
[l]
If we derive respect to zj we have that
[l+1]
"n #
l
∂ zn ∂ [l+1] [l]
= wnj A0 (zj ) for n = 1, . . . , nl+1 .
X
[l+1]
[l]
= [l]
wnr A(zr[l] ) + b[l+1]
n (2.64)
∂ zj ∂ zj r=1
This last expression holds for j = 1, . . . , nl and for l = 2, . . . , L − 1, so, taking into account how
componentwise product works, we have proved (2.59a) for l = 2, . . . , L − 1.
Let us prove (2.59b). Let l = 2, . . . , L. Using the chain rule we have that
[l]
∂ fe ∂ fe ∂ zj
[l]
= [l] [l]
. (2.65)
∂ bj ∂ zj ∂ bj
[l]
Notice that z [l−1] does not depend on bj (remember that the expressions of z [l] for l = 2, . . . , L
[l]
are computed forwards). Therefore, we conclude that the derivative respect to bj of the last
expression is
[l]
∂ zj
[l]
= 1. (2.66)
∂ bj
[l]
From (2.66) and the expression of ξj given in (2.58) we have that (2.65) can be rewritten
as
∂ fe [l]
[l]
= ξj ,
∂ bj
which gives us (2.59b).
Finally, let us prove (2.59c). Let l = 2, . . . , L. Applying the chain rule we have that
nl [l]
∂ fe X ∂ fe ∂zn
[l]
= [l] [l]
. (2.67)
∂ wji n=1 ∂ zn ∂ wji
[l]
Computing the derivative respect to wji of the previous expression, we get
[l]
∂ zn a[l−1] if n = j,
i
[l]
=
∂ wji 0 if n 6= j.
[l]
Taking into account this last equality and the definition of ξj given in (2.58) we have that
(2.67) is rewritten as
∂ fe ∂ fe [l−1] [l] [l−1]
[l]
= a
[l] i
= ξj a i ,
∂ wji ∂ zj
which is (2.59c).
With Lemma 2.28 it is easy to see that Back Propagation exploits the forward structure of
the neural network to be able to compute the partial derivatives in the backward direction. With
(2.56) and (2.57), which go forwards on the network, we can obtain a[L] , that is, the output of
the FNN with fully-connected layers. As we know this value, we can obtain ξ [L] applying (2.59a)
of Lemma 2.28 with l = L. As we have ξ [L] , we can compute ξ [l] for l = 2, . . . , L − 1 backwards
with expression (2.59a) (that is to say, we calculate ξ [L−1] using ξ [L] , then we obtain ξ [L−2]
from ξ [L−1] ,. . .). The partial derivatives respect to the weights and biases can be computed
with (2.59b) and (2.59c).
Remark 2.30. Observe that in expression (2.59a) of Lemma 2.28 we have to compute A0 ,
where A is the activation function. If we take, for example, the sigmoid function given in (2.4),
it is easy to compute A0 :
0
1 e−x
0 0
A (x) = σ (x) = = = σ(x)(1 − σ(x)).
1 + e−x (1 + e−x )2
38 Chapter 2. Deep Learning
Notice that if we call D[l] the diagonal matrix with dimension nl × nl whose (i, i) entry is
[l]
A0 (zi ), we can replace componentwise product in (2.59a) of Lemma 2.28 by a matrix product.
The expression is rewritten as
D (W [l+1] )T ξ [l+1]
[l]
for l = 2, . . . , L − 1,
ξ [l] =
D[L] (a[L] − y {m} ) for l = L.
Using this matrix expression we can write the following algorithm [22] for the training process
of an FNN with fully-connected layers when we use Stochastic Gradient with replacement and
mini-batch size equal to 1 (see (2.16)) and Back Propagation:
2.3.1 Overfitting
When a network that has already been trained is not able to generalize but it fits perfectly
the training data, we say that overfitting occurs. Intuitively, overfitting is due to the fact
that the network has learnt in a wrong way: it has focused on details of the training set that
are not representative of all possible inputs of the network. In Figure 2.4 we have a visual
representation of overfitting.
Deep Learning from a Mathematical Point of View - Carmen Mayora Cebollero 39
0.0
0.5
1.0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Input Input Input
Figure 2.4: In all the pictures we have in orange the training points and in blue the result given
by a neural network. From left to right, graph obtained when the neural network has learnt
correctly, when overfitting occurs and when underfitting arises.
If we want to detect if overfitting occurs, we have to split the available labeled data into
two different groups: training data and test data. Test data, that has to be preprocessed as we
have done with training data, is the main tool that helps us to notice if overfitting arises. If
the value of the loss function for test data is relatively high and this value for training data is
small, the network is not able to generalize and we can conclude that overfitting is happening.
The term regularization describes the different techniques used to avoid overfitting by re-
warding smoothness. Some of the most popular regularization techniques are early stopping,
dropout, `1 and `2 regularization, max-norm regularization and data augmentation.
→ Early stopping.
If we want to use early stopping, we need to split the labeled data into three different sets
(they do not have data in common): training set, test set and validation set. As we have seen,
the training set is used to update the weights and biases during training. However, validation
set is not used to change the values of weights and biases, it is used to check if the DNN is
learning correctly.
During the process of training (for example, after every update of the weights and biases),
using the loss function, the target outputs of the data of the validation set are compared with
the outputs obtained applying the current neural network to measure how well the trained
weights and biases fit unseen data (that is, data not used for training). If the network is
learning correctly, the value of the loss function decreases at each application of the validation
data set. Nevertheless, if overfitting occurs, at some point of the training process, that value
does not decrease (even it can increase) and it is better to stop training.
Early stopping is usually used together with other techniques that also help us to avoid
overfitting.
→ Dropout.
Dropout is a regularization technique that at first can seem to be extreme, but it works
quite well. It was proposed in Improving Neural Networks by Preventing Co-adaptation of
Feature Detectors [30] by G.E. Hinton et al. in 2012. In 2014, in the article Dropout: A
Simple Way to Prevent Neural Networks from Overfitting [31], N. Srivastava et al. showed
that this technique can improve the performance of neural networks created to carry out vision
and speech recognition tasks. Dropout consists in deleting the neurons (and its corresponding
connections) of the input and hidden layers (neurons of the output layer cannot be removed)
with probability P at every training step and training the resultant network. After training, the
obtained weights and biases have to be multiplied by 1 − P . The probability P is a parameter
whose value is not computed during training, it is known as dropout rate. After training, we
consider the entire DNN.
40 Chapter 2. Deep Learning
[3]
b1 ; σ
[2] [4]
x1 b1 ; σ b1 ; σ g1 (x)
[3] [4]
[2]
w12 w21 w12
[3]
b2 ; σ
[4]
w22
[2] [4]
x2 b2 ;σ b2 ; σ g2 (x)
[3] [4] [4]
w31 w13 w23
[3]
b3 ; σ
Figure 2.5: Illustrative example of dropout in the FNN with fully-connected layers of Figure 2.2.
To remove neurons and connections and to train the resulting network make neurons more
robust and less dependent on their neighbours. This helps the network to learn to generalize
and overfitting is avoided. If we use dropout and we detect that overfitting is still happening,
we could increase P and try again.
Example 2.31. If we have the FNN with fully-connected layers of Figure 2.2 and at training
step k the removed neurons are the first one of the input layer, the second one of the first hidden
layer and the first one of the second hidden layer (the neurons of each layer are numbered from
top to bottom and the layers from left to right), the neural network that we have to train is in
Figure 2.5.
→ `1 and `2 regularization.
To avoid overfitting we can use `1 and `2 regularization. Their goal is to force weights to
be small (conditions are not usually imposed on the biases). To do this, some special terms are
added to the loss function f (W, b).
That is to say, the extra addend is proportional to the sum of the absolute value of all
the weights of the DNN.
• `2 regularization. The special addend inserted in the loss function is of the form
nl nX
L X l−1
[l] 2
X
β wji .
l=2 j=1 i=1
It is obvious that it is proportional to the sum of the square of all the weights of the
neural network.
Deep Learning from a Mathematical Point of View - Carmen Mayora Cebollero 41
Figure 2.6: Data augmentation. From left to right: original image, rotation version, shifted
instance and version of the image when a gamma correction is used. To obtain these images
we have used the library imgaug for Python [32].
2.3.2 Underfitting
Other problem that we can find after training is underfitting. It is the opposite of overfitting.
It arises when we have built a so simple neural network that it is not able to learn from data.
In this case, the network does not fit neither the training set nor the test data. To avoid
underfitting we can build a more complex network, for example, adding layers or neurons. If
we detect underfitting while we are using dropout, we should decrease the dropout rate to avoid
the problem. In Figure 2.4 we have a visual representation of underfitting.
42 Chapter 2. Deep Learning
2.4 Hyperparameters
We have seen that some parameters of neural networks like weights and biases are tuned during
training. However, there are others as the initial value of the weights and biases, the number
of layers or the number of neurons per layer whose values are not fitted during training. These
parameters are known as hyperparameters. To determine their optimal values is one of the
main drawbacks of DNNs and it makes difficult to find an optimal architecture for the neural
network. Some information about hyperparameters can be consulted in [5] and [33].
The most intuitive way to tune hyperparameters is to try all the possible values, that is
to say, we train the network with all the possible combinations of all the possible values and
then we decide the best option. This technique is known as grid search. Let us imagine that
we have a network with 6 hyperparameters, each of them can take 4 different values and the
training time is around 10 minutes. To train the network with all the possible combinations to
decide the best values for the hyperparameters would take us 46 × 10 = 40960 minutes, that is,
almost a month. Therefore, even when we have few hyperparameters and few values for them,
to obtain the optimal values is computationally expensive. We conclude that this is not a good
way to proceed.
These values can be optimized using metaheuristics algorithms (they are based on nature
guiding principles and they help us to search more intelligently in the search space, which is the
space of all the possible values for the hyperparameters) instead of using exhaustive methods
like grid search. Examples of these metaheuristics algorithms are:
• Particle Swarm Optimization (PSO): it is based on the flight of birds during migration
or during the search for food.
• Genetic Algorithm (GA): it is based on the selection and crossover of species taking into
account how the evolution and the improvement of the species are affected.
To look for the best values for the hyperparameters we can also use tools like Oscar [34]. It
is described as Your new (free) Data Scientist and it helps us to optimize the hyperparameters
of our DNNs. Given the results obtained with our current network, Oscar chooses the best
values for them and it shows us the influence of each of them in our network.
To fit some hyperparameters like the number of layers and neurons per layer and the initial
value of weights and biases of an FNN with fully-connected layers, there are some particular
strategies:
→ Number of layers.
To determine the number of layers we usually start with a network with one or two hidden
layers (in the case of one hidden layer, it is an MLP but not a DNN) and then, we gradually
increase the number of hidden layers until we detect overfitting.
The number of neurons at the first layer (input layer) and the last layer (output layer)
depends on the structure of the input and the output data, respectively. However, there is
not a general rule for choosing the optimal number of neurons at each hidden layer. We can
take some simplifications as to consider that all the hidden layers have the same number of
neurons (this reduces the problem to just one hyperparameter) and to increase this number
until overfitting occurs. We can also consider that the number of neurons decreases at each
layer (it is an idea based on the hierarchical structure of DNNs). But, in general, it has been
observed that it is better to increase the number of layers than the number of neurons per layer.
Deep Learning from a Mathematical Point of View - Carmen Mayora Cebollero 43
When we work with FNNs with fully-connected layers, biases can be initialized to zero and
weights can be initialized using different strategies (they tend to be heuristic) as normalized
initialization proposed by X. Glorot and Y. Bengio in their article Understanding the Diffi-
culty of Training Deep Feedforward Neural Networks [35]. Normalized initialization consists in
obtaining the initial values of the weights with a uniform distribution
" √ √ #
[l] 6 6
W ∼ U −√ ,√ ,
nl−1 + nl nl−1 + nl
We have seen that it is complicated and usually expensive to obtain the optimal values of
the hyperparameters. For this and as a consequence of the hierarchical structure of DNNs, it is
common to recycle networks that have already been trained. For example, if we have trained a
DNN that is able to recognize people, we can use the neurons of the first hidden layers to con-
struct a DNN to recognize different dress styles. In this way, it is only necessary to determine
the architecture of the middle and last hidden layers, and the number of hyperparameters is
reduced.
In Chapter 3 we study other types of Deep Neural Networks that allow us to perform some
tasks that are not feasible to carry out with Feed-forward Neural Networks with fully-connected
layers.
Chapter 3
To perform, for example, image classification or prediction tasks, it is not feasible to use Feed-
forward Neural Networks with fully-connected layers. In these cases, we have to use other types
of Deep Neural Networks as Convolutional Neural Networks or Recurrent Neural Networks.
45
46 Chapter 3. CNNs and RNNs
image and three channels if it is a color image). This determines the structure of the input
layer.
The main structures of CNNs are convolutional layers. Convolutional layers are 2D layers
(matrices) inspired by the fact that some neurons only react to small regions of the visual
field. Therefore, each neuron of each non-input layer is not connected to all the neurons of the
previous layer (unlike what happens with fully-connected DNNs), it is only connected to some
neurons that constitute its local receptive field.
Figure 3.1: Representation of two consecutive convolutional layers of a CNN whose flow goes
from left to right. We can see the local receptive field (in purple) of a neuron (in grey).
Figure 3.1 shows two consecutive convolutional layers whose flow goes from left to right. In
purple we have the receptive field of the grey neuron. We only need the values of these four
neurons at the previous layer to obtain the value of the neuron in grey, we do not need the
values of the 36 neurons.
Layer l − 1 Layer l
Figure 3.2: Graphic representation of the connections between two consecutive layers of a CNN.
The receptive field of layer l has dimension 2 × 1.
The height and width that define the local receptive field of a neuron are common to all
the neurons in the same layer (all the neurons at a layer are connected to the same number
[l] [l]
of neurons of the previous layer). For layer l, we use βh and βw to denote these dimensions.
Deep Learning from a Mathematical Point of View - Carmen Mayora Cebollero 47
The neuron at position (i, j) (row and column) in convolutional layer l is connected to all the
[l]
neurons in the previous layer that belong to the region between rows i and i + βh − 1 and
[l]
columns j and j + βw − 1.
Example 3.1. For simplicity, we use a column vector (matrix with dimension n × 1) to under-
stand how convolutional layers are connected. Let us consider two consecutive convolutional
layers, layer l − 1 and layer l, that have dimensions 5 × 1 and 4 × 1, respectively. The height and
[l] [l]
width of the local receptive field of the neurons of layer l are βh = 2 and βw = 1 (notice that no
other value makes sense). The connections between the two layers can be seen in Figure 3.2.
Notice that if in Example 3.1 we consider that the convolutional layer l has dimension 5 × 1
[l] [l]
(the value of βh and βw and the dimension of the previous layer do not change), then the
last neuron of layer l does not have enough neurons to connect to. We can see this in the left
picture of Figure 3.3, where the missing connection is represented in red. The dimensions of
both layers are not compatible. To solve this problem we perform zero padding. This consists
in adding zeros around the layer l − 1 in order to make dimensions compatible. In right picture
of Figure 3.3 we have the connections between the two layers when zero padding is used (in
grey we have the neuron with value 0 that we have added).
? 0
Figure 3.3: On the left, the dimensions of both layers are not compatible. On the right, we
perform zero padding to have compatibility.
Observe that in Example 3.1 the distance between the beginnings of the receptive fields of
two consecutive neurons is 1, that is to say, if we move the local receptive field of neuron (i, 1)
one position down, we obtain the local receptive field of neuron (i + 1, 1). Such distance is called
[l] [l]
stride. In the case of Figure 3.4 it is 2. We use αh and αw to denote the vertical and horizontal
stride, respectively. If only one number is given for the stride, we understand that vertical and
[l] [l]
horizontal stride are equal. If αh , αw 6= 1, neuron (i, j) at layer l is connected to the neurons
[l] [l] [l]
of previous layer that belong to the rectangle with rows (i − 1)αh + 1 to (i − 1)αh + βh and
[l] [l] [l]
columns (j − 1)αw + 1 to (j − 1)αw + βw .
So far, we have seen how connections between neurons of consecutive layers work. Let us
study the weights and biases of such connections.
A filter or convolutional kernel is a matrix in which the weights of the connections between
two consecutive layers are stored. This matrix (which is sparse and well-structured) and the
48 Chapter 3. CNNs and RNNs
local receptive field of the layer receiving these connections have the same dimension. The same
filter is applied to the receptive field of all the neurons at the same layer. The bias term is also
the same for all the neurons in one layer.
Layer l − 1 Layer l
Figure 3.4: Graphic representation of the connections between two consecutive layers of a CNN
when the vertical stride and the dimension of the local receptive field of layer l are 2 and 2 × 1,
respectively.
[l]
We denote by xi,j the value of the neuron in the i-th row and j-th column of the convolutional
[l] [l] [l]
layer l. The filter is represented by the matrix W [l] = (wi,j )i,j ∈ Rβh ×βw and b[l] is the bias
term. It is common to use the ReLU function (2.3) as the activation function A. The value of
neuron (i, j) at layer l is computed as follows:
[l]
βh [l]
βw
[l] X X [l−1]
xi,j = A b[l] + [l]
xi0 ,j 0 wu,v , (3.1)
u=1 v=1
where
[l]
i0 = u + (i − 1)αh ,
j 0 = v + (j − 1)α[l] .
w
CNNs and convolutional layers are called this way because the computation process (3.1) is
similar to a convolution operation.
As we apply the same filter and bias to all the neurons at the same layer, all of them
are extracting the same structures (determined by the filter) from their corresponding local
receptive field. Let us see this with an example. At the bottom of Figure 3.5 we have an image
of a building. We consider a vertical filter that is represented by a sparse matrix that only has
1’s in the middle column. If we apply it, we obtain the top-left image in which all the vertical
lines of the original image are highlighted, while the other structures seem to be ‘discarded’.
We use a horizontal filter now. It corresponds to a sparse matrix with only 1’s in the middle
row. Applying it to the original image we obtain the top-right image in which the horizontal
features of the image have been captured. So, using different filters we extract the different
structures of the image. A layer whose neurons use the same filter and bias is known as feature
map.
Deep Learning from a Mathematical Point of View - Carmen Mayora Cebollero 49
Figure 3.5: Visual explanation of how filters highlight some features of an image. This picture
has been obtained from [5].
At this moment, a convolutional layer and a feature map seem to be the same, but they are
not. To explain how a CNN works we have defined a convolutional layer as a unique layer of
neurons with the same weights and bias, that is to say, we have considered that a convolutional
layer has only one feature map. However, convolutional layers are usually built with several
feature maps with the same size. Each feature map is connected to all the feature maps of
the previous convolutional layer using a different filter for each of them (each one highlights
a different structure of the data). The set of all these filters is called filter bank and it is
characteristic of this feature map.
In Figure 3.6 we have two consecutive convolutional layers whose flow goes from left to
right. Each layer has three feature maps. In grey we have a neuron of a feature map. In purple
we can see all the information it receives from all the feature maps of the previous layer.
Figure 3.6: Representation of two consecutive convolutional layers of a CNN whose flow goes
from left to right. We can see the information (in purple) that receives a neuron (in grey) from
the feature maps of the previous convolutional layer.
As all the neurons of a feature map use the same bank of filters, CNNs have the following
two characteristics:
• The number of training parameters (weights and biases) is lower, so the network needs
less time to learn.
50 Chapter 3. CNNs and RNNs
• CNNs have the property of invariance under location: any pattern or object can be
recognized in any part of the image, unlike Feed-forward Neural Networks with fully-
connected layers that only are able to recognize objects in one part of the image if they
have seen one similar there during training.
[l]
We denote by xi,j,k the value of the neuron (i, j) in the feature map k of the convolutional
layer l. The filter applied to the feature map k 0 at convolutional layer l − 1 to obtain the
[l] [l]
feature map k of layer l is the matrix Wk,k0 = (wi,j,k,k0 ) . The bias term of feature map k at
i,j
[l]
convolutional layer l is The number of feature maps of the convolutional layer l is f [l] . We
bk .
can write mathematically the value of neuron (i, j) in the feature map k of the convolutional
layer l as
[l]
[l]
βh βw [l−1]
[l] [l] X X fX [l−1] [l]
xi,j,k = A bk + xi0 ,j 0 ,m wu,v,k,m , (3.2)
u=1 v=1 m=1
where
[l]
i0 = u + (i − 1)αh ,
j 0 = v + (j − 1)α[l] .
w
If l = 2, we are computing the values of the neurons of the first hidden layer. In this case, the
information is obtained from the input layer (l = 1). Instead of feature maps, the input layer
has channels that behave similarly to feature maps.
Taking into account formula (3.2), it is easy to deduce that the patterns highlighted in layer
l − 1 are combined in convolutional layer l to create more complex structures. ConvNets are
able to take advantage of the hierarchical structure of real data and for that, they are good for
solving, for example, classification problems.
After each convolutional layer, we often have a pooling layer (some authors refer to it as
sub-sampling layer) which allows us to reduce the dimension of the feature maps. In the pooling
layer we have a ‘copy’ of each feature map that has less neurons: each neuron of the ‘copy’ is
replacing all the neurons of the feature map which are in a rectangular receptive field. The value
of such replacement is obtained using a function that returns the maximum of the receptive
field (in this case the pooling layer is called max pooling layer) or the average (it receives the
name of average pooling layer). As in the case of convolutional layers, we have to fix a stride
value α[l],p (two values if we consider that vertical and horizontal strides are different) and the
[l],p [l],p
dimension βh × βw of the rectangular receptive field (we have used the superscript [l], p to
refer to the pooling layer that follows convolutional layer l).
8 2 2 6 1 7
8 6 7 5 4 4
Figure 3.7: ‘Copy’ of the max pooling layer (bottom-left) and ‘copy’ of the average pooling
layer (bottom-right) of a feature map (top).
Deep Learning from a Mathematical Point of View - Carmen Mayora Cebollero 51
Example 3.2. As in Example 3.1, for simplicity, we work with vectors. Let us consider that
we have a feature map of dimension 1 × 6 in the convolutional layer l that we can see at
the top of Figure 3.7. The pooling layer that follows it is built with ‘copies’ of dimension
[l],p
1 × 3, its horizontal stride is αw = 2 and the dimension of the rectangular receptive field is
[l],p [l],p
βh × βw = 1 × 2. At the bottom-left of the figure we have the corresponding ‘copy’ of our
feature map when the pooling layer is a max pooling layer and at the bottom-right of such
figure we have it when the pooling layer is an average pooling layer.
To sum up, in the architecture of a CNN we have convolutional layers (they are built stack-
ing features maps) and pooling layers (they can be understood as ‘copies’ of the feature maps of
the previous convolutional layer that have less neurons). If we stack two convolutional layers,
the connections between them, and the corresponding computations, are based on the local
receptive field and the filters. The convolutional layer that follows a pooling layer is connected
to it and computed in the same way as when we have two consecutive convolutional layers.
As we have introduced, ConvNets are used for image recognition. Therefore, we can use
them to carry out detection and classification tasks. In these cases, after the CNN, we stack
an FNN with fully-connected layers in order to obtain the wanted output.
→ Detection task.
We want to know the position of an object in the image. The output of the network is a
sequence of bounding boxes. A bounding box is a rectangle, usually represented by its corners,
that bounds an object. So, the output is a set of points and to obtain it we need to stack
behind the ConvNet a Feed-forward Neural Network with fully-connected layers.
→ Classification task.
The neural network has to be able to classify some objects. The category of each of them
is represented by a numerical value. The output of the network is a vector of dimension N
where N is the number of categories in which the object can be classified. Its i-th coordinate
is the probability of the object to belong to the i-th category. To get this type of output we
need to stack behind the CNN a Feed-Forward Neural Network with fully-connected layers and
to apply the softmax operation to the last layer. The softmax operation converts the values of
a vector into positive values whose sum is 1, so we can understand them as probabilities. Its
expression is given by
evi
vei = N ,
X
evj
j=1
{i}
where M is the number of points of the training data, N is the number of categories, vej
denotes the j-th component of the vector of probabilities that corresponds to the i-th element
{i}
of the training data and veci is the component of the previous vector that corresponds to the
correct category of the input (this index is obtained from the label of the training point).
Back Propagation (see Subsection 2.2.4) can also be used to compute the gradient of the
loss function during training. The techniques introduced in Section 2.3 to avoid overfitting can
be applied to CNNs.
Some hyperparameters of the ConvNets are the number of convolutional and pooling layers,
the number of feature maps in each convolutional layer and the stride and dimension of the
receptive fields.
Figure 3.8: CNN that can be used to solve a classification task. This picture has been made
with [41].
The input layer is of dimension 150 × 200 × 1 since we work with black and white images of
dimension 150 × 200. This layer is followed by a convolutional layer with 24 features maps with
dimension 150 × 200. We use an stride of 1 and a local receptive field of dimension 5 × 5 (if it
is needed, zero padding is applied). We consider a bias term and we take the ReLU function
as the activation function. Following this convolutional layer we place an average pooling layer
with stride 2 and dimension 2 × 2 for the receptive field in order to reduce the dimension
into 75 × 100. Behind it we put a convolutional layer with 48 feature maps with dimension
75 × 100. The filters have dimension 5 × 5 and we consider an stride of 2 (as we have pointed
out before, zero padding is applied if it is necessary). A bias term is applied and we use the
ReLU function as the activation function. This convolutional layer is followed by a max pooling
layer with stride 3 for the vertical dimension and 2 for the horizontal one, being the dimension
of the receptive field equal to 3 × 2. The dimension is reduced to 25 × 50. We place the last
convolutional layer which has 56 feature maps with dimension 25 × 50. The stride is 1 and the
local receptive field has dimension 5 × 5 (if it is necessary, zero padding is applied). We consider
a bias term and the ReLU activation function. This network ends with a Feed-forward Neural
Network with fully-connected layers whose output is a vector of dimension 5 (each element of
the vector represents one of the five categories), to which we apply the softmax operation in
order to obtain a probability for each category. To train the network we use the logarithmic
cost function defined in (3.3). In Figure 3.8 we have a representation of a network of this type.
Deep Learning from a Mathematical Point of View - Carmen Mayora Cebollero 53
Let us think now in a more complex task. We want to detect and classify the vehicles that
cross the street. In this case we have a video, so we have to study it frame by frame. To detect
and classify the vehicles of each still, we use a CNN followed by an FNN with fully-connected
layers. The output is a sequence of vectors, each of them containing the bounding box (position)
and the probabilities for the categories of an object. This output may be used, for example, to
identify the vehicle over the video to compute its velocity. This has been develop in the project
Smart Traffic manager suite (reference: 2020/0075) carried out in the University of Zaragoza
and financed by Lector Vision S.L. In Figure 3.9 we have a sample of the graphic interpretation
of the output of the network, the colours represent the category of the vehicles (green for cars
and orange for motorbikes).
Figure 3.9: Graphic interpretation of the output of a DNN used for detection and classification
tasks. The image has been provided by Lector Vision S.L.
The simplest RNN is built with a unique neuron that is called recurrent neuron. At each
time step t (also called frame), the input of the neuron consists of an observable data point x(t)
and its previous output y(t − 1). The output is a scalar y(t) that is used as input in the next
54 Chapter 3. CNNs and RNNs
time step. The row vector wx of the weights of the connections between the observable data
and the neuron and the weight wy of the link between the previous output and the neuron do
not depend on time. Adding a bias term b that we can also consider independent of time and
choosing an activation function A, we can write the output at time t of this simple RNN as
With this expression it is easy to see that the output y(t) depends on the observable data x(t)
and on the previous output y(t − 1), which, in turn, depends on x(t − 1) and y(t − 2) and so on.
Therefore, y(t) is a function that depends on all the introduced data until time t. Somehow,
the neural network has memory.
Remark 3.4. For time step t = 0, according to formula (3.4), the previous output is needed.
As it does not exist, we consider that it is equal to 0.
wy wy
b; A wy b; A b; A b; A t
wx wx wx wx
Figure 3.10: On the left, scheme of the simplest RNN. On the right, drawing obtained unrolling
the network through time.
On the left of Figure 3.10 we can see a graphic representation of this RNN. On the right of
the same figure we have represented the network with respect to time. Now, the RNN looks
like a Feed-forward Neural Network. The process to change from the representation on the left
to the representation on the right is known as unrolling the network through time.
A more complicated RNN can be obtained if we build a layer with some recurrent neurons.
In this case, we have an observable input connected to all the neurons in the layer. All of them
also receive the output of this neural network, which is a vector (the output of each recurrent
neuron is a scalar, so the output of all the network has to be a vector). As in the case of a
unique neuron, the weights and biases are constant for all time steps. If W x is the matrix
whose rows are the weights of the connections between the observable input and each recurrent
neuron, W y is the matrix for the links between previous output and each recurrent neuron, and
b is the bias vector, we have that the output of the network at time step t is given by
On the left of Figure 3.11 we have a drawing of this RNN and on the right we have the
picture obtained unrolling the network through time. Notice that in the unrolled representation
we have used a unique structure represented by a square to symbolize the layer. This layer
is the part of the RNN that is preserving the previous states and it is considered as a unique
structure called memory cell. The recurrent neuron of our first RNN (the simplest one) is a
memory cell too.
Deep Learning from a Mathematical Point of View - Carmen Mayora Cebollero 55
Wy
y(0) y(1) y(2)
Wy Wy
b; A b; A b; A t
b1 ; A b2 ; A b3 ; A
Wx Wx Wx
Wx
Figure 3.11: On the left, scheme of an RNN with a layer of recurrent neurons. On the right,
drawing obtained unrolling the network through time.
Remark 3.5. Up to now, we have considered that the previous information of the network used
as input at time step t is y(t − 1). However, in a more general case, the input of the network
that contains previous information can be a vector s(t − 1) called state that is computed by the
network but it could not coincide with y(t − 1).
Remark 3.6. We can stack some layers of recurrent neurons to obtain a deep RNN. However,
it is common to have only one layer.
It is not necessary to have an input x(t) and an output y(t) for each time step t. Depending
on the task we need to perform, we can choose different sequences for the observable inputs
and for the outputs. A famous example is the Encoder-Decoder. In this case, for the first time
steps, the elements of the observable input sequence are non zero and we do not obtain the
outputs (however, at each time step the network is still receiving information from the previous
time steps). This part of the RNN is the Encoder, that converts the input sequence to a vector.
For the next time steps, we consider that the observable input sequence is zero and we obtain
all the outputs. This part of the network is the Decoder, which transforms the vector obtained
in the Encoder into a sequence. In Figure 3.12 we have an example of this RNN. It can be
used to translate sentences from one language to another (in [5] it is explained in detail how it
works).
loss function respect to the weights and biases are calculated using Back Propagation Through
Time (BPTT).
BPTT is an algorithm similar to Back Propagation. It is applied to the unrolled RNN. The
loss function depends on all the outputs along time (if, as in the case of an Encoder-Decoder,
there are some time steps for which we have not an output, the loss function does not depend
on the outputs of these time steps). The derivatives are computed backwards from the output
of each time step as we can see in Figure 3.12, where the red arrows represent the calculation
flow. The obtained derivatives are added over all time steps.
Notice that unrolling through time, we transform the RNN into a network that has a layer
for each time step. If the task needs a lot of time steps, this network has many layers and BPTT
needs much time to compute the derivatives. In this case, we can use Truncated BPTT. Such
algorithm consists in separating the usual computation flow of BPTT into some subsequences.
Then, the backwards computations are done separately for each group. Finally, the results are
added in order to obtain the gradient.
y(2) y(3)
Wy Wy Wy
b; A b; A b; A b; A
Wx Wx
x(0) x(1) 0 0
Encoder Decoder
Figure 3.12: Graphic representation of an Encoder-Decoder. Flow of the calculations of BPTT
algorithm in red. This picture is inspired by the image used to explain BPTT in [5].
LSTM
Long Short-Term Memory (LSTM) is a Recurrent Neural Network with long-term memory.
Unlike RNNs with memory cells formed by only one recurrent neuron or a layer of recurrent
networks, LSTM is able to keep information of the states of earlier time steps. It is based on a
memory cell called LSTM cell. It was presented by S. Hochreiter and J. Schmidhuber in 1997
in their paper Long Short-Term Memory [44].
Let us study how an LSTM cell works at each time step t. First of all, we have to remark
that the state is divided into a short-term state h(t) and a long-term state c(t). The inputs of
Deep Learning from a Mathematical Point of View - Carmen Mayora Cebollero 57
the cell are the observable input x(t) and the short-term and long-term state of the previous
time step, they are denoted by h(t − 1) and c(t − 1), respectively. The output of the network is
y(t).
Figure 3.13: LSTM cell. This picture has been obtained from [5].
Inside the LSTM cell there are four fully-connected layers whose inputs are x(t) and h(t−1).
We refer to these layers with letters f , g, i and o. They can be understood as the structures of
the cell that decide the information that is in the state vectors and in the output vector.
Let α = f, g, i, o. We denote by W α,x and W α,h the weights of the connections between
x(t) and layer α, and between h(t − 1) and layer α, respectively. The bias term of layer α is
bα and α(t) represents its output at time step t. Using this notation, we study each of the
fully-connected layers introduced before.
Notice that we have used the sigmoid (or logistic) function as the activation function, so
the components of vector f (t) take values between 0 and 1. Because of this, we can use
this output as a control term that is said to be responsible of a gate: if we multiply f (t)
by vector c(t − 1) element by element (componentwise multiplication of two vectors), the
components of the long-term state vector that are multiplied by zero are erased and the
others are stored (they are multiplied by a number in the interval (0, 1]). Therefore, if we
call this vector c
b(t), we have that
• Layer g. It is responsible for studying x(t) and h(t − 1). The output g(t) of this layer is
given by
g(t) = tanh W g,x x(t) + W g,h h(t − 1) + bg , (3.6)
where we have used the hyperbolic tangent function as the activation function.
58 Chapter 3. CNNs and RNNs
• Layer i. It is one of the three gate controllers. Its output i(t) is given by
i(t) = σ W i,x x(t) + W i,h h(t − 1) + bi . (3.7)
As in the case of layer f , the output is a vector whose elements belong to the closed
interval [0, 1]. This output and g(t) are multiplied component by component in the input
gate (it is controlled by i(t) that decides the elements of g(t) that are stored). The result
is added to c b(t), obtaining the long-term state of the current time step. We can write
that
c(t) = i(t) ⊗ g(t) + c
b(t), (3.8)
b(t) = f (t) ⊗ c(t − 1).
where c
• Layer o. It is one of the three gate controllers. The output of this layer is
o(t) = σ W o,x x(t) + W o,h h(t − 1) + bo . (3.9)
This gate controller is responsible for the output gate in which tanh(c(t − 1)) and o(t)
are multiplied point by point in order to obtain the short-term state h(t) and the output
y(t). Therefore we can write
h(t) = y(t) = o(t) ⊗ tanh(c(t)). (3.10)
Formulas (3.5), (3.6), (3.7), (3.8), (3.9) and (3.10) give us all the computations performed
inside the memory cell. In Figure 3.13 we can see the structure of the LSTM cell.
GRU
A Gated Recurrent Unit (GRU) is a simpler version of an LSTM, so it is an RNN with long-term
memory. Its main structure is a memory cell called GRU cell. GRUs were introduced by K.
Cho et al. in 2014 in the article Learning Phrase Representations using RNN Encoder-Decoder
for Statistical Machine Translation [45].
Figure 3.14: GRU cell. This picture has been obtained from [5].
Let us see how a GRU cell works. It has only one vector state denoted by h(t) (in the case
of the LSTM cell we have two). The inputs are the observable input x(t) and the previous
state h(t − 1). The output is y(t). Unlike LSTM cell, GRU cell has only three fully-connected
layers to which we refer with letters r, g and z, and it does not have an output gate. Following
a notation for the weights and biases equivalent to the one used in the case of LSTM cells, we
can study each layer.
Deep Learning from a Mathematical Point of View - Carmen Mayora Cebollero 59
This output is responsible for a gate that, with a componentwise multiplication, controls
the part of the previous state that is received by layer g as input.
• Layer g. It is equivalent to the layer g in the LSTM cell. However, a gate controlled by
r(t) has power over the previous state h(t − 1) that layer g receives as input. Its output
is
g(t) = tanh W g,x x(t) + W g,h (r ⊗ h(t − 1)) + bg . (3.12)
that controls the input and the forget door: when a component of z(t) is equal to 1, the
forget gate does not erase this element from h(t − 1) and the input gate removes it from
g(t); when it has value 0, the gates work in the other way around. Therefore, we have
that
h(t) = y(t) = z(t) ⊗ h(t − 1) + (1 − z(t)) ⊗ g(t). (3.14)
All the computations carried out by the GRU cell are given by equations (3.11), (3.12),
(3.13) and (3.14). This memory cell is represented in Figure 3.14, where we can see clearly that
its structure is simpler than the structure of the LSTM cell.
Conclusions
• Artificial Intelligence (AI) is the field of computer science focused on building machines
that can learn, reason and act as intelligent humans. One of the main topics in AI is
Machine Learning.
• Artificial Neural Networks (ANNs) are a ML algorithm based on the structure of bi-
ological neurons (these neurons are organized in layers). We focus on ANNs that are
supervised learning algorithms. The Perceptron is a simple type of ANN which cannot
solve some problems like the XOR classification problem that other ANNs like Multi-Layer
Perceptron can.
• Deep Learning (DL) is the branch of ML that includes all the techniques used to build
Deep Neural Networks (ANNs with multiple hidden layers) that are able to learn from
real data.
• In the field of DL, an FNN with fully-connected layers is an MLP with at least two hidden
layers. Its architecture is based on the weights assigned to the connections between the
neurons of consecutive layers, the biases added at each neuron and the activation function.
To determine the values of the weights and biases we need to train the network. This
process is reduced to an optimization problem that consists in minimizing a function
called loss function (it measures how well the neural network fits the data).
• Gradient Descent (GD) and Stochastic Gradient (SG) are optimization algorithms that
can be used to solve the minimization problem. However, they have convergence issues.
In the case of convex L-smooth functions, we can fix them adding momentum or restart
techniques to GD. In the case of SG, if we only add momentum (NASGD), even for convex
smooth optimization problems, we accumulate error at each iteration. We need to use a
restart technique to speed up SG (SRSGD algorithm).
• In the optimization algorithms used during training, we have to compute the gradient of
the loss function. Back Propagation exploits the forward structure of FNNs with fully-
connected layers to compute the derivatives backwards.
• During training, the neural network cannot be able to learn to generalize or it cannot
learn from the training data. These problems are known as overfitting and underfitting,
respectively. The first one can be solved with regularization techniques as dropout and
the second one, building a more complex neural network.
61
62 Conclusions
• Hyperparameters are parameters that are not fitted during training. In the case of FNNs
with fully-connected layers they are, for example, the number of layers or the initial value
of the weights and biases.
• Convolutional Neural Networks (CNNs) are Deep Neural Networks used when the input
of the network is an image. They can be used to perform classification and detection
tasks. They are FNNs, but they do not have fully-connected layers. Their architecture is
based on convolutional and pooling layers. The process of training is similar to the one
of FNNs with fully-connected layers and Back Propagation can also be used.
• Recurrent Neural Networks (RNNs) are Deep Neural Networks with memory (they use an
observable input and the previous state of the network that is stored in the memory cell).
They are used when the input data is a sequence. If we unroll them through time, they
seem to be FNNs, so the process of training is equivalent and the algorithm to compute
the derivatives is a version of Back Propagation known as Back Propagation Through
Time. There is empirical evidence that they do not have long-term memory, we can get
it using an LSTM cell or a GRU cell, which are long-term memory cells.
Bibliography
[2] I. Belda, Mentes, Máquinas y Matemáticas. La Inteligencia Artificial y sus Retos, Na-
tional Geographic. El Mundo es Matemático, RBA Coleccionables (2011).
[3] K. Kumar, G.S.M. Thakur, Advanced Application of Neural Networks and Artificial
Intelligence: A Review, International Journal of Information Technology and Computer
Science 4 (6) (2012), 57 − 68.
[4] A. Turing, Computing Machinery and Intelligence, Mind 59 (236) (1950), 433 − 460.
[5] A. Géron, Hands-On Machine Learning with Scikit-Learn & TensorFlow. Concepts,
Tools, and Techniques to Build Intelligent Systems, O’Reilly (2019) (Twelfth Release).
[6] R. Follmann, E. Rosa Jr., Predicting Slow and Fast Neuronal Dynamics with Machine
Learning, Chaos: An Interdisciplinary Journal of Nonlinear Science 29 (11) (2019), 113 −
119.
[10] AFI Escuela de Finanzas Madrid, Machine Learning con Python. Lecture Notes for
MOOC: Machine Learning con Python 2020.
[11] R. Morris, M. Fillenz, Neuroscience. Science of the Brain. An Introduction for Young
Students, British Neuroscience Association - European Dana Alliance for the Brain (2003).
[14] W. McCulloch, W. Pitts, A Logical Calculus of the Ideas Immanent in Nervous Ac-
tivity, Bulletin of Mathematical Biophysics 5 (4) (1943), 115 − 133.
63
64 Bibliography
[16] J.L. Navarro, Topología General. Lecture Notes for Academic Year 2017/2018. Univer-
sidad de Zaragoza.
[17] P.J. Miana, Integral de Lebesgue. Lecture Notes for Academic Year 2019/2020. Universi-
dad de Zaragoza.
[18] W. Rudin, Real and Complex Analysis, McGraw-Hill International Editions (1987).
[20] P.J. Miana, Curso de Análisis Funcional, Colección Textos Docentes (2006).
[22] C.F. Higham, D.J. Higham, Deep Learning: An Introduction for Applied Mathemati-
cians, SIAM Review 61 (4) (2019), 860 − 891.
[23] B. Wang, T.M. Nguyen, T. Sun, A.L. Bertozzi, R.G. Baraniuk, S.J. Osher,
Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent, arXiv preprint
arXiv:2002.10583 (2020).
It is under review in
https://openreview.net/pdf/24b9ad22e22211b234923e134f9a732cb25edcc7.pdf
[24] P. Mehta, M. Bukov, C.H. Wang, A.G.R. Day, C. Richardson, C.K. Fisher,
D.J. Schwab, A High-bias, Low-variance Introduction to Machine Learning for Physicists,
Physics Reports 810 (2019) 1 − 124.
[25] D. Kressner, Advanced Numerical Analysis. Lecture Notes Part 3a. Version 24.04.2015
(Optimization, Part 2),
https://www.epfl.ch/labs/anchp/wp-content/uploads/2018/05/part3a.pdf
[26] W. Su, S. Boyd, E.J. Candes, A Differential Equation for Modeling Nesterov’s Accel-
erated Gradient Method: Theory and Insights, Journal of Machine Learning Research 17
(2016) 1 − 43.
[35] X. Glorot, Y. Bengio, Understanding the Difficulty of Training Deep Feedforward Neu-
ral Networks, Proceedings of the Thirteenth International Conference on Artificial Intelli-
gence and Statistics (2010) 249 − 256.
[36] IBM Cloud Education, Natural Language Processing, IBM Cloud (2020)
https://www.ibm.com/cloud/learn/natural-language-processing
[37] D.H. Hubel, Single Unit Activity in Striate Cortex of Unrestrained Cats, Journal of
Physiology 147 (1959) 226 − 238.
[38] D.H. Hubel, T.N. Wiesel, Receptive Fields of Single Neurones in the Cat’s Striate
Cortex, Journal of Physiology 148 (1959) 574 − 591.
[40] Y. LeCun, Y. Bengio, G. Hinton, Deep Learning, Nature 521 (2015) 436 − 444.
[42] P.R. Vlachas, J. Pathak, B.R. Hunt, T.P. Sapsis, M. Girvan, E. Ott, P.
Koumoutsakos, Backpropagation Algorithms and Reservoir Computing in Recurrent
Neural Networks for the Forecasting of Complex Spatiotemporal Dynamics, Neural Net-
works 126 (2020) 191 − 217.