Deep Learning
Deep Learning
{Deep Learning}
{Selected by Dr. Asmaa Awad }
{Third level}
1
رؤية الكلية
خلق جيل مبتكر قادر على إيداع تقنيات الذكاء في األنظمة واألشياء لتحقيق التطور والنماء.
رسالة الكلية
إعداد كوادر علمية متخصصة ومتفردة تمتلك من المهارات والمعرفة في مجال الذكاء االصطناعي ما يؤهلها
لمواكبة التطور في هذا المجال وابتكار أبحاث وتطبيقات قادرة على المنافسة محليًا وإقليميًا ودوليًا تنمو
بالمجتمع المحيط وتطور صناعته في ظل خطة مصر للتنمية المستدامة.
Course Objectives:
1. Understand the theoretical foundations of neural networks and deep learning.
2. Learn to design, train, and optimize neural network models.
3. Gain proficiency in applying neural networks to practical problems using modern frameworks.
Course Topics:
Table of Content 3
Neural Networks:
1. Introduction to Neural Networks: Structure and biological inspiration.
2. Feedforward Neural Networks: Architecture, activation functions, and backpropagation.
3. Optimization Techniques: Gradient descent, learning rate tuning, and regularization.
4. Applications: Handwritten digit recognition and simple classification tasks.
Deep Learning:
5. Convolutional Neural Networks (CNNs): Image recognition, feature extraction, and pooling.
6. Recurrent Neural Networks (RNNs): Sequential data modeling with LSTM and GRU networks.
7. Generative Adversarial Networks (GANs): Image generation and creative applications.
8. Advanced Topics: Transformers, attention mechanisms, and NLP applications.
Table of Content 4
Contents
CHAPTER 1: NEURAL NETWORKS .................................................................................................................. 7
1.1Overview .............................................................................................................................................. 7
1.2 History: ................................................................................................................................................ 7
1.3 Applications:........................................................................................................................................ 8
1.4 Biological Inspiration: ........................................................................................................................ 10
1.5Neuron Model and Network Architectures: ...................................................................................... 12
1.5.1 Neuron Model: ........................................................................................................................... 12
1.6 Network Architectures: ..................................................................................................................... 18
1.6.1 Single Layer network: ................................................................................................................. 18
1.6.2 Multiple Layers Network: ........................................................................................................... 20
1.6.3 Recurrent Networks: .................................................................................................................. 21
1.7Neural Network Learning: .................................................................................................................. 23
1.7.1 Types of Learning: ...................................................................................................................... 23
1.8 Learning Rules: .................................................................................................................................. 24
1.8.1 Hebbian Learning ....................................................................................................................... 24
1.8.2 Perceptron.................................................................................................................................. 27
1.8.3 backpropagation ........................................................................................................................ 31
Chapter 3 :Convolution Neural Network..................................................................................................... 45
3.1 What Computer "See" ........................................................................................................................... 45
3.2. Learning Visual Features ...................................................................................................................... 45
3.2.1 Using Spatial Structure ................................................................................................................... 46
3.3. Feature Extraction and Convolution - A Case study ................................................................................ 48
3.4. Convolutional Neural Networks (CNNs).................................................................................................. 50
3.4.1 CNNs: Spatial Arrangement of Output Volume ................................................................................. 52
3.4.2 Introducing Non-Linearity............................................................................................................... 52
3.4.3 Pooling ......................................................................................................................................... 53
3.4.4 CNNs for Classification: Feature Learning ........................................................................................ 54
3.5. An Architecture for Many Applications .................................................................................................. 56
3.5.1 Object Detection ............................................................................................................................ 57
3.5.2 Semantic Segmentation: Fully Convolutional Networks ..................................................................... 58
3.5.3 Continuous Control: navigation from Vision ..................................................................................... 59
3.6 Summary .............................................................................................................................................. 59
Table of Content 5
Chapter 4 : Recurrent Neural Network ....................................................................................................... 60
4.1Recurrent Neural Networks ............................................................................................................... 60
4.1.2Why sequence models ................................................................................................................ 60
4.1.2Name entity ................................................................................................................................ 61
4.1.3Representing words: ................................................................................................................... 62
4.1.4Forward RNN ............................................................................................................................... 65
4.1.5Backpropagation through time ................................................................................................... 67
4.1.6Different types of RNNs .............................................................................................................. 68
4.1.7Language model and sequence generation ................................................................................ 70
4.1.8 How to build language models with RNNs? ............................................................................... 70
4.1.9Vanishing gradients with RNNs ................................................................................................... 72
4.2Gated Recurrent Unit (GRU) .............................................................................................................. 74
4.3 LSTM Networks ................................................................................................................................. 78
4.1 Step-by-Step LSTM Walk Through ................................................................................................. 80
4.2Variants on Long Short Term Memory ........................................................................................... 81
4.3LSTM Example ................................................................................................................................ 83
4.4Course summary ................................................................................................................................ 90
Table of Content 6
CHAPTER 1: NEURAL NETWORKS
1.1Overview
Artificial Neural Network (ANN) is a mathematical function designed to mimic the basic function of a
biological neuron and it has used in many application such as Prediction, Classification of inputs and Data
Filtering.
The training of the network by using back propagation algorithm is produced where in the forward pass the
actual output is calculated and in the backward path the weights between output layer and hidden layer and
between hidden layer and input layer will be adjusted, then steps of this algorithm is repeated until the error is
reduced and the importance of sigmoid transfer function is presented also in details.
1.2 History:
A neural network is a machine that is designed to simulate the way of a human brain works, which is
composed of a large number neurons working to gather to solve a specific problem.
The history of Artificial Neural Network can be traced back to the early 1940s. The first important paper on
neural network was published by physiologist, Warren McCulloch and Walter Pitts in 1943, they proposed a
simple model of neuron with electronic circuit, this model consists of two input and one output, in 1949 Donald
Hebb proposed a learning law that become starting point for neural network training algorithm, in the 1950 and
1960, many researchers (Block, Minsky, Papert and Rosenblatt) worked on Perceptrons, where the first type of
neural network is called Perceptrons. The Perceptron is a very simple mathematical representation of the neuron
where most Artificial Neural Network is based on it to this day as shown below in figure 2.1
Table of Content 7
This figure shows that the inputs of the neuron are represented by X1, X2… Xm then multiplied by
corresponding weight W1, W2… Wm similar to the synaptic strength in a biological neuron, the externally
applied bias is denoted by b. Summation of these inputs with their corresponding weights and bias ‘b’ is
symbolized by V, where V is calculated by equation 2.1:
V = ∑𝑚
𝑖=1 𝑊𝑖 ∗ 𝑋𝑖 + 𝑏 (2.1)
After that, the activation function is compared with value of a certain threshold. If the total summation of the
inputs multiplied by their corresponding weight is more than the threshold the output (O) will be “fires” and if
the total summation of the inputs multiplied by their corresponding weight is less than the threshold the output
(O) will be “not fires”.
Bernard Widrow and Marcian Hoff in 1959, they developed model called "ADALINE" (Adaptive Linier
Neuron) and "MADALINE" is composed of "many ADALINE" (Multilayer ADALINE).
Widrow and Hoff in 1960 developed a mathematical method for adapting the weight, where this algorithm was
depended on minimizing the error squared, and then this algorithm would become called as least mean square
error (LMS). In1962, Frank Rosenblatt was able to demonstrate the convergence of a learning algorithm. In
1969, Marvin Minsky and Seymour Papert published a book in which they showed that Perceptron could not
learn this function which are not linearly separable.
The effect of these problems was to limit of the funding available for research into artificial neural networks
therefore the neural networks research declined throughout 1970 and until mid of 1980.After a proof of the
limitations of neural network in the 1970's, but much work was done on self-organizing maps by Willshaw and
von der Malsburg. Hopfield presented a paper on neural networks with feedback known as Hopfield Networks.
The back propagation algorithm was first developed by Werbos in 1974; the most development happened
around 1985- 1986 when Rumelhart, Hinton and Willimas invented (back-propagation), where back-
propagation is a powerful tool for training multilayer neural network. Appearance of back-propagation method
has spectacular the range of problems to which neural network can be applied
1.3 Applications:
Google uses neural networks for image tagging (automatically identifying an image and assigning keywords), and
Microsoft has developed neural networks that can help convert spoken English speech into spoken Chinese speech. These
examples are indicative of the broad range of applications that can be found for neural networks. The applications are
Table of Content 8
expanding because neural networks are good at solving problems, not just in engineering, science and mathematics, but in
medicine, business, finance and literature as well. Their application to a wide variety of problems in many fields makes
them very attractive. Also, faster computers and faster algorithms have made it possible to use neural networks to solve
complex industrial problems that formerly required too much computation.
1. Aerospace:
High performance aircraft autopilots, flight path simulations, aircraft control systems, autopilot enhancements,
aircraft component simulations, aircraft component fault detectors.
2. Automotive:
Automobile automatic guidance systems, fuel injector control, automatic braking systems, misfire detection,
virtual emission sensors, warranty activity analyzers.
3. Banking:
Check and other document readers, credit application evaluators, cash forecasting, firm classification, exchange
rate forecasting, predicting loan recovery rates, measuring credit risk.
4. Defense:
Weapon steering, target tracking, object discrimination, facial recognition, new kinds of sensors, sonar, radar and
image signal processing including data compression, feature extraction and noise suppression, signal/image identification.
5. Electronics:
Code sequence prediction, integrated circuit chip layout, process control, chip failure analysis, machine vision,
voice synthesis, nonlinear modeling.
6. Entertainment:
7. Financial:
Real estate appraisal, loan advisor, mortgage screening, corporate bond rating, credit line use analysis, portfolio
trading program, corporate financial analysis, currency price prediction.
8. Insurance:
Table of Content 9
9. Manufacturing:
Manufacturing process control, product design and analysis, process and machine diagnosis, real-time particle
identification, visual quality inspection systems, beer testing, welding quality analysis, paper quality prediction, computer
chip quality analysis, analysis of grinding operations, chemical product design analysis, machine maintenance analysis,
project bidding, planning and management, dynamic modeling of chemical process systems.
10. Medical:
Breast cancer cell analysis, EEG and ECG analysis, prosthesis design, `timization of transplant times, hospital
expense reduction, hospital quality improvement, emergency room test advisement.
Exploration, smart sensors, reservoir modeling, well treatment decisions, seismic interpretation.
12. Robotics:
Trajectory control, forklift robot, manipulator controllers, vision systems, autonomous vehicles.
13. Speech:
14. Securities:
15. Telecommunications:
Image and data compression, automated information services, real-time translation of spoken language, customer
payment processing systems.
16. Transportation:
The brain consists of a large number of highly connected elements called neurons. These neurons have three
principal components: the dendrites, the cell body and the axon as shown in Figure 1.
Table of Content 10
Figure 1: Biological Neuron
The dendrites are tree-like receptive networks of nerve fibers that carry electrical signals into the cell body.
The cell body effectively sums and thresholds these incoming signals.
The axon is a single long fiber that carries the signal from the cell body out to other neurons.
The point of contact between an axon of one cell and a dendrite of another cell is called a synapse. It is the
arrangement of neurons and the strengths of the individual synapses, determined by a complex chemical process, that
establishes the function of the neural network.
The synapses are the connections which enable the transfer of electric axon impulses from a particular neuron to
dendrites of other neurons, as illustrated in Figure 2.
The human brain has close to 100 billion nerve cells, called neurons. Each neuron is connected to thousands
of others, creating a neural network that shuttles information in the form of stimuli, in and out of the brain
constantly. Each of the yellow blobs in the figure 2.3 are neuronal cell bodies (soma), each neuron has long, thin
nerve fibres called dendrites that bring information in and even longer fibres called axons that send information
away.
Table of Content 11
Figure 2.3: Biological neurons of human brain [47].
The neuron receives information in the form of electrical signals from neighboring neurons across one of
thousands of synapses, small gaps that separate two neurons and act as input channels.
Once a neuron has received this charge it triggers either a "go" signal that allows the message to be passed to
the next neuron or a "stop" signal that prevents the message from being forwarded, so it is important to note that
a neuron fires only if the total signal received at the cell body exceeds a certain level.
For example, when a person thinks of something, sees an image, or smells a scent, that mental process or
sensory stimulus excites a neuron, which fires an electrical pulse that shoots out through the axons and fires
across the synapse. If enough input is received at the same time, the neuron is activated to send out a signal to
be picked up by the next neuron's dendrites
Table of Content 12
If we relate this simple model back to the biological neuron that we discussed in section 1.3, the
weight 𝑤 corresponds to the strength of a synapse, the cell body is represented by the summation and
the transfer function, and the neuron output 𝑎 represents the signal on the axon.
𝑎 = 𝑓(𝑤𝑝 + 𝑏)
Example 2.1:
Let 𝑤 = 3 , 𝑝 = 2 and 𝑏 = –1.5, what is the single-input neuron output ?
𝑎 = 𝑓 3 ∗ 2 + (−1.5) = 𝑓(4.5)
The actual output depends on the particular transfer function that is chosen.
Notes:
1. The bias is much like a weight, except that it has a constant input of 1. However, if you do not want to have a bias
in a particular neuron, it can be omitted.
2. Note that 𝑤 and 𝑏 are both adjustable scalar parameters of the neuron. Typically, the transfer function is chosen by
the designer and then the parameters 𝑤 and 𝑏 will be adjusted by some learning rule so that the neuron input/output
relationship meets some specific goal.
Typically, a neuron has more than one input. A neuron with 𝑅 inputs is shown in Figure 13. The individual inputs
𝐩 = (𝑝 , 𝑝 , 𝑝 , … , 𝑝 ) are each weighted by
Table of Content 13
corresponding elements 𝑤 , , 𝑤 , , … , 𝑤 , of the weight matrix 𝐖.
The neuron has a bias, which is summed with the weighted inputs to form the net input 𝑛 :
𝑛 = 𝑤 ,𝑝 + 𝑤 , 𝑝 + … + 𝑤 , 𝑝 +𝑏
𝑛 = 𝐖𝐩 + 𝑏
where the matrix 𝐖 for the single neuron case has only one row.
𝑎 = 𝑓(𝐖𝐩 + 𝑏)
The elements of the weight matrix had indices, which are, the first index indicates the particular neuron
destination for that weight. The second index indicates the source of the signal fed to the neuron. Thus, the indices in 𝑤 ,
say that this weight represents the connection to the first (and only) neuron from the second source. Of course, this
convention is more useful if there is more than one neuron, as will be the case later.
We would like to draw networks with several neurons, each having several inputs. Further, we would like to have
more than one layer of neurons. You can imagine how complex such a network might appear if all the lines were drawn.
It would take a lot of ink, could hardly be read, and the mass of detail might obscure the main features. Thus, we will use
an abbreviated notation. A multiple-input neuron using this notation is shown in Figure 14.
Table of Content 14
Figure 14: Neuron with 𝑹 Inputs, Abbreviated Notation
Note that the number of inputs to a network is set by the external specifications of the problem. If, for instance,
you want to design a neural network that is to predict kite-flying conditions and the inputs are air temperature, wind
velocity and humidity, then there would be three inputs to the network.
There are variety of transfer functions some of them are listed below:
1. Threshold (Hard Limit) Transfer Function:
Table of Content 15
This function will used to create neurons that classify inputs into two distinct categories.
The log-sigmoid transfer function is commonly used in multilayer networks that are trained using the
backpropagation algorithm.
Since then using the tanh function instead of the logistic one is equivalent. The
tanh function has the advantage of being symmetrical with respect to the origin.
Table of Content 17
Figure 9: Radial Basis Transfer Function
Example 2.2:
Let 𝑤 = 4 , 𝑝 = 2 and 𝑏 = –2 with 𝑓 radial basis, what is the single neuron output?
The layer includes the weight matrix, the summers, the bias vector, the transfer function boxes and the output
vector.
Each element of the input vector 𝐩 is connected to each neuron through the weight matrix 𝐖. Each neuron has
a bias 𝑏 , a summer, a transfer function 𝑓 and an output 𝑎 .
Table of Content 18
Taken together, the outputs form the output vector.
You might ask if all the neurons in a layer must have the same transfer function. The answer is no; you can define
a single (composite) layer of neurons having different transfer functions by combining two of the networks shown above
in parallel. Both networks would have the same inputs, and each network would create some of the outputs.
The input vector elements enter the network through the weight matrix 𝐖:
𝑤, 𝑤, … 𝑤,
𝑤, 𝑤, ⋯ 𝑤,
𝐖= ⋮ ⋱ ⋮
𝑤, 𝑤, ⋯ 𝑤,
As noted previously, the row indices of the elements of matrix 𝐖 indicate the destination neuron associated with
that weight, while the column indices indicate the source of the input for that weight. Thus, the indices in 𝑤 , say that
this weight represents the
Table of Content 19
Figure 16: Layer of 𝑺 Neurons, Abbreviated Notation
As shown, there are 𝑅 inputs, 𝑆 neurons in the first layer, 𝑆 neurons in the second layer, etc. As noted, different
layers can have different numbers of neurons.
The outputs of layers one and two are the inputs for layers two and three. Thus layer 2 can be viewed as a one-
layer network with 𝑅 = 𝑆 inputs, 𝑆 = 𝑆 neurons, and an 𝑆 × 𝑆 weight matrix 𝐖𝟐 . The input to layer 2 is 𝐚𝟏, and the
output is 𝐚𝟐.
Table of Content 20
A layer whose output is the network output is called an output layer. The other layers are called hidden layers.
The network shown above has an output layer (layer 3) and two hidden layers (layers 1 and 2).
The same three-layer network discussed previously also can be drawn using our abbreviated notation, as shown
in Figure 18.
Multilayer networks are more powerful than single layer networks. For instance, a two-layer
network having a sigmoid first layer and a linear second layer can be trained to approximate most
functions arbitrarily well. Single- layer networks cannot do this.
As for the number of layers, most practical neural networks have just two or three layers. Four
or more layers are used rarely.
We should say something about the use of biases. One can choose neurons with or without
biases. The bias gives the network an extra variable, and so you might expect that networks with
biases would be more powerful than those without, and that is true. Note, for instance, that a neuron
without a bias will always have a net input 𝑛 of zero when the network inputs 𝐩 are zero. This may
not be desirable and can be avoided by the use of a bias.
Other recurrent artificial neural networks such as Hopfield, Elman, Jordan, bidirectional and
other networks are just special cases of recurrent artificial neural networks.
Example:
A single-layer neural network is to have six inputs and two outputs. The outputs are to be limited to and
continuous over the range 0 to 1. What can you tell about the network architecture? Specifically:
Table of Content 22
Solution:
One of the questions will raised from the above examples “How do we determine the weight
matrix and bias for perceptron networks with many inputs, where it is impossible to visualize the
decision boundaries?”. The answer is to build an algorithm for training perceptron networks, so
that they can learn to solve classification problems.
Supervised Learning:
The learning rule is provided with a set of examples (the training set) of proper network behavior:
{𝐩 , 𝐭 }, {𝐩 , 𝐭 }, {𝐩 , 𝐭 }, … , 𝐩 , 𝐭
where 𝐩 is an input to the network and 𝐭 is the corresponding correct (target) output. As the
inputs are applied to the network, the network outputs are compared to the targets. The learning
rule is then used to adjust the weights and biases of the network in order to move the network
outputs closer to the targets.
Table of Content 23
Unsupervised Learning:
The weights and biases are modified in response to network inputs only. There are no target
outputs available. At first glance this might seem to be impractical. How can you train a network if
you don’t know what it is supposed to do? Most of these algorithms perform some kind of
clustering operation. They learn to categorize the input patterns into a finite number of classes.
• Step 3: Adjust the bias (just like the weights): b(new) = b(old) + t
Example
PROBLEM:
Construct a Hebb Net which performs like an AND function, that is, only when both features are "active"
will the data be in the target class.
x1 x2 bias Target
TRAINING SET (with the bias input always at 1):
1 1 1 1
1 -1 1 -1
-1 1 1 -1
-1 -1 11 -1
Table of Content 24
Training - First Input
Update the weights: w₁(new) = w₁(old) + x₁t = 0 + 1 = 1 w₂(new) = w₂(old) + x₂t = 0 + 1 = 1 b(new) = b(old)
+t=0+1=1
w₁(new) = w₁(old) + x₁t = 1 + 1(-1) = 0 w₂(new) = w₂(old) + x₂t = 1 + (-1)(-1) = 2 b(new) = b(old) + t = 1 + (-1) = 0
w₁(new) = w₁(old) + x₁t = 0 + (-1)(-1) = 1 w₂(new) = w₂(old) + x₂t = 2 + 1(-1) = 1 b(new) = b(old) + t = 0 + (-1) = -1
Table of Content 25
Training - Fourth Input
w₁(new) = w₁(old) + x₁t = 1 + (-1)(-1) = 2 w₂(new) = w₂(old) + x₂t = 1 + (-1)(-1) = 2 b(new) = b(old) + t = -1 + (-1) = -2
Final Neuron
x1 x2 bias Target
1 1 1 1
1 -1 1 -1
-1 1 1 -1
-1 -1 1 -1
12 + 12 + 1*(-2) = 2 > 0
Table of Content 26
(-1)*2 + (-1)2 + 1(-2) = -6 < 0
1.8.2 Perceptron
The perceptron model, proposed by Minsky-Papert, is a more general computational model than McCulloch-Pitts neuron.
It overcomes some of the limitations of the M-P neuron by introducing the concept of numerical weights (a measure of
importance) for inputs, and a mechanism for learning those weights. Inputs are no longer limited to boolean values like
in the case of an M-P neuron, it supports real inputs as well which makes it more useful and generalized.
Now, this is very similar to an M-P neuron but we take a weighted sum of the inputs and set the output as one only when
the sum is more than an arbitrary threshold (theta). However, according to the convention, instead of hand coding the
Table of Content 27
thresholding parameter thetha, we add it as one of the inputs, with the weight -theta like shown below, which makes it
learn-able (more on this in my next post — Perceptron Learning Algorithm).
Consider the task of predicting whether I would watch a random game of football on TV or not (the same example from
my M-P neuron post) using the behavioral data available. And let's assume my decision is solely dependent on 3 binary
inputs (binary for simplicity).
Here, w_0 is called the bias because it represents the prior (prejudice). A football freak may have a very low threshold
and may watch any football game irrespective of the league, club or importance of the game [theta = 0]. On the other
hand, a selective viewer like me may only watch a football game that is a premier league game, featuring Man United
game and is not friendly [theta = 2]. The point is, the weights and the bias will depend on the data (my viewing history in
this case).
Based on the data, if needed the model may have to give a lot of importance (high weight) to
the isManUnitedPlaying input and penalize the weights of other inputs.
Perceptront algorithm
Table of Content 28
Use a set of training samples:
Repeat:
• Choose a sample
What kind of functions can be implemented using a perceptron? How different is it from McCulloch-Pitts neurons?
From the equations, it is clear that even a perceptron separates the input space into two halves, positive and negative.
All the inputs that produce an output 1 lie on one side (positive half space) and all the inputs that produce an output 0
lie on the other side (negative half space).
In other words, a single perceptron can only be used to implement linearly separable functions, just like the M-P neuron.
Then what is the difference? Why do we claim that the perceptron is an updated version of an M-P neuron? Here, the
weights, including the threshold can be learned and the inputs can be real values.
Table of Content 29
Try solving the equations on your own.
The above ‘possible solution’ was obtained by solving the linear system of equations on the left. It is clear that the
solution separates the input space into two spaces, negative and positive half spaces. I encourage you to try it out for
AND and other boolean function.
Now if you actually try and solve the linear equations above, you will realize that there can be multiple solutions. But
which solution is the best? To more formally define the ‘best’ solution, we need to understand errors and error surfaces,
which we will do in my next post on Perceptron Learning Algorithm.
Now let's look at a non-linear boolean function i.e., you cannot draw a line to separate positive inputs from the negative
ones.
Table of Content 30
Notice that the fourth equation contradicts the second and the third equation. Point is, there are no perceptron solutions
for non-linearly separated data. So the key take away is that a single perceptron cannot learn to separate the data that
are non-linear in nature.
Example
1.8.3 backpropagation
The project describes teaching process of multi-layer neural network employing backpropagation algorithm. To
illustrate this process the three layer neural network with two inputs and one output,which is shown in the
picture below, is used:
Each neuron is composed of two units. First unit adds products of weights coefficients and input signals. The
second unit realise nonlinear function, called neuron activation function. Signal e is adder output signal, and y
= f(e) is output signal of nonlinear element. Signal y is also output signal of neuron.
Table of Content 31
To teach the neural network we need training data set. The training data set consists of input signals
(x1 and x2 ) assigned with corresponding target (desired output) z. The network training is an iterative process.
In each iteration weights coefficients of nodes are modified using new data from training data set. Modification
is calculated using algorithm described below: Each teaching step starts with forcing both input signals from
training set. After this stage we can determine output signals values for each neuron in each network layer.
Pictures below illustrate how signal is propagating through the network, Symbols w(xm)n represent weights of
connections between network input xm and neuron n in input layer. Symbols yn represents output signal of
neuron n.
Table of Content 32
Propagation of signals through the hidden layer. Symbols wmn represent weights of connections between
output of neuron m and input of neuron n in the next layer.
Table of Content 33
Propagation of signals through the output layer.
Table of Content 34
In the next algorithm step the output signal of the network y is compared with the desired output value (the
target), which is found in training data set. The difference is called error signal d of output layer neuron.
It is impossible to compute error signal for internal neurons directly, because output values of these neurons
are unknown. For many years the effective method for training multiplayer networks has been unknown. Only
in the middle eighties the backpropagation algorithm has been worked out. The idea is to propagate error
signal d (computed in single teaching step) back to all neurons, which output signals were input for discussed
neuron.
Table of Content 35
The weights' coefficients wmn used to propagate errors back are equal to this used during computing output
value. Only the direction of data flow is changed (signals are propagated from output to inputs one after the
other). This technique is used for all network layers. If propagated errors came from few neurons they are
added. The illustration is below:
Table of Content 36
When the error signal for each neuron is computed, the weights coefficients of each neuron input node may
be modified. In formulas below df(e)/de represents derivative of neuron activation function (which weights are
modified).
Table of Content 37
Table of Content 38
Table of Content 39
Coefficient h affects network teaching speed. There are a few techniques to select this parameter. The first
method is to start teaching process with large value of the parameter. While weights coefficients are being
established the parameter is being decreased gradually. The second, more complicated, method starts
teaching with small parameter value. During the teaching process the parameter is being increased when the
teaching is advanced and then decreased again in the final stage. Starting teaching process with low parameter
value enables to determine weights coefficients signs.
Table of Content 40
let’s clear that up by explicitly showing all the calculations for a full sized network with 2 inputs, 3 hidden layer
neurons and 2 output neurons as shown in figure 3.4. W+ represents the new, recalculated, weight, whereas
W (without the superscript) represents the old weight.
Figure 3.4, all the calculations for a reverse pass of Back Propagation.
Table of Content 41
The
constant η (called the learning rate, and nominally equal to one) is put in to speed
Table of Content 42
Table of Content 43
Learning in Back Propagation Algorithm
The proposed training algorithm used in the back propagation algorithm is shown below in many steps:
Because of the simplicity in use impressive speed for training and teaching Artificial Neural Network (ANN),
and because of their distinctive ability to extract meaning from the complicated data and recognize patterns
beside of its massive ability to predict and data filtering, all that made the back propagation learning algorithm a
powerful tool and widely used technique in the learning and training of the Artificial Neural Networks (ANNs).
Table of Content 44
Chapter 3 :Convolution Neural Network
Manual Feature Extraction: Domain knowledge -> Define features -> Detect features to
classify.
Learning Feature Representations. Can we learn a hierarchy of features directly from the
data instead of hand engineering? (We've mentioned this in the first lecture: MIT
Introduction to Deep Learning : 6.S191)
Table of Content 45
Input.
• 2D images
• Array of pixel values
So, how can we use spatial structure in the input to inform the architecture of the
network?
IDEA: connect patches of input to neurons in hidden layer. neuron connected to region of
input. Only "sees" these values.
Table of Content 46
One way we can use the spatial structure would be to actually connect patches of our input. Not the
whole input but just patches of the input to neurons in the hidden layer. So before, everything was
connected from the input layer to the hidden layer, but now we're just gonna connect only things
that are within a single patch to the next neuron in the next layer. Each neuron only sees the
values coming from the patch that precedes it.
(patch: a small area of something, especially one that is different from the area around it.)
This will not only reduce the number of weights in our model, but it's also going to allow us to
leverage the fact that in an image spatially close pixels are likely to be somewhat related and
correlated to each other. That's a fact that we should really take it into account.
We can basically do this by sliding that patch across the input image. For each time we slide it,
we're going to have a new output neuron in the subsequent layer. This way, we can actually take
into account some of the spatial structure that inherent to our input, but remember that
our ultimate task is not only to preserve spatial structure but to actually learn the visual
features. And we do this by weighting the connections between the patches and the neurons.
Table of Content 47
• Apply a set of weights - a filter - to extract local features
• Use multiple filters to extract different features
• Spatially share parameters of each filter
We want our model to basically compare images of a piece of an X (piece by piece) and
the really important pieces that it should look for are exactly what we've been calling
the features. If our model can find those important (and rough) features that define the X
roughly in the same positions, it can get a lot better at understanding the similarity
between different examples of X even in the presence of these types of deformities.
Table of Content 48
The Convolution Operation. We slide the 3×33×3 filter over the input image, element-
wise multiply, and add the outputs:
Table of Content 49
...
Producing Feature Maps. Different filter can be used to produce different feature maps.
Table of Content 50
1. Convolution: Apply filters to generate feature maps.
2. Non-linearity: Often ReLU.
3. Pooling: Downsampling operation on each feature map.
tf.keras.layers.Conv2D
tf.keras.activations.*
tf.keras.layers.MaxPool2D
Train model with image data. Learn wights of filters in convolutional layers.
In the dense layers, we'll need to add on a bias to allow us to shift the activation function,
apply and activate it with some non-linearity, so that we can handle non-linear data
relationship.
What's special here is that the local connectivity is preserved each neuron in the hidden
layer you can see in the right only sees a very specific patch of its inputs. It does not see
the entire input neurons like it would have if it was a fully connected layer.
Let's define the actual operation more concretely using a mathematical equation here.
We're left with a 4×44×4 filter matrix, and for each neuron in the hidden layer, its inputs
are those neurons in the patch from the previous layer.
Table of Content 51
Summary. 1) applying a window of weights. 2) computing linear combinations.
3) activating with non-linear function.
Previously, we know how to take input image and learn a single feature map. But in
reality there are many types of features in our image. How can we use convolutional
layers to learn a stack or many different types of features that could be useful for
performing our types of task? How can we use this to do multiple feature extraction?
Now the output layer is still convolution but now it has a volume dimension where the
height and the width are spatial dimensions dependent upon the dimensions of the input
layer.
3.4.3 Pooling
Pooling is an operation that is commonly used to reduce the dimentionality of our inputs
and of our feature maps while still preserving spatial invariants.
Now, a common technique and a common type of pooling that is commonly used in
practice is called Max Pooling.
Max Pooling.
• Reduced dimensionality
• Spatial invariance
Table of Content 53
tf.keras.layers.MaxPool2D(
pool_size=(2,2),
strides=2
)
Mean Pooling. Taking the maximum over that patch is one idea. A very common
alternative is also taking the average that's called mean pooling.
Taking the average actually represents a very smooth way to perform the pooling
operation because you're not just taking a maximum which can be subject to maybe
outliers, but you averaging it, so you get a smoother result in your output layer.
Mean Pooling and Mean Pooling, they both have their advantages and
disadvantages.
The CNNs for classification can be broken down into two parts.
4.4.1 Part 1
First is the Feature Learning part where we actually try to learn the features in our input
image that can be used to perform our specific task. The feature learning part is actually
mentioned before in this blog.
Feature Learning:
4.4.2 Part 2
The convolutional layers and pooling provide those the output excuse me of the first part
is those high-level features of the input. The second part is actually using these features
to perform our classification or whatever our task is in this case. The task is to output
the class probabilities that are present in the input image. So we feed those outputted
features into a fully connected or dense neural network to perform the classification. We
can do this now and don't mind about losing spatial invariance because we've already
down sampled our images so much that it's not really even an image anymore, it's
actually closer to a vector of numbers, and we can directly apply our dense neural
network to that vector of numbers. It's also much lower dimensional now. And We can
output a class of probabilities using a function called softmax whose output actually
represents a categorical probability distribution.
The spatial invariance of convolution refers to applying the same filter bank F to input patches at
all locations.
Table of Content 55
import tensorflow as tf
def generate_model():
model = tf.keras.Sequential([
# first convolutional layer
tf.keras.layers.Conv2D(32, filter_size=3, activation='relu'),
tf.keras.layers.Conv2D(pool_size=2, strides=2),
Task can be
• Classification
• Object detection
• Segmentation
• Probabilistic control
Table of Content 56
3.5.1 Object Detection
A naive solution. We can start by placing a random box over the input image somewhere.
It has some random location, it also has a random size. Then we can take that box and
feed it through our normal image classification network. This network's task is to predict
what is the class of this image. If there is no class of this box then it simply can ignore it.
We repeat this process then we pick another box in the scene and we pass that through
network to predict its class and we can keep doing this with different boxes in the scene…
In some sense if each of these boxes give us a prediction class we can pick the boxes that
do have a class in them and use those as a box where an object is found. Problem: there
are way too many inputs, this basically results in boxes and considering a number of
boxes that have way too many scales, way to many positions, too many sizes. We can't
possibly iterate over our images in all of these dimensions.
R-CNN algorithm: Find regions that we think have objects. Use CNN to classify.
Problems:
Table of Content 57
1. Slow! Many regions; time intensive inference.
2. Brittle! Manually define region proposals.
Advantages: It only requires a single forward pass through the model. We only feed in
this image once, we have a region proposal network that extracts the regions, and all of
these regions are fed on to perform classification on the rest of the image.
Network designed with all convolutional layers, with downsampling and upsampling
operations.
Table of Content 58
This output is created using an upsampling operation not a downsampling operation. But
upsampling allow the convolutional decoder to actually increase its spatial dimension.
tf.keras.layers.Conv2DTranspose
3.6 Summary
Table of Content 59
Chapter 4 : Recurrent Neural Network
4.1Recurrent Neural Networks
Learn about recurrent neural networks. This type of model has been proven to perform extremely well on temporal data.
It has several variants including LSTMs, GRUs and Bidirectional RNNs, which you are going to learn about in this section.
▪ X: wave sequence
▪ Y: text sequence
▪ X: nothing or an integer
▪ Y: wave sequence
▪ X: text sequence
▪ X: DNA sequence
▪ Y: DNA Labels
▪ X: video frames
▪ Y: label (activity)
▪ X: text sequence
▪ Y: label sequence
Table of Content 60
▪ Can be used by seach engines to index different type of words inside a text.
• All of these problems with different input and output (sequence or not) can be addressed as supervised learning
with label data X, Y as the training set.
4.1.2Name entity
• Named entities are the proper names that play an important role in searching important information of
interest.(understand meaning of word)
• Motivating example:
▪ Y: 1 1 0 1 1 0 0 0 0
▪ Both elements has a shape of 1 means its a name, while 0 means its not a name.
• We will index the first element of x by x<1>, the second x<2> and so on.
o x<1> = Harry
o x<2> = Potter
• Similarly, we will index the first element of y by y<1>, the second y<2> and so on.
o y<1> = 1
o y<2> = 1
• Tx is the size of the input sequence and Ty is the size of the output sequence.
Table of Content 61
• x(i)<t> is the element t of the sequence of input vector i. Similarly y(i)<t> means the t-th element in the output
sequence of the i training example.
• Tx(i) the input sequence length for training example i. It can be different across the examples. Similarly for Ty(i) will
be the length of the output sequence in the i-th training example.
• For example , give 1 for name of person , 2 for name of city ,3 for currency ,4 book name and 0 for other
4.1.3Representing words:
o We will now work in this course with NLP which stands for natural language processing. One of the
challenges of NLP is how can we represent a word?
ii.We need a vocabulary list that contains all the words in our target sets.
▪ Example:
▪ Each word will have a unique index that it can be represented with.
▪ Vocabulary sizes in modern applications are from 30,000 to 50,000. 100,000 is not uncommon.
Some of the bigger companies use even a million.
▪ To build vocabulary list, you can read all the texts you have and get m words with the most
occurrence, or search online for m most occurrent words.
iii.Create a one-hot encoding sequence for each word in your dataset given the vocabulary you have created.
▪ We can add a token in the vocabulary with name <UNK> which stands for unknown text and use
its index for your one-hot vector.
Table of Content 62
o Full example:
• The goal is given this representation for x to learn a mapping using a sequence model to then target output y as
a supervised learning problem.
• Why not to use a standard network for sequence tasks? There are two problems:
▪ This can be solved for normal NNs by paddings with the maximum lengths but it's not a good
solution.
▪ Using a feature sharing like in CNNs can significantly reduce the number of parameters in your
model. That's what we will do in RNNs.
Table of Content 63
o Long term dependence
• Recurrent neural network doesn't have either of the four mentioned problems.
Table of Content 64
4.1.4Forward RNN
• Lets build a RNN that solves name entity recognition task:
o In this problem Tx = Ty. In other problems where they aren't equal, the RNN architecture may be
different.
o a<0> is usually initialized with zeros, but some others may initialize it randomly in some cases.
o There are three weight matrices here: Wax, Waa, and Wya with shapes:
• The weight matrix Waa is the memory the RNN is trying to maintain from the previous layers.
• A lot of papers and books write the same architecture this way:
o It's harder to interpret. It's easier to roll this drawings to the unrolled version.
• In the discussed RNN architecture, the current output ŷ<t> depends on the previous inputs and activations.
Table of Content 65
• Let's have this example 'He Said, "Teddy Roosevelt was a great president"'. In this example Teddy is a person
name but we know that from the word president that came after Teddy not from He and said that were before
it.
• So limitation of the discussed architecture is that it can not learn from elements later in the sequence. To
address this problem we will later discuss Bidirectional RNN (BRNN).
• Now let's discuss the forward propagation equations on the discussed architecture:
o The activation function of a is usually tanh or ReLU and for y depends on your task choosing some
activation functions like sigmoid and softmax. In name entity recognition task we will use sigmoid
because we only have two classes.
• In order to help us develop complex RNN architectures, the last equations needs to be simplified a bit.
• Usually deep learning frameworks do backpropagation automatically for you. But it's useful to know how it
works in RNNs.
o Where wa, ba, wy, and by are shared across each element in a sequence.
o Where the first equation is the loss for one example and the loss for the whole sequence is given by the
summation over all the calculated single example losses.
Table of Content 67
• Graph with losses:
• The backpropagation here is called backpropagation through time because we pass activation a from one
sequence element to another like backwards in time.
Table of Content 68
• In sentiment analysis problem, X is a text while Y is an integer that rangers from 1 to 5. The RNN architecture for
that is Many to One as in Andrej Karpathy image.
o Note that starting the second layer we are feeding the generated output back to the network.
• There are another interesting architecture in Many To Many. Applications like machine translation inputs and
outputs sequences have different lengths in most of the cases. So an alternative Many To Many architecture that
fits the translation would be as follows:
o There are an encoder and a decoder parts in this architecture. The encoder encodes the input sequence
into one matrix and feed it to the decoder to generate the outputs. Encoder and decoder have different
weight matrices.
Table of Content 69
• Summary of RNN types:
• There is another architecture which is the attention architecture which we will talk about in chapter 4.
o Let's say we are solving a speech recognition problem and someone says a sentence that can be
interpreted into to two sentences:
o Pair and pear sounds exactly the same, so how would a speech recognition application choose from the
two.
o That's where the language model comes in. It gives a probability for the two sentences and the
application decides the best based on this probability.
• The job of a language model is to give a probability of any given sequence of words.
o Then tokenize this training set by getting the vocabulary and then one-hot each word.
o Put an end of sentence token <EOS> with the vocabulary and include it with each converted sentence.
Also, use the token <UNK> for the unknown words.
Table of Content 70
o In training time we will use this:
i.For predicting the chance of next word, we feed the sentence to the RNN and then get the final y^<t> hot vector and sort
it by maximum probability.
▪ This is simply feeding the sentence into the RNN and multiplying the probabilities (outputs).
• After a sequence model is trained on a language model, to check what the model has learned you can apply it to
sample novel sequence.
• Lets see the steps of how we can sample a novel sequence from a trained sequence language model:
ii. We first pass a<0> = zeros vector, and x<1> = zeros vector.
Table of Content 71
iii. Then we choose a prediction randomly from distribution obtained by ŷ<1>. For example it could be "The".
▪ This is the line where you get a random beginning of the sentence each time you sample run a
novel sequence.
iv. We pass the last predicted word with the calculated a<1>
v. We keep doing 3 & 4 steps for a fixed length or until we get the <EOS> token.
vi. You can reject any <UNK> token if you mind finding it in your output.
• So far we have to build a word-level language model. It's also possible to implement a character-level language
model.
• In the character-level language model, the vocabulary will contain [a-zA-Z0-9], punctuation, special characters
and possibly token.
• Character-level language model has some pros and cons compared to the word-level language model
o Pros:
o Cons:
a. The main disadvantage is that you end up with much longer sequences.
b. Character-level language models are not as good as word-level language models at capturing long range
dependencies between how the the earlier parts of the sentence also affect the later part of the sentence.
• The trend Andrew has seen in NLP is that for the most part, a word-level language model is still used, but as
computers get faster there are more and more applications where people are, at least in some special cases,
starting to look at more character-level models. Also, they are used in specialized applications where you might
need to deal with unknown words or other vocabulary words a lot. Or they are also used in more specialized
applications where you have a more specialized vocabulary.
• An RNN that process a sequence data with the size of 10,000 time steps, has 10,000 deep layers which is very
hard to optimize.
• Let's take an example. Suppose we are working with language modeling problem and there are two sequences
that model tries to learn:
Table of Content 72
• What we need to learn here that "was" came with "cat" and that "were" came with "cats". The naive RNN is not
very good at capturing very long-term dependencies like this.
• As we have discussed in Deep neural networks, deeper networks are getting into the vanishing gradient problem.
That also happens with RNNs with a long sequence size.
o For computing the word "was", we need to compute the gradient for everything behind. Multiplying
fractions tends to vanish the gradient, while multiplication of large number tends to explode it.
• In the problem we descried it means that its hard for the network to memorize "was" word all over back to "cat".
So in this case, the network won't identify the singular/plural words so that it gives it the right grammar form of
verb was/were.
• In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick
parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn
them. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
• Vanishing gradients problem tends to be the bigger problem with RNNs than the exploding gradients problem.
We will discuss how to solve it in next sections.
Table of Content 73
• Exploding gradients can be easily seen when your weight values become NaN. So one of the ways solve
exploding gradient is to apply gradient clipping means if your gradient is more than some threshold - re-scale
some of your gradient vector so that is not too big. So there are cliped according to some maximum value.
• Extra:
▪ Truncated backpropagation.
▪ Gradient clipping.
▪ Weight initialization.
▪ Like He initialization.
▪ Most popular.
Table of Content 74
• The basic RNN unit can be visualized to be like this:
• Each layer in GRUs has a new variable C which is the memory cell. It can tell to whether memorize something or
not.
▪ To understand GRUs imagine that the update gate is either 0 or 1 most of the time.
o So we update the memory cell based on the update cell and the previous cell.
• Lets take the cat sentence example and apply it to understand this equations:
o We will suppose that U is 0 or 1 and is a bit that tells us if a singular word needs to be memorized.
Table of Content 75
Word Update gate(U) Cell memory (C)
The 0 val
cat 1 new_val
which 0 new_val
already 0 new_val
... 0 new_val
full .. ..
• Because the update gate U is usually a small number like 0.00001, GRUs doesn't suffer the vanishing gradient
problem.
• Shapes:
• What has been descried so far is the Simplified GRU unit. Let's now describe the full one:
Table of Content 76
o The full GRU contains a new gate that is used with to calculate the candidate C. The gate tells you how
relevant is C<t-1> to C<t>
o Equations:
• So why we use these architectures, why don't we change them, how we know they will work, why not add
another gate, why not use the simpler GRU instead of the full GRU; well researchers has experimented over
years all the various types of these architectures with many many different versions and also addressing the
vanishing gradient problem. They have found that full GRUs are one of the best RNN architectures to be used for
many different problems. You can make your design but put in mind that GRUs and LSTMs are standards.
Table of Content 77
4.3 LSTM Networks
Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-
term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by
many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.
LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of
time is practically their default behavior, not something they struggle to learn!
All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this
repeating module will have a very simple structure, such as a single tanh layer.
LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single
neural network layer, there are four, interacting in a very special way.
Don’t worry about the details of what’s going on. We’ll walk through the LSTM diagram step by step later. For now, let’s
just try to get comfortable with the notation we’ll be using.
Table of Content 78
In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink
circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers.
Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to
different locations.
The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.
The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear
interactions. It’s very easy for information to just flow along it unchanged.
The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called
gates.
Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a
pointwise multiplication operation.
The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let
through. A value of zero means “let nothing through,” while a value of one means “let everything through!”
Table of Content 79
An LSTM has three of these gates, to protect and control the cell state.
Let’s go back to our example of a language model trying to predict the next word based on all the previous ones. In such
a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used.
When we see a new subject, we want to forget the gender of the old subject.
The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid
layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate
values, C~t, that could be added to the state. In the next step, we’ll combine these two to create an update to the state.
In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the
old one we’re forgetting.
It’s now time to update the old cell state, Ct−1, into the new cell state Ct. The previous steps already decided what to do,
we just need to actually do it.
We multiply the old state by ft, forgetting the things we decided to forget earlier. Then we add it∗C~t. This is the new
candidate values, scaled by how much we decided to update each state value.
In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and
add the new information, as we decided in the previous steps.
Table of Content 80
Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered
version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the
cell state through tanhtanh (to push the values to be between −1−1 and 11) and multiply it by the output of the sigmoid
gate, so that we only output the parts we decided to.
For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in
case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know
what form a verb should be conjugated into if that’s what follows next.
One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that
we let the gate layers look at the cell state.
Table of Content 81
The above diagram adds peepholes to all the gates, but many papers will give some peepholes and not others.
Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget and what we
should add new information to, we make those decisions together. We only forget when we’re going to input something
in its place. We only input new values to the state when we forget something older.
A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It
combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes
some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly
popular.
Table of Content 82
4.3LSTM Example
Table of Content 83
Table of Content 84
Table of Content 85
Table of Content 86
Table of Content 87
Table of Content 88
Table of Content 89
4.4Course summary
Here are the course summary as its given on the course link:
This course will teach you how to build models for natural language, audio, and other sequence data. Thanks to deep
learning, sequence algorithms are working far better than just two years ago, and this is enabling numerous exciting
applications in speech recognition, music synthesis, chatbots, machine translation, natural language understanding, and
many others.
You will:
• Understand how to build and train Recurrent Neural Networks (RNNs), and commonly-used variants such as
GRUs and LSTMs.
• Be able to apply sequence models to natural language problems, including text synthesis.
• Be able to apply sequence models to audio applications, including speech recognition and music synthesis.
This is the fifth and final course of the Deep Learning Specialization.
Table of Content 90