DEEP LEARNING
Convolu onal Neural Network(CNN)
In deep learning, a convolu onal neural network (CNN/ConvNet) is a class of deep
neural networks, most commonly applied to analyze visual imagery.
Convolu onal networks use a process called convolu on, which combines two
func ons to show how one changes the shape of the other.
The role of the ConvNet is to reduce the images into a form that is easier to process,
without losing features that are cri cal for ge ng a good predic on.
An RGB image is nothing but a matrix of pixel values having three planes whereas a
grayscale image is the same but it has a single plane.
Each neuron in a CNN layer receives inputs from a local region of the previous layer,
known as its recep ve field.
The number of parameters in a CNN layer depends on the size of the recep ve fields
(filter kernels) and the number of filters.
The recep ve fields move over the input, calcula ng dot products and crea ng a
convolved feature map as the output.
Ar ficial neurons calculate the weighted sum of mul ple inputs and output an
ac va on value.
The first layer usually extracts basic features such as horizontal or diagonal edges.
This output is passed on to the next layer which detects more complex features such
as corners or combina onal edges.
ConvNets are feed-forward networks that process the input data in a single pass.
Based on the ac va on map of the final convolu on layer, the classifica on layer
outputs a set of confidence scores (values between 0 and 1) that specify how likely
the image is to belong to a class.
Gradient descent is commonly used as the op miza on algorithm during training to
adjust the weights of the input layer and subsequent layers.
The Pooling layer is responsible for reducing the spa al size of the Convolved Feature.
This is to decrease the computa onal power required to process the data by reducing
the dimensions.
There are two types of pooling :
1. average pooling.
2. max pooling.
3. Lp pooling.
4. Mixed pooling and so on.
Max Pooling: the maximum value of a pixel from a por on of the image covered by
the kernel.
Max Pooling also performs as a Noise Suppressant. It discards the noisy ac va ons
altogether and also performs de-noising along with dimensionality reduc on.
Average Pooling: returns the average of all the values from the por on of the image
covered by the Kernel.
Average Pooling performs dimensionality reduc on as a noise-suppressing
mechanism.
Limita ons of CNN:
1. High computa onal requirements.
2. Needs large amount of data.
3. Classifica on of images at different posi ons.
4. Need to develop effec ve and scalable parallel training algorithms.
5. At tes ng me, these deep models are highly memory demanding and me-
consuming, which makes them not suitable to be deployed.
6. It requires considerable skill and experience to select suitable
hyperparameters such as the learning rate, kernel sizes of convolu onal filters,
the number of layers, etc.
Components of Convolu onal Neural Networks
Convolu onal Layer
It tries to learn the feature representa on of the inputs.
For compu ng the different feature maps, it is composed of several kernels/matrix
are used.
A filter/kernel of (n*n) matrix applied to the input data(or image) to get the
convolu onal feature.
Filters help us exploit the spa al locality of a par cular image by enforcing a local
connec vity pa ern between neurons.
These filters enable the model to capture intricate details and spa al rela onships
within the image.
Convolu onal layers are responsible for feature extrac on.
This convolu on feature is then passed on to the next layer a er adding bias and
applying any suitable ac va on func on.
We have seen that convolving an input of 6 X 6 dimension with a 3 X 3 filter results in
4 X 4 output.
We can generalize it and say that if the input is n X n and the filter size is f X f, then
the output size will be (n-f+1) X (n-f+1):
Input: n X n
Filter size: f X f
Output: (n-f+1) X (n-f+1)
There are primarily two disadvantages here:
1. Every me we apply a convolu onal opera on, the size of the image shrinks.
2. Pixels present in the corner of the image are used only a few number of mes
during convolu on as compared to the central pixels. Hence, we do not focus
too much on the corners since that can lead to informa on loss.
To overcome these issues, we can pad the image with an addi onal border, i.e., we
add one pixel all around the edges.
Input: n X n
Padding: p
Filter size: f X f
Output: (n+2p-f+1) X (n+2p-f+1)
There are two common choices for padding:
1. Valid: It means no padding. If we are using valid padding, the output will be (n-
f+1) X (n-f+1)
2. Same: Here, we apply padding so that the output size is the same as the input
size, i.e.,
n+2p-f+1 = n
So, p = (f-1)/2
Stride is a parameter that dictates the movement of the kernel, or filter, across the
input data, such as an image.
When performing a convolu on opera on, the stride determines how many units the
filter shi s at each step.
The shi can be horizontal, ver cal, or both, depending on the stride's configura on.
The dimensions for stride s will be:
Input: n X n
Padding: p
Stride: s
Filter size: f X f
Output: [(n+2p-f)/s+1] X [(n+2p-f)/s+1]
Stride helps to reduce the size of the image, a par cularly useful feature.
Generalized dimensions can be given as:
Input: n X n X nc
Filter: f X f X nc
Padding: p
Stride: s
Output: [(n+2p-f)/s+1] X [(n+2p-f)/s+1] X nc’
nc is the number of channels in the input and filter, while nc’ is the number of
filters.
Convolu on over volume
Convolu on over volume with
mul ple filters
The output dimension can be calculated for any general case using the following
equa on :
Pooling Layer
The pooling layer is placed between the convolu onal layers.
It is used for achieving shi invariance which is achieved by decreasing the resolu on
of the feature maps.
Performs downsampling opera ons (e.g., max pooling) to retain the most salient
informa on while discarding unnecessary details.
This helps in achieving transla on invariance and reducing computa onal complexity.
Reducing the number of connec ons between convolu onal layers, lowers the
computa onal burden on the processing units.
If the input of the pooling layer is nh X nw X nc, then the output will be [{(nh – f) / s + 1}
X {(nw – f) / s + 1} X nc] (without padding).
Fully-Connected Layer
There may be mul ple fully connected layers, a er a number of convolu onal and
pooling layers.
Every single neuron of the current layer is connected with all the neurons in the
previous layer.
Fully connected layers fla en the output of the last pooling layer.
The last layer of CNNs is an output layer that makes final predic ons.
The So max func on is commonly used for mul -class classifica on.
There are primarily two major advantages of using convolu onal layers over using
just fully connected layers:
1. Parameter sharing
2. Sparsity of connec ons
The fully connected layer can only work with 1D data.
Once the data is converted into a 1D array, it is sent to the fully connected layer.
All of these individual values are treated as separate features that represent the
image.
The fully connected layer performs two opera ons on the incoming data:
1. linear transforma on.
2. non-linear transforma on.
The equation for linear transformation is: Z = WT.X + b
The linear transforma on alone cannot capture complex rela onships.
Thus, we introduce an addi onal component in the network which adds non-linearity
to the data. This new component in the architecture is called the ac va on func on.
Ac va on Func on
Ac va on func ons in Neural Network are used in a neural network to compute the
weighted sum of inputs and biases, which is in turn used to decide whether a neuron
can be ac vated or not.
For CNN’s hidden layers, Re Lu is the preferred ac va on func on because of its
simple differen ability and fastness compared to other ac va on func ons like tanh
and sigmoid.
Forward Propaga on
Step 1: Load the input images in a variable (say X)
Step 2: Define (randomly ini alize) a filter matrix. Images are convolved with the filter
Z1 = X * f
Step 3: Apply the Sigmoid ac va on func on on the result
A = sigmoid(Z1)
Step 4: Define (randomly ini alize) weight and bias matrix. Apply linear
transforma on on the values
Z2 = WT.A + b
Step 5: Apply the Sigmoid func on on the data. This will be the final output
O = sigmoid(Z2)
Backward Propaga on
Fully Connected Layer
1. Change in error with respect to output
Suppose the actual values for the data are denoted as y’ and the predicted output is
represented as O. Then the error would be given by this equa on:
E = (y' - O)2/2
If we differen ate the error with respect to the output, we will get the following
equa on:
∂E/∂O = -(y'-O)
2. Change in output with respect to Z2 (linear transforma on output)
To find the deriva ve of output O with respect to Z2, we must first define O in terms
of Z2.
Thus, ∂O/∂Z2 is effec vely the deriva ve of Sigmoid. The equa on for the Sigmoid
func on:
f(x) = 1/(1+e^-x)
The deriva ve of this func on comes out to be:
f'(x) = (1+e-x)-1[1-(1+e-x)-1]
f'(x) = sigmoid(x)(1-sigmoid(x))
∂O/∂Z2 = (O)(1-O)
3. Change in Z2 with respect to W (Weights)
The value Z2 is the result of the linear transforma on process. Here is the equa on of
Z2 in terms of weights:
Z2 = WT.A1 + b
On differen a ng Z2 with respect to W, we will get the value A1 itself:
∂Z2/∂W = A1
Now that we have the individual deriva ons, we can use the chain rule to find the
change in error with respect to weights:
∂E/∂W = ∂E/∂O . ∂O/∂Z2. ∂Z2/∂W
∂E/∂W = -(y'-o). sigmoid'. A1
The shape of ∂E/∂W will be the same as the weight matrix W. We can update the
values in the weight matrix using the following equa on:
W_new = W_old - lr*∂E/∂W
Convolu on Layer
1. Change in Z2 with respect to A1
To find the value for ∂Z2/∂A1 , we need to have the equa on for Z2 in terms of A1:
Z2 = WT.A1 + b
On differen a ng the above equa on with respect to A1, we get WT as the result:
∂Z2/∂A1 = WT
2. Change in A1 with respect to Z1
The next value that we need to determine is ∂A1/∂Z1. Have a look at the equa on of
A1
A1 = sigmoid(Z1)
This is simply the Sigmoid func on. The deriva ve of Sigmoid would be:
∂A1/∂Z1 = (A1)(1-A1)
3. Change in Z1 with respect to filter f
Finally, we need the value for ∂Z1/∂f. Here’s the equa on for Z1
Z1 = X * f
Differen a ng Z with respect to X will simply give us X:
∂Z1/∂f = X
Now that we have all the required values, let’s find the overall change in error with
respect to the filter:
∂E/∂f = ∂E/∂O.∂O/∂Z2 .∂Z2/∂A1 .∂A1/∂Z1 * ∂Z1/∂f
The value: (∂E/∂O.∂O/∂Z2.∂Z2/∂A1 .∂A1/∂Z).
The main reason is that during forward propaga on, we perform a convolu on
opera on for the images and filters.
This is repeated in the backward propaga on process. Once we have the value for
∂E/∂f, we will use this value to update the original filter value:
f = f - lr*(∂E/∂f)
Improvements in CNNs :
1.Dilated Convolu on
2.Transposed Convolu on
3.Tiled Convolu on
4.Network in Network
Modern Convolu on Neural Networks
AlexNet
AlexNet, created by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, marked a
turning point in deep learning.
Introduced in the ImageNet Large Scale Visual Recogni on Challenge (ILSVRC) in
2012.
The AlexNet architecture was designed to be used with large-scale image datasets
and it achieved state-of-the-art results at the me of its publica on.
AlexNet is composed of:
5 convolu onal layers with a combina on of max-pooling layers,
3 fully connected layers, and
2 dropout layers.
The ac va on func on used in all layers is ReLU.
The ac va on func on used in the output layer is So max.
The total number of parameters in this architecture is around 60 million.
The first convolu onal layer filters an input image of size 224 × 224 × 3 through 96
kernels of dimensions 11 × 11 × 3 by a stride of 4.
The second convolu onal layer receives a normalized response and pooling output of
the first layer. It applied filters by using 256 kernels of dimensions 5 × 5 × 48.
The next three layers are connected to each other without using any normaliza on or
pooling layer.
The third convolu onal layer contains 384 kernels with dimensions 3 × 3 × 256.
In the fourth convolu onal layer, 384 kernels of dimensions 3 × 3 × 192 are used.
In the fi h convolu onal layer, 256 kernels of dimensions 3 × 3 × 192 are used.
Each fully connected layer contains 4096 neurons.
The normaliza on layers are added to help generaliza on.
AlexNet reduces the problem of overfi ng in fully connected layers. Therefore, it
leads to a dropout of the regulariza on method.
VGG
The Visual Geometry Group (VGG) at Oxford University proposed the VGGNet
architecture.
The localiza on and classifica on accuracy of this model increases with an increase in
the depth of a model
It uses small convolu onal filters of dimensions 3 × 3 with a stride of one in all layers.
It includes a max-pooling layer of dimension 2 × 2 with a stride of two.
It receives an RGB image of dimension 224 × 224 × 3 as an input.
In the training dataset, the mean average value of RGB is subtracted from each pixel
to perform pre-processing.
A pre-processed image is passed through a stack of convolu onal layers followed by
five max-pooling layers.
It uses the first two fully connected layers with 4096 channels in each layer and the
third fully connected layer with 1000 channels.
The last layer performs a 1000-way classifica on.
At the last stage, a so max layer works to determine mul -class probabili es.
NiN
Network in Network (NiN) is a deep convolu onal neural network introduced by Min
Lin, Qiang Chen, Shuicheng Fan [2].
The classic models use linear convolu onal layers and the layers are followed by an
ac va on func on to scan the input, while the NiN uses mul layer perceptron
convolu onal layers, at which each layer includes a micro-network.
The classic models apply fully connected layers at the end of the model to classify
objects, while the NiN uses a global average pooling layer before feeding the output
to the so max layer.
The advantages of global average pooling layer are:
it is more na ve to the convolu on structure by enforcing correspondences
between feature maps and categories.
there is no parameter to op mize in the global average pooling layer, so it
helps to avoid the overfi ng phenomena.
The original NiN network is composed of four NiN blocks. Each block includes three
convolu onal layers:
The first layer uses a filter window whose shape belongs to {11 × 11, 5 × 5,
3 × 3}.
The last two layers are 1 × 1 convolu onal layers.
Each NiN block is followed by a Max-pooling layer with pooling size 3 × 3, and strides
of 2.
Except the last block is followed by a Global Average Pooling layer.
GoogLeNet
GoogLeNet is winner of ILSVRC 2014, introduced the Incep on module, which
employs parallel convolu onal opera ons with different kernel sizes.
GoogLeNet efficiently captures features at mul ple scales, promo ng be er
generaliza on.
It is a deeper and broader architecture.
It consists of 22 layers with 224 X 244 recep ve fields and very small convolu ons of
1 X 1, 3 X 3, 5 X 5.
GoogLeNet uses 1 X 1, 3 X 3, 5 X 5 convolu ons and 3 X 3 max pooling layers together
to extract different kinds of features.
It has 9 linearly stacked incep on modules.
The incep on module is a combina on of 1 × 1, 3 × 3, 5 × 5 convolu onal layers with
their outputs concatenated into a solitary output vector.
It has two significant modifica ons:
1 × 1 convolu onal layer is applied before other layers.
It uses parallel max pooling layer.
At the end of last incep on module, it uses global average pooling layer instead of
fully connected layers.
As an increase in the number of layers and the number of neurons in each layer,
networks become more prone to overfi ng.
GoogLeNet uses sparsely connected network architectures specially inside
convolu onal layers.
GoogLeNet provides a good solu on for overfi ng problems and reduces
computa onal and storage costs.
ResNet
Residual Networks, or ResNets, proposed by Kaiming He et al., addressed the
challenge of training very deep networks.
ResNets introduce shortcut connec ons that bypass one or more layers, allowing the
gradient to flow more easily during backpropaga on.
The deep residual learning framework is mo vated by the degrada on problem due
to more and more layers to a network.
It becomes difficult to train a deep network due to the vanishing gradient problem.
During backpropaga on, gradients of loss are calculated concerning weights.
The gradients tend to become smaller on moving backward in a network.
Thus, the performance of the network saturates or degrades.
This indicates that lower layers are slow learners than upper layers of a network.
Another problem of deeper networks is performing op miza on on large parameter
space.
ResNet-50 is a convolu onal network that contains residual blocks with 50
convolu onal layers. It includes about 25.6 million parameters.
In ResNet-101, the number of parameters is increased to 44.5 million.
In ResNet-152, the number of parameters are 60.2 million.
The benefit of training a residual network is that even if we train deeper networks,
the training error does not increase.
DenseNet
G. Huang et al. proposed DenseNet, a densely connected network.
It consists of two modules, namely, Dense Blocks and Transi on Layers.
In this network, all layers are connected directly with each other in a feed-forward
manner.
A DenseNet of N layers contains N(N+1)/2 direct connec ons.
Each layer receives feature maps of previous layers as inputs.
A Dense Block is composed of Batch Normaliza on, ReLU ac va on, and 3 × 3
Convolu on.
Transi on layers lie between two Dense Blocks.
These are made up of batch normaliza on, 1 X 1 convolu on, and average pooling.
To concatenate all feature maps in each dense block, feature maps of all layers are of
the same size.