Deep Learning – BCSE332L
vi R
Convolutional Neural Networks
ar
Dr. R. Bhargavi
g a
Professor
SCOPE
B h
VIT University
1
Computer Vision - Applications
Image Classification Object Detection
R
Malignant/Benign
g avi
r
Style Transfer
B ha
Dr. R Bhargavi, VIT 2
Working with Images - Fully connected DNN
• A fully connected DNN/MLP takes only tabular data as the input.
• It does not work well with images because they heavily rely on certain pixel
positions. Hence any positional variance will result in miss-classification (Example
shown in figure below)
Features are considered as independent of each other.
R
•
i
• A traditional fully connected DNN has huge number of learnable parameters
av
• Images of size 1024 x 1024 x 3, with 2 hidden layers of size 1000 ?
har g
B
Dr. R Bhargavi, VIT 3
Working with Images –DNN (cont…)
Input image
vi R
ar g a
B h
Flattened Input image to a Fully connected DNN
Dr. R Bhargavi, VIT 4
Working with Images - CNN
• Automatic feature extraction is done in CNN.
• This allows us to feed Images directly, instead of extracting features manually.
• Convolutional layers are responsible for feature extraction.
• Convolution layers will consider locality into account.
•
v
is also used for DNN.
i R
As the conv layers learn the representations the name representation learning
ar g a
B h
Dr. R Bhargavi, VIT 5
Convolutional Neural Network - Architecture
vi R
ar g a
B h Convolutional layers
Abstract
Features
FC layers (for
classification)
Dr. R Bhargavi, VIT 6
CNN – Architecture (cont…)
vi R
Convolution
ar g a
Pooling
Fully
Connected
h
Convolution Fully
Pooling Fully
B
Connected
Connected
Trainable Layers
Dr. R Bhargavi, VIT 7
CNN – Architecture (cont…)
vi R
ar g a
B h
Dr. R Bhargavi, VIT 8
Convolutional Layer
• Convolutional layer is the core building block of a Convolutional Network.
• Conv layer’s parameters consist of a set of learnable filters.
•
input.
v R
Local connectivity: Each neuron is connected only to a small region in the
i
ar g a x1
x3
x2
x4 *
w1
w3
w2
w4
z
h
=
B
Z
𝑧 = 𝑏 + % 𝑤! 𝑥!
Receptive Field
of the Neuron in !
the feature map
Dr. R Bhargavi, VIT 9
Convolutional Layer (cont…)
• Parameter sharing/ Weight sharing: In one conv layer same filter is used for the
entire image.
R
• Rationale - If detecting a horizontal edge is important at some location in the
i
image, it should intuitively be useful at some other location as well due to the
v
translationally-invariant structure of images. There is therefore no need to
ar g a
relearn to detect a horizontal edge at every one of the distinct locations in the
Conv layer output volume.
B h
Dr. R Bhargavi, VIT 10
Convolution Operation
Feature Map
vi R
ar g a
B h
Output size is given by (nh – kh +1) x (nw – kw +1) where (nh x nw) is the size (height
and width) of the input tensor and (kh x kw) is the size of the kernel
Dr. R Bhargavi, VIT 11
Convolutions with Multiple Channels
1 input channel and multiple output channels
• Use multiple kernels.
i R
• Each kernel results in one channel.
v
a
• Same convolution operation is used for each of the output channels.
ar g
• Each kernel learns different parameters corresponding to different
filters.
h
B
Dr. R Bhargavi, VIT 12
Convolutions with Multiple Channels (cont…)
Multiple input channels (3channels) and single output channel
vi R
ar g a
B h
Dr. R Bhargavi, VIT 13
Convolutions with Multiple Channels (cont…)
Multiple input channels (3channels) and multiple output channel
Kernel1: 3 channels
vi R Kernel2: 3 channels
ar g a Kernel3: 3 channels
B h
Input: 3 channels
Kernel4: 3 channels
Kernel5: 3 channels
Output: 5 channels
Dr. R Bhargavi, VIT 14
Padding
• With zero padding (aka valid conv) each convolution operation reduces the size of the
output.
• Some pixels (for example the corner ones) are least used where as few are used more often.
• If the input is of size n x n, and filter size is f x f and padding size is p then the resultant
output size will be ( n +2p –f +1) x (n +2p –f +1)
• If the o/p size is same as i/p size then it is called as Same padding
Padding size = 1
vi R
a
0 0 0 0 0 0 0 -1 0 0 3 0 3 2 0
r g
0 1 0 2 2 1 0 -1 1 0 4 -1 1 0 -2
* =
ha
0 2 1 1 2 1 0 0 1 0 2
B
0 2 1 1 1 1 0 2
0 0 1 1 2 2 0 2
0 2 2 1 1 1 0
0 0 0 0 0 0 0
Dr. R Bhargavi, VIT 15
Stride
• If the input is of size n x n, filter size is f x f, padding size is p, and Stride = s
then the resultant output size will be ( n +2p –f)/s +1 x (n +2p –f)/s +1
Stride = 2
R
0 0 0 0 0 0 0 -1 0 0 3 3 0
vi
0 1 0 2 2 1 0 -1 1 0 2 0 0
* =
a
2 -2 -2
g
0 2 1 1 2 1 0 0 1 0
ar
0 2 1 1 1 1 0
h
0 0 1 1 2 2 0
0
2
0
B 2
0
1
0
1
0
1
0
0
Dr. R Bhargavi, VIT 16
Inductive Biases
• Sparse connectivity – based on the assumption that neighboring pixels are
related
• Parameter sharing – based on the the assumption that same filters work in
R
different parts of the image
i
•
v
The above two assumptions are called Inductive biases.
a
•
g
Inductive biases result in CNNs learn more quickly and generalize better as
r
compared to fully connected NNs.
B ha
Dr. R Bhargavi, VIT 17
Pooling
• Used between the conv layers.
• Reduce the spatial size of the representation to reduce the amount of parameters
and computation in the network.
• Controls the overfitting.
• Accepts a volume of size W1×H1×D1
R
• Requires two hyperparameters:
i
• Spatial extent F.
av
• The stride S,
g
Produces a volume of size W2×H2×D2 where:
r
•
a
• W2= ((W1−F)/S)+1
h
• H2= ((H1−F)/S)+1
B
• D2=D1
• No learnable parameters.
• Padding the input using zero-padding is not done for pooling layer.
Dr. R Bhargavi, VIT 18
Pooling
vi R
ar g a
B h
Dr. R Bhargavi, VIT 19
CNN Architectures
vi R
ar g a
B h
Dr. R Bhargavi, VIT 20
Source: [Link]
LeNet
• LeNet, proposed by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick
Haffner in 1998, laid the groundwork for convolutional neural networks (CNNs)
R
and their applications in handwritten digit recognition.
i
LeNet was trained using stochastic gradient descent (SGD) with
v
•
a
backpropagation.
•
har g
The network was trained on the MNIST dataset, comprising 60,000 training
examples and 10,000 test examples.
B
• Data augmentation techniques such as translation, rotation, and scaling were
employed to increase the diversity of training samples and improve
generalization.
• LeNet achieved a remarkable accuracy of over 99% on the MNIST dataset.
Dr. R Bhargavi, VIT 21
LeNet-5 (cont…)
• Used sigmoid and tanh activations.
• Has approx. 60k learnable parameters.
• LeNet was used to read zip codes, digits, etc
vi R
ar g a
Stride = 1B h
6 Kernels - 5 x 5 Avg pool - 2 x 2
Stride = 2 16 Kernels - 5 x 5
Stride = 1
Avg pool - 2 x 2
Stride = 2
Dr. R Bhargavi, VIT 22
Source: [Link]
AlexNet
• This deep convolutional neural network is
trained to classify the 1.2 million high-
resolution images in the ImageNet LSVRC-
2010 contest into the 1000 different classes.
•
vi R
Test data performance - Achieved top-1 and
top-5 error rates of 37.5% and 17.0%.
•
r g a
In the ILSVRC-2012 competition, a variant of
a
this model achieved a winning top-5 test error
h
rate of 15.3%.
B
Source: [Link]
Dr. R Bhargavi, VIT 23
AlexNet
• AlexNet consists of eight layers, including
• five convolutional layers followed by
• max-pooling layers and
R
• three fully connected layers.
vi
• Rectified Linear Units (ReLU) were used as activation functions, providing
a
faster convergence and alleviating the vanishing gradient problem.
•
h r g
The neural network has 60 million parameters and 650,000 neurons
a
Local Response Normalization (LRN) was introduced to normalize activations
B
within local regions of the feature maps.
• LRN operates on local groups of neurons, normalizing activity within each
group and across feature channels.
Dr. R Bhargavi, VIT 24
AlexNet - Training
• LRN is done using the formula
• !
𝑎",$ is the activity of a neuron computed by applying kernel i at position (x, y)
and then applying ReLU
•
i R
n - “adjacent” kernel maps at the same spatial position.
v
a
• N - the total number of kernels in the layer.
r g
• The constants k, n, α, and β are hyper-parameters with values k = 2, n = 5, α =
a
10−4 , and β = 0.75.
•
B h
AlexNet was trained using stochastic gradient descent (SGD) with momentum.
Dr. R Bhargavi, VIT 25
AlexNet - Training
• Overfitting reduction:
• Data augmentation
• Dropout
• Data augmentation techniques such as cropping, flipping, and color jittering
R
were employed to increase the diversity of training samples.
•
avi
The network was trained on two NVIDIA GTX 580 GPUs, marking one of the
earliest instances of utilizing GPU acceleration for deep learning.
g
har
B
Dr. R Bhargavi, VIT 26
AlexNet - Architecture
Conv ReLU
ReLU Conv ReLU 3x3
Maxpool 5x5
Maxpool S = 1,
Conv 3x3 p = same
Same
11 x 11 S=1 3x3
S=2
S=4 S=2
227 x 227 x 3 55 x 55 x 96
vi R 27 x 27 x 96 27 x 27 x 256 13 x 13 x 256 13 x 13 x 384
Conv
ar g
ReLU
a Conv ReLU
h
3x3 3x3 Maxpool
B
S = 1, Same 3x3
Same S = 1, S=2
FC FC
FC
13 x 13 x 384 13 x 13 x 256 6 x 6 x 256 4096 4096 1000
SoftMax
Architecture (cont…)
vi R
ar g a
B h
Dr. R Bhargavi, VIT 28
How to compute Number of parameters in
CNN
What will be the output size of the following network ? How many learnable
parameters exist? No padding is used and Stride = 1
vi R
a
3x3
g
10 x 10 x 1
r
x1
B ha Gray scale
image
Conv
Dr. R Bhargavi, VIT 29
Number of parameters in CNN (cont…)
3x3
10 x 10 x 1
x1
R
Gray scale
i
image
g av
Output size = (10 -3 +1, 10-3+1, 1) = 8,8,1
r
a
Parameters = (3 x 3 x 1) + 1 = 10
B h
Dr. R Bhargavi, VIT 30
Number of parameters in CNN (cont…)
What will be the output size of the following network ? How many learnable
parameters exist? No padding is used and Stride = 1
vi R
g a
10 x 10 x 1
har Conv Conv
B
Gray scale 3x3x5 3x3x2
image
Dr. R Bhargavi, VIT 31
Number of parameters in CNN (cont…)
10 x 10 x 1
R
Conv Conv
i
Gray scale 3x3x5 3x3x2
v
image
r g a
After first Conv Output size = (10 -3 +1, 10-3+1, 5) = 8,8,5
a
h
Parameters = for each Each filter (3 x 3 x 1) + 1 = 10 , For 5 filters = 50
B
Now
After Second conv filter, output size = (8 – 3 + 1, 8 – 3 + 1, 2) = 6,6,2
Parameters = Each filter (3 x 3 x 5)+1 = 46; Two filters = 92
Total parameters = 50 + 92 = 142
Dr. R Bhargavi, VIT 32
Number of parameters in CNN (cont…)
What will be the output size of the following network ? How many learnable
parameters exist? No padding is used and Stride = 1
vi R
ar g a100 x 100 x 3
h
Conv Conv
B
Color image 3x3x8 3x3x1
Dr. R Bhargavi, VIT 33
Number of parameters in CNN (cont…)
100 x 100 x 3
vi R
Color image
Conv
3x3x8
Conv
3x3x1
r g a
After first Conv Output size = (100 -3 +1, 100-3+1, 8) = 98, 98, 8
a
h
Parameters = for each Each filter (3 x 3 x 3) + 1 = 28 , For 8 filters = 224
B
Now
After Second conv filter, output size = (98 – 3 + 1, 98 – 3 + 1, 1) = 96,96,1
Parameters = Each filter (3 x 3 x 8)+1 = 73; only one filter = 73
Total parameters = 224 + 73 = 297
Dr. R Bhargavi, VIT 34
Number of parameters in CNN (cont…)
What will be the output size of the following network ? How many learnable
parameters exist? No padding is used and Stride = 1
vi R
ar g a
h
Conv Conv
B
(100) , 5 (3), 8 (3) ,1
Dr. R Bhargavi, VIT 35
Number of parameters in CNN (cont…)
i R
Conv Conv
v
(100) , 5 (3), 8 (3) ,1
r g a
After first Conv Output size = (100-3+1, 8) = 98, 8
a
h
Parameters = for each Each filter (3 x 5) + 1 = 16 , For 8 filters = 128
B
Now
After Second conv filter, output size = (98 – 3 + 1, 1) = 96,1
Parameters = Each filter (3 x 8)+1 = 25; only one filter = 25
Total parameters = 128 + 25 = 153
Dr. R Bhargavi, VIT 36
INCEPTION Module
vi R
ar g a
B h
Dr. R Bhargavi, VIT 37
GOOGLENET / INCEPTION NET
vi R
ar g a
B h
Auxiliary Loss
Dr. R Bhargavi, VIT 38
INCEPTION NET (cont…)
vi R
ar g a
B h
Dr. R Bhargavi, VIT 39
vi R
ar g a
B h
Dr. R Bhargavi, VIT 40