0% found this document useful (0 votes)
17 views12 pages

DL UNIT 2 CNN Architectures

The document discusses various Convolutional Neural Network (CNN) architectures including LeNet, AlexNet, VGG, GoogLeNet, and ResNet, highlighting their designs and contributions to the ImageNet Large Scale Vision Recognition Challenge. It explains the importance of convolution and pooling layers in reducing parameters for efficient processing, and provides insights into the unique features of each architecture, such as activation functions and dropout techniques. Additionally, it includes details on convolution parameters and the overall architecture of each model.

Uploaded by

gomalavalli.ece
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views12 pages

DL UNIT 2 CNN Architectures

The document discusses various Convolutional Neural Network (CNN) architectures including LeNet, AlexNet, VGG, GoogLeNet, and ResNet, highlighting their designs and contributions to the ImageNet Large Scale Vision Recognition Challenge. It explains the importance of convolution and pooling layers in reducing parameters for efficient processing, and provides insights into the unique features of each architecture, such as activation functions and dropout techniques. Additionally, it includes details on convolution parameters and the overall architecture of each model.

Uploaded by

gomalavalli.ece
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd

CNN Architectures — LeNet,

AlexNet, VGG, GoogLeNet and


ResNet

Prabhu Raghav
Follow
5 min read
·
Mar 15, 2018
348
3

In my previous blog post, explained about my understanding of


Convolution Neural Network (CNN). In this post, I am going to
detailing about convolution parameters and various CNN
architectures used in ImageNet challenge. ImageNet has been
running an annual competition in visual recognition (ILSVRC# — “
ImageNet Large Scale Vision Recognition Challenge” from 2010
onwards) where participants are provided with 1.4 millions of
images. Below are some popular CNN architectures won in ILSVRC
competitions.

 LeNet-5

 AlexNet
 VGGNet

 GoogLeNet

 ResNet

Figure 1 : ILSVRC

Before delve into see the above mentioned details, let us see why
convolution and pooling layers are used in front of FC layers (Fully
Connected Layer) and how convolution, pooling layers is being
calculated in order to filtering essential features in an input image.

Why convolution layers are used before the fully connected


layers?
Generally, convolution and pooling layers act as giant filters.
Imagine an image 224 x 224 x 3 pixels and FC layer as a direct first
hidden layer with 100,000 perceptrons, the total number of
connections will be 224*224*3*100000=15,02,800,000 =>15 billion
parameters which is impossible to process. Convolution and max
pooling layers can actually help to reduce some features in the
image which are may not required to train.

Convolution Parameters

In convolution layer, it accepts a volume of size W x H x D and


requires four hyper parameters as follows:

 Number of filters → K

 Spatial Extent → Fw, Fh (Filter width, Filter height)

 Stride → Sw, Sh (Stride width, Stride height)

 Padding → P

To calculate receptive field, the formula is as follows,

To calculate pooling layer, the formula is as follows,


 OM → Output Matrix

 IM → Input Matrix

 P → Padding

 F → Filter

 S → Stride

By applying above receptive and pooling calculation formulas, the


convolutions, pooling and feature maps outputs are derived. The
key operations in a CNN,
Figure 2 : CNN Key Operation (Source : R.Fergus, Y.LeCun)

LeNet-5

In this classical neural network architecture successfully used on


MNIST handwritten digit recogniser patterns. Below is the LeNet-5
architecture model.
Figure 3 : LeNet-5 Architecture

LeNet-5 receives an input image of 32 x 32 x 1 (Greyscale image)


and goal was to recognise handwritten digit patterns. It uses 5 x 5
filter and with stride is 1. By applying the above receptive field
calculation formula and the output volume result is 28 x 28. The
derivation is in below,

 W x H → 32 x 32 (Width x Height)

 F(w x h) → 5 x 5 (Filter)

 S → 1 (Stride)

 P → 0 (Pooling)
Next layer is a pooling layer, to calculate pooling layer in the above
LeNet-5 architecture, the derivation as follows in below,

 IM → 28 (Input Matrix → Convolution output volume., Refer


above derivation output)

 P → 0 (Pooling)

 S → 1 (Stride)

Finally, it goes to fully connected (FC Layer) layer with 120 nodes
and followed by another FC Layer with 84 nodes. It uses Sigmoid or
Tanh nonlinearity functions. The output variable Yhat with 10
possible values from digits 0 to 9. It is trained on MNIST digit
dataset with 60K training examples.

AlexNet
It starts with 227 x 227 x 3 images and the next convolution layer
applies 96 of 11 x 11 filter with stride of 4. The output volume
reduce its dimension by 55 x 55. Next layer is a pooling layer which
applies max pool by 3 x 3 filter along with stride 2. It goes on and
finally reaches FC layer with 9216 parameter and the next two FC
layers with 4096 node each. At the end, it uses Softmax function
with 1000 output classes. It has 60 million parameters.

Figure 4 : AlexNet Architecture

Some of the highlights in AlexNet Architecture:

 It uses ReLU activation function instead Sigmoid or Tanh


functions. It speed by more than 5 times faster with same
accuracy
 It uses “Dropout” instead of regularisation to deal with
overfitting. But the training time is doubled with dropout ratio of
0.5

 More data and bigger model with 7 hidden layers, 650K units
and 60M parameters.

VGG-16

VGG-16 is a simpler architecture model, since its not using much


hyper parameters. It always uses 3 x 3 filters with stride of 1 in
convolution layer and uses SAME padding in pooling layers 2 x 2
with stride of 2.

Figure 5 : VGG-16 → Source

GoogLeNet

The winner of ILSVRC 2014 and the GoogLeNet architecture is also


known as Inception Module. It goes deeper in parallel paths with
different receptive field sizes and it achieved a top-5 error rate with
of 6.67%.
Figure 6 : GoogLeNet Inception Module

This architecture consists of 22 layer in deep. It reduces the number


of parameters from 60 million (AlexNet) to 4 million. The alternate
view of this architecture in a tabular format in below
Table 1 : GoogLeNet Architecture Tabular View. Source

ResNet (2015)

The winner of ILSRVC 2015, it also called as Residual Neural


Network (ResNet) by Kaiming. This architecture introduced a
concept called “skip connections”. Typically, the input matrix
calculates in two linear transformation with ReLU activation
function. In Residual network, it directly copy the input matrix to
the second transformation output and sum the output in final ReLU
function.
Figure 7 : ResNet Architecture comparison with plain. Source

The overall summary table of various CNN architecture models in


below.

Table 2 : ILSVRC competition CNN architecture models

You might also like