0% found this document useful (0 votes)
24 views40 pages

Module 3 B

The document reviews several CNN architectures including LeNet, AlexNet, ZFNet, and VGGNet, detailing their structures and key features. It highlights the evolution of CNNs, emphasizing improvements in hyperparameters and architectural depth that led to better performance in image recognition tasks. Notably, VGGNet introduced smaller filters and deeper networks, achieving significant reductions in error rates on the ImageNet challenge.

Uploaded by

devmbandhiya11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views40 pages

Module 3 B

The document reviews several CNN architectures including LeNet, AlexNet, ZFNet, and VGGNet, detailing their structures and key features. It highlights the evolution of CNNs, emphasizing improvements in hyperparameters and architectural depth that led to better performance in image recognition tasks. Notably, VGGNet introduced smaller filters and deeper networks, achieving significant reductions in error rates on the ImageNet challenge.

Uploaded by

devmbandhiya11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

This Session

CNN Architectures: Plain Models


LeNet
AlexNet
ZFNet
VggNet
Network in Network
Review: LeNet-5

LeCun et al. Gradient-based learning applied to document recognition.


Proceedings of the IEEE, 1998. Source: cs231n
Review: LeNet-5

Conv filters are 5x5, applied at stride 1


Subsampling (Pooling) layers are 2x2 applied at stride 2
i.e. architecture is [CONV-POOL-CONV-POOL-CONV-FC-FC]

LeCun et al. Gradient-based learning applied to document recognition.


Proceedings of the IEEE, 1998. Source: cs231n
AlexNet

Architecture:
CONV1 MAX POOL1 NORM1(Local Response Normalization)
CONV2 MAX POOL2 NORM2(Local Response Normalization)
CONV3
CONV4
CONV5 Max POOL3
FC6
FC7
FC8
Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet

Input: 227x227x3 images


First layer (CONV1): 96 11x11 filters applied at stride 4
=>
Q: what is the output volume size? Hint: (227-11)/4+1 = 55

Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet

Input: 227x227x3 images


First layer (CONV1): 96 11x11 filters applied at stride 4
=>
Output volume [55x55x96]
Q: What is the total number of parameters in this layer?

Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet

Input: 227x227x3 images


First layer (CONV1): 96 11x11 filters applied at stride 4
=>
Output volume [55x55x96]
Parameters: (11*11*3)*96 = 35K

Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet

Input: 227x227x3 images


After CONV1: 55x55x96
Second layer (POOL1): 3x3 filters applied at stride 2

Q: what is the output volume size? Hint: (55-3)/2+1 = 27

Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet

Input: 227x227x3 images


After CONV1: 55x55x96
Second layer (POOL1): 3x3 filters applied at stride 2
Output volume [27x27x96]

Q: what is the number of parameters in this layer?

Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet

Input: 227x227x3 images


After CONV1: 55x55x96
Second layer (POOL1): 3x3 filters applied at stride 2
Output volume [27x27x96]
Parameters: 0!

Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet

Input: 227x227x3 images


After CONV1: 55x55x96
After POOL1: 27x27x96
...

Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet

Full (simplified) AlexNet architecture:


[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet

Full (simplified) AlexNet architecture: [55x55x48] x2


[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 Historical note: Trained on GTX
[27x27x96] MAX POOL1: 3x3 filters at stride 2 580 GPU with only 3 GB of
[27x27x96] NORM1: Normalization layer memory. Network spread across
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 2 GPUs, half the neurons
[13x13x256] MAX POOL2: 3x3 filters at stride 2 (feature maps) on each GPU.
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet

Full (simplified) AlexNet architecture:


[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 CONV1, CONV2, CONV4,
[27x27x96] MAX POOL1: 3x3 filters at stride 2 CONV5: Connections only with
[27x27x96] NORM1: Normalization layer feature maps on same GPU
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet

Full (simplified) AlexNet architecture:


[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 CONV3, FC6, FC7, FC8:
[27x27x96] MAX POOL1: 3x3 filters at stride 2 Connections with all feature
[27x27x96] NORM1: Normalization layer maps in preceding layer,
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 communication across GPUs
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet

Full (simplified) AlexNet architecture: Details/Retrospectives:


[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 - first use of ReLU
[27x27x96] MAX POOL1: 3x3 filters at stride 2 - used Norm layers (not
[27x27x96] NORM1: Normalization layer common anymore)
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2 - heavy data augmentation
[13x13x256] NORM2: Normalization layer - batch size 128
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 - SGD Momentum 0.9
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 - Learning rate 0.01, reduced
[6x6x256] MAX POOL3: 3x3 filters at stride 2 manually when val accuracy
[4096] FC6: 4096 neurons saturates
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
First CNN based Winner

K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, IEEE CVPR 2016.
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
ZFNet: Improved
hyperparameters over
AlexNet

K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, IEEE CVPR 2016.
ZFNet

AlexNet but:
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512

ImageNet top 5 error: 16.4% -> 11.7%

M. Zeiler, R. Fergus. Visualizing and understanding convolutional networks. ECCV 2014. Source: cs231n
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Deeper Networks

K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, IEEE CVPR 2016.
VGGNet

VGG16 (source)

Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR2015.
VGGNet
Small filters, Deeper networks

8 layers (AlexNet)
-> 16 - 19 layers (VGGNet)

Only 3x3 CONV stride 1, pad 1


and 2x2 MAX POOL stride 2

Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR2015. Source: cs231n
VGGNet
Small filters, Deeper networks

8 layers (AlexNet)
-> 16 - 19 layers (VGGNet)

Only 3x3 CONV stride 1, pad 1


and 2x2 MAX POOL stride 2

ImageNet top 5 error:


11.4% (ZFNet, 2013)
->
7.3% (VGGNet, 2014)

Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR2015. Source: cs231n
VGGNet
Q: Why use smaller filters? (3x3 conv)

Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR2015. Source: cs231n
VGGNet
Q: Why use smaller filters? (3x3 conv)
Stack of three 3x3 conv (stride 1) layers
has same effective receptive field as
one 7x7 conv layer

Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR2015. Source: cs231n
VGGNet
Q: Why use smaller filters? (3x3 conv)
Stack of three 3x3 conv (stride 1) layers
has same effective receptive field as
one 7x7 conv layer

Q: What is the effective receptive field of


three 3x3 conv (stride 1) layers?

Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR2015. Source: cs231n
VGGNet
Q: Why use smaller filters? (3x3 conv)
Stack of three 3x3 conv (stride 1) layers
has same effective receptive field as
one 7x7 conv layer

Q: What is the effective receptive field of


three 3x3 conv (stride 1) layers?
[7x7]

But deeper, more non-linearities

And fewer parameters: 3 * (32C2) vs.


72C2 for C channels per layer

Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR2015. Source: cs231n
VGGNet

Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR2015. Source: cs231n
VGGNet

TOTAL memory: 24M * 4 bytes ~= 96MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters

Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR2015. Source: cs231n
VGGNet
Note:

Most
memory is in
early CONV

Most params
are in late
FC

TOTAL memory: 24M * 4 bytes ~= 96MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters

Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR2015. Source: cs231n
Network in Network (NiN)

Lin et al. Network in Network. ICLR 2014.


Network in Network (NiN)

Mlpconv layer micronetwork conv layer to


compute more abstract features for local patches

Micronetwork uses multilayer perceptron (FC, i.e. 1x1 conv


layers)

Lin et al. Network in Network. ICLR 2014. Source: cs231n


Network in Network (NiN)
The overall structure of NiN: stacking of three mlpconv layers and one global
average pooling layer

Lin et al. Network in Network. 2014.


Network in Network (NiN)
The overall structure of NiN: stacking of three mlpconv layers and one global
average pooling layer

Lin et al. Network in Network. 2014.


Network in Network (NiN)
The overall structure of NiN: stacking of three mlpconv layers and one global
average pooling layer

Precursor to GoogLeNet and ResNet

Philosophical inspiration for GoogLeNet

Lin et al. Network in Network. 2014. Source: cs231n


Things to remember
Architectures: Plain Models
LeNet (1998)
5 Layers
No progress till 2012 due to lack of large scale data and
computational resources
AlexNet (2012)
8 Layers
Game changer in Computer Vision Area
ZFNet (2013)
8 Layers with improved hypermeter setting
VGGNet (2014)
Deeper model: 16 or 19 Layers
Uniform filters
NiN (Network in Network) (2014)
Inspiration to DAG Model
Things to remember
Architectures: Plain Models
LeNet (1998)
5 Layers
No progress till 2012 due to lack of large scale data and
computational resources
AlexNet (2012)
8 Layers
Game changer in Computer Vision Area
ZFNet (2013)
8 Layers with improved hypermeter setting
VGGNet (2014)
Deeper model: 16 or 19 Layers
Uniform filters
NiN (Network in Network) (2014)
Inspiration to DAG Model
Things to remember
Architectures: Plain Models
LeNet (1998)
5 Layers
No progress till 2012 due to lack of large scale data and
computational resources
AlexNet (2012)
8 Layers
Game changer in Computer Vision Area
ZFNet (2013)
8 Layers with improved hypermeter setting
VGGNet (2014)
Deeper model: 16 or 19 Layers
Uniform filters
NiN (Network in Network) (2014)
Inspiration to DAG Model
Things to remember
Architectures: Plain Models
LeNet (1998)
5 Layers
No progress till 2012 due to lack of large scale data and
computational resources
AlexNet (2012)
8 Layers
Game changer in Computer Vision Area
ZFNet (2013)
8 Layers with improved hypermeter setting
VGGNet (2014)
Deeper model: 16 or 19 Layers
Uniform filters
NiN (Network in Network) (2014)
Inspiration to DAG Model
Things to remember
Architectures: Plain Models
LeNet (1998)
5 Layers
No progress till 2012 due to lack of large scale data and
computational resources
AlexNet (2012)
8 Layers
Game changer in Computer Vision Area
ZFNet (2013)
8 Layers with improved hypermeter setting
VGGNet (2014)
Deeper model: 16 or 19 Layers
Uniform filters
NiN (Network in Network) (2014)
Inspiration to DAG Model

You might also like