This Session
CNN Architectures: Plain Models
LeNet
AlexNet
ZFNet
VggNet
Network in Network
Review: LeNet-5
LeCun et al. Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 1998. Source: cs231n
Review: LeNet-5
Conv filters are 5x5, applied at stride 1
Subsampling (Pooling) layers are 2x2 applied at stride 2
i.e. architecture is [CONV-POOL-CONV-POOL-CONV-FC-FC]
LeCun et al. Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 1998. Source: cs231n
AlexNet
Architecture:
CONV1 MAX POOL1 NORM1(Local Response Normalization)
CONV2 MAX POOL2 NORM2(Local Response Normalization)
CONV3
CONV4
CONV5 Max POOL3
FC6
FC7
FC8
Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet
Input: 227x227x3 images
First layer (CONV1): 96 11x11 filters applied at stride 4
=>
Q: what is the output volume size? Hint: (227-11)/4+1 = 55
Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet
Input: 227x227x3 images
First layer (CONV1): 96 11x11 filters applied at stride 4
=>
Output volume [55x55x96]
Q: What is the total number of parameters in this layer?
Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet
Input: 227x227x3 images
First layer (CONV1): 96 11x11 filters applied at stride 4
=>
Output volume [55x55x96]
Parameters: (11*11*3)*96 = 35K
Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet
Input: 227x227x3 images
After CONV1: 55x55x96
Second layer (POOL1): 3x3 filters applied at stride 2
Q: what is the output volume size? Hint: (55-3)/2+1 = 27
Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet
Input: 227x227x3 images
After CONV1: 55x55x96
Second layer (POOL1): 3x3 filters applied at stride 2
Output volume [27x27x96]
Q: what is the number of parameters in this layer?
Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet
Input: 227x227x3 images
After CONV1: 55x55x96
Second layer (POOL1): 3x3 filters applied at stride 2
Output volume [27x27x96]
Parameters: 0!
Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet
Input: 227x227x3 images
After CONV1: 55x55x96
After POOL1: 27x27x96
...
Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet
Full (simplified) AlexNet architecture:
[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet
Full (simplified) AlexNet architecture: [55x55x48] x2
[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 Historical note: Trained on GTX
[27x27x96] MAX POOL1: 3x3 filters at stride 2 580 GPU with only 3 GB of
[27x27x96] NORM1: Normalization layer memory. Network spread across
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 2 GPUs, half the neurons
[13x13x256] MAX POOL2: 3x3 filters at stride 2 (feature maps) on each GPU.
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet
Full (simplified) AlexNet architecture:
[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 CONV1, CONV2, CONV4,
[27x27x96] MAX POOL1: 3x3 filters at stride 2 CONV5: Connections only with
[27x27x96] NORM1: Normalization layer feature maps on same GPU
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet
Full (simplified) AlexNet architecture:
[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 CONV3, FC6, FC7, FC8:
[27x27x96] MAX POOL1: 3x3 filters at stride 2 Connections with all feature
[27x27x96] NORM1: Normalization layer maps in preceding layer,
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 communication across GPUs
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
AlexNet
Full (simplified) AlexNet architecture: Details/Retrospectives:
[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 - first use of ReLU
[27x27x96] MAX POOL1: 3x3 filters at stride 2 - used Norm layers (not
[27x27x96] NORM1: Normalization layer common anymore)
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2 - heavy data augmentation
[13x13x256] NORM2: Normalization layer - batch size 128
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 - SGD Momentum 0.9
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 - Learning rate 0.01, reduced
[6x6x256] MAX POOL3: 3x3 filters at stride 2 manually when val accuracy
[4096] FC6: 4096 neurons saturates
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
First CNN based Winner
K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, IEEE CVPR 2016.
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
ZFNet: Improved
hyperparameters over
AlexNet
K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, IEEE CVPR 2016.
ZFNet
AlexNet but:
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
ImageNet top 5 error: 16.4% -> 11.7%
M. Zeiler, R. Fergus. Visualizing and understanding convolutional networks. ECCV 2014. Source: cs231n
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Deeper Networks
K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, IEEE CVPR 2016.
VGGNet
VGG16 (source)
Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR2015.
VGGNet
Small filters, Deeper networks
8 layers (AlexNet)
-> 16 - 19 layers (VGGNet)
Only 3x3 CONV stride 1, pad 1
and 2x2 MAX POOL stride 2
Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR2015. Source: cs231n
VGGNet
Small filters, Deeper networks
8 layers (AlexNet)
-> 16 - 19 layers (VGGNet)
Only 3x3 CONV stride 1, pad 1
and 2x2 MAX POOL stride 2
ImageNet top 5 error:
11.4% (ZFNet, 2013)
->
7.3% (VGGNet, 2014)
Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR2015. Source: cs231n
VGGNet
Q: Why use smaller filters? (3x3 conv)
Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR2015. Source: cs231n
VGGNet
Q: Why use smaller filters? (3x3 conv)
Stack of three 3x3 conv (stride 1) layers
has same effective receptive field as
one 7x7 conv layer
Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR2015. Source: cs231n
VGGNet
Q: Why use smaller filters? (3x3 conv)
Stack of three 3x3 conv (stride 1) layers
has same effective receptive field as
one 7x7 conv layer
Q: What is the effective receptive field of
three 3x3 conv (stride 1) layers?
Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR2015. Source: cs231n
VGGNet
Q: Why use smaller filters? (3x3 conv)
Stack of three 3x3 conv (stride 1) layers
has same effective receptive field as
one 7x7 conv layer
Q: What is the effective receptive field of
three 3x3 conv (stride 1) layers?
[7x7]
But deeper, more non-linearities
And fewer parameters: 3 * (32C2) vs.
72C2 for C channels per layer
Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR2015. Source: cs231n
VGGNet
Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR2015. Source: cs231n
VGGNet
TOTAL memory: 24M * 4 bytes ~= 96MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR2015. Source: cs231n
VGGNet
Note:
Most
memory is in
early CONV
Most params
are in late
FC
TOTAL memory: 24M * 4 bytes ~= 96MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR2015. Source: cs231n
Network in Network (NiN)
Lin et al. Network in Network. ICLR 2014.
Network in Network (NiN)
Mlpconv layer micronetwork conv layer to
compute more abstract features for local patches
Micronetwork uses multilayer perceptron (FC, i.e. 1x1 conv
layers)
Lin et al. Network in Network. ICLR 2014. Source: cs231n
Network in Network (NiN)
The overall structure of NiN: stacking of three mlpconv layers and one global
average pooling layer
Lin et al. Network in Network. 2014.
Network in Network (NiN)
The overall structure of NiN: stacking of three mlpconv layers and one global
average pooling layer
Lin et al. Network in Network. 2014.
Network in Network (NiN)
The overall structure of NiN: stacking of three mlpconv layers and one global
average pooling layer
Precursor to GoogLeNet and ResNet
Philosophical inspiration for GoogLeNet
Lin et al. Network in Network. 2014. Source: cs231n
Things to remember
Architectures: Plain Models
LeNet (1998)
5 Layers
No progress till 2012 due to lack of large scale data and
computational resources
AlexNet (2012)
8 Layers
Game changer in Computer Vision Area
ZFNet (2013)
8 Layers with improved hypermeter setting
VGGNet (2014)
Deeper model: 16 or 19 Layers
Uniform filters
NiN (Network in Network) (2014)
Inspiration to DAG Model
Things to remember
Architectures: Plain Models
LeNet (1998)
5 Layers
No progress till 2012 due to lack of large scale data and
computational resources
AlexNet (2012)
8 Layers
Game changer in Computer Vision Area
ZFNet (2013)
8 Layers with improved hypermeter setting
VGGNet (2014)
Deeper model: 16 or 19 Layers
Uniform filters
NiN (Network in Network) (2014)
Inspiration to DAG Model
Things to remember
Architectures: Plain Models
LeNet (1998)
5 Layers
No progress till 2012 due to lack of large scale data and
computational resources
AlexNet (2012)
8 Layers
Game changer in Computer Vision Area
ZFNet (2013)
8 Layers with improved hypermeter setting
VGGNet (2014)
Deeper model: 16 or 19 Layers
Uniform filters
NiN (Network in Network) (2014)
Inspiration to DAG Model
Things to remember
Architectures: Plain Models
LeNet (1998)
5 Layers
No progress till 2012 due to lack of large scale data and
computational resources
AlexNet (2012)
8 Layers
Game changer in Computer Vision Area
ZFNet (2013)
8 Layers with improved hypermeter setting
VGGNet (2014)
Deeper model: 16 or 19 Layers
Uniform filters
NiN (Network in Network) (2014)
Inspiration to DAG Model
Things to remember
Architectures: Plain Models
LeNet (1998)
5 Layers
No progress till 2012 due to lack of large scale data and
computational resources
AlexNet (2012)
8 Layers
Game changer in Computer Vision Area
ZFNet (2013)
8 Layers with improved hypermeter setting
VGGNet (2014)
Deeper model: 16 or 19 Layers
Uniform filters
NiN (Network in Network) (2014)
Inspiration to DAG Model