Convolutional Neural Networks
Source: Computer Vision Course by Dr. Shiv Ram Dubey
https://sites.google.com/site/iiitscv/spring2018
Previous class
• Gradient Descent
– Back prop
– Chain rule
• Perceptron/Neuron
– A non-linear function
• Multilayer Neural Networks
– Hidden layers
– Deep networks
Assignment 3
Results: Number of Training Testing
neurons accuracy accuracy
For both SSD 5
and 6
7
Cross-entropy loss .
Functions. .
.
Today’s class
• Overview of image classification using hand-
crafted features
• Convolutional Neural Network (CNN)
– Convolution Layer
– Non-linearity Layer
– Pooling Layer
Image Categorization: Training phase
Training Training
Images
Training Labels
Image Classifier Trained
Features Training Classifier
Ex: Assignment 3
Features are obtained using a CNN model
Image Categorization: Testing phase
Training Training
Images
Training Labels
Image Classifier Trained
Features Training Classifier
Testing
Image Trained Prediction
Features Classifier Outdoor
Test Image
Features are the Keys
SIFT [Loewe IJCV 04] LBP [Ojala et al. PAMI 02]
HOG [Dalal and Triggs CVPR 05]
SPM [Lazebnik et al. CVPR 06]
Color Descriptor [Van De Sande et al. PAMI 10]
Neural Networks
Source: http://cs231n.github.io
Multi-layer Neural Network
• A non-linear classifier
• Training: find network weights w to minimize the
error between true and estimated outputs of
training examples:
• Minimization can be done by gradient descent
provided is differentiable
• This training method is called
back-propagation
Multi-layer Neural Network
• A non-linear classifier
• Training: find network weights w to minimize the
error between true and estimated outputs of
training examples:
• Minimization can be done by gradient descent
provided is differentiable
• This training method is called
back-propagation
Multi-layer Neural Network
• A non-linear classifier
• Training: find network weights w to minimize the
error between true and estimated outputs of
training examples:
• Minimization can be done by gradient descent
provided is differentiable
• This training method is called
back-propagation
Deep Learning:
Learning a Hierarchy of Feature Extractors
• Each layer of hierarchy extracts low level to high level
features progressively.
• All the way from pixels classifier
Image/Video
Image/video
Pixels Layer 1 Layer 2 Layer 3 Labels
Multi-layer Neural Network & Image
Stretch pixels
in single
column vector
Multi-layer Neural Network & Image
Stretch pixels
in single
column vector
Problems ?
Multi-layer Neural Network & Image
Stretch pixels
in single
column vector
Problems:
High dimensionality (200x200x3=120000)
Multi-layer Neural Network & Image
Stretch pixels
in single
column vector
Problems:
High dimensionality (200x200x3=120000)
Local relationship
Multi-layer Neural Network & Image
Stretch pixels
in single
column vector
Problems: Solution ?
High dimensionality
Local relationship
Multi-layer Neural Network & Image
Stretch pixels
in single
column vector
Problems: Solution:
High dimensionality Convolutional Neural Network
Local relationship
Convolutional Neural Networks
Source: cs231n, Stanford University
Convolutional Neural Networks (CNN)
• Also known as
ConvNet,
DCNN,
DNN
• CNN = a multi-layer neural network with
1. Local connectivity
2. Weight sharing
CNN: Local Connectivity
Hidden layer
Input layer
Global connectivity Local connectivity
• # input units (neurons): 7
• # hidden units: 3
• Number of parameters
– Global connectivity: ?
– Local connectivity: ?
CNN: Local Connectivity
Hidden layer
Input layer
Global connectivity Local connectivity
• # input units (neurons): 7
• # hidden units: 3
• Number of parameters
– Global connectivity: ?
– Local connectivity: ?
CNN: Local Connectivity
Hidden layer
Input layer
Global connectivity Local connectivity
• # input units (neurons): 7
• # hidden units: 3
• Number of parameters
– Global connectivity: 3 x 7 = 21
– Local connectivity: 3 x 3 = 9
https://stats.stackexchange.com/questions/159588/how-does-local-con
nection-implied-in-the-cnn-algorithm
CNN: Weight Sharing
Hidden layer
w1 w3 w5 w7 w9 w1 w3 w2 w1 w3
w2 w4 w6 w8 w2 w1 w3 w2
Input layer
Without weight sharing With weight sharing
• # input units (neurons): 7
• # hidden units: 3
• Number of parameters
– Without weight sharing: ?
– With weight sharing : ?
CNN: Weight Sharing
Hidden layer
w1 w3 w5 w7 w9 w1 w3 w2 w1 w3
w2 w4 w6 w8 w2 w1 w3 w2
Input layer
Without weight sharing With weight sharing
• # input units (neurons): 7
• # hidden units: 3
• Number of parameters
– Without weight sharing: ?
– With weight sharing : ?
CNN: Weight Sharing
Hidden layer
w1 w3 w5 w7 w9 w1 w3 w2 w1 w3
w2 w4 w6 w8 w2 w1 w3 w2
Input layer
Without weight sharing With weight sharing
• # input units (neurons): 7
• # hidden units: 3
• Number of parameters
– Without weight sharing: 3 x 3 = 9
– With weight sharing : 3 x 1 = 3
Layers used to build ConvNets
Input Layer (Input image)
Convolutional Layer (Today’s discussion)
Non-linearity Layer (such as Sigmoid, Tanh, ReLU, etc.)
Pooling Layer (Today’s Discussion)
Fully-Connected Layer (exactly as seen in Artificial Neural
Networks (ANN))
Classification Layer (Softmax, SVM loss, etc.)
Convolutional Layer
32×32×3 Image
Width
32
Height 32
3 Depth
Convolutional Layer
32×32×3 Image
Width
32
5×5×3 Filter
Height 32
Convolve the filter with the image i.e.
“slide over the image spatially, computing
3 Depth dot products”
Convolutional Layer
Filters always extend the full depth of
the input volume
32×32×3 Image
Width
32
5×5×3 Filter
Height 32
Convolve the filter with the image i.e.
“slide over the image spatially, computing
3 Depth dot products”
Convolutional Layer
32×32×3 Image
Width
weight mask
32
5×5×3 Filter
Height 32
3 Depth
Convolutional Layer
32×32×3 Image
Width
weight mask
32
5×5×3 Filter
A single value
Height 32
the result of taking a dot product
between the filter and a small 5x5x3
chunk of the image (i.e. 5*5*3 = 75-
3 Depth dimensional dot product + bias)
wT.x + b
Convolutional Layer
Activation map
32×32×3 Image
Width
weight mask
32
28
5×5×3 Filter
Height 32
convolve (slide)
over all spatial 28
locations
3 Depth 1
Convolutional Layer
Handling multiple output maps
32×32×3 Image Activation maps
Width
weight mask
32
28
5×5×3 Filter
Height 32
Second filter
28
3 Depth 1 1
Convolutional Layer
Handling multiple output maps
32×32×3 Image Activation maps
Width
weight mask
32
28
5×5×3 Filter
Height 32
Third filter
28
3 Depth 1 1 1
Convolutional Layer
Handling multiple output maps
32×32×3 Image Activation maps
Width
weight mask
32
28
5×5×3 Filter
Height 32
Total 96 filters
28
3 Depth
Depth of output volume: 96
Convolutional Layer
32×32×3 Image Activation maps
32
28
32
CONV, 28
e.g.
96
3 5x5x3 96
filters
Convolutional Layer
32×32×3 Image Activation maps
32
28
One
number
5×5×96
Filter
32
CONV, 28
e.g.
96
3 5x5x3 96
filters
Convolutional Layer
32×32×3 Image Deeper activation
Activation maps
map
32
28
24
5×5×96
Filter
32
CONV, 28 24
e.g. convolve (slide)
96 over all spatial
3 5x5x3 96 locations 1
filters
Convolutional Layer
32×32×3 Image Deeper activation
Activation maps
maps
32
28
24
128
5×5×96
32 Filters
CONV, 28
e.g. 24
96
3 5x5x3 96
filters 128
Multilayer Convolution
...
CONV, CONV, CONV
e.g. e.g.
96 128
32 5x5x3 28 5x5x96 24
filters filters
32 28 24
3 96 128
Any Convolution Layer
• Local connectivity
• Weight sharing
Weight sharing
Local connectivity
# input channels # output (activation) maps Image credit: A. Karpathy
A closer look at spatial dimensions
7
7×7 input (spatially)
assume 3×3 filter
Source: cs231n, Stanford University
A closer look at spatial dimensions
7
7×7 input (spatially)
assume 3×3 filter
Source: cs231n, Stanford University
A closer look at spatial dimensions
7
7×7 input (spatially)
assume 3×3 filter
Source: cs231n, Stanford University
A closer look at spatial dimensions
7
7×7 input (spatially)
assume 3×3 filter
Source: cs231n, Stanford University
A closer look at spatial dimensions
7
7×7 input (spatially)
assume 3×3 filter
7
5×5 output
Source: cs231n, Stanford University
A closer look at spatial dimensions
7
7×7 input (spatially)
assume 3×3 filter
applied with stride 2
7
Source: cs231n, Stanford University
A closer look at spatial dimensions
7
7×7 input (spatially)
assume 3×3 filter
applied with stride 2
7
Source: cs231n, Stanford University
A closer look at spatial dimensions
7
7×7 input (spatially)
assume 3×3 filter
applied with stride 2
7
3×3 output
Source: cs231n, Stanford University
A closer look at spatial dimensions
7
7×7 input (spatially)
assume 3×3 filter
applied with stride 3
7
Source: cs231n, Stanford University
A closer look at spatial dimensions
7
7×7 input (spatially)
assume 3×3 filter
applied with stride 3
7
doesn’t fit!
cannot apply 3x3 filter on
7x7 input with stride 3.
Source: cs231n, Stanford University
A closer look at spatial dimensions
N
Output size
(N - F) / stride + 1
F
F N e.g. N = 7, F = 3
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33
Source: cs231n, Stanford University
In practice: common to zero pad
0 0 0 0 0 0 0 0 0 e.g. input 7×7 (spatially)
0 0 3×3 filter, applied with stride 1
0 0 pad with 1 pixel border
0 0
0 0 What is the output dimension?
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
Source: cs231n, Stanford University
In practice: common to zero pad
0 0 0 0 0 0 0 0 0 e.g. input 7×7 (spatially)
0 0 3×3 filter, applied with stride 1
0 0 pad with 1 pixel border
0 0
0 0 7×7 Output
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
Source: cs231n, Stanford University
In practice: common to zero pad
0 0 0 0 0 0 0 0 0 e.g. input 7×7 (spatially)
0 0 3×3 filter, applied with stride 1
0 0 pad with 1 pixel border
0 0
0 0 7×7 Output
0 0 in general, common to see CONV
0 0
layers with stride 1, filters of size
F×F, and zero-padding with
0 0
(F-1)/2. (will preserve size spatially)
0 0 0 0 0 0 0 0 0
e.g.
F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3
Source: cs231n, Stanford University
A closer look at spatial dimensions
...
CONV, CONV, CONV
e.g. e.g.
96 128
32 5x5x3 28 5x5x96 24
filters filters
32 28 24
3 96 128
E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks
volumes spatially! (32 -> 28 -> 24 ...). Shrinking too fast is not
good.
Source: cs231n, Stanford University
Example
Input volume: 32x32x3
10 filters of dimension 5x5 with
stride 1, pad 2
Output volume size: ?
Source: cs231n, Stanford University
Example
Input volume: 32x32x3
10 5x5 filters with stride 1, pad 2
Output volume size:
(32+2*2-5)/1+1 = 32 spatially, so
32x32x10
Source: cs231n, Stanford University
Example
Input volume: 32x32x3
10 5x5 filters with stride 1, pad 2
Number of parameters in this layer?
Source: cs231n, Stanford University
Example
Input volume: 32x32x3
10 5x5 filters with stride 1, pad 2
Number of parameters in this layer?
each filter has
5*5*3 + 1 = 76 params (+1 for bias)
=> 76*10 = 760
Source: cs231n, Stanford University
Source: cs231n, Stanford University
Pooling Layer
- makes the representations smaller and more manageable
- operates over each activation map independently:
Source: cs231n, Stanford University
Max Pooling
Source: cs231n, Stanford University
Pooling Layer
Source: cs231n, Stanford University
Convolutional Neural Networks
Feature maps
Spatial pooling
Non-linearity
Convolution
(Learned)
Input Image
slide credit: S. Lazebnik
LeNet
• Neural network with specialized
connectivity structure
• Stack multiple stages of feature
extractors
• Higher stages compute more global,
more invariant features
• Classification layer at the end
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner,
Gradient-based learning applied to document recognition, Proc. IEEE 86(11): 2278–2324, 1998.
AlexNet
• Similar framework to LeCun’98 but:
• Bigger model (7 hidden layers, 650,000 units, 60,000,000 params)
• More data (106 vs. 103 images)
• GPU implementation (50x speedup over CPU)
• Trained on two GPUs (3GB each) for a week
A. Krizhevsky, I. Sutskever, and G. Hinton,
ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012
Gradient-Based Learning Applied to Document
Recognition, LeCun, Bottou, Bengio and Haffner, Proc. of
the IEEE, 1998
Imagenet Classification with Deep Convolutional Neural
Networks, Krizhevsky, Sutskever, and Hinton, NIPS 2012
Slide Credit: L. Zitnick
Resources
• http://deeplearning.net/
– Hub to many other deep learning resources
• https://github.com/ChristosChristofidis/awesome-deep-learn
ing
– A resource collection deep learning
• https://github.com/kjw0612/awesome-deep-vision
– A resource collection deep learning for computer vision
• http://cs231n.stanford.edu/syllabus.html
– Nice course on CNN for visual recognition
Things to remember
• Overview
– multi-layer neural networks
• Convolutional neural network (CNN)
– Convolution,
– nonlinearity,
– max pooling.