This Session
• Neural Network and Image
– Dimensionality
– Local relationship
• Convolutional Neural Network (CNN)
– Convolution Layer
– Non-linearity Layer
– Pooling Layer
– Fully Connected Layer
– Classification Layer
• ImageNet Challenge
– Progress
– Human Level Performance
Neural Networks
Source: [Link]
Multi-layer Neural Network & Image
How to apply NN over Image?
Multi-layer Neural Network & Image
Stretch pixels
in single
column vector
Multi-layer Neural Network & Image
Stretch pixels
in single
column vector
Problems ?
Multi-layer Neural Network & Image
Stretch pixels
in single
column vector
Problems:
High dimensionality
Local relationship
Multi-layer Neural Network & Image
Stretch pixels
in single
column vector
Problems: Solution ?
High dimensionality
Local relationship
Multi-layer Neural Network & Image
Stretch pixels
in single
column vector
Problems: Solution:
High dimensionality Convolutional Neural Network
Local relationship
Convolutional Neural Networks
• Also known as
CNN,
ConvNet,
DCN
• CNN = a multi-layer neural network with
1. Local connectivity
2. Weight sharing
CNN: Local Connectivity
Hidden layer
Input layer
Global connectivity Local connectivity
• # input units (neurons): 7
• # hidden units: 3
• Number of parameters
– Global connectivity: ?
– Local connectivity: ?
CNN: Local Connectivity
Hidden layer
Input layer
Global connectivity Local connectivity
• # input units (neurons): 7
• # hidden units: 3
• Number of parameters
– Global connectivity: ?
– Local connectivity: ?
CNN: Local Connectivity
Hidden layer
Input layer
Global connectivity Local connectivity
• # input units (neurons): 7
• # hidden units: 3
• Number of parameters
– Global connectivity: 3 x 7 = 21
– Local connectivity: 3 x 3 = 9
CNN: Weight Sharing
Hidden layer
w1 w3 w5 w7 w9 w1 w3 w2 w1 w3
w2 w4 w6 w8 w2 w1 w3 w2
Input layer
Without weight sharing With weight sharing
• # input units (neurons): 7
• # hidden units: 3
• Number of parameters
– Without weight sharing: ?
– With weight sharing : ?
CNN: Weight Sharing
Hidden layer
w1 w3 w5 w7 w9 w1 w3 w2 w1 w3
w2 w4 w6 w8 w2 w1 w3 w2
Input layer
Without weight sharing With weight sharing
• # input units (neurons): 7
• # hidden units: 3
• Number of parameters
– Without weight sharing: ?
– With weight sharing : ?
CNN: Weight Sharing
Hidden layer
w1 w3 w5 w7 w9 w1 w3 w2 w1 w3
w2 w4 w6 w8 w2 w1 w3 w2
Input layer
Without weight sharing With weight sharing
• # input units (neurons): 7
• # hidden units: 3
• Number of parameters
– Without weight sharing: 3 x 3 = 9
– With weight sharing : 3 x 1 = 3
Convolutional Neural Networks
Source: cs231n, Stanford University
Layers used to build ConvNets
Input Layer (Input image)
Convolutional Layer
Non-linearity Layer (such as Sigmoid, Tanh, ReLU, PReLU,
ELU, Swish, etc.)
Pooling Layer (such as Max Pooling, Average Pooling, etc.)
Fully-Connected Layer
Classification Layer (Softmax, etc.)
Convolutional Layer
32×32×3 Image -> preserve spatial structure
Width
32
Height 32
3 Depth
Convolutional Layer
32×32×3 Image
Width
32
5×5×3 Filter
Height 32
Convolve the filter with the image i.e.
“slide over the image spatially, computing
3 Depth dot products”
Convolutional Layer
Handling multiple input channels
Filters always extend the full depth of
the input volume
32×32×3 Image
Width
32
5×5×3 Filter
Height 32
Convolve the filter with the image i.e.
“slide over the image spatially, computing
3 Depth dot products”
Convolutional Layer
32×32×3 Image
Width
weight mask
32
5×5×3 Filter
Height 32
3 Depth
Convolutional Layer
32×32×3 Image
Width
weight mask
32
5×5×3 Filter
A single value
Height 32
the result of taking a dot product
between the filter and a small 5x5x3
chunk of the image (i.e. 5*5*3 = 75-
3 Depth dimensional dot product + bias)
wT.x + b
Convolutional Layer
32×32×3 Image
Width
weight mask
32
5×5×3 Filter
A single value
Height 32
the result of taking a dot product
between the filter and a small 5x5x3
chunk of the image (i.e. 5*5*3 = 75-
3 Depth dimensional dot product + bias)
wT.x + b
Convolutional Layer
Activation map
32×32×3 Image
Width
weight mask
32
28
5×5×3 Filter
Height 32
convolve (slide)
over all spatial 28
locations
3 Depth 1
Convolutional Layer
Handling multiple output maps
32×32×3 Image Activation maps
Width
weight mask
32
28
5×5×3 Filter
Height 32
Second filter
28
3 Depth 1 1
Convolutional Layer
Handling multiple output maps
32×32×3 Image Activation maps
Width
weight mask
32
28
5×5×3 Filter
Height 32
Third filter
28
3 Depth 1 1 1
Convolutional Layer
Handling multiple output maps
32×32×3 Image Activation maps
Width
weight mask
32
28
5×5×3 Filter
Height 32
Total 96 filters
28
3 Depth
Depth of output volume: 96
Image Source: cs231n, Oxford University
Image Source: cs231n, Oxford University
Image Source: cs231n, Oxford University
Convolutional Layer
Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation
functions
32×32×3 Image Activation maps
32
28
32
CONV, 28
e.g.
96
3 5x5x3 96
filters
Convolutional Layer
Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation
functions
32×32×3 Image Activation maps
32
28
One
number
5×5×96
Filter
32
CONV, 28
e.g.
96
3 5x5x3 96
filters
Convolutional Layer
Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation
functions
32×32×3 Image Deeper activation
Activation maps
map
32
28
24
5×5×96
Filter
32
CONV, 28 24
e.g. convolve (slide)
96 over all spatial
3 5x5x3 96 locations 1
filters
Convolutional Layer
Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation
functions
32×32×3 Image Deeper activation
Activation maps
maps
32
28
24
128
5×5×96
32 Filters
CONV, 28
e.g. 24
96
3 5x5x3 96
filters 128
Multilayer Convolution
Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation
functions
...
CONV, CONV, CONV
e.g. e.g.
96 128
32 5x5x3 28 5x5x96 24
filters filters
32 28 24
3 96 128
Any Convolution Layer
• Local connectivity
• Weight sharing
• Handling multiple input channels
• Handling multiple output maps
Weight sharing
Local connectivity
# input channels # output (activation) maps Image credit: A. Karpathy
A closer look at spatial dimensions
7
7×7 input (spatially)
assume 3×3 filter
7
A closer look at spatial dimensions
7
7×7 input (spatially)
assume 3×3 filter
7
A closer look at spatial dimensions
7
7×7 input (spatially)
assume 3×3 filter
7
A closer look at spatial dimensions
7
7×7 input (spatially)
assume 3×3 filter
7
A closer look at spatial dimensions
7
7×7 input (spatially)
assume 3×3 filter
7
5×5 output
A closer look at spatial dimensions
7
7×7 input (spatially)
assume 3×3 filter
applied with stride 2
7
A closer look at spatial dimensions
7
7×7 input (spatially)
assume 3×3 filter
applied with stride 2
7
A closer look at spatial dimensions
7
7×7 input (spatially)
assume 3×3 filter
applied with stride 2
7
3×3 output
A closer look at spatial dimensions
7
7×7 input (spatially)
assume 3×3 filter
applied with stride 3
7
A closer look at spatial dimensions
7
7×7 input (spatially)
assume 3×3 filter
applied with stride 3
7
doesn’t fit!
cannot apply 3x3 filter on
7x7 input with stride 3.
A closer look at spatial dimensions
N
Output size
(N - F) / stride + 1
F
F N e.g. N = 7, F = 3
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33
A closer look at spatial dimensions
...
CONV, CONV, CONV
e.g. e.g.
96 128
32 5x5x3 28 5x5x96 24
filters filters
32 28 24
3 96 128
E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks
volumes spatially! (32 -> 28 -> 24 ...). Shrinking too fast is not
good, doesn’t work well.
Source: cs231n, Stanford University
In practice: common to zero pad
0 0 0 0 0 0 0 0 0 e.g. input 7×7 (spatially)
0 0 3×3 filter, applied with stride 1
0 0 pad with 1 pixel border
0 0
0 0 What is the output dimension?
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
In practice: common to zero pad
0 0 0 0 0 0 0 0 0 e.g. input 7×7 (spatially)
0 0 3×3 filter, applied with stride 1
0 0 pad with 1 pixel border
0 0
0 0 7×7 Output
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
In practice: common to zero pad
0 0 0 0 0 0 0 0 0 e.g. input 7×7 (spatially)
0 0 3×3 filter, applied with stride 1
0 0 pad with 1 pixel border
0 0
0 0 7×7 Output
0 0
in general, common to see CONV
layers with stride 1, filters of size
0 0
F×F, and zero-padding with
0 0
(F-1)/2. (will preserve size spatially)
0 0 0 0 0 0 0 0 0
e.g.
F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3
Example
Input volume: 32x32x3
10 5x5 filters with stride 1, pad 2
Output volume size: ?
Example
Input volume: 32x32x3
10 5x5 filters with stride 1, pad 2
Output volume size:
(32+2*2-5)/1+1 = 32 spatially, so
32x32x10
Example
Input volume: 32x32x3
10 5x5 filters with stride 1, pad 2
Number of parameters in this layer?
Example
Input volume: 32x32x3
10 5x5 filters with stride 1, pad 2
Number of parameters in this layer?
each filter has
5*5*3 + 1 = 76 params (+1 for bias)
=> 76*10 = 760
Source: cs231n, Stanford University
Convolution as feature extraction
Source: cs231n, Stanford University
Non-linearity Layer
Source: cs231n, Stanford University
Pooling Layer
- makes the representations smaller and more manageable
- operates over each activation map independently:
Source: cs231n, Stanford University
Max Pooling
Source: cs231n, Stanford University
Pooling Layer
Source: cs231n, Stanford University
Fully Connected Layer
• Connect every neuron in one layer to every neuron in
another layer
• Same as the traditional multi-layer perceptron neural
network
Image Source: [Link]
Fully Connected Layer
• Connect every neuron in one layer to every neuron in
another layer
• Same as the traditional multi-layer perceptron neural
network
No. of Neurons (Last FC)
= No. of classes
Image Source: [Link]
Loss/Classification Layer
• SVM Classifier (SVM Loss/Hinge Loss/Max-
margin Loss)
• Softmax Classifier (Softmax Loss/Cross-
entropy Loss)
A typical CNN structure
Image Source: [Link]
A typical CNN structure
Image Source: [Link]
ImageNet Challenge
Validation classification
Validation classification • ~14 million labeled images, 20k
Validation classification classes
• Images gathered from Internet
• Human labels via Amazon MTurk
• Challenge: 1.2 million training
images, 1000 classes
[Link]/challenges/LSVRC/
Progress on ImageNet Challenge
ImageNet Image Classification Top5 Error
18 16.4
16
14
11.7
12
10
8 7.3 6.7
6
4 3.57 3.06
2.251
2
0
Progress on ImageNet Challenge
ImageNet Image Classification Top5 Error
18 16.4
16
14
11.7
12
10
8 7.3 6.7
6
4 3.57 3.06
2.251
2
0
Best Non-ConvNet in 2012: 26.2%
Things to remember
• Neural network and Image
– Neuroscience, Perceptron, Problems due to High
Dimensionality and Local Relationship
• Convolutional neural network (CNN)
– Convolution Layer,
– Nonlinearity Layer,
– Pooling Layer,
– Fully Connected Layer,
– Loss/Classification Layer
• Progress on ImageNet challenge
– Latest SENet, Winner 2017
Acknowledgements
• Thanks to the following researchers for making their teaching/research
material online
– Forsyth
– Steve Seitz
– Noah Snavely
– J.B. Huang
– Derek Hoiem
– D. Lowe
– A. Bobick
– S. Lazebnik
– K. Grauman
– R. Zaleski
– Antonio Torralba
– Rob Fergus
– Leibe
– And many more ………..