Convolutional Neural
Networks
INTELLIGENT SYSTEMS FOR PATTERN RECOGNITION (ISPR)
DAVIDE BACCIU – DIPARTIMENTO DI INFORMATICA - UNIVERSITA’ DI PISA
[email protected]
Lecture Outline
○ Introduction and historical perspective
○ Dissecting the components of a CNN
● Convolution, stride, pooling
○ CNN architectures for machine vision
● Putting components back together
● From LeNet to ResNet
○ Advanced topics
● Interpreting convolutions Split in two
● Advanced models and applications lectures
DAVIDE BACCIU - ISPR COURSE 2
CNN Lecture – Part I
Introduction
Convolutional Neural Networks
DAVIDE BACCIU - ISPR COURSE 4
Introduction
Convolutional Neural Networks
Destroying Machine Vision research since 2012
DAVIDE BACCIU - ISPR COURSE 5
Neocognitron
○ Hubel-Wiesel (‘59) model of brain
visual processing
● Simple cells responding to localized
features
● Complex cells pooling responses of
simple cells for invariance
○ Fukushima (‘80) built the first
hierarchical image processing
architecture exploiting this model
Trained by unsupervised learning
DAVIDE BACCIU - ISPR COURSE 6
CNN for Sequences
○ Apply a bank of 16 convolution kernels
to sequences (windows of 15 elements)
○ Trained by backpropagation with
parameter sharing
○ Guess who introduced it?
…yeah, HIM!
Time delay neural network
(Waibel & Hinton, 1987)
DAVIDE BACCIU - ISPR COURSE 7
CNN for Images
First convolutional neural network for images dates back to 1989 (LeCun)
DAVIDE BACCIU - ISPR COURSE 8
Dense Vector Multiplication
Processing images: the dense way
32x32x3 image
Reshape it into An input-sized weight vector
a vector for each hidden neuron
𝒙 𝑾
3072
100x3072
𝑻
𝑾𝒙
Each element contains
the activation of 1 neuron 100
DAVIDE BACCIU - ISPR COURSE 10
About invariances
MLPs are positional
We (most likely) need
translation
invariance!
• If we unfold the two images into two vectors, the features
identifying the cat will be in different positions
• But this still remains a picture of a cat, which we would like to
classify as such irrespectively of its position in the image
DAVIDE BACCIU - ISPR COURSE 11
An inductive bias to keep in mind
Nearby pixels are
more correlated
than far away
ones
The input
representation
should not
destroy pixel
relationships (like
vectorization
does)
DAVIDE BACCIU - ISPR COURSE 12
Convolution (Refresher)
filter
5x5
sum 25 multiplications + bias
32x32
Matrix input preserving
spatial structure
DAVIDE BACCIU - ISPR COURSE 13
Adaptive Convolution
1 0 1 𝑐1 = 𝑤1 + 𝑤3 + 2𝑤4 + 3𝑤5 +4𝑤6 + 𝑤7 + 𝑤9
2 3 4 𝑐1
1 0 1
1 0 1
0 2 0 𝑐2
1 0 1
𝑐2 = 𝑤1 + 𝑤3 + 2𝑤5 + 𝑤7 + 𝑤9
𝒘𝑇 𝒙2,2 𝒘𝑇 𝒙9,7
𝑤1 𝑤2 𝑤3
𝑤4 𝑤5 𝑤6 Convolutional filter (kernel) with
𝑤7 𝑤8 𝑤9 (adaptive) weights 𝑤𝑖
DAVIDE BACCIU - ISPR COURSE 14
Convolutional Features
Convolution
features
32x32 28x28
Slide the filter on the image computing
elementwise products and summing up
DAVIDE BACCIU - ISPR COURSE 15
Multi-Channel Convolution
Convolution filter
has a number of
slices equal to
the number of
5x5x3 image channels
32x32x3
DAVIDE BACCIU - ISPR COURSE 16
Multi-Channel Convolution
28x28
All channels are typically convolved together
o They are summed-up in the convolution
o The convolution map stays bi-dimensional
DAVIDE BACCIU - ISPR COURSE 17
Stride
○ Basic convolution slides the filter
on the image one pixel at a time
● Stride = 1
DAVIDE BACCIU - ISPR COURSE 18
Stride
○ Basic convolution slides the filter
on the image one pixel at a time
● Stride = 1
stride = 1
DAVIDE BACCIU - ISPR COURSE 19
Stride
○ Basic convolution slides the filter
on the image one pixel at a time
● Stride = 1
stride = 1
DAVIDE BACCIU - ISPR COURSE 20
Stride
○ Basic convolution slides the filter
on the image one pixel at a time
● Stride = 1
stride = 1
DAVIDE BACCIU - ISPR COURSE 21
Stride
○ Basic convolution slides the filter
on the image one pixel at a time
● Stride = 1
○ Can define a different stride
● Hyperparameter
stride = 2
DAVIDE BACCIU - ISPR COURSE 22
Stride
○ Basic convolution slides the filter
on the image one pixel at a time
● Stride = 1
○ Can define a different stride
● Hyperparameter
stride = 2
DAVIDE BACCIU - ISPR COURSE 23
Stride
○ Basic convolution slides the filter
on the image one pixel at a time
● Stride = 1
○ Can define a different stride
● Hyperparameter
stride = 2
DAVIDE BACCIU - ISPR COURSE 24
Stride
○ Basic convolution slides the filter
on the image one pixel at a time
● Stride = 1
○ Can define a different stride
● Hyperparameter
stride = 2
DAVIDE BACCIU - ISPR COURSE 25
Stride
○ Basic convolution slides the filter
on the image one pixel at a time
● Stride = 1
○ Can define a different stride
● Hyperparameter
stride = 2
Works in both directions!
DAVIDE BACCIU - ISPR COURSE 26
Stride
○ Basic convolution slides the filter
on the image one pixel at a time
● Stride = 1
○ Can define a different stride
● Hyperparameter
○ Stride reduces the number of
stride = 3 multiplications
● Subsamples the image
DAVIDE BACCIU - ISPR COURSE 27
Stride
○ Basic convolution slides the filter
on the image one pixel at a time
● Stride = 1
○ Can define a different stride
● Hyperparameter
○ Stride reduces the number of
stride = 3 multiplications
● Subsamples the image
DAVIDE BACCIU - ISPR COURSE 28
Stride
○ Basic convolution slides the filter
on the image one pixel at a time
● Stride = 1
○ Can define a different stride
● Hyperparameter
○ Stride reduces the number of
stride = 3 multiplications
● Subsamples the image
DAVIDE BACCIU - ISPR COURSE 29
Activation Map Size
What is the size of the image after application of a filter with a given
size and stride?
W=7
Take a 3x3 filter with stride 1
K=3, S=1
H=7
Output image is: 5x5
DAVIDE BACCIU - ISPR COURSE 30
Activation Map Size
What is the size of the image after application of a filter with a given
size and stride?
W=7
Take a 3x3 filter with stride 2
K=3, S=2
H=7
Output image is: 3x3
DAVIDE BACCIU - ISPR COURSE 31
Activation Map Size
What is the size of the image after application of a filter with a given
size and stride?
W=7
General rule
𝑊 − 𝐾
H=7 𝑊′ = +1
𝑆
′
𝐻−𝐾
𝐻 = +1
𝑆
DAVIDE BACCIU - ISPR COURSE 32
Activation Map Size
What is the size of the image after application of a filter with a given
size and stride?
W=7
Take a 3x3 filter with stride 3
K=3, S=3
H=7
Output image is: not really and
image!
DAVIDE BACCIU - ISPR COURSE 33
Zero Padding
Add columns and rows of zeros to the border of the image
W=7
0 0 0 0 0 0 0 0 0
0
H=7
0
DAVIDE BACCIU - ISPR COURSE 34
Zero Padding
Add columns and rows of zeros to the border of the image
W=7 (P=1)
0 0 0 0 0 0 0 0 0
K=3, S=1
0
0
H=7 Output image is?
0
(P=1) 0
𝑊 − 𝐾 + 2𝑃
0
𝑊′ = +1
0 𝑆
0
7x7
DAVIDE BACCIU - ISPR COURSE 35
Zero Padding
Add columns and rows of zeros to the border of the image
W=7 (P=1)
0 0 0 0 0 0 0 0 0
Zero padding serves to retain
0
the original size of image
0
0 𝐾−1
H=7 0 𝑃=
(P=1) 2
0
0 Pad as necessary to perform
0 convolutions with a given
0 stride S
DAVIDE BACCIU - ISPR COURSE 36
Feature Map Transformation
𝒘𝑇 𝒙𝑖,𝑗 + 𝑏 𝒎𝒂𝒙(𝟎, 𝒘𝑇 𝒙𝑖,𝑗 + 𝑏)
32x32x3 32x32 32x32
○ Convolution is a linear operator
○ Apply an element-wise nonlinearity to obtain a transformed feature map
DAVIDE BACCIU - ISPR COURSE 37
Pooling
○ Operates on the feature map to make the representation
● Smaller (subsampling)
● Robust to (some) transformations
W=4
1 1 2 4 W’=2
5 6 7 8 Max pooling 6 8
H=4 H’=2
3 2 1 0 2x2 filters 3 4
stride = 2
1 2 3 4 pooled map
feature map
DAVIDE BACCIU - ISPR COURSE 38
Pooling Facts
○ Max pooling is the one used more frequently, but other forms are
possible
● Average pooling
● L2-norm pooling
● Random pooling
○ It is uncommon to use zero padding with pooling
𝑊−𝐾
𝑊′ = +1
𝑆
DAVIDE BACCIU - ISPR COURSE 39
The Convolutional Architecture
To next layer
Convolutional layer
○ An architecture made by a
hierarchical composition of the
Pooling
(max)
basic elements
○ Convolution layer is an
Nonlinearity abstraction for the composition
(ReLu)
of the 3 basic operations
Convolutional Filters ○ Network parameters are in the
(Strided adaptive conv)
convolutional component
Input
DAVIDE BACCIU - ISPR COURSE 40
A Bigger Picture
Dense
connectivity
Sparse connectivity
CL 4 Output
CL 3
FCL 2
CL 2
CL 1 FCL 1
Input
CL -> Convolutional Layer Contains several convolutional filters
FCL -> Fully Connected Layer with different size and stride
DAVIDE BACCIU - ISPR COURSE 41
Convolutional Filter Banks
Feature map
+ nonlinearity
𝐷𝐾 convolutional Pooling
filters of size 𝐾 × 𝐾
𝐾 × 𝐾 × 𝐷𝐼 × 𝐷𝐾
𝐻′′ × 𝑊′′ × 𝐷𝐾
𝐻 × 𝑊 × 𝐷𝐼 𝐻′ × 𝑊′ × 𝐷𝐾
Number of model parameters due Pooling is often (not always)
to this convolution element (add applied independently on the 𝐷𝐾
𝐷𝐾 bias terms) convolutions
DAVIDE BACCIU - ISPR COURSE 42
Specifying CNN in Code (Keras)
Number of convolution filters 𝐷𝑘 Define input size (only first hidden layer)
model = Sequential()
model.add(Conv2D(32, kernel_size=(5, 5), strides=(1, 1),
activation='relu',
input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
model.add(Conv2D(64, (5, 5))
model.add(Activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(1000, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
Does for you all the calculations to determine the final size to the
dense layer
DAVIDE BACCIU - ISPR COURSE 43
A (Final?) Note on Convolution
○ We know that discrete convolution between an image 𝐼 and a
filter/kernel 𝐾 is
(𝐼 ∗ 𝐾)(𝑖, 𝑗) = σ𝑚 σ𝑛 𝐼 𝑖 − 𝑚, 𝑗 − 𝑛 𝐾(𝑚, 𝑛)
and it is commutative.
○ In practice, convolution implementation in DL libraries does not
flip the kernel
(𝐼 ∗ 𝐾)(𝑖, 𝑗) = σ𝑚 σ𝑛 𝐼 𝑖 + 𝑚, 𝑖 + 𝑛 𝐾(𝑚, 𝑛)
Which is cross-correlation and it is not commutative.
DAVIDE BACCIU - ISPR COURSE 44
CNN as a Sparse Neural Network
Let us take a 1-D input (sequence) to ease graphics
Convolution
b b b
c a c b b
a c a c a
Input
Convolution amounts to sparse connectivity (reduce parameters)
with parameter sharing (enforces invariance)
DAVIDE BACCIU - ISPR COURSE 45
Dense Network
The dense counterpart would look like this
DAVIDE BACCIU - ISPR COURSE 46
Strided Convolution
Make connectivity sparser
DAVIDE BACCIU - ISPR COURSE 47
Max-Pooling and Spatial Invariance
A feature is detected even if it is spatially translated
Pooling
Feature map
Pooling
Feature map
DAVIDE BACCIU - ISPR COURSE 48
Cross Channel Pooling and Spatial Invariance
Feature Feature Feature Feature
map 1 map 3 map 1 map 3
Input Input
DAVIDE BACCIU - ISPR COURSE 49
Hierarchical Feature Organization
The deeper the larger the receptive field of a unit
DAVIDE BACCIU - ISPR COURSE 50
Zero-Padding Effect
Assuming
no pooling
DAVIDE BACCIU - ISPR COURSE 51
CNN Lecture – Part II
CNN Training
Variants of the standard backpropagation that account for the fact that
connections share weights (convolution parameters)
𝑎1 𝑎2 𝑎3
𝑤1 𝑤3 The gradient ∆𝑤𝑖 is obtained by
𝑤1 𝑤3
𝑤3 𝑤2 summing the contributions from all
𝑤2
𝑤2
𝑤1 connections sharing the weight
Backpropagating gradients from convolutional layer N to N-1 is not as simple
as transposing the weight matrix (need deconvolution with zero padding)
DAVIDE BACCIU - ISPR COURSE 53
Backpropagating on Convolution
Convolution
K=3, S=1 Input is a 4x4 image
Output is a 2x2 image
Backpropagation step requires
going back from the 2x2 to the
4x4 representation
Can write convolution as dense multiplication with shared weights
Backpropagation is performed by multiplying the 4x1 representation to the
transpose of this matrix
DAVIDE BACCIU - ISPR COURSE 54
Deconvolution (Transposed Convolution)
We can obtain the transposed convolution using the same logic of the forward
convolution
K=3, S=1, P=0
If you had no padding in the forward convolution, you need to pad much
when performing transposed convolution
DAVIDE BACCIU - ISPR COURSE 55
Deconvolution (Transposed Convolution)
If you have striding, you need to fill in the convolution map with zeroes to
obtain a correctly sized deconvolution
K=3, S=2, P=1
https://github.com/vdumoulin/conv_arithmetic
DAVIDE BACCIU - ISPR COURSE 56
LeNet-5 (1989)
○ Grayscale images
○ Filters are 5x5 with stride 1 (sigmoid nonlinearity)
○ Pooling is 2x2 with stride 2
○ No zero padding
DAVIDE BACCIU - ISPR COURSE 57
AlexNet (2012) - Architecture
ImageNet Top-5 : 15.4%
○ RGB images 227x227x3
○ 5 convolutional layers + 3 fully connected layers
○ Split into two parts (top/bottom) each on 1 GPU
DAVIDE BACCIU - ISPR COURSE 58
Data Augmentation
Key intuition - If I
have an image
with a given label,
I can transform it
(by flipping,
rotation, etc) and
the resulting
image will still
have the same
label
DAVIDE BACCIU - ISPR COURSE 59
AlexNet - Innovations
○ Use heavy data augmentation (rotations, random crops, etc.)
○ Introduced the use of ReLu
○ Dense layers regularized by dropout
DAVIDE BACCIU - ISPR COURSE 60
ReLU Nonlinearity
Non zero- Dead Units!!!
Saturation centered
○ ReLu help counteract gradient vanish
● Sigmod first derivative vanishes as we increase or decrease z
● ReLu first derivative is 1 when unit is active and 0 elsewhere
● ReLu second derivative is 0 (no second order effects)
○ Easy to compute (zero thresholding)
○ Favors sparsity
DAVIDE BACCIU - ISPR COURSE 61
AlexNet - Parameters
○ 62.3 millions of parameters (6% in convolutions)
○ 5-6 days to train on two GTX 580 GPUs (95% time in convolutions)
DAVIDE BACCIU - ISPR COURSE 62
VGGNet – VGG16 (2014)
ImageNet Top-5 : 7.3%
○ Standardized convolutional layer
● 3x3 convolutions with stride 1
● 2x2 max pooling with stride 2 (not after every convolution)
○ Various configuration analysed, but best has
● 16 Convolutional + 3 Fully Connected layers
● About 140 millions parameters (85% in FC)
DAVIDE BACCIU - ISPR COURSE 63
GoogLeNet (2015)
ImageNet Top-5 : 6.7%
• Kernels of different size to
capture details at varied
Why 1x1 Inception Module
scale
convolutions?
• Aggregated before sending to
next layer
• Average pooling
• No fully connected layers
DAVIDE BACCIU - ISPR COURSE 64
1x1 Convolutions are Helpful
Take 5 kernels
1x1x64
56x56x64 56x56x5
By placing 1x1 convolutions before larger kernels in the Inception module, the
number of input channels is reduced, saving computations and parameters
DAVIDE BACCIU - ISPR COURSE 65
Back on GoogLeNet
Auxiliary outputs
○ Only 5 millions of parameters to inject gradients
○ 12X less parameters than AlexNet at deeper layers
○ Followed by v2, v3 and v4 of the Inception module
● More filter factorization
● Introduce heavy use of Batch Normalization
DAVIDE BACCIU - ISPR COURSE 66
Batch Normalization
○ Very deep neural network are subject to internal covariate shift
● Distribution of inputs to a layer N might vary (shift) with different minibatches (due to
adjustments of layer N-1)
● Layer N can get confused by this
● Solution is to normalize for mean and variance in each minibatch (bit more articulated
than this actually)
𝑁𝑏 𝑥𝑖 − 𝜇𝑏 𝑦 = 𝛾𝑥ො𝑖 + 𝛽 Scale and shift
1 𝑥ො𝑖 =
𝜇𝑏 = 𝑥𝑖
𝑁𝑏
𝑖=1
𝜎𝑏2 + 𝜖 Trainable linear transform potentially
𝑁𝑏
allowing to cancel unwanted zero-
1 centering effects (e.g. sigmoid)
𝜎𝑏2 = 𝑥𝑖 − 𝜇𝑏 2
Normalization
𝑁𝑏 Need to backpropagate through this!
𝑖=1
DAVIDE BACCIU - ISPR COURSE 67
ResNet (2015)
Begin of the Ultra-Deep Network Era (152 Layers) ImageNet Top-5 : 3.57%
Why wasn’t this working
before?
Gradient vanishes when backpropagating too deep!
DAVIDE BACCIU - ISPR COURSE 68
ResNet Trick
𝐹(𝑋) + 𝑋 The input to the block 𝑋 bypasses the
+
convolution and is then combined with its
𝐹(𝑋) ReLu
residual 𝐹(𝑋) resulting from the convolutions
3x3
convolution
Residual 𝑋
ReLu
block
3x3 When backpropagating the gradient flows in full
convolution
through these bypass connections
𝑋
DAVIDE BACCIU - ISPR COURSE 69
ResNet & Batch Norm
When connecting several Residual Blocks in series, one need to be
careful about amplification/compounding of variance due to the
residual connectivity
• Batch norm can alleviate this effect
DAVIDE BACCIU - ISPR COURSE 70
MobileNets
Making CNNs efficient to run on mobile
devices by depthwise separable
convolutions
Basically run channel-independent
convolutions followed by 1x1
convolutions for cross-channel mixing
arxiv.org/pdf/1704.04861.pdf
DAVIDE BACCIU - ISPR COURSE 71
CNN Architecture Evolution
DAVIDE BACCIU - ISPR COURSE 72
Transfer learning
Use (part of) a model
trained (pretrained) by
someone on large dataset
as a “feature-extractor”
on problems with fewer
data, fine tuning only the
predictor part
Understanding CNN Embedding
tSNE projection of AlexNet last
hidden dense layer
https://cs.stanford.edu/people/karpathy/cnnembed/
DAVIDE BACCIU - ISPR COURSE 74
Interpreting Intermediate Levels
○ What about the information captured in convolutional layers?
○ Visualize kernel weights (filters)
● Naïve approach
● Works only for early convolutional layers
○ Map the activation of the convolutional kernel back in pixel space
● Requires to reverse convolution
● Deconvolution
Zeiler&Fergus, Visualizing and Understanding Convolutional Networks, ICML 2013
DAVIDE BACCIU - ISPR COURSE 75
Deconvolutional Network (DeConvNet)
○ Attach a DeConvNet to a target layer
○ Plug an input and forward propagate activations until layer
○ Zero activations of target neuron
○ Backpropagate on the DeConvNet and see what parts of the reconstructed
image are affected
DAVIDE BACCIU - ISPR COURSE 76
Inspect Deconvolution Layers
Deconv 14x14 Pooling Deconv 28x28 ….
DAVIDE BACCIU - ISPR COURSE 77
Filters & Patches – Layer 1
Reconstructed filters in pixel space
Corresponding top-9 image patches
Zeiler&Fergus, Visualizing and Understanding Convolutional Networks, ICML 2013
DAVIDE BACCIU - ISPR COURSE 78
Filters & Patches – Layer 2
Zeiler&Fergus, Visualizing and Understanding Convolutional Networks, ICML 2013
DAVIDE BACCIU - ISPR COURSE 79
Filters & Patches – Layer 3
Zeiler&Fergus, Visualizing and Understanding Convolutional Networks, ICML 2013
DAVIDE BACCIU - ISPR COURSE 80
Filters & Patches – Layer 4
Zeiler&Fergus, Visualizing and Understanding Convolutional Networks, ICML 2013
DAVIDE BACCIU - ISPR COURSE 81
Filters & Patches – Layer 5
Zeiler&Fergus, Visualizing and Understanding
Convolutional Networks, ICML 2013
DAVIDE BACCIU - ISPR COURSE 82
Occlusions
o Measure what happens to feature maps and object classification if we
occlude part of the image
o Slide a grey mask on the image and project back the response of the best
filters using deconvolution
DAVIDE BACCIU - ISPR COURSE 83
Occlusions
Zeiler&Fergus, Visualizing and Understanding Convolutional Networks, ICML 2013
DAVIDE BACCIU - ISPR COURSE 84
Dense CNN
Transition layers
batch normalization + 1×1 convolutional +
2×2 average pooling layer
batch normalization + ReLU + 3x3 conv
o Gradient flows well in bypass connections
o Each layer in the dense block has access to
all information from previous layers
Huang et al, Densely Connected Convolutional Networks, CVPR 2017
DAVIDE BACCIU - ISPR COURSE 85
Causal Convolutions
Preventing a convolution from allowing to see into the future…
time
Problem is the context size grows slow with depth
DAVIDE BACCIU - ISPR COURSE 86
Causal & Dilated Convolutions
(𝐼 ∗ 𝐾)(𝑖, 𝑗) = σ𝑚 σ𝑛 𝐼 𝑖 − 𝑙𝑚, 𝑖 − 𝑙𝑛 𝐾(𝑚, 𝑛)
Similar to striding, but size is preserved
Oord et al, WaveNet: A Generative Model for Raw Audio, ICLR 2016
DAVIDE BACCIU - ISPR COURSE 87
Semantic Segmentation
Traditional CNN cannot be used for this task due to the
downsampling of the striding and pooling operations
DAVIDE BACCIU - ISPR COURSE 89
Fully Convolutional Networks (FCN)
Convolutional part to extract
interesting features at various Learn an upsampling function of the fused
scales map to generate the semantic
segmentation map
Fuse information from feature maps of different scale
Shelhamer et at, Fully Convolutional Networks for Semantic Segmentation, PAMI 2016
DAVIDE BACCIU - ISPR COURSE 90
Deconvolution Architecture
Maxpooling indices transferred to decoder to improve the segmentation
resolution.
Badrinarayanan et al, SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation, PAMI 2017
DAVIDE BACCIU - ISPR COURSE 91
SegNet Segmentation
Demo here: http://mi.eng.cam.ac.uk/projects/segnet/
DAVIDE BACCIU - ISPR COURSE 92
U-Nets (Big on Biomedical Images)
Low level information transfer by Pixel mask in output
concatenation of early feature (a bit smaller than
Few convolutional maps original image)
layers at different
resolutions
Pooling layers
High level
visual features
Upconvolution
(Deconvolution)
Use Dilated Convolutions
Always perform 3x3 convolutions with no pooling at each level
Level 1 Level 2 Level 3
Context increases without
o Pooling (changes map size)
o Increasing computational complexity
Yu et al, Multi-Scale Context Aggregation by Dilated Convolutions, ICLR 2016
DAVIDE BACCIU - ISPR COURSE 94
Segmentation by Dilated CNN
Dilated CNN GT Dilated CNN GT
Yu et al, Multi-Scale Context Aggregation by Dilated Convolutions, ICLR 2016
DAVIDE BACCIU - ISPR COURSE 95
Object Detection
Object Detection: Faster R-CNN
Crop, fuse and
polish bounding
boxes proposals
Generate bounding boxes
proposals
• x,y position
Any CNN of your
• size
choice that can
• confidence
produce a feature
map
Source: S. Yeung, BIODS 220
Software
○ CNN are supported by any deep learning framework (Keras-TF,
Pytorch, MS Cognitive TK, Intel OpenVino, …)
○ Caffe was one of the initiators and basically built around CNN
● Introduced protobuffer network specification
● ModelZoo of pretrained models (LeNet, AlexNet, …)
● Support for GPU
● Project converged into PyTorch now
DAVIDE BACCIU - ISPR COURSE 98
Caffe Protobuffer
name: "LeNet"
layer {
name: "data"
type: "Input"
…
input_param { shape: { dim: 64 dim: 1 dim: 28 dim: 28 } }
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
…
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
DAVIDE BACCIU - ISPR COURSE 99
Other Software
○ Matlab distributes its Neural Network Toolbox which allows
importing pretrained models from Keras-TF
○ Want to have a CNN in your browser?
● Try ConvNetJS (https://cs.stanford.edu/people/karpathy/convnetjs/)
DAVIDE BACCIU - ISPR COURSE 100
GUIs
Major hardware producers have GUI and toolkits wrapping Caffe,
Intel OpenVino
Keras-TF to play with CNNs
NVIDIA Digits
Barista
Plus
others…
DAVIDE BACCIU - ISPR COURSE 101
Take Home Messages
o Key things
• Convolutions in place of dense multiplications allow sparse connectivity and weight
sharing
• Pooling enforces invariance and allows to change resolution but shrinks data size
• Full connectivity compress information from all convolutions but accounts for 90% of
model complexity
o Lessons learned
• ReLU are efficient and counteract gradient vanish
• 1x1 convolutions are useful
• Need batch normalization
• Bypass connections allow to go deeper
o Dilated (à trous) convolutions
o You can use CNN outside of machine vision
DAVIDE BACCIU - ISPR COURSE 102
Next Lecture
Gated Recurrent Networks
○ Learning with sequential data PART I
○ Gradient issues
○ Gated RNN
● Long-Short Term Memories (LSTM)
● Gated Recurrent Units (GRU)
○ Advanced topics PART II
● Understanding and exploiting memory encoding
● Applications
DAVIDE BACCIU - ISPR COURSE 103