Convolutional Neural Networks in Computer
Vision
Jochen Lang
[email protected]
Faculté de génie | Faculty of Engineering
Jochen Lang, EECS
[email protected] Convolutional neural networks
• Convolutional neural networks (CNNs)
– “Classic layers”: Convolutional, pooling and fully-
connected layers
– Visualizing CNNs
Jochen Lang, EECS
[email protected] Convolutional Networks
• Yann Le Cun et al., LeNet [1998]
– The paper describes a network architecture first
introduced in 1989. It defines the principles of deep
learning for OCR and speech recognition.
©IEEE, 1998
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD. Backpropagation
applied to handwritten zip code recognition. Neural computation. 1989;1(4):541-51.
Jochen Lang, EECS
[email protected]
Convolutional Network Layers
• Images are arguably (very) high dimensional, e.g., 1080p
Million dimensions
• dimensions may however not be independent
• Convolutional layers help to summarize the image
– Easily understood as linear FIR filters or 2D
convolutions
– The filter coefficients are the weights of the neural
network layer
– The filters of a layer however slide (as in classical
image filtering) over the input image
– The output of many different filters and of the same
filter at different locations are combined deeper into the
network
Jochen Lang, EECS
[email protected] Example: Convolutional Layer
– Padded input in
RGB 7x7x3
– Filter W0 3x3x3
applied at
stride 2 (move
filter by two
pixels after
each
application)
– Filter W1 3x3x3
stride 2
Image Source: cs231n.github.io
– Combine output “Convolutional Neural Networks for Visual
into 3x3x2 recognition”, Karpathy et al., Stanford
Jochen Lang, EECS
[email protected] Convolution a Closer Look
• Inner product, i.e., multiplying the pixel values with the
weights
• Same as image correlation!
Image Source: cs231n.github.io
“Convolutional Neural Networks for Visual
recognition”, Karpathy et al., Stanford
Jochen Lang, EECS
[email protected] Main Concepts in CNNs
• Convolutional layers introduce the following core ideas into
machine learning
– sparse interaction
• the filter kernels are chosen smaller than the input
image
• the deeper layers in the network are indirectly
connected to many inputs
– parameter sharing
• the same filter kernels are applied at many (or all)
input locations
– equivariant representation
• the filter kernel does not change over the image and
hence if we shift the input image, the output will
shift correspondingly
Jochen Lang, EECS
[email protected] Convolutional Layers
• As in the previous example, a convolutional layer
combines many filter kernels
– E.g., first layer of VGG-16
• Input RGB image of size 224x224 can be viewed
as a tensor of size 224x224x3
• The first convolution will produce 64 output
channels, i.e., 64 multi-channel kernels are
applied to produce an output of 224x224x64
– Multi-channel kernels means that the filter sum over
image area and color, e.g., in VGG net the first filter
kernel size is 3x3x3, and in the next layer over
feature size, here still 224x224 and channels, here
64
Jochen Lang, EECS
[email protected] Multichannel Convolution
• Multiple input and output channels add further
summations to our convolution operator
• Let the output of the convolution be , , with
the input (an image or features of the previous layer)
and the Kernel than the convolution function is:
, , , , ,, ,
, ,
Note that we have four indices for the spatial
dimension of the kernel, for the output channel and
for the input channel. The indices are the location of
the output.
Jochen Lang, EECS
[email protected] Padding
• Just as in image filtering, convolutional kernels have to
set up to deal with the boundary. Most common
strategies are “valid” and “same”
• “valid” means no padding, i.e., each output is only
calculated from actual input pixels or features
• “same” means zero padding is used to calculate output
for all input pixels or features
g g
g g
valid same
f f
g g
g g Image source S. Lazebnik
Jochen Lang, EECS
[email protected] Tensor Indices
• The indices in the equations are 1-indexed (according to
mathematical convention).
• Equation assumes padding has been applied to input
• Example: First convolutional layer in VGG
– Input is V with dimension (depth first) with
“same” (zero) padding for a filter means that the
input has spatial dimensions but output is
same, i.e., . Input depth is 3, output depth is
64.
, , , , ,, ,
, ,
with
and n .
– Note that with “valid” padding the indices would be
Jochen Lang, EECS
[email protected] Stride
• Image filters are typically applied with stride=1, i.e.,
the filter is moved over 1 pixel at a time
• We can use larger motion or stride
– Output will be smaller than input
• In classical convolutional networks the number of
channels increases but the spatial resolution
decreases deeper in the network
• Same kernel size can summarize larger input size
• Alternative to increasing stride is to apply pooling
Jochen Lang, EECS
[email protected] Multichannel Convolution with Stride
• Let the output of the convolution with stride be , ,
with the input (an image or features of the
previous layer) and the Kernel than the convolution
function with stride is:
, , , , ,, ,
, ,
Note that we have still four indices for the spatial
dimension of the kernel, for the output channel and
for the input channel. But the indices for the
location of the output are multiplied with the stride in
the input image.
Jochen Lang, EECS
[email protected] Example: Stride of One and Two
• Feature map of and kernel with same padding
• Stride
Image source:
Vincent
Dumoulin and
Francesco Visin,
A guide to
convolution
• Stride arithmetic for
deep
Learning,
2018.
Jochen Lang, EECS
[email protected] Relationship to Classic Hidden Layers
• Classic hidden layers connect every input to every
output
• A convolutional layer can be implemented as a classic
hidden layer where all filter coefficients are zero except
for the kernel size and the weights between parallel
hidden layers are shared.
– Leads to the concept of an infinitely strong prior
• Forced zero weights and forced shared weights
– Clearly backpropagation should still work
– Introduce extra sums in backpropagation to include
all inputs and outputs influenced by the weights and
the sensitivity of the output.
Jochen Lang, EECS
[email protected] Backpropagation with CNNs
• Let , , be the output of a convolutional layer
with kernel and multichannel image
• The output will be a tensor with spatial index and
channel index
• The overall network will have a loss for a given
image and Kernel
– Then we need to take the gradient tensor
, ,
, ,
from the output to calculate the influence of the Kernel
weights, i.e., the partials and to
, , ,
backpropagate the loss further, i.e.,
, ,
Jochen Lang, EECS
[email protected] Influence of Weights
• Given the gradient tensor at the layer output
, ,
, ,
We need to calculate the
partials
, , , ,
, , ,
,
Note that we have four indices for the spatial
dimension of the kernel, for the output channel and
for the input channel.
The equation assumes 1-based indexing and a stride ,
the index is over the output dimension.
Jochen Lang, EECS
[email protected] Backpropagation through the Layer
• Given the gradient tensor at the layer output
, ,
, ,
We need to calculate the
partials
,, , ,,
, ,
, ,
Note that we have the indices for the spatial dimension
of the input and for the input channel.
The two other summation are over all convolutions that
operate on the input from these locations.
All output channels q need to be summed up.
The equation assumes 1-based indexing and a stride
Jochen Lang, EECS
[email protected] Example: Back-propagation
• Feature map of and kernel with valid padding
and stride and its back-propagation
Image source:
Vincent
Dumoulin and
Francesco Visin,
A guide to
convolution
arithmetic for
deep
Learning,
2018.
Jochen Lang, EECS
[email protected] “Deconvolution”
• Deconvolution in CNN refers to convolutions that
increases the spatial dimensions of the output relative
to the input. But Deconvolution is a misnomer.
• Deconvolution as a mathematical operation is defined as
recovering a signal that has undergone a convolution.
– Consider where is the input image, is
the filter and is the output image (see lecture on
image processing).
– Deconvolution is recovering given and
– This operation is linear in the Fourier domain
where the operation is performed for each
Fourier coefficient.
Jochen Lang, EECS
[email protected] “Deconvolution” in CNNs
• Goal of “Deconvolution”
– In many architectures (in particular, auto encoders),
we like the output to be the same size of the input
– We need to go from a “minimal representation” back
to the input image size
– This is the same as in backpropagation when we
distribute the loss from the output to the input of a
convolutional layer
This can be understood as fractionally-strided convolution
Jochen Lang, EECS
[email protected] Fractionally-Strided Convolution
Recall in the strided convolution for output map , , with
stride , the indices strides the filter over some
input indices. Hence if the stride than the filter will
stride “slower” than the input.
• Stride with natural numbers, i.e., to reduce
the output size.
• Stride with fractions of natural numbers, i.e.,
to increase the ouput size.
Fractionally-strided convolution can be realized by adding
zero-padded rows and columns in-between the input map
before filtering with stride .
Jochen Lang, EECS
[email protected] Fractionally-Strided Convolution
Example:
Jochen Lang, EECS
[email protected] Example: Strided and Fractionally-
strided Convolution
• Convolution of feature map of size and kernel
with same padding and stride leads to a
output.
• A corresponding “deconvolution” uses a stride and
a padding of .
Image source:
Vincent
Dumoulin and
Francesco Visin,
A guide to
convolution
arithmetic for
deep
Learning,
2018.
Jochen Lang, EECS
[email protected] Convolution as Matrix Operation
• Convolution is a linear operation
• With appropriate matrix shape, a convolution layer can
be expressed as matrix multiply and back-propagation
is then a multiplication with a matrix transpose
• Example:
feature map after padding convolved with a
kernel leads to a matrix of size multiplying
the feature map reshaped row-major as a
vector for each channel.
𝑘 , 𝑘 , 𝑘 , 0 0 𝑘 , 𝑘 , 𝑘 , 0 0 𝑘 , 𝑘 , 𝑘 , 0 0 0 0 0 0 0 0 0 0 0 0
0 𝑘 , 𝑘 , 𝑘 , 0 0 𝑘 , 𝑘 , 𝑘 , 0 0 𝑘 , 𝑘 , 𝑘 , 0 0 0 0 0 0 0 0 0 0 0
0 0 𝑘 , 𝑘 , 𝑘 , 0 0 𝑘 , 𝑘 , 𝑘 , 0 0 𝑘 , 𝑘 , 𝑘 , 0 0 0 0 0 0 0 0 0 0
⋱
Jochen Lang, EECS
[email protected] Sparse Matrix Operation
• Back-propagation through a layer and deconvolution are
simple to see if matrices are used.
• Given then where is the
gradient tensor of the output written as a vector.
Jochen Lang, EECS
[email protected] Special Cases of Convolution
• Locally connected layer (or unshared convolution)
• Weights are specific at each location, i.e., we use a
different kernel at each
location
, , , , , , ,, ,
, ,
• Tiled convolution
– Is a compromise between regular convolution and
unshared convolution
– Neighboring input regions or tiles use different kernels
but distant tiles use the same kernels again, i.e., we
rotate through the kernels. Expressed as modulo t, we
get
, , , , ,, , , % , %
, ,
Jochen Lang, EECS
[email protected] Other Layers in CNNs
• Other layers are needed besides our classic hidden
layer, which is referred to as a fully-connected layer in
CNNs, and the convolutional layer
• Pooling layer
– Combine different outputs and summarizes the
result, e.g., max pooling (select the highest value)
• Activation layer
– Separate the linear weighting of the inputs and the
activation function into separate layers, e.g., ReLu
layer
• Already discussed
Jochen Lang, EECS
[email protected] Convolutional Layer in Context
Image source I. Goodfellow et al.,
Deep Learning, MIT Press, 2016
Jochen Lang, EECS
[email protected] Pooling Layers
• Pooling makes the output invariant to small translations
– E.g. Max Pooling
• The output of the max pooling operators is the
maximum input over the input neighborhood
0.3 0.7
0.7
0.2 0.1
• The output is the same independent where the
maximum occurs
• Most pooling operations have no parameters to learn,
e.g., max pooling
Jochen Lang, EECS
[email protected] Pooling
• While max pooling seems to be used frequently, other
options can be used
– Median pooling
– Mean pooling
norm pooling
• The translational invariance of pooling is only
appropriate if we don’t care of the exact location of an
output
– Pooling brakes the spatial connection over the input
neighborhood to the output
Jochen Lang, EECS
[email protected] Alternatives to Pooling
• Pooling is sometimes regarded unfavorable
• Instead a fully-convolutional design can be used
– The pooling operation can be replaced by 1x1
convolution as it sums over the channels
– It is less arbitrary as learnable weights are used
– It keeps the architecture more homogenous,
potentially giving a speed advantage
Jochen Lang, EECS
[email protected] Visualizing CNNs
• The filters in CNNs are multi-channel filter and all but
the first convolutional layers can therefore not directly
be viewed.
• Max pooling introduces spatial uncertainty where an
output activation originates
• Deconvnet by Zeiler and Fergus, ECCV 2014, introduces
a way to visualize classic convolutional network beyond
the first convolutional layer
– Essentially a form of backpropagation to the input to
trace back where high activations originate in the
image
Jochen Lang, EECS
[email protected] Structure of Deconvnet
Deconvnet Convnet
Zeiler and Fergus, Visualizing and Understanding
Convolutional Networks, ECCV 2014.
Jochen Lang, EECS
[email protected] Depiction of Deconvnet Operation
• Key ideas:
– Attach a separate net which operates in reverse
– Replace max pooling with max location switches
Zeiler and Fergus, Visualizing and Understanding
Convolutional Networks, ECCV 2014.
Jochen Lang, EECS
[email protected] Convnet
• The convent used by Zeiler and Fergus
Zeiler and Fergus, Visualizing and Understanding
Convolutional Networks, ECCV 2014.
Jochen Lang, EECS
[email protected] Visualization of First Layer
• Max activation for all kernels in Layer 1 (left) and for 9
kernels on right with 9 input images creating top
activation
Zeiler and Fergus,
Visualizing and
Understanding
Convolutional Networks,
ECCV 2014.
Jochen Lang, EECS
[email protected] Layer 2
• Similar to traditional corner or feature detectors
• Responding to shape, texture, structure, etc.
Zeiler and
Fergus,
Visualizing and
Understanding
Convolutional
Networks,
ECCV 2014.
Jochen Lang, EECS
[email protected] Layer 3 Zeiler and
Fergus,
Visualizing and
Understanding
• More complex groupings, i.e., textures or patterns Convolutional
• Even face textures identified Networks,
ECCV 2014.
Jochen Lang, EECS
[email protected] Layer 4
• More high level
groupings
• Not yet class label or
object specific
• E.g., dog faces,
animal legs,
foreground water
Zeiler and Fergus, Visualizing
and Understanding
Convolutional Networks, ECCV
2014.
Jochen Lang, EECS
[email protected] Layer 5
Close to final output,
i.e., object classes
• Object-specific with
large variation
including pose
variations
Zeiler and Fergus, Visualizing
and Understanding
Convolutional Networks, ECCV
2014.
Jochen Lang, EECS
[email protected] Summary
• Convolutional Layers
– Multichannel convolution
– Backpropagation
• Other Layers
– Pooling
– Fully-connected layers
– Activation functions
• Deconvnet
– Visualizing activations
Jochen Lang, EECS
[email protected]