MODULE 4
SYLLABUS
Convolutional Neural Networks – Convolution operation, Motivation,
Pooling, Convolution and Pooling as an infinitely strong prior,
Variants of convolution functions, Structured outputs, Data types,
Efficient convolution algorithms. Practical use cases for CNNs, Case
study - Building CNN model AlexNet for simple pattern analysis
tasks benchmark datasets
Introduction
❑Convolutional networks convolutional (LeCun, 1989), also known as
neural networks or CNNs, are a specialized kind of neural network for
processing data that has a known, grid-like topology.
❑The name “convolutional neural network” indicates that the
network employs a mathematical operation called convolution.
Convolution is a specialized kind of linear operation. Convolutional
networks are simply neural networks that use convolution in place
of general matrix multiplication in at least one of their layers
The Convolution Operation
Convolution in 2d input
An example of 2-D convolution without
kernel-flipping
Relation between input size, output size and filter size
To make the output size same as that of
input - Padding
Padding
❑The convolution operation reduces the size of the output.
❑This type of reduction in size is not desirable in general, because it tends to lose
some information along the borders of the image.
❑This problem can be resolved by using padding. In padding, one adds (F −1)/2
“pixels” all around the borders of the feature map in order to maintain the spatial
footprint.
❑The value of each of these padded feature values is set to 0, irrespective of
whether the input or the hidden layers are being padded.
❑ As a result, the spatial height and width of the input volume will both increase
by (F− 1), which is exactly what they reduce by (in the output volume) after the
convolution is performed.
❑The padded portions do not contribute to the final dot product because their
values are set to 0.
❑This type of padding is referred to as half-padding because
(almost) half the filter is sticking out from all sides of the
spatial input in the case where the filter is placed in its
extreme spatial position along the edges.
❑Another useful form of padding is full-padding. In full-
padding, we allow (almost) the full filter to stick out from
various sides of the input.
❑In other words, a portion of the filter of size F − 1 is
allowed to stick out from any side of the input with an
overlap of only one spatial feature.
Strides
❑The traditional approach performs convolution at every position in the feature
map, but this is not necessary.
❑Using the concept of strides, one can reduce the granularity of the convolution
and only perform it at certain spatial positions in the layer.
❑This can lead to a reduction in the spatial footprint of the image or layer, while
still maintaining important features.
❑When a stride of S is used in a layer, the convolution is performed at the
locations 1, S + 1, 2S + 1, and so on along both spatial dimensions of the layer.
❑As a result, the use of strides will reduce each spatial dimension of the layer by
a factor of approximately S.
The Basic Structure of a Convolutional Network
❑Each layer in the convolutional network is a 3-dimensional grid structure,
which has a height, width, and depth.
❑The word “depth” refers to the number of channels in each layer, such as the
number of primary color channels (e.g., blue, green, and red) in the input image
or the number of feature maps in the hidden layers.
❑The convolutional neural network functions much like a traditional feed-
forward neural network, except that the operations in its layers are spatially
organized with sparse connections between layers.
❑The three types of layers that are commonly present in a convolutional neural
network are convolution, pooling, and ReLU.
Pooling
❑The pooling operation is, however, quite different. The pooling
operation works on small grid regions of size P × P in each layer, and
produces another layer with the same depth.
❑ For each square region of size P ×P in each of the activation maps,
the maximum of these values is returned. This approach is referred
to as max-pooling. If a stride of 1 is used, then this will produce a
new layer of size (H − P + 1) × (W − P + 1) ×D.
❑However, it is more common to use a stride S > 1 in pooling.
❑In such cases, the length of the new layer will be (H −P)/S +1 and
the breadth will be (W −P)/S +1. Therefore, pooling drastically
reduces the spatial dimensions of each activation map.
❑Unlike with convolution operations, pooling is done at the level of each
activation map.
❑Whereas a convolution operation simultaneously uses all feature maps in
combination with a filter to produce a single feature value, pooling
independently operates on each feature map to produce another feature map.
❑Therefore, the operation of pooling does not change the number of feature
maps. In other words, the depth of the layer created using pooling is the same
as that of the layer on which the pooling operation was performed.
❑Other types of pooling (like average-pooling) are possible but rarely used. In the earliest convolutional
network, referred to as LeNet-5, a variant of average pooling was used and was referred to as subsampling.
In general, max-pooling remains more popular than average pooling.
The ReLU Layer
❑The convolution operation is interleaved with the pooling and ReLU operations.
❑For each of the H ×W ×D values in a layer, the ReLU activation function is
applied to it to create H×W×D thresholded values.
❑These values are then passed on to the next layer. Therefore, applying the
ReLU does not change the dimensions of a layer because it is a simple one-to
one mapping of activation value.
❑The use of the ReLU has tremendous advantages over these activation
functions both in terms of speed and accuracy.
❑Increased speed is also connected to accuracy because it allows one to use
deeper models and train them for a longer time. In recent years, the use of the
ReLU activation function has replace the other activation functions in
convolutional neural network
LeNet 5
Motivation
1. sparse interactions
2. parameter sharing and
3. equivariant representations.
SPARSE CONNECTIVITY
PARAMETER SHARING
❑Parameter sharing refers to using the same parameter for more than one
function in a model.
❑In a traditional neural net, each element of the weight matrix is used exactly
once when computing the output of a layer. It is multiplied by one element of
the input and then never revisited.
❑In a convolutional neural net, each member of the kernel is used at every
position of the input.
❑The parameter sharing used by the convolution operation means that rather
than learning a separate set of parameters for every location, we learn only one
set.
❑This does not affect the runtime of forward propagation—it is still O(k × n)—
but it does further reduce the storage requirements of the model to k
parameters. Recall that k is usually several orders of magnitude less than m.
❑Since m and n are usually roughly the same size, k is practically insignificant
compared to m× n. Convolution is thus dramatically more efficient than dense
matrix multiplication in terms of the memory requirements and statistical
efficiency.
EQUIVARIENT REPRESENTATIONS
❑A function is equivariant means that if the input changes, the
output changes in the same way. Specifically, a function f(x) is
equivariant to a function g if f (g(x)) = g(f(x)).
❑In the case of convolution, if we let g be any function that
translates the input, i.e., shifts it, then the convolution function is
equivariant to g.
❑Convolution is not naturally equivariant to some other
transformations, such as changes in the scale or rotation of an
image. Other mechanisms are necessary for handling these kinds of
transformations
TYPES OF CONVOLUTION
Data types
❑The data used with a convolutional network usually consists of several
channels, each channel being the observation of a different quantity at some
point in space or time.
ALEXNET
Architecture