0% found this document useful (0 votes)
35 views14 pages

Modifications to CNN in Sliding Window

Uploaded by

edigadinesh2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views14 pages

Modifications to CNN in Sliding Window

Uploaded by

edigadinesh2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

NB-SEAGI DL(R20)-Unit-4

DEEP LEARNING (20A05703c)


UNIT IV
Convolutional Networks: The Convolution Operation, Pooling, Convolution, Basic
Convolution Functions, Structured Outputs, Data Types, Efficient Convolution Algorithms,
Random or Unsupervised Features, Basis for Convolutional Networks.

The Convolution Operation


The convolution operates on the input with a kernel (weights) to produce an output
map given by:

Let us break down the formula. The steps involved are:

1. Express each function in terms of a dummy variable τ


2. Reflect the function g i.e. g(τ) → g(-τ)
3. Add a time offset i.e. g(τ) → g(t-τ). Adding the offset shifts the input to the right
by t units (by convention, a negative offset shits it to the left)
4. Multiply f and g point-wise and accumulate the results to get output at instant t.
Basically, we are calculating the area of overlap between f and shifted g

Dept: CAI Page 1 of 14


NB-SEAGI DL(R20)-Unit-4

For our application, we are interested in the discrete domain formulation:

When the kernel is not flipped in its domain, we obtain the cross-correlation operation. The
basic difference between the two operations is that convolution is commutative in nature,
i.e. f and g can be interchanged without changing the output. Cross-correlation is not
commutative. This difference is highlighted in the image below:

Although these equations imply that the domains for both f and g are infinite, in practice, these
two variables are non-zero only in a finite region. As a result, the output is non-zero only in a
finite region (where the non-zero regions of f and g overlap).

The intuition for convolution in 1-D can be extended to n-dimensions by nesting the
convolution operations. Vincent Dumoulin and Francesco Visin provide an in depth analysis
of how input and output shapes and computations are tied. Below is their visualization of a 2-
D convolution operation:

Dept: CAI Page 2 of 14


NB-SEAGI DL(R20)-Unit-4

The 1D convolution operation can be represented as a matrix vector product. The kernel marix
is obtained by composing weights into a Toeplitz matrix. A Toeplitz matrix has the property
that values along all diagonals are constant.

To extend this principle to 2D input, we first need to unroll the 2D input into a 1D vector.
Once this is done, the kernel needs to be modified as before but this time resulting in a block-
circulant matrix. What’s that?

A circulant matrix is a special case of a Toeplitz matrix where each row is a circular shift of
the previous row. To see that it is a special case of the Toeplitz matrix is trivial.

A matrix which is circulant with respect to its sub-matrices is called a block circulant
matrix. If each of the submatrices is itself circulant, the matrix is called doubly block-
circulant matrix.

Now, given a 2D kernel, we can create the block-circulant matrix that will act allow
matrix-vector implementation of convolution as below:

Dept: CAI Page 3 of 14


NB-SEAGI DL(R20)-Unit-4

Convince yourself t hat the result ant of convolving a 3x3 kernel on a 4x4 input
(16x1unrolled vector) result s in a 2x2 output (4x1 vector) [refer to gif above]
and hencet he required kernel mat rix must be of shape 4x16

Pooling
Pooling is nothing other than down sampling of an image. The most common pooling layer
filter is of size 2x2, which discards three forth of the activations. Role of pooling layer is to
reduce the resolution of the feature map but retaining features of the map required for
classification through translational and rotational invariants. In addition
tospatial invariance robustness, pooling will reduce the computation cost by a greatdeal.

Back propagation is used for training of pooling operation It again helps the processor to
process things faster.

There are many pooling techniques. They are as follows

i)Max pooling where we take largest of the pixel values of a segment.

ii)Mean pooling where we take largest of the pixel values of a segment.

iii) Avg pooling where we take largest of the pixel values of a segment.

Dept: CAI Page 4 of 14


NB-SEAGI DL(R20)-Unit-4

As cross validation is expensive for big network, remedy of over-fitting in a modernneural


network is considered through two roots:

 Reducing the number of the parameter by representing the model more effectively.

 Regularization So dominant architecture in recent times for image classificationis


convolution neural network, where number of parameter is reducedeffectively through
convolution technique in initial layers and fully connectedlayers at the very end of the
network.

Usually, regularization is performed through data augmentation, dropout or batch


normalization. Most of these regularization techniques have difficulties to implementing
convolutional layers. So, alternatively, such responsibility can be carried over by pooling
layers in convolutional neural network.

There are three variants of pooling operation depending on roots of regularization technique:

Stochastic pooling: Randomly picked activation within each pooling region is considered
than deterministic pooling operations for regularization of the network. Stochastic pooling
performs reduction of feature size but denies role for selecting features judiciously for the
sake of regularization. Although clipping of negative output from ReLU activation helps to
carry some of the selection responsibility.

Overlapping pooling: Overlapping pooling operation shares responsibility of local


connection beyond the size of previous convolutional filter, which breaks orthogonal
responsibility between pooling layer and convolutional layer. So, no information is gained if
pooling windows overlap.

Fractional pooling: Reduction ratio of filter size due to pooling can be controlled by a
fractional pooling concept, which helps to increase the depth of the network. Unlike
stochastic pooling, the randomness is related to the choice of pooling regions, not the way
pooling is performed inside each of the pooling regions.

There are other variants of pooling as follows:

 Min pooling
 wavelet pooling
 tree pooling
 max-avg pooling
 spatial pyramid pooling
Pooling makes the network invariant to translations in shape, size and scale. Max pooling is
generally predominantly used in objection recognition.

Dept: CAI Page 5 of 14


NB-SEAGI DL(R20)-Unit-4

CONVOLUTION:
Convolution is an orderly procedure where two sources of information areintertwined; it’s an
operation that changes a function into something [Link] have been used for a long
time typically in image processing to blur andsharpen images, but also to perform other
operations. (e.g. enhance edges andemboss) CNNs enforce a local connectivity pattern
between neurons of adjacent layers.

CNNs make use of filters (also known as kernels), to detect what features, such as edges, are

present throughout an image.

There are four main operations in a CNN:

 Convolution

 Non Linearity (ReLU)

 Pooling or Sub Sampling

 Classification (Fully Connected Layer)


The first layer of a Convolutional Neural Network is always a Convolutional Layer.
Convolutional layers apply a convolution operation to the input, passing the result to the next
layer. A convolution converts all the pixels in its receptive field in to a single value.

For example, if you would apply a convolution to an image, you will be decreasingthe image
size as well as bringing all the information in the field together into asingle pixel. The final
output of the convolutional layer is a vector. Based on the typeof problem we need to solve
and on the kind of features we are looking to learn, wecan use different kinds of
convolutions.

The 2D Convolution Layer


The most common type of convolution that is used is the 2D convolution layer and isusually
abbreviated as conv2D. A filter or a kernel in a conv2D layer “slides” over the2D input data,
performing an elementwise multiplication. As a result, it will be summingup the results into a

Dept: CAI Page 6 of 14


NB-SEAGI DL(R20)-Unit-4

single output pixel. The kernel will perform the same operation forevery location it slides
over, transforming a 2D matrix of features into a different 2Dmatrix of features.

The Dilated or Atrous Convolution


This operation expands window size without increasing the number of weights by inserting
zero-values into convolution kernels. Dilated or Atrous Convolutions can be used in real time
applications and in applications where the processing power is less as the RAM requirements
are less intensive.
Separable Convolutions
There are two main types of separable convolutions:
spatial separable convolutions, and depthwise separable convolutions.
The spatial separable convolution deals primarily with the spatial dimensions of an image and
kernel: the width and the height. Compared to spatial separable convolutions, depthwise
separable convolutions work with kernels that cannot be “factored” into two smaller kernels.
As a result, it is more frequently used.
Transposed Convolutions
These types of convolutions are also known as deconvolutions or fractionally strided
convolutions. A transposed convolutional layer carries out a regular convolution but reverts
its spatial transformation.

Variants of the Basic Convolution Function


In practical implementations of the convolution operation, certain modifications are made
which deviate from the discrete convolution formula mentioned above:
 In general a convolution layer consists of application of several different kernels to
the input. This allows the extraction of several different features at all locations in the
input. This means that in each layer, a single kernel (filter) isn’t applied. Multiple
kernels (filters), usually a power of 2, are used as different feature detectors.
 The input is generally not real-valued but instead vector valued (e.g. RGB values at
each pixel or the feature values computed by the previous layer at each pixel
position). Multi-channel convolutions are commutative only if number of output and
input channels is the same.

Dept: CAI Page 7 of 14


NB-SEAGI DL(R20)-Unit-4

 In order to allow for calculation of features at a coarser level strided convolutions can
be used. The effect of strided convolution is the same as that of a convolution
followed by a downsampling stage. This can be used to reduce the representation size.

Fig: 2D convolution 3x3 kernel and stride of 2 units (source)

 Zero padding helps to make output dimensions and kernel size independent. 3
common zero padding strategies are:
 valid: The output is computed only at places where the entire kernel lies inside the
input. Essentially, no zero padding is performed. For a kernel of size k in any
dimension, the input shape of m in the direction will become m-k+1 in the output.
This shrinkage restricts architecture depth.
 same: The input is zero padded such that the spatial size of the input and output is
same. Essentially, for a dimension where kernle size is k, the input is padded by k-
1 zeros in that dimension. Since the number of output units connected to border pixels
is less than that for centre pixels, it may under-represent border pixels.
 full: The input is padded by enough zeros such that each input pixel is connected to
the same number of output units.
In terms of test set accuracy, the optimal padding is somewhere
between same and valid.

Dept: CAI Page 8 of 14


NB-SEAGI DL(R20)-Unit-4

valid(left), same(middle) and full(right) padding (source). The extreme left one is for
stride=2.

 Besides locally-connected layers and tiled convolution, another extension can be to


restrict the kernels to operate on certain input channels. One way to implement this is
to connect the first m input channels to the first n output channels, the next m input
channels to the next n output channels and so on. This method decreases the number
of parameters in the model without dereasing the number of output units.
 When max pooling operation is applied to locally connected layer or tiled
convolution, the model has the ability to become transformation invariant because
adjacent filters have the freedom to learn a transformed version of the same
feature. This essentially similar to the property leveraged by pooling over channels
rather than spatially.
 Bias terms can be used in different ways in the convolution stage. For locally
connected layer and tiled convolution, we can use a bias per output unit and kernel
respectively. In case of traditional convolution, a single bias term per output channel
is used. If the input size is fixed, a bias per output unit may be used to counter the
effect of regional image statistics and smaller activations at the boundary due to zero
padding.

Structured Outputs
 Convolutional networks can be trained to output high-dimensional structured output
rather than just a classification score. A good example is the task of image
segmentation where each pixel needs to be associated with an object class. Here the

Dept: CAI Page 9 of 14


NB-SEAGI DL(R20)-Unit-4

output is the same size (spatially) as the input. The model outputs a
tensor S where S[i,j,k] is the probability that pixel (j,k) belongs to class i.
 To produce an output map as the same size as the input map, only same-
padded convolutions can be stacked. Alternatively, a coarser segmentation map can
be obtained by allowing the output map to shrink spatially.
 The output of the first labelling stage can be refined successively by another
convolutional model. If the models use tied parameters, this gives rise to a type
of recursive model as shownbelow. (H¹, H², H³ share parameters)

Recursive refinement of the segmentation map

 The output can be further processed under the assumption that contiguous regions of
pixels will tend to belong to the same label. Graphical models can describe this
relationship. Alternately, CNNs can learn to optimize the graphical models training
objective.
 Another model that has gained popularity for segmentation tasks (especially in the
medical imaging community) is the U-Net. The up-convolution mentioned is just a
direct upsampling by repetition followed by a convolution with same padding.

U-Net architecture for medical image segmentation (source)

Dept: CAI Page 10 of 14


NB-SEAGI DL(R20)-Unit-4

Data Types
The data used with a convolutional network usually consist of several channels, each channel
being the observation of a different quantity at some point in space or time.
One advantage to convolutional networks is that they can also process inputs with varying
spatial extents.
When the output is accordingly variable sized, no extra design change needs to be made. If
however the output is fixed sized, as in the classification task, a pooling stage with kernel size
proportional to the input size needs to be used.

Different data types based on the number of spatial dimensions and channels

Efficient Convolution Algorithms


In some problem settings, performing convolution as pointwise multiplication in the frequency
domain can provide a speed up as compared to direct computation. This is a result from the
property of convolution:

Convolution in the source domain is multiplication in the frequency domain. F is the


transformation operation

When a d-dimensional kernel can be broken into the outer product of d vectors, the kernel is
said to be separable. The corresponding convolution operations are more efficient when
implemented as d 1-dimensional convolutions rather than a direct d-dimensional convolution.
Note however, it may not always be possible to express a kernel as an outer product of lower
dimensional kernels.

This is not to be confused with depthwise separable convolution (explained


brilliantly here). This method restricts convolution kernels to operate on only one input
channel at a time followed by 1x1 convolutions on all channels of the intermediate output.

Devising faster ways of performing convolution or approximate convolution without harming


the accuracy of the model is an active area of research.

Dept: CAI Page 11 of 14


NB-SEAGI DL(R20)-Unit-4

Random and Unsupervised Features


To reduce the computational cost of training the CNN, we can use features not learned by
supervised training.

1. Random initialization has been shown to create filters that are frequency selective
and translation invariant. This can be used to inexpensively select the model
architecture. Randomly initialize several CNN architectures and just train the last
classification layer. Once a winner is determined, that model can be fully trained in a
supervised manner.
2. Hand designed kernels may be used; e.g. to detect edges at different orientations and
intensities.
3. Unsupervised training of kernels may be performed; e.g. applying k-means clustering
to image patches and using the centroids as convolutional kernels. Unsupervised pre-
training may offer regularization effect (not well established). It may also allow for
training of larger CNNs because of reduced computation cost.
Another approach for CNN training is greedy layer-wise pretraining most notably used
in convolutional deep belief network. For example, in the case of multi-layer perceptrons,
starting with the first layer, each layer is trained in isolation. Once the first layer is trained, its
output is stored and used as input for training the next layer, and so on.

Basis for Convolutional Networks


Hubel and Wiesel studied the activity of neurons in a cat’s brain in response to visual stimuli.
Their work characterized many aspects of brain function.

In a simplified view, we have:

1. The light entering the eye stimulates the retina. The image then passes through the the
optic nerve and a region of the brain called the LGN (lateral geniculate nucleus)
2. V1 (primary visual cortex): The image produced on the retina is transported to the
V1 with minimal processing. The properties of V1 that have been replicated in CNNs
are:
a. The V1 response is localized spatially, i.e. the upper image stimulates the cells
in the upper region of V1 [localized kernel].
b. V1 has simple cells whose activity is a linear function of the input in a small
neighbourhood [convolution].

Dept: CAI Page 12 of 14


NB-SEAGI DL(R20)-Unit-4

c. V1 has complex cells whose activity is invariant to shifts in the position of the
feature [pooling] as well as some changes in lighting which cannot be captured
by spatial pooling [cross-channel pooling].
3. There are several stages of V1 like perations [stacking convolutional layers].
4. In the medial temporal lobe, we find grandmother cells. These cells respond to
specific concepts and are invariant to several transforms of the input. In the medial
temporal lobe, researchers also found neurons spiking on a particular concept, e.g.
the Halle Berry neuron fires when looking at a photo/drawing of Halle Berry or even
reading the text Halle Berry. Of course, there are neurons which spike at other
concepts like Bill Clinton, Jennifer Aniston, etc.
The medial temporal neurons are more generic than CNN in that they respond even to
specific ideas. A closer match to the function of the last layers of a CNN is the IT
(inferotemporal cortex). When viewing an object, information flows from the retina,
through LGN, V1, V2, V4 and reaches IT. This happens within 100ms. When a person
continues to look at an object, the brain sends top-down feedback signals to affect lower
level activation.
Some of the major differences between the human visual system (HVS) and the CNN
model are:
 The human eye is low resolution except in a region called fovea. Essentially, the
eye does not receive the whole image at high resolution but stiches several patches
through eye movements called saccades. This attention based gazing of the input
image is an active research problem. Note: attention mechanisms have been shown
to work on natural language tasks.
 Integration of several senses in the HVS while CNNs are only visual.
 The HVS processes rich 3D information, and can also determine relations between
objects. CNNs for such tasks are in their early stages.
 The feedback from higher levels to V1 has not been incorporated into CNNs with
substantial improvement.
 While the CNN can capture firing rates in the IT, the similarity between intermediate
computations is not established. The brain probably uses different activation and pooling
functions. Even the linearity of filter response is doubtful as recent models for V1 involve
quadratic filters.

Dept: CAI Page 13 of 14


NB-SEAGI DL(R20)-Unit-4

Neuroscience tells us very little about the training procedure. Backpropogation which is a
standard training mechanism today is not inspired by neuroscience and sometimes considered
biologically implausible.

The heatmap of a 2D Gabor filter (source)

In order to determine the filter parameters used by neurons, a process called reverse
correlation is used. The neuron activations are measured by an electrode when viewing
several white noise images and a linear model is used to approximate this behaviour. It has
been shown experimentally that the weights of the fitted model of V1 neurons are described
by Gabor functions. If we go by the simplified version of the HVS, if the simple cells detect
Gabor-like features, then complex cells learn a function of simple cell outputs which is
invariant to certain translations and magnitude changes.
A wide variety of statistical learning algorithms (from unsupervised (sparse code) to deep
learning (first layer features)) learn features with Gabor-like functions when applied to natural
images. This goes to show that while no algorithm can be touted as the right method based on
Gabor-like feature detectors, a lack of such features may be taken as a bad sign.

(Left) Gabor functions with different values of the parameters that control the coordinate
system. (Middle) Weights learned by an unsupervised learning algorithm (Right)
Convolution kernels learned by the first layer of a fully supervised convolutional maxout
network.
Dept: CAI Page 14 of 14

You might also like