0% found this document useful (0 votes)
37 views25 pages

DL Module 3 Notes

The document provides an overview of Convolutional Neural Networks (CNNs), detailing their architecture, operations such as convolution and pooling, and their applications in image recognition and other domains. It explains the mathematical foundations of convolution, its properties, and how CNNs leverage parameter sharing and equivariance for efficient feature detection. Additionally, it discusses structured outputs and the ability of CNNs to handle various data types, emphasizing their flexibility and efficiency in processing grid-like data.

Uploaded by

meghana872004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views25 pages

DL Module 3 Notes

The document provides an overview of Convolutional Neural Networks (CNNs), detailing their architecture, operations such as convolution and pooling, and their applications in image recognition and other domains. It explains the mathematical foundations of convolution, its properties, and how CNNs leverage parameter sharing and equivariance for efficient feature detection. Additionally, it discusses structured outputs and the ability of CNNs to handle various data types, emphasizing their flexibility and efficiency in processing grid-like data.

Uploaded by

meghana872004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Deep Learning(BCS714A) 2025-26

MODULE – 03
CONVOLUTION NEURAL NETWORKS

Introduction:
Convolutional Neural Networks (CNNs), introduced by LeCun (1989), are specialized
neural networks designed for data with a grid-like structure — such as 1D time-series (data
over time) or 2D images (grids of pixels). CNNs have achieved remarkable success in real-
world applications like image recognition, video analysis, and speech processing.

At their core, CNNs use a mathematical operation called convolution, a form of linear
transformation that replaces traditional matrix multiplication in some layers of the network.
This operation enables CNNs to automatically detect and learn spatial or temporal patterns,
such as edges, shapes, and textures.

Most CNNs also employ an operation called pooling, which reduces data dimensionality
while preserving key features. Though the convolution used in neural networks differs
slightly from that in engineering or mathematics, it is optimized for learning hierarchical
feature representations efficiently.

CNNs draw inspiration from the structure and functioning of the human visual cortex,
representing a strong link between neuroscience and deep learning. Their architectures evolve
rapidly, but the most effective ones consistently rely on the fundamental components
described here — convolution, pooling, and hierarchical feature extraction.

Convolution Operation
 Definition: Convolution is an operation on two functions of a real-valued argument.
 Example: Tracking a spaceship with a laser sensor:
o Laser provides output x(t) the position at time t (both real-valued).
o Sensor is noisy → need a smoothed estimate using weighted average.
o Weighting function w(a) where a is the age of a measurement.
o Smoothed output:

Dr SMCE Dept of CSE


Page 1
Deep Learning(BCS714A) 2025-26

o w must be a valid probability density and w=0for negative arguments.


 CNN Terminology:
o First argument x → input
o Second argument w → kernel
o Output s → feature map

Discrete Convolution

 Real-world data are usually discretized (e.g., measurements once per second).
 Discrete convolution:

 Inputs and kernels are usually multidimensional arrays (tensors).


 Infinite summation → summation over finite array elements.
 For 2D images III with 2D kernel K:

Properties of Convolution

 Commutative:

 Flipping kernel gives commutative property; often not critical in NN


implementations.
 Many libraries implement cross-correlation (no kernel flipping):


 Kernel values learned by algorithm → flipping or not flipping does not affect
learning.
Dr SMCE Dept of CSE
Page 2
Deep Learning(BCS714A) 2025-26

 Convolution rarely used alone; combined with other functions, combination does not
commute.

Matrix Representation

 Discrete convolution ↔ matrix multiplication:


o Univariate → Toeplitz matrix (rows shifted by one element).
o 2D → doubly block circulant matrix.
 Sparse matrix: most entries zero (kernel smaller than input).
 Any N algorithm working with matrix multiplication works with convolution.
 Typical CNNs use specializations for large inputs, but not theoretically necessary.

Motivation

Convolution leverages three important ideas that improve machine learning systems: sparse
interactions, parameter sharing, and equivariant representations. It also allows working
with inputs of variable size. Traditional neural network layers use matrix multiplication with
a separate parameter for each input-output pair, so every output interacts with every input.
Convolutional networks, however, employ sparse interactions by using kernels smaller than
the input. For example, an image with thousands or millions of pixels can be processed with
kernels covering only tens or hundreds of pixels to detect meaningful features such as edges.
This reduces the number of parameters to store, improving memory efficiency, statistical
efficiency, and computation, since fewer operations are required. While matrix multiplication
with m inputs and n outputs requires m×n parameters and O(m×n) runtime, a sparse approach
with k connections per output requires only k×n parameters and O(k×n) runtime. In practice,
good performance is often achieved with k several orders of magnitude smaller than m. In
deep convolutional networks, deeper layers can indirectly interact with larger portions of the
input, enabling the network to describe complex interactions efficiently by building them
from simple sparse interactions.

Dr SMCE Dept of CSE


Page 3
Deep Learning(BCS714A) 2025-26

Parameter Sharing and Equivariance in Convolution: Parameter sharing refers to using


the same parameter for more than one function in a model. In a traditional neural network,
each weight element is used once when computing the output of a layer, while in a
convolutional neural network (CNN), each member of the kernel is applied at every position
of the input, except perhaps at the boundaries. This “tied weights” approach means that
instead of learning separate parameters for every location, the network learns only one set.
Parameter sharing does not affect the runtime of forward propagation—it remains O(k×n))—
but it drastically reduces memory requirements to k parameters, which is often several orders
of magnitude smaller than the total number of parameters in a dense matrix multiplication.
Sparse connectivity combined with parameter sharing allows CNNs to efficiently detect
features such as edges in images. Moreover, convolution layers are equivariant to
translation, meaning that if the input shifts, the output shifts in the same way. Formally, a
function f is equivariant to g if f(g(x))=g(f(x)). For example, translating an image to the right
and then applying convolution yields the same result as applying convolution first and then
Dr SMCE Dept of CSE
Page 4
Deep Learning(BCS714A) 2025-26

translating the output. This property is valuable for time-series data, producing a timeline
where feature occurrences shift consistently, and for images, creating a 2D map where feature
locations move in correspondence with the input. Parameter sharing is especially useful when
the same local features, such as edges, appear throughout the input. However, in some cases,
such as cropped face images, features differ by location (e.g., eyebrows versus chin), so full
parameter sharing may not be ideal. Convolution is not naturally equivariant to
transformations like scaling or rotation, which require other mechanisms. Finally,
convolution allows processing of data types that cannot be handled by fixed-shape matrix
multiplication, enabling more flexible and efficient neural network designs.

Dr SMCE Dept of CSE


Page 5
Deep Learning(BCS714A) 2025-26

Dr SMCE Dept of CSE


Page 6
Deep Learning(BCS714A) 2025-26

Pooling

A typical layer in a convolutional network has three stages. First, multiple convolutions are
performed in parallel to produce linear activations. Second, each activation passes through a
nonlinear function, such as a rectified linear unit, in what is sometimes called the detector
stage. Third, a pooling function summarizes the outputs over a neighborhood, replacing each
location with a statistic such as the maximum (max pooling), the average, the L2 norm, or a
distance-weighted average. Pooling introduces approximate invariance to small
translations, meaning that small shifts in the input do not substantially change most pooled
outputs. This is useful when the presence of a feature matters more than its precise location,
for instance, detecting eyes in a face image, though in other tasks, such as finding corners,
location must be preserved. Pooling can be seen as adding a strong prior that the learned
function should be translation-invariant, improving statistical efficiency when the
assumption holds. Spatial pooling reduces the number of pooling units needed by
summarizing regions spaced multiple pixels apart, which improves computational
efficiency, decreases the number of inputs for the next layer, and reduces memory
requirements when subsequent layers depend on input size, such as fully connected layers.
Pooling also enables handling inputs of variable size, allowing the classification layer to
receive a fixed number of summary statistics regardless of input dimensions. Advanced
strategies include dynamically pooling features using clustering or learning a single pooling
Dr SMCE Dept of CSE
Page 7
Deep Learning(BCS714A) 2025-26

structure applied across images. While pooling is essential for many tasks, it can complicate
networks with top-down information such as Boltzmann machines and autoencoders. In
convolutional architectures, pooling combined with convolution forms the backbone of
modern classification networks, providing both translation invariance and efficiency.

Dr SMCE Dept of CSE


Page 8
Deep Learning(BCS714A) 2025-26

Dr SMCE Dept of CSE


Page 9
Deep Learning(BCS714A) 2025-26

Convolution and Pooling as an Infinitely Strong Prior

Recall the concept of a prior probability distribution, which encodes our beliefs about what
models are reasonable before seeing any data. Priors can be weak, with high entropy such as
a Gaussian with high variance, allowing the data to move parameters freely, or strong, with

Dr SMCE Dept of CSE


Page 10
Deep Learning(BCS714A) 2025-26

very low entropy such as a Gaussian with low variance, actively influencing parameter
values. An infinitely strong prior places zero probability on some parameters, completely
forbidding certain values regardless of the data. Convolutional networks can be thought of as
fully connected networks with an infinitely strong prior over their weights: the weights for
one hidden unit must be identical to those of neighboring units but shifted in space, and
weights must be zero outside the small, spatially contiguous receptive field assigned to the
unit. This imposes that the function learned contains only local interactions and is equivariant
to translation. Similarly, pooling introduces an infinitely strong prior that each unit should be
invariant to small translations. While implementing a convolutional net as a fully connected
net with such a prior would be computationally wasteful, this perspective provides insight
into how convolutional nets work. Convolution and pooling can cause underfitting if the prior
assumptions are inaccurate; for example, pooling all features can increase training error when
precise spatial information is required. Some architectures, such as those by Szegedy et al.
(2014a), selectively apply pooling across channels to achieve both highly invariant features
and features that avoid underfitting. Convolution may also be inappropriate for tasks
requiring information from very distant input locations. Finally, convolutional models should
only be compared to other convolutional models in benchmarks, as non-convolutional models
could learn even if pixels are permuted. Many image datasets provide separate benchmarks
for models that are permutation invariant, learning topology from data, and for models with
hard-coded spatial relationships.

Dr SMCE Dept of CSE


Page 11
Deep Learning(BCS714A) 2025-26

Variants of the Basic Convolution Function


1. Convolution in neural networks differs from standard discrete convolution in math;
usually involves multiple parallel convolutions to extract many features at many
locations. Inputs are vector-valued grids (e.g., RGB images). Outputs are 3-D tensors
(channels × spatial coordinates), 4-D if batch processing.

Dr SMCE Dept of CSE


Page 12
Deep Learning(BCS714A) 2025-26

2. Multichannel convolution is generally not commutative unless input and output


channels match.
3. Basic convolution equation:

where K is the 4-D kernel tensor connecting input channel lll to output channel iii.

4. Downsampled convolution (stride s):

5. Zero padding allows control over output size:


o Valid convolution: no padding, output shrinks, size m−k+1.
o Same convolution: padding keeps output same as input.
o Full convolution: enough padding for kernel to visit every pixel, output size
m+k−1.
6. Locally connected layers / unshared convolution:

Useful when features are location-specific; weights are not shared across spatial
positions.

7. Tiled convolution: compromise between convolution and locally connected layers.


Uses a small set of kernels rotated across space:

% denotes modulo, t = tiling size.

Dr SMCE Dept of CSE


Page 13
Deep Learning(BCS714A) 2025-26

8. Interaction with max pooling: detector units can learn transformed versions of
features; max pooling gives invariance to learned transformations, while standard
convolution is invariant specifically to translation.
9. Backpropagation equations:
10.

Convolutional autoencoder:
o Reconstruction: R=h(K,H,s)
o Decoder gradient: g(H,E,s)
o Encoder gradient: c(K,E,s)
11. Bias handling:
o Locally connected layers: one bias per unit.
o Tiled convolution: biases shared according to tiling.
o Standard convolution: one bias per output channel.
o Bias separation helps correct for edge effects due to zero padding.

Dr SMCE Dept of CSE


Page 14
Deep Learning(BCS714A) 2025-26

Dr SMCE Dept of CSE


Page 15
Deep Learning(BCS714A) 2025-26

Dr SMCE Dept of CSE


Page 16
Deep Learning(BCS714A) 2025-26

Dr SMCE Dept of CSE


Page 17
Deep Learning(BCS714A) 2025-26

Dr SMCE Dept of CSE


Page 18
Deep Learning(BCS714A) 2025-26

Structured Outputs
Definition:

o Convolutional networks can output high-dimensional structured objects, e.g.,


tensors, instead of a single class label or real value.
o Example: output tensor SSS where Si,j,k is the probability that pixel (j,k)
belongs to class iii.

Dr SMCE Dept of CSE


Page 19
Deep Learning(BCS714A) 2025-26

2. Applications:
o Pixel-wise labeling and image segmentation.
o Produces precise masks that follow object boundaries.
3. Output resolution considerations:
o Output can be smaller than input due to pooling layers with stride > 1.
o Strategies to maintain output size:
 Avoid pooling layers (unit stride pooling).
 Emit lower-resolution label grids and refine (e.g., Pinheiro &
Collobert).
 Produce initial labels and iteratively refine using neighboring pixel
interactions.
4. Recurrent convolutional refinement:
o Successive convolution layers share weights across iterations.
o Acts like a recurrent network for pixel-level label refinement.
5. Post-processing for segmentation:
o Assumes contiguous pixels tend to share labels.
o Graphical models describe probabilistic relationships between pixels.
o Networks can be trained to approximate graphical model objectives.
6. Key insight:
o Structured output CNNs combine convolutional feature extraction with spatial
reasoning to achieve precise, high-resolution labeling, enabling advanced
image segmentation and object delineation.

Dr SMCE Dept of CSE


Page 20
Deep Learning(BCS714A) 2025-26

Data Types
Convolutional networks process data consisting of multiple channels, where each channel
represents a different quantity observed at a specific point in space or time. Examples include
1-D audio waveforms over time, skeleton animation data representing joint angles, 2-D audio
spectrograms with frequency and time axes, color images with red, green, and blue channels,
3-D volumetric data like CT scans, and color video with temporal, height, and width
dimensions. A key advantage of convolutional networks is their ability to handle inputs with
varying spatial extents, which traditional fixed-size matrix multiplication networks cannot
accommodate. In such cases, the convolution kernel is applied a variable number of times
according to the input size, and the output scales accordingly. Convolution can also be
interpreted as matrix multiplication, where the kernel induces a doubly block circulant matrix
whose size depends on the input. When both input and output are allowed to vary, such as in
pixel-wise labeling, no additional design is needed. However, when a fixed-size output is
required, such as assigning a single class label to an image, strategies like pooling layers with
regions scaled to the input are used to maintain a consistent number of pooled outputs.

Dr SMCE Dept of CSE


Page 21
Deep Learning(BCS714A) 2025-26

Convolution is effective only when the variable input sizes correspond to different amounts
of the same type of observation, such as varying recording lengths or spatial widths. It is not
suitable when variability arises from heterogeneous features, for example, when processing
college applications where some applicants have grades but not standardized test scores,
because the same kernel cannot meaningfully operate across different types of data.

Dr SMCE Dept of CSE


Page 22
Deep Learning(BCS714A) 2025-26

Efficient Convolution Algorithms

Efficient convolution algorithms are critical for modern networks, which often contain over a
million units. While parallel computation can accelerate processing, choosing the right

Dr SMCE Dept of CSE


Page 23
Deep Learning(BCS714A) 2025-26

convolution algorithm can further improve speed. One approach is to perform convolution in
the frequency domain: transform both the input and kernel using a Fourier transform,
multiply them point-wise, and then apply an inverse Fourier transform. For certain problem
sizes, this is faster than naive discrete convolution. Another strategy involves separable
kernels, where a d-dimensional kernel can be expressed as the outer product of d vectors, one
per dimension. Instead of performing a single d-dimensional convolution, one can compose d
one-dimensional convolutions, which is much more efficient in both runtime and parameter
storage. For a kernel of width www in each dimension, naive convolution requires O(wd)
operations, while separable convolution reduces this to O(w*d). Not all kernels are separable,
so research continues on faster or approximate convolution methods that maintain model
accuracy. Even optimizations that improve only forward propagation are valuable, as
deployment often consumes more resources than training in commercial applications.

Convolutional Networks and the History of Deep Learning

Convolutional networks have been pivotal in the history of deep learning, representing a
successful translation of insights from neuroscience to machine learning. They were among
the first deep models to perform effectively, long before arbitrary deep networks were
considered viable, and they quickly proved their commercial value. For instance, in the
1990s, AT&T developed a convolutional network for check reading, and by the decade’s end,
NCR deployed this system to process over 10% of checks in the United States. Microsoft also
implemented convolutional net-based OCR and handwriting recognition systems.
Convolutional networks have consistently excelled in competitions, including the landmark
ImageNet challenge in 2012, although they had already won smaller contests previously.
They were some of the earliest deep networks successfully trained with back-propagation.
Their success, unlike fully connected networks of the time, may have been due to
computational efficiency, allowing more experiments and better hyperparameter tuning.
Additionally, larger networks tend to be easier to train, and modern hardware has revealed
that fully connected networks can perform reasonably well under proper conditions.
Convolutional networks specialized neural networks for data with grid-structured topology,
particularly excelling in two-dimensional image processing, and paved the way for broader
acceptance of neural networks. For sequential, one-dimensional data, recurrent neural
networks provide another powerful specialization.

Dr SMCE Dept of CSE


Page 24
Deep Learning(BCS714A) 2025-26

Dr SMCE Dept of CSE


Page 25

You might also like