0% found this document useful (0 votes)
6 views26 pages

Unit II Deep Learning

deep learning lessons with more knowledge

Uploaded by

imsamuel1905
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views26 pages

Unit II Deep Learning

deep learning lessons with more knowledge

Uploaded by

imsamuel1905
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT II

CONVOLUTIONAL NEURAL NETWORKS


Convolution Operation - Sparse Interactions - Parameter Sharing - Equivariance - Pooling -
Convolution Variants: Strided - Tiled -Transposed and dilated convolutions; CNN Learning:
NonlinearityFunctions - Loss Functions - Regularization - Optimizers -Gradient Computation.

2.1 CONVOLUTION OPERATION

Convolutional operations are fundamental to convolutional neural networks (CNNs), which are
widely used in deep learning for tasks involving image and signal processing.

Convolution operation focuses on extracting / preserving important features from the input.
Convolution operation allows the network to detect horizontal and vertical edges of an image and then
based on those edges build high-level features in the following layers of neural network.

 In general form convolution is an operation on two functions of a real valued argument. To


motivate the definition of convolution, we start with examples of two functions we might
use.

 Convolution is a mathematical operation that combines two functions to produce a third


function. In the context of CNNs, it involves sliding a filter (kernel) over an input feature
map to produce a feature map.

Process:

 Filter (Kernel): A small matrix of weights.

 Stride: The number of pixels by which the filter moves across the input.

 Padding: Adding pixels to the input to control the spatial dimensions of the output.

 Activation Map: The result of applying the filter to the input through element-wise
multiplication and summation.

Mathematical Representation

Formula: For an input 𝐼I and filter F, the convolution operation at position (𝑖,)(i,j) is:
Where m and n are the dimensions of the filter.

Types of Convolution Operations

 1D Convolution: Used for sequence data such as time series or text.

 2D Convolution: Used for image data.

 3D Convolution: Used for volumetric data such as videos or 3D medical scans.

Importance of Convolutional Operations

 Local Receptive Fields: Convolutional layers exploit local spatial coherence by connecting only a
local region of the input to the output.

 Parameter Sharing: Each filter is used across the entire input, reducing the number of
parameters compared to fully connected layers.

 Translation Invariance: Convolutions help in detecting features regardless of their spatial


location in the input image.

Applications of Convolutional Operations

 Image Classification: Identifying the class of an object in an image (e.g., recognizing digits in
MNIST dataset).

 Object Detection: Identifying and localizing objects within an image.

 Image Segmentation: Partitioning an image into regions or objects (e.g., identifying different
objects in a scene).

 Video Analysis: Processing sequences of images to recognize actions or events.

Examples of Convolutional Layers in CNN Architectures

 LeNet-5: One of the first CNNs used for handwritten digit recognition.

 AlexNet: Popularized the use of CNNs with deep architectures for image classification.

 VGGNet: Known for its simplicity and depth, using small 3x3 filters.

 ResNet: Introduced residual connections to enable training of very deep networks.


OPERATION

Convolution operation focuses on extracting / preserving important features from the input. Convolution
operation allows the network to detect horizontal and vertical edges of an image and then based on those
edges build high-level features in the following layers of neural network.

 In general form convolution is an operation on two functions of a real valued argument. To


motivate the definition of convolution, we start with examples of two functions we might
use.

 Suppose we are tracking the location of a spaceship with a laser sensor. Laser Sensor
provides a single output x(t), the position of the spaceship at time t. Both "x" and "t" are
real-valued, i.e., we can get a different reading from the laser sensor at any instant in time.

 Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy estimate of
the spaceship's position, we would like to average together several measurements. Of
course, more recent measurements are more relevant, so we will want this to be a
weighted average that gives more weight to recent measurements.

 We can do this with a weighting function w(a). where"a" is the age of a measurement. If
we apply such a weighted average operation at every moment, we obtain a new function
providing a smoothed estimate of the position "s" of the spaceship.

 Convolution operation uses three parameters: Input Image, Feature detector and Feature
map.

 Convolution operation involves an input matrix and a filter, also known as the kernel. Input
matrix can be pixel values of a grayscale image whereas a titter is a relatively small matrix
that detects edges by darkening areas of input image where there are transitions from
brighter to darker areas. There can be different types of filters depending upon what type
of features we want to detect, e.g. vertical, horizontal, or diagonal, etc.

 Input image is converted into binary1 and 0. The convolution operation, shown inFig 2.1 is
known as the feature detector of a CNN. The input to a convolution can be raw data or a
feature map output from another convolution. It is often interpreted as filter in which the
kernel filters input data for certain kinds of information.
 Sometimes a 5 x 5 or a 7 x 7 matrix is used as a feature detector. The feature detector

often referred to as a “kernel" or a "filter,". At each step, the kernel is multiplied by the

input data values within its bounds, creating a single entry in the output feature map.

FIG 2.1: Convolution Operation

 Generally an image can be considered as a matrix whose elements are numbers between
0 and 255. The size of image matrix is :image height * image width * number of
imagechannels.

 A grayscale image has I channel, where a colour image has 3 channels.

 Kernel : A kernel is a small matrix of numbers that is used in image convolutions.


Differently sized kernels containing different patterns of numbers produce different results
under convolution. The size of a kernel is arbitrary but 3 x 3 is often used.

FIG 2.2: Example of Kernel

 Convolutional layers perform transformations on the input data volume that are a function
of the activations in the input volume and the parameters.

 In reality, convolutional neural networks develop multiple feature detectors and use them
to develop several feature maps which are referred to as convolutional layers and is
shown in Fig 2.3.
 Through training, the network determines what features it finds important in order for it to
be able to scan images and categorize them more accurately.

 Convolutional layers have parameters for the layer and additional hyper-Parameters.'
Gradient descent is used to train the Parameters in this layer such that the class scores
are consistent with the labels in the training set.

FIG 2.3: Feature detectors

 Components of convolutional layers are as follows :

a) Filters
b) Activation maps
c) Parameter sharing
d) Layer-specific hyper-parameters
 Filters are a function that has a width and height smaller than the width and height of the
input volume. The filters are tensors and they are used to convolve the input tensor when
the tensor is passed to the layer instance. The random values inside the filter tensors are
the weights of the convolutional layer.

 Sliding each filter across the spatial dimensions (width, height) of the input volume during
the forward pass of information through the CNN. This produces a two dimensional output
called an activation map for that specific filter.
2.2 SPARSE INTERACTION:

Definition of Sparse Interaction

Sparse Connectivity: In CNNs, each neuron in the convolutional layer is connected only to a small,
localized region of the input feature map, unlike fully connected layers where every neuron is connected
to every neuron in the previous layer.

 Receptive Field: The local region of the input that a neuron in the convolutional layer "sees" is
called the receptive field. This locality enforces sparsity in interactions.

 Sparse interactions are also referred to m sparse connectivity or sparse weights. Sparse
interaction is implemented by using kernels or feature detector smaller than the input image, i.e.
Making the kernel smaller than the input.
 If we have an input image of the size 256 by 256 then it becomes difficult to detect edges in the
image may occupy only a smaller subset of pixels in the image. This means that we need to store
fewer parameters, which both reduces the memory requirements of the model and improves its
statistical efficiency. It also means that computing the output requires fewer operations.
 Sparse interaction idea uses convolution kernel to interact with the local region in the image. This
region is called receptive field, which improves the parameters and efficiency compared with the
full connection layer.
 For example, when processing a three channel picture, the pixels of the image 11, contain
thousands of pixels, but when we only need to detect the edge information in the image, we do
not need to connect the pixels of the whole picture, we only need to use the convolutional kernel
containing hundreds of pixels to detect. This calculation method not only improves the calculation
efficiently, but also saves a large part of the parameter space.

2. How Sparse Interaction Works

 Local Filters: Convolutional layers use filters (or kernels) that slide over the input feature map,
performing element-wise multiplication and summation. Each filter interacts with only a subset of
the input data.

 Reduced Parameters: Because filters are smaller than the input feature maps, the number of
parameters in a convolutional layer is significantly lower than in a fully connected layer, making
the model more efficient and less prone to overfitting.
3. Benefits of Sparse Interaction

 Efficiency: Fewer connections mean fewer parameters to learn and less computational cost,
which is crucial for training deep networks on large datasets.

 Translation Invariance: Sparse connectivity allows CNNs to detect features regardless of their
position in the input, leading to translation invariance.

 Locality: Ensures that the network learns spatial hierarchies and local patterns, which are
important for tasks like image recognition and processing.

4. Applications and Advantages

 Image Classification: Sparse interaction helps in efficiently extracting features like edges,
textures, and patterns.
 Object Detection: Enables the network to detect objects regardless of their spatial location.
 Medical Imaging: Used in tasks like tumor detection where local features are crucial.

2.3 PARAMETER SHARING

 Parameter Sharing: In CNNs, the same set of weights (filter) is used across different parts of the
input feature map. This means that the same parameters are applied to different locations in the
input, allowing the network to detect the same feature regardless of its position in the input image.

 The user can reduce the number of parameters by making an assumption that if one feature can
compute at some spatial position (x1), then it is useful to compute a different place (x2,y2).

 In other words, denoting a single 2D slice of depth as a depth slice. For example, during back-
propagation, every neuron in the network will compute the gradient for its weights, but these
gradients will be added up across each depth slice and only update a single set of weights per
slice.

 If all neurons in a single depth slice are using the same weight vector, then the f pass of the
convolutional layer can be computed in each depth slice as a convolution the neuron's weights
with the input volume. This is the reason why it is common to refer to the sets of weights as a filter
(or a kernel), that is convolved with the input.

 Fig 2.4 shows convolution shares the same parameters across all spatial locations,
FIG 2.4: Convolution shares the same parameters across all spatial locations

Benefits of Parameter Sharing

 Reduced Number of Parameters: By using the same filter across the input, the number of
parameters is significantly reduced compared to fully connected layers where each neuron has its
own set of weights.

 Learning Efficiency: Fewer parameters mean faster training and less risk of overfitting. The
network can generalize better since the same features are learned irrespective of their location in
the input.

 Translation Invariance: Parameter sharing contributes to the ability of CNNs to recognize patterns
anywhere in the input image, making them invariant to translations.

Applications and Advantages

 Image Recognition: Parameter sharing helps in detecting features such as edges, corners, and
textures across the image.
 Object Detection: Enables the network to identify objects irrespective of their position within the
image.
 Medical Imaging: Used for tasks like detecting anomalies in medical scans where features need to
be recognized at various locations.

2.4 EQUIVARIANT

 Convolution function is equivariant to translation. This means that shifting the input and applying
convolution is equivalent to applying convolution to the input and shifting it.
 If we move the object in the input, its representation will move the same amount in the output.
 General definition : If representation(transform(x)) = transform(representation(x)) then
representation is equivariant to the transform
 Convolution is equivariant to translation. This is a direct consequence of parameter sharing. It is
useful when detecting structures that are common in the input. For example, edges in an image.
 Equivariance in early layers is good. We are able to achieve translation-invariance (via max-
pooling) due to this property.
 Convolution is not equivariant to other operations such as change in scale or rotation.
 Example of equivariance : With 2D images convolution creates a map where certain features
appear in the input. If we move the object in the input, the representation will move the same
amount in the output. It is useful to detect edges in first layer of convolutional network. Same
edges appear everywhere in image, so it is practical to share parameters across entire image.

2.5 POOLING

Pooling helps the representation become slightly invariant to small translation of the input. A
pooling function takes the output of the previous layer at a certain location L and compares a "summary"
of the neighborhood around L.

 The pooling layer reduces Use height and width of the input. it helps to reduce
computation, as well as helps make feature detectors more invariant to its position in the
input.

 The function of the pooling layer is to progressively reduce the spatial size of the
representation to reduce the amount of parameters and computation in the network, and
hence to also control over fitting No learning takes place on the pooling layers.

 The addition of a pooling layer after the convolutional layer is a common pattern used for
ordering layers within a convolutional neural network that may be repeated of more times
in a given model.

 The pooling layer operates upon each feature map separately to create a new set of the
same number of pooled feature maps. Pooling involves selecting a pooling operation
much like a filter 10 be applied to feature maps.

 The size of the pooling operation or filter is smaller than the size of the feature map. This
means that the pooling layer will always reduce the size of each feature map In I factor of
2. e.g. each dimension is halved, reducing the number of pixels or vales each feature map
to one quarter the size.

 For example a pooling layer applied to a feature map of 6 x 6 (36 pixels) will result in an
output pooled feature map of 3 x 3 (9 pixels). The pooling operation is specified rather
than learned.
 The pooling operation, also called subsampling, is used to reduce the dimensionality of
feature maps from the convolution operation. Max pooling and average pooling are the
most common pooling operations used in the CNN.
 Pooling units are obtained using functions like max-pooling average poolingand even
pooling block being reduced to a single value - of the "Winning Unit". Backpropagaion of
the pooling layer then computes the error which is acquired by this single value "winning
unit"
 Pooling layers, also known as down sampling. conducts dimensionality reduction. reducing
the number of parameters in the input. Similar to the convolutional layer, the pooling
operation sweeps a filter across the entire input. but the difference is that this filter does
not have any weights. litstead, the kernel applies an aggregation function to the values
within the receptive field, populating the output array. There arc two main types of pooling :
 Max pooling : As the filter moves across the input, it selects the pixel with the maximum
value to send to the output array. As an aside, this approach tends to be used more often
compared to average pooling.
 Average pooling : As the filter moves across the input, it calculates the average value
within the receptive field to send to the output array.
 Invariance to local translation can he useful if we care more about whether a certain
feature is present rather than exactly where it is.

2.6 CONVOLUTION VARIANTS

2.6.1 STRIDE:

Convolution functions used in practice differ slightly compared to convolution operation as it is


usually understood in the mathematical literature.

 In general a convolution layer consists of application of several different keels input. This allows
the extraction of several different features at all locations in the input. This means that in each
layer, a single kernel is not applied. Multiple kernels, areused different feature detectors.
 The input is generally not real-valued but instead vector valued. Multi channel convolutions are
commutative only if number of output and input channels is the same.
 In order to allow for calculation of features at a coarser level strided convolutions can be used.
The effect of strided convolution is the same as that of a convolution followed by a down sampling
stage. This can be used to reduce the representation size.
 The stride indicates the pace by which the filter moves horizontally and vertically over the pixels of
the input image during convolution. Fig.5 shows stride during convolution.

FIG 2.5: Stride during convolution

 Stride is a parameter of the neural network's filter that modifies the amount of movement over the
image or video. Stride is a component for the compression of images and video data. For
example, if a neural network's stride is set to 1, the filter will move one pixel or unit, at a time. If
stride = 1, the filter will move one pixel.
 Stride depends on what we expect in our output image. We prefer a smaller stride size if we
expect several fine-grained features to reflect in our output. On the other hand, if we are only
interested in the macro-level of features, we choose a larger stride size.

2.6.2 Tiled

 Tiled convolution learn a set of kernels that is rotated through as we move through space,
rather than teaming a separate set of weights at every spatial location as in locally
connected layer.

 It offers a compromise between a convolutional layer and a locally connected layer.

 Memory requirements for storing the parameters will increase only by a factor of the size
of this set of kernels.

 Let k be a 6-D tensor, where two of the dimensions correspond to different locations in the
output map. Rather than having a separate index for each location in the output map,
output locations cycle through a set oft different choices of kernel stack in each direction. If
t is equal to the output width, this is the same as a locally connected layer.
Transposed and Dilated Convolutions

 Transposed convolutions : These types of convolutions are also known as deconvolutions


or fractionally strided convolutions. A transposed convolutional layer carries out a regular
convolution but reverts its spatial transformation.

 Fig 2.6 shows how transposed convolution with a 2 x 2 kernel is computed for a 2 x 2 input
tensor.

FIG 2.6: Transposed Convolution with a 2 x 2 kernel

The shaded portions are a portion of no intermediate tensor as well as the input and kernel tensor
elements used for the computation.

Dilated convolution operation expands window size without increasing the number of weights by
inserting zero-values into convolution kernels. Dilated convolutions can be used in real time applications
and in applications where the processing power is less as the RAM requirements are less intensive.

Dilated convolution also called atrous convolutions. The central idea is that a new dilation
parameter (d) is introduced, which decides on the spacing between the filter weights while performing
convolution.

2.7 CNN LEARNING

Convolutional Neural Networks (CNNs) are a class of deep learning algorithms specifically
designed for processing structured grid data, such as images. They have revolutionized the field of
computer vision by enabling highly accurate object recognition, image classification, and many other
tasks. Here’s an overview of how CNNs learn in deep learning:

1. Architecture of CNNs

 Input Layer: The input to a CNN is usually a multi-channel image (e.g., RGB channels).
 Convolutional Layers: These layers apply a set of filters (kernels) to the input, performing
convolutions to extract features such as edges, textures, and patterns.

 Pooling Layers: These layers downsample the spatial dimensions of the feature maps, reducing
the computational load and helping the network to become invariant to small translations.

 Fully Connected Layers: After several convolutional and pooling layers, the high-level reasoning
in the neural network is done via fully connected layers.

 Output Layer: The final layer provides the output, such as class probabilities in classification
tasks.

2. Learning Process

Forward Propagation:

 Convolution Operation: Filters slide over the input image, performing element-wise
multiplications and summations to produce feature maps.
 Activation Functions: Non-linear functions (e.g., ReLU) are applied to introduce non-
linearity.
 Pooling Operation: Reduces the dimensionality of the feature maps, retaining important
features.
 Flattening and Fully Connected Layers: The feature maps are flattened into a vector and
passed through fully connected layers to produce the final output.

Backpropagation:

 Loss Function: The difference between the predicted output and the actual label is
measured using a loss function (e.g., cross-entropy loss for classification tasks).

 Gradient Calculation: Gradients of the loss function with respect to each weight are
calculated using the chain rule of calculus.

 Weight Update: Weights are updated using optimization algorithms like Stochastic
Gradient Descent (SGD) or Adam to minimize the loss function.

3. Key Concepts in CNN Learning

 Parameter Sharing: The same filter (set of weights) is applied across different regions of
the input, reducing the number of parameters.
 Sparse Connectivity: Each neuron in the convolutional layer is connected to only a small
region of the input, promoting locality and reducing the number of parameters.

 Local Receptive Fields: Filters focus on localregions of the input, allowing the network to
learn local patterns effectively.

4. Advantages of CNNs

 Efficient Training: Due to parameter sharing and sparse connectivity, CNNs require fewer
parameters than fully connected networks, making them easier to train.

 Translation Invariance: CNNs can recognize objects regardless of their spatial location in
the image.

 Hierarchical Feature Learning: CNNs learn hierarchical representations, from low-level


features (e.g., edges) to high-level features (e.g., object parts).

5. Applications of CNNs

 Image Classification: Assigning a label to an entire image (e.g., recognizing handwritten


digits).

 Object Detection: Identifying and localizing objects within an image.

 Image Segmentation: Partitioning an image into segments for tasks like medical imaging.

 Face Recognition: Identifying or verifying a person from an image.

Example of a CNN Architecture: LeNet-5

 Input Layer: 32x32 pixel image.

 C1 Convolutional Layer: 6 filters of size 5x5, resulting in 28x28 feature maps.

 S2 Pooling Layer: 2x2 pooling, resulting in 14x14 feature maps.

 C3 Convolutional Layer: 16 filters of size 5x5, resulting in 10x10 feature maps.

 S4 Pooling Layer: 2x2 pooling, resulting in 5x5 feature maps.

 C5 Convolutional Layer: 120 filters of size 5x5.

 Fully Connected Layers: A series of fully connected layers leading to the output.
2.7.1 Nonlinearity Functions
The weight layers in a CNN are often followed by a nonlinear activation function. The
activation function takes a real valued input and squashes it within a small range such as [0; 1] and [-
1; 1]. The application of nonlinear function the weight layer is highly important, since it allows a
neural network to learn nonlinear mappings.

 In the absence of nonlinearities, a stacked network of weight layer is equivalent to a


linear mapping from input domain to the output domain.

 A nonlinear function can also be understood as a switching or a selection mechanism,


which decides whether neuron will fire or not given all of its inputs. The activation
functions that are commonly used in deep networks are differentiable to enable error
back propagation.

 Common activation functions that are used in deep neural networks are Sigmoid,
Tanh, Algebraic Sigmoid, ReLU, Leaky ReLU/PReLU and Exponential Linear Unit.
These activationfunction are shown below fig.9.

FIG 2.7: Common Activation Function


Sigmoid : The sigmoid activation function takes in a real number as its input andoutputs a
number in the range of [0,1].

Tanh : The tanh activation function implements the hyberbolic tangent function tosqush the
input vales within range of [-1;1].
Algebraic sigmoid function : The algebraic sigrnoid function also maps the inputwithin the
range [-1;1].

Rectifier linear unit : The ReLu is a Simple activation function which is of a special practical
importance because of its quick computation. A ReLu function maps the inputto a 0 if it is negative
keeps its value unchanged if it is positive.

Leaky ReLu : The rectifier function completely switches off the output if the input isnegative.
A leaky ReLu function does not reduce output to a zero value, rather itoutputs a down-scaled version
of the negative input.

Exponential linear units : The exponential linear units have both positive and negative values
and they therefore try to push the mean activations toward zero. It helps inspeeding up the training
Process while achieving a better performance.

FIG 2.8: Convolution with a dilated filter where the dilation factor is d=2

 Dilation by a factor Of means that the original filter is expanded by d — I spaceseach element and
the intermediate empty locations are filled in with zeros.
2.8 LOSS FUNCTION
 A loss function computes the difference between the estimated output of the model
(prediction) and the correct output (the ground truth).

 All the algorithms in machine learning rely on minimizing or maximizing a function which we
call "objective function". The group Of functions that are minimized are called "loss functions".
A loss function is a measure of how good a prediction model in terms of being able to predict
the expected outcome.

 Loss functions are used to calculate the difference between the predicted output actual output.

 The loss function is the function that computes the distance between the current output of the
algorithm and the expected output. It's a method to evaluate how our algorithm models the
data. It can be categorized into two groups. One for classification (discrete values. 0, l, 2, and
the other for regression (continuous values).

 The type of loss function used in CNN model depends on end problem. The generic set
problems for which neural networks are usually used can be categorized into the following
categories.

I. Binary Classification (SVM hinge loss, Squared hinge loss).

2. Identity Verification (Contrastive loss).

3- Multi-class Classification (Softmax loss, Expectation loss).

4. Regression (SSIM, I error, Euclidean loss).

Loss Function Notation

• Loss function notation are as follows :

a) N — The number of samples collected.

b) P —The number of input features gathered

e) M — The number of output features that have been observed.

d) (X. Y) the input and output data collected. there will be N such pairs

the input is a collection of P values and the output Y is a collection of M


will denote the ith pair in the dataset as Xi and Yi.

e) Y = Output of the neural net,

f) network transforming the input Xi to give the output.

g) Thus y,] refers in the ith sample collected.

h) Loss function = L(W,b).

Loss Functions for Regression:

Loss functions for regression : Regression involves predicting a specific Value that is continuous in
nature. Estimating the price of a house or predicting stock prices are examples because one works
towards building a model that would predict a real-valued quantity.

Mean Square Error:

Mean Square Error (MSE) is the most commonly used regression loss function. MSE is the sum of
squared distances between our target variable and predicted values.

• Mean Squared EIT0r is the average of the squared differences between the actual and the predicted
values. For a data point Yi and its predicted value Yi, where n is the total number of data points in the
dataset, the mean squared error is defined as :

• Advantages : For small errors, MSE helps converge to the minima efficiently, as the gradient reduces
gradually.

• Drawback :

a) Squaring the values does increases the rate of training, but at the same time, an extremely large loss
may lead to a drastic jump during backpropagation, which is not desirable.

b) MSE is also sensitive to outliers.

Loss Functions for Classification: Loss functions for classification : Classification problems involve
predicting a discrete Class output. It involves dividing the dataset into different and unique classes based
on different parameters so that a new and unseen record can be put into one of the classes.
1. Hinge loss

• Hinge loss is a specific loss function used by Support Vector Machines (SVM), This loss function will
help SVM to make a decision boundary with a certain margin distance.

• The equation for hinge loss when data points must be categorized as -1 or 1 is as

follows :

• Hinge loss is mostly used of binary classification.

Square hinge :

• There are many extensions of hinge loss arc present to use with SVM models. One of the popular
extensions is called Squared Hinge Loss. It simply calculates the hinge loss value.

Squared hinge loss has the effect of the smoothing the surface of the error function making it numerically
easier to work with.

• When the hinge loss requires better performance on a given binary classification problem it is mostly
observed that a squared hinge loss may be appropriate to use. As using the hinge loss function, the
target variable must be modified to have values in the set {-1,1}.

• It is simple to implement using python only we have to change the loss function name to

"squared_hinge" in compile O function when building the model.

• A typical application can be classifying email into 'spam' and 'not spam' and only interested in the
classification accuracy. Let us see how squared Hinge can be used with Keras. It just involves specifying
it as the used loss function during the model compilation step :

#Compile the Model

model.compile(Loss=Squard_hinge,Optimizer=tensorflow.keras.optimizer.Adam(lr=0.03),metrics=['accur
acy'])
2.9 REGULARIZATION

Regularization may be defined as any modification or change in the learning algorithm that helps
reduce its error over a test dataset, commonly known as generalization error but not on the supplied or
training dataset.

In learning algorithms, there are many variants of regularization techniques, each of which tries to
cater to different challenges. These can be listed down straightforwardly based on the kind of challenge
the technique is trying to deal with:

1. Some try to put extra constraints on the learning of an ML model, like adding restrictions on the
range/type of parameter values.

2. Some add more terms in the objective or cost function, like a soft constraint on the parameter
values. More often than not, a careful selection of the right constraints and penalties in the cost
function contributes to a massive boost in the model's performance, specifically on the test
dataset.

3. These extra terms can also be encoded based on some prior information that closely relates to
the dataset or the problem statement.

4. One of the most commonly used regularization techniques is creating ensemble models, which
take into account the collective decision of multiple models, each trained with different samples of
data.

The main aim of regularization is to reduce the over-complexity of the machine learning models and help
the model learn a simpler function to promote generalization.

Regularization in Deep Learning:

In the context of deep learning models, most regularization strategies revolve around regularizing
estimators. So now the question arises what does regularizing an estimator means?

Bias vs variance tradeoff graph here sheds a bit more light on the nuances of this topic and demarcation:

Regularization of an estimator works by trading increased bias for reduced variance. An effective
regularize will be the one that makes the best trade between bias and variance, and the end-product of
the tradeoff should be a significant reduction in variance at minimum expense to bias. In simpler terms,
this would mean low variance without immensely increasing the bias value.
FIG : 2.9 Regularization graph

2.10 OPTIMIZATION

In machine learning, optimizers and loss functions are two components that help improve the
performance of the model. By calculating the difference between the expected and actual outputs of a
model, a loss function evaluates the effectiveness of a model. Among the loss functions are log loss,
hinge loss, and mean square loss. By modifying the model’s parameters to reduce the loss function
value, the optimizer contributes to its improvement. RMSProp, ADAM, and SGD are a few examples of
optimizers. The optimizer’s job is to determine which combination of the neural network’s weights and
biases will give it the best chance to generate accurate predictions.

Optimization Rule in Deep Neural Networks

There are various optimization techniques to change model weights and learning rates, like
Gradient Descent, Stochastic Gradient Descent, Stochastic Gradient descent with momentum, Mini-
Batch Gradient Descent, AdaGrad, RMSProp, AdaDelta, and Adam. These optimization techniques play
a critical role in the training of neural networks, as they help improve the model by adjusting its
parameters to minimize the loss of function value. Choosing the best optimizer depends on the
application.
1. The epoch is the number of times the algorithm iterates over the entire training dataset.

2. Batch weights refer to the number of samples used for updating the model parameters.

3. A sample is a single record of data in a dataset.

4. Learning Rate is a parameter determining the scale of model weight updates

5. Weights and Bias are learnable parameters in a model that regulate the signal between two
neurons.

2.11 GRADIENT COMPUTATION

Gradient computation in deep learning is a fundamental concept used during the training of neural
networks. It involves calculating the gradients of the loss function with respect to the model's parameters
(weights and biases). These gradients are then used to update the parameters in a way that minimizes
the loss function. Here's an overview of the key steps and concepts involved:

1. Loss Function

The loss function (or cost function) measures how well the neural network's predictions match the
actual target values. Common loss functions include Mean Squared Error (MSE) for regression tasks and
Cross-Entropy Loss for classification tasks.

2. Forward Pass

In the forward pass, the input data is passed through the network layer by layer, applying
activations and computing the output. The output is then compared to the target values using the loss
function.

3. Backward Pass (Backpropagation)

The backward pass involves propagating the error back through the network to compute the
gradients. This is done using the chain rule of calculus. The main steps are:

a. Compute the Gradient of the Loss with respect to the Output (dL/dO)

For the final layer, this is the derivative of the loss function with respect to the output of the
network.
b. Backpropagate the Gradient through Each Layer

For each layer, the gradients of the loss with respect to the weights and biases are computed.
This involves:

 Calculating the gradient of the loss with respect to the layer's output (dL/dO) and then the gradient
with respect to the input of that layer (dL/dI).

 Using these gradients to compute the gradients with respect to the weights (dL/dW) and biases
(dL/dB).

c. Update the Weights and Biases

Using the computed gradients, the weights and biases are updated. This is typically done using
an optimization algorithm like Stochastic Gradient Descent (SGD), Adam, or RMSprop. The update rule
for SGD is:

4. Automatic Differentiation

Modern deep learning frameworks like TensorFlow and PyTorch use automatic differentiation to
efficiently compute gradients. These frameworks build a computation graph during the forward pass and
then use it to automatically compute the gradients during the backward pass.

import torch
import torch.nn as nn
import torch.optim as optim

# Define a simple model


model = nn.Sequential(
nn.Linear(10, 5),
nn.ReLU(),
nn.Linear(5, 1)
)
# Define a loss function and optimizer
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Dummy input and target


inputs = torch.randn(3, 10)
targets = torch.randn(3, 1)

# Forward pass
outputs = model(inputs)
loss = loss_fn(outputs, targets)

# Backward pass
optimizer.zero_grad()
loss.backward()

# Update parameters
optimizer.step()

In this example, the gradients of the loss with respect to each parameter are computed
automatically using the .backward() method, and the optimizer updates the parameters using these
gradients.

Advantages & Disadvantages of gradient :

Advantages of Gradient Descent

1. Widely used: Gradient descent and its variants are widely used in machine learning and
optimization problems because they are effective and easy to implement.

2. Convergence: Gradient descent and its variants can converge to a global minimum or a good
local minimum of the cost function, depending on the problem and the variant used.
3. Scalability: Many variants of gradient descent can be parallelized and are scalable to large
datasets and high-dimensional models.

4. Flexibility: Different variants of gradient descent offer a range of trade-offs between accuracy and
speed, and can be adjusted to optimize the performance of a specific problem.

Disadvantages of gradient descent:

1. Choice of learning rate: The choice of learning rate is crucial for the convergence of gradient
descent and its variants. Choosing a learning rate that is too large can lead to oscillations or
overshooting while choosing a learning rate that is too small can lead to slow convergence or
getting stuck in local minima.

2. Sensitivity to initialization: Gradient descent and its variants can be sensitive to the initialization
of the model’s parameters, which can affect the convergence and the quality of the solution.

3. Time-consuming: Gradient descent and its variants can be time-consuming, especially when
dealing with large datasets and high-dimensional models. The convergence speed can also vary
depending on the variant used and the specific problem.

4. Local optima: Gradient descent and its variants can converge to a local minimum instead of the
global minimum of the cost function, especially in non-convex problems. This can affect the
quality of the solution, and techniques like random initialization and multiple restarts may be used
to mitigate this issue.
PART A

1. Define Convolutional Networks.

2. How Sparse interactions used in convolutional networks? What are benefits of it?

3. Why sparse interactions is beneficial?

4. What is equivariance representation?

5. List the types of pooling.

6. Explain pro of Tiled convolution?

7. What is Convolution?

8. Which are the four main operations in a CNN?

9. What sparse interactions cause reduction on performance in convolutional networks?

10. Define full convolution.

11. What is gradient descent?

12. What is difference between linear unit and rectified linear unit?

13. Define Loss Function.

14. Define Padding in CNN.

15. What is the use of parameter sharing in CNN?

PART-B

1. Explain in detail about Convolution Operation.


2. Discuss about Pooling.
3. Examine about Convolution Variants.
4. Discuss about Fully Connected Layers.
5. Simplify the CNN Learning and its methods.
6. Explain in detail about Los Function.
7. Illustrate about Gradient Computation and its working methodology.

You might also like