0% found this document useful (0 votes)

93 views38 pages

Unit II

The document outlines the course 'Introduction to Deep Learning' focusing on Deep Convolutional Neural Networks (CNN). It covers essential components such as various deep learning layers (convolution, pooling, activation), optimizers (SGD, RMSProp, AdaDelta), and loss functions used in training models. Key concepts include the operations of each layer, the significance of activation functions, and the role of optimizers in minimizing loss during model training.

Uploaded by

harikrishnanmelath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views38 pages

Unit II

Uploaded by

harikrishnanmelath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Department of

Computer Science and Engineering

Course Name / Code : Introduction to Deep Learning /

10212CS215
Slot : S2
Category : Program Elective
Faculty Name : R.T.THIVYA LAKSHMI.,M.TECH.,(Phd).,
TTS : 3821
Assistant Professor, CSE

School of Computing
Vel Tech Rangarajan Dr. Sagunthala R&D Institute of
Science and Technology
Unit – II Deep Convolutional Neural Networks (CNN)

• Deep Learning Layers - Convolution layer, properties and operations

• Pooling Layer - Role and its types

• Activation Layer - Significance and its types

• Optimizers - Stochastic gradient, RMSProp, AdaDelta, Adam Optimizer

• Loss Function - Mean Square Error (MSE), Binary Cross Entropy,

Categorical Cross Entropy and Sparse Categorical Cross Entropy

and Project
Management
(SEPM)

Department of Computer Science and Engineering

Deep Learning Layers
 Deep learning models are typically composed of multiple layers that process and
transform input data. Each layer performs a specific computation and passes the
output to the next layer. Here are some commonly used layers in deep learning:
i. Input Layer: This layer receives the raw input data, such as images, text, or audio.
It acts as the entry point for the model.
ii. Convolutional Layer: This layer is commonly used in computer vision tasks. It
applies a set of learnable filters (kernels) to the input data, performing convolution
operations to extract relevant features. Convolutional layers are especially effective at
capturing spatial relationships in images.
iii. Pooling Layer: Also known as subsampling or downsampling layer, it reduces the
spatial dimensions of the input, helping to decrease the computational complexity of
the model. Max pooling and average pooling are commonly used techniques in this
layer.
Deep Learning Layers
iv. Fully Connected Layer (Dense Layer): This layer connects every neuron from the
previous layer to every neuron in the current layer. It transforms the input data into a
suitable format for the final output layer. Fully connected layers are often used in the
later stages of deep learning models.
v. Activation Layer: Activation functions introduce non-linearities into the model,
enabling it to learn complex relationships in the data. Popular activation functions
include ReLU (Rectified Linear Unit), sigmoid, and tanh.
vi. Batch Normalization Layer: This layer normalizes the input data to each layer,
typically by subtracting the mean and dividing by the standard deviation. It helps
stabilize the training process and accelerates convergence.
vii. Output Layer: The final layer of the network produces the desired output based on the
task at hand. It may use different activation functions depending on the problem, such
as softmax for multi-class classification or sigmoid for binary classification.
Convolutional Layers
 The convolutional layer is a fundamental building block of convolutional neural
networks (CNNs).

 The convolutional layer applies a set of learnable filters (also known as kernels or
convolutional kernels) to the input data in order to extract relevant features.

i. Input: The input to the convolutional layer is a multi-dimensional array, often referred
to as a feature map or activation map.

- In the case of image data, the input is typically a three-dimensional array with
dimensions [height, width, channels],

- where height and width represent the spatial dimensions of the image, and channels
represent the color channels (e.g., RGB channels).
ii. Convolution Operation: The convolutional layer convolves the input with a set of
learnable filters.

- Each filter/ kernel is a small matrix of weights, typically of size [filter_height,

filter_width, input_channels].

- The filter is applied to a small receptive field (patch) of the input, and the dot product
between the filter and the input patch is computed element-wise.
Kernel

Output = Size of input − size of kernel + 1

Size of feature map = [6 – 3] + 1 = 4

ii. Stride and Padding: During the convolution operation, the filter is typically
applied to the input with a certain stride (step size), which determines the amount
of overlap between receptive fields.

- When the number of strides is 1, we move the filters to 1 pixel at a time.

Similarly, when the number of strides is 2, we carry the filters to 2 pixels, and so on.

- Padding is a technique used to preserve the spatial dimensions of the input image
after convolution operations on a feature map.

- Padding involves adding extra pixels around the border of the input feature map
before convolution.
 In the first matrix, the stride = 0,
second image: stride=2, and the
third image: stride=2.
iii. Non-linear Activation: After the convolution operation, a non-linear activation function,
such as the Rectified Linear Unit (ReLU), is commonly applied element-wise to introduce
non-linearities into the model.

- This helps the model learn complex relationships and makes it more expressive.

iv. Output: The output of the convolutional layer is a feature map that represents the filtered
and transformed version of the input data. It retains the spatial structure of the input but
with reduced spatial dimensions due to the stride and padding settings.

 Multiple filters are typically used in a convolutional layer to capture different features
or patterns from the input. Each filter learns to detect different local patterns or
structures in the data, such as edges, corners, or textures.

 The output of the convolutional layer forms the input to the subsequent layers in the
network, enabling the model to learn hierarchical representations of the input data.
Pooling Layers
 Pooling layers, also known as subsampling or downsampling layers, are commonly
used in convolutional neural networks (CNNs) to reduce the spatial dimensions (3D
to 2D) of the feature maps while retaining the important information.

 Here are two commonly used types of pooling layers:

i. Max Pooling: Max pooling is a pooling operation that selects the maximum value
within a region of the input feature map. It helps to extract the most dominant features
while discarding the less relevant details.

ii. Average Pooling: Average pooling computes the average value within each region of
the input feature map. Similar to max pooling, it divides the input into non-overlapping
regions and computes the average value within each region.
 The choice of pooling layer and its parameters (e.g., pooling region size, stride)
depends on the specific task, dataset, and architecture.
 Pooling layers are often interleaved with convolutional layers to extract
hierarchical representations of the input data. By repeatedly applying convolution
and pooling operations, the network can learn to capture increasingly abstract
features while reducing the spatial dimensions.
Activation Layers
 The activation function layer can be thought of as the "brain" of the CNN, where the
input is transformed into a meaningful representation of the data. It is a fundamental
component.

 The activation function decides whether a neuron should be activated or not by

calculating the weighted sum and further adding bias to it.

 For this reason, it is also referred to as threshold or transformation for the neurons
which can converge the network.

 Activation functions help in normalizing the output between 0 to 1 or -1 to 1. It helps

in the process of backpropagation due to their differentiable property.

 During backpropagation, loss function gets updated, and activation function helps
the gradient descent curves to achieve their local minima.
Types of Layers
i. Linear function: Linear function has the equation similar to as of a straight line i.e.
y = x.

- No matter how many layers we have, if all are linear in nature, the final
activation function of last layer is nothing but just a linear function of the input of first
layer.

- Range is −∞, ∞
ii. Non - Linear function
1
(a) Sigmoid function: It is a function which is plotted as ‘S’ shaped graph. A =
1+𝑒 −𝑥

- Range (0,1)
- It is differentiable and gives a smooth gradient curve. Sigmoid is mostly
used before the output layer in binary classification.
- It has smooth gradient, preventing sudden jump in output values. It
enables clear predictions.
(b) Tanh function: The activation that works almost always better than sigmoid function is
Tanh function also known as Tangent Hyperbolic function.

- It’s actually mathematically shifted version of the sigmoid function. Both are similar
and can be derived from each other.

2
- A= − 1. Range (-1, 1). Tanh is symmetric around the origin and is useful in
1+𝑒 −2𝑥

capturing both positive and negative values.

- Softmax is commonly used in multi-class classification tasks. It transforms the input

into a probability distribution over multiple classes, with each class having a probability
between 0 and 1.

- Range (0,1)
(d) Relu function: It Stands for Rectified linear unit. It is the most widely used activation
function. Chiefly implemented in hidden layers of Neural network.

- ReLU is one of the most widely used activation functions. It sets all negative values
in the input to zero and leaves positive values unchanged.

- Range (0,∞) . A(x) = max (0,x)

- ReLU is computationally efficient and helps address the vanishing gradient problem
by allowing better gradient flow during backpropagation.
(e) Leaky Relu function: Leaky ReLU is a variant of the ReLU function that introduces a
small slope for negative values, preventing zero gradients for negative inputs.

- It is defined as A(x) = max(gx, x), where ‘g’ is a small positive constant.

 The choice of activation function depends on the task, the nature of the data, and the
architecture of the model.
 Different activation functions have different properties and can impact the model's
learning capacity, convergence speed, and ability to handle different types of data
distributions.
Optimizers
 Optimizers play a crucial role in training deep learning models by iteratively
adjusting the model's parameters to minimize the loss function.

 Optimizers use various optimization algorithms to update the weights and biases
of the model based on the gradients computed during backpropagation.

 Important Deep Learning Terms

i. Epoch – The number of times the algorithm runs on the whole training
dataset.

ii. Sample – A single row of a dataset.

iii. Batch – It denotes the number of samples to be taken to for updating the
model parameters.
i. Learning rate (LR) – It is a parameter that provides the model a scale
of how much model weights should be updated.

ii. Cost Function/Loss Function – A cost function is used to calculate the

cost, which is the difference between the predicted value and the
actual value.

iii. Weights/ Bias – The learnable parameters in a model that controls the
signal between two neurons.
Gradient Descent (GD)
It uses the derivative of the loss function and learning rate to reduce the
loss and achieve the minima.

This approach is also adapted in backpropagation in Neural Network.

The best way to define local minima and maxima of a function using a
GD
i. If we move towards a negative gradient of the function at the current point, it will give
the local minimum of that function.

ii. If we move towards a positive gradient of the function at the current point, it will give
the local maximum of that function.
 The main objective of using a gradient descent algorithm is to minimize the cost
function using iteration. To achieve this goal, it performs two steps iteratively:
i. Calculates the first-order derivative of the function to compute the gradient or slope of that
function.
ii. Move away from the direction of the gradient, which means slope increased from the
current point by alpha times, where Alpha is defined as Learning Rate. It is a tuning
parameter in the optimization process which helps to decide the length of the steps.
 Stochastic Gradient Descent (SGD): SGD is one of the simplest and most
widely used optimizers. It runs one training sample per iteration.

- It updates the model's parameters by taking small steps proportional to

the negative gradient of the loss function with respect to the parameters.

- SGD can suffer from slow convergence and can get stuck in local
minima, but it is computationally efficient and often serves as a baseline for other
optimization algorithms.

 Root Mean Square Propagation (RMS Prop): It describes the way it computes
and updates the LR for the individual weigh in a neural network.

- It uses an exponentially decaying average of squared gradients to adapt

the learning rate for each parameter.
SGD procedure
1. Mini-Batch Sampling: In SGD, instead of computing gradients on the entire training
dataset, a mini-batch of training examples is randomly sampled for each iteration. The mini-
batch size is typically smaller than the total number of training examples but larger than 1.

2. Gradient Computation: For each mini-batch, the gradients of the loss function with respect
to the model's parameters are computed. This is done by performing forward propagation to
compute the predictions of the model and then backpropagation to calculate the gradients.

3. Parameter Update: After computing the gradients, the model's parameters are updated to
minimize the loss function. The update is performed by subtracting a fraction of the
gradient from the current parameter values. The fraction is determined by the learning rate,
which controls the step size of the update. The learning rate is usually a small positive
value, such as 0.1 or 0.01.

4. Iteration: Steps 2 and 3 are repeated for a fixed number of iterations or until a convergence
criterion is met. Each iteration processes a different mini-batch of training examples, and
the gradients are computed and used to update the parameters.
 Root Mean Square Propagation (RMS Prop): It describes the way it computes
and updates the LR for the individual weigh in a neural network.

- RMSprop is to adaptively adjust the learning rate for each weight based
on the magnitudes of the recent gradients that have been computed for that weight

- This adaptation is done by maintaining an exponentially decaying

average of the squared gradients, and the learning rate for each weight is divided by
the square root of this average.

- It aims to achieve better convergence rates and overcome some of the

limitations of traditional gradient descent methods.

- Advantages: Fast convergence, stable learning, fewer hyperparameters

RMS Prop procedure
1. Initialize the parameters, including the learning rate, decay rate, and a small constant
for numerical stability.

2. For each iteration or batch:

i. Compute the gradients of the loss function with respect to the network
weights.

ii. Update the exponentially decaying average of the squared gradients:

Square the gradients element-wise.

Compute the exponentially decaying average of the squared gradients using

the decay rate.

iii. Compute the adaptive learning rate for each weight:

Divide the learning rate by the square root of the average squared gradients
(element-wise).

iv. Update the weights using the computed learning rates and the gradients.

3. Repeat until convergence or for a fixed number of iterations.

 Ada Delta Optimizer: AdaDelta is an extension of RMSprop that further
improves the adaptive learning rate scheme.

- It addresses the problem of RMSprop's accumulating squared gradients

by using an exponentially decaying average of past gradients instead.

- AdaDelta does not require manual tuning of the learning rate and is
robust to different learning rate choices.

- In Adadelta, instead of accumulating all past squared gradients, it

maintains a running average of gradients and updates the parameters based on the
ratio of the root mean square (RMS) of the past gradients and the RMS of the
current gradient.
Ada Delta procedure

1. Initialize variables:
i. Accumulated gradient squared values: E[g^2] (initialized to zero)
ii. Accumulated update squared values: E[delta_theta^2] (initialized to zero)
2. For each iteration:
i. Compute the gradients of the parameters with respect to the loss function.
ii. Accumulate the squared gradients: E[g^2] = rho * E[g^2] + (1 - rho) * g^2, where rho
is a decay rate (usually set to 0.95).
iii. Compute the update: delta_theta = - (sqrt(E[delta_theta^2] + epsilon) / sqrt(E[g^2] +
epsilon)) * g, where epsilon is a small value (e.g., 1e-6) added for numerical stability.
iv. Accumulate the squared updates: E[delta_theta^2] = rho * E[delta_theta^2] + (1 - rho)
* delta_theta^2.
v. Update the parameters: theta = theta + delta_theta.
 Adam Optimizer: It combines the benefits of two other optimization methods,
Adaptive Gradient Algorithm (AdaGrad) and Root Mean Square Propagation
(RMSProp), to provide efficient and effective parameter updates.

- Adam is known for its efficiency, robustness, and good convergence

properties.

- In adam, instead of adapting learning rates based upon the first

moment(mean) as in RMS Prop, it also uses the second moment of the gradients.

- It has become a popular choice for optimizing neural network models,

often providing faster convergence compared to traditional gradient descent variants.
Adam procedure

1. Initialize variables:
i. Exponential decay rates for the first and second moments: beta1 (typically set to 0.9) and
beta2 (typically set to 0.999).
ii. Small constant for numerical stability: epsilon (e.g., 1e-8).
iii. Step counter: t (initialized to 0).
iv. Parameters: theta (model weights) and their corresponding gradients g (initialized to
zero).
2. For each iteration:
i. Increment the step counter t = t + 1
ii. Compute the gradients of the parameters with respect to the loss function
iii. Update the biased first moment estimate m = beta1 * m + (1 – beta1) * g, where m is the
first moment estimate (mean of the gradients).
Adam procedure (Contd…)

i. Update the biased second moment estimate v = beta2 * v + (1 – beta2) * g^2,

where v is the second moment estimate (uncentered variance of the
gradients).
ii. Compute the bias - corrected first moment estimate m_hat = m / (1- beta1^t)
iii. Compute the bias - corrected second moment estimate v_hat = v / (1- beta2^t)
iv. Update the parameters theta = theta - (learning_rate * m_hat) / (sqrt(v_hat) +
epsilon), where learning_rate is a hyperparameter determining the step size.
Loss Function
 Loss function, also known as a cost function or objective function, is a measure
that quantifies the discrepancy between the predicted output of a model and the
true target values.
 It represents how well the model is performing on a specific task or problem.
 The goal is to minimize the value of the loss function, which indicates a better fit
of the model to the data.
 Here are some commonly used loss functions:
i. Mean Squared Error (MSE): MSE is a popular loss function for regression
problems. It computes the average squared difference between the predicted and true
values. It is defined as:

1 2
∗ ෍ 𝑦𝑝𝑟𝑒𝑑 − 𝑦𝑡𝑟𝑢𝑒
𝑛
where y_pred represents the predicted values, y_true represents the true values, and n is
the number of data points.
Loss Function
ii. Binary Cross - Entropy: It commonly used for binary classification problems. It
measures the dissimilarity between the predicted probabilities and the true binary
labels. The formula for binary cross-entropy is:
− 𝑦𝑡𝑟𝑢𝑒 ∗ log 𝑦𝑝𝑟𝑒𝑑 + 1 − 𝑦𝑡𝑟𝑢𝑒 ∗ log 1 − 𝑦𝑝𝑟𝑒𝑑
where ypred is the predicted probability, ytrue is the true binary label (0 or 1).
iii. Categorical Cross - Entropy: It is used for multi-class classification problems. It
calculates the dissimilarity between the predicted class probabilities and the true one-
hot encoded labels. The formula is:

− ෍(𝑦𝑡𝑟𝑢𝑒 ∗ log 𝑦𝑝𝑟𝑒𝑑 )

where ypred is the predicted probability distribution over classes, ytrue is the true one-hot
encoded label.
- One-Hot Encoding is a technique, transforms categorical variables into a binary
representation
Loss Function

iv. Sparse Categorical Cross - Entropy: It is used in multi-class classification problems

where the target labels are represented as integers rather than one-hot encoded
vectors. It is particularly useful when the number of classes is large.
- In contrast to Categorical Cross-Entropy (CCE), which requires the target labels to
be one-hot encoded, Sparse CCE accepts integer labels directly.
- This makes it more memory-efficient and avoids the need for converting labels into
one-hot encoded vectors. The formula is

− ෍ log 𝑦𝑝𝑟𝑒𝑑 _𝑖

where ypred_i represents the predicted probability assigned to the true class label for the ith
sample.
Cost Function Vs. Loss Function
Cost Function Loss Function
Quantifies the overall cost or error of the Measures the error between predicted and
model on the entire training set. actual values in a machine learning model.

Used to guide the optimization process by Used to optimize the model during training.
minimizing the cost or error.
Aggregates the loss values over the entire Can be specific to individual samples.
training set.

Often the average or sum of individual loss Examples include mean squared error (MSE),
values in the training set. mean absolute error (MAE), and binary cross-
entropy.
Used to determine the direction and magnitude Used to evaluate model performance.
of parameter updates during optimization.

Typically derived from the loss function, but Different loss functions can be used for
can include additional regularization terms or different tasks or problem domains.
other considerations.
Thank You

Department of Computer Science and Engineering

Deep Learning Unit-III
No ratings yet
Deep Learning Unit-III
9 pages
CNN Layer Sequence in Transfer Learning
No ratings yet
CNN Layer Sequence in Transfer Learning
8 pages
Deep Learning Concepts and Techniques
No ratings yet
Deep Learning Concepts and Techniques
12 pages
Understanding Convolutional Layers in CNNs
No ratings yet
Understanding Convolutional Layers in CNNs
8 pages
Understanding Convolutional Neural Networks
No ratings yet
Understanding Convolutional Neural Networks
22 pages
Lecture 3
No ratings yet
Lecture 3
48 pages
Unit IV Deep Leraning
No ratings yet
Unit IV Deep Leraning
35 pages
Various Neural Network Architect Assignment Questions
No ratings yet
Various Neural Network Architect Assignment Questions
9 pages
Unit 3
No ratings yet
Unit 3
59 pages
Lecture - 07 (Convolutional Neural Networks)
No ratings yet
Lecture - 07 (Convolutional Neural Networks)
57 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
5 pages
Deep Learning Cheatsheet Guide
No ratings yet
Deep Learning Cheatsheet Guide
14 pages
Unit 4 (CNN and SOM)
No ratings yet
Unit 4 (CNN and SOM)
15 pages
Super VIP Cheatsheet - Deep Learning
No ratings yet
Super VIP Cheatsheet - Deep Learning
47 pages
Unit 3 ML
No ratings yet
Unit 3 ML
27 pages
Introduction To Convolution Neural Network
No ratings yet
Introduction To Convolution Neural Network
15 pages
1 CNN
No ratings yet
1 CNN
14 pages
Unit 5 Ann
No ratings yet
Unit 5 Ann
28 pages
Machine Learning (CSO851) - Lecture 10
No ratings yet
Machine Learning (CSO851) - Lecture 10
83 pages
Advanced Deep Learning Techniques
No ratings yet
Advanced Deep Learning Techniques
89 pages
UNIT-III DLL Full Unit
No ratings yet
UNIT-III DLL Full Unit
63 pages
CNN Architecture and Hyperparameters Guide
No ratings yet
CNN Architecture and Hyperparameters Guide
5 pages
CNN Cheat Sheet
100% (1)
CNN Cheat Sheet
5 pages
Deep Learning for Visual Recognition
No ratings yet
Deep Learning for Visual Recognition
82 pages
Antim Prahar AI and ML For Business 2025
No ratings yet
Antim Prahar AI and ML For Business 2025
45 pages
SocrAI Day 2: Neural Networks & CNNs
No ratings yet
SocrAI Day 2: Neural Networks & CNNs
66 pages
Deep Learning and CNN Fundamentals
No ratings yet
Deep Learning and CNN Fundamentals
33 pages
IBM Question & Answers
No ratings yet
IBM Question & Answers
3 pages
Understanding NN Architecture Basics
No ratings yet
Understanding NN Architecture Basics
19 pages
CNN Guide for Machine Learning Students
No ratings yet
CNN Guide for Machine Learning Students
37 pages
DL Unit 3
No ratings yet
DL Unit 3
14 pages
ML Lec 13 CNN
No ratings yet
ML Lec 13 CNN
44 pages
DL Endsem 2024 FlyHigh Services
No ratings yet
DL Endsem 2024 FlyHigh Services
18 pages
Deep Learning CNN 4th Unit
No ratings yet
Deep Learning CNN 4th Unit
16 pages
NNDL
No ratings yet
NNDL
7 pages
CNN Overview: Layers and Functions
No ratings yet
CNN Overview: Layers and Functions
24 pages
Advanced Deep Learning Techniques
No ratings yet
Advanced Deep Learning Techniques
89 pages
Image Classification with ANN and CNN
No ratings yet
Image Classification with ANN and CNN
8 pages
CNN Architectures and Applications Overview
No ratings yet
CNN Architectures and Applications Overview
82 pages
ML Prep For Samsung
No ratings yet
ML Prep For Samsung
73 pages
Understanding Convolutional Neural Networks
No ratings yet
Understanding Convolutional Neural Networks
6 pages
Understanding CNNs and RNNs in PyTorch
No ratings yet
Understanding CNNs and RNNs in PyTorch
18 pages
Understanding Convolutional Neural Networks
No ratings yet
Understanding Convolutional Neural Networks
18 pages
Convolutional Neural Networks: Convolutional Layer Pooling Layer Fully Connected Layer
No ratings yet
Convolutional Neural Networks: Convolutional Layer Pooling Layer Fully Connected Layer
33 pages
CNN Cheatsheet for CS230 Students
No ratings yet
CNN Cheatsheet for CS230 Students
17 pages
Gender Classification with CNN Techniques
No ratings yet
Gender Classification with CNN Techniques
7 pages
Unit Iii Deep Learning
No ratings yet
Unit Iii Deep Learning
31 pages
Gen Aiml Notes by Piyush
No ratings yet
Gen Aiml Notes by Piyush
39 pages
Understanding Convolutional Neural Networks
No ratings yet
Understanding Convolutional Neural Networks
102 pages
Class Notes Unit 5
No ratings yet
Class Notes Unit 5
13 pages
Understanding Convolutional Neural Networks
No ratings yet
Understanding Convolutional Neural Networks
34 pages
Ai & Ds-Ii Iat-2 QB Soln
No ratings yet
Ai & Ds-Ii Iat-2 QB Soln
15 pages
CNN Cheatsheet for Deep Learning
No ratings yet
CNN Cheatsheet for Deep Learning
5 pages
Understanding Deep Learning Concepts
No ratings yet
Understanding Deep Learning Concepts
5 pages
CS 230 - Convolutional Neural Networks Cheatsheet
No ratings yet
CS 230 - Convolutional Neural Networks Cheatsheet
7 pages
Previous Year Technical Coding Assessment Problmes Cognizant
No ratings yet
Previous Year Technical Coding Assessment Problmes Cognizant
4 pages
Solved Long Division Problems With Step-By-Step Walkthrough: Solutions Are On Page 2
No ratings yet
Solved Long Division Problems With Step-By-Step Walkthrough: Solutions Are On Page 2
2 pages
Pseudo Code
No ratings yet
Pseudo Code
4 pages
Binary BCH Codes Overview
No ratings yet
Binary BCH Codes Overview
32 pages
Lec 8
No ratings yet
Lec 8
10 pages
Filter Bank Analysis and Design Overview
No ratings yet
Filter Bank Analysis and Design Overview
1 page
Linear Arrays: Operations and Representation
No ratings yet
Linear Arrays: Operations and Representation
18 pages
Math1013 Assignment 2 Solutions
No ratings yet
Math1013 Assignment 2 Solutions
3 pages
DSA - W2022 (3134201) (GTURanker - Com)
No ratings yet
DSA - W2022 (3134201) (GTURanker - Com)
1 page
Graph Algorithms for Learners
No ratings yet
Graph Algorithms for Learners
32 pages
DFT Concepts in Digital Signal Processing
No ratings yet
DFT Concepts in Digital Signal Processing
39 pages
Cyclic Codes - Detailed Study Notes
No ratings yet
Cyclic Codes - Detailed Study Notes
24 pages
Optimization Techniques Exam Questions
No ratings yet
Optimization Techniques Exam Questions
1 page
Adaptive Generic Corner Detection Method
No ratings yet
Adaptive Generic Corner Detection Method
14 pages
Understanding Clustering Techniques
No ratings yet
Understanding Clustering Techniques
28 pages
DSP II - 2 (Analog Filter Design)
No ratings yet
DSP II - 2 (Analog Filter Design)
16 pages
Interpolation Methods
No ratings yet
Interpolation Methods
13 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
97 pages
15 Frame Example PDF
No ratings yet
15 Frame Example PDF
12 pages
Long Short-Term Memory Networks (LSTM) - Simply Explained! - Data Basecamp
No ratings yet
Long Short-Term Memory Networks (LSTM) - Simply Explained! - Data Basecamp
4 pages
Python Coding Patterns Explained
No ratings yet
Python Coding Patterns Explained
7 pages
01AI0304DataStructuredocx 2025 07 08 07 28 17
No ratings yet
01AI0304DataStructuredocx 2025 07 08 07 28 17
4 pages
Key Differences Between CMAC and HMAC - Cbgist
100% (1)
Key Differences Between CMAC and HMAC - Cbgist
3 pages
Finite Element Analysis Using Hypermesh Radioss OR Optistruct PDF
No ratings yet
Finite Element Analysis Using Hypermesh Radioss OR Optistruct PDF
4 pages
Signals and Systems - Mjroberts
100% (1)
Signals and Systems - Mjroberts
3 pages
Problem Solving and Programming Basics
No ratings yet
Problem Solving and Programming Basics
61 pages
Clustering Techniques in Machine Learning
No ratings yet
Clustering Techniques in Machine Learning
47 pages
ODE NM Introduction
No ratings yet
ODE NM Introduction
4 pages
EE 413 DSP Course Overview
No ratings yet
EE 413 DSP Course Overview
15 pages
MGT 247 - Decision Science & Analytics OE
No ratings yet
MGT 247 - Decision Science & Analytics OE
2 pages

Unit II

Uploaded by

Unit II

Uploaded by

Department of

Computer Science and Engineering

Course Name / Code : Introduction to Deep Learning /

• Deep Learning Layers - Convolution layer, properties and operations

• Pooling Layer - Role and its types

• Activation Layer - Significance and its types

• Optimizers - Stochastic gradient, RMSProp, AdaDelta, Adam Optimizer

• Loss Function - Mean Square Error (MSE), Binary Cross Entropy,

Department of Computer Science and Engineering

- Each filter/ kernel is a small matrix of weights, typically of size [filter_height,

Output = Size of input − size of kernel + 1

Size of feature map = [6 – 3] + 1 = 4

- When the number of strides is 1, we move the filters to 1 pixel at a time.

 Here are two commonly used types of pooling layers:

 The activation function decides whether a neuron should be activated or not by

 Activation functions help in normalizing the output between 0 to 1 or -1 to 1. It helps

capturing both positive and negative values.

- Softmax is commonly used in multi-class classification tasks. It transforms the input

- Range (0,∞) . A(x) = max (0,x)

- It is defined as A(x) = max(gx, x), where ‘g’ is a small positive constant.

 Important Deep Learning Terms

ii. Sample – A single row of a dataset.

ii. Cost Function/Loss Function – A cost function is used to calculate the

This approach is also adapted in backpropagation in Neural Network.

- It updates the model's parameters by taking small steps proportional to

- It uses an exponentially decaying average of squared gradients to adapt

- This adaptation is done by maintaining an exponentially decaying

- It aims to achieve better convergence rates and overcome some of the

- Advantages: Fast convergence, stable learning, fewer hyperparameters

2. For each iteration or batch:

ii. Update the exponentially decaying average of the squared gradients:

Square the gradients element-wise.

Compute the exponentially decaying average of the squared gradients using

iii. Compute the adaptive learning rate for each weight:

3. Repeat until convergence or for a fixed number of iterations.

- It addresses the problem of RMSprop's accumulating squared gradients

- In Adadelta, instead of accumulating all past squared gradients, it

- Adam is known for its efficiency, robustness, and good convergence

- In adam, instead of adapting learning rates based upon the first

- It has become a popular choice for optimizing neural network models,

i. Update the biased second moment estimate v = beta2 * v + (1 – beta2) * g^2,

− ෍(𝑦𝑡𝑟𝑢𝑒 ∗ log 𝑦𝑝𝑟𝑒𝑑 )

iv. Sparse Categorical Cross - Entropy: It is used in multi-class classification problems

Department of Computer Science and Engineering

You might also like