Department of
Computer Science and Engineering
Course Name / Code : Introduction to Deep Learning /
10212CS215
Slot : S2
Category : Program Elective
Faculty Name : R.T.THIVYA LAKSHMI.,M.TECH.,(Phd).,
TTS : 3821
Assistant Professor, CSE
School of Computing
Vel Tech Rangarajan Dr. Sagunthala R&D Institute of
Science and Technology
Unit – II Deep Convolutional Neural Networks (CNN)
• Deep Learning Layers - Convolution layer, properties and operations
• Pooling Layer - Role and its types
• Activation Layer - Significance and its types
• Optimizers - Stochastic gradient, RMSProp, AdaDelta, Adam Optimizer
• Loss Function - Mean Square Error (MSE), Binary Cross Entropy,
Categorical Cross Entropy and Sparse Categorical Cross Entropy
and Project
Management
(SEPM)
Department of Computer Science and Engineering
Deep Learning Layers
Deep learning models are typically composed of multiple layers that process and
transform input data. Each layer performs a specific computation and passes the
output to the next layer. Here are some commonly used layers in deep learning:
i. Input Layer: This layer receives the raw input data, such as images, text, or audio.
It acts as the entry point for the model.
ii. Convolutional Layer: This layer is commonly used in computer vision tasks. It
applies a set of learnable filters (kernels) to the input data, performing convolution
operations to extract relevant features. Convolutional layers are especially effective at
capturing spatial relationships in images.
iii. Pooling Layer: Also known as subsampling or downsampling layer, it reduces the
spatial dimensions of the input, helping to decrease the computational complexity of
the model. Max pooling and average pooling are commonly used techniques in this
layer.
Deep Learning Layers
iv. Fully Connected Layer (Dense Layer): This layer connects every neuron from the
previous layer to every neuron in the current layer. It transforms the input data into a
suitable format for the final output layer. Fully connected layers are often used in the
later stages of deep learning models.
v. Activation Layer: Activation functions introduce non-linearities into the model,
enabling it to learn complex relationships in the data. Popular activation functions
include ReLU (Rectified Linear Unit), sigmoid, and tanh.
vi. Batch Normalization Layer: This layer normalizes the input data to each layer,
typically by subtracting the mean and dividing by the standard deviation. It helps
stabilize the training process and accelerates convergence.
vii. Output Layer: The final layer of the network produces the desired output based on the
task at hand. It may use different activation functions depending on the problem, such
as softmax for multi-class classification or sigmoid for binary classification.
Convolutional Layers
The convolutional layer is a fundamental building block of convolutional neural
networks (CNNs).
The convolutional layer applies a set of learnable filters (also known as kernels or
convolutional kernels) to the input data in order to extract relevant features.
i. Input: The input to the convolutional layer is a multi-dimensional array, often referred
to as a feature map or activation map.
- In the case of image data, the input is typically a three-dimensional array with
dimensions [height, width, channels],
- where height and width represent the spatial dimensions of the image, and channels
represent the color channels (e.g., RGB channels).
ii. Convolution Operation: The convolutional layer convolves the input with a set of
learnable filters.
- Each filter/ kernel is a small matrix of weights, typically of size [filter_height,
filter_width, input_channels].
- The filter is applied to a small receptive field (patch) of the input, and the dot product
between the filter and the input patch is computed element-wise.
Kernel
Output = Size of input − size of kernel + 1
Size of feature map = [6 – 3] + 1 = 4
ii. Stride and Padding: During the convolution operation, the filter is typically
applied to the input with a certain stride (step size), which determines the amount
of overlap between receptive fields.
- When the number of strides is 1, we move the filters to 1 pixel at a time.
Similarly, when the number of strides is 2, we carry the filters to 2 pixels, and so on.
- Padding is a technique used to preserve the spatial dimensions of the input image
after convolution operations on a feature map.
- Padding involves adding extra pixels around the border of the input feature map
before convolution.
In the first matrix, the stride = 0,
second image: stride=2, and the
third image: stride=2.
iii. Non-linear Activation: After the convolution operation, a non-linear activation function,
such as the Rectified Linear Unit (ReLU), is commonly applied element-wise to introduce
non-linearities into the model.
- This helps the model learn complex relationships and makes it more expressive.
iv. Output: The output of the convolutional layer is a feature map that represents the filtered
and transformed version of the input data. It retains the spatial structure of the input but
with reduced spatial dimensions due to the stride and padding settings.
Multiple filters are typically used in a convolutional layer to capture different features
or patterns from the input. Each filter learns to detect different local patterns or
structures in the data, such as edges, corners, or textures.
The output of the convolutional layer forms the input to the subsequent layers in the
network, enabling the model to learn hierarchical representations of the input data.
Pooling Layers
Pooling layers, also known as subsampling or downsampling layers, are commonly
used in convolutional neural networks (CNNs) to reduce the spatial dimensions (3D
to 2D) of the feature maps while retaining the important information.
Here are two commonly used types of pooling layers:
i. Max Pooling: Max pooling is a pooling operation that selects the maximum value
within a region of the input feature map. It helps to extract the most dominant features
while discarding the less relevant details.
ii. Average Pooling: Average pooling computes the average value within each region of
the input feature map. Similar to max pooling, it divides the input into non-overlapping
regions and computes the average value within each region.
The choice of pooling layer and its parameters (e.g., pooling region size, stride)
depends on the specific task, dataset, and architecture.
Pooling layers are often interleaved with convolutional layers to extract
hierarchical representations of the input data. By repeatedly applying convolution
and pooling operations, the network can learn to capture increasingly abstract
features while reducing the spatial dimensions.
Activation Layers
The activation function layer can be thought of as the "brain" of the CNN, where the
input is transformed into a meaningful representation of the data. It is a fundamental
component.
The activation function decides whether a neuron should be activated or not by
calculating the weighted sum and further adding bias to it.
For this reason, it is also referred to as threshold or transformation for the neurons
which can converge the network.
Activation functions help in normalizing the output between 0 to 1 or -1 to 1. It helps
in the process of backpropagation due to their differentiable property.
During backpropagation, loss function gets updated, and activation function helps
the gradient descent curves to achieve their local minima.
Types of Layers
i. Linear function: Linear function has the equation similar to as of a straight line i.e.
y = x.
- No matter how many layers we have, if all are linear in nature, the final
activation function of last layer is nothing but just a linear function of the input of first
layer.
- Range is −∞, ∞
ii. Non - Linear function
1
(a) Sigmoid function: It is a function which is plotted as ‘S’ shaped graph. A =
1+𝑒 −𝑥
- Range (0,1)
- It is differentiable and gives a smooth gradient curve. Sigmoid is mostly
used before the output layer in binary classification.
- It has smooth gradient, preventing sudden jump in output values. It
enables clear predictions.
(b) Tanh function: The activation that works almost always better than sigmoid function is
Tanh function also known as Tangent Hyperbolic function.
- It’s actually mathematically shifted version of the sigmoid function. Both are similar
and can be derived from each other.
2
- A= − 1. Range (-1, 1). Tanh is symmetric around the origin and is useful in
1+𝑒 −2𝑥
capturing both positive and negative values.
(c) Softmax function: The softmax function is also a type of sigmoid function.
- Softmax is commonly used in multi-class classification tasks. It transforms the input
into a probability distribution over multiple classes, with each class having a probability
between 0 and 1.
- Range (0,1)
(d) Relu function: It Stands for Rectified linear unit. It is the most widely used activation
function. Chiefly implemented in hidden layers of Neural network.
- ReLU is one of the most widely used activation functions. It sets all negative values
in the input to zero and leaves positive values unchanged.
- Range (0,∞) . A(x) = max (0,x)
- ReLU is computationally efficient and helps address the vanishing gradient problem
by allowing better gradient flow during backpropagation.
(e) Leaky Relu function: Leaky ReLU is a variant of the ReLU function that introduces a
small slope for negative values, preventing zero gradients for negative inputs.
- It is defined as A(x) = max(gx, x), where ‘g’ is a small positive constant.
The choice of activation function depends on the task, the nature of the data, and the
architecture of the model.
Different activation functions have different properties and can impact the model's
learning capacity, convergence speed, and ability to handle different types of data
distributions.
Optimizers
Optimizers play a crucial role in training deep learning models by iteratively
adjusting the model's parameters to minimize the loss function.
Optimizers use various optimization algorithms to update the weights and biases
of the model based on the gradients computed during backpropagation.
Important Deep Learning Terms
i. Epoch – The number of times the algorithm runs on the whole training
dataset.
ii. Sample – A single row of a dataset.
iii. Batch – It denotes the number of samples to be taken to for updating the
model parameters.
i. Learning rate (LR) – It is a parameter that provides the model a scale
of how much model weights should be updated.
ii. Cost Function/Loss Function – A cost function is used to calculate the
cost, which is the difference between the predicted value and the
actual value.
iii. Weights/ Bias – The learnable parameters in a model that controls the
signal between two neurons.
Gradient Descent (GD)
It uses the derivative of the loss function and learning rate to reduce the
loss and achieve the minima.
This approach is also adapted in backpropagation in Neural Network.
The best way to define local minima and maxima of a function using a
GD
i. If we move towards a negative gradient of the function at the current point, it will give
the local minimum of that function.
ii. If we move towards a positive gradient of the function at the current point, it will give
the local maximum of that function.
The main objective of using a gradient descent algorithm is to minimize the cost
function using iteration. To achieve this goal, it performs two steps iteratively:
i. Calculates the first-order derivative of the function to compute the gradient or slope of that
function.
ii. Move away from the direction of the gradient, which means slope increased from the
current point by alpha times, where Alpha is defined as Learning Rate. It is a tuning
parameter in the optimization process which helps to decide the length of the steps.
Stochastic Gradient Descent (SGD): SGD is one of the simplest and most
widely used optimizers. It runs one training sample per iteration.
- It updates the model's parameters by taking small steps proportional to
the negative gradient of the loss function with respect to the parameters.
- SGD can suffer from slow convergence and can get stuck in local
minima, but it is computationally efficient and often serves as a baseline for other
optimization algorithms.
Root Mean Square Propagation (RMS Prop): It describes the way it computes
and updates the LR for the individual weigh in a neural network.
- It uses an exponentially decaying average of squared gradients to adapt
the learning rate for each parameter.
SGD procedure
1. Mini-Batch Sampling: In SGD, instead of computing gradients on the entire training
dataset, a mini-batch of training examples is randomly sampled for each iteration. The mini-
batch size is typically smaller than the total number of training examples but larger than 1.
2. Gradient Computation: For each mini-batch, the gradients of the loss function with respect
to the model's parameters are computed. This is done by performing forward propagation to
compute the predictions of the model and then backpropagation to calculate the gradients.
3. Parameter Update: After computing the gradients, the model's parameters are updated to
minimize the loss function. The update is performed by subtracting a fraction of the
gradient from the current parameter values. The fraction is determined by the learning rate,
which controls the step size of the update. The learning rate is usually a small positive
value, such as 0.1 or 0.01.
4. Iteration: Steps 2 and 3 are repeated for a fixed number of iterations or until a convergence
criterion is met. Each iteration processes a different mini-batch of training examples, and
the gradients are computed and used to update the parameters.
Root Mean Square Propagation (RMS Prop): It describes the way it computes
and updates the LR for the individual weigh in a neural network.
- RMSprop is to adaptively adjust the learning rate for each weight based
on the magnitudes of the recent gradients that have been computed for that weight
- This adaptation is done by maintaining an exponentially decaying
average of the squared gradients, and the learning rate for each weight is divided by
the square root of this average.
- It aims to achieve better convergence rates and overcome some of the
limitations of traditional gradient descent methods.
- Advantages: Fast convergence, stable learning, fewer hyperparameters
RMS Prop procedure
1. Initialize the parameters, including the learning rate, decay rate, and a small constant
for numerical stability.
2. For each iteration or batch:
i. Compute the gradients of the loss function with respect to the network
weights.
ii. Update the exponentially decaying average of the squared gradients:
Square the gradients element-wise.
Compute the exponentially decaying average of the squared gradients using
the decay rate.
iii. Compute the adaptive learning rate for each weight:
Divide the learning rate by the square root of the average squared gradients
(element-wise).
iv. Update the weights using the computed learning rates and the gradients.
3. Repeat until convergence or for a fixed number of iterations.
Ada Delta Optimizer: AdaDelta is an extension of RMSprop that further
improves the adaptive learning rate scheme.
- It addresses the problem of RMSprop's accumulating squared gradients
by using an exponentially decaying average of past gradients instead.
- AdaDelta does not require manual tuning of the learning rate and is
robust to different learning rate choices.
- In Adadelta, instead of accumulating all past squared gradients, it
maintains a running average of gradients and updates the parameters based on the
ratio of the root mean square (RMS) of the past gradients and the RMS of the
current gradient.
Ada Delta procedure
1. Initialize variables:
i. Accumulated gradient squared values: E[g^2] (initialized to zero)
ii. Accumulated update squared values: E[delta_theta^2] (initialized to zero)
2. For each iteration:
i. Compute the gradients of the parameters with respect to the loss function.
ii. Accumulate the squared gradients: E[g^2] = rho * E[g^2] + (1 - rho) * g^2, where rho
is a decay rate (usually set to 0.95).
iii. Compute the update: delta_theta = - (sqrt(E[delta_theta^2] + epsilon) / sqrt(E[g^2] +
epsilon)) * g, where epsilon is a small value (e.g., 1e-6) added for numerical stability.
iv. Accumulate the squared updates: E[delta_theta^2] = rho * E[delta_theta^2] + (1 - rho)
* delta_theta^2.
v. Update the parameters: theta = theta + delta_theta.
Adam Optimizer: It combines the benefits of two other optimization methods,
Adaptive Gradient Algorithm (AdaGrad) and Root Mean Square Propagation
(RMSProp), to provide efficient and effective parameter updates.
- Adam is known for its efficiency, robustness, and good convergence
properties.
- In adam, instead of adapting learning rates based upon the first
moment(mean) as in RMS Prop, it also uses the second moment of the gradients.
- It has become a popular choice for optimizing neural network models,
often providing faster convergence compared to traditional gradient descent variants.
Adam procedure
1. Initialize variables:
i. Exponential decay rates for the first and second moments: beta1 (typically set to 0.9) and
beta2 (typically set to 0.999).
ii. Small constant for numerical stability: epsilon (e.g., 1e-8).
iii. Step counter: t (initialized to 0).
iv. Parameters: theta (model weights) and their corresponding gradients g (initialized to
zero).
2. For each iteration:
i. Increment the step counter t = t + 1
ii. Compute the gradients of the parameters with respect to the loss function
iii. Update the biased first moment estimate m = beta1 * m + (1 – beta1) * g, where m is the
first moment estimate (mean of the gradients).
Adam procedure (Contd…)
i. Update the biased second moment estimate v = beta2 * v + (1 – beta2) * g^2,
where v is the second moment estimate (uncentered variance of the
gradients).
ii. Compute the bias - corrected first moment estimate m_hat = m / (1- beta1^t)
iii. Compute the bias - corrected second moment estimate v_hat = v / (1- beta2^t)
iv. Update the parameters theta = theta - (learning_rate * m_hat) / (sqrt(v_hat) +
epsilon), where learning_rate is a hyperparameter determining the step size.
Loss Function
Loss function, also known as a cost function or objective function, is a measure
that quantifies the discrepancy between the predicted output of a model and the
true target values.
It represents how well the model is performing on a specific task or problem.
The goal is to minimize the value of the loss function, which indicates a better fit
of the model to the data.
Here are some commonly used loss functions:
i. Mean Squared Error (MSE): MSE is a popular loss function for regression
problems. It computes the average squared difference between the predicted and true
values. It is defined as:
1 2
∗ 𝑦𝑝𝑟𝑒𝑑 − 𝑦𝑡𝑟𝑢𝑒
𝑛
where y_pred represents the predicted values, y_true represents the true values, and n is
the number of data points.
Loss Function
ii. Binary Cross - Entropy: It commonly used for binary classification problems. It
measures the dissimilarity between the predicted probabilities and the true binary
labels. The formula for binary cross-entropy is:
− 𝑦𝑡𝑟𝑢𝑒 ∗ log 𝑦𝑝𝑟𝑒𝑑 + 1 − 𝑦𝑡𝑟𝑢𝑒 ∗ log 1 − 𝑦𝑝𝑟𝑒𝑑
where ypred is the predicted probability, ytrue is the true binary label (0 or 1).
iii. Categorical Cross - Entropy: It is used for multi-class classification problems. It
calculates the dissimilarity between the predicted class probabilities and the true one-
hot encoded labels. The formula is:
− (𝑦𝑡𝑟𝑢𝑒 ∗ log 𝑦𝑝𝑟𝑒𝑑 )
where ypred is the predicted probability distribution over classes, ytrue is the true one-hot
encoded label.
- One-Hot Encoding is a technique, transforms categorical variables into a binary
representation
Loss Function
iv. Sparse Categorical Cross - Entropy: It is used in multi-class classification problems
where the target labels are represented as integers rather than one-hot encoded
vectors. It is particularly useful when the number of classes is large.
- In contrast to Categorical Cross-Entropy (CCE), which requires the target labels to
be one-hot encoded, Sparse CCE accepts integer labels directly.
- This makes it more memory-efficient and avoids the need for converting labels into
one-hot encoded vectors. The formula is
− log 𝑦𝑝𝑟𝑒𝑑 _𝑖
where ypred_i represents the predicted probability assigned to the true class label for the ith
sample.
Cost Function Vs. Loss Function
Cost Function Loss Function
Quantifies the overall cost or error of the Measures the error between predicted and
model on the entire training set. actual values in a machine learning model.
Used to guide the optimization process by Used to optimize the model during training.
minimizing the cost or error.
Aggregates the loss values over the entire Can be specific to individual samples.
training set.
Often the average or sum of individual loss Examples include mean squared error (MSE),
values in the training set. mean absolute error (MAE), and binary cross-
entropy.
Used to determine the direction and magnitude Used to evaluate model performance.
of parameter updates during optimization.
Typically derived from the loss function, but Different loss functions can be used for
can include additional regularization terms or different tasks or problem domains.
other considerations.
Thank You
Department of Computer Science and Engineering