0% found this document useful (0 votes)

68 views87 pages

1 - Introduction To Deep Learning

Uploaded by

SHRAVANI ANAND

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views87 pages

1 - Introduction To Deep Learning

Uploaded by

SHRAVANI ANAND

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Shallow vs Deep NNs

DSE 3151 Deep Learning

DSE 3151, [Link] Data Science & Engineering
August 2023
Rohini R Rao & Abhilash Pai
Department of Data Science and Computer Applications
MIT Manipal
Slide -1 of 5
Contents
• Shallow Networks
• Neural Network Representation
• Back Propagation
• Vectorization
• Deep Neural Networks
• Example: Learning XOR
• Architecture Design
• Loss Functions
• Metrics
• Gradient-Based Learning
• Optimization
• Diagnosing Learning Curves
• Strategies for overfitting
• Learning rate scheduling

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 2

Deep Learning and Machine Learning

(Source: [Link])

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 3

Learning Multiple Components

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 4

Neural Network Examples

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 5

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 6
Applications of Deep Learning

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 7

Neural Network
• is a set of connected input/output
units in which each connection has
• a weight associated with it.
• During the learning phase, the
network learns by adjusting
• the weights so as to be able to
predict the correct class label of
the input tuples.
• Also referred to as Connectionist
learning

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 8

Neuron is a computational , logistic unit

Rohini R Rao, Manjunath Hegde 9

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 10
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 11
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 12
Multilayer Feed-Forward Neural Network
• consists of an input layer, one or
more hidden layers, and an
output layer
• units in
• input layer are input units
• hidden and output layer are
neurodes
• feed-forward network since
none of the weights cycles back

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 13

Back Propagation Algorithm

Jiawei Han and Micheline Kamber,Rohini

“Data Mining Concepts And Techniques”, 3rd Edition, Morgan Kauffmann14
R Rao & Abhilash Pai, Dept of Data Science and CA
Back Propagation Algorithm
• learns using a gradient descent method to search for a set of weights that fits the training data so as to
minimize the mean squared error
• L is the learning rate, a constant typically having a value between 0.0 and 1.0.
• helps avoid getting stuck at a local minimum in decision space and encourages finding the global minimum.
• If l is
• too small, then learning will occur at a very slow pace.
• too large, then oscillation between inadequate solutions may occur
• rule of thumb is to set the learning rate to 1=/t
• where t is the number of iterations through the training set so far
• one iteration through the training set is an epoch.
• Case Updating
• updating the weights and biases after the presentation of each tuple.
• Alternatively,
• Epoch updating
• the weight and bias increments could be accumulated in variables
• so that the weights and biases are updated after all the tuples in the training set have been presented.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 15

Back Propagation Algorithm
• Terminating condition:
• Training stops when
• All changes in wij in the previous epoch are so small as to be below some specified
threshold
• Or The percentage of tuples misclassified in the previous epoch is below some threshold,
• Or A prespecified number of epochs has expired.
• The computational efficiency depends on the
• time spent training the network.
• Given |D| tuples and w weights, each epoch requires O(|D|*w) time.
• In the worst-case scenario, the number of epochs can be exponential in n, the
number of inputs.
• In practice, the time required for the networks to converge is highly variable.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 16

Example

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 17

Example

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 18

Simple AND and Simple OR operations

Course Notes – Deep Learning , Andrew NG

Rohini R Rao, Manjunath Hegde 19

XOR Problem - Perceptron Learning

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 20

XOR Problem
Solving XOR
Network Diagram

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 21

Neural Network Representation

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 22

Neural Network Representation

Course Notes – Deep Learning , Andrew NG

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 23
The Shallow Neural Network
Vectorizing across multiple examples

Course Notes – Deep Learning , Andrew NG

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 24
The Shallow Neural Network
Vectorizing across multiple examples

Course Notes – Deep Learning , Andrew NG

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 25
Deep vs Shallow Network

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 26

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 27
Neural Network learning as optimization
• Cannot calculate the perfect weights for a neural network since there are too many
unknowns
• Instead, the problem of learning is as a search or optimization problem
• An algorithm is used to navigate the space of possible sets of weights the model may use
in order to make good or good enough predictions

• Objective of optimizer is to minimize the loss function or the error term.

• Loss function
• gives the difference between observed value from the predicted value.
• must be
• Continuous
• Differentiable at each point (allows the use of gradient-based optimization)
• To minimize the loss generated from any model compute
• The magnitude that is by how much amount to decrease or increase, and
• direction in which to move

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 28

Machine Learning Setup

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 29

Activation Functions
• To make the network robust use of Activation or Transfer functions.
• Activations functions introduce non-linear properties in the neural networks.
• A good Activation function has the following properties:
• Monotonic Function:
• should be either entirely non-increasing or non-decreasing.
• If not monotonic then increasing the neuron's weight might cause it to have less influence on
reducing the error of the cost function.
• Differential:
• mathematically means the change in y with respect to change in x.
• should be differential because we want to calculate the change in error with respect to given
weights at the time of gradient descent.
• Quickly Converging:
• Should reach its desired value fast.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 30

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 31
Gradient Descent

• Generic Optimization Algorithm

• Starts with random values
• Improves gradually , in an attempt to
decrease loss function
• Until algorithm converges to minimum
• Learning Rate-
• Hyperparameter
• Indicates size of steps

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 32

Gradient Descent & Learning Rate

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 33

Gradient Descent Pitfalls

• If it starts on left can reach local

minimum
• If it stars from right can hit plateau
• So pick Cost Functions which are
convex functions
• Has no local minimum
• Continuous function with slope that does
not change abruptly
• Then Gradient Descent will approach
close to global minimum

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 34

Activation Functions and their derivatives

Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras & Tensorflow, OReilly Publications
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 35
Activation Functions – Sigmoid vs Tanh
• Sigmoid
• s(x) = 1/(1 + e−x) where e ≈ 2.71 is the base of the
natural logarithm
• to predict the probability
• between the range of 0 and 1, sigmoid is the
right choice.
• is differentiable-, we can find the slope of the
sigmoid curve at any two points.
• The function is monotonic but function’s
derivative is not.
• can cause a neural network to get stuck at the
training time.
• Tanh
• Range is from (-1 to 1)
• Centering data – mean = 0
• is also sigmoidal (s - shaped)
• Advantage is that the -ve inputs will be mapped
strongly negative
• Disadvantage – If x is very small or very large
slope or gradient becomes 0 which slows
down gradient descent
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 36
Activation Functions – Sigmoid vs ReLU

• ReLU is half rectified

• f(z) is 0 when z is < 0 and f(z) is equal to z when z is >= zero.
• Derivative =1 when z is +ve and 0 when z is 0-ve
• The function and its derivative both are monotonic
• Alternate to ReLU is softplus activation function
• Softplus(z) = log(1+exp(z), Close to 0 when z is –ve and close to z when z is +ve
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 37
Activation Functions- ReLU vs Leaky ReLU

• The leak helps to increase the range of the ReLU function.

• Usually, the value of a is 0.01.
• When a is not 0.01 then it is called Randomized ReLU.
• range of the Leaky ReLU is (-infinity to infinity)
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 38
Activation Functions-
SoftMax
• Usually last activation function
• to normalize the output of a
network to a probability
distribution over predicted
output classes

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 39

Activation Function Cheat Sheet

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 40

[Link]
Output Functions

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 41

Regression problems - Output Function

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 42

Regression problems - Loss Function

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 43

Classification problem- Output Function

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 44

Classification Problems- Loss Function

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 45

Typical MLP architecture

Regression Classification

Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras & Tensorflow, OReilly Publications
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 46
Regression Loss Functions
• Mean Squared Error (MSE)
• values with a large error are penalized.
• is a convex function with a clearly defined
global minimum
• Can be used in gradient descent
optimization to set the weight values
• Very sensitive to outliers , will significantly
increase the loss.
• Mean Absolute Error (MAE)
• used in cases when the training data has a
large number of outliers
as the average distance approaches 0,
gradient descent optimization will not work
• Huber Loss
• Based on absolute difference between the
actual and predicted value and threshold
value, 𝛿
• Is quadratic when error is smaller than 𝛿
but linear when error is larger than 𝛿
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 47
Classification Loss Functions
Cross entropy measures entropy between two probability distributions
• Binary Cross-Entropy/Log Loss
• Compares the actual value (0 or 1)
with the probability that the input
aligns with that category
• p(i) = probability that the category is 1
• 1 — p(i) = probability that the
category is 0)
• Categorical Cross-Entropy Loss
• In cases where the number of
classes is greater than two

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 48

Keras Metrics
• [Link](..., metrics=['mse'])
• Metric values are recorded at the end of each epoch on the training
dataset.
• If a validation dataset is also provided, then is also calculated for the
validation dataset.
• All metrics are reported in verbose output and in the history object
returned from calling the fit() function.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 49

Keras Metrics
• Accuracy metrics
• Accuracy
• Calculates how often predictions equal labels.
• Binary Accuracy
• Calculates how often predictions match binary labels.
• Categorical Accuracy
• Calculates how often predictions match one-hot labels
• Sparse Categorical Accuracy
• Calculates how often predictions match integer labels.
• TopK Categorical Accuracy
• calculates the percentage of records for which the targets are in the
top K predictions
• rank the predictions in the descending order of probability values.
• If the rank of the yPred is less than or equal to K, it is considered
accurate.
• Sparse TopK Categorical Accuracy class
• Computes how often integer targets are in the top K predictions.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 50

Keras Metrics
• Regression metrics
• Mean Squared Error
• Computes the mean squared error between y_true and y_pred
• Root Mean Squared Error
• Computes root mean SE metric between y_true and y_pred
• Mean Absolute Error
• Computes the mean absolute error between the labels and
predictions
• Mean Absolute Percentage Error
• MAPE = (1/n) * Σ(|actual – prediction| / |actual|) * 100
• Average difference between the predicted and the actual in %
• Mean Squared Logarithmic Error
• measure of the ratio between the true and predicted values.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 51

Keras Metrics
• AUC • Probabilistic metrics
• Precision • Binary Crossentropy
• Recall
• Categorical Crossentropy
• TruePositives
• TrueNegatives • Sparse Categorical Crossentropy
• FalsePositives • KLDivergence
• FalseNegatives • a measure of how two probability
• PrecisionAtRecall distributions are different from each
• Computes best precision where recall is >= specified other
value
• SensitivityAtSpecificity • Poisson
• Computes best sensitivity where specificity is >= • if dataset comes from a Poisson
specified value
distribution
• SpecificityAtSensitivity
• Computes best specificity where sensitivity is >=
specified value

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 52

Gradient based learning
• Gradient Based Learning
• seeks to change the weights so
that the next evaluation reduces
the error
• is navigating down the gradient (or
slope) of error
• The process repeats until the
global minimum is reached.
• works well for convex functions
• It is expensive to calculate the
gradients if the size of the data
is huge.
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 53
The Vanishing/Exploding Gradient Problems
• Algorithm computes the gradient of the cost function with regard to each parameter in
the network
• Problems include
• Vanishing Gradients
• Gradients become very small as algorithm progresses down to lower layers
• So connection weights remain unchanged and training never converges to a solution
• Exploding Gradients
• Gradients become so large until layers get huge weight updates and algorithm diverges
• Deep Neural Networks suffer from unstable gradients, different layers may learn at
different speeds.
• Reasons for unstable gradients Glorot & Bengio (2010)
• Because of combination of Sigmoid activation and weight initialization (normal distribution with
mean 0 and std dev 1)
• Variance of outputs of each layer is much greater than the variance of its inputs
• Variance goes on increasing after each layer
• until the activation function saturates(0 or 1, with derivative close to 0) at the top layers

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 54

Solutions include : 1. Weight Initialization
• The variance of the outputs of each layer has to be equal to the
variance of its inputs
• Xavier’s Initialization

• Weight Matrix W of a particular layer l

• picked randomly from a normal distribution with
• mean μ= 0
• variance sigma² = multiplicative inverse of the number of neurons in layer
l−1.
• The bias b of all layers is initialized with 0

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 55

Solutions Include : Weight Initialization
1. Xavier’s or Glorot Initialization
Variance of outputs of each layer to be equal to the variance of its inputs
Gradients to have equal variance before and after flowing through a layer in the reverse
direction
Fan-in – Number of inputs
Fan-out- Number of neurons

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 56

Solutions include
2. Using Non-saturating Activation functions
1. ReLU does not saturate for +ve values
2. Suffer from problem of dying ReLU- neurons output only 0
• Weighted sum of its inputs are negative for all instances in training set
3. Use Leaky ReLU instead – (only go into coma don’t die, may wake up)
1. α = 0.01 , sometimes 0.2
4. Flavors include
Randomized leaky ReLU(RReLU)
α is picked randomly in a given range during training and is fixed to an average
value during testing
Parametric Leaky Relu (PReLU)
α is authorised to be learned during training as a parameter
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 57
SELU
• Scaled Exponential Linear Units
• Induce self-normalization
• the output of each layer will tend to
preserve mean 0 and standard deviation
1 during training
• f(x) = λx if x>= 0
• f(x) = λα(exp(x)-1) if x < 0
• α = 1.6733 , λ= 1.0507
• conditions for self-normalization to
happen:
• The input features must be standardized
(mean 0 and standard deviation 1).
• Every hidden layer’s weights must also be
initialized using normal initialization.
• The network’s architecture must be
sequential.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 58

Solutions Include
3. Batch Normalization -Ioffe and Szegedy (2015)
• designed to solve the vanishing/exploding gradients problems, is also a good regularizer
• BN layer performs the standardizing and normalizing operations on the input of a layer coming
from a previous layer.
• Normalization
• brings the numerical data to a common scale without distorting its shape.
• (mean = 0, std dev= 1)
• BN adds extra operations in the model , before activation
• Operation zero centres and normalizes each input
• Then scale and shift the result using two new parameter vectors per layer
• Each BN Layer learn 4 parameter vectors
• Output Scale Vector
• Output offset vector
• Input mean vector
• Input standard deviation
• To zero-centre and normalize the inputs , mean and standard deviation of input needs to be
computed
• Current mini-batch is used to evaluate mean and standard deviation
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 59
Solutions Include
3. Batch Normalization -Ioffe and Szegedy (2015)

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 60

Solutions include
4. Gradient Clipping
• Clip gradient during back
propagation so that they never
exceed some threshold
• All the partial derivatives of the
loss will be clipped between -0.1
to 0.1
• Threshold can also be a
hyperparameter to tune
5. Reusing Pretrained Layers
• Transfer Learning

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 61

6. Faster Optimizers
• Optimizer
• is a function or an algorithm that modifies the attributes of the neural network, such
as weights and learning rate.
• Terminology
• Weights/ Bias – The learnable parameters in a model that controls the signal
between two neurons.
• Epoch – The number of times the algorithm runs on the whole training dataset.
• Sample – A single row of a dataset.
• Batch –denotes the number of samples to be taken for updating the model
parameters.
• Learning rate –defines a scale of how much model weights should be updated.
• Cost Function/Loss Function -is used to calculate the cost that is the difference
between the predicted value and the actual value.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 62

Mini Batch Gradient Descent Deep Learning
Optimizer
• Batch gradient descent:
• gradient is average of gradients computed from ALL the samples in dataset
• Mini Batch GD:
• subset of the dataset is used for calculating the loss function, therefore fewer
iterations are needed.
• batch size of 32 is considered to be appropriate for almost every case.
• Yann Lecun (2018) – “Friends don’t let friends use mini batches larger than 32”
• is faster , more efficient and robust than the earlier variants of gradient
descent.
• the cost function is noisier than the batch GD but smoother than SDG.
• Provides a good balance between speed and accuracy.
• It needs a hyperparameter that is “mini-batch-size”, which needs to be
tuned to achieve the required accuracy.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 63

Stochastic GD & SGD with Momentum DL
• stochastic means randomness on which the
algorithm is based upon.
• Instead of taking the whole dataset for each
iteration, randomly select the batches of data
• The path taken is full of noise as compared to
the gradient descent algorithm.
• Uses a higher number of iterations to reach
the local minima, thereby the overall
computation time increases.
• The computation cost is still less than that of
the gradient descent optimizer.
• If the data is enormous and computational
time is an essential factor, SGD should be
preferred over batch gradient descent
algorithm.
• Stochastic Gradient Descent with
Momentum Deep Learning Optimizer
• momentum helps in faster
convergence of the loss function.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 64

SGD with Momentum Optimizer
• GD takes small , regular steps down
the slope so algorithm takes more
time to reach the bottom
• adding a fraction of the previous
update to the current update will
make the process a bit faster.
• Hyperparameter β - Momentum
• To simulate a friction mechanism • β - Momentum
and prevent momentum from • set between 0 (high friction) and 1 (low friction)
becoming too large • Typically 0.9
• Also rolls past local minima
• learning rate should be decreased
with a high momentum term.

Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras & Tensorflow, OReilly Publications
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 65
SGD with Nesterov Momentum Optimization
• Yurii Nesterov in 1983
• to measure the gradient of the cost
function not at the local position but
slightly ahead in the direction of the
momentum
• the momentum vector will be
pointing in the right direction (i.e.,
toward the optimum)
• it will be slightly more accurate to
use the gradient measured a bit
farther in that direction rather than
using the gradient at the original
position

Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras & Tensorflow, OReilly Publications
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 66
Adagrad (Adaptive Gradient Descent) Deep
Learning Optimizer
• Adaptive Learning Rate
• Scaling down the gradient vector along the
steepest dimension
• If the cost function is steep along the ith
dimension, then s will get larger and larger at each
iteration
• No need to modify the learning rate manually
• more reliable than gradient descent algorithms,
and it reaches convergence at a higher speed.

• Disadvantage
• it decreases the learning rate aggressively and
monotonically.
• Due to small learning rates, the model eventually
becomes unable to acquire more knowledge, and
hence the accuracy of the model is compromised.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 67

RMS Prop(Root Mean Square) Deep Learning
Optimizer
• The problem with the gradients some are small while others may be huge
• Defining a single learning rate might not be the best idea.
• accumulating only the gradients from the most recent iterations (as
opposed to all the gradients since the beginning of training).

• β= 0.9

• Works better than Adagrad , was popular until ADAM

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 68

Adam Deep Learning Optimizer
• is derived from adaptive moment estimation.
• inherit the features of both Adagrad and RMS
prop algorithms.
• like Momentum optimization keeps track of
an exponentially decaying average of past
gradients,
• like RMSProp it keeps track of an
exponentially decaying average of past
squared gradients
• β1 and β2 represent the decay rate of the
average of the gradients.
• Advantages
• Is straightforward to implement
• faster running time t represents iteration
• low memory requirements, and requires less tuning β1= 0.9
• Disadvantages β 2=0.999
• Focusses on faster computation time, whereas Smoothing term ϵ is=10–7
SGD focus on data points.
• Therefore SGD generalize the data in a better
manner at the cost of low computation speed.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 69

Curve Fitting – True Function is Sinusoidal

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 70

Curve Fitting – True Function is Sinusoidal

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 71

Bias

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 72

Variance

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 73

Mean Square Error

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 74

Train vs Test Error

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 75

Train vs Test Error

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 76

Learning Curves
• Line plot of learning (y-axis) over experience (x-axis)
• The metric used to evaluate learning could be
• Optimization Learning Curves:
• calculated on the metric by which the parameters of the model are being optimized, e.g. loss.
• Minimizing, such as loss or error
• Performance Learning Curves:
• calculated on the metric by which the model will be evaluated and selected, e.g. accuracy.
• Maximizing metric , such as classification accuracy
• Train Learning Curve:
• calculated from the training dataset that gives an idea of how well the model is learning.
• Validation Learning Curve:
• calculated from a hold-out validation dataset that gives an idea of how well the model is
generalizing.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 77

Underfit Learning Curves

A plot of learning curves shows underfitting if:

•The training loss remains flat regardless of training.
•The training loss continues to decrease until the end of training.
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 78
Overfitting Curves
• Overfitting
• Model specialized on training data, it is not
able to generalize to new data
• Results in increase in generalization error.
• generalization error can be measured by
the performance of the model on the
validation dataset.
• A plot of learning curves shows
overfitting if:
• The plot of training loss continues to
decrease with experience.
• The plot of validation loss decreases to a
point and begins increasing again.
• The inflection point in validation loss may
be the point at which training could be
halted as experience after that point shows
the dynamics of overfitting.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 79

Good Fit Learning Curves
• A good fit is the goal of the learning
algorithm and exists between an overfit
and underfit model.
• A plot of learning curves shows a good fit
if:
• Plot of training loss decreases to a point of
stability
• Plot of validation loss decreases to a point
of stability and has a small gap with the
training loss.
• Loss of the model will almost always be
lower on the training than the validation
dataset.
• We should expect some gap between the
train and validation loss learning curves.
• This gap is referred to as the
“generalization gap.”
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 80
Avoiding Overfitting
• With so many parameters, DNN can fit
complex datasets.
• But also prone to overfitting the
training set.
• Early Stopping
• stop training as soon as the validation
error reaches a minimum
• With Stochastic and Mini-batch Gradient
Descent, the curves are not so smooth,
and it may be hard to know whether you
have reached the minimum or not.
• Stop only after the validation error has
been above the minimum for some time

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 81

Avoiding overfitting
• Dropout
• Proposed by Geoffrey Hinton in 2012
• at every training step, every neuron has a
probability p of being temporarily
“dropped out,”
• it will be entirely ignored during this
training step, but it may be active during
the next step
• p is called the dropout rate, usually 50%.
• After training, neurons don’t get dropped
• If p = 50%
• during testing a neuron will be connected
to twice as many input neurons as it was
(on average) during training.
• multiply each input connection weight by
the keep probability (1 – p) after training
• Alternatively, divide each neuron’s output
by the keep probability during training

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 82

Overfitting using Regularization
• L1, l2 regularization
• Regularization can be used to constrain the NN weights
• Lasso – (l1) least absolute shrinkage and selection operator
• adds the “absolute value of magnitude” of the coefficient as a penalty term to the loss function.
• Ridge regression (l2)
• adds the “squared magnitude” of the coefficient as the penalty term to the loss function.
• Use l1(), l2(), l1_l2() function
• which returns a regularizer that will compute the regularization loss at each step during training
• Regularization loss is added to final loss
• Max-Norm Normalization
• It constrains the weights w of the incoming connections

• r is max-norm hyper parameter and ∥ · ∥2 is the ℓ2 norm

• Reducing r increases the amount of regularization and helps reduce overfitting
• Can also help alleviate the unstable gradients problem if we are not using Batch normalization

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 83

Constant learning rate not ideal
• Better to start with a high learning
rate
• then reduce it once it stops
making fast progress
• can reach a good solution faster
• Learning Schedule strategies can
be applied
• Power Scheduling
• Exponential Scheduling
• Piecewise Constant Scheduling
• Performance Scheduling

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 84

Learning Rate Scheduling
• Power scheduling
• Learning rate set to a function of the iteration number ‘t’
• t: η(t) = η0 / (1 + t/k)c
• The initial learning rate η0, the power c (typically set to 1)
• The learning rate drops at each step, and after s steps it is down to η0 / 2 and so on
• schedule first drops quickly, then more and more slowly
• optimizer = [Link](lr=0.01, decay=1e-4)
• The decay is the number of steps it takes to divide the learning rate by one more unit,
• Keras assumes that c is equal to 1.
• Exponential scheduling
• Set the learning rate to: η(t) = η0 0.1t/s
• learning rate will gradually drop by a factor of 10 every s steps.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 85

Learning Rate Scheduling
• Piecewise constant scheduling
• Constant learning rate for a number of epochs
• e.g., η0 = 0.1 for 5 epochs
• then a smaller learning rate for another number of epochs
• e.g., η1 = 0.001 for 50 epochs and so on
• Performance scheduling
• Measure the validation error every N steps (just like for early stopping)
• reduce the learning rate by a factor of λ when the error stops dropping

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 86

References
• Ian Goodfellow, Yoshua Bengio and Aaron Courville, “Deep Learning”,
MIT Press 2016
• Swayam NPTEL Notes- Deep Learning, Mitesh Khapra
• Course Notes – Neural Networks and Deep Learning, Andrew NG
• Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn ,
Keras & Tensorflow, OReilly Publications
• Jiawei Han and Micheline Kamber, “Data Mining Concepts And
Techniques”, 3rd Edition, Morgan Kauffmann

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 87

Shallow Vs Deep Nns Dse 3151 Deep Learning
No ratings yet
Shallow Vs Deep Nns Dse 3151 Deep Learning
591 pages
Chapter 5 Final
No ratings yet
Chapter 5 Final
80 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Understanding Deep Learning Concepts
No ratings yet
Understanding Deep Learning Concepts
78 pages
Understanding Neural Networks Basics
No ratings yet
Understanding Neural Networks Basics
48 pages
AI and ML in Robotics Optimization
No ratings yet
AI and ML in Robotics Optimization
53 pages
Module 2
No ratings yet
Module 2
44 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Neural Networks (Basics)
No ratings yet
Neural Networks (Basics)
30 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
AD601 Deep Learning Unit-2 Notes
No ratings yet
AD601 Deep Learning Unit-2 Notes
14 pages
DL Unit2
No ratings yet
DL Unit2
113 pages
NN Unit - 1
100% (1)
NN Unit - 1
27 pages
Artificial Neural Networks & Fuzzy Logic
No ratings yet
Artificial Neural Networks & Fuzzy Logic
13 pages
Neural Networks: Keras & Backpropagation
No ratings yet
Neural Networks: Keras & Backpropagation
19 pages
855597620
No ratings yet
855597620
44 pages
Forward and Backward Propagation Deep Learning 1703697260
No ratings yet
Forward and Backward Propagation Deep Learning 1703697260
9 pages
Lessson 13 ANN
No ratings yet
Lessson 13 ANN
76 pages
Unit 1
No ratings yet
Unit 1
32 pages
03-Lecture Notes-Mid
No ratings yet
03-Lecture Notes-Mid
23 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Neural Networks: Structure & Training
No ratings yet
Neural Networks: Structure & Training
52 pages
DeepLearing Theory
No ratings yet
DeepLearing Theory
51 pages
CS217 2024 Lec11
No ratings yet
CS217 2024 Lec11
7 pages
Unit 2 DL
No ratings yet
Unit 2 DL
70 pages
Jntuk R20 ML Unit-V
No ratings yet
Jntuk R20 ML Unit-V
19 pages
Home Assignment Submission Solutions
No ratings yet
Home Assignment Submission Solutions
82 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
Week 4
No ratings yet
Week 4
61 pages
LLM Ai Interview SS
100% (1)
LLM Ai Interview SS
187 pages
So Far : Lecture 1: Review of Classical & Modern Control Lecture 2: MATLAB Lecture
No ratings yet
So Far : Lecture 1: Review of Classical & Modern Control Lecture 2: MATLAB Lecture
12 pages
Unit-5: Introduction To Deep Learning: Artificial Neural Networks
No ratings yet
Unit-5: Introduction To Deep Learning: Artificial Neural Networks
14 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
4-Neural Networks and Activation Function
No ratings yet
4-Neural Networks and Activation Function
28 pages
Notes On Introduction To Deep Learning
No ratings yet
Notes On Introduction To Deep Learning
19 pages
AIMLB PGP 2025 Session 15
No ratings yet
AIMLB PGP 2025 Session 15
23 pages
CS601 Machine Learning Unit 2 Notes 1672759753
No ratings yet
CS601 Machine Learning Unit 2 Notes 1672759753
14 pages
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
No ratings yet
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
14 pages
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-11 Reference-Material-I
No ratings yet
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-11 Reference-Material-I
40 pages
ML MU Unit 5NeuralNetworkpdf 2025 04 16 13 47 39
No ratings yet
ML MU Unit 5NeuralNetworkpdf 2025 04 16 13 47 39
57 pages
CS601 - Machine Learning - Unit 2 New
No ratings yet
CS601 - Machine Learning - Unit 2 New
56 pages
DL Unit 1
No ratings yet
DL Unit 1
10 pages
DL M2 Tech
No ratings yet
DL M2 Tech
32 pages
CS 329 Lecture4 2025new
No ratings yet
CS 329 Lecture4 2025new
61 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
Lecture 4
No ratings yet
Lecture 4
50 pages
2024 04 CS115 Vector Caculus
No ratings yet
2024 04 CS115 Vector Caculus
131 pages
Week 14 (NN)
No ratings yet
Week 14 (NN)
49 pages
Deep Learning vs Machine Learning
No ratings yet
Deep Learning vs Machine Learning
11 pages
Deep Learning
No ratings yet
Deep Learning
50 pages
04-NN Training GoodF
No ratings yet
04-NN Training GoodF
82 pages
3.2 Overview of Neural Networks
No ratings yet
3.2 Overview of Neural Networks
28 pages
Part 1
No ratings yet
Part 1
48 pages
Scunit 2 Application of Soft Computing kcs056
No ratings yet
Scunit 2 Application of Soft Computing kcs056
26 pages
Learning With Linear Neurons: Adapted From Lectures by Geoffrey Hinton and Others Updated by N. Intrator, May 2007
No ratings yet
Learning With Linear Neurons: Adapted From Lectures by Geoffrey Hinton and Others Updated by N. Intrator, May 2007
59 pages
CV 3
No ratings yet
CV 3
159 pages
ML Unit 2
No ratings yet
ML Unit 2
5 pages
D20 Modern - WOTC - Past - Oef BM We
96% (25)
D20 Modern - WOTC - Past - Oef BM We
101 pages
Solid Geometry Basics & Formulas
No ratings yet
Solid Geometry Basics & Formulas
9 pages
Janelle Drouin-Ouellet 2022 IPSC Vs DC Review
No ratings yet
Janelle Drouin-Ouellet 2022 IPSC Vs DC Review
24 pages
pt6 7 8gear Hand LobePumps
No ratings yet
pt6 7 8gear Hand LobePumps
8 pages
Modals For Class 4 PDF
No ratings yet
Modals For Class 4 PDF
12 pages
Syarifah Rahimah 12 Jun
No ratings yet
Syarifah Rahimah 12 Jun
13 pages
Sharadindu Bandyopadhyay
0% (1)
Sharadindu Bandyopadhyay
12 pages
Supreme Court Judgment on Modak Case
No ratings yet
Supreme Court Judgment on Modak Case
6 pages
The Lived Experiences of Learners From Broken Home With Insignificant Progress Amidst Pandemic Basis in Designing A Remediation Plan
No ratings yet
The Lived Experiences of Learners From Broken Home With Insignificant Progress Amidst Pandemic Basis in Designing A Remediation Plan
12 pages
Shop Storage Plan
100% (3)
Shop Storage Plan
7 pages
Land Rights and Registration Guide
No ratings yet
Land Rights and Registration Guide
14 pages
Pediatric Bleeding Questionnaire Scoring Key
No ratings yet
Pediatric Bleeding Questionnaire Scoring Key
1 page
Uematsu ff09 Jesters of The Moon
No ratings yet
Uematsu ff09 Jesters of The Moon
3 pages
Lighting Solutions Quotation
No ratings yet
Lighting Solutions Quotation
3 pages
Solution Manual For Human Resource Management 3rd Edition Stewart and Brown 1118582802 9781118582800
No ratings yet
Solution Manual For Human Resource Management 3rd Edition Stewart and Brown 1118582802 9781118582800
20 pages
LIC
No ratings yet
LIC
33 pages
Excerpt - Ursoi Race
100% (2)
Excerpt - Ursoi Race
2 pages
Spring Rise Phenomenon Notes
No ratings yet
Spring Rise Phenomenon Notes
2 pages
IPD Online Portal for Trainees and Mentors
No ratings yet
IPD Online Portal for Trainees and Mentors
6 pages
En Attendant Godot Dissertation Gratuite
100% (3)
En Attendant Godot Dissertation Gratuite
8 pages
Collocations in Tenth Grade English Textbooks
No ratings yet
Collocations in Tenth Grade English Textbooks
33 pages
Taller de Ingles Cultiadvice
No ratings yet
Taller de Ingles Cultiadvice
3 pages
VMware Cloud Director Availability in AVS
No ratings yet
VMware Cloud Director Availability in AVS
24 pages
Bacteria: Friend or Foe?
No ratings yet
Bacteria: Friend or Foe?
6 pages
Sale Deed
No ratings yet
Sale Deed
5 pages
Calculus Revision Sheet
No ratings yet
Calculus Revision Sheet
2 pages
Misleading "Natural" Claims on Kix Cereal
No ratings yet
Misleading "Natural" Claims on Kix Cereal
29 pages
Orphan Crops For Sustainable Food and Nutrition Security Promoting Neglected and Underutilized Species 1st Edition Stefano Padulosi Editor Download
No ratings yet
Orphan Crops For Sustainable Food and Nutrition Security Promoting Neglected and Underutilized Species 1st Edition Stefano Padulosi Editor Download
88 pages
Incident at Sakhalin
No ratings yet
Incident at Sakhalin
350 pages