0% found this document useful (0 votes)
68 views87 pages

1 - Introduction To Deep Learning

Uploaded by

SHRAVANI ANAND
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views87 pages

1 - Introduction To Deep Learning

Uploaded by

SHRAVANI ANAND
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Shallow vs Deep NNs

DSE 3151 Deep Learning


DSE 3151, [Link] Data Science & Engineering
August 2023
Rohini R Rao & Abhilash Pai
Department of Data Science and Computer Applications
MIT Manipal
Slide -1 of 5
Contents
• Shallow Networks
• Neural Network Representation
• Back Propagation
• Vectorization
• Deep Neural Networks
• Example: Learning XOR
• Architecture Design
• Loss Functions
• Metrics
• Gradient-Based Learning
• Optimization
• Diagnosing Learning Curves
• Strategies for overfitting
• Learning rate scheduling

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 2


Deep Learning and Machine Learning

(Source: [Link])

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 3


Learning Multiple Components

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 4


Neural Network Examples

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 5


Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 6
Applications of Deep Learning

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 7


Neural Network
• is a set of connected input/output
units in which each connection has
• a weight associated with it.
• During the learning phase, the
network learns by adjusting
• the weights so as to be able to
predict the correct class label of
the input tuples.
• Also referred to as Connectionist
learning

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 8


Neuron is a computational , logistic unit

Rohini R Rao, Manjunath Hegde 9


Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 10
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 11
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 12
Multilayer Feed-Forward Neural Network
• consists of an input layer, one or
more hidden layers, and an
output layer
• units in
• input layer are input units
• hidden and output layer are
neurodes
• feed-forward network since
none of the weights cycles back

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 13


Back Propagation Algorithm

Jiawei Han and Micheline Kamber,Rohini


“Data Mining Concepts And Techniques”, 3rd Edition, Morgan Kauffmann14
R Rao & Abhilash Pai, Dept of Data Science and CA
Back Propagation Algorithm
• learns using a gradient descent method to search for a set of weights that fits the training data so as to
minimize the mean squared error
• L is the learning rate, a constant typically having a value between 0.0 and 1.0.
• helps avoid getting stuck at a local minimum in decision space and encourages finding the global minimum.
• If l is
• too small, then learning will occur at a very slow pace.
• too large, then oscillation between inadequate solutions may occur
• rule of thumb is to set the learning rate to 1=/t
• where t is the number of iterations through the training set so far
• one iteration through the training set is an epoch.
• Case Updating
• updating the weights and biases after the presentation of each tuple.
• Alternatively,
• Epoch updating
• the weight and bias increments could be accumulated in variables
• so that the weights and biases are updated after all the tuples in the training set have been presented.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 15


Back Propagation Algorithm
• Terminating condition:
• Training stops when
• All changes in wij in the previous epoch are so small as to be below some specified
threshold
• Or The percentage of tuples misclassified in the previous epoch is below some threshold,
• Or A prespecified number of epochs has expired.
• The computational efficiency depends on the
• time spent training the network.
• Given |D| tuples and w weights, each epoch requires O(|D|*w) time.
• In the worst-case scenario, the number of epochs can be exponential in n, the
number of inputs.
• In practice, the time required for the networks to converge is highly variable.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 16


Example

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 17


Example

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 18


Simple AND and Simple OR operations

Course Notes – Deep Learning , Andrew NG

Rohini R Rao, Manjunath Hegde 19


XOR Problem - Perceptron Learning

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 20


XOR Problem
Solving XOR
Network Diagram

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 21


Neural Network Representation

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 22


Neural Network Representation

Course Notes – Deep Learning , Andrew NG


Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 23
The Shallow Neural Network
Vectorizing across multiple examples

Course Notes – Deep Learning , Andrew NG


Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 24
The Shallow Neural Network
Vectorizing across multiple examples

Course Notes – Deep Learning , Andrew NG


Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 25
Deep vs Shallow Network

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 26


Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 27
Neural Network learning as optimization
• Cannot calculate the perfect weights for a neural network since there are too many
unknowns
• Instead, the problem of learning is as a search or optimization problem
• An algorithm is used to navigate the space of possible sets of weights the model may use
in order to make good or good enough predictions

• Objective of optimizer is to minimize the loss function or the error term.


• Loss function
• gives the difference between observed value from the predicted value.
• must be
• Continuous
• Differentiable at each point (allows the use of gradient-based optimization)
• To minimize the loss generated from any model compute
• The magnitude that is by how much amount to decrease or increase, and
• direction in which to move

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 28


Machine Learning Setup

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 29


Activation Functions
• To make the network robust use of Activation or Transfer functions.
• Activations functions introduce non-linear properties in the neural networks.
• A good Activation function has the following properties:
• Monotonic Function:
• should be either entirely non-increasing or non-decreasing.
• If not monotonic then increasing the neuron's weight might cause it to have less influence on
reducing the error of the cost function.
• Differential:
• mathematically means the change in y with respect to change in x.
• should be differential because we want to calculate the change in error with respect to given
weights at the time of gradient descent.
• Quickly Converging:
• Should reach its desired value fast.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 30


Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 31
Gradient Descent

• Generic Optimization Algorithm


• Starts with random values
• Improves gradually , in an attempt to
decrease loss function
• Until algorithm converges to minimum
• Learning Rate-
• Hyperparameter
• Indicates size of steps

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 32


Gradient Descent & Learning Rate

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 33


Gradient Descent Pitfalls

• If it starts on left can reach local


minimum
• If it stars from right can hit plateau
• So pick Cost Functions which are
convex functions
• Has no local minimum
• Continuous function with slope that does
not change abruptly
• Then Gradient Descent will approach
close to global minimum

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 34


Activation Functions and their derivatives

Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras & Tensorflow, OReilly Publications
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 35
Activation Functions – Sigmoid vs Tanh
• Sigmoid
• s(x) = 1/(1 + e−x) where e ≈ 2.71 is the base of the
natural logarithm
• to predict the probability
• between the range of 0 and 1, sigmoid is the
right choice.
• is differentiable-, we can find the slope of the
sigmoid curve at any two points.
• The function is monotonic but function’s
derivative is not.
• can cause a neural network to get stuck at the
training time.
• Tanh
• Range is from (-1 to 1)
• Centering data – mean = 0
• is also sigmoidal (s - shaped)
• Advantage is that the -ve inputs will be mapped
strongly negative
• Disadvantage – If x is very small or very large
slope or gradient becomes 0 which slows
down gradient descent
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 36
Activation Functions – Sigmoid vs ReLU

• ReLU is half rectified


• f(z) is 0 when z is < 0 and f(z) is equal to z when z is >= zero.
• Derivative =1 when z is +ve and 0 when z is 0-ve
• The function and its derivative both are monotonic
• Alternate to ReLU is softplus activation function
• Softplus(z) = log(1+exp(z), Close to 0 when z is –ve and close to z when z is +ve
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 37
Activation Functions- ReLU vs Leaky ReLU

• The leak helps to increase the range of the ReLU function.


• Usually, the value of a is 0.01.
• When a is not 0.01 then it is called Randomized ReLU.
• range of the Leaky ReLU is (-infinity to infinity)
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 38
Activation Functions-
SoftMax
• Usually last activation function
• to normalize the output of a
network to a probability
distribution over predicted
output classes

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 39


Activation Function Cheat Sheet

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 40


[Link]
Output Functions

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 41


Regression problems - Output Function

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 42


Regression problems - Loss Function

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 43


Classification problem- Output Function

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 44


Classification Problems- Loss Function

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 45


Typical MLP architecture

Regression Classification

Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras & Tensorflow, OReilly Publications
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 46
Regression Loss Functions
• Mean Squared Error (MSE)
• values with a large error are penalized.
• is a convex function with a clearly defined
global minimum
• Can be used in gradient descent
optimization to set the weight values
• Very sensitive to outliers , will significantly
increase the loss.
• Mean Absolute Error (MAE)
• used in cases when the training data has a
large number of outliers
as the average distance approaches 0,
gradient descent optimization will not work
• Huber Loss
• Based on absolute difference between the
actual and predicted value and threshold
value, 𝛿
• Is quadratic when error is smaller than 𝛿
but linear when error is larger than 𝛿
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 47
Classification Loss Functions
Cross entropy measures entropy between two probability distributions
• Binary Cross-Entropy/Log Loss
• Compares the actual value (0 or 1)
with the probability that the input
aligns with that category
• p(i) = probability that the category is 1
• 1 — p(i) = probability that the
category is 0)
• Categorical Cross-Entropy Loss
• In cases where the number of
classes is greater than two

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 48


Keras Metrics
• [Link](..., metrics=['mse'])
• Metric values are recorded at the end of each epoch on the training
dataset.
• If a validation dataset is also provided, then is also calculated for the
validation dataset.
• All metrics are reported in verbose output and in the history object
returned from calling the fit() function.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 49


.

Keras Metrics
• Accuracy metrics
• Accuracy
• Calculates how often predictions equal labels.
• Binary Accuracy
• Calculates how often predictions match binary labels.
• Categorical Accuracy
• Calculates how often predictions match one-hot labels
• Sparse Categorical Accuracy
• Calculates how often predictions match integer labels.
• TopK Categorical Accuracy
• calculates the percentage of records for which the targets are in the
top K predictions
• rank the predictions in the descending order of probability values.
• If the rank of the yPred is less than or equal to K, it is considered
accurate.
• Sparse TopK Categorical Accuracy class
• Computes how often integer targets are in the top K predictions.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 50


Keras Metrics
• Regression metrics
• Mean Squared Error
• Computes the mean squared error between y_true and y_pred
• Root Mean Squared Error
• Computes root mean SE metric between y_true and y_pred
• Mean Absolute Error
• Computes the mean absolute error between the labels and
predictions
• Mean Absolute Percentage Error
• MAPE = (1/n) * Σ(|actual – prediction| / |actual|) * 100
• Average difference between the predicted and the actual in %
• Mean Squared Logarithmic Error
• measure of the ratio between the true and predicted values.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 51


Keras Metrics
• AUC • Probabilistic metrics
• Precision • Binary Crossentropy
• Recall
• Categorical Crossentropy
• TruePositives
• TrueNegatives • Sparse Categorical Crossentropy
• FalsePositives • KLDivergence
• FalseNegatives • a measure of how two probability
• PrecisionAtRecall distributions are different from each
• Computes best precision where recall is >= specified other
value
• SensitivityAtSpecificity • Poisson
• Computes best sensitivity where specificity is >= • if dataset comes from a Poisson
specified value
distribution
• SpecificityAtSensitivity
• Computes best specificity where sensitivity is >=
specified value

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 52


Gradient based learning
• Gradient Based Learning
• seeks to change the weights so
that the next evaluation reduces
the error
• is navigating down the gradient (or
slope) of error
• The process repeats until the
global minimum is reached.
• works well for convex functions
• It is expensive to calculate the
gradients if the size of the data
is huge.
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 53
The Vanishing/Exploding Gradient Problems
• Algorithm computes the gradient of the cost function with regard to each parameter in
the network
• Problems include
• Vanishing Gradients
• Gradients become very small as algorithm progresses down to lower layers
• So connection weights remain unchanged and training never converges to a solution
• Exploding Gradients
• Gradients become so large until layers get huge weight updates and algorithm diverges
• Deep Neural Networks suffer from unstable gradients, different layers may learn at
different speeds.
• Reasons for unstable gradients Glorot & Bengio (2010)
• Because of combination of Sigmoid activation and weight initialization (normal distribution with
mean 0 and std dev 1)
• Variance of outputs of each layer is much greater than the variance of its inputs
• Variance goes on increasing after each layer
• until the activation function saturates(0 or 1, with derivative close to 0) at the top layers

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 54


Solutions include : 1. Weight Initialization
• The variance of the outputs of each layer has to be equal to the
variance of its inputs
• Xavier’s Initialization

• Weight Matrix W of a particular layer l


• picked randomly from a normal distribution with
• mean μ= 0
• variance sigma² = multiplicative inverse of the number of neurons in layer
l−1.
• The bias b of all layers is initialized with 0

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 55


Solutions Include : Weight Initialization
1. Xavier’s or Glorot Initialization
Variance of outputs of each layer to be equal to the variance of its inputs
Gradients to have equal variance before and after flowing through a layer in the reverse
direction
Fan-in – Number of inputs
Fan-out- Number of neurons

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 56


Solutions include
2. Using Non-saturating Activation functions
1. ReLU does not saturate for +ve values
2. Suffer from problem of dying ReLU- neurons output only 0
• Weighted sum of its inputs are negative for all instances in training set
3. Use Leaky ReLU instead – (only go into coma don’t die, may wake up)
1. α = 0.01 , sometimes 0.2
4. Flavors include
Randomized leaky ReLU(RReLU)
α is picked randomly in a given range during training and is fixed to an average
value during testing
Parametric Leaky Relu (PReLU)
α is authorised to be learned during training as a parameter
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 57
SELU
• Scaled Exponential Linear Units
• Induce self-normalization
• the output of each layer will tend to
preserve mean 0 and standard deviation
1 during training
• f(x) = λx if x>= 0
• f(x) = λα(exp(x)-1) if x < 0
• α = 1.6733 , λ= 1.0507
• conditions for self-normalization to
happen:
• The input features must be standardized
(mean 0 and standard deviation 1).
• Every hidden layer’s weights must also be
initialized using normal initialization.
• The network’s architecture must be
sequential.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 58


Solutions Include
3. Batch Normalization -Ioffe and Szegedy (2015)
• designed to solve the vanishing/exploding gradients problems, is also a good regularizer
• BN layer performs the standardizing and normalizing operations on the input of a layer coming
from a previous layer.
• Normalization
• brings the numerical data to a common scale without distorting its shape.
• (mean = 0, std dev= 1)
• BN adds extra operations in the model , before activation
• Operation zero centres and normalizes each input
• Then scale and shift the result using two new parameter vectors per layer
• Each BN Layer learn 4 parameter vectors
• Output Scale Vector
• Output offset vector
• Input mean vector
• Input standard deviation
• To zero-centre and normalize the inputs , mean and standard deviation of input needs to be
computed
• Current mini-batch is used to evaluate mean and standard deviation
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 59
Solutions Include
3. Batch Normalization -Ioffe and Szegedy (2015)

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 60


Solutions include
4. Gradient Clipping
• Clip gradient during back
propagation so that they never
exceed some threshold
• All the partial derivatives of the
loss will be clipped between -0.1
to 0.1
• Threshold can also be a
hyperparameter to tune
5. Reusing Pretrained Layers
• Transfer Learning

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 61


6. Faster Optimizers
• Optimizer
• is a function or an algorithm that modifies the attributes of the neural network, such
as weights and learning rate.
• Terminology
• Weights/ Bias – The learnable parameters in a model that controls the signal
between two neurons.
• Epoch – The number of times the algorithm runs on the whole training dataset.
• Sample – A single row of a dataset.
• Batch –denotes the number of samples to be taken for updating the model
parameters.
• Learning rate –defines a scale of how much model weights should be updated.
• Cost Function/Loss Function -is used to calculate the cost that is the difference
between the predicted value and the actual value.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 62


Mini Batch Gradient Descent Deep Learning
Optimizer
• Batch gradient descent:
• gradient is average of gradients computed from ALL the samples in dataset
• Mini Batch GD:
• subset of the dataset is used for calculating the loss function, therefore fewer
iterations are needed.
• batch size of 32 is considered to be appropriate for almost every case.
• Yann Lecun (2018) – “Friends don’t let friends use mini batches larger than 32”
• is faster , more efficient and robust than the earlier variants of gradient
descent.
• the cost function is noisier than the batch GD but smoother than SDG.
• Provides a good balance between speed and accuracy.
• It needs a hyperparameter that is “mini-batch-size”, which needs to be
tuned to achieve the required accuracy.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 63


Stochastic GD & SGD with Momentum DL
• stochastic means randomness on which the
algorithm is based upon.
• Instead of taking the whole dataset for each
iteration, randomly select the batches of data
• The path taken is full of noise as compared to
the gradient descent algorithm.
• Uses a higher number of iterations to reach
the local minima, thereby the overall
computation time increases.
• The computation cost is still less than that of
the gradient descent optimizer.
• If the data is enormous and computational
time is an essential factor, SGD should be
preferred over batch gradient descent
algorithm.
• Stochastic Gradient Descent with
Momentum Deep Learning Optimizer
• momentum helps in faster
convergence of the loss function.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 64


SGD with Momentum Optimizer
• GD takes small , regular steps down
the slope so algorithm takes more
time to reach the bottom
• adding a fraction of the previous
update to the current update will
make the process a bit faster.
• Hyperparameter β - Momentum
• To simulate a friction mechanism • β - Momentum
and prevent momentum from • set between 0 (high friction) and 1 (low friction)
becoming too large • Typically 0.9
• Also rolls past local minima
• learning rate should be decreased
with a high momentum term.

Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras & Tensorflow, OReilly Publications
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 65
SGD with Nesterov Momentum Optimization
• Yurii Nesterov in 1983
• to measure the gradient of the cost
function not at the local position but
slightly ahead in the direction of the
momentum
• the momentum vector will be
pointing in the right direction (i.e.,
toward the optimum)
• it will be slightly more accurate to
use the gradient measured a bit
farther in that direction rather than
using the gradient at the original
position

Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras & Tensorflow, OReilly Publications
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 66
Adagrad (Adaptive Gradient Descent) Deep
Learning Optimizer
• Adaptive Learning Rate
• Scaling down the gradient vector along the
steepest dimension
• If the cost function is steep along the ith
dimension, then s will get larger and larger at each
iteration
• No need to modify the learning rate manually
• more reliable than gradient descent algorithms,
and it reaches convergence at a higher speed.

• Disadvantage
• it decreases the learning rate aggressively and
monotonically.
• Due to small learning rates, the model eventually
becomes unable to acquire more knowledge, and
hence the accuracy of the model is compromised.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 67


RMS Prop(Root Mean Square) Deep Learning
Optimizer
• The problem with the gradients some are small while others may be huge
• Defining a single learning rate might not be the best idea.
• accumulating only the gradients from the most recent iterations (as
opposed to all the gradients since the beginning of training).

• β= 0.9

• Works better than Adagrad , was popular until ADAM

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 68


Adam Deep Learning Optimizer
• is derived from adaptive moment estimation.
• inherit the features of both Adagrad and RMS
prop algorithms.
• like Momentum optimization keeps track of
an exponentially decaying average of past
gradients,
• like RMSProp it keeps track of an
exponentially decaying average of past
squared gradients
• β1 and β2 represent the decay rate of the
average of the gradients.
• Advantages
• Is straightforward to implement
• faster running time t represents iteration
• low memory requirements, and requires less tuning β1= 0.9
• Disadvantages β 2=0.999
• Focusses on faster computation time, whereas Smoothing term ϵ is=10–7
SGD focus on data points.
• Therefore SGD generalize the data in a better
manner at the cost of low computation speed.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 69


Curve Fitting – True Function is Sinusoidal

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 70


Curve Fitting – True Function is Sinusoidal

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 71


Bias

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 72


Variance

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 73


Mean Square Error

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 74


Train vs Test Error

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 75


Train vs Test Error

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 76


Learning Curves
• Line plot of learning (y-axis) over experience (x-axis)
• The metric used to evaluate learning could be
• Optimization Learning Curves:
• calculated on the metric by which the parameters of the model are being optimized, e.g. loss.
• Minimizing, such as loss or error
• Performance Learning Curves:
• calculated on the metric by which the model will be evaluated and selected, e.g. accuracy.
• Maximizing metric , such as classification accuracy
• Train Learning Curve:
• calculated from the training dataset that gives an idea of how well the model is learning.
• Validation Learning Curve:
• calculated from a hold-out validation dataset that gives an idea of how well the model is
generalizing.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 77


Underfit Learning Curves

A plot of learning curves shows underfitting if:


•The training loss remains flat regardless of training.
•The training loss continues to decrease until the end of training.
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 78
Overfitting Curves
• Overfitting
• Model specialized on training data, it is not
able to generalize to new data
• Results in increase in generalization error.
• generalization error can be measured by
the performance of the model on the
validation dataset.
• A plot of learning curves shows
overfitting if:
• The plot of training loss continues to
decrease with experience.
• The plot of validation loss decreases to a
point and begins increasing again.
• The inflection point in validation loss may
be the point at which training could be
halted as experience after that point shows
the dynamics of overfitting.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 79


Good Fit Learning Curves
• A good fit is the goal of the learning
algorithm and exists between an overfit
and underfit model.
• A plot of learning curves shows a good fit
if:
• Plot of training loss decreases to a point of
stability
• Plot of validation loss decreases to a point
of stability and has a small gap with the
training loss.
• Loss of the model will almost always be
lower on the training than the validation
dataset.
• We should expect some gap between the
train and validation loss learning curves.
• This gap is referred to as the
“generalization gap.”
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 80
Avoiding Overfitting
• With so many parameters, DNN can fit
complex datasets.
• But also prone to overfitting the
training set.
• Early Stopping
• stop training as soon as the validation
error reaches a minimum
• With Stochastic and Mini-batch Gradient
Descent, the curves are not so smooth,
and it may be hard to know whether you
have reached the minimum or not.
• Stop only after the validation error has
been above the minimum for some time

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 81


Avoiding overfitting
• Dropout
• Proposed by Geoffrey Hinton in 2012
• at every training step, every neuron has a
probability p of being temporarily
“dropped out,”
• it will be entirely ignored during this
training step, but it may be active during
the next step
• p is called the dropout rate, usually 50%.
• After training, neurons don’t get dropped
• If p = 50%
• during testing a neuron will be connected
to twice as many input neurons as it was
(on average) during training.
• multiply each input connection weight by
the keep probability (1 – p) after training
• Alternatively, divide each neuron’s output
by the keep probability during training

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 82


Overfitting using Regularization
• L1, l2 regularization
• Regularization can be used to constrain the NN weights
• Lasso – (l1) least absolute shrinkage and selection operator
• adds the “absolute value of magnitude” of the coefficient as a penalty term to the loss function.
• Ridge regression (l2)
• adds the “squared magnitude” of the coefficient as the penalty term to the loss function.
• Use l1(), l2(), l1_l2() function
• which returns a regularizer that will compute the regularization loss at each step during training
• Regularization loss is added to final loss
• Max-Norm Normalization
• It constrains the weights w of the incoming connections

• r is max-norm hyper parameter and ∥ · ∥2 is the ℓ2 norm


• Reducing r increases the amount of regularization and helps reduce overfitting
• Can also help alleviate the unstable gradients problem if we are not using Batch normalization

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 83


Constant learning rate not ideal
• Better to start with a high learning
rate
• then reduce it once it stops
making fast progress
• can reach a good solution faster
• Learning Schedule strategies can
be applied
• Power Scheduling
• Exponential Scheduling
• Piecewise Constant Scheduling
• Performance Scheduling

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 84


Learning Rate Scheduling
• Power scheduling
• Learning rate set to a function of the iteration number ‘t’
• t: η(t) = η0 / (1 + t/k)c
• The initial learning rate η0, the power c (typically set to 1)
• The learning rate drops at each step, and after s steps it is down to η0 / 2 and so on
• schedule first drops quickly, then more and more slowly
• optimizer = [Link](lr=0.01, decay=1e-4)
• The decay is the number of steps it takes to divide the learning rate by one more unit,
• Keras assumes that c is equal to 1.
• Exponential scheduling
• Set the learning rate to: η(t) = η0 0.1t/s
• learning rate will gradually drop by a factor of 10 every s steps.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 85


Learning Rate Scheduling
• Piecewise constant scheduling
• Constant learning rate for a number of epochs
• e.g., η0 = 0.1 for 5 epochs
• then a smaller learning rate for another number of epochs
• e.g., η1 = 0.001 for 50 epochs and so on
• Performance scheduling
• Measure the validation error every N steps (just like for early stopping)
• reduce the learning rate by a factor of λ when the error stops dropping

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 86


References
• Ian Goodfellow, Yoshua Bengio and Aaron Courville, “Deep Learning”,
MIT Press 2016
• Swayam NPTEL Notes- Deep Learning, Mitesh Khapra
• Course Notes – Neural Networks and Deep Learning, Andrew NG
• Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn ,
Keras & Tensorflow, OReilly Publications
• Jiawei Han and Micheline Kamber, “Data Mining Concepts And
Techniques”, 3rd Edition, Morgan Kauffmann

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 87

You might also like