FOUNDATIONS OF DEEP LEARNING
SURYA C
DEPT. OF ADS
MODULE 1
Introduction to Neural Network and
Deep Learning
CONTENTS
• Introduction
• The basic architecture of neural network
• Single computational Layer
• Multilayer Neural Networks
Automotive Car
Recognizing
Vehicles and
Pedestrians in
the Road
Courtesy: MIT Introduction to Deep Learning
What is Deep Learning?
Courtesy: MIT Introduction to Deep Learning
Why Deep Learning and Why Now?
Why Deep Learning?
Courtesy: MIT Introduction to Deep Learning
Why Now?
Courtesy: MIT Introduction to Deep Learning
Neural Network
The Perceptron
The Structural building block of Deep Learning
The Perceptron: Forward Propagation
Courtesy: MIT Introduction to Deep Learning
The Perceptron: Forward Propagation
Courtesy: MIT Introduction to Deep Learning
The Perceptron: Forward Propagation
Courtesy: MIT Introduction to Deep Learning
The Perceptron: Forward Propagation
Courtesy: MIT Introduction to Deep Learning
The Perceptron: Example
Courtesy: MIT Introduction to Deep Learning
The Perceptron: Example
Courtesy: MIT Introduction to Deep Learning
The Perceptron: Example
Courtesy: MIT Introduction to Deep Learning
The Perceptron: Example
Courtesy: MIT Introduction to Deep Learning
Multilayer Neural Network
Limitations of Perceptron
• Perceptron has a monotonicity property – do single type of jobs
• Explanation: If we have a positive weight the activation can only increase as the
input value increases. Each neuron individually interacts with the network. So
the neurons can not handle the interactions between the neurons
• Example: Perceptron can be used to represent logical Gates like AND, OR. But
can not represent XOR function
Multi Output Perceptron
Courtesy: MIT Introduction to Deep Learning
Single Layer Neural Network
Courtesy: MIT Introduction to Deep Learning
Single Layer Neural Network
Courtesy: MIT Introduction to Deep Learning
Multi Output Perceptron
Courtesy: MIT Introduction to Deep Learning
Deep Neural Network
Courtesy: MIT Introduction to Deep Learning
Deep Neural Network
Courtesy: MIT Introduction to Deep Learning
Activation function
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Activation Functions
Sigmoid/Logistic Activation Function
• The function takes any real value as input and
outputs values in the range of 0 to 1
• The function is represented by an S-shape
• The larger the input(more +ve), the closer the
output value will be to 1, where as the smaller
the input(more –ve), the closer the output will
be to 0
• Commonly used in models where we have
probability as an output
• The function is differentiable and provide a
smooth gradient
Tanh/Hyperbolic Tangent Activation Function
• Like sigmoid function but better
• The range of the function is from -1 to +1
• Tanh is also sigmoidal(S- shaped)
• Output of tanh function is zero centered,
hence we can map the output values as
strongly +ve or strongly -ve
• Usually used in hidden layers of neural
networks as its values lie between -1 to 1,
therefore the mean of the hidden layer comes
out be 0 or close to it
• It helps in centering the data and makes
learning for the next layer much easier
• Mainly used in classification between two
classes.
• Both sigmoid and tanh are used in feed
forward neteorks
ReLU Activation Function
• Stands for Rectified Linear Unit
• Most used activation function in the world right now
• ReLU is half rectified(from bottom)
• Range is between 0 𝑎𝑛𝑑 ∞
• Function and its derivative both are monotonic
• Problem is that all the –ve values become zero immediately
which decreases the ability of the model to fit or transform the
data properly.
• Advantages are
Dying ReLU problem
Only a certain no.of neurons are activated
Far more computationally efficient when compared to the sigmoid &
tanh function 0 𝑓𝑜𝑟 𝑥 < 0
Accelerates the convergence of gradient descent towards the global 𝑓 𝑥 = ቊ
minimum of the loss function to its linear, non-saturating property 𝑥 𝑓𝑜𝑟 𝑥 ≥ 0
• Disadvantage
The –ve side of the graph makes the gradient value 0.
0 𝑓𝑜𝑟 𝑥 < 0
𝑓′ 𝑥 = ቊ
Hence during backpropagation the weights and biases for some 1 𝑓𝑜𝑟 𝑥 ≥ 0
neurons are not updated.
This can create dead neurons which never gets activated
Leaky ReLU Activation Function
• An improvised version of ReLU to solve the dying
ReLU problem as it has a small +ve slope in the –ve
area
• Advantages
It enable backpropagation, even for –ve input values
By making the minor modification, the gradient of the left
side of the graph comes out to be a non-zero value
Therefore no more dead neurons in the region
• Limitations
The predictions may not be consistent for –ve input
values
The gradient for the –ve value is a small value which 𝑐𝑥 𝑓𝑜𝑟 𝑥 < 0
makes the learning of model parameters time consuming 𝑓 𝑥 = ቊ
𝑥 𝑓𝑜𝑟 𝑥 ≥ 0 c=0.01
usually
𝑐 𝑓𝑜𝑟 𝑥 < 0
𝑓′ 𝑥 = ቊ
1 𝑓𝑜𝑟 𝑥 ≥ 0
Hard Tanh Activation Function
• Modified version of tanh
• Applies a threshold to the output to produce an
output between -1 and +1
−1 𝑓𝑜𝑟 𝑥 < −1
𝑓 𝑥 = ቐ𝑥 𝑓𝑜𝑟 − 1 ≤ 𝑥 ≤ 1
1 𝑓𝑜𝑟 𝑥 > 1
0 𝑓𝑜𝑟 𝑥 < −1
𝑓′ 𝑥 = ቐ1𝑓𝑜𝑟 − 1 ≤ 𝑥 ≤ 1
0 𝑓𝑜𝑟 𝑥 > 1
Softmax Activation Function
• An activation function that scales numbers/logits
into probabilities
• Output of a softmax function is a vector(say V)
with probabilities of each possible outcome
• The probabilities in vector V sum to 1 for all
possible outcome or class
• Z is an input vector to a softmax function S. It
consists of n elements for n classes(possible
outcomes)
𝑒 𝑍𝑖
• 𝑍𝑖 is the𝑖 𝑡ℎ element
of the input vector. It can 𝑆 𝑍𝑖 = 𝑛
take any value between (-∞ 𝑡𝑜 + ∞) σ𝑗=1 𝑒 𝑍𝑗
• σ𝑛𝑗=1 𝑒 𝑍𝑗 A normalization term. It ensures that the 𝜕𝑆(𝑍𝑖 ) 𝑆 𝑍𝑖 × 1 − 𝑆 𝑍𝑖 𝑖𝑓 𝑖 = 𝑗
values of output vector 𝑆 𝑍𝑖 sums to 1 for 𝑖 𝑡ℎ = ቐ
𝜕𝑍𝑗 −𝑆 𝑍𝑖 × 𝑆 𝑍𝑗 𝑖𝑓 𝑖 ≠ 𝑗
class and a valid probability distribution
Applying Neural Network
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
𝟏 𝒏
𝑴𝑺𝑬 = (𝒚𝒊 − 𝒚𝒊 )𝟐
𝒏 𝒊=𝟏
Courtesy: MIT Introduction to Deep Learning
Mean Absolute Error (MAE) Loss
• MEA is also used in regression problems
𝟏 𝒏
• 𝑴𝑬𝑨 = σ 𝒚𝒊 − 𝒚𝒊
𝒏 𝒊=𝒊
𝟏 𝒏
𝑩𝑪𝑫 = 𝒚𝒊 𝐥𝐨𝐠 𝒚𝒊 + (𝟏 − 𝒚𝒊 ) 𝐥𝐨𝐠(𝟏 − 𝒚𝒊 )
𝒏 𝒊=𝟏
Courtesy: MIT Introduction to Deep Learning
Categorical Cross Entropy Loss
• CCE is used in multi-class classification problems
• CCE = -σ𝒏𝒊=𝟏 σ𝑪𝒄=𝟏 𝒚𝒊,𝒄 𝒍𝒐𝒈(ෝ
𝒚𝒊,𝒄 )
• Where 𝑦𝑖,𝑐 is a binary indicator(0 or 1) if class label c is the correct
classification for observation 𝑖, and 𝑦ෞ
𝑖,𝑐 is the predicted probability for class c
for observation 𝑖
Hinge Loss
• Hinge loss is mainly used in SVM but also can be used in neural network for binary
classification
1 𝑛
• 𝐻𝑖𝑛𝑔𝑒 𝐿𝑜𝑠𝑠 = σ max(0, 1 −𝑦𝑖 . 𝑦ෝ𝑖 )
𝑛 𝑖=1
Training a Neural Network
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Back Propagation Algorithm
Back Propagation Algorithm
Back Propagation Algorithm
Practical Issues in Neural Network Training
Training Neural Network
• Training a neural network is very challenging
• Loss function can be difficult to optimize
Optimization through gradient descent
η = 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
How to deal with this?
Idea I:
Try lots of different learning rates and see what works “just right”
Idea II:
Design a learning rate algorithm, that adapts to its landscape
Courtesy: MIT Introduction to Deep Learning
Adaptive Learning Rates
• No more fixed
• Can be made larger or smaller depending on
How large gradient is
How fast learning is happening
Size of particular weights
Courtesy: MIT Introduction to Deep Learning
Gradient Descent Algorithms
Courtesy: MIT Introduction to Deep Learning
Mini Batches While Training
• These days we have huge amount of data.
• So rather than computing the gradient of whole bunch of data if we could compute the gradient
in in batches
Courtesy: MIT Introduction to Deep Learning
Mini Batches While Training
• More accurate estimation of gradient
Smoother convergence
Allows for larger learning rates
Mini batches lead to fast training
Can parallelize computation
achieve significant speed increases on GPUs
Courtesy: MIT Introduction to Deep Learning
Problem of Overfitting
• Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Regularization
• Techniques that constraints our optimization problem to discourage complex models
Regularization I: Dropout
Courtesy: MIT Introduction to Deep Learning
Regularization I: Dropout
Courtesy: MIT Introduction to Deep Learning
Regularization II: Early Stopping
Courtesy: MIT Introduction to Deep Learning
Regularization II: Early Stopping
Courtesy: MIT Introduction to Deep Learning
Regularization II: Early Stopping
Courtesy: MIT Introduction to Deep Learning
Regularization II: Early Stopping
Courtesy: MIT Introduction to Deep Learning
Regularization II: Early Stopping
Courtesy: MIT Introduction to Deep Learning
Data Imbalance
• We need balanced dataset to train the data
• Problem of overfitting or underfitting
• Model will be biased towards the higher quantity samples
• Techniques to solve data imbalance
Resampling
Class weighting
Synthetic data generation
Hyperparameters
• Hyperparameters are the external configurations of a deep learning model
• Hyperparameters are set before the training begins and remains constant through out
the training
• They control the behavior of the training process and the architecture of the model,
influencing how the model learn and perform
• Examples:
1. Learning rate:
2. Number of epochs
3. Batch size
4. Network architecture
5. Activation function
6. Regularization parameters
7. Dropout rate
8. optimizer
Hyperparameters
Importance
Choosing the right hyperparameters is crucial for the performance of a deep learning model
Proper tuning of hyperparameters can significantly improve model accuracy, reduce
overfitting, and decrease training time.
Hyperparameter Tuning:
Hyperparameter tuning involves searching for the best combination of hyperparameters to
optimize the model's performance
Common methods:
Grid Search
Random Search
Validation Set
• Normally it is not useful to evaluate the model with the data we used for training
• So we split the dataset into training set and test set
Validation Set
• If we use a validation set, it is also helpful as a proxy measure of accuracy during the
process of hyperparameter optimization
• Grid Search:
Estimators – Bias and Variance
Estimators
• There are various ways to evaluate an ML model.
• We can use MSE for Regression, Recall and ROC for a classification problem along with
absolute error
• Some other validation methods
Accuracy
Precision
Recall
• Similarly Bias and Variance help us in parameter tuning and decide better fitted model
Bias and Variance
• Bias: The difference between the predicted value and the actual value of a model
The inability of a model to capture the correct relationship between the data points
• Variance: The difference between the predictions on different datasets. (Or different between the fits in the
training and testing set)
It is the variability of the training model that how much it is sensitive to another subset of the training data
High bias Low Bias Low Bias
Low Variance Low Variance High Variance
Bias – Variance Tradeoff
Thank You