0% found this document useful (0 votes)
18 views102 pages

FDL Module1

Uploaded by

ghostwolfvn6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views102 pages

FDL Module1

Uploaded by

ghostwolfvn6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

FOUNDATIONS OF DEEP LEARNING

SURYA C
DEPT. OF ADS
MODULE 1

Introduction to Neural Network and


Deep Learning
CONTENTS
• Introduction

• The basic architecture of neural network

• Single computational Layer

• Multilayer Neural Networks


Automotive Car
Recognizing
Vehicles and
Pedestrians in
the Road

Courtesy: MIT Introduction to Deep Learning


What is Deep Learning?

Courtesy: MIT Introduction to Deep Learning


Why Deep Learning and Why Now?
Why Deep Learning?

Courtesy: MIT Introduction to Deep Learning


Why Now?

Courtesy: MIT Introduction to Deep Learning


Neural Network
The Perceptron
The Structural building block of Deep Learning
The Perceptron: Forward Propagation

Courtesy: MIT Introduction to Deep Learning


The Perceptron: Forward Propagation

Courtesy: MIT Introduction to Deep Learning


The Perceptron: Forward Propagation

Courtesy: MIT Introduction to Deep Learning


The Perceptron: Forward Propagation

Courtesy: MIT Introduction to Deep Learning


The Perceptron: Example

Courtesy: MIT Introduction to Deep Learning


The Perceptron: Example

Courtesy: MIT Introduction to Deep Learning


The Perceptron: Example

Courtesy: MIT Introduction to Deep Learning


The Perceptron: Example

Courtesy: MIT Introduction to Deep Learning


Multilayer Neural Network
Limitations of Perceptron
• Perceptron has a monotonicity property – do single type of jobs
• Explanation: If we have a positive weight the activation can only increase as the
input value increases. Each neuron individually interacts with the network. So
the neurons can not handle the interactions between the neurons
• Example: Perceptron can be used to represent logical Gates like AND, OR. But
can not represent XOR function
Multi Output Perceptron

Courtesy: MIT Introduction to Deep Learning


Single Layer Neural Network

Courtesy: MIT Introduction to Deep Learning


Single Layer Neural Network

Courtesy: MIT Introduction to Deep Learning


Multi Output Perceptron

Courtesy: MIT Introduction to Deep Learning


Deep Neural Network

Courtesy: MIT Introduction to Deep Learning


Deep Neural Network

Courtesy: MIT Introduction to Deep Learning


Activation function

Courtesy: MIT Introduction to Deep Learning


Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Activation Functions
Sigmoid/Logistic Activation Function
• The function takes any real value as input and
outputs values in the range of 0 to 1

• The function is represented by an S-shape

• The larger the input(more +ve), the closer the


output value will be to 1, where as the smaller
the input(more –ve), the closer the output will
be to 0

• Commonly used in models where we have


probability as an output

• The function is differentiable and provide a


smooth gradient
Tanh/Hyperbolic Tangent Activation Function
• Like sigmoid function but better
• The range of the function is from -1 to +1
• Tanh is also sigmoidal(S- shaped)
• Output of tanh function is zero centered,
hence we can map the output values as
strongly +ve or strongly -ve
• Usually used in hidden layers of neural
networks as its values lie between -1 to 1,
therefore the mean of the hidden layer comes
out be 0 or close to it
• It helps in centering the data and makes
learning for the next layer much easier
• Mainly used in classification between two
classes.
• Both sigmoid and tanh are used in feed
forward neteorks
ReLU Activation Function
• Stands for Rectified Linear Unit
• Most used activation function in the world right now
• ReLU is half rectified(from bottom)
• Range is between 0 𝑎𝑛𝑑 ∞
• Function and its derivative both are monotonic
• Problem is that all the –ve values become zero immediately
which decreases the ability of the model to fit or transform the
data properly.
• Advantages are
 Dying ReLU problem
 Only a certain no.of neurons are activated
 Far more computationally efficient when compared to the sigmoid &
tanh function 0 𝑓𝑜𝑟 𝑥 < 0
 Accelerates the convergence of gradient descent towards the global 𝑓 𝑥 = ቊ
minimum of the loss function to its linear, non-saturating property 𝑥 𝑓𝑜𝑟 𝑥 ≥ 0
• Disadvantage
 The –ve side of the graph makes the gradient value 0.
0 𝑓𝑜𝑟 𝑥 < 0
𝑓′ 𝑥 = ቊ
 Hence during backpropagation the weights and biases for some 1 𝑓𝑜𝑟 𝑥 ≥ 0
neurons are not updated.
 This can create dead neurons which never gets activated
Leaky ReLU Activation Function
• An improvised version of ReLU to solve the dying
ReLU problem as it has a small +ve slope in the –ve
area

• Advantages
 It enable backpropagation, even for –ve input values
 By making the minor modification, the gradient of the left
side of the graph comes out to be a non-zero value
 Therefore no more dead neurons in the region

• Limitations
 The predictions may not be consistent for –ve input
values
 The gradient for the –ve value is a small value which 𝑐𝑥 𝑓𝑜𝑟 𝑥 < 0
makes the learning of model parameters time consuming 𝑓 𝑥 = ቊ
𝑥 𝑓𝑜𝑟 𝑥 ≥ 0 c=0.01
usually
𝑐 𝑓𝑜𝑟 𝑥 < 0
𝑓′ 𝑥 = ቊ
1 𝑓𝑜𝑟 𝑥 ≥ 0
Hard Tanh Activation Function
• Modified version of tanh

• Applies a threshold to the output to produce an


output between -1 and +1

−1 𝑓𝑜𝑟 𝑥 < −1
𝑓 𝑥 = ቐ𝑥 𝑓𝑜𝑟 − 1 ≤ 𝑥 ≤ 1
1 𝑓𝑜𝑟 𝑥 > 1

0 𝑓𝑜𝑟 𝑥 < −1
𝑓′ 𝑥 = ቐ1𝑓𝑜𝑟 − 1 ≤ 𝑥 ≤ 1
0 𝑓𝑜𝑟 𝑥 > 1
Softmax Activation Function
• An activation function that scales numbers/logits
into probabilities

• Output of a softmax function is a vector(say V)


with probabilities of each possible outcome

• The probabilities in vector V sum to 1 for all


possible outcome or class

• Z is an input vector to a softmax function S. It


consists of n elements for n classes(possible
outcomes)
𝑒 𝑍𝑖
• 𝑍𝑖 is the𝑖 𝑡ℎ element
of the input vector. It can 𝑆 𝑍𝑖 = 𝑛
take any value between (-∞ 𝑡𝑜 + ∞) σ𝑗=1 𝑒 𝑍𝑗

• σ𝑛𝑗=1 𝑒 𝑍𝑗 A normalization term. It ensures that the 𝜕𝑆(𝑍𝑖 ) 𝑆 𝑍𝑖 × 1 − 𝑆 𝑍𝑖 𝑖𝑓 𝑖 = 𝑗


values of output vector 𝑆 𝑍𝑖 sums to 1 for 𝑖 𝑡ℎ = ቐ
𝜕𝑍𝑗 −𝑆 𝑍𝑖 × 𝑆 𝑍𝑗 𝑖𝑓 𝑖 ≠ 𝑗
class and a valid probability distribution
Applying Neural Network
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
𝟏 𝒏
𝑴𝑺𝑬 = ෍ (𝒚𝒊 − 𝒚෡𝒊 )𝟐
𝒏 𝒊=𝟏

Courtesy: MIT Introduction to Deep Learning


Mean Absolute Error (MAE) Loss
• MEA is also used in regression problems
𝟏 𝒏
• 𝑴𝑬𝑨 = σ 𝒚𝒊 − 𝒚෡𝒊
𝒏 𝒊=𝒊
𝟏 𝒏
𝑩𝑪𝑫 = ෍ 𝒚𝒊 𝐥𝐨𝐠 𝒚෡𝒊 + (𝟏 − 𝒚𝒊 ) 𝐥𝐨𝐠(𝟏 − 𝒚෡𝒊 )
𝒏 𝒊=𝟏

Courtesy: MIT Introduction to Deep Learning


Categorical Cross Entropy Loss
• CCE is used in multi-class classification problems

• CCE = -σ𝒏𝒊=𝟏 σ𝑪𝒄=𝟏 𝒚𝒊,𝒄 𝒍𝒐𝒈(ෝ


𝒚𝒊,𝒄 )

• Where 𝑦𝑖,𝑐 is a binary indicator(0 or 1) if class label c is the correct


classification for observation 𝑖, and 𝑦ෞ
𝑖,𝑐 is the predicted probability for class c
for observation 𝑖
Hinge Loss
• Hinge loss is mainly used in SVM but also can be used in neural network for binary
classification
1 𝑛
• 𝐻𝑖𝑛𝑔𝑒 𝐿𝑜𝑠𝑠 = σ max(0, 1 −𝑦𝑖 . 𝑦ෝ𝑖 )
𝑛 𝑖=1
Training a Neural Network
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Back Propagation Algorithm
Back Propagation Algorithm
Back Propagation Algorithm
Practical Issues in Neural Network Training
Training Neural Network
• Training a neural network is very challenging

• Loss function can be difficult to optimize


 Optimization through gradient descent

η = 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
How to deal with this?

Idea I:

Try lots of different learning rates and see what works “just right”

Idea II:

Design a learning rate algorithm, that adapts to its landscape

Courtesy: MIT Introduction to Deep Learning


Adaptive Learning Rates
• No more fixed

• Can be made larger or smaller depending on


 How large gradient is
 How fast learning is happening
 Size of particular weights

Courtesy: MIT Introduction to Deep Learning


Gradient Descent Algorithms

Courtesy: MIT Introduction to Deep Learning


Mini Batches While Training
• These days we have huge amount of data.

• So rather than computing the gradient of whole bunch of data if we could compute the gradient
in in batches

Courtesy: MIT Introduction to Deep Learning


Mini Batches While Training
• More accurate estimation of gradient

Smoother convergence

Allows for larger learning rates

Mini batches lead to fast training

Can parallelize computation

achieve significant speed increases on GPUs

Courtesy: MIT Introduction to Deep Learning


Problem of Overfitting
• Courtesy: MIT Introduction to Deep Learning

Courtesy: MIT Introduction to Deep Learning


Regularization
• Techniques that constraints our optimization problem to discourage complex models
Regularization I: Dropout

Courtesy: MIT Introduction to Deep Learning


Regularization I: Dropout

Courtesy: MIT Introduction to Deep Learning


Regularization II: Early Stopping

Courtesy: MIT Introduction to Deep Learning


Regularization II: Early Stopping

Courtesy: MIT Introduction to Deep Learning


Regularization II: Early Stopping

Courtesy: MIT Introduction to Deep Learning


Regularization II: Early Stopping

Courtesy: MIT Introduction to Deep Learning


Regularization II: Early Stopping

Courtesy: MIT Introduction to Deep Learning


Data Imbalance
• We need balanced dataset to train the data

• Problem of overfitting or underfitting

• Model will be biased towards the higher quantity samples

• Techniques to solve data imbalance


 Resampling
 Class weighting
 Synthetic data generation
Hyperparameters
• Hyperparameters are the external configurations of a deep learning model

• Hyperparameters are set before the training begins and remains constant through out
the training

• They control the behavior of the training process and the architecture of the model,
influencing how the model learn and perform

• Examples:
1. Learning rate:
2. Number of epochs
3. Batch size
4. Network architecture
5. Activation function
6. Regularization parameters
7. Dropout rate
8. optimizer
Hyperparameters
 Importance
 Choosing the right hyperparameters is crucial for the performance of a deep learning model
 Proper tuning of hyperparameters can significantly improve model accuracy, reduce
overfitting, and decrease training time.

 Hyperparameter Tuning:
 Hyperparameter tuning involves searching for the best combination of hyperparameters to
optimize the model's performance
 Common methods:
 Grid Search
 Random Search
Validation Set
• Normally it is not useful to evaluate the model with the data we used for training

• So we split the dataset into training set and test set


Validation Set
• If we use a validation set, it is also helpful as a proxy measure of accuracy during the
process of hyperparameter optimization

• Grid Search:
Estimators – Bias and Variance
Estimators
• There are various ways to evaluate an ML model.

• We can use MSE for Regression, Recall and ROC for a classification problem along with
absolute error

• Some other validation methods


 Accuracy
 Precision
 Recall

• Similarly Bias and Variance help us in parameter tuning and decide better fitted model
Bias and Variance
• Bias: The difference between the predicted value and the actual value of a model
 The inability of a model to capture the correct relationship between the data points

• Variance: The difference between the predictions on different datasets. (Or different between the fits in the
training and testing set)
 It is the variability of the training model that how much it is sensitive to another subset of the training data

High bias Low Bias Low Bias


Low Variance Low Variance High Variance
Bias – Variance Tradeoff
Thank You

You might also like