0% found this document useful (0 votes)

18 views102 pages

FDL Module1

Uploaded by

ghostwolfvn6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views102 pages

FDL Module1

Uploaded by

ghostwolfvn6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

FOUNDATIONS OF DEEP LEARNING

SURYA C
DEPT. OF ADS
MODULE 1

Introduction to Neural Network and

Deep Learning
CONTENTS
• Introduction

• The basic architecture of neural network

• Single computational Layer

• Multilayer Neural Networks

Automotive Car
Recognizing
Vehicles and
Pedestrians in
the Road

Courtesy: MIT Introduction to Deep Learning

What is Deep Learning?

Courtesy: MIT Introduction to Deep Learning

Why Deep Learning and Why Now?
Why Deep Learning?

Courtesy: MIT Introduction to Deep Learning

Why Now?

Courtesy: MIT Introduction to Deep Learning

Neural Network
The Perceptron
The Structural building block of Deep Learning
The Perceptron: Forward Propagation

Courtesy: MIT Introduction to Deep Learning

The Perceptron: Forward Propagation

Courtesy: MIT Introduction to Deep Learning

The Perceptron: Forward Propagation

Courtesy: MIT Introduction to Deep Learning

The Perceptron: Forward Propagation

Courtesy: MIT Introduction to Deep Learning

The Perceptron: Example

Courtesy: MIT Introduction to Deep Learning

The Perceptron: Example

Courtesy: MIT Introduction to Deep Learning

The Perceptron: Example

Courtesy: MIT Introduction to Deep Learning

The Perceptron: Example

Courtesy: MIT Introduction to Deep Learning

Multilayer Neural Network
Limitations of Perceptron
• Perceptron has a monotonicity property – do single type of jobs
• Explanation: If we have a positive weight the activation can only increase as the
input value increases. Each neuron individually interacts with the network. So
the neurons can not handle the interactions between the neurons
• Example: Perceptron can be used to represent logical Gates like AND, OR. But
can not represent XOR function
Multi Output Perceptron

Courtesy: MIT Introduction to Deep Learning

Single Layer Neural Network

Courtesy: MIT Introduction to Deep Learning

Single Layer Neural Network

Courtesy: MIT Introduction to Deep Learning

Multi Output Perceptron

Courtesy: MIT Introduction to Deep Learning

Deep Neural Network

Courtesy: MIT Introduction to Deep Learning

Deep Neural Network

Courtesy: MIT Introduction to Deep Learning

Activation function

Courtesy: MIT Introduction to Deep Learning

Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Activation Functions
Sigmoid/Logistic Activation Function
• The function takes any real value as input and
outputs values in the range of 0 to 1

• The function is represented by an S-shape

• The larger the input(more +ve), the closer the

output value will be to 1, where as the smaller
the input(more –ve), the closer the output will
be to 0

• Commonly used in models where we have

probability as an output

• The function is differentiable and provide a

smooth gradient
Tanh/Hyperbolic Tangent Activation Function
• Like sigmoid function but better
• The range of the function is from -1 to +1
• Tanh is also sigmoidal(S- shaped)
• Output of tanh function is zero centered,
hence we can map the output values as
strongly +ve or strongly -ve
• Usually used in hidden layers of neural
networks as its values lie between -1 to 1,
therefore the mean of the hidden layer comes
out be 0 or close to it
• It helps in centering the data and makes
learning for the next layer much easier
• Mainly used in classification between two
classes.
• Both sigmoid and tanh are used in feed
forward neteorks
ReLU Activation Function
• Stands for Rectified Linear Unit
• Most used activation function in the world right now
• ReLU is half rectified(from bottom)
• Range is between 0 𝑎𝑛𝑑 ∞
• Function and its derivative both are monotonic
• Problem is that all the –ve values become zero immediately
which decreases the ability of the model to fit or transform the
data properly.
• Advantages are
 Dying ReLU problem
 Only a certain no.of neurons are activated
 Far more computationally efficient when compared to the sigmoid &
tanh function 0 𝑓𝑜𝑟 𝑥 < 0
 Accelerates the convergence of gradient descent towards the global 𝑓 𝑥 = ቊ
minimum of the loss function to its linear, non-saturating property 𝑥 𝑓𝑜𝑟 𝑥 ≥ 0
• Disadvantage
 The –ve side of the graph makes the gradient value 0.
0 𝑓𝑜𝑟 𝑥 < 0
𝑓′ 𝑥 = ቊ
 Hence during backpropagation the weights and biases for some 1 𝑓𝑜𝑟 𝑥 ≥ 0
neurons are not updated.
 This can create dead neurons which never gets activated
Leaky ReLU Activation Function
• An improvised version of ReLU to solve the dying
ReLU problem as it has a small +ve slope in the –ve
area

• Advantages
 It enable backpropagation, even for –ve input values
 By making the minor modification, the gradient of the left
side of the graph comes out to be a non-zero value
 Therefore no more dead neurons in the region

• Limitations
 The predictions may not be consistent for –ve input
values
 The gradient for the –ve value is a small value which 𝑐𝑥 𝑓𝑜𝑟 𝑥 < 0
makes the learning of model parameters time consuming 𝑓 𝑥 = ቊ
𝑥 𝑓𝑜𝑟 𝑥 ≥ 0 c=0.01
usually
𝑐 𝑓𝑜𝑟 𝑥 < 0
𝑓′ 𝑥 = ቊ
1 𝑓𝑜𝑟 𝑥 ≥ 0
Hard Tanh Activation Function
• Modified version of tanh

• Applies a threshold to the output to produce an

output between -1 and +1

−1 𝑓𝑜𝑟 𝑥 < −1
𝑓 𝑥 = ቐ𝑥 𝑓𝑜𝑟 − 1 ≤ 𝑥 ≤ 1
1 𝑓𝑜𝑟 𝑥 > 1

0 𝑓𝑜𝑟 𝑥 < −1
𝑓′ 𝑥 = ቐ1𝑓𝑜𝑟 − 1 ≤ 𝑥 ≤ 1
0 𝑓𝑜𝑟 𝑥 > 1
Softmax Activation Function
• An activation function that scales numbers/logits
into probabilities

• Output of a softmax function is a vector(say V)

with probabilities of each possible outcome

• The probabilities in vector V sum to 1 for all

possible outcome or class

• Z is an input vector to a softmax function S. It

consists of n elements for n classes(possible
outcomes)
𝑒 𝑍𝑖
• 𝑍𝑖 is the𝑖 𝑡ℎ element
of the input vector. It can 𝑆 𝑍𝑖 = 𝑛
take any value between (-∞ 𝑡𝑜 + ∞) σ𝑗=1 𝑒 𝑍𝑗

• σ𝑛𝑗=1 𝑒 𝑍𝑗 A normalization term. It ensures that the 𝜕𝑆(𝑍𝑖 ) 𝑆 𝑍𝑖 × 1 − 𝑆 𝑍𝑖 𝑖𝑓 𝑖 = 𝑗

values of output vector 𝑆 𝑍𝑖 sums to 1 for 𝑖 𝑡ℎ = ቐ
𝜕𝑍𝑗 −𝑆 𝑍𝑖 × 𝑆 𝑍𝑗 𝑖𝑓 𝑖 ≠ 𝑗
class and a valid probability distribution
Applying Neural Network
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
𝟏 𝒏
𝑴𝑺𝑬 = ෍ (𝒚𝒊 − 𝒚෡𝒊 )𝟐
𝒏 𝒊=𝟏

Courtesy: MIT Introduction to Deep Learning

Mean Absolute Error (MAE) Loss
• MEA is also used in regression problems
𝟏 𝒏
• 𝑴𝑬𝑨 = σ 𝒚𝒊 − 𝒚෡𝒊
𝒏 𝒊=𝒊
𝟏 𝒏
𝑩𝑪𝑫 = ෍ 𝒚𝒊 𝐥𝐨𝐠 𝒚෡𝒊 + (𝟏 − 𝒚𝒊 ) 𝐥𝐨𝐠(𝟏 − 𝒚෡𝒊 )
𝒏 𝒊=𝟏

Courtesy: MIT Introduction to Deep Learning

Categorical Cross Entropy Loss
• CCE is used in multi-class classification problems

• CCE = -σ𝒏𝒊=𝟏 σ𝑪𝒄=𝟏 𝒚𝒊,𝒄 𝒍𝒐𝒈(ෝ

𝒚𝒊,𝒄 )

• Where 𝑦𝑖,𝑐 is a binary indicator(0 or 1) if class label c is the correct

classification for observation 𝑖, and 𝑦ෞ
𝑖,𝑐 is the predicted probability for class c
for observation 𝑖
Hinge Loss
• Hinge loss is mainly used in SVM but also can be used in neural network for binary
classification
1 𝑛
• 𝐻𝑖𝑛𝑔𝑒 𝐿𝑜𝑠𝑠 = σ max(0, 1 −𝑦𝑖 . 𝑦ෝ𝑖 )
𝑛 𝑖=1
Training a Neural Network
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Back Propagation Algorithm
Back Propagation Algorithm
Back Propagation Algorithm
Practical Issues in Neural Network Training
Training Neural Network
• Training a neural network is very challenging

• Loss function can be difficult to optimize

 Optimization through gradient descent

η = 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
Courtesy: MIT Introduction to Deep Learning
How to deal with this?

Idea I:

Try lots of different learning rates and see what works “just right”

Idea II:

Design a learning rate algorithm, that adapts to its landscape

Courtesy: MIT Introduction to Deep Learning

Adaptive Learning Rates
• No more fixed

• Can be made larger or smaller depending on

 How large gradient is
 How fast learning is happening
 Size of particular weights

Courtesy: MIT Introduction to Deep Learning

Gradient Descent Algorithms

Courtesy: MIT Introduction to Deep Learning

Mini Batches While Training
• These days we have huge amount of data.

• So rather than computing the gradient of whole bunch of data if we could compute the gradient
in in batches

Courtesy: MIT Introduction to Deep Learning

Mini Batches While Training
• More accurate estimation of gradient

Smoother convergence

Allows for larger learning rates

Mini batches lead to fast training

Can parallelize computation

achieve significant speed increases on GPUs

Courtesy: MIT Introduction to Deep Learning

Problem of Overfitting
• Courtesy: MIT Introduction to Deep Learning

Courtesy: MIT Introduction to Deep Learning

Regularization
• Techniques that constraints our optimization problem to discourage complex models
Regularization I: Dropout