How Neural Networks and the
Backpropagation Works
We have this input data…….
Feature 1 Feature 2
0.5 -0.5
0.3 0.4
0.7 0.9
We wish to map it to…..
Feature 1 Feature 2
0.9 0.1
0.9 0.9
0.1 0.1
Let’s take our first sample
Feature 1 Feature 2
0.5 -0.5
0.3 0.4
0.7 0.9
Feature 1 Feature 2
0.9 0.1
0.9 0.9
0.1 0.1
Consider this Neural network….
Example taken from: Neural Networks, A classroom approach by Satish Kumar
Bias Bias
0.01 0.31
Input Layer Hidden Layer Output Layer
d1
X1 0.1 0.37
0.9
0.5
0.3 -0.22
0.9
-0.2
d2
X2 0.1
-0.5 0.55 -0.12
-0.02 0.27
Bias
Bias
Let’s Start by moving forward
Bias
0.01
Input Layer Hidden Layer
Net value is the total input
0.1
X1 coming to the neuron.
0.5
Net value of the first neuron in the hidden layer:
-0.2
𝑧1 = 𝑥1 (0.1) + 𝑥2 −0.2 + 𝑏𝑖𝑎𝑠
X2
-0.5 𝑧1 = 0.5 0.1 + −0.5 −0.2 + 0.01
𝑧1 = 0.16
Input Layer
Net value is the total input
X1
coming to the neuron.
0.5
0.3
Net value of the second neuron in the hidden layer:
𝑧2 = 𝑥1 (0.3) + 𝑥2 0.55 + 𝑏𝑖𝑎𝑠
X2
-0.5 0.55
-0.02
Bias
𝑧2 = 0.5 0.3 + −0.5 0.55 + (−0.02)
𝑧2 = −0.145
The activation of the neuron
Net (z) Activation
(z)
Activation is scaling the input value (net value) to
a value from 0-1
For Example, The Sigmoidal Function:
Activating the two neurons at the hidden layer:
Net (z) Activation
(z) The input(net) value
For simplicity, we will consider =1
1
𝛿 𝑧1 = −0.16 = 0.5399
1+𝑒
1
𝛿 𝑧2 = 0.145 = 0.4638
1+𝑒
Let’s continue with the output neurons
Now, the hidden neuron’s output becomes the input to the next neuron
Bias
0.31
Hidden Layer Output Layer
0.37
0.9
𝑦1 = 0.5399(0.37) + 0.4638 0.9 + 0.31
𝑦1 = 0.9271
-0.22
Similarly……
-0.12
0.27
Bias
𝑦2 = 0.5399(−0.22) + 0.4638 −0.12 + 0.27
𝑦2 = 0.0955
Now, activating the output neurons
1
𝛿 𝑦1 = = 0.7164
1+𝑒 −0.9271
1
𝛿 𝑦2 = = 0.5238
1+𝑒 −0.0955
Output Layer
0.7164
0.5238
Complete Guide to Neural Networks with Python: Theory and Applications
Definition of Backpropagation
• A method to train the neural network, by
adjusting the weights of the neurons, for the
purpose of reducing the output error.
Gradient Descent
Gradient Descent
• The base algorithm that is used to minimize the error with respect to the
weights of the neural network. The learning rate determines the step size of
the update used to reach the minimum.
• An Epoch is one complete pass through all the samples.
https://www.learnopencv.com/understanding-activation-
functions-in-deep-learning/ https://sebastianraschka.com/faq/docs/closed-form-vs-gd.html
The Backpropagation
Remember our objective is to:
Minimize the error By Changing the Weight
Negative Slope: Gradient Descent
We move in the
When we Increase w,
direction opposite to
the loss is decreasing
the derivative
-(-) = + Weight
(opposite to the slope)
Increases (Moving
Right)
Positive Slope:
When we increase w,
the loss is increasing
-(+) = - Weight
Decreases (Moving Left)
Weight Update Rule:
ɳ = Learning Rate – How
fast we update the
weights. In other words,
𝑑𝐸 the step size of the update
𝑤 𝑤 − ղ
𝑑𝑤
Old Weight Negative Learnin Gradient
https://towardsdatascience.com/gradien
Slop g Rate t-descent-in-a-nutshell-eaf8c18212f0
Local Minimum and Global Minimum
f(x)
Convex and Non-Convex
Optimization
One global/local minima One or more local
minima and a global
minima
Image Credits: https://www.oreilly.com/radar/the-hard-thing-about-deep-learning/
Non-Convex Optimization
Multiple Local Minima
y=z+2 No w term 𝑑𝐸
z=w+4
???
𝑑𝑤
𝑤 Net (z) Activation
𝑎
𝐸𝑟𝑟𝑜𝑟 (𝐸)
This cannot be
done directly
𝑦 =𝑧+2 𝑦 = 𝑓(𝑧)
𝑧 =𝑤+4 𝑧 = 𝑔(𝑤)
𝑑𝑦 𝑑𝑦 𝑑𝑧
= .
𝑑𝑤 𝑑𝑧 𝑑𝑤
What Should be done is…….
𝑑𝑎
𝑑𝑧 𝑑𝐸
𝑑𝑧
𝑑𝑤 𝑑𝑎
𝑤 Net (z) Activation
𝑎
𝐸𝑟𝑟𝑜𝑟 (𝐸)
The Chain Rule
𝑑𝑎
𝑑𝑧 𝑑𝐸
𝑑𝑧
𝑑𝑤 𝑑𝑎
𝑤 Net (z) Activation
𝑎
𝐸𝑟𝑟𝑜𝑟 (
𝑑𝐸 𝑑𝐸 𝑑𝑎 𝑑𝑧
=
𝑑𝑤 𝑑𝑎 𝑑𝑧 𝑑𝑤
More Complex
𝑤1 𝑤2
𝑥 𝑧1 𝑎1 𝑧 2 𝑎2 𝐸
𝑑𝐸
𝑑𝑧 1
𝑑𝑎1 𝑑𝑧 2 𝑑𝑎2 𝑑𝑎2
𝑑𝑤1
𝑑𝑧 1 𝑑𝑎1 𝑑𝑧 2
𝑑𝐸 𝑑𝐸 𝑑𝑎2 𝑑𝑧 2 𝑑𝑎1 𝑑𝑧 1
=
𝑑𝑤1 𝑑𝑎2 𝑑𝑧 2 𝑑𝑎1 𝑑𝑧 1 𝑑𝑤1
Consider these neurons to work with…….
Bias Bias
0.01 0.31
Input Layer Hidden Layer Output Layer
d1
X1 0.1 0.37
0.9
0.5
0.3 -0.22
0.9
-0.2
d2
X2 0.1
-0.5 0.55 -0.12
-0.02 0.27
Bias
Bias
Adjusting the weight of the output neuron
Bias Bias
0.01 0.31
Input Layer Hidden Layer Output Layer
d1
X1 0.1 0.37
0.9
0.5
0.3 -0.22
0.9
-0.2
d2
X2 0.1
-0.5 0.55 -0.12
-0.02 0.27
Bias
Bias
How much is the error changing
with respect to the output
Expected Actual
Assuming one training sample
per iteration (batch size of 1)
1
{𝑑1 −𝛿 𝑦1 } 2 + {𝑑2 −𝛿 𝑦2 } 2
2
= -(0.9 – 0.7164) = -0.1836
How much is the output changing
with respect to the input
= 0.7164(1-0.7164) = 0.2031
How much is the input changing
with respect to the weight
= 0.5399
All Together
-0.1836 0.2031 0.5399
= -0.0201
Weight Update for the neuron
Found from the chain rule
-0.0201
Old Weight Learning Rate, how fast are you moving
Assume it to be 1.2
= 0.37 + 1.2(0.0201) = 0.3941
The new weight
Adjusting the weight for the Hidden
Layer
Input Layer Hidden Layer Output Layer
X1
0.5
X2
-0.5
In our case, p=2 𝜕𝑧1
𝜕𝑤1
𝑧1 = 𝑥1 (0.1) + 𝑥2 −0.2 + 𝑏𝑖𝑎𝑠
X1
0.5
𝜕(𝛿 𝑍1 )
In our case, p=2 𝑍1
𝛿 𝑍1 [1-𝛿 𝑍1 ]
1
𝛿 𝑧1 =
1 + 𝑒 −𝑍1
0.5399(1-0.5399)
0.2484
In our case, p=2
𝜕𝐸 𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕𝐸 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2
+
𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕(𝛿 𝑍1 ) 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2 𝜕(𝛿 𝑍1 )
-0.1836 0.2031
𝑦1 = 𝛿 𝑍1 (𝑤1) + 𝛿 𝑍2 𝑤2 + 0.31
𝑦1 = 0.5399(0.37) + 0.4638 0.9 + 0.31
In our case, p=2
𝜕𝐸 𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕𝐸 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2
+
𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕(𝛿 𝑍1 ) 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2 𝜕(𝛿 𝑍1 )
-0.1836 0.2031 0.37
1
{𝑑1 −𝛿 𝑦1 } 2 + {𝑑2 −𝛿 𝑦2 } 2
2
In our case, p=2
𝜕𝐸 𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕𝐸 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2
+
𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕(𝛿 𝑍1 ) 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2 𝜕(𝛿 𝑍1 )
-0.1836 0.2031 0.37
−[𝑑2 − 𝛿 𝑦2 ]
-(0.1-0.5238)
0.4238
In our case, p=2
𝜕𝐸 𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕𝐸 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2
+
𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕(𝛿 𝑍1 ) 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2 𝜕(𝛿 𝑍1 )
-0.1836 0.2031 0.37 0.4238
𝛿 𝑦2 [1 − 𝛿 𝑦2 ]
0.5238(1-0.5238)
0.2494
In our case, p=2
𝜕𝐸 𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕𝐸 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2
+
𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕(𝛿 𝑍1 ) 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2 𝜕(𝛿 𝑍1 )
-0.1836 0.2031 0.37 0.4238 0.2494
𝛿 𝑍1
𝑦2 = 0.5399(−0.22) + 0.4638 −0.12 + 0.27
In our case, p=2
𝜕𝐸 𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕𝐸 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2
+
𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕(𝛿 𝑍1 ) 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2 𝜕(𝛿 𝑍1 )
-0.1836 0.2031 0.37 0.4238 0.2494 -0.22
-0.0370
0.2484 0.5
-0.0370
-0.0045954
Weight Update for the hidden neuron
Found from the chain rule
-0.0045954
Old Weight Learning Rate, how fast are you moving
Assume it to be 1.2
= 0.1 + 1.2(0.0045954) =
0.1055
The new weight
A Final Diagram to Wrap it up…….
https://www.jeremyjordan.me/neural-networks-training/
Weights Update for the network
https://www.jeremyjordan.me/neural-networks-training/
Blue Path: Orange Path:
Combine
Continue…..
• Similar Procedure for all the other neurons
Complete Guide to Neural Networks with Python: Theory and Applications
Take the second sample (iteration 2)
Feature 1 Feature 2
0.5 -0.5
0.3 0.4
0.7 0.9
Feature 1 Feature 2
0.9 0.1
0.9 0.9
0.1 0.1
Take the third sample (iteration 3)
Feature 1 Feature 2
0.5 -0.5
0.3 0.4
0.7 0.9
Feature 1 Feature 2
0.9 0.1
0.9 0.9
0.1 0.1
• That was ONE EPOCH. An Epoch is one
complete pass through all the samples. After
repeating that for many epochs (ex. 25) our
neural network is expected to reach the
minimum error, and be considered as trained.
We’ll learn about optimization later!