Deep Learning for Engineering Students
Deep Learning for Engineering Students
1
Introduction to Deep Learning Lecture 2
𝑤2 𝑦ො 𝑚
𝑥2 Σ ∫
𝑦ො = 𝑔 𝑥𝑖 . 𝑤𝑖
𝑖=1
𝑥𝑚
Non-linear
Activate function
Input Weights Sum Non-linearity Output
𝑥𝑚
Non-linear
Activate function
Input Weights Sum Non-linearity Output
𝑥1
𝑤2 𝑦ො 𝑚
𝑥2 Σ ∫
𝑦ො = 𝑔 𝑏 + 𝑥𝑖 . 𝑤𝑖
𝑖=1
𝑥𝑚
𝑦ො = 𝑔 𝑋𝑊 + 𝑏
𝑥1 Activate function
𝑤2 𝑦ො
𝑥2 Σ ∫ 𝑦ො = 𝑔 𝑋𝑊 + 𝑏
1 𝑒 𝑧 − 𝑒 −𝑧 𝑔 𝑧 = max(0, 𝑧)
𝑔 𝑧 = 𝑔 𝑧 = 𝑧
1 + 𝑒 −𝑧 𝑒 + 𝑒 −𝑧 1, 𝑧>0
𝑔ሶ 𝑧 = 𝑔 𝑧 1 − 𝑔 𝑧 𝑔ሶ 𝑧 = 1 − 𝑔 𝑧 2 𝑔ሶ 𝑧 = ቊ
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
1
Activate function
𝑥1
𝑦ො = 𝑔 𝑋𝑊 + 𝑤0
-2 𝑦ො
𝑥2 Σ ∫ 3
𝑦ො = 𝑔(1 + 𝑥1 𝑥2
−2
In this case, we have 𝑦ො = 𝑔(1 + 3𝑥1 − 2𝑥2 )
3
𝑊=
−2 This is just a line in 2D
𝑤0 = 1
𝑥1
-2 𝑦ො
𝑥2 Σ ∫
𝑥1
𝑋 = −1 2
-2 𝑦ො
𝑥2 Σ ∫
𝑦ො = 𝑔 𝑋𝑊 + 𝑤0
𝑥1
𝑤2 𝑦ො
𝑥2 Σ ∫
𝑥𝑚
𝑦ො = 𝑔 𝑋𝑊 + 𝑤0
𝑥1
𝑦ො = 𝑔(𝑧)
𝑥2 𝑧
𝑥𝑚
𝑚
𝑧 = 𝑤0 + 𝑥𝑖 𝑤𝑖
𝑖=1
𝑥1
𝑦ො1 = 𝑔(𝑧1 ) 𝑚
𝑧1
𝑥2 𝑧𝑗 = 𝑥𝑖 𝑤𝑖𝑗 + 𝑤0𝑗
𝑦ො2 = 𝑔(𝑧2 ) 𝑖=1
𝑧2
𝑥𝑚 import tensorflow as tf
layer = tf.keras.layers.Dense(units = 2)
Because all inputs are densely connected to all outputs, these layers are called
Dense layers
(1)
𝑤12 𝑧1
𝑥1 𝑔(𝑧2 ) 𝑦ො1
𝑧2
𝑥2
𝑧3
𝑦ො2
𝑥𝑚 𝑧𝑛
𝑚
(1) (1)
𝑧2 = 𝑥𝑖 𝑤𝑖2 + 𝑤02
𝑖=1
(1) (1) (1) (1)
z2 = 𝑥1 𝑤12 + 𝑥2 𝑤22 + 𝑥𝑚 𝑤𝑚2 + 𝑤02
𝑧1
𝑥1
𝑦ො1
𝑧2
𝑥2
𝑧3
𝑥3 𝑦ො2
𝑧4
𝑿 𝑾𝑯 𝑯 𝑾𝑶 𝑶
Input size Hidden size Hidden size Output size Output size
Input size
Hidden size
Example
Example
Example
1 𝑤0
𝑤1
𝑥1 ∫
Inputs
𝑤2
Sigmoid
𝑥2
1 12
-4
𝑥1 ∫
-1
𝑥2 −4𝑥1 − 𝑥2 + 12 = 0
1 12
-4
𝑥1 ∫
-1
𝑥2 −4𝑥1 − 𝑥2 + 12 = 0
1 3
-1/5
𝑥1 ∫
-1
𝑥2 1
− 𝑥1 − 𝑥2 + 3 = 0
5
HCM City Univ. of Technology, Faculty of Mechanical Engineering 22 Duong Van Tu
Introduction to Deep Learning Lecture 2
1 12 𝑦ො = 𝑔 𝑔 𝑧1 ∗ 1.5 + 𝑔 𝑧2 ∗ 1 + 0.5
-4 0.88 import numpy as np
2 𝑥1 𝑧1 def sigmoid(x):
return 1 / (1 + np.exp(-x))
-1 1.5 X = np.array([2, 2])
1 12
-4 0.88
2 𝑥1 𝑧1
-1 1.5
2 𝑥2 0.92
𝑦ො
1 3 1
𝑧2 0.64
-1/5
2 𝑥1
-1 0.5
2 𝑥2
1
HCM City Univ. of Technology, Faculty of Mechanical Engineering 24 Duong Van Tu
Introduction to Deep Learning Lecture 2
1 12
-4 4.54
4 𝑥1 𝑧1
-1 1.5
6 𝑥2 0.62
𝑦ො
1 3 1
𝑧2 0.02
-1/5
4 𝑥1
-1 0.5
6 𝑥2
1
HCM City Univ. of Technology, Faculty of Mechanical Engineering 25 Duong Van Tu
Introduction to Deep Learning Lecture 2
1 12
-4 0.1
0 𝑥1 𝑧1
-1 1.5
6 𝑥2 0.89
𝑦ො
1 3 1
𝑧2 0.05
-1/5
0 𝑥1
-1 0.5
6 𝑥2
1
HCM City Univ. of Technology, Faculty of Mechanical Engineering 26 Duong Van Tu
Introduction to Deep Learning Lecture 2
Exercise 1
Create a notebook with Google Colab (do not use TensorFlow) to compute the outputs of this
neural network with the given matrices:
0.6 0.9 0.7 −0.1
𝒘(1) = 2.6 0.5 −0.1 0.7
−0.7 1.4 0.3 −1.2 𝒘(1) 𝒘(2)
0.2 −0.5 𝑧1
𝒘(2) =
−0.1 1.6 2 𝑦ො1
−0.5 1.4 𝑧2
−0.5 −0.1 3
Hidden layer: sigmoid 𝑧3
𝑦ො2
Output layer: linear 𝑔 𝑧 = 𝑧
1 𝑧4
hidden(𝑙−1) hidden(𝑙+1)
TensorFlow (𝑙)
import tensorflow as tf 𝑧1
model = tf.keras.Sequential([ 𝑥1
tf.keras.layers.Dense(n1), (𝑙) 𝑦ො1
tf.keras.layers.Dense(n2), 𝑧2
... 𝑥2
tf.keras.layers.Dense(2) (𝑙)
])
𝑧3
𝑦ො2
𝑥𝑚 (𝑙)
𝑧𝑛
𝑛(𝑙−1)
(𝑙) 𝑙−1 𝑙 (𝑙)
𝑧𝑗 = 𝑔 𝑧𝑖 𝑤𝑖𝑗 + 𝑤0𝑗
𝑖=1
1 1 Scalar form
𝑑 (𝑙−1)
(𝑙−1) (𝑙) (𝑙) (𝑙−1) (𝑙) (𝑙)
(𝑙−1)
𝑧1 𝑎1 𝑧1 (𝑙)
𝑎1 𝑧𝑗 = 𝑎𝑖 𝑤𝑖𝑗 + 𝑤0𝑗
𝑖=1
(𝑙−1) (𝑙) (𝑙)
𝑧2 (𝑙−1) 𝑧2 𝑎2 (𝑙−1) 𝑙−1
𝑎2 𝑎𝑖 = 𝑔 𝑧𝑖
(𝑙)
(𝑙−1) (𝑙−1) 𝑤𝑖𝑗 (𝑙)
𝑧𝑖 𝑎𝑖 (𝑙)
𝑧𝑗 𝑎𝑗
Example: Build a model to estimate of test score based on sleep and study hours
Example Sleep hours Study hours Target Output
Student1 0.3 1 75/100
Student2 0.5 0.2 82/100
Student3 1 0.4 93/100
StudentX 8 3 ?
Two features: Sleep hours (𝑥1 ) and study hours (𝑥2 ). import numpy as np
X = np.array([[0.3, 1],
[0.5, 0.2],
𝑥11 𝑥12 0.3 1 0.75 [1, 0.4]])
𝑋 = 𝑥21 𝑥22 = 0.5 0.2 , 𝑦 = 0.82 y = np.array([[0.75],
[0.82],
𝑥31 𝑥32 1 0.4 0.93 [0.93]])
print(X.shape)
print(y.shape)
Example: Build a model to estimate of test score based on sleep and study hours
1 𝑿 ∈ 𝑅𝑁×2
1
𝑾 , 𝒃(1) 𝑾(2) , 𝑏 (2)
1 (1) 𝑾(1) ∈ 𝑅2×3
𝑧1
(1)
𝑎1 𝒃(1) ∈ 𝑅1×3
Sleep 𝑥1 (2)
𝑧1 𝑾(2) ∈ 𝑅3×1
hours
𝑦ො 𝑏 (2) ∈ ℛ
(1) (1)
Study 𝑧2 𝑎2
𝒛(1) = 𝑿𝑾(1) + 𝒃(1) ∈ 𝑅𝑁×3
hours 𝑥2
(1) (1) 𝒂(1) = 𝑔 𝒛(1) ∈ 𝑅𝑁×3
𝑧3 𝑎3
𝑿 𝒛(1) 𝒂(1) 𝒛 (2) 𝒛(2) = 𝒂(1) 𝑾(2) + 𝑏 (2) ∈ 𝑅𝑁×1
2 ෝ = 𝑔 𝒛(2) ∈ 𝑅𝑁×1
Inputs Hidden Output 𝒂 =𝒚
HCM City Univ. of Technology, Faculty of Mechanical Engineering 32 Duong Van Tu
Introduction to Deep Learning Lecture 2
Example: Build a model to estimate of test score based on sleep and study hours
Code Symbol Math Symbol Definition Dimensions
X 𝑋 Input data, each rom in an example (NumExamples, inputLayerSize)
y 𝑦 Target data (NumExamples, outputLayerSize)
W1 𝑊 (1) Layer 1 weights (inputLayerSize, hiddenLayerSize)
b1 𝑏 (1) Layer 1 bias (1, hiddenLayerSize)
W2 𝑊 (2) Layer 2 weights (hiddenLayerSize, outputLayerSize)
b2 𝑏 (2) Layer 2 bias (1, outputLayerSize)
z1 𝑧 (1) Layer 1 activation (NumExamples, hiddenLayerSize)
a1 𝑎(1) Layer 1 output (NumExamples, hiddenLayerSize)
z2 𝑧 (2) Layer 2 activation (NumExamples, outputLayerSize)
a2 𝑎(2) Layer 2 output (NumExamples, outputLayerSize)
Example: Build a model to estimate of test score based on sleep and study hours
import numpy as np
class NeuralNetwork:
def __init__(self):
self.inputLayerSize = 2
self.hiddenLayerSize = 3
self.outputLayerSize = 1
self.W1 = np.random.randn(self.inputLayerSize, self.hiddenLayerSize)
self.b1 = np.random.randn(1, self.hiddenLayerSize)
self.W2 = np.random.randn(self.hiddenLayerSize, self.outputLayerSize)
self.b2 = np.random.randn(1, self.outputLayerSize)
def sigmoid(self, z):
return 1/(1 + np.exp(-z))
def forwardPropagation(self, X):
self.z1 = np.dot(X, self.W1) + self.b1
self.a1 = self.sigmoid(self.z1)
self.z2 = np.dot(self.a1, self.W2) + self.b2
y_hat= self.sigmoid(self.z2)
return y_hat
Example: Build a model to estimate of test score based on sleep and study hours
Example: Build a model to estimate of test score based on sleep and study hours
y = np.array([[0.75],
[0.82],
[0.93]])
Example: Build a model to estimate of test score based on sleep and study hours
[[0.5063543 ]
[0.03394059]
[0.06788117]]
Exercise 2
Build a model to estimate of house price based on number of room, area, and floor:
Hidden layer: 4 nodes – Activation: sigmoid
Output layer: 1 node – Activation: Linear
Weight: Random
Compute the forward propagation for a single cycle to find out the output.
Example Rooms Area Floor Price (k)
1 2 100 1 75
2 1 60 2 60
3 3 120 1 90
4 2 75 2 80
If the value of the loss function is lower then it’s a good model;
The loss function quantifies how much the model prediction deviates from the ground
true for one particular object (between predicted and actual values in a machine
learning model).
For example: Very often, we use the squared error as the loss function in regression
problems.
ℒ𝑖2 = 𝑓(𝑥𝑖 ; 𝑤) − 𝑦𝑖 2
𝑦ො = 𝑓(𝑥𝑖 ; 𝑤)
Estimate of test score based on sleep and study hours
y_hat = ([[0.62404926], y = np.array([[0.75],
[0.58412263], [0.82],
[0.57624213]]) [0.93]])
Example 𝑦ො 𝑦 ℒ2
1 0.624 0.75 0.015876
2 0.584 0.82 0.055696
3 0.576 0.93 0.125316
HCM City Univ. of Technology, Faculty of Mechanical Engineering 41 Duong Van Tu
Introduction to Deep Learning Lecture 2
Another loss function we often use for regression is the absolute loss which is also called
ℒ1 loss: ℒ𝑖1 = 𝑓 𝑥𝑖 ; 𝑤 − 𝑦𝑖
For instance, let’s say that our model predicts a flat’s price (in thousands of dollars)
based on the number of rooms, area (𝑚2 ), floor, and the neighborhood in the city (A or
B). Let’s suppose that its prediction for 𝑥 = [4, 70, 1, 𝐴] is USD 110k. If the actual selling
price is USD 105k, then absolute loss is:
ℒ1 = 𝑦ො − 𝑦
ℒ1 = 110 − 105 = 5
1 2
𝑓 𝑥𝑖 ; 𝑤 − 𝑦𝑖 if 𝑓 𝑥𝑖 ; 𝑤 − 𝑦𝑖 < 𝛿
ℒ 𝛿 𝑓 𝑥𝑖 ; 𝑤 − 𝑦𝑖 = 2
1 2
𝛿 𝑓 𝑥𝑖 ; 𝑤 − 𝑦𝑖 − 𝛿 otherwise
2
The target variables are in binary format, and there is only two classes.
If the probability of class 1 is 𝑓 𝑥𝑖 ; 𝑤 , then that of class 2 is (1 − 𝑓 𝑥𝑖 ; 𝑤 ).
Cross entropy loss for the actual label 𝑦 (only take 0s or 1s) and the predicted
probability can be defined as:
ℒ = − 𝑦𝑙𝑜𝑔 𝑓 𝑥𝑖 ; 𝑤 + 1 − 𝑦 log 1 − 𝑓 𝑥𝑖 ; 𝑤
If 𝑦 = 0, then ℒ = −log 1 − 𝑓 𝑥𝑖 ; 𝑤 ,
Cost Function
The cost function measures the model’s error on a group of objects, whereas the loss
function deals with a single data instance.
So, if ℒ is our loss function, then we calculate the cost function by aggregating the loss
L over the training, validation, or test data 𝒟 = 𝑥𝑖 , 𝑦𝑖 𝑖=1 .
𝑛
For example, we can
compute the cost as the mean loss:
1 𝑁
𝐽 𝑊 = ℒ(𝑓 𝑥𝑖 ; 𝑤 , 𝑦𝑖 )
𝑁 𝑖=1
Cost Function
Mean Squared Error (MSE):
1 𝑁 1 𝑁
2 2
𝐽(𝑾) = ℒ𝑖 = ෝ 𝑖 − 𝒚𝑖
𝒚
𝑁 𝑖=1 𝑁 𝑖=1
1 𝑁 1 𝑁
1
𝐽(𝑾) = ℒ𝑖 = ෝ 𝑖 − 𝒚𝑖
𝒚
𝑁 𝑖=1 𝑁 𝑖=1
Cost Function
Example: Compute the MSE cost function
Let’s say that we have the data on four flats and that our model predicted the sale
prices as follows: 𝒙𝑖 Rooms Area Floor Neighborhood 𝑦𝑖 𝑦ො𝑖
𝒙1 4 70 1 A 105 104.5
𝒙2 2 50 2 A 83 91
𝒙3 1 30 5 B 50 65.3
𝒙4 5 90 2 A 200 114
Solution:
2 2 2 2
104.5 − 105 + 91 − 83 + 65.3 − 50 + 114 − 200 0.52 + 82 + 15.32 + 862
𝐽= = = 1,923.585
4 4
Cost Function
import tensorflow as tf
actual_values = tf.constant([105, 83, 50, 200], dtype=tf.float32)
predicted_values = tf.constant([104.5, 91, 65.3, 114], dtype=tf.float32)
mse = tf.keras.losses.mse(actual_values, predicted_values)
print(mse.numpy())
VS
import tensorflow as tf
import numpy as np
actual_values = np.array([[105],[83], [50], [200]])
predicted_values = np.array([[104.5], [91], [65.3], [114]])
mse = tf.keras.losses.mse(actual_values, predicted_values)
print(mse.numpy())
1 𝑁
𝑾∗ = argmin𝑊 𝐽 𝑾 = argmin𝑊 ℒ(𝑓(𝒙𝑖 ; 𝑾), 𝒚𝑖 )
𝑁 𝑖=1
𝜕𝐽 𝑾
▪ Update weights 𝑾 ← 𝑾 − 𝜂 (where 𝜂 is the learning rate)
𝜕𝑾
HCM City Univ. of Technology, Faculty of Mechanical Engineering 49 Duong Van Tu
Introduction to Deep Learning Lecture 2
𝜕𝑓 𝑥
𝑥𝑛𝑒𝑤 = 𝑥𝑐𝑢𝑟 − 𝜂 𝜕𝑓 𝑥
𝜕𝑥 𝑥𝑛𝑒𝑤 = 𝑥𝑐𝑢𝑟 − 𝜂
𝜕𝑥
Gradient Descent
𝜕𝐽 𝑾
▪ Update weights 𝑾 ← 𝑾 − 𝜂
𝜕𝑾
𝜕𝑓 𝑥
𝑥𝑛𝑒𝑤 = 𝑥𝑐𝑢𝑟 − 𝜂
𝜕𝑥
Chain Rule
𝜕𝑧 𝜕𝑧 𝜕𝑦
= 𝜕𝑦 𝜕𝑧
𝜕𝑥 𝜕𝑦 𝜕𝑥
𝑥 𝜕𝑥 𝑦 𝜕𝑦 𝑧
Chain Rule
Path 1 Path 2
𝑦2
𝜕𝑦2 𝜕𝑧
𝜕𝑥 𝜕𝑦2
Chain Rule
𝜕𝑦𝑖 𝜕𝑧
𝜕𝑥 𝑦𝑖 𝜕𝑦𝑖
…
𝜕𝑦𝑛 𝑦𝑛 𝜕𝑧
𝜕𝑥 𝜕𝑦𝑛
HCM City Univ. of Technology, Faculty of Mechanical Engineering 56 Duong Van Tu
Introduction to Deep Learning Lecture 2
𝑙 𝑙
𝜕𝐽 𝜕𝐽 𝜕𝑧𝑗 𝜕𝐽 𝜕𝐽 𝜕𝑧𝑗 𝜕𝐽
(𝑙) = (𝑙) . (𝑙) (𝑙) = (𝑙) . (𝑙) = (𝑙) with 𝑖 = 1 ~𝑑(𝑙−1) , 𝑗 = 1~𝑑(𝑙)
𝜕𝑤𝑖𝑗 𝜕𝑧𝑗 𝜕𝑤𝑖𝑗 𝜕𝑏𝑗 𝜕𝑧𝑗 𝜕𝑏𝑗 𝜕𝑧𝑗
It should be noted that the current layer error and current layer input are
𝑙
(𝑙) 𝜕𝐽 𝜕𝑧𝑗 (𝑙−1) 𝜕𝐽 (𝑙) (𝑙−1)
𝑒𝑗 ≜ (𝑙) (𝑙)
= 𝑎𝑖 Then (𝑙)
= 𝑒𝑗 𝑎𝑖
𝜕𝑧𝑗 𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗
(𝑙 − 1) (𝑙)
(𝑙)
𝑤𝑖1 𝟏
(𝑙)
𝑤𝑖𝑗
𝒊 𝒋 Current layer error for multi-path
(𝑙)
𝑤𝑖𝑑(𝑙) 𝑑 (𝑙)
(𝑙−1) (𝑙) (𝑙) 𝜕𝑔
𝒅(𝒍) 𝑒𝑖 = 𝑒𝑗 . 𝑤𝑖𝑗 (𝑙−1)
𝑗=1 𝜕𝑧𝑖
(1)
𝑧1 (1)
𝑎1
(2)
𝑧1 (2)
𝑎1 Example
(2)
(1) 𝑤11 (3)
𝑤11 RL
(2)
RL 𝑤11
𝑤12
𝑥 (1)
𝑤12 (2) 𝑧 (3) LN 𝑦ො
𝑤21
(2)
𝑤22 (3)
RL RL 𝑤21
(1) (1) (2) (2) 𝑏1 𝑏3
𝑧2 𝑎2 𝑧2 𝑎2 𝑧1 𝑎1 𝑧3 𝑎3
𝑤1 𝑤3
RL RL 𝑤7
𝑤4
𝑥 LN 𝑦ො
𝑤5
𝑤2 𝑤6 𝑧5
RL RL 𝑤8 𝑏5
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4
HCM City Univ. of Technology, Faculty of Mechanical Engineering 61 Duong Van Tu
Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
Loss Function
𝑤1 𝑤3
RL RL 𝑤7 2
𝐿 = 𝑦ො − 𝑦
𝑤4
𝑥 LN 𝑦ො Derivative of Loss Function
𝑤5
𝑤2 𝑤6 𝑧5
RL RL 𝑤8 𝑏5 𝜕𝐿
= 2 𝑦ො − 𝑦
𝑧2 𝑎2 𝑧4 𝑎4 𝜕𝑦ො
𝑏2 𝑏4
Forward Propagation
𝑧1 = 𝑥. 𝑤1 + 𝑏1 𝑧3 = 𝑎1 . 𝑤3 + 𝑎2 . 𝑤5 + 𝑏3 𝑧5 = 𝑎3 . 𝑤7 + 𝑎4 . 𝑤8 + 𝑏5
𝑧2 = 𝑥. 𝑤2 + 𝑏2 𝑧4 = 𝑎1 . 𝑤4 + 𝑎2 . 𝑤6 + 𝑏4 𝑦ො = 𝑧5
𝑎1 = 𝑔 𝑧1 𝑎3 = 𝑔 𝑧3
𝑎2 = 𝑔 𝑧2 𝑎4 = 𝑔 𝑧4
HCM City Univ. of Technology, Faculty of Mechanical Engineering 62 Duong Van Tu
Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
𝑤3 Output layer:
𝑤1 RL RL 𝑤7
𝑤4 𝜕𝐿 𝜕𝐿 𝜕𝐿
𝑥 LN 𝑦ො , ,
𝑤5 𝜕𝑤7 𝜕𝑤8 𝜕𝑏5
𝑤2 𝑤6 𝑧5
RL RL 𝑤8 𝑏5 It can be seen that 𝑧5 is function with
𝑧2 𝑎2 𝑧4 𝑎4 respect to 𝑤7 , 𝑤8 .
𝑏2 𝑏4
Back Propagation Applying the Chain rule, we can
𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿 obtain these
, , , , , , ,
𝜕𝑤1 𝜕𝑤2 𝜕𝑤3 𝜕𝑤4 𝜕𝑤5 𝜕𝑤6 𝜕𝑤7 𝜕𝑤8
𝜕𝐿 𝜕𝐿 𝜕𝑧5
=
𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝑤7 𝜕𝑧5 𝜕𝑤7
, , , ,
𝜕𝑏1 𝜕𝑏2 𝜕𝑏3 𝜕𝑏4 𝜕𝑏5
HCM City Univ. of Technology, Faculty of Mechanical Engineering 63 Duong Van Tu
Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
𝑤3 Output layer:
𝑤1 RL RL 𝑤7 𝜕𝐿 𝜕𝐿 𝜕𝑧5
𝑤4 =
𝜕𝑤7 𝜕𝑧5 𝜕𝑤7
𝑥 LN 𝑦ො
𝑤5
𝑤2 𝑧5 𝜕𝐿 𝜕𝐿 𝜕𝑧5
𝑤6 𝑤8 𝑒 (𝐿) ≜ ≡ = 2 𝑦ො − 𝑦 = 𝑎3
RL RL 𝑏5 𝜕𝑧5 𝜕𝑦ො 𝜕𝑤7
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4 Output layer error
Recall that:
Current layer error Current layer input w.r.t 𝑤7 𝑧5 = 𝑎3 . 𝑤7 + 𝑎4 . 𝑤8 + 𝑏5
𝜕𝐿 𝑦ො = 𝑧5
=𝑒 𝐿 . a3
𝜕𝑤7
HCM City Univ. of Technology, Faculty of Mechanical Engineering 64 Duong Van Tu
Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
𝑤3 Output layer:
𝑤1 RL RL 𝑤7 𝜕𝐿 𝜕𝐿 𝜕𝑧5
𝑤4 =
𝜕𝑤8 𝜕𝑧5 𝜕𝑤8
𝑥 LN 𝑦ො
𝑤5
𝑤2 𝑧5 𝜕𝐿 𝜕𝐿 𝜕𝑧5
𝑤6 𝑤8 𝑒 (𝐿) ≜ ≡ = 2 𝑦ො − 𝑦 = 𝑎4
RL RL 𝑏5 𝜕𝑧5 𝜕𝑦ො 𝜕𝑤8
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4 Output layer error
Recall that:
Current layer error Current layer input w.r.t 𝑤8 𝑧5 = 𝑎3 . 𝑤7 + 𝑎4 . 𝑤8 + 𝑏5
𝜕𝐿 𝑦ො = 𝑧5
=𝑒 𝐿 . a4
𝜕𝑤8
HCM City Univ. of Technology, Faculty of Mechanical Engineering 65 Duong Van Tu
Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
Output layer:
𝑤1 𝑤3
RL RL 𝑤7 𝜕𝐿 𝜕𝐿 𝜕𝑧5
𝑤4 =
𝜕𝑏5 𝜕𝑧5 𝜕𝑏5
𝑥 LN 𝑦ො
𝑤5
𝑤2 𝑧5 𝜕𝐿 𝜕𝐿 𝜕𝑧5
𝑤6 𝑤8 𝑒 (𝐿) ≜ ≡ = 2 𝑦ො − 𝑦 =1
RL RL 𝑏5 𝜕𝑧5 𝜕𝑦ො 𝜕𝑏5
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4 Output layer error
Recall that:
Current layer error 𝑧5 = 𝑎3 . 𝑤7 + 𝑎4 . 𝑤8 + 𝑏5
𝜕𝐿 𝑦ො = 𝑧5
=𝑒 𝐿
𝜕𝑏5
HCM City Univ. of Technology, Faculty of Mechanical Engineering 66 Duong Van Tu
Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
2nd Hidden layer:
𝑤1 𝑤3
RL RL 𝑤7
𝑤4 Current layer error:
𝑥 LN 𝑦ො
𝑤5
𝑤2 𝑧5 (2) 𝜕𝐿 𝜕𝐿 𝜕𝑎3 𝜕𝐿 𝜕𝑧5 𝜕𝑎3
𝑤6 𝑤8 𝑒3 ≜ = =
RL RL 𝑏5 𝜕𝑧3 𝜕𝑎3 𝜕𝑧3 𝜕𝑧5 𝜕𝑎3 𝜕𝑧3
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4
𝜕𝐿 𝜕𝑧5 𝜕𝑔(𝑧)
=𝑒 𝐿 = 𝑤7
Recall that: 𝜕𝑧5 𝜕𝑎3 𝜕𝑧
Next layer Next layer Derivative of
𝜕𝐿
𝑒 (𝐿) ≜ , error weight activate function
𝜕𝑧5
𝑧5 = 𝑎3 . 𝑤7 + 𝑎4 . 𝑤8 + 𝑏5
𝑧5 = 𝑎3 . 𝑤7 + 𝑎4 . 𝑤8 + 𝑏5 𝑒 (𝐿) 𝑤7 𝑤3 𝑤8 𝑤4
𝑧3 = 𝑎1 . 𝑤3 + 𝑎2 . 𝑤5 + 𝑏3
Derivative of
𝑧4 = 𝑎1 . 𝑤4 + 𝑎2 . 𝑤6 + 𝑏4
activate function
HCM City Univ. of Technology, Faculty of Mechanical Engineering 76 Duong Van Tu
Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
Hidden layer:
1st
𝑤1 𝑤3
RL RL 𝑤7 𝜕𝐿 𝜕𝐿 𝜕𝑧5 𝜕𝑎3 𝜕𝑧3 𝜕𝐿 𝜕𝑧5 𝜕𝑎4 𝜕𝑧4
𝑤4 = +
𝜕𝑎1 𝜕𝑧5 𝜕𝑎3 𝜕𝑧3 𝜕𝑎1 𝜕𝑧5 𝜕𝑎4 𝜕𝑧4 𝜕𝑎1
𝑥 LN 𝑦ො
𝑤5 𝜕𝐿
𝑤2 𝑧5 (2) (2) 𝜕𝐿
𝑤6 𝑤8 𝑒3 ≜ 𝑒4 ≜
RL RL 𝑏5 𝜕𝑧3 𝜕𝑧4 𝑤
𝑤3 4
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4 Current layer error:
𝜕𝒛
= 𝑨𝑇
𝜕𝒙
If 𝒘 is a function of 𝒛, which is a function of 𝒚, which is a function of 𝒙, we can obtain
this chain rule:
𝜕𝒘 𝜕𝒚 𝜕𝒛 𝜕𝒘
=
𝜕𝒙 𝜕𝒙 𝜕𝒚 𝜕𝒛
For more understanding, refer to matrix calculus documentation.
𝑁×𝑑 (𝑙)
Step 1: Execute Forward propagation and store each layer output 𝒂 (𝑙)
∈ℛ
𝜕𝐽 𝜕𝐽 (𝐿)
Step 2: Compute the derivative of loss function = ∈ ℛ 𝑁×𝑑
𝜕𝒂(𝐿) 𝜕ෝ
𝒚
𝜕𝒂(𝐿) 𝜕ෝ
𝒚 (𝐿)
Step 3: Compute the derivative of activate function = ∈ ℛ 𝑁×𝑑
𝜕𝒛(𝐿) 𝜕𝒛(𝐿)
𝜕𝐽 (𝐿) 𝜕𝐽 𝜕ෝ𝒚
where 𝒆(𝐿) ≜ ∈ ℛ 𝑁×𝑑 is the Hadamard product of ⨀ (𝐿)
𝜕𝒛(𝐿) 𝜕ෝ
𝒚 𝜕𝒛
𝜕𝐽 𝜕𝐽 𝜕ෝ𝒚 𝐿
Step 6: Compute the gradient of bias = 𝑁
σ𝑖=1 ⨀ (𝐿) ∈ ℛ 1×𝑑
𝜕𝒃(𝐿) 𝜕ෝ
𝒚 𝜕𝒛
Step 7: Repeat step 5 for all other layer to compute the gradient of weight
𝜕𝐽 𝑁 𝜕𝐽 𝜕𝒂(𝑙) 1×𝑑 𝑙
= σ𝑖=1 ⨀ (𝑙) ∈ ℛ
𝜕𝒃(𝑙) 𝜕𝒂(𝑙) 𝜕𝒛
𝜕𝐽
𝒃(𝑙) = 𝒃(𝑙) − 𝜂 (𝑙) for 𝑙 = 1 … 𝐿
𝜕𝒃
Example: Build a model to estimate of test score based on sleep and study hours
1 𝑿 ∈ 𝑅𝑁×2
1
𝑾 , 𝒃(1) 𝑾(2) , 𝑏 (2)
1 (1) 𝑾(1) ∈ 𝑅2×3
𝑧1
(1)
𝑎1 𝒃(1) ∈ 𝑅1×3
Sleep 𝑥1 𝐿 = 𝑦ො − 𝑦 2
(2)
𝑧1 𝑾(2) ∈ 𝑅3×1
hours
𝑦ො 𝑏 (2) ∈ ℛ
(1) (1)
Study 𝑧2 𝑎2
𝒛(1) = 𝑿𝑾(1) + 𝒃(1) ∈ 𝑅𝑁×3
hours 𝑥2
(1) (1) 𝒂(1) = 𝑔1 𝒛(1) ∈ 𝑅𝑁×3
𝑧3 𝑎3
𝑿 𝒛(1) 𝒂(1) 𝒛 (2) 𝑧 (2) = 𝒂(1) 𝑾(2) + 𝑏 (2) ∈ 𝑅𝑁×1
2 = 𝑦ො = 𝑔2 𝑧 (2) ∈ 𝑅𝑁×1
Inputs Hidden Output 𝑎
HCM City Univ. of Technology, Faculty of Mechanical Engineering 87 Duong Van Tu
Introduction to Deep Learning Lecture 2
Example: Build a model to estimate of test score based on sleep and study hours
Example: Build a model to estimate of test score based on sleep and study hours
import numpy as np
# Simulated example data
num_examples = 300
num_input_units = 2
num_hidden_units = 3
num_output_units = 1
# Generate random input features
X = X = np.array([[0.3, 1],
[0.5, 0.2],
[1, 0.4]])
# Generate random output labels
y = np.array([[0.75],
[0.82],
[0.93]])
# Initialize random weights and biases
W_input_hidden = np.random.rand(num_input_units, num_hidden_units)
b_hidden = np.random.rand(1, num_hidden_units)
W_hidden_output = np.random.rand(num_hidden_units, num_output_units)
b_output = np.random.rand(1, num_output_units)
# Learning rate
learning_rate = 0.1
Example: Build a model to estimate of test score based on sleep and study hours
for iteration in range(1, 1001):
# Forward propagation
z_hidden = np.dot(X, W_input_hidden) + b_hidden
Output layer
a_hidden = 1 / (1 + np.exp(-z_hidden)) error
z_output = np.dot(a_hidden, W_hidden_output) + b_output
a_output = z_output # Linear activation for regression 𝜕𝐽 𝜕𝐽
= = 2 𝑦ො − 𝑦
# Calculate cost (MSE) 𝜕𝑦ො 𝜕𝑎(2)
cost = np.mean((a_output - y) ** 2)
# Backpropagation 𝜕𝐽 𝜕𝐽 𝜕𝐽 𝜕𝐽
= = 𝒂 1 𝑇
dJ_da_output = 2 * (a_output - y) 𝜕𝑎(2) 𝜕𝑧 (2) 𝜕𝑾(2) 𝜕𝑧 (2)
dJ_dz_output = dJ_da_output # Linear activation derivative 𝑁
dJ_dW_hidden_output = a_hidden.T.dot(dJ_dz_output) 𝜕𝐽 𝜕𝐽
=
dJ_db_output = np.sum(dJ_dz_output, axis=0, keepdims=True) 𝜕𝑏 (2) 𝜕𝑧 (2)
dJ_da_hidden = dJ_dz_output.dot(W_hidden_output.T) 𝑖=1
dJ_dz_hidden = dJ_da_hidden * a_hidden * (1 - a_hidden) 𝜕𝐽 𝜕𝐽 2 𝑇
dJ_dW_input_hidden = X.T.dot(dJ_dz_hidden) = 𝑾
𝜕𝒂(1) 𝜕𝑧 (2)
dJ_db_hidden = np.sum(dJ_dz_hidden, axis=0, keepdims=True)
# Update weights and biases 𝜕𝐽 𝜕𝐽 1 1−𝒂 1
W_hidden_output -= learning_rate * dJ_dW_hidden_output = 𝒂
𝜕𝒛(1) 𝜕𝒂 1
b_output -= learning_rate * dJ_db_output
W_input_hidden -= learning_rate * dJ_dW_input_hidden 𝜕𝐽 𝜕𝐽
𝑇
b_hidden -= learning_rate * dJ_db_hidden (1)
= 𝑿 (1)
𝜕𝑾 𝜕𝒛
HCM City Univ. of Technology, Faculty of Mechanical Engineering 90 Duong Van Tu
Introduction to Deep Learning Lecture 2
Example: Build a model to estimate of test score based on sleep and study hours
# Log cost every 100 iterations
if iteration % 100 == 0:
print(f"Iteration {iteration}, Cost: {cost:.4f}")
print("Training complete!")
Exercise 3
Build a model to estimate of house price based on number of room, area, and floor:
Hidden layer: 4 nodes – Activation: sigmoid
Output layer: 1 node – Activation: Linear Do not use TensorFlow
Weight: Random – Cost function: MSE
Compute the model and train for 100 iterations, plot the cost, weight.
Example Rooms Area Floor Price (k)
1 2 100 1 75
2 1 60 2 60
3 3 120 1 90
4 2 75 2 80
Epochs vs Iterations
▪ Epoch can be understood as the number of times the algorithm scans the entire
data. For example if we set epoch = 10 then the algorithm will scan the entire
Epochs vs Iterations
and epochs = 5.
▪ Each batch will be passed by the algorithm there will be 5 iterations per epoch.
Learning Rate
Small learning rate converges slowly and gets stuck in false local minima.
Learning Rate
Learning Rate
▪ Batch Gradient Descent involves calculations over the full training set at each step.
Algorithm
1. Initialize weight randomly
2. Loop until convergence:
𝜕𝐽(𝑾,𝒃) 𝜕𝐽(𝑾,𝒃)
3. Compute gradient, ,
𝜕𝑾 𝜕𝒃
𝜕𝐽 𝑾,𝒃 𝜕𝐽(𝑾,𝒃)
4. Update weights, bias 𝑾 ← 𝑾 − 𝜂 ,𝒃 ← 𝒃 − 𝜂
𝜕𝑾 𝜕𝒃
HCM City Univ. of Technology, Faculty of Mechanical Engineering 100 Duong Van Tu
Introduction to Deep Learning Lecture 2
Algorithm
1. Initialize weight randomly
2. Loop until convergence:
3. Shuffle the training dataset, pick single data point 𝒊
4. Iterate over each training example
𝜕𝐽𝑖 (𝑾,𝒃) 𝜕𝐽𝑖 (𝑾,𝒃)
5. Compute the gradient of the cost function, ,
𝜕𝑾 𝜕𝒃
HCM City Univ. of Technology, Faculty of Mechanical Engineering 101 Duong Van Tu
Introduction to Deep Learning Lecture 2
HCM City Univ. of Technology, Faculty of Mechanical Engineering 102 Duong Van Tu
Introduction to Deep Learning Lecture 2
▪ Parameters are updated after computing the gradient of the error with respect to a
subset of the training set.
Algorithm
1. Initialize weight randomly
2. Loop until convergence:
3. Pick a batch size of examples to train.
𝜕𝐽(𝑾,𝒃) 𝜕𝐽 (𝑾,𝒃)
4. Compute the gradient of the cost function, ,
𝜕𝑾 𝜕𝒃
𝜕𝐽 𝑾,𝒃 𝜕𝐽(𝑾,𝒃)
5. Update weights, bias 𝑾 ← 𝑾 − 𝜂 ,𝒃 ← 𝒃 − 𝜂
𝜕𝑾 𝜕𝒃
𝜂 𝜕𝐽
𝜃𝑡+1 = 𝜃𝑡 − ⨀𝑔𝑡 where 𝑔𝑡 ≜
𝐺𝑡 +𝜖 𝜕𝜃𝑡
HCM City Univ. of Technology, Faculty of Mechanical Engineering 104 Duong Van Tu
Introduction to Deep Learning Lecture 2
𝜂
𝜃𝑡+1 = 𝜃𝑡 − 𝑔𝑡
𝐸 𝑔2 𝑡 +𝜖
HCM City Univ. of Technology, Faculty of Mechanical Engineering 105 Duong Van Tu
Introduction to Deep Learning Lecture 2
𝑚𝑡 = 𝛽1 𝑚𝑡−1 + 1 − 𝛽1 𝑔𝑡
where 𝑚𝑡 is the first moment, 𝑣𝑡 is the second moment
𝑣𝑡 = 𝛽2 𝑣𝑡−1 + 1 − 𝛽2 𝑔𝑡2
𝑚𝑡
𝑚ෝ𝑡 =
1 − 𝛽1
𝑣𝑡
𝑣ො𝑡 =
1 − 𝛽2
𝜂
𝜃𝑡+1 = 𝜃𝑡 − 𝑚
ෝ𝑡
𝑣ො𝑡 + 𝜖
HCM City Univ. of Technology, Faculty of Mechanical Engineering 106 Duong Van Tu
Introduction to Deep Learning Lecture 2
Optimizer
HCM City Univ. of Technology, Faculty of Mechanical Engineering 107 Duong Van Tu
Introduction to Deep Learning Lecture 2
Exercise 4
Use the requirement described in Ex. 3. Change the learning rate and plot the cost function
corresponding to each value of learning rate.
HCM City Univ. of Technology, Faculty of Mechanical Engineering 108 Duong Van Tu
Introduction to Deep Learning Lecture 2
HCM City Univ. of Technology, Faculty of Mechanical Engineering 109 Duong Van Tu
Introduction to Deep Learning Lecture 2
Problem of Underfitting
▪ A statistical model is said to have underfitting when it
cannot capture the underlying trend of the data.
▪ The model does not make accurate predictions on
testing data.
▪ This case is called as High bias.
Reason:
• The size of the training dataset used is not enough.
• The model is too simple.
• Training data is not cleaned and also contains noise
in it.
HCM City Univ. of Technology, Faculty of Mechanical Engineering 110 Duong Van Tu
Introduction to Deep Learning Lecture 2
Problem of Underfitting
HCM City Univ. of Technology, Faculty of Mechanical Engineering 111 Duong Van Tu
Introduction to Deep Learning Lecture 2
Problem of Overfitting
▪ A statistical model is said to have overfitting when it
only performs well on training data but performs poorly
on testing data.
▪ This case is called as High Variance.
Reason:
• The size of the training dataset used is not enough.
• The model is too complex.
HCM City Univ. of Technology, Faculty of Mechanical Engineering 112 Duong Van Tu
Introduction to Deep Learning Lecture 2
Problem of Overfitting
HCM City Univ. of Technology, Faculty of Mechanical Engineering 113 Duong Van Tu
Introduction to Deep Learning Lecture 2
Dropout Regularization
tf.keras.layers.Dropout(p=0.5)
HCM City Univ. of Technology, Faculty of Mechanical Engineering 114 Duong Van Tu
Introduction to Deep Learning Lecture 2
HCM City Univ. of Technology, Faculty of Mechanical Engineering 115 Duong Van Tu
Introduction to Deep Learning Lecture 2
HCM City Univ. of Technology, Faculty of Mechanical Engineering 116 Duong Van Tu
Introduction to Deep Learning Lecture 2
Exercise 5
Build a model to estimate of house price based on number of room, area, and floor:
Hidden layer: 4 nodes – Activation: Sigmoid
Output layer: 1 node Use TensorFlow >> Keras API
Weight: Random – Cost function: MSE
Compute the model and train for 100 iterations. Compute the estimated price for 5th example.
Example Rooms Area Floor Price (k)
1 2 100 1 75
2 1 60 2 60
3 3 120 1 90
4 2 75 2 80
5 1 45 1 ?
HCM City Univ. of Technology, Faculty of Mechanical Engineering 117 Duong Van Tu
Introduction to Deep Learning Lecture 2
Exercise 6
Predict the compressive strength of concrete manufactured according to various recipes. The
dataset is taken from BKEL.
Build the model two hidden layers with the same 32 nodes, output layer with a single node.
Hidden layer uses ReLU function as activation function.
HCM City Univ. of Technology, Faculty of Mechanical Engineering 118 Duong Van Tu