0% found this document useful (0 votes)
12 views118 pages

Deep Learning for Engineering Students

Uploaded by

triệu thi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views118 pages

Deep Learning for Engineering Students

Uploaded by

triệu thi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 118

Lecture 2

Introduction to Deep Learning

1
Introduction to Deep Learning Lecture 2

The Perceptron: Forward Propagation


Linear combination
𝑥1 Output of input

𝑤2 𝑦ො 𝑚
𝑥2 Σ ∫
𝑦ො = 𝑔 ෍ 𝑥𝑖 . 𝑤𝑖
𝑖=1

𝑥𝑚
Non-linear
Activate function
Input Weights Sum Non-linearity Output

HCM City Univ. of Technology, Faculty of Mechanical Engineering 2 Duong Van Tu


Introduction to Deep Learning Lecture 2

1 The Perceptron: Forward Propagation


Linear combination
𝑥1 Output of input
Bias
𝑤2 𝑦ො 𝑚
𝑥2 Σ ∫
𝑦ො = 𝑔 𝑤0 + ෍ 𝑥𝑖 . 𝑤𝑖
𝑖=1

𝑥𝑚
Non-linear
Activate function
Input Weights Sum Non-linearity Output

HCM City Univ. of Technology, Faculty of Mechanical Engineering 3 Duong Van Tu


Introduction to Deep Learning Lecture 2

1 The Perceptron: Forward Propagation

𝑥1
𝑤2 𝑦ො 𝑚
𝑥2 Σ ∫
𝑦ො = 𝑔 𝑏 + ෍ 𝑥𝑖 . 𝑤𝑖
𝑖=1

𝑥𝑚
𝑦ො = 𝑔 𝑋𝑊 + 𝑏

Input Weights Sum Non-linearity Output where


𝑋 = 𝑥1 𝑥2 … 𝑥𝑚
𝑤1
𝑤2
𝑊=

𝑤𝑚

HCM City Univ. of Technology, Faculty of Mechanical Engineering 4 Duong Van Tu


Introduction to Deep Learning Lecture 2

1 The Perceptron: Forward Propagation

𝑥1 Activate function
𝑤2 𝑦ො
𝑥2 Σ ∫ 𝑦ො = 𝑔 𝑋𝑊 + 𝑏

For example: Sigmoid function


𝑥𝑚 1
𝑔 𝑧 =𝜎 𝑧 =
1 + 𝑒 −𝑧

Input Weights Sum Non-linearity Output


import numpy as np
from matplotlib import pyplot as plt
x = np.linspace(-5,5,100)
y = 1/(1 + np.exp(-x))
plt.plot(x,y)
plt.grid(True, color = 'k')
plt.show()
HCM City Univ. of Technology, Faculty of Mechanical Engineering 5 Duong Van Tu
Introduction to Deep Learning Lecture 2

Common Activation Function

Sigmoid Hyperbolic Tangent Rectified Linear Unit (ReLU)

1 𝑒 𝑧 − 𝑒 −𝑧 𝑔 𝑧 = max(0, 𝑧)
𝑔 𝑧 = 𝑔 𝑧 = 𝑧
1 + 𝑒 −𝑧 𝑒 + 𝑒 −𝑧 1, 𝑧>0
𝑔ሶ 𝑧 = 𝑔 𝑧 1 − 𝑔 𝑧 𝑔ሶ 𝑧 = 1 − 𝑔 𝑧 2 𝑔ሶ 𝑧 = ቊ
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

tf.math.sigmoid(z) tf.math.tanh(z) tf.nn.relu(z)

HCM City Univ. of Technology, Faculty of Mechanical Engineering 6 Duong Van Tu


Introduction to Deep Learning Lecture 2

Common Activation Function

Sigmoid Hyperbolic Tangent Rectified Linear Unit (ReLU)

import numpy as np import numpy as np import numpy as np


from matplotlib import from matplotlib import pyplot as from matplotlib import pyplot as
pyplot as plt plt plt
import tensorflow as tf import tensorflow as tf import tensorflow as tf
x = np.linspace(-5,5,100) x = np.linspace(-5,5,100) x = np.linspace(-5,5,100)
y = tf.math.sigmoid(x) y = tf.math.tanh(x) y = tf.nn.relu(x)
plt.plot(x,y) plt.plot(x,y) plt.plot(x,y)
plt.grid(True, color = 'k') plt.grid(True, color = 'k') plt.grid(True, color = 'k')
plt.show() plt.show() plt.show()

HCM City Univ. of Technology, Faculty of Mechanical Engineering 7 Duong Van Tu


Introduction to Deep Learning Lecture 2

Importance of Activation Function

Linear activation functions Non-linear produce allow us to


produce linear decision approximate arbitrarily complex function

HCM City Univ. of Technology, Faculty of Mechanical Engineering 8 Duong Van Tu


Introduction to Deep Learning Lecture 2

The Perceptron: Example

1
Activate function
𝑥1
𝑦ො = 𝑔 𝑋𝑊 + 𝑤0
-2 𝑦ො
𝑥2 Σ ∫ 3
𝑦ො = 𝑔(1 + 𝑥1 𝑥2
−2
In this case, we have 𝑦ො = 𝑔(1 + 3𝑥1 − 2𝑥2 )
3
𝑊=
−2 This is just a line in 2D
𝑤0 = 1

HCM City Univ. of Technology, Faculty of Mechanical Engineering 9 Duong Van Tu


Introduction to Deep Learning Lecture 2

The Perceptron: Example

𝑥1
-2 𝑦ො
𝑥2 Σ ∫

𝑦ො = 𝑔(1 + 3𝑥1 − 2𝑥2 )

HCM City Univ. of Technology, Faculty of Mechanical Engineering 10 Duong Van Tu


Introduction to Deep Learning Lecture 2

1 The Perceptron: Example

𝑥1
𝑋 = −1 2
-2 𝑦ො
𝑥2 Σ ∫

It assumes that we have the input


𝑋 = −1 2
𝑦ො = 𝑔 1 + 3. −1 − 2.2
import numpy as np
𝑦ො = 𝑔 −6 = 0.0025 def sigmoid(x):
return 1 / (1 + np.exp(-x))
argument = -6
sigmoid_value = sigmoid(argument)
print(f"The sigmoid value at x = {argument} is
{sigmoid_value:.4f}")
HCM City Univ. of Technology, Faculty of Mechanical Engineering 11 Duong Van Tu
Introduction to Deep Learning Lecture 2

The Perceptron: Simplified

𝑦ො = 𝑔 𝑋𝑊 + 𝑤0

𝑥1
𝑤2 𝑦ො
𝑥2 Σ ∫

𝑥𝑚

Input Weights Sum Non-linearity Output


HCM City Univ. of Technology, Faculty of Mechanical Engineering 12 Duong Van Tu
Introduction to Deep Learning Lecture 2

The Perceptron: Simplified

𝑦ො = 𝑔 𝑋𝑊 + 𝑤0

𝑥1
𝑦ො = 𝑔(𝑧)
𝑥2 𝑧

𝑥𝑚
𝑚

𝑧 = 𝑤0 + ෍ 𝑥𝑖 𝑤𝑖
𝑖=1

HCM City Univ. of Technology, Faculty of Mechanical Engineering 13 Duong Van Tu


Introduction to Deep Learning Lecture 2

Multi Output Perceptron

𝑥1
𝑦ො1 = 𝑔(𝑧1 ) 𝑚
𝑧1
𝑥2 𝑧𝑗 = ෍ 𝑥𝑖 𝑤𝑖𝑗 + 𝑤0𝑗
𝑦ො2 = 𝑔(𝑧2 ) 𝑖=1
𝑧2
𝑥𝑚 import tensorflow as tf
layer = tf.keras.layers.Dense(units = 2)

Because all inputs are densely connected to all outputs, these layers are called
Dense layers

HCM City Univ. of Technology, Faculty of Mechanical Engineering 14 Duong Van Tu


Introduction to Deep Learning Lecture 2

Single Layer Neural Network: Forward Propagation


𝒘(1) 𝒘(2)
𝑧1
𝑥1 𝑔(𝑧2 ) 𝑦ො1
𝑧2
𝑥2
𝑧3
𝑦ො2
𝑥𝑚 𝑧n

Inputs Hidden Outputs


import tensorflow as tf
𝑚 𝑛
n = 10
model = tf.keras.Sequential([ (1) (1) (2) (2)
𝑧𝑗 = ෍ 𝑥𝑖 𝑤𝑖𝑗 + 𝑤0𝑗 𝑦ො𝑘 = ෍ 𝑔(𝑧𝑗 )𝑤𝑗𝑘 + 𝑤0𝑘
tf.keras.layers.Dense(n),
tf.keras.layers.Dense(2) 𝑖=1 𝑗=1
])
HCM City Univ. of Technology, Faculty of Mechanical Engineering 15 Duong Van Tu
Introduction to Deep Learning Lecture 2

Single Layer Neural Network: Example

(1)
𝑤12 𝑧1
𝑥1 𝑔(𝑧2 ) 𝑦ො1
𝑧2
𝑥2
𝑧3
𝑦ො2
𝑥𝑚 𝑧𝑛

𝑚
(1) (1)
𝑧2 = ෍ 𝑥𝑖 𝑤𝑖2 + 𝑤02
𝑖=1
(1) (1) (1) (1)
z2 = 𝑥1 𝑤12 + 𝑥2 𝑤22 + 𝑥𝑚 𝑤𝑚2 + 𝑤02

HCM City Univ. of Technology, Faculty of Mechanical Engineering 16 Duong Van Tu


Introduction to Deep Learning Lecture 2

Single Layer Neural Network: Example

𝑧1
𝑥1
𝑦ො1
𝑧2
𝑥2
𝑧3
𝑥3 𝑦ො2
𝑧4
𝑿 𝑾𝑯 𝑯 𝑾𝑶 𝑶
Input size Hidden size Hidden size Output size Output size
Input size

Hidden size

Example
Example

Example

HCM City Univ. of Technology, Faculty of Mechanical Engineering 17 Duong Van Tu


Introduction to Deep Learning Lecture 2

Single Layer Neural Network: Example


Bias

1 𝑤0
𝑤1
𝑥1 ∫
Inputs
𝑤2
Sigmoid
𝑥2

HCM City Univ. of Technology, Faculty of Mechanical Engineering 18 Duong Van Tu


Introduction to Deep Learning Lecture 2

Single Layer Neural Network: Example

import matplotlib.pyplot as plt


import numpy as np
np.random.seed(42)
cov = 0.2
cov2 = 0.5
num_points = 250
cluster1 = np.random.multivariate_normal([1.6, 1], [[cov, 0], [0, cov]], num_points // 5)
cluster2 = np.random.multivariate_normal([1, 4], [[cov2, 0], [0, cov2]], num_points // 5)
cluster3 = np.random.multivariate_normal([3, 4.5], [[cov2, 0], [0, cov2]], num_points // 5)
cluster4 = np.random.multivariate_normal([1.5, 6], [[cov2, 0], [0, cov2]], num_points // 5)
cluster5 = np.random.multivariate_normal([4, 2], [[cov2, 0], [0, cov2]], num_points // 5)
plt.scatter(cluster1[:, 0], cluster1[:, 1], color='red', label='Cluster 1')
plt.scatter(cluster2[:, 0], cluster2[:, 1], color='green', label='Cluster 2')
plt.scatter(cluster3[:, 0], cluster3[:, 1], color='green', label='Cluster 3')
plt.scatter(cluster4[:, 0], cluster4[:, 1], color='green', label='Cluster 4')
plt.scatter(cluster5[:, 0], cluster5[:, 1], color='green', label='Cluster 5')
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.grid(True)
plt.show()

HCM City Univ. of Technology, Faculty of Mechanical Engineering 19 Duong Van Tu


Introduction to Deep Learning Lecture 2

Single Layer Neural Network: Example

1 12
-4
𝑥1 ∫
-1
𝑥2 −4𝑥1 − 𝑥2 + 12 = 0

HCM City Univ. of Technology, Faculty of Mechanical Engineering 20 Duong Van Tu


Introduction to Deep Learning Lecture 2

Single Layer Neural Network: Example


import matplotlib.pyplot as plt
import numpy as np
np.random.seed(42)
cov = 0.2
cov2 = 0.5
num_points = 250
cluster1 = np.random.multivariate_normal([1.6, 1], [[cov, 0], [0, cov]], num_points // 5)
cluster2 = np.random.multivariate_normal([1, 4], [[cov2, 0], [0, cov2]], num_points // 5)
cluster3 = np.random.multivariate_normal([3, 4.5], [[cov2, 0], [0, cov2]], num_points // 5)
cluster4 = np.random.multivariate_normal([1.5, 6], [[cov2, 0], [0, cov2]], num_points // 5)
cluster5 = np.random.multivariate_normal([4, 2], [[cov2, 0], [0, cov2]], num_points // 5)
plt.scatter(cluster1[:, 0], cluster1[:, 1], color='red', label='Cluster 1')
plt.scatter(cluster2[:, 0], cluster2[:, 1], color='green', label='Cluster 2')
plt.scatter(cluster3[:, 0], cluster3[:, 1], color='green', label='Cluster 3')
plt.scatter(cluster4[:, 0], cluster4[:, 1], color='green', label='Cluster 4')
plt.scatter(cluster5[:, 0], cluster5[:, 1], color='green', label='Cluster 5')
x = np.linspace(0.5,3,100)
y = -4*x + 12
plt.plot(x,y, color='b')
plt.text(1.5, 8, r'$-4x_1 - x_2 + 12 = 0$')
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.grid(True)
plt.show()
HCM City Univ. of Technology, Faculty of Mechanical Engineering 21 Duong Van Tu
Introduction to Deep Learning Lecture 2

Single Layer Neural Network: Example

1 12
-4
𝑥1 ∫
-1
𝑥2 −4𝑥1 − 𝑥2 + 12 = 0

1 3
-1/5
𝑥1 ∫
-1
𝑥2 1
− 𝑥1 − 𝑥2 + 3 = 0
5
HCM City Univ. of Technology, Faculty of Mechanical Engineering 22 Duong Van Tu
Introduction to Deep Learning Lecture 2

Single Layer Neural Network: Example

1 12 𝑦ො = 𝑔 𝑔 𝑧1 ∗ 1.5 + 𝑔 𝑧2 ∗ 1 + 0.5
-4 0.88 import numpy as np
2 𝑥1 𝑧1 def sigmoid(x):
return 1 / (1 + np.exp(-x))
-1 1.5 X = np.array([2, 2])

2 𝑥2 0.92 w11 = np.array([-4,-1])


w12 = np.array([-1/5, -1])
w21 = np.array([1.5, 1])
𝑦ො bias11 = 12
bias12 = 3
bias21 = 0.5
1 3 1 z1 = np.dot(X,w11) + bias11
z2 = np.dot(X,w12) + bias12
g1 = sigmoid(z1)
𝑧2 0.64
-1/5
2 𝑥1 print(g1)
g2 = sigmoid(z2)
-1 0.5 print(g2)
g = np.array([g1, g2])
2 𝑥2 z = np.dot(g,w21) + bias21
y_hat = sigmoid(z)
1 print(y_hat)

HCM City Univ. of Technology, Faculty of Mechanical Engineering 23 Duong Van Tu


Introduction to Deep Learning Lecture 2

Single Layer Neural Network: Example

1 12
-4 0.88
2 𝑥1 𝑧1
-1 1.5
2 𝑥2 0.92

𝑦ො
1 3 1

𝑧2 0.64
-1/5
2 𝑥1
-1 0.5
2 𝑥2
1
HCM City Univ. of Technology, Faculty of Mechanical Engineering 24 Duong Van Tu
Introduction to Deep Learning Lecture 2

Single Layer Neural Network: Example

1 12
-4 4.54
4 𝑥1 𝑧1
-1 1.5
6 𝑥2 0.62

𝑦ො
1 3 1

𝑧2 0.02
-1/5
4 𝑥1
-1 0.5
6 𝑥2
1
HCM City Univ. of Technology, Faculty of Mechanical Engineering 25 Duong Van Tu
Introduction to Deep Learning Lecture 2

Single Layer Neural Network: Example

1 12
-4 0.1
0 𝑥1 𝑧1
-1 1.5
6 𝑥2 0.89

𝑦ො
1 3 1

𝑧2 0.05
-1/5
0 𝑥1
-1 0.5
6 𝑥2
1
HCM City Univ. of Technology, Faculty of Mechanical Engineering 26 Duong Van Tu
Introduction to Deep Learning Lecture 2

Exercise 1
Create a notebook with Google Colab (do not use TensorFlow) to compute the outputs of this
neural network with the given matrices:
0.6 0.9 0.7 −0.1
𝒘(1) = 2.6 0.5 −0.1 0.7
−0.7 1.4 0.3 −1.2 𝒘(1) 𝒘(2)
0.2 −0.5 𝑧1

𝒘(2) =
−0.1 1.6 2 𝑦ො1
−0.5 1.4 𝑧2
−0.5 −0.1 3
Hidden layer: sigmoid 𝑧3
𝑦ො2
Output layer: linear 𝑔 𝑧 = 𝑧
1 𝑧4

Inputs Hidden Outputs


HCM City Univ. of Technology, Faculty of Mechanical Engineering 27 Duong Van Tu
Introduction to Deep Learning Lecture 2

Deep Neural Network

hidden(𝑙−1) hidden(𝑙+1)
TensorFlow (𝑙)
import tensorflow as tf 𝑧1
model = tf.keras.Sequential([ 𝑥1
tf.keras.layers.Dense(n1), (𝑙) 𝑦ො1
tf.keras.layers.Dense(n2), 𝑧2
... 𝑥2
tf.keras.layers.Dense(2) (𝑙)
])
𝑧3
𝑦ො2
𝑥𝑚 (𝑙)
𝑧𝑛

𝑛(𝑙−1)
(𝑙) 𝑙−1 𝑙 (𝑙)
𝑧𝑗 = ෍ 𝑔 𝑧𝑖 𝑤𝑖𝑗 + 𝑤0𝑗
𝑖=1

HCM City Univ. of Technology, Faculty of Mechanical Engineering 28 Duong Van Tu


Introduction to Deep Learning Lecture 2

Deep Neural Network


hidden(𝑙−1) hidden(𝑙)

1 1 Scalar form
𝑑 (𝑙−1)
(𝑙−1) (𝑙) (𝑙) (𝑙−1) (𝑙) (𝑙)
(𝑙−1)
𝑧1 𝑎1 𝑧1 (𝑙)
𝑎1 𝑧𝑗 = ෍ 𝑎𝑖 𝑤𝑖𝑗 + 𝑤0𝑗
𝑖=1
(𝑙−1) (𝑙) (𝑙)
𝑧2 (𝑙−1) 𝑧2 𝑎2 (𝑙−1) 𝑙−1
𝑎2 𝑎𝑖 = 𝑔 𝑧𝑖
(𝑙)
(𝑙−1) (𝑙−1) 𝑤𝑖𝑗 (𝑙)
𝑧𝑖 𝑎𝑖 (𝑙)
𝑧𝑗 𝑎𝑗

(𝑙−1) (𝑙−1) (𝑙) (𝑙)


𝑧𝑑(𝑙−1) 𝑎𝑑(𝑙−1) 𝑧𝑑(𝑙) 𝑎𝑑(𝑙)

𝒛(𝑙−1) 𝒂(𝑙−1) 𝒛(𝑙) 𝒂(𝑙)


HCM City Univ. of Technology, Faculty of Mechanical Engineering 29 Duong Van Tu
Introduction to Deep Learning Lecture 2

Deep Neural Network


hidden(𝑙−1) hidden(𝑙)
Matrix form 𝑿∈ ℛ 𝑁×𝑑 (0)
1 1
(𝐿)
𝒀 ∈ ℛ𝑁×𝑑
(𝑙−1) (𝑙) (𝑙)
(𝑙−1)
𝑧1 𝑎1 𝑧1 𝑎1 𝑙 (𝑙−1) ×𝑑 𝑙
𝑾 ∈ ℛ𝑑
(𝑙−1) (𝑙) (𝑙) 𝑙
𝑧2 (𝑙−1) 𝑧2 𝑎2 𝒃(𝑙) ∈ ℛ1×𝑑
𝑎2
(𝑙) 𝒂(0) = 𝑿
(𝑙−1) (𝑙−1) 𝑤𝑖𝑗 (𝑙)
𝑧𝑖 𝑎𝑖 (𝑙)
𝑧𝑗 𝑎𝑗
ෝ = 𝒂(𝐿)
𝒚
(𝑙) (𝑙−1) (𝑙) (𝑙) 𝑁×𝑑 (𝑙)
(𝑙−1) (𝑙−1) (𝑙) (𝑙) 𝒛 =𝒂 𝑾 +𝒃 ∈ℛ
𝑧𝑑(𝑙−1) 𝑎𝑑(𝑙−1) 𝑧𝑑(𝑙) 𝑎𝑑(𝑙)
(𝑙)
𝒂(𝑙) = 𝑔 𝒛(𝑙) ∈ ℛ𝑁×𝑑
𝒛(𝑙−1) 𝒂(𝑙−1) 𝒛(𝑙) 𝒂(𝑙)
HCM City Univ. of Technology, Faculty of Mechanical Engineering 30 Duong Van Tu
Introduction to Deep Learning Lecture 2

Example: Build a model to estimate of test score based on sleep and study hours
Example Sleep hours Study hours Target Output
Student1 0.3 1 75/100
Student2 0.5 0.2 82/100
Student3 1 0.4 93/100
StudentX 8 3 ?
Two features: Sleep hours (𝑥1 ) and study hours (𝑥2 ). import numpy as np
X = np.array([[0.3, 1],
[0.5, 0.2],
𝑥11 𝑥12 0.3 1 0.75 [1, 0.4]])
𝑋 = 𝑥21 𝑥22 = 0.5 0.2 , 𝑦 = 0.82 y = np.array([[0.75],
[0.82],
𝑥31 𝑥32 1 0.4 0.93 [0.93]])
print(X.shape)
print(y.shape)

HCM City Univ. of Technology, Faculty of Mechanical Engineering 31 Duong Van Tu


Introduction to Deep Learning Lecture 2

Example: Build a model to estimate of test score based on sleep and study hours

1 𝑿 ∈ 𝑅𝑁×2
1
𝑾 , 𝒃(1) 𝑾(2) , 𝑏 (2)
1 (1) 𝑾(1) ∈ 𝑅2×3
𝑧1
(1)
𝑎1 𝒃(1) ∈ 𝑅1×3
Sleep 𝑥1 (2)
𝑧1 𝑾(2) ∈ 𝑅3×1
hours
𝑦ො 𝑏 (2) ∈ ℛ
(1) (1)
Study 𝑧2 𝑎2
𝒛(1) = 𝑿𝑾(1) + 𝒃(1) ∈ 𝑅𝑁×3
hours 𝑥2
(1) (1) 𝒂(1) = 𝑔 𝒛(1) ∈ 𝑅𝑁×3
𝑧3 𝑎3
𝑿 𝒛(1) 𝒂(1) 𝒛 (2) 𝒛(2) = 𝒂(1) 𝑾(2) + 𝑏 (2) ∈ 𝑅𝑁×1
2 ෝ = 𝑔 𝒛(2) ∈ 𝑅𝑁×1
Inputs Hidden Output 𝒂 =𝒚
HCM City Univ. of Technology, Faculty of Mechanical Engineering 32 Duong Van Tu
Introduction to Deep Learning Lecture 2

Example: Build a model to estimate of test score based on sleep and study hours
Code Symbol Math Symbol Definition Dimensions
X 𝑋 Input data, each rom in an example (NumExamples, inputLayerSize)
y 𝑦 Target data (NumExamples, outputLayerSize)
W1 𝑊 (1) Layer 1 weights (inputLayerSize, hiddenLayerSize)
b1 𝑏 (1) Layer 1 bias (1, hiddenLayerSize)
W2 𝑊 (2) Layer 2 weights (hiddenLayerSize, outputLayerSize)
b2 𝑏 (2) Layer 2 bias (1, outputLayerSize)
z1 𝑧 (1) Layer 1 activation (NumExamples, hiddenLayerSize)
a1 𝑎(1) Layer 1 output (NumExamples, hiddenLayerSize)
z2 𝑧 (2) Layer 2 activation (NumExamples, outputLayerSize)
a2 𝑎(2) Layer 2 output (NumExamples, outputLayerSize)

HCM City Univ. of Technology, Faculty of Mechanical Engineering 33 Duong Van Tu


Introduction to Deep Learning Lecture 2

Example: Build a model to estimate of test score based on sleep and study hours
import numpy as np
class NeuralNetwork:
def __init__(self):
self.inputLayerSize = 2
self.hiddenLayerSize = 3
self.outputLayerSize = 1
self.W1 = np.random.randn(self.inputLayerSize, self.hiddenLayerSize)
self.b1 = np.random.randn(1, self.hiddenLayerSize)
self.W2 = np.random.randn(self.hiddenLayerSize, self.outputLayerSize)
self.b2 = np.random.randn(1, self.outputLayerSize)
def sigmoid(self, z):
return 1/(1 + np.exp(-z))
def forwardPropagation(self, X):
self.z1 = np.dot(X, self.W1) + self.b1
self.a1 = self.sigmoid(self.z1)
self.z2 = np.dot(self.a1, self.W2) + self.b2
y_hat= self.sigmoid(self.z2)
return y_hat

HCM City Univ. of Technology, Faculty of Mechanical Engineering 34 Duong Van Tu


Introduction to Deep Learning Lecture 2

Example: Build a model to estimate of test score based on sleep and study hours

Test sigmoid function


from matplotlib import pyplot as plt
Import numpy as np
NN = NeuralNetwork()
test_input = np.arange(-6,6,0.01)
plt.plot(test_input,NN.sigmoid(test_input), linewidth = 2)
plt.grid(True)
plt.show()

HCM City Univ. of Technology, Faculty of Mechanical Engineering 35 Duong Van Tu


Introduction to Deep Learning Lecture 2

Example: Build a model to estimate of test score based on sleep and study hours

Implementation of Forward Propagation


NN = NeuralNetwork()
X = np.array([[0.3, 1],
[0.5, 0.2],
[1, 0.4]])
NN.forwardPropagation(X)

array([[0.23570094], Are your results same to these?


[0.19816067],
[0.25203999]])

y = np.array([[0.75],
[0.82],
[0.93]])

HCM City Univ. of Technology, Faculty of Mechanical Engineering 36 Duong Van Tu


Introduction to Deep Learning Lecture 2

Example: Build a model to estimate of test score based on sleep and study hours

Implementation of Forward Propagation using TensorFlow


import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(3),
tf.keras.layers.Dense(1)
])
X = np.array([[0.3, 1],
[0.5, 0.2],
[1, 0.4]])

# Perform forward propagation


y_hat = model(X)
print(y_hat.numpy())

[[0.5063543 ]
[0.03394059]
[0.06788117]]

HCM City Univ. of Technology, Faculty of Mechanical Engineering 37 Duong Van Tu


Introduction to Deep Learning Lecture 2

Exercise 2
Build a model to estimate of house price based on number of room, area, and floor:
Hidden layer: 4 nodes – Activation: sigmoid
Output layer: 1 node – Activation: Linear
Weight: Random
Compute the forward propagation for a single cycle to find out the output.
Example Rooms Area Floor Price (k)
1 2 100 1 75
2 1 60 2 60
3 3 120 1 90
4 2 75 2 80

HCM City Univ. of Technology, Faculty of Mechanical Engineering 38 Duong Van Tu


Introduction to Deep Learning Lecture 2

Loss Function in Deep Learning

A loss or cost function sometimes also called an error function

…is a method of evaluating how well your algorithm.

If the value of the loss function is lower then it’s a good model;

Otherwise, we have to change the parameter of the model

…and minimize the loss.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 39 Duong Van Tu


Introduction to Deep Learning Lecture 2

Squared Error Loss Function

The loss function quantifies how much the model prediction deviates from the ground
true for one particular object (between predicted and actual values in a machine
learning model).

Squared error is also ℒ 2 called loss:


ℒ𝑖2 = 𝑓(𝑥𝑖 ; 𝑤) − 𝑦𝑖 2

There is only global minimum, and


No local minimum.
MSE is less robust to the outlier’s presence.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 40 Duong Van Tu


Introduction to Deep Learning Lecture 2

Squared Error Loss Function

For example: Very often, we use the squared error as the loss function in regression
problems.
ℒ𝑖2 = 𝑓(𝑥𝑖 ; 𝑤) − 𝑦𝑖 2

𝑦ො = 𝑓(𝑥𝑖 ; 𝑤)
Estimate of test score based on sleep and study hours
y_hat = ([[0.62404926], y = np.array([[0.75],
[0.58412263], [0.82],
[0.57624213]]) [0.93]])

Example 𝑦ො 𝑦 ℒ2
1 0.624 0.75 0.015876
2 0.584 0.82 0.055696
3 0.576 0.93 0.125316
HCM City Univ. of Technology, Faculty of Mechanical Engineering 41 Duong Van Tu
Introduction to Deep Learning Lecture 2

Absolute Error Loss Function

Another loss function we often use for regression is the absolute loss which is also called
ℒ1 loss: ℒ𝑖1 = 𝑓 𝑥𝑖 ; 𝑤 − 𝑦𝑖

For instance, let’s say that our model predicts a flat’s price (in thousands of dollars)
based on the number of rooms, area (𝑚2 ), floor, and the neighborhood in the city (A or
B). Let’s suppose that its prediction for 𝑥 = [4, 70, 1, 𝐴] is USD 110k. If the actual selling
price is USD 105k, then absolute loss is:

ℒ1 = 𝑦ො − 𝑦
ℒ1 = 110 − 105 = 5

HCM City Univ. of Technology, Faculty of Mechanical Engineering 42 Duong Van Tu


Introduction to Deep Learning Lecture 2

Huber Loss Function

ℒ 1 loss is more robust to outlier than ℒ 2


ℒ 2 loss is more stable than ℒ1 , when the difference is smaller.
Huber loss takes the advantages of ℒ1 , andℒ 2 , charactering by 𝛿.

1 2
𝑓 𝑥𝑖 ; 𝑤 − 𝑦𝑖 if 𝑓 𝑥𝑖 ; 𝑤 − 𝑦𝑖 < 𝛿
ℒ 𝛿 𝑓 𝑥𝑖 ; 𝑤 − 𝑦𝑖 = 2
1 2
𝛿 𝑓 𝑥𝑖 ; 𝑤 − 𝑦𝑖 − 𝛿 otherwise
2

HCM City Univ. of Technology, Faculty of Mechanical Engineering 43 Duong Van Tu


Introduction to Deep Learning Lecture 2

Binary Cross Entropy Loss Function

The target variables are in binary format, and there is only two classes.
If the probability of class 1 is 𝑓 𝑥𝑖 ; 𝑤 , then that of class 2 is (1 − 𝑓 𝑥𝑖 ; 𝑤 ).
Cross entropy loss for the actual label 𝑦 (only take 0s or 1s) and the predicted
probability can be defined as:
ℒ = − 𝑦𝑙𝑜𝑔 𝑓 𝑥𝑖 ; 𝑤 + 1 − 𝑦 log 1 − 𝑓 𝑥𝑖 ; 𝑤

If 𝑦 = 0, then ℒ = −log 1 − 𝑓 𝑥𝑖 ; 𝑤 ,

otherwise 𝑦 = 1 results in ℒ = − log 𝑓 𝑥𝑖 ; 𝑤

HCM City Univ. of Technology, Faculty of Mechanical Engineering 44 Duong Van Tu


Introduction to Deep Learning Lecture 2

Cost Function

The cost function measures the model’s error on a group of objects, whereas the loss
function deals with a single data instance.
So, if ℒ is our loss function, then we calculate the cost function by aggregating the loss
L over the training, validation, or test data 𝒟 = 𝑥𝑖 , 𝑦𝑖 𝑖=1 .
𝑛
For example, we can
compute the cost as the mean loss:

1 𝑁
𝐽 𝑊 = ෍ ℒ(𝑓 𝑥𝑖 ; 𝑤 , 𝑦𝑖 )
𝑁 𝑖=1

HCM City Univ. of Technology, Faculty of Mechanical Engineering 45 Duong Van Tu


Introduction to Deep Learning Lecture 2

Cost Function
Mean Squared Error (MSE):

1 𝑁 1 𝑁
2 2
𝐽(𝑾) = ෍ ℒ𝑖 = ෍ ෝ 𝑖 − 𝒚𝑖
𝒚
𝑁 𝑖=1 𝑁 𝑖=1

mse = tf.keras.losses.mse(actual_values, predicted_values)

Mean Absolute Error (MAE):

1 𝑁 1 𝑁
1
𝐽(𝑾) = ෍ ℒ𝑖 = ෍ ෝ 𝑖 − 𝒚𝑖
𝒚
𝑁 𝑖=1 𝑁 𝑖=1

mse = tf.keras.losses.mae(actual_values, predicted_values)

HCM City Univ. of Technology, Faculty of Mechanical Engineering 46 Duong Van Tu


Introduction to Deep Learning Lecture 2

Cost Function
Example: Compute the MSE cost function
Let’s say that we have the data on four flats and that our model predicted the sale
prices as follows: 𝒙𝑖 Rooms Area Floor Neighborhood 𝑦𝑖 𝑦ො𝑖
𝒙1 4 70 1 A 105 104.5
𝒙2 2 50 2 A 83 91
𝒙3 1 30 5 B 50 65.3
𝒙4 5 90 2 A 200 114

Solution:
2 2 2 2
104.5 − 105 + 91 − 83 + 65.3 − 50 + 114 − 200 0.52 + 82 + 15.32 + 862
𝐽= = = 1,923.585
4 4

HCM City Univ. of Technology, Faculty of Mechanical Engineering 47 Duong Van Tu


Introduction to Deep Learning Lecture 2

Cost Function

import tensorflow as tf
actual_values = tf.constant([105, 83, 50, 200], dtype=tf.float32)
predicted_values = tf.constant([104.5, 91, 65.3, 114], dtype=tf.float32)
mse = tf.keras.losses.mse(actual_values, predicted_values)
print(mse.numpy())

VS

import tensorflow as tf
import numpy as np
actual_values = np.array([[105],[83], [50], [200]])
predicted_values = np.array([[104.5], [91], [65.3], [114]])
mse = tf.keras.losses.mse(actual_values, predicted_values)
print(mse.numpy())

HCM City Univ. of Technology, Faculty of Mechanical Engineering 48 Duong Van Tu


Introduction to Deep Learning Lecture 2

Cost (Loss) Function Optimization


We want to find the network weights to minimize the loss function.

1 𝑁
𝑾∗ = argmin𝑊 𝐽 𝑾 = argmin𝑊 ෍ ℒ(𝑓(𝒙𝑖 ; 𝑾), 𝒚𝑖 )
𝑁 𝑖=1

• Initial random weights


• Loop until convergence: A

▪ Predict using updated weight


▪ Compute the cost (loss) function
𝜕𝐽 𝑾 B
▪ Compute gradient ;
𝜕𝑾

𝜕𝐽 𝑾
▪ Update weights 𝑾 ← 𝑾 − 𝜂 (where 𝜂 is the learning rate)
𝜕𝑾
HCM City Univ. of Technology, Faculty of Mechanical Engineering 49 Duong Van Tu
Introduction to Deep Learning Lecture 2

Gradient Descent in Single Variable

• Initial random weight (𝑥)


𝑓(𝑥) = 𝑥2 − 4𝑥
• Loop until convergence:

𝑓′ 𝑥 > 0 ▪ Predict using updated weight;


𝑓′ 𝑥 < 0
▪ Compute the cost (loss) function; 𝑓(𝑥)
𝑓′ 𝑥 = 0
𝜕𝐽 𝑾 𝜕𝑓 𝑥
▪ Compute gradient ;( )
Local 𝜕𝑾 𝜕𝑥
minimum 𝜕𝐽 𝑾
▪ Update weights 𝑾 ← 𝑾 − 𝜂
𝜕𝑾

𝜕𝑓 𝑥
𝑥𝑛𝑒𝑤 = 𝑥𝑐𝑢𝑟 − 𝜂 𝜕𝑓 𝑥
𝜕𝑥 𝑥𝑛𝑒𝑤 = 𝑥𝑐𝑢𝑟 − 𝜂
𝜕𝑥

HCM City Univ. of Technology, Faculty of Mechanical Engineering 50 Duong Van Tu


Introduction to Deep Learning Lecture 2

Gradient Descent

• Initial random weight (𝑥)


• Loop until convergence:
▪ Predict using updated weight;
▪ Compute the cost (loss) function; 𝑓(𝑥)
𝜕𝐽 𝑾 𝜕𝑓 𝑥
▪ Compute gradient ;( )
𝜕𝑾 𝜕𝑥

𝜕𝐽 𝑾
▪ Update weights 𝑾 ← 𝑾 − 𝜂
𝜕𝑾

𝜕𝑓 𝑥
𝑥𝑛𝑒𝑤 = 𝑥𝑐𝑢𝑟 − 𝜂
𝜕𝑥

HCM City Univ. of Technology, Faculty of Mechanical Engineering 51 Duong Van Tu


Introduction to Deep Learning Lecture 2

Back Propagation in Scalar


The goals of backpropagation are straightforward: adjust each weight in the network in
proportion to how much it contributes to the overall error.
• Forward propagation can be viewed as a long series of nested equations.
• Backpropagation is an application of Chain rule to find the Derivatives of cost with
respect to any variable in the nested equation.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 52 Duong Van Tu


Introduction to Deep Learning Lecture 2

Back Propagation in Scalar


Recall the gradient descent method
𝜕𝐽 𝑾
𝑾←𝑾−𝜂
𝜕𝑾
It requires to compute these terms:
𝜕𝐽
(𝑙) with 𝑖 = 1 ~𝑑 (𝑙−1) , 𝑗 = 1~𝑑 (𝑙) , 𝑙 = 1~𝐿
𝜕𝑤𝑖𝑗
𝜕𝐽
(𝑙) with 𝑗 = 1~𝑑(𝑙) , 𝑙 = 1~𝐿
𝜕𝑏𝑗

HCM City Univ. of Technology, Faculty of Mechanical Engineering 53 Duong Van Tu


Introduction to Deep Learning Lecture 2

Chain Rule

• Backpropagation computes the chain rule in a manner that is highly efficient.


• Let 𝑓, 𝑔: ℝ → ℝ
• Suppose 𝑦 = 𝑔 𝑥 , 𝑧 = 𝑓 𝑦 = 𝑓 𝑔 𝑥
• Chain rule for Single path:

𝜕𝑧 𝜕𝑧 𝜕𝑦
= 𝜕𝑦 𝜕𝑧
𝜕𝑥 𝜕𝑦 𝜕𝑥
𝑥 𝜕𝑥 𝑦 𝜕𝑦 𝑧

HCM City Univ. of Technology, Faculty of Mechanical Engineering 54 Duong Van Tu


Introduction to Deep Learning Lecture 2

Chain Rule

• Backpropagation computes the chain rule in a manner that is highly efficient.


• Let 𝑓, 𝑔, 𝑢1 , 𝑢2 : ℝ → ℝ
• Suppose 𝑦1 = 𝑢1 𝑥 , 𝑦2 = 𝑢2 𝑥 , 𝑧 = 𝑓 𝑦1 + 𝑔(𝑦2 )
• Chain rule for Two-path:
𝜕𝑦1 𝑦1 𝜕𝑧
𝜕𝑧 𝜕𝑧 𝜕𝑦1 𝜕𝑧 𝜕𝑦2
= + 𝜕𝑥 𝜕𝑦1
𝜕𝑥 𝜕𝑦1 𝜕𝑥 𝜕𝑦2 𝜕𝑥 𝑥 𝑧

Path 1 Path 2
𝑦2
𝜕𝑦2 𝜕𝑧
𝜕𝑥 𝜕𝑦2

HCM City Univ. of Technology, Faculty of Mechanical Engineering 55 Duong Van Tu


Introduction to Deep Learning Lecture 2

Chain Rule

• Chain rule for Multi-Path


𝑦1 𝜕𝑧
𝜕𝑦1
𝑛 𝜕𝑥 𝜕𝑦1
𝜕𝑧 𝜕𝑧 𝜕𝑦𝑖
=෍
𝜕𝑥 𝜕𝑦𝑖 𝜕𝑥 𝜕𝑦2 𝜕𝑧
𝑖=1 𝑦2
𝑥 𝜕𝑥 𝜕𝑦2 𝑧

𝜕𝑦𝑖 𝜕𝑧
𝜕𝑥 𝑦𝑖 𝜕𝑦𝑖


𝜕𝑦𝑛 𝑦𝑛 𝜕𝑧
𝜕𝑥 𝜕𝑦𝑛
HCM City Univ. of Technology, Faculty of Mechanical Engineering 56 Duong Van Tu
Introduction to Deep Learning Lecture 2

Back Propagation in Scalar


Let consider the output layer: Applying Chain Rule, we can obtain:
𝐿
(𝐿−1) (𝐿−1) (𝐿) (𝐿) 𝜕𝐽 𝜕𝐽 𝜕𝑧𝑗
𝑧𝑖 𝑎𝑖 𝑤𝑖𝑗 𝑧𝑗 𝑦ො𝑗 = .
(𝐿) (𝐿) (𝐿)
𝒊 𝒋 𝜕𝑤𝑖𝑗 𝜕𝑧𝑗 𝜕𝑤𝑖𝑗

with 𝑖 = 1 ~𝑑 (𝐿−1) , 𝑗 = 1~𝑑 (𝐿)


(𝐿) (𝐿−1) (𝐿) (𝐿) 𝐿
Forward Propagation: 𝑧𝑗 = 𝑎𝑖 𝑤𝑖𝑗 + 𝑏𝑗 , 𝑦ො𝑗 = 𝑔 𝑧𝑗

By defining output layer error: Output layer input:


𝐿
(𝐿) 𝜕𝐽 𝜕𝐽 𝜕𝐽 𝜕𝑦ො𝑗 𝜕𝑧𝑗 (𝐿−1)
𝑒𝑗 ≜ 𝐿
, 𝐿
= . (𝐿)
= 𝑎𝑖
𝜕𝑧𝑗 𝜕𝑧𝑗 𝜕𝑦ො𝑗 𝜕𝑧 𝐿 𝜕𝑤𝑖𝑗
𝑗
𝜕𝐽 (𝐿) (𝐿−1)
Derivative of Derivative of It yields (𝐿)
= 𝑒𝑗 𝑎𝑖
Loss function Activate function 𝜕𝑤𝑖𝑗

HCM City Univ. of Technology, Faculty of Mechanical Engineering 57 Duong Van Tu


Introduction to Deep Learning Lecture 2

Back Propagation in Scalar


Let consider hidden layer 𝑙 : Forward Propagation:
(𝑙−1) (𝑙−1) (𝑙) (𝑙) (𝑙)
𝑧𝑖 𝑎𝑖 𝑤𝑖𝑗 𝑧𝑗 𝑎𝑗 (𝑙) (𝑙−1) (𝑙) (𝑙)
𝒊 𝒋 𝑧𝑗 = 𝑎𝑖 𝑤𝑖𝑗 + 𝑏𝑗

Applying Chain Rule, we can obtain:

𝑙 𝑙
𝜕𝐽 𝜕𝐽 𝜕𝑧𝑗 𝜕𝐽 𝜕𝐽 𝜕𝑧𝑗 𝜕𝐽
(𝑙) = (𝑙) . (𝑙) (𝑙) = (𝑙) . (𝑙) = (𝑙) with 𝑖 = 1 ~𝑑(𝑙−1) , 𝑗 = 1~𝑑(𝑙)
𝜕𝑤𝑖𝑗 𝜕𝑧𝑗 𝜕𝑤𝑖𝑗 𝜕𝑏𝑗 𝜕𝑧𝑗 𝜕𝑏𝑗 𝜕𝑧𝑗

It should be noted that the current layer error and current layer input are
𝑙
(𝑙) 𝜕𝐽 𝜕𝑧𝑗 (𝑙−1) 𝜕𝐽 (𝑙) (𝑙−1)
𝑒𝑗 ≜ (𝑙) (𝑙)
= 𝑎𝑖 Then (𝑙)
= 𝑒𝑗 𝑎𝑖
𝜕𝑧𝑗 𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗

HCM City Univ. of Technology, Faculty of Mechanical Engineering 58 Duong Van Tu


Introduction to Deep Learning Lecture 2

Back Propagation in Scalar


Let consider hidden layer (𝑙 − 1) Forward Propagation:
(𝑙−1) 𝑙−1
(𝑙−1) (𝑙) (𝑙) (𝑙)
𝑎𝑖 = 𝑔 𝑧𝑖
(𝑙−1)
𝑧𝑖 𝑎𝑖 𝑤𝑖𝑗 𝑧𝑗 𝑎𝑗
𝒊 𝒋 (𝑙) (𝑙−1) (𝑙) (𝑙)
𝑧𝑗 = 𝑎𝑖 𝑤𝑖𝑗 + 𝑏𝑗

Current layer error for single path


𝑙
(𝑙−1) 𝜕𝐽 𝜕𝐽 𝜕𝑧𝑗 𝜕𝑎𝑖
𝑙−1
(𝑙) (𝑙) 𝜕𝑔
𝑒𝑗 ≜ (𝑙−1) = (𝑙) . 𝑙−1 . (𝑙−1) = 𝑒𝑗 . 𝑤𝑖𝑗 . (𝑙−1)
𝜕𝑧𝑖 𝜕𝑧𝑗 𝜕𝑎𝑖 𝜕𝑧𝑖 𝜕𝑧𝑖

Next adjacent Next adjacent Derivative of


layer error layer weight activation function
HCM City Univ. of Technology, Faculty of Mechanical Engineering 59 Duong Van Tu
Introduction to Deep Learning Lecture 2

Back Propagation in Scalar


Let consider hidden layer (𝑙 − 1)

(𝑙 − 1) (𝑙)
(𝑙)
𝑤𝑖1 𝟏
(𝑙)
𝑤𝑖𝑗
𝒊 𝒋 Current layer error for multi-path
(𝑙)
𝑤𝑖𝑑(𝑙) 𝑑 (𝑙)
(𝑙−1) (𝑙) (𝑙) 𝜕𝑔
𝒅(𝒍) 𝑒𝑖 = ෍ 𝑒𝑗 . 𝑤𝑖𝑗 (𝑙−1)
𝑗=1 𝜕𝑧𝑖

Next adjacent Next adjacent Derivative of


layer error layer weight activation function
HCM City Univ. of Technology, Faculty of Mechanical Engineering 60 Duong Van Tu
Introduction to Deep Learning Lecture 2

(1)
𝑧1 (1)
𝑎1
(2)
𝑧1 (2)
𝑎1 Example
(2)
(1) 𝑤11 (3)
𝑤11 RL
(2)
RL 𝑤11
𝑤12
𝑥 (1)
𝑤12 (2) 𝑧 (3) LN 𝑦ො
𝑤21
(2)
𝑤22 (3)
RL RL 𝑤21
(1) (1) (2) (2) 𝑏1 𝑏3
𝑧2 𝑎2 𝑧2 𝑎2 𝑧1 𝑎1 𝑧3 𝑎3
𝑤1 𝑤3
RL RL 𝑤7
𝑤4
𝑥 LN 𝑦ො
𝑤5
𝑤2 𝑤6 𝑧5
RL RL 𝑤8 𝑏5
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4
HCM City Univ. of Technology, Faculty of Mechanical Engineering 61 Duong Van Tu
Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
Loss Function
𝑤1 𝑤3
RL RL 𝑤7 2
𝐿 = 𝑦ො − 𝑦
𝑤4
𝑥 LN 𝑦ො Derivative of Loss Function
𝑤5
𝑤2 𝑤6 𝑧5
RL RL 𝑤8 𝑏5 𝜕𝐿
= 2 𝑦ො − 𝑦
𝑧2 𝑎2 𝑧4 𝑎4 𝜕𝑦ො
𝑏2 𝑏4
Forward Propagation
𝑧1 = 𝑥. 𝑤1 + 𝑏1 𝑧3 = 𝑎1 . 𝑤3 + 𝑎2 . 𝑤5 + 𝑏3 𝑧5 = 𝑎3 . 𝑤7 + 𝑎4 . 𝑤8 + 𝑏5
𝑧2 = 𝑥. 𝑤2 + 𝑏2 𝑧4 = 𝑎1 . 𝑤4 + 𝑎2 . 𝑤6 + 𝑏4 𝑦ො = 𝑧5
𝑎1 = 𝑔 𝑧1 𝑎3 = 𝑔 𝑧3
𝑎2 = 𝑔 𝑧2 𝑎4 = 𝑔 𝑧4
HCM City Univ. of Technology, Faculty of Mechanical Engineering 62 Duong Van Tu
Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
𝑤3 Output layer:
𝑤1 RL RL 𝑤7
𝑤4 𝜕𝐿 𝜕𝐿 𝜕𝐿
𝑥 LN 𝑦ො , ,
𝑤5 𝜕𝑤7 𝜕𝑤8 𝜕𝑏5
𝑤2 𝑤6 𝑧5
RL RL 𝑤8 𝑏5 It can be seen that 𝑧5 is function with
𝑧2 𝑎2 𝑧4 𝑎4 respect to 𝑤7 , 𝑤8 .
𝑏2 𝑏4
Back Propagation Applying the Chain rule, we can

𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿 obtain these
, , , , , , ,
𝜕𝑤1 𝜕𝑤2 𝜕𝑤3 𝜕𝑤4 𝜕𝑤5 𝜕𝑤6 𝜕𝑤7 𝜕𝑤8
𝜕𝐿 𝜕𝐿 𝜕𝑧5
=
𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝑤7 𝜕𝑧5 𝜕𝑤7
, , , ,
𝜕𝑏1 𝜕𝑏2 𝜕𝑏3 𝜕𝑏4 𝜕𝑏5
HCM City Univ. of Technology, Faculty of Mechanical Engineering 63 Duong Van Tu
Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
𝑤3 Output layer:
𝑤1 RL RL 𝑤7 𝜕𝐿 𝜕𝐿 𝜕𝑧5
𝑤4 =
𝜕𝑤7 𝜕𝑧5 𝜕𝑤7
𝑥 LN 𝑦ො
𝑤5
𝑤2 𝑧5 𝜕𝐿 𝜕𝐿 𝜕𝑧5
𝑤6 𝑤8 𝑒 (𝐿) ≜ ≡ = 2 𝑦ො − 𝑦 = 𝑎3
RL RL 𝑏5 𝜕𝑧5 𝜕𝑦ො 𝜕𝑤7
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4 Output layer error

Recall that:
Current layer error Current layer input w.r.t 𝑤7 𝑧5 = 𝑎3 . 𝑤7 + 𝑎4 . 𝑤8 + 𝑏5
𝜕𝐿 𝑦ො = 𝑧5
=𝑒 𝐿 . a3
𝜕𝑤7
HCM City Univ. of Technology, Faculty of Mechanical Engineering 64 Duong Van Tu
Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
𝑤3 Output layer:
𝑤1 RL RL 𝑤7 𝜕𝐿 𝜕𝐿 𝜕𝑧5
𝑤4 =
𝜕𝑤8 𝜕𝑧5 𝜕𝑤8
𝑥 LN 𝑦ො
𝑤5
𝑤2 𝑧5 𝜕𝐿 𝜕𝐿 𝜕𝑧5
𝑤6 𝑤8 𝑒 (𝐿) ≜ ≡ = 2 𝑦ො − 𝑦 = 𝑎4
RL RL 𝑏5 𝜕𝑧5 𝜕𝑦ො 𝜕𝑤8
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4 Output layer error

Recall that:
Current layer error Current layer input w.r.t 𝑤8 𝑧5 = 𝑎3 . 𝑤7 + 𝑎4 . 𝑤8 + 𝑏5
𝜕𝐿 𝑦ො = 𝑧5
=𝑒 𝐿 . a4
𝜕𝑤8
HCM City Univ. of Technology, Faculty of Mechanical Engineering 65 Duong Van Tu
Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
Output layer:
𝑤1 𝑤3
RL RL 𝑤7 𝜕𝐿 𝜕𝐿 𝜕𝑧5
𝑤4 =
𝜕𝑏5 𝜕𝑧5 𝜕𝑏5
𝑥 LN 𝑦ො
𝑤5
𝑤2 𝑧5 𝜕𝐿 𝜕𝐿 𝜕𝑧5
𝑤6 𝑤8 𝑒 (𝐿) ≜ ≡ = 2 𝑦ො − 𝑦 =1
RL RL 𝑏5 𝜕𝑧5 𝜕𝑦ො 𝜕𝑏5
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4 Output layer error

Recall that:
Current layer error 𝑧5 = 𝑎3 . 𝑤7 + 𝑎4 . 𝑤8 + 𝑏5
𝜕𝐿 𝑦ො = 𝑧5
=𝑒 𝐿
𝜕𝑏5
HCM City Univ. of Technology, Faculty of Mechanical Engineering 66 Duong Van Tu
Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
2nd Hidden layer:
𝑤1 𝑤3
RL RL 𝑤7
𝑤4 Current layer error:
𝑥 LN 𝑦ො
𝑤5
𝑤2 𝑧5 (2) 𝜕𝐿 𝜕𝐿 𝜕𝑎3 𝜕𝐿 𝜕𝑧5 𝜕𝑎3
𝑤6 𝑤8 𝑒3 ≜ = =
RL RL 𝑏5 𝜕𝑧3 𝜕𝑎3 𝜕𝑧3 𝜕𝑧5 𝜕𝑎3 𝜕𝑧3
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4
𝜕𝐿 𝜕𝑧5 𝜕𝑔(𝑧)
=𝑒 𝐿 = 𝑤7
Recall that: 𝜕𝑧5 𝜕𝑎3 𝜕𝑧
Next layer Next layer Derivative of
𝜕𝐿
𝑒 (𝐿) ≜ , error weight activate function
𝜕𝑧5
𝑧5 = 𝑎3 . 𝑤7 + 𝑎4 . 𝑤8 + 𝑏5

HCM City Univ. of Technology, Faculty of Mechanical Engineering 67 Duong Van Tu


Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
2nd Hidden layer:
𝑤1 𝑤3
RL RL 𝑤7
𝑤4
𝜕𝐿 𝜕𝐿 𝜕𝑧3
𝑥 LN 𝑦ො =
𝑤5 𝜕𝑤3 𝜕𝑧3 𝜕𝑤3
𝑤2 𝑤6 𝑧5
RL RL 𝑤8 𝑏5
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4
𝜕𝐿 2 𝜕𝑧3
= 𝑒3 = 𝑎1
Recall that: 𝜕𝑧3 𝜕𝑤3
Current Current layer
(2) 𝜕𝐿
𝑒3 ≜ , layer error input
𝜕𝑧3
𝑧3 = 𝑎1 . 𝑤3 + 𝑎2 . 𝑤5 + 𝑏3

HCM City Univ. of Technology, Faculty of Mechanical Engineering 68 Duong Van Tu


Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
2nd Hidden layer:
𝑤1 𝑤3
RL RL 𝑤7
𝑤4
𝜕𝐿 𝜕𝐿 𝜕𝑧3
𝑥 LN 𝑦ො =
𝑤5 𝜕𝑤5 𝜕𝑧3 𝜕𝑤5
𝑤2 𝑤6 𝑧5
RL RL 𝑤8 𝑏5
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4
𝜕𝐿 2 𝜕𝑧3
= 𝑒3 = 𝑎2
Recall that: 𝜕𝑧3 𝜕𝑤5
Current Current layer
(2) 𝜕𝐿
𝑒3 ≜ , layer error input
𝜕𝑧3
𝑧3 = 𝑎1 . 𝑤3 + 𝑎2 . 𝑤5 + 𝑏3

HCM City Univ. of Technology, Faculty of Mechanical Engineering 69 Duong Van Tu


Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
2nd Hidden layer:
𝑤1 𝑤3
RL RL 𝑤7
𝑤4
𝜕𝐿 𝜕𝐿 𝜕𝑧3
𝑥 LN 𝑦ො =
𝑤5 𝜕𝑏3 𝜕𝑧3 𝜕𝑏3
𝑤2 𝑤6 𝑧5
RL RL 𝑤8 𝑏5
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4
𝜕𝐿 2 𝜕𝑧3
= 𝑒3 =1
Recall that: 𝜕𝑧3 𝜕𝑏3
Current
(2) 𝜕𝐿
𝑒3 ≜ , layer error
𝜕𝑧3
𝑧3 = 𝑎1 . 𝑤3 + 𝑎2 . 𝑤5 + 𝑏3

HCM City Univ. of Technology, Faculty of Mechanical Engineering 70 Duong Van Tu


Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
2nd Hidden layer:
𝑤1 𝑤3
RL RL 𝑤7
𝑤4 Current layer error:
𝑥 LN 𝑦ො
𝑤5
𝑤2 𝑧5 (2) 𝜕𝐿 𝜕𝐿 𝜕𝑎4 𝜕𝐿 𝜕𝑧5 𝜕𝑎4
𝑤6 𝑤8 𝑒4 ≜ = =
RL RL 𝑏5 𝜕𝑧4 𝜕𝑎4 𝜕𝑧4 𝜕𝑧5 𝜕𝑎4 𝜕𝑧4
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4
𝜕𝐿 𝜕𝑧5 𝜕𝑔(𝑧)
=𝑒 𝐿 = 𝑤8
Recall that: 𝜕𝑧5 𝜕𝑎4 𝜕𝑧
Next layer Next layer Derivative of
𝜕𝐿
𝑒 (𝐿) ≜ , error weight activate function
𝜕𝑧5
𝑧5 = 𝑎3 . 𝑤7 + 𝑎4 . 𝑤8 + 𝑏5

HCM City Univ. of Technology, Faculty of Mechanical Engineering 71 Duong Van Tu


Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
2nd Hidden layer:
𝑤1 𝑤3
RL RL 𝑤7
𝑤4
𝜕𝐿 𝜕𝐿 𝜕𝑧4
𝑥 LN 𝑦ො =
𝑤5 𝜕𝑤4 𝜕𝑧4 𝜕𝑤4
𝑤2 𝑤6 𝑧5
RL RL 𝑤8 𝑏5
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4
𝜕𝐿 2 𝜕𝑧4
= 𝑒4 = 𝑎1
Recall that: 𝜕𝑧4 𝜕𝑤4
Current Current layer
(2) 𝜕𝐿
𝑒4 ≜ , layer error input
𝜕𝑧4
𝑧4 = 𝑎1 . 𝑤4 + 𝑎2 . 𝑤6 + 𝑏4

HCM City Univ. of Technology, Faculty of Mechanical Engineering 72 Duong Van Tu


Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
2nd Hidden layer:
𝑤1 𝑤3
RL RL 𝑤7
𝑤4
𝜕𝐿 𝜕𝐿 𝜕𝑧4
𝑥 LN 𝑦ො =
𝑤5 𝜕𝑤6 𝜕𝑧4 𝜕𝑤6
𝑤2 𝑤6 𝑧5
RL RL 𝑤8 𝑏5
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4
𝜕𝐿 2 𝜕𝑧4
= 𝑒4 = 𝑎2
Recall that: 𝜕𝑧4 𝜕𝑤6
Current Current layer
(2) 𝜕𝐿
𝑒4 ≜ , layer error input
𝜕𝑧4
𝑧4 = 𝑎1 . 𝑤4 + 𝑎2 . 𝑤6 + 𝑏4

HCM City Univ. of Technology, Faculty of Mechanical Engineering 73 Duong Van Tu


Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
2nd Hidden layer:
𝑤1 𝑤3
RL RL 𝑤7
𝑤4
𝜕𝐿 𝜕𝐿 𝜕𝑧4
𝑥 LN 𝑦ො =
𝑤5 𝜕𝑏4 𝜕𝑧4 𝜕𝑏4
𝑤2 𝑤6 𝑧5
RL RL 𝑤8 𝑏5
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4
𝜕𝐿 2 𝜕𝑧4
= 𝑒4 =1
Recall that: 𝜕𝑧4 𝜕𝑏4
Current
(2) 𝜕𝐿
𝑒4 ≜ , layer error
𝜕𝑧4
𝑧4 = 𝑎1 . 𝑤4 + 𝑎2 . 𝑤6 + 𝑏4

HCM City Univ. of Technology, Faculty of Mechanical Engineering 74 Duong Van Tu


Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
1st Hidden layer:
𝑤1 𝑤3
RL RL 𝑤7
𝑤4 Current layer error:
𝑥 LN 𝑦ො
𝑤5
𝑤2 𝑧5 (1) 𝜕𝐿 𝜕𝐿 𝜕𝑎1
𝑤6 𝑤8 𝑒1 ≜ =
RL RL 𝑏5 𝜕𝑧1 𝜕𝑎1 𝜕𝑧1
𝑧2 𝑎2 𝑧4 𝑎4 where
𝑏2 𝑏4
𝜕𝐿 𝜕𝐿 𝜕𝑧5 𝜕𝐿 𝜕𝑧5
Recall that: 𝑢 𝑎1 ≜ 𝑎3 , 𝑣 𝑎1 ≜ 𝑎4 = +
𝜕𝑎1 𝜕𝑧5 𝜕𝑢 𝜕𝑧5 𝜕𝑣
𝜕𝐿
𝑒 (𝐿) ≜ , 𝑧5 = 𝑎3 . 𝑤7 + 𝑎4 . 𝑤8 + 𝑏5 𝜕𝐿 𝜕𝑧5 𝜕𝑎3 𝜕𝑧3 𝜕𝐿 𝜕𝑧5 𝜕𝑎4 𝜕𝑧4
𝜕𝑧5 = +
𝜕𝑧5 𝜕𝑎3 𝜕𝑧3 𝜕𝑎1 𝜕𝑧5 𝜕𝑎4 𝜕𝑧4 𝜕𝑎1
= 𝑢 𝑎1 . 𝑤7 + 𝑣 𝑎1 . 𝑤8 + 𝑏5

HCM City Univ. of Technology, Faculty of Mechanical Engineering 75 Duong Van Tu


Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
1st Hidden layer:
𝑤1 𝑤3
RL RL 𝑤7
𝑤4 Current layer error:
𝑥 LN 𝑦ො
𝑤5
𝑤2 𝑧5 (1) 𝜕𝐿 𝜕𝐿 𝜕𝑎1
𝑤6 𝑤8 𝑒1 ≜ =
RL RL 𝑏5 𝜕𝑧1 𝜕𝑎1 𝜕𝑧1
𝑧2 𝑎2 𝑧4 𝑎4 𝜕𝐿 𝜕𝐿 𝜕𝑧5 𝜕𝑎3 𝜕𝑧3 𝜕𝐿 𝜕𝑧5 𝜕𝑎4 𝜕𝑧4
𝑏2 𝑏4 = +
𝜕𝑎1 𝜕𝑧5 𝜕𝑎3 𝜕𝑧3 𝜕𝑎1 𝜕𝑧5 𝜕𝑎4 𝜕𝑧4 𝜕𝑎1
𝜕𝐿
Recall that: 𝑒 (𝐿) ≜ ,
𝜕𝑧5

𝑧5 = 𝑎3 . 𝑤7 + 𝑎4 . 𝑤8 + 𝑏5 𝑒 (𝐿) 𝑤7 𝑤3 𝑤8 𝑤4

𝑧3 = 𝑎1 . 𝑤3 + 𝑎2 . 𝑤5 + 𝑏3
Derivative of
𝑧4 = 𝑎1 . 𝑤4 + 𝑎2 . 𝑤6 + 𝑏4
activate function
HCM City Univ. of Technology, Faculty of Mechanical Engineering 76 Duong Van Tu
Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
Hidden layer:
1st
𝑤1 𝑤3
RL RL 𝑤7 𝜕𝐿 𝜕𝐿 𝜕𝑧5 𝜕𝑎3 𝜕𝑧3 𝜕𝐿 𝜕𝑧5 𝜕𝑎4 𝜕𝑧4
𝑤4 = +
𝜕𝑎1 𝜕𝑧5 𝜕𝑎3 𝜕𝑧3 𝜕𝑎1 𝜕𝑧5 𝜕𝑎4 𝜕𝑧4 𝜕𝑎1
𝑥 LN 𝑦ො
𝑤5 𝜕𝐿
𝑤2 𝑧5 (2) (2) 𝜕𝐿
𝑤6 𝑤8 𝑒3 ≜ 𝑒4 ≜
RL RL 𝑏5 𝜕𝑧3 𝜕𝑧4 𝑤
𝑤3 4
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4 Current layer error:

(1) 𝜕𝐿 2 𝜕𝑎1 2 𝜕𝑎1


𝑒1 ≜ = 𝑒3 . 𝑤3 . + 𝑒4 . 𝑤4 .
𝜕𝑧1 𝜕𝑧1 𝜕𝑧1

Next layer Next layer Derivative of


error weight activate function
HCM City Univ. of Technology, Faculty of Mechanical Engineering 77 Duong Van Tu
Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
1st Hidden layer:
𝑤1 𝑤3
RL RL 𝑤7
𝑤4
𝑥 LN 𝑦ො
𝑤5
𝑤2 𝑤6 𝑧5
RL RL 𝑤8 𝑏5
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4 Current layer error:

(1) 𝜕𝐿 2 𝜕𝑎2 2 𝜕𝑎2


𝑒2 ≜ = 𝑒3 . 𝑤5 . + 𝑒4 . 𝑤6 .
𝜕𝑧2 𝜕𝑧2 𝜕𝑧2

Next layer Next layer Derivative of


error weight activate function
HCM City Univ. of Technology, Faculty of Mechanical Engineering 78 Duong Van Tu
Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
1st Hidden layer:
𝑤1 𝑤3
RL RL 𝑤7
𝑤4
𝜕𝐿 𝜕𝐿 𝜕𝑧1
𝑥 LN 𝑦ො =
𝑤5 𝜕𝑤1 𝜕𝑧1 𝜕𝑤1
𝑤2 𝑤6 𝑧5
RL RL 𝑤8 𝑏5
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4
𝜕𝐿 1 𝜕𝑧1
= 𝑒1 =𝑥
Recall that: 𝜕𝑧1 𝜕𝑤1
Current Current layer
(1) 𝜕𝐿
𝑒1 ≜ , layer error input
𝜕𝑧1
𝑧1 = 𝑥. 𝑤1 + 𝑏1

HCM City Univ. of Technology, Faculty of Mechanical Engineering 79 Duong Van Tu


Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
1st Hidden layer:
𝑤1 𝑤3
RL RL 𝑤7
𝑤4
𝜕𝐿 𝜕𝐿 𝜕𝑧1
𝑥 LN 𝑦ො =
𝑤5 𝜕𝑏1 𝜕𝑧1 𝜕𝑏1
𝑤2 𝑤6 𝑧5
RL RL 𝑤8 𝑏5
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4
𝜕𝐿 1 𝜕𝑧1
= 𝑒1 =1
Recall that: 𝜕𝑧1 𝜕𝑤1
Current
(1) 𝜕𝐿
𝑒1 ≜ , layer error
𝜕𝑧1
𝑧1 = 𝑥. 𝑤1 + 𝑏1

HCM City Univ. of Technology, Faculty of Mechanical Engineering 80 Duong Van Tu


Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
1st Hidden layer:
𝑤1 𝑤3
RL RL 𝑤7
𝑤4
𝜕𝐿 𝜕𝐿 𝜕𝑧2
𝑥 LN 𝑦ො =
𝑤5 𝜕𝑤2 𝜕𝑧2 𝜕𝑤2
𝑤2 𝑤6 𝑧5
RL RL 𝑤8 𝑏5
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4
𝜕𝐿 1 𝜕𝑧2
= 𝑒2 =𝑥
Recall that: 𝜕𝑧2 𝜕𝑤2
Current Current layer
(1) 𝜕𝐿
𝑒2 ≜ , layer error input
𝜕𝑧2
𝑧2 = 𝑥. 𝑤2 + 𝑏2

HCM City Univ. of Technology, Faculty of Mechanical Engineering 81 Duong Van Tu


Introduction to Deep Learning Lecture 2
𝑏1 𝑏3
𝑧1 𝑎1 𝑧3 𝑎3 Example
1st Hidden layer:
𝑤1 𝑤3
RL RL 𝑤7
𝑤4
𝜕𝐿 𝜕𝐿 𝜕𝑧2
𝑥 LN 𝑦ො =
𝑤5 𝜕𝑏2 𝜕𝑧2 𝜕𝑏2
𝑤2 𝑤6 𝑧5
RL RL 𝑤8 𝑏5
𝑧2 𝑎2 𝑧4 𝑎4
𝑏2 𝑏4
𝜕𝐿 1 𝜕𝑧2
= 𝑒2 =1
Recall that: 𝜕𝑧2 𝜕𝑏2
Current
(1) 𝜕𝐿
𝑒2 ≜ , layer error
𝜕𝑧2
𝑧2 = 𝑥. 𝑤2 + 𝑏2

HCM City Univ. of Technology, Faculty of Mechanical Engineering 82 Duong Van Tu


Introduction to Deep Learning Lecture 2

Back Propagation in Matrix form

For 𝑨 not a function of 𝒙:


𝒛 ∈ ℛ𝑚×𝑛 , 𝑨 ∈ ℛ 𝑚×𝑝 , 𝒙 ∈ ℛ 𝑝×𝑛 , 𝒛 = 𝐀𝒙

𝜕𝒛
= 𝑨𝑇
𝜕𝒙
If 𝒘 is a function of 𝒛, which is a function of 𝒚, which is a function of 𝒙, we can obtain
this chain rule:

𝜕𝒘 𝜕𝒚 𝜕𝒛 𝜕𝒘
=
𝜕𝒙 𝜕𝒙 𝜕𝒚 𝜕𝒛
For more understanding, refer to matrix calculus documentation.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 83 Duong Van Tu


Introduction to Deep Learning Lecture 2

Back Propagation in Matrix form

𝑁×𝑑 (𝑙)
Step 1: Execute Forward propagation and store each layer output 𝒂 (𝑙)
∈ℛ
𝜕𝐽 𝜕𝐽 (𝐿)
Step 2: Compute the derivative of loss function = ∈ ℛ 𝑁×𝑑
𝜕𝒂(𝐿) 𝜕ෝ
𝒚

𝜕𝒂(𝐿) 𝜕ෝ
𝒚 (𝐿)
Step 3: Compute the derivative of activate function = ∈ ℛ 𝑁×𝑑
𝜕𝒛(𝐿) 𝜕𝒛(𝐿)

𝜕𝒛(𝐿) 𝐿−1 𝑇 (𝐿−1) ×𝑁


Step 4: Compute the term =𝒂 ∈ ℛ 𝑑 since 𝒛(𝑙) = 𝒂(𝑙−1) 𝑾(𝑙) + 𝒃(𝑙)
𝜕𝑾(𝐿)

Step 5: Compute the gradient of weight


𝜕𝐽
=𝒂 𝐿−1 𝑇 𝒆(𝐿) ∈ ℛ 𝑑 𝐿−1 ×𝑑 𝐿
𝜕𝑾(𝐿)

𝜕𝐽 (𝐿) 𝜕𝐽 𝜕ෝ𝒚
where 𝒆(𝐿) ≜ ∈ ℛ 𝑁×𝑑 is the Hadamard product of ⨀ (𝐿)
𝜕𝒛(𝐿) 𝜕ෝ
𝒚 𝜕𝒛

HCM City Univ. of Technology, Faculty of Mechanical Engineering 84 Duong Van Tu


Introduction to Deep Learning Lecture 2

Back Propagation in Matrix form

𝜕𝐽 𝜕𝐽 𝜕ෝ𝒚 𝐿
Step 6: Compute the gradient of bias = 𝑁
σ𝑖=1 ⨀ (𝐿) ∈ ℛ 1×𝑑
𝜕𝒃(𝐿) 𝜕ෝ
𝒚 𝜕𝒛

Step 7: Repeat step 5 for all other layer to compute the gradient of weight

𝜕𝐽 𝑙−1 𝑇 (𝑙) 𝑑 𝑙−1 ×𝑑 𝑙


(𝑙)
=𝒂 𝒆 ∈ℛ
𝜕𝑾
𝑁×𝑑 𝑙 𝑙+1 𝑇
where 𝒆 (𝑙)
∈ℛ ,𝒆 𝑙
=𝒆 𝑙+1
.𝑾 ⨀𝑔′(𝒛(𝑙) )
Step 8: Repeat step 6 for all other layer to compute the gradient of bias

𝜕𝐽 𝑁 𝜕𝐽 𝜕𝒂(𝑙) 1×𝑑 𝑙
= σ𝑖=1 ⨀ (𝑙) ∈ ℛ
𝜕𝒃(𝑙) 𝜕𝒂(𝑙) 𝜕𝒛

HCM City Univ. of Technology, Faculty of Mechanical Engineering 85 Duong Van Tu


Introduction to Deep Learning Lecture 2

Weight and Bias Update

Update the weight matrix for all layers


𝜕𝐽
𝑾(𝑙) = 𝑾(𝑙) −𝜂 for 𝑙 = 1 … 𝐿
𝜕𝑾(𝑙)

Update the bias matrix for all layers

𝜕𝐽
𝒃(𝑙) = 𝒃(𝑙) − 𝜂 (𝑙) for 𝑙 = 1 … 𝐿
𝜕𝒃

HCM City Univ. of Technology, Faculty of Mechanical Engineering 86 Duong Van Tu


Introduction to Deep Learning Lecture 2

Example: Build a model to estimate of test score based on sleep and study hours

1 𝑿 ∈ 𝑅𝑁×2
1
𝑾 , 𝒃(1) 𝑾(2) , 𝑏 (2)
1 (1) 𝑾(1) ∈ 𝑅2×3
𝑧1
(1)
𝑎1 𝒃(1) ∈ 𝑅1×3
Sleep 𝑥1 𝐿 = 𝑦ො − 𝑦 2
(2)
𝑧1 𝑾(2) ∈ 𝑅3×1
hours
𝑦ො 𝑏 (2) ∈ ℛ
(1) (1)
Study 𝑧2 𝑎2
𝒛(1) = 𝑿𝑾(1) + 𝒃(1) ∈ 𝑅𝑁×3
hours 𝑥2
(1) (1) 𝒂(1) = 𝑔1 𝒛(1) ∈ 𝑅𝑁×3
𝑧3 𝑎3
𝑿 𝒛(1) 𝒂(1) 𝒛 (2) 𝑧 (2) = 𝒂(1) 𝑾(2) + 𝑏 (2) ∈ 𝑅𝑁×1
2 = 𝑦ො = 𝑔2 𝑧 (2) ∈ 𝑅𝑁×1
Inputs Hidden Output 𝑎
HCM City Univ. of Technology, Faculty of Mechanical Engineering 87 Duong Van Tu
Introduction to Deep Learning Lecture 2

Example: Build a model to estimate of test score based on sleep and study hours

Code Symbol Math Symbol Definition Dimensions


X 𝑋 Input data, each rom in an example (3, 2)
y 𝑦 Target data (3, 1)
W1 𝑊 (1) Layer 1 weights (2, 3)
b1 𝑏 (1) Layer 1 bias (1, 3)
W2 𝑊 (2) Layer 2 weights (3, 1)
b2 𝑏 (2) Layer 2 bias (1, 1)
z1 𝑧 (1) Layer 1 activation (3, 3)
a1 𝑎(1) Layer 1 output (3, 3)
z2 𝑧 (2) Layer 2 activation (3, 1)
a2 𝑎(2) Layer 2 output (3, 1)

HCM City Univ. of Technology, Faculty of Mechanical Engineering 88 Duong Van Tu


Introduction to Deep Learning Lecture 2

Example: Build a model to estimate of test score based on sleep and study hours
import numpy as np
# Simulated example data
num_examples = 300
num_input_units = 2
num_hidden_units = 3
num_output_units = 1
# Generate random input features
X = X = np.array([[0.3, 1],
[0.5, 0.2],
[1, 0.4]])
# Generate random output labels
y = np.array([[0.75],
[0.82],
[0.93]])
# Initialize random weights and biases
W_input_hidden = np.random.rand(num_input_units, num_hidden_units)
b_hidden = np.random.rand(1, num_hidden_units)
W_hidden_output = np.random.rand(num_hidden_units, num_output_units)
b_output = np.random.rand(1, num_output_units)
# Learning rate
learning_rate = 0.1

HCM City Univ. of Technology, Faculty of Mechanical Engineering 89 Duong Van Tu


Introduction to Deep Learning Lecture 2

Example: Build a model to estimate of test score based on sleep and study hours
for iteration in range(1, 1001):
# Forward propagation
z_hidden = np.dot(X, W_input_hidden) + b_hidden
Output layer
a_hidden = 1 / (1 + np.exp(-z_hidden)) error
z_output = np.dot(a_hidden, W_hidden_output) + b_output
a_output = z_output # Linear activation for regression 𝜕𝐽 𝜕𝐽
= = 2 𝑦ො − 𝑦
# Calculate cost (MSE) 𝜕𝑦ො 𝜕𝑎(2)
cost = np.mean((a_output - y) ** 2)
# Backpropagation 𝜕𝐽 𝜕𝐽 𝜕𝐽 𝜕𝐽
= = 𝒂 1 𝑇
dJ_da_output = 2 * (a_output - y) 𝜕𝑎(2) 𝜕𝑧 (2) 𝜕𝑾(2) 𝜕𝑧 (2)
dJ_dz_output = dJ_da_output # Linear activation derivative 𝑁
dJ_dW_hidden_output = a_hidden.T.dot(dJ_dz_output) 𝜕𝐽 𝜕𝐽
= ෍
dJ_db_output = np.sum(dJ_dz_output, axis=0, keepdims=True) 𝜕𝑏 (2) 𝜕𝑧 (2)
dJ_da_hidden = dJ_dz_output.dot(W_hidden_output.T) 𝑖=1
dJ_dz_hidden = dJ_da_hidden * a_hidden * (1 - a_hidden) 𝜕𝐽 𝜕𝐽 2 𝑇
dJ_dW_input_hidden = X.T.dot(dJ_dz_hidden) = 𝑾
𝜕𝒂(1) 𝜕𝑧 (2)
dJ_db_hidden = np.sum(dJ_dz_hidden, axis=0, keepdims=True)
# Update weights and biases 𝜕𝐽 𝜕𝐽 1 1−𝒂 1
W_hidden_output -= learning_rate * dJ_dW_hidden_output = 𝒂
𝜕𝒛(1) 𝜕𝒂 1
b_output -= learning_rate * dJ_db_output
W_input_hidden -= learning_rate * dJ_dW_input_hidden 𝜕𝐽 𝜕𝐽
𝑇
b_hidden -= learning_rate * dJ_db_hidden (1)
= 𝑿 (1)
𝜕𝑾 𝜕𝒛
HCM City Univ. of Technology, Faculty of Mechanical Engineering 90 Duong Van Tu
Introduction to Deep Learning Lecture 2

Example: Build a model to estimate of test score based on sleep and study hours
# Log cost every 100 iterations
if iteration % 100 == 0:
print(f"Iteration {iteration}, Cost: {cost:.4f}")
print("Training complete!")

Iteration 100, Cost: 0.0044


Iteration 200, Cost: 0.0025
Iteration 300, Cost: 0.0012
Iteration 400, Cost: 0.0006
Iteration 500, Cost: 0.0003
Iteration 600, Cost: 0.0002
Iteration 700, Cost: 0.0002
Iteration 800, Cost: 0.0001
Iteration 900, Cost: 0.0001
Iteration 1000, Cost: 0.0001
Training complete!

HCM City Univ. of Technology, Faculty of Mechanical Engineering 91 Duong Van Tu


Introduction to Deep Learning Lecture 2

Exercise 3
Build a model to estimate of house price based on number of room, area, and floor:
Hidden layer: 4 nodes – Activation: sigmoid
Output layer: 1 node – Activation: Linear Do not use TensorFlow
Weight: Random – Cost function: MSE
Compute the model and train for 100 iterations, plot the cost, weight.
Example Rooms Area Floor Price (k)
1 2 100 1 75
2 1 60 2 60
3 3 120 1 90
4 2 75 2 80

HCM City Univ. of Technology, Faculty of Mechanical Engineering 92 Duong Van Tu


Introduction to Deep Learning Lecture 2

Epochs vs Iterations

▪ Epoch can be understood as the number of times the algorithm scans the entire

data. For example if we set epoch = 10 then the algorithm will scan the entire

data ten times.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 93 Duong Van Tu


Introduction to Deep Learning Lecture 2

Epochs vs Iterations

▪ Iteration is the number of times a certain batch is passed via an algorithm.

▪ For Example : a dataset having 20 examples, batch size = 4 examples,

and epochs = 5.

▪ Then, in each epoch, there will be:

20 examples /4 examples per batch = 5 batches

▪ Each batch will be passed by the algorithm there will be 5 iterations per epoch.

▪ Total number of iterations will be 5 epochs * 5 iterations per epoch = 25 iterations

HCM City Univ. of Technology, Faculty of Mechanical Engineering 94 Duong Van Tu


Introduction to Deep Learning Lecture 2

Learning Rate

Small learning rate converges slowly and gets stuck in false local minima.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 95 Duong Van Tu


Introduction to Deep Learning Lecture 2

Learning Rate

Large learning rate overshoots, become unstable and diverge.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 96 Duong Van Tu


Introduction to Deep Learning Lecture 2

Learning Rate

Suitable learning rate converges smoothly and avoid local minima.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 97 Duong Van Tu


Introduction to Deep Learning Lecture 2

Optimizer >> Batch Gradient Descent

▪ Batch Gradient Descent involves calculations over the full training set at each step.

▪ This results in that it is very slow on very large training data.

▪ It becomes very computationally expensive to do Batch GD.

▪ This is great for convex or relatively smooth error manifolds.

▪ Batch GD scales well with the number of features.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 98 Duong Van Tu


Introduction to Deep Learning Lecture 2

Optimizer >> Batch Gradient Descent

Algorithm
1. Initialize weight randomly
2. Loop until convergence:
𝜕𝐽(𝑾,𝒃) 𝜕𝐽(𝑾,𝒃)
3. Compute gradient, ,
𝜕𝑾 𝜕𝒃

𝜕𝐽 𝑾,𝒃 𝜕𝐽(𝑾,𝒃)
4. Update weights, bias 𝑾 ← 𝑾 − 𝜂 ,𝒃 ← 𝒃 − 𝜂
𝜕𝑾 𝜕𝒃

5. Return weights, bias

HCM City Univ. of Technology, Faculty of Mechanical Engineering 99 Duong Van Tu


Introduction to Deep Learning Lecture 2

Optimizer >> Stochastic Gradient Descent

▪ Stochastic Gradient Descent (SGD) is a variant of the Gradient Descent algorithm


▪ It deals with large datasets in machine learning projects.
▪ A single random training example (or a small batch) is selected to calculate the
gradient and update the model parameters.
▪ The advantage of using SGD is its computational efficiency, especially when dealing
with large datasets.
▪ The computational cost per iteration is significantly reduced compared to traditional
Gradient Descent methods.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 100 Duong Van Tu
Introduction to Deep Learning Lecture 2

Optimizer >> Stochastic Gradient Descent

Algorithm
1. Initialize weight randomly
2. Loop until convergence:
3. Shuffle the training dataset, pick single data point 𝒊
4. Iterate over each training example
𝜕𝐽𝑖 (𝑾,𝒃) 𝜕𝐽𝑖 (𝑾,𝒃)
5. Compute the gradient of the cost function, ,
𝜕𝑾 𝜕𝒃

𝜕𝐽𝑖 𝑾,𝒃 𝜕𝐽𝑖 (𝑾,𝒃)


6. Update weights, bias 𝑾 ← 𝑾 −𝜂 ,𝒃 ← 𝒃 − 𝜂
𝜕𝑾 𝜕𝒃

7. Return weights, bias

HCM City Univ. of Technology, Faculty of Mechanical Engineering 101 Duong Van Tu
Introduction to Deep Learning Lecture 2

Batch Gradient Descent vs Stochastic Gradient Descent

HCM City Univ. of Technology, Faculty of Mechanical Engineering 102 Duong Van Tu
Introduction to Deep Learning Lecture 2

Optimizer >> Mini Batch Gradient Descent

▪ Parameters are updated after computing the gradient of the error with respect to a
subset of the training set.
Algorithm
1. Initialize weight randomly
2. Loop until convergence:
3. Pick a batch size of examples to train.
𝜕𝐽(𝑾,𝒃) 𝜕𝐽 (𝑾,𝒃)
4. Compute the gradient of the cost function, ,
𝜕𝑾 𝜕𝒃

𝜕𝐽 𝑾,𝒃 𝜕𝐽(𝑾,𝒃)
5. Update weights, bias 𝑾 ← 𝑾 − 𝜂 ,𝒃 ← 𝒃 − 𝜂
𝜕𝑾 𝜕𝒃

6. Return weights, bias


HCM City Univ. of Technology, Faculty of Mechanical Engineering 103 Duong Van Tu
Introduction to Deep Learning Lecture 2

Optimizer >> Adaptive Gradient (AdaGrad)


• Adapts the learning rate for each parameter individually.
• The learning rate will be lower for parameters with a high gradient.
• For parameters with a low gradient, the learning rate will be larger.
• It modifies the general learning rate 𝜂 at each time step 𝑡 for every parameter
𝜃𝑖 based on the past gradients for 𝜃𝑖 .
• The learning rate is updated per each iteration as follows:

𝜂 𝜕𝐽
𝜃𝑡+1 = 𝜃𝑡 − ⨀𝑔𝑡 where 𝑔𝑡 ≜
𝐺𝑡 +𝜖 𝜕𝜃𝑡

HCM City Univ. of Technology, Faculty of Mechanical Engineering 104 Duong Van Tu
Introduction to Deep Learning Lecture 2

Optimizer >> Adaptive Moment Estimation (AdaDelta)

• Adadelta is an extension of Adagrad.


• Adadelta does not require a manual learning rate and can also handle variable-length
sequences.
𝐸 𝑔2 𝑡 = 𝛾𝐸 𝑔2 𝑡−1 + 1 − 𝛾 𝑔𝑡2

𝜂
𝜃𝑡+1 = 𝜃𝑡 − 𝑔𝑡
𝐸 𝑔2 𝑡 +𝜖

HCM City Univ. of Technology, Faculty of Mechanical Engineering 105 Duong Van Tu
Introduction to Deep Learning Lecture 2

Optimizer >> Adaptive Moment Estimation (Adam)

𝑚𝑡 = 𝛽1 𝑚𝑡−1 + 1 − 𝛽1 𝑔𝑡
where 𝑚𝑡 is the first moment, 𝑣𝑡 is the second moment
𝑣𝑡 = 𝛽2 𝑣𝑡−1 + 1 − 𝛽2 𝑔𝑡2

𝑚𝑡
𝑚ෝ𝑡 =
1 − 𝛽1
𝑣𝑡
𝑣ො𝑡 =
1 − 𝛽2

𝜂
𝜃𝑡+1 = 𝜃𝑡 − 𝑚
ෝ𝑡
𝑣ො𝑡 + 𝜖

HCM City Univ. of Technology, Faculty of Mechanical Engineering 106 Duong Van Tu
Introduction to Deep Learning Lecture 2

Optimizer

Algorithm TensorFlow Implementation


• SGD tf.keras.optimizers.SGD
• Adam tf.keras.optimizers.Adam
• Adadelta tf.keras.optimizers.Adadelta
• Adagrad tf.keras.optimizers.Adagrad
• RMSProp tf.keras.optimizers.RMSProp

HCM City Univ. of Technology, Faculty of Mechanical Engineering 107 Duong Van Tu
Introduction to Deep Learning Lecture 2

Exercise 4
Use the requirement described in Ex. 3. Change the learning rate and plot the cost function
corresponding to each value of learning rate.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 108 Duong Van Tu
Introduction to Deep Learning Lecture 2

Bias and Variance

▪ Bias: Assumptions made by a model to make a function easier to learn. It is actually


the error rate of the training data.
▪ When the error rate has a high value, it is called High Bias, otherwise it called Low
Bias.
▪ Variance: The difference between the error rate of training data and testing data is
called variance.
▪ If the difference is high then it’s called high variance, otherwise, it is called low
variance.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 109 Duong Van Tu
Introduction to Deep Learning Lecture 2

Problem of Underfitting
▪ A statistical model is said to have underfitting when it
cannot capture the underlying trend of the data.
▪ The model does not make accurate predictions on
testing data.
▪ This case is called as High bias.
Reason:
• The size of the training dataset used is not enough.
• The model is too simple.
• Training data is not cleaned and also contains noise
in it.
HCM City Univ. of Technology, Faculty of Mechanical Engineering 110 Duong Van Tu
Introduction to Deep Learning Lecture 2

Problem of Underfitting

Techniques to Reduce Underfitting


1. Increase model complexity.
2. Increase the number of features, performing feature
engineering.
3. Remove noise from the data.
4. Increase the number of epochs or increase the
duration of training to get better results.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 111 Duong Van Tu
Introduction to Deep Learning Lecture 2

Problem of Overfitting
▪ A statistical model is said to have overfitting when it
only performs well on training data but performs poorly
on testing data.
▪ This case is called as High Variance.
Reason:
• The size of the training dataset used is not enough.
• The model is too complex.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 112 Duong Van Tu
Introduction to Deep Learning Lecture 2

Problem of Overfitting

Techniques to Reduce Overfitting


1. Increase training data.
2. Reduce model complexity.
3. Early stopping during the training phase (have an eye
over the loss over the training period as soon as loss
begins to increase stop training).
4. Ridge Regularization and Lasso Regularization.
5. Use dropout for neural networks to tackle overfitting.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 113 Duong Van Tu
Introduction to Deep Learning Lecture 2

Dropout Regularization

▪ Dropout regularization is one technique used to tackle overfitting problems in


deep learning.
▪ During training phase, randomly set some
activations to 0.
▪ Typically drop 50% of activation in layers

tf.keras.layers.Dropout(p=0.5)

HCM City Univ. of Technology, Faculty of Mechanical Engineering 114 Duong Van Tu
Introduction to Deep Learning Lecture 2

Early Stopping Regularization

▪ In Regularization by Early Stopping, we stop training the model when the


performance of the model on the validation set is getting worse-increasing loss
or decreasing accuracy or poorer values of the scoring metric.

early_stopping = EarlyStopping(patience=10, restore_best_weights=True)

HCM City Univ. of Technology, Faculty of Mechanical Engineering 115 Duong Van Tu
Introduction to Deep Learning Lecture 2

Implementation on TensorFlow Keras

• Create input and output using TensorFlow constant “tf.constant”


• Create model using Sequential Class “tf.keras.Sequential()”
• Compile the model with the optimizer and loss function
“model.compile(optimizer='adam’, loss=‘mse’)”
• Train the model using “fit” method “model.fit”
• Generate predictions and analyze accuracy using “predict” method “model.predict”

HCM City Univ. of Technology, Faculty of Mechanical Engineering 116 Duong Van Tu
Introduction to Deep Learning Lecture 2

Exercise 5
Build a model to estimate of house price based on number of room, area, and floor:
Hidden layer: 4 nodes – Activation: Sigmoid
Output layer: 1 node Use TensorFlow >> Keras API
Weight: Random – Cost function: MSE
Compute the model and train for 100 iterations. Compute the estimated price for 5th example.
Example Rooms Area Floor Price (k)
1 2 100 1 75
2 1 60 2 60
3 3 120 1 90
4 2 75 2 80
5 1 45 1 ?
HCM City Univ. of Technology, Faculty of Mechanical Engineering 117 Duong Van Tu
Introduction to Deep Learning Lecture 2

Exercise 6
Predict the compressive strength of concrete manufactured according to various recipes. The
dataset is taken from BKEL.
Build the model two hidden layers with the same 32 nodes, output layer with a single node.
Hidden layer uses ReLU function as activation function.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 118 Duong Van Tu

You might also like