◆ Suppose we want to find a linear model 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 to fit a dataset
𝐷 𝑥, 𝑦 = 0,1 , 1,2 . We consider the squared error 𝐽 𝜃 =
1 𝑚 2
σ 𝑓 𝑥 𝑗 −𝑦 𝑗 as the cost function, where 𝑚 is the number of data
𝑚 𝑗=1
examples. We use the gradient descent to iteratively update the parameters
𝜃0 and 𝜃1 starting from 𝜃0 = 𝜃1 = 0.
Question 1: Calculate 𝜃0 and 𝜃1 after the first iteration of the update process
in the gradient descent, where the learning rate is set to 0.1.
1 1 𝑚
𝜕𝐽(𝜃) 𝜕(𝑚 σ𝑚 (𝑗) 𝑗 2
𝑗=1(𝑓(𝑥 )−𝑦 ) ) 𝑚
σ𝑗=1 𝜕(𝑓(𝑥 (𝑗) )−𝑦 𝑗 )2
= =
𝜕𝜃0 𝜕𝜃0 𝜕𝜃0
𝑗 𝑗 𝑗 𝑗
And according to derivative calculation rule
As we have 𝑓 𝑥 −𝑦 = 𝜃0 + 𝜃1 𝑥1 − 𝑦 𝑑(𝑎𝜃 − 𝑏)2
= 2 𝑎𝜃 − 𝑏 𝑎
𝑑𝜃
𝜕𝐽(𝜃) 1 𝑚 2
𝑗 𝑗
= 2(𝜃0 + 𝜃1 𝑥1 − 𝑦 ) = (0 + 0 − 1) + (0 + 0 − 2) = −3
𝜕𝜃0 𝑚 𝑗=1 2
𝜕𝐽 𝜃
𝜃0 ← 𝜃0 − 𝛼 = 0 − 0.1 × (−3) = 0.3
𝜕𝜃0
◆ Suppose we want to find a linear model 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 to fit a dataset
𝐷 𝑥, 𝑦 = 0,1 , 1,2 . We consider the squared error 𝐽 𝜃 =
1 𝑚 2
σ 𝑓 𝑥 𝑗 −𝑦 𝑗 as the cost function, where 𝑚 is the number of data
𝑚 𝑗=1
examples. We use the gradient descent to iteratively update the parameters
𝜃0 and 𝜃1 starting from 𝜃0 = 𝜃1 = 0.
Question 1: Calculate 𝜃0 and 𝜃1 after the first iteration of the update process
in the gradient descent, where the learning rate is set to 0.1.
1 1 𝑚 𝑗
𝜕𝐽(𝜃) 𝜕(𝑚 σ𝑚 (𝑗) 𝑗 2
𝑗=1(𝑓(𝑥 )−𝑦 ) ) 𝑚
σ𝑗=1 𝜕(𝜃0 +𝜃1 𝑥1 −𝑦 𝑗 )2
= =
𝜕𝜃1 𝜕𝜃1 𝜕𝜃1
𝑑(𝑎𝜃−𝑏)2
And according to derivative calculation rule = 2 𝑎𝜃 − 𝑏 𝑎
𝑑𝜃
𝜕𝐽(𝜃) 1 𝑚 2
𝑗 𝑗 𝑗
= 2(𝜃0 + 𝜃1 𝑥1 − 𝑦 )𝑥 = (0 + 0 − 1) × 0 + (0 + 0 − 2) × 1 = −2
𝜕𝜃1 𝑚 𝑗=1 2
𝜕𝐽 𝜃
𝜃1 ← 𝜃1 − 𝛼 = 0 − 0.1 × (−2) = 0.2
𝜕𝜃0
Question 2: Now suppose we want to find a quadratic model 𝑓 𝑥 = 𝜃0 +
𝜃1 𝑥 + 𝜃2 𝑥 2 to fit the data. Consider the same cost function and the initial
value 𝜃0 = 𝜃1 = 𝜃2 = 0. Calculate 𝜃0 , 𝜃1 , 𝜃2 after the first iteration of the
update process in the gradient descent, where the learning rate is set to 0.1.
1 𝑚 1 𝑚
𝜕𝐽(𝜃) 𝜕( σ𝑗=1(𝑓(𝑥 (𝑗) ) − 𝑦 𝑗
)2 ) σ𝑗=1 𝜕(𝑓(𝑥 (𝑗) ) − 𝑦 𝑗
)2
= 𝑚 =𝑚
𝜕𝜃0 𝜕𝜃0 𝜕𝜃0
2 𝑚 𝑗 𝑗 2
= σ 𝑓 𝑥 −𝑦 = −1 − 2 = −3
𝑚 𝑗=1 2
𝜕𝐽 𝜃
𝜃0 ← 𝜃0 − 𝛼 = 0 − 0.1 × (−3) = 0.3
𝜕𝜃0
Question 2: Now suppose we want to find a quadratic model 𝑓 𝑥 = 𝜃0 +
𝜃1 𝑥 + 𝜃2 𝑥 2 to fit the data. Consider the same cost function and the initial
value 𝜃0 = 𝜃1 = 𝜃2 = 0. Calculate 𝜃0 , 𝜃1 , 𝜃2 after the first iteration of the
update process in the gradient descent, where the learning rate is set to 0.1.
1 𝑚 (𝑗) 𝑗 2 1 𝑚
𝜕𝐽(𝜃) 𝜕( σ𝑗=1 (𝑓(𝑥 ) − 𝑦 ) ) σ𝑗=1 𝜕(𝑓(𝑥 (𝑗) ) − 𝑦 𝑗
)2
= 𝑚 =𝑚
𝜕𝜃1 𝜕𝜃1 𝜕𝜃1
𝑚
2 𝑗 𝑗 𝑗
2
= 𝑓 𝑥 −𝑦 𝑥 = 0 − 2 = −2
𝑚 2
𝑗=1
𝜕𝐽 𝜃
𝜃1 ← 𝜃1 − 𝛼 = 0 − 0.1 × (−2) = 0.2
𝜕𝜃1
Question 2: Now suppose we want to find a quadratic model 𝑓 𝑥 = 𝜃0 +
𝜃1 𝑥 + 𝜃2 𝑥 2 to fit the data. Consider the same cost function and the initial
value 𝜃0 = 𝜃1 = 𝜃2 = 0. Calculate 𝜃0 , 𝜃1 , 𝜃2 after the first iteration of the
update process in the gradient descent, where the learning rate is set to 0.1.
1 𝑚 (𝑗) 𝑗 2 1 𝑚 (𝑗) 𝑗
𝜕𝐽(𝜃) 𝜕(𝑚 σ𝑗=1(𝑓(𝑥 ) − 𝑦 ) ) 𝑚 σ𝑗=1 𝜕(𝑓(𝑥 ) − 𝑦 )2
= =
𝜕𝜃2 𝜕𝜃2 𝜕𝜃2
𝑚
2
= 𝑓 𝑥 𝑗 − 𝑦 𝑗 (𝑥 (𝑗) )2
𝑚
𝑗=1
2
= ((0 + 0 + 0 − 1) × 0 + (0 + 0 + 0 − 2) × 1) = −2
2
𝜕𝐽 𝜃
𝜃2 ← 𝜃2 − 𝛼 = 0 − 0.1 × (−2) = 0.2
𝜕𝜃2