10-301/601: Introduction
to Machine Learning
Lecture 13 –
Differentiation
Henry Chai & Matt Gormley & Hoda Heidari
2/26/24
𝑦
𝜷 ∈ ℝ)(
𝜷
𝛽& ∈ ℝ (%)
𝑦 = σ((𝜷)$ 𝒛 + 𝛽& )
Recall: Neural ($)
𝑧!
($)
𝑧$ … ($)
𝑧&"
𝜶
%
∈ ℝ'×)( (%) % $ (")
Networks 𝜶
($)
𝒃(%) ∈ ℝ)(
𝒛 = σ((𝜶 ) 𝒛 + 𝒃(%) )
(Matrix Form) 𝑧!
(!) (!)
𝑧$ … (!)
𝑧&!
"
𝒛
(")
= σ((𝜶
" $
) 𝒙+ 𝒃(") )
𝜶 ∈ ℝ'×)'
(!)
𝜶
𝒃(") ∈ ℝ)'
𝑥! 𝑥$ … 𝑥%
10/6/23 2
$ 1
𝑦 𝑦= σ 𝜷* %
𝒛
𝛽&
𝜷 𝜷* = ∈ ℝ)(+"
𝜷
(%) % *
$ 1
Recall: Neural ($)
𝑧!
($)
𝑧$ … ($)
𝑧&"
% $
𝒛 = σ 𝜶
𝒛
"
% * 𝒃
Networks 𝜶
($) 𝜶 =
𝜶
% ∈ℝ )' +" ×)(
(Matrix Form)
)
(!) (!) (!) (") " * 1
𝑧! 𝑧$ … 𝑧&! 𝒛 = σ 𝜶
𝒙
(!)
𝒃 " $
𝜶
𝜶 " * = ∈ℝ '+" ×)'
"
𝜶
𝑥! 𝑥$ … 𝑥%
10/6/23 3
Inputs: weights 𝜶 ! ,…,𝜶 " , 𝜷 and a query data point 𝒙#
Initialize 𝒛 $ = 𝒙#
Forward For 𝑙 = 1, … , 𝐿
Propagation 𝒂 % =𝜶 % & 𝒛 %'!
for Making
% %
Predictions 𝒛 =𝜎 𝒂
𝑦- = 𝜎 𝜷& 𝒛 "
Output: the prediction 𝑦-
10/6/23 4
( ( *
Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾
Initialize all weights 𝜶 ! ,…,𝜶 " , 𝜷 (???)
While TERMINATION CRITERION is not satisfied (???)
For 𝑖 ∈ shuf7le 1, … , 𝑁
Stochastic ( ! "
Compute 𝑔𝜷 = ∇𝜷𝐽 𝜶 ,…,𝜶 ,𝜷
Gradient
For 𝑙 = 1, … , 𝐿
Descent
( ! "
for Learning Compute 𝑔𝜶 ! = ∇𝜶 ! 𝐽 𝜶 ,…,𝜶 ,𝜷
Update 𝜷 = 𝜷 − 𝛾𝑔𝜷
For 𝑙 = 1, … , 𝐿
Update 𝜶 % =𝜶 % − 𝛾𝑔𝜶 !
Output:𝜶 ! ,…,𝜶 " ,𝜷
10/6/23 5
( ( *
Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾
! "
Two questions: Initialize all weights 𝜶 ,…,𝜶 , 𝜷 (???)
While TERMINATION CRITERION is not satisfied (???)
1. What is this For 𝑖 ∈ shuf7le 1, … , 𝑁
loss function Compute 𝑔𝜷 = ∇𝜷𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
𝐽!? For 𝑙 = 1, … , 𝐿
Compute 𝑔𝜶 ! = ∇𝜶 ! 𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
2. How on earth Update 𝜷 = 𝜷 − 𝛾𝑔𝜷
do we take For 𝑙 = 1, … , 𝐿
these gradients? % %
Update 𝜶 =𝜶 − 𝛾𝑔𝜶 !
Output:𝜶 ! ,…,𝜶 " ,𝜷
10/6/23 6
( ( *
Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾
! "
Two questions: Initialize all weights 𝜶 ,…,𝜶 , 𝜷 (???)
While TERMINATION CRITERION is not satisfied (???)
1. What is this For 𝑖 ∈ shuf7le 1, … , 𝑁
loss function Compute 𝑔𝜷 = ∇𝜷𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
𝐽!? For 𝑙 = 1, … , 𝐿
Compute 𝑔𝜶 ! = ∇𝜶 ! 𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
2. How on earth Update 𝜷 = 𝜷 − 𝛾𝑔𝜷
do we take For 𝑙 = 1, … , 𝐿
these gradients? % %
Update 𝜶 =𝜶 − 𝛾𝑔𝜶 !
Output:𝜶 ! ,…,𝜶 " ,𝜷
10/6/23 7
Let 𝚯 = 𝜶 ! ,…,𝜶 " , 𝜷 be the parameters of our neural network
Regression - squared error (same as linear regression!)
( ( ( .
𝐽 𝚯 = 𝑦-𝚯 𝒙 −𝑦
Loss
Binary classification - cross-entropy loss (same as logistic regression!)
Functions
Assume 𝑌 ∈ 0,1 and 𝑃 𝑌 = 1 𝒙, 𝚯 = 𝑦-𝚯 𝒙
for Neural
Networks 𝐽 ( 𝚯 = − log 𝑃 𝑦 ( |𝒙 ( , 𝚯
" !'/ "
( ( / (
𝐽 𝚯 = − log 𝑦-𝚯 𝒙 1 − 𝑦-𝚯 𝒙
𝐽 ( 𝚯 =− 𝑦 ( log 𝑦-𝚯 𝒙 ( + 1−𝑦 ( log 1 − 𝑦-𝚯 𝒙 (
10/6/23 8
Let 𝚯 = 𝜶 ! ,…,𝜶 " , 𝜷 be the parameters of our neural network
Multi-class classification - cross-entropy loss again!
Express the label as a one-hot or one-of-𝐶 vector e.g.,
Loss 𝒚= 0 0 1 0 ⋯ 0
Functions J𝚯
Assume the neural network output is also a vector of length 𝐶, 𝒚
for Neural J𝚯 𝒙
𝑃 𝒚 𝑐 = 1 𝒙, 𝚯 = 𝒚 ( 𝑐
Networks
Then the cross-entropy loss is
𝐽 ( 𝚯 = − log 𝑃 𝑦 ( |𝒙 ( , 𝚯
1
𝐽 ( 𝚯 = −L𝒚 ( J𝚯 𝒙
𝑐 log 𝒚 ( 𝑐
0)!
10/6/23 9
Let 𝚯 = 𝜶 ! ,…,𝜶 " , 𝜷 be the parameters of our neural network
Okay but Multi-class classification - cross-entropy loss
how do Express the label as a one-hot or one-of-𝐶 vector e.g.,
we get 𝒚= 0 0 1 0 ⋯ 0
our J𝚯
Assume the neural network output is also a vector of length 𝐶, 𝒚
network J𝚯 𝒙
𝑃 𝒚 𝑐 = 1 𝒙, 𝚯 = 𝒚 ( 𝑐
to output
Then the cross-entropy loss is
this (
𝐽 𝚯 = − log 𝑃 𝑦 ( |𝒙 ( , 𝚯
vector? 1
𝐽 ( 𝚯 = −L𝒚 ( J𝚯 𝒙
𝑐 log 𝒚 ( 𝑐
0)!
10/6/23 10
exp 𝑏0
… 𝑦0 = 1
∑7)! exp 𝑏7
6
𝜷 𝑏0 = L 𝛽0,5 𝑧5
5)$
Softmax … 𝑧2 = 𝜎 𝑎2
3
𝜶 𝑎2 = L 𝛼2,( 𝑥(
()$
10/6/23 11
( ( *
Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾
! "
Two questions: Initialize all weights 𝜶 ,…,𝜶 , 𝜷 (???)
While TERMINATION CRITERION is not satisfied (???)
1. What is this For 𝑖 ∈ shuf7le 1, … , 𝑁
loss function Compute 𝑔𝜷 = ∇𝜷𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
𝐽!? For 𝑙 = 1, … , 𝐿
Compute 𝑔𝜶 ! = ∇𝜶 ! 𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
2. How on earth Update 𝜷 = 𝜷 − 𝛾𝑔𝜷
do we take For 𝑙 = 1, … , 𝐿
these gradients? % %
Update 𝜶 =𝜶 − 𝛾𝑔𝜶 !
Output:𝜶 ! ,…,𝜶 " ,𝜷
10/6/23 12
Numerator
Types of
scalar vector matrix
Derivatives
Matrix scalar
Calculus
vector
Denominator
matrix
10/6/23 Table courtesy of Matt Gormley 13
Types of
scalar
Derivatives
Derivatives of a scalar
scalar always
Matrix have the same
Calculus: shape as the
vector
Denominator entity that the
Layout derivative is
being taken
with respect to. matrix
10/6/23 Table courtesy of Matt Gormley 14
Types of
scalar vector
Derivatives
scalar
Matrix
Calculus:
Denominator
Layout
vector
10/6/23 Table courtesy of Matt Gormley 15
Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
1. Finite difference method
Requires the ability to call 𝑓 𝒙
Great for checking accuracy of implementations of
more complex differentiation methods
Computationally expensive for high-dimensional inputs
Three
Approaches to 2. Symbolic differentiation
Requires systematic knowledge of derivatives
Differentiation Can computationally expensive if poorly implemented
3. Automatic differentiation (reverse mode)
Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
Computational cost of computing 9: 𝒙 Z9𝒙 is
10/6/23
proportional to the cost of computing 𝑓 𝒙 16
Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
𝜕𝑓 𝒙 𝑓 𝒙 + 𝜖𝒅( − 𝑓 𝒙 − 𝜖𝒅(
≈
𝜕𝑥( 2𝜖
where 𝒅( is a one-hot vector with a 1 in the 𝑖 th position
𝑓 𝑥
Approach 1:
Finite 𝜖𝜖 𝑥
Difference We want 𝜖 to be small to get a good approximation but we
Method run into floating point issues when 𝜖 is too small
Getting the full gradient requires computing the above
approximation for each dimension of the input
10/6/23 17
Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧
what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?
>>> from math import *
>>> y = lambda x,z: exp(x*z)+(x*z)/log(x)+sin(log(x))/(x*z)
Approach 1: >>> x = 2
Finite >>> z = 3
Difference >>> e = 10**-8
>>> dydx = (y(x+e,z)-y(x-e,z))/(2*e)
Method >>> dydz = (y(x,z+e)-y(x,z-e))/(2*e)
Example >>> print(dydx, dydz)
10/6/23 Example courtesy of Matt Gormley 18
Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
1. Finite difference method
Requires the ability to call 𝑓 𝒙
Great for checking accuracy of implementations of
more complex differentiation methods
Computationally expensive for high-dimensional inputs
Three
Approaches to 2. Symbolic differentiation
Requires systematic knowledge of derivatives
Differentiation Can computationally expensive if poorly implemented
3. Automatic differentiation (reverse mode)
Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
Computational cost of computing 9: 𝒙 Z9𝒙 is
10/6/23
proportional to the cost of computing 𝑓 𝒙 19
Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧
what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?
Looks like we’re gonna need the chain rule!
Approach 2:
Symbolic
Differentiation
10/6/23 Example courtesy of Matt Gormley 20
If 𝑦 = 𝑓 𝑧 and 𝑧 = 𝑔 𝑥 then
the corresponding computation graph is
𝑥 𝑧 𝑦
𝜕𝑦 𝜕𝑦 𝜕𝑧
⟹ =
𝜕𝑥 𝜕𝑧 𝜕𝑥
If 𝑦 = 𝑓 𝑧!, 𝑧. and 𝑧! = 𝑔! 𝑥 , 𝑧. = 𝑔. 𝑥 then
𝑧!
𝑥 𝑦
The Chain Rule ⟹
𝜕𝑦 𝜕𝑦 𝜕𝑧! 𝜕𝑦 𝜕𝑧.
= +
of Calculus 𝑧.
𝜕𝑥 𝜕𝑧! 𝜕𝑥 𝜕𝑧. 𝜕𝑥
If 𝑦 = 𝑓 𝒛 and 𝒛 = 𝑔 𝑥 then
𝑧!
𝑥 𝑦 6
𝜕𝑦 𝜕𝑦 𝜕𝑧2
⟹ =L
𝜕𝑥 𝜕𝑧2 𝜕𝑥
𝑧. 2)!
⋮
10/6/23 𝑧6 21
If 𝑦 = 𝑓 𝒛 , 𝒛 = 𝑔 𝒘 and 𝒘 = ℎ 𝑥 , does the equation
6
𝜕𝑦 𝜕𝑦 𝜕𝑧2
=L
𝜕𝑥 𝜕𝑧2 𝜕𝑥
2)!
Poll Question 1 still hold?
A. Yes
B. No
C. Only on Fridays (TOXIC)
10/6/23 22
Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧
what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?
𝜕𝑦 𝜕 ;< 𝜕 𝑥𝑧 𝜕 sin ln 𝑥
= 𝑒 + +
𝜕𝑥 𝜕𝑥 𝜕𝑥 ln 𝑥 𝜕𝑥 𝑥𝑧
Approach 2: 𝜕𝑦 ;<
= 𝑧𝑒 +
𝑧
−
𝑧
+
cos ln 𝑥
−
sin ln 𝑥
Symbolic 𝜕𝑥 ln 𝑥 ln 𝑥 . .
𝑥 𝑧 𝑥 .𝑧
𝜕𝑦 3 3 cos ln 2 sin ln 2
Differentiation =
= 3𝑒 + − . + −
𝜕𝑥 ln 2 ln 2 12 12
𝜕𝑦 𝜕 ;< 𝜕 𝑥𝑧 𝜕 sin ln 𝑥
= 𝑒 + +
𝜕𝑧 𝜕𝑧 𝜕𝑧 ln 𝑥 𝜕𝑧 𝑥𝑧
𝜕𝑦 =
2 sin ln 2
= 2𝑒 + −
10/6/23 𝜕𝑥 ln 2 18 Example courtesy of Matt Gormley 23
Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
1. Finite difference method
Requires the ability to call 𝑓 𝒙
Great for checking accuracy of implementations of
more complex differentiation methods
Computationally expensive for high-dimensional inputs
Three
Approaches to 2. Symbolic differentiation
Requires systematic knowledge of derivatives
Differentiation Can be computationally expensive if poorly implemented
3. Automatic differentiation (reverse mode)
Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
Computational cost of computing 9: 𝒙 Z9𝒙 is proportional
10/6/23
to the cost of computing 𝑓 𝒙 24
Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧
what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?
First define some intermediate quantities, draw the
computation graph and run the “forward” computation
𝑎 = 𝑥𝑧
Approach 3: 𝑥 𝑎 𝑑
𝑏 = ln 𝑥
Automatic 𝑐 = sin 𝑏
2 ∗ 𝑒𝑥𝑝
Differentiation 𝑑 = 𝑒>
𝑧 𝑏 𝑒 𝑦
(reverse mode) 𝑒 = 𝑎Z𝑏
3 𝑙𝑛 / +
𝑓
𝑓 = 𝑐⁄𝑎
𝑐 𝑠𝑖𝑛 /
𝑦 =𝑑+𝑒+𝑓
10/6/23 Example courtesy of Matt Gormley 25
Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧
what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?
Then compute partial derivatives,
starting from 𝑦 and working back
Approach 3: 𝑥 𝑎 𝑑
Automatic 2 ∗ 𝑒𝑥𝑝
Differentiation 𝑧 𝑏 𝑒 𝑦
(reverse mode) 3 𝑙𝑛 / +
𝑓
𝑐 𝑠𝑖𝑛 /
10/6/23 Example courtesy of Matt Gormley 26
Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
1. Finite difference method
Requires the ability to call 𝑓 𝒙
Great for checking accuracy of implementations of
more complex differentiation methods
Computationally expensive for high-dimensional inputs
Three
Approaches to 2. Symbolic differentiation
Requires systematic knowledge of derivatives
Differentiation Can be computationally expensive if poorly implemented
3. Automatic differentiation (reverse mode)
Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
Computational cost of computing 9: 𝒙 Z9𝒙 is proportional
10/6/23
to the cost of computing 𝑓 𝒙 28
Given 𝑓: ℝ6 → ℝ1 , compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
3. Automatic differentiation (reverse mode)
Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
Computational cost of computing ∇𝒙 𝑓 𝒙 0 = 9: 𝒙 #Z9𝒙
is proportional to the cost of computing 𝑓 𝒙
Automatic Great for high-dimensional inputs and low-dimensional
outputs (𝐷 ≫ 𝐶)
Differentiation
4. Automatic differentiation (forward mode)
Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
Computational cost of computing 9: 𝒙 Z9𝒙$
is proportional to the cost of computing 𝑓 𝒙
Great for low-dimensional inputs and high-dimensional
10/6/23
outputs (𝐷 ≪ 𝐶) 29
The diagram represents an algorithm
Nodes are rectangles with one node per intermediate
variable in the algorithm
Each node is labeled with the function that it computes
Computation (inside the box) and the variable name (outside the box)
Graph:
10-301/601 Edges are directed and do not have labels
Conventions For neural networks:
Each weight, feature value, label and bias term
appears as a node
We can include the loss function
10/6/23 30
The diagram represents a neural network
Nodes are circles with one node per hidden unit
Each node is labeled with the variable corresponding to
the hidden unit
Neural
Network Edges are directed and each edge is labeled with its weight
Diagram Following standard convention, the bias term is typically
Conventions not shown as a node, but rather is assumed to be part of
the activation function i.e., its weight does not appear in
the picture anywhere.
The diagram typically does not include any nodes related
to the loss computation
10/6/23 31
You should be able to…
Differentiate between a neural network diagram and a computation graph
Construct a computation graph for a function as specified by an algorithm
Carry out the backpropagation on an arbitrary computation graph
Construct a computation graph for a neural network, identifying all the
Backprop given and intermediate quantities that are relevant
Learning Instantiate the backpropagation algorithm for a neural network
Objectives Instantiate an optimization method (e.g. SGD) and a regularizer (e.g. L2)
when the parameters of a model are comprised of several matrices
corresponding to different layers of a neural network
Use the finite difference method to evaluate the gradient of a function
Identify when the gradient of a function can be computed at all and when
it can be computed efficiently
Employ basic matrix calculus to compute vector/matrix/tensor derivatives.
10/6/23 32