0% found this document useful (0 votes)

21 views31 pages

Lecture12 Diff

lecture 12 - diff

Uploaded by

seshathiri.d

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views31 pages

Lecture12 Diff

lecture 12 - diff

Uploaded by

seshathiri.d

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

10-301/601: Introduction

to Machine Learning
Lecture 13 –
Differentiation
Henry Chai & Matt Gormley & Hoda Heidari
2/26/24
𝑦

𝜷 ∈ ℝ)(
𝜷
𝛽& ∈ ℝ (%)
𝑦 = σ((𝜷)$ 𝒛 + 𝛽& )
Recall: Neural ($)
𝑧!
($)
𝑧$ … ($)
𝑧&"
𝜶
%
∈ ℝ'×)( (%) % $ (")
Networks 𝜶
($)
𝒃(%) ∈ ℝ)(
𝒛 = σ((𝜶 ) 𝒛 + 𝒃(%) )

(Matrix Form) 𝑧!
(!) (!)
𝑧$ … (!)
𝑧&!
"
𝒛
(")
= σ((𝜶
" $
) 𝒙+ 𝒃(") )
𝜶 ∈ ℝ'×)'
(!)
𝜶
𝒃(") ∈ ℝ)'
𝑥! 𝑥$ … 𝑥%

10/6/23 2
$ 1
𝑦 𝑦= σ 𝜷* %
𝒛
𝛽&
𝜷 𝜷* = ∈ ℝ)(+"
𝜷
(%) % *
$ 1
Recall: Neural ($)
𝑧!
($)
𝑧$ … ($)
𝑧&"
% $
𝒛 = σ 𝜶
𝒛
"

% * 𝒃
Networks 𝜶
($) 𝜶 =
𝜶
% ∈ℝ )' +" ×)(

(Matrix Form)
)
(!) (!) (!) (") " * 1
𝑧! 𝑧$ … 𝑧&! 𝒛 = σ 𝜶
𝒙
(!)
𝒃 " $
𝜶
𝜶 " * = ∈ℝ '+" ×)'
"
𝜶
𝑥! 𝑥$ … 𝑥%

10/6/23 3
Inputs: weights 𝜶 ! ,…,𝜶 " , 𝜷 and a query data point 𝒙#

Initialize 𝒛 $ = 𝒙#
Forward For 𝑙 = 1, … , 𝐿
Propagation 𝒂 % =𝜶 % & 𝒛 %'!
for Making
% %
Predictions 𝒛 =𝜎 𝒂

𝑦- = 𝜎 𝜷& 𝒛 "

Output: the prediction 𝑦-

10/6/23 4
( ( *
Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾

Initialize all weights 𝜶 ! ,…,𝜶 " , 𝜷 (???)

While TERMINATION CRITERION is not satisfied (???)
For 𝑖 ∈ shuf7le 1, … , 𝑁
Stochastic ( ! "
Compute 𝑔𝜷 = ∇𝜷𝐽 𝜶 ,…,𝜶 ,𝜷
Gradient
For 𝑙 = 1, … , 𝐿
Descent
( ! "
for Learning Compute 𝑔𝜶 ! = ∇𝜶 ! 𝐽 𝜶 ,…,𝜶 ,𝜷
Update 𝜷 = 𝜷 − 𝛾𝑔𝜷
For 𝑙 = 1, … , 𝐿
Update 𝜶 % =𝜶 % − 𝛾𝑔𝜶 !

Output:𝜶 ! ,…,𝜶 " ,𝜷

10/6/23 5
( ( *
Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾
! "
Two questions: Initialize all weights 𝜶 ,…,𝜶 , 𝜷 (???)
While TERMINATION CRITERION is not satisfied (???)
1. What is this For 𝑖 ∈ shuf7le 1, … , 𝑁
loss function Compute 𝑔𝜷 = ∇𝜷𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
𝐽!? For 𝑙 = 1, … , 𝐿
Compute 𝑔𝜶 ! = ∇𝜶 ! 𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
2. How on earth Update 𝜷 = 𝜷 − 𝛾𝑔𝜷
do we take For 𝑙 = 1, … , 𝐿
these gradients? % %
Update 𝜶 =𝜶 − 𝛾𝑔𝜶 !

Output:𝜶 ! ,…,𝜶 " ,𝜷

10/6/23 6
( ( *
Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾
! "
Two questions: Initialize all weights 𝜶 ,…,𝜶 , 𝜷 (???)
While TERMINATION CRITERION is not satisfied (???)
1. What is this For 𝑖 ∈ shuf7le 1, … , 𝑁
loss function Compute 𝑔𝜷 = ∇𝜷𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
𝐽!? For 𝑙 = 1, … , 𝐿
Compute 𝑔𝜶 ! = ∇𝜶 ! 𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
2. How on earth Update 𝜷 = 𝜷 − 𝛾𝑔𝜷
do we take For 𝑙 = 1, … , 𝐿
these gradients? % %
Update 𝜶 =𝜶 − 𝛾𝑔𝜶 !

Output:𝜶 ! ,…,𝜶 " ,𝜷

10/6/23 7
Let 𝚯 = 𝜶 ! ,…,𝜶 " , 𝜷 be the parameters of our neural network

Regression - squared error (same as linear regression!)

( ( ( .
𝐽 𝚯 = 𝑦-𝚯 𝒙 −𝑦
Loss
Binary classification - cross-entropy loss (same as logistic regression!)
Functions
Assume 𝑌 ∈ 0,1 and 𝑃 𝑌 = 1 𝒙, 𝚯 = 𝑦-𝚯 𝒙
for Neural
Networks 𝐽 ( 𝚯 = − log 𝑃 𝑦 ( |𝒙 ( , 𝚯
" !'/ "
( ( / (
𝐽 𝚯 = − log 𝑦-𝚯 𝒙 1 − 𝑦-𝚯 𝒙

𝐽 ( 𝚯 =− 𝑦 ( log 𝑦-𝚯 𝒙 ( + 1−𝑦 ( log 1 − 𝑦-𝚯 𝒙 (

10/6/23 8
Let 𝚯 = 𝜶 ! ,…,𝜶 " , 𝜷 be the parameters of our neural network

Multi-class classification - cross-entropy loss again!

Express the label as a one-hot or one-of-𝐶 vector e.g.,
Loss 𝒚= 0 0 1 0 ⋯ 0
Functions J𝚯
Assume the neural network output is also a vector of length 𝐶, 𝒚
for Neural J𝚯 𝒙
𝑃 𝒚 𝑐 = 1 𝒙, 𝚯 = 𝒚 ( 𝑐
Networks
Then the cross-entropy loss is
𝐽 ( 𝚯 = − log 𝑃 𝑦 ( |𝒙 ( , 𝚯
1
𝐽 ( 𝚯 = −L𝒚 ( J𝚯 𝒙
𝑐 log 𝒚 ( 𝑐
0)!
10/6/23 9
Let 𝚯 = 𝜶 ! ,…,𝜶 " , 𝜷 be the parameters of our neural network

Okay but Multi-class classification - cross-entropy loss

how do Express the label as a one-hot or one-of-𝐶 vector e.g.,
we get 𝒚= 0 0 1 0 ⋯ 0
our J𝚯
Assume the neural network output is also a vector of length 𝐶, 𝒚
network J𝚯 𝒙
𝑃 𝒚 𝑐 = 1 𝒙, 𝚯 = 𝒚 ( 𝑐
to output
Then the cross-entropy loss is
this (
𝐽 𝚯 = − log 𝑃 𝑦 ( |𝒙 ( , 𝚯
vector? 1
𝐽 ( 𝚯 = −L𝒚 ( J𝚯 𝒙
𝑐 log 𝒚 ( 𝑐
0)!
10/6/23 10
exp 𝑏0
… 𝑦0 = 1
∑7)! exp 𝑏7
6
𝜷 𝑏0 = L 𝛽0,5 𝑧5
5)$

Softmax … 𝑧2 = 𝜎 𝑎2

3
𝜶 𝑎2 = L 𝛼2,( 𝑥(
()$

10/6/23 11
( ( *
Input: 𝒟 = 𝒙 ,𝑦 ()!
,𝛾
! "
Two questions: Initialize all weights 𝜶 ,…,𝜶 , 𝜷 (???)
While TERMINATION CRITERION is not satisfied (???)
1. What is this For 𝑖 ∈ shuf7le 1, … , 𝑁
loss function Compute 𝑔𝜷 = ∇𝜷𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
𝐽!? For 𝑙 = 1, … , 𝐿
Compute 𝑔𝜶 ! = ∇𝜶 ! 𝐽 ( 𝜶 ! ,…,𝜶 " ,𝜷
2. How on earth Update 𝜷 = 𝜷 − 𝛾𝑔𝜷
do we take For 𝑙 = 1, … , 𝐿
these gradients? % %
Update 𝜶 =𝜶 − 𝛾𝑔𝜶 !

Output:𝜶 ! ,…,𝜶 " ,𝜷

10/6/23 12
Numerator

Types of
scalar vector matrix
Derivatives

Matrix scalar
Calculus

vector

Denominator

matrix

10/6/23 Table courtesy of Matt Gormley 13

Types of
scalar
Derivatives

Derivatives of a scalar

scalar always
Matrix have the same
Calculus: shape as the
vector

Denominator entity that the

Layout derivative is
being taken
with respect to. matrix

10/6/23 Table courtesy of Matt Gormley 14

Types of
scalar vector
Derivatives

scalar
Matrix
Calculus:
Denominator
Layout
vector

10/6/23 Table courtesy of Matt Gormley 15

Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
1. Finite difference method
Requires the ability to call 𝑓 𝒙
Great for checking accuracy of implementations of
more complex differentiation methods
Computationally expensive for high-dimensional inputs
Three
Approaches to 2. Symbolic differentiation
Requires systematic knowledge of derivatives
Differentiation Can computationally expensive if poorly implemented
3. Automatic differentiation (reverse mode)
Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
Computational cost of computing 9: 𝒙 Z9𝒙 is
10/6/23
proportional to the cost of computing 𝑓 𝒙 16
Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
𝜕𝑓 𝒙 𝑓 𝒙 + 𝜖𝒅( − 𝑓 𝒙 − 𝜖𝒅(
≈
𝜕𝑥( 2𝜖
where 𝒅( is a one-hot vector with a 1 in the 𝑖 th position
𝑓 𝑥

Approach 1:
Finite 𝜖𝜖 𝑥
Difference We want 𝜖 to be small to get a good approximation but we
Method run into floating point issues when 𝜖 is too small

Getting the full gradient requires computing the above

approximation for each dimension of the input
10/6/23 17
Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

>>> from math import *
>>> y = lambda x,z: exp(x*z)+(x*z)/log(x)+sin(log(x))/(x*z)

Approach 1: >>> x = 2

Finite >>> z = 3

Difference >>> e = 10**-8

>>> dydx = (y(x+e,z)-y(x-e,z))/(2*e)
Method >>> dydz = (y(x,z+e)-y(x,z-e))/(2*e)
Example >>> print(dydx, dydz)

10/6/23 Example courtesy of Matt Gormley 18

Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
1. Finite difference method
Requires the ability to call 𝑓 𝒙
Great for checking accuracy of implementations of
more complex differentiation methods
Computationally expensive for high-dimensional inputs
Three
Approaches to 2. Symbolic differentiation
Requires systematic knowledge of derivatives
Differentiation Can computationally expensive if poorly implemented
3. Automatic differentiation (reverse mode)
Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
Computational cost of computing 9: 𝒙 Z9𝒙 is
10/6/23
proportional to the cost of computing 𝑓 𝒙 19
Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

Looks like we’re gonna need the chain rule!

Approach 2:
Symbolic
Differentiation

10/6/23 Example courtesy of Matt Gormley 20

If 𝑦 = 𝑓 𝑧 and 𝑧 = 𝑔 𝑥 then
the corresponding computation graph is
𝑥 𝑧 𝑦
𝜕𝑦 𝜕𝑦 𝜕𝑧
⟹ =
𝜕𝑥 𝜕𝑧 𝜕𝑥
If 𝑦 = 𝑓 𝑧!, 𝑧. and 𝑧! = 𝑔! 𝑥 , 𝑧. = 𝑔. 𝑥 then
𝑧!
𝑥 𝑦
The Chain Rule ⟹
𝜕𝑦 𝜕𝑦 𝜕𝑧! 𝜕𝑦 𝜕𝑧.
= +
of Calculus 𝑧.
𝜕𝑥 𝜕𝑧! 𝜕𝑥 𝜕𝑧. 𝜕𝑥

If 𝑦 = 𝑓 𝒛 and 𝒛 = 𝑔 𝑥 then
𝑧!
𝑥 𝑦 6
𝜕𝑦 𝜕𝑦 𝜕𝑧2
⟹ =L
𝜕𝑥 𝜕𝑧2 𝜕𝑥
𝑧. 2)!
⋮
10/6/23 𝑧6 21
If 𝑦 = 𝑓 𝒛 , 𝒛 = 𝑔 𝒘 and 𝒘 = ℎ 𝑥 , does the equation
6
𝜕𝑦 𝜕𝑦 𝜕𝑧2
=L
𝜕𝑥 𝜕𝑧2 𝜕𝑥
2)!

Poll Question 1 still hold?

A. Yes
B. No
C. Only on Fridays (TOXIC)

10/6/23 22
Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

𝜕𝑦 𝜕 ;< 𝜕 𝑥𝑧 𝜕 sin ln 𝑥
= 𝑒 + +
𝜕𝑥 𝜕𝑥 𝜕𝑥 ln 𝑥 𝜕𝑥 𝑥𝑧
Approach 2: 𝜕𝑦 ;<
= 𝑧𝑒 +
𝑧
−
𝑧
+
cos ln 𝑥
−
sin ln 𝑥
Symbolic 𝜕𝑥 ln 𝑥 ln 𝑥 . .
𝑥 𝑧 𝑥 .𝑧
𝜕𝑦 3 3 cos ln 2 sin ln 2
Differentiation =
= 3𝑒 + − . + −
𝜕𝑥 ln 2 ln 2 12 12
𝜕𝑦 𝜕 ;< 𝜕 𝑥𝑧 𝜕 sin ln 𝑥
= 𝑒 + +
𝜕𝑧 𝜕𝑧 𝜕𝑧 ln 𝑥 𝜕𝑧 𝑥𝑧
𝜕𝑦 =
2 sin ln 2
= 2𝑒 + −
10/6/23 𝜕𝑥 ln 2 18 Example courtesy of Matt Gormley 23
Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
1. Finite difference method
Requires the ability to call 𝑓 𝒙
Great for checking accuracy of implementations of
more complex differentiation methods
Computationally expensive for high-dimensional inputs
Three
Approaches to 2. Symbolic differentiation
Requires systematic knowledge of derivatives
Differentiation Can be computationally expensive if poorly implemented
3. Automatic differentiation (reverse mode)
Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
Computational cost of computing 9: 𝒙 Z9𝒙 is proportional
10/6/23
to the cost of computing 𝑓 𝒙 24
Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

First define some intermediate quantities, draw the
computation graph and run the “forward” computation
𝑎 = 𝑥𝑧
Approach 3: 𝑥 𝑎 𝑑
𝑏 = ln 𝑥
Automatic 𝑐 = sin 𝑏
2 ∗ 𝑒𝑥𝑝
Differentiation 𝑑 = 𝑒>
𝑧 𝑏 𝑒 𝑦
(reverse mode) 𝑒 = 𝑎Z𝑏
3 𝑙𝑛 / +
𝑓
𝑓 = 𝑐⁄𝑎
𝑐 𝑠𝑖𝑛 /
𝑦 =𝑑+𝑒+𝑓
10/6/23 Example courtesy of Matt Gormley 25
Given
𝑥𝑧 sin ln 𝑥
𝑦 = 𝑓 𝑥, 𝑧 = 𝑒 ;< + +
ln 𝑥 𝑥𝑧

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

Then compute partial derivatives,
starting from 𝑦 and working back

Approach 3: 𝑥 𝑎 𝑑
Automatic 2 ∗ 𝑒𝑥𝑝
Differentiation 𝑧 𝑏 𝑒 𝑦
(reverse mode) 3 𝑙𝑛 / +
𝑓
𝑐 𝑠𝑖𝑛 /
10/6/23 Example courtesy of Matt Gormley 26
Given 𝑓: ℝ6 → ℝ, compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
1. Finite difference method
Requires the ability to call 𝑓 𝒙
Great for checking accuracy of implementations of
more complex differentiation methods
Computationally expensive for high-dimensional inputs
Three
Approaches to 2. Symbolic differentiation
Requires systematic knowledge of derivatives
Differentiation Can be computationally expensive if poorly implemented
3. Automatic differentiation (reverse mode)
Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
Computational cost of computing 9: 𝒙 Z9𝒙 is proportional
10/6/23
to the cost of computing 𝑓 𝒙 28
Given 𝑓: ℝ6 → ℝ1 , compute ∇𝒙 𝑓 𝒙 = 9: 𝒙 Z9𝒙
3. Automatic differentiation (reverse mode)
Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
Computational cost of computing ∇𝒙 𝑓 𝒙 0 = 9: 𝒙 #Z9𝒙
is proportional to the cost of computing 𝑓 𝒙
Automatic Great for high-dimensional inputs and low-dimensional
outputs (𝐷 ≫ 𝐶)
Differentiation
4. Automatic differentiation (forward mode)
Requires systematic knowledge of derivatives and an
algorithm for computing 𝑓 𝒙
Computational cost of computing 9: 𝒙 Z9𝒙$
is proportional to the cost of computing 𝑓 𝒙
Great for low-dimensional inputs and high-dimensional
10/6/23
outputs (𝐷 ≪ 𝐶) 29
The diagram represents an algorithm

Nodes are rectangles with one node per intermediate

variable in the algorithm

Each node is labeled with the function that it computes

Computation (inside the box) and the variable name (outside the box)
Graph:
10-301/601 Edges are directed and do not have labels

Conventions For neural networks:

Each weight, feature value, label and bias term
appears as a node
We can include the loss function

10/6/23 30
The diagram represents a neural network

Nodes are circles with one node per hidden unit

Each node is labeled with the variable corresponding to

the hidden unit
Neural
Network Edges are directed and each edge is labeled with its weight

Diagram Following standard convention, the bias term is typically

Conventions not shown as a node, but rather is assumed to be part of
the activation function i.e., its weight does not appear in
the picture anywhere.

The diagram typically does not include any nodes related

to the loss computation
10/6/23 31
You should be able to…
Differentiate between a neural network diagram and a computation graph
Construct a computation graph for a function as specified by an algorithm
Carry out the backpropagation on an arbitrary computation graph
Construct a computation graph for a neural network, identifying all the
Backprop given and intermediate quantities that are relevant

Learning Instantiate the backpropagation algorithm for a neural network

Objectives Instantiate an optimization method (e.g. SGD) and a regularizer (e.g. L2)
when the parameters of a model are comprised of several matrices
corresponding to different layers of a neural network
Use the finite difference method to evaluate the gradient of a function
Identify when the gradient of a function can be computed at all and when
it can be computed efficiently
Employ basic matrix calculus to compute vector/matrix/tensor derivatives.
10/6/23 32

Neural Networks: Gradients & Backpropagation
No ratings yet
Neural Networks: Gradients & Backpropagation
83 pages
Tut 01
No ratings yet
Tut 01
39 pages
Back Prop
No ratings yet
Back Prop
8 pages
Lecture04 Neuralnets
No ratings yet
Lecture04 Neuralnets
81 pages
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
No ratings yet
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
84 pages
Mit18 S096iap23 Lec1
No ratings yet
Mit18 S096iap23 Lec1
16 pages
Neural Networks & Backpropagation
No ratings yet
Neural Networks & Backpropagation
77 pages
Differentiable Programming & Optimization
No ratings yet
Differentiable Programming & Optimization
72 pages
Demystifying Deep Learning
No ratings yet
Demystifying Deep Learning
68 pages
Matrix Calculus for Deep Learning
No ratings yet
Matrix Calculus for Deep Learning
33 pages
NER and Backpropagation in NLP
No ratings yet
NER and Backpropagation in NLP
80 pages
NLP Backpropagation Guide
No ratings yet
NLP Backpropagation Guide
8 pages
07autodiff Nnets
No ratings yet
07autodiff Nnets
12 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Neural Networks & Auto Differentiation
No ratings yet
Neural Networks & Auto Differentiation
13 pages
2 - Learning With Gradient
No ratings yet
2 - Learning With Gradient
23 pages
Opt Sem3
No ratings yet
Opt Sem3
50 pages
Lec06 Derivatives
No ratings yet
Lec06 Derivatives
22 pages
Sms Essay 2
No ratings yet
Sms Essay 2
6 pages
Neural Networks Skimmed - Ipynb - Colab
No ratings yet
Neural Networks Skimmed - Ipynb - Colab
8 pages
Machine Learning and Pattern Recognition Week 8 - Backprop
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Backprop
8 pages
Sparse Autoencoder Overview
No ratings yet
Sparse Autoencoder Overview
15 pages
Step-by-Step Automatic Differentiation Guide
No ratings yet
Step-by-Step Automatic Differentiation Guide
17 pages
Learning 3
No ratings yet
Learning 3
98 pages
Cours 2
No ratings yet
Cours 2
25 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Numerical Analysis for Economists
No ratings yet
Numerical Analysis for Economists
57 pages
Autodiff
No ratings yet
Autodiff
12 pages
Automatic Differentiation in ML
No ratings yet
Automatic Differentiation in ML
114 pages
Machine Learning: Backpropagation
No ratings yet
Machine Learning: Backpropagation
24 pages
Lecture 3-4
No ratings yet
Lecture 3-4
50 pages
Lecture 02
No ratings yet
Lecture 02
37 pages
TUM I2DL Matrix Derivatives
No ratings yet
TUM I2DL Matrix Derivatives
8 pages
Neural Networks Cost Function & Backpropagation
No ratings yet
Neural Networks Cost Function & Backpropagation
9 pages
Lecture 20
No ratings yet
Lecture 20
71 pages
Backward Forward Propogation
No ratings yet
Backward Forward Propogation
19 pages
Vectorized Neural Network Gradients
No ratings yet
Vectorized Neural Network Gradients
67 pages
Vectorized Neural Network Gradients
No ratings yet
Vectorized Neural Network Gradients
7 pages
13 - Neural Network (Perceptrons)
No ratings yet
13 - Neural Network (Perceptrons)
31 pages
Neural Network Training
No ratings yet
Neural Network Training
73 pages
Neural Network (Perceptrons)
No ratings yet
Neural Network (Perceptrons)
31 pages
Matrix Calculus and Kronecker Product
No ratings yet
Matrix Calculus and Kronecker Product
7 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
8 pages
DeepLearning Introduction
No ratings yet
DeepLearning Introduction
14 pages
Final2 Math EE
No ratings yet
Final2 Math EE
77 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
Practical-5 - 2CEIT606 - Artificial Intelligence
No ratings yet
Practical-5 - 2CEIT606 - Artificial Intelligence
14 pages
3 Gradient
No ratings yet
3 Gradient
31 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
Differentiation, Partial Differentiation & Gradients
No ratings yet
Differentiation, Partial Differentiation & Gradients
51 pages
Deep Learning Numerical Challenges
No ratings yet
Deep Learning Numerical Challenges
46 pages
Sparseautoencoder 2011new
No ratings yet
Sparseautoencoder 2011new
19 pages
Deep Learning Module-02 Search Creators
No ratings yet
Deep Learning Module-02 Search Creators
15 pages
Non-Linear Models Explained
No ratings yet
Non-Linear Models Explained
8 pages
Calc
No ratings yet
Calc
6 pages
Deep Learning Assignment 1 Solutions
100% (1)
Deep Learning Assignment 1 Solutions
28 pages
Chap 3 Slides
No ratings yet
Chap 3 Slides
95 pages
Gradient Descent & Backpropagation Practice Problems
No ratings yet
Gradient Descent & Backpropagation Practice Problems
7 pages
1 s2.0 S0925231225007945 Main
No ratings yet
1 s2.0 S0925231225007945 Main
12 pages
Learning by Induction in Machine Learning
No ratings yet
Learning by Induction in Machine Learning
4 pages
Python Projects for Tech Enthusiasts
No ratings yet
Python Projects for Tech Enthusiasts
15 pages
LDA for Machine Learning Experts
No ratings yet
LDA for Machine Learning Experts
16 pages
Jacob Sanz-Robinson: Mcgill University, Montreal, Canada. Colegio Anglo Colombiano, Bogota, Colombia
No ratings yet
Jacob Sanz-Robinson: Mcgill University, Montreal, Canada. Colegio Anglo Colombiano, Bogota, Colombia
1 page
Adapting To Tomorrows Workforce Navigating The Impacts of Artificial Intelligence On Employment
No ratings yet
Adapting To Tomorrows Workforce Navigating The Impacts of Artificial Intelligence On Employment
6 pages
AI in HR
No ratings yet
AI in HR
22 pages
Future Directions for Sustainable Business
No ratings yet
Future Directions for Sustainable Business
22 pages
ST Open Source Data Pipelines Oreilly f22568 202003 en PDF
No ratings yet
ST Open Source Data Pipelines Oreilly f22568 202003 en PDF
79 pages
Hyperparameter Tuning with SHAP Insights
No ratings yet
Hyperparameter Tuning with SHAP Insights
24 pages
Crime Prediction with Machine Learning
No ratings yet
Crime Prediction with Machine Learning
8 pages
Bhavana Gubbi: Profile Summary Technical Skills
No ratings yet
Bhavana Gubbi: Profile Summary Technical Skills
1 page
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
AI & ML in CNC Machining
No ratings yet
AI & ML in CNC Machining
12 pages
Ensemble Classifier For Driver Fatigue Detection Based On A Single EEG Channel
No ratings yet
Ensemble Classifier For Driver Fatigue Detection Based On A Single EEG Channel
8 pages
Deep Learning MCQs for B.Tech Exam
No ratings yet
Deep Learning MCQs for B.Tech Exam
13 pages
Federated Learning for Engine Maintenance
No ratings yet
Federated Learning for Engine Maintenance
13 pages
Comparative Study of K-Means and Hierarchical Clustering Techniques
No ratings yet
Comparative Study of K-Means and Hierarchical Clustering Techniques
7 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
21 pages
Afandizadeh Et Al. - 2024 - Deep Learning Algorithms For Traffic Forecasting A Comprehensive Review and Comparison With Classic
No ratings yet
Afandizadeh Et Al. - 2024 - Deep Learning Algorithms For Traffic Forecasting A Comprehensive Review and Comparison With Classic
30 pages
Disease Detection Using ML
100% (8)
Disease Detection Using ML
24 pages
Gold Price Prediction Using Machine Learning
No ratings yet
Gold Price Prediction Using Machine Learning
9 pages
Animesh Gupta - 2021uea6545 - PPT - DL
No ratings yet
Animesh Gupta - 2021uea6545 - PPT - DL
23 pages
Intro to Machine Learning Basics
100% (3)
Intro to Machine Learning Basics
24 pages
Responsible AI in Evidence SynthEsis v.0.9 2
No ratings yet
Responsible AI in Evidence SynthEsis v.0.9 2
18 pages
AI Learns More with Fewer Samples
No ratings yet
AI Learns More with Fewer Samples
8 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
19 pages
Crime Prediction Using Machine Learning
No ratings yet
Crime Prediction Using Machine Learning
60 pages
AI in Defense Proposal
No ratings yet
AI in Defense Proposal
4 pages
Unit-4 Ai
No ratings yet
Unit-4 Ai
32 pages

Lecture12 Diff

Uploaded by

Lecture12 Diff

Uploaded by

10-301/601: Introduction

 Output: the prediction 𝑦-

 Initialize all weights 𝜶 ! ,…,𝜶 " , 𝜷 (???)

 Output:𝜶 ! ,…,𝜶 " ,𝜷

 Output:𝜶 ! ,…,𝜶 " ,𝜷

 Output:𝜶 ! ,…,𝜶 " ,𝜷

 Regression - squared error (same as linear regression!)

𝐽 ( 𝚯 =− 𝑦 ( log 𝑦-𝚯 𝒙 ( + 1−𝑦 ( log 1 − 𝑦-𝚯 𝒙 (

 Multi-class classification - cross-entropy loss again!

Okay but  Multi-class classification - cross-entropy loss

 Output:𝜶 ! ,…,𝜶 " ,𝜷

10/6/23 Table courtesy of Matt Gormley 13

Denominator entity that the

10/6/23 Table courtesy of Matt Gormley 14

10/6/23 Table courtesy of Matt Gormley 15

 Getting the full gradient requires computing the above

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

Difference >>> e = 10**-8

10/6/23 Example courtesy of Matt Gormley 18

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

 Looks like we’re gonna need the chain rule!

10/6/23 Example courtesy of Matt Gormley 20

Poll Question 1 still hold?

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

what are 9/Z9; and 9/Z9< at 𝑥 = 2, 𝑧 = 3?

 Nodes are rectangles with one node per intermediate

 Each node is labeled with the function that it computes

Conventions  For neural networks:

 Nodes are circles with one node per hidden unit

 Each node is labeled with the variable corresponding to

Diagram  Following standard convention, the bias term is typically

 The diagram typically does not include any nodes related

Learning  Instantiate the backpropagation algorithm for a neural network

You might also like

Output: the prediction 𝑦-

Initialize all weights 𝜶 ! ,…,𝜶 " , 𝜷 (???)

Output:𝜶 ! ,…,𝜶 " ,𝜷

Output:𝜶 ! ,…,𝜶 " ,𝜷

Output:𝜶 ! ,…,𝜶 " ,𝜷

Regression - squared error (same as linear regression!)

Multi-class classification - cross-entropy loss again!

Okay but Multi-class classification - cross-entropy loss

Output:𝜶 ! ,…,𝜶 " ,𝜷

Getting the full gradient requires computing the above

Looks like we’re gonna need the chain rule!

Nodes are rectangles with one node per intermediate

Each node is labeled with the function that it computes

Conventions For neural networks:

Nodes are circles with one node per hidden unit

Each node is labeled with the variable corresponding to

Diagram Following standard convention, the bias term is typically

The diagram typically does not include any nodes related

Learning Instantiate the backpropagation algorithm for a neural network