0% found this document useful (0 votes)
16 views104 pages

Lec4 Learning

The document discusses the principles of neural networks and the process of backpropagation, focusing on empirical risk minimization and function optimization. It explains the concepts of derivatives, gradients, and Hessians in the context of scalar and multivariate functions, emphasizing their roles in optimization. Additionally, it covers criteria for identifying minima and maxima of functions through derivatives and second derivatives.

Uploaded by

jhuan35
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views104 pages

Lec4 Learning

The document discusses the principles of neural networks and the process of backpropagation, focusing on empirical risk minimization and function optimization. It explains the concepts of derivatives, gradients, and Hessians in the context of scalar and multivariate functions, emphasizing their roles in optimization. Additionally, it covers criteria for identifying minima and maxima of functions through derivatives and second derivatives.

Uploaded by

jhuan35
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Neural Networks

Learning the network: Backprop


11-785, Fall 2025
Lecture 4

Attendance: @313

1
Recap: Empirical Risk Minimization
𝑌 = 𝑓(𝑋; 𝑾)

• Given a training set of input-output pairs 𝑿 , 𝒅 , 𝑿 , 𝒅2 , … , 𝑿 , 𝒅


– Divergence on the i-th instance: 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
– Empirical average divergence on all training data:
1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑇
• Estimate the parameters to minimize the empirical estimate of expected
divergence
𝑾 = argmin 𝐿𝑜𝑠𝑠(𝑊)
– I.e. minimize the empirical risk over the drawn samples 2
Recap: Empirical Risk Minimization
𝑌 = 𝑓(𝑋; 𝑾)

This is an instance of
function minimization
(optimization)

• Given a training set of input-output pairs 𝑿 , 𝒅 , 𝑿 , 𝒅2 , … , 𝑿 , 𝒅


– Error on the i-th instance: 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
– Empirical average error on all training data:
1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑇
• Estimate the parameters to minimize the empirical estimate of expected
divergence
𝑾 = argmin 𝐿𝑜𝑠𝑠(𝑊)
– I.e. minimize the empirical error over the drawn samples 3
A quick intro to
function optimization

with an initial discussion of


derivatives

4
A brief note on derivatives..

𝑦 Δ𝑦
derivative
Δ𝑥
𝑥
• A derivative of a function at any point tells us how
much a minute increment to the argument of the
function will increment the value of the function
 For any 𝑦 = 𝑓 𝑥 , expressed as a multiplier 𝛼 to a tiny
increment Δ𝑥 to obtain the increments Δ𝑦 to the output
Δ𝑦 = 𝛼Δ𝑥
 Based on the fact that at a fine enough resolution, any
smooth, continuous function is locally linear at any point 5
Scalar function of scalar argument

𝑦 Δ𝑦

Δ𝑥
𝑥

• When 𝑥 and 𝑦 are scalar


𝑦=𝑓 𝑥
 Derivative:
Δ𝑦 = 𝛼Δ𝑥
 Often represented (using somewhat inaccurate notation) as
 Or alternately (and more reasonably) as 𝑓′ 𝑥
6
Scalar function of scalar argument
0
0 --
+ -- +
+
--
+ - +
--
+ - +
- -
-- +
+

 Derivative 𝑓′ 𝑥 is the rate of change of the function at 𝑥


 How fast it increases with increasing 𝑥
 The magnitude of f’(x) gives you the steepness of the curve at x
 Larger |f’(x)|  the function is increasing or decreasing more rapidly

 It will be positive where a small increase in x results in an increase of f(x)


 Regions of positive slope
 It will be negative where a small increase in x results in a decrease of f(x)
 Regions of negative slope

 It will be 0 where the function is locally flat (neither increasing nor decreasing)
7
Multivariate scalar function:
Scalar function of
𝑦
vector argument
𝑦 = 𝑓(𝐱) Note: Δ𝐱 is now also a vector
𝑥 Δ𝑥
𝐱 is now a vector: 𝐱 = ⋮ Δ𝐱 = ⋮
𝑥 Δ𝑥

Δ𝑦 = 𝛼Δ𝐱
• Giving us that 𝛼 is a row vector: 𝛼 = 𝛼 ⋯ 𝛼
Δ𝑦 = 𝛼 Δ𝑥 + 𝛼 Δ𝑥 + ⋯ + 𝛼 Δ𝑥
• The partial derivative 𝛼 gives us how 𝑦 increments when only 𝑥 is
incremented
• Often represented as
𝜕𝑦 𝜕𝑦 𝜕𝑦
Δ𝑦 = Δ𝑥 + Δ𝑥 + ⋯ + Δ𝑥
𝜕𝑥 𝜕𝑥 𝜕𝑥 8
Multivariate scalar function:
Scalar function of vector argument
𝑦
Note: Δ𝐱 is now a vector
Δ𝑥
Δ𝐱 = ⋮
Δ𝑥

𝑥
Δ𝑦 = 𝛻𝐱 𝑦Δ𝐱
We will be using this
• Where symbol for vector and
𝜕𝑦 𝜕𝑦 matrix derivatives
𝛻𝐱 𝑦 = ⋯
𝜕𝑥 𝜕𝑥
o You may be more familiar with the term “gradient” which
is actually defined as the transpose of the derivative
9
Gradient of a scalar function of a vector

𝑑𝑓(𝑋)

𝑑𝑋

• The derivative 𝛻 𝑓(𝑋) of a scalar function 𝑓(𝑋) of a multi-variate input 𝑋 is a


multiplicative factor that gives us the change in 𝑓(𝑋) for tiny variations in 𝑋
𝑑𝑓(𝑋) = 𝛻 𝑓(𝑋)𝑑𝑋

– 𝛻 𝑓(𝑋) = ⋯
• The gradient is the transpose of the derivative 𝛻 𝑓(𝑋)
– A column vector of the same dimensionality as 𝑋
10
Gradient of a scalar function of a vector

𝑑𝑓(𝑋)

𝑑𝑋

• The derivative 𝛻 𝑓(𝑋) of a scalar function 𝑓(𝑋) of a multi-variate input 𝑋 is a


multiplicative factor that gives us the change in 𝑓(𝑋) for tiny variations in 𝑋
𝑑𝑓(𝑋) = 𝛻 𝑓(𝑋)𝑑𝑋

– 𝛻 𝑓(𝑋) = ⋯
• The gradient is the transpose of the derivative 𝛻 𝑓(𝑋)
This is– aAvector inner product. To understand its behavior lets
column vector of the same dimensionality as 𝑋
consider a well-known property of inner products 11
A well-known vector property
𝐮 𝐯 𝑣𝑠. 𝜃

𝐮 𝐯 = 𝐮 𝐯 cos 𝜃
• The inner product between two vectors of
fixed lengths is maximum when the two
vectors are aligned
– i.e. when 𝜃 = 0
12
Properties of Gradient
𝛻 𝑓(𝑋)𝑑𝑋 vs angle of 𝑑𝑋

Blue arrow
𝛻 𝑓(𝑋)
is 𝑑𝑋

𝛻 𝑓(𝑋)

• 𝑑𝑓 𝑋 = 𝛻 𝑓(𝑋)𝑑𝑋
• For an increment 𝑑𝑋 of any given length 𝑑𝑓 𝑋 is max if
𝑑𝑋 is aligned with 𝛻 𝑓 𝑋 T
– The function f(X) increases most rapidly if the input increment
𝑑𝑋 is exactly in the direction of 𝛻 𝑓 𝑋 T
• The gradient is the direction of fastest increase in f(X) 13
Gradient

Gradient
vector 𝛻 𝑓 𝑋 𝑇

14
Gradient

Gradient
vector 𝛻 𝑓 𝑋 𝑇

Moving in this
direction increases
𝑓 𝑋 fastest

15
Gradient

Gradient
vector 𝛻 𝑓 𝑋 𝑇

Moving in this
𝑇
−𝛻 𝑓 𝑋 direction increases
Moving in this 𝑓(𝑋) fastest
direction decreases
𝑓 𝑋 fastest

16
Gradient

Gradient here
is 0

Gradient here
is 0

17
Properties of Gradient: 2

• The gradient vector 𝛻 𝑓 𝑋 𝑇 is perpendicular to the level curve


18
The Hessian
• The Hessian of a function 𝑓(𝑥 , 𝑥 , … , 𝑥 ) is
given by the second derivative

 2 f 2 f 2 f 
 . . 
 21x
2
x1x2 x1xn 
  f 2 f 2 f 
 x x . .
 X f ( x1 ,..., xn ) :  2 1
2 x2
2
x2 xn 

 . . . . . 
 . . . . . 
 2 f 2 f 2 f 
 . . 
 xn x1 xn x2 2
xn 

19
Poll 1 : @305, @307
• Select all that are true about derivatives of a scalar function f(X) of
multivariate inputs
– At any location X, there may be many directions in which we can step, such
that f(X) increases
– The direction of the gradient is the direction in which the function increases
fastest
– The gradient is the derivative of f(X) w.r.t. X

• y = f(x) is a scalar function of an Nx1 column vector variable x. What is the


shape of the derivative of y with respect to x
– Scalar
– N x 1 column vector
– 1 x N row vector
– There is insufficient information to decide

20
Poll 1
• Select all that are true about derivatives of a scalar function f(X) of
multivariate inputs
– At any location X, there may be many directions in which we can step, such
that f(X) increases
– The direction of the gradient is the direction in which the function increases
fastest
– The gradient is the derivative of f(X) w.r.t. X

• y = f(x) is a scalar function of an Nx1 column vector variable x. What is the


shape of the derivative of y with respect to x
– Scalar
– N x 1 column vector
– 1 x N row vector
– There is insufficient information to decide

21
The problem of optimization
f(x)
global maximum

inflection point

local minimum
global minimum

• General problem of
optimization: Given a function
f(x) of some variable x …

• Find the value of x where f(x)


is minimum

22
Finding the minimum of a function
𝑑𝑦
=0
𝑑𝑥
f(x)

• Find the value 𝑥 at which 𝑓′(𝑥) = 0


– Solve
𝑑𝑓(𝑥)
=0
𝑑𝑥
• The solution is a “turning point”
– Derivatives go from positive to negative or vice versa at this point
• But is it a minimum?
23
Poll 2 : @308
Which of the following criteria would be true
(choose one) about the minimum of a function f(x)

1. The derivative f’(x) = 0 at the minimum. This is


the only condition to be satisfied
2. f’(x) = 0 and the second derivative f”(x) is
negative
3. f’(x) = 0 and the second derivative f”(x) is
positive
24
Poll 2
Which of the following would be true (choose one)
about the minimum of a function f(x)

1. The derivative f’(x) = 0 at the minimum. This is


the only condition to be satisfied
2. f’(x) = 0 and the second derivative f”(x) is
negative
3. f’(x) = 0 and the second derivative f”(x) is
positive
25
Turning Points
0
0 --
+ -- +
+
--
+ - +
--
+ -- +
+
- -
- +
0

• Both maxima and minima have zero derivative


• Both are turning points

26
Derivatives of a curve

f’(x)
f(x) x

• Both maxima and minima are turning points


• Both maxima and minima have zero derivative

27
Derivative of the derivative of the
curve
f’’(x)
f’(x)
f(x) x

• Both maxima and minima are turning points


• Both maxima and minima have zero derivative

• The second derivative f’’(x) is –ve at maxima and


+ve at minima!
28
Solution: Finding the minimum or
maximum of a function
𝑑𝑦
=0
𝑑𝑥
f(x)

x
• Find the value 𝑥 at which 𝑓′(𝑥) = 0: Solve
𝑑𝑓(𝑥)
=0
𝑑𝑥
• The solution 𝑥 is a turning point
• Check the double derivative at 𝑥 : compute
𝑑𝑓′(𝑥 )
𝑓 𝑥 =
𝑑𝑥
• If 𝑓 𝑥 is positive 𝑥 is a minimum
– If it is negative, it is a maximum
29
A note on derivatives of functions of
single variable
maximum
• All locations with zero
Inflection point
𝑥
𝑓(𝑥) derivative are critical points
minimum – These can be local maxima, local
minima, or inflection points
Critical points

𝑑𝑓(𝑥) • The second derivative is


𝑑𝑥 – Positive (or 0) at minima
– Negative (or 0) at maxima
– Zero at inflection points

• It’s a little more complicated for


Derivative is 0
functions of multiple variables
30
A note on derivatives of functions of
single variable
maximum
• All locations with zero
Inflection point
𝑥
𝑓(𝑥) derivative are critical points
minimum – These can be local maxima, local
minima, or inflection points

𝑑𝑓(𝑥) • The second derivative is


𝑑𝑥 – ≥ 0 at minima
– ≤ 0 at maxima
– Zero at inflection points
positive

𝑑 𝑓(𝑥) zero • It’s a little more complicated for


𝑑𝑥 functions of multiple variables..
negative
31
What about functions of multiple
variables?

• The optimum point is still “turning” point


– Shifting in any direction will increase the value
– For smooth functions, at the minimum/maximum, the gradient
is 0
• Really tiny shifts will not change the function value
32
Finding the minimum of a scalar
function of a multivariate input

• The optimum point is a turning point – the


gradient will be 0
• Find the location where the gradient is 0
33
Unconstrained Minimization of
function (Multivariate)
1. Solve for the 𝑋 where the derivative (or gradient)
equals to zero
X f (X )  0

2. Compute the Hessian Matrix 𝛻 𝑓(𝑋) at the candidate


solution and verify that
– Hessian is positive definite (eigenvalues positive) -> to
identify local minima
– Hessian is negative definite (eigenvalues negative) -> to
identify local maxima
34
Unconstrained Minimization of
function (Example)
• Minimize
f ( x1 , x2 , x3 ) ( x1 ) 2  x1 (1  x2 )  ( x2 ) 2  x2 x3  ( x3 ) 2  x3

• Gradient
 2 x1  1  x2 
 X f T   x1  2 x2  x3 
  x2  2 x3  1 

35
Unconstrained Minimization of
function (Example)
• Set the gradient to null
 2 x1  1  x2  0
  
 X f  0   x1  2 x2  x3   0 
  x2  2 x3  1  0

• Solving the 3 equations system with 3 unknowns


 x   
 1
  1

x   x2    1 
   1 
 x3   
36
Unconstrained Minimization of
function (Example)
 2 1 0 
• Compute the Hessian matrix  2X f   1 2  1
 0  1 2 
• Evaluate the eigenvalues of the Hessian matrix
l1  3.414, l2  0.586, l3  2
• All the eigenvalues are positives => the Hessian
matrix is positive definite
 x   
 1
  1

• The point x   x2    1  is a minimum
   1 
 x3   
37
Closed Form Solutions are not always
available
f(X)

• Often it is not possible to simply solve 𝛻𝑋𝑓 𝑋 = 0


– The function to minimize/maximize may have an
intractable form
• In these situations, iterative solutions are used
– Begin with a “guess” for the optimal 𝑋 and refine it
iteratively until the correct value is obtained
38
Iterative solutions

f(X)

X 𝑋
𝑋
x0 x1 x2 x5 x3
x4 𝑋

• Iterative solutions
– Start from an initial guess 𝑋 for the optimal 𝑋
– Update the guess towards a (hopefully) “better” value of f(𝑋)
– Stop when f(𝑋) no longer decreases
• Problems:
– Which direction to step in
– How big must the steps be

39
The Approach of Gradient Descent

• Iterative solution:
– Start at some point
– Find direction in which to shift this point to decrease error
• This can be found from the derivative of the function
– A negative derivative  moving right decreases error
– A positive derivative  moving left decreases error
– Shift point in this direction 40
The Approach of Gradient Descent

• Iterative solution: Trivial algorithm


 Initialize 𝑥
 While 𝑓 𝑥 ≠0
• If 𝑠𝑖𝑔𝑛 𝑓 𝑥 is positive:
𝑥 = 𝑥 − 𝑠𝑡𝑒𝑝
• Else
𝑥 = 𝑥 + 𝑠𝑡𝑒𝑝 41
The Approach of Gradient Descent

• Iterative solution: Trivial algorithm


 Initialize 𝑥
 While 𝑓 𝑥 ≠0
𝑥 = 𝑥 − 𝑠𝑖𝑔𝑛 𝑓 𝑥 . 𝑠𝑡𝑒𝑝

• Identical to previous algorithm


42
The Approach of Gradient Descent

• Iterative solution: Trivial algorithm


 Initialize 𝑥
 While 𝑓 𝑥 ≠0
𝑥 =𝑥 −𝜂 𝑓 𝑥

• 𝜂 is the “step size”


43
Poll 3: @310
• Select all that are true about derivatives of a
scalar function f(X) of multivariate inputs
– At any location X, there may be many directions in
which we can step, such that f(X) increases
– The direction of the gradient is the direction in
which the function increases fastest
– The gradient is the derivative of f(X) w.r.t. X

44
Poll 3: Multivariate functions
• Select all that are true about derivatives of a
scalar function f(X) of multivariate inputs
– At any location X, there may be many directions
in which we can step, such that f(X) increases
– The direction of the gradient is the direction in
which the function increases fastest
– The gradient is the derivative of f(X) w.r.t. X

45
Gradients of multivariate functions

Gradient
vector 𝛻 𝑓 𝑋 𝑇

Moving in this
𝑇
−𝛻 𝑓 𝑋 direction increases
Moving in this 𝑓(𝑋) fastest
direction decreases
𝑓 𝑋 fastest

46
Gradient descent/ascent (multivariate)

• The gradient descent/ascent method to find the


minimum or maximum of a function 𝑓 iteratively
– To find a maximum move in the direction of the
gradient
𝑥 = 𝑥 + 𝜂 𝛻𝑥 𝑓 𝑥
– To find a minimum move exactly opposite the
direction of the gradient
𝑥 = 𝑥 − 𝜂 𝛻𝑥 𝑓 𝑥

• Many solutions to choosing step size 𝜂


47
Gradient descent convergence criteria
• The gradient descent algorithm converges
when one of the following criteria is satisfied
f (x )  f (x ) < e1
k1 k

• Or
x f (x ) < e 2
k

48
Overall Gradient Descent Algorithm
• Initialize:
𝑥
𝑘=0

• do
𝑥 = 𝑥 − 𝜂 𝛻𝑥𝑓 𝑥
 𝑘 =𝑘+1
• while 𝑓 𝑥 −𝑓 𝑥 >𝜀

49
Convergence of Gradient Descent
• For appropriate step
size, for convex (bowl-
shaped) functions
gradient descent will
always find the
minimum.

• For non-convex
functions it will find a
local minimum or an
inflection point
50
Poll 4 : @311
• y = f(x) is a scalar function of an Nx1 column vector variable
x. Starting from x = x0, in which direction must we move in
the space of x, to achieve the maximum decrease in f()?
– Exactly in the direction of the gradient of f(x) at x0
– Exactly perpendicular to the direction of the gradient of f(x) at x0
– Exactly opposite to the direction of the gradient of f(x) at x0
– Exactly perpendicular to the direction of the gradient of f(x) at
x0.

51
Poll 4
• y = f(x) is a scalar function of an Nx1 column vector variable
x. Starting from x = x0, in which direction must we move in
the space of x, to achieve the maximum decrease in f()?
– Exactly in the direction of the gradient of f(x) at x0
– Exactly perpendicular to the direction of the gradient of f(x) at x0
– Exactly opposite to the direction of the gradient of f(x) at x0
– Exactly perpendicular to the direction of the gradient of f(x) at
x0.

52
• Returning to our problem from our detour..

53
Problem Statement
• Given a training set of input-output pairs
𝑿 , 𝒅 , 𝑿 , 𝒅2 , … , 𝑿 , 𝒅

• Minimize the following function


1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑇
w.r.t. 𝑊

• This is problem of function minimization


– An instance of optimization
54
Gradient Descent to train a network
• Initialize:
–𝑊
–𝑘=0

• do
–𝑊 = 𝑊 − 𝜂 𝛻𝐿𝑜𝑠𝑠(𝑊 )
–𝑘 =𝑘+1
• while 𝐿𝑜𝑠𝑠 𝑊 − 𝐿𝑜𝑠𝑠 𝑊 >𝜀

11-755/18-797 55
Preliminaries
• Before we proceed: the problem setup

56
Problem Setup: Things to define
• Given a training set of input-output pairs
𝑿 , 𝒅 , 𝑿 , 𝒅2 , … , 𝑿 , 𝒅

• Minimize the following function


1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑇
w.r.t 𝑊

• This is problem of function minimization


– An instance of optimization
57
Problem Setup: Things to define
• Given a training set of input-output pairs
𝑿 , 𝒅 , 𝑿 , 𝒅2 , … , 𝑿 , 𝒅

What are
• Minimize these
the input-output
following function pairs?
1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑇

58
Problem Setup: Things to define
• Given a training set of input-output pairs
𝑿 , 𝒅 , 𝑿 , 𝒅2 , … , 𝑿 , 𝒅

What are
• Minimize these
the input-output
following function pairs?
1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑇
What is f() and
what are its
parameters W?

59
Problem Setup: Things to define
• Given a training set of input-output pairs
𝑿 , 𝒅 , 𝑿 , 𝒅2 , … , 𝑿 , 𝒅

What are
• Minimize these
the input-output
following function pairs?
1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑇
What is f() and
What is the
what are its
divergence div()?
parameters W?

60
Problem Setup: Things to define
• Given a training set of input-output pairs
𝑿 , 𝒅 , 𝑿 , 𝒅2 , … , 𝑿 , 𝒅

• Minimize the following function


1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑇
What is f() and
what are its
parameters W?

61
What is f()? Typical network
Input
Hidden units
units Output
units

• Multi-layer perceptron
• A directed network with a set of inputs and
outputs
– No loops
62
Typical network
Input
Hidden Layers
Layer Output
Layer

• We assume a “layered” network for simplicity


– Each “layer” of neurons only gets inputs from the earlier layer(s)
and outputs signals only to later layer(s)
– We will refer to the inputs as the input layer
• No neurons here – the “layer” simply refers to inputs
– We refer to the outputs as the output layer
– Intermediate layers are “hidden” layers 63
The individual neurons

• Individual neurons operate on a set of inputs and produce a single


output
– Standard setup: A continuous activation function applied to an affine
function of the inputs

𝑦=𝑓 𝑤 𝑥 +𝑏

– More generally: any differentiable function


𝑦 = 𝑓 𝑥 ,𝑥 ,…,𝑥 ;𝑊 64
The individual neurons

• Individual neurons operate on a set of inputs and produce a single


output
– Standard setup: A continuous activation function applied to an affine
function of the inputs We will assume this
unless otherwise
𝑦=𝑓 𝑤 𝑥 +𝑏 specified

– More generally: any differentiable function Parameters are weights


𝑤 and bias 𝑏
𝑦 = 𝑓 𝑥 ,𝑥 ,…,𝑥 ;𝑊 65
Activations and their derivatives
1
𝑓(𝑧) = 𝑓′(𝑧) = 𝑓(𝑧)(1 − 𝑓 𝑧 )
1 + exp(−𝑧)

𝑓(𝑧) = tanh(𝑧) 𝑓′(𝑧) = 1 − 𝑓 (𝑧)

𝑧, 𝑧≥0 1, 𝑧 ≥ 0
𝑓(𝑧) = [*] 𝑓′(𝑧) =
0, 𝑧<0 0, 𝑧 < 0

1
𝑓′(𝑧) =
𝑓(𝑧) = log 1 + exp(𝑧) 1 + exp(−𝑧)

• Some popular activation functions and their


derivatives 66
Vector Activations
Input
Hidden Layers
Layer Output
Layer

• We can also have neurons that have multiple coupled


outputs
𝑦 ,𝑦 ,…,𝑦 = 𝑓 𝑥 ,𝑥 ,…,𝑥 ;𝑊
– Function 𝑓() operates on set of inputs to produce set of
outputs
– Modifying a single parameter in 𝑊 will affect all outputs 67
Vector activation example: Softmax
𝑥
𝑧 s 𝑦
𝑥 o
𝑧 f 𝑦
t
𝑥
m
𝑧 a
x 𝑦
𝑥

• Example: Softmax vector activation

Parameters are
𝑧 = 𝑤 𝑥 +𝑏
weights 𝑤
and bias 𝑏
𝑒𝑥𝑝 𝑧
𝑦=
∑ 𝑒𝑥𝑝 𝑧
68
Multiplicative combination: Can be
viewed as a case of vector activations
x z y

𝑧 = 𝑤 𝑥 +𝑏

𝑦 = 𝑧

Parameters are
weights 𝑤
and bias 𝑏
• A layer of multiplicative combination is a special case of vector activation69
Typical network
Input
Hidden Layers
Layer Output
Layer

• In a layered network, each layer of


perceptrons can be viewed as a single vector
activation

70
Notation
( ) ( ) ( )
𝑦 𝑦 𝑦
𝑥
( ) ( ) ( ) ( )
𝑤 𝑤 𝑤 𝑤 𝑦
( ) ( ) ( ) ( )
𝑏 𝑏 𝑏 𝑏 𝑦

• The input layer is the 0th layer


( )
• We will represent the output of the i-th perceptron of the kth layer as 𝑦
( )
– Input to network: 𝑦 =𝑥
( )
– Output of network: 𝑦 = 𝑦
• We will represent the weight of the connection between the i-th unit of
( )
the k-1th layer and the jth unit of the k-th layer as 𝑤
( )
– The bias to the jth unit of the k-th layer is 𝑏
71
Problem Setup: Things to define
• Given a training set of input-output pairs
𝑿 , 𝒅 , 𝑿 , 𝒅2 , … , 𝑿 , 𝒅

• Minimize the following function


1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑇
What is f() and
what are its
parameters W?

72
Problem Setup: Things to define
• Given a training set of input-output pairs
𝑿 , 𝒅 , 𝑿 , 𝒅2 , … , 𝑿 , 𝒅

What are
• Minimize these
the input-output
following function pairs?
1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑇

73
Input, target output, and actual output:
Vector notation
𝑥

• Given a training set of input-output pairs 𝑋 , 𝑑 , 𝑋 , 𝑑2 , … , 𝑋 , 𝑑


• 𝑋 = [𝑥 , 𝑥 , … , 𝑥 ] is the nth input vector
• 𝑑 = [𝑑 , 𝑑 , … , 𝑑 ] is the nth desired output
• 𝑌 = [𝑦 , 𝑦 , … , 𝑦 ] is the nth vector of actual outputs of the network
– Function of input 𝑋 and network parameters
• We will sometimes drop the first subscript when referring to a specific
instance
74
Representing the input
Input
Hidden Layers
Layer Output
Layer

• Vectors of numbers
– (or may even be just a scalar, if input layer is of size 1)
– E.g. vector of pixel values
– E.g. vector of speech features
– E.g. real-valued vector representing text
• We will see how this happens later in the course
– Other real valued vectors
75
Representing the output
Input
Hidden Layers
Layer Output
Layer

• If the desired output is real-valued, no special tricks are necessary


– Scalar Output : single output neuron
• d = scalar (real value)
– Vector Output : as many output neurons as the dimension of the
desired output
• d = [d1 d2 .. dL] (vector of real values)

76
Representing the output

• If the desired output is binary (is this a cat or not), use


a simple 1/0 representation of the desired output
– 1 = Yes, it’s a cat
– 0 = No, it’s not a cat.

77
Representing the output

1
𝜎 𝑧 =
1+𝑒

𝜎(𝑧)

• If the desired output is binary (is this a cat or not), use


a simple 1/0 representation of the desired output
• Output activation: Typically, a sigmoid
– Viewed as the probability 𝑃 𝑌 = 1 𝑋 of class value 1
• Indicating the fact that for actual data, in general a feature value X
may occur for both classes, but with different probabilities
78
• Is differentiable
Representing the output

• If the desired output is binary (is this a cat or not), use a simple 1/0 representation of the desired
output
– 1 = Yes it’s a cat
– 0 = No it’s not a cat.

• Sometimes represented by two outputs, one representing the desired output, the other
representing the negation of the desired output
– Yes:  [1 0]
– No:  [0 1]
• The output explicitly becomes a 2-output softmax
79
Multi-class output: One-hot
representations
• Consider a network that must distinguish if an input is a cat, a dog, a camel, a hat,
or a flower
• We can represent this set as the following vector, with the classes arranged in a
chosen order:
[cat dog camel hat flower]T
• For inputs of each of the five classes the desired output is:
cat: [1 0 0 0 0] T
dog: [0 1 0 0 0] T
camel: [0 0 1 0 0] T
hat: [0 0 0 1 0] T
flower: [0 0 0 0 1] T
• For an input of any class, we will have a five-dimensional vector output with four
zeros and a single 1 at the position of that class
• This is a one hot vector

80
Multi-class networks
Input
Hidden Layers
Layer Output
Layer

• For a multi-class classifier with N classes, the one-hot


representation will have N binary target outputs
– The desired output 𝑑 is an N-dimensional binary vector
• The neural network’s actual output too must ideally be binary (N-1
zeros and a single 1 in the right place)
• More realistically, it will be a probability vector
– N probability values that sum to 1.
81
Multi-class classification: Output
Input
Hidden Layers
Layer Output
Layer
s
o
f
t
m
a
x

• Softmax vector activation is often used at the output of multi-class


classifier nets
( ) ( )
𝑧 = 𝑤 𝑦

𝑒𝑥𝑝 𝑧
𝑦 =
∑ 𝑒𝑥𝑝 𝑧
• This can be viewed as the probability 𝑦 = 𝑃(𝑐𝑙𝑎𝑠𝑠 = 𝑖|𝑋)
82
Inputs and outputs:
Typical Problem Statement

• We are given a number of “training” data instances


• E.g. images of digits, along with information about
which digit the image represents
• Tasks:
– Binary recognition: Is this a “2” or not
– Multi-class recognition: Which digit is this?
83
Typical Problem statement:
binary classification
Training data

( , 0) ( , 1)
( , 1) ( , 0)
( , 0) ( , 1)
Output: sigmoid
Input: vector of
pixel values

• Given, many positive and negative examples (training data),


– learn all weights such that the network does the desired job

84
Typical Problem statement:
multiclass classification
Training data
Input

( , 5) ( , 2)
Hidden Layers
Layer Output
Layer
s
o

( , 2) ( , 4)
f
t
m
a
x

( , 0) ( , 2)
Input: vector of Output: Class prob
pixel values

• Given, many positive and negative examples (training data),


– learn all weights such that the network does the desired job

85
Problem Setup: Things to define
• Given a training set of input-output pairs
𝑿 , 𝒅 , 𝑿 , 𝒅2 , … , 𝑿 , 𝒅

• Minimize the following function


1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑇

What is the
divergence div()?

86
Problem Setup: Things to define
• Given a training set of input-output pairs
𝑿 , 𝒅 , 𝑿 , 𝒅2 , … , 𝑿 , 𝒅

• Minimize the following function


1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑇

What is the
divergence div()?
Note: For Loss(W) to be differentiable
w.r.t W, div() must be differentiable 87
Examples of divergence functions
d1 d2 d3 d4

L2 Div() Div

• For real-valued output vectors, the (scaled) L2 divergence is popular


1 1
𝐷𝑖𝑣 𝑌, 𝑑 = 𝑌−𝑑 = 𝑦 −𝑑
2 2
– Squared Euclidean distance between true and desired output
– Note: this is differentiable
𝑑𝐷𝑖𝑣(𝑌, 𝑑)
= 𝑦 −𝑑
𝑑𝑦
𝛻 𝐷𝑖𝑣(𝑌, 𝑑) = 𝑦 − 𝑑 , 𝑦 − 𝑑 , …
88
For binary classifier
𝑑

𝑦
KL Div 𝐷𝑖𝑣

• For binary classifier with scalar output, Y ∈ (0,1), d is 0/1, the Kullback Leibler (KL)
divergence between the probability distribution [𝑌, 1 − 𝑌] and the ideal output
probability [𝑑, 1 − 𝑑] is popular
𝐷𝑖𝑣 𝑌, 𝑑 = −𝑑𝑙𝑜𝑔𝑌 − 1 − 𝑑 log(1 − 𝑌)
– Minimum when 𝑑 = 𝑌
• Derivative

1
𝑑𝐷𝑖𝑣(𝑌, 𝑑) − 𝑖𝑓 𝑑 = 1
= 𝑌
𝑑𝑌 1
𝑖𝑓 𝑑 = 0
1−𝑌
89
1
𝑑𝐾𝐿𝐷𝑖𝑣(𝑌, 𝑑) − 𝑖𝑓 𝑑 = 1
= 𝑌

KL vs L2
𝑑𝑌 1
𝑖𝑓 𝑑 = 0
1−𝑌

d=0 d=1

𝐿2 𝑌, 𝑑 = (𝑦 − 𝑑) 𝐾𝐿 𝑌, 𝑑 = −𝑑𝑙𝑜𝑔𝑌 − 1 − 𝑑 log(1 − 𝑌)

• Both KL and L2 have a minimum when 𝑦 is the target value of 𝑑


• KL rises much more steeply away from 𝑑
– Encouraging faster convergence of gradient descent
• The derivative of KL is not equal to 0 at the minimum
– It is 0 for L2, though
90
For binary classifier
𝑑

𝑦
KL Div 𝐷𝑖𝑣

• For binary classifier with scalar output, Y ∈ (0,1), d is 0/1, the Kullback Leibler (KL)
divergence between the probability distribution [𝑌, 1 − 𝑌] and the ideal output
probability [𝑑, 1 − 𝑑] is popular
𝐷𝑖𝑣 𝑌, 𝑑 = −𝑑𝑙𝑜𝑔𝑌 − 1 − 𝑑 log(1 − 𝑌)
– Minimum when d = 𝑌
• Derivative Note: when 𝑦 = 𝑑 the
1
derivative is not 0
𝑑𝐷𝑖𝑣(𝑌, 𝑑) − 𝑖𝑓 𝑑 = 1
= 𝑌
𝑑𝑌 1 Even though 𝑑𝑖𝑣() = 0
𝑖𝑓 𝑑 = 0
1−𝑌 (minimum) when y = d
91
For multi-class classification
d1 d2 d3 d4

KL Div() Div

• Desired output 𝑑 is a one hot vector 0 0 … 1 … 0 0 0 with the 1 in the 𝑐-th position (for class 𝑐)
• Actual output will be probability distribution 𝑦 , 𝑦 , …
• The KL divergence between the desired one-hot output and actual output:
𝑑
𝐷𝑖𝑣 𝑌, 𝑑 = 𝑑 log = 𝑑 log 𝑑 − 𝑑 log 𝑦 = − log 𝑦
𝑦
• Derivative
The slope is negative
𝑑𝐷𝑖𝑣(𝑌, 𝑑) −
1
𝑓𝑜𝑟 𝑡ℎ𝑒 𝑐 − 𝑡ℎ 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡 w.r.t. 𝑦
= 𝑦
𝑑𝑌
0 𝑓𝑜𝑟 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡
Indicates increasing 𝑦
−1
𝛻 𝐷𝑖𝑣(𝑌, 𝑑) = 0 0 … …0 0 will reduce divergence
𝑦
92
For multi-class classification
d1 d2 d3 d4

KL Div() Div

• Desired output 𝑑 is a one hot vector 0 0 … 1 … 0 0 0 with the 1 in the 𝑐-th position (for class 𝑐)
• Actual output will be probability distribution 𝑦 , 𝑦 , …
• The KL divergence between the desired one-hot output and actual output:

𝐷𝑖𝑣 𝑌, 𝑑 = 𝑑 log 𝑑 − 𝑑 log 𝑦 = 0 − log 𝑦 = − log 𝑦

• Derivative
Note: when 𝑦 = 𝑑 the The slope is negative
derivative is not 0 1
w.r.t. 𝑦
𝑑𝐷𝑖𝑣(𝑌, 𝑑) − 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑐 − 𝑡ℎ 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡
= 𝑦
𝑑𝑌
Even though 𝑑𝑖𝑣() = 0 0 𝑓𝑜𝑟 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡 Indicates increasing 𝑦
(minimum) when y = d −1 will reduce divergence
𝛻 𝐷𝑖𝑣(𝑌, 𝑑) = 0 0 … …00 93
𝑦
KL divergence vs cross entropy
• KL divergence between 𝑑 and 𝑦:

𝐾𝐿 𝑌, 𝑑 = 𝑑 log 𝑑 − 𝑑 log 𝑦

• Cross-entropy between 𝑑 and 𝑦:

𝑋𝑒𝑛𝑡 𝑌, 𝑑 = − 𝑑 log 𝑦

• When the desired target 𝑑 is one-hot, ∑ 𝑑 log 𝑑 = 0


– KL and cross-entropy are identical
– 𝑋𝑒𝑛𝑡 𝑌, 𝑑 = −log 𝑦
– Minimizing cross-entropy simply maximizes the probability of the target class

• We will continue discussing in terms of KL, but the discussion applies


directly to Cross-entropy as well

94
KL divergence vs cross entropy
• KL divergence between 𝑑 and 𝑦:

𝐾𝐿 𝑌, 𝑑 = 𝑑 log 𝑑 − 𝑑 log 𝑦

• Cross-entropy between 𝑑 and 𝑦:

𝑋𝑒𝑛𝑡 𝑌, 𝑑 = − 𝑑 log 𝑦

• More generally, the cross entropy is merely the KL - entropy of 𝑑

𝑋𝑒𝑛𝑡 𝑌, 𝑑 = 𝐾𝐿 𝑌, 𝑑 − 𝑑 log 𝑑 = 𝐾𝐿 𝑌, 𝑑 − 𝐻(𝑑)

• The 𝑊 that minimizes cross-entropy will minimize the KL divergence


– since 𝑑 is the desired output and does not depend on the network, 𝐻(𝑑) does not depend on
the net
– In fact, for one-hot 𝑑, 𝐻 𝑑 = 0 (and KL = Xent)
• We will generally minimize to the cross-entropy loss rather than the KL divergence
– The Xent is not a divergence, and although it attains its minimum when 𝑦 = 𝑑, its minimum
value is not 0
95
“Label smoothing”
d1 d2 d3 d4

KL Div() Div

• It is sometimes useful to set the target output to 𝜖 𝜖 … (1 − (𝐾 − 1)𝜖) … 𝜖 𝜖 𝜖


with the value 1 − (K − 1)𝜖 in the 𝑐-th position (for class 𝑐) and 𝜖 elsewhere for
some small 𝜖
– “Label smoothing” -- aids gradient descent
• The KL divergence remains:

𝐷𝑖𝑣 𝑌, 𝑑 = 𝑑 log 𝑑 − 𝑑 log 𝑦

• Derivative

1 − (𝐾 − 1)𝜖
𝑑𝐷𝑖𝑣(𝑌, 𝑑) − 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑐 − 𝑡ℎ 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡
𝑦
= 𝜖
𝑑𝑌
− 𝑓𝑜𝑟 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑠
𝑦 96
“Label smoothing”
d1 d2 d3 d4

KL Div() Div

• It is sometimes useful to set the target output to 𝜖 𝜖 … (1 − (𝐾 − 1)𝜖) … 𝜖 𝜖 𝜖


with the value 1 − (K − 1)𝜖 in the 𝑐-th position (for class 𝑐) and 𝜖 elsewhere for
some small 𝜖
– “Label smoothing” -- aids gradient descent
Negative derivatives
• The KL divergence remains: encourage increasing
the probabilities of
𝐷𝑖𝑣 𝑌, 𝑑 = 𝑑 log 𝑑 − 𝑑 log 𝑦 all classes, including
incorrect classes!
• Derivative (Seems wrong, no?)
1 − (𝐾 − 1)𝜖
𝑑𝐷𝑖𝑣(𝑌, 𝑑) − 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑐 − 𝑡ℎ 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡
𝑦
= 𝜖
𝑑𝑌
− 𝑓𝑜𝑟 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑠
𝑦 97
The derivative of the KL divergence
𝐾𝐿 (𝑦, 𝑑)
𝑑
𝑑𝑒𝑠𝑖𝑟𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡 KL Divergence
(𝑜𝑛𝑒 ℎ𝑜𝑡) 𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧)
Softmax
𝑧

• The softmax is computed on affine values z to obtain output


probabilities 𝑦
• The derivative of the KL divergence between the actual output 𝑦
and target output is as given earlier
• However, the derivative of the KL divergence w.r.t. the affine value 𝑧
at the input of the softmax is just the error
𝛻𝒛 𝐾𝐿(𝒚, 𝒅) = 𝒚 − 𝒅
98
A note on derivatives 𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧)
𝑦=𝑧 1 2 3 4 0 1 2 3 4 0

Softmax
𝑧 𝑧

• Note: For both regression models with linear output layer


and L2 divergence, and classification models with softmax
output layer and KL divergence the gradient w.r.t. the final
affine value of the network is just the error
1
𝛻𝒛 𝒚 − 𝒅 = 𝒚 − 𝒅
2
𝛻𝒛 𝐾𝐿(𝒚, 𝒅) = 𝒚 − 𝒅
99
Problem Setup: Things to define
• Given a training set of input-output pairs
𝑿 , 𝒅 , 𝑿 , 𝒅2 , … , 𝑿 , 𝒅

• Minimize the following function


1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑇

ALL TERMS HAVE BEEN DEFINED

100
Poll 5 : @312
• Select all that are correct
– The gradient of the loss will always be 0 or close
to 0 at a minimum
– The gradient of the loss may be 0 or close to 0 at a
minimum
– The gradient of the loss may have large magnitude
at a minimum
– If the gradient is not 0 at a minimum, it must be a
local minimum

101
Poll 5
• Select all that are correct
– The gradient of the loss will always be 0 or close
to 0 at a minimum
– The gradient of the loss may be 0 or close to 0 at
a minimum
– The gradient of the loss may have large
magnitude at a minimum
– If the gradient is not 0 at a minimum, it must be a
local minimum

102
Story so far
• Neural nets are universal approximators

• Neural networks are trained to approximate functions by adjusting their


parameters to minimize the average divergence between their actual output and
the desired output at a set of “training instances”
– Input-output samples from the function to be learned
– The average divergence is the “Loss” to be minimized

• To train them, several terms must be defined


– The network itself
– The manner in which inputs are represented as numbers
– The manner in which outputs are represented as numbers
• As numeric vectors for real predictions
• As one-hot vectors for classification functions
– The divergence function that computes the error between actual and desired outputs
• L2 divergence for real-valued predictions
• KL divergence for classifiers

103
Next Class
• Backpropagation

104

You might also like