Lec4 Learning
Lec4 Learning
Attendance: @313
1
Recap: Empirical Risk Minimization
𝑌 = 𝑓(𝑋; 𝑾)
This is an instance of
function minimization
(optimization)
4
A brief note on derivatives..
𝑦 Δ𝑦
derivative
Δ𝑥
𝑥
• A derivative of a function at any point tells us how
much a minute increment to the argument of the
function will increment the value of the function
For any 𝑦 = 𝑓 𝑥 , expressed as a multiplier 𝛼 to a tiny
increment Δ𝑥 to obtain the increments Δ𝑦 to the output
Δ𝑦 = 𝛼Δ𝑥
Based on the fact that at a fine enough resolution, any
smooth, continuous function is locally linear at any point 5
Scalar function of scalar argument
𝑦 Δ𝑦
Δ𝑥
𝑥
It will be 0 where the function is locally flat (neither increasing nor decreasing)
7
Multivariate scalar function:
Scalar function of
𝑦
vector argument
𝑦 = 𝑓(𝐱) Note: Δ𝐱 is now also a vector
𝑥 Δ𝑥
𝐱 is now a vector: 𝐱 = ⋮ Δ𝐱 = ⋮
𝑥 Δ𝑥
Δ𝑦 = 𝛼Δ𝐱
• Giving us that 𝛼 is a row vector: 𝛼 = 𝛼 ⋯ 𝛼
Δ𝑦 = 𝛼 Δ𝑥 + 𝛼 Δ𝑥 + ⋯ + 𝛼 Δ𝑥
• The partial derivative 𝛼 gives us how 𝑦 increments when only 𝑥 is
incremented
• Often represented as
𝜕𝑦 𝜕𝑦 𝜕𝑦
Δ𝑦 = Δ𝑥 + Δ𝑥 + ⋯ + Δ𝑥
𝜕𝑥 𝜕𝑥 𝜕𝑥 8
Multivariate scalar function:
Scalar function of vector argument
𝑦
Note: Δ𝐱 is now a vector
Δ𝑥
Δ𝐱 = ⋮
Δ𝑥
𝑥
Δ𝑦 = 𝛻𝐱 𝑦Δ𝐱
We will be using this
• Where symbol for vector and
𝜕𝑦 𝜕𝑦 matrix derivatives
𝛻𝐱 𝑦 = ⋯
𝜕𝑥 𝜕𝑥
o You may be more familiar with the term “gradient” which
is actually defined as the transpose of the derivative
9
Gradient of a scalar function of a vector
𝑑𝑓(𝑋)
𝑑𝑋
– 𝛻 𝑓(𝑋) = ⋯
• The gradient is the transpose of the derivative 𝛻 𝑓(𝑋)
– A column vector of the same dimensionality as 𝑋
10
Gradient of a scalar function of a vector
𝑑𝑓(𝑋)
𝑑𝑋
– 𝛻 𝑓(𝑋) = ⋯
• The gradient is the transpose of the derivative 𝛻 𝑓(𝑋)
This is– aAvector inner product. To understand its behavior lets
column vector of the same dimensionality as 𝑋
consider a well-known property of inner products 11
A well-known vector property
𝐮 𝐯 𝑣𝑠. 𝜃
𝐮 𝐯 = 𝐮 𝐯 cos 𝜃
• The inner product between two vectors of
fixed lengths is maximum when the two
vectors are aligned
– i.e. when 𝜃 = 0
12
Properties of Gradient
𝛻 𝑓(𝑋)𝑑𝑋 vs angle of 𝑑𝑋
Blue arrow
𝛻 𝑓(𝑋)
is 𝑑𝑋
𝛻 𝑓(𝑋)
• 𝑑𝑓 𝑋 = 𝛻 𝑓(𝑋)𝑑𝑋
• For an increment 𝑑𝑋 of any given length 𝑑𝑓 𝑋 is max if
𝑑𝑋 is aligned with 𝛻 𝑓 𝑋 T
– The function f(X) increases most rapidly if the input increment
𝑑𝑋 is exactly in the direction of 𝛻 𝑓 𝑋 T
• The gradient is the direction of fastest increase in f(X) 13
Gradient
Gradient
vector 𝛻 𝑓 𝑋 𝑇
14
Gradient
Gradient
vector 𝛻 𝑓 𝑋 𝑇
Moving in this
direction increases
𝑓 𝑋 fastest
15
Gradient
Gradient
vector 𝛻 𝑓 𝑋 𝑇
Moving in this
𝑇
−𝛻 𝑓 𝑋 direction increases
Moving in this 𝑓(𝑋) fastest
direction decreases
𝑓 𝑋 fastest
16
Gradient
Gradient here
is 0
Gradient here
is 0
17
Properties of Gradient: 2
2 f 2 f 2 f
. .
21x
2
x1x2 x1xn
f 2 f 2 f
x x . .
X f ( x1 ,..., xn ) : 2 1
2 x2
2
x2 xn
. . . . .
. . . . .
2 f 2 f 2 f
. .
xn x1 xn x2 2
xn
19
Poll 1 : @305, @307
• Select all that are true about derivatives of a scalar function f(X) of
multivariate inputs
– At any location X, there may be many directions in which we can step, such
that f(X) increases
– The direction of the gradient is the direction in which the function increases
fastest
– The gradient is the derivative of f(X) w.r.t. X
20
Poll 1
• Select all that are true about derivatives of a scalar function f(X) of
multivariate inputs
– At any location X, there may be many directions in which we can step, such
that f(X) increases
– The direction of the gradient is the direction in which the function increases
fastest
– The gradient is the derivative of f(X) w.r.t. X
21
The problem of optimization
f(x)
global maximum
inflection point
local minimum
global minimum
• General problem of
optimization: Given a function
f(x) of some variable x …
22
Finding the minimum of a function
𝑑𝑦
=0
𝑑𝑥
f(x)
26
Derivatives of a curve
f’(x)
f(x) x
27
Derivative of the derivative of the
curve
f’’(x)
f’(x)
f(x) x
x
• Find the value 𝑥 at which 𝑓′(𝑥) = 0: Solve
𝑑𝑓(𝑥)
=0
𝑑𝑥
• The solution 𝑥 is a turning point
• Check the double derivative at 𝑥 : compute
𝑑𝑓′(𝑥 )
𝑓 𝑥 =
𝑑𝑥
• If 𝑓 𝑥 is positive 𝑥 is a minimum
– If it is negative, it is a maximum
29
A note on derivatives of functions of
single variable
maximum
• All locations with zero
Inflection point
𝑥
𝑓(𝑥) derivative are critical points
minimum – These can be local maxima, local
minima, or inflection points
Critical points
• Gradient
2 x1 1 x2
X f T x1 2 x2 x3
x2 2 x3 1
35
Unconstrained Minimization of
function (Example)
• Set the gradient to null
2 x1 1 x2 0
X f 0 x1 2 x2 x3 0
x2 2 x3 1 0
f(X)
X 𝑋
𝑋
x0 x1 x2 x5 x3
x4 𝑋
• Iterative solutions
– Start from an initial guess 𝑋 for the optimal 𝑋
– Update the guess towards a (hopefully) “better” value of f(𝑋)
– Stop when f(𝑋) no longer decreases
• Problems:
– Which direction to step in
– How big must the steps be
39
The Approach of Gradient Descent
• Iterative solution:
– Start at some point
– Find direction in which to shift this point to decrease error
• This can be found from the derivative of the function
– A negative derivative moving right decreases error
– A positive derivative moving left decreases error
– Shift point in this direction 40
The Approach of Gradient Descent
44
Poll 3: Multivariate functions
• Select all that are true about derivatives of a
scalar function f(X) of multivariate inputs
– At any location X, there may be many directions
in which we can step, such that f(X) increases
– The direction of the gradient is the direction in
which the function increases fastest
– The gradient is the derivative of f(X) w.r.t. X
45
Gradients of multivariate functions
Gradient
vector 𝛻 𝑓 𝑋 𝑇
Moving in this
𝑇
−𝛻 𝑓 𝑋 direction increases
Moving in this 𝑓(𝑋) fastest
direction decreases
𝑓 𝑋 fastest
46
Gradient descent/ascent (multivariate)
• Or
x f (x ) < e 2
k
48
Overall Gradient Descent Algorithm
• Initialize:
𝑥
𝑘=0
• do
𝑥 = 𝑥 − 𝜂 𝛻𝑥𝑓 𝑥
𝑘 =𝑘+1
• while 𝑓 𝑥 −𝑓 𝑥 >𝜀
49
Convergence of Gradient Descent
• For appropriate step
size, for convex (bowl-
shaped) functions
gradient descent will
always find the
minimum.
• For non-convex
functions it will find a
local minimum or an
inflection point
50
Poll 4 : @311
• y = f(x) is a scalar function of an Nx1 column vector variable
x. Starting from x = x0, in which direction must we move in
the space of x, to achieve the maximum decrease in f()?
– Exactly in the direction of the gradient of f(x) at x0
– Exactly perpendicular to the direction of the gradient of f(x) at x0
– Exactly opposite to the direction of the gradient of f(x) at x0
– Exactly perpendicular to the direction of the gradient of f(x) at
x0.
51
Poll 4
• y = f(x) is a scalar function of an Nx1 column vector variable
x. Starting from x = x0, in which direction must we move in
the space of x, to achieve the maximum decrease in f()?
– Exactly in the direction of the gradient of f(x) at x0
– Exactly perpendicular to the direction of the gradient of f(x) at x0
– Exactly opposite to the direction of the gradient of f(x) at x0
– Exactly perpendicular to the direction of the gradient of f(x) at
x0.
52
• Returning to our problem from our detour..
53
Problem Statement
• Given a training set of input-output pairs
𝑿 , 𝒅 , 𝑿 , 𝒅2 , … , 𝑿 , 𝒅
• do
–𝑊 = 𝑊 − 𝜂 𝛻𝐿𝑜𝑠𝑠(𝑊 )
–𝑘 =𝑘+1
• while 𝐿𝑜𝑠𝑠 𝑊 − 𝐿𝑜𝑠𝑠 𝑊 >𝜀
11-755/18-797 55
Preliminaries
• Before we proceed: the problem setup
56
Problem Setup: Things to define
• Given a training set of input-output pairs
𝑿 , 𝒅 , 𝑿 , 𝒅2 , … , 𝑿 , 𝒅
What are
• Minimize these
the input-output
following function pairs?
1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑇
58
Problem Setup: Things to define
• Given a training set of input-output pairs
𝑿 , 𝒅 , 𝑿 , 𝒅2 , … , 𝑿 , 𝒅
What are
• Minimize these
the input-output
following function pairs?
1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑇
What is f() and
what are its
parameters W?
59
Problem Setup: Things to define
• Given a training set of input-output pairs
𝑿 , 𝒅 , 𝑿 , 𝒅2 , … , 𝑿 , 𝒅
What are
• Minimize these
the input-output
following function pairs?
1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑇
What is f() and
What is the
what are its
divergence div()?
parameters W?
60
Problem Setup: Things to define
• Given a training set of input-output pairs
𝑿 , 𝒅 , 𝑿 , 𝒅2 , … , 𝑿 , 𝒅
61
What is f()? Typical network
Input
Hidden units
units Output
units
• Multi-layer perceptron
• A directed network with a set of inputs and
outputs
– No loops
62
Typical network
Input
Hidden Layers
Layer Output
Layer
𝑦=𝑓 𝑤 𝑥 +𝑏
𝑧, 𝑧≥0 1, 𝑧 ≥ 0
𝑓(𝑧) = [*] 𝑓′(𝑧) =
0, 𝑧<0 0, 𝑧 < 0
1
𝑓′(𝑧) =
𝑓(𝑧) = log 1 + exp(𝑧) 1 + exp(−𝑧)
Parameters are
𝑧 = 𝑤 𝑥 +𝑏
weights 𝑤
and bias 𝑏
𝑒𝑥𝑝 𝑧
𝑦=
∑ 𝑒𝑥𝑝 𝑧
68
Multiplicative combination: Can be
viewed as a case of vector activations
x z y
𝑧 = 𝑤 𝑥 +𝑏
𝑦 = 𝑧
Parameters are
weights 𝑤
and bias 𝑏
• A layer of multiplicative combination is a special case of vector activation69
Typical network
Input
Hidden Layers
Layer Output
Layer
70
Notation
( ) ( ) ( )
𝑦 𝑦 𝑦
𝑥
( ) ( ) ( ) ( )
𝑤 𝑤 𝑤 𝑤 𝑦
( ) ( ) ( ) ( )
𝑏 𝑏 𝑏 𝑏 𝑦
72
Problem Setup: Things to define
• Given a training set of input-output pairs
𝑿 , 𝒅 , 𝑿 , 𝒅2 , … , 𝑿 , 𝒅
What are
• Minimize these
the input-output
following function pairs?
1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑇
73
Input, target output, and actual output:
Vector notation
𝑥
• Vectors of numbers
– (or may even be just a scalar, if input layer is of size 1)
– E.g. vector of pixel values
– E.g. vector of speech features
– E.g. real-valued vector representing text
• We will see how this happens later in the course
– Other real valued vectors
75
Representing the output
Input
Hidden Layers
Layer Output
Layer
76
Representing the output
77
Representing the output
1
𝜎 𝑧 =
1+𝑒
𝜎(𝑧)
• If the desired output is binary (is this a cat or not), use a simple 1/0 representation of the desired
output
– 1 = Yes it’s a cat
– 0 = No it’s not a cat.
• Sometimes represented by two outputs, one representing the desired output, the other
representing the negation of the desired output
– Yes: [1 0]
– No: [0 1]
• The output explicitly becomes a 2-output softmax
79
Multi-class output: One-hot
representations
• Consider a network that must distinguish if an input is a cat, a dog, a camel, a hat,
or a flower
• We can represent this set as the following vector, with the classes arranged in a
chosen order:
[cat dog camel hat flower]T
• For inputs of each of the five classes the desired output is:
cat: [1 0 0 0 0] T
dog: [0 1 0 0 0] T
camel: [0 0 1 0 0] T
hat: [0 0 0 1 0] T
flower: [0 0 0 0 1] T
• For an input of any class, we will have a five-dimensional vector output with four
zeros and a single 1 at the position of that class
• This is a one hot vector
80
Multi-class networks
Input
Hidden Layers
Layer Output
Layer
𝑒𝑥𝑝 𝑧
𝑦 =
∑ 𝑒𝑥𝑝 𝑧
• This can be viewed as the probability 𝑦 = 𝑃(𝑐𝑙𝑎𝑠𝑠 = 𝑖|𝑋)
82
Inputs and outputs:
Typical Problem Statement
( , 0) ( , 1)
( , 1) ( , 0)
( , 0) ( , 1)
Output: sigmoid
Input: vector of
pixel values
84
Typical Problem statement:
multiclass classification
Training data
Input
( , 5) ( , 2)
Hidden Layers
Layer Output
Layer
s
o
( , 2) ( , 4)
f
t
m
a
x
( , 0) ( , 2)
Input: vector of Output: Class prob
pixel values
85
Problem Setup: Things to define
• Given a training set of input-output pairs
𝑿 , 𝒅 , 𝑿 , 𝒅2 , … , 𝑿 , 𝒅
What is the
divergence div()?
86
Problem Setup: Things to define
• Given a training set of input-output pairs
𝑿 , 𝒅 , 𝑿 , 𝒅2 , … , 𝑿 , 𝒅
What is the
divergence div()?
Note: For Loss(W) to be differentiable
w.r.t W, div() must be differentiable 87
Examples of divergence functions
d1 d2 d3 d4
L2 Div() Div
𝑦
KL Div 𝐷𝑖𝑣
• For binary classifier with scalar output, Y ∈ (0,1), d is 0/1, the Kullback Leibler (KL)
divergence between the probability distribution [𝑌, 1 − 𝑌] and the ideal output
probability [𝑑, 1 − 𝑑] is popular
𝐷𝑖𝑣 𝑌, 𝑑 = −𝑑𝑙𝑜𝑔𝑌 − 1 − 𝑑 log(1 − 𝑌)
– Minimum when 𝑑 = 𝑌
• Derivative
1
𝑑𝐷𝑖𝑣(𝑌, 𝑑) − 𝑖𝑓 𝑑 = 1
= 𝑌
𝑑𝑌 1
𝑖𝑓 𝑑 = 0
1−𝑌
89
1
𝑑𝐾𝐿𝐷𝑖𝑣(𝑌, 𝑑) − 𝑖𝑓 𝑑 = 1
= 𝑌
KL vs L2
𝑑𝑌 1
𝑖𝑓 𝑑 = 0
1−𝑌
d=0 d=1
𝐿2 𝑌, 𝑑 = (𝑦 − 𝑑) 𝐾𝐿 𝑌, 𝑑 = −𝑑𝑙𝑜𝑔𝑌 − 1 − 𝑑 log(1 − 𝑌)
𝑦
KL Div 𝐷𝑖𝑣
• For binary classifier with scalar output, Y ∈ (0,1), d is 0/1, the Kullback Leibler (KL)
divergence between the probability distribution [𝑌, 1 − 𝑌] and the ideal output
probability [𝑑, 1 − 𝑑] is popular
𝐷𝑖𝑣 𝑌, 𝑑 = −𝑑𝑙𝑜𝑔𝑌 − 1 − 𝑑 log(1 − 𝑌)
– Minimum when d = 𝑌
• Derivative Note: when 𝑦 = 𝑑 the
1
derivative is not 0
𝑑𝐷𝑖𝑣(𝑌, 𝑑) − 𝑖𝑓 𝑑 = 1
= 𝑌
𝑑𝑌 1 Even though 𝑑𝑖𝑣() = 0
𝑖𝑓 𝑑 = 0
1−𝑌 (minimum) when y = d
91
For multi-class classification
d1 d2 d3 d4
KL Div() Div
• Desired output 𝑑 is a one hot vector 0 0 … 1 … 0 0 0 with the 1 in the 𝑐-th position (for class 𝑐)
• Actual output will be probability distribution 𝑦 , 𝑦 , …
• The KL divergence between the desired one-hot output and actual output:
𝑑
𝐷𝑖𝑣 𝑌, 𝑑 = 𝑑 log = 𝑑 log 𝑑 − 𝑑 log 𝑦 = − log 𝑦
𝑦
• Derivative
The slope is negative
𝑑𝐷𝑖𝑣(𝑌, 𝑑) −
1
𝑓𝑜𝑟 𝑡ℎ𝑒 𝑐 − 𝑡ℎ 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡 w.r.t. 𝑦
= 𝑦
𝑑𝑌
0 𝑓𝑜𝑟 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡
Indicates increasing 𝑦
−1
𝛻 𝐷𝑖𝑣(𝑌, 𝑑) = 0 0 … …0 0 will reduce divergence
𝑦
92
For multi-class classification
d1 d2 d3 d4
KL Div() Div
• Desired output 𝑑 is a one hot vector 0 0 … 1 … 0 0 0 with the 1 in the 𝑐-th position (for class 𝑐)
• Actual output will be probability distribution 𝑦 , 𝑦 , …
• The KL divergence between the desired one-hot output and actual output:
• Derivative
Note: when 𝑦 = 𝑑 the The slope is negative
derivative is not 0 1
w.r.t. 𝑦
𝑑𝐷𝑖𝑣(𝑌, 𝑑) − 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑐 − 𝑡ℎ 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡
= 𝑦
𝑑𝑌
Even though 𝑑𝑖𝑣() = 0 0 𝑓𝑜𝑟 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡 Indicates increasing 𝑦
(minimum) when y = d −1 will reduce divergence
𝛻 𝐷𝑖𝑣(𝑌, 𝑑) = 0 0 … …00 93
𝑦
KL divergence vs cross entropy
• KL divergence between 𝑑 and 𝑦:
𝐾𝐿 𝑌, 𝑑 = 𝑑 log 𝑑 − 𝑑 log 𝑦
𝑋𝑒𝑛𝑡 𝑌, 𝑑 = − 𝑑 log 𝑦
94
KL divergence vs cross entropy
• KL divergence between 𝑑 and 𝑦:
𝐾𝐿 𝑌, 𝑑 = 𝑑 log 𝑑 − 𝑑 log 𝑦
𝑋𝑒𝑛𝑡 𝑌, 𝑑 = − 𝑑 log 𝑦
KL Div() Div
• Derivative
1 − (𝐾 − 1)𝜖
𝑑𝐷𝑖𝑣(𝑌, 𝑑) − 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑐 − 𝑡ℎ 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡
𝑦
= 𝜖
𝑑𝑌
− 𝑓𝑜𝑟 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑠
𝑦 96
“Label smoothing”
d1 d2 d3 d4
KL Div() Div
Softmax
𝑧 𝑧
100
Poll 5 : @312
• Select all that are correct
– The gradient of the loss will always be 0 or close
to 0 at a minimum
– The gradient of the loss may be 0 or close to 0 at a
minimum
– The gradient of the loss may have large magnitude
at a minimum
– If the gradient is not 0 at a minimum, it must be a
local minimum
101
Poll 5
• Select all that are correct
– The gradient of the loss will always be 0 or close
to 0 at a minimum
– The gradient of the loss may be 0 or close to 0 at
a minimum
– The gradient of the loss may have large
magnitude at a minimum
– If the gradient is not 0 at a minimum, it must be a
local minimum
102
Story so far
• Neural nets are universal approximators
103
Next Class
• Backpropagation
104