Advanced Machine Learning
Lecture 3: Linear models
Sandjai Bhulai
Vrije Universiteit Amsterdam
[Link]@[Link]
12 September 2023
Towards a Bayesian framework
Advanced Machine Learning
The frequentist pitfall
= (t | y(x0, w), β −1)
3 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
𝒩
The frequentist pitfall
▪ Given p(t | x) or p(t, x) directly minimize expected loss
function
∫∫
[L] = L(t, y(x))p(x, t) dx dt
▪ Natural choice:
∫∫
[L] = {y(x) − t}2 p(x, t) dx dt
𝔼
𝔼
4 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
The frequentist pitfall
▪ Given a point x, the expected loss at that point is given by
∫
[L(t, y(x))] = {y(x) − t}2 p(t | x) dt
▪ Taking the derivative w.r.t. y(x) yields
∫
2 {y(x) − t}p(t | x) dt
▪ Setting this expression to 0, yields
∫ ∫
y(x) = y(x)p(t | x) dt = tp(t | x) dt = t[t | x]
𝔼
𝔼
5 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
The frequentist pitfall
▪ The expected loss function
∫∫
[L] = {y(x) − t}2 p(x, t) dx dt
▪ Rewrite integrand as
{y(x) − t}2 = {y(x) − [t | x] + [t | x] + −t}2
= {y(x) − [t | x]}2 + 2{y(x) − [t | x]}{ [t | x] − t} + { [t | x] − t}2
▪ Taking the expected value yields
∫ ∫
[L] = {y(x) − [t | x]}2 p(x) dx + var[t | x]p(x) dx
𝔼
𝔼
𝔼
6 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
𝔼
𝔼
𝔼
𝔼
𝔼
𝔼
The frequentist pitfall
▪ Recall the expected square loss,
∫ ∫∫
[L] = {y(x) − h(x)}2 p(x) dx + {h(x) − t}2 p(x, t) dx dt
∫
where h(x) = [t | x] = tp(t | x) dt
▪ The second term corresponds to the noise inherent in the
random variable t
▪ What about the rst term?
𝔼
𝔼
7 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
fi
The frequentist pitfall
▪ Suppose we were given multiple datasets, each of size N.
Any particular dataset , will give a particular function
y(x; )
▪ We then have
{y(x; ) − h(x)}2
= {y(x; )− [y(x; )] + [y(x; )] − h(x)}2
= {y(x; )− [y(x; )]}2 + { [y(x; )] − h(x)}2
+2{y(x; )− [y(x; )]}{ [y(x; )] − h(x)}
𝒟
𝒟
𝒟
𝒟
𝒟
𝒟
8
𝒟
𝒟
Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
𝒟
𝒟
𝒟
𝒟
𝔼
𝔼
𝔼
𝒟
𝒟
𝒟
𝔼
𝔼
𝔼
𝒟
𝒟
𝒟
The frequentist pitfall
▪ Taking the expectation over yields:
[{y(x; ) − h(x)}2]
={ [y(x; )] − h(x)}2 + [{y(x; )− [y(x; )]}2]
(bias)2 variance
𝒟
𝒟
𝒟
𝒟
9
𝒟
Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
𝔼
𝔼
𝒟
𝒟
𝔼
𝒟
𝔼
𝒟
The frequentist pitfall
▪ In conclusion:
expected loss = (bias)2 + variance + noise
where
∫
(bias)2 = { [y(x; )] − h(x)}2 p(x) dx
∫
variance = [{y(x; )] − [y(x; )]}2] p(x) dx
∫∫
noise = {h(x) − t}2 p(x, t) dx dt
𝒟
𝒟
𝒟
𝔼
𝔼
𝒟
𝒟
𝔼
𝒟
10 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Bias-variance decomposition
▪ Example: 100 datasets from the sinusoidal with 25 data
points, varying the degree of regularization
11 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Bias-variance decomposition
▪ Example: 100 datasets from the sinusoidal with 25 data
points, varying the degree of regularization
12 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Bias-variance decomposition
▪ Example: 100 datasets from the sinusoidal with 25 data
points, varying the degree of regularization
13 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Bias-variance decomposition
▪ From these plots, we note that an over-regularized model
(large λ) will have a high bias, while an under-regularized
model (small λ) will have a high variance
14 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Bias-variance decomposition
▪ These insights are of limited practical value
▪ It is based on averages with respect to ensembles of
datasets
▪ In practice, we have only the single observed dataset
15 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Bias-variance tradeo
16 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
f
Bayesian linear regression
▪ Bayes’ theorem:
p(X | Y )p(Y )
p(Y | X) =
p(X)
▪ Essentially, this leads to
posterior ∝ likelhood × prior
▪ The idea is to use a probability distribution over the weights,
and then update the weights based on the observed data
17 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Bayesian linear regression
▪ De ne a conjugate prior over w
p(w) = (w | m0, S0)
▪ Combining this with the likelihood function and using results
for marginal and conditional Gaussian distributions, gives the
posterior
p(w | t) = (w | mN, SN )
where
mN = SN (S−1
0 m 0 + βΦ ⊤
t)
S−1
N = S −1
0 + βΦ ⊤
Φ
18 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
𝒩
𝒩
fi
Bayesian linear regression
▪ A common choice for the prior is
p(w) = (w | 0, α −1I)
for which
mN = β SN Φ⊤t
S−1
N = αI + βΦ ⊤
Φ
▪ Consider the following example to make the concept less
abstract: y(x) = − 0.3 + 0.5x
19 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
𝒩
Bayesian linear regression
▪ 0 data points observed
20 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Bayesian linear regression
▪ 1 data point observed
21 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Bayesian linear regression
▪ 2 data points observed
22 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Bayesian linear regression
▪ 20 data points observed
23 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Bayesian linear regression
▪ A common choice for the prior is
p(w) = (w | 0, α −1I)
for which
mN = β SN Φ⊤t
S−1
N = αI + βΦ ⊤
Φ
▪ What is the log of the posterior distribution, i.e., ln p(w | t)?
p(t | w)p(w) β N T 2 α ⊤
∑
ln p(w | t) = ln = {tn − w φ(xn)} − w w + const
p(t) 2 n=1 2
24 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
𝒩
Bayesian linear regression
▪ A common choice for the prior is
p(w) = (w | 0, α −1I)
for which
mN = β SN Φ⊤t
S−1
N = αI + βΦ ⊤
Φ
▪ What if we have no prior no information, i.e., α → 0?
mN = (Φ⊤Φ)−1Φ⊤t
25 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
𝒩
Bayesian linear regression
▪ A common choice for the prior is
p(w) = (w | 0, α −1I)
for which
mN = β SN Φ⊤t
S−1
N = αI + βΦ ⊤
Φ
▪ What if we have precise prior information, i.e., α → ∞?
mN = 0
26 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
𝒩
Bayesian linear regression
▪ A common choice for the prior is
p(w) = (w | 0, α −1I)
for which
mN = β SN Φ⊤t
S−1
N = αI + βΦ ⊤
Φ
▪ What if we have in nite data, i.e., N → ∞?
lim mN = (Φ⊤Φ)−1Φ⊤t
N→∞
27 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
𝒩
fi
Bayesian linear regression
▪ Predict t for new values of x by integrating over w
∫
p(t | t, α, β) = p(t | w, β)p(w | t, α, β) dw
= (t | m⊤N φ(x), σN2 (x))
where
1
σN2 (x) = + φ(x)⊤SN φ(x)
β
28 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
𝒩
Bayesian linear regression
▪ Sinusoidal data, 9 Gaussian basis functions: 1 data point
29 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Bayesian linear regression
▪ Sinusoidal data, 9 Gaussian basis functions: 2 data points
30 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Bayesian linear regression
▪ Sinusoidal data, 9 Gaussian basis functions: 4 data points
31 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Bayesian linear regression
▪ Sinusoidal data, 9 Gaussian basis functions: 25 data points
32 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Conclusions
▪ The use of maximum likelihood, or equivalently, least
squares, can lead to severe over tting if complex models are
trained using data sets of limited size
▪ A Bayesian approach to machine learning avoids the
over tting and also quanti es the uncertainty in model
parameters
33 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
fi
fi
fi
Linear models for regression
Advanced Machine Learning
Linear regression
35 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Linear regression
36 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Linear regression
▪ General model is:
M−1
wjφj(x) = w⊤φ(x)
∑
y(x, w) =
j=0
▪ φj = x j
▪ Take M=2
= (Φ Φ) ⊤ −1
▪ Calculate wML Φ ⊤t
37 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Linear regression
▪ Thus, we have y(x, w) = w0 + w1x
▪ Performance is measured by
1 N
{y(xn, w) − tn}2
2N ∑
E(w) =
n=1
▪ Goal: min E(w0, w1)
w0,w1
38 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Gradient descent
39 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Gradient descent
40 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Gradient descent
▪ Gradient descent algorithm:
repeat until convergence {
∂
wj := wj − α E(w0, w1)
∂wj
}
41 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Gradient descent
▪ Correct update:
∂
temp0 := w0 − α E(w0, w1)
∂w0
∂
temp1 := w1 − α E(w0, w1)
∂w1
w0 := temp0
w1 := temp1
42 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Gradient descent
▪ Incorrect update:
∂
temp0 := w0 − α E(w0, w1)
∂w0
w0 := temp0
∂
temp1 := w1 − α E(w0, w1)
∂w1
w1 := temp1
43 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Gradient descent
▪ Feature scaling is important
44 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Gradient descent
▪ Step size is important for convergence
45 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Gradient descent
▪ Convexity of the problem is important for global optimality
46 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Gradient descent for linear regression
▪ General model is:
M−1
wjφj(x) = w⊤φ(x)
∑
y(x, w) =
j=0
▪ Repeat {
1 N
wj := wj − α
∑ (y(xn, w) − tn) φj(xn)
N n=1
}
47 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023