0% found this document useful (0 votes)
13 views47 pages

Lecture 03

The document discusses linear models and the frequentist approach. It describes how minimizing expected loss leads to predicting the conditional mean, but this ignores model uncertainty. It then introduces the bias-variance decomposition, showing expected loss is the sum of bias, variance, and noise. Over-regularization increases bias while under-regularization increases variance. Bayesian linear regression is presented as an alternative that places a prior over model parameters and updates this based on data.

Uploaded by

carlo.768.ri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views47 pages

Lecture 03

The document discusses linear models and the frequentist approach. It describes how minimizing expected loss leads to predicting the conditional mean, but this ignores model uncertainty. It then introduces the bias-variance decomposition, showing expected loss is the sum of bias, variance, and noise. Over-regularization increases bias while under-regularization increases variance. Bayesian linear regression is presented as an alternative that places a prior over model parameters and updates this based on data.

Uploaded by

carlo.768.ri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Advanced Machine Learning

Lecture 3: Linear models


Sandjai Bhulai
Vrije Universiteit Amsterdam

[Link]@[Link]
12 September 2023
Towards a Bayesian framework

Advanced Machine Learning


The frequentist pitfall

= (t | y(x0, w), β −1)

3 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


𝒩
The frequentist pitfall
▪ Given p(t | x) or p(t, x) directly minimize expected loss
function

∫∫
[L] = L(t, y(x))p(x, t) dx dt

▪ Natural choice:

∫∫
[L] = {y(x) − t}2 p(x, t) dx dt
𝔼
𝔼
4 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
The frequentist pitfall
▪ Given a point x, the expected loss at that point is given by


[L(t, y(x))] = {y(x) − t}2 p(t | x) dt

▪ Taking the derivative w.r.t. y(x) yields


2 {y(x) − t}p(t | x) dt

▪ Setting this expression to 0, yields

∫ ∫
y(x) = y(x)p(t | x) dt = tp(t | x) dt = t[t | x]
𝔼
𝔼
5 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
The frequentist pitfall
▪ The expected loss function

∫∫
[L] = {y(x) − t}2 p(x, t) dx dt

▪ Rewrite integrand as
{y(x) − t}2 = {y(x) − [t | x] + [t | x] + −t}2
= {y(x) − [t | x]}2 + 2{y(x) − [t | x]}{ [t | x] − t} + { [t | x] − t}2

▪ Taking the expected value yields

∫ ∫
[L] = {y(x) − [t | x]}2 p(x) dx + var[t | x]p(x) dx
𝔼
𝔼
𝔼
6 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
𝔼
𝔼
𝔼
𝔼
𝔼
𝔼
The frequentist pitfall
▪ Recall the expected square loss,

∫ ∫∫
[L] = {y(x) − h(x)}2 p(x) dx + {h(x) − t}2 p(x, t) dx dt


where h(x) = [t | x] = tp(t | x) dt

▪ The second term corresponds to the noise inherent in the


random variable t
▪ What about the rst term?
𝔼
𝔼
7 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
fi
The frequentist pitfall
▪ Suppose we were given multiple datasets, each of size N.
Any particular dataset , will give a particular function
y(x; )

▪ We then have

{y(x; ) − h(x)}2

= {y(x; )− [y(x; )] + [y(x; )] − h(x)}2

= {y(x; )− [y(x; )]}2 + { [y(x; )] − h(x)}2

+2{y(x; )− [y(x; )]}{ [y(x; )] − h(x)}


𝒟
𝒟
𝒟
𝒟
𝒟
𝒟
8
𝒟
𝒟
Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
𝒟
𝒟
𝒟
𝒟
𝔼
𝔼
𝔼
𝒟
𝒟
𝒟
𝔼
𝔼
𝔼
𝒟
𝒟
𝒟
The frequentist pitfall
▪ Taking the expectation over yields:
[{y(x; ) − h(x)}2]

={ [y(x; )] − h(x)}2 + [{y(x; )− [y(x; )]}2]

(bias)2 variance
𝒟
𝒟
𝒟
𝒟
9
𝒟
Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
𝔼
𝔼
𝒟
𝒟
𝔼
𝒟
𝔼
𝒟
The frequentist pitfall
▪ In conclusion:

expected loss = (bias)2 + variance + noise

where


(bias)2 = { [y(x; )] − h(x)}2 p(x) dx


variance = [{y(x; )] − [y(x; )]}2] p(x) dx

∫∫
noise = {h(x) − t}2 p(x, t) dx dt
𝒟
𝒟
𝒟
𝔼
𝔼
𝒟
𝒟
𝔼
𝒟
10 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Bias-variance decomposition
▪ Example: 100 datasets from the sinusoidal with 25 data
points, varying the degree of regularization

11 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bias-variance decomposition
▪ Example: 100 datasets from the sinusoidal with 25 data
points, varying the degree of regularization

12 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bias-variance decomposition
▪ Example: 100 datasets from the sinusoidal with 25 data
points, varying the degree of regularization

13 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bias-variance decomposition
▪ From these plots, we note that an over-regularized model
(large λ) will have a high bias, while an under-regularized
model (small λ) will have a high variance

14 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bias-variance decomposition
▪ These insights are of limited practical value
▪ It is based on averages with respect to ensembles of
datasets
▪ In practice, we have only the single observed dataset

15 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bias-variance tradeo

16 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


f
Bayesian linear regression
▪ Bayes’ theorem:
p(X | Y )p(Y )
p(Y | X) =
p(X)

▪ Essentially, this leads to

posterior ∝ likelhood × prior

▪ The idea is to use a probability distribution over the weights,


and then update the weights based on the observed data

17 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bayesian linear regression
▪ De ne a conjugate prior over w
p(w) = (w | m0, S0)

▪ Combining this with the likelihood function and using results


for marginal and conditional Gaussian distributions, gives the
posterior
p(w | t) = (w | mN, SN )

where
mN = SN (S−1
0 m 0 + βΦ ⊤
t)

S−1
N = S −1
0 + βΦ ⊤
Φ
18 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
𝒩
𝒩
fi
Bayesian linear regression
▪ A common choice for the prior is

p(w) = (w | 0, α −1I)

for which
mN = β SN Φ⊤t

S−1
N = αI + βΦ ⊤
Φ

▪ Consider the following example to make the concept less


abstract: y(x) = − 0.3 + 0.5x

19 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


𝒩
Bayesian linear regression
▪ 0 data points observed

20 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bayesian linear regression
▪ 1 data point observed

21 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bayesian linear regression
▪ 2 data points observed

22 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bayesian linear regression
▪ 20 data points observed

23 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bayesian linear regression
▪ A common choice for the prior is

p(w) = (w | 0, α −1I)

for which
mN = β SN Φ⊤t

S−1
N = αI + βΦ ⊤
Φ

▪ What is the log of the posterior distribution, i.e., ln p(w | t)?


p(t | w)p(w) β N T 2 α ⊤

ln p(w | t) = ln = {tn − w φ(xn)} − w w + const
p(t) 2 n=1 2

24 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


𝒩
Bayesian linear regression
▪ A common choice for the prior is

p(w) = (w | 0, α −1I)

for which
mN = β SN Φ⊤t

S−1
N = αI + βΦ ⊤
Φ

▪ What if we have no prior no information, i.e., α → 0?


mN = (Φ⊤Φ)−1Φ⊤t

25 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


𝒩
Bayesian linear regression
▪ A common choice for the prior is

p(w) = (w | 0, α −1I)

for which
mN = β SN Φ⊤t

S−1
N = αI + βΦ ⊤
Φ

▪ What if we have precise prior information, i.e., α → ∞?


mN = 0

26 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


𝒩
Bayesian linear regression
▪ A common choice for the prior is

p(w) = (w | 0, α −1I)

for which
mN = β SN Φ⊤t

S−1
N = αI + βΦ ⊤
Φ

▪ What if we have in nite data, i.e., N → ∞?


lim mN = (Φ⊤Φ)−1Φ⊤t
N→∞

27 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


𝒩
fi
Bayesian linear regression
▪ Predict t for new values of x by integrating over w


p(t | t, α, β) = p(t | w, β)p(w | t, α, β) dw

= (t | m⊤N φ(x), σN2 (x))

where
1
σN2 (x) = + φ(x)⊤SN φ(x)
β

28 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


𝒩
Bayesian linear regression
▪ Sinusoidal data, 9 Gaussian basis functions: 1 data point

29 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bayesian linear regression
▪ Sinusoidal data, 9 Gaussian basis functions: 2 data points

30 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bayesian linear regression
▪ Sinusoidal data, 9 Gaussian basis functions: 4 data points

31 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bayesian linear regression
▪ Sinusoidal data, 9 Gaussian basis functions: 25 data points

32 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Conclusions
▪ The use of maximum likelihood, or equivalently, least
squares, can lead to severe over tting if complex models are
trained using data sets of limited size

▪ A Bayesian approach to machine learning avoids the


over tting and also quanti es the uncertainty in model
parameters

33 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


fi
fi
fi
Linear models for regression

Advanced Machine Learning


Linear regression

35 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Linear regression

36 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Linear regression
▪ General model is:

M−1
wjφj(x) = w⊤φ(x)

y(x, w) =
j=0

▪ φj = x j
▪ Take M=2
= (Φ Φ) ⊤ −1
▪ Calculate wML Φ ⊤t

37 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Linear regression
▪ Thus, we have y(x, w) = w0 + w1x

▪ Performance is measured by

1 N
{y(xn, w) − tn}2
2N ∑
E(w) =
n=1

▪ Goal: min E(w0, w1)


w0,w1

38 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Gradient descent

39 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Gradient descent

40 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Gradient descent
▪ Gradient descent algorithm:

repeat until convergence {



wj := wj − α E(w0, w1)
∂wj
}

41 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Gradient descent
▪ Correct update:

temp0 := w0 − α E(w0, w1)
∂w0

temp1 := w1 − α E(w0, w1)
∂w1
w0 := temp0

w1 := temp1

42 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Gradient descent
▪ Incorrect update:

temp0 := w0 − α E(w0, w1)
∂w0
w0 := temp0

temp1 := w1 − α E(w0, w1)
∂w1
w1 := temp1

43 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Gradient descent
▪ Feature scaling is important

44 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Gradient descent
▪ Step size is important for convergence

45 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Gradient descent
▪ Convexity of the problem is important for global optimality

46 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Gradient descent for linear regression
▪ General model is:

M−1
wjφj(x) = w⊤φ(x)

y(x, w) =
j=0

▪ Repeat {

1 N
wj := wj − α
∑ (y(xn, w) − tn) φj(xn)
N n=1
}

47 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

You might also like