0% found this document useful (0 votes)

28 views97 pages

M6 RegressionLinearModels v2

Here are some potential issues with increasing the polynomial degree: - Overfitting: Higher degree polynomials will fit the training data very well, but may not generalize well to new data points and could model the noise in the training data. - Variance: Higher degree polynomials are more complex models with more parameters, so the variance/uncertainty of the predictions will increase. - Interpretability: Lower degree polynomials are simpler models that are easier to interpret. Higher degrees may result in a model that is difficult to understand. - Computational cost: Fitting higher degree polynomials requires inverting/solving larger matrices, so the computational cost increases with the degree of the polynomial. So in summary, while a higher degree

Uploaded by

Aniket Keshri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views97 pages

M6 RegressionLinearModels v2

Uploaded by

Aniket Keshri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

M6.

Regression: Linear Models -

[a popular (supervised ML) appn. of linear algebra]
Manikandan Narayanan
Week 11 (Oct 6-, 2023)
PRML Jul-Nov 2023 (Grads section)
Acknowledgment of Sources
• Slides based on content from related
• Courses:
• IITM – Profs. Arun/Harish/Chandra’s PRML offerings (slides, quizzes, notes, etc.), Prof.
Ravi’s “Intro to ML” slides – cited respectively as [AR], [HR]/[HG], [CC], [BR] in the bottom
right of a slide.
• India – NPTEL PR course by IISc Prof. PS. Sastry (slides, etc.) – cited as [PSS] in the bottom
right of a slide.

• Books:
• PRML by Bishop. (content, figures, slides, etc.) – cited as [CMB]
• Pattern Classification by Duda, Hart and Stork. (content, figures, etc.) – [DHS]
• Mathematics for ML by Deisenroth, Faisal and Ong. (content, figures, etc.) – [DFO]
• Information Theory, Inference and Learning Algorithms by David JC MacKay – [DJM]
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Context: ML Paradigms
• Unsupervised Learning (informally aka “learning patterns from
(unlabelled) data”)
• Density estimation
• Clustering
• Dimensionality reduction
• Supervised Learning (informally aka curve-fitting or function
approximation or “function learning from (labelled) data”)
• Learn an input and output map (features x to target t)
• Regression: continuous output/target t
• Classification: categorical output

[BR,HR]
Regression: the problem
• Learn a map from input variables (features x) to a continuous output variable (target t).
Informally, known as function approximation/learning or curve fitting, since
Given 𝒙𝒏 , 𝒕𝒏 𝒏=𝟏…𝑵 pairs, we seek a function 𝒚𝒘 (𝒙) that “approximates” the map from x → t.

Linear regression assumes yw x ≔ 𝑦(𝑥, 𝑤) is a linear function of the adjustable parameters 𝒘.

It could be linear or non-linear in 𝑥.

• A foundational supervised learning problem/algorithm:

• Practical limitations for complex data, but sets analytical foundation for other sophisticated learning algorithms.
• Due to its simplicity, first and predominant choice of statistical model in many applied areas with moderate sample sizes:
((E.g., in bioinformatics: to adjust for known confounding factors (covariates) in Disease Genomics and Genome-wide Association
(GWAS) Studies, Causal inference such as in Mendelian Randomization, etc.))

• Our approach in this lecture: Mostly [CBM, Chapters 1,3].

[CMB]
Regression: what does “approximate” mean?
recall three approaches (from Decision Theory)
• Generative model approach:
(I) Model
(I) Infer from
(D) Take conditional mean/median/mode/any other optimal decision outcome as y(x)

• Discriminative model approach:

(I) Model directly
(D) Take conditional mean/median/mode/any other optimal decision outcome as y(x)

• Direct regression approach:

(D) Learn a regression function y(x) directly from training data

[CMB]
Linear regression: from linear combination of input variables (x ∈
ℝ𝐷 ) to that of basis functions (𝜙(x) ∈ ℝ𝑀 )
• Simplest model of linear regn. involving 𝐷 input vars.:

𝑦 x, 𝑤 = 𝑤0 + 𝑤1 𝑥(1) + … + 𝑤𝐷 𝑥(𝐷)

1 1
= 𝑤0 . 1 + σ𝐷
𝑗=1 𝑤𝑗 𝑥𝑗 = 𝑤0 𝑤1 … 𝑤𝐷 = 𝑤𝑇
x x

• Model of linear regn. involving 𝑀 basis fns. (fixed non-linear fns. of the input vars.):

𝑦 x, 𝑤 = 𝑤0 + 𝑤1 𝜙1 x + … + 𝑤𝑀−1 𝜙𝑀−1 (x)

= 𝑤0 . 1 + σ𝑀−1
𝑗=1 𝑤𝑗 𝜙𝑗 (x)

= 𝑤𝑇𝜙 x (𝜙: ℝ𝐷 → ℝ𝑀 , with convention 𝜙0 (x) = 1)

Linear regression: recall standard examples
• Predicting weight 𝑡 from height 𝑥 : 𝒚(𝒙, 𝒘) linear in both 𝒙 and 𝒘.

• Estimation of fetal weight 𝑡 (actually log10 t) from ultrasound measurements: 𝒚 𝒙, 𝒘 linear in 𝒘, not 𝒙.

[From Hoopmann et al. Fetal Diagn

Ther. 2011;30(1):29-34.
doi:10.1159/000323586]
More examples of lin. regn. of basis functions

Fourier basis
Polynomials (& extn.
(sinusoidal fns.),
to spatially restricted
Wavelets, etc.
splines)
[CMB]
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝒘𝑳𝑺 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Linear regression: a direct approach
Approach: minimize sum-of-squares error; aka
least- squares solution/approach

min𝐰 ( )

where y x, 𝒘 = 𝒘𝑇 𝜙 x
Solution: 𝑤𝐿𝑆 that minimizes 𝐸𝐷 𝑤 (via matrix notation)

Normal equations from setting gradient to zero, using N x M design matrix:

[CMB, HR]
Soln.: Geometry of (least-squares) sol.

[CMB, Figure 3.2]

[CMB, HR]
LA Refresher
• See Appendix for LA refresher on
• related topic of Ax=b when no solution is possible, and
• some matrix-vector gradient formulas/tricks.
Recall LA: To solve 𝐴𝑥 = 𝑏, we premult. by 𝐴𝑇 , and
simply solve 𝐴𝑇 𝐴𝑥 = 𝐴𝑇 𝑏.

Ex.: Prove:
1) at least one soln. x* exists for the normal eqn.
2) soln. x* unique if (𝐴𝑇 𝐴) is invertible ( 𝐴 has lin. indep. cols.)
3) infinite solns. x* if (𝐴𝑇 𝐴) is non-invertible ( 𝐴 has lin. dep. cols.)

Ex.:
i) Prove 𝑁𝑆(𝐴) = 𝑁𝑆(𝐴𝑇 𝐴).
ii) Use orthog. complementarity of 𝑁𝑆(𝐴𝑇 ), 𝐶𝑆(𝐴) to derive normal eqns.
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝒘𝑹𝑳𝑺 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
From sum-of-squares to regularized error!
Running example: polynomial curve fitting ((via sum-
of-squares error function / least squares approach))

min𝐰 ( )

where y x, 𝒘 = 𝒘𝑇 𝜙 x

[CMB]
Polynomial Curve Fitting

[CMB]
0th Order Polynomial

[CMB]
1st Order Polynomial

[CMB]
3rd Order Polynomial

[CMB]
9th Order Polynomial

[CMB]
Brainstorm: What degree polynomial would
you choose?
Over-fitting

Root-Mean-Square (RMS) Error:

[CMB]
Polynomial Coefficients

[CMB]
The role of data set size N?
Data Set Size: 𝑁 = 10
9th Order Polynomial

𝑁 = 10

[CMB]
Data Set Size:
9th Order Polynomial

[CMB]
How to take care of both data set size and
model complexity tradeoffs?
Regularization
• Penalize large coefficient values

favors complex models penalizes complex models

[CMB]
Polynomial Coefficients

[CMB]
Regularization:

[CMB]
Regularization: vs.

[CMB]
Now, let’s see how to solve the minimization
problem!

min
𝑤

Matrix notation gradient formula (from Appendix):

Regularized Least Squares (1): Solution for
ridge regression
• Consider the error function: 𝜆 is called the
regularization
Data term + Regularization term coefficient.

• With the sum-of-squares error function and a quadratic regularizer,

we get

• which is minimized by

[CMB]
Regularized Least Squares (2)
• With a more general regularizer, we have

Lasso Quadratic

[CMB]
Regularized Least Squares (3)
• Lasso tends to generate sparser solutions than a ridge (quadratic) regularizer.
• Regularization aka penalization/weight-decay in ML or parameter shrinkage in statistics literature.

[CMB]
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝒘𝑴𝑳 = 𝒘𝑳𝑺 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
A different view of min. E(w) or its regularized
version?
Q: How do you convert this intuition/empirical-art into science, and
derive E(w) or its (other) regularized versions systematically?
A: Probabilistic view helps. Discriminative Approach: Model 𝑝(𝑡|𝑥)
Q: What has regression got to do with our
previous topic: density estimation?
• Brainstorm: how to model P(t|x)?
Q: What has regression got to do with our
previous topics?
A: P(t|x) captures the input-output map. Steps involved are:
(1) Model/estimate P(t|x)
(how? Density Estimation; MLE/Bayesian Inference)
(2) Predict t for a new x from estimated P(t|x)
(how? Decision Theory; e.g., 𝑦(x𝒏𝒆𝒘 ) = 𝐸[𝒕|x = x𝒏𝒆𝒘 ])
Curve Fitting: Going to the basics!
using a Probabilistic Model & its Density Estimation (MLE/Bayesian)

[CMB]
ML estimation

Determine by minimizing sum-of-squares error 𝐸𝐷 𝑤 .

[CMB]
Summary: Linear model for regression -
𝑤𝑀𝐿 == 𝑤𝐿𝑆

[CMB]
Addtnl. Advantage: ML Predictive Distribution

[CMB]
Bayesian inference: what would you model as
a rv instead of a fixed value?
• Brainstorm
• What would you model?
• What would you infer?
Relation between Bayesian MAP and
Regularized linear regression

Let look at a simpler problem first – mode of the posterior of 𝑤

𝑤𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑤 𝑃 𝑤 𝑫𝑵 ≔ 𝐱, 𝒕 ≔ 𝒙𝒏 , 𝒕𝒏 𝑛=1 𝑡𝑜 𝑁 )

We will actually show that 𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 !!

Bayesian inference: a first step via MAP
Assume

Then, Assume

Determine maximizer of this posterior, i.e., 𝑤𝑀𝐴𝑃 , by minimizing

෨
regularized sum-of-squares error 𝐸(𝑤), because:

prior
(shrinks w
towards 0)
[cf. full details of proof in next slide]
[CMB]
Full details of 𝑤𝑀𝐴𝑃 derivation, & 𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 proof!

𝑀 𝑀 𝛼 𝑻
ln 𝑝 𝒘 𝛼 = ln 𝛼 − ln(2𝜋) − 𝒘 𝒘
2 2 2

⇒ ln 𝑝 𝒘 𝐱, 𝒕, 𝛼, 𝛽 = ln 𝑝 𝒕 𝐱, 𝒘, 𝛼, 𝛽 + ln 𝑝 𝒘 𝛼 + 𝑐𝑜𝑛s𝑡. (assume 𝛼, 𝛽 are both known hyperparams.)

𝑁 𝑁 𝑀 𝑀 𝛼
= 2 ln 𝛽 − 2 ln(2𝜋) − 𝛽𝐸𝐷 𝒘 + 2 ln 𝛼 − 2 ln 2𝜋 − 2 𝑤 𝑇 𝑤 + 𝑐𝑜𝑛s𝑡.

𝛼 2
So maximizing this ln(𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟) is the same as max. −𝛽𝐸𝐷 𝒘 − 2 𝒘 (ignoring terms indept. of 𝒘) , or equivalently
1 2 𝛼 𝛼
min. 𝛽𝐸𝐷 𝒘 + 𝛼 2 𝒘 = 𝛽 𝐸𝐷 𝒘 + 𝛽 𝐸𝑊 𝒘 ෨
= 𝛽𝐸(𝒘), ෨
or equivalently min. 𝐸(𝒘). ((here, we set 𝜆 ≔ 𝛽))
Changing the prior from Gaussian to Laplacian!
• Prior: p 𝑤 = 𝑝 𝑤0 𝑝 𝑤1 … 𝑝(𝑤𝑀−1 )
• What
1
if 𝑝 𝑤𝑖 changed from
𝛼 2 𝛼 𝛼 𝛼
exp{− 𝒘2𝑖 } → exp{− |𝒘𝑖 |} ?
2𝜋 2 4 2

• Then,
𝑀
𝑝(𝑤) changes from
𝛼 2 𝛼 𝟐 𝛼 𝑀 𝛼
exp(− 𝒘 𝟐
)→ exp(− 𝒘 1
)
2𝜋 2 4 2

• Then regularization term 𝐸𝑤 (𝑤) in regularized error

𝐸෨ 𝑤 = 𝐸𝐷 𝑤 + 𝜆𝐸𝑤 𝑤 changes from:
1 1
• σ𝑖 𝑤𝑖2 (ridge regn. / 𝐿2 regularization) → σ |𝑤𝑖 | (lasso regn. / 𝐿1 regularization)
2 2 𝑖

Recall:
General prior p(𝒘) that generalizes both
Gaussian and Laplacian prior
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Bayesian inference: second and third steps
• Assume Gaussian prior for 𝒘 going forward.
• We don’t want just a single-point estimate (MAP) of 𝒘; we want

Step 2) the full posterior of 𝒘, and in turn use it to

Step 3) predict t for new x (via model-averaging…
• …wherein each model is given by a particular possible value of w and the averaging
weight is given by the model’ s posterior)
Step 2) Let’s see an example of full posterior…
𝑡 ≅ 𝑦 𝑥, 𝑤 = 𝑤0 + 𝑤1 𝑥

Use a given set of training data points to not just infer one optimal model specified by 𝑤𝑀𝐴𝑃 , but all
possible 𝑤 models and the training dataset’s support (posterior probab.) for each such model 𝑤.
Bayesian Linear Regression Example (1)
0 data points observed

Prior Data Space

[CMB]
Bayesian Linear Regression Example (2)
1 data point observed

Likelihood Posterior Data Space

[CMB]
Bayesian Linear Regression Example (3)
2 data points observed

Likelihood Posterior Data Space

[CMB]
Bayesian Linear Regression Example (4)
20 data points observed

Posterior Data Space

[CMB]
From example posterior plots to
Full posterior in the general case:
Bayesian Linear Regression
• Define a conjugate prior over w. A common choice is

• Combining this with the likelihood function and using results for marginal
and conditional Gaussian distributions, gives the posterior

[Qn. What is 𝑚𝑁 ?]

[CMB]
Recall: MVG Handy Results (cheat-sheet)

[CMB: Bishop, Chapter 2]

Finally, Step 3: What about predicting t for a
new point x?
• ((we don’t want just an estimate or posterior of w; we want to use it
to predict t for new x))

((again example plots first, and post. predictive distbn. form for the
general case after that.))
Example Predictive Distribution (1)
• Example: Sinusoidal data, 9 Gaussian basis fns. (𝜙: ℝ → ℝ9 ), 1 data point

[CMB]
Example Predictive Distribution (2)
• Example: Sinusoidal data, 9 Gaussian basis functions, 2 data points

[CMB]
Example Predictive Distribution (3)
• Example: Sinusoidal data, 9 Gaussian basis functions, 4 data points

[CMB]
Example Predictive Distribution (4)
• Example: Sinusoidal data, 9 Gaussian basis functions, 25 data points

[CMB]
(Bayesian/Posterior) Predictive Distribution (1)
• Predict 𝑡 for new values of 𝑥 by integrating over w (model-averaging)

• where

Exercise: Prove that mN = 𝑤𝑅𝐿𝑆 = 𝑤𝑀𝐴𝑃 (recall 𝜆 ≔ 𝛼/𝛽), and hence that the posterior
predictive mean is same as that of the predicted value in the direct RLS approach. [CMB]
Linear regression direct vs. discriminative
model approaches: summary
Discriminative model-based approaches have two advantages:
1. Convert intuition for obj. fns. to probabilistic model driven motivations:
(Least-squares or min 𝐸(𝑤)) 𝑤𝐿𝑆 = 𝑤𝑀𝐿 (MLE)
෨
(Reg. Least-squares or min 𝐸(𝑤)) 𝑤𝑅𝐿𝑆 = 𝑤𝑀𝐴𝑃 (Bayesian)

2. Give additional advantage of capturing the uncertainty over the predicted values, viz., a
predictive distribution

𝑝 𝑡 𝑥 = 𝑁 𝑡 𝑦 𝑥, 𝑤∗ ≔ 𝑤∗𝑇 𝜙(𝑥), 𝐯𝐚𝐫), (and not just a single predicted value 𝑦(𝑥, 𝑤∗) as in the direct
approach).

• In MLE, (pred.) var is a dataset-wide single variance (𝜎 2 = 𝛽 −1)

• In Bayesian, (post. pred.) var is datapoint-specific (𝜎 2 𝑥 = 𝛽 −1 + 𝜙 𝑥 𝑇 𝑆𝑁 𝜙 𝑥 )
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Recall: Motivating example for Regularization,
and Hyperparam. Tuning

ln 𝜆 = −∞
[CMB]
Regularization, and Hyperparam. tuning

Regularization penalizes large coefs.

of 9th order polynomial fit of the data.
[CMB]
Some Motivating Questions
• Can we understand the error in our predictions better?
• That is, can we identify the different components of the error in our predictions?
• How are these different components related to the complexity of our model, and to
overfitting?

• Can we use above knowledge to better tune model complexity

(hyperparameters) to avoid overfitting?

• If the trends in data require fitting of a complex model, then

• can overfitting be detected by understanding the stability of the optimal
(frequentist/ML) model across different training datasets?
• can a Bayesian model overcome the overfitting “naturally/implicitly” by not settling
in on a single optimal model and instead averaging over multiple models?
• Bayesian view (model averaging & empirical Bayes) possible, but out of scope for this course.
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Recall: Decision Theory for Regression
min. squared loss (cond. expn. as minimizer)

[CMB]
Bias-variance analysis proof

Goal: Decompose average error 𝐸𝐷 [𝐸𝑥,𝑡 [𝐿]] into different terms.

Now, simply view “y(x;D) – h(x)” as a random variable Z; and apply the variance formula:
𝑉𝑎𝑟𝐷 (𝑍) = 𝐸𝐷 [𝑍 2 ] – 𝐸𝐷 [𝑍] 2 to get the bias-variance decomposition of error below:

[CMB]
Bias-variance decomposition in formula

expected loss = 𝐸𝐷 𝐸𝑥,𝑡 𝐿 = 𝐸𝐷 𝐸𝑥,𝑡 𝑦 𝑥; 𝐷 − 𝑡 2

Exercise: cf. worksheet for careful understanding of what random variables the expectation above is
taken over!
[CMB]
Bias-variance in pictures (for an example)

[CMB]
Bias-variance analysis (for the example)

[CMB]
Bias-variance anals.: applicability in practice?

[CMB]
Bias-variance anals.: applicability in practice?
details

[CMB]
Concluding thoughts
• Linear regression forms the foundations of other sophisticated methods, so
it is good to invest enough time on it.
• Two views: direct loss fn. view (E(w)/regularization) & probabilistic model view (MLE/Bayesian)
• But lin. regn. has limitations in practice, even with non-linear basis functions, closed-
form solutions and other analytical advantages.
• Mainly because basis functions are fixed before seeing the training data (curse of
dimensionality when dimensionality of feature vectors D grows).

• Next steps:
• linear models for classification, which play similar basic role for other classification
methods.
• Move from fixed basis fns to selection of basis functions or adaptation of basis
functions using training data, in later lectures on non-linear models.
Thank you!
Backup slides
Linear Algebra (LA) Refresher
• Switch to LN Pandey’s notes
LA Cheat Sheet: The four subspaces of a matrix

[From https://mathworld.wolfram.com/FundamentalTheoremofLinearAlgebra.html, Strang LA book]

LA + Opt. Cheat Sheet
• Real, symmetric matrices 𝑆 can be diagonalized as 𝑆 = 𝑄ΛQ𝑇
• 𝑆 is psd iff all its eigen values are non-negative.
•.
• .

[HR]
Recall LA: To solve 𝐴𝑥 = 𝑏, we premult. by 𝐴𝑇 , and
simply solve 𝐴𝑇 𝐴𝑥 = 𝐴𝑇 𝑏.

Ex.:
i) Prove 𝑁𝑆(𝐴) = 𝑁𝑆(𝐴𝑇 𝐴).
ii) Use orthog. complementarity of 𝑁𝑆(𝐴𝑇 ), 𝐶𝑆(𝐴) to derive normal eqns.
Choice of Prior
Different priors on parameter 𝜃(≔ 𝑤) leads to…
…different regularizations (ridge vs. lasso regn.)
Bias-variance analysis (alternate proof)

[CMB]

Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
Linear Regression Models Explained
No ratings yet
Linear Regression Models Explained
7 pages
Lecture 5 - Linear Regression
No ratings yet
Lecture 5 - Linear Regression
51 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Progression Linaire
No ratings yet
Progression Linaire
187 pages
CS2011 2
No ratings yet
CS2011 2
14 pages
Linear Regression Models Overview
100% (1)
Linear Regression Models Overview
39 pages
Lecture15 Regression
No ratings yet
Lecture15 Regression
15 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
Lecture 3 - Regression
No ratings yet
Lecture 3 - Regression
47 pages
Chapter Regression
No ratings yet
Chapter Regression
10 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Berkeley Machine Learning
No ratings yet
Berkeley Machine Learning
185 pages
Updated Module2 - OTML Updated
No ratings yet
Updated Module2 - OTML Updated
83 pages
Bayesian Linear Regression For Posterior Predictive Distribution MATLAB
No ratings yet
Bayesian Linear Regression For Posterior Predictive Distribution MATLAB
46 pages
Linear Regression Foundations and Methods
No ratings yet
Linear Regression Foundations and Methods
20 pages
Mlfa Autumn 22 Lec 02
No ratings yet
Mlfa Autumn 22 Lec 02
24 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Introml 02 Regression Annotated PDF
No ratings yet
Introml 02 Regression Annotated PDF
26 pages
Cơ Bản Về Hồi Quy Tuyến Tính
No ratings yet
Cơ Bản Về Hồi Quy Tuyến Tính
105 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
03 Linear Regression
No ratings yet
03 Linear Regression
54 pages
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
100% (1)
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
48 pages
Complexity in Linear Regression Models
No ratings yet
Complexity in Linear Regression Models
20 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
Lecture 2 - Linear Regression
No ratings yet
Lecture 2 - Linear Regression
54 pages
Linear Regression for Engineers
No ratings yet
Linear Regression for Engineers
29 pages
LinearRegression LectureNotesPublic PDF
No ratings yet
LinearRegression LectureNotesPublic PDF
7 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
Machine Learning Regression Techniques
No ratings yet
Machine Learning Regression Techniques
16 pages
Linear & Polynomial Regression Guide
No ratings yet
Linear & Polynomial Regression Guide
56 pages
Data Science Course Syllabus
No ratings yet
Data Science Course Syllabus
104 pages
Regression
No ratings yet
Regression
11 pages
Chapter2 Annotated Part2
No ratings yet
Chapter2 Annotated Part2
30 pages
Unit 2 - ML - SRM
No ratings yet
Unit 2 - ML - SRM
89 pages
Linear Regression Lecture Notes
No ratings yet
Linear Regression Lecture Notes
34 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
W2 Ecs7020p
No ratings yet
W2 Ecs7020p
54 pages
Day 1
No ratings yet
Day 1
41 pages
Regression Using LS Handout
No ratings yet
Regression Using LS Handout
21 pages
Linear Regression with Python OLS
No ratings yet
Linear Regression with Python OLS
23 pages
Regression
No ratings yet
Regression
39 pages
ML Lecture Linear Regression 1
No ratings yet
ML Lecture Linear Regression 1
33 pages
Lecture 09 - 02.09.2024 - Regression-01
No ratings yet
Lecture 09 - 02.09.2024 - Regression-01
62 pages
Unit 2 - ML - SRM
No ratings yet
Unit 2 - ML - SRM
66 pages
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
04 Linear
No ratings yet
04 Linear
31 pages
GradientDescent-Regression Slides
No ratings yet
GradientDescent-Regression Slides
26 pages
05 Regression Least Squares
No ratings yet
05 Regression Least Squares
5 pages
CS550 Regression Aug12
100% (1)
CS550 Regression Aug12
63 pages
Abstract: y F X X X, X, X
No ratings yet
Abstract: y F X X X, X, X
10 pages
Intro To ML RevisionNotes
No ratings yet
Intro To ML RevisionNotes
24 pages
W1.2 Regression 1
No ratings yet
W1.2 Regression 1
28 pages
PRML Slides 3
No ratings yet
PRML Slides 3
57 pages
2a Linear Regression 18may
No ratings yet
2a Linear Regression 18may
28 pages
7SSMM700 Lecture 8
No ratings yet
7SSMM700 Lecture 8
33 pages
Conceptual Exercises
No ratings yet
Conceptual Exercises
11 pages
Ref 7
No ratings yet
Ref 7
12 pages
Parameter Norm Penalties
No ratings yet
Parameter Norm Penalties
6 pages
2 +suprapto+et+al
No ratings yet
2 +suprapto+et+al
10 pages
ENG3300 Assignment1-1
No ratings yet
ENG3300 Assignment1-1
13 pages
Body Temperature Paper
No ratings yet
Body Temperature Paper
7 pages
Applied Mathematics in Hydrogeology Lee 2024 Scribd Download
100% (2)
Applied Mathematics in Hydrogeology Lee 2024 Scribd Download
55 pages
1.11 Lab 1 Data Analysis With Python 3
No ratings yet
1.11 Lab 1 Data Analysis With Python 3
25 pages
Data Analytics Unit 2
No ratings yet
Data Analytics Unit 2
13 pages
Nptel ML Questions
100% (2)
Nptel ML Questions
12 pages
Data Science Lab Course B24CS0108
No ratings yet
Data Science Lab Course B24CS0108
88 pages
Measuring Variability and Factors Affecting The Agricultural Production: A Ridge Regression Approach
No ratings yet
Measuring Variability and Factors Affecting The Agricultural Production: A Ridge Regression Approach
14 pages
2507.13338v1 Number 34 3
No ratings yet
2507.13338v1 Number 34 3
25 pages
Linear Regression and Classification
No ratings yet
Linear Regression and Classification
8 pages
Nonlinear Kernel Ridge Classification
No ratings yet
Nonlinear Kernel Ridge Classification
5 pages
Article in Press: Joint In-Season and Out-of-Season Promotion Demand Forecasting in A Retail Environment
No ratings yet
Article in Press: Joint In-Season and Out-of-Season Promotion Demand Forecasting in A Retail Environment
20 pages
Chap3 Ridge Lasso
No ratings yet
Chap3 Ridge Lasso
26 pages
Project Report Forest Fire Final
No ratings yet
Project Report Forest Fire Final
26 pages
Module 4 EDA
No ratings yet
Module 4 EDA
20 pages
Deeplearning Notes
No ratings yet
Deeplearning Notes
73 pages
Beginner's Guide to Regression Models
No ratings yet
Beginner's Guide to Regression Models
18 pages
Multivariate Short-Term Traffic Flow Prediction Based On Real-Time Expressway Toll Plaza Data Using Non-Parametric Techniques
No ratings yet
Multivariate Short-Term Traffic Flow Prediction Based On Real-Time Expressway Toll Plaza Data Using Non-Parametric Techniques
19 pages
2509.14786v1
No ratings yet
2509.14786v1
45 pages
Chen and Liu - 2018 - Broad Learning System An Effective and Efficient Incremental Learning System Without The Need For D
No ratings yet
Chen and Liu - 2018 - Broad Learning System An Effective and Efficient Incremental Learning System Without The Need For D
15 pages
Notes 3
No ratings yet
Notes 3
59 pages
Lecture Notes in Earth Sciences
No ratings yet
Lecture Notes in Earth Sciences
267 pages
Overview of Regression Techniques
No ratings yet
Overview of Regression Techniques
14 pages
Optimal Ridge Parameter Selection
No ratings yet
Optimal Ridge Parameter Selection
6 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
22 pages

M6 RegressionLinearModels v2

Uploaded by

M6 RegressionLinearModels v2

Uploaded by

M6.

Regression: Linear Models -

Linear regression assumes yw x ≔ 𝑦(𝑥, 𝑤) is a linear function of the adjustable parameters 𝒘.

• A foundational supervised learning problem/algorithm:

• Our approach in this lecture: Mostly [CBM, Chapters 1,3].

• Discriminative model approach:

• Direct regression approach:

𝑦 x, 𝑤 = 𝑤0 + 𝑤1 𝜙1 x + … + 𝑤𝑀−1 𝜙𝑀−1 (x)

= 𝑤𝑇𝜙 x (𝜙: ℝ𝐷 → ℝ𝑀 , with convention 𝜙0 (x) = 1)

[From Hoopmann et al. Fetal Diagn

Normal equations from setting gradient to zero, using N x M design matrix:

[CMB, Figure 3.2]

Root-Mean-Square (RMS) Error:

favors complex models penalizes complex models

Matrix notation gradient formula (from Appendix):

• With the sum-of-squares error function and a quadratic regularizer,

Determine by minimizing sum-of-squares error 𝐸𝐷 𝑤 .

Let look at a simpler problem first – mode of the posterior of 𝑤

𝑤𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑤 𝑃 𝑤 𝑫𝑵 ≔ 𝐱, 𝒕 ≔ 𝒙𝒏 , 𝒕𝒏 𝑛=1 𝑡𝑜 𝑁 )

We will actually show that 𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 !!

Determine maximizer of this posterior, i.e., 𝑤𝑀𝐴𝑃 , by minimizing

⇒ ln 𝑝 𝒘 𝐱, 𝒕, 𝛼, 𝛽 = ln 𝑝 𝒕 𝐱, 𝒘, 𝛼, 𝛽 + ln 𝑝 𝒘 𝛼 + 𝑐𝑜𝑛s𝑡. (assume 𝛼, 𝛽 are both known hyperparams.)

• Then regularization term 𝐸𝑤 (𝑤) in regularized error

Step 2) the full posterior of 𝒘, and in turn use it to

Prior Data Space

Likelihood Posterior Data Space

Likelihood Posterior Data Space

Posterior Data Space

[CMB: Bishop, Chapter 2]

• In MLE, (pred.) var is a dataset-wide single variance (𝜎 2 = 𝛽 −1)

Regularization penalizes large coefs.

• Can we use above knowledge to better tune model complexity

• If the trends in data require fitting of a complex model, then

Goal: Decompose average error 𝐸𝐷 [𝐸𝑥,𝑡 [𝐿]] into different terms.

expected loss = 𝐸𝐷 𝐸𝑥,𝑡 𝐿 = 𝐸𝐷 𝐸𝑥,𝑡 𝑦 𝑥; 𝐷 − 𝑡 2

[From https://mathworld.wolfram.com/FundamentalTheoremofLinearAlgebra.html, Strang LA book]

You might also like