M6.
Regression: Linear Models -
[a popular (supervised ML) appn. of linear algebra]
Manikandan Narayanan
Week 11 (Oct 6-, 2023)
PRML Jul-Nov 2023 (Grads section)
Acknowledgment of Sources
• Slides based on content from related
• Courses:
• IITM – Profs. Arun/Harish/Chandra’s PRML offerings (slides, quizzes, notes, etc.), Prof.
Ravi’s “Intro to ML” slides – cited respectively as [AR], [HR]/[HG], [CC], [BR] in the bottom
right of a slide.
• India – NPTEL PR course by IISc Prof. PS. Sastry (slides, etc.) – cited as [PSS] in the bottom
right of a slide.
• Books:
• PRML by Bishop. (content, figures, slides, etc.) – cited as [CMB]
• Pattern Classification by Duda, Hart and Stork. (content, figures, etc.) – [DHS]
• Mathematics for ML by Deisenroth, Faisal and Ong. (content, figures, etc.) – [DFO]
• Information Theory, Inference and Learning Algorithms by David JC MacKay – [DJM]
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Context: ML Paradigms
• Unsupervised Learning (informally aka “learning patterns from
(unlabelled) data”)
• Density estimation
• Clustering
• Dimensionality reduction
• Supervised Learning (informally aka curve-fitting or function
approximation or “function learning from (labelled) data”)
• Learn an input and output map (features x to target t)
• Regression: continuous output/target t
• Classification: categorical output
[BR,HR]
Regression: the problem
• Learn a map from input variables (features x) to a continuous output variable (target t).
Informally, known as function approximation/learning or curve fitting, since
Given 𝒙𝒏 , 𝒕𝒏 𝒏=𝟏…𝑵 pairs, we seek a function 𝒚𝒘 (𝒙) that “approximates” the map from x → t.
Linear regression assumes yw x ≔ 𝑦(𝑥, 𝑤) is a linear function of the adjustable parameters 𝒘.
It could be linear or non-linear in 𝑥.
• A foundational supervised learning problem/algorithm:
• Practical limitations for complex data, but sets analytical foundation for other sophisticated learning algorithms.
• Due to its simplicity, first and predominant choice of statistical model in many applied areas with moderate sample sizes:
((E.g., in bioinformatics: to adjust for known confounding factors (covariates) in Disease Genomics and Genome-wide Association
(GWAS) Studies, Causal inference such as in Mendelian Randomization, etc.))
• Our approach in this lecture: Mostly [CBM, Chapters 1,3].
[CMB]
Regression: what does “approximate” mean?
recall three approaches (from Decision Theory)
• Generative model approach:
(I) Model
(I) Infer from
(D) Take conditional mean/median/mode/any other optimal decision outcome as y(x)
• Discriminative model approach:
(I) Model directly
(D) Take conditional mean/median/mode/any other optimal decision outcome as y(x)
• Direct regression approach:
(D) Learn a regression function y(x) directly from training data
[CMB]
Linear regression: from linear combination of input variables (x ∈
ℝ𝐷 ) to that of basis functions (𝜙(x) ∈ ℝ𝑀 )
• Simplest model of linear regn. involving 𝐷 input vars.:
𝑦 x, 𝑤 = 𝑤0 + 𝑤1 𝑥(1) + … + 𝑤𝐷 𝑥(𝐷)
1 1
= 𝑤0 . 1 + σ𝐷
𝑗=1 𝑤𝑗 𝑥𝑗 = 𝑤0 𝑤1 … 𝑤𝐷 = 𝑤𝑇
x x
• Model of linear regn. involving 𝑀 basis fns. (fixed non-linear fns. of the input vars.):
𝑦 x, 𝑤 = 𝑤0 + 𝑤1 𝜙1 x + … + 𝑤𝑀−1 𝜙𝑀−1 (x)
= 𝑤0 . 1 + σ𝑀−1
𝑗=1 𝑤𝑗 𝜙𝑗 (x)
= 𝑤𝑇𝜙 x (𝜙: ℝ𝐷 → ℝ𝑀 , with convention 𝜙0 (x) = 1)
Linear regression: recall standard examples
• Predicting weight 𝑡 from height 𝑥 : 𝒚(𝒙, 𝒘) linear in both 𝒙 and 𝒘.
• Estimation of fetal weight 𝑡 (actually log10 t) from ultrasound measurements: 𝒚 𝒙, 𝒘 linear in 𝒘, not 𝒙.
[From Hoopmann et al. Fetal Diagn
Ther. 2011;30(1):29-34.
doi:10.1159/000323586]
More examples of lin. regn. of basis functions
Fourier basis
Polynomials (& extn.
(sinusoidal fns.),
to spatially restricted
Wavelets, etc.
splines)
[CMB]
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝒘𝑳𝑺 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Linear regression: a direct approach
Approach: minimize sum-of-squares error; aka
least- squares solution/approach
min𝐰 ( )
where y x, 𝒘 = 𝒘𝑇 𝜙 x
Solution: 𝑤𝐿𝑆 that minimizes 𝐸𝐷 𝑤 (via matrix notation)
Normal equations from setting gradient to zero, using N x M design matrix:
[CMB, HR]
Soln.: Geometry of (least-squares) sol.
[CMB, Figure 3.2]
[CMB, HR]
LA Refresher
• See Appendix for LA refresher on
• related topic of Ax=b when no solution is possible, and
• some matrix-vector gradient formulas/tricks.
Recall LA: To solve 𝐴𝑥 = 𝑏, we premult. by 𝐴𝑇 , and
simply solve 𝐴𝑇 𝐴𝑥 = 𝐴𝑇 𝑏.
Ex.: Prove:
1) at least one soln. x* exists for the normal eqn.
2) soln. x* unique if (𝐴𝑇 𝐴) is invertible ( 𝐴 has lin. indep. cols.)
3) infinite solns. x* if (𝐴𝑇 𝐴) is non-invertible ( 𝐴 has lin. dep. cols.)
Ex.:
i) Prove 𝑁𝑆(𝐴) = 𝑁𝑆(𝐴𝑇 𝐴).
ii) Use orthog. complementarity of 𝑁𝑆(𝐴𝑇 ), 𝐶𝑆(𝐴) to derive normal eqns.
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝒘𝑹𝑳𝑺 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
From sum-of-squares to regularized error!
Running example: polynomial curve fitting ((via sum-
of-squares error function / least squares approach))
min𝐰 ( )
where y x, 𝒘 = 𝒘𝑇 𝜙 x
[CMB]
Polynomial Curve Fitting
[CMB]
0th Order Polynomial
[CMB]
1st Order Polynomial
[CMB]
3rd Order Polynomial
[CMB]
9th Order Polynomial
[CMB]
Brainstorm: What degree polynomial would
you choose?
Over-fitting
Root-Mean-Square (RMS) Error:
[CMB]
Polynomial Coefficients
[CMB]
The role of data set size N?
Data Set Size: 𝑁 = 10
9th Order Polynomial
𝑁 = 10
[CMB]
Data Set Size:
9th Order Polynomial
[CMB]
Data Set Size:
9th Order Polynomial
[CMB]
How to take care of both data set size and
model complexity tradeoffs?
Regularization
• Penalize large coefficient values
favors complex models penalizes complex models
[CMB]
Polynomial Coefficients
[CMB]
Regularization:
[CMB]
Regularization:
[CMB]
Regularization: vs.
[CMB]
Now, let’s see how to solve the minimization
problem!
min
𝑤
Matrix notation gradient formula (from Appendix):
Regularized Least Squares (1): Solution for
ridge regression
• Consider the error function: 𝜆 is called the
regularization
Data term + Regularization term coefficient.
• With the sum-of-squares error function and a quadratic regularizer,
we get
• which is minimized by
[CMB]
Regularized Least Squares (2)
• With a more general regularizer, we have
Lasso Quadratic
[CMB]
Regularized Least Squares (3)
• Lasso tends to generate sparser solutions than a ridge (quadratic) regularizer.
• Regularization aka penalization/weight-decay in ML or parameter shrinkage in statistics literature.
[CMB]
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝒘𝑴𝑳 = 𝒘𝑳𝑺 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
A different view of min. E(w) or its regularized
version?
Q: How do you convert this intuition/empirical-art into science, and
derive E(w) or its (other) regularized versions systematically?
A: Probabilistic view helps. Discriminative Approach: Model 𝑝(𝑡|𝑥)
Q: What has regression got to do with our
previous topic: density estimation?
• Brainstorm: how to model P(t|x)?
Q: What has regression got to do with our
previous topics?
A: P(t|x) captures the input-output map. Steps involved are:
(1) Model/estimate P(t|x)
(how? Density Estimation; MLE/Bayesian Inference)
(2) Predict t for a new x from estimated P(t|x)
(how? Decision Theory; e.g., 𝑦(x𝒏𝒆𝒘 ) = 𝐸[𝒕|x = x𝒏𝒆𝒘 ])
Curve Fitting: Going to the basics!
using a Probabilistic Model & its Density Estimation (MLE/Bayesian)
[CMB]
ML estimation
Determine by minimizing sum-of-squares error 𝐸𝐷 𝑤 .
[CMB]
Summary: Linear model for regression -
𝑤𝑀𝐿 == 𝑤𝐿𝑆
[CMB]
Addtnl. Advantage: ML Predictive Distribution
[CMB]
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝒘𝑴𝑨𝑷 = 𝒘𝑹𝑳𝑺 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Recall: Probab. Model & its Density Estimation (MLE)
[CMB]
Bayesian inference: what would you model as
a rv instead of a fixed value?
• Brainstorm
• What would you model?
• What would you infer?
Relation between Bayesian MAP and
Regularized linear regression
Let look at a simpler problem first – mode of the posterior of 𝑤
𝑤𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑤 𝑃 𝑤 𝑫𝑵 ≔ 𝐱, 𝒕 ≔ 𝒙𝒏 , 𝒕𝒏 𝑛=1 𝑡𝑜 𝑁 )
We will actually show that 𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 !!
Bayesian inference: a first step via MAP
Assume
Then, Assume
Determine maximizer of this posterior, i.e., 𝑤𝑀𝐴𝑃 , by minimizing
෨
regularized sum-of-squares error 𝐸(𝑤), because:
prior
(shrinks w
towards 0)
[cf. full details of proof in next slide]
[CMB]
Full details of 𝑤𝑀𝐴𝑃 derivation, & 𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 proof!
𝑀 𝑀 𝛼 𝑻
ln 𝑝 𝒘 𝛼 = ln 𝛼 − ln(2𝜋) − 𝒘 𝒘
2 2 2
⇒ ln 𝑝 𝒘 𝐱, 𝒕, 𝛼, 𝛽 = ln 𝑝 𝒕 𝐱, 𝒘, 𝛼, 𝛽 + ln 𝑝 𝒘 𝛼 + 𝑐𝑜𝑛s𝑡. (assume 𝛼, 𝛽 are both known hyperparams.)
𝑁 𝑁 𝑀 𝑀 𝛼
= 2 ln 𝛽 − 2 ln(2𝜋) − 𝛽𝐸𝐷 𝒘 + 2 ln 𝛼 − 2 ln 2𝜋 − 2 𝑤 𝑇 𝑤 + 𝑐𝑜𝑛s𝑡.
𝛼 2
So maximizing this ln(𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟) is the same as max. −𝛽𝐸𝐷 𝒘 − 2 𝒘 (ignoring terms indept. of 𝒘) , or equivalently
1 2 𝛼 𝛼
min. 𝛽𝐸𝐷 𝒘 + 𝛼 2 𝒘 = 𝛽 𝐸𝐷 𝒘 + 𝛽 𝐸𝑊 𝒘 ෨
= 𝛽𝐸(𝒘), ෨
or equivalently min. 𝐸(𝒘). ((here, we set 𝜆 ≔ 𝛽))
Changing the prior from Gaussian to Laplacian!
• Prior: p 𝑤 = 𝑝 𝑤0 𝑝 𝑤1 … 𝑝(𝑤𝑀−1 )
• What
1
if 𝑝 𝑤𝑖 changed from
𝛼 2 𝛼 𝛼 𝛼
exp{− 𝒘2𝑖 } → exp{− |𝒘𝑖 |} ?
2𝜋 2 4 2
• Then,
𝑀
𝑝(𝑤) changes from
𝛼 2 𝛼 𝟐 𝛼 𝑀 𝛼
exp(− 𝒘 𝟐
)→ exp(− 𝒘 1
)
2𝜋 2 4 2
• Then regularization term 𝐸𝑤 (𝑤) in regularized error
𝐸෨ 𝑤 = 𝐸𝐷 𝑤 + 𝜆𝐸𝑤 𝑤 changes from:
1 1
• σ𝑖 𝑤𝑖2 (ridge regn. / 𝐿2 regularization) → σ |𝑤𝑖 | (lasso regn. / 𝐿1 regularization)
2 2 𝑖
Recall:
General prior p(𝒘) that generalizes both
Gaussian and Laplacian prior
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Bayesian inference: second and third steps
• Assume Gaussian prior for 𝒘 going forward.
• We don’t want just a single-point estimate (MAP) of 𝒘; we want
Step 2) the full posterior of 𝒘, and in turn use it to
Step 3) predict t for new x (via model-averaging…
• …wherein each model is given by a particular possible value of w and the averaging
weight is given by the model’ s posterior)
Step 2) Let’s see an example of full posterior…
𝑡 ≅ 𝑦 𝑥, 𝑤 = 𝑤0 + 𝑤1 𝑥
Use a given set of training data points to not just infer one optimal model specified by 𝑤𝑀𝐴𝑃 , but all
possible 𝑤 models and the training dataset’s support (posterior probab.) for each such model 𝑤.
Bayesian Linear Regression Example (1)
0 data points observed
Prior Data Space
[CMB]
Bayesian Linear Regression Example (2)
1 data point observed
Likelihood Posterior Data Space
[CMB]
Bayesian Linear Regression Example (3)
2 data points observed
Likelihood Posterior Data Space
[CMB]
Bayesian Linear Regression Example (4)
20 data points observed
Posterior Data Space
[CMB]
From example posterior plots to
Full posterior in the general case:
Bayesian Linear Regression
• Define a conjugate prior over w. A common choice is
• Combining this with the likelihood function and using results for marginal
and conditional Gaussian distributions, gives the posterior
[Qn. What is 𝑚𝑁 ?]
[CMB]
Recall: MVG Handy Results (cheat-sheet)
[CMB: Bishop, Chapter 2]
Finally, Step 3: What about predicting t for a
new point x?
• ((we don’t want just an estimate or posterior of w; we want to use it
to predict t for new x))
((again example plots first, and post. predictive distbn. form for the
general case after that.))
Example Predictive Distribution (1)
• Example: Sinusoidal data, 9 Gaussian basis fns. (𝜙: ℝ → ℝ9 ), 1 data point
[CMB]
Example Predictive Distribution (2)
• Example: Sinusoidal data, 9 Gaussian basis functions, 2 data points
[CMB]
Example Predictive Distribution (3)
• Example: Sinusoidal data, 9 Gaussian basis functions, 4 data points
[CMB]
Example Predictive Distribution (4)
• Example: Sinusoidal data, 9 Gaussian basis functions, 25 data points
[CMB]
(Bayesian/Posterior) Predictive Distribution (1)
• Predict 𝑡 for new values of 𝑥 by integrating over w (model-averaging)
• where
Exercise: Prove that mN = 𝑤𝑅𝐿𝑆 = 𝑤𝑀𝐴𝑃 (recall 𝜆 ≔ 𝛼/𝛽), and hence that the posterior
predictive mean is same as that of the predicted value in the direct RLS approach. [CMB]
Linear regression direct vs. discriminative
model approaches: summary
Discriminative model-based approaches have two advantages:
1. Convert intuition for obj. fns. to probabilistic model driven motivations:
(Least-squares or min 𝐸(𝑤)) 𝑤𝐿𝑆 = 𝑤𝑀𝐿 (MLE)
෨
(Reg. Least-squares or min 𝐸(𝑤)) 𝑤𝑅𝐿𝑆 = 𝑤𝑀𝐴𝑃 (Bayesian)
2. Give additional advantage of capturing the uncertainty over the predicted values, viz., a
predictive distribution
𝑝 𝑡 𝑥 = 𝑁 𝑡 𝑦 𝑥, 𝑤∗ ≔ 𝑤∗𝑇 𝜙(𝑥), 𝐯𝐚𝐫), (and not just a single predicted value 𝑦(𝑥, 𝑤∗) as in the direct
approach).
• In MLE, (pred.) var is a dataset-wide single variance (𝜎 2 = 𝛽 −1)
• In Bayesian, (post. pred.) var is datapoint-specific (𝜎 2 𝑥 = 𝛽 −1 + 𝜙 𝑥 𝑇 𝑆𝑁 𝜙 𝑥 )
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Recall: Motivating example for Regularization,
and Hyperparam. Tuning
ln 𝜆 = −∞
[CMB]
Regularization, and Hyperparam. tuning
Regularization penalizes large coefs.
of 9th order polynomial fit of the data.
[CMB]
Some Motivating Questions
• Can we understand the error in our predictions better?
• That is, can we identify the different components of the error in our predictions?
• How are these different components related to the complexity of our model, and to
overfitting?
• Can we use above knowledge to better tune model complexity
(hyperparameters) to avoid overfitting?
• If the trends in data require fitting of a complex model, then
• can overfitting be detected by understanding the stability of the optimal
(frequentist/ML) model across different training datasets?
• can a Bayesian model overcome the overfitting “naturally/implicitly” by not settling
in on a single optimal model and instead averaging over multiple models?
• Bayesian view (model averaging & empirical Bayes) possible, but out of scope for this course.
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Recall: Decision Theory for Regression
min. squared loss (cond. expn. as minimizer)
[CMB]
Bias-variance analysis proof
Goal: Decompose average error 𝐸𝐷 [𝐸𝑥,𝑡 [𝐿]] into different terms.
Now, simply view “y(x;D) – h(x)” as a random variable Z; and apply the variance formula:
𝑉𝑎𝑟𝐷 (𝑍) = 𝐸𝐷 [𝑍 2 ] – 𝐸𝐷 [𝑍] 2 to get the bias-variance decomposition of error below:
[CMB]
Bias-variance decomposition in formula
expected loss = 𝐸𝐷 𝐸𝑥,𝑡 𝐿 = 𝐸𝐷 𝐸𝑥,𝑡 𝑦 𝑥; 𝐷 − 𝑡 2
Exercise: cf. worksheet for careful understanding of what random variables the expectation above is
taken over!
[CMB]
Bias-variance in pictures (for an example)
[CMB]
Bias-variance analysis (for the example)
[CMB]
Bias-variance anals.: applicability in practice?
[CMB]
Bias-variance anals.: applicability in practice?
details
[CMB]
Concluding thoughts
• Linear regression forms the foundations of other sophisticated methods, so
it is good to invest enough time on it.
• Two views: direct loss fn. view (E(w)/regularization) & probabilistic model view (MLE/Bayesian)
• But lin. regn. has limitations in practice, even with non-linear basis functions, closed-
form solutions and other analytical advantages.
• Mainly because basis functions are fixed before seeing the training data (curse of
dimensionality when dimensionality of feature vectors D grows).
• Next steps:
• linear models for classification, which play similar basic role for other classification
methods.
• Move from fixed basis fns to selection of basis functions or adaptation of basis
functions using training data, in later lectures on non-linear models.
Thank you!
Backup slides
Linear Algebra (LA) Refresher
• Switch to LN Pandey’s notes
LA Cheat Sheet: The four subspaces of a matrix
[From https://mathworld.wolfram.com/FundamentalTheoremofLinearAlgebra.html, Strang LA book]
LA + Opt. Cheat Sheet
• Real, symmetric matrices 𝑆 can be diagonalized as 𝑆 = 𝑄ΛQ𝑇
• 𝑆 is psd iff all its eigen values are non-negative.
•.
• .
[HR]
Recall LA: To solve 𝐴𝑥 = 𝑏, we premult. by 𝐴𝑇 , and
simply solve 𝐴𝑇 𝐴𝑥 = 𝐴𝑇 𝑏.
Ex.: Prove:
1) at least one soln. x* exists for the normal eqn.
2) soln. x* unique if (𝐴𝑇 𝐴) is invertible ( 𝐴 has lin. indep. cols.)
3) infinite solns. x* if (𝐴𝑇 𝐴) is non-invertible ( 𝐴 has lin. dep. cols.)
Ex.:
i) Prove 𝑁𝑆(𝐴) = 𝑁𝑆(𝐴𝑇 𝐴).
ii) Use orthog. complementarity of 𝑁𝑆(𝐴𝑇 ), 𝐶𝑆(𝐴) to derive normal eqns.
Choice of Prior
Different priors on parameter 𝜃(≔ 𝑤) leads to…
…different regularizations (ridge vs. lasso regn.)
Bias-variance analysis (alternate proof)
[CMB]