Lecture 24: Linear regression I
Introduction to Mathematical Modeling, Spring 2025
Lecturer: Yijun Dong
Helpful references:
Statistical Learning Theory Lecture Notes by Percy Liang §2.7 (2.8 FYI)
All of Statistics by Larry Wasserman §13
Example: determine ages based on face images
Figure 1: UTKFace dataset: face images with age labels.
1
Regression
• Consider a joint distribution P(x, y ) of a random vector X → Rd and a
random variable Y → R.
• Regression is a method for studying the relationship between a response
R.V. Y → R and a feature R.V. X → Rd through a regression function:
!
f→ (x) = E [Y | X = x] = yP(x, y ) dy
R
• The goal of regression is to estimate the regression function f→ (x) based
on data (observations) drawn from the joint distribution P(x, y ):
(X1 , Y1 ), (X2 , Y2 ), · · · , (Xn , Yn ) ↑ P(x, y )
• Without prior knowledge on the regression function, finding a good
estimate f ↓ f→ is hard. A common approach is to assume a parametric
model for the regression function, e.g.:
• Linear regression: f (x) = ω T x for some ω → Rd .
• Two-layer neural network regression: f (x) = a→ ω(W→ x) for some
a → Rm , W → Rd↑m , where ω is a non-linear activation function.
2
Fixed design linear regression: parameter estimation
• Consider a joint distribution Pω→ (x, y ) of a random vector X → Rd and a
random variable Y → R parameterized by some unknown parameter
ω → → Rd :
(x, y ) ↑ Pω→ (x, y ) ↔ y = x↑ ω → + z, z ↑ N (0, ω 2 ),
where z ↑ N (0, ω 2 ) is an independent Gaussian noise of the
response/label with mean 0 and variance ω 2 .
• Fixed design linear regression aims to estimate the parameter ω →
based on a fixed set of features (i.e., no randomness)
X = [x1 , x2 , · · · , xn ]↑ → Rn↓d .
• For each xi (i = 1, 2, · · · , n), the corresponding label (response) is
yi = x↑
i ω → + zi , zi ↑ N (0, ω 2 ),
where zi is the independent Gaussian label noise.
• Exercise: Let y = [y1 , y2 , · · · , yn ]↑ → Rn . Is y a random vector in fixed
design? If so, where does the randomness come from?
3
Square loss for linear regression
• The fixed features X → Rn↓d and corresponding labels y → Rn are
related as
y = Xω → + z, z = [z1 , z2 , · · · , zn ]↑ ↑ N (0, ω 2 In ).
• Exercise: show that with independent Gaussian label noises
zi ↑ N (0, ω 2 ) for all i → [n], z ↑ N (0, ω 2 In ).
4
Square loss for linear regression
• The fixed features X → Rn↓d and corresponding labels y → Rn are
related as
y = Xω → + z, z = [z1 , z2 , · · · , zn ]↑ ↑ N (0, ω 2 In ).
• Exercise: show that with independent Gaussian label noises
zi ↑ N (0, ω 2 ) for all i → [n], z ↑ N (0, ω 2 In ).
• The square loss (i.e., ε2 loss) is defined as: ε(y , y") = (y ↗ y")2 .
• Expected/population risk of a regression function parameterized by
ω → Rd under the square loss is defined as
# ↑
$ # ↑ 2
$
L(ω) = E(x,y )↔Pω→ (x,y ) ε(y , x ω) = E(x,y )↔Pω→ (x,y ) (y ↗ x ω) .
• For fixed design given X → Rn↓d , the expected risk can be expressed as
% & % &
1 2 1 2
L(ω) = Ey↑ ↔Pω→ (·|X) ↘Xω ↗ y↗ ↘2 = Ey↑ |X ↘Xω ↗ y↗ ↘2
n n
4
Empirical risk minimization (ERM)
• Population (truth distribution): the true joint distribution Pω→ (x, y ) that
generates the data (features x and labels y ).
• Samples (empirical distribution): the fixed features x1 , · · · , xn and the
corresponding random labels y1 , · · · , yn drawn from the 'population. (
2
• Population risk (fixed design, square loss): L(ω) = Ey↑ |X n1 ↘Xω ↗ y↘2 .
• Empirical risk (fixed design, square loss):
" 1 2
L(ω) = ↘Xω ↗ y↘2 .
n
Empirical risk minimization (ERM)
• What we want: estimate ω → that characterizes the population Pω→ (x, y )
• What we have: n samples (X, y) where y = Xω → + z and
z ↑ N (0, ω 2 In ).
) *
" " 1 2
ERM : ω = argmin L(ω) = ↘Xω ↗ y↘2 .
ω↘Rd n
5
Generalization error
Generalization error measures how much the regression function
f"(x) = x↑ ω
" learned with finite samples (X, y) underperforms the best
possible regression function over the entire population.
• The best possible regression function over the population is ω → :
) % &*
1 2
min L(ω) = Ey↑ |X ↘Xω ↗ y↗ ↘2
ω↘Rd n
% &
1 2
= min Ez↑ ↘Xω ↗ Xω → ↗ z↗ ↘2
ω↘Rd n
1 ' (
2 ↗ 2 ↗↑
= min Ez↑ ↘X(ω ↗ ω → )↘2 + ↘z ↘2 ↗ 2z X(ω ↗ ω → )
ω↘Rd n
1+ 2
'
2
( # $ ,
= min ↘X(ω ↗ ω → )↘2 + Ez↑ ↘z↗ ↘2 ↗ 2Ez↑ z↗↑ X(ω ↗ ω → )
ω↘Rd n
1 2
= min ↘X(ω ↗ ω → )↘2 + ω 2 = ω 2 when ω = ω → .
ω↘Rd n
• Population risk of the best possible regression function ω → is L(ω → ) = ω 2 .
6
Generalization error
• The population risk of the regression function learned via ERM over the
n samples (X, y) is
% - - & - -2
" = Ey↑ |X 1 - " - 2 1 - " -
L(ω) -Xω ↗ y↗ - = -X(ω ↗ ω → )- + ω 2 .
n 2 n 2
"
• Formally, the generalization error is defined as the suboptimality of ω
compared to ω → in terms of the population risk, known as excess risk:
1 - -2
" := L(ω) -
" ↗ L(ω → ) = -X(ω " ↗ ω → )-
ER(ω) - .
n 2
• Define the covariance matrix of the features X as
1 ↑
!= X X → Rd↓d .
n
- -2 ≃
-
" = -ω -
" ↗ ω → - , where ↘u↘ = u↑ !u is the
• Notice that ER(ω) !
!
Mahalanobis norm of any u → Rd with respect to ! ⇐ 0.
7
Intuition for generalization error