0% found this document useful (0 votes)
24 views154 pages

MFML

The document outlines the mathematical foundations of machine learning, developed from a course by Robert Nowak at the University of Wisconsin-Madison. It covers essential topics such as probability, discrete probability distributions, multivariate Gaussian models, maximum likelihood estimation, and linear models, assuming a background in probability, statistics, linear algebra, and optimization. The notes include exercises and applications relevant to classification and estimation in machine learning.

Uploaded by

machius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views154 pages

MFML

The document outlines the mathematical foundations of machine learning, developed from a course by Robert Nowak at the University of Wisconsin-Madison. It covers essential topics such as probability, discrete probability distributions, multivariate Gaussian models, maximum likelihood estimation, and linear models, assuming a background in probability, statistics, linear algebra, and optimization. The notes include exercises and applications relevant to classification and estimation in machine learning.

Uploaded by

machius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 154

Mathematical Foundations of Machine Learning

© 2022 Robert Nowak


Mathematical Foundations of Machine Learning
© 2022 Robert Nowak

Genesis of notes. These notes were developed as part of a course taught by Robert Nowak at the University of
Wisconsin-Madison. The reader should beware that the notes have not been carefully proofread and edited. The
notes assume the reader has background knowledge of basic probability, statistics, linear algebra, and optimiza-
tion.

Contents

1 Probability in Machine Learning 1


1.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Joint, Marginal, and Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Histogram Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.1 Basic Probability Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Sums of Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.7 Excercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Discrete Probability Distributions and Classification 8


2.1 Optimal Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Application to Classification Error Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Application to Nearest Neighbor Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Application to Histogram Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Multivariate Gaussian Models and Classification 16


3.1 MVN Models and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Optimality of Likelihood Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Learning MVN Classifiers 22


4.1 Analysis of the “Plug-in” MVN Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Comparison of MVN Plug-in Classifier and Histogram Classifier . . . . . . . . . . . . . . . . . . 24
4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Likelihood and Kullback-Leibler Divergence 26


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Optimal Classifiers and Likelihood Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Kullback-Leibler Divergence: Intrinsic Difficulty of Classification . . . . . . . . . . . . . . . . . 27
5.3.1 Gaussian Class-Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3.2 Separable Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3.3 Statistical Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

i
5.3.4 Nonnegativity of KL Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3.5 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6 Maximum Likelihood Estimation 32


6.1 The Maximum Likelihood Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.2 ML Estimation and Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.3 Examples of MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.4 MLE and KL Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7 Sufficient Statistics 36
7.1 Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.2 Minimal Sufficient Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.3 Rao-Blackwell Thereom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

8 Asymptotic Analysis of the MLE 41


8.1 Convergence of log likelihood to KL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.2 Asymptotic Distribution of MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

9 Maximum Likelihood Estimation and Empirical Risk Minimization 47


9.0.1 Maximum Likelihood Estimation Approach . . . . . . . . . . . . . . . . . . . . . . . . . 47
9.0.2 Empirical Risk Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
9.0.3 Example: Gaussian Models and Least Squares . . . . . . . . . . . . . . . . . . . . . . . 48
9.1 The Exponential Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

10 Linear Models 53
10.1 Common Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
10.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
10.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

11 Gradient Descent 56
11.1 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
11.2 Stochastic Gradient Descent for Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
11.3 Gradients and Subgradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
11.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

12 Analysis of Stochastic Gradient Descent 60


12.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

13 Bayesian Inference 64
13.1 Back to the Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
13.2 Posterior Distributions and Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
13.3 Example: Temperature Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
13.4 Example: Twitter Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
13.5 Multivariate Normal Distributions in Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . 68
13.6 Bayesian Linear Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
13.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

ii
14 Proximal Gradient Algorithms 72
14.1 Proximal Gradient Algorithm with Squared Error Loss . . . . . . . . . . . . . . . . . . . . . . . 73
14.2 Proximal Gradient Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
14.3 Analysis of Proximal Gradient Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
14.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

15 The Lasso and Soft-Thresholding 79


15.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

16 Concentration Inequalities 84
16.1 The Chernoff Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
16.2 Azuma-Hoeffding Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
16.3 KL-Based Tail Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
16.4 Proof of Hoeffding’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
16.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

17 Probably Approximately Correct (PAC) Learning 94


17.1 Analysis of Empirical Risk Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
17.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

18 Learning in Infinite Model Classes 97


18.1 Rademacher Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
18.2 Generalization Bounds for Classification with 0/1 Loss . . . . . . . . . . . . . . . . . . . . . . . 100
18.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

19 Vapnik-Chervonenkis Theory 102


19.1 Shatter Coefficient and VC Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
19.2 The VC Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
19.2.1 Proof of Massart’s Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
19.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

20 Learning with Continuous Loss Functions 106


20.1 Generalization Bounds for Continuous Loss Functions . . . . . . . . . . . . . . . . . . . . . . . 106
20.1.1 Application to Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
20.2 Proof of Lemma 20.1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
20.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

21 Introduction to Function Spaces 110


21.1 Constructions of Function Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
21.1.1 Parameteric Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
21.1.2 Atomic Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
21.1.3 Nonparametric Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
21.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

22 Banach and Hilbert Spaces 113


22.1 Review of Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
22.2 Normed Vector Spaces and Banach Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
22.3 Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
22.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

23 Reproducing Kernel Hilbert Spaces 120

iii
23.1 Constructing an RKHS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
23.2 Examples of PSD Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
23.3 The Representer Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
23.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

24 Analysis of RKHS Methods 126


24.1 Rademacher Complexity Bounds for Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . 126
24.2 Properties of Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
24.3 Take-Away Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
24.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

25 Neural Networks 131


25.1 Neural Network Function Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
25.2 ReLU Neural Network Banach Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
25.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

26 Neural Network Approximation and Generalization Bounds 134


26.1 Approximating Functions in F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
26.2 Generalization Bounds for Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
26.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

27 Notation 140

28 Useful Inequalities 141

29 Convergence of Random Variables 142


29.1 Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
29.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
29.3 Law of the Iterated Logarithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

List of Figures

1 Probability in Machine Learning 1


1.1 Movie ratings for 40 students. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Movie ratings for all 8, 094 students. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Histogram of movie ratings for all 8, 094 students. . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Probabilities of movie ratings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 All relevant statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Discrete Probability Distributions and Classification 8


2.1 Example of hypercube [0, 1]2 in M equally sized subsquares . . . . . . . . . . . . . . . . . . . . 11

3 Multivariate Gaussian Models and Classification 16

iv
3.1 Multivariate Gaussian Distributions. The top row show the Gaussian density (left) and its contour
plot (right) with mean [0 0]T and covariance [1 0; 0 1]. The second row is the same except that
the covariance is [1 0.5; 0.5 1]; positive correlation. The third row is the same except that the
covariance is [1 − 0.5; −0.5 1]; negative correlation. . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Learning MVN Classifiers 22

5 Likelihood and Kullback-Leibler Divergence 26

6 Maximum Likelihood Estimation 32

7 Sufficient Statistics 36

8 Asymptotic Analysis of the MLE 41


8.1 Likelihood functions for scalar parameter θ. (a) Low curvature cases will lead to greater estimator
variances compared to (b) high curvature cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

9 Maximum Likelihood Estimation and Empirical Risk Minimization 47

10 Linear Models 53
10.1 Comparison of loss functions for binary classification, with yi ∈ {−1, +1}. . . . . . . . . . . . . 54

11 Gradient Descent 56
11.1 A function f is said to be convex if for all λ ∈ [0, 1] and w1 , w2 we have λf (w1 ) + (1 −
λ)f (w2 ) ≥ f (λw1 + (1 − λ)w2 ). If f is differentiable, then an equivalent definition is that
f (w2 ) ≥ f (w1 ) + ∇f (w1 )T (w2 − w1 ). A function is strictly convex if both inequalities are
strict (i.e., hold with ≥ replace by >). A function f is said to be α-strongly convex if f (w2 ) ≥
f (w1 ) + ∇f (w1 )T (w2 − w1 ) + α2 ∥w2 − w1 ∥22 for all w1 , w2 . If f is twice differentiable, then an
equivalent definition is that ∇2 f (w) ≻ αI for all w. . . . . . . . . . . . . . . . . . . . . . . . . 57
11.2 The SGD algorithm can be thought of as considering each of the loss terms of (11.1) individually.
Because each term is “flat” in all but 1 of the total d directions, this implies that each term is
convex but not strongly convex (see Figure 11.1). However, if T > d we typically have that the
complete sum is strongly convex which can be exploited to achieve faster rates of convergence. . . 58

12 Analysis of Stochastic Gradient Descent 60

13 Bayesian Inference 64

14 Proximal Gradient Algorithms 72

15 The Lasso and Soft-Thresholding 79

16 Concentration Inequalities 84
16.1 Convexity of exponential function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

17 Probably Approximately Correct (PAC) Learning 94

18 Learning in Infinite Model Classes 97

19 Vapnik-Chervonenkis Theory 102

20 Learning with Continuous Loss Functions 106

v
21 Introduction to Function Spaces 110

22 Banach and Hilbert Spaces 113

23 Reproducing Kernel Hilbert Spaces 120

24 Analysis of RKHS Methods 126

25 Neural Networks 131

26 Neural Network Approximation and Generalization Bounds 134

27 Notation 140

28 Useful Inequalities 141

29 Convergence of Random Variables 142

List of Equations

1 Probability in Machine Learning 1

2 Discrete Probability Distributions and Classification 8

3 Multivariate Gaussian Models and Classification 16

4 Learning MVN Classifiers 22

5 Likelihood and Kullback-Leibler Divergence 26

6 Maximum Likelihood Estimation 32

7 Sufficient Statistics 36

8 Asymptotic Analysis of the MLE 41

9 Maximum Likelihood Estimation and Empirical Risk Minimization 47

10 Linear Models 53

11 Gradient Descent 56

12 Analysis of Stochastic Gradient Descent 60

13 Bayesian Inference 64

14 Proximal Gradient Algorithms 72

vi
15 The Lasso and Soft-Thresholding 79

16 Concentration Inequalities 84

17 Probably Approximately Correct (PAC) Learning 94

18 Learning in Infinite Model Classes 97

19 Vapnik-Chervonenkis Theory 102

20 Learning with Continuous Loss Functions 106

21 Introduction to Function Spaces 110

22 Banach and Hilbert Spaces 113

23 Reproducing Kernel Hilbert Spaces 120

24 Analysis of RKHS Methods 126

25 Neural Networks 131

26 Neural Network Approximation and Generalization Bounds 134

27 Notation 140

28 Useful Inequalities 141

29 Convergence of Random Variables 142

vii
Lecture 1: Probability in Machine Learning

1.1. Executive Summary

Probability and statistics are central to the design and analysis of ML algorithms. This note introduces some of
the key concepts from probability useful in understanding ML. There are many great references on this topic,
including [4, Chapter 2].

1.2. Introduction

Consider the training dataset depicted in Figure 1.1. Imagine that these data are based on 1 to 5 star movie reviews.
The horizontal axis is the rating of Star Wars (SW) and the vertical axis is the rating of Crazy Rich Asians (CRA),
a romantic comedy. Think of each point as a person’s ideal ratings (maybe fractional like 3.5), but they are forced
to select 1 to 5 stars. Each box in the plot indicates ratings that take on a specific combination of stars. For
example the box between [2, 3] × [3, 4] indicates people who gave SW a 3-star rating and CRA a 4-star rating.
There are four people in this box. The color of each point indicates whether or not they “liked” Guardians of the
Galaxy (GG); red if they liked GG, green otherwise1

Figure 1.1: Movie ratings for 40 students.

There are a total of 40 people in this dataset, but of course there are many more people in the world. Suppose we
are really interested in understanding the movie preferences of all graduate students at UW-Madison. Suppose
there are 8, 094 students. This is what the preferences of all of them might look like. This is the full “population”
we would like to consider, but our available data is only the smaller sample of 40 students. We can use ML to try
to make predictions about the whole population based on the sample.
1
Other possible scenarios are: Computer Vision- say two features correspond to distances between eyes and eyebrows, center of
mouth and corners, and labels are happy or not happy face; Natural Language Processing - say features are # times words “ball” or
“vote” appear and labels indicate whether document is about sports or politics.

1
Figure 1.2: Movie ratings for all 8, 094 students.

The number of people in each box is

Figure 1.3: Histogram of movie ratings for all 8, 094 students.

Normalizing by the total number of students gives us the probability that a randomly selected student will be in
a particular box, as shown below. For example, the probability that a person is in the box [2, 3] × [3, 4], which
corresponds to a 3-star rating for SW and a 4-star rating for CRA, is 598/8094 ≈ 3/40. So, if we take a random
sub-sample of 40 people, then we would expect about 3 individuals to be in this box. In the random subsample
shown in Figure 1.1, there were 4 people in this box... very close to our expectation.

Figure 1.4: Probabilities of movie ratings.

So let’s think about things this way. There is a big population out there, and we want to learn what they think
based on a random subsample. We’d like to predict things like: Will a graduate student give both SW and CRA
5-star ratings?’ Will such a student like GG? If we know how a student rated SW and CRA, can we predict
whether she will like GG? The complete probability structure of this problem is summarize by two sets of counts,
one for those who don’t like GG and one for those who do. There are 4039 students who don’t like GG, and 4055
who do.

2
didn’t like GG liked GG

Figure 1.5: All relevant statistics.

1.3. Joint, Marginal, and Conditional Probabilities

Let x1i and x2i denote the two ratings for the ith person in the dataset, and let yi denote the binary ±1 label
indicating whether they liked GG.

1. If we select a person at random from the population, what is the probability that they are in the (3, 4) box?

2. If we select a person at random, what is the probability they gave SW a 3-star rating?

3. What is the probability that a student will like GG?

4. If a person gave SW 3-stars and CRA 4-stars, then what is the probability they will like GG?

5. What is the probability that a person who gives SW 3-stars will give CRA 1 or 2 stars?

6. What is the expected number of people in a subsample of size n that will rate SW 3 stars and CRA 5 stars?

1.4. Histogram Classifier

Suppose we want to predict whether a student will like GG based on her ratings of SW and CRA. Consider the
histogram classifier:
(
p(y=+1 | x1 ,x2 )
+1 p(y=−1 | x1 ,x2 )
≥1
yb =
−1 otherwise

We can use a subsample to estimate the probabilities needed above, but they may be poor estimates if the sample
size is small.

1.4.1. Basic Probability Calculus

Let X and Y be discrete random variables. The joint probability that X takes the value x Pand Y takes the value
y is denoted by p(x, y). The marginal probability that X takes the value x is p(x) = y p(x, y), and p(y) is
defined analogously. The conditional probability that Y takes the value y given that X equals x is denoted by
p(y|x) which is the solution to the equation p(x, y) = p(y|x)p(x), or in other words p(y|x) = p(x, y)/p(x). If the
random variables are independent, then p(x, y) = p(x)p(y).

3
1.5. Expectation

The expected value or mean of a discrete random variable X is computed by taking the expectation
X
E[X] = x p(x) ,
x

where the summation is over all possible discrete values X may take. In other words, the expected value is a
weighted combination of all the discrete values, where each weight is the probability that X will take that value.
We can also take the expectation of a function of the random variable
X
E[f (X)] = f (x) p(x) .
x

Suppose we have two random variables X and Y and consider the sum X + Y . The expectation of the sum is
XX
E[X + Y ] = (x + y)p(x, y)
x y
XX XX
= x p(x, y) + y p(x, y)
x y x y
X X X X
= x p(x, y) + y p(x, y)
x y y x
X X
= x p(x) + y p(y)
x y
= E[X] + E[Y ]
The expectation of the product is
XX
E[XY ] = xy p(x, y) ,
x y

and, in general, is not a function of E[X] and E[Y ] alone. However, if X and Y are independent, then
XX
E[XY ] = xy p(x, y)
x y
XX
= xy p(x)p(y)
x y
! !
X X
= x p(x) y p(y)
x y
= E[X]E[Y ]
The variance of a random variable X is the expectation of (X − E[X])2 , that is the average squared deviation of
X from its mean value µ = E[X]. We write this as
V(X) = E[(X − µ)2 ] .
Another key concept is the conditional expectation defined as
X
E[Y |X = x] = y p(y|x) .
y

Technically, this is just the first moment of p(y|x); in other words, the mean of the conditional distribution of y
given the “side information” X = x. The practical interpretation is that this is in some sense the best prediction
of Y given X = x. Note that if X and Y are independent, then p(y|x) = p(y) and E[Y |X = x] = E[Y ]; i.e.,
X = x doesn’t inform our prediction about Y .

4
1.6. Sums of Independent Random Variables

In ML applications we often encounter sums or averages of independent random variables. For example, if
we select 100 people at random from a population and ask them if they like cheese, then we can estimate the
probability that an average person likes cheese by averaging the answers in our survey.
P
Let X1 , X2 , . . . denote independent random variables and for any n let Sn = ni=1 Xi . As shown above, the mean
of the sum is equal to the sum of the means:
X
n
E[Sn ] = E[Xi ] .
i=1

This is true whether or not the terms in the sum are independent. The variance of the sum is more complicated in
general, since it involves not only the variances of the individual terms, but also their covariation.

V(Sn ) = E[(Sn − E[Sn ])2 ]


Xn
= E[( (Xi − E[Xi ]))2 ]
i=1
XX
n n
= E[ (Xi − E[Xi ])(Xj − E[Xj ])]
i=1 j=1
XX
n n
= E[(Xi − E[Xi ])(Xj − E[Xj ])]
i=1 j=1

However, if the terms are independent, then the variance of the sum is equal to the sum of the variances. Note that
if Xi and Xj are independent, then

E[(Xi − E[Xi ])(Xj − E[Xj ])] = E[Xi − E[Xi ]] E[Xj − E[Xj ]] = 0 ,

since both expectations are equal to zero. Thus, if the Xi are independent, the variance simplifies to
X
n X
n
2
V(Sn ) = E[(Xi − E[Xi ])) ] = V(Xi ) .
i=1 i=1

Here is a nice application of this property. Suppose we sample n people uniformly at random2 and ask them to
rate SW and CRA. Based on this sample, we want to estimate the probability that a random person will give SW
k stars and CRA ℓ stars. Let 1i,k,ℓ denote the binary “indicator” variable that is assigned the value 1 if person i
gives SW k stars and CRA ℓ stars, and the value 0 otherwise. Then the empirical probability that a person gave
SW k stars and CRA ℓ stars is
1X
n
pbk,ℓ = 1i,k,ℓ .
n i=1
The expectation of the empirical probability is

1X 1X
n n
pk,ℓ ] =
E[b E[1i,k,ℓ ] = pk,ℓ = pk,ℓ ,
n i=1 n i=1
2
We will assume that we sample people uniformly at random with replacement, rather than without replacement. This means that
may select the same person more than once, but it ensures that our samples are independently and identically distributed. If the full
population is large relative to the sample size, then there isn’t much chance of sampling the same person more than once.

5
where pk,ℓ is the “true” probability that a randomly selected person from the population at large will give SW k
stars and CRA ℓ stars. So the empirical probability is an unbiased estimator of the true probability. However, de-
pending on the sample size n, it may have a large variance. Since the indicator random variables are independent,
the variance is

1 X
n
pk,ℓ − pk,ℓ )2 ] =
pk,ℓ ) = E[(b
V(b 2
E[(1i,k,ℓ − pk,ℓ )2 ] .
n i=1

Note that (1i,k,ℓ − pk,ℓ )2 is either p2k,ℓ or (1 − pk,ℓ )2 . In fact, it takes the value p2k,ℓ with probability 1 − pk,ℓ and the
value (1 − pk,ℓ )2 with probability pk,ℓ . So its expectation is

E[(1i,k,ℓ − pk,ℓ )2 ] = (1 − pk,ℓ )2 pk,ℓ + p2k,ℓ (1 − pk,ℓ ) = pk,ℓ (1 − pk,ℓ )


p (1−pk,ℓ )
Therefore, V(bpk,ℓ )√= k,ℓ n
. In other words, the variance is proportional to 1/n and the standard deviation is
proportional to 1/ n.

1.7. Excercises

1. This question is about using the rand or random function to simulation random variables. Let rand
denote a function that generates a random number uniformly distributed on [0, 1].

a. Use rand to select one of 8, 094 students uniformly at random.


b. Write a code to select k students uniformly at random without replacement.
c. Write a code to generate a bivariate random variable according to the distribution below

2. Consider the movie preference prediction problem. The two arrays below denote student ratings for Star
Wars (SW), 1-5, left to right. and Crazy Rich Asians (CRA), 1-5 bottom to top. The array on the left are the
counts of students that did not like Guardians of the Galaxy (GG), and the array in the right is those that
did.

didn’t like GG liked GG

(a) What is the probability that a randomly selected student (RSS) will like GG and give SW a 5-star
rating?
(b) What is the probability that an RSS who likes GG, will give SW 5-stars and CRA 2-stars?
(c) What is the probability that an RSS will give SW at least 3 stars and CRA at most 2 stars?

6
(d) Suppose a new student gave SW a 3-star rating and CRA a 2-star rating. Will she like GG?
(e) Suppose a new student gave SW a 3-star rating. Predict her rating for CRA.
(f) Suppose you know that a new student gave CRA a 2 star rating and also that she liked GG, what is
your estimate of her (unknown) rating of SW?
(g) Suppose all you know is that a new student didn’t like GG. Make predictions of her ratings for SW
and CRA?
(h) Suppose a new student gave SW X rating and CRA Y rating, and 7 ≤ X + Y ≤ 9. What is the
probability she will like GG?
(i) Suppose a new student gave SW X rating and CRA Y rating, and X < Y . What is the probability she
will like GG?

3. Consider the following two-player game. Each player rolls a six-sided die (and the outcomes of the rolls are
independent and uniformly distributed). They cannot see their own outcome but they can see the outcome
of the other player. They must submit a guess of their own roll and they do so in such a way that absolutely
no information is passed between players. If they are both correct, they win. If either is wrong, they lose.
What is the maximum probability of victory?

4. Suppose that scientists are studying how the brain performs a certain information-processing tasks. Three
regions of the brain are involved, denoted A, B and C. There is prior evidence that there are direct neural
connections between regions A and B and regions B and C. However, it is uncertain whether regions A and
C are directly connected. The scientists design an experiment to test this. The activity in human subjects’
brains is measured while they perform the information-processing tasks. The activity level in each region is
a binary-valued variable, indicating whether the region is significantly active (1) or not (0). Let xA , xB , and
xC denote the activity level in each region, which we will model as sequences of random variables. If there
is no direct connection between regions A and C, then we conjecture that xA and xC will be conditionally
independent given xB .
Many measurements of these variables, for repeated trials of the task and different human subjects, are
(i) (i) (i)
recorded. The dataset is {(xA , xB , xC )}ni=1 , where n is the total number of measurements. We can model
(i) (i) (i)
each triple (xA , xB , xC ) as independently and identically distributed (i.e., each triple is an independent
(i) (i) (i)
realization from the same multivariate distribution, but (xA , xB , and xC ) may be correlated). How would
you use the data to check for whether xA and xC are conditionally independent given xB ?

7
Lecture 2: Discrete Probability Distributions and Classification

Let Y be a random variable that takes one of m discrete values {a1 , . . . , am }. For example, Y could be a label in
P −1 or +1. The probability distribution for discrete random variables
a binary classification problem taking values
is P(Y = aj ) = pj , j = 1, . . . , m, and m
j=1 pj = 1. Here are some common distributions.

Bernoulli Suppose Y takes values 0 or 1, then


P(Y = y) = py (1 − p)1−y
Its mean and variance are p and p(1 − p), respectively.
Binomial Consider n independent and identically distributed Bernoulli random variables Y1 , Y2 , . . . , Yn . Then
the joint probability distribution is
Y
n
P(Y1 = y1 , . . . , Yn = yn ) = P(Yi = yi )
i=1
Y
n
= pyi (1 − p)1−yi
i=1

The sum of i.i.d. Bernoullis follows a Binomial distribution


X n   
n k
P Yi = k = p (1 − p)n−k
i=1
k

where nk = k!(n−k)!
n!
is the number of ways of choosing k out of n of the variables to have the value of 1.
Pn
The sum K := i=1 Yi is said to be a binomial random variable. The mean and variance of K are np and
np(1 − p), respectively.
Multinomial Consider n independent and identically distributed random variables Y1 , Y2 , . . . , Yn that take values
in {a1 , . . . , am }. Then the joint probability distribution is
Y
n
P(Y1 = y1 , . . . , Yn = yn ) = P(Yi = yi )
i=1
Yn Ym
1{yi =aj }
= pj
i=1 j=1

Let Kj denote the number of times the value aj appears in the sample. Then
 Ym
n k
P(K1 = k1 , . . . , Km = km ) = pj
k1 , k2 , . . . , km j=1 j

where k1 ,k2n,...,km = k1 ! k2n!
! ··· km !
is the number of ways of choosing k1 variables to have the value a1 and k2
to have value a2 , and so on. The mean and variance of each Kj are npj and npj (1 − pj ), respectively.
Poisson Let Y be a non-negative integer-valued random variable with distribution
e−λ λk
P(Y = k) =
k!
with parameter λ > 0. Both the mean and variance is λ.

8
2.1. Optimal Binary Classification

The goal of classification is to learn a mapping from the feature space X to a label space, Y. This mapping, f , is
called a classifier. For example, we might have

X = Rd
Y = {0, 1}.

The classifier output is a prediction of the label, yb = f (x). We can measure the error of our classifier using a loss
function; e.g., the 0-1 loss 
1, yb ̸= y
ℓ(b
y , y) = 1{by̸=y} =
0, yb = y
Assume that the features and labels follow a joint probability distribution. The risk is defined to be the expected
value of the loss function, we have
 
R(f ) = E[ℓ(f (X), Y )] = E 1{f (X)̸=Y } = P(f (X) ̸= Y ).

Both the expectation and probability are with respect to a random (X, Y ) pair. Note that 1{f (X)̸=Y } is a Bernoulli
random variable with probability p = P P(f (X) ̸= Y ). Thus, if we have a i.i.d. dataset {Xi , Yi }ni=1 , then the total
number of mistakes f on these data is ni=1 1{f (Xi )̸=Yi } and this random variable is binomially distributed.

The performance of a given classifier can be evaluated in terms of how close its risk is to the Bayes risk.

Definition 1 (Bayes Risk). The Bayes risk is the infimum of the risk for all classifiers:

R∗ = inf R(f ).
f

We can prove that the Bayes risk is achieved by the Bayes classifier.

Definition 2 (Bayes Classifier). The Bayes classifier is the following mapping:



∗ 1, η(x) ≥ 1/2
f (x) =
0, otherwise

where
η(x) ≡ PY |X (Y = 1|X = x).

Note that for any x, f ∗ (x) is the value of y ∈ {0, 1} that maximizes PXY (Y = y|X = x).

Theorem 1 (Risk of the Bayes Classifier).


R(f ∗ ) = R∗ .

Note that while the Bayes classifier achieves the Bayes risk, in practice this classifier is not realizable because we
do not know the distribution PXY and so cannot construct η(x).

9
2.2. Application to Classification Error Estimation

Let f be any classifier. Its probability of error is pf := P(f (X) ̸= Y ) = E[1{f (X)̸=Y } ]. This is generally
unknown, since we typically don’t know the joint distribution PXY in practice. A common approach to estimate
iid
the error rate of a classifier f is to evaluate its performance on a test set {Xi , Yi }ni=1 ∼ PXY . The empirical error
rate is
1X
n
pbf = 1{f (Xi )̸=Yi } .
n i=1
Since the binary indicator variables 1{f (Xi )̸=Yi } are i.i.d. nb
pf has a Binomial distribution. So the mean and variance
pf (1−pf )
of pbf are E[b
pf ] = pf and V(b
pf ) = n
.

2.3. Application to Nearest Neighbor Classification

Suppose we are given a set of binary labeled training data {Xi , Yi }ni=1 that are independently and identically
distributed. Let X be a new iid point. If we knew the data distribution, then the Bayes optimal classifier for X
would be to label it 1 if p(Y = +1|X) > P (Y = 0|X) and 0 otherwise. For any X let η(X) = P(Y = 1|X). The
optimal classifier’s probability of error is R∗ := E[min(η(X), 1 − η(X))]. To relate this to the notation above,
recall that the probability of error P(f ∗ (X) ̸= Y ) = E[1{f ∗ (X)̸=Y } ]. The expectation is taken with respect to a
random pair (X, Y ). We can break this into two expectations
h i
E[1{f ∗ (X)̸=Y } ] = EX EY |X [1{f ∗ (X)̸=Y } ]]
= EX [min(η(X), 1 − η(X))]

The nearest neighbor classifier labels a new point X by finding the closest point in the training set iX =
arg mini dist(X, Xi ) and assigning the corresponding label yiX . The dist function could be any valid distance
measure, for example the Euclidean distance dist(X, Xi ) = ∥X − Xi ∥2 . Its asymptotic error rate is characterized
by the following theorem.
Theorem 2 ([8]). Let fnN N denote the nearest neighbor classifier based on n iid training examples and for any X
let η(X) = P(Y = 1|X). Then
lim P(fnN N (X) ̸= Y ) = E[2η(X)(1 − η(X))]
n→∞

The proof is a bit technical (see [12] for details), but the intuition is straightforward. If the training set is large
enough, then there is a point X ′ ∈ {X1 , . . . , Xn } that is very close to X, so let’s suppose that X ′ = X. Under
this assumption, the labels of X and X ′ , denoted Y and Y ′ , are independent and identically distributed Bernoulli
random variables with p = η(X). There are two cases of error Y ′ = 1 and Y = 0 or Y ′ = 0 and Y =
1. The probability of either of these two outcomes is η(X)(1 − η(X)) and so the total probability of error is
2η(X)(1 − η(X)).
NN
Let R∞ := E[2η(X)(1 − η(X))] denote the asymptotic error of the nearest neighbor rule. Since η ∈ [0, 1], we
have 2η(1 − η) ≥ min(η, 1 − η), which implies that the asymptotic error rate of the nearest neighbor rule is never
better than the optimal classifier. Also, if we let Z = min(η(X), (1 − η(X))), then
NN
R∞ = 2 E[Z(1 − Z)] = 2 (E[Z] − E[Z 2 ]
≤ 2 (E[Z] − E2 [Z]) , since E[Z 2 ] ≥ (E[Z])2
= 2R∗ (1 − R∗ ) ≤ 2R∗

10
So we have shown that the asymptotic error rate of the nearest neighbor classifier is never more than twice that of
the Bayes optimal classifier.

2.4. Application to Histogram Classifier

Let us assume that the (input) features are randomly distributed over the unit hypercube X = [0, 1]d (note that
by scaling and shifting any set of bounded features we can satisfy this assumption), and assume that the (output)
labels are binary, i.e., Y = {0, 1}. A histogram classifier is based on a partition the hypercube [0, 1]d into M
smaller cubes or “bins” of equal size.

Example 1 (Partition of hypercube in 2 dimensions). Consider the unit square [0, 1]2 and partition it into M
subsquares of equal area (assuming M is a squared integer). Let the subsquares be denoted by {Bi }M
i=1 .

−1/2
Μ
0 1
−1/2
Μ

Figure 2.1: Example of hypercube [0, 1]2 in M equally sized subsquares

iid
A histogram classifier is any assignment of 0 or 1 to each bin. Given training examples {Xi , Yi }ni=1 ∼ PXY , a
reasonable rule is to assign each bin the majority vote of the examples that fall into that bin. Specifically, for the
jth bin define Pn
i=1 1{Xi ∈Bj ,Yi =1}
Pbj = P n ,
i=1 1{Xi ∈Bj }

with the convention that 0/0 = 0. Assign the bin the label 1 if Pbj ≥ 1/2 and 0 otherwise. Equivalently, define the
following piecewise-constant estimator of η(x):

X
M
ηbn (x) = Pbj 1{x∈Bj }
j=1

and classify according to 


1, ηbn (x) ≥ 1/2
fbnH (x) =
0, otherwise
The histogram classifier may differ from the Bayes classifier in two ways:

bias: its classification rule is constant on each bin

variance: the majority vote may not be the optimal rule for each bin

11
The bias tends to 0 as M → ∞, and the variance tends to 0 as n → ∞. Thus, if M, n → ∞ the histogram
classifier may converge to the Bayes classifier. Formally, we have the following theorem which proves that
histogram classifiers are universally consistent, meaning its error rate converges to the Bayes error rate. The
histogram classifier is similar to the nearest-neighbor classifier. Both label a new example based on the examples
in the training set that are close to it. The key difference is that the histogram classifier prediction is effectively
the majority vote of a number of nearby examples. This averaging effect enables it to achieve near-optimal
performance with a sufficiently large training set.
n
Theorem 3 (Consistency of Histogram Classifiers). If M → ∞ and M → ∞ as n → ∞, then as n → ∞ the
error rate of the histogram classifier P(fnH (X) ̸= Y ) → R∗ , the Bayes risk, for every distribution PXY .

Here is a sketch of the main ideas in the proof (a more formal proof is given below). The histogram classifier
assigns each bin the label of the majority of training data in the bin. Equivalently, the label for bin Bj is 1 if
Pn
i=1 1{xi ∈Bj ,yi =1}
pbj := P n
i=1 1{xi ∈Bj }

is larger than 1/2 and 0 otherwise. The bias of the histogram classifiers is due to the fact that it must assign the
same label to every point in each bin, whereas the optimal classifier can be arbitrary. However, as M → ∞ the
bins get smaller and smaller and this piecewise constant restriction becomes less and less of a limitation. This
implies that as M grows, there is a histogram classifier that can approximate the optimal classifier to arbitrary
accuracy. Think of it this way. Let pj be the probability of Y = 1 for a random X in this bin. The optimal
histogram decision is to assign the label 1 to the bin if pj > 1/2. As the bin size shrinks to 0 around a specific
point x, the value of pj tends to P(Y = 1|X = x).
P P
What we need is to show that pbj is a good estimator of pj . Let kj = ni=1 1{xi ∈Bj ,yi =1} and nj = ni=1 1{xi ∈Bj } ,
so that pbj = kj /nj . Notice that given nj , the random variable kj is binomially distributed with parameter pj .
Therefore, E[kj |nj ] = pj nj and V[kj |nj ] = nj pj (1 − pj ). If n/M → ∞, then the average number of examples
per bin tends to infinity. So we can conclude that nj → ∞. Combining this with the fact that kj |nj is binomially
distributed, it follows that the variance of kj /nj → 0 and therefore kj /nj → pj as n → ∞.

Here is the more formal proof of the theorem.

Proof. The proof can be found in [12, Chapter 6]. We will prove the result under some minor assumptions that
simplify things. Assume that η(x) = P(Y = 1|X = x) is uniformly continuous3 and that the marginal density
p(x) ≥ c, for some constant c > 0. Let f ∗ denote the Bayes optimal classifier (defined by η). It is easily verified
that
P(fbnH (X) ̸= Y ) − P(f ∗ (X) ̸= Y ) ≤ 2E[|b
ηn (X) − η(X)|] ,
so we will focus the convergence of this upper bound.
R
η(x)pX (x)dx
(the theoretical analog of Pbj ) and define
Bj
Let Pj ≡ R
pX (x)dx
Bj

X
M
η̄(x) = Pj 1{x∈Bj }
j=1

The function η̄ is constant on each bin and its value on each bin is the average of η. By the triangle inequality,
ηn (X) − η(X)|] ≤ E[|η̄(X) − η(X)|] + E[|b
E[|b ηn (X) − η̄(X)|]
| {z } | {z }
approximation error estimation error
3
A function f is uniformly continuous if, for every ϵ > 0 there exists a δ > 0, such that |x − y| < δ implies ∥f (y) − f (x)∥ < ϵ.

12
The expectation above is over both the training data (which define ηbn ) and the new test point X. We will show that
ηn (X) − η(X)|] → 0 as M and n grow, per the statement of the theorem. Since the Bayes risk is determined
E[|b
by η(X), this proves that the histogram classifier’s error converges to the Bayes risk.

The bias is bounded as follows.


M Z
X
E[|η̄(X) − η(X)|] = |η̄(x) − η(x)| pX (x)dx
j=1 Bj

M Z
X
≤ ϵM pX (x)dx , for some small ϵM by continuity of η
j=1 Bj

X
M
= ϵM P(X ∈ Bj ),
j=1

X
M
= ϵM , since P(X ∈ Bj ) = 1
j=1

By taking M sufficiently large, ϵM can be made arbitrarily small. Thus E[|η(X) − η̄(X)|] → 0.

The variance is bounded by noting that Pbj is proportional to a binomial random variable. For any x ∈ [0, 1]d , let
B(x) denote the histogram bin in which x falls in. Define the random variables
X
n X
n
N (x) = 1{Xi ∈B(x)} and K(x) = 1{Xi ∈B(x), Yi =1}
i=1 i=1

Then
K(x)
ηbn (x) = .
N (x)
Note that
K(x) | {N (x) = nx } ∼ Binomial(nx , η̄(x))
since η̄(x) is the probability of a sample in B(x) having the label 1 and we are conditioning on the event of
observing nx samples in B(x). Therefore,
 
ηn (x)] = E E[b
E[b ηn (x)|N (x) = nx ]
= E[η̄(x)] = η̄(x)
and for any nx > 0
η̄(x)(1 − η̄(x))
ηn (x) − η̄(x)|2 | N (x) = nx ] =
E[|b
nx
1
By convexity (i.e., Jensen’s inequality), E[|Z|] ≤ (E[|Z|2 ]) 2 for any random variable Z. So we have
 1/2
ηn (x) − η̄(x)| | N (x) = nx ] ≤ E[|b
E[|b ηn (x) − η̄(x)|2 | N (x) = nx ]
s
η̄(x)(1 − η̄(x))
=
nx
Since pX (x) ≥ c > 0 and n/M → ∞, it follows that N (x) → ∞. Therefore
a.s.
ηn (x) − η̄(x)| | N (x)] → 0
E[|b
a.s.
ηn (x)− η̄(x)|] → 0. The symbol → means convergence almost surely or equivalently with probability
and thus E[|b
1. Since this holds for every x ∈ [0, 1]d , it follows that E[|b
ηn (X) − η̄(X)|] → 0, where the expectation is over the
training data and a randomly chosen X.

13
2.5. Summary

We defined the optimal binary classifier f ∗ and showed that the nearest-neighbor classifier fnN N and this histogram
classifier fnH can perform nearly or exactly as well in the limit as the training set size n → ∞. So is this the end
of the machine learning story? No. In practice, training set sizes are limited, and more sophisticated approaches
such as kernel methods and neural networks tend to work much better. We will see why this may be so in later
lectures.

2.6. Exercises

1. Consider the binary classification problem. Let η(x) = P(Y = 1|X = x) and recall that the Bayes Classifier
is defined as 
∗ 1, η(x) ≥ 1/2
f (x) =
0, otherwise
Let g(x) be any other classifier. Prove that

P(g(X) ̸= Y ) ≥ P(f ∗ (X) ̸= Y ) .

2. Consider a binary classification problem with (X, Y ) ∼ PXY . Recall the Bayes optimal classifier:

∗ 1 if η(x) ≥ 1/2
f (x) =
0 otherwise

where η(x) = P(Y = 1|X = x). Let ηe denote any approximation to η and consider the “plug-in” classifier

1 ηe(x) ≥ 1/2
f (x) =
0 otherwise

Show that
P(f (X) ̸= Y ) − P(f ∗ (X) ̸= Y ) ≤ 2 E[|η(X) − ηe(X)|] .

3. Let X be a nonnegative random variable. Prove Markov’s inequality:

E[X]
P(X ≥ t) ≤
t

4. A common approach to estimate the error rate of a classifier f is to evaluate its performance on a test set
iid
{Xi , Yi }ni=1 ∼ PXY . The empirical error rate is

1X
n
pbf = 1f (Xi )̸=Yi .
n i=1

Let pf := P(f (x) ̸= Y ) and show that for any ϵ > 0,

pf (1 − pf )
pf − pf | ≥ ϵ) ≤
P(|b .
nϵ2

5. Let X be a random variable. Prove that E[X 2 ] ≥ (E[X])2 .

14
6. Let X be a discrete random variable taking values in the set {x1 , x2 , . . . , xk } ⊂ R. Prove Jensen’s inequal-
ity: For any convex function φ
E[φ(X)] ≥ φ(E[X]) .
Hint: By the definition of convexity, for any λ ∈ [0, 1] and x1 , x2 ∈ R we have φ(λx1 + (1 − λ)x2 ) ≤
λφ(x1 ) + (1 − λ)φ(x2 ). Use this fact and induction on k to prove the result. Jensen’s inequality also holds
for continuous real-valued random variables (essentially the limit of discrete distributions).

7. Let X be a random variable. Prove that E[|X|3 ] ≥ (E[|X|])3 . (Hint: apply Jensen’s inequality)
k
8. Show that the mean and variance of a Poisson random variable with distribution P(X = k) = e−λ λk! is λ.

9. Suppose X1 and X2 are independent Poisson random variables with parameters λ1 and λ2 . 1) Show that the
conditional distribution of X1 given X1 + X2 = n is binomial with parameters n and λ1λ+λ
1
2
. 2) Show that
X1 + X2 is also a Poisson random variable with parameter λ1 + λ2 .

10. What is the minimum probability of error classifier in the multiclass setting? Assume m > 2 classes and
knowledge of the joint distribution PXY .

11. Derive an expression for the minimum probability of error in the multiclass setting in terms of appropriate
functionals of the joint distribution (similar to the expression we derived in the binary classification setting).

12. Consider estimating the error rate of a classifier f in a multiclass setting. Given an expression for the
iid
estimator based on a training set {(Xi , Yi )}ni=1 ∼ PXY .

13. Following from the previous problem, suppose that n = 1000 and the estimated error rate is 0.05. Are you
confident that the true error is probably less than 0.10?

15
Lecture 3: Multivariate Gaussian Models and Classification

iid
A common framework in ML is to suppose that the training data {xi , yi }ni=1 ∼ Pxy , where Pxy is the joint
distribution over pairs (x, y). The generative approach to designing a classifier is to fit a model to the training
data, and then use this model to derive a classifier; e.g., based on p(y|x). One of the most common probability
models in ML is the multivariate Gaussian (or Normal) model, abbreviated MVN. Let x ∈ Rd . The MVN density
function is given by  
1 1 T −1
p(x) = p exp − (x − µ) Σ (x − µ) ,
(2π)d |Σ| 2
where µ ∈ Rd is the mean, Σ ∈ Rd×d is the covariance matrix, and |Σ| is the determinant of the covariance
matrix. If x is a random vector distributed according to this density function, then we write x ∼ N (µ, Σ) for
shorthand notation. Some bivariate Gaussians are depicted in the figure below. In this note we will examine how
the MVN model can be used to derive linear (and nonlinear) classifiers.

Figure 3.1: Multivariate Gaussian Distributions. The top row show the Gaussian density (left) and its contour
plot (right) with mean [0 0]T and covariance [1 0; 0 1]. The second row is the same except that the covariance
is [1 0.5; 0.5 1]; positive correlation. The third row is the same except that the covariance is [1 − 0.5; −0.5 1];
negative correlation.

One of the most important and special properties of the Gaussian distribution is that linear transformations of
Gaussian random variables are Gaussian distributed. For example, the sum of two Gaussian random variables is
Gaussian. The general rule for affine transformations is: if x ∼ N (µ, Σ), then for any compatible matrix A and
vector b the transformed variable Ax + b ∼ N (Aµ + b, AΣAT ). This is a very special property that does not
hold for random variables in general. This linear transformation property can be proved using Fourier transforms.

16
3.1. MVN Models and Classification

Consider a multiclass prediction problem and let p(x|y = j), j = 0, 1, . . . , k denote the class conditional distri-
butions of the feature x; i.e., the features of examples belonging to class j are distributed according to the density
function p(x|y = j). In this note we focus on the special case where the class-conditional densitities are Gaussian:
x|y = j ∼ N (µj , Σj ).
Note that each class-conditional density is Gaussian with its own mean and covariance.

Recall that the optimal classification rule is


yb(x) = arg max p(y = j|x) ,
j

where p(y = j|x) is the probability of that y = j given the feature x. We say it is optimal because it minimizes
the probability of error. This can be related to the class-conditional densities via Bayes Rule:
p(x|y = j)p(y = j)
p(y = j|x) = .
p(x)
p(y = j) is the marginal probability that y = j, i.e., the probability that an random example has label j. This is
also sometimes called the prior probability that y = j, since it is the probability prior to having seen the feature
associated with y. p(x) is the marginal density of x. When predicting the label given x, the denominator p(x) is
a constant, so the optimal classification rule can be expressed as
yb(x) = arg max p(x|y = j)p(y = j) .
j

Consider the special case of binary classification. In this case,


yb(x) = arg max p(x|y = j)
j∈{0,1}

p(x|y=1) p(y=0)

 1 if p(x|y=0)
> p(y=1)
=

 0 if p(x|y=1) p(y=0)
p(x|y=0)
< p(y=1)


p(x|y=1) p(y=0)

 1 if log p(x|y=0) > log p(y=1)
=

 0 if log p(x|y=1) < log p(y=0)
p(x|y=0) p(y=1)

This is called the log-likelihood ratio test. If we have Gaussian class-conditional densities, then the log-likelihood
ratio is a quadratic function in x. So the decision boundary (the set of x where the log-likelihood ratio is 0) is
a quadratic curve/surface in the feature space. For the special case of Gaussian class-conditional densities with
equal covariances (i.e., Σ0 = Σ1 = Σ) and equal prior probabilities (i.e., p(y = 0) = p(y = 1) = 1/2), the
log-likelihood ratio simplifies to

 1 if 2(µ1 − µ0 )T Σ−1 x ≥ µT1 Σ−1 µ1 − µT0 Σ−1 µ0
yb(x) =

0 otherwise
If we let w = 2Σ−1 (µ1 − µ0 ) and b = µT0 Σ−1 µ0 − µT1 Σ−1 µ1 , then we see that this is just a linear classifier:
wT x + b > 0.

17
3.2. Optimality of Likelihood Ratio

If the class-conditional densities are known and the prior probabilities of each class are equal, then the maximum
likelihood classification rule yb(x) = arg maxj p(x|y = j) is optimal, in the sense that it minimizes the proba-
bility of making a mistake4 . Here we provide another interpretation of this optimality for the special case of two
classes (binary classification). The generalization to multiple classes is straightforward.

To keep the notation simple, let pj , j = 0, 1, denote the two class-conditional distributions. Assume that we
observe a random variable distributed according to one of two distributions.

H0 : x ∼ p0
H1 : x ∼ p1

Deciding which of the two best fits an observation of x is called a simple binary hypothesis test, simple because
the two distributions are known precisely (i.e., without unknown parameters or other uncertainties). A decision is
made by partitioning the range of x into two disjoint regions. Let us denote the regions by R0 and R1 . If x ∈ Ri ,
then we decide that Hi is the best match to the data; i.e., we decide that the data were distributed
S according to pi .
The key question is how to design the decision regions R0 and R1 . Note that since R0 R1 is assumed to be the
entire range of x, R1 is simply the complement of R0 , and so the choice of either region determines the other.

There are four possible outcomes in a test of this form, depending on the decision we make (H0 or H1 ), and the
true distribution of the data (also H0 or H1 ). Let us denote these as (0, 0), (0, 1), (1, 0), and (1, 1), where the first
argument denotes the decision based on the regions R0 and R1 and the second denotes the true distribution that
generated x. Note that the outcomes (0, 1) and (1, 0) are mistakes or errors. The test made the wrong decision
about which distribution generated x.

In order to optimize the choice of decision regions, we can specify a cost for incorrect (and correct, if we wish)
decisions. Without loss of generality, let’s assume the costs are non-negative. Let ci,j be the cost associated with
outcome (i, j), i, j ∈ {0, 1}. The costs reflect the relative importance of correct and incorrect decisions. Since
our aim is to design a test that makes few mistakes, it is reasonable to assume that c1,0 and c0,1 are larger than c0,0
and c1,1 ; in fact often it is reasonable to assign a zero cost to correct decisions. The costs c0,1 and c1,0 may be
different. For example, it may be that one type of error is more problematic than the other.

The overall cost associated with a test (i.e., with decision regions R0 and R1 ) is usually called the Bayes Cost, and
it is defined as follows.
X
1
C = ci,j πj P(decide Hi | Hj is true)
i,j=0

where πj := p(y = j), j = 0, 1, is called the prior probability of Hj , and P(decide Hi | Hj is true) denotes
the probability of deciding Hi when Hj generated x. πj is the probability that an observation will be generated
according to pj . The prior probabilities sum to 1, since we assume the data are generated either according to p0
or p1 , but they need not be equal. One distribution may be more probable than the other (e.g., more people are
healthy than have a disease). Our goal is to design the decision regions in order to minimize the Bayes Cost.

The Bayes Cost can be expressed directly in terms of the decision regions as follows. We will assume that p0
and p1 are continuous densities, but an analogous representation exists when they are discrete probability mass
4
If the prior probabilities are not equal, then the optimal classification rule is simply yb(x) = arg maxj p(x|y = j) p(y = j), where
p(y = j) is the prior probability of class j.

18
functions (i.e., replace integrals with sums in expressions below).
X
1
C = ci,j πj P(decide Hi | Hj is true)
i,j=0

X
1
= ci,j πj P(x ∈ Ri | Hj is true)
i,j=0

X
1 Z
= ci,j πj pj (x) dx
i,j=0 Ri

The choice of R0 and R1 that minimizes the cost C becomes obvious if we expand the sum above.
X
1 Z
C = ci,j πj pj (x) dx
i,j=0 Ri
Z Z
C = (c0,0 π0 p0 (x) + c0,1 π1 p1 (x)) dx + (c1,0 π0 p0 (x) + c1,1 π1 p1 (x)) dx
R0 R1

The integrands are non-negative, so it follows that we should let R0 be the set of x for which the first integrand is
smaller than the second. That is,

R0 := {x : c0,0 π0 p0 (x) + c0,1 π1 p1 (x) < c1,0 π0 p0 (x) + c1,1 π1 p1 (x)}


R1 := {x : c0,0 π0 p0 (x) + c0,1 π1 p1 (x) > c1,0 π0 p0 (x) + c1,1 π1 p1 (x)}

Therefore, the optimal test (relative to the assigned costs) takes the following simple form:
p1 (x) H1 π0 (c1,0 − c0,0 )
≷ .
p0 (x) H0 π1 (c0,1 − c1,1 )
Note that the term on the right hand side is a constant that depends on the prior probabilities and the costs (i.e., it
does not depend on x). The term of the left hand side is a ratio of probability densities evaluated at x. The value
of a probability density at the observed x is called the likelihood of x under that model. Thus, pp01 (x)
(x)
is called the
likelihood ratio and the test is called the likelihood ratio test (LRT). No matter what the prior probabilities are or
how costs are assigned, the optimal test always has takes the form
p1 (x) H1
≷ γ,
p0 (x) H0
where γ > 0 is the threshold of the test. We have shown is that the LRT, with an appropriate threshold, is optimal.

3.3. Exercises

1. Consider a binary classification problem with class-conditional densities p(x|y = j) given by MVNs
N (µj , Σ), j = 0, 1.
(a) Show that the log likelihood ratio test reduces to the form
H1
(µ1 − µ0 )T Σ−1 x ≷ γ
H0

where γ is a threshold.

19
(b) Let t(x) = (µ1 − µ0 )T Σ−1 x denote the scalar test statistic. What is the distribution of t(x) when x
is drawn from N (µj , Σ) (i.e., when it is from class j)?
R∞ 2
(c) Let Q(z) = z √12π e−t /2 dt, the probability that a N (0, 1) random variable exceeds the value z.
Show that the probability of mistakenly predicting a label of 1 when the true label is 0 is given by
 
 γ − (µ1 − µ0 )T Σ−1 µ0 
Q  1/2 
(µ1 − µ0 )T Σ−1 (µ1 − µ0 )

(d) Let A ∈ Rk×d , with k < d. Then x e = Ax ∈ Rk . Let us view x e as a means of reducing the
dimensionality of x. What are the class-conditional distributions of x e? If we base our classification
e rather than x, will the optimal classifier perform better or worse?
on x
(e) Prove that the covariance matrix Σ is positive semidefinite; i.e., v T Σv ≥ 0 for every v ∈ Rd .
P
(f) Let Σ = di=1 λi ui uTi be the eigendecomposition. Since Σ is symmetric and positive semidefinite
(by construction) the eigenvalues λi ≥ 0. ui is the eigenvector associated with eigenvalue λi ; i.e.,
Σui = λi ui . Express the inverse Σ−1 in terms of the eigenvalues and eigenvectors.
(g) Use the results above to design a linear dimensionality reduction from d to k that introduces the least
distortion to the Mahanalobis distance, the distance appearing in the MVN model.
(h) Suppose that  
1 0.9 0.9
Σ =  0.9 1 0.9 .
0.9 0.9 1
Find the eigendecomposition and the best reduction to k = 1 dimension.

2. The Bayes optimal classifier minimizes the total probability of error at the point x. Recall there are two
types of errors, false positives and false negatives. The f ∗ treats both as equally costly losses (loss of 1 for
both types of error).

(a) Suppose that we assign different costs to the two types of error. Let c01 be the cost/loss of predicting
yb = 0 when the true label is y = 1 and let c10 be the cost/loss of the other type of error. Suppose that
we do not know the prior probabilities, so we simply guess that they are equal. Derive an expression
for the optimal classifier in this case.
(b) Now suppose that we additionally know the probabilities of the two classes; i.e., we know that P(Y =
1) = p and P(Y = 0) = 1 − p. Show that the f ∗ above does not minimize the average probability of
error (expectation over X) if p ̸= 1/2.
(c) Derive a new classifier that is optimal for given costs c01 and c10 and prior probability p.

3. Consider a binary classification problem where the features x ∈ {−1, +1}d and the label y ∈ {−1, 1}.

(a) Let x = [x1 , · · · , xd ]T and suppose that y = x1 and x2 , . . . , xd are independent of x1 and y and are
i.i.d. binary random variables with P(xi = +1) = P(xi = −1) = 1/2 for i = 2, . . . , d. What is the
Bayes optimal classifier and Bayes Risk?
(b) Now consider the nearest-neighbor classifier based on a labeled training set of size n = 2 consisting
of one randomly chosen example form each class. Find a lower bound on the probability of error of
the nearest-neighbor classifier for d = 2, 3. (If both training points are equally close to the test point,
then randomly pick either label with equal probability). HINT: Consider the test point a fixed binary
vector and consider the randomness in the two training points.

20
(c) Use the binomial distribution to derive an expression for the error bound for general values of d. How
does the computational complexity of this calculation grow as a function of d?
(d) Conjecture about the limiting error rate as d → ∞.

21
Lecture 4: Learning MVN Classifiers

Suppose we are given training data {xi , yi }ni=1 . We can fit MVN models to these data (separating the data accord-
ing to their labels), and then derive a classifier based on the fitted models.

A key question is how to fit models to the data. There are many ways to do this, but one natural approach is to
match the empirical (data-based) moments to the moments of the MVN model. The MVN model is defined by
its first two moments, the mean and covariance. So we can just set the mean and covariance of the model to the
empirical mean and covariance.

Consider the subset of the training data with the label j, that is {xi }i:yi =j , where j is one of the possible labels.
Let nj denote the number of examples in this set. The empirical mean and covariance of these data are computed
as follows:
1 X
bj =
µ xi
nj i:y =j
i

1 X
Σbj = b j )(xi − µ
(xi − µ b j )T
nj i:y =j
i

Then we “plug-in” these estimates to obtain a MVN model for data in the class j; i.e., p(x|y = j) is the MVN
b j ).
bj , Σ
density N (µ

If we assume the two class-conditional distributions have the same covariance, then we can estimate the covariance
as follows.
X n
b = 1
Σ b yi )T
b yi )(xi − µ
(xi − µ
n i=1

As discussed above, the optimal classifier is linear in this case. This type of linear classifier is referred to as
Fisher’s linear discriminant [16].

4.1. Analysis of the “Plug-in” MVN Classifier

Consider the following simplified case. Suppose we are given n i.i.d. training examples

x|y = +1 ∼ N (θ, I) and x|y = −1 ∼ N (−θ, I) .

Assume that θ ∈ Rd and that P(y = +1) = P(y = −1). In other words, the two classes are equally probable and
have means θ and −θ and the covariance matrices are both known to be I, the identity matrix. The likelihood
ratio classifier (Bayes classifier) minimizes the probability of error. The log likelihood ratio in this case is
 p(x|y = +1)  1 1
log = (x + θ)T (x + θ) − (x − θ)T (x − θ) = 2xT θ
p(x|y = −1) 2 2

So the optimal classification rule is 


∗ +1 if xT θ > 0
f (x) = .
−1 if xT θ < 0

22
This achieves the minimum probability of error, which is given by

P(f ∗ (x) ̸= y) = P(xT θ > 0|y = −1)P(y = −1) + P(xT θ < 0|y = +1)P(y = +1)
= P(xT θ > 0|y = −1) ,

where the last equality follows since the two types of error are equal because of the symmetry of the problem. Note
that xT θ|y = −1 ∼ N (−∥θ∥2 , ∥θ∥2 ). So the probability of error is equal to the probability that a N (0, ∥θ∥2 )
random variable exceeds ∥θ∥2 . This can be bounded by Markov’s inequality: if z ∼ N (0, ∥θ∥2 ), then P(z >
2]
1 ∗ 1
∥θ∥2 ) ≤ E[z
∥θ∥4
= ∥θ∥ 2 . So we have shown that P(f (x) ̸= y) ≤ ∥θ∥2 , which makes sense since the error should

decrease as the distance between the means increases (note that the distance between the means is 2∥θ∥).

Now let’s consider a learning set-up in which we don’t know the value of θ, but we do have a training set
{(xi , yi )}ni=1 . Note that xi |yi ∼ N (yi θ, I) and so yi xi |yi ∼ N (yi2 θ, I) = N (θ, I), and therefore yi xi ∼
N (θ, I). Thus, a natural estimator of θ is
1X
n
b
θ = yi xi .
n i=1
Now we will plug this estimate into the form of the Bayes classifier to obtain the classification rule
(
+1 if xT θb > 0
fb(x) = .
−1 if xT θb < 0

Due to the symmetry of the problem, the probability of error of this classifier is

P(fb(x) ̸= y) = P(xT θb > 0|y = −1) ,

The test statistic xT θb doesn’t have as simple a distribution as the optimal statistic xT θ since both x and θb are
random. It is important, however, that these are independent of each other, since we assume that the new “test"
point x is independent of the training data. Specifically, x ∼ N (−θ, I) and θb ∼ N (θ, I/n). Equivalently,
x = −θ + e1 and θb = θ + e2 , where e1 ∼ N (0, I) and e2 ∼ N (0, I/n) . So we can expand the statistic as
follows

xT θb = (−θ + e1 )T (θ + e2 )
= −∥θ∥2 + (e1 − e2 )T θ + eT1 e2

With this we can write the probability of error as follows.

P(fb(x) ̸= y) = P(−∥θ∥2 + (e1 − e2 )T θ + eT1 e2 > 0)



= P (e1 − e2 )T θ + eT1 e2 > ∥θ∥2 .

Now we can apply Markov’s inequality to get the bound


 T T
2 
E (e 1 − e2 ) θ + e e2
P(fb(x) ̸= y) ≤ 1
.
∥θ∥4
The expectation can be easily computed as follows.
 2 
E (e1 − e2 )T θ + eT1 e2 = E[((e1 − e2 )T θ + eT1 e2 )T ((e1 − e2 )T θ + eT1 e2 )]
= E[θ T (e1 − e2 )(e1 − e2 )T θ + eT2 e1 (e1 − e2 )T θ + θ T (e1 − e2 )eT1 e2 + eT2 e1 eT1 e2 ] .

Now if we first take the expectation with respect to e1 we get

E[((e1 − e2 )T θ + eT1 e2 )2 |e2 ] = ∥θ∥2 + θ T e2 eT2 θ + eT2 θ + θ T e2 + eT2 e2 ,

23
since all cross terms vanish, i.e., E[eT1 e2 |e2 ] = 0 because e1 and e2 are independent and E[e1 ] = 0. Taking the
expectation with respect to e2 yields
1
E[(e1 − e2 )T θ + eT1 e2 )2 ] = ∥θ∥2 + ∥θ∥2 + d/n .
n
Thus, we have the bound
(1 + 1/n)∥θ∥2 + d/n
P(fb(x) ̸= y) ≤ .
∥θ∥4
Notice that if n ≫ d, then the bound is essentially equal to the one we obtained for the Bayes classifier.

4.2. Comparison of MVN Plug-in Classifier and Histogram Classifier

Recall that Theorem 3 shows that the histogram classifier is able to achieve performance of the optimal classifier
in any situation (arbitrary distributions) given enough training data. The MVN plug-in classifier is also able to
achieve performance of the optimal classifier, but with the additional strong assumptions on the training data. So
should why don’t we prefer the histogram classifier, since it requires no assumptions on the data? If the class-
conditional densities are MVN with (known) covariances, then the MVN plug-in classifier’s error bound comes
close to matching that of the optimal classifier when n > d (i.e., the number of training data exceeds the dimension
of the feature space). The histogram classifier would require far more data in the same situation. Minimally, the
histogram partition would split each coordinate dimension in two. This means the histogram would have at least
2d bins. The histogram classifier also requires at least one example in each bin, leading to the requirement that
n > 2d . In other words, the training set size must grow exponentially with dimension. This is the curse of
dimensionality. Because of the curse, histogram classifiers are usually only used in low-dimensional problems.
For high-dimensional problems, approaches based on strong modeling assumptions (even though they may not
reflect the true distributions) can yield much better results.

4.3. Exercises

1. Are the empirical mean µb and covariance Σ b unbiased estimators? If not, are they asymptotically unbiased?
Derive explicit formulas for the expectations.
2. The analysis above assumes MVN class-conditional distributions. Show that the error bound for fb applies
for any class-conditional distributions with the same means and covariance.
3. Derive another error bound for the case where the two means are different, say θ+1 ̸= θ−1 .
4. Generalize things further by assuming the class-conditional distributions have the same covariance, but that
it is a general covariance matrix Σ instead of the identity matrix I.
5. In 1936 Ronald Fisher published a famous paper on classification titled “The use of multiple measure-
ments in taxonomic problems.” In the paper, Fisher study the problem of classifying iris flowers based on
measurements of the sepal and petal widths and lengths, depicted in the image below.
Fisher’s dataset is available in Matlab (fisheriris.mat) and is widely available on the web (e.g.,
Wikipedia). The dataset consists of 50 examples of three types of iris flowers. The sepal and petal measure-
ments can be used to classify the examples into the three types of flowers. Approach the iris classification
problem using a generative modeling approach. Assume that each class-conditional density is MVN.

24
(a) Fit the MVN models to the training data using estimates of means and covariances. Design a classifier
based on these fitted models.
(b) Constrain the MVN models so that the decision boundaries between pairs of classes are linear.

25
Lecture 5: Likelihood and Kullback-Leibler Divergence

5.1. Introduction

Consider a binary classification problem. For the special case of Gaussian class-conditional densities with equal
covariances (i.e., Σ0 = Σ1 = Σ) and equal prior probabilities (i.e., P(y = 1) = P(y = 0)), the log-likelihood
ratio simplifies to

 1 if 2(µ1 − µ0 )T Σ−1 x ≥ µT1 Σ−1 µ1 − µT0 Σ−1 µ0
yb(x) =

0 otherwise

If we let w = 2Σ−1 (µ1 − µ0 ) and b = µT0 Σ−1 µ0 − µT1 Σ−1 µ1 , then we see that this is just a linear classifier: if
wT x + b ≥ 0, then the predicted label is yb = 1. To get a sense of the difficulty this classification problem, let us
consider the expected value of the test statistic wT x+b. Consider a random example from the class 1 distribution:
x ∼ N (µ1 , Σ). Then we have

Ex∼p(x|y=1) [wT x + b] = 2(µ1 − µ0 )T Σ−1 µ1 + µT0 Σ−1 µ0 − µT1 Σ−1 µ1


= µT1 Σ−1 µ1 − 2µT0 Σ−1 µ1 + µT0 Σ−1 µ0
= (µ1 − µ0 )T Σ−1 (µ1 − µ0 ) ,

the Mahalanobis distance between the means. Similarly,

Ex∼p(x|y=0) [wT x + b] = −(µ1 − µ0 )T Σ−1 (µ1 − µ0 ) .

So for any feature x we can write the test statistic as

wT x + b = ± (µ1 − µ0 )T Σ−1 (µ1 − µ0 ) + ζ ,

where the sign on the first term depends on whether x comes from class 1 or 0 and ζ is a zero-mean random
variable5 . This suggests that larger values of (µ1 − µ0 )T Σ−1 (µ1 − µ0 ) make it easier to classify new examples.
In other words, the Mahalanobis distance between the means is a natural measure of the classification difficulty
(i..e, larger distance implies easier classification). In this lecture, we generalize this idea to any class-conditional
distributions.

5.2. Optimal Classifiers and Likelihood Functions

Suppose we have feature/label pairs (x, y) ∼ P . In this note, let us assume a binary classification setting with
y ∈ {0, 1}. If P is known, then the optimal classifier (i.e., the classifier that minimizes the probability of error) is
easily derived. Denote the joint probability function by p(x, y). Recall that conditional probability of y given x
is given by
p(x, y)
p(y|x) = .
p(x)
5
In fact, since the test statistic is linear and x is MVN, the test statistic is also Gaussian and it is easy to check that ζ ∼ N (0, 4(µ1 −
µ0 )T Σ−1 (µ1 − µ0 )).

26
In particular, the probability that y = 1 given x is p(1|x). To clarify what this refers to, we sometimes write it
more explicitly as p(y = 1|x). This is a probability (i.e., a number in [0, 1]) that depends on the value of x. If
p(y = 1|x) ≥ 1/2, then the optimal classifier predicts the label to be 1, and otherwise it predicts 0. Since this
conditional probability plays a crucial role in the optimal classification we give it a special notation:

η(x) := p(y = 1|x) .

The optimal classifier can be expressed as



∗ 1 η(x) ≥ 1/2
f (x) = .
0 η(x) < 1/2

Equivalently, since p(y = 0|x) = 1 − p(y = 1|x) we have


(
η(x)
1 1−η(x) ≥1
f ∗ (x) = .
0 otherwise

We can express the optimal classifier in a different way, in terms of the class-conditional probability functions (or
the conditional probabilities of x given y). Note that

p(x, y) p(x|y)p(y)
p(y|x) = = .
p(x) p(x)
p(x|y=1)p(y=1) p(x|y=0)p(y=0)
So η(x) = p(x)
and p(y = 0|x = x) = p(x)
and thus the optimal classifier is
 p(x|y=1)p(y=1)
 1 p(x|y=0)p(y=0)
≥1
f ∗ (x) = ,

0 otherwise

which we recognize as the likelihood ratio test (notice that the common denominator p(x) cancels in the ratio).
The value of p(x|y = j) is called the likelihood that y = j given x, and p(x|y=1)
p(x|y=0)
is called the likelihood ratio.

5.3. Kullback-Leibler Divergence: Intrinsic Difficulty of Classification

Let pj (x) = p(x|y = j) for j = 0, 1 and denote the log likelihood ratio by

p1 (x)
Λ(x) = log .
p0 (x)

Consider a random test point x ∼ q, where q may be p0 , p1 , or some other distribution. The log likelihood ratio
classifier (assuming equal priors) is
H1
Λ(x) ≷ 0 .
H0

The resulting log likelihood ratio Λ(x) is a random variable, and we can decompose it into its deterministic and
stochastic components:
Λ(x) = E[Λ(x)] + (Λ(x) − E[Λ(x)]) .

27
The first term is a deterministic number and the second term is a zero-mean random variable. In other words,
Λ(x) has a distribution with mean E[Λ(x)]. If the mean is far above or below the threshold 0, then we expect the
classifier to perform well, and conversely if the mean is close to zero it will perform poorly. Note that
Z
p1 (x)
E[Λ(x)] = q(x) log dx
p0 (x)
Z Z
q(x) q(x)
= q(x) log dx − q(x) log dx
p0 (x) p1 (x)
These integrals are the Kullback-Leibler (KL) divergences from of p0 and p1 from q, denoted by D(q∥p0 ) and
D(q∥p1 ). If q = p1 , then the mean is E[Λ(x)] = D(p1 ∥ p0 ) ≥ 0, since the KL divergence is non-negative. If
q = p0 , then E[Λ(x)] = −D(p0 ∥ p1 ) ≤ 0. The larger the divergences, the easier the classification problem.

To illustrate further, consider an example x chosen randomly from class 1, that is x ∼ p1 (x) = p(x|y = 1). The
log likelihood ratio is then
Λ(x) = D(p1 ∥ p0 ) + ζ(x) ,
where the KL divergence D(p1 ∥ p0 ) ≥ 0 and ζ(x) := Λ(x) − D(p1 ∥ p0 ) is a zero-mean random variable.

5.3.1. Gaussian Class-Conditional Distributions

If x|y = j ∼ N (µj , Σ) for j = 0, 1 (note common covariance), then


1
D(p0 ∥ p1 ) = D(p1 ∥ p0 ) = (µ1 − µ0 )T Σ−1 (µ1 − µ0 ) ,
2
whichi is proportional to the squared Mahalanobis distance between the means. To gain some insight, consider
the scalar case: x|y = j ∼ N (µj , σ 2 ) for j = 0, 1, with µ0 = −µ1 = µ > 0. In this case, D(p0 ∥ p1 ) =
D(p1 ∥ p0 ) = 2µ2 /σ 2 . The log likelihood ratio is proportional to Λ(x) = 2µx. Consider a test point x ∼ p1 . Then
Λ(x) ∼ N (2µ2 , 4µ2 σ 2 ). Note that the KL divergence is equal to signal-to-noise ratio, i.e., the ratio of the squared
mean of x over its variance.

5.3.2. Separable Classes

The classes are separable if and only if the class-conditional densities do not overlap (i.e., the supports are disjoint
subsets of the feature space). Consider the KL divergences:
Z
p1 (x)
D(p1 ∥ p0 ) = p1 (x) log dx
p0 (x)
Z
p0 (x)
D(p0 ∥ p1 ) = p0 (x) log dx
p1 (x)
Consider a point x in the feature space where p1 (x) > 0 and p0 (x) = 0. The value of the integrand in D(p1 ∥ p0 )
is infinite at that point, and therefore so is D(p1 ∥ p0 ). This agrees with common sense: if the class-conditional
distributions are non-overlapping, then perfect (error-free) classification is possible. However, the optimal classi-
fication rule may be nonlinear and can be very difficult to learn from training data, so KL divergence does not tell
the whole story. The margin between separable classes (mininum distance between the supports) is a more mean-
ingful indicator of the difficulty of learning a classifier. Also notice that the KL divergence may be infinite even
if the supports of the two class-conditional densities have a large overlap (i.e., even if the two classes are difficult
to distinguish); if one has non-zero probability mass outside the support of the other, then the KL divergence will
be infinite. All this shows that the KL divergence may not be an appropriate measure of classification difficulty if
the two class-conditional distributions do not have the same support.

28
5.3.3. Statistical Hypothesis Testing

The KL divergence between two distributions p1 and p0 may be infinite even if they partially overlap. This
happens whenever the integrand p1 (x) log pp10 (x)
(x)
is infinite on a non-trivial subset of the feature space. Clearly in
such cases the classes are non-separable and perfect classification from a single sample is impossible. However,
the infinite value of the KL divergence still tells us something important. Suppose we are given a set of i.i.d.
examples {xi }ni=1 , drawn from either p1 or p0 . The statistical testing problem is to decide which distribution all
the examples come from. In this case, the log-likelihood ratio is
X
n
p1 (xi )
Λ(x1 , . . . , xn ) = log ,
i=1
p0 (xi )
P
which is proportional to the average n1 ni=1 log pp10 (xi)
(xi )
. The strong law of large numbers implies that if the exam-
ples come from p1 , then
1X
n
p1 (xi )
log → D(p1 ∥ p0 ) , as n → ∞ .
n i=1 p0 (xi )
So as n grows it is possible to perfectly decide which distribution generated the “batch” of examples. More
generally, the size of D(p1 ∥ p0 ) and D(p0 ∥ p1 ) determines how many examples are sufficient to confidently
decide which distribution they came from.

5.3.4. Nonnegativity of KL Divergence

The key property in question is that D(q||p) ≥ 0, with equality if and only if q = p. To prove this, we will use
Jensen’s Inequality. Recall, a function is convex if ∀ λ ∈ [0, 1]

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

The left hand side of this inequality is the function value at some point between x and y, and the right hand side
is the value of a straight line connecting the points (x, f (x)) and (y, f (y)). In other words, for a convex function
the function value between two points is always lower than the straight line between those points.

Jensen’s Inequality: If a function f (x) is convex, then

E[f (x)] ≥ f (E[x])

Now if we rearrange the KL divergence formula,


Z
q(x)
D(q||p) = q(x) log dx
p(x)
 
q(x)
= Eq log
p(x)
 
p(x)
= −Eq log
q(x)

29
we can use Jensen’s inequality, since − log z is a convex function.
  
p(x)
≥ − log Eq
q(x)
Z 
p(x)
= − log q(x) dx
q(x)
Z 
= − log p(x)dx

= − log(1)
= 0

Therefore D(q||p) ≥ 0.

5.3.5. Mutual Information

KL divergence also plays a key role in quantifying the information that one random variable conveys about an-
other. In the context of machine learning, one may be interested in quantifying how informative a feature x is
for predicting a label y. If the feature and label are statistically independent, then the feature is useless. With
this in mind, consider D(p(x, y) ∥ p(x)p(y)) , the KL divergence between the joint distribution p(x, y) and the
distribution p(x)p(y). If x and y are independent, then p(x, y) = p(x)p(y) and the KL divergence is 0. If the
feature x is highly predictive of the label, then the joint distribution is very different than the factorized form and
the KL divergence is large. This particular KL divergence has a special name: the mutual information between x
and y. This quantify is often used for feature engineering and selection. Consult [10] and [4, Chapter 1] for more
on information-theoretic concepts.

5.4. Exercises

1. Consider the following two-dimensional class-conditional densities:

p(x|y = +1) = a1{x∈[0,1]×[0,1]} + 4(1 − a)1{x∈[0,0.5]×[0,0.5]}


p(x|y = −1) = a1{x∈[0,1]×[0,1]} + 4(1 − a)1{x∈[0.5,1]×[0.5,1]} ,

where a ∈ [0, 1].

(a) Assuming equal prior probabilities for the two classes, express the optimal classification rule and error
rate in terms of a.
(b) Express the KL divergences

D(p(x|y = +1) ∥ p(x|y = −1)) and D(p(x|y = −1) ∥ p(x|y = +1))

as functions of a.
(c) Now suppose that do not know the class conditional densities, but you are given a set of training data
iid
{(xi , yi )}ni=1 ∼ p(x, y) and the features are known to lie within the unit square. How would you
design a classifier using these data? Can you bound its probability of error and compare it to the
optimal error rate?

30
2. Consider a binary classification problem, and assume the class conditional densities have the form x|y =
0 ∼ N (µ0 , I) and x|y = 1 ∼ N (µ1 , I) and equal prior probabilities. Suppose you estimate the means
from a training dataset, denoted µ b 0 and µ
b 1 . Moreover, assume that you know that ∥µj − µ b j ∥2 ≤ ϵ for
j = 0, 1 and some ϵ > 0. Plug these estimates into the log likelihood ratio. The resulting classifier may be
suboptimal due to the errors in the estimates. Quantify the suboptimality by computing the KL divergences
of this classifier and comparing them to the KL divergences of the optimal classifier (based on the true
means).

3. Consider a binary classification problem with equal prior probabilities and class-conditional densities x|y =
0 ∼ uniform[0, 1] and x|y = 1 ∼ uniform[t, 1 + t] for some t ∈ R.

(a) Compute the mutual information between the feature x and the label y as a function of t. Interpret the
result for t = 0, |t| ≥ 1, and t ∈ (0, 1).
(b) Compute the minimum probability of misclassification as a function of t.

31
Lecture 6: Maximum Likelihood Estimation

Maximum likelihood estimation is one of the most popular approaches in statistical inference and machine learn-
ing. It is usually viewed as a methodology for estimating the parameters of a probabilistic model, but in fact the
core principle of maximum likelihood is density estimation.

6.1. The Maximum Likelihood Estimator

Consider a family of probability distributions indexed by a parameter θ. The parameter may be a scalar or mul-
tidimensional. For example, the family of multivariate Gaussian distributions N (µ, Σ) is indexed by the mean
vector µ and covariance matrix Σ, so in this case θ = (µ, Σ). In general, we will denote a family of density
functions by p(x|θ), θ ∈ Θ, where Θ denotes the set of all possible values the parameter can take. Given data x,
the Maximum Likelihood Estimate (MLE) is

θb = arg max p(x|θ)


θ∈Θ

where p(x|θ) as a function of x with the parameter θ fixed is the probability density function or mass function.
Viewing p(x|θ) as a function of θ with the data x fixed is called the “likelihood function.” Sometimes will denote
p(x|θ) by pθ (x).

This is a generalization of the hypothesis testing problem considered in the previous lecture. Suppose that θ ∈
{0, 1}, i.e., θ is a binary-valued parameter which could indicate which of to two classes y = 0 or y = 1 generated
the data x. In this case, maximum likelihood estimation of θ is equivalent to the log-likelihood ratio test. Denote
the log-likelihood ratio by Λ(x) = log pp01 (x)
(x)
. Then the Maximum Likelihood Estimator (MLE) of θ is

1 if Λ(x) > 0
θb =
0 if Λ(x) < 0

6.2. ML Estimation and Density Estimation

ML Estimation is equivalent to density estimation. Assume


iid
xi ∼ q, i = 1, · · · , n, where q is an unknown probability density

and suppose we wish to model these data using with a parametrized family of distributions {pθ }θ∈Θ . The ML
Estimation is equivalent to finding the density in {pθ }θ∈Θ that best fits the data. i.e., the generative model with the
highest density/probability value on the points x = {xi }. The true distribution of the data points is Πni=1 q(xi ),
because the xi are i.i.d. The likelihood function for θ is Πni=1 pθ (xi ). The true generating density q may not be a
member of the parametric family under consideration. The MLE is the solution to
X
n
max Πni=1 pθ (xi ) or equivalently max log pθ (xi ) .
θ∈Θ θ∈Θ
i=1

The argument of the latter optimization is called the log-likelihood function of θ. The logarithm is often applied
because it simplifies the optimization in cases where the probability models involve exponential functions (e.g.,

32
the Gaussian). We can also express the MLE as the solution of the minimization
X
n
min − log pθ (xi ) .
θ∈Θ
i=1

6.3. Examples of MLE

Example 2. Structured Mean. Let B ∈ Rn×k be a known matrix and assume the n × n covariance matrix Σ is
known. Consider the model
1 1
p(x|θ) = exp{− (x − Bθ)T Σ−1 (x − Bθ)} , x ∈ Rn , θ ∈ Rk
(2π) |Σ|
n/2 1/2 2

The MLE θb is given by


θb = arg min − log p(x|θ)
θ
= arg min(x − Bθ)T Σ−1 (x − Bθ)
θ
= (B Σ−1 B)−1 B T Σ−1 x
T

iid
Example 3. Poisson mean estimation. Let xi ∼ Poisson(λ), i = 1, . . . , n. Then the negative log-likelihood is
Xn  xi
 X n  
−λ λ
− log e = λ − xi log(λ) + log(xi !)
i=1
xi ! i=1

The derivative of each term with respect to λ is 1 − xi /λ, so if we set the derivative equal to zero we find the
b = 1 Pn xi .
minimizer is λ n i=1

Example 4. Linear Regression. Suppose yi = wT xi + ϵi , for i = 1, . . . , n. Here {xi , yi }ni=1 are observed, with
xi ∈ Rd , yi ∈ R, and the ϵi ∼ N (0, σ 2 ) are unobserved noises.
!  
1 X
n
T 2 1 2
p(y|X, w) ∝ exp − 2 (yi − w xi ) = exp − 2 ∥y − Xw∥
2σ i=1 2σ

where y is a column vector of the {yi } and X is an n × d matrix whose ith row is xi .

b is given by,
The maximum likelihood estimate w
b = arg min − log p(y|X, w)
w
w
= arg min ∥y − Xw∥2
w
= (X X)−1 X T y , assuming X has full rank.
T

6.4. MLE and KL Divergence

We can view the negative log-likelihood function as sum of “loss" functions of the form
ℓi (q, pθ ) := − log pθ (xi )

33
which measure the loss incurred when using pθ to model the distribution of xi , which is actually distributed
according to q. So we may view this as an empirical measure of how well pθ fits q at the point xi . The notation
here makes the dependence of the loss on both q and the choice of pθ explicit, but it is commmon to simply write
ℓi (θ), since we view the loss as a function of the parameter θ. The expected value of a loss is called the “risk".
The risk associated with negative log-likelihood loss function is

R(q, pθ ) = E[ℓi (q, pθ )]


= Exi ∼q [− log pθ (xi )]
Z
= − q(x) log pθ (x) dx

It is also common just to write R(θ) to denote the risk, but we will be more explicit here. We can compare the
value of the risk of pθ with the that of the true distribution q. The excess risk

R(q, pθ ) − R(q, q)

quantifies how much larger the risk is when we use pθ instead of q. Note that

R(q, pθ ) − R(q, q) = E[log q(x) − log pθ (x)]


 
q(x)
= E log
pθ (x)
Z
q(x)
= q(x) log dx
pθ (x)
=: D(q ∥ pθ )
= ≥0

with equality if pθ = q. Recall that D(q ∥ pθ ) is the Kullback-Leibler divergence of pθ from q. This shows that q
minimizes the risk. We can consider
θ∗ = arg min D(q∥pθ )
θ

to be the optimal value of θ. The density pθ∗ is the member of the parametric class that is closest in KL divergence
iid
to the data-generating distribution q. If we have multiple iid observations xi ∼ q, i = 1, · · · , n, then the MLE is
X
n
θbn = arg min − log pθ (xi )
θ
i=1

By strong law of large numbers, for any θ ∈ Θ

1X
n
q(xi ) a.s.
log −→ D(q∥pθ ) .
n i=1 pθ (xi )

So intuitively θbn → θ∗ as n → ∞. To conclude, the analysis above shows that ML Estimation is essentially trying
to find a parametric probability model pθ that best fits the true data distribution in the sense of KL divergence.

6.5. Exercises

iid
1. Let x1 , . . . , xn ∼ Uniform(0, θ).

34
a. Find the MLE θbn .
b. Give a mathematical expression for the exact distribution of θbn .
c. Show that the MLE is consistent in MSE; i.e., E[|θbn − θ|2 ] → 0 as n → ∞. Hint: Use the result in part
(b) to compute the mean and variance of θbn .

2. Suppose we are monitoring credit card payments for a population of N people, and model the number of
days (τ ) that will pass before each person defaults on their payments as independent, identically exponen-
tially distributed random variables with parameter θ > 0. That is, for i = 1, . . . , N , τi ∼ exp(θ), so that
p(τi |θ) = θe−τi θ . Find the MLE of θ.

3. Robust estimation. There are different approaches to finding the “center” of a dataset. Suppose x1 , . . . , xn
are scalars. The sample mean is one statistic that summarizes the central location of the data. Of course, if
one assumes the data are i.i.d. realizations of a N (θ, σ 2 ) random variable, then the sample mean is the MLE
of θ. The sample mean is the minimizer of
X
n
e2 (θ) = (xi − θ)2
i=1

Another alternative is to minimize the sum of absolute errors


X
n
e1 (θ) = |xi − θ|
i=1

The minimizer of e1 has some nice robustness properties, which you will investigate in this problem. It can
also be viewed as the MLE of θ if the data are assumed to be i.i.d. realizations from a Laplacian distribution
1 −|x−θ|/b
of the form p(x; θ) = 2b e , where b > 0.

a. Find closed-form expressions for θbi = arg minθ ei (θ), i = 1, 2.


b. Suppose that n = 3 and (x1 , x2 , x3 ) = (1.1, 0.9, 1.0). What are θbi , i = 1, 2 in this case?
c. Suppose that the value of x3 was misrecorded as x3 = 100.0 instead of 1.0. What are θbi , i = 1, 2 in this
case? Which estimator is more robust to this sort of error?

4. Suppose that Twitter assigns each new user an integer k + 1, where k is the number of users who registered
before. At the moment there are N Twitter accounts, but Twitter keeps this number secret. However, the
integer assigned to each user can be found in their profile and you are going to exploit this in order to
estimate N . Sample n users uniformly at random from the set of all users. Let x1 , . . . , xn be the integers
assigned to these users. These integers are distributed uniformly at random from the set {1, . . . , N }.

a. Consider the estimator N b1 := 2 (Pn xi ) − 1. Compute the mean, variance, and MSE E[|N − N b1 |2 ] of
n i=1
Nb1 ? Is this estimator unbiased?
b. Let Nb2 denote the MLE of N . Derive an expression for N b2 . Compute the mean, variance, and MSE
E[|N − P Nb2 | ] of N
2 b2 ? Hint: Recall that if Y is a non-negative integer-valued random variable, then

E[Y ] = i=1 P(Y ≥ i).
c. Which estimator would you use and why? Can you propose an even better estimator?

35
Lecture 7: Sufficient Statistics

Consider a random variable x whose distribution p is parametrized by θ ∈ Θ where θ is a scalar or a vector.


Denote this distribution as p(x|θ). In many machine learning applications we need to make some decision about θ
from observations of x, where the density can be one of many in a family of distributions, {p(x|θ)}θ∈Θ , indexed by
different choices of the parameter
Qn θ. More generally, suppose we make n independent observations x1 , x2 , . . . , xn
where p(x1 . . . xn |θ) = i=1 p(xi |θ). These observations can be used to infer or estimate the correct value for θ.
This problem can be posed as follows. Let x = [x1 , x2 , . . . , xn ] be a vector containing the n observations.

Question: Is there a lower dimensional function of x, say t(x), that alone carries all the relevant information
about θ? For example, if θ is a scalar parameter, then one might suppose that all relevant information in the
observations can be summarized in a scalar statistic.

Goal: Given a family of distributions {p(x|θ)}θ∈Θ and one or more observations from a particular distribu-
tion p(x|θ∗ ) in this family, find a data compression strategy that preserves all information pertaining to θ∗ . The
function identified by such strategy is called a sufficient statistic.

7.1. Sufficient Statistics

Maximum likelihood estimation is based exclusively on the shape likelihood function p(x|θ); the goal is to find
a maximum point. Any processing operations that preserve the shape will not affect the ML estimation process.
This is key to the idea of sufficient statistics.
Example 5 (Bernoulli Random Variables). Suppose x is a 0/1 - valued variable with P(x = 1) = θ and P(x =
0) = 1 − θ. That is x ∼Qp(x|θ) = θx (1 − θ)1−x , (x ∈ [0, 1]). We Pobserve n independent realizations x1 , . . . , xn
n xi
xn |θ) = i=1 θ (1 − θ)
with p(x1 , . . . ,P 1−xi
= θ (1 − θ) ; k = ni=1 xi (number of 1’s).
k n−k
n
Note that x = i=1 xi is a random variable with values in {0, 1 . . . , n}
   
n k n−k n n!
p(k|θ) = θ (1 − θ) , a binomial distribution with =
k k (n − k)!k!
The joint probability mass function of (x1 , . . . , xn ) and k is
 P
p(x1 , . . . , xn |θ); if k = xi
p(x1 , . . . , xn , k|θ) =
0; otherwise
Therefore
θk (1 − θ)n−k 1

p(x1 , . . . , xn |k, θ) =
n k = n
k
θ (1 − θ)n−k
k
P 
This shows that the conditional probability of x1 , . . . , xn given k = xi is uniformly distributed over the nk
sequences that have exactly k 1’s. In other words, the condition distribution of x1 , . . . , xn given k is independent
of θ. So k carries all relevant infomation about θ!

P
Note: k = xi compresses {0, 1}n (n bits) to {0, . . . , n} (log n bits).
Definition 3. Let x denote a random variable whose distribution is parameterized by θ ∈ Θ. Let p(x|θ) denote
the density of mass function. A statistic t(x) is sufficient for θ if the distribution of x given t(x) is independent of
θ; i.e., p(x|t, θ) = p(x|t)

36
Theorem 4 (Fisher-Neyman Factorization). Let x be a random variable with density p(x|θ) for some θ ∈ Θ. The
statistic t(x) is sufficient for θ if and only if the density can be factorized into a function a(x) and a function
b(t, θ), a function of θ but only depending on x through the t(x); i.e.,

p(x|θ) = a(x)b(t, θ)

Proof. (if/sufficiency) Assume p(x|θ) = a(x)b(t|θ)


Z Z 
p(t|θ) = p(x|θ)dx = a(x)dx b(t, θ)
x:t(x)=t x:t(x)=t

This shows that


p(x, t|θ) p(x|θ) a(x)
p(x|t, θ) = = =R independent of θ
p(t|θ) p(t|θ) x:t(x)=t
a(x)dx
⇒ t(x) is a sufficient statistic

(only if/necessity) If p(x|t, θ) = p(x|t) independent of θ then p(x|θ) = p(x, t|θ) = p(x|t, θ)p(t|θ) = p(x|t) p(t|θ)
| {z } | {z }
a(x) b(t,θ)

iid
Example 6 (BernoulliRandom
 Variables). x1 , . . . , xn ∼ Bernoulli(θ) and let x = (x1 , . . . , xn ). Then p(x|θ) =
1 n
θk (1 − θ)n−k = n θk (1 − θ)n−k ⇒ k is sufficient for θ.
k |
k
|{z} {z }
a(x) b(k,θ)

Example 7 (Poisson). Let λ be an average number of packets/sec sent over a network. Let x be a random variable
x
representing number of packets seen in 1 second. Assume p(x|λ) = e−λ λx! .
iid
Given x1 , . . . , xn ∼ p(x|λ) , we have

Y
n xi Yn
−λ λ 1 −nλ P xi
p(x1 , . . . , xn |λ) = e = e| {zλ } .
i=1
xi ! i=1
x i ! P
| {z } b( xi ,λ)
a(x)

Pn
So i=1 xi is a sufficient statistic for λ.

Example 8 (Gaussian). x ∼ N (µ, Σ) is d-dimensional.


iid
x1 , . . . , xn ∼ N (µ, Σ) and let θ = (µ, Σ) denote all the parameters of the model.

Y
n Y
n
1 1 T Σ−1 (x −µ)
p(x1 , . . . , xn |θ) = p(xi ; θ) = p e− 2 (xi −µ) i

i=1 i=1 2π d |Σ|


1 Pn T −1 (x −µ)
= 2π −nd/2 |Σ|−n/2 e− 2 i=1 (xi −µ) Σ i

Define sample mean

1X
n
b=
µ xi
n i=1

37
and sample covariance
X n
b = 1
Σ b T
b i − µ)
(xi − µ)(x
n i=1

1X 1X
n n
exp(− (xi − µ)T Σ−1 (xi − µ)) = exp(− (xi − µ b − µ)T Σ−1 (xi − µ
b+µ b+µ
b − µ))
2 i=1 2 i=1

1X X 1X
n n n
= exp(− b T Σ−1 (xi − µ)
(xi − µ) b − b T Σ−1 (µ
(xi − µ) b − µ) − b − µ)T Σ−1 (µ
(µ b − µ))
2 i=1 i=1
2 i=1

1X 1X
n n
= exp(− b T Σ−1 (xi − µ))
(xi − µ) b exp(− b − µ)T Σ−1 (µ
(µ b − µ))
2 i=1 2 i=1
1 X n
1X
n
= exp(− tr(Σ−1 b T )) exp(−
b i − µ)
(xi − µ)(x b − µ)T Σ−1 (µ
(µ b − µ))
2 i=1
2 i=1
1 X n
b exp(− 1
= exp(− tr(Σ−1 (nΣ))) b − µ)T Σ−1 (µ
(µ b − µ))
2 2 i=1

P
Note that the second term on the second line is zero because n1 i xi = µ.b For any matrix B, tr(B) is the sum
of the diagonal elements. On the fourth line above we use the trace property, tr(AB) = tr(BA).

1X
n
−nd/2 −n/2 1 b
p(x1 , . . . , xn |θ) = 2π |Σ| exp(− b − µ)T Σ−1 (µ
(µ b − µ)) exp(− tr(Σ−1 nΣ)) 1
|{z}
2 i=1 2
a(x1 ,...,xn )
| {z }
b(µ,
b Σ,θ)
b

7.2. Minimal Sufficient Statistic

Definition 4. A sufficient statistic is minimal if the dimension of t(x) cannot be further reduced and still be
sufficient.
iid
Example 9. Let x1 , . . . , xn ∼ N (0, 1)

u(x1 , . . . , xn ) = [x1 + x2 , . . . , xn−1 + xn ]T , u is a n2 -dimensional statistic


Xn
t(x1 , . . . , xn ) = xi is a 1-dimensional statistic
i=1
Pn/2
t is sufficient, and since t = i=1 ui it follows that u is also a sufficient statistic (but not minimal).

7.3. Rao-Blackwell Thereom

Sufficient statistics arise naturally in ML estimation, but there are many other criteria for estimation. The Rao-
Blackwell theorem provides additional support for the use of sufficient statistics.

38
Theorem 5. [5] Assume x ∼ p(x|θ), θ ∈ R, and t(x) is a sufficient statistic for θ. Let f (x) be an estimator of θ
and consider the mean square error E[(f (x) − θ)2 ]. Define g(t(x)) = E[f (x)|t(x)].

Then E[(g(t(x)) − θ)2 ] ≤ E[(f (x) − θ)2 ], with equality if and only if f (x) = g(t(x)) with probability 1; i.e., if
the function f is equal to g composed with t.

Proof. First note that because t(x) is a sufficient statistic for θ, it follows that g(t(x)) = E[f (x)|t(x)] does not
depend on θ, and so it too is a valid estimator (i.e., if t(x) were not sufficient, then g(t(x)) might be a function of
t(x) and θ and therefore not computable from the data alone).

Next recall the following basic facts about conditional expectation. Suppose x and y are random variables. Then
Z
E[f (x)|y] = f (x)p(x|y)dx

where p(x|t) is conditional density of x given t(x). Furthermore, for any random variables x and y
Z
E[E[f (x)|y)]] = E[f (x)|y]p(y)dy
Z Z 
= f (x)p(x|y)dx p(y)dy
Z Z 
= f (x) p(x|y)p(y)dy dx
Z
= f (x)p(x)dx = E[x]

This is sometimes called the smoothing property.

Now consider the conditional expectation

E[f (x) − θ|t(x)] = g(t(x)) − θ

By Jensen’s Inequality

(E[f (x) − θ|t(x)])2 ≤ E[(f (x) − θ)2 |t(x)] .

Therefore

(g(t(x)) − θ)2 ≤ E[(f (x) − θ)2 |t(x)]

Take expectation of both sides (recall the smoothing property above) yields

E[(g(t(x)) − θ)2 ] ≤ E[(f (x) − θ)2 ]

7.4. Exercises

1. Let x1 , x2 , . . . , xn be independent and identically distributed random variables. Each xi ∼ Uniform(a, b).
That is, each random variable is uniformly distributed on the interval (a, b). Find scalar sufficient statistics
for a and b.

39
iid
2. Suppose that x1 , . . . , xn ∼ Exp(λ) for some λ > 0, that is p(x|λ) = λe−λx for x ≥ 0 and 0 otherwise. Fur-
iid
thermore, suppose that y1 , . . . , ym ∼ Exp(λ/2) and independent of {xi }. Find a one-dimensional sufficient
statistic λ.

3. Let x = [x1 x2 ]T denote a realization of a bivariate Gaussian random vector. Assume


   
µ T 1 ρ
E[x] = := µ and E[(x − µ)(x − µ) ] =
µ ρ 1

where ρ is known and µ is an unknown mean parameter of interest. Note that x1 and x2 are correlated.

(a) Find a 1-dimensional sufficient statistic t for µ.


(b) Show that x1 is an unbiased estimator of µ.
(c) Verify the Rao-Blackwell Theorem by deriving the conditional expectation E[x1 |t] and showing that it
is also an unbiased estimator of µ and that its variance is lower than that of x1 . Hint: First determine
the conditional density of x1 |t.

40
Lecture 8: Asymptotic Analysis of the MLE

Finding the MLE is essentially density estimation based on a set of i.i.d. samples drawn from a distribution q.
As the number of samples increases, the MLE tends to the parameter value of the density that is closest to the
generating distribution q. The MLE is asymptotically Gaussian distributed with covariance given by the inverse
of the Fisher Information Matrix.

8.1. Convergence of log likelihood to KL

Suppose we make n i.i.d. samples x1 , . . . , xn drawn from distribution q, and consider a parametric family of
densities {p(x|θ)}. We will use the notation p(x|θ) and pθ interchangeably. The samples could be scalars or
vector-valued; this does not affect the analysis in this note. The MLE of θ is
Y
n X
n
θbn = arg max p(xi |θ) = arg min − log p(xi |θ)
θ θ
i=1 i=1
X
n X
n
q(xi )
= arg min log q(xi ) − log p(xi |θ) = arg min log
θ
i=1
θ
i=1
p(xi |θ)

By strong law of large numbers (SLLN) for any θ ∈ Θ

1X
n
q(xi ) a.s.
log −→ D(q∥pθ )
n i=1 p(xi |θ)

where pθ denotes the parametric density and the quantity D(q∥pθ ) is the so-called Kullback-Leibler divergence of
pθ from q. Let θ∗ = arg minθ D(q∥pθ ), the parameter associated with the parametric density that is closest to q.
We would like to show that the MLE converges to θ∗ in the following sense:

D q∥pθbn −→ D(q∥pθ∗ )
P
Note that since θbn maximizes ni=1 log p(xi |θ) we have

1X
n
p(xi |θ∗ )
0 ≥ log
n i=1 p(xi |θbn )
1X
n
p(xi |θ∗ ) q(xi )
= log
n i=1 p(xi |θbn ) q(xi )
1X 1X
n n
q(xi ) q(xi )
= log − log
n i=1 p(xi |θbn ) n i=1 p(xi |θ∗ )

≈ D q∥pθbn − D(q∥pθ∗ ) .

This shows that D(q∥pθ∗ ) ≳ D q∥pθbn (i.e., this is an approximate inequality), but by definition D(q∥pθ∗ ) ≤

D(q∥pθ ) for all θ. This suggests that θbn → θ∗ in the sense that D q∥pθbn → D(q∥pθ∗ ). The subtle issue here
is that θbn is a random variable depending on all the Pxi (not a fixed θ ∈ Θ), so technically one needs to be a bit
more careful when considering the convergence of n ni=1 log p(x
1 q(xi )
b , which is not a sum of independent random
i |θn )
variables. However, the claimed convergence to KL holds under mild regularity assumptions on the likelihood
function; this argument can be made precise [24, Chpt. 3.4]

41
8.2. Asymptotic Distribution of MLE

Above, we argued that under mild assumptions θbn converges to θ∗ . Next we will characterize the asymptotic
distribution of θbn under the assumption that the data are generated by q = pθ∗ . The notation θbn ∼ p means that
asymp

as n → ∞ the distribution of θbn (a random variable because it is a function of a random set of data) tends to the
distribution p.
Theorem 6. (Asymptotic Distribution of MLE) Let x1 , . . . , xn be iid observations from p(x|θ∗ ), where θ∗ ∈ Rd .
Qn Pn Pn
Let θbn = arg maxθ i=1 p(xi |θ) = arg maxθ i=1 log p(xi |θ), define L(θ) := i=1 log p(xi |θ), and
∂L(θ) ∂ 2 L(θ)
assume ∂θj and ∂θj ∂θk exist for all j,k. Furthermore, assume that p(x|θ) satisfies the regularity6 condition
E[ ∂ log∂θ
p(x|θ)
] = 0. Then

θbn ∼ N (θ∗ , n−1 I −1 (θ∗ ))


asymp

where I(θ∗ ) is the Fisher-Information Matrix (FIM), whose elements are given by
 2 
∗ ∂ log p(x|θ)
[I(θ )]j,k = −Ex∼p∗θ |θ=θ∗
∂θj ∂θk

The theorem tells that the distribution tends to a Gaussian. It also tells us that θbn is asymptotically unbiased,
since the mean of the limiting Gaussian distribution if θ∗ . It also characterizes the asymptotic covariance of the
estimator; the covariance decays to zero like 1/n, but the structure of the covariance is determined by the Fisher
information matrix. The Fisher Information matrix is the expected value of the negative of the Hessian matrix of
the log-likelihood function at the point θ∗ . It measures the curvature of the log-likelihood surface. For example,
in the case where θ is scalar, the Fisher Information matrix is simply the negative of the second derivative of
the log-likelihood function. Since we are maximizing the log-likelihood, the curvature should be negative. The
more negative the curvature, the more sharply defined is the location of the maximum. Therefore, more negative
curvatures lead to less variable estimates, which is precisely what is revealed by the limiting distribution above.
Figure 8.1 depicts this in the case of a scalar parameter.

Remark: Consider the scalar case with d = 1. The theorem above tells us that for large n the variance of the
MLE is E[(θb − θ∗ )2 ] ≈ I −1 (θ∗ )/n. This shows that the larger the Fisher Information I(θ∗ ), the fewer samples we
need to obtain a good estimate of θ∗ . Thus, knowing or bounding the Fisher information can help us decide how
many training examples are sufficient.

Proof. We will prove the theorem for the special case when θ is scalar. The proof for multidimensional vectors
follows the same steps using multivariate calculus. By the mean value theorem,
∂L(θ) ∂L(θ) ∂ 2 L(θ)
θ=θbn = |θ=θ∗ + θ=θe (θbn − θ∗ ) ,
∂θ ∂θ ∂θ2
where θe is some value between θ∗ and θbn . By definition, ∂L(θ)
|
∂θ θ=θbn
= 0, so

∂L(θ) ∂ 2 L(θ)
0 = |θ=θ∗ + | e (θbn − θ∗ )
∂θ ∂θ2 θ=θ
6
The regularity condition amounts to assuming that we can interchange the order of differentiation and integration/expectation (as we
will do in the proof). Formally, this is an application of the dominated convergence theorem and requires certain conditions on p(x|θ) to
be met. The conditions hold for most of the distributions we encounter in practice, but it’s not true when the support of the distribution
depends on θ (e.g., if p(x|θ) is the uniform density on [0, θ]).

42
p(x|✓) p(x|✓)

✓⇤ ✓⇤

(a) (b)

Figure 8.1: Likelihood functions for scalar parameter θ. (a) Low curvature cases will lead to greater estimator
variances compared to (b) high curvature cases.

From equation above we have


∂L(θ)
|θ=θ∗
θbn − θ∗ = − ∂ 2∂θL(θ)
|
∂θ2 θ=θe
√ √
Let us consider n(θbn − θ∗ ). The reason scaling the difference by n is that this is the normalization needed to
stabilize the limiting distribution. For example, if x1 , . . . , xn were iid observations from the distribution N (θ∗ , 1),

then it is easy to see that n(θbn − θ∗ ) ∼ N (0, 1). So, from above we have
∂L(θ)
√ √1 | ∗
n ∂θ θ=θ
n(θbn − θ∗ ) = − 1 ∂ 2 L(θ)
. (8.1)
n ∂θ2 θ=θe
|
Note that the denominator is the average curvature of the individual log-likelihood terms, and this average will
converge to the mean/expected curvature (a constant larger than zero unless the log-likelihood functions are ex-
actly “flat”). So think of the denominator as behaving like a non-zero constant. We will identify that constant
later.

First, let’s study the numerator.

1 X ∂ log p(xi |θ)


n
1 ∂L(θ)
√ |θ=θ∗ = √ |θ=θ∗
n ∂θ n i=1 ∂θ

The expectation (mean) of this quantity is 0, as shown below.


  Z
∂ log p(x|θ) ∂ log p(x|θ)
E |θ=θ∗ = |θ=θ∗ p(x|θ∗ )dx
∂θ ∂θ
Z
1 ∂p(x|θ)
= |θ=θ∗ p(x|θ∗ )dx
p(x|θ∗ ) ∂θ
Z
∂p(x|θ)
= |θ=θ∗ dx
∂θ
Z 

= p(x|θ)dx |θ=θ∗ = 0 ,
∂θ
R
since p(x|θ)dx = 1 for all θ and the derivative of a constant is 0. Recall the Central Limit Theorem. If z1 , . . . , zn
P D
are iid random variables with mean E[z1 ] = 0 and variance E[z12 ] = σ 2 , then √1n i zi → N (0, σ 2 ), meaning the

43
the random variable defined by summation has a distribution that tends to the Gaussian as n → ∞. Therefore, by
the CLT we have
  
1 X ∂ log p(xi |θ)
n
1 ∂L(θ) D ∂ log p(x|θ)
√ |θ=θ∗ = √ |θ=θ∗ → N 0, V |θ=θ∗ .
n ∂θ n i=1 ∂θ ∂θ
D
The notation → means convergence in distribution. Since the mean is 0, its variance is just the second moment
  " 2 #
∂ log p(x|θ) ∂ log p(x|θ)
V |θ=θ∗ = E |θ=θ∗ .
∂θ ∂θ

The variance can be related to the curvature of the log-likelihood function as follows. First observe that
 
∂ 2 log p(x|θ) ∂ 1 ∂p(x|θ)
=
∂θ2 ∂θ p(x|θ) ∂θ
 2
1 ∂p(x|θ) 1 ∂ 2 p(x|θ)
= − 2 +
p (x|θ) ∂θ p(x|θ) ∂θ2
Now let’s take the expectation
 2  Z  2 Z
∂ log p(x|θ) 1 ∂p(x|θ) ∗ 1 ∂ 2 p(x|θ)
E 2
|θ=θ ∗ = − ∗
|θ=θ ∗ p(x|θ ) dx + ∗ 2
|θ=θ∗ p(x|θ∗ ) dx
∂θ p(x|θ ) ∂θ p(x|θ ) ∂θ
Z  2 Z 2
∂ log p(x|θ) ∂ p(x|θ)
= − |θ=θ∗ p(x|θ∗ ) dx + |θ=θ∗ dx
∂θ ∂θ2
" 2 # Z 
∂ log p(x|θ) ∂2
= −E |θ=θ∗ + 2 p(x|θ) dx |θ=θ∗ .
∂θ ∂θ
R
Since p(x|θ) dx = 1 the second term is 0. Therefore, the variance is equal to the negative expected curvature:
" 2 #  2 
∂ log p(x|θ) ∂ log p(x|θ)
E |θ=θ∗ = −E 2
|θ=θ∗ = I(θ∗ ) .
∂θ ∂θ

Now consider the denominator of (8.1). By the Strong Law of Large Numbers (SLLN), this average converges to
its mean value, the negative Fisher Information, almost surely:
 2 
1 X ∂ 2 log p(xi |θ)
n
1 ∂ 2 L(θ) a.s. ∂ log p(xi |θ)
2
|θ=θe = 2 θ=θe → E 2
|θ=θ∗ = −I(θ∗ ) .
n ∂θ n i=1 ∂θ ∂θ

Note that in the equation above we substituted θ∗ for θe since we assume that θbn → θ∗ (and therefore so does θ).
e
a.s.

To summarize, the numerator of (8.1) converges in distribution to a Gaussian


1 ∂L(θ) D
√ |θ=θ∗ → N (0, I(θ∗ )) ,
n ∂θ
2 a.s.
and the denominator n1 ∂ ∂θL(θ) ∗
e → −I(θ ). So for large n, the numerator behaves like a Gaussian random
2 |θ=θ

variable and the denominator is almost constant. The ratio therefore converges in distribution to a Gaussian
rescaled by the limiting constant of the denominator
√ 1
n(θbn − θ∗ ) →
D

N (0, I(θ∗ )) ≡ N (0, I −1 (θ∗ )) .
I(θ )

This type of convergence is rigorously proved by Slutsky’s Theorem [23].

44
8.3. Exercises

1. Congratulations! You have just been hired to by Google to work on their online ad auction team. Website
advertising spaces are sold by an auction. For simplicity, let’s assume the following auction model. There
is one auction for each ad space. Each bidder places a single bid per ad space, and the highest bidder wins
that auction. Since Google runs the ad auction service, they can observe all the bids. The website selling
the ad space only observes the final winning bid.
Your first assignment at Google is to determine how much the sellers can learn about the distribution of bids
from observations of only the highest bids in each auction. Here is a mathematical model of the observation
iid
process. Suppose there are n bidders in an auction and there (non-negative) bids are x1 , x2 , . . . , xn ∼ θe−θx
(i.e., exponentially distributed with parameter θ). The seller observes y := maxi=1,...,n xi .

a. Derive an expression for the probability density of y. Hint: Start by finding the cumulative distribution
function of y.
b. How would you find the MLE of θ given y?
c. Let θb denote the MLE. Show that for n sufficiently large θb ≈ log n
y
.

2. Suppose that n1 people are given treatment 1 and n2 people are given treatment 2. let x1 bet the number
of people on treatment 1 who respond favorably to the treatment and let x2 be the number of people on
treatment 2 who respond favorably. Assume x1 ∼ Binomial(n1 , p1 ), x2 ∼ Binomial(n2 , p2 ), and that x1
and x2 are statistically independent. Let ϕ = p1 − p2 .

a. Find the MLE ϕb for ϕ.


b. Find the Fisher information matrix I(p1 , p2 ).
b
c. Use the result of (b) to characterize the large-sample distribution of ϕ.
d. Assume that n1 = n2 . Based on the result of (c), roughly how many people are required in the study so
that P(|ϕb − ϕ| > 0.01) < 0.05?
e. Verify your conclusions in (d) by simulating this problem (in Python, Matlab, R, etc); i.e., generate
many iid realizations of x1 and x2 with n1 = n2 from part (d) and generate a Monte Carlo estimate of
P(|ϕb − ϕ| > 0.01). You can chose p1 and p2 as you like (e.g., p1 = p2 will suffice).

3. Suppose we observe measurements of the form

yi = fw (xi ) + ϵi , i = 1, . . . , n ,

where xi ∈ Rd are known (deterministic), the ϵi are i.i.d. zero-mean, unit-variance random variables, and
fw is a function parametermized by an unknown weight vector w.

(a) What are the conditional mean and variance of yi given xi ?


(b) Let p denote the probability density of the ϵi . What optimization would you solve to find the maximum
likelihood estimate (MLE) of w?
iid
For the rest of the questions, assume that fw (x) = wT x, a linear function and that ϵi ∼ N (0, 1). Also,
remember that we are treating the xi as known, deterministic vectors. Assume that the linear span of
{xi }ni=1 is Rd .

(a) Give an explicit (linear-algebraic) expression in terms of {(xi , yi )}ni=1 for the MLE of w and its dis-
tribution.

45
(b) What is the Fisher information matrix for the MLE?
(c) Suppose that you have a training sample budget. Assume n ≫ d, but may only use d training exam-
ples. If our aim is to minimize the mean-square error E[∥w − w∥b 2 ], propose a method for selecting
2
the d examples. Hint: Express E[∥w − w∥b ] in terms of the Fisher information matrix.

46
Lecture 9: Maximum Likelihood Estimation and Empirical Risk Minimization

Suppose we have a “training" dataset of independent and identically distributed examples {(xi , yi )}ni=1 . The goal
of prediction is to predict the unknown label y from an observation of a new x, under the assumption that (x, y)
is independently and identically distributed to the training data. Specifically, the goal is to “learn" a function f
using the training data so that f (x) agrees with y in some sense. We look at two approaches that lead to similar
(sometimes identical) solutions to this learning problem: Maximum Likelihood Estimation and Empirical Risk
Minimization.

9.0.1. Maximum Likelihood Estimation Approach

The MLE approach starts by assuming some form for the conditional distribution of y given x. For example, we
might assume that the conditional distribution is Gaussian or Bernoulli, which lead to least squares or logistic
regression, respectively (more on this later). One of the main ideas in this approach is to use models from the
exponential family of distributions, which includes many of the common distributions including the Gaussian
and Bernoulli. Using any distribution from the exponential family produces a MLE optimization problem that is
convex in the prediction function f . This is because any such distribution can be written in terms a parameter θ,
called the natural parameter of the distribution, as

p(y|θ) ∝ exp − ℓ(y, θ)

for some function ℓ that depends on the specific distribution under consideration. A key fact (which we prove
below) is that if p(y|x) is in the exponential family, then ℓ is convex in θ. The idea is to model the natural
parameter as a function of x, so that the distribution of y is determined by x. Substituting θ = f (x), we have

p(y|x) ∝ exp − ℓ(y, f (x))

The MLE is the solution to


X
n X
n

max log p(yi |xi ) ≡ min ℓ yi , f (xi )
f ∈F f ∈F
i=1 i=1

where F is a certain set of allowable functions. The fact that f is convex in f , ensures that all local minima are
global.

9.0.2. Empirical Risk Minimization

One of the most common approaches to machine learning is called Empirical Risk Minimization (ERM). ERM
is based on a choice of a loss function ℓ that measures the disagreement between a label y and a predictor f (x).
The value of ℓ(y, f (x)) is called the the loss of f on the example (x, y). Common losses included the squared
error loss ℓ(y, f (x)) = (y − f (x))2 and the hinge loss ℓ(y, f (x)) = max(0, 1 − yf (x)), which is common in
binary classification problems with the label set y ∈ {−1, +1}. The expected value of the loss, with respect to the
joint distribution of (x, y) is called the risk. Given a class of functions F, one would like to find the f ∈ F that
minimizes the risk, i.e., to find a solution to
 
min E ℓ(y, f (x)) .
f ∈F

47
ERM approximates the expectation with an i.i.d. set of training examples {(xi , yi )} and finds the solution to

X
n
min ℓ(yi , f (xi )) .
f ∈F
i=1
P  
This is reasonable since n1 ni=1 ℓ(yi , f (xi )) → E ℓ(y, f (x)) as n → ∞. Notice that the ERM optimization has
the same form as the MLE. So, given a choice of loss function, we can view ERM as MLE with a conditional
distribution model p(y|x) ∝ exp − ℓ(y, f (x)) .

9.0.3. Example: Gaussian Models and Least Squares

Consider the following model for a labeled dataset {xi , yi }ni=1 . Suppose that the labels yi are conditionally inde-
pendent given their correponding features xi and that yi ∼ N (fw (xi ), 1). This means that E[yi |xi ] = fw (xi ),
where fw is a function parameterized by w. The variance of yi |xi is 1 (more generally, we could assume any
other value for the variance). In this case, the log-likelihood of w given {xi , yi } is

1 X 2
n
L(w) = − yi − fw (xi ) + C
2 i=1

where C is a constant that does not depend on w. Thus, the MLE of w is given by the least squares optimization
n 
X 2
b = arg min
w yi − fw (xi )
w
i=1

If we assume a linear model, that is fw (xi ) = wT xi , then we have the classical least squares problem

X
n
arg min (yi − xTi w)2
w
i=1

and the MLE is a solution to the linear system

X T Xw = X T y

where X is a matrix with rows xTi and y is a vector with rows equal to yi , i = 1, . . . , n.

Generalized linear models extend this sort of linear modeling approach to other distributions, including cases
where the conditional distribution of yi given xi is binomial (binary classification), multinomial (multi-class),
Poisson, exponential, and other probability models in the exponential family. The common theme is that the
conditional probability model takes the form p(y|x) ∝ exp(−ℓ(y, wT x)), where ℓ(y, wT x) is a convex function
of w.

9.1. The Exponential Family

The Exponential Family is a class of distributions with the following form:

p(y|θ) = b(y) exp(θ T t(y) − a(θ)) .

48
The parameter θ is called the natural parameter of the distribution and t(y) is the sufficient statistic. The quantity
e−a(θ) is a normalization constant, ensuring that p(y|θ) sums or integrates to 1. If y is continuous
Z Z
−a(θ)
1 = p(y|θ) dy = e b(y) exp(θ T t(y)) dy
R 
which shows that a(θ) = log b(y) exp(θ T t(y)) dy . If y is discrete, then the integral is replaced by a sum-
mation. a(θ) is called the log partition function. The factor b(y) is the non-negative base measure, and in many
cases it is equal to 1. Many familiar distributions belong to the exponential family (e.g., Gaussian, exponential,
log-normal, gamma, chi-squared, beta, Dirichlet, Bernoulli, Poisson, geometric). To use exponential family dis-
tributions to model the conditional distribution of y given x, we take θ to be a parametric function of x, e.g., a
linear function θ = wT x.

Note that the negative log-likelihood function of θ is


− log p(y|θ) = −θ T t(y) + a(θ) − log b(y) .
Remarkably, this is a convex function of θ, which means that a maximum likelihood estimator can be easily
computed using convex optimization. To show it is convex, note that the first term is linear and hence convex in
θ. We will verify a(θ) is convex by showing that for any θ1 , θ2 , and λ ∈ [0, 1]

a λθ1 + (1 − λ)θ2 ≤ λa(θ1 ) + (1 − λ)a(θ2 ) .
We will assume y is continuous, but the following argument holds in the discrete case by replacing integrals with
summations. We will use Hölder’s inequality which states that p, q ≥ 1 such that 1/p + 1/q ≤ 1 and any two
functions f (y), g(y) and measure b(y)
Z Z 1/p  Z 1/q
p
|f (y)g(y)| b(y)dy ≤ |f (y)| b(y)dy |g(y)|q b(y)dy .

Consider
 Z
  T 
exp a λθ1 + (1 − λ)θ2 = exp λθ1 + (1 − λ)θ2 t(y) b(y)dy
Z
λ (1−λ)
= exp θ1T t(y) exp θ2T t(y) b(y)dy
 Z λ  Z 1−λ
 
≤ exp θ1T t(y) b(y)dy exp θ2T t(y) b(y)dy

= exp(a(θ1 ))λ exp(a(θ2 ))1−λ


where we applied Hölder’s inequality with p = 1/λ and q = 1/(1 − λ). Taking the logarithm of both sides yields
the desired result.

Next we will consider several applications of the GLM framework.

Gaussian Distribution: Classical Least Squares


1 2 1 2
p(y|µ) = √ e−(y−µ) /2 = √ e−y /2 exp(µ y − µ2 /2)
2π 2π
2
and we identify b(y) = √12π e−y /2 , θ = µ, t(y) = y, and a(θ) = θ2 /2. Note that the natural parameter in this
case is the mean µ. If we use this as a model for the conditional distribution yi |xi and let θ = wT xi , then
we have the least squares linear model above. In other words, in least squares we model yi as Gaussian with
unit variance and model its mean as a linear function of the corresponding feature xi .

49
Binomial Distribution: Logistic Regression
    µ  
y 1−y
p(y|µ) = µ (1 − µ) = exp y log(µ) + (1 − y) log(1 − µ = exp y log + log(1 − µ)
1−µ

and we identify b(y) = 1, θ = log(µ/(1 − µ)), t(y) = y, and a(θ) = log(1 + eθ ). The natural parameter in
this case is not the mean. Reparameterizing the Binomial distribution in terms of its natural parameter, we
have  
θ
p(y|θ) = exp θ y − log(1 + e )

Consider the log-likelihood function log p(y|θ).


1  
θ θ θ
log p(y = 1|θ) = θ − log(1 + e ) = log(e ) − log(1 + e ) = log
1 + e−θ
 1 
log p(y = 0|θ) = log
1 + eθ
For convenience, let us use the binary labels ±1 instead of 0 and 1; i.e., reassign y → 2y − 1. Then we have
 1 
log p(y|θ) = log
1 + e−yθ
To use this as a model for the conditional distribution y|x, let θ = wT x. Consider iid examples {(xi , yi )}ni=1 .
In this case, we model yi as Bernoulli (with labels ±1) and model its natural parameter as a linear function
of the corresponding feature xi . This is a (generalized) linear model for a binary classification setting. The
maximum likelihood estimator of w is the solution to
X
n  T

min log 1 + e−yi xi w
w
i=1

b is the solution, then thepredicted label for a new x is given by


If w

b T x)
yb = sign(w

This optimization is called logistic regression because the function f (θ) := log(1/(1 + e−θ )) is known as
the logistic function. The function ℓ(θ) := log(1 + e−θ ) is called the logistic loss function. The logistic loss
function is convex in θ. To see this, note that the second derivative of ℓ(θ) is

e−θ  e−θ 
1 − ≥0
1 + e−θ 1 + e−θ

50
Multinomial Distribution: Multinomial Logistic Regression

Let y be a random variable that takes value k with probability P(y = k) = qk , k = 1, . . . , K. The
likelihood of q1 , . . . , qK is

X
K X
K   
p(y|q1 , . . . , qK ) = 1{y=k} qk = exp 1{y=k} log qk = exp θ T t(y) − a(θ)
k=1 k=1

PK P 
K
where θ ∈ Rk , qk = eθk / θm
m=1 e , a(θ) = log k=1 e
θk
, and t(y) is the “one-hot" vector
 
1{y=1}
 .. 
t(y) =  . 
1{y=K}
P
The function eθk / K m=1 e
θm
is called the logit or soft-max function. Notice that this parameterization ensures
that the probabilities sum to 1. Reparameterizing the likelihood function in terms of θ yields
!
X K
eθk  XK
eθk
L(θ) := log p(y|θ1 , . . . , θK ) = log 1{y=k} PK = 1{y=k} log PK
θ θm
m=1 e m=1 e
m
k=1 k=1

To use this as a model for the conditional distribution y|x, we introduce weight vectors w1 , . . . , wK and let
θk = wkT x. This is a linear model for a multiclass classification setting. Consider iid examples {(xi , yi )}ni=1 .
The maximum likelihood estimator of w1 , . . . , wK is the solution to
!
Xn X K T
ewk xi
min − 1{yi =k} log PK T x
w1 ,...,wK wm
m=1 e
i
i=1 k=1

The objective is a convex function in the weight vectors, because of the convexity with respect to θ. Each
term in the sum is a cross-entropy loss. This name comes from the following interpretation.
P Let {pk } and
{qk } be probability mass functions. The cross-entropy {qk } relative to {pk } is − k pk log qk . Each term in
T
ewk xi
the objective above is the cross-entropy between {1{y=k} } and {qk }, where qk = PK T x .
wm
m=1 e
i

b1 , . . . , w
Finally, let w bK denote the MLEs. The predicted label for a new x is given by

yb = arg max bkT x


w
k∈{1,...,K}

9.2. Exercises

1. Suppose that y ≥ 0 and p(y|µ) = µ1 e−y/µ , the exponential density function.

(a) Show that E[y] = µ.


(b) Express this density in the exponential family form p(y|θ) = b(y) exp(θ t(y) − a(θ)).
(c) Suppose that y1 , . . . , yn are iid according to p(y|θ). Show that the negative log-likelihood is a convex
function of θ.

51
(d) What is the MLE for θ?

2. Consider the Binomial and Multinomial GLMs:


    
1 1
Binomial : p(y|x, w) = exp y log + (1 − y) log
1 + e−xT w 1 + exT w

exp(xT wℓ )
Multinomial : p(y = ℓ|x, w1 , . . . , wk ) = Pk , for ℓ = 1, . . . , k
T
j=1 exp(x wj )

(a) Show that the two models are equivalent for the binary classification, k = 2, case. HINT: Relate w in
the Binomial model to w1 and w2 in the Multinomial model.
(b) What is the limiting behavior of the p(y|x, w) as ∥w∥ → ∞?
(c) Consider the class-conditional densities x|y = 1 ∼ N (w, I) and x|y = 0 ∼ N (−w, I). Assuming
equal prior probabilities, find an expression for p(y = 1|x, w).
(d) Suppose we have n training examples {(xi , yi )}ni=1 that are independently sampled with equal proba-
bility from the two Gaussian class-conditional densities above and consider the estimate

1X
n
b =
w (2yi − 1)xi .
n i=1

b
i. What is the distribution of w?
w a.s.
ii. What is the asymptotic distribution of b
∥w∥
b
? b → ∥w∥.
Hint: Use that fact that by the SLLN, ∥w∥

3. Recall the negative log-likelihood function for the multinomial GLM


X
K
eθk 
−L(θ) = − log 1{y=k} PK
k=1 m=1 eθm

Prove directly that this is convex in θ by showing that the Hessian matrix is positive semidefinite via the
following steps.
2
(a) Denote the Hessian matrix by −∇2 L(θ). The j, k-th element of this matrix is − ∂∂θjL(θ)
∂θk
. Let q be the
PK
K × 1 vector of the probabilities qj = eθj / k=1 eθk , j = 1, . . . , K. Show that

−∇2 L(θ) = diag(q) − qq T

(b) The Hessian is positive semidefinite by shown that for any v ∈ RK

X
K X
K 2
T 2
v ∇ L(θ)v = qk vk2 − qk vk ≥ 0.
k=1 k=1

Hint: You can interpret the two sums in the previous expression as expectations and apply Jensen’s
inequality.

52
Lecture 10: Linear Models

The main idea of the Generalized Linear Model (GLM) is to model p(y|x) in terms of a linear function (i.e.,
weighted combination) of the features. Specifically, given a chosen probability distribution for y, we set the
natural parameter to be θ = wT x, and so we will write this model as p(y|wT x). In other words, we consider a
parametric
 family of models for the conditional distribution of y given x indexed by w ∈ Rd , which we denote
T
as p(y|w x) w∈Rd . The MLE of w is the solution to

X
n
min − log p(yi |wT xi ) .
w
i=1

This will be relatively easy to solve if − log p(y|wT x) is differentiable and convex in w, which is the case for
GLMs. In other words, p(y|wT x) ∝ exp(−ℓ(y, wT x)) for a function ℓ satisfying the following properties:

1. ℓ is convex in w.

2. ℓ : R → [0, ∞) so that p(y|wT x) = exp(log p(y|wT x)) ∝ exp(−ℓ(y, wT x)) ∈ [0, 1].

The ℓ is a called a loss function that measures the error/distortion between yi and the value predicted by wT xi .
The general form of the optimization is
X
n
min ℓ(yi , wT xi )
w
i=1

If ℓ is convex in w, then the overall optimization is also convex.

10.1. Common Loss Functions

The GLM is a systematic framework for obtaining convex loss functions that are matched to the (assumed) data
distribution. Here is a list of common loss functions used in machine learning for regression and/or binary clas-
sification. Some are derived via the GLM framework (e.g., quadratic, logistic), while others are not (e.g., hinge
loss).

Quadratic/Gaussian: ℓ(yi , wT xi ) = (yi − wT xi )2

Absolute/Laplacian: ℓ(yi , wT xi ) = |yi − wT xi |

Logistic: ℓ(yi , wT xi ) = log(1 + exp(−yi wT xi ))

Hinge: ℓ(yi , wT xi ) = max(0, 1 − yi wT xi )

0/1: ℓ(yi , wT xi ) = 1{yi wT xi <0}

where we use the convention yi ∈ {−1, +1} for the final three “classification” loss functions. All of these are
convex functions, and thus easy to optimize, except for the 0/1 loss. Notice that in the context of classification,

53
we can express the quadratic/Gaussian loss and the absolute/Laplacian loss in terms of yi xTi w as well, since
yi ∈ {−1, +1} implies that |yi | = 1 and therefore

|yi − xTi w| = |yi ||1 − yi xTi w| = |1 − yi xTi w| .

Figure 10.1 compares the various loss functions in the binary classification setting as a function of yi wT xi . Recall
that the classification decision will be sign(xTi w). Thus, if yi wT xi > 0, then the decision is correct. The 0/1
loss is ideal, since its expected value is exactly the probability of error. The other loss functions can be viewed
as convex approximations to the 0/1 loss. The absolute value and quadratic losses have the undesirable feature
of assigning large losses to some points that are correctly classified (these losses are more suitable for regression
problems). The logistic and hinge losses reduce this problem, and thus they are preferred for classification tasks.

Figure 10.1: Comparison of loss functions for binary classification, with yi ∈ {−1, +1}.

10.2. Optimization

In general, the optimization problem


X
n
min ℓ(yi , wT xi )
w
i=1

does not have a closed-form solution, and we need to solve it by gradient descent or other iterative algorithms.
Denote the gradient of the loss function with respect to w by ∇ℓ. Then gradient descent starts from an initial w0
and proceeds according to the iteration
X
n
T
wt = wt−1 − µ ∇ℓ(yi , wt−1 xi ) , t = 1, 2, . . .
i=1

where µ > 0 is a step size. If the loss is convex in w, then gradient descent will converge to a global minimum
(if µ is sufficiently small). If the loss is continuous but non-convex, then it may converge to a (suboptimal) local
minimum. If the loss is discontinuous, as in the case of 0/1 loss, then gradient descent cannot be used to solve
the optimization.

54
In the special case of the quadratic loss (Gaussian negative log likelihood), we have a closed-form solution to the
optimization. The optimization can be expressed in matrix-vector form by defining X to be an n × d matrix with
ith row equal to xTi and y to be an n × 1 vector with ith row equal to yi . Then the optimization is
min ∥y − Xw∥2 .
w

Setting the derivative with respect to w to 0 results in the system of equations


−X T (y − Xw) = 0
and so (assuming X is full rank) the solution is
−1
b = XT X
w XT y .

10.3. Exercises

1. Consider a binary classification problem with training data {(xi , yi )}ni=1 , with xi ∈ Rd and yi ∈ {−1, +1}.
(a) Suppose that training data are linearly separable. That is, there exists a vector w ∈ Rd such that
yi = sign(wT xi ), i = 1, . . . , n. Construct an example of this situation in d = 2 with n = 3 points that
are not collinear and with at least one from each class. Sketch your example in the two-dimensional
plane. In this simple setting, you can compare the behaviors of learning with various loss functions
without the need for numerical optimizations.
(b) Formulate the learning problem using 0/1-loss. What is the minimum value of the objective (loss)?
Find a solution to the learning problem. Is it unique?
(c) Formulate the learning problem as a logistic regression problem. What is the minimum value of the
logistic regression objective (loss)? Find a solution to the logistic regression. Is it unique?
(d) Derive an expression for the gradient descent steps for logistic regression (i.e., what is the gradient of
the loss function?).
(e) Formulate the learning problem using hinge loss instead. Prove that the hinge loss is convex.
(f) What is the minimum value of the hinge loss objective? Find a solution in this case. Is it unique?
(g) Finally, consider squared error loss. Find a solution in this case. Is it unique?
2. Suppose that yi ∼ Poisson(λi ), i = 1, . . . , n. Derive a GLM for this model, i.e., model the natural parameter
θ = Xw, where X ∈ Rn×p is known and w ∈ Rp is an unknown parameter. We can interpret this sort of
model as follows. The columns of X can denote a basis for representing the natural parameter θ ∈ Rn of
the Poisson process. If p < n, then the basis spans a subspace of Rn of dimension at most p.

(a) Express the distribution of y in the standard exponential family form and identify the natural parameter
θ ∈ Rn . How is θ related to λ = {λi }ni=1 .
(b) For a p < n of your choice, pick a basis X and generate data according to the Poisson model with
θ = Xw. You can use any standard software for your experiment, and you can pick w as you like. For
example, you could take X to be a basis for linear functions of the form θi = w1 i + w2 , i = 1, . . . , n.
(c) If we use a basis for linear functions to represent θ, then what type of function is the canonical
parameter λ?
(d) Now use the GLM to find the MLE of θ. Transform this MLE to obtain an estimate of λ. There are
built-in routines from finding the MLE in most standard software packages, usually under a name like
glmfit.

55
Lecture 11: Gradient Descent

The least squares optimization problem


min ∥y − Xw∥22
w
−1
has the solution w b = (X X) X y, when X ∈ Rn×d is full-rank. In general, inverting the d × d matrix
T T

(X T X)−1 requires O(d3 ) floating point operations, which can be demanding both in terms of time and space
if d is large. Iterative solvers can be used to obtain good approximations to the solution at a lower cost. The
Landweber iteration is given by
wt+1 = wt + τ X T (y − Xwt ) , for t = 0, 1, . . . for some τ > 0.
This is equivalent to a gradient descent method, which we will examine in these notes. This iteration requires
matrix multiplications X T Xwt at each step. If d and n, the number of training examples is large, then even
this iterative method can be prohibitive. In recent years, dataset sizes have grown faster than computer speeds,
and consequently large-scale machine learning is limited by computing time rather than the sample size. Incre-
mental versions of gradient descent, that process just one sample at each step rather than the whole dataset, are
increasingly useful because they are scalable to extremely large datasets and problem sizes. Incremental gradient
descent algorithms are also readily extended to handle regularized methods such as ridge regression and lasso,
and a variety of loss functions including the hinge loss. Such algorithms are the focus of this note.

11.1. Gradient Descent

Suppose we are given a training set {xi , yi }ni=1 and consider the least squares problem
X
n
w∗ = arg min (yi − wT Xi )2 . (11.1)
w∈Rd
i=1

If X is full-rank, then a closed-form solution exists


w∗ = arg min ∥y − Xw∥22 = (X T X)−1 X T y . (11.2)
w∈Rd

An alternative to the matrix-inverse approach is to minimize the squared error using gradient descent. This
requires computing the gradient of the squared error, which is −2X T (y − Xw). Note that the gradient is zero at
the optimal solution, so the optimal w∗ is the solution to the normal equations X T Xw = X T y.

To gain a further insight, consider the gradient descent algorithm. Starting with an initial weight vector w0 (e.g.,
w0 = 0), gradient descent iterates the following step for t = 1, 2, . . .
1 
wt = wt−1 − γ ∇∥y − Xw∥22 w=wt−1
2
= wt−1 + γX T (y − Xwt−1 )
where γ > 0 is a step-size. Note that the algorithm takes a step in the negative gradient direction (i.e., ‘downhill’).
The choice of the step size is crucial. If the steps are too large, then the algorithm may diverge. If they are too
small, then convergence may take a long time. We can understand the effect of the step size as follows. Note that
we can write the iterates as
wt = wt−1 + γ(X T y − X T Xwt−1 )
= wt−1 + γX T X((X T X)−1 X T y − wt−1 )
= wt−1 − γX T X(wt−1 − w∗ )

56
b from both sides gives us
Subtracting w

vt = vt−1 − γX T Xvt−1

b t = 1, 2, . . . So we have
where vt = wt − w,

vt = (I − γX T X)vt−1
= (I − γX T X)t v0

Thus the sequence vt → 0 if all the eigenvalues of (I − γX T X) are less than 1. This holds7 if γ < 2λ−1 T
max (X X).
We see that the eigenvalues of X T X play a key role in gradient descent algorithms. Let α < 1 denote the largest
eigenvalue of (I − γX T X). Then for any initialization w0 , we have

b ≤ αt ∥wt − w∥
∥wt − w∥ b = O(αt )

which shows that the error of gradient descent converges exponentially in t. This is a consequence of the fact that
the quadratic loss with a full rank X is a strongly Figure 11.1 depicts convex functions and discusses the notion
of strong convexity.
<latexit sha1_base64="dfCKoAWFXo5p7ilglstLE7lgRgI=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LDZCvZSkiIqnghePFewHtKFstpt26WaT7m6UEvonvHhQxKt/x5v/xm2bg7Y+GHi8N8PMPD/mTGnH+bZya+sbm1v57cLO7t7+QfHwqKmiRBLaIBGPZNvHinImaEMzzWk7lhSHPqctf3Q781uPVCoWiQc9iakX4oFgASNYG6lt20H56dy2e8WSU3HmQKvEzUgJMtR7xa9uPyJJSIUmHCvVcZ1YeymWmhFOp4VuomiMyQgPaMdQgUOqvHR+7xSdGaWPgkiaEhrN1d8TKQ6VmoS+6QyxHqplbyb+53USHVx7KRNxoqkgi0VBwpGO0Ox51GeSEs0nhmAimbkVkSGWmGgTUcGE4C6/vEqa1Yp7Wbm4r5ZqN1kceTiBUyiDC1dQgzuoQwMIcHiGV3izxtaL9W59LFpzVjZzDH9gff4A3ReOhw==</latexit>

f (w) f convex:
<latexit sha1_base64="LusGTBjY0YH1dZJrQ5im/gmdgPk=">AAACCHicbVDLTgIxFL2DL8QX6tJNI5i4IjNo1LgiceMSE3kkQEindKCh007aDpFM+AH3bvUX3Bm3/oV/4GdYYBYCnqTJyTn31eNHnGnjut9OZm19Y3Mru53b2d3bP8gfHtW1jBWhNSK5VE0fa8qZoDXDDKfNSFEc+pw2/OHd1G+MqNJMikczjmgnxH3BAkawsVKrGBQRkWJEn267+YJbcmdAq8RLSQFSVLv5n3ZPkjikwhCOtW55bmQ6CVaGEU4nuXasaYTJEPdpy1KBQ6o7yezkCTqzSg8FUtknDJqpfzsSHGo9Dn1bGWIz0MveVPzPa8UmuOkkTESxoYLMFwUxR0ai6f9RjylKDB9bgoli9lZEBlhhYmxKC1ums5UO9CRno/GWg1gl9XLJuypdPJQLlcs0pCycwCmcgwfXUIF7qEINCEh4gVd4c56dd+fD+ZyXZpy05xgW4Hz9AgeGmd4=</latexit>

<latexit sha1_base64="VdMGODmOLzImBvIp6vITlHBMJgo=">AAACiHicdZFdb9MwFIad8DXKVwfiCgmOaJFarVR2h9b2boIbLodEt0l1VDmO01lznGA7QBX1B/AT+Qfc8Rewt0zaEBzJyqv3fY4/TtJKSesw/hnFt27fuXtv537nwcNHj590d58e27I2XCx4qUpzmjIrlNRi4aRT4rQyghWpEifp+YeQn3wVxspSf3abSiQFW2uZS86ct1bdH/0+VR7PGOSDbysy3BuQt60zDM5kCBToWnzxn5ymcj244j19HfZoiAM+Ci15aZhSQEcBHPkUaJGW3xtgOoMttG1U6iUekaTfX3V7eIwxJoRAEGR6gL2Yz2cTMgMSIl891NbRqvuLZiWvC6EdV8zaJcGVSxpmnORKbDu0tqJi/JytxdJLzQphk+ZiZFt4450M/B390g4u3OsdDSus3RSpJwvmzuzfWTD/lS1rl8+SRuqqdkLzy4PyWoErIcwfMmkEd2rjBeNG+rsCP2OGcef/0o1Twt7G5nbb8aO5ej/8XxxPxuRgvP9p0jt81w5pB71Ar9EAETRFh+gjOkILxNHv6Hn0MnoVd2IcT+P5JRpHbc8zdKPi938AZW2+ew==</latexit>

f (w1 ) + (1 )f (w2 ) f w1 + (1 )w2 , 8 w1 , w2 and 2 [0, 1]

f (w1 ) + rw f (w1 )T (w2


<latexit sha1_base64="v+9JK5Ui5iIs8ymxfBTw+intkR4=">AAACUHicdVBNbxMxEJ1N+WjDR0M5chmRILUiRLvbKCm3Slw4tlLTVsqG1azjTa16vYvtpYqi/C7+BzcOXKE/gRvYaSpRBCNZfn5vxjPzskoKY8Pwa9DYuHf/wcPNreajx0+ebree7ZyastaMj1gpS32ekeFSKD6ywkp+XmlORSb5WXb5zutnn7g2olQndl7xSUEzJXLByDoqbR13OvnuVRrvYYLJjH90l39He/gaE0WZpPRqzXw4QZ/5ZqUm2PUVealJSky66NiuUzudtNUOe28PBnF/gGEvDIdRHHkQD/v7fYwc46MN6zhKW9fJtGR1wZVlkowZR2FlJwvSVjDJl82kNrwidkkzPnZQUcHNZLFafYmvHDNFN4Y7yuKK/bNiQYUx8yJzmQXZC/O35sl/aePa5geThVBVbbliN43yWqIt0fuIU6E5s3LuADEt3KzILkgTs87tO13839rkZtl01tzuj/8Hp3EvGvT2j+P2YX9t0ia8gJewCxEM4RDewxGMgMFn+Abf4UfwJfgZ/GoEN6m3NzyHO9Fo/gaNvK5t</latexit>

f (w2 ) w 1 ) , 8 w 1 , w2

f ↵-strongly convex:
<latexit sha1_base64="7VbJ66/x2JhgkT4MmSD9k+LbVLQ=">AAACHHicbVDLSgMxFM3UV62vUZciBDuCG8tMFRVXBTcuK9gHtEPJpJk2NJMMSaZYSlf+h3u3+gvuxK3gH/gZZtpZ2OqBwOGc+8g9Qcyo0q77ZeWWlldW1/LrhY3Nre0de3evrkQiMalhwYRsBkgRRjmpaaoZacaSoChgpBEMblK/MSRSUcHv9SgmfoR6nIYUI22kjn3ohA502ojFfeScKi0F77ERxIIPycN1xy66JXcK+Jd4GSmCDNWO/d3uCpxEhGvMkFItz421P0ZSU8zIpNBOFIkRHqAeaRnKUUSUP56eMYHHRunCUEjzuIZT9XfHGEVKjaLAVEZI99Wil4r/ea1Eh1f+mPI40YTj2aIwYVALmGYCu1QSrM3ZXYqwpOavEPeRRFib5Oa2pLOlCtWkYKLxFoP4S+rlkndROrsrFyvnWUh5cACOwAnwwCWogFtQBTWAwSN4Bi/g1Xqy3qx362NWmrOynn0wB+vzB+9poTw=</latexit>

<latexit sha1_base64="SlWrpfYwFRvrvtb/oCsgueEQKrs=">AAACcnicdVFdb9MwFHXCV1e+yniDF0OLtImuSrJR2rdJvPA4pHWbVHfRjeu01hwn2A5TleWH7h/wA3iHm5FJDMGVLB+fcz/s46RQ0roguPb8e/cfPHzU2eo+fvL02fPei+0Tm5eGixnPVW7OErBCSS1mTjolzgojIEuUOE0uPjX66TdhrMz1sdsUYpHBSstUcnBIxT07GKQ7l3G0SxllK/EVt+Yc7tL3lGlIFMSXLXN+TJvMvVs1NcArBqpYQ11FNbtCAcWIXcXReYSNhk3PNDegFGVDivIQ5cEg7vWD0YcgnI7HNBgFQXgwCRFMpxMkaYhME33SxlHc+86WOS8zoR1XYO08DAq3qMA4yZWou6y0ogB+ASsxR6ghE3ZR3ZhT03fILCleA5d29Ib9s6KCzNpNlmBmBm5t/9Ya8l/avHTpZFFJXZROaP57UFoq6nLaOE2X0gju1AYBcCPxrpSvAU1z+B93pjS9jU1t3UVrbt9P/w9OolE4Hu1/ifqHB61JHfKavCU7JCQfySH5TI7IjHByTX56HW/L++G/8t/4raO+19a8JHfCH/4CPbS3wA==</latexit>


f (w2 ) f (w1 ) + rw f (w1 )T (w2 w1 ) + kw1 w2 k22 , 8 w1 , w2
2

w1
<latexit sha1_base64="t8I6SIYxtMP9wTbG2A2689jfbpk=">AAAB7nicbVBNSwMxEJ2tX7V+VT16CXYFT2W3iIqnghePFewHtEvJptk2NJssSVYpS3+EFw+KePX3ePPfmLZ70NYHA4/3ZpiZFyacaeN5305hbX1jc6u4XdrZ3ds/KB8etbRMFaFNIrlUnRBrypmgTcMMp51EURyHnLbD8e3Mbz9SpZkUD2aS0CDGQ8EiRrCxUtt1n/q+6/bLFa/qzYFWiZ+TCuRo9MtfvYEkaUyFIRxr3fW9xAQZVoYRTqelXqppgskYD2nXUoFjqoNsfu4UnVllgCKpbAmD5urviQzHWk/i0HbG2Iz0sjcT//O6qYmug4yJJDVUkMWiKOXISDT7HQ2YosTwiSWYKGZvRWSEFSbGJlSyIfjLL6+SVq3qX1Yv7muV+k0eRxFO4BTOwYcrqMMdNKAJBMbwDK/w5iTOi/PufCxaC04+cwx/4Hz+AHvJjlY=</latexit>

w2
<latexit sha1_base64="dzC7S6yJo1qWuRUWhQDigNefSw8=">AAAB7nicbVBNSwMxEJ2tX7V+VT16CXYFT2W3iIqnghePFewHtEvJptk2NMkuSVYpS3+EFw+KePX3ePPfmLZ70NYHA4/3ZpiZFyacaeN5305hbX1jc6u4XdrZ3ds/KB8etXScKkKbJOax6oRYU84kbRpmOO0kimIRctoOx7czv/1IlWaxfDCThAYCDyWLGMHGSm3XferXXLdfrnhVbw60SvycVCBHo1/+6g1ikgoqDeFY667vJSbIsDKMcDot9VJNE0zGeEi7lkosqA6y+blTdGaVAYpiZUsaNFd/T2RYaD0Roe0U2Iz0sjcT//O6qYmug4zJJDVUksWiKOXIxGj2OxowRYnhE0swUczeisgIK0yMTahkQ/CXX14lrVrVv6xe3Ncq9Zs8jiKcwCmcgw9XUIc7aEATCIzhGV7hzUmcF+fd+Vi0Fpx85hj+wPn8AX1Pjlc=</latexit>

w
<latexit sha1_base64="k3LEFVPvGySw39OM6caLlpWPKVk=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LDaCp5IUUfFU8OKxgmkLbSib7aZdutmE3Y1SQn+DFw+KePUHefPfuGlz0NYHA4/3ZpiZFyScKe0431ZpbX1jc6u8XdnZ3ds/qB4etVWcSkI9EvNYdgOsKGeCepppTruJpDgKOO0Ek9vc7zxSqVgsHvQ0oX6ER4KFjGBtJM+2n2x7UK05dWcOtErcgtSgQGtQ/eoPY5JGVGjCsVI910m0n2GpGeF0VumniiaYTPCI9gwVOKLKz+bHztCZUYYojKUpodFc/T2R4UipaRSYzgjrsVr2cvE/r5fq8NrPmEhSTQVZLApTjnSM8s/RkElKNJ8agolk5lZExlhiok0+FROCu/zyKmk36u5l/eK+UWveFHGU4QRO4RxcuIIm3EELPCDA4Ble4c0S1ov1bn0sWktWMXMMf2B9/gBUo42y</latexit>

Figure 11.1: A function f is said to be convex if for all λ ∈ [0, 1] and w1 , w2 we have λf (w1 ) + (1 − λ)f (w2 ) ≥
f (λw1 + (1 − λ)w2 ). If f is differentiable, then an equivalent definition is that f (w2 ) ≥ f (w1 ) + ∇f (w1 )T (w2 −
w1 ). A function is strictly convex if both inequalities are strict (i.e., hold with ≥ replace by >). A function f is
said to be α-strongly convex if f (w2 ) ≥ f (w1 ) + ∇f (w1 )T (w2 − w1 ) + α2 ∥w2 − w1 ∥22 for all w1 , w2 . If f is
twice differentiable, then an equivalent definition is that ∇2 f (w) ≻ αI for all w.

11.2. Stochastic Gradient Descent for Least Squares

Stochastic gradient descent (SGD) is an incremental version of gradient descent, where the gradient is replaced
by the gradient with respect to just one training example, rather than the entire training set. The SGD iterates are
1
wt = wt−1 − γ ∇(yit − xTit w)2 w=wt−1
2
T
= wt−1 + γ(yit − wt−1 xit )xit ,

where (yit , xit ) is one of the training examples. There are several choices for the training example used at each
step. For example, the algorithm can simply cycle through the training examples as it iterates it = [t mod m] + 1.
Let U DU T = I − γX T X, the eigendecomposition, where D is diagonal and U is orthogonal (i.e., U T U = I). Note that
7

U (I − γX T X)U = I − γU T X T XU , so U T X T XU is diagonal and its diagonal elements are the non-negative eigenvalues of
T

X T X. Let λi (X T X) denote the ith eigenvalue of X T X and note that the convergence condition becomes |1 − γλi (X T X)| < 1.

57
This form of algorithm is usually referred to as incremental gradient descent. The term stochastic gradient descent
selects each example uniformly at random from the full dataset. In this case, the average or expected value of the
gradient is equal to the full gradient; i.e., it ∼ uniform(1, 2, . . . , m) and

1X
n
 T 2

E ∇(yit − xit w) = ∇(yi − xTi w)2 .
n i=1

One can think of the SGD algorithm as considering each term in the sum of (11.1) individually. Geometrically,
for t > d the complete sum of (11.1) tends to look like a convex, quadratic bowl while each individual term is
described by a degenerate quadratic in the sense that in all but 1 of the d orthogonal directions, the function is flat.
This concept is illustrated in Figure 11.2 with ft equal to (yit − xTit w)2 . Intuitively, each individual function ft
only tells us about at most one dimension of the total d so we should anticipate the algorithm will require t ≫ d
iterations.

+ + ... + =
T
X
f1 (w) + f2 (w) + . . . + fT (w) = ft (w)
t=1
Figure 11.2: The SGD algorithm can be thought of as considering each of the loss terms of (11.1) individually.
Because each term is “flat” in all but 1 of the total d directions, this implies that each term is convex but not
strongly convex (see Figure 11.1). However, if T > d we typically have that the complete sum is strongly convex
which can be exploited to achieve faster rates of convergence.

11.3. Gradients and Subgradients

In general, the objective function f might not be differentiable. For example, we might replace a squared ℓ2
regularizer with the ℓ1 norm, or change the squared error loss function to the hinge loss. The idea of a gradient
can be extended to non-differentiable functions by introducing the notion of subgradients. Recall that for a convex
function f that is differentiable at w, for all u we have

f (u) ≥ f (w) + (u − w)T ∇f (w), (11.3)

i.e., the gradient at w defines the slope of a tangent that lies below f , as depicted in Figure 11.1 . If f is not
differentiable at w, we can write a similar inequality:

f (u) ≥ f (w) + (u − w)T v (11.4)

where we call v a subgradient. The formal definition is below.

Definition 11.3.1. Any vector v that satisfies (11.4) is called a subgradient of f at w. The set of subgradients of
f at w is called the differential set and denoted ∂f (w). So we write v ∈ ∂f (w).

58
If f is differentiable at w, then there is only one subgradient at w, and it is equal to the gradient at w. Subgradients
exist even if f is not differentiable at a certain point (e.g., the blue x on the left of Figure 11.1). For such cases,
we can replace the gradient in SGD with any of the subgradients. We will use ∇ft (wt ) in the SGD iterations,
with the understanding that if the gradient does not exist, then a subgradient (any one if there are multiple) are
used instead.

11.4. Exercises

1. (a) Suppose f : R → R and g : Rd → R are both convex functions. In addition, suppose f is non-
decreasing. Prove h(x) := f ◦ g(x) = f (g(x)) is a convex function in x.
(b) Suppose f : R → R is a convex function and non-increasing, and g : Rd → R is a concave function.
Prove h(x) := f ◦ g(x) = f (g(x)) is a convex function in x.
(c) Suppose f : R → R is a convex function, and g : Rd → R is an affine function, namely g(x) =
wT x + b for some known w ∈ Rd and b ∈ R. Prove h(x) := f ◦ g(x) = f (g(x)) is a convex function
in x.

2. (a) Derive an expression for the gradient descent iteration to minimize the ridge regression objective:

min ∥y − Xw∥22 + λ∥w∥22 , λ > 0 .


w

(b) Derive subgradient for the absolute value function f (x) = |x|.
(c) Derive an expression for a (sub)gradient descent iteration to minimize the Lasso criterion:

min ∥y − Xw∥22 + λ∥w∥1 , λ > 0 .


w

59
Lecture 12: Analysis of Stochastic Gradient Descent

To study the convergence behavior of stochastic gradient descent, let us consider the more general problem
1X
T

w = arg min ℓt (w) (12.1)
w∈Rd T
t=1

where each ℓt : Rd → R is a convex function (see Figure 11.1). In the context of LS, ℓt (w) := (yit −xTit w)2 , which
is quadratic and hence convex in w. In logistic regression with yi ∈ {−1, +1}, ℓt (w) := log(1+exp(−yit xTit w)),
which is also convex in w. The problem of (12.1) is known as an unconstrained online convex optimization
program [27]. The general SGD iteration is given by
wt+1 = wt − γt ∇ℓt (wt ) (12.2)
where γt is a positive, non-increasing sequence of step sizes and the algorithm is initialized with some arbitrary
w1 ∈ Rd . Each term in the sum of (12.1) typically corresponds to the loss on a different training example. If
the training set is finite, then the iteration process could make passes over the training set, say in a cyclical or
randomized fashion. The following theorem characterizes the performance of the SGD iteration (12.2) assuming
the gradients are bounded.
Theorem 12.0.1 (see [27], constant stepsize).PLet ℓt be convex and ∥∇ℓt (w)∥2 ≤ G for all t and all w. Further
define the optimal value w∗ = arg minw∈Rd Tt=1 ℓt (w). Using algorithm (12.2) with γt = γ (constant stepsize)
and arbitrary w1 ∈ Rd we have
1X
T
∥w1 − w∗ ∥22 γ 2
(ℓt (wt ) − ℓt (w∗ )) ≤ + G for all T
T t=1 2γT 2

Before proving the theorem, note that this is a strong result. It only uses the fact that the ℓt functions are convex
and that the gradients of ℓt are bounded. In particular, it assumes nothing about how the ℓt functions relate to each
other from time to time.

Proof. We begin by observing that


∥wt+1 − w∗ ∥22 = ∥wt − γ∇ℓt (wt ) − w∗ ∥22
= ∥wt − w∗ ∥22 − 2γ∇ℓt (wt )T (wt − w∗ ) + γ 2 ∥∇ℓt (wt )∥22
and after rearranging we have that
∥wt − w∗ ∥22 − ∥wt+1 − w∗ ∥22 γ 2
∇ℓt (wt )T (wt − w∗ ) ≤ + G.
2γ 2
By the convexity of ℓt for all t, wt we have ℓt (w∗ ) − ℓt (wt ) ≥ ∇ℓt (wt )T (w∗ − wt ). Thus, summing both sides
of this equation from t = 1 to T and dividing by T , we have
T  
1X 1 X ∥wt − w∗ ∥22 − ∥wt+1 − w∗ ∥22 γ 2
T

ℓt (wt ) − ℓt (w ) ≤ + G
T t=1 T t=1 2γ 2
∥w1 − w∗ ∥22 ∥wT +1 − w∗ ∥22 γ 2
= − + G
2γT 2γT 2
∗ 2
∥w1 − w ∥2 γ 2
≤ + G .
2γT 2

Notice that we use the fact that the sum above is a telescoping series.

60
Note that in Theorem 12.0.1, as we increase the number of iterations (T → ∞), the first term in the bound tends
to zero but the second term does not. One way to ensure that we have a diminishing average error is to choose a
constant stepsize that shrinks with the number of iterations we plan on doing. The following corollary gives this
specialized result.

PℓTt be convex and ∥∇ℓt (w)∥2 ≤ G for all t and all√w. Further define the optimal value
Corollary 12.0.2. Let

w = arg minw∈Rd t=1 ℓt (w). Using algorithm (12.2) with γt = 1/ T (constant stepsize) and arbitrary
w1 ∈ Rd we have

1X
T
∥w1 − w∗ ∥22 + G2
(ℓt (wt ) − ℓt (w∗ )) ≤ √ for all T
T t=1 2 T


The proof of Corollary 12.0.2 is straightforward; simply substitute γ = 1/ T into Theorem 12.0.1.

To illustrate
Pn how the results above apply to learning problems, consider the case where our goal is to solve
minw i=1 ℓi (w), where ℓi (w) is the loss incurred by w on training example (xi , yi ). For example, ℓi (w) =
log(1 + exp(−yi wT xi )) in the case of logistic regression. Let w∗ denote a solution. This solution may not
be unique if the overall sum of losses is not strictly convex. This is one reason that we measure the difference
between objective values rather than the weight vectors themselves. To apply the theory above, let T = kn for
some positive integer k ≥ 1 and let ℓt (w) = ℓit , with it = 1, 2, . . . , n, 1, 2, . . . , n, . . . . That is, let it simply make
cyclic passes over the training set. In this case, we have

1X 1X 1X
T T n
(ℓt (wt ) − ℓt (w∗ )) = ℓit (wt ) − ℓi (w∗ )
T t=1 T t=1 n i=1

Now let us apply Corollary 12.0.2, which tells us for large T (i.e., large
PT k in this setting), the difference above
∗ 1
w minimizes theP
tends to zero. SinceP sum of losses, this shows that T t=1 ℓit (wt ) tends to the global minimum.
This implies that n i=1 ℓi (wt ) → n ni=1 ℓi (w∗ ) as t grows.
1 n 1

Using a very small but constant stepsize, Corollary 12.0.2, as in may lead to slow initial convergence. One
way around this is to use a diminishing stepsize. For technical reasons that will be come clear in our analysis
of the dimishing stepsize case, we will modify the algorithm as follows. Assume that the solution w∗ satisfies
∥w∗ ∥ ≤ B for some constant B. This is a reasonable assumption in practice; it just says that the solution vector
isn’t arbitrarily large. Now at each iteration t we do two things:

1. vt+1 = wt − γt ∇ℓt (wt )


(
vt+1 if ∥vt+1 ∥ ≤ B (12.3)
2. wt+1 = B vt+1
∥vt+1 ∥
otherwise

The first step is the gradient descent, the second is a projection step that ensures the weight vectors always satisfy
∥wt ∥ ≤ B, which we assume the solution w∗ must also satisfy. We will show that this algorithm, with γt = √1t ,
satisfies a bound similar to the one we derived for γt = √1T . Before giving the result, we will require an auxiliary
result that will help us find bounds.

Lemma 12.0.3. For any t = 1, 2, . . . , the following inequality holds:

XT
1 √
√ ≤ 2 T
t=1
t

61
√ This can be shown by thinking of the sum as the sums of areas of several rectangles of width 1 and height
Proof.
1/ t. The sum is bounded above by an integral:

XT Z T √ √ √
1 dt
√ ≤ 1+ √ = 1 + 2( T − 1) = 2 T − 1 ≤ 2 T
t=1
t 1 t

We are now ready to prove the bound result with diminishing stepsize.

∥∇ℓt (w)∥2 ≤ G for all t and all w


Theorem 12.0.4 (diminishing stepsize). Let ℓt be convex and assume that P
∥ ≤ B. Further define the optimal value w∗ = arg minw∈Rd Tt=1 ℓt (w). Using algorithm (12.3)
and that ∥w∗√
with γt = 1/ t (diminishing stepsize) and arbitrary w1 ∈ Rd we have

1X
T
2B 2 + G2
(ℓt (wt ) − ℓt (w∗ )) ≤ √ for all T
T t=1 T

Proof. The first part of the proof is very similar to that of Theorem 12.0.1. First observe that ∥wt − w∗ ∥ ≤
∥vt − w∗ ∥ for all t, since the projection step never increases the error. Therefore,

∥wt+1 − w∗ ∥22 ≤ ∥vt+1 − w∗ ∥22


= ∥wt − γ∇ℓt (wt ) − w∗ ∥22
= ∥wt − w∗ ∥22 − 2γt ∇ℓt (wt )T (wt − w∗ ) + γ 2 ∥∇ℓt (wt )∥22
≤ ∥wt − w∗ ∥22 − 2γt ∇ℓt (wt )T (wt − w∗ ) + γt2 G2 .

Proceed as in Theorem 12.0.1 but don’t divide by T to obtain:

X
T T 
X 
∗ ∥wt − w∗ ∥2 − ∥wt+1 − w∗ ∥2
2 2 γt
ℓt (wt ) − ℓt (w ) ≤ + G2
t=1 t=1
2γt 2

Rearranging, substituting γt = 1/ t, and applying Lemma 12.0.3, we obtain:

X
T
ℓt (wt ) − ℓt (w∗ )
t=1
  X
T   X
T
∥w1 − w∗ ∥22 ∥wT +1 − w∗ ∥22 ∥wt − w∗ ∥2 1 2 1 γt
≤ − + − + G2
2γ1 2γT +1 t=2
2 γt γt−1 t=1
2

∥w1 − w∗ ∥22 X ∥wt − w∗ ∥22 √  X


T T
√ 1
≤ + t − t − 1 + G2 √
2 t=2
2 t=1
2 t
T 
X 

2
√ √
≤ 2B t − t − 1 + G2 T
t=1
√ 
≤ T 2B 2 + G2 ,
PT √ √ 
where we used the fact that ∥wt − w∗ ∥ ≤ ∥wt ∥ + ∥w∗ ∥ ≤ 2B and that t=1 t− t − 1 is a telescoping
series. Divide both sides by T completes the proof.

62
This agnostic approach to the functions ℓt allow us to apply the theorem to analyzing SGD for least squares. In
this case we have ℓt (w) = (yit − wT xit )2 so that ∇ℓt (w) = −2(yit − wT xit )xit we see that

∥∇ℓt (w)∥ ≤ 2 (∥yit xit ∥ + ∥wt xTit xit ∥) ≤ 2 (max |yi |∥xi ∥ + ∥wt ∥ max ∥xi ∥2 ) .
i i

To further simplify this expression, suppose we are in the classification setting where |yi | = 1 and assume that the
features are bounded ∥xi ∥ ≤ C. Then we have

∥∇ℓt (w)∥ ≤ 2C (1 + BC 2 ) =: G.

Plugging this into the theorem, we have

1X
T
2B 2 + 2C 2 (1 + BC 2 )2
(ℓt (wt ) − ℓt (w∗ )) ≤ √
T t=1 T

∥xi ∥ ≤ B, then our


for all T . The takeaway message is that if we assume nothing about the features, apart from √

residual sum of squared errors converges to the residual using w at a rate proportional to 1/ T .

12.1. Exercises

1. Derive an expression for an SGD iteration to minimize the sum of logistic losses:
X
n
min log(1 + exp(−yi xTi w)) .
w
i=1

2. Prove that the logistic loss function log(1 + exp(−z)) is strictly convex in z ∈ R.

3. Derive an expression for an SGD iteration to minimize the sum of hinge losses:
X
n
min (1 − yi xTi w)+ .
w
i=1

4. Analyze stochastic GD (SGD) for Perceptron Algorithm. Let {xi , yi }ni=1 be a training set with yi ∈
{−1, +1} and ∥xi ∥ ≤ B for all i. Assume that there exists a w satisfying the “margin” condition
yi wT xi ≥ 1 for i = 1, . . . , n. Let w∗ be a vector that has the minimum norm among vectors that sat-
isfy the margin condition. Define the function

f (w) = max(1 − yi wT xi ) ,
i

which is margin (distance) of the classification boundary to the nearest training example (assuming w
correctly classifies all the points (i.e., yi wT xi > 0 for all i). This can be viewed as a loss function (recall
the hinge loss is max(0, 1 − yi wT xi ), essentially the same thing).

(a) Show that minw:∥w∥≤∥w∗ ∥ f (w) = 0 and show that any w that satisifies f (w) < 1 yields a linear
classifier that correctly classifies the training data.
(b) Show how to calculate a subgradient of f .
(c) Analyze the performance of SGD in this setting, using the theoretical bounds and analysis discussed
in class.

63
Lecture 13: Bayesian Inference

Consider a family of probability distributions indexed by a parameter θ. The parameter may be a scalar or
multidimensional. In general, we will denote a family of distributions by p(x|θ), θ ∈ θ, where θ denotes the
set of all possible values the parameter can take. Viewed as a function of θ with x fixed, p(x|θ) is called the
likelihood (function) of θ. Let p(θ) be a prior probability distribution on θ. This distribution is a models how
specific values of θ are more or less probable a priori, that is before observing the data x. Bayes’ Rule allows us
to compute the posterior distribution of θ:
p(x|θ) p(θ)
p(θ|x) =
p(x)
The posterior distribution reflects the probability of different values of θ in light of the observed data x.

13.1. Back to the Basics

Recall the simple setting where θ takes just one of two values, say 0 and 1. This corresponds to two probability
models for x, p1 (x) = p(x|θ = 1) and p0 (x) = p(x|θ = 0). Deciding which model is better for observed data
x is a simple binary hypothesis test. The decision rule that minimizes the probability of error is to decide θb = 1
if p(θ = 1|x) > p(θ = 0|x) and decide θb = 0 otherwise (and flip a fair coin if p(θ = 1|x) = p(θ = 0|x)). From
Bayes’ Rule, we know that this is equivalent to the likelihood ratio test
p1 (x) H1 p(θ = 0)

p0 (x) H0 p(θ = 1)
where p(θ = 1) = 1 − p(θ = 0) is the prior probability that θ = 1. We see here how the optimal decision
about which model is best for the data hinges on our prior knowledge about θ. Bayesian inference is the natural
extension of this idea to more general settings, e.g., where θ may be a continuous parameter. This is analogous
to the idea of maximum likelihood, but now incorporating prior knowledge about θ that shapes how we decide
which model p(x|θ) is best for a given observation x.
iid
To view this in the general Bayesian framework, consider the following example. Suppose that xi |θ ∼ N (θ, 1), i =
1, . . . , n; i.e., a Gaussian likelihood function. Let the prior probability model be p(θ) = 12 δ(θ) + 12 δ(θ − µ), where
δ(θ − µ) is the Dirac delta function at µ. This means the prior is zero at all points except 0 or µ. The MLE of θ is
P
θb = n1 ni=1 xi . However, the prior allows for only two possible probability models, Gaussian with mean θ = 0 or
θ = µ (since the prior is zero at all other values of θ). This implies that the posterior p(θ|x) is also nonzero only
at these two points. The best model is the one with larger posterior probability. This is equivalent to the simple
hypothesis testing problem above.

13.2. Posterior Distributions and Decisions

Bayesian inference methods have three core elements.

Prior Distribution: p(θ) encodes prior knowledge about which values of θ are more plausible models for the
inference problem at hand. In essence, p(θ) is a non-negative weighting function over the set Θ of all values

64
that θ may take. Prior knowledge can come in form of physical reasoning, desired constraints or regularity
properties, or beliefs about θ.
Likelihood Function: The likelihood function is p(x|θ) viewed as a function of θ. Maximizing this function
itself is maximum likelihood estimation.
Posterior Distribution: The posterior distribution p(θ|x) ∝ p(x|θ)p(θ) combines prior knowledge with infor-
mation derived from the data x. Maximizing the posterior with respect to θ produces the Maximum a
Posteriori (MAP) estimator. Note that log p(θ|x) = log p(x|θ) + log p(θ) + constant, so the log-posterior
is just a linear combination of the log-likelihood and the log-prior. From this point of view, − log p(θ) can
be view as a regularization term in the estimation process.

Bayesian inference methods consider the full posterior distribution, since it tells the probability of any value of θ.
Often, we are interested in a specific estimate of θ. The two most common estimators are

Maximum a Posteriori Estimator: θbMAP = arg maxθ p(θ|x), the mode of p(θ|x)
R
Posterior Mean Estimator: θbPM = θ p(θ|x) dθ, the mean of p(θ|x)

13.3. Example: Temperature Estimation

Consider a family of probability distributions indexed by a parameter θ. The parameter may be a scalar or multi-
dimensional. In general, we will denote a family of density functions by p(x|θ), θ ∈ Θ, where Θ denotes the set
of all possible values the parameter can take. Recall that the Maximum Likelihood Estimator (MLE) is
θbMLE = arg max p(x|θ)
θ∈Θ

where p(x|θ) as a function of x with the parameter θ fixed is the probability density function or mass function.
Viewing p(x|θ) as a function of θ with x fixed is called the “likelihood function.”

The MLE procedure implicitly treats all θ ∈ Θ are equally plausible, but this might not be reasonable in all
situations. For example, suppose that x is a temperature measurement somewhere on the surface of the earth
and θ is the mean temperature at that location. We know that temperatures outside the range −100 ◦ C ≤ x ≤
100 ◦ C have never been observed anywhere on earth, so we can safely restrict θ to the interval [−100, 100]; i.e.,
Θ = [−100, 100]. However, we also know that it is much more probably that temperatures will be in the range
[−30, 30], than below −30 or above +30. So perhaps it makes sense to “weight” different temperature ranges
more than others, since they are simply more probable a priori; i.e., before making any measurements.

One way to weight the set Θ to reflect prior knowledge of the plausibility of different θ (and hence different p(x|θ)
models) is to place a prior probability distribution on Θ. Let p(θ) denote such a distribution. We can view p(θ) as
a non-negative weighting function over the set Θ, and use this weighting to modify our optimization as follows
θbMAP = arg max p(x|θ)p(θ)
θ∈Θ

This optimization tends to prefer solutions where p(θ) is large. This is called the Maximum a Posteriori (MAP)
estimator. The name MAP derives from the fact that
p(x|θ)p(θ)
max p(x|θ)p(θ) = max
θ∈Θ θ∈Θ p(x)
= max p(θ|x)
θ∈Θ

65
and p(θ|x) is called the posterior distribution of θ give x.

Continuing with the temperature example above, Suppose the prior is


 1
 130 −30 ≤ θ ≤ 30
p(θ) =
 1
260
30 < |θ| ≤ 100

In other words, the prior places twice the probability on values in the range [−30, 30]. Then |θbMAP | ≤ |θbMLE |. This
is easy to verify. Suppose that the likelihood achieves its maximum on the interval [−30, 30]. Then so does the
posterior density function. If the likelihood is maximized outside this region, then the maximum of the posterior
may be at the same point or at some other point in the range [−30, 30], due to the scaling factor of 2 in this range.
So in this case the prior biases the estimator towards lower temperatures. Although bias is usually undesirable,
the MAP estimator may also have a lower variance (since it may reduce the magnitude of the estimate), and thus
the overall mean-squared error of the MAP estimator may be smaller than that of the MLE. We will look at a more
concrete example of this bias-variance tradeoff later in this note.

13.4. Example: Twitter Monitoring

Suppose that we are monitoring Twitter for mentions of a particular topic or hashtag. Each hour for n hours we
count how many tweets were posted on the topic. Denote these counts by x = (x1 , . . . , xn ). We will assume the
counts are i.i.d. The Poisson probability distribution is a reasonable model for these data:
Y
n
θ xi
p(x|θ) = e−θ , θ>0.
i=1
xi !

The parameter θ is the mean of the Poisson distribution, and the MLE is given by

1X
n
θbMLE = xi .
n i=1

Now imagine that we have some prior knowledge about whether the topic is hot and trending (e.g., probably a
large θ) or rare (i,e., small θ). We can represent this knowledge in terms of an exponential prior distribution:

p(θ) = αe−αθ , α > 0 .

The larger the value of α, the more quickly the prior density function decays away from 0. As α → 0, the prior
tends to a uniform (flat) distribution. The posterior distribution is
Y
n xi Y
n
θ xi
−αθ −θ θ
p(θ|x) ∝ αe e = α e−θ(1+α/n)
i=1
xi ! i=1
xi !

and
X
n
− log p(θ|x) = θ(1 + α/n) − xi log(θ) + constant .
i=1

Minimizing this expression with respect to θ yield the MAP estimator

1 X n
n θbMLE
θbMAP = xi = .
(n + α) i=1 n+α

66
So we see that the MAP estimator is a “shrunken” version of the MLE (i.e., it is scaled down towards 0 by a factor
of n/(n + α)). Notice that as the sample size n grows, the MAP estimator converges to the MLE. This illustrates
a general fact: the prior only plays a significant role if the sample size is relatively small.

This shrinkage may be desirable if we are only going to count for a few hours and we believe that the number
of tweets on the topic will probably be relatively small. To understand this further, consider the bias-variance
decomposition of the mean-squared error (MSE). If θb be denotes any estimator of the true value θ, then
b := E[(θ − θ)
MSE(θ) b 2 ] = E[(θ − E[θ]
b + E[θ]
b − θ)
b 2]
b 2 + E[(E[θ]
= (θ − E[θ]) b − θ)
b 2]
bias variance
b
Note that cross-term E[(θ − E[θ])(E[ b − θ)]
θ] b = (θ − E[θ])E[(E[
b b − θ)]
θ] b = 0. Now let us compute the bias and
variance of each estimator above. For the MLE we have
1X
n
E[θbMLE ] = E[xi ] = θ
n i=1
P
and so it is unbiased. Since ni=1 xi is a sum of independent random variables, its variance is equal to the some
of the variance of each xi . The variance of xi ∼ Poisson(θ) is θ. So the variance of the MLE is
X
n
θ
V[θbMLE ] = V[xi /n] =
i=1
n

Since the MAP estimator is just a scaled version of the MLE, its mean and variance are
n
E[θbMAP ] = θ
n+α
 n 2 θ
V[θbMAP ] =
n+α n
 2
1
So its variance is smaller than that of the MLE, by a factor of 1+α/n , but it is biased. The squared bias of the
MAP estimator is
 2  n 2
b
θ − E[θMAP ] = θ− θ
n+α
 α 2
= θ2
n+α
So the two estimators have the following MSEs
θ
MSE(θbMLE ) =
n
 α 2  n 2 θ
MSE(θbMAP ) = θ2 +
n+α n+α n
Since we don’t know θ, it is impossible to determine which estimator has a lower MSE. However, we can gain
some insight by minimizing this quantity. Setting the derivative with respect to α to zero yields the optimal value
1
α∗ = ,
θ
which tends to 0 as θ grows (i.e., if θ is very large, then we should use the MLE). So if our a priori expectation is
that we will get about θguess tweets on the topic per hour, then we could set α using this value.

67
13.5. Multivariate Normal Distributions in Bayesian Inference

The multivariate Gaussian/normal (MVN) distribution holds a special place in Bayesian inference. The Gaussian
distribution is, of course, one of the most important models of probabilistic uncertainty. The Central Limit The-
orem tells us that the sum of many independent random effects tends to a Gaussian distribution. Its prominence
in Bayesian inference owes largely to the fact that if both the likelihood and prior distribution are multivariate
Gaussian, then so is the posterior distribution. In fact, the mean and covariance of the posterior can be computed
by simple linear-algebraic operations applied to the means and covariances of the likelihood and prior. This led to
important Bayesian methods like Wiener [25] and Kalman [17] filters, which are widely used in signal and image
processing and control systems.

The following theorem demonstrates a special property of jointly MVN randon variables.

Theorem 13.5.1. Let x and y be jointly Gaussian random vectors, whose joint distribution can be expressed as
     
x µx Σxx Σxy
∼N ,
y µy Σyx Σyy

Then the conditional distribution of y given x is



y|x ∼ N µy + Σyx Σ−1 −1
xx (x − µx ), Σyy − Σyx Σxx Σxy .

Proof. Without loss of generality assume that x and y are zero-mean random vectors. Therefore
n  
−1 1
 T T
 −1 x o
|Σ| exp − 2 x y Σ
p(x, y) y
p(y|x) = ∝  1
p(x) −1 −1
|Σxx | exp − 2 x Σxx x
T

where  
Σxx Σxy
Σ= .
Σyx Σyy
To simplify the formula we need to determine Σ−1 . The inverse can be written as:
 −1  −1   
Σxx Σxy Σxx 0 −Σ−1
xx Σxy
 
= + Q−1 −Σyx Σ−1
xx I
Σyx Σyy 0 0 I

where
Q := Σyy − Σyx Σ−1
xx Σxy .

This formula for the inverse is easily verified by multiplying it by Σ to get the identity matrix. Substituting the
inverse into p(y|x) yields
n 1 o
p(y|x) ∝ |Q|−1 exp − (y − Σyx Σ−1
xx x)T
Q −1
(y − Σ Σ−1
yx xx x)
2
which shows that y|x ∼ N (Σyx Σ−1
xx x, Q). For the general case when E[x] = µx and E[y] = µy then

(y − µy )|(x − µx ) ∼ N (Σyx Σ−1


xx (x − µx ), Q)

y|x ∼ N (µy + Σyx Σ−1


xx (x − µx ), Q)

68
To apply this result to Bayesian inference, consider the likelihood x|θ ∼ N (θ, Σ), for some covariance Σ.
Assume a MVN prior for θ ∼ N (0, Σθ,θ ). Since x = θ + N (0, Σ), the marginal distribution is x ∼ N (0, Σ +
Σθ,θ ) and the cross-covariance between x and θ is just the variance common to both, i.e., Σx,θ = Σθ,θ . The result
above tells us that  
θ|x ∼ N Σθ,θ (Σ + Σθ,θ )−1 x , Σθ,θ − Σθ,θ (Σ + Σθ,θ )−1 Σθ,θ .
The posterior mean and MAP estimator are the same
θb = Σθ,θ (Σ + Σθ,θ )−1 x ,
which is referred to as the Wiener filter in the signal processing literature.

13.6. Bayesian Linear Modeling

Recall the idea of a Generalized Linear Model (GLM) is to model p(y|x) in terms of a linear function (i.e.,
weighted combination) of the features. Specifically, given a chosen probability distribution for y, we set the
natural parameter to be θ = wT x, and so we will write this model as p(y|wT x). GLMs have the special property
that p(y|wT x) ∝ exp(−ℓ(y, wT x)) for a convex loss function ℓ. The Bayesian approach here is to place a prior
probability model on the weights w. Let p(w) denote the prior. The posterior
p(w|x, y) ∝ p(w) exp(−ℓ(y, wT x)) .
Given data {(xi , yi )}ni=1 , the MAP estimator of w is the solution to
X
n
min ℓ(yi , wT xi ) − log p(θ) .
w
i=1

For example, priors of the form p(w) ∝ exp − λ2 ∥w∥22 or p(w) ∝ exp(−λ∥w∥1 ) lead to “ridge" or “lasso"
regularization, respectively.

13.7. Exercises

1. Consider a sequence of independent coin tosses (i.i.d. Bernoulli random variables)


x = [x1 , . . . , xn ]T
Let xi = 1 denote “heads” and xi = 0 “tails,” and P(xi = 1) = θ, P(xi = 0) = 1 − θ. If we observe x, then
the likelihood function is
p(x|θ) = θs(x) (1 − θ)N −s(x)
P
where s(x) = nn=1 xi . Suppose we are interested in estimating θ given x. Let’s take a Bayesian approach.
A priori we believe that θ ≈ 1/2 (i.e., we’re tossing a reasonably fair coin), and we know 0 ≤ θ ≤ 1.

a. Show that a beta density prior for θ reflects this prior information. The beta density is given by
Γ(2α) α−1
p(θ; α) = θ (1 − θ)α−1
Γ(α)2
where αR ≥ 1 is a shape parameter to be specified by the user and Γ is the Euler gamma function

Γ(z) = 0 tz−1 e−t dt.

69
b. Show that the beta density is a conjugate prior for the Bernoulli distribution. That is, show that the
posterior density also has a beta form.
c. Find the posterior mean estimator E[θ|x], the mean of the posterior distribution which is a function of α
and the data. How does α effect estimator performance? What is the asymptotic (large n) behavior of
the estimator?

2. Suppose x ∼ p(x|θ) and let R be a risk function (e.g., MSE). An estimator θb is said to be minimax optimal
if it minimizes the maximum risk supθ R(θ, b θ). The minimax estimator is related to Bayesian estimators as
follows. Let p(θ) denote a prior probability distribution and let
Z
b
θp := arg min R(θ(x), b θ) p(θ)dθ ,
θ(x)
b

where the maximization is over all possible estimators. The estimator θbp the minimizes the average or Bayes
risk under prior p. For example, if the risk is MSE, then θbp = E[θ|x], the mean of the posterior distribution.
If the estimator θbp satisfies Z
R(θbp , θ) p(θ) dθ = sup R(θbp , θ) ,
θ

then it is minimax optimal. To see this let θb be any estimator and note that
Z
b
sup R(θ, θ) ≥ b θ) p(θ) dθ
R(θ,
θ
Z
≥ R(θbp , θ) p(θ) dθ = sup R(θbp , θ) .
θ

A corollary of this is that if θbp has constant risk, then it is minimax (since the average risk will be equal to
the worst case). Use this to find a minimax optimal estimator under MSE risk for x ∼ Binomial(θ, n), the
distribution consider in the problem 2 above. Hint: Use a Beta prior with parameter α chosen so that the
posterior mean has constant risk.

3. Recall that the solution to the optimization

min ∥y − θ∥22 + λ∥θ∥1


θ∈Rn

is the soft-thresholding operator applied to each element of y. Reasoning in a similar manner, determine
the solution to
minn ∥y − θ∥22 + λ∥θ∥0
θ∈R

where ∥θ∥0 is equal to the number of non-zero elements in θ.

4. The idea of a gradient can be extended to non-differentiable functions by introducing the notion of subgra-
dients. Recall that for a convex function f that is differentiable at w, for all u we have

f (u) ≥ f (w) + (u − w)T ∇f (w),

i.e., the gradient at w defines the slope of a tangent that lies below f . If f is not differentiable at w, we can
write a similar inequality:
f (u) ≥ f (w) + (u − w)T v (13.1)
where we call v a subgradient. The formal definition is below.

70
Definition 13.7.1. Any vector v that satisfies (13.1) is called a subgradient of f at w. The set of subgradients
of f at w is called the differential set and denoted ∂f (w). So we write v ∈ ∂f (w).

If f is differentiable at w, then there is only one subgradient at w, and it is equal to the gradient at w.
Let {(xi , yi )}ni=1 be a set of training data and arrange these data into matrix X and vector y. Derive the
SGD steps for the following optimization problems. If the gradient does exist at a point, then you may use
any subgradient instead (in general there may be many, so you can pick any one).

(a) minw ∥y − Xw∥2


(b) minw ∥y − Xw∥2 + λ∥w∥22
(c) minw ∥y − Xw∥2 + λ∥w∥1
P
(d) minw ni=1 log(1 + exp(−yi wT xi ))
P
(e) minw ni=1 max(0, 1 − yi wT xi )
P
(f) minw ni=1 max(0, 1 − yi wT xi ) + λ∥w∥22

71
Lecture 14: Proximal Gradient Algorithms

Consider optimization problems of the following form


min f (w) + g(w) ,
w∈Rp

where the functions f and gPare convex, and f is also differentiable. Special cases include “ridge" and “lasso"
regression, where f (w) = ni=1 ℓ(yi , wT xi ) with some convex loss function ℓ and g(w) = λ2 ∥w∥22 or λ∥w∥1 ,
respectively. λ ≥ 0 is a regularization parameter that adjusts the tradeoff between the loss and the regularization.
A general class of iterative algorithms known as proximal gradient methods can be used to solve these optimiza-
tions. Proximal gradient algorithms are easy to implement when the function g has a computationally efficient
proximal operator, and have state-of-the-art performance.

The proximal operator for g is defined as follows. For any t > 0 and v ∈ Rd
 
1 2
proxg,t (v) := arg min ∥u − v∥ + t g(u) .
u 2
The solution to the optimization is a point close (proximal) to the input v and with a relatively small g value. The
parameter t controls the tradeoff between staying close to v and minimizing g. For example, suppose g(w) =
∥w∥1 . Then the proximal operator is
  Xd  
1 2 1 2
proxg,t (v) = arg min ∥u − v∥ + t ∥u|1 = arg min (ui − vi ) + t |ui | .
u 2 u
i=1
2
We see that the optimization objective is separable in the coordinates. Each coordinate ui can be optimized
separately. In fact, there is a closed-form solution known as the soft-threshold operation
 
1 2
arg min (ui − vi ) + t |ui | = sign(vi ) max(0, |vi | − t) .
ui 2
The parameter t plays the role of a threshold. Note that if |vi | ≤ t, then the solution is 0.
b to the optimization
Proposition 14.0.2. The solution u
 
1 2
min (u − v) + t |u|
u∈R 2

b = sign(v) max(0, |v| − t).


is the soft-threshold operation u

Proof. The subgradient of the objective function with respect to u is


 
1 2
∂u (u − v) + t |u| = u − v + t sign(u) .
2
The subgradient must be zero at the solution, yielding the equation v = u + t sign(u). Consider u as a function of
v. If |v| < t, then u = −t sign(u) + v implies u u) ̸= sign(−t sign(b
b = 0, since otherwise the sign(b u) + v). If v ≥ t,
then sign(bu) = +1 and u b = v − t. To see this, observe that if we assume sign(bu) = −1, then the solution would
b = v + λ > 0, which contradicts the assumed sign. Similarly, if v ≤ −t, then the solution is u
be u b = v + t.

This particular case arises in the lasso regression problem, which was a major motivation for the development of
proximal gradient algorithms. This type of algorithm was first introduced for the lasso regression (i.e., squared
error loss and ℓ1 regularization) in [15]. Proximal gradient algorithms were further developed in many subsequent
papers, including [26, 3], and the analysis in this lecture follows ideas developed in these papers.

72
14.1. Proximal Gradient Algorithm with Squared Error Loss

To introduce the idea of proximal gradient algorithms, let us first consider the special case of squared error loss,
f (w) = ∥y − Xw∥22 . We can write the objective function as

L(w) := ∥y − Xw∥22 + g(w)


= ∥y − Xw(k) + Xw(k) − Xw∥22 + g(w)
= ∥y − Xw(k) ∥22 + 2(y − Xw(k) )T X(w(k) − w) + ∥X(w(k) − w)∥22 + g(w)

The goal of an iterative algorithm is to choose w to reduce our objective. Notice that the first term C := ∥y −
Xw(k) ∥2 is a constant that doesn’t depend on the variable w. The remaining terms are still somewhat complicated,
but we can simplify it using the following bound

L(w) = C + 2(y − Xw(k) )T X(w(k) − w) + ∥X(w(k) − w)∥22 + g(w)


≤ C + 2(y − Xw(k) )T X(w(k) − w) + ∥X∥22 ∥w(k) − w∥22 + g(w)
≤ C + 2(y − Xw(k) )T X(w(k) − w) + t−1 ∥w(k) − w∥22 + g(w) , (14.1)

where ∥X|2 is the spectral norm of X (the largest singular value) and 0 < t < 1/∥X∥22 . Observe that the upper
bound on the right hand side “touches” the original objective L at the point w = w(k) . The parameter t will play
the same role that it did in the gradient descent iteration. Now we will choose w to minimize this bound, that is

w(k+1) := arg min 2(y − Xw(k) )T X(w(k) − w) + t−1 ∥w(k) − w∥22 + g(w)
w

= arg min 2t(y − Xw(k) )T X(w(k) − w) + ∥w(k) − w∥22 + tg(w) .
w

Note that if we take w = w(k) , then the value of the objective above is g(w(k) ). The minimization will produce
a value at least this small (making no progress), but it may be possible that w(k+1) has a value less than g(w(k) ).
It is easy to see (sketch a picture of L and the upper bound) that finding w(k+1) to reduce the upper bound must
also reduce the original objective, i.e., ∥y − Xw(k+1) ∥22 + g(w(k+1) ) < ∥y − Xw(k) ∥22 + g(wk ), and progress
is made toward reducing the original objective function. This inequality is strict for the following reason. Let Luk
denote the upper bounding function in (14.1). Then L(w(k+1) ) ≤ Luk (w(k+1) ) < Luk (w(k) ) = L(w(k) ), where the
first inequality is because Luk upper bounds L at all points and the second is strict because Luk is strictly convex
due to the ∥w(k) − w∥2 term (unless w(k) is already the minimum point, in which case we have equality).

Next define v := tX T (y − Xw(k) ) = −tX T (Xw(k) − y). Then we can write the optimization above as

w(k+1) = arg min 2 v T (w(k) − w) + ∥w(k) − w∥22 + tg(w) .
w

So we can “complete the square” to obtain



w(k+1) = arg min ∥v + w(k) − w∥22 − ∥v∥22 + tg(w)
w

= arg min ∥v + w(k) − w∥22 + tg(w) ,
w

since v is a constant that does not depend on the optimization variable w. Finally, define

zk := v + w(k)
= w(k) − tX T (Xw(k) − y) ,

which you will recognize as the gradient descent iterate. Thus, the next iterate is given by

w(k+1) = arg min ∥zk − w∥22 + tg(w) .
w

73
Note that if g is separable (i.e., it is a sum of terms, each involving just one coordinate of w), then the optimization
above is separable. This is a separable approximation (and upper bound) of the original optimization. The solution
to the optimization the proximal operator of the function g. So this sort of iterative optimization is often referred
to as a proximal point algorithm. Note that if g = 0, then the solution is w(k+1) = zk , the gradient descent iterate.
The optimization is also easy to solve for many choices of g.

14.2. Proximal Gradient Algorithm

Now let f (w) be any convex loss function. The proximal gradient algorithm is follows a simple iteration. Initialize
with w(0) arbitrarily chosen and for k ≥ 1

w(k) = proxg,t (w(k−1) − t∇f (w(k−1) )) ,

where for any v ∈ Rp  


1 2
proxg,t (v) := arg min ∥u − v∥ + tg(u) ,
u 2
is called a proximal operator for the function g. The parameter value w+ = proxg,t (w − t∇f (w)) minimizes
the sum of g(u) and a separable quadratic approximation of f (u) around the point w. The separability of this
Pp that if we define v = w − t∇f (w) and if the regularization
approximation is the key to efficient algorithms. Note
function is also separable (e.g., g(u) = λ∥u∥1 = i=1 |ui |), then we can write the proximal operator optimization
as p  
X 1 2
min (ui − vi ) + tg(ui ) ,
u
i=1
2
and we can solve for each scalar element separately. In the case g(u) = λ∥u∥1 , the proximal gradient algorithm
iterates a gradient descent step followed by a soft-thresholding operation, and so is sometimes called an iterative
soft-thresholding algorithm (ISTA). The solutions in this case tend to be sparse vectors.

14.3. Analysis of Proximal Gradient Algorithm

We will employ the following assumptions in our analysis. Unless otherwise noted, norms denote the Euclidean
norm.

1. f is convex on Rp and the gradient ∇f is L-Lipschitz, i.e.,

∥∇f (w) − ∇f (v)∥ ≤ L∥w − v∥ for all w, v ∈ Rp .

2. g is convex

An immediate consequence of the first assumption is the following Taylor series bound.

Lemma 14.3.1.
L
f (v) ≤ f (w) + ∇f (w)T (v − w) + ∥v − w∥2 , (14.2)
2

74
Proof. Since f is differentiable, by the Fundamental Theorem of Calculus
Z 1
T
f (v) = f (w) + ∇f w + γ(v − w) (v − w) dγ .
0

Therefore, we have
Z  T
T
1 
f (v) − f (w) − ∇f (w) (v − w) = ∇f w + γ(v − w) − ∇f (w) (v − w) dγ .
0

By the Cauchy-Schwarz inequality and the Lipschitz assumption above, we have


  T 
∇f w + γ(v − w) − ∇f (w) (v − w) ≤ ∥∇f w + γ(v − w) − ∇f (w)∥ ∥v − w∥ ≤ Lγ∥v − w∥2 .
R1
and since 0
γ dγ = 21 , the result follows.

Recall that because f is convex, f (v) ≥ f (w) + ∇f (w)T (v − w). So the Lipschitz assumption on the gradient
is guaranteeing that f (v) cannot be too much larger than this.

First note that we can express the proximal gradient update rule as

w(k) = w(k−1) − tF (w(k−1) ) ,

where
1 
F (w) := w − proxg,t (w − t∇f (w)) .
t
(k−1)
We can interpret tF (w ) as the step we take to adjust the weights at each iteration. Next, take v = w − tF (w)
in Equation (14.2) to obtain

t2 L
f (w − tF (w)) ≤ f (w) − t∇f (w)T F (w) + ∥F (w)∥2 .
2
If 0 < t ≤ L1 , then
t
f (w − tF (w)) ≤ f (w) − t∇f (w)T F (w) + ∥F (w)∥2 .
2
We will assume this condition on t for the rest of the analysis. Also, we will make use of the following fact in
two places in the proof: If g is any convex function and x is in the subdifferential of g at the point w (i.e., if g is
differentiable at u, then x is tangent to g at w, otherwise it is any subdifferential at w), then for any u ∈ Rp

g(u) ≥ f (w) + xT (u − w) .

In words, this says that “tangent” lines lie below the convex function.

Lemma 14.3.2. For all v ∈ Rp


t
f (w − tF (w)) + g(w − tF (w)) ≤ f (v) + g(v) + F (w)T (w − v) − ∥F (w)∥2 . (14.3)
2

Proof. Using the Taylor’s Series bound 14.2


t
f (w − tF (w)) + g(w − tF (w)) ≤ f (w) − t∇f (w)T F (w) + ∥F (w)∥2 + g(w − tF (w)) .
2

75
By convexity of f , for any v we have f (v) ≥ f (w) + ∇f (w)T (v − w), and so f (w) ≤ f (v) + ∇f (w)T (w − v).
This yields
f (w − tF (w)) + g(w − tF (w)) ≤ f (v) + ∇f (w)T (w − v) − tf (w)T F (w)
t
+ ∥F (w)∥2 + g(w − tF (w)) . (14.4)
2
Also, since g is convex we can upper bound it in a similar fashion. Because g may not be differentiable, we
consider the subdifferentials of g. Recall the definition of the proxg,t operator
 
1 2
proxg,t (v) := arg min ∥u − v∥ + tg(u) .
u 2
| {z }
=:p(u)

Now because proxg,t (v) is a minimizer of p(u), the zero vector 0 is in the subdifferential of p at the point
proxg,t (v). Note that the subdifferential is
∂p(proxg,t (v)) = proxg,t (v) − v + t∂g(proxg,t (v)) .
Setting this expression to 0 shows that
v − proxg,t (v) ∈ ∂g(proxg,t (v)) .
Now take v = w − t∇f (w) to obtain
w − t∇f (w) − proxg,t (w − t∇f (w)) ∈ ∂g(proxg,t (w − t∇f (w))) .
Recall that proxg,t (w − t∇f (w)) = w − tF (w), so we have
F (w) − ∇f (w) ∈ ∂g(proxg,t (w − t∇f (w))) .
T
By convexity of g, we have the bound g(w − tF (w)) ≤ g(v) + F (w) − ∇f (w) (w − tF (w) − v) for all v.
Plugging this into Equation (14.4) yields
t
f (w − tF (w)) + g(w − tF (w)) ≤ f (v) + ∇f (w)T (w − v) − tf (w)T F (w) + ∥F (w)∥2
2
T
+g(v) + F (w) − ∇f (w) (w − tF (w) − v)
t
= f (v) + g(v) + F (w)T (w − v) − ∥F (w)∥2
2

Lemma 1 shows that each iteration of the algorithm makes progress toward the global minimum. To simplify
notation, let us denote the overall objective function by ϕ := f + g. Take v = w in Equation 14.3 to obtain
t
ϕ(w − tF (w)) ≤ ϕ(w) − ∥F (w)∥2
2
≤ ϕ(w) .
Now let w⋆ denote a global minimizer ϕ, and use Equation (14.3) with v = w⋆ to obtain
t
ϕ(w − tF (w)) − ϕ(w⋆ ) ≤ F (w)T (w − w⋆ ) − ∥F (w)∥2
2
1 
= ∥w − w⋆ ∥2 − ∥w − w⋆ − tF (w)∥2
2t
1 
= ∥w − w⋆ ∥2 − ∥w − tF (w) − w⋆ ∥2 .
2t
76
Since ϕ(w − tF (w)) − ϕ(w⋆ ) ≥ 0, we have

∥w − tF (w) − w⋆ ∥2 ≤ ∥w − w⋆ ∥2 , (14.5)

which shows that each iteration decreases the distance to a global minimizer.

Now we can analyze the rate of convergence of the algorithm. Add inequalities in Equation (14.5) for w(i−1) = w
and w(i) = w − tF (w):

X 1 X  (i−1) 
k k
(i) ⋆

ϕ(w ) − ϕ(w ) ≤ ∥w − w⋆ ∥2 − ∥w(i) − w⋆ ∥2
i=1
2t i=1
1  (0) ⋆ 2 (k) ⋆ 2

= ∥w − w ∥ − ∥w − w ∥
2t
1
≤ ∥w(0) − w⋆ ∥2
2t
Now since we showed above that ϕ(w(i) ) is nonincreasing,

1X
k
(k) ⋆
 1
ϕ(w ) − ϕ(w ) ≤ ϕ(w(i) ) − ϕ(w⋆ ) ≤ ∥w(0) − w⋆ ∥2 .
k i=1 2kt

1
Finally, recall that we can take t = L
yielding

L
ϕ(w(k) ) − ϕ(w⋆ ) ≤ ∥w(0) − w⋆ ∥2 .
2k
This shows that ϕ(w(k) ) − ϕ(w⋆ ) ≤ ϵ after O(ϵ−1 ) iterations.

14.4. Exercises

Suppose that we observe

y = Xw∗ + ϵ (14.6)

where X ∈ Rn×n has orthonormal columns and ϵ ∼ N (0, I). Consider the regularized optimization for estimat-
ing w∗
1 λ
min ∥y − Xw∥22 + ∥w∥pp
w 2 p
for p = 1 or 2 and λ > 0.

1. Show/explain how the optimization problem above can be transformed into an equivalent optimization of
the form
1 λ
min y − w∥22 + ∥w∥pp
∥e
w 2 p

and explain how ye is related to y.

2. Let λ = 0.

77
b the solution to optimization above, and
(a) Give an expression for w,
(b) Consider the prediction error for a new observation of the form y = xT w∗ + ϵ, for arbitrary, fixed
x ∈ Rn and independent noise ϵ ∼ N (0, 1). Show that the expected squared prediction error

E[(y − xT w)
b 2 ] = ∥x∥22 + 1 .

3. Let λ > 0 and p = 2.

b the solution to optimization above in this case, and


(a) Give an expression for w,
(b) Consider the prediction error for a new observation of the form y = xT w∗ + ϵ, as above. Derive an
expression for the expected squared prediction error E[(y − xT w)
b 2 ], and show that it reduces to the
expression in the MLE case when λ = 0.

4. The analysis of proximal gradient assumes that ∇f is L-Lipschitz, and the step size is then assumed to
satisfy t ≤ 1/L. The analysis also assumes that t is also the parameter of the proximal operator proxg,t .
What is the effect of changing the proximal operator parameter to some other values τ > t?

78
Lecture 15: The Lasso and Soft-Thresholding

The squared error “lasso" regression problem solves the optimization


1
min ∥y − Xw∥22 + λ∥w∥1 .
w 2

Recall that the ℓ1 regularization encourages sparse solutions. Suppose that y ∼ N (Xw, σ 2 I) for some w ∈ Rd
and that w is sparse (many of elements in w are exactly zero). Ideally, the solution to the lasso optimization
produce a w b also sparse in the same locations. Under certain assumptions on X, this sort of result can be proved
[18, 7, 13]. The simplest setting where such results can be established is when X = I, the identity matrix, which
we will study here.

Consider the “direct” observation model where y ∈ Rn is given by

y = w + ϵ , ϵ ∼ N (0, σ 2 I) .

Suppose that many of the weights/coefficients in w are equal to zero. The MLE of w is simply y, and its MSE is
nσ 2 . Instead, consider the regularized problem
1
min ∥y − w∥22 + λ∥w∥1 .
w 2

Its solution is soft-thresholding estimator

bi = sign(yi ) max(|yi | − λ, 0) , λ > 0


w

which can perform much better, especially if w is sparse.

Before we analyze the soft-thresholding estimator, let us consider an ideal thresholding estimator. Suppose that
an oracale tells us the magnitude of each wi . The oracle estimator is

O yi if |wi |2 ≥ σ 2
wbi =
0 if |wi |2 < σ 2

In other words, we estimate a coefficient if and only if the signal power is at least as large as the noise power. The
MSE of this estimator is
X n X n
E (wbiO − wi )2 = min(|wi |2 , σ 2 )
i=1 i=1

Notice that the MSE of the oracle estimator is always less than or equal to the MSE of the MLE. If w is sparse,
then the MSE of the oracle estimator can be much smaller. If all but k < n coefficients are zero, then the MSE
of the oracle estimator is at most kσ 2 . Remarkably, the soft-thresholding estimator comes very close to achieving
the performance of the oracle, and shown by the following theorem from [14].
p
The theorem uses the threshold λ = 2σ 2 log n. This choice of threshold is motivated by the following obser-
vation. Assume, for the moment, that we observe no signal at all, just noise (i.e., wi = 0 for i = 1, . . . , n).
In this case, we should set the threshold
p so that it is larger than the magnitude of any of the yi (so they are
allSset to zero). If we take λ = 2σ log nδ , then using the Gaussian tail bound and the union bound we have
2

P( ni=1 {|yi | ≥ λ}) ≤ δ.


Theorem 15.0.1. Assume the direct observation model above and let

bi = sign(yi ) max(|yi | − λ, 0)
w

79
p
with λ = 2σ 2 log n. Then
( )
X
n
b − w∥22 ≤ (2 log n + 1) σ 2 +
E∥w min(|wi |2 , σ 2 )
i=1

The theorem shows that the soft-thresholding estimator mimics the MSE performance of the oracle estimator
to within a factor of roughly 2 log n. For example, if w is k-sparse (with non-zero coefficients larger than σ
in magnitude), then the MSE of the oracle is kσ 2 and the MSE of the soft-thresholding estimator is at most
(2 log n + 1)(k + 1)σ 2 ≈ 2k log n σ 2 when n is large. This also corresponds to a huge improvement over the MLE
if 2k log n ≪ n.

Intution: Consider the case with σ 2 = 1 (the general case follows by simple rescaling). First recall that if y ∼
2
N (0, 1), then P(|y| ≥ λ) ≤ e−λ /2 . This inequality
R ∞ −y2 /2is easily derived as follows. Since P(y ≥ λ) = P(y ≤ −λ),
1 1 −λ2 /2
we only need to show that P(y ≥ λ) = 2π λ
e dy ≤ 2
e . Note that
1
R ∞ −y2 /2 1
R ∞ −(y2 −λ2 )/2 1
R ∞ −(y−λ)(y+λ)/2
2π λ
e dy 2π λ
e dy 2π λ
e dy
1 −λ2 /2 = 1 = 1 .
2
e 2 2

The desired inequality results by making change of variable t = y + λ to yield


1
R ∞ −y2 /2 1
R ∞ −t(t+2λ)/2 1
R ∞ −t2 /2
2π λ
e dy e dt e dt
1 −λ2 /2 = 2π 0 1 ≤ 2π 0 1 = 1.
2
e 2 2

Now observe that if λ = 2 log n, then P(|yi | ≥ λ|wi = 0) ≤ e− log n = n1 . Using this we have
" #
X X 1
E 1{wbi ̸=0} ≤ ≤ 1.
i:w =0 i:w =0
n
i i

In other words, using this threshold we expect that at most one of the wi = 0 will not be estimated as w bi = 0.
Next consider cases when wi ̸= 0. Let us suppose that |wi | ≫ λ, so that w bi = yi − λsign(yi ). In other words, if
|wi | ≫ λ, then with high probability (tending to 1 as |wi | increases, we have |yi | ≥ λ. In this case,

bi )2 = (−ϵi + λsign(yi ))2 ≤ ϵ2i + 2|ϵi |λ + λ2 .


(wi − w

Taking the expecation of this upper bound yields

bi )2 ] ≤ 1 + 2λ + λ2 ≤ 3λ2 + 1 , assuming λ > 1 .


E[(wi − w

Thus, if w has only k nonzero weights, then this intuition suggests that
X
n
bi )2 ] = O(k log n) .
E[(wi − w
i=1

This is formalized in the following proof of Theorem 1.

Proof: To simplify the analysis, assume that σ 2 = 1. The general result follows directly. It suffice to show that
 
2 1 2
E[(w bi − wi ) ] ≤ (2 log n + 1) + min(wi , 1)
n

80

for each i. So let y ∼ N (w, 1) and let fλ (y) = sign(y) max(|y| − λ, 0). We will show that with λ = 2 log n
 
2 1 2
E[(fλ (y) − w) ] ≤ (2 log n + 1) + min(w , 1) .
n
First note that fλ (y) = y − sign(y)(|y| ∧ λ), where a ∧ b is shorthand notation for min(a, b). It follows that
E[(fλ (y) − w)2 ] = E[(y − w)2 ] − 2E[sign(y)(|y| ∧ λ)(y − w)] + E[y 2 ∧ λ2 ]
= 1 − 2E[sign(y)(|y| ∧ λ)(y − w)] + E[y 2 ∧ λ2 ]
The expected value in the second term is equal to P(|y| < λ), which is verified as follows.

The expectation can be split into integrals over four intervals, (∞, −t], (−t, 0], (0, t], and (t, ∞). Each integrand
2
is a linear or quadratic function of y times the Gaussian density function. Let ϕ(x) := √12π e−x /2 and Φ(x) be the
cumulative distribution function of ϕ(x), and consider the following indefinite Gaussian integral forms:
Z
ϕ(x) dx = Φ(x) , by definition of Φ,
Z Z Z
1 −x2 /2 1 1
xϕ(x) dx = √ xe dx = − √ eu du = − √ eu = −ϕ(x) ,
2π 2π 2π
| {z }
u=−x2 /2
Z
x2 ϕ(x) dx = Φ(x) − xϕ(x) .
R R
The Rlast form is verified as follows. Let u = x and dv = xϕ(x)dx. Then integration by parts udv = uv − vdu
and xϕ(x)dx = −ϕ(x) show that
Z Z Z Z Z
2
x ϕ(x) dx = x xϕ(x)dx − xϕ(x)dx = −xϕ(x) + ϕ(x) = Φ(x) − xϕ(x) .

The Gaussian distribution we are considering has mean w so the shifted integral forms below, which follow
immediately from the derviations above by variable substitution, will be used in our analysis:
R
(i) ϕ(x − w)dx = Φ(x − w)
R
(ii) xϕ(x − w)dx = wΦ(x − w) − ϕ(x − w)
R 2
(iii) x ϕ(x − w)dx = (1 + w2 )Φ(x − w) − (x + w)ϕ(x − w)
Using these forms we compute
Z ∞
E[sign(x)(|x| ∧ λ)(x − w)] = sign(x)(|x| ∧ λ)(x − w) ϕ(x − w) dx
−∞
Z −λ Z 0
= −λ(x − w)ϕ(x − w) dx + x(x − w)ϕ(x − w)dx
−∞ −λ
| {z } | {z }
λϕ(−λ−w) Φ(−w)−Φ(−λ−w)−λϕ(−λ−w)
Z λ Z ∞
+ x(x − w)ϕ(x − w)dx + λ(x − w)ϕ(x − w)dx
|0 {z } | λ
{z }
Φ(λ−w)−Φ(−w)−λϕ(λ−w) λϕ(λ−w)

= Φ(λ − w) − Φ(−λ − w) = P(|x| < λ)


So we have shown that
E[(fλ (y) − w)2 ] = 1 − 2P(|y| < λ) + E[y 2 ∧ λ2 ]

81
Note first that since y 2 ∧ λ2 ≤ λ2 we have
E[(fλ (y) − w)2 ] ≤ 1 + λ2 = 1 + 2 log n < (2 log n + 1)(1/n + 1) .
On the other hand, since y 2 ∧ λ2 ≤ y 2 we also have
E[(fλ (y) − w)2 ] ≤ 1 − 2P(|y| < λ) + w2 + 1 = 2(1 − P(|y| < λ)) + w2 = 2P(|y| ≥ λ) + w2 .
The proof will be finished if we show that
2P(|y| ≥ λ) ≤ (2 log n + 1)/n + (2 log n)w2 .

Define g(w) := 2P(|y| ≥ λ) and note that g is symmetric about 0. Using a Taylor’s series with remainder we have
1
g(w) ≤ g(0) + sup |g ′′ |w2 ,
2
where g ′′ is the second derivative of g. Note that g(w) = 2[1 − P(z ≤ λ − w) + √ P(z ≤ −λ − w)], where z ∼
1 −λ2 /2
N (0, 1). Using the Gaussian tail bound P(z > λ) ≤ 2 e and plugging in λ = 2 log n we obtain g(0) ≤ 2/n.
′ ′
Note that g (w) = 2[ϕ(λ − w) − ϕ(−λ − w)] and g (0) = 0. The integral (ii) above shows that the derivative of
ϕ(λ − w) with respect to w is equal to (λ − w)ϕ(λ − w). So we have g ′′ (w) = 2[(λ − w)ϕ(λ − w) − (−λ −
w)ϕ(−λ − w)]. It is easy to verify that |g ′′ (w)| < 1, since supx |xϕ(x)| < 0.25. To simplify the final bound, note
that 4 log n > 1 if n ≥ 2, so it follows that supw g ′′ (w) < 4 log n for all n ≥ 2.

15.1. Exercises

1. Suppose that we observe


y = Xw∗ + ϵ (15.1)
where X ∈ Rn×n has orthonormal columns and ϵ ∼ N (0, I). Consider the regularized optimization for
estimating w∗
1
min ∥y − Xw∥22 + λ ∥w∥1
w 2
(a) What value of λ would you suggest for this case and why?
(b) Suppose that w∗ has only k < n nonzero elements. Consider the prediction error for a new observation
of the form y = xT w∗ + ϵ, as above. Show that the expected squared prediction error can be bounded
as follows
E[(y − xT w)b 2 ] ≤ (2 log n + 1)(k + 1)∥x∥2 + 1 .
Hint: You may use the soft-thresholding result from class which states that the solution to the opti-
b satisfying the bound
mization in (14.6), for appropriately chosen λ, produces an estimator w
 Xn 

E[∥w − w∥ 2
b 2 ] ≤ (2 log n + 1) 1 + min(|wi∗ |2 , 1) .
i=1

2. Consider the data model


yi = wi + ϵi , i = 1, 2, . . . , n
iid iid
where ϵi ∼ N (0, 1). Let’s model the weight sparsity using the Gaussian mixture model wi ∼ (1 −
b denote the estimate of w using a soft-thresholding operation with threshold
p)N (0, 0.1)+pN (0, 10). Let w
λ > 0. Find a bound on the MSE of this estimator in terms of p and λ. How would you optimize the choice
of λ in terms of p?

82
3. Let X be an n × n diagonal matrix and denote the ith diagonal entry as xi . Assume X is full rank (i.e.,
xi ̸= 0 for all i). Suppose we observe y = Xw + ϵ, with ϵ ∼ N (0, I). Equivalently, yi = xi wi + ϵi ,
i = 1, . . . , n. Consider the optimization
1
∥y − Xw∥22 + λ∥w∥1
2
and recall that our theory for the case X = I suggests that λ should be proportional to the standard deviation
of the noise. Is this reasonable for general diagonal matrix X? Suppose that a certain |xi | ≫ 1 or |xi | ≪ 1
to get some intuition. Can you suggest a modified
P regularization that might make more sense in this setting?
The goal is to find an estimator w bi − wi )2 ] is small.
bi so the i E[(w

83
Lecture 16: Concentration Inequalities

The most important form of statistic considered in this course is a sum of independent random variables.
Example 10. A biologist is studying the new artificial lifeform called synthia. She is interested to see if the
synthia cells can survive in cold conditions. To test synthia’s hardiness, the biologist will conduct n independent
experiments. She has grown n cell cultures under ideal conditions and then exposed each to cold conditions. The
number of cells in each culture is measured before and after spending one day in cold conditions. The fraction
Pn of
1
cells surviving the cold is recorded. Let x1 , . . . , xn denote the recorded fractions. The average pb := n i=1 xi is
an estimator of the survival probability.

Understanding behavior of sums of independent random variables is extremely important. For instance, the bi-
ologist in the example above would like to know that the estimator is reasonably accurate. Let X1 , . . . , Xn
be independent
P and identically distributed random variables with variance σ 2 < ∞ and consider the average
n
b := n1 i=1 Xi . First note that E[b
µ b is σ 2 /n. So the
µ] = E[X]. An easy calculation shows that the variance of µ
average has the same mean value as the random variables and the variance is reduced by a factor of n. Lower
variance means less uncertainty. So it is possible to reduce uncertainty by averaging. The more we average,
the less the uncertainty (assuming, as we are, that the random variables are independent, which implies they are
uncorrelated).

The argument above quantifies the effect of averaging on the variance, but often we would like to say more
about the distribution of the average. The Central Limit Theorem is a classic result showing that the probability
distribution of the average of n independent and identically distributed random variables with mean µ and variance
σ 2 < ∞ tends to a Gaussian distribution with mean µ and variance σ 2 /n, regardless of the form of the distribution
of the variables. By ‘tends to’ we mean in the limit as n tends to infinity.

In many applications we would like to say something more about the distributional characteristics for finite values
of n. One approach is to calculate the distribution
Pnof the average explicitly. Recall that if the random variables
have a density pX , then the density of the sum i=1 Xi is the n-fold convolution of the density pX with itself
(again this hinges on the assumption that the random variables are independent; it is easy to see by considering
the characteristic function of the sum and recalling that multiplication of Fourier transforms is equivalent to
convolution in the inverse domain). However, this exact calculation can be sometimes difficult or impossible, if
for instance we don’t know the density pX , and so sometimes probability bounds are more useful.

Let Z be a non-negative random variable and take t > 0. Then


E[Z] ≥ E[Z 1{Z≥t} ]
≥ E[t 1{Z≥t} ] = t P(Z ≥ t)
The result P(Z ≥ t) ≤ E[Z]/t is called Markov’s Inequality. We can generalize this inequality as follows. Let ϕ
be any non-decreasing, non-negative function. Then
E[ϕ(Z)]
P(Z ≥ t) = P(ϕ(Z) ≥ ϕ(t)) ≤ .
ϕ(t)

We can use this to get a bound on the probability ‘tails’ of any random variable Z. Let t > 0
P(|Z − E[Z]| ≥ t) = P ((Z − E[Z])2 ≥ t2 )
E[(Z − E[Z])2 ]

t2
Var(Z)
= ,
t2
84
Pnthe variance of Z. This inequality is known as Chebyshev’s Inequality. If we apply this to
where Var(Z) denotes
1
b = n i=1 Xi , then we have
the average µ

σ2
µ − µ| ≥ t) ≤
P(|b
nt2
where µ and σ 2 are the mean and variance of the random variables {Xi }. This shows that not only is the variance
reduced by averaging, but the tails of the distribution (probability of observing values a distance of more than t
from the mean) are smaller.

The tail bound given by Chebyshev’s Inequality is loose, and much tighter bounds are possible under slightly
iid
stronger assumptions. For example, if Xi ∼ N (µ, 1), then µ b ∼ N (µ, 1/n). The following tail-bound for the
−nt2 /2
µ − µ| ≥ t) ≤ e
Gaussian density shows that in this case P(|b .
Theorem 7. The tail of the standard Gaussian N (0, 1) distribution satisfies the bound for any t ≥ 0,
Z∞  
1 −x2 1 −t2 1 −t2
√ e 2 dx ≤ min e 2 , √ e 2
2π 2 2π t2
t

Proof. Consider
R∞ −x2
√1 e 2 dx Z∞ Z∞
2π 1 −(x2 −t2 ) 1 −(x−t)(x+t)
t
R := 2 = √ e 2 dx = √ e 2 dx
− t2 2π 2π
e t t

For the first bound, let y = x − t,


Z∞ Z∞
1 −y(y+2t) 1 −y 2 1
R = √ e 2 dy ≤ √ e 2 dy =
2π 2π 2
0 0

For the second bound, note that


Z∞ Z∞ −t 2
1 −2t(x−t) 1 2 −tx 1 2 e 1
R ≤ √ e 2 dx = √ et e dx = √ et = √
2π 2π 2π t 2πt2
t t

16.1. The Chernoff Method

More generally, if the random variables {Xi } are bounded or sub-Gaussian (meaning the tails of the probability
distribution decay at least as fast as Gaussian tails), then the tails of the average converge exponentially fast in n.
The key to this sort of result is the so-called Chernoff bounding method, based on Markov’s inequality and the
exponential function (non-decreasing, non-negative). If Z is any real-valued random variable and s > 0, then

P(Z > t) = P(esZ > est ) ≤ e−st E[esZ ] .

We can choose s > 0 to minimize this upper bound. In particular, if we define the function

ψ ∗ (t) = max st − log E[esZ ] ,
s>0

85
∗ (t)
then P(Z > t) ≤ e−ψ .

Exponential bounds of this form can be obtained explicitly for many classes of random variables. One of the most
important is the class of sub-Gaussian random variables. A random variable X is said to be sub-Gaussian if there
2
exists a constant c > 0 such that E[esX ] ≤ ecs /2 for all s ∈ R.
2 /2
Theorem 8. Let X1 , X2 , ..., Xn be independent sub-Gaussian
Pn random variables such that E[es(Xi −E[Xi ]) ] ≤ ecs
for a constant c > 0 and i = 1, . . . , n. Let Sn = i=1 Xi . Then for any t > 0, we have
2 /(2nc)
P(|Sn − E[Sn ]| ≥ t) ≤ 2 e−t
b := n1 Sn we have
and equivalently if µ
2 /(2c)
µ − µ| ≥ t) ≤ 2 e−nt
P(|b

Proof.
Xn h Pn i
P( Xi − E[Xi ] ≥ t) ≤ e−st E es( i=1 Xi −E[Xi ])
i=1
" #
Y
n
= e−st E es(Xi −E[Xi ])
i=1
Y
n
 
−st
= e E es(Xi −E[Xi ])
i=1
−st ncs2 /2
= e e
2
= e−t /(2nc)
where the last step follows by taking s = t/(nc).
  2
To apply the result above we need to verify that the sub-Gaussian condition, E es(Xi −E[Xi ]) ≤ ecs /2 , holds for
2
some c > 0. As the name suggests, the condition holds if the tails of the probability distribution decay like e−t /2
(or faster).
2 /2
Theorem 9. If P(|Xi − E[Xi ]| ≥ t) ≤ ae−bt holds for constants a ≥ 1, b > 0 and all t ≥ 0, then
2 /b
E[es(Xi −E[Xi ]) ] ≤ e4as .

2
Proof. Let X be a zero-mean random variable satisfying P(|X| ≥ t) ≤ ae−bt /2 . First note since X has mean
zero, Jensen’s inequality implies E[esX ] ≥ esEX = 1 for all s ∈ R. Thus, if X1 and X2 are two independent
copies of X, then
E[es(X1 −X2 ) ] = E[esX1 ]E[e−sX2 ] ≥ E[esX1 ] = E[esX ] .
Thus, we can write
X sℓ E[(X1 − X2 )ℓ ]
E[esX ] ≤ E[es(X1 −X2 ) ] = 1 + .
ℓ≥1
ℓ!

Also, since E[(X1 − X2 )ℓ ] = 0 for ℓ odd (since symmetry X1 − X2 has a symmetric distribution), we have
X s2ℓ E[(X1 − X2 )2ℓ ]
E[esX ] ≤ 1 + .
ℓ≥1
(2ℓ)!

86
Next note that since x2ℓ is convex in x, by Jensen’s inequality we have

(X1 /2 − X2 /2)2ℓ = ((X1 + (−X2 ))/2)2ℓ ≤ (X12ℓ + (−X2 )2ℓ )/2 = (X12ℓ + X22ℓ )/2 ,

which yields

E[(X1 − X2 )2ℓ ] = E[2ℓ (X1 /2 − X2 /2)2ℓ ] ≤ 22ℓ−1 E[X12ℓ ] + E[X22ℓ ] = 22ℓ E[X 2ℓ ] .
R∞
Next note that E[X 2ℓ ] = 0 P(X 2ℓ > t) dt and by the change of variables t = x2ℓ we have
Z ∞ Z ∞
2
2ℓ
E[X ] = 2ℓ x 2ℓ−1
P(|X| > x) dx ≤ 2ℓa x2ℓ−1 e−bx /2 dx .
0 0
p
Now substitute x = 2y/b to get
Z ∞
2ℓ
E[X ] ≤ (2/b) ℓa ℓ
y ℓ−1 e−y dy = (2/b)ℓ a ℓ! ,
0
R∞
where we recognize that 0 y ℓ−1 e−y dy = Γ(ℓ) = (ℓ − 1)!, the gamma function. So we have E[(X1 − X2 )2ℓ ] ≤
23ℓ b−ℓ a ℓ! ≤ (8a/b)ℓ ℓ! since a ≥ 1. Now plugging this into the bound for E[esX ] above, we see that each term in
the sum is bounded by s2ℓ (8a/b)ℓ ℓ!/(2ℓ)!. Since (2ℓ)! ≥ 2ℓ (ℓ!)2 each term can be bounded by (4as2 /b)ℓ /ℓ!, and
2
so E[esX ] ≤ e4as /b .

The simplest result of this form is for bounded random variables.

Theorem 10. (Hoeffding’s Inequality). Let PnX1 , X2 , ..., Xn be independent bounded random variables such that
Xi ∈ [ai , bi ] with probability 1. Let Sn = i=1 Xi . Then for any t > 0, we have
2t2
− Pn
(bi −ai )2
P(|Sn − E[Sn ]| ≥ t) ≤ 2 e i=1

Proof. We prove a special case. The more general result above also has slightly better constants, and its proof is
later in the notes. Here, assume that a ≤ Xi ≤ b with probability 1 for all i. Then the following bound
2
− Pn t
(bi −ai )2
P(|Sn − E[Sn ]| ≥ t) ≤ 2 e 2
i=1

follows from Theorem 9 above by noting that if a ≤ X1 , X2 ≤ b with probability 1, then E[(X1 − X2 )2ℓ ] ≤
(b − a)2ℓ .

If the random variables {Xi } are binary-valued, then this result is usually referred to as the Chernoff Bound.
Another proof of Hoeffding’s Inequality, which relies Markov’s inequality and some elementary conceptsPn from
1
b = n i=1 Xi are
convex analysis, is given in the next section. Note that if the random variables in the average µ
2
bounded according to a ≤ Xi ≤ b. Let c = (b − a) . Then Hoeffding’s Inequality implies
2nt2
µ − µ| ≥ t) ≤ 2 e−
P(|b c (16.1)

In other words, the tails of the distribution of the average are tending to zero at an exponential rate in n, much
faster than indicated by Chebeyshev’s Inequality.

87
Example 11. Let us revisit the synthia experiments. The biologist has collected n observations, x1 , . . . , xn , each
corresponding
Pn to the fraction of cells that survived in a given experiment. Her estimator of the survival rate is
1
n i=1 xi . How confident can she be that this is an accurate estimator of the true survival rate? LetP
us model her
observations as realizations of n iid random variables X1 , . . . , Xn with mean p and define pb = n1 ni=1 Xi . We
say that her estimator is probability approximately correct with non-negative parameters (ϵ, δ) if

p − p| > ϵ) ≤ δ
P(|b

The random variables are bounded between 0 and 1 and so the value of c in (16.1) above is equal to 1. For
desired accuracy ϵ > 0 and confidence 1 − δ, how many experiments will be sufficient? From (16.1) we equate
δ = 2 exp(−2nϵ2 ) which yields n ≥ 2ϵ12 log(2/δ). Note that this requires no knowledge of the distribution of the
{Xi } apart from the fact that they are bounded. The result can be summarized as follows. If n ≥ 2ϵ12 log(2/δ),
then the probability that her estimate is off the mark by more than ϵ is less than δ.

16.2. Azuma-Hoeffding Inequality

Hoeffding’s inequality can be generalized in a few ways. First, using Doob’s inequality we have the stronger
bound 2 2t
− Pn
(bi −ai )2
P( max |Sk − E[Sk ]| ≥ t) ≤ 2 e i=1
1≤k≤n

Second, we can extend the inequality to martingale sequences. A martingale sequence of random variables
S0 , S1 , . . . , Sn satisfies E[Sk+1 |S1 , . . . , Sk ] = Sk for all k = 0, 1, . . . , n. Note that sums of zero-mean and in-
dependent random variables are martingales.

Theorem 11. (Azuma’s Inequality). Let S0 , S1 , . . . , Sn be martingale sequence of random variables such that for
all i Si − Si−1 ∈ [ai , bi ] with probability 1. Then for any t > 0, we have
2
− Pn t
(b −ai )2
P(Sn − S0 ≥ t) ≤ 2 e 2
i=1 i

Example. Suppose you makea bet each day. If you bet $b and have a 50/50 chance of receiving $2b or losing
your money. Let Si denote your net gain on day i and let Yi ∈ {−1, +1} be an indicator of the outcome for your
bet on day i. Here are two strategies.

Pn
Independent Betting: Always bet $b. Then the net gain is Sn = b i=1 Yi

Recursive Betting: On day i bet $pSi−1 for some p ∈ [0, 1]. Then the change of wealth on day i can be expressed
recursively as Si = Si−1 + pSi−1 Yi . This is a martingale.

16.3. KL-Based Tail Bounds

It is possible to derive tighter bounds by optimizing the exponent. In particular, if the random variables belong
to the exponential family, then the resulting exponent turns out to be a KL-divergence. Below we will work this
out for the case of Bernoulli random variables. This results in a tail bound that is as good or better than the
subGaussian/Hoeffing bounds above for [0, 1]-bounded random variables.

88
Let x be a non-negative random variable. By Markov’s inequality, for any λ > 0
P(x ≥ ϵ) = P(eλx ≥ eλϵ )
≤ e−λϵ E[eλx ]

= exp − λϵ − log E[eλx ]
∗ (ϵ)
≤ e−ψ
where 
ψ ∗ (ϵ) := sup λϵ − log E[eλx ]
λ∈R
If x1 , . . . , xn are i.i.d. non-negative random variables, then
! !
1X X
n n
P xi ≥ ϵ = P xi ≥ nϵ
n i=1 i=1
Pn
≤ e−nλϵ E[eλ i=1 xi ]
= e−nλϵ E[enλx1 ]

= exp −n λϵ − log E[eλx ]

≤ e−nψ (ϵ)
P 
Now suppose that xi is Bernoulli with mean p. Then P n1 ni=1 xi − p ≥ ϵ ≤ exp(−nψ ∗ (p + ϵ)). Now consider

ψ ∗ (p + ϵ) = sup λ(p + ϵ) − log E[eλx ]
λ∈R

= sup λ(p + ϵ) − log(1 − p + peλ )
λ∈R

Setting the derivative of the argument to zero


peλ
(p + ϵ) =
1 − p + peλ
and solving for λ yields  
(1 − p)(p + ϵ)
λ = log ,
p(1 − (p + ϵ))
so ψ ∗ (p + ϵ) = (p + ϵ) log( p+ϵ
p
) + (1 − (p + ϵ)) log( 1−(p+ϵ)
1−p
) = KL(p + ϵ, p). Thus we have
!
1X
n
P xi − p ≥ ϵ ≤ exp(−n KL(p + ϵ, p)) ,
n i=1
and a similar derivation yields
!
1X
n
P p− xi ≥ ϵ ≤ exp(−n KL(p − ϵ, p)) .
n i=1

P
These bounds can be used to construct confidence intervals as follows. Let pb := n1 ni=1 xi and consider the
bound from above P(b p − p ≥ ϵ) ≤ exp(−n KL(p + ϵ, p)). In other words, if we choose δ so that KL(p + ϵ, p) =
log(1/δ)/n,P then pb − p ≤ ϵ with probability at least 1 − δ. Observe that KL(p + ϵ, p) is increasing in ϵ ≥ 0. On
the event n1 ni=1 xi − p ≤ ϵ we have KL(b p, p) ≤ KL(p + ϵ, p). Therefore, we can construct a (1 − δ)-confidence
upper bound on p as 
p, δ) := sup q ≥ pb : KL(b
U (b p, q) ≤ log(1/δ)/n .
Similarly, we can construct a (1 − δ)-confidence lower bound on p as

p, δ) := inf q ≤ pb : KL(b
L(b p, q) ≤ log(1/δ)/n .

89
16.4. Proof of Hoeffding’s Inequality

Let X be any random variable and s > 0. Note that P(X ≥ t) = P(esX ≥ est ) ≤ e−st E[esX ] , by using Markov’s
inequality, and noting that esx is a non-negative monotone increasing function. For clever choices of s this can be
quite a good bound.
P
Let’s look now at ni=1 Xi − E[Xi ]. Then

Xn h Pn i
P( Xi − E[Xi ] ≥ t) ≤ e−st E es( i=1 Xi −E[Xi ])
i=1
" #
Y
n
= e−st E es(Xi −E[Xi ])
i=1
Y
n
 
= e−st E es(Xi −E[Xi ]) ,
i=1

where the last


 step follows from the independence of the Xi ’s. To complete the proof we need to find a good
bound for E es(Xi −E[Xi ]) .

Figure 16.1: Convexity of exponential function.

Lemma 16.4.1. Let Z be a r.v. such that E[Z] = 0 and a ≤ Z ≤ b with probability one. Then
  s2 (b−a)2
E esZ ≤ e 8 .

This upper bound is derived as follows. By the convexity of the exponential function (see Fig. 16.1),
z − a sb b − z sa
esz ≤ e + e , for a ≤ z ≤ b .
b−a b−a

90
Thus,
   
sZ Z − a sb b − Z sa
E[e ] ≤ E e +E e
b−a b−a
b sa a sb
= e − e , since E[Z] = 0
b−a b−a
−a
= (1 − λ + λes(b−a) )e−λs(b−a) , where λ =
b−a

Now let u = s(b − a) and define


ϕ(u) ≡ −λu + log(1 − λ + λeu ) ,
so that
E[esZ ] ≤ (1 − λ + λes(b−a) )e−λs(b−a) = eϕ(u) .

We want to find a good upper-bound on eϕ(u) . Let’s express ϕ(u) as its Taylor series with remainder:
u2 ′′
ϕ(u) = ϕ(0) + uϕ′ (0) + ϕ (v) for some v ∈ [0, u] .
2

λeu
ϕ′ (u) = −λ + ⇒ ϕ′ (0) = 0
1 − λ + λeu
λeu λ2 e2u
ϕ′′ (u) = −
1 − λ + λeu (1 − λ + λeu )2
λeu λeu
= (1 − )
1 − λ + λeu 1 − λ + λeu
= ρ(1 − ρ) ,
λeu
where ρ = 1−λ+λeu
. Now note that ρ(1 − ρ) ≤ 1/4, for any value of ρ (the maximum is attained when ρ = 1/2,
′′ u2 s2 (b−a)2
therefore ϕ (u) ≤ 1/4. So finally we have ϕ(u) ≤ 8
= 8
, and therefore
s2 (b−a)2
E[esZ ] ≤ e 8 .

Now, we can apply this upper bound to derive Hoeffding’s inequality.

Y
n
−st
P(Sn − E[Sn ] ≥ t) ≤ e E[es(Xi −E[Xi ]) ]
i=1
Yn
s2 (bi −ai )2
≤ e−st e 8

i=1
(bi −ai )2
−st s2 n
P
= e e i=1 8

−2t2
Pn
(b −ai )2
= e i=1 i

4t
by choosing s = Pn
i=1 (bi − ai )
2

The same result applies to the r.v.’s −X1 , . . . , −Xn , and combining these two results yields the claim of the
theorem.

91
16.5. Exercises

1. Let x be a random variable with bounded variance V[x] < ∞. Recall Chebyshev’s inequality

V[x]
P(|x − E[x]| ≥ t) ≤
t2
which is obtained by applying Markov’s inequality to P(|x − E[x]|2 ≥ t2 ). It bounds the probability that
|x − E[x]| ≥ t, which is a two sided event. In this exercise you will derive a one-sided version of this
inequality. Assume first the random variable x has zero mean, meaning E[x] = 0. Let σ 2 = V[x], its
variance.

(a) Recall the Cauchy-Schwarz inequality. For any two random variables y and z we have that
p
E[yz] ≤ E(y 2 )E(z 2 ) .

Now write t = t − E[x] = E[t − x] ≤ E[(t − x)1{t>x} ], and use Cauchy-Schwarz to show that

t2 ≤ E[(t − x)2 ]P(x < t) .

(b) Using the fact that E[x] = 0 manipulate the above expression to obtain inequality

σ2
P(x ≥ t) ≤ .
σ 2 + t2

(c) Now make only the assumption that x is a random variable for which V[x] = σ 2 < ∞. Show that

σ2
P(x − E[x] ≥ t) ≤ .
σ 2 + t2
Note that the inequality in (1c) has the nice feature that the r.h.s. is always smaller than 1, so the bound
is never trivial. Hint: define a suitable random variable x as a function of x that has zero mean, and
apply the result in (b).
(d) Use the above result to derive a two-sided version of this inequality. Namely, use the union bound to
show that
2σ 2
P(|x − E[x]| ≥ t) ≤ 2 .
σ + t2
(e) Let z be a standard normal random variable and use the result of (c) to get an upper bound on P(z >
1/2). Noting that z is symmetric around the origin we have P(z > 1/2) = 12 P(|z| > 1/2). Use this
and the original Chebyshev inequality to get a bound on P(z > 1/2). Which is a better bound? (note
that we can actually compute P (z > 1/2) numerically and get approximately 0.3085).

2. Hide-and-Seek. Consider the following problem. You are given two coins, one is fair, but the other one is
fake and flips heads with probability 1/2 + ε, where ε > 0. However, you don’t know the value of ε. You
would like to identify the fake coin quickly.
Consider the following strategy. Flip both coins n times and compute the proportion of heads of each coin
(say pb1 and pb2 for coins 1 and 2, respectively). Now deem the coin for which the proportion of heads is
larger to be the fake coin. What is the probability we’ll make a mistake? Suppose without loss of generality
that coin 1 is the fake.

92
(a) We’ll make a mistake if pb1 < pb2 . That is

p1 − pb2 < 0) .
P(b

Noting that n(bp1 − pb2 ) is the sum of 2n independent random variables use Hoeffding’s inequality to
2
show that the probability of making an error is bounded by e−nε .
(b) Now suppose you have m coins, where only one coin is a fake. Similar to what we have done for the
two coins we can flip each coins n times, and compute the proportion of times each coin flips heads,
denoted by pb1 , . . . , pbm . What is the probability of making an error then?
Suppose without loss of generality that the first coin is fake. The probability of making an error is
given by
!
[
m
P(bp1 < pb2 or pb1 < pb3 or · · · or pb1 < pbm ) = P {b
p1 < pbi } .
i=2

Use Hoeffding’s inequality an the union bound to see that the probability of making an error smaller
2
than (m − 1)e−nε .
(c) Implement the above procedure with ε = 0.1, m = 2 or m = 100, and the following values of
n = 10, 100, 500, 1000. For each choice of parameters m and n repeat the procedure N = 10000
times and compute the proportion of runs where the procedure failed to identify the correct coin.
Compare these with the bounds you got. How good are the bounds you derived?

93
Lecture 17: Probably Approximately Correct (PAC) Learning

Suppose we have training examples of the form {xi , yi }, where xi are d-dimensional features and yi are scalar
and bounded labels/responses. Let F denote a collection of prediction rules. That is, each f ∈ F is a function that
maps from features to labels. The aim of Probably Approximately Correct (PAC) learning is to use the training
data to select an fb from F so that it’s predictions are probably almost as good as the best possible predictor in F.

Perhaps the most natural approach to this task is to choose fb to minimize the errors made on the training data.
This is called empirical risk minimization (ERM). To elaborate, let us introduce some terminology and notation.

feature space: X , feature: x ∈ X

label space: Y, label: y ∈ Y

predictor: f : X → Y, collection of predictors: F

loss function: ℓ : Y × Y → R+

For example, a common learning task is binary classification, wherein X = Rd , Y = {−1, +1}, and the loss
function is the 0/1-loss defined as follows. Let y be a true label and yb be the prediction (e.g., yb = f (x) for some
predictor f ). The 0/1-loss is 1 if yb ̸= y and 0 otherwise. We will assume throughout these notes that the losses
are bounded by a constant c (e.g., c = 1).

The basic premise of the PAC learning framework is that the training examples are i.i.d. samples from a unknown
i.i.d.
probability distribution P , written mathematically as (xi , yi ) ∼ P . The goal of learning is to select a predictor
that minimizes the expected loss or risk:

min E(x,y)∼P [ℓ(y, f (x))] .


f ∈F

ERM is the optimization:


X
n
min ℓ(yi , f (xi )) .
f ∈F
i=1

b
Pnthe minimum isa.s.denoted by f and is called an empirical risk minimizer. Note that for
A predictor that achieves
1
large n, the average n i=1 ℓ(yi , f (xi )) → E(x,y)∼P [ℓ(y, f (x))], since the losses ℓ(yi , f (xi )) are i.i.d. So, ERM
seems like a sensible approach to trying to select an f that comes close to minimizing the risk.

17.1. Analysis of Empirical Risk Minimization

To simplify the notation, let us denote the risk and empirical risk as follows:

R(f ) = E(x,y)∼P [ℓ(y, f (x))]


Xn
b ) = 1
R(f ℓ(yi , f (xi ))
n i=1

94
b )] = R(f ). Markov’s inequality provides a (weak) upper bound on the deviation of the empirical
Note that E[R(f
risk from the true risk. Assuming the losses are bounded in [0, c]
b ) − R(f )|2 ]
E[|R(f c2
b ) − R(f )| > t) ≤
P(|R(f ≤ ,
t2 4nt2
since the maximum variance of c-bounded random variables c2 /4. This bound can be improved using Chernoff’s
bounding technique
b ) − R(f ) > t) = inf P(eλ(R(f )−R(f )) > eλt ) ≤ e−2nt
P(R(f
b 2 /c2
,
λ>0

and by the union bound P(|R(fb ) − R(f )| > t) ≤ 2 e−2nt2 /c2 . For example, in the case of 0/1-loss, the losses are
i.i.d. binary random variables and we may take c = 1.

Recall that the empirical risk minimizer is fb = arg minf ∈F R(f b ). If R(f
b ) ≈ R(f ) for all f ∈ F, then the
minimizer of R b should be “close to” the minimizer of R. The bound above shows that R(fb ) is close to R(f ) for a
S n o
b ) − R(f )| > t ,
specific function f . To guarantee that it is close for all f ∈ F we must consider P f ∈F |R(f
the probability that Rb deviates significantly from R for one or more of the functions. To bound this probability,
we will assume that F is finite and denote the number of functions in F by |F|. Then we have
!
[n o X
P b ) − R(f )| > t
|R(f ≤ b ) − R(f )| > t)
P(|R(f
f ∈F f ∈F
2 /c2
≤ 2|F|e−2nt
Let δ = 2|F|e−2nt
2 /c2
b is uniformly close to R over F, with probability at least
. The bound above says that R
1 − δ.

Now let f ⋆ = arg minf ∈F E(x,y)∼P [ℓ(y, f (x))], which is the best predictor in F. This is the f we would choose
if we knew the data distribution P . ERM tries to select a predictor that is approximately as good as f ⋆ using only
the training examples. Consider the following inequalities which hold with probability at least 1 − δ:
R(fb) ≤ R(
b fb) + t
b ∗ ) + t , since fb minimizes R
≤ R(f b
≤ R(f ∗ ) + 2t .
This shows that the risk of fb is at most 2t larger than R(f ∗ ), the minimum risk, with probability at least 1 − δ. In
other words, fb is probably approximately correct. Let us also define
r
2c2 log(2|F|/δ)
ϵ := 2t = .
n
Then we have R(fb) − R(f ∗ ) ≤ ϵ with probability at least 1 − δ, and we say that fb is (ϵ, δ)-PAC.

Notice that the approximation error decreases with n and increases with the size of F (although only logarithmi-
cally in |F|). Another way to view these results is to consider the expected risk of fb. Note that R(fb) is a random
variable, since fb is random (it depends on the random set of training examples). Taking the expectation over the
training examples, we have
r !
2c 2 log(2|F|/δ)
E[R(fb)] ≤ (1 − δ) R(f ∗ ) + + δ max R(f )
n f ∈F
r
2c2 log(2|F|/δ)
≤ R(f ∗ ) + + cδ .
n

95
Now since the second term is at least O( √1n ), take δ = √1
n
to obtain
r √
2c 2 log(2|F| n) c
E[R(fb)] ≤ R(f ) + ∗
+ √
n n
r !
log |F| + log n
= R(f ∗ ) + O .
n

These bounds show that if the number of samples n = O(log |F|), then the class F is PAC-learnable.

17.2. Exercises

1. Consider a classification setting with training data {(xi , yi )}ni=1 , where xi ∈ [0, 1]d and yi ∈ {−1, +1}.

(a) First consider d = 1 and a discrete set of classifiers Fm of the form f2j (x) = 21{x≥j/m} − 1 and
f2j+1 = −21{x≥j/m} + 1 , for j = 0, 1, . . . , m and an integer m ≥ 1. Also consider any classifier of
form ft,σ (x) = σ(21{x≥t} − 1), with t ∈ [0, 1] and σ ∈ {−1, +1}. How large must m be so that for
fixed ε > 0, t, and σ Z
min |f (x) − ft,σ (x)| dx ≤ ε .
f ∈Fm

b ) = 1 Pn 1{f (x )̸=y } . Derive a PAC generalization bound which


(b) Let R(f ) = E[1{f (x)̸=y} ] and R(f n i=1 i i
states that with probability at least 1 − δ
s
b ) + log(|Fm |/δ)
R(f ) ≤ R(f , for all f ∈ Fm ,
n/2

where |Fm | is the cardinality of Fm . Use this to obtain a generalization error bound for the minimum
empirical risk minimizer fb = arg minf ∈Fm R(f
b ). Simplify the bound so that it depends only on R( b fb),
m, n, and δ.
(c) Now consider the general case with d ≥ 1. Specify the design of a discrete set of classifiers Fε such
that for any linear classifer g on [0, 1]d
Z
min |f (x) − g(x)| dx ≤ ε .
f ∈Fε x∈[0,1]d

b fb), ε, n, d and δ.
Derive a generalization error bound for this case expressed in terms of R(

2. Recall the histogram classifier studied in Lecture 2.4. Let FM denote the set of all histogram classifiers with
M equal-volume bins B1 , . . . , BM .

(a) Show that fbnH defined in Lecture 2.4 minimizes the empirical risk.
(b) Derive a bound on E[R(fbnH )] − minf ∈FM R(f ) using the empirical risk bounds developed in Sec-
tion 17.1.

96
Lecture 18: Learning in Infinite Model Classes

b ) = 1 Pn 1{f (x )̸=y } . The PAC bound for a finite model class F may be
Let R(f ) = P(f (x) ̸= y) and let R(f n i=1 i i
stated as  
b ) − R(f )| ≥ ϵ ≤ 2|F|e−2nϵ2 ,
P max |R(f
f ∈F

where |F| is the number of models in the class. We can also state this as a generalization bound. For any δ > 0
and for every f ∈ F, with probability at least 1 − δ
r
b ) + log(|F|/δ)
R(f ) ≤ R(f .
2n

Since this holds for every model in F, it holds for the the empirical risk minimizer fb = arg minf ∈F R(f
b ). So if
b ) is small and if n is large compared to log(|F|/δ), then R(fb) = P(fb(x) ̸= y) is probably small too.
minf ∈F R(f

Now we will generalize this sort of result to cases in which the model class is infinite. Linear classifiers are a
prime example of such a class. Let F denote the set of all linear classifiers of the form

+1 if wT x + b ≥ 0
fw (x) =
−1 if wT x + b < 0

for some w ∈ Rd and b ∈ R. There are an infinite number of choices for the weight w and bias b, and so |F| = ∞.

However, suppose we have a training set {(xi , yi )}ni=1 . Each linear classifier defines a hyperplane (separator) in
Rd . Consider any one of the hyperplanes. It splits Rd into two halfspaces. Note that we can move the hyperplane
until it just touches one or more of the points, without changing the binary labeling it produces. In particular, if
n ≥ 2d there must be at least d points in one of the halfspaces, and we can move the hyperplane until it touches
d points. Since d points (in general position) define a hyperplane in Rd , these d points define two of the possible
labelings that can be realized by linear classifiers (two because we can assign +1 to either one of the halfspaces it
generates). In effect, this particular hyperplane represents all the linear classifiers that produce
 the same labelings
of the dataset. Since there are nd unique subsets of d points, and there are at least 2 nd possible ways linear
classifiers can label the dataset. In fact, the total number is larger than this, since we also must consider cases
where fewer than d points define the hyperplane Pseparator. The total number of unique labelings of n points in d
dimensions using hyperplanes is S(F, n) := 2 dk=0 n−1 k
. In other words, there are effectively at most S(F, n)
d
unique linear classifiers for n points in R ; each may be represented by a specific hyperplane and pair of linear
classifiers [9], as we defined above. The quantity S(F, n) is called the shatter coefficient of F (we will discuss
this further later).

Since the shatter coefficient is the effective size of the class, and it is finite since n is finite, it is tempting to
apply the standard PAC bound for finite classes. However, there is a subtle issue. The finite set representative
linear classifiers depends on the specific locations of x1 , . . . , xn . This means that set of representative classifiers
is data-dependent and the argument used for the standard PAC bound assumed the set of classifiers to be fixed
and deterministic (not depending on the training examples). The problem is that if we consider a representative
classifier f that depends on {xi }, then the errors 1{f (xi )̸=yi } are no longer independent random variables.

97
18.1. Rademacher Complexity

Let F be an infinite model class. Our goal is to derive a bound of the form
 
P sup |R(fb ) − R(f )| ≥ ϵ ≤ B(n, ϵ) ,
f ∈F

for some function B(n, ϵ). Bounds of this type are called uniform deviation bounds, since they bound the largest
deviations over all possible f ∈ F. For the class of linear models discussed above, we will show that
2 /32
B(n, ϵ) = 8S(F, n)e−nϵ

will suffice. This shows that the shatter coefficient, which is the number of linear classifiers needed to represent
all possible labelings of the dataset, indeed plays the role that |F| played in the case of finite classes. Rademacher
complexity is a standard approach to construct uniform deviation bounds.

Let ℓ1 (f ), . . . , ℓn (f ) be iid bounded random variables satisfying ℓi (f ) ∈ [0, 1] and indexed by functions f ∈ F.
In the next section, we will view ℓi (f ) as the prediction error using f on the ith training example. In other
words, ℓi (f ) could represent the 0-1 loss 1{f (xi )̸=yi } or any other bounded loss function. Denote the ensemble and
empirical mean by R(f ) = E[ℓ1 (f )] and R bn (f ) = 1 Pn ℓi (f ), respectively. We will use the following lemma.
n i=1

Lemma 18.1.1. (McDiarmid’s Bounded Difference Inequality). Let g : Rn → R be a function satisfying

sup |g(ℓ1 , . . . , ℓi−1 , ℓi , ℓi+1 , . . . , ℓn ) − g(ℓ1 , . . . , ℓi−1 , ℓ′i , ℓi+1 , . . . , ℓn )| ≤ ci


ℓ1 ,...,ℓn ,ℓ′i

for some constant ci ≥ 0 for i = 1, . . . , n. Then if ℓ1 , . . . , ℓn are independent random variables


 
2t2
P(g(ℓ1 , . . . , ℓn ) − E[g(ℓ1 , . . . , ℓn )] ≥ t) ≤ exp − Pn 2 .
i=1 ci

Proof. For short


Phand
n
we will let G = g(ℓ1 , . . . , ℓn ) and define Pn Vi = E[G|ℓ1 , . . . , ℓi ] − E[G|ℓ1 , . . . , ℓi−1 ]. Then
G − E[G] = i=1 E[G|ℓ 1 , . . . , ℓ i ] − E[G|ℓ1 , . . . , ℓ i−1 ] = i=1 Vi . Note that E[Vi |ℓ1 , . . . , ℓi−1 ] = 0 and the
absolute value of Vi |ℓ1 , . . . , ℓi−1 is less than or equal to ci by assumption. Therefore, since it is a bounded random
2 2
variable, its moment generating function satisfies E[esVi |ℓ1 , . . . , ℓi−1 ] ≤ es ci /8 . Now for any t > 0 we have
!
Xn
P(G − E[G] ≥ t) = P Vi ≥ t
i=1
s n
P Pn
= P(e i=1 Vi ≥ est ) ≤ e−st E[es i=1 Vi
] , Markov’s inequality
h Pn−1 i
−st s i=1 Vi sVn
= e E e E[e |ℓ1 , . . . , ℓn−1 ]
2 2
Pn−1
≤ e−st es cn /8 E[es i=1 Vi
]
..
.
2
Pn 2
≤ e−st es i=1 ci /8
P
The result follows by taking s = 4t/ ni=1 c2i .

98
The function g(ℓ1 , . . . , ℓn ) := supf ∈F R(f ) − R bn (f ) satisfies the bounded differences assumption with ci = 1/n.
This is verified as follows. Suppose that ℓj = ℓ′j for all j ∈ {1, . . . , n} except a certain index i. Then
X X
|g(ℓ1 , . . . , ℓn ) − g(ℓ′1 , . . . , , ℓ′n )| ≤ sup n−1 (ℓj − Eℓj ) − sup n−1 (ℓ′j − Eℓ′j )
f ∈F j f ∈F j
X X 
−1 −1 ′ ′
≤ sup n (ℓj − Eℓj ) − n (ℓj − Eℓj )
f ∈F j j

≤ sup(ℓi − Eℓi )/n − (ℓ′i − Eℓ′i )/n ≤ 1/n ,


f ∈F

bn (f ) − R(f ). Everything that follows


since ℓi , ℓ′i ∈ [0, 1]. Note that this also holds for −g(ℓ1 , . . . , ℓn ) = supf ∈F R
holds for g and −g.

Therefore, for any δ ∈ (0, 1), with probability at least 1 − δ,


r
h i log(1/δ)
bn (f ) ≤ E sup R(f ) − R
sup R(f ) − R bn (f ) + .
f ∈F f ∈F 2n
h i
The next step is to bound E supf ∈F R(f ) − R bn (f ) . Since R(f ) depends on the underlying (and unknown) data
distribution, our approach will aim to eliminate this unknown quantity by a symmetrization step. We will bound
the difference between R(f ) and R bn (f ) by the difference between two independent versions of R bn (f ). Intuitively,
this makes sense since both are equal to R(f ) in expectation and the difference between two independent empirical
risks will tend to have even larger deviations.

To this end, introduce a “ghost sample” ℓ′ = {ℓ′1 (f ), . . . , ℓ′n (f )} independent of ℓ = {ℓ1 (f ), . . . , ℓn (f )} and
b′ (f ) denote the empirical mean of the ghost sample. Then by Jensen’s inequality
distributed identically. Let Rn
h i h  h ii h i
b b ′ b
Eℓ sup R(f ) − Rn (f ) = Eℓ sup Eℓ′ Rn (f ) − Rn (f ) ℓ1 (f ), . . . , ℓn (f ) b ′ b
≤ Eℓ,ℓ′ sup Rn (f ) − Rn (f ) .
f ∈F f ∈F f ∈F

Above we are using the simple fact that the supremum of an average is less than or equal to the average of the
supremum.

Now let σ = {σ1 , . . . , σn } be independent Rademacher random variables with P(σi = ±1) = 1/2, independent
of ℓi and ℓ′i . Then we have
" #
h i 1 X n
Eℓ,ℓ′ sup R bn′ (f ) − R
bn (f ) = Eℓ,ℓ′ sup ℓ′i (f ) − ℓi (f )
f ∈F f ∈F n i=1
" #
1X
n
= Eℓ,ℓ′ ,σ sup σi (ℓ′i (f ) − ℓi (f )) , by symmetry
f ∈F n i=1
" #
1X ′ 1X
n n
≤ Eℓ,ℓ′ ,σ sup σi ℓi (f ) + sup σi ℓi (f )
f ∈F n i=1 f ∈F n i=1
" #
1X
n
= 2 E sup σi ℓi (f ) .
f ∈F n i=1

The Rademacher Complexity of the class F with loss function ℓ is


" #
1X
n
Rn (ℓ(F)) := 2 E sup σi ℓi (f ) ,
f ∈F n i=1

99
where the expectation is taken with respect to {σi } and {ℓi }. The Rademacher complexity measures how easy it
is to find a function in F that correlates with random sign patterns [2]. Richer classes of functions have a higher
complexity. If we take the expectation only over {σi }, holding {ℓi } fixed, then we have the so-called empirical
Rademacher complexity " #
1 Xn
Rb n (ℓ(F)) := 2 E sup σi ℓi (f ) {ℓi } .
f ∈F n i=1
q
b
By McDiarmid’s inequality, Rn (ℓ(F)) ≤ Rn (ℓ(F)) + 2 log(1/δ) ; the factor of 2 appears due to the factor of 2 in
2n
the definition of the Rademacher complexity. The empirical Rademacher complexity is sometimes useful because
we can easily construct a Monte Carlo estimate of it. Putting everything together, we have the following result.

Theorem 12. With probability at least 1 − δ


r
bn (f ) ≤ Rn (ℓ(F)) + log(1/δ)
sup R(f ) − R
f ∈F 2n

and r
b n (ℓ(F)) + 3
bn (f ) ≤ R log(2/δ)
sup R(f ) − R
f ∈F 2n

Note that 1/δ replaced by 2/δ because we are union bounding over the original Rademacher bound and the McDi-
armid bound on the empirical Rademacher bound. As mentioned above, we could have just as easily considered
bn (f ) − R(f ) and obtained the same result. So we can also state two-sided bounds such as
R
r
bn (f )| ≤ Rn (ℓ(F)) + log(2/δ)
sup |R(f ) − R
f ∈F 2n

with probability at least 1 − δ. Note that the inequalities above can be stated as generalization bounds: For any
δ > 0 and for all f ∈ F, with probability at least 1 − δ
r
bn (f ) + Rn (ℓ(F)) + log(1/δ)
R(f ) ≤ R .
2n

In particular, the bound holds for fb that minimizes R.


b

18.2. Generalization Bounds for Classification with 0/1 Loss

Now let us specialize the results above to the case of binary classification with 0/1 loss. The Rademacher com-
plexity above Rn (ℓ(F)) depends implicitly on the choice of loss. Specifically, consider the 0/1 loss, ℓi (f ) =
1{f (xi )̸=yi } . Define the empirical Rademacher complexity for F (without a loss) to be
h 1X
n i
b
Rn (F) := Eσ sup σi f (xi ) .
f ∈F n i=1

The expectation Rn (F) := E[Rb n (F)], where the expectation is with respect to x1 , . . . , xn , is what is normally
referred to as the Rademacher complexity of the class F. For ℓ chosen to be 0/1 loss, we can relate these

100
complexities as follows.
h X n i
b n (ℓ(F)) = 2Eσ sup 1
R σi 1{f (xi )̸=yi }
f ∈F n i=1
h 1 X  1 − yi f (xi ) i
n
= 2Eσ sup σi
f ∈F n i=1 2
h1 X n
1X
n i
= Eσ σi + sup σi (−yi )f (xi )
n i=1 f ∈F n i=1
h 1X
n i
= Eσ sup σi yi f (xi ) = R b n (F) .
f ∈F n i=1

b )= 1
Pn
Also for 0/1 loss, R(f ) = P(f (x) ̸= y) and R(f n i=1 1{f (xi )̸=yi } . Using the results above, for any δ > 0
and for all f ∈ F, with probability at least 1 − δ
r
bn (f ) + Rn (F) + log(1/δ)
R(f ) ≤ R . (18.1)
2n

18.3. Exercises

1. The Hamming distance between two sequences ℓ = (ℓ1 , . . . , ℓn ) and ℓ′ = (ℓ′1 , . . . , ℓ′n ) is defined to be
the number of coordinates where ℓi ̸= ℓ′i . Denote this by dH (ℓ, ℓ′ ). For any set A of such sequences, the
distance of ℓ to A is dH (ℓ, A) := minℓ′ ∈A dH (ℓ, ℓ′ ). If ℓ1 , . . . , ℓn are independent random variables, prove
that   2
P dH (ℓ, A) − E[dH (ℓ, A)] ≥ t ≤ 2e2t /n
q
b n (F) + 2
2. Give a formal proof of the bound Rn (F) ≤ R log(1/δ)
.
2n

3. This exercise considers an application of the Rademacher complexity bounds to histogram classifiers. Con-
sider a classification setting with training data {(xi , yi )}ni=1 , where xi ∈ [0, 1]d and yi ∈ {−1, +1}.

(a) Let F beP the set of histogram classifiers with m bins B1 , . . . , Bm . Any f ∈ F can be written as
m
f (x) = j=1 Lj (f ) 1{x∈Bj } , where Lj (f ) ∈ {−1, +1} is the label f assigns to points in bin Bj .
Give an expression for the histogram classifier fb that minimizes the error on the training set.
(b) Show that the empirical Rademacher complexity can be bounded as follows
X√ m
b n (F) ≤ 1
R nj ,
n j=1

where nj is the number of training examples falling in bin j.


(c) Use Rb n (F) to bound the generalization error P(y ̸= fb(x)). Explain how this bound can provide
guidance on selecting the number of bins m.

101
Lecture 19: Vapnik-Chervonenkis Theory

Recall the set of linear classifiers in Rd


n o
F = f : Rd → {−1, +1} , f (x) = sign(wT x + b), w ∈ Rd , b ∈ R .

There are an infinite number of choices for the weight w and biasPb, and so |F| = ∞. However, for any training
dataset consisting of n examples there are at most S(F, n) := 2 dk=0 n−1 k
possible ways linear classifiers can
label the dataset [9]. The quantity S(F, n) is called the shatter coefficient of F.

Remark: Throughout this lecture we will exclusively consider the 0/1 loss function in all our analysis and results.

19.1. Shatter Coefficient and VC Dimension

Vapnik-Chervonenkis (VC) theory is based on a generalization of this idea. The main intuition behind VC theory
is that, although a collection of classifiers may be infinite, using a finite set of training data to select a good rule
effectively reduces the number of different classifiers we need to consider. We can measure the effective size of a
class F using the shatter coefficient. Suppose we have a training set Dn = {(xi , yi )}ni=1 for a binary classification
problem with labels yi ∈ {−1, +1}. Because there are only two possible labels for each xi , each classifier in F
produces a binary label sequence  
f (x1 ), . . . , f (xn ) ∈ {−1, +1}n .
There at are at most 2n distinct sequences, but often not all sequences can be generated by functions in F. Let
S(F, n) be the maximum number of labeling sequences the class F induces over n training points in the feature
space X . More formally,
Definition 19.1.1. The shatter coefficient of class F is defined as
n o
S(F, n) = max (f (x1 ), . . . , f (xn )) ∈ {−1, +1}n , f ∈ F ,
x1 ,...,xn ∈X

where | · | denotes the number of elements in the set.

Clearly S(F, n) ≤ 2n , but often it is much smaller. The class of linear classifiers is a canonical example.

The shatter coefficient S(F, n) is a measure of the “effective size” of F with respect to a training set of size n.
The sample complexity of selecting a classifier from a set of size N is O(log N ), because of the union bound.
Thus, log S(F, n) measures the “effective dimension” of F.
Definition 19.1.2. The Vapnik-Chervonenkis (VC) dimension is defined as the largest integer k such that S(F, k) =
2k . The VC dimension of a class F is denoted by V (F).

Note that the VC dimension is not a function of the number of training data. We have the following result,
presented here without a proof.
Lemma 19.1.3. Sauer’s Lemma:
S(F, n) ≤ (n + 1)V (F ) .
Example 19.1.4. Linear classifiers in d dimensions. Note that if n ≤ d+1, then every possible labeling sequence
can be realized by a linear classifier, but it is not possible if n > d+1. Thus, the VC dimension P of linear classifiers

d
in R is d + 1. Sauer’s Lemma shows that S(F, n) ≤ (n + 1) (d+1)
. Recall that S(F, n) = 2 dk=0 n−1k
, which
indeed is less than Sauer’s bound.

102
19.2. The VC Inequality

The main result in VC theory is the following theorem, which yields generalization bounds in terms of the shatter
coefficient and VC dimension.

Theorem 19.2.1. (Vapnik-Chervonenkis ’71): Let F be a class of binary classifiers with shatter coefficient
S(F, n). For any ϵ > 0  
P sup |Rbn (f ) − R(f )| ≥ ϵ ≤ 2S(F, n)e−nϵ2 /8 ,
f ∈F

or equivalently for any δ > 0, with probability at least 1 − δ


r
bn (f ) − R(f )| ≤ 8(log S(F, n) + log(2/δ))
sup |R .
f ∈F n

Using Sauer’s bound, we can state a generalization bound of the in terms of the VC dimension V (F). For any
δ > 0 and every f ∈ F, with probability at least 1 − δ
r
bn (f ) + 8(V (F) log(n + 1) + log(1/δ))
R(f ) ≤ R .
n

Proof. We will use the following Rademacher complexity bound from 18.1. For any δ > 0, with probability at
least 1 − δ r
  log(1/δ)
sup R(f ) − R bn (f ) ≤ Rn (F) + .
f ∈F 2n
n o
n
Let D = {(xi , yi )}i=1 denote a dataset and let FD = (f (x1 ), . . . , f (xn )) : f ∈ F , the set of all labelings of
the datapoints that can be generated using classifiers in F. The shatter coefficient S(F, n) bounds the cardinality
of FD . We can bound the Rademacher complexity as follows.
" " ## " " ##
1X 1X
n n
Rn (F) = ED Eσ sup σi f (xi ) = ED Eσ sup σi f (xi ) .
f ∈F n i=1 f ∈FD n i=1

We next apply the following lemma (which we will prove later).


Lemma 19.2.2. (Massart’s Lemma) Let A ⊂ Rn , with |A| < ∞. Set r = maxu∈A ∥u∥2 and let u = (u1 , . . . , un )T .
Then " # p
1 Xn
r 2 log |A|
Eσ sup σi ui ≤ .
n u∈A i=1 n

P
In our setting, take A = FD and notice that for every sequence u ∈ FD we have ∥u∥22 = ni=1 (±1)2 = n. So
applying the lemma we have
"p # "p # r
2n log |FD | 2n log S(F, n) 2 log S(F, n)
Rn (F) ≤ ED ≤ ED = .
n n n

Thus we have r r
  2 log S(F, n) log(1/δ)
bn (f )
sup R(f ) − R ≤ + .
f ∈F n 2n

103
√ √ √ √ √
Observe that for any a, b ≥ 0 we have a + b ≤ a + b + a + b = 2 a + b. Therefore,
s 
  8 log S(F, n) + log(1/δ)
sup R(f ) − R bn (f ) ≤ .
f ∈F n
 
b
The two-sided version follows by repeating the same argument for supf ∈F Rn (f ) − R(f ) and union bounding
over the two cases.

19.2.1. Proof of Massart’s Lemma

Lemma 19.2.3. (Massart’s Lemma) Let A ⊂ Rn , with |A| < ∞. Set r = maxu∈A ∥u∥2 . Then
" # p
1 Xn
r 2 log |A|
Eσ sup σi ui ≤ .
n u∈A i=1 n

Proof. For all t ≥ 0 we have


 h X
n i h 
X
n i
exp t Eσ sup σi ui = exp Eσ t sup σi ui
u∈A u∈A
i=1 i=1
h  X n i
≤ Eσ exp t sup σi ui , by Jensen’s inequality
u∈A
i=1
h  Xn i
= Eσ sup exp t σi ui , since exp is strictly increasing
u∈A
i=1
X h  Xn i
≤ Eσ exp t σi ui , by the union bound
u∈A i=1
XY
n h i
= Eσi exp tσi ui , since σi are iid .
u∈A i=1

Recall that we encountered expectations of exponential functions this form in the Chernoff bounding method. In
particular, Lemma 16.4.1 shows that since the random variable σi ui has mean zero and is bounded |σi ui | ≤ |ui |,
2 2
we have Eσi [etσi ui ] ≤ et (2ui ) /8 . Thus, we have
 h X
n i XY
n h  i XY
n
2 (2u )2 /8
exp t Eσ sup σi ui ≤ Eσi exp tσi ui ≤ et i

u∈A
i=1 u∈A i=1 u∈A i=1
X  t2 ∥u∥2  X  2 2
tr  t2 r2 
2
= exp ≤ exp = |A| exp
u∈A
2 u∈A
2 2

Now take the log of both sides and divide by t to obtain


h X
n i log |A| tr2
Eσ sup σi ui ≤ + .
u∈A
i=1
t 2

Choosing t to minimize this bound yields the result.

104
19.3. Exercises

1. Consider a binary classification problem with features in [0, 1]d and binary labels +1 and −1. For classifiers,
let’s use the set of all linear (hyperplane) classifiers in Rd .

(a) How many training examples are sufficient to learn linear classifier that (with large probability) has a
probability of error at most ϵ > 0 larger than that of the best possible linear classifier?
(b) Suppose that the Bayes optimal classifier f ∗ (i.e., the classifier that minimizes the probability of error)
is nonlinear and that the minimum probability of error achievable using a linear classifer is 0 < γ < 1
larger than the probability of error of the Bayes classifier. How many samples would suffice to learn a
linear classifier with probability of error at most η > γ in this case?

2. Show the following monotonicity property of VC-dimension: if F and F ′ are hypothesis classes with
F ⊂ F ′ , then the VC-dimension of F ′ is greater than or equal to that of F.

3. Suppose that F is the set of all axis aligned rectangles in Rd . How many training examples are needed to
learn an (ϵ, δ)-PAC classifier in F.

105
Lecture 20: Learning with Continuous Loss Functions

Let {(xi , yi )}ni=1 be iid training examples and let ℓ be a loss function. Consider the empirical risk function
bn (f ) = 1 Pn ℓ(yi , f (xi )) and its expectation R(f ) = E[ℓ(y, f (x)]. Assume the losses are bounded in [0, 1].
R n i=1
Theorem 12 states that with probability at least 1 − δ
r
bn (f ) ≤ Rn (ℓ(F)) + log(1/δ)
sup R(f ) − R
f ∈F 2n
where the Rademacher complexity with respect to ℓ is
" #
1X
n
Rn (ℓ(F)) = 2 E sup σi ℓ(yi , f (xi )) .
f ∈F n i=1

We will apply these bounds to continuous loss functions like


hinge: ℓ(y, f (x)) = max(0, 1 − yf (x))
logistic: ℓ(y, f (x)) = log(1 + exp(−yf (x)))

If the losses are bounded in [0, 1], then Rn (ℓ(F)) ≤ 1. Observe that Rn (ℓ(F)) = 1 leads to a vacuous bound
since if the losses are bounded in [0, 1], then trivially R(f ) ≤ 1. So generalization bounds based on Rademacher
complexity are only meaningful if Rn (ℓ(F )) < 1. In fact, the bounds are only interesting if Rn (ℓ(F)) decays as
n grows. Why might this sort of decay happen? Recall that one interpretation of Rn (ℓ(F)) is that it measures how
well functions in F can correlate with a random sequence of ±1 values. As n grows, it becomes more and more
difficult to match such a sequence. Even if F is infinite it may not be possible to do this for large n. To see this,
start with a finite set F0 with a certain Rademacher complexity. Suppose that we create a number of additional
functions that are small perturbations of each f ∈ F0 . Let F1 denote the new larger set. Since each f ∈ F1
is very close to one of the f ∈ F0 , the arguments yi f (xi ) of the losses above will not change much, and we
have Rn (ℓ(F1 )) ≈ Rn (ℓ(F0 )). In other words, increasing the set does not necessarily increase the Rademacher
complexity.

20.1. Generalization Bounds for Continuous Loss Functions

The 0/1 loss is natural for binary classification since its expectation is the probability of misclassification. How-
ever, minimizing the empirical risk is difficult due to the discontinuous nature of 0/1 loss. This is the main reason
we work with continuous loss functions like the hinge loss or logistic loss. These can be easily minimized using
gradient descent procedures. We will assume that yi = ±1 and thus the sign of the f (xi ) is the predicted label.
Note that the hinge and logistic losses are functions of z = yf (x). So we will express the loss as ℓ(yf (x)), a
function of a single scalar argument z = yf (x).

The bounds in Theorem 12 apply in such cases (assuming the losses are bounded). However, we would like to be
able to compute or bound the Rademacher complexity Rn (ℓ(F)), so that we can determine its dependence on n.
To the end, we will first bound Rn (ℓ(F)) in terms of
h Xn i
Rn (F) = 2 E sup σi f (xi )
f ∈F i=1

using the following lemma, and then bound Rn (F). It is often easier to bound Rn (F) than bounding Rn (ℓ(F))
directly.

106
Lemma 20.1.1. Assume the loss ℓ is L-Lipschitz: |ℓ(z) − ℓ(z ′ )| ≤ L |z − z ′ |. Then
1X 1X 1X
n n n

Rn (ℓ(F)) = 2 E sup σi ℓ yi f (xi ) ≤ 2L E sup σi yi f (xi ) = 2L E sup σi f (xi )
f ∈F n i=1 f ∈F n i=1 f ∈F n i=1

The hinge and logistic losses are 1-Lipschitz functions. We will prove the lemma in a bit, but first let us apply it
to an interesting case.

20.1.1. Application to Linear Classifiers

Theorem 20.1.2. Consider linear classifiers of the form f (x) = wT x, with ∥w∥2 ≤ 1 and ∥x∥2 ≤ 1. Let
B1d = {x ∈ Rd : ∥x∥2 ≤ 1} and define

Flin := f : B1d → R , f (x) = wT x , ∥w∥2 ≤ 1 .
Assume that the loss ℓ is L-Lipschitz. Then
2L
Rn (ℓ(Flin )) ≤ 2L Rn (Flin ) ≤ √ .
n

Proof. First inequality follows from Lemma 20.1.1. Let ∥ · ∥ denote the Euclidean norm ∥ · ∥2 . Now consider
h 1X
n i h 1X
n i
Rn (Flin ) = E sup σi f (xi ) = E sup σi wT xi
f ∈Flin n i=1 ∥w∥≤1 n i=1
h 1 Xn i
≤ E sup ∥w∥ σi xi , by Cauchy-Schwartz inequality
∥w∥≤1 n i=1
v v
h i hu n i u h n
X X X 2i
n
1 1 t u 2 1u
≤ E σi xi = E σi xi ≤ tE σ i xi , by Jensen’s inequality
n i=1
n i=1
n i=1
v " # v
u u n
1u X n
1u X 1
≤ t E σi σj xi xj = t
T
∥xi ∥2 = √ .
n i,j=1
n i=1 n

To conclude, we have shown the following result (which also holds for logistic loss).
b be a solution to the convex
Corollary 1. Assume yi ∈ [−1, 1] and ∥xi ∥2 ≤ 1 for i = 1, . . . , n, and let w
optimization
Xn

min 1 − yi w T xi + .
w:∥w∥2 ≤1
i=1
Then with probability at least 1 − δ
r
1X
n
T
 2 2 log 1/δ
b x)) ≤
P(y ̸= sign(w b T xi + + √ +
1 − yi w .
n i=1 n n

107
In this case, the losses are bounded in [0,q
2] rather than [0, 1]; q
this follows from the Cauchy-Schwartz inequality.
Therefore, the last term in the bound is 2 logn1/δ instead of log2n1/δ (this follows from the application McDi-
ramid’s inequality). Although here we specialized our discussion to the hinge loss, similar arguments can be used
for other Lipschitz losses such as the logistic loss, which is also 1-Lipschitz.

20.2. Proof of Lemma 20.1.1

Lemma 20.2.1. (Lipschitz Property of Rademacher Complexity). Suppose {ϕi } and {ψi } are two sets of functions
on domain F such that for each i and f, f ′ ∈ F,

|ϕi (f ) − ϕi (f ′ )| ≤ |ψi (f ) − ψi (f ′ )| (A1)

Let σ = {σ1 , σ2 , . . . } be a sequence of i.i.d. Rademacher random variables. Then


h X
n i h X
n i
Eσ sup σi ϕi (f ) ≤ Eσ sup σi ψi (f ) .
f i=1 f i=1

Proof. The proof is similar to one given in [19].


h X
n i h n X
n oi
Eσ1 ,...,σn sup σi ϕi (f ) = Eσ2 ,...,σn Eσ1 sup σ1 ϕ1 (f ) + σi ϕi (f )
f i=1 f i=2
h1 n X n o 1 n Xn oi
′ ′
= Eσ2 ,...,σn sup ϕ1 (f ) + σi ϕi (f ) + sup − ϕ1 (f ) + σi ϕi (f )
2 f i=2
2 f′ i=2
h n ϕ (f ) − ϕ (f ′ ) Pn σ ϕ (f ) + Pn σ ϕ (f ′ ) oi
1 1 i i i=2 i i
= Eσ2 ,...,σn sup + i=2
f,f ′ 2 2
h n |ϕ (f ) − ϕ (f ′ )| Pn σ ϕ (f ) + Pn σ ϕ (f ′ ) oi
1 1 i i i=2 i i
= Eσ2 ,...,σn sup + i=2 , (#)
f,f ′ 2 2
h n |ψ (f ) − ψ (f ′ )| Pn σ ϕ (f ) + Pn σ ϕ (f ′ ) oi
1 1 i i i=2 i i
≤ Eσ2 ,...,σn sup + i=2 , by A1
f,f ′ 2 2
h n ψ (f ) − ψ (f ′ ) Pn σ ϕ (f ) + Pn σ ϕ (f ′ ) oi
1 1 i i i=2 i i
= Eσ2 ,...,σn sup + i=2 , (#)
f,f ′ 2 2
h n X n oi
= Eσ2 ,...,σn Eσ1 sup σ1 ψ1 (f ) + σi ϕi (f ) , reverse of step in 2nd line above.
f i=2

To see that step (#) holds, note that if ϕ1 (f ) < ϕ1 (f ′ ) (or ψ1 (f ) < ψ1 (f ′ )), then swapping f and f ′ increases
the difference term while leaving the others fixed. Now continue from the last line above by repeating the same
argument with respect to σ2 . This yields
h X
n i h nX
2 X
n oi
Eσ1 ,...,σn sup σi ϕi (f ) ≤ Eσ1 ,...,σn sup σi ψi (f ) + σi ϕi (f )
f i=1 f i=1 i=3

Continuing this process for σ3 , . . . , σn yields the claimed inequality.

108
Corollary 2. Consider a finite collection of stochastic processes z1 (f ), z2 (f ), . . . , zn (f ) indexed by f ∈ f . Let
σ1 , . . . , σn be independent Rademacher random variables (i.e., each σi takes values −1 and +1 with probabilities
1/2). Then for any L−Lipschitz function ℓ (i.e., |ℓ(z) − ℓ(z ′ )| ≤ L|z − z ′ |, ∀z, z ′ )
" # " #
Xn
 Xn
E sup σi ℓ zi (f ) ≤ L E sup σi zi (f ) .
f ∈f i=1 f ∈f i=1


Proof. Apply the lemma above with ϕi (f ) = φ zi (f ) , ψi (f ) = Lzi (f ).

The following 2-sided generalization of the corollary above, sometimes called the “contraction” property of
Rademacher complexity, can be found in [6] (note the extra factor of 2 due to the absolute value).

Corollary 3. Consider a finite collection of stochastic processes z1 (f ), z2 (f ), . . . , zn (f ) indexed by f ∈ f . Let


σ1 , . . . , σn be independent Rademacher random variables (i.e., each σi takes values −1 and +1 with probabilities
1/2). Then for any L−Lipschitz function ℓ (i.e., |ℓ(z) − ℓ(z ′ )| ≤ L|z − z ′ |, ∀z, z ′ )
" # " #
Xn
 X n
E sup σi ℓ zi (f ) ≤ 2L E sup σi zi (f ) .
f ∈f i=1 f ∈f i=1

20.3. Exercises

1. Prove that the hinge and logistic loss functions are 1-Lipschitz.

2. Let ρ1 , ρ2 > 0. Derive a generalization bound like the one in (1) for the class
n o
Flin (ρ1 , ρ2 ) = f : f (x) = wT x, ∥w∥2 ≤ ρ1 , ∥x∥2 ≤ ρ2

3. Derive a generalization bound like the one in (1) for the class
n o
Flin,1,∞ = f : f (x) = wT x, ∥w∥1 ≤ 1, ∥x∥∞ ≤ 1

Hint: Use the fact that |wT x| ≤ ∥w∥1 ∥x∥∞ and Massart’s inequality.

4. Consider quadratic prediction functions of the form f (x) = xT W x + xT w, with ∥W ∥2 ≤ 1 and ∥w∥2 ≤
1, where ∥W ∥2 is the spectral norm of W . Assuming ∥x∥2 ≤ 1, derive a generalization bound analogous
to the one in Corollary 1.

109
Lecture 21: Introduction to Function Spaces

Let F be a class of functions. Given a training dataset {(xi , yi )}, we can pose the empirical risk minimization
problem
Xn
min ℓ(yi , f (xi )) .
f ∈F
i=1
The solution is a function f ∈ F that minimizes the sum of losses on the training data (i.e., that fits the training
data best). Learning linear classifiers is a good example. The function space of all (homogeneous) linear functions
on Rd is n o
F = f : f (x) = wT x , w ∈ Rd .
We can limit this further by restricting the norm of w. Define the function class
n o
FB = f : f (x) = wT x , ∥w∥ ≤ B .
In this case we have
X
n X
n
min ℓ(yi , f (xi )) ≡ min ℓ(yi , wT xi )
f ∈FB w:∥w∥≤B
i=1 i=1
and we can solve the optimization using gradient descent methods. The problem is also equivalent to the regular-
ized optimization
Xn
min ℓ(yi , wT xi ) + λB ∥w∥2
w∈Rd
i=1
for a certain regularization parameter λB > 0.

We can generalize this to other function classes as follows. Let ∥f ∥ denote the norm of the function f . Norms
map functions to real numbers, and satisfy: ∥f ∥ ≥ 0, ∥f + g∥ ≤ ∥f ∥ + ∥g∥, and if ∥f ∥ = 0, then f = 0. There
are many ways to define norms on functions, which we will discuss in more detail below. Norms based on the
P R (k) 1/2
integrals of f or its kth derivatives f (k) are common. For example, ∥f ∥ := K k=0 |f (x)|2
dx . Given a
norm, we can define a function space F = {f : ∥f ∥ < ∞} and classes of functions as FB = {f : ∥f ∥ ≤ B}
and then consider learning with this class by solving optimizations of the form
Xn X
n
min ℓ(yi , f (xi )) or min ℓ(yi , f (xi )) + λB ∥f ∥2 .
f ∈FB f ∈F
i=1 i=1
This leads to a number of questions:

1. What sorts of norms and function classes are useful in machine learning?
2. Can we derive generalization bounds for classes defined in terms of function norms?
3. If F is not defined in terms of a finite number of parameters, can we efficiently solve the optimizations?

21.1. Constructions of Function Classes

21.1.1. Parameteric Classes

The simplest way to construct a function class is in terms of a set of parameters or weights. The linear functions
in the class above are one example, and polynomial functions and neural networks others. For example, a single

110
P
output, two-layer neural network has the form f (x) = K T
k=1 vk ϕ(wk x + bk ), where the activation function ϕ is
fixed (e.g., Rectified Linear Unit) and the input and output weights, wk and vk , and the biases bk are the learnable
parameters. This gives us the neural network class
n X
K o
F = f : f (x) = vk ϕ(wkT x + bk ), wk ∈ Rd , vk , bk ∈ R .
k=1
We could further limit this class by placing constraints on the size of the weights and biases.

21.1.2. Atomic Classes

Consider a family of parameterized functions {ϕw }w∈W where each ϕw has the same functional form parameter-
ized by the choice of w. The set W could be a certain subset of Rd . We call the functions atoms since we can
take weight combinations of them to synthesize more complex functions. The neurons√in a neural network are an
T
example of atoms. Fourier basis functions of the form ϕw (x) := eiw x , where i = −1, are another. We can
define function classes in terms of atoms. For example, if W is finite or countably infinite we can define a class
like n X X o
FB = f : f (x) = v(w)ϕw (x) , v(w) ∈ R, |v(w)|2 ≤ B .
w∈W w∈W
This can be viewed as a generalization of the parametric class idea to include models with an infinite number of
parameters. We can even consider continuously infinite weighted combinations using integrals, such as
n Z Z o
FB = f : f (x) = v(w)ϕw (x) dw , |v(w)|2 dw ≤ B .
R
A classic example of this is the Fourier transform. Let f be a function satisfying |f (x)|2 dx < ∞. Such
R T R T
1
functions can be represented as (2π) d F (w)eiw x dw, where the function F (w) = f (x)e−iw x dx, the Fourier
R
transform of f . “Infinite width" neural networks of the form f (x) = v(w)ϕ(wT x + b) dw are another popular
example.

21.1.3. Nonparametric Classes

Given a function norm ∥f ∥ we can define the class FB = {f : ∥f ∥ ≤ B}. For example, consider contin-
uous functions on the interval [0, 1]. We may define a norm to be ∥f ∥C 0 = supx∈[0,1] |f (x)|. This defines a
class of functions without an explicit parameterization. If we consider functions
Pk that have continuous derivatives
f , . . . , f , then we could define a class based on the norm ∥f ∥C k = j=1 supx∈[0,1] |f (j) (x)|. Note that
(1) (k)

n o n o
FBk := f : ∥f ∥C k ≤ B ⊂ f : ∥f ∥C 0 ≤ B =: FB0 .
This shows that certain nonparametric classes are larger than others. A common approach to work with non-
parametric classes in practice is to approximate the functions in such classes with parameteric or atomic models,
revealing bridges between the different ways we may formulate model classes. A classical example of this is the
Weierstrauss theorem.

Weierstrauss (1885): If f is a continuous function on [0, 1], then for any continuous f : [0, 1] → R and any
ϵ > 0 there exists a polynomial p such that

sup |p(x) − f (x)| < ϵ.


x∈[0,1]

111
21.2. Exercises

Suppose we want to interpolate data {(xi , yi )}ni=1 , where xi ∈ [0, 1] and yi ∈ R. It is possible to interpolate these
data using a polynomial of degree n − 1, but this interpolation will typically widely fluctuate between the data
points. Suppose instead we interpolate with a degree d > n − 1 polynomial function. There an infinite number of
such “overparameterized” polynomials that interpolate the data, so it is natural to choose the one with minimum
norm. The norm could simply be the Euclidean norm of the polynomial coefficients or it could be another function
of the coefficients.

1. Let Fd denote the set of all a degree d polynomials. Consider the optimization
nX
n Z o
2 1
min yi − f (xi ) + λ |f (x)|2 dx
f ∈Fd 0
i=1

P
where λ > 0. The polynomial functions are parametric of the form f (x) = dk=0 wk xk . Reformulate this
as an optimization over the polynomial coefficients w = [w0 , w1 , · · · , wd ]T . Show that the solution has
the form
wb = (V T V + λA)−1 V T y .
Give explicit expressions for the elements of the matrices V and A.

2. Now consider the optimization


nX
n Z o
2 1
′′ 2
min yi − f (xi ) + λ (f (x)) dx
f ∈Fd 0
i=1

where f ′′ is the second derivative of f and λ > 0. Reformulate this as an optimization over the polynomial
coefficients w = [w0 , w1 , · · · , wd ]T . Show that the solution has the form

b = (V T V + λB)−1 V T y .
w

Give an explicit expression for the matrix B.

3. If we take λ = 0+ , then the optimization above corresponds to


Z 1
min (f ′′ (x))2 dx subject to f (xi ) = yi , i = 1, . . . , n .
f ∈Fd 0

Explain/discuss the behavior of the solution as d increases.

112
Lecture 22: Banach and Hilbert Spaces

22.1. Review of Vector Spaces

We start from an a brief review of vector spaces in this section, and then introduce normed vector spaces, complete
normed vector spaces (Banach spaces), and then Banach spaces with an inner product (Hilbert Spaces). Examples
are provided along the discussion.

Definition 22.1.1. A vector space F is a set of elements (vectors) together with addition and scalar multipli-
cation operators satisfying the following axioms. For any u, v, w ∈ F and any scalars α, β ∈ R:a

• If u, v ∈ F, then u + v ∈ F

• u+v =v+u (commutativity of addition)

• u + (v + w) = (u + v) + w (associativity of addition)

• ∃ null vector 0 ∈ F such that v + 0 = v (identity element of addition)

• ∃ − v ∈ F such that v + (−v) = 0 (inverse element of addition)

• If u ∈ F, then αu ∈ F

• α(βv) = (αβ)v (compatibility of scalar multiplication with field multiplication)

• 1v = v where 1 denotes the multiplicative identity in R (identity element of scalar multiplication)

• α(u + v) = αu + αv (distributivity of scalar multiplication with respect to vector addition)

• (α + β)v = αv + βv (distributivity of scalar multiplication with respect to field addition)


a
We could also work with complex valued functions and the field of complex numbers C or other fields. The default field
considered in this note is R unless otherwise mentioned.

Note that many other properties can be derived from above axioms. For instance, 0v = 0 can be derived by
noticing 0 + 0 = 0 =⇒ (0 + 0)v = 0v =⇒ 0v + 0v = 0v =⇒ 0v + 0v + (−0v) = 0v + (−0v) =⇒
0v + (0v + (−0v)) = 0v + (−0v) =⇒ 0v + 0 = 0 =⇒ 0v = 0. The abstract definition of vector space can
be easily understood with the following examples.
Example 22.1.2.

• R with v ∈ R.

• Rd with v = [v1 , v2 , . . . , vd ]⊤ and each vi ∈ R.

• R∞ with v = [v1 , v2 , . . . , ]⊤ and each vi ∈ R.

• C([0, 1]) with v being any real-valued continuous function defined on [0, 1].

113
• Pd ([0, 1]) with v being any polynomial of degree d or smaller defined on [0, 1].

We can also define a subspace of a vector space, which is a subset that is closed under linear combinations.

Definition 22.1.3. A non-empty subset S ⊆ F is a subspace of F if αu + βv ∈ S for all u, v ∈ S and


scalars α, β ∈ R.

One should notice that 0 ∈ S since we can always set α = β = 0. Examples of subspaces are provided as follows.

Example 22.1.4.

• {v : v = [v1 , v2 , . . . , vk , 0, . . . , 0] ∈ Rd } is a subspace of Rd .

• Pd ([0, 1]) is a subspace of C([0, 1]).

If S, T are subspaces of F, it’s easy to check that the following two subsets are also subspaces:

• S ∩ T = {v : v ∈ S, v ∈ T }

• S + T = {v : v = u + w, u ∈ S, w ∈ T }

One can also define an affine subspace, with respect to a fixed vector w ∈ F, as follows,

Sw = {v : v = u + w, u ∈ S, w ∈ F}.

Every subspace is an affine subspace (with w = 0), yet the converse is not true (an affine subspace need not to
contain 0).

We use dimension to measure the “size” of a vector (sub-)space. Before getting into that, we first introduce the
notions of linear dependence and linear independence.

Definition 22.1.5. A set of vectors {vj } is said to be linearly dependent if at least one vector vi in the set
can be written as the linear combination of the others, i.e.,
X
vi = αj vj .
j̸=i

If no vector in the set can be written in this way, the set of vectors are said to be linear independent.

Theorem 22.1.6. A set of vectors {vj } is linearly independent if and only if


X
αj vj = 0 =⇒ αj = 0, ∀j.
j

114
Proof. We prove the theorem by contradiction on both directions.
P P
“=⇒” Suppose there exists αi ̸= 0 such that j αi vj = 0, we then have vi = j̸=i −αj vj /αi . And this
contradicts with the definition of linear independence.
P
“⇐=” Suppose the P set of vectors is linear dependent, we then have vi = j̸=i αj vj for some vi . Rearranging
the equility gives j̸=i αj vj − vi = 0, which contradicts with the statement αj = 0, ∀j (the coefficient of vi is
−1 ̸= 0).

Definition 22.1.7. A set of linearly independent vectors {ui } in F is a basis for S ⊆ F if every v ∈ S can
be written as
X
v= αi ui .
i

If a basis {ui } is finite, then S is finite-dimensional. Otherwise, S is infinite-dimensional.

Another way to say that a set of linearly independent vectors {ui } forms a basis for S is that span({ui }) = S.
The span is the set of all vectors that can be formed by linear combinations of {ui }. Unless otherwise mentioned,
we will always be working with bases that are countable, i.e., finite or countably infinite. Examples of bases are
provided as follows.
Example 22.1.8.

• Rd is d-dimensional with basis {ei }di=1 , where ei is a d-dimensional vector with 1 on the i-th entry and 0
elsewhere.
• R∞ is infinite-dimensional with basis {ui }∞
i=1 , where ui is a infinite-dimensional vector with 1 on the i-th
entry and 0 elsewhere.
• Pd ([0, 1]) is (d + 1)-dimensional with basis {ui (x)}di=0 , where ui (x) = xi .

22.2. Normed Vector Spaces and Banach Spaces

We equip a vector space with a norm and introduce the normed vector space. A norm can be thought as the
formalization of “length/size” in vector spaces.

Definition 22.2.1. A normed vector space (F, ∥·∥) is a vector space F equipped with functional mapping
∥·∥ : F → R satisfying the following properties. For any u, v ∈ F and scalar α ∈ R:

• ∥v∥ ≥ 0 (nonnegativity)

• ∥v∥ = 0 ⇐⇒ v = 0 (positivity on non-zero vectors)

• ∥αv∥ = |α|∥v∥ (absolutely scalable)

• ∥u + v∥ ≤ ∥u∥ + ∥v∥ (triangle inequality)

115
It is important to specify the norm ∥·∥ when defining a normed vector space (F, ∥·∥). Different norms on the
same vector space induce different normed spaces with different properties. Some examples of normed vector
spaces are provided as follows.

Example 22.2.2.

P
• Rd is a normed vector space with norm ∥v∥p = ( di=1 |vi |p )1/p when p ≥ 1.8
R1
• C([0, 1]) is a normed vector space with norm ∥f ∥L∞ = supx∈[0,1] |f (x)| or norm ∥f ∥L1 = 0
|f (x)|dx.

• C 1 ([0, 1]) is a normed vector space with norm ∥f ∥ = supx∈[0,1] |f (x)| + supx∈[0,1] |f ′ (x)|.9

• BV([0, 1]) is a normed vector space with norm ∥f ∥ = |f (0)| + TV(f ) where
P −1
nX
TV(f ) = sup |f (xi+1 ) − f (xi )|
P ∈P
i=0

and P is the set of all partitions [0, 1] and 0 = x0 ≤ · · · ≤ xnP = 1 are the boundaries of the partition P .10

Equipped with a norm ∥·∥, one can easily define a metric d(u, v) = ∥u − v∥ to measure the distance between two
vectors. This allows us to analyze sequences of vectors and their limits. Several related definitions are provided
as follows.

Definition 22.2.3. Let (F, ∥·∥) be a normed vector space. A sequence {vn }n≥1 in F is said to converge to
v ∈ F if

lim ∥vn − v∥ = 0.
n→∞

Definition 22.2.4. A set S ⊆ F is closed if and only if every convergent sequence in S has its limit point in
S.

Definition 22.2.5. Let (F, ∥·∥) be a normed vector space. A sequence {vn }n≥1 in F is Cauchy if for any
ϵ > 0, there exists N (ϵ) ∈ N such that for any m, n ≥ N (ϵ), we have

∥vm − vn ∥ < ϵ.

The definition of Banach space is provided as follows.


8
It does not form a norm when 0 ≤ p < 1.
9
C 1 ([0, 1]) stands for all continuously differentiable functions defined on [0, 1].
10
We add the |f (0)| term in ∥f ∥ to make it a norm. TV(f ) itself is a semi-norm since it doesn’t satisfy the second property, i.e.,
positivity on non-zero vectors.

116
Definition 22.2.6. A normed vector space (F, ∥·∥) is said to be complete if every Cauchy sequence in F
converges to limits in F. A complete normed vector space is called a Banach space.

Examples of Banach spaces and non-Banach spaces are provided as follows.

Example 22.2.7.

• R with absolute-value norm is a Banach space and Rd with p-norm (p ≥ 1) is a Banach space.

• C([0, 1]) with norm ∥f ∥L∞ = supx∈[0,1] |f (x)| is a Banach space.


R1
• C([0, 1]) with norm ∥f ∥L1 = 0 |f (x)|dx is not complete (and thus not a Banach space).

Some brief explanations of the above examples are provided as follows.

• The proof of completeness of R is a consequence of the Bolzano-Weierstrass theorem. The proof of com-
pletenesso of Rd can be done by checking the convergence of each coordinate and use the completeness
result of R.

• Let (fn ) be a Cauchy sequence in C([0, 1]). For any x ∈ [0, 1], (fn (x)) is Cauchy in R; we then define f :
[0, 1] → R such that f (x) = limn→∞ fn (x). We next show that fn → f under L∞ norm and f ∈ C([0, 1])
as below.
fn → f : Fix any ϵ > 0. Since (fn ) is Cauchy, there exists N (ϵ) ∈ N such that for any m, n ≥ N (ϵ), we
have ∥fn − fm ∥L∞ ≤ ϵ. As a result, for any n ≥ N (ϵ), we have

∥fn − f ∥L∞ = sup |fn (x) − f (x)| = sup lim |fn (x) − fm (x)| ≤ sup lim ∥fn − fm ∥L∞ ≤ ϵ.
x∈[0,1] x∈[0,1] m→∞ x∈[0,1] m→∞

f ∈ C([0, 1]): Consider any n ≥ N (ϵ/3) so that ∥fn − f ∥L∞ ≤ ϵ/3 and any fixed x ∈ [0, 1]. Since fn is
continuous, there must exists a δ(ϵ/3) > 0 such that for any |y − x| ≤ δ(ϵ/3), we have |fn (y) − fn (x)| ≤
ϵ/3. As a result, we have

|f (y) − f (x)| ≤ |f (y) − fn (y)| + |fn (y) − fn (x)| + |fn (x) − f (x)| ≤ ϵ,

and thus f ∈ C([0, 1]).


R1
• Showing that C([0, 1]) with norm ∥f ∥L1 = 0
|f (x)|dx is not complete is left as an exercise.

22.3. Hilbert Spaces

Hilbert spaces generalize the familiar concept of Euclidean space. A Hilbert space is a type of Banach space
equipped with an additional geometric structure called an inner product, which allows the definition of length and
angle.

117
Inner product: Let H be a vector space. A inner product is a mapping from H × H to R satisfying the
following for any u, v, w ∈ H and any scalar α, β:

1. symmetry: ⟨u, v⟩ = ⟨v, u⟩;

2. linearity: ⟨αu + βv, w⟩ = α⟨u, w⟩ + β⟨v, w⟩;

3. positive-definite: ⟨v, v⟩ > 0 if v ̸= 0.

The inner product induces the norm p


∥v∥= ⟨v, v⟩. (22.1)

Obviously, the norm induced by inner product is nonnegative and for any α ∈ R, ∥αv∥= |α|∥v∥ . Thus, to show
the equation 22.1 is a valid norm, we only need to show that it also satisfies triangle inequality. To prove this
property, let us introduce Cauchy-Schwarz inequality first.

Cauchy-Schwarz Inequality: For any u, v ∈ H, |⟨u, v⟩| ≤ ∥u∥∥v∥.

⟨u,v⟩
Proof. For any α we have 0 ≤ ⟨u − αv, u − αv⟩ = ∥u∥2 −2α⟨u, v⟩ + α2 ∥v∥2 . Let α = ∥v∥2
. Then we have
2
0 ≤ ∥u∥2 − |⟨u,v⟩|
∥v∥2
, hence leads to the desired result.

Cauchy-Schwarz inequality leads to triangle inequality directly: ∥u + v∥≤ ∥u∥+∥v∥.

Hilbert Space: Let H be vector space equipped with an inner product and associated norm. If the space is
complete with respect to the norm, then it is called a Hilbert Space.

Parallelogram Law: If H is a Hilbert space, then for any u, v ∈ H,



∥u + v∥2 + ∥u − v∥2 = 2 ∥u∥2 + ∥v∥2 .

In fact a Banach space is a Hilbert space if and only if the parallelogram law holds. This means that if the
parallelogram law fails to hold for certain a u and v, then the space is not a Hilbert space.

P
Example 22.3.1. Rn equipped with inner product ⟨u, v⟩ = ui vi for any u, v ∈ Rn is a Hilbert space.
i
R
Example 22.3.2. L2 [0, 1] equipped with inner product ⟨f, g⟩ = f (x)g(x) dx for any f, g ∈ L2 [0, 1] is a Hilbert
space.
R
Example 22.3.3. H = {all polynomial functions on [0, 1]} equipped with inner product ⟨f, g⟩ = f (x)g(x) dx
for any f, g ∈ H is a Hilbert space. Note H is a subspace of L2 [0, 1].

There are many interesting properties of Hilbert space. Specifically geometric intuition plays an important role in
many aspects of Hilbert space theory. Analogs of the Pythagorean theorem holds in a Hilbert space.

118
Orthogonality: Consider a Hilbert space H. Two vectors u, v ∈ H are orthogonal if ⟨u, v⟩ = 0, denoted
by via u ⊥ v. A vector u ∈ H is orthogonal to an subspace S ∈ H if u ⊥ v, for every v ∈ S.

Pythagorean Theorem: If u ⊥ v, then ∥u + v∥2 = ∥u∥2 +∥v∥2 .

Parallelogram Law: For any u, v ∈ H,



∥u + v∥2 + ∥u − v∥2 = 2 ∥u∥2 + ∥v∥2 .

22.4. Exercises

1. Verify that C[0, 1] is a vector space.

2. Verify that Pd [0, 1] is a subspace of C[0, 1].

3. A basis {uj } is said to be orthonormal if ⟨ui , uj ⟩ = 0 for all i ̸= j. Construction an orthonormal basis for
the space of linear functions on [0, 1], i.e., P1 [0, 1].
R1 1/2
4. Consider C[0, 1] and show that supx∈[0,1] |f (x)| and 0 f 2 (x)dx are norms.
R1
5. Show that C[0, 1] with the norm ∥f ∥L1 = 0 |f (x)|dx is not a Banach space. Hint: Consider the sequence
of functions


0 0 ≤ x < 21 − n1 ,
fn (x) = nx + (1 − n2 ) 12 − n1 ≤ x < 21 ,

 1
1 2
≤ x ≤ 1.

R1
6. Verify that ⟨f, g⟩ = 0
f (x)g(x) dx is a a valid inner product and that it generates the space L2 [0, 1].

7. Explain why Pd [0, 1] is a subspace of L2 [0, 1].

119
Lecture 23: Reproducing Kernel Hilbert Spaces

A Reproducing Kernel Hilbert Space (RKHS) are a special type of Hilbert spaces that is especially important
in machine learning, because they enable practical and computationally efficient learning methods in infinite
dimensional function spaces. Before moving on, let us introduce a bit of notation. We write f to refer to a
function/vector in a vector space and f (x) to refer to the value of f at the point x. We also sometimes write f (·)
to refer to the function/vector. We will also be working with functions of two points or vectors. For example, an
inner product is such a function and we write ⟨·, ·⟩ to refer to this function and ⟨f, g⟩ to refer to its value for the
pair f, g.

Reproducing Kernel Hilbert Space: A Hilbert Space H of functions on a domain X is said to be a Repro-
ducing Kernel Hilbert Space (RKHS) if there is a function k defined on X × X satisfying two properties:

1. k(·, x) ∈ H for each x ∈ X

2. ⟨f, k(·, x)⟩ = f (x) for each f ∈ H

Such a function is called a reproducing kernel.

The domain X could be Rd , for example. Notice that the reproducing kernel satisfies
⟨k(·, x′ ), k(·, x)⟩ = k(x, x′ ) .
Example 23.0.1. Consider the Hilbert space Rd equipped with inner product ⟨u, v⟩ = uT v, denoted ℓ2d . Here
the domain is X = {1, . . . , d}. Define the kernel k(i, j) = 1 if i = j and 0 otherwise. Clearly, k(·, j) ∈ ℓ2d for
P
j = 1, . . . , d. And k satisfies the reproducing property u(j) = ⟨u, k(·, j)⟩ = di=1 u(i)k(i, j). This shows that
d-dimensional Euclidean space is an RKHS.
Example 23.0.2. Let f be a univariate function and let f (1) denote its first derivative. Let the domain X = [0, 1]
R1 1/2
and ∥f ∥L2 = 0 |f (u)|2 du . Consider the normed vector space

H1 [0, 1] = {f : [0, 1] → R , f (0) = 0, ∥f (1) ∥L2 < ∞}


R
with inner product ⟨f, g⟩ = fR(1) (u)g (1) (u) du. This is an RKHS with reproducing kernel k(x, x′ ) = min(x, x′ ).
x
To see this, write min(x, x′ ) = 0 1{u∈[0,x′ ]} du. Observe that for x fixed k (1) (u, x) = 1{u∈[0,x]} as a function of u
and thus Z 1 Z x
(1)
⟨f, k(·, x)⟩ = f (u)1{u∈[0,x]} du = f (1) (u) du = f (x) .
0 0

23.1. Constructing an RKHS

We can also construct an RKHS by starting with a positive semidefinite kernel function, defined as follows.

Positive semidefinite (psd) kernel: A symmetric bivariate function k : X × X → R is positive semidefinite


(psd) if for all integers n ≥ 1 and all {xi }ni=1 ⊂ X , the n × n matrix Kij = k(xi , xj ) is psd.

120
Let k : X × X → R be any psd kernel. The kernel defines a unique Hilbert space H of functions on X , with the
reproducing property
⟨f, k(·, x)⟩H = f (x) ∀f ∈ H, x ∈ X .
Consider functions of the form
X
n
f (·) = αi k(·, xi )
i=1

where {xi }ni=1


⊂ X and {αi }ni=1
⊂ R. It is easily verified that the set of all such functions is a vector space, which
we will denote by H. Now let f, fe ∈ H,
e e which we have the forms

X
n X
n
e
f (·) = αj k(·, xj ) and fe(·) = ej k(·, xj ) .
α
j=1 j=1

e as
Define the inner product on H
X
n X
n
e
⟨f, fe⟩He := αi αej k(xi , x
ej )
i=1 j=1

This is a valid inner product since it satisfies the following,

1. Symmetry. ⟨f, fe⟩He = ⟨fe, f ⟩He .

2. Linearity. ⟨af + bg, fe⟩He = a · ⟨f, fe⟩He + b · ⟨g, fe⟩He

3. ⟨f, f ⟩He ≥ 0 with equality iff f = 0. Note this is true because k is positive semidefinite.

Furthermore, this definition satisfies


X
n
⟨f, k(·, x)⟩He = αj k(x, xj ) = f (x).
j=1

e We complete H
The final step in the construction is to complete H. e by including limits of all Cauchy sequences
e and thus we get H, which is the RKHS.
in H

Lastly, we prove the uniqueness of H. Recall the notion of orthogonality in Hilbert spaces.

Orthogonality: Consider a Hilbert space H. Two vectors f, g ∈ H are orthogonal if ⟨f, g⟩ = 0, denoted by
via f ⊥ g. A vector f ∈ H is orthogonal to an subspace S ∈ H if f ⊥ g, for every g ∈ S. Let S ⊥ denote the
set of all vectors in H that are orthogonal to S. Then any vector f ∈ H may be decomposed as f = fS +fS ⊥ ,
where fS is the component in S and fS ⊥ is the component in S ⊥ . This is denoted by H = S ⊕ S ⊥ .

Suppose H1 is some other RKHS with k as its kernel. Then k(·, x) ∈ H1 for ∀x ∈ X . Since H1 is complete,
X
H = closure({f : f (·) = αj k(·, xj )}) ⊂ H1
L
Thus H is a subspace of H1 and H1 = H H⊥ . Then for g ∈ H⊥ , since k(·, x) ∈ H, we have

0 = ⟨k(·, x), g⟩ = g(x), ∀x

121
so g = 0, which means H⊥ = {0}. Therefore, H1 = H.

Any RKHS also has a unique kernel. To see this suppose k1 and k2 generate the same H. Then the reproducing
property implies that
⟨f, k1 ( ·, x) − k2 ( ·, x)⟩H = 0 for all f ∈ H, x ∈ X .
Now, let f (·) = k1 ( ·, x′ ) for any x′ ∈ X . Then we have

0 = ⟨f, k1 ( ·, x) − k2 ( ·, x)⟩H = ⟨k1 ( ·, x′ ), k1 ( ·, x) − k2 ( ·, x)⟩H = k1 (x′ , x) − k2 (x′ , x)

which implies that k1 = k2 .

23.2. Examples of PSD Kernels

Ex.1: Linear kernel. Let X = Rd , consider the kernel k(x1 , x2 ) = ⟨x1 , x2 ⟩ = xT1 x2 , then we have

X
n Xn
f (x) = αi k(x, xi ) = ( αi xTi )x
i=1 i=1

also it’s easy to show this kernel is psd because we have Kij = k(xi , xj ) = xTi xj , if we define X = [x1 , · · · , xn ]
then
K = XT X
which is psd since αT Kα = ||Xα||22 ≥ 0 for any α. Using the linear kernel is equivalent to simple linear
regression on Rd , which shows that the linear kernel generates an RKHS of dimension d.

Ex.2: Polynomial kernel. Let X = Rd and consider the kernel

k(x1 , x2 ) = (⟨x1 , x2 ⟩)p = (xT1 x2 )p


P P P
for simplicity, we consider the case p = 2 here. Then k(x1 , x2 ) = ( dj=1 x1j x2j )2 = dj=1 x21j x22j +2 i<j x1i x1j x2i x2j .
This could be rewritten as
k(x1 , x2 ) = ⟨ϕ(x1 ), ϕ(x2 )⟩ = ϕ(x1 )T ϕ(x2 )
where  2 
xj , j = 1, · · · , d
ϕ(x) = √
2xi xj , i < j
actually ϕ here is a so-called feature map. This shows that the polynomial kernel generates a finite dimensional
RKHS, with dimension d(d + 1)/2, the number of terms in ϕ(x). Then

X
n Xn
f (x) = αi k(x, xi ) = ( αi ϕ(xi ))T ϕ(x)
i=1 i=1

and it is also easy to show the kernel is psd since we have K = ΦT Φ where

Φ = [ϕ(x1 ), · · · , ϕ(xn )].

Ex.3: Gaussian kernel: Let α > 0 and

k(x1 , x2 ) = exp(−α∥x1 − x2 ∥22 ).

122
Ex.4: Laplace kernel:
k(x1 , x2 ) = exp(−α||x1 − x2 ||2 ).

The Gaussian and Laplace kernels each generate an RKHS of infinite dimension. The feature maps associated
with these spaces involve monomials of all possible degrees, but with scaling factors on each monomial that differ
depending on the choice of kernel.

23.3. The Representer Theorem

Now let us consider the problem of learning in an RKHS H. The goal will be to find a function f ∈ H that fits a
set of training data and has a small norm.

Representer Theorem: Let H be an RKHS with kernel k. Then for any data {(xi , yi )}ni=1 and any continu-
ous loss function ℓ, there exists an f ∈ H that minimizes
X
n
ℓ(yi , f (xi )) + λ∥f ∥2H , λ>0
i=1

and has a representation


X
n
f (·) = αi k( ·, xi ), α1 , . . . , αn ∈ R.
i=1

If the loss function is convex, then the solution is unique.

Proof. Let us assume that a solution exists. Let H0 = span{k( ·, xi )}ni=1 . The orthogonal complement to

H
P0n is H0 = {f ∈ H : f (xi ) = 0, i = 1, . . . , n}. To see this, note that every function in H0 has the form
i=1 αi k( ·, xi ). Let f be orthogonal to H0 . Then we have
* n +
X X
n
0 = f, αi k( ·, xi ) = αi f (xi ) .
i=1 H i=1

Since this equality holds for all choices of α1 , . . . , αn it holds if and only if f (xi ) = 0, P
i = 1, . . . , n. Any
f ∈ H may be decomposed as f = f0 + f0 , where f0 ∈ H0 and f0 ∈ H0 . Note that ni=1 ℓ(yi , f (xi )) =
⊥ ⊥ ⊥
Pn 2 2 ⊥ 2
i=1 ℓ(yi , f0 (xi )) and that ∥f ∥H = ∥f0 ∥H + f0 H by orthogonality. Since the loss term does not depend on
f0⊥ it is clear that the overall objective is minimized with f0⊥ = 0. Together these imply that a global minimizer
fb ∈ H0 which completes the proof.

The representer theorem shows that the solution is a linear combination of the functions k(·, x1 ), . . . , k(·, xn ). In
other words, the Representer Theorem shows that the solution is a linear-in-parameters model. So all our results
pertaining to linear modeling apply in the RKHS setting. This is often referred to as the kernel trick. The weights
α1 , . . . , αn can be found by solving a finite dimensional optimization problem as follows. Note that the norm of
the solution f is
DX n X
n E Xn
∥f ∥H = αi k(·, xi ) , αj k(·, xj ) = αi αj k(xi , xj ) .
H
i=1 j=1 i,j=1

123
Let K denote the n × n matrix with i, jth entry k(xi , xj ) and let α ∈ Rn be a vector with ith entry αi . Then we
can write the norm as ∥f ∥H = αT Kα. Thus, knowing that the solution has this form, we may equivalently
solve the optimization
Xn  X n 
min ℓ yi , αj k(xi , xj ) + αT Kα .
α∈Rd
i=1 j=1

This can be solved, for example, using gradient descent. Note that if the loss function is convex, then this is
a convex optimization and gradient descent (with a sufficiently small step size) is guaranteed to converge to a
minimizer.

23.4. Exercises

1. Recall the RKHS


H1 [0, 1] = {f : [0, 1] → R , f (0) = 0, ∥f (1) ∥L2 < ∞}
Consider the representor theorem in this case. Describe the nature of the function that solves the optimiza-
tion.

2. k(x, x′ ) is a valid kernel if and only if for every n ≥ 1 and every set {xi }ni=1 the matrix K with ij-th
element K(xi , xj ) is positive semi-definite. Use this to show that the following are valid kernels:

(a) k(x, x′ ) = xT x′
(b) k(x, x′ ) = (xT x′ + 1)p for integers p ≥ 1
(c) k(x, x′ ) = f (x)f (x′ ) for any function f

3. Suppose that k1 and k2 are valid kernels. Show that the following are also kernels.

(a) k(x, x′ ) = k1 (x, x′ ) + k2 (x, x′ )


(b) k(x, x′ ) = k1 (x, x′ )k2 (x, x′ ) P
Hint: Consider the Hardamard products of the eigendecompositions K = i λi ui uTi
(c) k(x, x′ ) = p(k1 (x, x′ )), where p is a polynomial with positive coefficients
(d) k(x, x′ ) = exp(k1 (x, x′ ))
Hint: Consider the Taylor Series of the exponential function.

4. Let {xi }ni=1 be points in Rd and assume n ≥ d. Let X T = [x1 · · · xn ], let k(x, x′ ) = (⟨x, x′ ⟩ + 1)2 and
let K be the associated n × n Gram matrix with ijth entry k(xi , xj ).

(a) What is the rank of K and K1 = XX T ?


(b) Suppose xi = xj for some i ̸= j. What is the rank of K and K1 = XX T ?

5. (Normalized Kernels). If k is a kernel such that k(x, x) > 0 for all x, then show that

e k(x, x )
k(x, x) := p
k(x, x)k(x′ , x′ )

is also a kernel.
′ 2 2
6. (Gaussian Kernel). Show that kG (x, x′ ) = e−∥x−x ∥ /σ is a valid kernel. Hint: Consider the exponential
T ′
kernel kE (x, x′ ) = ex x and use the normalization trick above.

124
7. (Laplace Kernel). Show that for α > 0

k(x, x′ ) = e−α∥x−x ∥

is a kernel. Hint: Use the following fact



Z ∞
−α s α α2
e = e−su √ e− 4u du
0 2 πu3
The Laplacian kernel does not decay to zero as rapidly as the Gaussian kernel, and therefore is less likely
to encounter numerical problems.

8. Consider the three binary classification regions in R2 depicted below. Is there a kernel function that can
represent all of them?

- + +
+

- -
+ -

(a) (b) (c)

125
Lecture 24: Analysis of RKHS Methods

The representer theorem tells us that the problem of finding the function in a (possibly infinite dimensional) RKHS
that minimizes training losses can be posed as a finite dimensional optimization of the form
X
n  X n 
b = arg min
α ℓ yi , αj k(xi , xj ) + αT Kα ,
α∈Rd
i=1 j=1
P
where k is the reproducing kernel. The function fb(·) = ni=1 α
bi k(·, xi ) is a solution to
X
n
min ℓ(yi , f (xi )) + ∥f ∥2H .
f ∈H
i=1

The soluion fb is a weighted combination of kernel functions “centered" at each training point xi . For example, if
k is a radial basis kernel, like the Gaussian or Laplacian, then the solution is a weighted combination of “bump"
functions at the data points. This lecture analyzes the performance and properties of kernel methods of this type.

24.1. Rademacher Complexity Bounds for Kernel Methods

Recall the Rademacher complexity bounds developed in Lecture 20. Let F be a class of functions, {(xi , yi )}ni=1
bePiid training examples, and ℓ be an L-Lipschitz loss function. Consider the empirical risk function R bn (f ) =
1 n
n i=1 ℓ(yi , f (xi )) and its expectation R(f ) = E[ℓ(y, f (x)]. Assume the losses are bounded in [0, C]. Theo-
rem 12 states that with probability at least 1 − δ
r
bn (f ) ≤ 2L Rn (F) + C log(1/δ)
sup R(f ) − R
f ∈F 2n
where " #
1X
n
Rn (F) = E sup σi f (xi ) .
f ∈F n i=1

To apply this machinery to the RKHS setting, we will consider a constrained form of the optimization
X
n
min ℓ(yi , f (xi )) subject to ∥f ∥2H ≤ B 2 .
f ∈H
i=1

In other words, we will consider the Rademacher complexity of the class of functions
HB = {f ∈ H : ∥f ∥H ≤ B} .
The bound above yields a generalization bound of the following form. For any δ > 0 with probability 1 − δ
r
R(fb) ≤ R( b fb) + 2L Rn (HB ) + C log 1/δ ,
2n
Here fb is the function in HB that minimizes the training loss, R(fb) is the test error, and R(
b fb) is train error. Recall
that this bound assumes that the losses are bounded in [0, C]. To check that this requirement is met, consider the
loss as a function of yi f (xi ). Assume that yi ∈ [−1, 1] and note that
|yi f (xi )| ≤ |f (xi )| = |⟨f, k(·, xi )⟩H | ≤ ∥f ∥H ∥k(·, xi )∥H

126
by the Cauchy-Schwartz inequality. Since we are working with the class HB we have ∥f ∥H ≤ B. And the
reproducing property yields
∥k(·, xi )∥2H = ⟨k(·, xi , k(·, xi )⟩H = k(xi , xi ) .
Thus, we have ∥k(·, xi )∥H ≤ supx k(x, x). So assuming the kernel function is bounded, we have
p
|yi f (xi )| ≤ B sup k(x, x) .
x
p p
Let C be an upper bound on the loss function over the range [−B supx k(x, x), B supx k(x, x)] and assume
the loss is lower bounded by 0. For example, if k is a Gaussian kernel and ℓ is logistic or hinge loss, then we can
use C = 1 + B and L = 1.

We can bound the Rademacher complexity of HB as follows.


" # " # " * n + #
1 X n
1 Xn
1 X
Rn (HB ) = E sup σi f (xi ) = E sup σi ⟨f, k(·, xi )⟩H = E sup f, σi k(·, xi )
n f ∈HB i=1 n f ∈HB i=1 n f ∈HB i=1 H
" # " n #
1 X n
B X
≤ E sup ∥f ∥H ∥ σi k(·, xi )∥H = E ∥ σi k(·, xi )∥H
n f ∈HB i=1
n i=1
v " # v " #
u u
Bt u Xn
B u X n X n
≤ E ∥ σi k(·, xi )∥2H = tE σi σj ⟨k(·, xi ), k(·, xj )⟩H
n i=1
n i=1 j=1
" n #1/2
B X B p
= k(xi , xi ) ≤ √ sup k(x, x),
n i=1 n x

where the first inequality follows by Cauchy-Schwartz inequality, the second inequality follows by the Jensen’s
inequality, and the last inequality holds because E[σi σj ] = 0 if i ̸= j. Putting everything together, we have shown
that for any δ > 0 with probability 1 − δ
p r
B sup k(x, x) log 1/δ
R(fb) ≤ R( b fb) + 2L x
√ +C .
n 2n
For example, if we use logistic or hinge loss and a radial kernel function like the Gaussian or Laplacian kernel,
then we have r
R(fb) ≤ R(
2B
b fb) + √ + (1 + B) log 1/δ .
n 2n

24.2. Properties of Kernel Functions

The Rademacher complexity bound depends on the maximum value of the kernel function, but otherwise does
not reflect particular characteristics of the kernel function. To gain insight into the differences between kernels
and the RKHSs they generate, let us focus on translation-invariant kernels that only depend on the difference
between x and x′ . We will denote translation-invariant kernels as k(x, x′ ) = k(x − x′ ). The Gaussian kernel
k(x, x′ ) = exp(α∥x − x′ ∥22 ) is an example.

We will use the Fourier transform to study such kernels. Recall the Fourier transform of a function f ∈ L2 (Rd ) is
Z
T
F (ω) = f (x)e−iω x dx

127
and the inverse transform is Z
1 Tx
f (x) = F (ω)eiω dω .
(2π)d
√ T R
Here i = −1, ω, x ∈ Rd , and eiω x = cos(ω T x) + i sin(ω T x). The squared L2 norm |f (x)|2 dx can be
intepreted as the total “energy" of the function f . The Fourier transform F (ω) indicates how much of the energy
is associated with each frequency ω.

We will specifically kernels that can be expressed in terms of the parameter ρ = x − x′ and consider the Fourier
transform of the function k(ρ).
∥ρ∥2 
Example 24.2.1. Consider Gaussian kernels of the form k(ρ) = σ −d exp − 2σ22 , for some σ > 0 that controls
the width of the Gaussian bump. The Fourier transform of k is

K(ω) = exp(−σ 2 ∥ω∥22 /2) .

This shows that the Fourier transform decays exponentially as the frequency of oscillation ∥ω∥2 increases and as
σ 2 increases. Larger values of σ 2 correspond to broader and smoother bumps. This tells us that solutions based
on Gaussian kernels tend to be relatively smooth functions.
Example 24.2.2. Consider Laplacian of the form k(ρ) = exp(−α∥ρ∥2 ), for some α > 0 that controls the width
of this sort of bump. The Fourier transform of k is
√  − d+1
K(ω) = 2d/2 α π Γ d/2 + 1 α2 + ∥ω∥22 2
,

where Γ is the Gamma function satisfying Γ(n+1) = n! when n is a positive integer. This shows us that its Fourier
−(d+1)
transform decays less rapidly than in the Gaussian case; like ∥ω∥2 which is much slower than exponential
decay. This tells us that solutions based on Laplacian kernels tend to be less smooth in comparison to those based
on Gaussian kernels .

This Fourier analysis has several practical implications. As noted above, different kernels induce different spectral
decays in the frequency domain. If we have prior knowledge that the function we are trying to learn has certain
frequency characteristics, then we can try to match the kernel to these characteristics. For example, if we know
that the true function has little or no energy above a certain frequency, then we can use this information to choose
σ or α.

24.3. Take-Away Messages

Let H be an RKHS with kernel k and consider the ball of radius B > 0 in H:

HB = f ∈ H : ∥f ∥H ≤ B .

We saw that a solution to the constrained optimization


X
n
min ℓ(yi , f (xi )) subject to ∥f ∥H ≤ B
f ∈H
i=1

P
can be represented fb(·) = ni=1 αi k(·, xi ), for some αi depending on the data. This is remarkable and potentially
useful, but since H may be contain very complex functions we should worry about possibly overfitting to training
data.

128

The Rademacher complexity analysis shows that learning is well-posed if B/ n is small, since this ensures the
solution will generalize well to new examples. This bound cannot be improved too much, since in general we
have a n−1/2 term in the generalization bound in finite classes too. From this perspective, we see that learning in
an RKHS ball is not more difficult that learning with a finite class of functions. This might seem surprising, but
it crucially depends on the assumption that the norm of the solution is at most B. This is a restriction on what
functions we consider. As we increase our training set size, we might want to allow for more functions and let
B grow with n. Also, we might want to allow B to depend on the dimension of the feature space, possibly even
exponentially in d. So there may be a lot that we are constraining or ruling out with the norm bound.

We also discussed how the Fourier transforms of translation invariant kernels can have dramatically different decay
characteristics. The decay of the Fourier transform, along with the norm bound B, affect how rapidly varying the
solution can be. Consider the RKHS balls associated with the Gaussian kernel k(x, x′ ) = exp(−α2 ∥x − x′ ∥22 )
and the Laplacian kernel k(x, x′ ) = exp(−α∥x − x′ ∥2 ) for some α > 0, denoted by HB G
and HBL
, respectively.
These balls both contain the function f = 0, but there are many functions that may be√in one ball and not the
other. The Rademacher complexity analysis tells us that both will generalize well if B/ n is small, but the two
solutions may be very different functions. This means that the empirical risk of one solution may be much smaller
than the other and thus lead to a smaller bound on the generalization error. A nice overview of other approaches
to understanding kernel methods is given in [11].

To put the analysis to practical use, we can find solutions using different kernels (varying the kernel function and
parameters) and then check the norms and the empirical risks of the various solutions. If a particular solution has
small empirical risk and a small norm in its RKHS, then this indicates it may be a better solution than another that
has a larger empirical risk and/or larger norm in its RKHS. Of course, another practical approach in this setting is
cross-validation: “hold-out" some of the training data and then estimate the error rate of each solution using these
data. The Rademacher complexity analysis is complementary to this as it (a) sheds light on the tradeoffs involved
in learning good classifiers and (b) could be used as a criterion for selection that does not require splitting the data
into train and validation sets (all the available data is used for training).

24.4. Exercises

1. Suppose that instead of learning a function from point evaluations, we instead consider learning a function
from generic continuous linear measurements. We can formulate this learning problem over a Hilbert space
H. We can model continuous linear measurements of a function f ∈ H by inner products of the form
⟨νi , f ⟩, i = 1, . . . , n, where νi ∈ H, i = 1, . . . , n are the measurement functionals. Prove that if a solution
exists to the following optimization problem
X
n
min ℓ(yi , ⟨νi , f ⟩) + λ∥f ∥2H , λ > 0,
f ∈H
i=1

then the solution admits a representation of the form


X
n
f= αi νi , α1 , . . . , αn ∈ R.
i=1

Hint: The RKHS representer theorem is a special case of this problem when νi = k(·, xi ) since ⟨k(·, xi ), f ⟩ =
f (xi ) by the reproducing property. Adapt the proof of the RKHS representer theorem.

2. Some kernels can be associated with an explicit feature map. For example, the polynomial kernel k(x1 , x2 ) =
(xT1 x2 + 1)p = ϕ(x1 )T ϕ(x2 ), where ϕ(x) is a D × 1 vector containing all monomials of the elements in x

129
up to and including degree p (here D is the number of distinct monomial terms). In this case, the solution to
a learning problem can be written as fb(x) = wT ϕ(x) for some weight w ∈ RD . Suppose that ∥xi ∥2 ≤ 1,
i = 1, . . . , n. Derive a Rademacher complexity bounds using the kernel approach discussed in this lecture
and the linear modeling approach discussed in Lecture 20 with the model wT ϕ(x). How do the bounds
compare?

3. Let k be a Gaussian or Laplacian kernel and let H denote the corresponding RKHS.

(a) Show that the solution to


X
n
min (yi − f (xi ))2 + λ∥f ∥2H
f ∈H
i=1
Pn
has the form fbλ (·) = i=1 b ∈ Rn is given by
bi k(·, xi ) where α
α
−1
b = K + λI
α y

where y ∈ Rn is the vector of the labels.


(b) Argue that, in general, the kernel matrix K with ij entry equal to k(xi , xj ) will be full rank (i.e.,
positive definite). Exceptions include cases where certain xi values are repeated, for example.
(c) Assume K is full rank. Then we may take λ = P b = K −1 y. What is the
0 and obtain the solution α
training error of the solution fb0 ? That is, what is ni=1 (yi − fb0 (xi ))2 ?
(d) The training error is the same for any solution of this form, no matter the choice of kernel or its width
parameter. Discuss why and how different choices might lead to different predictions on new test
examples. Use insights from the Rademacher complexity analysis to explain how different choices
might lead to better or worse generalization.

130
Lecture 25: Neural Networks

Recall positive definite kernel k generates an RKHS and the solutions to learning in an RKHS have the form
X
n
f (·) = αi k(·, xi )
i=1

where {xi }ni=1 are the training examples and {αi }ni=1 ⊂ R. We can interpret such a function as a linear combina-
tion of fixed nonlinear functions {k(·, xi )}ni=1 . Two layer neural networks have a similar form and construction.
Let σ : R → R be an “activation function." The most common activation function today is the Rectified Linear
Unit (ReLU) defined by σ(·) = max{0, ·}. A two layer neural network is a function f : Rd → R of the form

X
m
f (x) = vj σ(wjT x + bj ) , ∀x ∈ Rd ,
j=1

where vj , bj ∈ R and wj ∈ Rd are “trainable" parameters. Like the RKHS functions, neural network functions
can be viewed as a linear combination of nonlinear functions. Unlike kernel methods, the nonlinear functions in a
neural network are not fixed, since wj and bj are adjustable parameters. This is a key distinction between neural
networks and kernel methods.

25.1. Neural Network Function Spaces

To be concrete, let us fix the activation function to be the ReLU. Also for notational convenience, we will append
the bias bj to the input weight vector wj and append a 1 to x so that each neuron is written as

σ(wjT x) = max(0, wjT x) =: (wjT x)+

We will assume this for the remainder of the discussion. Define the set of neural network functions as
n X
m o
F = f : f (x) = vj (wjT x)+ , m ≥ 1, wj ∈ Rd+1 , vj ∈ R .
j=1

If f, g ∈ F, then f + g is also a neural network function in F and so F is a vector space (all the other properties
of a vector space are easily verified).

The next step is to add a norm to this vector space. Since the weights of a neural network determine the function
it represents, any norm we choose will depend on the weights in some way. One natural thing we can try is to
define the norm of a neural network function as some norm on the weights of the neural network. Let us consider
the Euclidean norm of the weights, the square root of the sum of squared weights. To gain some insight, consider
a simple ReLU neural network with a single neuron, f (x) = v(wT x)+ . The Euclidean norm of the weights is
(∥w∥22 + |v|2 )1/2 . Note that the function f tends to zero if v or w tends to zero, but the Euclidean norm of the
weights tends to zero only if both v and w tend to zero. For example, if w ̸= 0 and v = 0, then f = 0 but
(∥w∥22 + |v|2 )1/2 ̸= 0. This shows that the Euclidean norm of the weights is not a valid norm for neural network
functions.

The problem arises because of the way neural networks are parameterized, with both input and output weights
for each neuron. Inspection of f (x) = v(wT x)+ shows that f tends to the 0 function if and only if the product

131
|v| ∥w∥2 → 0.PThis suggests using this product as the
Pmbasis for a norm on F. Consider a general f ∈ |F of the
m T
form f (x) = j=1 vj (wj x)+ and consider ∥f ∥ := j=1 ∥vj wj ∥2 as a norm. Then ∥f ∥ = 0 if and only if f = 0
P Pm e
and ∥αf ∥ = ∥ m T
j=1 αvj (wj x)+ ∥ = j=1 ∥αvj wj ∥2 = |α|∥f ∥ for any α ∈ R. Let f be another neural network
e neurons and weights w
function with m ej and vej . Then we have

X
m X
m
e
∥f + fe∥ ≤ ∥vj wj ∥2 + ∥e ej ∥2 = ∥f ∥ + ∥fe∥ .
vj w
j=1 j=1

The inequality arises because there could neurons in f and fe with exactly the same input weights and biases but
output weights of differing signs. So this is indeed a valid norm on F, and it is often called the path-norm of the
network. The vector space F of two layer ReLU neural networks equipped with the path-norm is a normed vector
space and its completion is a Banach space11 .

While the path-norm may seem unusual at first glance, it is actually arises naturally in neural network training.
The most common
P sort regularization in neural network training is called “weight decay," explained as follows.
Let L(f ) = ni=1 ℓ(yi , f (xi )), the sum of losses for neural network function f on a training set. Consider the
optimization
λ X
m

min L(f ) + ∥wj ∥22 + |vj |2 ,
f 2 j=1

Neural network training is based on gradient descent. The negative gradient of the objective with respect to any
∂L
weight in the network, say vk for example, is − ∂v k
− λvk . So each step of gradient descent will take a small step
∂L
in the direction − ∂vk with a proportionately small amount of weight decay −λvk . In other words, weight decay
in gradient descent training is equivalent to regularizing the sum of squared weights.

The ReLU is piecewise linear and the functions α−1 (αx)+ are equivalent for all α > 0. Let us reconsider
the optimization in light of this. We can scale the input and output weights of the jth neuron by αj > 0 and αj−1
without affecting the neural network function. Let fα denote this equivalent function and consider the optimization

λ X 2
m

min L(fα ) + αj ∥wj ∥22 + αj−2 |vj |2 .
fα 2 j=1

The loss is invariant to α, but for any set of weights it is easy to check that the regularization term is smallest for
αj2 = |vj |/∥wj ∥2 . Thus, at a global minimum of the objective function we have

1 
∥wj ∥22 + |vj |2 = ∥vj wj ∥2 .
2
This implies that solutions to the optimization
X
m
min L(f ) + λ ∥vj wj ∥2 ,

j=1

are equivalent to those of the weight decay objective.

Another perspective that sheds some light on this choice of norm is the notion of stability. We call a function f
stable if f (x) ≈ f (x + ϵ) for any small perturbation ϵ and every x. Stable functions have good generalization
11
Because this space contains generalized functions (measures), like the Dirac delta, the completion is not with respect to the norm
topology of the space, but with respect to the weak∗ topology via a Prokhorov’s theorem.

132
and robustness properties, since the produce similar outputs for similar inputs. Consider a single ReLU neuron
function f (x) = vj (wjT x)+ and note that
|f (x + ϵ) − f (x)| ≤ ∥vj wj ∥2 ∥ϵ∥2 .
So it is stable if the product vj wj has a small norm. This is the case if both vj and wj are small, but also if one
is large and the other is much smaller.
P This illustrates the problem with the Euclidean norm of the weights. The
Euclidean norm of the weights ( j ∥wj ∥22 + |vj |2 )1/2 is small if both vj and wj are small, but it is not small if
one is large and the other is very small (e.g., ∥wj ∥2 = 1 and vj = 0.001). So the Euclidean norm of the weights
does not reflect the stability of neural network functions.

25.2. ReLU Neural Network Banach Space

To characterize the sort of functions are in the two layer ReLU neural network Banach space let us consider one-
dimensional (univariate) case where f :P R → R. For this characterization we will not regularize the biases bj and
so the path norm in this case is simply m j=1 |vj wj |. Regularizing the biases is unnecessary since if vj = 0, then
the neuron does not contribute to the neural network function f . To ensure this is stillPa valid norm, we can fix
|wj | = 1 and absorb its scale into vj . With this normalization, the path norm is simply m j=1 |vj |.

The derivative of a ReLU function is a step function, that is for any b ∈ R



∂σ(x − b) ∂ max{0, x − b} 1 x≥b
= =
∂x ∂x 0 x<b
Pm P
Consider a univariate neural network function f (x) = j=1 vj σ(wj x + bj ). Its path-norm is m j=1 |vj wj|. We
Pm wj
can move all the scaling of |wj | out of the ReLU funtions and write f (x) = j=1 vj |wj | σ |wj | (x + bj /wj ) . The
derivative of f is
X m w 
j
f ′ (x) = vj |wj | u (x + bj /|wj |) ,
j=1
|wj |
where u(·) is the unit step function that is 1 when its argument is nonnegative and 0 otherwise. So f ′ is a
piecewise constant function. The total variation of such a function is the sum of the sizes of the changes/jumps
in the function. So the total variation of f ′ is
X
m
TV(f ′ ) = |vj wj | .
j=1

In other words, in the univariate case the path-norm is equal to the total variation of the derivative of f . The
Banach space of functions with derivatives of finite total variation is called BV2 (R). This Ris the ReLU neural
network Banach space. If a function f has its second derivative f ′′ ∈ L1 (R), then TV(f ′ ) = |f ′′ (x)| dx. So we
can think of the path-norm as measuring the L1 norm of the second derivative of the neural network function.

25.3. Exercises

1. Let f be a univariate two layer ReLU neural network. What is its second derivative? Is it in L1 (R)?
P
2. The path-norm of a univariate ReLU neural network is m 1
j=1 |vj wj |, which is the ℓ norm of the vector of
m
{vj wj }j=1 . Since regularizing with the path-norm is equivalent to weight decay, what does suggest about
the nature of solutions as we increase λ?

133
Lecture 26: Neural Network Approximation and Generalization Bounds

Let σ be the ReLU activation function, σ(x) = max{0, x}, and consider two layer neural networks with neurons
of the form σ(wT x + b) with x ∈ Rd . We will append the bias of each neuron to the input weight vector and
append a 1 to x, denoted by x. Hence, we will use the notation σ(wT x), with w ∈ Rd+1 , for each neuron.
Because the ReLU function is piecewise linear, the size of w can be absorbed into the output weight v, so let us
assume that ∥w∥2 = 1. Consider the space of neural network functions mapping Rd → R
n X
m o
f : f (x) = vj σ(wjT x), m ≥ 1, wj ∈ Rd+1 , ∥wj ∥2 = 1, vj ∈ R .
j=1

The vectors in Rd+1 satisfying ∥w∥2 = 1 is the surface of the unit sphere in d + 1 dimensions, denoted by Sd . Let
F be the space of all functions of the form
Z
f (x) = σ(wT x) dν(w)

where ν(w)
Pm is a finite measure on Sd . The measure ν plays the role of the output weights. If we take the measure
dν(w) = j=1 vj δ(w − wj ) dw, the integral formula produces the finite width neural network

X
m
f (x) = vj σ(wjT x) .
j=1

Thus, F contains all functions in the vector space above. The measure ν can be split into positive and negative
parts ν = ν + + ν − and suggests the norm
Z Z
∥f ∥ = +
dν (w) − dν − (w) .
Sd Sd
Pm
Observe that for a finite width neural network ∥f ∥ = j=1 |vj |.

There is a small problem with the norm defined above, due to the fact that the same function could be repre-
sented by different neural networks (and different measures ν). For example, adding the neurons vσ(wT x) and
−vσ(wT x) to any network does not change the function it represents (since the contributions of the two neurons
cancel each other). To deal with this, we will define the norm of a function f to be
Z Z 
+ −
∥f ∥ := R inf dν (w) − dν (w) .
ν : f = σ(wT x) dν(w) Sd Sd

Taking the infimum over representations eliminates the problem of non-uniqueness.

Equipped with ∥f ∥, F is a Banach space written as


n Z o
T
F = f : f (x) = σ(w x) dν(w) , ∥f ∥ < ∞ .

Specialized to the case d = 1, this is the space BV2 discussed in the last lecture. Roughly speaking, we can think
of this as the space of functions with absolutely integrable second derivatives. For d ≥ 1, the space F is also
characterized in terms of second derivatives, but measured in the Radon transform domain [20, 21].

134
26.1. Approximating Functions in F

In general, an f ∈ F is represented by an infinite width neural network. However, in practice we will always use
finite width neural networks. How wide should a practical neural network be? To answer this we will quantify
how well such any function in F can be approximated by a neural network of width m in the following sense. Let
Fm denote the set of all neural networks with width at most m and for any f ∈ F consider

min ∥f − fm ∥L2 (Ω)


fm ∈Fm
R
where ∥g∥2L2 (Ω) := Ω |g(x)|2 dx for some bounded domain Ω ⊂ Rd . A small approximation error ∥f − fm ∥L2 (Ω)
means that fm is a good approximation to f . To interpret this, consider the following. Suppose we pick x at
random from a probability density p supported in Ω. Then

E[|f (x) − fm (x)|2 ] ≤ M ∥f − fm ∥2L2 (Ω)

where M is the maximum value of the density p. Here is the result we will prove, which is from [1] and based on
a simple argument attributed to Maurey in [22].

Theorem 26.1.1. Let Ω ⊂ Rd be a bounded domain. Then there exists a constant C0 > 0 such that for every
m ≥ 1 and any f ∈ F there is a width m neural network satisfying
C0
∥f − fm ∥2L2 (Ω) ≤ .
m

The theorem shows that the approximation error is proportional to 1/m, which means that finite width neural
networks are good approximations to any function in F. Remarkably, the approximation rate has no dependence
on the dimension d of the domain. For any ϵ > 0, a network of width O(1/ϵ) is sufficient for ϵ approximation
error. This is unusual
√ since for most common multivariate function approximation methods, the rate depends on
the dimension like d m or equivalently an ϵ approximation error requires O(1/ϵd ) terms. For example, suppose
the domain Ω = [0, 1]d an consider a histogram approximation (piecewise constant approximation) of f ∈ F.
In general, we would require at least md bins in the histogram partition to guarantee an error of 1/m. This is
the so-called “curse of dimensionality". The same is true for any kernel method. In contrast, neural network
approximations in F are immune to the curse.

The reason for this immunity is simple. Consider the ball ofP radius C > 0 in F. This contains many functions,
including finite width neural networks, which must satisfy j |vj | ≤ C. This constraint means that the finite
width networks may have many neurons with very small vj , but only a few with larger vj . The magnitude |vj | is
the slope of the jth ReLU function. So functions in any ball of finite radius in F may only have large slopes and
variation in (at most) a small number of directions, which are determined by the corresponding input weights. So
the space F contains functions that are very smooth except possibly in a few directions. In this sense, any function
in F is intrinsically low dimensional, but there is no single low-dimensional subspace that contains every function
in F. The key characteristic of neural networks is that they can adapt to the special directions of an underlying f
by learning the appropriate input weights. In this sense, the neurons are what we might call “steerable".

We conclude this section with a elementary proof of the theorem based on a probabilistic argument. We have
specialized this result to the space L2 (Ω) and to ReLU neural networks. However, the arguments only hinge on
the fact that the functions we aim to approximate are essentially ℓ1 combinations of functions that are bounded on
Ω. So the same result can hold for other constructions that meet these requirements, including neural networks
with different activation functions.

135
Proof. Without loss of generality, assume that ∥f ∥ ≤ C. Let C > 0 and define the set of neurons

NC = η : η(x) = v σ(wT x) , |v| ≤ C .

Let conv(NC ) denote the convex hull of NC . This is the set of all functions of the form
X
f (x) = γj vj σ(wjT x)
j≥1
P
where γj ≥ 0 and j≥1 γj = 1. In other words, conv(NC ) contains all two layer neural networks with ∥v∥1 ≤ C,
where v is a vector containing {vj }j≥1 . These neural networks may be arbitrarily wide, but must satisfy this ℓ1
bound on the vector of output weights.

Because |v| ≤ C, ∥w∥2 = 1, and Ω is a bounded domain, every η ∈ NC is bounded on Ω and therefore there
exists a B > 0 such that ∥η∥L2 (Ω) ≤ B for all η ∈ NC . This and the triangle inequality implies that ∥f ∥L2 (Ω) ≤ B
for all f ∈ conv(NC ).

Let GC denote the completion of conv(NC ) in L2 (Ω). This means conv(N ) is dense in GC (with respect to
the L2 (Ω)R norm) and f ∈ GC . For ease of notation, we will denote the norm in L2 (Ω) by ∥ · ∥L2 , that is
∥f ∥2L2 = Ω |f (x)|2 dx.

Given some δ > 0 P and f ∈ GC , there exists f¯ ∈ conv(N


P
¯
C ) satisfying ∥f − f ∥L2 ≤ δ/m, since conv(NC ) is dense
in GC . Thus, f¯ = j=1 γj η̄j with η̄j ∈ NC , γj ≥ 0, j=1 |γj | = 1 for some sufficiently large N , possibly much
N N

larger than m. Let ηi , i = 1, . . . , m, be drawn independently from {η̄1 , . . . , η̄N } according to the probabilities
P Pn
{γ1 , . . . , γN }. That is, P(ηi = η̄j ) = vj . Let fbm = m1 m i=1 ηi . Because E[ηi ] =
¯
j=1 γj η̄j = f , we have
E[fbm ] = f¯. Furthermore

  h 1 Xm
2 i
1
b ¯ 2
E ∥fm − f ∥L2 = E ηi − f¯ E[∥η1 − f¯∥2L2 ]
=
m i=1 L2 m
1  B 2 − ∥f¯∥2L2 B2
= E[∥η1 ]∥2L2 ] − ∥f¯∥2L2 ] ≤ ≤
m m m
where the second and third inequalities follow from the the fact that the ηi are iid and E[ηi ] = f¯. Since fbm satisfies
the bound on average, the must exist at least one specific fm ∈ conv(NC ) that does as well. Then we have

B2
∥f − fm ∥L2 = ∥f − fm + fm − f¯∥L2 ≤ ∥f − f¯∥L2 + ∥f¯ − fm ∥L2 ≤ δ/m + .
m
Since δ > 0 was arbitrary this completes the proof.

26.2. Generalization Bounds for Neural Networks

Single-hidden-layer neural networks are functions mapping Rd → R with the following form
X
m
f (x) = vj σ(wjT x) ,
j=1

where σ is an activation function. Although different choices are possible we will focus on the popular Rectified
Linear Unit (ReLU) activation function, defined as σ(x) = max{0, x}. As above the input x ∈ Rd , x ∈ Rd+1 is
x appended with a 1, and wj ∈ Rd+1 .

136
Theorem 26.2.1. Let {x1 , . . . , xn } be a set of points in Rd . Consider the class of two layer neural networks
( )
X m Xm
T
FC = f (x) = vj σ(wj x) : m ≥ 1 , |vj | ∥wj ∥ ≤ C .
j=1 j=1

The empirical Rademacher complexity of FC satisfies


v
u n
2C uX
b n (FC (x1 , . . . , xn )) ≤
R t ∥xi ∥2 .
n i=1

Note that this bound does not involve m, the number of neurons. Rather, it depends on the size of the neural
network weights. This gives some insight on why having a large number of neurons does not negatively impact
the ability of neural networks to generalize well.

Proof. We will take C = 1 in the proof and the same argument applies for any C > 0. Recall σi , i = 1, . . . , n,
are ±1 valued iid random variables (not to be confused with σ which denotes the ReLU).
b n (F1 , (x1 , . . . , xn ))
R
" #
1 X n
= E sup σi f (xi )
n f ∈F i=1
" #
1 X n XN
= E sup σi vj σ(wjT xi )
n PN
{vj ,wj } : j=1 |vj | ∥wj ∥≤1 i=1 j=1
" !#
1 XN Xn
= E sup vj ∥wj ∥ σi σ(xTi wj /∥wj ∥)
n PN
{vj ,wj } : j=1 |vj | ∥wj ∥≤1 j=1 i=1
" ! #
1 X N X n
≤ E sup |vj |∥wj ∥ max σi σ(xTi wj /∥wj ∥) ,
n {vj ,wj } : N |vj | ∥wj ∥≤1
P
j=1
1≤j≤N
i=1
j=1

where in second to last step we used the following property of the ReLU: for α ≥ 0 σ(αz) = ασ(z). The last step
follows from Hölder’s inequality, namely, for two vectors a, b ∈ Rd we have

X
d X
d
T
a b= ai bi ≤ ∥a∥1 ∥b∥∞ = |ai | max |bi | .
i
i=1 i=1

137
Continuing, we have
" #
X
n
b n (F1 , (x1 , . . . , xn )) ≤ 1 E
R sup max σi σ(xTi wj )
n {wj } : ∥wj ∥≤1 1≤j≤N
i=1
" #
1 X n
= E sup σi σ(xTi w)
n w:∥w∥≤1 i=1
" #
2 X n
≤ E sup σi xTi w
n w:∥w∥≤1 i=1
" n #
2 X
≤ E σi xTi
n i=1
v
u n
2 uX
≤ t ∥xi ∥2 .
n i=1

The first equality holds since the supremum over wj is the same for all terms. The first inequality is the contraction
property of Rademacher complexity (see Lemma 26.2.2 below). The second to last inequality follows by the
Cauchy-Scwartz inequality.

The following 2-sided generalization of Lemma 20.2.1, sometimes called the “contraction” property of Rademacher
complexity, can be found in [6, Theorem 11.6].
Lemma 26.2.2. Consider z1 (θ), z2 (θ), . . . , zn (θ), a collection of stochastic processes indexed by θ ∈ Θ. Let
σ1 , . . . , σn be independent Rademacher random variables. Then, for any L−Lipschitz function φ satisfying
φ(0) = 0 " # " #
Xn
 Xn
E sup σi φ zi (θ) ≤ 2L E sup σi zi (θ) .
θ∈Θ i=1 θ∈Θ i=1

26.3. Exercises
P Pm
1. Consider a neural network function f (x) = m T
j=1 vj σ(wj x). Its norm is ∥f ∥ = j=1 ∥vj wj ∥2 . Show
that the space of functions with this norm is not a Hilbert space by demonstrating that it fails to satisfy the
Parallelogram Law. Hint: Consider functions f and g with just one ReLU neuron each.

2. In engineering textbooks, the Dirac delta on R is a distribution (generalized function) defined as


(
0 x ̸= 0
δ(x) =
∞ x=0

with the property that Z


f (x)δ(x) dx = f (0)
R

138
for some sufficiently nice f : R → R. Use this property to morally12 show that δ ̸∈ L2 (R).
R
Hint: Morally, ∥δ∥2L2 = R δ(x)δ(x) dx.

3. Let Ω ⊂ Rd be a bounded domain. Consider a set of functions Φ, where each φ ∈ Φ maps from Ω to R and
is bounded on Ω. The set may be uncountably infinite. Consider the convex hull
n X X o
CΦ = f : f = γj φj , φj ∈ Φ , γj ≥ 0 , γj = 1
j≥1 j≥1

P
Let ∥φ∥∞ = supx∈Ω |φ(x)| and define the norm of ∥f ∥ = j≥1 γj ∥φj ∥∞ . Let C¯Φ denote the closure of CΦ
with respect to the norm L2 (Ω). Show that there exists a constant C > 0 such that for every m ≥ 1 and any
f ∈ C¯Φ there is an m-term function in fm ∈ CΦ (i.e., at most m nonzero terms) satisfying
Z
C
|f (x) − fm (x)|2 dx ≤ .
Ω m
Notice there is no dependence on the dimension of Ω.

4. In most cases the feature vectors x1 , . . . , xn are bounded (meaning their√norm is √


finite). Assume that
∥xi ∥2 ≤ B show that the Rademacher complexity of FC is bounded by 2C B 2 + 1/ n.

5. Let fb ∈ FC be a binary classifier that minimizes the logistic or hinge loss on an iid training set {(xi , yi )}ni=1 .
Bound the generalization error P(y ̸= f (x)), where (x, y) drawn from the same distribution as the training
data.

12
This fact can be shown rigorously using standard duality arguments used when working with distributions, but is out of the scope of
this class.

139
Appendix A: Notation

P(A) is the probability of the event A with respect to everything that is random in the definition of A. The
symbol P alone means the joint distribution of everything random in the setting under consideration.

PXY is the joint distribution of (X, Y ), PX is the marginal distribution of X.

E[Z] is the expectation of the random variable Z with respect to everything that is random in the definition of Z.

V[Z] = E[(Z − E[Z])2 ] is the variance of the random variable Z with respect to everything that is random in the
definition of Z.

EY |X [f (X, Y )] is the conditional expectation of the random variable f (X, Y ) given X. This is also sometimes
denoted as E[f (X, Y )|X].

p(x, y) denotes the joint probability/mass function of X and Y . In classification settings Y is discrete valued
and X maybe be continuous and/or discrete valued. If both X and Y are discrete, then p(x, y) denotes the
probability that X = x and Y = x. If X is continuous, then p(x, y) = p(x|y)p(y) where p(x|y) is the
probability density of X given Y = y and p(y) is the probability that Y = y. p(x) is the marginal density
(or mass function) of x.

P(Y = 1|X = x) is more explicit notation for the conditional probability that Y = 1 given X = x, which is the
same as p(1|x) = p(x, 1)/p(x), using the notation above. It’s used to make clear that the 1 refers to Y = 1.
Another way this might be denoted is pY |X (1|x) which also clarifies that it refers the situation when Y = 1.

X ∼ PX means that X is a random variable with distribution PX . If p(x) denotes the probability density function
of X, then we may also write X ∼ p.

If X1 , X2 , . . . , Xn are independent and identically distributed according to PX with density function p(x), then
iid iid
we write X1 , . . . , Xn ∼ PX or X1 , . . . , Xn ∼ p.

For clarity, we use X to denote a random variable and x to denote a specific value that X may take. Sometimes
this becomes cumbersome. For example, we often use capital letters to denote matrices and lower case
letters to denote vectors and scalars. So we just use x generically and the context will dictate whether we
are talking about the random variable or a specific value taken by the random variable.

In linear algebra notation, vectors and matrices are often denoted with bold lower or upper case symbols. For
example, x, X, and x denote a vector, matrix, and scalar, respectively.

140
Appendix B: Useful Inequalities

Markov’s Inequality: For any nonnegative (scalar) random variable X

E[X]
P(X ≥ t) ≤
t

Chebyshev’s Inequality: Let X be a random variable with mean µ and variance σ 2 . Then

σ2
P(|X − µ| ≥ t) ≤
t2

Jensen’s Inequality: Let X be a random variable. For any convex function φ

E[φ(X)] ≥ φ(E[X])

Cauchy-Schwarz Inequality: Let f and g be functions. Then


Z Z 1/2  Z 1/2
2
f (x)g(x) dx ≤ |f (x)| dx |g(x)|2 dx

If x and y are random variables, then

|E[xy]|2 ≤ E[x2 ] E[y 2 ]

Hölders’s Inequality: Let f and g be functions, and let p, q ≥ 1 satisfy 1/p + 1/q ≤ 1. Then
Z Z 1/p  Z 1/q
p
|f (x)g(x)| dx ≤ |f (x)| dx |g(x)|q dx

Chernoff’s Bound: Let z1 , z2 , ..., zn be independent bounded random variables such that zi ∈ [0, 1] with proba-
bility 1. Then for any ϵ > 0
 1X n  2
P (zi − E[zi ]) ≥ ϵ ≤ 2e−2nϵ
n i=1

1 , z2 , ..., zn be independent bounded random variables such that zi ∈ [ai , bi ] with


Hoeffding’s Inequality: Let zP
probability 1. Let Sn = ni=1 zi . Then for any t > 0, we have
2t2
− Pn
(bi −ai )2
P(|Sn − E[Sn ]| ≥ t) ≤ 2 e i=1

141
Appendix C: Convergence of Random Variables

The fact that averages of realizations of random variables converge to the corresponding expected (mean) value
is central to the analysis and design of machine learning algorithms. This note discusses several forms of conver-
gence. Let X be a real-valued random variable, and let X1 , X2 , . . . be an infinite sequence of independent and
identically distributed copies of X.

29.1. Law of Large Numbers


P
Let µbn = n1 ni=1 Xi , n ≥ 1, be the empirical averages of this sequence. The law of large numbers refers to the
bn → E[X] as n → ∞. Specifically, there is a weak and strong version of the law of large numbers.
fact that µ

Weak Law of Large Numbers. If E[|X|] < ∞, then µ


bn converges in probability
to E[X], i.e., for every ϵ > 0

µn − E[X]| ≥ ϵ) = 0 .
lim P(|b
n→∞

Strong Law of Large Numbers. If E[|X|] < ∞, then µ


bn converges almost surely
to E[X], i.e.,
P( lim µbn = E[X]) = 1 .
n→∞

Proving the laws of large numbers involves a few standard results from analysis (e.g., the Borel-Cantelli lemma).
2]
Here we will simply provide a bit of intuition based on Markov’s inequality: for any X then P(X 2 ≥ a) ≤ E[X a
for any a > 0. So, suppose that E[X 2 ] < ∞, a slightly stronger moment condition than required for the laws of
large numbers. Then we have
1
2 2 µn − E[X]|2 ]
E[|b n
E[|X1 − E[X]|2 ] E[X 2 ]
µn − E[X]| ≥ ϵ) = P(|b
P(|b µn − E[X]| ≥ ϵ ) ≤ = ≤ ,
ϵ2 ϵ2 nϵ2
where we use the fact that the variance of a sum of independent random variables is equal to the sum of the
µn − E[X]| ≥ ϵ) → 0 as n → ∞ for any ϵ > 0. With a bit more work, we can
variances of each term. Thus, P(|b
show this holds for every ϵ > 0.

29.2. Central Limit Theorem

The LLN tells us that empirical averages converge to expected values, but little else about the (random) behavior
of averages. The most elementary characterization of this is the Central Limit Theorem (CLT), which states that
the distribution of averages of random variables tends to a Gaussian distribution.


Central Limit Theorem. If E[X] = µ and E[|X − µ|2 ] = σ 2 < ∞, then µn −
n(b
µ) converges in distribution to N (0, σ 2 ).

142
√ Pn √
Notice that bn =
nµ √1 Xi . Therefore, the variance of bn is

n i=1

1 X  1X
n n
µn ) =
V(b V Xi = V(Xi ) = σ 2 ,
n i=1
n i=1

since the Xi√are independently and identically distributed with common variance σ 2 . In other words, normalizing
the sum by n rather than n stablizes the variance and therefore it converges to a finite-variance random variable,
rather than a deterministic constant. Stating the result a slightly different way, for large values of n, the distri-
bution of the empirical average µbn is approximately N (µ, σ 2 /n). This provides a characterization of the random
fluctuations of the empirical average (roughly Gaussian with variance σ 2 /n).

The CLT suggests that it should be possible to sharpen Markov’s inequality to obtain a better bound on the
2
bn . Suppose X ∼ N (0, σ 2 ). Markov’s inequality shows that P(|X| > t) = P(|X|2 > t2 ) ≤ σt2 , for
deviations of µ
any t > 0. However, the tail of the Gaussian distribution decays exponentially, and therefore
Z∞
1 −x2 1 −t22
P(X > t) = √ e 2σ2 dx ≤ e 2σ
2π 2
t

To see this consider


R∞ −x2
√ 1 e 2σ2 dx Z∞ Z∞
2πσ 2 1 −(x2 −t2 ) 1 −(x−t)(x+t)
t
R := 2 = √ e 2σ 2 dx = √ e 2σ 2 dx
− t2 2πσ 2 2πσ 2
e 2σ
t t

Let y = x − t, then
Z∞ Z∞
1 −y(y+2t) 1 −y 2 1
R = √ e 2σ 2 dy ≤ √ e 2σ2 dy =
2πσ 2 2πσ 2 2
0 0

Now let us apply this reasoning to µ bn . Assume that µ bn − E[X] ∼ N (0, σ 2 /n), which the CLT shows is ap-
σ2
proximately correct for large n. Markov’s inequality gives the bound P(|b µn − E[X]| > t) ≤ nt 2 . However, the

Gaussian tail-bound shows that  nt2 


µn − µ| > t) ≤ exp − 2 ,
P(|b

which is exponentially smaller (as a function of n) than the bound given by Markov’s inequality.

29.3. Law of the Iterated Logarithm

The LLN shows that averages converge to the expected value and the CLT characterizes the distribution of the
average in the large-sample limit. The sequence of empirical means {b µn }n≥1 is a random process that fluctuates
about E[X]. Another natural question is to quantify how large these fluctuations may be, and this is what the Law
of the Iterated Logarithim (LIL) tells us.

The LLN and CLT consider partial sums of X1 , X2 , . . . normalized by n or n, respectively. Partial sums nor-
malized by a higher power of n converge to 0 (not interesting) and partial sums normalized by a lower power than
n1/2 will not converge to a finite variance random variable (not√interesting). So it is reasonable to consider nor-
malizations between n1/2 and n. In particular, normalizing by n log log n characterizes the maximal deviations
of the sequence of empirical means from the expected value.

143
Law of the Iterated Logarithm.
 Pn 
i=1 (Xi − E[X])
P lim sup p =1 = 1.
n→∞ 2σ 2 n log log n

speaking, this tells us that the sequence of random variables {|b


Roughly q µn − µ|}n≥1 tends to be bounded q
by the
2 √ p 2
function 2σ logn log n , which is 2 log log n times the standard deviation of µ
bn − µ (note E[|bµn − µ|2 ] = σn ).

144
Bibliography

[1] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE
Transactions on Information theory, 39(3):930–945, 1993.

[2] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural
results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.

[3] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
SIAM journal on imaging sciences, 2(1):183–202, 2009.

[4] CM Bishop. Pattern recognition and machine learning (information science and statistics), 2006.

[5] David Blackwell. Conditional expectation and unbiased sequential estimation. The Annals of Mathematical
Statistics, pages 105–110, 1947.

[6] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic
theory of independence. Oxford university press, 2013.

[7] Emmanuel J Candès, Justin Romberg, and Terence Tao. Robust uncertainty principles: Exact signal re-
construction from highly incomplete frequency information. IEEE Transactions on information theory,
52(2):489–509, 2006.

[8] Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE transactions on information
theory, 13(1):21–27, 1967.

[9] Thomas M Cover. Geometrical and statistical properties of systems of linear inequalities with applications
in pattern recognition. IEEE transactions on electronic computers, pages 326–334, 1965.

[10] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.

[11] Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bulletin of the American
mathematical society, 39(1):1–49, 2002.

[12] Luc Devroye, László Györfi, and Gábor Lugosi. A probabilistic theory of pattern recognition, volume 31.
Springer Science & Business Media, 2013.

[13] David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.

[14] David L Donoho and Jain M Johnstone. Ideal spatial adaptation by wavelet shrinkage. biometrika,
81(3):425–455, 1994.

[15] Mário AT Figueiredo and Robert D Nowak. An em algorithm for wavelet-based image restoration. IEEE
Transactions on Image Processing, 12(8):906–916, 2003.

145
[16] Ronald A Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179–
188, 1936.

[17] Rudolph Kalman. E. 1960. a new approach to linear filtering and prediction problems. Transactions of the
ASME–Journal of Basic Engineering, 82:35–45, 1960.

[18] Nicolai Meinshausen and Bin Yu. Lasso-type recovery of sparse representations for high-dimensional data.
The annals of statistics, 37(1):246–270, 2009.

[19] Ron Meir and Tong Zhang. Generalization error bounds for bayesian mixture algorithms. Journal of Machine
Learning Research, 4(Oct):839–860, 2003.

[20] Greg Ongie, Rebecca Willett, Daniel Soudry, and Nathan Srebro. A function space view of bounded norm
infinite width relu nets: The multivariate case. In International Conference on Learning Representations,
2019.

[21] Rahul Parhi and Robert D Nowak. Banach space representer theorems for neural networks and ridge splines.
J. Mach. Learn. Res., 22(43):1–40, 2021.

[22] G Pisier. Remarques sur un résultat non publié de b. maurey. Séminaire d’Analyse fonctionnelle (dit"
Maurey-Schwartz"), pages 1–12, 1980.

[23] Eugen Slutsky. http://en.wikipedia.org/wiki/Slutsky’s_theorem.

[24] Aad W van der Vaart and Jon A Wellner. Weak convergence and empirical processes with applications to
statistics. Journal of the Royal Statistical Society-Series A Statistics in Society, 160(3):596–608, 1997.

[25] Norbert Wiener, Norbert Wiener, Cyberneticist Mathematician, Norbert Wiener, Norbert Wiener, and Cy-
bernéticien Mathématicien. Extrapolation, interpolation, and smoothing of stationary time series: with
engineering applications, volume 113. MIT press Cambridge, MA, 1949.

[26] Stephen J Wright, Robert D Nowak, and Mário AT Figueiredo. Sparse reconstruction by separable approxi-
mation. IEEE Transactions on signal processing, 57(7):2479–2493, 2009.

[27] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings
of the 20th international conference on machine learning (icml-03), pages 928–936, 2003.

146

You might also like