CSE598 Statistical Learning Theory
Lecture01
Yingzhen Yang
1 / 20
Lecture Overview
Course information, syllabus and logistics
2 / 20
Instructor and TA
Instructor: Yingzhen Yang, assistant professor at SCAI
Statistical Machine Learning and Deep Learning My research most
focuses on machine learning/deep learning theory with tools from
statistics and other fields, as well as optimization/generalization for
machine learning.
Email:
[email protected], office hours: Monday &
Wednesday 2 pm — 3 pm. Office hours are both online (with
Zoom link https://asu.zoom.us/j/9531979310) and in person
(BYENG 590).
3 / 20
Instructor and TA
Instructor: Yingzhen Yang, assistant professor at SCAI
Statistical Machine Learning and Deep Learning My research most
focuses on machine learning/deep learning theory with tools from
statistics and other fields, as well as optimization/generalization for
machine learning.
Email:
[email protected], office hours: Monday &
Wednesday 2 pm — 3 pm. Office hours are both online (with
Zoom link https://asu.zoom.us/j/9531979310) and in person
(BYENG 590).
TA: Yancheng Wang
PhD student in deep learning, especially Neural Architecture
Search (NAS)
Email:
[email protected], office hours: Monday & Wednesday 11
am – 12 pm. Office hours are both online (with Zoom link
https://asu.zoom.us/j/7737185236) and in-person (BYENG
221).
3 / 20
Difference from Other Machine Learning Courses
The major difference between CSE 575 and this course is that, CSE
575 is more focused on concrete machine learning algorithms, while
this course is focused on the theoretical foundation of these learning
algorithms.
4 / 20
Difference from Other Machine Learning Courses
Example: Support Vector Machines (SVMs) for classification
CSE 575 introduces the concept of maximum margin classification
(by SVMs) and its optimization algorithm
This course will explain the fundamental mathematical reason why
maximum margin is good.
5 / 20
Difference from Other Machine Learning Courses
Another example: Neural Networks
CSE 575 introduces the concept of neural networks and the basic
(stochastic) gradient descent algorithm for optimization of neural
networks.
This course will explain the fundamental mathematical reason why
gradient descent can be used for the optimization of neural
networks and when the trained neural networks can generalize well
(generalizing well means low test error or high test accuracy).
6 / 20
Difference from Other Machine Learning Courses
Another example: Neural Networks
CSE 575 introduces the concept of neural networks and the basic
(stochastic) gradient descent algorithm for optimization of neural
networks.
This course will explain the fundamental mathematical reason why
gradient descent can be used for the optimization of neural
networks and when the trained neural networks can generalize well
(generalizing well means low test error or high test accuracy).
As you could see here, a prior knowledge of basic machine learning
algorithms is very helpful. However, all the relevant machine
learning algorithms will be reviewed in this course so that you can
understand why the learning theory studied in this course is crucial
to the design of these algorithms.
The major mathematical tools used for theoretical analysis of
machine learning algorithms will be reviewed in lectures. You are
expected to feel comfortable with basic mathematics and statistics.
6 / 20
Course Assessment Plan
40%: Four assignments of equal weights.
20%: Paper presentation Every student will present at least one
paper related to theoretical aspects of machine learning or deep
learning. Presenting application-oriented papers with theoretical
background or theoretical motivation is welcome, and presenting
multiple papers is strongly recommended.
7 / 20
Course Assessment Plan
40%: Project
The project can be in any area related to the topics of the course.
For example, you may develop a new method and demonstrate its
empirical performance, or perform experiments with an existing
method.
You may also write a comprehensive survey of topic of your own
interest (relevant to this course) as the project.
You will submit a final project report which will be due on
12/08/2025.
You are allowed to work on projects in groups of two, and all group
members will receive the same grade for the project.
8 / 20
Prerequisites and Course Objective
Prerequisites: linear algebra, probability, statistics algorithm design
and analysis. While the major mathematical tools will be reviewed
in lectures, a solid mathematical background is required to
understand the materials.
Course Objective: An in-depth understanding of statistical learning
theory behind the design of machine learning algorithms.
9 / 20
Topics to Cover (tentative)
Basics of statistical prediction theory
Concentration inequalities
Supervised and unsupervised learning
Empirical risk minimization
Complexity-regularized estimation; Generalization bounds for
learning algorithms; VC dimension and Rademacher complexity;
minimax lower bounds
Optimization for machine learning
Recent theoretical developments in deep learning, including
optimization and generalization of deep neural networks, and Neural
Tangent Kernel
10 / 20
More Information
Reference Book: Pattern Recognition and Machine Learning By
Christopher M. Bishop
The relevant machine learning algorithms will be reviewed so that
you can understand why the learning theory studied in this course is
crucial to the design of such algorithms.
We will post links to additional resources which will be helpful for
you to understand the course materials.
11 / 20
What is Machine Learning?
Machine Learning: Study of algorithms that
improve their performance P
at some task T
with experience E
Well-defined learning task: (P,T,E)
A machine learning algorithm learns from some experience E
at some task T, aiming to improve some performance P.
E is also called the training data.
12 / 20
What is Machine Learning?
Example: Spam Filter, an algorithm to tell if an input email is spam
or not spam (the task T)
Below are examples of training emails (the experience E)
13 / 20
Probabilistic Formulation of Prediction Problems
Prediction problem: predict an outcome y from some set Y of
possible outcomes based on the input x from a feature space X . X
and Y are also called the input space and the output space.
In the example of spam filter, the outcome y is a label in
Y = {spam, no spam} based on an input email x.
The training data for such prediction problems is a data set of n
pairs, (x1 , y1 ), . . . , (xn , yn ), and we will use the training data to
produce a function f : X → Y so that f (x) is a good prediction of
y for some unseen pair (x, y).
14 / 20
Probabilistic Formulation of Prediction Problems
To measure the quality of the prediction, we define a loss function
ℓ : Y × Y → R, so that ℓ(by , y) measures the cost of the prediction
yb = f (x) when the true outcome is y.
Our goal is to ensure that ℓ(f (x), y) is small.
In classification problems, the goal is to classify an input x as one of
a finite number of classes (that is, the label space Y is finite). We
can choose a 0 − 1 loss function defined as
ℓ(b
y , y) = 1I{b
y ̸=y}
(
1 if yb ̸= y,
=
0 otherwise.
1I{E} is an indicator function with 1I{E} = 1 if the event E is true,
and 0 if E is not true.
15 / 20
Probabilistic Formulation of Prediction Problems
In a regression problem, with Y = R, we can define the quadratic
loss function, ℓ(b y − y)2 .
y , y) = (b
Assume that there is a probability distribution P over X × Y, and
that the training data (X1 , Y1 ), . . . , (Xn , Yn ) are chosen
independently according to P . The aim is to produce a function f
using the training data so that the risk of f , which is defined as
R(f ) = E(X,Y )∼P [ℓ(f (X), Y )] ,
is small. E [·] indicates expectation.
16 / 20
Probabilistic Formulation of Prediction Problems
In a regression problem, with Y = R, we can define the quadratic
loss function, ℓ(b y − y)2 .
y , y) = (b
Assume that there is a probability distribution P over X × Y, and
that the training data (X1 , Y1 ), . . . , (Xn , Yn ) are chosen
independently according to P . The aim is to produce a function f
using the training data so that the risk of f , which is defined as
R(f ) = E(X,Y )∼P [ℓ(f (X), Y )] ,
is small. E [·] indicates expectation.
Important notes:
We use capital letters to denote random variables.
Because the training data is random, the produced prediction
function f , which depends on the training data, is also random.
We hope to have a small R(f ). In statistical learning theory, we
aim to make R(f ) small with high probability.
16 / 20
Probabilistic Formulation of Prediction Problems
When the 0 − 1 loss function is used for a classification problem,
then R(f ) = E(X,Y )∼P [ℓ(f (X), Y )] = P (Y ̸= f (X)).
Assume that we are dealing with a binary classification problem with
the output space Y = {1, −1}, and P is the joint distribution over
X × Y. Define η as the conditional probability of Y = 1 given
X = x : η(x) = P (Y = 1|X = x).
We used the basic fact that E 1I{E} = P (E).
Then we can compute the risk of f .
17 / 20
18 / 20
Probabilistic Formulation of Prediction Problems
We can compute the risk of f .
R(f ) = E [ℓ(f (X), Y )]
= E [E [ℓ(f (X), Y )|X]]
= E [ℓ(f (X), 1)P (Y = 1|X) + ℓ(f (X), −1)P (Y = −1|X)]
= E 1I{f (X)̸=1} η(X) + 1I{f (X)̸=−1} (1 − η(X))
= E 1I{f (X)̸=1} η(X) + (1 − 1I{f (X)̸=1} )(1 − η(X))
= E 1I{f (X)̸=1} (2η(X) − 1) + 1 − η(X) .
19 / 20
Probabilistic Formulation of Prediction Problems
Remember that we aim to make R(f ) small.
20 / 20
Probabilistic Formulation of Prediction Problems
Remember that we aim to make R(f ) small.
Define a function f ∗ that predicts the label “naturally”:
(
∗ 1 if η(x) ≥ 1/2,
f (x) =
−1 otherwise.
Denote the optimal risk R∗ = inf f R(f ) by the Bayes risk
(inf f R(f ) takes the infimum of R(f ) over all possible prediction
function f ). We will see that f ∗ achieves the Bayes risk. f ∗ is
called the Bayes decision function.
20 / 20