0% found this document useful (0 votes)
20 views24 pages

Lecture 1

HehehehjejejejrRespected Professor Berman, I hope this message finds you well. My name is Aditya Sahu, a Computer Science graduate student at ASU. I am very interested in exploring potential research opportunities with you. My background includes AI/ML, data science, and full-stack development, with hands-on projects and IEEE publications. I would be grateful for the chance to contribute to your work and learn from your expertise. Please find my resume attached for reference.

Uploaded by

Aditya sahu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views24 pages

Lecture 1

HehehehjejejejrRespected Professor Berman, I hope this message finds you well. My name is Aditya Sahu, a Computer Science graduate student at ASU. I am very interested in exploring potential research opportunities with you. My background includes AI/ML, data science, and full-stack development, with hands-on projects and IEEE publications. I would be grateful for the chance to contribute to your work and learn from your expertise. Please find my resume attached for reference.

Uploaded by

Aditya sahu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CSE598 Statistical Learning Theory

Lecture01

Yingzhen Yang

1 / 20
Lecture Overview

Course information, syllabus and logistics

2 / 20
Instructor and TA

Instructor: Yingzhen Yang, assistant professor at SCAI


Statistical Machine Learning and Deep Learning My research most
focuses on machine learning/deep learning theory with tools from
statistics and other fields, as well as optimization/generalization for
machine learning.
Email: [email protected], office hours: Monday &
Wednesday 2 pm — 3 pm. Office hours are both online (with
Zoom link https://asu.zoom.us/j/9531979310) and in person
(BYENG 590).

3 / 20
Instructor and TA

Instructor: Yingzhen Yang, assistant professor at SCAI


Statistical Machine Learning and Deep Learning My research most
focuses on machine learning/deep learning theory with tools from
statistics and other fields, as well as optimization/generalization for
machine learning.
Email: [email protected], office hours: Monday &
Wednesday 2 pm — 3 pm. Office hours are both online (with
Zoom link https://asu.zoom.us/j/9531979310) and in person
(BYENG 590).

TA: Yancheng Wang

PhD student in deep learning, especially Neural Architecture


Search (NAS)
Email: [email protected], office hours: Monday & Wednesday 11
am – 12 pm. Office hours are both online (with Zoom link
https://asu.zoom.us/j/7737185236) and in-person (BYENG
221).

3 / 20
Difference from Other Machine Learning Courses

The major difference between CSE 575 and this course is that, CSE
575 is more focused on concrete machine learning algorithms, while
this course is focused on the theoretical foundation of these learning
algorithms.

4 / 20
Difference from Other Machine Learning Courses

Example: Support Vector Machines (SVMs) for classification

CSE 575 introduces the concept of maximum margin classification


(by SVMs) and its optimization algorithm
This course will explain the fundamental mathematical reason why
maximum margin is good.
5 / 20
Difference from Other Machine Learning Courses

Another example: Neural Networks


CSE 575 introduces the concept of neural networks and the basic
(stochastic) gradient descent algorithm for optimization of neural
networks.
This course will explain the fundamental mathematical reason why
gradient descent can be used for the optimization of neural
networks and when the trained neural networks can generalize well
(generalizing well means low test error or high test accuracy).

6 / 20
Difference from Other Machine Learning Courses

Another example: Neural Networks


CSE 575 introduces the concept of neural networks and the basic
(stochastic) gradient descent algorithm for optimization of neural
networks.
This course will explain the fundamental mathematical reason why
gradient descent can be used for the optimization of neural
networks and when the trained neural networks can generalize well
(generalizing well means low test error or high test accuracy).

As you could see here, a prior knowledge of basic machine learning


algorithms is very helpful. However, all the relevant machine
learning algorithms will be reviewed in this course so that you can
understand why the learning theory studied in this course is crucial
to the design of these algorithms.
The major mathematical tools used for theoretical analysis of
machine learning algorithms will be reviewed in lectures. You are
expected to feel comfortable with basic mathematics and statistics.
6 / 20
Course Assessment Plan

40%: Four assignments of equal weights.


20%: Paper presentation Every student will present at least one
paper related to theoretical aspects of machine learning or deep
learning. Presenting application-oriented papers with theoretical
background or theoretical motivation is welcome, and presenting
multiple papers is strongly recommended.

7 / 20
Course Assessment Plan

40%: Project

The project can be in any area related to the topics of the course.
For example, you may develop a new method and demonstrate its
empirical performance, or perform experiments with an existing
method.
You may also write a comprehensive survey of topic of your own
interest (relevant to this course) as the project.
You will submit a final project report which will be due on
12/08/2025.
You are allowed to work on projects in groups of two, and all group
members will receive the same grade for the project.

8 / 20
Prerequisites and Course Objective

Prerequisites: linear algebra, probability, statistics algorithm design


and analysis. While the major mathematical tools will be reviewed
in lectures, a solid mathematical background is required to
understand the materials.
Course Objective: An in-depth understanding of statistical learning
theory behind the design of machine learning algorithms.

9 / 20
Topics to Cover (tentative)

Basics of statistical prediction theory


Concentration inequalities
Supervised and unsupervised learning
Empirical risk minimization
Complexity-regularized estimation; Generalization bounds for
learning algorithms; VC dimension and Rademacher complexity;
minimax lower bounds
Optimization for machine learning
Recent theoretical developments in deep learning, including
optimization and generalization of deep neural networks, and Neural
Tangent Kernel

10 / 20
More Information

Reference Book: Pattern Recognition and Machine Learning By


Christopher M. Bishop
The relevant machine learning algorithms will be reviewed so that
you can understand why the learning theory studied in this course is
crucial to the design of such algorithms.
We will post links to additional resources which will be helpful for
you to understand the course materials.

11 / 20
What is Machine Learning?

Machine Learning: Study of algorithms that

improve their performance P


at some task T
with experience E

Well-defined learning task: (P,T,E)


A machine learning algorithm learns from some experience E
at some task T, aiming to improve some performance P.

E is also called the training data.

12 / 20
What is Machine Learning?

Example: Spam Filter, an algorithm to tell if an input email is spam


or not spam (the task T)
Below are examples of training emails (the experience E)

13 / 20
Probabilistic Formulation of Prediction Problems

Prediction problem: predict an outcome y from some set Y of


possible outcomes based on the input x from a feature space X . X
and Y are also called the input space and the output space.
In the example of spam filter, the outcome y is a label in
Y = {spam, no spam} based on an input email x.
The training data for such prediction problems is a data set of n
pairs, (x1 , y1 ), . . . , (xn , yn ), and we will use the training data to
produce a function f : X → Y so that f (x) is a good prediction of
y for some unseen pair (x, y).

14 / 20
Probabilistic Formulation of Prediction Problems

To measure the quality of the prediction, we define a loss function


ℓ : Y × Y → R, so that ℓ(by , y) measures the cost of the prediction
yb = f (x) when the true outcome is y.
Our goal is to ensure that ℓ(f (x), y) is small.
In classification problems, the goal is to classify an input x as one of
a finite number of classes (that is, the label space Y is finite). We
can choose a 0 − 1 loss function defined as

ℓ(b
y , y) = 1I{b
y ̸=y}
(
1 if yb ̸= y,
=
0 otherwise.

1I{E} is an indicator function with 1I{E} = 1 if the event E is true,


and 0 if E is not true.

15 / 20
Probabilistic Formulation of Prediction Problems

In a regression problem, with Y = R, we can define the quadratic


loss function, ℓ(b y − y)2 .
y , y) = (b
Assume that there is a probability distribution P over X × Y, and
that the training data (X1 , Y1 ), . . . , (Xn , Yn ) are chosen
independently according to P . The aim is to produce a function f
using the training data so that the risk of f , which is defined as

R(f ) = E(X,Y )∼P [ℓ(f (X), Y )] ,


is small. E [·] indicates expectation.

16 / 20
Probabilistic Formulation of Prediction Problems

In a regression problem, with Y = R, we can define the quadratic


loss function, ℓ(b y − y)2 .
y , y) = (b
Assume that there is a probability distribution P over X × Y, and
that the training data (X1 , Y1 ), . . . , (Xn , Yn ) are chosen
independently according to P . The aim is to produce a function f
using the training data so that the risk of f , which is defined as

R(f ) = E(X,Y )∼P [ℓ(f (X), Y )] ,


is small. E [·] indicates expectation.
Important notes:
We use capital letters to denote random variables.
Because the training data is random, the produced prediction
function f , which depends on the training data, is also random.
We hope to have a small R(f ). In statistical learning theory, we
aim to make R(f ) small with high probability.
16 / 20
Probabilistic Formulation of Prediction Problems

When the 0 − 1 loss function is used for a classification problem,


then R(f ) = E(X,Y )∼P [ℓ(f (X), Y )] = P (Y ̸= f (X)).
Assume that we are dealing with a binary classification problem with
the output space Y = {1, −1}, and P is the joint distribution over
X × Y. Define η as the conditional probability of Y = 1 given
X = x : η(x) = P (Y = 1|X = x).
 
We used the basic fact that E 1I{E} = P (E).

Then we can compute the risk of f .

17 / 20
18 / 20
Probabilistic Formulation of Prediction Problems

We can compute the risk of f .

R(f ) = E [ℓ(f (X), Y )]


= E [E [ℓ(f (X), Y )|X]]
= E [ℓ(f (X), 1)P (Y = 1|X) + ℓ(f (X), −1)P (Y = −1|X)]
 
= E 1I{f (X)̸=1} η(X) + 1I{f (X)̸=−1} (1 − η(X))
 
= E 1I{f (X)̸=1} η(X) + (1 − 1I{f (X)̸=1} )(1 − η(X))
 
= E 1I{f (X)̸=1} (2η(X) − 1) + 1 − η(X) .

19 / 20
Probabilistic Formulation of Prediction Problems

Remember that we aim to make R(f ) small.

20 / 20
Probabilistic Formulation of Prediction Problems

Remember that we aim to make R(f ) small.


Define a function f ∗ that predicts the label “naturally”:
(
∗ 1 if η(x) ≥ 1/2,
f (x) =
−1 otherwise.

Denote the optimal risk R∗ = inf f R(f ) by the Bayes risk


(inf f R(f ) takes the infimum of R(f ) over all possible prediction
function f ). We will see that f ∗ achieves the Bayes risk. f ∗ is
called the Bayes decision function.

20 / 20

You might also like