0% found this document useful (0 votes)

15 views8 pages

Intro To Machine Learning Lecture Notes1

The document outlines the first lecture of a course on Statistical Learning at Ben-Gurion University, focusing on the inference problem and its various methods. It distinguishes between model-based and data-driven approaches to inference, detailing common loss measures for classification and regression tasks. Additionally, it discusses different types of learning tasks, including supervised, unsupervised, semi-supervised, and reinforcement learning, as well as the concept of empirical risk minimization.

Uploaded by

Or Shraga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views8 pages

Intro To Machine Learning Lecture Notes1

Uploaded by

Or Shraga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Ben-Gurion University - School of Electrical and Computer Engineering - 361-1-3040

Lecture 1: Statistical Learning

Fall 2024/5
Lecturer: Nir Shlezinger and Asaf Cohen

Before we go into learning, we ﬁrst need to understand the notion of inference, which is the task
we wish to learn to carry out. We thus begin by a generic description of the inference problem, after
which we detail how it can be tackled, highlighting the differences between stochastic-model-based
methods (referred to henceforth as model-based methods), and techniques which rely on data, i.e.,
data-driven approaches, that constitute the foundations of deep learning and machine learning in
general. Most of the content detailed here is loosely based on the ﬁrst two chapters of [2] (with
some change of notations).

1 Inference
The term inference refers to the ability to conclude based on evidence and reasoning. While this
generic deﬁnition can refer to a broad range of tasks, we focus in our description on systems which
estimate or make predictions based on a set of observed variables. In this wide family of problems,
the system is required to map an input variable x taking values in an input space X into a prediction
of a label variable s which takes value in the label space S.

Input Space This is an arbitrary set, X , which represents the observations upon which we base
our decision. For example, in an image classiﬁcation problem, X is the set of RGB images of a
given number of pixels. It is also sometimes referred to as the feature space.

Label Space The set S denotes the possible decisions we are allowed to make. For instance, in
an image classiﬁcation problem where one attempts to classify which breed of a dog appears in an
image, the set S represents all allowed breeds.

Data Generating Distribution The inputs are related to the labels via some statistical distribu-
tion measure P deﬁned over X × S. The stochastic nature implies that the labels are not fully
determined by the input we are observing. For instance, consider the problem of predicting the
price of a house based on a set of feature comprised of its location, size, and age; Obviously, while
the price is indeed affected by these quantities, it is not deterministically determined by them.
Formally, P is a joint distribution over the domain of inputs and labels. One can view such a
distribution as being composed of two parts: a distribution over the unlabeled input Px (sometimes
called the marginal distribution), and the conditional distribution over the labels given the inputs
Ps|x . also referred to as the inverse model.

1
Inference Rule An inference mapping can thus be expressed as

f : X 7→ S. (1)

We write the decision variable for a given input x ∈ X as ŝ = f (x) ∈ S. The space of all possible
inference mappings of the form (1) is denoted by F.

Success Criteria The ﬁdelity of an inference mapping is measured using a loss function

l : F × X × S 7→ R+ . (2)

We are generally interested in carrying out inference which minimizes the risk function, also known
as the generalization error, given by:

LP (f ) ≜ E(x,s)∼P {l(f, x, s)}. (3)

Thus, the goal is to design the inference rule f (·) to minimize the generalization loss LP (f ) for a
given problem.

2 Common Loss Measures

The inference task is essentially encapsulated in the selection of the loss function l(·). In the
following we detail some commonly used loss measures, focus on supervised learning setups,
explained in the sequel.

2.1 Classification
In the classification setting, we are required to assign to each input one of a finite number of
labels. For instance, we are interested in deciding whether an image represents a cat or a dog.
In classification, the label space S is finite, i.e., there exists a finite positive integer K such that
|S| = K and we write S = {sk }K k=1 . Common loss measures for classification problems are:

Error Rate The error-rate loss, also known as the zero-one loss, assigns an equal score to each
‘correct‘ decision, and the same score to each ‘incorrect‘ decision. In particular, this loss measure
is given by: {
0 f (x) = s,
lErr (f, x, s) = (4)
1 f (x) 6= s.
The inference rule which minimizes the error rate risk is the maximum a-posteriori probability
(MAP) rule, given by:
fMAP (x) = arg max P(s|x). (5)
s∈S

2
Proof. We ﬁrst note that, by the law of total expectation, we have

LP (f ) = E(x,s)∼P {l(f, x, s)}

{ }
= Ex∼Px EPs|x {l(f, x, s)|x} . (6)

Now, under the loss measure in (4), it holds that for a given input x with label s, we have
∑
EPs|x {l(f, x, s)|x} = l(f, x, s̃)P(s̃|x)
s̃∈S
∑
= P(s̃|x)
s̸̃=s

= (1 − P(s|x)) . (7)

Therefore, in order to minimize the risk (6), for a given x we should set the inference rule to
minimize (7), thus setting it to be

f (x) = arg min (1 − P(s|x))

s∈S
= arg max P(s|x). (8)
s∈S

This proves (5).

Cross-Entropy An alternative widely-used loss function for classiﬁcation problems it the cross-
entropy loss. Here, the inference rule does not produce a label (hard decision), but rather a prob-
∑ Namely, f (x) here is a K × 1 vector
ability mass function over the label space (soft decision).
f1 (x), . . . , fK (x) with non-negative entries such that k fk (x) = 1. For this setting, the cross
entropy loss is given by:
∑K
lCE (f, x, s) = − 1s=sk log fk (x), (9)
k=1

where 1(·) is the indicator function.

While the main motivation for using the cross-entropy loss stems from its ability to provide a
measure of conﬁdence in the decision, as well as from computational reasons, it turns out that the
optimal inference rule in the sense of minimal cross-entropy risk is the true conditional distribution,
i.e.,
fCE (x) = [P(s1 |x), . . . , P(sK |x)]. (10)
Before we prove (10), we note that this implies that the MAP rule can be obtained by taking the
arg max over the entries of fCE (·).

3
Proof. First, we write p(sk |x) = fk (x), and note that for any distribution measure p(s|x) over S
it holds that

LP (f ) = EP {lCE (f, x, s)}

= EP {− log p(s|x)}
{ }
p(s|x)
= EP − log + EP {− log P(s|x)}
P(s|x)
= DKL (P(s|x)||p(s|x)) + H(s|x), (11)

where H(s|x) is the true cross-entropy of s conditioned on x, while DKL (·||·) is the Kullback-
Leibler divergence [1]. As DKL (P(s|x)||p(s|x)) is non-negative and equals zero only when
p(s|x) ≡ P(s|x), it holds that (11) is minimized when the inference output is the true condi-
tional distribution, thus proving (10).

2.2 Regression
Another common task is regression, also referred to as estimation. Here, one attempts to predict a
set of continuous variables instead of a categorical one, i.e., S is some continuous set, e.g., R or
some speciﬁed range [a, b]. For instance, the task of predicting the price of a house based on a set
of features can be treated as a regression task. The most common loss function used for regression
problems is the squared error loss:

Squared Error The squared error loss is given by:

lMSE (f, x, s) = (s − f (x))2 . (12)

The inference rule which minimizes the squared-error risk is the minimal mean-squared error
(MSE) estimate, given by the conditional expected value, namely:

fMMSE (x) = EPs|x {s|x}. (13)

Proof. Follows from the orthogonality principle for MSE estimation.

3 Model-Based versus Data-Driven

So far, we have only discussed the goal, which is to identify the inference rule which achieves the
minimal risk measure. The main question is how to ﬁnd this mapping, which is divided into two
main strategies: the statistical model-based strategy, referred to henceforth as model-based; and
the machine learning approach, which is relies on data, and is thus referred to as data-driven. The
main difference between these strategies is what information is utilized to tune f (·).

4
3.1 Model-Based Methods
Model-based algorithms, also referred to as hand-designed methods, set their inference rule, i.e.
tune f in (1) to minimize the risk function L(·), based on full domain knowledge. The term domain
knowledge typically refers to prior knowledge of the underlying statistics relating the input x and
the label s, where full domain knowledge implies that the joint distribution P is known. For
instance, the inference rules in (5) and (13) are all model-based, as their computation requires full
knowledge of P.
Model-based methods are the foundations of statistical signal processing. So why do we need
machine learning?
1. In practice, accurate knowledge of the statistical model relating the observations and the de-
sired information is typically unavailable, and thus applying such techniques commonly re-
quires imposing some assumptions on the underlying statistics, which in some cases reﬂects
the actual behavior, but may also constitute a crude approximation of the true dynamics. For
instance, coming up with an analytical expression for the conditional distribution of a breed
of a dog label given an image of a dog is infeasible.
2. In the presence of inaccurate model knowledge, either as a result of estimation errors or
due to enforcing a model which does not fully capture the environment, the performance of
model-based techniques tends to degrade.
This limits the applicability of model-based schemes in scenarios where, e.g., P is unknown, costly
to estimate accurately, or too complex to express analytically.

3.2 Data-Driven Schemes

While in many applications coming up with accurate and tractable statistical modelling is difﬁcult,
we are often given access to data describing the setup. For instance, while we cannot accurately
formulate the conditional distribution of the breed of a dog given an image of a dog, one is likely
to be able to aggregate massive amounts of images of dogs, from which the inference rule can be
learned.
Data-driven systems learn their mapping from data. In a supervised setting, data is comprised
of a training set consisting of nt pairs of inputs and their corresponding labels, denoted D =
{xt , st }nt=1
t
. This data is referred to as the training set. Nonetheless, since we do not have access
to the true distribution P, we cannot directly optimize the risk function (3), and must resort to the
empirical risk, given by
1 ∑
nt
LD (f ) ≜ l(f, xt , st ). (14)
nt t=1
Now, before going into how one can minimize the empirical risk, we ﬁrst note that if our data
samples are generated in an i.i.d. manner from the true distribution P, then by the law of large
numbers, the empirical risk in (15) is expected to converge to the true risk (3). This implies that
the empirical risk minimizer will approach the true risk minimizer – sounds great, right? Well, not
always, as we see in the next lecture.

5
3.3 Types of Learning Tasks
As we discussed above, the data-driven nature of machine learning is encapsulated in the loss
function’s dependence on training data. Thus, the loss function not only implicitly deﬁnes the
task of the resulting system, but also dictates what kind of data is required. Based on the require-
ments placed on the training data, problems in statistical learning largely fall under three different
categories: supervised, semi-supervised, and unsupervised:

Supervised Learning In supervised learning, the training set consists of a set of input-label
pairs {(xt , st )}nt=1
t
⊂ X × S. Each pair (xt , st ) satisﬁes st = f ∗ (xt ) for some unknown ground
truth mapping f ∗ . The goal is to have fθ approximate f ∗ as accurately as possible by utilizing
the given training data. This setting encompasses a wide range of problems including regression,
classiﬁcation, and structured prediction, through a judicious choice of the loss function.

Unsupervised Learning In unsupervised learning, we are only given a set of examples {xt }nt=1 t
⊂
X without labels. In this case, the loss measure l is deﬁned over F ×X , instead of over F ×X ×S as
deﬁned in (2). Since there is no label to predict, unsupervised learning algorithms are often used
to discover interesting patterns present in the given data. Common tasks in this setting include
clustering, anomaly detection, generative modeling, and compression.

Semi-supervised Learning Semi-supervised learning lies in the middle ground between the
above two categories, where one typically has access to a large amount of unlabeled data and a
small set of labeled data. The goal is to leverage the unlabeled data to improve performance on
some supervised task to be trained on the labeled data. As labeling data is often a very costly
process, semi-supervised learning provides a way to quickly learn desired inference rules without
having to label all of the available unlabeled data points.

Reinforcement Learning In reinforcement learning, we have an agent in an unknown environ-

ment, which can obtain some rewards by interacting with the environment. The agent must take
actions, i.e., interact with the environment, so as to maximize cumulative rewards. In reality, the
scenario could be a bot playing a game to achieve high scores, or a robot trying to complete physi-
cal tasks with physical items; and not just limited to these. The goal of reinforcement learning is to
learn a good strategy for the agent from experimental trials and relative simple feedback received.
With the optimal strategy, the agent is capable to actively adapt to the environment to maximize
future rewards. The astounding success of deep learning in defeating human experts in challenging
games such as Go and Stracraft is based on reinforcement learning. Unfortunately, we will not deal
with reinforcement learning (neither classic nor deep) in this course, as it can constitute a complete
course on its own.

6
3.4 Empirical Risk Minimization
In the following we focus on supervised learning. In a supervised setting, data is comprised
of a training set consisting of nt pairs of inputs and their corresponding labels, denoted D =
{xi , si }ni=1
t
. This data is referred to as the training set. Since no mathematical model relating the
input and the desired decision is imposed, the objective used for setting the decision rule f (·) ∈ F
is the empirical risk, given by

1 ∑
nt
LD (f ) ≜ l(f, xt , st ). (15)
nt t=1

Recall that true risk is deﬁned as

LP (f ) ≜ E(x,s)∼P {l(f, x, s)}. (16)

Overfitting The main problem with minimizing the empirical risk in (15) is that or any fixed nt ,
the mapping f (·) which minimizes (15) is the one which memorizes the training data. For instance,
for a classification task with the 0-1 loss, one can set
{
st ∃t ∈ [1, . . . , nt ] : x = xt ,
f (x) = (17)
0 otherwise.

While the mapping in (17) achieves zero empirical risk, it is useless for any input sample which
does not appear in the training set. We say that such a mapping, in which there is a large gap
between the empirical risk (15) and the generalization error (16), does not generalize. In particular,
data-driven mappings which memorize their training data are said to exhibit overﬁtting.

Inductive Bias So the question that comes to mind is how to learn a mapping from data without
overfitting? The answer is quite trivial – the reason that we were able to come up with an overfitted
inference rule as in (17) follows from the fact that the domain of feasible mappings F was not
constrained. Therefore, in order to allow learning inference mappings from data in a generalizing
manner, one has to constrain the set F, namely, to induce a bias on the selection of the mapping.
In the learnability analysis it is assumed that F is finite. The common approach in machine
learning does not limit the number of different feasible mappings, but rather assumes some para-
metric model on the mapping f (·). In such cases, the inference rule is dictated by a set of parame-
ters denoted θ, and thus the system mapping is written as fθ ∈ Fθ .
Example 3.1 (Linear model). Arguably the most simple model one can impose is a linear model,
where the mapping is simply a linear combination of the input entries. In particular, when X = RN
and S = R, a linear model boils down to fθ = θ T x.
While a linear model of the form given above is quite simple, it may not be able to capture the
true characteristics of the underlying statistics, as illustrated in Fig. 1. As a result, we are usually
interested in a highly-expressive generic parametric model, which for some given configuration

7
Figure 1: Illustration of overﬁtting versus underﬁtting for different levels of inductive bias.

of θ can approach the true risk minimizer. This model should also be one which we can actually
optimize based on the empirical risk (15). The remainder of the course deals with different settings
of Fθ , and the corresponding ways to ﬁnd a suitable fθ ∈ Fθ .

References
[1] T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2012.

[2] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algo-
rithms. Cambridge university press, 2014.

MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
Statistical Learning Theory Guide
No ratings yet
Statistical Learning Theory Guide
4 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
Supervised Learning
No ratings yet
Supervised Learning
5 pages
Lecturenotes
No ratings yet
Lecturenotes
56 pages
Machine Learning PDF
No ratings yet
Machine Learning PDF
77 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
100 pages
Toc 1
No ratings yet
Toc 1
17 pages
Supervised Learning Fundamentals
No ratings yet
Supervised Learning Fundamentals
47 pages
Machine Learning Lecture Notes Undergrad
No ratings yet
Machine Learning Lecture Notes Undergrad
19 pages
Supervised Learning Cheatsheet
No ratings yet
Supervised Learning Cheatsheet
4 pages
Week2 StatisticalLearning
No ratings yet
Week2 StatisticalLearning
46 pages
Intro to Machine Learning Concepts
No ratings yet
Intro to Machine Learning Concepts
8 pages
Class 02
No ratings yet
Class 02
42 pages
Unit 5
No ratings yet
Unit 5
21 pages
05-1 Supervised Learning
No ratings yet
05-1 Supervised Learning
65 pages
Machine Learning Guide 2017
No ratings yet
Machine Learning Guide 2017
15 pages
BTMMeeting25Nov2020 StatisticalLearning
No ratings yet
BTMMeeting25Nov2020 StatisticalLearning
49 pages
Notes6 Classification
No ratings yet
Notes6 Classification
10 pages
Notes 1
No ratings yet
Notes 1
3 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
Uncertainty Notes
No ratings yet
Uncertainty Notes
166 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
10 pages
When Models Meet Data
No ratings yet
When Models Meet Data
25 pages
Fairness Lectures-21
No ratings yet
Fairness Lectures-21
63 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
71 pages
Machine Learning Basics for Beginners
No ratings yet
Machine Learning Basics for Beginners
8 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
Capitulo 2 Big Data
No ratings yet
Capitulo 2 Big Data
25 pages
ML 01
No ratings yet
ML 01
24 pages
Week 1 Lecture Notes
No ratings yet
Week 1 Lecture Notes
7 pages
Maximum Likelihood Estimation Guide
No ratings yet
Maximum Likelihood Estimation Guide
34 pages
Statistical Learning by Sasha Rakhlin
No ratings yet
Statistical Learning by Sasha Rakhlin
26 pages
2.SupervisedLearning Error
No ratings yet
2.SupervisedLearning Error
32 pages
Notes Stat Learning
No ratings yet
Notes Stat Learning
64 pages
Statistical Learning
No ratings yet
Statistical Learning
31 pages
UNIT1 ERM and PAC Learning
No ratings yet
UNIT1 ERM and PAC Learning
20 pages
1 Intro
No ratings yet
1 Intro
5 pages
Week 3
No ratings yet
Week 3
56 pages
Machine Learning Unit 5 Notes
No ratings yet
Machine Learning Unit 5 Notes
45 pages
Machine Learning-2
No ratings yet
Machine Learning-2
16 pages
Ai512 Book
No ratings yet
Ai512 Book
127 pages
UNIT I-Part 2
No ratings yet
UNIT I-Part 2
35 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
57 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
5 pages
Statistical Learning Theory Notes
No ratings yet
Statistical Learning Theory Notes
119 pages
IML Unit 2 Updated
No ratings yet
IML Unit 2 Updated
20 pages
Lec5 Class
No ratings yet
Lec5 Class
14 pages
Machine Learning Basics Explained
No ratings yet
Machine Learning Basics Explained
10 pages
Week 2
No ratings yet
Week 2
43 pages
Merge
No ratings yet
Merge
240 pages
ML Document-1 - Merged
No ratings yet
ML Document-1 - Merged
19 pages
ML Merge
No ratings yet
ML Merge
145 pages
Cs 171 18 IntroLearning Old
No ratings yet
Cs 171 18 IntroLearning Old
47 pages
SLT 2024
No ratings yet
SLT 2024
94 pages
Lecture 02
No ratings yet
Lecture 02
4 pages
Introduction to Statistical Learning
No ratings yet
Introduction to Statistical Learning
16 pages
Calculator Techniques PERCDC
No ratings yet
Calculator Techniques PERCDC
54 pages
Glaucoma
100% (8)
Glaucoma
15 pages
Occupational Safety and Health Management at Mopani Mines
No ratings yet
Occupational Safety and Health Management at Mopani Mines
16 pages
Milk & Meat Hygiene
No ratings yet
Milk & Meat Hygiene
195 pages
Fruit Tree Care Guide
No ratings yet
Fruit Tree Care Guide
4 pages
Transpose of Matrix
No ratings yet
Transpose of Matrix
5 pages
Counter Blast
80% (5)
Counter Blast
12 pages
Respiratory System Overview
100% (2)
Respiratory System Overview
15 pages
ACTIVITY NO. 4: How Elements Heavier Than Iron Are Formed Name: Date: Year & Section: Score: Concept Notes
No ratings yet
ACTIVITY NO. 4: How Elements Heavier Than Iron Are Formed Name: Date: Year & Section: Score: Concept Notes
2 pages
The Caterpillar Nomenclature and Definition Cards
No ratings yet
The Caterpillar Nomenclature and Definition Cards
14 pages
DC Lab Exp 3 Student Manual
No ratings yet
DC Lab Exp 3 Student Manual
4 pages
Business Plan: Cooperative Goat Farming in Pagalungan, Maguindanao
No ratings yet
Business Plan: Cooperative Goat Farming in Pagalungan, Maguindanao
8 pages
Green Energy Conference Registration 2013
No ratings yet
Green Energy Conference Registration 2013
2 pages
FORMULAE of Clutch
67% (3)
FORMULAE of Clutch
2 pages
Health and Wellness
No ratings yet
Health and Wellness
13 pages
Scaffolding, Shoring & Underpinning: Applications of Scaffolds
100% (2)
Scaffolding, Shoring & Underpinning: Applications of Scaffolds
14 pages
FV 321M Vertical Milling Machine Manual
100% (1)
FV 321M Vertical Milling Machine Manual
65 pages
Spectrum of Dysentery in Children Presenting To A Tertiary Level Teaching Hospital in New Delhi
No ratings yet
Spectrum of Dysentery in Children Presenting To A Tertiary Level Teaching Hospital in New Delhi
5 pages
Data Storage and Querying
No ratings yet
Data Storage and Querying
3 pages
TDS GFB600W3 Black Wrinkle
No ratings yet
TDS GFB600W3 Black Wrinkle
1 page
Health Products and Food Branch Inspectorate: Guidance For Medical Device Complaint Handling and Recalls
No ratings yet
Health Products and Food Branch Inspectorate: Guidance For Medical Device Complaint Handling and Recalls
14 pages
#Blocks
No ratings yet
#Blocks
37 pages
Nerve Impulse Transmission Explained
No ratings yet
Nerve Impulse Transmission Explained
20 pages
Cell Size Comparison: 1x1x1 6cm 1cm 6:1
100% (1)
Cell Size Comparison: 1x1x1 6cm 1cm 6:1
2 pages
Tan Quiz 2
No ratings yet
Tan Quiz 2
3 pages
Grade 7 Investigation 2025 Term 2 Final
No ratings yet
Grade 7 Investigation 2025 Term 2 Final
4 pages
De Deus Et Al 2009 1364-1368 Dentin Demineralization When Subjected To BioPure MTAD: A Longitudinal and Quantitative Assessment
No ratings yet
De Deus Et Al 2009 1364-1368 Dentin Demineralization When Subjected To BioPure MTAD: A Longitudinal and Quantitative Assessment
5 pages
Pentaho CDE Extension Points Guide
No ratings yet
Pentaho CDE Extension Points Guide
5 pages
TOS Tutorial
No ratings yet
TOS Tutorial
23 pages
Parallel Resonance and Parallel RLC Resonant Circuit
No ratings yet
Parallel Resonance and Parallel RLC Resonant Circuit
9 pages

Intro To Machine Learning Lecture Notes1

Uploaded by

Intro To Machine Learning Lecture Notes1

Uploaded by

Ben-Gurion University - School of Electrical and Computer Engineering - 361-1-3040

Lecture 1: Statistical Learning

LP (f ) ≜ E(x,s)∼P {l(f, x, s)}. (3)

2 Common Loss Measures

LP (f ) = E(x,s)∼P {l(f, x, s)}

f (x) = arg min (1 − P(s|x))

This proves (5).

where 1(·) is the indicator function.

LP (f ) = EP {lCE (f, x, s)}

Squared Error The squared error loss is given by:

lMSE (f, x, s) = (s − f (x))2 . (12)

fMMSE (x) = EPs|x {s|x}. (13)

Proof. Follows from the orthogonality principle for MSE estimation.

3 Model-Based versus Data-Driven

3.2 Data-Driven Schemes

Reinforcement Learning In reinforcement learning, we have an agent in an unknown environ-

Recall that true risk is deﬁned as

LP (f ) ≜ E(x,s)∼P {l(f, x, s)}. (16)

You might also like