0% found this document useful (0 votes)
15 views8 pages

Intro To Machine Learning Lecture Notes1

The document outlines the first lecture of a course on Statistical Learning at Ben-Gurion University, focusing on the inference problem and its various methods. It distinguishes between model-based and data-driven approaches to inference, detailing common loss measures for classification and regression tasks. Additionally, it discusses different types of learning tasks, including supervised, unsupervised, semi-supervised, and reinforcement learning, as well as the concept of empirical risk minimization.

Uploaded by

Or Shraga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views8 pages

Intro To Machine Learning Lecture Notes1

The document outlines the first lecture of a course on Statistical Learning at Ben-Gurion University, focusing on the inference problem and its various methods. It distinguishes between model-based and data-driven approaches to inference, detailing common loss measures for classification and regression tasks. Additionally, it discusses different types of learning tasks, including supervised, unsupervised, semi-supervised, and reinforcement learning, as well as the concept of empirical risk minimization.

Uploaded by

Or Shraga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Ben-Gurion University - School of Electrical and Computer Engineering - 361-1-3040

Lecture 1: Statistical Learning


Fall 2024/5
Lecturer: Nir Shlezinger and Asaf Cohen

Before we go into learning, we first need to understand the notion of inference, which is the task
we wish to learn to carry out. We thus begin by a generic description of the inference problem, after
which we detail how it can be tackled, highlighting the differences between stochastic-model-based
methods (referred to henceforth as model-based methods), and techniques which rely on data, i.e.,
data-driven approaches, that constitute the foundations of deep learning and machine learning in
general. Most of the content detailed here is loosely based on the first two chapters of [2] (with
some change of notations).

1 Inference
The term inference refers to the ability to conclude based on evidence and reasoning. While this
generic definition can refer to a broad range of tasks, we focus in our description on systems which
estimate or make predictions based on a set of observed variables. In this wide family of problems,
the system is required to map an input variable x taking values in an input space X into a prediction
of a label variable s which takes value in the label space S.

Input Space This is an arbitrary set, X , which represents the observations upon which we base
our decision. For example, in an image classification problem, X is the set of RGB images of a
given number of pixels. It is also sometimes referred to as the feature space.

Label Space The set S denotes the possible decisions we are allowed to make. For instance, in
an image classification problem where one attempts to classify which breed of a dog appears in an
image, the set S represents all allowed breeds.

Data Generating Distribution The inputs are related to the labels via some statistical distribu-
tion measure P defined over X × S. The stochastic nature implies that the labels are not fully
determined by the input we are observing. For instance, consider the problem of predicting the
price of a house based on a set of feature comprised of its location, size, and age; Obviously, while
the price is indeed affected by these quantities, it is not deterministically determined by them.
Formally, P is a joint distribution over the domain of inputs and labels. One can view such a
distribution as being composed of two parts: a distribution over the unlabeled input Px (sometimes
called the marginal distribution), and the conditional distribution over the labels given the inputs
Ps|x . also referred to as the inverse model.

1
Inference Rule An inference mapping can thus be expressed as

f : X 7→ S. (1)

We write the decision variable for a given input x ∈ X as ŝ = f (x) ∈ S. The space of all possible
inference mappings of the form (1) is denoted by F.

Success Criteria The fidelity of an inference mapping is measured using a loss function

l : F × X × S 7→ R+ . (2)

We are generally interested in carrying out inference which minimizes the risk function, also known
as the generalization error, given by:

LP (f ) ≜ E(x,s)∼P {l(f, x, s)}. (3)

Thus, the goal is to design the inference rule f (·) to minimize the generalization loss LP (f ) for a
given problem.

2 Common Loss Measures


The inference task is essentially encapsulated in the selection of the loss function l(·). In the
following we detail some commonly used loss measures, focus on supervised learning setups,
explained in the sequel.

2.1 Classification
In the classification setting, we are required to assign to each input one of a finite number of
labels. For instance, we are interested in deciding whether an image represents a cat or a dog.
In classification, the label space S is finite, i.e., there exists a finite positive integer K such that
|S| = K and we write S = {sk }K k=1 . Common loss measures for classification problems are:

Error Rate The error-rate loss, also known as the zero-one loss, assigns an equal score to each
‘correct‘ decision, and the same score to each ‘incorrect‘ decision. In particular, this loss measure
is given by: {
0 f (x) = s,
lErr (f, x, s) = (4)
1 f (x) 6= s.
The inference rule which minimizes the error rate risk is the maximum a-posteriori probability
(MAP) rule, given by:
fMAP (x) = arg max P(s|x). (5)
s∈S

2
Proof. We first note that, by the law of total expectation, we have

LP (f ) = E(x,s)∼P {l(f, x, s)}


{ }
= Ex∼Px EPs|x {l(f, x, s)|x} . (6)

Now, under the loss measure in (4), it holds that for a given input x with label s, we have

EPs|x {l(f, x, s)|x} = l(f, x, s̃)P(s̃|x)
s̃∈S

= P(s̃|x)
s̸̃=s

= (1 − P(s|x)) . (7)

Therefore, in order to minimize the risk (6), for a given x we should set the inference rule to
minimize (7), thus setting it to be

f (x) = arg min (1 − P(s|x))


s∈S
= arg max P(s|x). (8)
s∈S

This proves (5).

Cross-Entropy An alternative widely-used loss function for classification problems it the cross-
entropy loss. Here, the inference rule does not produce a label (hard decision), but rather a prob-
∑ Namely, f (x) here is a K × 1 vector
ability mass function over the label space (soft decision).
f1 (x), . . . , fK (x) with non-negative entries such that k fk (x) = 1. For this setting, the cross
entropy loss is given by:
∑K
lCE (f, x, s) = − 1s=sk log fk (x), (9)
k=1

where 1(·) is the indicator function.


While the main motivation for using the cross-entropy loss stems from its ability to provide a
measure of confidence in the decision, as well as from computational reasons, it turns out that the
optimal inference rule in the sense of minimal cross-entropy risk is the true conditional distribution,
i.e.,
fCE (x) = [P(s1 |x), . . . , P(sK |x)]. (10)
Before we prove (10), we note that this implies that the MAP rule can be obtained by taking the
arg max over the entries of fCE (·).

3
Proof. First, we write p(sk |x) = fk (x), and note that for any distribution measure p(s|x) over S
it holds that

LP (f ) = EP {lCE (f, x, s)}


= EP {− log p(s|x)}
{ }
p(s|x)
= EP − log + EP {− log P(s|x)}
P(s|x)
= DKL (P(s|x)||p(s|x)) + H(s|x), (11)

where H(s|x) is the true cross-entropy of s conditioned on x, while DKL (·||·) is the Kullback-
Leibler divergence [1]. As DKL (P(s|x)||p(s|x)) is non-negative and equals zero only when
p(s|x) ≡ P(s|x), it holds that (11) is minimized when the inference output is the true condi-
tional distribution, thus proving (10).

2.2 Regression
Another common task is regression, also referred to as estimation. Here, one attempts to predict a
set of continuous variables instead of a categorical one, i.e., S is some continuous set, e.g., R or
some specified range [a, b]. For instance, the task of predicting the price of a house based on a set
of features can be treated as a regression task. The most common loss function used for regression
problems is the squared error loss:

Squared Error The squared error loss is given by:

lMSE (f, x, s) = (s − f (x))2 . (12)

The inference rule which minimizes the squared-error risk is the minimal mean-squared error
(MSE) estimate, given by the conditional expected value, namely:

fMMSE (x) = EPs|x {s|x}. (13)

Proof. Follows from the orthogonality principle for MSE estimation.

3 Model-Based versus Data-Driven


So far, we have only discussed the goal, which is to identify the inference rule which achieves the
minimal risk measure. The main question is how to find this mapping, which is divided into two
main strategies: the statistical model-based strategy, referred to henceforth as model-based; and
the machine learning approach, which is relies on data, and is thus referred to as data-driven. The
main difference between these strategies is what information is utilized to tune f (·).

4
3.1 Model-Based Methods
Model-based algorithms, also referred to as hand-designed methods, set their inference rule, i.e.
tune f in (1) to minimize the risk function L(·), based on full domain knowledge. The term domain
knowledge typically refers to prior knowledge of the underlying statistics relating the input x and
the label s, where full domain knowledge implies that the joint distribution P is known. For
instance, the inference rules in (5) and (13) are all model-based, as their computation requires full
knowledge of P.
Model-based methods are the foundations of statistical signal processing. So why do we need
machine learning?
1. In practice, accurate knowledge of the statistical model relating the observations and the de-
sired information is typically unavailable, and thus applying such techniques commonly re-
quires imposing some assumptions on the underlying statistics, which in some cases reflects
the actual behavior, but may also constitute a crude approximation of the true dynamics. For
instance, coming up with an analytical expression for the conditional distribution of a breed
of a dog label given an image of a dog is infeasible.
2. In the presence of inaccurate model knowledge, either as a result of estimation errors or
due to enforcing a model which does not fully capture the environment, the performance of
model-based techniques tends to degrade.
This limits the applicability of model-based schemes in scenarios where, e.g., P is unknown, costly
to estimate accurately, or too complex to express analytically.

3.2 Data-Driven Schemes


While in many applications coming up with accurate and tractable statistical modelling is difficult,
we are often given access to data describing the setup. For instance, while we cannot accurately
formulate the conditional distribution of the breed of a dog given an image of a dog, one is likely
to be able to aggregate massive amounts of images of dogs, from which the inference rule can be
learned.
Data-driven systems learn their mapping from data. In a supervised setting, data is comprised
of a training set consisting of nt pairs of inputs and their corresponding labels, denoted D =
{xt , st }nt=1
t
. This data is referred to as the training set. Nonetheless, since we do not have access
to the true distribution P, we cannot directly optimize the risk function (3), and must resort to the
empirical risk, given by
1 ∑
nt
LD (f ) ≜ l(f, xt , st ). (14)
nt t=1
Now, before going into how one can minimize the empirical risk, we first note that if our data
samples are generated in an i.i.d. manner from the true distribution P, then by the law of large
numbers, the empirical risk in (15) is expected to converge to the true risk (3). This implies that
the empirical risk minimizer will approach the true risk minimizer – sounds great, right? Well, not
always, as we see in the next lecture.

5
3.3 Types of Learning Tasks
As we discussed above, the data-driven nature of machine learning is encapsulated in the loss
function’s dependence on training data. Thus, the loss function not only implicitly defines the
task of the resulting system, but also dictates what kind of data is required. Based on the require-
ments placed on the training data, problems in statistical learning largely fall under three different
categories: supervised, semi-supervised, and unsupervised:

Supervised Learning In supervised learning, the training set consists of a set of input-label
pairs {(xt , st )}nt=1
t
⊂ X × S. Each pair (xt , st ) satisfies st = f ∗ (xt ) for some unknown ground
truth mapping f ∗ . The goal is to have fθ approximate f ∗ as accurately as possible by utilizing
the given training data. This setting encompasses a wide range of problems including regression,
classification, and structured prediction, through a judicious choice of the loss function.

Unsupervised Learning In unsupervised learning, we are only given a set of examples {xt }nt=1 t

X without labels. In this case, the loss measure l is defined over F ×X , instead of over F ×X ×S as
defined in (2). Since there is no label to predict, unsupervised learning algorithms are often used
to discover interesting patterns present in the given data. Common tasks in this setting include
clustering, anomaly detection, generative modeling, and compression.

Semi-supervised Learning Semi-supervised learning lies in the middle ground between the
above two categories, where one typically has access to a large amount of unlabeled data and a
small set of labeled data. The goal is to leverage the unlabeled data to improve performance on
some supervised task to be trained on the labeled data. As labeling data is often a very costly
process, semi-supervised learning provides a way to quickly learn desired inference rules without
having to label all of the available unlabeled data points.

Reinforcement Learning In reinforcement learning, we have an agent in an unknown environ-


ment, which can obtain some rewards by interacting with the environment. The agent must take
actions, i.e., interact with the environment, so as to maximize cumulative rewards. In reality, the
scenario could be a bot playing a game to achieve high scores, or a robot trying to complete physi-
cal tasks with physical items; and not just limited to these. The goal of reinforcement learning is to
learn a good strategy for the agent from experimental trials and relative simple feedback received.
With the optimal strategy, the agent is capable to actively adapt to the environment to maximize
future rewards. The astounding success of deep learning in defeating human experts in challenging
games such as Go and Stracraft is based on reinforcement learning. Unfortunately, we will not deal
with reinforcement learning (neither classic nor deep) in this course, as it can constitute a complete
course on its own.

6
3.4 Empirical Risk Minimization
In the following we focus on supervised learning. In a supervised setting, data is comprised
of a training set consisting of nt pairs of inputs and their corresponding labels, denoted D =
{xi , si }ni=1
t
. This data is referred to as the training set. Since no mathematical model relating the
input and the desired decision is imposed, the objective used for setting the decision rule f (·) ∈ F
is the empirical risk, given by

1 ∑
nt
LD (f ) ≜ l(f, xt , st ). (15)
nt t=1

Recall that true risk is defined as

LP (f ) ≜ E(x,s)∼P {l(f, x, s)}. (16)

Overfitting The main problem with minimizing the empirical risk in (15) is that or any fixed nt ,
the mapping f (·) which minimizes (15) is the one which memorizes the training data. For instance,
for a classification task with the 0-1 loss, one can set
{
st ∃t ∈ [1, . . . , nt ] : x = xt ,
f (x) = (17)
0 otherwise.

While the mapping in (17) achieves zero empirical risk, it is useless for any input sample which
does not appear in the training set. We say that such a mapping, in which there is a large gap
between the empirical risk (15) and the generalization error (16), does not generalize. In particular,
data-driven mappings which memorize their training data are said to exhibit overfitting.

Inductive Bias So the question that comes to mind is how to learn a mapping from data without
overfitting? The answer is quite trivial – the reason that we were able to come up with an overfitted
inference rule as in (17) follows from the fact that the domain of feasible mappings F was not
constrained. Therefore, in order to allow learning inference mappings from data in a generalizing
manner, one has to constrain the set F, namely, to induce a bias on the selection of the mapping.
In the learnability analysis it is assumed that F is finite. The common approach in machine
learning does not limit the number of different feasible mappings, but rather assumes some para-
metric model on the mapping f (·). In such cases, the inference rule is dictated by a set of parame-
ters denoted θ, and thus the system mapping is written as fθ ∈ Fθ .
Example 3.1 (Linear model). Arguably the most simple model one can impose is a linear model,
where the mapping is simply a linear combination of the input entries. In particular, when X = RN
and S = R, a linear model boils down to fθ = θ T x.
While a linear model of the form given above is quite simple, it may not be able to capture the
true characteristics of the underlying statistics, as illustrated in Fig. 1. As a result, we are usually
interested in a highly-expressive generic parametric model, which for some given configuration

7
Figure 1: Illustration of overfitting versus underfitting for different levels of inductive bias.

of θ can approach the true risk minimizer. This model should also be one which we can actually
optimize based on the empirical risk (15). The remainder of the course deals with different settings
of Fθ , and the corresponding ways to find a suitable fθ ∈ Fθ .

References
[1] T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2012.

[2] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algo-
rithms. Cambridge university press, 2014.

You might also like