0% found this document useful (0 votes)
21 views7 pages

Intro To Machine Learning Lecture Notes3

This document discusses machine learning models focusing on nearest neighbors and decision trees. It explains the k-nearest neighbors algorithm for classification and regression, detailing its mathematical formulation and generalization error analysis. Additionally, it covers decision trees, particularly the ID3 algorithm for constructing them, and introduces random forests as an ensemble learning method using multiple decision trees.

Uploaded by

Or Shraga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views7 pages

Intro To Machine Learning Lecture Notes3

This document discusses machine learning models focusing on nearest neighbors and decision trees. It explains the k-nearest neighbors algorithm for classification and regression, detailing its mathematical formulation and generalization error analysis. Additionally, it covers decision trees, particularly the ID3 algorithm for constructing them, and introduces random forests as an ensemble learning method using multiple decision trees.

Uploaded by

Or Shraga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Ben-Gurion University - School of Electrical and Computer Engineering - 361-1-3040

Lecture 3: Nearest Neighbors and Decision Trees


Fall 2024/5
Lecturer: Nir Shlezinger and Asaf Cohen

So far we described the design of machine learning models from a dataset D based on a loss
measure l(·) as a procedure comprised of the following steps:

1. Fix a family of parametric models Fθ .

2. Set the model by finding the parameters that minimize the empirical risk, i.e.,
1 ∑
θ ⋆ = arg min l(fθ , x, s). (1)
θ |D|
(x,s)∈D

3. When possible, solve (1); otherwise (which is often the case), apply some iterative optimizer
to estimate θ ⋆ .

However, not all machine learning models are designed based on this systematic rationale. Some
are more heuristic (which in some cases actually makes them attractive and interpretable). Today,
we will discuss two important examples – nearest neighbors and decision trees – being based
mostly on [1, Ch. 10-11].

1 Nearest Neighbors
An extremely simple non-parametric machine learning decision rule is termed k-nearest neighbors.
When given an input x, it sets its output ŝ by merely observing the k nearest data points in the
training set D.
To formulate this mathematically, consider a labeled dataset D = {xt , st }nt=1
t
with xt ∈ RN .
For a given input x ∈ R , let π(x, t) be the sorting the data points in D based on their distance
N

from x, i.e.,
d(xπ(x,t) , x) ≤ d(xπ(x,t+1) , x), ∀t ∈ 1, . . . , nt − 1, (2)
where d : Rn × RN 7→ R+ is some distance measure, most commonly the Euclidean norm
d(x′ , x) = kx′ −xk. In k-nearest neighbors, the inference ŝ is determined based on the data points
{xπ(x,t) , sπ(x,t) }kt=1 , with k being a hyperparameter (i.e., a predetermined non-learned system pa-
rameter). Specifically, such inference rules can be used for both classification and regression.

1
1.1 Classification
In classification, the labels take values in a finite set, i.e., |S| < ∞, and can be written as S =
{s1 , . . . , s|S| }. In this case, {xπ(x,t) , sπ(x,t) }kt=1 are typically used to set ŝ via majority rule, i.e.,
the label that appears most in {xπ(x,t) , sπ(x,t) }kt=1 , or mathematically

ŝ = f (x)
= si : |{t ∈ {1, . . . k} : sπ(x,t) = si }| > |{t ∈ {1, . . . k} : sπ(x,t) = sj }|, ∀j 6= i. (3)

1.2 Regression
In regression, the labels take values in a continuous set, e.g., S = R. In this case, there are several
ways to set ŝ based on {xπ(x,t) , sπ(x,t) }kt=1 . A common approach is based on weighted average,
i.e., weighting {sπ(x,t) } with the data points that are closed to x being assigned more weights. This
can be done by using the inverse distance as the weights, and mathematically


k 1
d(xπ(x,t) ,x)
ŝ = f (x) = ∑k 1
sπ(x,t) . (4)
t=1 i=1 d(xπ(x,i) ,x)

1.3 Analysis
The simplicity of k-nearest neighbors facilitates its theoretical analysis. Here, we show how one
can bound its generalization error. To that aim, we consider the application of k = 1 nearest
neighbors for a binary classification case, where S = {0, 1}, and the loss used is the zero-one loss,
i.e., l(f, x, s) = 1f (x)̸=s .

MAP Rule As discussed in the first lecture, the optimal inference rule in that case is maximum
a-posteriori probability (MAP), i.e.,

fMAP (x) = arg max Ps|x (s|x)


s∈S
= 1Ps|x (s=1|x)>1/2
= 1η(x)>1/2 , (5)

where in the last equality we defined η(x) ≜ Ps|x (s = 1|x). The risk function of the MAP rule is
clearly

LP (fMAP ) = Ex {Es|x {1fMAP (x)̸=s |x}}


= Ex {P(fMAP (x) 6= s|x)}
= Ex {P(1η(x)>1/2 6= s|x)}. (6)

2
Now, we note that the risk of the MAP rule is

P(1η(x)>1/2 6= s|x)
{
η(x) = P(s = 1|x) η(x) ≤ 1
2
⇔ min{η(x), 1 − η(x)} = η(x)
=
1 − η(x) = P(s = 0|x) η(x) > 1
2
⇔ min{η(x), 1 − η(x)} = 1 − η(x)
= min{η(x), 1 − η(x)}. (7)

Generalization Bound For analytical tractability, we introduce the following assumptions:


AS1 The dataset D is comprised of nt samples drawn i.i.d. from P.

AS2 The conditional distribution η(x) is a c-Lipschitz continuous function, i.e.,

|η(x) − η(x′ )| ≤ ckx − x′ k, ∀x, x′ ∈ RN .

Since the dataset is random (by AS1), then the k-nearest neighbors classifier, denoted fDk−NN (·),
is a random mapping. Therefore, we bound its expected generalization error, as stated in the
following theorem:
Theorem 1.1. Under AS1-AS2, the k-nearest neighbors classifier fDk−NN (·) satisfies

E{LP (fDk−NN )} ≤ 2LP (fMAP ) + c · E{kx − xπ(x,1) k}. (8)

Proof. Let us recall that the stochasticity in fDk−NN (·) stems from the randomness of D. Taking
this into account along with the definition of the generalization function leads to

E{LP (fDk−NN )} = E{E{l(fDk−NN , x, s)|D}}


= ED∼P nt {E(x,s)∼P {1f k−NN (x)̸=s |D}}
D

= ED∼P nt ,(x,s)∼P {1f k−NN (x)̸=s }. (9)


D

By letting Px denote the marginal distribution of the input, we can write (9) as

E{LP (fDk−NN )} = E{xt }∼Pxnt ,x∼Px {E{st }∼P nt ,s∼Ps|x {1f k−NN (x)̸=s |{xt }, x}}
s|x D

=E n
{xt }∼Px t ,x∼Px {Esπ(x,1) ∼Ps|x ,s∼Ps|x {1sπ(x,1) ̸=s |xπ(x,1) , x}}. (10)

Now, the internal expectation in (10) is in fact

E{1sπ(x,1) ̸=s |xπ(x,1) , x} = P(sπ(x,1) 6= s|xπ(x,1) , x)


= P(sπ(x,1) = 1, s = 0|xπ(x,1) , x) + P(sπ(x,1) = 0, s = 1|xπ(x,1) , x)
(a)
= P(sπ(x,1) = 1|xπ(x,1) )P(s = 0|x) + P(sπ(x,1) = 0|xπ(x,1) )P(s = 1|x)
= η(xπ(x,1) )(1 − η(x)) + η(x)(1 − η(xπ(x,1) ))
= 2η(x)(1 − η(x)) + (η(x) − η(xπ(x,1) ))(2η(x) − 1), (11)

3
where (a) follows since the data is mutually independent of the random variables s, x. Now, since
|2η(x) − 1| ≤ 1 and by AS2, we have from (11) that

E{1sπ(x,1) ̸=s |xπ(x,1) , x} ≤ 2η(x)(1 − η(x)) + ckx − xπ(x,1) k. (12)

substituting this into (10) leads to

E{LP (fDk−NN )} ≤ 2Ex∼Px {η(x)(1 − η(x))} + c · E{kx − xπ(x,1) k}


(a)
≤ 2LP (fMAP ) + c · E{kx − xπ(x,1) k}, (13)

where (a) follows since LP (fMAP ) = E{min{η(x), 1 − η(x)}} ≤ E{η(x)(1 − η(x))}.


The generalization bound in Theorem 1.1 implies that

E{LP (fDk−NN )} ≤ 2LP (fMAP ) + c · E{kx − xπ(x,1) k}


≤ 2LP (fMAP ) + c · Ex {E{xt } {min{kxt − xk}|x}}. (14)

When the entries of x are known to take values in some bounded set, one can further bound this
expression in a manner that does not account for the marginal distribution of x. This derivation
building on the fact that E{xt } {min{kxt − xk}|x} represents the minimal value of a sequence of
nt i.i.d. random variables. The detailed derivation can be found in [1, Ch. 19.2].

2 Decision Trees
Decision trees are an extremely common type of inference mappings. They are widely encountered
in medical diagnosis, finance, and various other forms of decision rules. They represent prediction
as a series of queries, each splitting based on some of the features of the input. An illustration of a
decision rule used for determining whether a papaya bought in the market is expected to be tasty
is illustrated in Fig. 1.

2.1 Learning a Decision Tree


While decision trees are often considered to be obtained from expert knowledge, they can also be
derived from data as machine learning models. Their popularity stems mostly from their inter-
pretability, as one can fully understand the considerations incorporated in making the prediction.
In that sense, decision trees effectively divide the input space RN into cells, where each cell is
assigned with its own value.
A decision tree is comprised of nodes (where queries are conducted), and leaf nodes (where
decisions are dictated). There are various algorithms for translating a dataset D into a decision
tree. Here, we describe one of the basic decision tree algorithms: ID3 (Iterative Dichotomizer 3).
It is based on a gain measure, which will be discussed in the sequel.

4
Figure 1: Decision tree illustration

2.2 ID3
Let us focus our description on classification tasks (i.e., S = {0, 1}). To further simplify things, we
also consider binary features, i.e., x ∈ {0, 1}N . This allows all queries to be of the form xj = 1?.
We next provide the pseudocode of ID3, which is first called as ID3(D, {1, . . . , N }).

Algorithm 1: ID3
Initialization: Dataset D, set of inspected features A ⊆ {1, . . . , N }
1 if all labels in D equal s ∈ {0, 1} then
Output: Leaf node whose value s
2 end
3 if A is empty then
Output: Leaf node whose value is the most common label in D
4 end
5 Set j = arg maxi∈A Gain(D, i);
6 Set T1 → ID3({(x, s)} ∈ D : xj = 1}, A/{j});
7 Set T2 → ID3({(x, s)} ∈ D : xj = 0}, A/{j});
Output: Tree with root query xj = 1?, left node T2 , and right node T1 (See Fig. 2)

We note that a key aspect of Algorithm 1 is the Gain measure, which dictates which input
feature to inspect. We next elaborate on such candidate measures.

5
Figure 2: ID3 recursive call illustration

2.3 Gain Measure


Different algorithms used different implementation of the gain measure in Step 4 of Algorithm 1.
To present these measures, let PD denote the empirical distribution evaluated over D, e.g.,

|{(x, s) ∈ D : xi = 1}| |{(x, s) ∈ D : s = 1, xi = 1}|


PD (xi = 1) = , PD (s = 1|xi = 1) = .
|D| |{(x, s) ∈ D : xi = 1}|

Using these notations we define two possible gain measures:

Train Error Recall that if we do not split based on querying feature i, then ID3 sets its value
based on the majority of labels in D. Accordingly the training error we shall observe if not splitting
based on querying feature i equals

min(PD (s = 1), PD (s = 0)) = min(PD (s = 1), 1 − PD (s = 1)) = C(PD (s = 1)),

where we have defined C(α) ≜ min(α, 1 − α). Similarly, the training error we shall observe when
splitting based on querying feature i is

PD (xi = 1)C(PD (s = 1|xi = 1)) + PD (xi = 0)C(PD (s = 1|xi = 0)).

Therefore, the train error gain measure is set as difference between these training errors, i.e.,

Gain(D, i) = C(PD (s = 1))


− (PD (xi = 1)C(PD (s = 1|xi = 1)) + PD (xi = 0)C(PD (s = 1|xi = 0))) . (15)

Information Gain A popular gain measure typically employed in ID3 is the information gain.
It follows a similar approach as in the train error gain, but instead of inspecting the difference in
the train error, it inspects the difference in the (empirical) entropy. This formulation is achieved by
using (15) while setting

C(α) = −α log(α) − (1 − α) log(1 − α).

6
3 Random Forests
A random forest is a classifier consisting of a collection of decision trees, where each tree is
constructed by applying an algorithm (e.g., ID3) on the training set D and an additional random
vector, w, where w is sampled i.i.d. from some distribution. The prediction of the random forest
is obtained by a majority vote over the predictions of the individual trees. A common way to do
so is to simply obtain multiple decision trees by building them from different randomly sampled
subsets of D.
Random forest can be viewed as a form of ensemble learning applied to decision trees. We will
discuss ensemble models a bit further when reviewing common techniques for optimizing neural
networks.

References
[1] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algo-
rithms. Cambridge university press, 2014.

You might also like