0% found this document useful (0 votes)

21 views7 pages

Intro To Machine Learning Lecture Notes3

This document discusses machine learning models focusing on nearest neighbors and decision trees. It explains the k-nearest neighbors algorithm for classification and regression, detailing its mathematical formulation and generalization error analysis. Additionally, it covers decision trees, particularly the ID3 algorithm for constructing them, and introduces random forests as an ensemble learning method using multiple decision trees.

Uploaded by

Or Shraga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views7 pages

Intro To Machine Learning Lecture Notes3

Uploaded by

Or Shraga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Ben-Gurion University - School of Electrical and Computer Engineering - 361-1-3040

Lecture 3: Nearest Neighbors and Decision Trees

Fall 2024/5
Lecturer: Nir Shlezinger and Asaf Cohen

So far we described the design of machine learning models from a dataset D based on a loss
measure l(·) as a procedure comprised of the following steps:

1. Fix a family of parametric models Fθ .

2. Set the model by ﬁnding the parameters that minimize the empirical risk, i.e.,
1 ∑
θ ⋆ = arg min l(fθ , x, s). (1)
θ |D|
(x,s)∈D

3. When possible, solve (1); otherwise (which is often the case), apply some iterative optimizer
to estimate θ ⋆ .

However, not all machine learning models are designed based on this systematic rationale. Some
are more heuristic (which in some cases actually makes them attractive and interpretable). Today,
we will discuss two important examples – nearest neighbors and decision trees – being based
mostly on [1, Ch. 10-11].

1 Nearest Neighbors
An extremely simple non-parametric machine learning decision rule is termed k-nearest neighbors.
When given an input x, it sets its output ŝ by merely observing the k nearest data points in the
training set D.
To formulate this mathematically, consider a labeled dataset D = {xt , st }nt=1
t
with xt ∈ RN .
For a given input x ∈ R , let π(x, t) be the sorting the data points in D based on their distance
N

from x, i.e.,
d(xπ(x,t) , x) ≤ d(xπ(x,t+1) , x), ∀t ∈ 1, . . . , nt − 1, (2)
where d : Rn × RN 7→ R+ is some distance measure, most commonly the Euclidean norm
d(x′ , x) = kx′ −xk. In k-nearest neighbors, the inference ŝ is determined based on the data points
{xπ(x,t) , sπ(x,t) }kt=1 , with k being a hyperparameter (i.e., a predetermined non-learned system pa-
rameter). Speciﬁcally, such inference rules can be used for both classiﬁcation and regression.

1
1.1 Classification
In classification, the labels take values in a finite set, i.e., |S| < ∞, and can be written as S =
{s1 , . . . , s|S| }. In this case, {xπ(x,t) , sπ(x,t) }kt=1 are typically used to set ŝ via majority rule, i.e.,
the label that appears most in {xπ(x,t) , sπ(x,t) }kt=1 , or mathematically

ŝ = f (x)
= si : |{t ∈ {1, . . . k} : sπ(x,t) = si }| > |{t ∈ {1, . . . k} : sπ(x,t) = sj }|, ∀j 6= i. (3)

1.2 Regression
In regression, the labels take values in a continuous set, e.g., S = R. In this case, there are several
ways to set ŝ based on {xπ(x,t) , sπ(x,t) }kt=1 . A common approach is based on weighted average,
i.e., weighting {sπ(x,t) } with the data points that are closed to x being assigned more weights. This
can be done by using the inverse distance as the weights, and mathematically

∑
k 1
d(xπ(x,t) ,x)
ŝ = f (x) = ∑k 1
sπ(x,t) . (4)
t=1 i=1 d(xπ(x,i) ,x)

1.3 Analysis
The simplicity of k-nearest neighbors facilitates its theoretical analysis. Here, we show how one
can bound its generalization error. To that aim, we consider the application of k = 1 nearest
neighbors for a binary classiﬁcation case, where S = {0, 1}, and the loss used is the zero-one loss,
i.e., l(f, x, s) = 1f (x)̸=s .

MAP Rule As discussed in the ﬁrst lecture, the optimal inference rule in that case is maximum
a-posteriori probability (MAP), i.e.,

fMAP (x) = arg max Ps|x (s|x)

s∈S
= 1Ps|x (s=1|x)>1/2
= 1η(x)>1/2 , (5)

where in the last equality we deﬁned η(x) ≜ Ps|x (s = 1|x). The risk function of the MAP rule is
clearly

LP (fMAP ) = Ex {Es|x {1fMAP (x)̸=s |x}}

= Ex {P(fMAP (x) 6= s|x)}
= Ex {P(1η(x)>1/2 6= s|x)}. (6)

2
Now, we note that the risk of the MAP rule is

P(1η(x)>1/2 6= s|x)
{
η(x) = P(s = 1|x) η(x) ≤ 1
2
⇔ min{η(x), 1 − η(x)} = η(x)
=
1 − η(x) = P(s = 0|x) η(x) > 1
2
⇔ min{η(x), 1 − η(x)} = 1 − η(x)
= min{η(x), 1 − η(x)}. (7)

Generalization Bound For analytical tractability, we introduce the following assumptions:

AS1 The dataset D is comprised of nt samples drawn i.i.d. from P.

AS2 The conditional distribution η(x) is a c-Lipschitz continuous function, i.e.,

|η(x) − η(x′ )| ≤ ckx − x′ k, ∀x, x′ ∈ RN .

Since the dataset is random (by AS1), then the k-nearest neighbors classifier, denoted fDk−NN (·),
is a random mapping. Therefore, we bound its expected generalization error, as stated in the
following theorem:
Theorem 1.1. Under AS1-AS2, the k-nearest neighbors classifier fDk−NN (·) satisfies

E{LP (fDk−NN )} ≤ 2LP (fMAP ) + c · E{kx − xπ(x,1) k}. (8)

Proof. Let us recall that the stochasticity in fDk−NN (·) stems from the randomness of D. Taking
this into account along with the deﬁnition of the generalization function leads to

E{LP (fDk−NN )} = E{E{l(fDk−NN , x, s)|D}}

= ED∼P nt {E(x,s)∼P {1f k−NN (x)̸=s |D}}
D

= ED∼P nt ,(x,s)∼P {1f k−NN (x)̸=s }. (9)

By letting Px denote the marginal distribution of the input, we can write (9) as

E{LP (fDk−NN )} = E{xt }∼Pxnt ,x∼Px {E{st }∼P nt ,s∼Ps|x {1f k−NN (x)̸=s |{xt }, x}}
s|x D

=E n
{xt }∼Px t ,x∼Px {Esπ(x,1) ∼Ps|x ,s∼Ps|x {1sπ(x,1) ̸=s |xπ(x,1) , x}}. (10)

Now, the internal expectation in (10) is in fact

E{1sπ(x,1) ̸=s |xπ(x,1) , x} = P(sπ(x,1) 6= s|xπ(x,1) , x)

3
where (a) follows since the data is mutually independent of the random variables s, x. Now, since
|2η(x) − 1| ≤ 1 and by AS2, we have from (11) that

E{1sπ(x,1) ̸=s |xπ(x,1) , x} ≤ 2η(x)(1 − η(x)) + ckx − xπ(x,1) k. (12)

substituting this into (10) leads to

E{LP (fDk−NN )} ≤ 2Ex∼Px {η(x)(1 − η(x))} + c · E{kx − xπ(x,1) k}

(a)
≤ 2LP (fMAP ) + c · E{kx − xπ(x,1) k}, (13)

where (a) follows since LP (fMAP ) = E{min{η(x), 1 − η(x)}} ≤ E{η(x)(1 − η(x))}.

The generalization bound in Theorem 1.1 implies that

E{LP (fDk−NN )} ≤ 2LP (fMAP ) + c · E{kx − xπ(x,1) k}

≤ 2LP (fMAP ) + c · Ex {E{xt } {min{kxt − xk}|x}}. (14)

When the entries of x are known to take values in some bounded set, one can further bound this
expression in a manner that does not account for the marginal distribution of x. This derivation
building on the fact that E{xt } {min{kxt − xk}|x} represents the minimal value of a sequence of
nt i.i.d. random variables. The detailed derivation can be found in [1, Ch. 19.2].

2 Decision Trees
Decision trees are an extremely common type of inference mappings. They are widely encountered
in medical diagnosis, ﬁnance, and various other forms of decision rules. They represent prediction
as a series of queries, each splitting based on some of the features of the input. An illustration of a
decision rule used for determining whether a papaya bought in the market is expected to be tasty
is illustrated in Fig. 1.

2.1 Learning a Decision Tree

While decision trees are often considered to be obtained from expert knowledge, they can also be
derived from data as machine learning models. Their popularity stems mostly from their inter-
pretability, as one can fully understand the considerations incorporated in making the prediction.
In that sense, decision trees effectively divide the input space RN into cells, where each cell is
assigned with its own value.
A decision tree is comprised of nodes (where queries are conducted), and leaf nodes (where
decisions are dictated). There are various algorithms for translating a dataset D into a decision
tree. Here, we describe one of the basic decision tree algorithms: ID3 (Iterative Dichotomizer 3).
It is based on a gain measure, which will be discussed in the sequel.

4
Figure 1: Decision tree illustration

2.2 ID3
Let us focus our description on classiﬁcation tasks (i.e., S = {0, 1}). To further simplify things, we
also consider binary features, i.e., x ∈ {0, 1}N . This allows all queries to be of the form xj = 1?.
We next provide the pseudocode of ID3, which is ﬁrst called as ID3(D, {1, . . . , N }).

Algorithm 1: ID3
Initialization: Dataset D, set of inspected features A ⊆ {1, . . . , N }
1 if all labels in D equal s ∈ {0, 1} then
Output: Leaf node whose value s
2 end
3 if A is empty then
Output: Leaf node whose value is the most common label in D
4 end
5 Set j = arg maxi∈A Gain(D, i);
6 Set T1 → ID3({(x, s)} ∈ D : xj = 1}, A/{j});
7 Set T2 → ID3({(x, s)} ∈ D : xj = 0}, A/{j});
Output: Tree with root query xj = 1?, left node T2 , and right node T1 (See Fig. 2)

We note that a key aspect of Algorithm 1 is the Gain measure, which dictates which input
feature to inspect. We next elaborate on such candidate measures.

5
Figure 2: ID3 recursive call illustration

2.3 Gain Measure

Different algorithms used different implementation of the gain measure in Step 4 of Algorithm 1.
To present these measures, let PD denote the empirical distribution evaluated over D, e.g.,

|{(x, s) ∈ D : xi = 1}| |{(x, s) ∈ D : s = 1, xi = 1}|

PD (xi = 1) = , PD (s = 1|xi = 1) = .
|D| |{(x, s) ∈ D : xi = 1}|

Using these notations we deﬁne two possible gain measures:

Train Error Recall that if we do not split based on querying feature i, then ID3 sets its value
based on the majority of labels in D. Accordingly the training error we shall observe if not splitting
based on querying feature i equals

min(PD (s = 1), PD (s = 0)) = min(PD (s = 1), 1 − PD (s = 1)) = C(PD (s = 1)),

where we have deﬁned C(α) ≜ min(α, 1 − α). Similarly, the training error we shall observe when
splitting based on querying feature i is

PD (xi = 1)C(PD (s = 1|xi = 1)) + PD (xi = 0)C(PD (s = 1|xi = 0)).

Therefore, the train error gain measure is set as difference between these training errors, i.e.,

Gain(D, i) = C(PD (s = 1))

− (PD (xi = 1)C(PD (s = 1|xi = 1)) + PD (xi = 0)C(PD (s = 1|xi = 0))) . (15)

Information Gain A popular gain measure typically employed in ID3 is the information gain.
It follows a similar approach as in the train error gain, but instead of inspecting the difference in
the train error, it inspects the difference in the (empirical) entropy. This formulation is achieved by
using (15) while setting

C(α) = −α log(α) − (1 − α) log(1 − α).

6
3 Random Forests
A random forest is a classiﬁer consisting of a collection of decision trees, where each tree is
constructed by applying an algorithm (e.g., ID3) on the training set D and an additional random
vector, w, where w is sampled i.i.d. from some distribution. The prediction of the random forest
is obtained by a majority vote over the predictions of the individual trees. A common way to do
so is to simply obtain multiple decision trees by building them from different randomly sampled
subsets of D.
Random forest can be viewed as a form of ensemble learning applied to decision trees. We will
discuss ensemble models a bit further when reviewing common techniques for optimizing neural
networks.

References
[1] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algo-
rithms. Cambridge university press, 2014.

Cours KNN
No ratings yet
Cours KNN
10 pages
Statistical Learning Theory Notes
No ratings yet
Statistical Learning Theory Notes
119 pages
Supervised Learning Techniques
No ratings yet
Supervised Learning Techniques
33 pages
ML Unit2
No ratings yet
ML Unit2
38 pages
STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
No ratings yet
STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
18 pages
Lecture 11
No ratings yet
Lecture 11
32 pages
Ortonormalidad en Espacios de Hilbert
No ratings yet
Ortonormalidad en Espacios de Hilbert
20 pages
Cours2 ML
No ratings yet
Cours2 ML
21 pages
Intro To Data Science
No ratings yet
Intro To Data Science
47 pages
An Adventure of Epic Porpoises
No ratings yet
An Adventure of Epic Porpoises
174 pages
AA1 Tema4
No ratings yet
AA1 Tema4
37 pages
ML Unit-2
No ratings yet
ML Unit-2
33 pages
ML - Course - 15 - 17
No ratings yet
ML - Course - 15 - 17
31 pages
W 3 Slides
No ratings yet
W 3 Slides
39 pages
Practice 1130
No ratings yet
Practice 1130
20 pages
Selected Theoretical Aspects of ML and Deep Learning
No ratings yet
Selected Theoretical Aspects of ML and Deep Learning
46 pages
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
No ratings yet
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
33 pages
WK 07
No ratings yet
WK 07
8 pages
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
No ratings yet
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
47 pages
Machine Learning Notes 1
No ratings yet
Machine Learning Notes 1
120 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
Non-Parametric Classification Guide
No ratings yet
Non-Parametric Classification Guide
41 pages
Nearest Neighbor Prediction Success
No ratings yet
Nearest Neighbor Prediction Success
253 pages
Ai New
No ratings yet
Ai New
4 pages
Unit-3 ML
No ratings yet
Unit-3 ML
47 pages
05 Nonparametric
No ratings yet
05 Nonparametric
22 pages
MlUnit 3
No ratings yet
MlUnit 3
8 pages
PRML Exercise Solutions Guide
No ratings yet
PRML Exercise Solutions Guide
87 pages
Solutions for Duda's Pattern Classification
No ratings yet
Solutions for Duda's Pattern Classification
77 pages
Datamining Lect7knearst
No ratings yet
Datamining Lect7knearst
62 pages
Decision Trees
No ratings yet
Decision Trees
15 pages
Machine Learning Foundations
No ratings yet
Machine Learning Foundations
119 pages
Notes Cce 577
No ratings yet
Notes Cce 577
71 pages
Data Mining Lecture 10B: Classification
No ratings yet
Data Mining Lecture 10B: Classification
62 pages
Unit 3
No ratings yet
Unit 3
46 pages
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
No ratings yet
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
10 pages
NNML
No ratings yet
NNML
113 pages
Machine Learning and Neural Networks: Riccardo Rizzo
100% (1)
Machine Learning and Neural Networks: Riccardo Rizzo
113 pages
Intro&NP Stat
No ratings yet
Intro&NP Stat
122 pages
Cheat Sheet
No ratings yet
Cheat Sheet
163 pages
Mod3 Classification
No ratings yet
Mod3 Classification
32 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
Decision Tree
No ratings yet
Decision Tree
42 pages
Session 5
No ratings yet
Session 5
36 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Thesis 2018 Revisiting Generalizaiton For Deep Learning - Pac-Bayes and Generative Models
No ratings yet
Thesis 2018 Revisiting Generalizaiton For Deep Learning - Pac-Bayes and Generative Models
146 pages
SCH Smo 03 C
No ratings yet
SCH Smo 03 C
24 pages
Poly Aml
No ratings yet
Poly Aml
76 pages
AD3461 ML Lab Manual
No ratings yet
AD3461 ML Lab Manual
32 pages
Excel Data Mining Guide
100% (1)
Excel Data Mining Guide
178 pages
DM Unit Iii
No ratings yet
DM Unit Iii
87 pages
Theoretical Bioinformatics and Machine Learning - Hochreiter - 2013
No ratings yet
Theoretical Bioinformatics and Machine Learning - Hochreiter - 2013
400 pages
SLT 2024
No ratings yet
SLT 2024
94 pages
Mount ACFS Filesystem via NFS on Linux
No ratings yet
Mount ACFS Filesystem via NFS on Linux
4 pages
Mso Excel (Notes)
No ratings yet
Mso Excel (Notes)
79 pages
Remote Diagnostic Services
No ratings yet
Remote Diagnostic Services
6 pages
Aachal Resume
No ratings yet
Aachal Resume
2 pages
JCL Procedures PDF
No ratings yet
JCL Procedures PDF
3 pages
Bank Statement for Subramani
No ratings yet
Bank Statement for Subramani
2 pages
Design and Implementation of A Multi
No ratings yet
Design and Implementation of A Multi
3 pages
FRC Help Document
No ratings yet
FRC Help Document
35 pages
Practical Questions With Answer
No ratings yet
Practical Questions With Answer
3 pages
Digico - Optocore - 221
No ratings yet
Digico - Optocore - 221
15 pages
Matrix Multiplication with Threads
No ratings yet
Matrix Multiplication with Threads
5 pages
Course Title: OWASP Top 10 Threats and Mitigation Exam Questions - Single Select
100% (1)
Course Title: OWASP Top 10 Threats and Mitigation Exam Questions - Single Select
12 pages
8 Google Employees Invented Modern AI
No ratings yet
8 Google Employees Invented Modern AI
12 pages
300+ TOP Operating System LAB VIVA Questions and Answers
No ratings yet
300+ TOP Operating System LAB VIVA Questions and Answers
33 pages
VariTrans P 41000 High Voltage Transducer
No ratings yet
VariTrans P 41000 High Voltage Transducer
24 pages
GPS Smart Cane for the Blind
No ratings yet
GPS Smart Cane for the Blind
6 pages
Single-Axis Solar Tracker Guide
No ratings yet
Single-Axis Solar Tracker Guide
4 pages
Sbrio-9607 9627 Rio Mezzanine Card Features 2025-06-24-11-45-24
No ratings yet
Sbrio-9607 9627 Rio Mezzanine Card Features 2025-06-24-11-45-24
45 pages
Simulation for Operations Managers
No ratings yet
Simulation for Operations Managers
29 pages
Armstrong's Axioms
No ratings yet
Armstrong's Axioms
4 pages
Business Network Solutions Pricing
No ratings yet
Business Network Solutions Pricing
3 pages
Security Constraint Optimal Power Flow (SCOPF) - A Comprehensive Survey
No ratings yet
Security Constraint Optimal Power Flow (SCOPF) - A Comprehensive Survey
11 pages
LCD TV Service Manual
100% (3)
LCD TV Service Manual
76 pages
CT & Mri (Seimens)
No ratings yet
CT & Mri (Seimens)
4 pages
Waveshare RS485 To Ethernet Converter For EU
100% (1)
Waveshare RS485 To Ethernet Converter For EU
26 pages
System Manual Safetycontroller Extendedsafetycontroller: Cr7021 Cr7201 Cr7506
No ratings yet
System Manual Safetycontroller Extendedsafetycontroller: Cr7021 Cr7201 Cr7506
405 pages
BEED 30-Module 4-Information and Communication Technology
100% (1)
BEED 30-Module 4-Information and Communication Technology
8 pages
HTML & CSS3 Web Design Basics
No ratings yet
HTML & CSS3 Web Design Basics
82 pages
F.sq-Wi.013.2 Work Instruction For Performing Test at Switch Gear
No ratings yet
F.sq-Wi.013.2 Work Instruction For Performing Test at Switch Gear
3 pages
Python Operator Precedence Guide
No ratings yet
Python Operator Precedence Guide
14 pages