0% found this document useful (0 votes)
16 views78 pages

Lecture03 MachineLearning

Uploaded by

quatdohlv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views78 pages

Lecture03 MachineLearning

Uploaded by

quatdohlv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

hange E hange E

XC di XC di
F- t F- t
PD

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

Machine Learning

Thien Huynh-The
Department of Computer and Communications Engineering
HCMC University of Technology and Education

February 10, 2025

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 1 / 78


hange E hange E
XC XC
AI, ML, and DL: Definitions
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
• Artificial Intelligence (AI): Machines performing tasks requiring human intelligence. Encompasses
k

k
lic

lic
C

C
.c

.c
w

w
tr e tr re
ar
.

.
ac ac
k e rrule-based
- s o ft w systems and advanced machine learning. k e r- s o ft w a

• Machine Learning (ML): A subfield of AI. Systems learn from data without explicit programming.
Algorithms learn patterns and make predictions. Examples: Linear Regression, SVMs.
• Deep Learning (DL): A subfield of ML using deep neural networks (multiple layers) to analyze data and
learn complex patterns. Examples: CNNs, RNNs.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 2 / 78


hange E hange E
XC XC
Machine Learning: Learning from Experience
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

• Definition: A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P, if its performance at tasks in T, as measured
by P, improves with experience E.
• Example: Email Spam Filtering
• T: Classifying emails as “spam” or “not spam.”
• P: Accuracy in classifying emails.
• E: A dataset of emails labeled as “spam” or “not spam.” The ML system learns patterns
from this labeled data to improve its spam detection accuracy.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 3 / 78


hange E hange E
XC XC
Learning from Experience: Task
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

T (Tasks): The specific problems the ML system is designed to solve. Examples:


• Classification: Assigning an input to one of several predefined categories (e.g., spam
filtering, image classification).
• Regression: Predicting a continuous value (e.g., house prices, stock prices).
• Clustering: Grouping similar inputs together (e.g., customer segmentation).

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 4 / 78


hange E hange E
XC XC
Tasks: Classification, Regression, and Clustering
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
Feature Classification Regression Clustering
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
Task k e r- s o ft w a Assign an input to one of a set Predict a continuous numerical Group similar inputs k e r - s o fto-
tw
a

of pre-defined categories. value. gether into clusters.


Output Discrete class label (e.g., Continuous numerical value Assignment of each input to
“spam,” “not spam,” “cat,” (e.g., house price, temperature). a specific cluster.
“dog”).
Training Data Labeled data (input-output Labeled data (input-output Unlabeled data (inputs
pairs). pairs). only).
Goal Maximize the accuracy of classi- Minimize the error between pre- Maximize similarity within
fication. dicted and actual values. clusters and minimize simi-
larity between clusters.
Examples Email spam detection, image House price prediction, stock Customer segmentation,
classification (cat vs. dog), price forecasting, temperature document categorization,
medical diagnosis. prediction. anomaly detection.
Algorithms Logistic Regression, Support Linear Regression, Polynomial K-Means, DBSCAN, Hierar-
Vector Machines (SVMs), De- Regression, Support Vector Re- chical Clustering, Gaussian
cision Trees, Random Forests, gression (SVR), Neural Net- Mixture Models (GMMs).
Naive Bayes. works.

Table: Comparison of ML Tasks


Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 5 / 78
hange E hange E
XC XC
Tasks: Classification, Regression, and Clustering
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a
Key Difference: Data Labels and Learning Paradigm
• Supervised Learning:
• Uses labeled data (input-output pairs).
• Learns a mapping from inputs to outputs to predict new outputs.
• Examples:
• Classification (e.g., spam detection).
• Regression (e.g., house price prediction).
• Unsupervised Learning:
• Uses unlabeled data (inputs only).
• Finds patterns or structure in the data (e.g., clustering).
• Example: Customer segmentation.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 6 / 78


hange E hange E
XC XC
Learning from Experience: Experience
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr e tr e
ar ar
.

.
ac ac
k
- s o ft w e r- s o ft w k
The e r“experience” in machine learning refers to the data used to train the model. The nature
and quality of this data are crucial for the model’s performance.
Data Types:
• Labeled Data: Each data point is associated with a corresponding output or label. Used
in supervised learning.
• Examples: (input, output) pairs like (image, “cat”), (email, “spam”), (house features, price).
• Use Cases: Classification, Regression.
• Unlabeled Data: Data points without any associated outputs or labels. Used in
unsupervised learning.
• Examples: A collection of images without labels, customer purchase histories.
• Use Cases: Clustering, dimensionality reduction, anomaly detection.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 7 / 78


hange E hange E
XC XC
Learning from Experience: Experience (cont
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

Data Characteristics: The quality of the experience (data) significantly impacts the learning
process. Important characteristics include:
• Quantity: More data generally leads to better performance, especially for complex models.
• Quality: Clean, accurate, and consistent data is essential. Noisy or erroneous data can hinder learning.
• Relevance: The data should be relevant to the task being learned. Irrelevant data can confuse the model.
• Bias: Data can contain biases that reflect existing societal or other prejudices. These biases can be
learned by the model, leading to unfair or inaccurate predictions.
• Representation: The data should be representative of the real-world scenarios the model will encounter. If
the training data is not representative, the model may not generalize well to new data.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 8 / 78


hange E hange E
XC XC
Learning from Experience: Performance
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

P (Performance Measure): A metric used to evaluate how well the system performs on the
tasks. Examples:
• Accuracy: The proportion of correctly classified instances.
• Precision: The proportion of true positives among the predicted positives.
• Recall: The proportion of true positives among the actual positives.
• Mean Squared Error (MSE): Average squared difference between predicted and actual values (for
regression).

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 9 / 78


hange E hange E
XC XC
Performance Metrics for Classification
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
Confusion Matrix: A table summarizing the performance of a classification model.
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

Predicted
Actual Positive Negative
Positive True Positive (TP) False Negative (FN)
Negative False Positive (FP) True Negative (TN)

• Accuracy: Fraction of correctly classified instances.

TP + TN
Accuracy =
TP + TN + FP + FN
• Precision: Fraction of true positives among predicted positives.

TP
Precision =
TP + FP

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 10 / 78


hange E hange E
XC XC
Performance Metrics for Classification (cont)
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
Confusion Matrix: A table summarizing the performance of a classification model.
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

Predicted
Actual Positive Negative
Positive True Positive (TP) False Negative (FN)
Negative False Positive (FP) True Negative (TN)

• Recall (Sensitivity): Fraction of true positives among actual positives.

TP
Recall =
TP + FN
• F1-Score: Harmonic mean of precision and recall.

2 × Precision × Recall
F1-Score =
Precision + Recall

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 11 / 78


hange E hange E
XC XC
Performance Metrics for Regression
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
• Mean Squared Error (MSE): Average squared difference between predicted and actual values.
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re n tr re
.

.
ac ac
k e r- s o ft w a 1X k e r- s o ft w a
MSE = (yi − ŷi )2
n i=1
where yi is the actual value, ŷi is the predicted value, and n is the number of data points.
• Root Mean Squared Error (RMSE): Square root of MSE.

RMSE = MSE
• Mean Absolute Error (MAE): Average absolute difference between predicted and actual values.
n
1X
MAE = |yi − ŷi |
n i=1
• R-squared (Coefficient of Determination): Represents the proportion of variance in the dependent
variable that is predictable from the independent variables.
Pn
(yi − ŷi )2
R 2 = 1 − Pi=1
n
i=1 (yi − ȳ )
2

where ȳ is the mean of the actual values.


Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 12 / 78
hange E hange E
XC XC
Overview of a ML Framework
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
Training Phase:
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a
• Data Collection: Gathering raw training data.
• Feature Extraction/Engineering: Transforming
raw data into meaningful features (Xtrain ). Prior
knowledge can be incorporated.
• Model Training: Applying a learning algorithm
(Classification, Regression, Clustering, etc.) to
Xtrain and ytrain to train a model.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 13 / 78


hange E hange E
XC XC
Overview of a ML Framework
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
Test Phase:
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a
• Data Collection: Gathering raw test data.
• Feature Extraction/Engineering: Applying the
same feature extraction to raw test data to get
Xtest .
• Model Evaluation/Prediction: Using the trained
model to predict on Xtest and get ytest .

Key Considerations:
• Feature engineering is crucial.
• Test data should be representative.
• Model evaluation is essential.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 14 / 78


hange E hange E
XC XC
Feature Engineering: Introduction
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr e tr e
ar r
.

.
ac ac
What k e r - s ois
f t w Feature Engineering? k e r- s o ft w a

Feature engineering is the process of transforming raw data into features that better represent
the underlying problem to the predictive models, resulting in improved model accuracy on
unseen data. It is a crucial step in machine learning, often more impactful than the choice of
model itself.
• Why is it important?
• Improves model performance.
• Simplifies model complexity.
• Enhances model interpretability.
• Two Main Components:
• Feature Extraction
• Feature Selection

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 15 / 78


hange E hange E
XC XC
Feature Extraction
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a
Feature Extraction: Creating New Features
Feature extraction involves creating new features from raw data. This can involve:
• Manual Feature Engineering: Using domain expertise to create features.
• Automated Feature Extraction: Using algorithms to automatically generate features.
• Examples:
• Text Data: Bag-of-words, TF-IDF, word embeddings.
• Image Data: Edge detection, SIFT, HOG, CNN features.
• Time Series Data: Rolling averages, Fourier transforms, autocorrelation.
• Dimensionality Reduction: Feature extraction can also be used to reduce the
dimensionality of the data while preserving important information (e.g., PCA).

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 16 / 78


hange E hange E
XC XC
Feature Extraction: Time Domain Examples
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
Time domain features are calculated directly from the raw time series data. They capture
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
characteristics
tr re of the signal’s amplitude and variation over time. tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

• Mean: Average value of the signal.


N
1 X
Mean = xi
N i=1
• Standard Deviation (Std Dev): Measure of the signal’s dispersion around the mean.
v
u
u 1 X N
Std Dev = t (xi − Mean)2
N − 1 i=1

• Root Mean Square (RMS): Measure of the signal’s magnitude.


v
u
u1 X N
RMS = t x2
N i=1 i

• Zero-Crossing Rate: Number of times the signal crosses zero.


• Peak-to-Peak Amplitude: Difference between the maximum and minimum values of the signal.
Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 17 / 78
hange E hange E
XC XC
Feature Extraction: Frequency Domain Examples
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
Frequency domain features are calculated after transforming the signal into the frequency
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
domain
tr
using
re techniques like the Fourier Transform, reveal its frequency content. tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

• Discrete Fourier Transform (DFT): Decomposes the signal into its constituent frequencies.
N−1
X
Xk = xn e −j2πkn/N
n=0

where xn is the time domain signal, Xk is the frequency domain representation, and j is the imaginary unit.
• Power Spectral Density (PSD): Describes how the power of a signal is distributed across different
frequencies.
|Xk |2
PSD =
N
• Dominant Frequency: The frequency with the highest power in the PSD.
• Spectral Centroid: The “center of mass” of the signal’s power spectrum (where fk are the frequencies).
PN/2
fk |Xk |2
Spectral Centroid = Pk=1 N/2
k=1 |Xk |
2

• Spectral Bandwidth: Measure of the width of the frequency band occupied by the signal.
Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 18 / 78
hange E hange E
XC XC
Feature Selection
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
Feature Selection: Choosing the Best Features
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a
Feature selection involves selecting a subset of the most relevant features from the existing set
of features. This can help to:
• Improve model performance by reducing overfitting.
• Simplify the model, making it faster and easier to interpret.
• Reduce computational cost.
• Methods:
• Filter Methods: Select features based on statistical measures (e.g., correlation, chi-squared
test).
• Wrapper Methods: Evaluate subsets of features by training a model on them (e.g., recursive
feature elimination).
• Embedded Methods: Feature selection is done as part of the model training process (e.g.,
LASSO regularization).

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 19 / 78


hange E hange E
XC XC
Feature Selection: Filter Methods
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
Filter methods select features based on statistical measures calculated from the data.tr a cThey
.c

.c
w

w
tr re re
.

.
ac
k e r- s o ft w a k e r- s o ft w a
evaluate each feature independently of the chosen learning algorithm.
• Key Characteristics:
• Evaluate features independently.
• Computationally efficient.
• Do not consider the interaction with the learning algorithm.
• Prone to selecting redundant features.
• Common Metrics/Methods:
• Correlation: Measures the linear relationship between two variables.

Cov(X , Y )
Correlation(X , Y ) =
σX σY
where Cov(X , Y ) is the covariance between X and Y , and σX and σY are the standard deviations of
X and Y , respectively. Select features with high correlation to the target variable.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 20 / 78


hange E hange E
XC XC
Feature Selection: Filter Methods
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
• Common Metrics/Methods:
C

C
.c

.c
w

w
tr re tr e
ar
.

.
ac ac
k wae r- s o ft ke ft w
• Chi-Squared Test (χ2 ): Measures the independence between categorical variables. Used for rfeature
-s o

selection in classification problems with categorical features.


X (Oi − Ei )2
χ2 =
Ei
where Oi are the observed frequencies and Ei are the expected frequencies. Select features with high
χ2 values (indicating dependence on the target variable).
• Analysis of Variance: In regression with categorical features, ANOVA selects features with
significantly higher variance between groups than within groups.
• Information Gain/Mutual Information: Selects features that provide the greatest reduction in
uncertainty (entropy) about the target variable.
• Advantages: Computationally fast, scalable to high-dimensional datasets.
• Disadvantages: Ignores feature dependencies, may select redundant features.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 21 / 78


hange E hange E
XC XC
Other Feature Engineering Techniques
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

Beyond Extraction and Selection


Besides extraction and selection, other important feature engineering techniques include:
• Data Cleaning: Handling missing values, outliers, and inconsistencies.
• Data Transformation: Scaling, normalization, and encoding categorical variables.
• Scaling: Standardizing or normalizing numerical features.
• Encoding: Converting categorical variables into numerical representations (e.g., one-hot
encoding, label encoding).
• Feature Creation/Construction: Combining existing features to create new ones (e.g.,
creating interaction terms).

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 22 / 78


hange E hange E
XC XC
Data Transformation: Scaling
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

Scaling: Bringing Features to a Similar Scale


Scaling transforms numerical features to a similar range of values. This is important because
features with larger values can dominate those with smaller values, which can negatively
impact the performance of some machine learning algorithms.
• Why is scaling important?
• Prevents features with larger magnitudes from dominating others.
• Improves the convergence speed of gradient-based optimization algorithms.
• Improves the performance of distance-based algorithms (e.g., k-NN, SVM).

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 23 / 78


hange E hange E
XC XC
Data Transformation: Scaling
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
Common Scaling Techniques:
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
•k e Min-Max
w

w
tr tr
ar
e
Scaling (Normalization): Scales features to a range between 0 and 1. re
.

.
ac ac
r- s o ft w k e r- s o ft w a

x − xmin
x′ =
xmax − xmin
where x is the original value, xmin is the minimum value, and xmax is the maximum value of the
feature.
• Standardization (Z-score normalization): Scales features to have a mean of 0 and a standard
deviation of 1.
x −µ
x′ =
σ
where x is the original value, µ is the mean, and σ is the standard deviation of the feature.
• Robust Scaling: Scales features using median and interquartile range (IQR), making it less
sensitive to outliers.
x − Median
x′ =
IQR
where IQR is the interquartile range (Q3-Q1).
Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 24 / 78
hange E hange E
XC XC
Feature Engineering: Summary
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

Key Takeaways
• Feature engineering is a crucial step in building effective machine learning models.
• It involves transforming raw data into meaningful features through extraction, selection,
cleaning, transformation, and creation.
• Domain expertise is often essential for effective feature engineering.
• It’s an iterative process that requires experimentation and evaluation.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 25 / 78


hange E hange E
XC XC
In-Class Assignment
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

Topics Covered:
• Discovery evaluation protocols:
• k-fold cross validation
• Leave one out cross validation
• Feature selection methods:
• Wrapper method
• Embedded method
• Other Engineering techniques:
• Encoding method

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 26 / 78


hange E hange E
XC XC
Introduction to Regression
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
Regression: Predicting Continuous Values
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a
Regression is a supervised learning task where the goal is to predict a continuous numerical
output (target variable) based on input features.
• Examples:
• Predicting house prices based on features like size, location, and number of bedrooms.
• Forecasting stock prices based on historical data and market indicators.
• Estimating temperature based on weather patterns.
• Types of Regression:
• Linear Regression
• Polynomial Regression
• Support Vector Regression (SVR)
• Decision Tree Regression
• Random Forest Regression
• Neural Network Regression

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 27 / 78


hange E hange E
XC XC
Linear Regression
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
Linear Regression: Modeling Linear Relationships
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
Linear
tr
regression
re assumes a linear relationship between the input features and the target
tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

variable.
• Simple Linear Regression: One input feature.
y = β0 + β1 x
where y is the predicted output, x is the input feature, β0 is the intercept, and β1 is the
slope.
• Multiple Linear Regression: Multiple input features.
y = β0 + β1 x1 + β2 x2 + · · · + βn xn
where x1 , x2 , . . . , xn are the input features, and β1 , β2 , . . . , βn are their corresponding
coefficients.
• Goal: Find the optimal values for the coefficients (βs) that minimize the difference
between the predicted and actual values (e.g., using Mean Squared Error).
Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 28 / 78
hange E hange E
XC XC
Other Regression Methods
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
Beyond Linear Relationships
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a
When the relationship between the input features and the target variable is non-linear, other
regression methods are more appropriate.
• Polynomial Regression: Models non-linear relationships using polynomial functions.
• Support Vector Regression (SVR): Uses support vectors to define a margin of
tolerance around the predicted values.
• Decision Tree Regression: Partitions the feature space into regions and predicts a
constant value within each region.
• Random Forest Regression: An ensemble method that combines multiple decision trees
to improve prediction accuracy.
• Neural Network Regression: Uses neural networks to learn complex non-linear
relationships.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 29 / 78


hange E hange E
XC XC
Linear Regression: Introduction
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
a a a a
Linear
cke
r - s oregression
ft w models the relationship between a dependent variable y and one or more
cke
r- s o ft w

independent variables x using a linear equation.


• Types of Linear Regression:
• Simple Linear Regression: One independent variable.

y = β0 + β1 x

• Multiple Linear Regression: Multiple independent variables.

y = β0 + β1 x1 + β2 x2 + · · · + βn xn

• Goal: Find the optimal coefficients (βs) that minimize the error between predicted and
actual values.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 30 / 78


hange E hange E
XC XC
Data Preprocessing for Linear Regression
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

Several preprocessing steps are often necessary to ensure the best performance of a linear
regression model:
• Handling Missing Values: Imputation (mean, median), removal.
• Outlier Detection and Treatment: Removal, transformation (e.g., log transformation).
• Feature Scaling: Important when features have different scales (e.g., standardization,
min-max scaling).
• Feature Encoding: Converting categorical variables into numerical representations (e.g.,
one-hot encoding).
• Feature Selection/Engineering: Selecting relevant features and creating new ones.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 31 / 78


hange E hange E
XC XC
Evaluating Linear Regression: Mean Squared Error (MSE)
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
ra t e ra t e
Mean r - s oSquared
ar Error (MSE): A Common Evaluation Metric ar
.

.
cke cke
ft w r- s o ft w

MSE measures the average squared difference between the predicted and actual values.
• Formula:
n
1X
MSE = (yi − ŷi )2
n
i=1

where yi is the actual value, ŷi is the predicted value, and n is the number of data points.
• Interpretation:
• A lower MSE indicates better model performance.
• MSE is sensitive to outliers due to the squared term.
• Other Metrics: RMSE (Root Mean Squared Error), MAE (Mean Absolute Error),
R-squared.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 32 / 78


hange E hange E
XC XC
Assumptions of Linear Regression
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
- s o ft k wa e r- s o ft k wa
Keye rAssumptions
Linear regression models are based on several assumptions. Violating these assumptions can
affect the model’s validity.
• Linearity: The relationship between the independent and dependent variables is linear.
• Independence of Errors: The errors (residuals) are independent of each other.
• Homoscedasticity: The variance of the errors is constant across all levels of the
independent variables.
• Normality of Errors: The errors are normally distributed.
• No Multicollinearity: The independent variables are not highly correlated with each
other.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 33 / 78


hange E hange E
XC XC
Estimating Weights in Linear Regression
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr e tr e
ar ar
.

.
cke a cke a
Findingr - s o f t w the Best Fit Line r- s o ft w

In linear regression, we aim to find a function that best describes the linear relationship
between input features (x) and the output (y ):

ŷ = f (x) = w1 x1 + w0 = xT w
where:
• ŷ is the predicted output.
 
• x = x1 is the input vector (including a constant 1 for the bias term).
1
 
• w = w1 is the weight vector (w1 is the weight for x1 , and w0 is the bias).
w0

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 34 / 78


hange E hange E
XC XC
Loss Function and Minimization
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
Minimizing the Error
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr tr
We a c wanta rto e
minimize the error between the predicted values (ŷ ) and the actual values a c (y ). aA re
.

.
k e r- s o ft w k e r- s o ft w

common way to do this is by minimizing the Mean Squared Error (MSE). The loss function is:
N
1 X
L(w) = (yi − xT
i w)
2
2N
i=1

where:
• N is the number of data points.
• yi is the actual output for the i-th data point.
• xi is the input vector for the i-th data point.
We want to find the weights w that minimize this loss:

w∗ = argmin L(w)
w

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 35 / 78


hange E hange E
XC XC
Matrix Form and Solution
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
Matrix Form and Normal Equation
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr e tr re
ar
.

.
The ac
k e rloss
- s o f t w function can be expressed in matrix form:
ac
k e r- s o ft w a

1
L(w) = ∥Xw − y∥2
2N
   T
y1 x1
 y2  xT2 
where y =  .  is the vector of actual outputs, X =  .  is the design matrix (each row is an input vector).
   
 ..   .. 
yN xTN
To find the minimum, we take the derivative of the loss function with respect to w and set it to zero:

∂L(w) 1
= XT (Xw − y) = 0
∂w N
Solving for w gives the normal equation:

w = (XT X)−1 XT y

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 36 / 78


hange E hange E
XC XC
Handling Non-Invertibility
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

Pseudo-Inverse
If XT X is not invertible (singular), we use the pseudo-inverse:

w = (XT X)† XT y
where (XT X)† is the Moore-Penrose pseudo-inverse of XT X.
The pseudo-inverse always exists and provides the least-squares solution, minimizing the
squared error even when the matrix is not invertible.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 37 / 78


hange E hange E
XC XC
Linear Regression Example: Data Preparation
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
Data:
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k - s o ft w a k e r- s o ft w a
• e rInput (Height - x)
• Output (Weight - y)
147 49
   
150 50
   
153 51
   
155 52
   
158 54
x=
  , y =  
160
 56
 
163 58
   
165 59
   
168 60
170 62

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 38 / 78


hange E hange E

Linear Regression Example: Calculating XTX and (XTX)-1


PD
F-
XC di
t F-
XC di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
Calculations:
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k - s o ft w a k e r- s o ft w a
• e rDesign Matrix (X): Adding a column of ones for the bias term:

147 1
 
150 1
 
153 1
 
155 1
 
158 1
X=
160

 1
163 1
 
165 1
 
168 1
170 1

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 39 / 78


hange E hange E

Linear Regression Example: Calculating XTX and (XTX)-1


PD
F-
XC di
t F-
XC di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

Calculations:
• XT X: P 2 P   
x xi 253025 1589
XT X = P i =
xi N 1589 10
• (XT X)−1 :
 
T −1 0.0019 −0,298
(X X) ≈
−0,298 47,48

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 40 / 78


hange E hange E

Linear Regression Example: Calculating XTy and w


PD
F-
XC di
t F-
XC di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

Calculations:
• XT y: P   
T xi yi 87867
X y= P =
yi 551
• Weights (w):
 
0.587
w = (XT X)−1 XT y ≈
−38.26

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 41 / 78


hange E hange E
XC XC
Linear Regression Example: The Equation and Prediction
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

The Linear Regression Equation:

ŷ = 0.612x − 41.31
Example Prediction (Height = 160):

ŷ = 0.612 × 160 − 41.31 ≈ 56.61


Conclusion:
A person with a height of 160 cm is predicted to have a weight of approximately 56.61 kg.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 42 / 78


hange E hange E
XC XC
Evaluating Linear Regression: Predictions
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
New Test Data:
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr e tr re
•k e rHeight a r (x
.

.
ac ac
- s o ft w test ) k e r- s o ft w a

• Actual Weight (ytest )


   
173 63
175 64
   
xtest 178 , ytest = 66
=   
180 67
183 68
Predictions (ŷtest ): Applying the equation to each height:

ŷ1 = 0.612 × 173 − 41.31 ≈ 64.476


ŷ2 = 0.612 × 175 − 41.31 ≈ 65.696
ŷ3 = 0.612 × 178 − 41.31 ≈ 67.536
ŷ4 = 0.612 × 180 − 41.31 ≈ 68.756
ŷ5 = 0.612 × 183 − 41.31 ≈ 70.596

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 43 / 78


hange E hange E
XC XC
Evaluating Linear Regression: Mean Squared Error (MSE)
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
Calculating MSE:
C

C
.c

.c
w

w
tr e tr re
ar
.

.
ac ac
k e r- s o ft w k e r- s o ft w a

n
1X
MSE = (yi − ŷi )2
n
i=1

1
MSE = [(63 − 64.476)2 + (64 − 65.696)2 + (66 − 67.536)2
5
+ (67 − 68.756)2 + (68 − 70.596)2 ]
1
≈ [2.179 + 2.876 + 2.350 + 3.084 + 6.749]
5
1
≈ [17.238]
5
≈ 3.448

Result: The Mean Squared Error for this test data is approximately 3.448.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 44 / 78


hange E hange E
XC XC
Linear Regression: Calculation Summary
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
Given input data x and corresponding outputs y, the weights w for the linear regression
to

to
ww

ww
om

om
k

k
lic

lic
ŷr e = xT w can be calculated as follows:
C

C
.c

.c
w

w
equation
tr tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

1. Prepare the Data:


• Create the design matrix X by adding a column of ones to x for the bias term.
• Ensure x and y are column vectors.
2. Calculate XT X: Multiply the transpose of X by X.
3. Calculate the Inverse of XT X: Calculate (XT X)−1 . If XT X is singular (not invertible), use the
pseudo-inverse (XT X)† .
4. Calculate XT y: Multiply the transpose of X by y.
5. Calculate the Weights w:
w = (XT X)−1 XT y
T † T
(or w = (X X) X y if using the pseudo-inverse).
6. The Linear Regression Equation: The equation is then ŷ = xT w, or in scalar form (for simple linear
regression): ŷ = w1 x + w0 .
Key Points:
• The pseudo-inverse is used when (XT X) is not invertible.
• This method minimizes the sum of squared errors.
Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 45 / 78
hange E hange E
XC XC
Python Code - Manual Calculation
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a
from __future__ import division, print_function, unicode_literals
import numpy as np
import matplotlib.pyplot as plt

# height (cm)
X = np.array([[147, 150, 153, 155, 158, 160, 163, 165, 168, 170]]).T
# weight (kg)
y = np.array([[49, 50, 51, 52, 54, 56, 58, 59, 60, 62]]).T
# Visualize data
plt.plot(X, y, ’ro’)
plt.axis([140, 190, 45, 75])
plt.xlabel(’Height (cm)’)
plt.ylabel(’Weight (kg)’)
plt.show()

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 46 / 78


hange E hange E
XC XC
Python Code - Manual Calculation
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a
# Building Xbar
one = np.ones((X.shape[0], 1))
Xbar = np.concatenate((one, X), axis = 1)
# Calculating weights of the fitting line
A = np.dot(Xbar.T, Xbar)
b = np.dot(Xbar.T, y)
w = np.dot(np.linalg.pinv(A), b)
print(’w = ’, w)
# Preparing the fitting line
w_0 = w[0][0]
w_1 = w[1][0]
x0 = np.linspace(145, 185, 2)
y0 = w_0 + w_1*x0

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 47 / 78


hange E hange E
XC XC
Python Code - Manual Calculation
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

# Drawing the fitting line


plt.plot(X.T, y.T, ’ro’) # data
plt.plot(x0, y0) # the fitting line
plt.axis([140, 190, 45, 75])
plt.xlabel(’Height (cm)’)
plt.ylabel(’Weight (kg)’)
plt.show()

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 48 / 78


hange E hange E
XC XC
Python Code - Using Library
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

from sklearn import datasets, linear_model


# fit the model by Linear Regression
regr = linear_model.LinearRegression(fit_intercept=False)
# fit_intercept = False for calculating the bias
regr.fit(Xbar, y)

# Compare two results


print( ’Solution found by scikit-learn : ’, regr.coef_ )
print( ’Solution found by (5): ’, w.T)

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 49 / 78


hange E hange E
XC XC
Student Assignment: Code Comparison
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

Objective: To compare the results of manual linear regression calculations with those
obtained using the scikit-learn library.
• Students will implement linear regression from scratch using matrix operations (as derived
in class).
• Students will then use scikit-learn’s LinearRegression class to fit the same data.
• By comparing the resulting weight vectors, students will:
• Verify the correctness of their manual calculations.
• Gain practical experience with a widely used machine learning library.
• Develop a deeper understanding of the underlying mathematical principles of linear regression.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 50 / 78


hange E hange E
XC XC
Introduction to Classification
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
Classification
tr re is a supervised learning task where the goal is to predict a categorical output
tr
k e r - s o f t w(a
ar
e
.

.
ac ac
k e r- s o ft w a

class label) based on input features.


• Examples:
• Email spam detection (spam/not spam).
• Image classification (cat/dog/bird).
• Medical diagnosis (disease/no disease).
• Common Classification Algorithms:
• Decision Trees
• Random Forests
• Support Vector Machines (SVMs)
• k-Nearest Neighbors (k-NN)
• Neural Networks
• Logistic Regression (often used for binary classification)

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 51 / 78


hange E hange E
XC XC
Decision Trees
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
Decision trees create a tree-like structure of decisions based on feature values. Each internal
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
node
tr
k e r -represents
ar
e
a feature, each branch represents a decision rule, and each leaf nodetr a c k e r - s o ft w a r e
.

.
ac
s o ft w
represents an outcome (class label).
• How it works: Recursively partitions the data based on feature values to create homogeneous subsets
(i.e., subsets with mostly the same class label).
• Advantages: Easy to understand and interpret, can handle both categorical and numerical data.
• Disadvantages: Prone to overfitting, can be sensitive to small changes in the data.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 52 / 78


hange E hange E
XC XC
Random Forests
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

Random forests are an ensemble learning method that combines multiple decision trees to
improve prediction accuracy and reduce overfitting.
• How it works: Creates multiple decision trees on random subsets of the data and
random subsets of features. The final prediction is made by aggregating the predictions of
all trees (e.g., by majority voting).
• Advantages: High accuracy, robust to overfitting, handles high-dimensional data well.
• Disadvantages: More complex than single decision trees, less interpretable.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 53 / 78


hange E hange E
XC XC
Support Vector Machines (SVMs)
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
SVMs find the optimal hyperplane that best separates data points of different classes.
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr tr
•k e rHow e
a rit works: Maps data points to a high-dimensional space and finds a hyperplane with the largest re
.

.
ac ac
- s o ft w k e r- s o ft w a
margin between classes. Can use kernel functions to handle non-linearly separable data.
• Advantages: Effective in high-dimensional spaces, relatively memory efficient.
• Disadvantages: Can be computationally intensive for large datasets, sensitive to kernel choice and
hyperparameters.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 54 / 78


hange E hange E
XC XC
k-Nearest Neighbors (k-NN)
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

k-NN: Classifying Based on Neighbors


k-NN classifies a data point based on the majority class among its k nearest neighbors in the
feature space.
• How it works: Calculates the distance between the new data point and all training data
points. Selects the k nearest neighbors and assigns the most frequent class among them.
• Advantages: Simple to implement, no training phase.
• Disadvantages: Computationally expensive for large datasets, sensitive to feature scaling
and the choice of k.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 55 / 78


hange E hange E
XC XC
Neural Networks
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
Neural networks are complex models inspired by the structure of the human brain. They consist
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
oftr a cinterconnected re layers of neurons that learn complex non-linear relationships in the trdata. re
.

.
ac
k e r- s o ft w a k e r- s o ft w a
• How it works: Multiple layers of interconnected nodes (neurons) process information through weighted
connections. Learning occurs by adjusting these weights based on the data.
• Advantages: Can learn highly complex patterns, achieve high accuracy on various tasks.
• Disadvantages: Can be computationally expensive to train, require large amounts of data, can be difficult
to interpret (black box).

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 56 / 78


hange E hange E
XC XC
Decision Trees: Introduction and Nominal Data
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

Decision Trees: A Tree-Based Approach to Classification


Decision trees are a powerful and interpretable classification method that uses a tree-like
structure to represent decisions and their possible outcomes.
Nominal Data:
• Nominal data represents categories or labels with no inherent order.
• Examples: Colors (red, blue, green), types of fruit (apple, banana, orange), weather
conditions (sunny, cloudy, rainy).
• Used as attributes (features) in decision trees to split the data.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 57 / 78


hange E hange E
XC XC
Decision Trees: Concept and Principle
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
A decision tree consists of:
to

to
ww

ww
om

om
k

k
lic

lic
• Nodes:
C

C
Represent decisions based on feature values.
.c

.c
w

w
tr re tr re
.

.
ac ac
k - s o ft w a k e r- s o ft w a
• e rBranches: Represent the outcomes of the decisions.
• Leaves: Represent the final outcomes or class labels.
The principle behind decision tree construction is to recursively partition the data into subsets
that are as “pure” as possible with respect to the target variable. “Purity” means that the
subsets contain mostly instances of a single class.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 58 / 78


hange E hange E
XC XC
Building a Decision Tree: Picking the Root Attribute
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
The core of decision tree construction lies in selecting the most informative attribute for
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
splitting
tr
the
re data at each node. This process aims to create subsets of data that are tr a c k e r - s o ft w a r e
.

.
ac
k e r- s o ft w a

increasingly “pure” (i.e., contain instances primarily of a single class).


• The Goal: To maximize the separation between classes after the split.
• How it Works:
• For each attribute:
• Calculate the entropy of the current node (before the split).
• Calculate the entropy of each child node that would result from splitting on that attribute.
• Calculate the information gain achieved by splitting on that attribute.
• Select the attribute with the highest information gain as the splitting attribute for the
current node.
• Example: With fruit data (apple, banana, orange) and features like color (red, yellow, orange) and size
(small, medium, large), we calculate information gain for each feature. The feature with higher gain is
chosen for the split.
• Recursive Process: This splitting process repeats for each child node until a stopping criterion is met
(e.g., pure nodes or maximum depth).
Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 59 / 78
hange E hange E
XC XC
Entropy
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr tr
Entropyk e r - s o f t wmeasures
ar
e
the impurity or disorder of a set of data. In the context of classification,k e r - s o f t wit
ar
e
.

.
ac ac

measures the degree to which a set of instances is mixed with different class labels.
• Formula:
Xc
H(S) = − pi log2 (pi )
i=1

where:
• S is the set of instances.
• c is the number of classes.
• pi is the proportion of instances in S that belong to class i.
• Interpretation:
• H(S) = 0 if all instances belong to the same class (pure set).
• H(S) is maximum when the instances are evenly distributed across all classes (most impure set).

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 60 / 78


hange E hange E
XC XC
Information Gain
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
Information
tr re gain measures the reduction in entropy achieved by splitting the data ontr aac k e r - s o ft w a r e
.

.
ac
k e r- s o ft w a

particular attribute.
• Formula:
X |Sv |
Gain(S, A) = H(S) − H(Sv )
|S|
v ∈Values(A)

where:
• S is the set of instances.
• A is the attribute.
• Values(A) is the set of possible values of attribute A.
• Sv is the subset of S where attribute A has value v .
• Selecting the Root Node: The attribute with the highest information gain is chosen as
the root node for splitting.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 61 / 78


hange E hange E
XC XC
Decision Tree Example: Data
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
• Outlook
lic

lic
C

C
.c

.c
w

w
Outlook
tr
a r Temp.
e Humidity Wind Play? tr re
.

.
ac ac
ke tw r-s of k e r- s o ft w a
• S(unny)
S H H W - • O(vercast)
S H H S - • R(ainy)
O H H W +
R M H W +
• Temperature
R C N W + • H(ot)
R C N S - • M(edium)
O C N S + • C(ool)
S M H W - • Humidity
S C N W + • H(igh)
R M N W + • N(ormal)
S M N S + • L(ow)
O M H S +
O H N W + • Wind
R M H S - • S(trong)
• W(eak)

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 62 / 78


hange E hange E
XC XC
Decision Tree Example: Initial Entropy
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

Initial Entropy (S):


• Total instances (N) = 14
• Positive instances (+) = 9
• Negative instances (-) = 5
• Calculation
9
p(+) = 14 ≃ 0.643
5
p(−) = 14 ≃ 0.357
H(S) = −p(+) log2 (p(+)) − p(−) log2 (p(−)) ≃ 0.940

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 63 / 78


hange E hange E
XC XC
Decision Tree Example: Information Gain for Outlook
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
Information Gain for Outlook:
to

to
ww

ww
om

om
k

k
lic

lic
C

C
a• .c

.c
w

w
tr
c k e Outlook ar
e
= Sunny (S): H(S⟨sub⟩Sunny⟨/sub⟩) ≃ 0.971 (5 instances: 2+, 3-) tr re
.

.
ac
r- s o ft w k e r- s o ft w a

• Outlook = Overcast (O): H(S⟨sub⟩Overcast⟨/sub⟩) = 0 (4 instances: 4+, 0-)


• Outlook = Rainy (R): H(S⟨sub⟩Rainy⟨/sub⟩) ≃ 0.971 (5 instances: 3+, 2-)
where:
• H(): Represents the entropy function.
• S⟨sub⟩Sunny⟨/sub⟩: Represents the subset of the data where the Outlook is “Sunny.”
• S⟨sub⟩Overcast⟨/sub⟩: Represents the subset of the data where the Outlook is “Overcast.”
• S⟨sub⟩Rainy⟨/sub⟩: Represents the subset of the data where the Outlook is “Rainy.”
 
5 4 5
Gain(S, Outlook) = H(S) − H(SSunny ) + H(SOvercast ) + H(SRainy )
14 14 14
 
5 4 5
≃ 0.940 − × 0.971 + ×0+ × 0.971
14 14 14
≃ 0.246
Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 64 / 78
hange E hange E
XC XC
Decision Tree Example: Information Gain for Other Attributes
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
Information Gain for Other Attributes:
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr e tr re
•k e rTemperature:
ar
.

.
ac ac
- s o ft w k e r- s o ft w a

• H(S⟨sub⟩Hot⟨/sub⟩) = 1 (4 instances: 2+, 2-)


• H(S⟨sub⟩Medium⟨/sub⟩) ≃ 0.918 (6 instances: 4+, 2-)
• H(S⟨sub⟩Cool⟨/sub⟩) ≃ 0.811 (4 instances: 3+, 1-)
Gain(S, Temperature) ≃ 0.029
• Humidity:
• H(S⟨sub⟩High⟨/sub⟩) ≃ 0.985 (7 instances: 3+, 4-)
• H(S⟨sub⟩Normal⟨/sub⟩) ≃ 0.592 (7 instances: 6+, 1-)
Gain(S, Humidity) ≃ 0.151
• Wind:
• H(S⟨sub⟩Strong⟨/sub⟩) = 1 (6 instances: 3+, 3-)
• H(S⟨sub⟩Weak⟨/sub⟩) ≃ 0.811 (8 instances: 6+, 2-)
Gain(S, Wind) ≃ 0.048
Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 65 / 78
hange E hange E
XC XC
Decision Tree Example: Choosing the Root Node
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

Comparing Information Gains:


• Gain(S, Outlook) ≃ 0.246
• Gain(S, Temperature) ≃ 0.029
• Gain(S, Humidity) ≃ 0.151
• Gain(S, Wind) ≃ 0.048
Conclusion:
Outlook has the highest information gain. Therefore, Outlook is selected as the root node of
the decision tree.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 66 / 78


hange E hange E
XC XC
Building the Decision Tree: Next Steps
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

After determining “Outlook” as the root node, we proceed by creating subsets of the data
based on its values (Sunny, Overcast, Rainy) and repeating the information gain calculation for
each subset.
1. Create Subsets:
• Sunny Subset: Instances where Outlook = Sunny.
• Overcast Subset: Instances where Outlook = Overcast.
• Rainy Subset: Instances where Outlook = Rainy.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 67 / 78


hange E hange E
XC XC
Building the Decision Tree: Splitting Each Subset
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

2. Splitting Each Subset:


For each subset created in the previous step:
1. Calculate the entropy of the subset H(Ssubset ).
2. For each remaining attribute (Temperature, Humidity, Wind):
2.1 Calculate the information gain if we split the subset on that attribute:
Gain(Ssubset , Attribute).
3. Choose the attribute with the highest information gain as the splitting attribute for that
branch.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 68 / 78


hange E hange E
XC XC
Building the Decision Tree: Recursive Process and Example
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

3. Recursive Process:
The splitting process is repeated recursively for each newly created branch until a stopping
criterion is met:
• Stopping Criteria:
• All instances in a subset belong to the same class (pure node).
• No more attributes to split on.
• A predefined maximum tree depth is reached.
• The number of instances in a node falls below a threshold.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 69 / 78


hange E hange E
XC XC
Outlook Subset
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr e tr e
ar ar
.

.
ac ac
k e r- s o ft w k e r- s o ft w
Example: Splitting the “Sunny” Branch:
If we are working with the “Sunny” subset, we would calculate the information gain for
splitting on “Temperature,” “Humidity,” and “Wind” within that subset and choose the
attribute that yields the highest information gain to create further branches. Outlook =

Temp. Humidity Wind Play?


H H W -
Sunny H H S -
M H W -
C N W +
M N S +

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 70 / 78


hange E hange E
XC XC
Splitting the “Sunny” Branch
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
Temp. Humidity Wind Play?
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

H H W -
Outlook = Sunny Subset: H H S -
M H W -
C N W +
M N S +

Calculating Information Gain:


• Gain(S⟨sub⟩Sunny⟨/sub⟩, Humidity) ≃ 0.971 - (3/5 * 0 + 2/5 * 0) = 0.971
• Gain(S⟨sub⟩Sunny⟨/sub⟩, Temperature) = 0.571
• Gain(S⟨sub⟩Sunny⟨/sub⟩, Wind) = 0.019
Humidity is chosen for the next split.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 71 / 78


hange E hange E
XC XC
Splitting the “Rainy” Branch
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr tr
re Temp. Humidity Wind Play? re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

M H W +
Outlook = Rainy Subset: C N W +
C N S -
M N W +
M H S -
Calculating Information Gain:
• Gain(S⟨sub⟩Rainy⟨/sub⟩, Wind) ≃ 0.971 - (3/5 * 0.918 + 2/5 * 0) = 0.419
• Gain(S⟨sub⟩Rainy⟨/sub⟩, Temperature) = 0.019
• Gain(⟨sub⟩Rainy⟨/sub⟩, Humidity) = 0.019
Wind is chosen for the next split.

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 72 / 78


hange E hange E
XC XC
The Complete Decision Tree
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

Outlook

Sunny Overcast Rainy


Humidity Play = + Wind

High Normal Strong Weak


Play = - Play = + Play = - Play = +

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 73 / 78


hange E hange E
XC XC
Python Code - Classification with DT
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
import pandas as pd
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
from sklearn.model_selection import train_test_split
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

# Load the dataset


file_path = ’/content/sample_data/iris.csv’
df = pd.read_csv(file_path)

# Inspect the first few rows of the dataset


print("First few rows of the dataset:")
print(df.head())

# Separate features and labels


X = df.iloc[:, :-1] # Features: all columns except the last
y = df.iloc[:, -1] # Labels: the last column
Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 74 / 78
hange E hange E
XC XC
Python Code - Classification with DT
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr r e tr r e
.

.
ac ac
# Split the data into training and testing sets
k e r- s o ft w a k e r- s o ft w a

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, s

# Create and train the Decision Tree classifier


clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Make predictions on the test set


y_pred = clf.predict(X_test)

# Evaluate the classifier


accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 75 / 78


hange E hange E
XC XC
Python Code - Classification with DT
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a

# Visualize the Decision Tree


plt.figure(figsize=(12, 8))
plot_tree(clf, feature_names=X.columns, class_names=clf.classes_, filled=True)
plt.title("Decision Tree Visualization")
plt.show()

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 76 / 78


hange E hange E
XC XC
Python Code - Different Classification Algorithms
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
tr # Initialize
re classifiers tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a
classifiers = {
"SVM": SVC(kernel=’linear’, random_state=42),
"k-NN": KNeighborsClassifier(n_neighbors=5),
"Neural Network": MLPClassifier(hidden_layer_sizes=(10,), max_iter=1000, random_state=4
}

# Train and evaluate each classifier


for name, clf in classifiers.items():
print(f"\n=== {name} Classifier ===")
clf.fit(X_train, y_train) # Train the classifier
y_pred = clf.predict(X_test) # Make predictions
accuracy = accuracy_score(y_test, y_pred) # Calculate accuracy
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:")
print(classification_report(y_test, y_pred))

Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 77 / 78


hange E hange E
XC XC
Student Assignment: Code Comprehension and Experimentation
PD
F-
di
t F-
di
t

PD
or

or
!

!
W

W
O

O
N

N
Y

Y
U

U
B

B
Objective: Deepen understanding of the machine learning workflow through code analysis and
to

to
ww

ww
om

om
k

k
lic

lic
C

C
.c

.c
w

w
experimentation.
tr re tr re
.

.
ac ac
k e r- s o ft w a k e r- s o ft w a
• Code Dissection: Students will meticulously analyze the provided Python code,
explaining the purpose of each line and its role within the overall machine learning
framework. This includes:
• Data loading and preprocessing.
• Model instantiation and training.
• Prediction and evaluation.
• Data Manipulation and Feature Engineering: Students will explore the impact of data
changes and feature engineering:
• Experiment with different train/test splits (e.g., 80/20, 70/30, k-fold cross-validation).
• Implement basic feature engineering techniques (e.g., feature scaling, creating interaction terms if
applicable).
• Analyze how these changes affect model performance.
• Algorithm Comparison: Students will extend the code to include and compare the
performance of other classification algorithms from scikit-learn (e.g., Logistic Regression,
Support Vector Machines, Random Forests).
Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 78 / 78

You might also like