Lecture03 MachineLearning
Lecture03 MachineLearning
XC di XC di
F- t F- t
PD
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
Machine Learning
Thien Huynh-The
Department of Computer and Communications Engineering
HCMC University of Technology and Education
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
• Artificial Intelligence (AI): Machines performing tasks requiring human intelligence. Encompasses
k
k
lic
lic
C
C
.c
.c
w
w
tr e tr re
ar
.
.
ac ac
k e rrule-based
- s o ft w systems and advanced machine learning. k e r- s o ft w a
• Machine Learning (ML): A subfield of AI. Systems learn from data without explicit programming.
Algorithms learn patterns and make predictions. Examples: Linear Regression, SVMs.
• Deep Learning (DL): A subfield of ML using deep neural networks (multiple layers) to analyze data and
learn complex patterns. Examples: CNNs, RNNs.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
• Definition: A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P, if its performance at tasks in T, as measured
by P, improves with experience E.
• Example: Email Spam Filtering
• T: Classifying emails as “spam” or “not spam.”
• P: Accuracy in classifying emails.
• E: A dataset of emails labeled as “spam” or “not spam.” The ML system learns patterns
from this labeled data to improve its spam detection accuracy.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
Feature Classification Regression Clustering
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
Task k e r- s o ft w a Assign an input to one of a set Predict a continuous numerical Group similar inputs k e r - s o fto-
tw
a
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
Key Difference: Data Labels and Learning Paradigm
• Supervised Learning:
• Uses labeled data (input-output pairs).
• Learns a mapping from inputs to outputs to predict new outputs.
• Examples:
• Classification (e.g., spam detection).
• Regression (e.g., house price prediction).
• Unsupervised Learning:
• Uses unlabeled data (inputs only).
• Finds patterns or structure in the data (e.g., clustering).
• Example: Customer segmentation.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr e tr e
ar ar
.
.
ac ac
k
- s o ft w e r- s o ft w k
The e r“experience” in machine learning refers to the data used to train the model. The nature
and quality of this data are crucial for the model’s performance.
Data Types:
• Labeled Data: Each data point is associated with a corresponding output or label. Used
in supervised learning.
• Examples: (input, output) pairs like (image, “cat”), (email, “spam”), (house features, price).
• Use Cases: Classification, Regression.
• Unlabeled Data: Data points without any associated outputs or labels. Used in
unsupervised learning.
• Examples: A collection of images without labels, customer purchase histories.
• Use Cases: Clustering, dimensionality reduction, anomaly detection.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
Data Characteristics: The quality of the experience (data) significantly impacts the learning
process. Important characteristics include:
• Quantity: More data generally leads to better performance, especially for complex models.
• Quality: Clean, accurate, and consistent data is essential. Noisy or erroneous data can hinder learning.
• Relevance: The data should be relevant to the task being learned. Irrelevant data can confuse the model.
• Bias: Data can contain biases that reflect existing societal or other prejudices. These biases can be
learned by the model, leading to unfair or inaccurate predictions.
• Representation: The data should be representative of the real-world scenarios the model will encounter. If
the training data is not representative, the model may not generalize well to new data.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
P (Performance Measure): A metric used to evaluate how well the system performs on the
tasks. Examples:
• Accuracy: The proportion of correctly classified instances.
• Precision: The proportion of true positives among the predicted positives.
• Recall: The proportion of true positives among the actual positives.
• Mean Squared Error (MSE): Average squared difference between predicted and actual values (for
regression).
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
Confusion Matrix: A table summarizing the performance of a classification model.
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
Predicted
Actual Positive Negative
Positive True Positive (TP) False Negative (FN)
Negative False Positive (FP) True Negative (TN)
TP + TN
Accuracy =
TP + TN + FP + FN
• Precision: Fraction of true positives among predicted positives.
TP
Precision =
TP + FP
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
Confusion Matrix: A table summarizing the performance of a classification model.
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
Predicted
Actual Positive Negative
Positive True Positive (TP) False Negative (FN)
Negative False Positive (FP) True Negative (TN)
TP
Recall =
TP + FN
• F1-Score: Harmonic mean of precision and recall.
2 × Precision × Recall
F1-Score =
Precision + Recall
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
• Mean Squared Error (MSE): Average squared difference between predicted and actual values.
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re n tr re
.
.
ac ac
k e r- s o ft w a 1X k e r- s o ft w a
MSE = (yi − ŷi )2
n i=1
where yi is the actual value, ŷi is the predicted value, and n is the number of data points.
• Root Mean Squared Error (RMSE): Square root of MSE.
√
RMSE = MSE
• Mean Absolute Error (MAE): Average absolute difference between predicted and actual values.
n
1X
MAE = |yi − ŷi |
n i=1
• R-squared (Coefficient of Determination): Represents the proportion of variance in the dependent
variable that is predictable from the independent variables.
Pn
(yi − ŷi )2
R 2 = 1 − Pi=1
n
i=1 (yi − ȳ )
2
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
Training Phase:
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
• Data Collection: Gathering raw training data.
• Feature Extraction/Engineering: Transforming
raw data into meaningful features (Xtrain ). Prior
knowledge can be incorporated.
• Model Training: Applying a learning algorithm
(Classification, Regression, Clustering, etc.) to
Xtrain and ytrain to train a model.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
Test Phase:
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
• Data Collection: Gathering raw test data.
• Feature Extraction/Engineering: Applying the
same feature extraction to raw test data to get
Xtest .
• Model Evaluation/Prediction: Using the trained
model to predict on Xtest and get ytest .
Key Considerations:
• Feature engineering is crucial.
• Test data should be representative.
• Model evaluation is essential.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr e tr e
ar r
.
.
ac ac
What k e r - s ois
f t w Feature Engineering? k e r- s o ft w a
Feature engineering is the process of transforming raw data into features that better represent
the underlying problem to the predictive models, resulting in improved model accuracy on
unseen data. It is a crucial step in machine learning, often more impactful than the choice of
model itself.
• Why is it important?
• Improves model performance.
• Simplifies model complexity.
• Enhances model interpretability.
• Two Main Components:
• Feature Extraction
• Feature Selection
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
Feature Extraction: Creating New Features
Feature extraction involves creating new features from raw data. This can involve:
• Manual Feature Engineering: Using domain expertise to create features.
• Automated Feature Extraction: Using algorithms to automatically generate features.
• Examples:
• Text Data: Bag-of-words, TF-IDF, word embeddings.
• Image Data: Edge detection, SIFT, HOG, CNN features.
• Time Series Data: Rolling averages, Fourier transforms, autocorrelation.
• Dimensionality Reduction: Feature extraction can also be used to reduce the
dimensionality of the data while preserving important information (e.g., PCA).
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
Time domain features are calculated directly from the raw time series data. They capture
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
characteristics
tr re of the signal’s amplitude and variation over time. tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
Frequency domain features are calculated after transforming the signal into the frequency
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
domain
tr
using
re techniques like the Fourier Transform, reveal its frequency content. tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
• Discrete Fourier Transform (DFT): Decomposes the signal into its constituent frequencies.
N−1
X
Xk = xn e −j2πkn/N
n=0
where xn is the time domain signal, Xk is the frequency domain representation, and j is the imaginary unit.
• Power Spectral Density (PSD): Describes how the power of a signal is distributed across different
frequencies.
|Xk |2
PSD =
N
• Dominant Frequency: The frequency with the highest power in the PSD.
• Spectral Centroid: The “center of mass” of the signal’s power spectrum (where fk are the frequencies).
PN/2
fk |Xk |2
Spectral Centroid = Pk=1 N/2
k=1 |Xk |
2
• Spectral Bandwidth: Measure of the width of the frequency band occupied by the signal.
Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 18 / 78
hange E hange E
XC XC
Feature Selection
PD
F-
di
t F-
di
t
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
Feature Selection: Choosing the Best Features
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
Feature selection involves selecting a subset of the most relevant features from the existing set
of features. This can help to:
• Improve model performance by reducing overfitting.
• Simplify the model, making it faster and easier to interpret.
• Reduce computational cost.
• Methods:
• Filter Methods: Select features based on statistical measures (e.g., correlation, chi-squared
test).
• Wrapper Methods: Evaluate subsets of features by training a model on them (e.g., recursive
feature elimination).
• Embedded Methods: Feature selection is done as part of the model training process (e.g.,
LASSO regularization).
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
Filter methods select features based on statistical measures calculated from the data.tr a cThey
.c
.c
w
w
tr re re
.
.
ac
k e r- s o ft w a k e r- s o ft w a
evaluate each feature independently of the chosen learning algorithm.
• Key Characteristics:
• Evaluate features independently.
• Computationally efficient.
• Do not consider the interaction with the learning algorithm.
• Prone to selecting redundant features.
• Common Metrics/Methods:
• Correlation: Measures the linear relationship between two variables.
Cov(X , Y )
Correlation(X , Y ) =
σX σY
where Cov(X , Y ) is the covariance between X and Y , and σX and σY are the standard deviations of
X and Y , respectively. Select features with high correlation to the target variable.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
• Common Metrics/Methods:
C
C
.c
.c
w
w
tr re tr e
ar
.
.
ac ac
k wae r- s o ft ke ft w
• Chi-Squared Test (χ2 ): Measures the independence between categorical variables. Used for rfeature
-s o
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
Common Scaling Techniques:
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
•k e Min-Max
w
w
tr tr
ar
e
Scaling (Normalization): Scales features to a range between 0 and 1. re
.
.
ac ac
r- s o ft w k e r- s o ft w a
x − xmin
x′ =
xmax − xmin
where x is the original value, xmin is the minimum value, and xmax is the maximum value of the
feature.
• Standardization (Z-score normalization): Scales features to have a mean of 0 and a standard
deviation of 1.
x −µ
x′ =
σ
where x is the original value, µ is the mean, and σ is the standard deviation of the feature.
• Robust Scaling: Scales features using median and interquartile range (IQR), making it less
sensitive to outliers.
x − Median
x′ =
IQR
where IQR is the interquartile range (Q3-Q1).
Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 24 / 78
hange E hange E
XC XC
Feature Engineering: Summary
PD
F-
di
t F-
di
t
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
Key Takeaways
• Feature engineering is a crucial step in building effective machine learning models.
• It involves transforming raw data into meaningful features through extraction, selection,
cleaning, transformation, and creation.
• Domain expertise is often essential for effective feature engineering.
• It’s an iterative process that requires experimentation and evaluation.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
Topics Covered:
• Discovery evaluation protocols:
• k-fold cross validation
• Leave one out cross validation
• Feature selection methods:
• Wrapper method
• Embedded method
• Other Engineering techniques:
• Encoding method
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
Regression: Predicting Continuous Values
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
Regression is a supervised learning task where the goal is to predict a continuous numerical
output (target variable) based on input features.
• Examples:
• Predicting house prices based on features like size, location, and number of bedrooms.
• Forecasting stock prices based on historical data and market indicators.
• Estimating temperature based on weather patterns.
• Types of Regression:
• Linear Regression
• Polynomial Regression
• Support Vector Regression (SVR)
• Decision Tree Regression
• Random Forest Regression
• Neural Network Regression
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
Linear Regression: Modeling Linear Relationships
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
Linear
tr
regression
re assumes a linear relationship between the input features and the target
tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
variable.
• Simple Linear Regression: One input feature.
y = β0 + β1 x
where y is the predicted output, x is the input feature, β0 is the intercept, and β1 is the
slope.
• Multiple Linear Regression: Multiple input features.
y = β0 + β1 x1 + β2 x2 + · · · + βn xn
where x1 , x2 , . . . , xn are the input features, and β1 , β2 , . . . , βn are their corresponding
coefficients.
• Goal: Find the optimal values for the coefficients (βs) that minimize the difference
between the predicted and actual values (e.g., using Mean Squared Error).
Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 28 / 78
hange E hange E
XC XC
Other Regression Methods
PD
F-
di
t F-
di
t
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
Beyond Linear Relationships
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
When the relationship between the input features and the target variable is non-linear, other
regression methods are more appropriate.
• Polynomial Regression: Models non-linear relationships using polynomial functions.
• Support Vector Regression (SVR): Uses support vectors to define a margin of
tolerance around the predicted values.
• Decision Tree Regression: Partitions the feature space into regions and predicts a
constant value within each region.
• Random Forest Regression: An ensemble method that combines multiple decision trees
to improve prediction accuracy.
• Neural Network Regression: Uses neural networks to learn complex non-linear
relationships.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
a a a a
Linear
cke
r - s oregression
ft w models the relationship between a dependent variable y and one or more
cke
r- s o ft w
y = β0 + β1 x
y = β0 + β1 x1 + β2 x2 + · · · + βn xn
• Goal: Find the optimal coefficients (βs) that minimize the error between predicted and
actual values.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
Several preprocessing steps are often necessary to ensure the best performance of a linear
regression model:
• Handling Missing Values: Imputation (mean, median), removal.
• Outlier Detection and Treatment: Removal, transformation (e.g., log transformation).
• Feature Scaling: Important when features have different scales (e.g., standardization,
min-max scaling).
• Feature Encoding: Converting categorical variables into numerical representations (e.g.,
one-hot encoding).
• Feature Selection/Engineering: Selecting relevant features and creating new ones.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
ra t e ra t e
Mean r - s oSquared
ar Error (MSE): A Common Evaluation Metric ar
.
.
cke cke
ft w r- s o ft w
MSE measures the average squared difference between the predicted and actual values.
• Formula:
n
1X
MSE = (yi − ŷi )2
n
i=1
where yi is the actual value, ŷi is the predicted value, and n is the number of data points.
• Interpretation:
• A lower MSE indicates better model performance.
• MSE is sensitive to outliers due to the squared term.
• Other Metrics: RMSE (Root Mean Squared Error), MAE (Mean Absolute Error),
R-squared.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
- s o ft k wa e r- s o ft k wa
Keye rAssumptions
Linear regression models are based on several assumptions. Violating these assumptions can
affect the model’s validity.
• Linearity: The relationship between the independent and dependent variables is linear.
• Independence of Errors: The errors (residuals) are independent of each other.
• Homoscedasticity: The variance of the errors is constant across all levels of the
independent variables.
• Normality of Errors: The errors are normally distributed.
• No Multicollinearity: The independent variables are not highly correlated with each
other.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr e tr e
ar ar
.
.
cke a cke a
Findingr - s o f t w the Best Fit Line r- s o ft w
In linear regression, we aim to find a function that best describes the linear relationship
between input features (x) and the output (y ):
ŷ = f (x) = w1 x1 + w0 = xT w
where:
• ŷ is the predicted output.
• x = x1 is the input vector (including a constant 1 for the bias term).
1
• w = w1 is the weight vector (w1 is the weight for x1 , and w0 is the bias).
w0
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
Minimizing the Error
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr tr
We a c wanta rto e
minimize the error between the predicted values (ŷ ) and the actual values a c (y ). aA re
.
.
k e r- s o ft w k e r- s o ft w
common way to do this is by minimizing the Mean Squared Error (MSE). The loss function is:
N
1 X
L(w) = (yi − xT
i w)
2
2N
i=1
where:
• N is the number of data points.
• yi is the actual output for the i-th data point.
• xi is the input vector for the i-th data point.
We want to find the weights w that minimize this loss:
w∗ = argmin L(w)
w
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
Matrix Form and Normal Equation
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr e tr re
ar
.
.
The ac
k e rloss
- s o f t w function can be expressed in matrix form:
ac
k e r- s o ft w a
1
L(w) = ∥Xw − y∥2
2N
T
y1 x1
y2 xT2
where y = . is the vector of actual outputs, X = . is the design matrix (each row is an input vector).
.. ..
yN xTN
To find the minimum, we take the derivative of the loss function with respect to w and set it to zero:
∂L(w) 1
= XT (Xw − y) = 0
∂w N
Solving for w gives the normal equation:
w = (XT X)−1 XT y
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
Pseudo-Inverse
If XT X is not invertible (singular), we use the pseudo-inverse:
w = (XT X)† XT y
where (XT X)† is the Moore-Penrose pseudo-inverse of XT X.
The pseudo-inverse always exists and provides the least-squares solution, minimizing the
squared error even when the matrix is not invertible.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
Data:
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k - s o ft w a k e r- s o ft w a
• e rInput (Height - x)
• Output (Weight - y)
147 49
150 50
153 51
155 52
158 54
x=
, y =
160
56
163 58
165 59
168 60
170 62
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
Calculations:
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k - s o ft w a k e r- s o ft w a
• e rDesign Matrix (X): Adding a column of ones for the bias term:
147 1
150 1
153 1
155 1
158 1
X=
160
1
163 1
165 1
168 1
170 1
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
Calculations:
• XT X: P 2 P
x xi 253025 1589
XT X = P i =
xi N 1589 10
• (XT X)−1 :
T −1 0.0019 −0,298
(X X) ≈
−0,298 47,48
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
Calculations:
• XT y: P
T xi yi 87867
X y= P =
yi 551
• Weights (w):
0.587
w = (XT X)−1 XT y ≈
−38.26
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
ŷ = 0.612x − 41.31
Example Prediction (Height = 160):
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
New Test Data:
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr e tr re
•k e rHeight a r (x
.
.
ac ac
- s o ft w test ) k e r- s o ft w a
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
Calculating MSE:
C
C
.c
.c
w
w
tr e tr re
ar
.
.
ac ac
k e r- s o ft w k e r- s o ft w a
n
1X
MSE = (yi − ŷi )2
n
i=1
1
MSE = [(63 − 64.476)2 + (64 − 65.696)2 + (66 − 67.536)2
5
+ (67 − 68.756)2 + (68 − 70.596)2 ]
1
≈ [2.179 + 2.876 + 2.350 + 3.084 + 6.749]
5
1
≈ [17.238]
5
≈ 3.448
Result: The Mean Squared Error for this test data is approximately 3.448.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
Given input data x and corresponding outputs y, the weights w for the linear regression
to
to
ww
ww
om
om
k
k
lic
lic
ŷr e = xT w can be calculated as follows:
C
C
.c
.c
w
w
equation
tr tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
from __future__ import division, print_function, unicode_literals
import numpy as np
import matplotlib.pyplot as plt
# height (cm)
X = np.array([[147, 150, 153, 155, 158, 160, 163, 165, 168, 170]]).T
# weight (kg)
y = np.array([[49, 50, 51, 52, 54, 56, 58, 59, 60, 62]]).T
# Visualize data
plt.plot(X, y, ’ro’)
plt.axis([140, 190, 45, 75])
plt.xlabel(’Height (cm)’)
plt.ylabel(’Weight (kg)’)
plt.show()
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
# Building Xbar
one = np.ones((X.shape[0], 1))
Xbar = np.concatenate((one, X), axis = 1)
# Calculating weights of the fitting line
A = np.dot(Xbar.T, Xbar)
b = np.dot(Xbar.T, y)
w = np.dot(np.linalg.pinv(A), b)
print(’w = ’, w)
# Preparing the fitting line
w_0 = w[0][0]
w_1 = w[1][0]
x0 = np.linspace(145, 185, 2)
y0 = w_0 + w_1*x0
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
Objective: To compare the results of manual linear regression calculations with those
obtained using the scikit-learn library.
• Students will implement linear regression from scratch using matrix operations (as derived
in class).
• Students will then use scikit-learn’s LinearRegression class to fit the same data.
• By comparing the resulting weight vectors, students will:
• Verify the correctness of their manual calculations.
• Gain practical experience with a widely used machine learning library.
• Develop a deeper understanding of the underlying mathematical principles of linear regression.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
Classification
tr re is a supervised learning task where the goal is to predict a categorical output
tr
k e r - s o f t w(a
ar
e
.
.
ac ac
k e r- s o ft w a
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
Decision trees create a tree-like structure of decisions based on feature values. Each internal
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
node
tr
k e r -represents
ar
e
a feature, each branch represents a decision rule, and each leaf nodetr a c k e r - s o ft w a r e
.
.
ac
s o ft w
represents an outcome (class label).
• How it works: Recursively partitions the data based on feature values to create homogeneous subsets
(i.e., subsets with mostly the same class label).
• Advantages: Easy to understand and interpret, can handle both categorical and numerical data.
• Disadvantages: Prone to overfitting, can be sensitive to small changes in the data.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
Random forests are an ensemble learning method that combines multiple decision trees to
improve prediction accuracy and reduce overfitting.
• How it works: Creates multiple decision trees on random subsets of the data and
random subsets of features. The final prediction is made by aggregating the predictions of
all trees (e.g., by majority voting).
• Advantages: High accuracy, robust to overfitting, handles high-dimensional data well.
• Disadvantages: More complex than single decision trees, less interpretable.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
SVMs find the optimal hyperplane that best separates data points of different classes.
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr tr
•k e rHow e
a rit works: Maps data points to a high-dimensional space and finds a hyperplane with the largest re
.
.
ac ac
- s o ft w k e r- s o ft w a
margin between classes. Can use kernel functions to handle non-linearly separable data.
• Advantages: Effective in high-dimensional spaces, relatively memory efficient.
• Disadvantages: Can be computationally intensive for large datasets, sensitive to kernel choice and
hyperparameters.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
Neural networks are complex models inspired by the structure of the human brain. They consist
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
oftr a cinterconnected re layers of neurons that learn complex non-linear relationships in the trdata. re
.
.
ac
k e r- s o ft w a k e r- s o ft w a
• How it works: Multiple layers of interconnected nodes (neurons) process information through weighted
connections. Learning occurs by adjusting these weights based on the data.
• Advantages: Can learn highly complex patterns, achieve high accuracy on various tasks.
• Disadvantages: Can be computationally expensive to train, require large amounts of data, can be difficult
to interpret (black box).
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
A decision tree consists of:
to
to
ww
ww
om
om
k
k
lic
lic
• Nodes:
C
C
Represent decisions based on feature values.
.c
.c
w
w
tr re tr re
.
.
ac ac
k - s o ft w a k e r- s o ft w a
• e rBranches: Represent the outcomes of the decisions.
• Leaves: Represent the final outcomes or class labels.
The principle behind decision tree construction is to recursively partition the data into subsets
that are as “pure” as possible with respect to the target variable. “Purity” means that the
subsets contain mostly instances of a single class.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
The core of decision tree construction lies in selecting the most informative attribute for
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
splitting
tr
the
re data at each node. This process aims to create subsets of data that are tr a c k e r - s o ft w a r e
.
.
ac
k e r- s o ft w a
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr tr
Entropyk e r - s o f t wmeasures
ar
e
the impurity or disorder of a set of data. In the context of classification,k e r - s o f t wit
ar
e
.
.
ac ac
measures the degree to which a set of instances is mixed with different class labels.
• Formula:
Xc
H(S) = − pi log2 (pi )
i=1
where:
• S is the set of instances.
• c is the number of classes.
• pi is the proportion of instances in S that belong to class i.
• Interpretation:
• H(S) = 0 if all instances belong to the same class (pure set).
• H(S) is maximum when the instances are evenly distributed across all classes (most impure set).
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
Information
tr re gain measures the reduction in entropy achieved by splitting the data ontr aac k e r - s o ft w a r e
.
.
ac
k e r- s o ft w a
particular attribute.
• Formula:
X |Sv |
Gain(S, A) = H(S) − H(Sv )
|S|
v ∈Values(A)
where:
• S is the set of instances.
• A is the attribute.
• Values(A) is the set of possible values of attribute A.
• Sv is the subset of S where attribute A has value v .
• Selecting the Root Node: The attribute with the highest information gain is chosen as
the root node for splitting.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
• Outlook
lic
lic
C
C
.c
.c
w
w
Outlook
tr
a r Temp.
e Humidity Wind Play? tr re
.
.
ac ac
ke tw r-s of k e r- s o ft w a
• S(unny)
S H H W - • O(vercast)
S H H S - • R(ainy)
O H H W +
R M H W +
• Temperature
R C N W + • H(ot)
R C N S - • M(edium)
O C N S + • C(ool)
S M H W - • Humidity
S C N W + • H(igh)
R M N W + • N(ormal)
S M N S + • L(ow)
O M H S +
O H N W + • Wind
R M H S - • S(trong)
• W(eak)
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
Information Gain for Outlook:
to
to
ww
ww
om
om
k
k
lic
lic
C
C
a• .c
.c
w
w
tr
c k e Outlook ar
e
= Sunny (S): H(S⟨sub⟩Sunny⟨/sub⟩) ≃ 0.971 (5 instances: 2+, 3-) tr re
.
.
ac
r- s o ft w k e r- s o ft w a
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
Information Gain for Other Attributes:
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr e tr re
•k e rTemperature:
ar
.
.
ac ac
- s o ft w k e r- s o ft w a
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
After determining “Outlook” as the root node, we proceed by creating subsets of the data
based on its values (Sunny, Overcast, Rainy) and repeating the information gain calculation for
each subset.
1. Create Subsets:
• Sunny Subset: Instances where Outlook = Sunny.
• Overcast Subset: Instances where Outlook = Overcast.
• Rainy Subset: Instances where Outlook = Rainy.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
3. Recursive Process:
The splitting process is repeated recursively for each newly created branch until a stopping
criterion is met:
• Stopping Criteria:
• All instances in a subset belong to the same class (pure node).
• No more attributes to split on.
• A predefined maximum tree depth is reached.
• The number of instances in a node falls below a threshold.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr e tr e
ar ar
.
.
ac ac
k e r- s o ft w k e r- s o ft w
Example: Splitting the “Sunny” Branch:
If we are working with the “Sunny” subset, we would calculate the information gain for
splitting on “Temperature,” “Humidity,” and “Wind” within that subset and choose the
attribute that yields the highest information gain to create further branches. Outlook =
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
Temp. Humidity Wind Play?
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
H H W -
Outlook = Sunny Subset: H H S -
M H W -
C N W +
M N S +
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr tr
re Temp. Humidity Wind Play? re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
M H W +
Outlook = Rainy Subset: C N W +
C N S -
M N W +
M H S -
Calculating Information Gain:
• Gain(S⟨sub⟩Rainy⟨/sub⟩, Wind) ≃ 0.971 - (3/5 * 0.918 + 2/5 * 0) = 0.419
• Gain(S⟨sub⟩Rainy⟨/sub⟩, Temperature) = 0.019
• Gain(⟨sub⟩Rainy⟨/sub⟩, Humidity) = 0.019
Wind is chosen for the next split.
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
Outlook
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
import pandas as pd
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
from sklearn.model_selection import train_test_split
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr r e tr r e
.
.
ac ac
# Split the data into training and testing sets
k e r- s o ft w a k e r- s o ft w a
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
tr # Initialize
re classifiers tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
classifiers = {
"SVM": SVC(kernel=’linear’, random_state=42),
"k-NN": KNeighborsClassifier(n_neighbors=5),
"Neural Network": MLPClassifier(hidden_layer_sizes=(10,), max_iter=1000, random_state=4
}
PD
or
or
!
!
W
W
O
O
N
N
Y
Y
U
U
B
B
Objective: Deepen understanding of the machine learning workflow through code analysis and
to
to
ww
ww
om
om
k
k
lic
lic
C
C
.c
.c
w
w
experimentation.
tr re tr re
.
.
ac ac
k e r- s o ft w a k e r- s o ft w a
• Code Dissection: Students will meticulously analyze the provided Python code,
explaining the purpose of each line and its role within the overall machine learning
framework. This includes:
• Data loading and preprocessing.
• Model instantiation and training.
• Prediction and evaluation.
• Data Manipulation and Feature Engineering: Students will explore the impact of data
changes and feature engineering:
• Experiment with different train/test splits (e.g., 80/20, 70/30, k-fold cross-validation).
• Implement basic feature engineering techniques (e.g., feature scaling, creating interaction terms if
applicable).
• Analyze how these changes affect model performance.
• Algorithm Comparison: Students will extend the code to include and compare the
performance of other classification algorithms from scikit-learn (e.g., Logistic Regression,
Support Vector Machines, Random Forests).
Thien Huynh-The - HCMUTE Machine Learning February 10, 2025 78 / 78