0% found this document useful (0 votes)
144 views51 pages

EN3150 Pattern Recognition - L02

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
144 views51 pages

EN3150 Pattern Recognition - L02

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

EN3150 Pattern Recognition

Learning from data and related challenges

M. T. U. Sampath K. Perera,
Department of Electronic and Telecommunication Engineering,
University of Moratuwa.
([email protected]).
Semester 5 – Batch 21.
What is learning ?
“A computer program is said to learn from experience E with respect to some class
of tasks T and performance measure P, if its performance at tasks in T, as measured
by P, improves with experience E “[1]

experience E performance measure P


tasks T

performance at tasks in T, as measured by P, improves with experience E

[1] Mitchell, Tom M. "Machine learning." (1997).


What is learning ?
performance at tasks
in T, as measured by P,
improves with
experience E

tasks T

performance measure P

experience E

Sample of the MNIST dataset of handwritten digits


(https://en.wikipedia.org/wiki/MNIST_database)
What is learning ?
performance at tasks in T, as measured by P, improves with experience E

tasks T

performance measure P

experience E

Sample of the MNIST dataset of handwritten digits


(https://en.wikipedia.org/wiki/MNIST_database)
Learning from data
➢There are different types of learning from data:

data ➢ Supervised
preparation
➢ Unsupervised
Model
training ➢ Semi-supervised

➢ Self-supervised
Model
Evaluation
➢ Reinforcement learning
Learning from data: Supervised learning
➢Supervised learning:
o The algorithm learns from labeled training data to make predictions or decisions.

➢ Labeled training data?


o The training data consists of input examples (also called features) along with their
corresponding output labels (also called targets or ground truth).

➢The goal of supervised learning is to learn a mapping function that can predict the correct
output label for new, unseen input examples.

Zero Five
ML Model
Learning from data: Supervised learning
➢ Labeled training data
➢ Handwritten digit - MNIST dataset

Zero Five ML Model


MNIST dataset
28x28 pixel images of handwritten digits (0 to 9) along with their 28
corresponding labels.
28
0 0 1 ... 781 782 783 label
0 0.0 0.0 0.0 0.0 0.0 5
1 0.0 0.0 0.0 0.0 0.0 0
2 0.0 0.0 0.0 0.0 0.0 4
3 0.0 0.0 0.0 0.0 0.0 1
4 0.0 0.0 0.0 0.0 0.0 9
Learning from data: Unsupervised learning
➢ Unsupervised learning involves training an algorithm on unlabeled data without explicit output
labels
➢The algorithm's objective is to find patterns, structures, or relationships within the data.
Learning from data: Semi-Supervised Learning

➢Semi-supervised learning: supervised + unsupervised learning.

➢Training data contains a mixture of labeled and unlabeled examples.


Learning from data: Self-Supervised Learning

➢Model is trained to generate its own data.


➢E.g., Generative Adversarial Networks (GANs), Variational Autoencoders(VAEs)
can be used to generate data
➢Contrastive Learning: model is trained to discriminate between positive pairs
(similar samples) and negative pairs (dissimilar samples)

A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta and A. A. Bharath, "Generative Adversarial Networks: An Overview," in IEEE Signal Processing Magazine, vol.
35, no. 1, pp. 53-65, Jan. 2018, doi: 10.1109/MSP.2017.2765202.
Learning from data: Reinforcement Learning
➢An agent learns to make decisions by interacting with an environment.
➢The agent receives feedback in the form of rewards or penalties based on its
actions.
➢The goal of the agent is to learn an optimal policy that maximizes the
cumulative reward over time.

action action

rewards or penalties rewards or penalties


Supervised vs Unsupervised learning
Supervised Unsupervised
Uses labeled input and output data Labels are not available
Well-defined objective (As labels are given, so you May discover hidden relationships
know possible type of results to expect)
Can be used to learn meaningful representations or
features from raw data
human intervention is required to label data

If you have labeled data and a clear target variable If you have large amounts of unlabeled data and
to predict, use supervised learning for accurate want to find patterns or groupings in the data, opt
predictions for unsupervised learning
If you have a mix of labeled and unlabeled data, or the cost of labeling data is high, consider using semi-
supervised learning to leverage both types of data
Learning from data and
related challenges
➢ Data Quality and Quantity
o Noisy, incomplete data can lead to inaccurate and
unreliable predictions.
90 o Often requires large amount of data
%
➢ Data Imbalance:
o E.g, in classification imbalance classes
10
% o May lead to poor performance
Learning from data and related challenges
➢Overfitting: Overfitting occurs when a model performs
exceptionally well on the training data but fails to generalize to
new, unseen data.
➢ Underfitting: When a model is too simplistic to capture the
underlying patterns in the data.
Underfitting
➢ Generalization: Ensuring that machine learning models
generalize well to new, unseen data.

Overfitting
Data preparation
➢ Data cleaning: Handle missing or inconsistent data
o Approaches:
inliers
o Removing them
outliers
o Filling with zeros/mean/median,
o Interpolation
➢Data cleaning: outlier* detection and removing
inliers
➢Data Preprocessing: Feature scaling (E.g. normalization)
➢Data Preprocessing: Dimensionality reduction
➢Principal Component Analysis (PCA)

*An outlier in a dataset refers to a data point that deviates significantly from the majority of the other data points.
Data preparation
➢Data Augmentation: Artificially expand the size and diversity of a given dataset.
oE.g Image rotation, flipping, scaling, cropping ➔ New image

➢Imbalanced Data:
o Under sampling of majority class
o Generating synthetic samples of the minority class
Data preprocessing example
➢ https://scikit-learn.org/stable/modules/preprocessing.html
1. Standardization: scale the features of a dataset to have zero mean and unit
variance.
2. Scaling features to a range e.g., between 0 and 1
➢ Min max scalar ➔ [0, 1]
➢ Max Abs Scaler ➔ [-1, 1]

If outliers are there, will it work?

Suggestions?
Data preprocessing example
California Housing Dataset
Independent variables
import pandas as pd
1. MedInc Median income in block group (measured in tens of
from sklearn.datasets import fetch_california_housing thousands of US Dollars)
2. HouseAge Median house age in block group (a lower number is a
Use pandas df.describe() to get followings newer building)
3. AveRooms Average number of rooms per household
MedInc AveOccup 4. AveBedrms Average number of bedrooms per household
count 20640 20640 5. Population Block group population
mean 3.870671 3.070655 6. AveOccup Average number of household members
std 1.899822 10.386050 7. Latitude Block group latitude (a higher value is farther north)
min 0.499900 0.692308 8. Longitude Block group longitude (a higher value is farther west)
25% 2.563400 2.429741
50% 3.534800 2.818116 Dependent variable
75% 4.743250 3.282261 1. medianHouseValue Median house value for households within a
max 15.000100 1243.333333 block (measured in US Dollars)
Data preprocessing example: Min-max Scaler
Data preprocessing example: Min-max Scaler

Min-max Scaler

Average Occupancy: Inliers in narrow range [0, 0.005]

Features scales differently

Balanced feature scales cannot be guaranteed with outliers


Data preprocessing example: Standard Scaler
Data preprocessing example: Standard Scaler

Standard Scaler
Average Occupancy:[-0.2, 0.2]
Median Income [-2, 4]

Features scales differently

Balanced feature scales cannot be guaranteed with outliers


Data preprocessing example: Robust Scalar
Data preprocessing example: Robust Scalar
Robust Scaler
Most of data points in both features in range of [-2, 3]

Features scales similarly with outliers as well

Still outliers are there

sklearn.preprocessing.RobustScaler — scikit-learn 1.3.0


documentation
Data preprocessing example: Robust Scalar

Standard Scaler Robust Scaler


Min-max Scaler
Average Occupancy:[-0.2, 0.2] Most of data points in both features in range of [-2, 3]
Average Occupancy: Inliers in narrow range [0, 0.005]
Median Income [-2, 4]
Data preprocessing example: Quantiles
information
➢ Non-linear transformation

➢ spreads out the most frequent


values in each feature, aiming to
follow a uniform or normal
distribution.

➢ It reduces the impact of outliers,


making it a robust preprocessing
technique.

➢ The transformation is applied


independently to each feature.

➢ It estimates the cumulative


distribution function of a feature
to map original values to a
sklearn.preprocessing.Quantile uniform/normal distribution.
Transformer — scikit-learn
1.3.0 documentation ➢ The obtained values are then
mapped to the desired output
distribution using the associated
quantile function.
Homework

➢Task: Comparing Data Normalization Methods


(See course page in Moodle)
ML Training Process (Supervised
Learning)
Model Training

Training set
data data Performance evaluation
Model Validation
preparation splitting
• Accuracy
• Precision
Validation set • Recall
• F1-Score
Model Testing • Mean Absolute Error (MAE)
• Mean Squared Error (MSE)
sklearn.model_selection.train_te • Root Mean Squared Error (RMSE)
Testing set
st_split — scikit-learn 1.3.0
documentation
ML Training Process (Supervised
Learning)
Model Training Loss Functions
Used to see how different the guesses
made by a machine learning model are
Training set from the actual correct answers
data data
preparation splitting Model Validation
➢ Mean Squared Error (MSE)
Validation set ➢ Mean Absolute Error (MAE)
➢ Binary Cross-Entropy (Log Loss)
Model Testing ➢ Categorical Cross-Entropy
➢ Sparse Categorical Cross-Entropy
➢ Hinge Loss
Testing set ➢ Kullback-Leibler Divergence (KL Divergence)
➢ Huber Loss
➢ Triplet Loss
➢ Ranking Losses (e.g., Hinge Rank Loss, RankNet Loss))
How to evaluate a model
➢ Accuracy, Precision, Recall, and F-Score
TP + 𝑇𝑁
Accuracy =
Predicted class
TP + TN + FP + 𝐹𝑁
Positive (+) Negative (-) Total
TP
Precision =
Positive (+) False Neg. (FN) P TP + 𝐹𝑃
True Pos. (TP)
True
class
Negative (-) N TP
False Pos. (FP) True Neg. (TN) Recall =
TP + FN
Total P* N*

2
𝐹1 =
type I error, false alarm type II error, miss 1
+
1
Precision Recall
Higher F1-score values indicate a better
Accuracy can be a misleading metric for imbalanced data sets. balance between precision and recall
How to evaluate a model
➢ Accuracy, Precision, Recall, and F-Score
Predicted class
TP
Precision =
TP + 𝐹𝑃
Positive (+) Negative (-) Total

TP
Positive (+) True Pos. (TP) False Neg. (FN) P Recall =
True TP + FN
class
Negative (-) False Pos. (FP) True Neg. (TN) N

Total P* N*

➢ A higher precision indicates a lower rate of false positives, which means


the model is making fewer incorrect positive predictions.
➢ A higher recall indicates a lower rate of false negatives, meaning the
model is correctly identifying more positive instances

https://en.wikipedia.org/wiki/Precision_and_recall
Model selection
➢ Model selection is the process of choosing the
best model from a set of candidate models for a
specific task.
Sample of the MNIST dataset of handwritten digits
(https://en.wikipedia.org/wiki/MNIST_database)

ML Model

k-Nearest
Convolutional Neural Decision Trees Neighbors (k-NN)
Networks (CNNs)
Model
selection
➢ Hyperparameters (parameters that are
set before the training process begins)
➢ E.g., Learning Rate

➢Hyperparameters not learned from data


Model selection
➢ Model selection is the process of choosing the
best model from a set of candidate models for a
specific task.

Model Model Parameter


Training (hyperparameters)
Tunning
Training set hyperparameters
Model/Models not learned from
data)
Model Re-
training

Testing set Model Select the best model


Evaluation e.g., using performance
metrics like accuracy, F1-score
Model selection
Resampling technique:- k-fold
cross validation (k-CV)

•The dataset is divided into k


subsets (folds) of approximately
equal size.
•The model is trained and
evaluated k times, each time
using a different fold as the test
set and the rest as the training
set.
•For each iteration, the model is
trained on (k-1) folds and
evaluated on the remaining fold.

Image from scikit-learn.org


Model selection:
Resampling technique:- k-fold cross validation (k-CV)
➢ Reduced Overfitting: Mitigates overfitting by testing the model on
unseen data subsets. Ensures better generalization to new data.
➢ Evaluates model performance for various hyperparameter
settings.
➢ Helps to identify the best hyperparameters for optimal model
performance
➢ Allows fair and consistent evaluation of multiple models
➢ Maximizes data utilization for both training and testing
➢ Ensures all available data contributes to model evaluation
Model selection:- Hyper parameter tuning-Grid Search

➢ Grid Search involves defining a grid of hyperparameter values to explore.


➢It systematically evaluates all possible combinations from the grid to identify the
best-performing one
➢To avoid overfitting during Grid Search, cross-validation is commonly used.
➢Grid Search can be computationally expensive when the hyperparameter space
is large.

Image from scikit-learn.org


Model Selection: Probabilistic

➢ Statistical modeling is used to choose the most appropriate model among a set of candidate
models.

➢Model comparison

➢ Akaike Information Criterion (AIC).


➢ Bayesian Information Criterion (BIC). “Information theory perspective”
➢ Minimum Description Length (MDL).
Model Selection: Probabilistic

➢ Akaike Information Criterion (AIC).

Number of adjustable parameters of the model


Best-fit log likelihood

✓ Select the model with the largest value


✓ Both model complexity and model performance is considered
➢Bayesian Information Criterion (BIC)- variation of AIC

✓ Generally, more penalize on model complexity than AIC ➔ more complex models less
like to select
Bias-variance trade-off.
➢ Given a dataset with samples denoted as

➢ A model that maps input features to predicted output

Mean Squared Error (MSE)

𝑛
1
MSE = ෍( 𝑦true,𝑖 −𝑦pred,𝑖 )2
𝑛
𝑖=1

➢ Learned model
Bias-variance trade-off.
➢ Given a dataset with samples denoted as
𝑛
1 𝑦𝑖 = 𝑦true,𝑖 = 𝑓 𝑥𝑖 + 𝑒
Mean Squared Error (MSE) MSE = ෍( 𝑦true,𝑖 −𝑦pred,𝑖 )2
𝑛
𝑖=1
𝑦pred,𝑖 = 𝑓መ 𝑥𝑖
2
MSE = 𝔼 𝑦true −𝑦pred
2 2
=𝔼 𝑓 𝑥 + 𝑒 − 𝑓መ 𝑥 = 𝔼 𝑓 𝑥 − 𝑓መ 𝑥 +𝑒
2
=𝔼 𝑓 𝑥 − 𝑓መ 𝑥 + 𝔼 (𝑒)2 + 2𝔼 𝑓 𝑥 − 𝑓መ 𝑥 (e)
Assuming e and (𝑓 𝑥 − 𝑓መ 𝑥 ) are
2
=𝔼 𝑓 𝑥 − 𝑓መ 𝑥 + 𝔼 (𝑒)2 + 2 𝔼 𝑓 𝑥 − 𝑓መ 𝑥 2𝔼 e independent
2
=𝔼 𝑓 𝑥 − 𝑓መ 𝑥 + 𝜎𝑒2 𝜎𝑒2 = 𝔼[(𝑒)2 ] - (𝔼[𝑒])2 and assuming 𝔼 e = 0
Bias-variance trade-off.
2 2
𝔼 𝑓 𝑥 − 𝑓መ 𝑥 =𝔼 𝑓 𝑥 − 𝔼 𝑓መ 𝑥 + 𝔼 𝑓መ 𝑥 − 𝑓መ 𝑥

2 2
= 𝔼 𝑓 𝑥 − 𝔼 𝑓መ 𝑥 +𝔼 𝔼 𝑓መ 𝑥 − 𝑓መ 𝑥 + 2𝔼 𝑓 𝑥 − 𝔼 𝑓መ 𝑥 𝔼 𝑓መ 𝑥 − 𝑓መ 𝑥

2𝔼 𝑓 𝑥 − 𝔼 𝑓መ 𝑥 𝔼 𝑓መ 𝑥 − 𝑓መ 𝑥 = 2 𝑓 𝑥 − 𝔼 𝑓መ 𝑥 𝔼 𝔼 𝑓መ 𝑥 − 𝑓መ 𝑥 =0

2 2 2
𝔼 𝑓 𝑥 − 𝑓መ 𝑥 = 𝔼 𝑓 𝑥 − 𝔼 𝑓መ 𝑥 +𝔼 𝔼 𝑓መ 𝑥 − 𝑓መ 𝑥

2
MSE = 𝔼 𝑓 𝑥 − 𝑓መ 𝑥 + 𝜎𝑒2
2 2
= 𝔼 𝑓 𝑥 − 𝔼 𝑓መ 𝑥 +𝔼 𝔼 𝑓መ 𝑥 − 𝑓መ 𝑥 +𝜎𝑒2
Bias-variance trade-off

2 2
= 𝔼 𝑓 𝑥 − 𝔼 𝑓መ 𝑥 +𝔼 𝔼 𝑓መ 𝑥 − 𝑓መ 𝑥 + 𝜎𝑒2

2 2
=𝔼 𝔼 𝑓መ 𝑥 −𝑓 𝑥 + 𝔼 𝑓መ 𝑥 − 𝔼 𝑓መ 𝑥 + 𝜎𝑒2

2 2
= 𝔼 𝑓መ 𝑥 −𝑓 𝑥 + 𝔼 𝑓መ 𝑥 − 𝔼 𝑓መ 𝑥 + 𝜎𝑒2

2
= bias + variance + Irreducible error

cannot be reduced by any model


Bias-variance trade-off
➢High bias: Inability to capture the true relationship between input data and
output
Marks

Hours of study (per week)


High bias Low bias
Fits well to training data (in this
Training samples Example)
Bias-variance trade-off
➢High variance: Inability to fit different data sets
Low variance means that the
Marks estimator does not change much
when different training datasets
are used

Fits well to testing data (in this


Example)

Hours of study (per week)


Low variance High variance

Testing samples
Bias-variance trade-off

➢ Too simple model

will have high bias but low variance

Overfitting ➢ A highly complex model


Underfitting
will have low bias but high variance

Low high

Image from https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff


Bias-variance trade-off

➢ Too simple model

will have high bias but low variance

➢ A highly complex model

will have low bias but high variance

Underfitting Overfitting

high bias but low variance low bias but high variance
Bias-variance trade-off
Underfitting Overfitting
Variance low Variance high
Bias high Bias low

Testing sample
Error

Training sample
Model complexity
Low high
Bias-variance trade-off

Low bias and low variance


Bias-variance trade-off
low variance High variance

Low bias

High bias
Thank You
Q&A

You might also like