0% found this document useful (0 votes)
5 views32 pages

L05-HW2 Notes 2024

The document discusses various methods for data splitting in machine learning, including train-test splits, cross-validation, and hyperparameter tuning. It emphasizes the importance of proper validation and testing to avoid bias and ensure reliable performance estimates. Additionally, it covers the Receiver Operating Characteristic (RoC) curve and the trade-offs in selecting thresholds based on application needs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views32 pages

L05-HW2 Notes 2024

The document discusses various methods for data splitting in machine learning, including train-test splits, cross-validation, and hyperparameter tuning. It emphasizes the importance of proper validation and testing to avoid bias and ensure reliable performance estimates. Additionally, it covers the Receiver Operating Characteristic (RoC) curve and the trade-offs in selecting thresholds based on application needs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

HW 2 notes

Data splitting, RoC


Loose ends from HW2

• A majority class baseline


• Powerful if one class dominates. Often happens in real life
• Recognizer becomes biased towards the majority class (the
prior term)
• How to deal with this?

• Zero probability in the estimation

• Hyperparameter
• RoC
Splitting the data
• We want to estimate the performance on the model
• Need a test set

• Train-test split
• Leave-one-out splitting
• Bootstraping
• Cross-validation splitting
Splitting data

Data
Simple train-test split
80 % 20 % Ratio can be varied
as appropriate

Training set Test set

200,000
1,000,000 samples
800,000 samples
samples

100
500 samples
400 samples
samples

Stratified splitting – tries to keep the distribution in the training and test the same

sklearn.model_selection.StratifiedShuffleSplit
sklearn.model_selection.StratifiedKFold
Leave-one-out

1 samples

Training set Test set


Multiple train-test split

Estimate the true performance (expected performance)

Random 𝑦! = 2.3x1 – 3.4x2 + 4.2

Train-test split ➝ Train model ➝ Predict

x 10

This is sometimes called bootstrapping (in statistics).


This can also be used to calculate the variance of your method performance
Cross-validation (CV)

Fold 1 Fold
Data2 Fold 3

Test set Training set Training set

Training set Test set Training set

Training set Training set Test set

3-fold cross-validation

Similar idea to bootstrapping but there’s no overlapped in the splits.


Hyperparameter vs Parameters
• Parameters - something the models learn from data
• Hyperparameter – something we pick for the model via
trial and error

Model Parameter Hyperparameter

Linear regression weights Loss (L1/L2), polynomial degree, features used, …

Naïve Bayes Distribution parameter Type of distribution used

GMM distribution Means, covariance, Number of mixture, type of covariance matrix


mixture weight
Histrogram Histrogram height Number of bins or size of bins
distribution
K-means Centroids K

ML model Model weights Type of ML model


Picking hyper-parameters

Data

Validation Test set


Training set
set

You decide on the models and hyperparameter on the validation performance.

Make sure that you are not optimizing on the test set.
Make the test set a good proxy for estimating real world performance

Don’t cheat!
Splitting a validation set

Fold 1 Fold 2set


Training Fold 3 Test set

Validation
Training set Training set Test set
set

Validation
Training set Training set Test set
set

Validation
Training set Training set Test set
set

If the test set is fixed, we can do CV on the training set to get the valitation set
Estimating true performance of our
pipeline Freeze the test set
Touch the test set as less as possible
The more you often see the test set the more you cheat

Fold 1 Fold 2set


Training Fold 3 Test set
Example

Validation
set
Training set Training set Test set Best degree = 3, CV accuracy = 80, Test accuracy = 75

Validation Best degree = 2, CV accuracy = 78, Test accuracy = 80


Training set Training set Test set
set

Validation Best degree = 2, CV accuracy = 78, Test accuracy = 85


Training set Training set Test set
set

Estimates of the accuracy using our precedure = 80%

Question: which model do we deploy?


Nested CV

It the fixed test set is too small to give a reliable estimate, use nested CV.
Or a mixture of tecchniques, EX leave one out CV with a validation set.

Validation Validation
Training set Training set Test set Test set Training set Training set
set set

Validation Validation
Training set Training set Test set Test set Training set Training set
set set

Validation Validation
Training set Training set Test set Test set Training set Training set
set set

Test split 1 Test split 2


Size of split
• Train/validation/test split
• 80/10/10, 90/5/5, 5-fold CV, leave one out CV, etc. for academia
• For real applications, get dev and test sets that represent
your users.
• Reflects the data you want to do well on.
• There can be a mis-match between train and dev data. But avoid
mis-match between dev and test data.
• If no users, recruit friends to pretend to be the users.

• Example: Cat classifier.


• Should you use ImageNet cat pictures as train/dev/test?
• Go pretend you’re a user and take cat pictures for the dev/test set.
Val and test set and size
• Val - tune hyperparameters, select features, and make
other decisions regarding the learning algorithm.
• Test - evaluate the performance of the algorithm, but not
to make any decisions about regarding what learning
algorithm or parameters to use.

• Val – big enough to notice difference between algorithms


(if you care about 0.1% difference, make sure you have
enough dev set to spot it).
• Test – large enough to give confidence that your model
will do well in real task
Congratulations on your first attempt on
(almost) re-implementing a research
paper!

Another trick to reduce 0 bins in histograms


Prediction and thresholds
Beer Grass Rice Flood Prediction
100 3 3 Yes 0.8
20 1 1 Yes 0.3
80 3 2 No 0.6
40 1 1 No 0.2
40 1 1 No 0.1

What happens if I set my threshold at 0.5?


Prediction and thresholds
Beer Grass Rice Flood Prediction Metric
100 3 3 Yes 0.8 TP
20 1 1 Yes 0.3 FN
80 3 2 No 0.6 FA
40 1 1 No 0.2 TN
40 1 1 No 0.1 TN

What happens if I set my threshold at 0.5?


True positive rate =
False alarm rate =
Precision =
Recall =
Prediction and thresholds
Beer Grass Rice Flood Prediction Metric
100 3 3 Yes 0.8 TP
20 1 1 Yes 0.3 FN
80 3 2 No 0.6 FA
40 1 1 No 0.2 TN
40 1 1 No 0.1 TN

What happens if I set my threshold at 0.5?


True positive rate = ½
False alarm rate = ⅓
Precision = ½
Recall = ½
Prediction and thresholds
Beer Grass Rice Flood Prediction Metric
100 3 3 Yes 0.8
20 1 1 Yes 0.3
80 3 2 No 0.6
40 1 1 No 0.2
40 1 1 No 0.1

What happens if I set my threshold at 0.15?


Prediction and thresholds
Beer Grass Rice Flood Prediction Metric
100 3 3 Yes 0.8 TP
20 1 1 Yes 0.3 TP
80 3 2 No 0.6 FA
40 1 1 No 0.2 FA
40 1 1 No 0.1 TN

What happens if I set my threshold at 0.15?


True positive rate = 1
False alarm rate = 2/3
Precision = 2/4
Recall = 2/2
Prediction and thresholds
Beer Grass Rice Flood Prediction Metric
100 3 3 Yes 0.8
20 1 1 Yes 0.3
80 3 2 No 0.6
40 1 1 No 0.2
40 1 1 No 0.1

What happens if I set my What happens if I set my


threshold at 0.5? threshold at 0.15?
True positive rate = ½ True positive rate = 1
False alarm rate = ⅓ False alarm rate = 2/3
Precision = ½ Precision = 2/4
Recall = ½ Recall = 2/2
Receiver operating Characteristic
(RoC) curve
• What if we change the threshold
• FA TP is a tradeoff
This is why we need to think of the application when
thinking of metrics.
• Plot FA rate and TP rate as threshold changes

TPR

FAR 1
Comparing detectors
• Which is better?

TPR

FAR 1
Comparing detectors
• Which is better?

TPR

FAR 1
Selecting the threshold
• Select based on the application
• Trade off between TP and FA. Know your application,
know your users.
• A miss is as bad as a false alarm FAR = 1-TPR => x = 1-y

1 This line has a special name


Equal Error Rate (EER)
TPR

FAR 1 x = 1-y
Selecting the threshold
• Select based on the application
• Trade off between TP and FA. Know your application,
know your users. Is the application about safety?
• A miss is 1000 times more costly than false alarm.
• FAR = 1000(1-TPR) => x = 1000-1000y

1
x = 1000-1000y

TPR

FAR 1
Churn prediction
Predict whether a customer will stop subscription, so we
can send a promotional ad.
Usual subscription fee 50
Cost of calling the customer 5
Promotional subscription fee 25
Describe the strategy to pick the threshold
1

TPR

FAR 1
Churn prediction
Predict whether a customer will stop subscription, so we
can send a promotional ad.
Usual subscription fee 50
Cost of calling the customer 5
Promotional subscription fee 25
Describe the strategy to pick the threshold
1
Cost of miss = 50
Cost of FA = 30
TPR
30FAR = 50(1-TPR)

FAR 1
Selecting the threshold
• Select based on the application
• Trade off between TP and FA.
• Regulation or hard threshold
• Cannot exceed 1 False alarm per year
• If 1 decision is made everyday, FAR = 1/365

x = 1/365
1

TPR

FAR 1
Notes about RoC
• Ways to compress RoC to just a number for easier
comparison -- use with care!!
• EER
• Area under the curve
• F score
• Other similar curve - Detection Error Tradeoff (DET) curve
• Plot False alarm vs Miss rate
1
• Other similar curve
PR curve (precision-recall curve) MR
• Can plot on log scale for clarity

FAR 1
Summary
• Train-validation-test
• Hyperparameter vs parameter
• RoC

You might also like