0% found this document useful (0 votes)
12 views11 pages

Lec9 - Evaluation

ML

Uploaded by

Ahmed Amer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views11 pages

Lec9 - Evaluation

ML

Uploaded by

Ahmed Amer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Evaluation

Training Data and Test Data


• Training data: data used to build the model
• Test data: new data, not used in the training process

• Training performance is often a poor indicator of


generalization performance
– Generalization is what we really care about in ML
– Easy to overfit the training data
– Performance on test data is a good indicator of
generalization performance
– i.e., test accuracy is more important than training accuracy

Classification Metrics

# correct predictions
accuracy =
# test instances

# incorrect predictions
error = 1 — accuracy =
# test instances
Confusion Matrix
• Given a dataset of P positive instances and N negative instances:
Predicted Class
Yes No
TP + TN
Actual Class

Yes TP FN accuracy =
P + N
No FP TN

• Imagine using classifier to identify positive cases (i.e., for


information retrieval)
TP TP
precision = recall =
TP + F P TP + F N
Probability that a randomly Probability that a randomly
selected result is relevant selected relevant document
is retrieved

Example
Example: The Overfitting Phenomenon

A Complex Model
Y = high-order polynomial in X

X
The True (simpler) Model
Y = a X + b + noise

How Overfitting Affects Prediction

Predictive
Error

Error on Training Data

Model Complexity
How Overfitting Affects Prediction

Predictive
Error

Error on Test Data

Error on Training Data

Model Complexity

How Overfitting Affects Prediction

Predictive Underfitting Overfitting


Error

Error on Test Data

Error on Training Data

Model Complexity

Ideal Range
for Model Complexity
Comparing Classifiers
Say we have two classifiers, C1 and C2, and want to
choose the best one to use for future predictions

Can we use training accuracy to choose between them?


• No!
– e.g., C1 = pruned decision tree, C2 = K-‐NN
training_accuracy(K‐NN) = 100%, but may not be best

Instead, choose based on test accuracy...

Training and Test Data

Training Data Idea:


Full Data Set Train each
model on the
“training data”...

...and then test


each model’s
Test Data
accuracy on
the test data
k-‐Fold Cross-‐Validation
• Why just choose one particular “split” of the data?
– In principle, we should do this multiple times since
performance may be different for each split

• k-‐Fold Cross-‐Validation (e.g., k=10)


– randomly partition full data set of n instances into
k disjoint subsets (each roughly of size n/k)
– Choose each fold in turn as the test set; train model
on the other folds and evaluate
– Compute statistics over k test performances, or
choose best of the k models
– Can also do “leave-‐one-‐out CV” where k = n
k-‐Fold Cross-‐Validation

k-‐Fold Cross-‐Validation
Example 3-‐Fold CV

Full Data Set 1st Partition 2nd Partition kth Partition


Training
Test Data Data Training

Training
Test Data ... Data

Data Training
Data Test Data

Test Test Test


Performance Performance Performance

Summary statistics
over k test
performances

Optimizing Model Parameters

Training Data 1st Partition 2nd Partition kth Partition

Validation Training
Set Data Training

Training
Validation
Set
... Data

Data Training Validation


Data Set
Found that Found that Found that
Test Data optimal P = p1 optimal P = p2 optimal P = pk

Choose value of p of the model with the best validation performance


More on Cross-‐Validation
• Cross-‐validation generates an approximate estimate
of how well the classifier will do on “unseen” data
– As k  n, the model becomes more accurate
(more training data)
– ...but, CV becomes more computationally expensive
– Choosing k < n is a compromise

• Averaging over different partitions is more robust


than just a single train/validate partition of the data

• It is an even better idea to do CV repeatedly!

Multiple Trials of k-‐Fold CV


1.) Loop for t trials:
Full Data Set
a.) Randomize
Data Set Shuffle

Full Data Set 1st Partition 2nd Partition kth Partition


Training
Test Data
Data Training
b.) Perform Test Data ... Data
k-‐fold CV Training
Data Training
Data Test Data

Test Test Test


Performance Performance Performance

2.) Compute statistics over


t x k test performances

You might also like