Evaluation
Training Data and Test Data
• Training data: data used to build the model
• Test data: new data, not used in the training process
• Training performance is often a poor indicator of
generalization performance
– Generalization is what we really care about in ML
– Easy to overfit the training data
– Performance on test data is a good indicator of
generalization performance
– i.e., test accuracy is more important than training accuracy
Classification Metrics
# correct predictions
accuracy =
# test instances
# incorrect predictions
error = 1 — accuracy =
# test instances
Confusion Matrix
• Given a dataset of P positive instances and N negative instances:
Predicted Class
Yes No
TP + TN
Actual Class
Yes TP FN accuracy =
P + N
No FP TN
• Imagine using classifier to identify positive cases (i.e., for
information retrieval)
TP TP
precision = recall =
TP + F P TP + F N
Probability that a randomly Probability that a randomly
selected result is relevant selected relevant document
is retrieved
Example
Example: The Overfitting Phenomenon
A Complex Model
Y = high-order polynomial in X
X
The True (simpler) Model
Y = a X + b + noise
How Overfitting Affects Prediction
Predictive
Error
Error on Training Data
Model Complexity
How Overfitting Affects Prediction
Predictive
Error
Error on Test Data
Error on Training Data
Model Complexity
How Overfitting Affects Prediction
Predictive Underfitting Overfitting
Error
Error on Test Data
Error on Training Data
Model Complexity
Ideal Range
for Model Complexity
Comparing Classifiers
Say we have two classifiers, C1 and C2, and want to
choose the best one to use for future predictions
Can we use training accuracy to choose between them?
• No!
– e.g., C1 = pruned decision tree, C2 = K-‐NN
training_accuracy(K‐NN) = 100%, but may not be best
Instead, choose based on test accuracy...
Training and Test Data
Training Data Idea:
Full Data Set Train each
model on the
“training data”...
...and then test
each model’s
Test Data
accuracy on
the test data
k-‐Fold Cross-‐Validation
• Why just choose one particular “split” of the data?
– In principle, we should do this multiple times since
performance may be different for each split
• k-‐Fold Cross-‐Validation (e.g., k=10)
– randomly partition full data set of n instances into
k disjoint subsets (each roughly of size n/k)
– Choose each fold in turn as the test set; train model
on the other folds and evaluate
– Compute statistics over k test performances, or
choose best of the k models
– Can also do “leave-‐one-‐out CV” where k = n
k-‐Fold Cross-‐Validation
k-‐Fold Cross-‐Validation
Example 3-‐Fold CV
Full Data Set 1st Partition 2nd Partition kth Partition
Training
Test Data Data Training
Training
Test Data ... Data
Data Training
Data Test Data
Test Test Test
Performance Performance Performance
Summary statistics
over k test
performances
Optimizing Model Parameters
Training Data 1st Partition 2nd Partition kth Partition
Validation Training
Set Data Training
Training
Validation
Set
... Data
Data Training Validation
Data Set
Found that Found that Found that
Test Data optimal P = p1 optimal P = p2 optimal P = pk
Choose value of p of the model with the best validation performance
More on Cross-‐Validation
• Cross-‐validation generates an approximate estimate
of how well the classifier will do on “unseen” data
– As k n, the model becomes more accurate
(more training data)
– ...but, CV becomes more computationally expensive
– Choosing k < n is a compromise
• Averaging over different partitions is more robust
than just a single train/validate partition of the data
• It is an even better idea to do CV repeatedly!
Multiple Trials of k-‐Fold CV
1.) Loop for t trials:
Full Data Set
a.) Randomize
Data Set Shuffle
Full Data Set 1st Partition 2nd Partition kth Partition
Training
Test Data
Data Training
b.) Perform Test Data ... Data
k-‐fold CV Training
Data Training
Data Test Data
Test Test Test
Performance Performance Performance
2.) Compute statistics over
t x k test performances