Lecture 4: Model Selection¶

Can I trust you?

Joaquin Vanschoren

Evaluation¶

  • To know whether we can trust our method or system, we need to evaluate it.
  • Model selection: choose between different models in a data-driven way.
    • If you cannot measure it, you cannot improve it.
  • Convince others that your work is meaningful
    • Peers, leadership, clients, yourself(!)
  • When possible, try to interpret what your model has learned
    • The signal your model found may just be an artifact of your biased data
    • See 'Why Should I Trust You?' by Marco Ribeiro et al.

ml

Designing Machine Learning systems¶

  • Just running your favourite algorithm is usually not a great way to start
  • Consider the problem: How to measure success? Are there costs involved?
    • Do you want to understand phenomena or do black box modelling?
  • Analyze your model's mistakes. Don't just finetune endlessly.
    • Build early prototypes. Should you collect more, or additional data?
    • Should the task be reformulated?
  • Overly complex machine learning systems are hard to maintain
    • See 'Machine Learning: The High Interest Credit Card of Technical Debt'

ml

Real world evaluations¶

  • Evaluate predictions, but also how outcomes improve because of them
  • Beware of feedback loops: predictions can influence future input data
    • Medical recommendations, spam filtering, trading algorithms,...
  • Evaluate algorithms in the wild.
    • A/B testing: split users in groups, test different models in parallel
    • Bandit testing: gradually direct more users to the winning system

ml

Performance estimation techniques¶

  • Always evaluate models as if they are predicting future data
  • We do not have access to future data, so we pretend that some data is hidden
  • Simplest way: the holdout (simple train-test split)
    • Randomly split data (and corresponding labels) into training and test set (e.g. 75%-25%)
    • Train (fit) a model on the training data, score on the test data

K-fold Cross-validation¶

  • Each random split can yield very different models (and scores)
    • e.g. all easy (of hard) examples could end up in the test set
  • Split data into k equal-sized parts, called folds
    • Create k splits, each time using a different fold as the test set
  • Compute k evaluation scores, aggregate afterwards (e.g. take the mean)
  • Examine the score variance to see how sensitive (unstable) models are
  • Large k gives better estimates (more training data), but is expensive

Can you explain this result?

kfold = KFold(n_splits=3)
cross_val_score(logistic_regression, iris.data, iris.target, cv=kfold)
Cross-validation scores KFold(n_splits=3):
[0. 0. 0.]

Stratified K-Fold cross-validation¶

  • If the data is unbalanced, some classes have only few samples
  • Likely that some classes are not present in the test set
  • Stratification: proportions between classes are conserved in each fold
    • Order examples per class
    • Separate the samples of each class in k sets (strata)
    • Combine corresponding strata into folds

Leave-One-Out cross-validation¶

  • k fold cross-validation with k equal to the number of samples
  • Completely unbiased (in terms of data splits), but computationally expensive
  • Actually generalizes less well towards unseen data
    • The training sets are correlated (overlap heavily)
    • Overfits on the data used for (the entire) evaluation
    • A different sample of the data can yield different results
  • Recommended only for small datasets

Shuffle-Split cross-validation¶

  • Shuffles the data, samples (train_size) points randomly as the training set
  • Can also use a smaller (test_size), handy with very large datasets
  • Never use if the data is ordered (e.g. time series)

The Bootstrap¶

  • Sample n (dataset size) data points, with replacement, as training set (the bootstrap)
    • On average, bootstraps include 66% of all data points (some are duplicates)
  • Use the unsampled (out-of-bootstrap) samples as the test set
  • Repeat $k$ times to obtain $k$ scores
  • Similar to Shuffle-Split with train_size=0.66, test_size=0.34 but without duplicates

Repeated cross-validation¶

  • Cross-validation is still biased in that the initial split can be made in many ways
  • Repeated, or n-times-k-fold cross-validation:
    • Shuffle data randomly, do k-fold cross-validation
    • Repeat n times, yields n times k scores
  • Unbiased, very robust, but n times more expensive

Cross-validation with groups¶

  • Sometimes the data contains inherent groups:
    • Multiple samples from same patient, images from same person,...
  • Data from the same person may end up in the training and test set
  • We want to measure how well the model generalizes to other people
  • Make sure that data from one person are in either the train or test set
    • This is called grouping or blocking
    • Leave-one-subject-out cross-validation: test set for each subject/group

Time series¶

When the data is ordered, random test sets are not a good idea

Test-then-train (prequential evaluation)¶

  • Every new sample is evaluated only once, then added to the training set
    • Can also be done in batches (of n samples at a time)
  • TimeSeriesSplit
    • In the kth split, the first k folds are the train set and the (k+1)th fold as the test set
    • Often, a maximum training set size (or window) is used
      • more robust against concept drift (change in data over time)

Choosing a performance estimation procedure¶

No strict rules, only guidelines:

  • Always use stratification for classification (sklearn does this by default)
  • Use holdout for very large datasets (e.g. >1.000.000 examples)
    • Or when learners don't always converge (e.g. deep learning)
  • Choose k depending on dataset size and resources
    • Use leave-one-out for very small datasets (e.g. <100 examples)
    • Use cross-validation otherwise
      • Most popular (and theoretically sound): 10-fold CV
      • Literature suggests 5x2-fold CV is better
  • Use grouping or leave-one-subject-out for grouped data
  • Use train-then-test for time series

Evaluation Metrics for Classification¶

Evaluation vs Optimization¶

  • Each algorithm optimizes a given objective function (on the training data)

    • E.g. remember L2 loss in Ridge regression $$\mathcal{L}_{Ridge} = \sum_{n=1}^{N} (y_n-(\mathbf{w}\mathbf{x_n} + w_0))^2 + \alpha \sum_{i=0}^{p} w_i^2$$
  • The choice of function is limited by what can be efficiently optimized

  • However, we evaluate the resulting model with a score that makes sense in the real world
    • Percentage of correct predictions (on a test set)
    • The actual cost of mistakes (e.g. in money, time, lives,...)
  • We also tune the algorithm's hyperparameters to maximize that score

Binary classification¶

  • We have a positive and a negative class
  • 2 different kind of errors:
    • False Positive (type I error): model predicts positive while true label is negative
    • False Negative (type II error): model predicts negative while true label is positive
  • They are not always equally important
    • Which side do you want to err on for a medical test?

ml

Confusion matrices¶

  • We can represent all predictions (correct and incorrect) in a confusion matrix
    • n by n array (n is the number of classes)
    • Rows correspond to true classes, columns to predicted classes
    • Count how often samples belonging to a class C are classified as C or any other class.
    • For binary classification, we label these true negative (TN), true positive (TP), false negative (FN), false positive (FP)
Predicted Neg Predicted Pos
Actual Neg TN FP
Actual Pos FN TP
confusion_matrix(y_test, y_pred): 
 [[48  5]
 [ 5 85]]

Predictive accuracy¶

  • Accuracy can be computed based on the confusion matrix
  • Not useful if the dataset is very imbalanced
    • E.g. credit card fraud: is 99.99% accuracy good enough?
\begin{equation} \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} \end{equation}
  • 3 models: very different predictions, same accuracy:

Precision¶

  • Use when the goal is to limit FPs
    • Clinical trails: you only want to test drugs that really work
    • Search engines: you want to avoid bad search results
\begin{equation} \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \end{equation}

Recall¶

  • Use when the goal is to limit FNs
    • Cancer diagnosis: you don't want to miss a serious disease
    • Search engines: You don't want to omit important hits
  • Also know as sensitivity, hit rate, true positive rate (TPR)
\begin{equation} \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \end{equation}

Comparison
ml

F1-score¶

  • Trades off precision and recall:
\begin{equation} \text{F1} = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}} \end{equation}