.
BDIJOF-FBSOJOH
1FSGPSNBODF.FUSJDT ityof Bergamo
Performance metrics
Outline
1. Metrics
2. Precision and recall
3. Receiver Operating Characteristic (ROC) curves
4. Worked example
2 /31
Outline
1. Metrics
2. Precision and recall
3. Receiver Operating Characteristic (ROC) curves
4. Worked example
3 /31
Metrics
It is extremely important to use quantitative metrics for evaluating a machine learning
model
Until now, we relied on the cost function value for regression and classification
Other metrics can be used to better evaluate and understand the model
For classification
Accuracy/Precision/Recall/F1-score, ROC curves,…
For regression
Normalized RMSE, Normalized Mean Absolute Error (NMAE),…
4 /31
Accuracy
Accuracy is a measure of how close a given set of guessing from our model are closed
to their true value.
(
If a classifier make 10 predictions and 9 of them are correct, the accuracy is 90%.
Accuracy is a measure of how well a binary classifier correctly identifies or excludes
a condition.
It’s the proportion of correct predictions among the total number of cases
examined.
5 /31
Classification case: metrics for skewed classes
Disease dichotomic classification example
Train logistic regression model " , with ! ( if disease, ! ( otherwise.
Find that you got error on test set ( correct diagnoses)
The " class has very few samples with
Only of patients actually have disease
respect to the " class
If I use a classifier that always classifies the observations to the % class, I get of
accuracy!!
For skewed classes, the accuracy metric can be deceptive
6 /31
Outline
1. Metrics
2. Precision and recall
3. Receiver Operating Characteristic (ROC) curves
4. Worked example
7 /31
Precision and recall
Suppose that ! ( in presence of a rare class that we want to detect
Precision (How much we are precise in the detection)
Of all patients where we classified , Confusion matrix
what fraction actually has the disease?
Actual class
"
!
Estiamted class
1 (p) 0 (n)
True positive False positive
Recall (How much we are good at detecting) 1 (Y)
(TP) (FP)
Of all patients that actually have the disease, what
fraction did we correctly detect as having the disease? False negative True negative
0 (N)
(FN) (TN)
"
!
8 /31
F1-score
It is usually better to compare models by means of one number only. The & ' can
be used to combine precision and recall
Precision(P) Recall (R) Average F1 Score
Algorithm 1 0.5 0.4 0.45 0.444 The best is Algorithm 1
Algorithm 2 0.7 0.1 0.4 0.175
Algorithm 3 0.02 1.0 0.51 0.0392
Algorithm 3 classifies always Average says not correctly
that Algorithm 3 is the best
10 /31
Summaries of the confusion matrix
Different metrics can be computed from the confusion matrix, depending on the class of
interest (https://en.wikipedia.org/wiki/Precision_and_recall)
11 /31
Outline
1. Metrics
2. Precision and recall
3. Receiver Operating Characteristic (ROC) curves
4. Worked example
12 /31
Ranking instead of classifying
Classifiers such as logistic regression can output a probability of belonging to a class (or
something similar)
We can use this to rank the different istances and take actions on the cases at top of
the list
We may have a budget, so we have to target most promising individuals
Ranking enables to use different techniques for visualizing model performance
13 /31
Ranking instead of classifying
p n
Y 0 0 p n
Instance 1 0
True class Score N 100 100 Y
description
99 100
…………… 0,99 N
…………… 0,98 Different confusion
…………… 0,96 p n matrices by changing
…………… 0,90 Y
2 0 the threshold
…………… 0,88 N 98 100
p n
…………… 0,87 Y
2 1
…………… 0,85 98 99
N
…………… 0,80 p n
…………… 0,70 Y
6 4
94 96
N
14 /31
Ranking instead of classifying
ROC curves are a very general way to represent and compare the performance of
different models (on a binary classification task)
Perfection Observations
classify always negative
Recall (True Positive Rate)
Random classify always positive
Better guessing
classifier : random classifier
Worse
classifier : worse than random classifier
Different classifiers can be compared
Area Under the Curve (AUC): probability that a
randomly chosen positive instance will be ranked
1 – specificity (False Positive Rate) ahead of randomly chosen negative instance
15 /31
Outline
1. Metrics
2. Precision and recall
3. Receiver Operating Characteristic (ROC) curves
4. Worked examples
16 /31
Breast cancer detection
Breast cancer is the most common cancer amongst women in the world.
It accounts for 25% of all cancer cases, and affected over 2.1 Million people in 2015
alone.
It starts when cells in the breast begin to grow out of control. These cells usually
form tumors that can be seen via X-ray or felt as lumps in the breast area.
The key challenges against it’s detection is how to classify tumors into malignant
(cancerous) or benign(non cancerous).
Goal: classifying these tumors using machine learning and the Breast Cancer
Wisconsin (Diagnostic) Dataset.
17 /31
Breast cancer Wisconsin dataset Output:
This dataset has been referred from Kaggle. Class 4 stands for malignant cancer
Class 2 stands for benign cancer
Uniformity Single
Clump Uniformity of Cell Marginal Epithelial Bare Bland Normal
id_num Thickness of Cell Size Shape Adhesion Cell Size Nuclei Chromatin Nucleoli Mitoses Class
1041801 5 3 3 3 2 3 4 4 1 4
1043999 1 1 1 1 2 3 3 1 1 2
1044572 8 7 5 10 7 9 5 5 4 4
1047630 7 4 6 4 6 1 4 3 1 4
1048672 4 1 1 1 2 1 2 1 1 2
1049815 4 1 1 1 2 1 3 1 1 2
1050670 10 7 7 6 4 10 4 1 2 4
……… …… …… …… …… …… …… ……. ……. ……. …….
18 /31
Breast cancer detection
We will use the dataset to compare differente logistic regression models by means
of the ROC curve associated to each of them.
To this aim we will work with 4 different dataset (plus an extra one)
1. Case 1: the whole dataset
2. Case 2: the first group of 5 features
3. Case 3: the second group of 5 features
4. Case 4: only the first two features
Extra: after learning the model of CASE 1, take only the features with the smallest
p-value.
19 /31
%% Load and clean data
Matlab code data = readtable('breast_cancer_w.xlsx'); %load our data as a table
Phi=table2array(data(:,1:end-1));
y=table2array(data(:,end));
Output:
y(y==4)=1; % in the original date 4 stands for malignant cancer
Class 4 stands for malignant y(y==2)=0; % in the original date 2 stands for benign cancer
cancer and it is for us the positive % Setup the data matrix appropriately, and add ones for the intercept
output. We set it to 1 term
[N, d] = size(Phi);
Class 2 stands for benign cancer Phi = [ones(N, 1) Phi]; % Add intercept term
and it is for us the negative %% Train and test data
output. We set it to 0. mdl = fitglm(Phi,y,'Distribution','binomial','Link','logit')
%% ============ Part 2: Compute the ROC curve ============
scores = mdl.Fitted.Probability;
[X,Y,T,AUC] = perfcurve(y,scores,1);
%Plot the ROC curve.
perfcurve compute the points in figure
plot(X,Y)
the ROC curve as well as the AUC xlabel('False positive rate')
ylabel('True positive rate')
title('ROC for Classification by Logistic Regression')
20 /31
Results
Comparison of case 1, 2, 3 and 4
Using only the first 2 features is nott
a smart choice.
21 /31
Results
Comparison of case 1, 4 and best
Using only the best features
provides a model that performs
almost as well as using all the
features
22 /31
Pneumonia detection
Suppose to have at disposal X-ray images of lungs: Healthy people - Covid-19 disease
patients
23 /31
Acknowledgments
The COVID-19 X-ray image is curated by Dr. Joseph Cohen, a postdoctoral fellow at
the University of Montreal, see https://josephpcohen.com/w/public-covid19-dataset/
The previous data contain only X-ray images of people with a disease. To collect
images of healthy people, we can download another X-ray dataset on the platform
Kaggle https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
The analysis is inspired from a tutorial by Adrian Rosebrock:
https://www.pyimagesearch.com/2020/03/16/detecting-covid-19-in-x-ray-images-with-keras-
tensorflow-and-deep-learning/
24 /31
Acknowledgments
We want to use a classifier to perform classification:
Healthy patients: class
Patients with a disease: class
The input data are directly the X-ray images
For these computer vision tasks, the state of the art algorithm are the Convolutional
Neural Networks:
we can use them to classify the images into healthy and disease
25 /31
Estimated covid label
Pneumonia detection True label
Estimated healthy label
26 /31
Pneumonia detection
Classification results on test set
Actual class
Estimated class
1 (p) 0 (n)
Sensitivity (recall, true positive rate)
True positive False positive
1 (Y)
11 0
False negative True negative
0 (N)
1 11
Specificity (true negative rate)
Accuracy: )
27 /31
Pneumonia detection
Classification results on test set
Sensitivity (recall, true positive rate) Specificity (true negative rate)
Sensitivity: of patients that do have COVID-19 (i.e., true positives), we could
accurately identify them as “COVID-19 positive” 92% of the time using our model
Specificity: of patients that do not have COVID-19 (i.e., true negatives), we could
accurately identify them as “COVID-19 negative” 100% of the time using our model.
28 /31
Pneumonia detection
Classification results on test set
Sensitivity (recall, true positive rate) Specificity (true negative rate)
Being able to accurately detect healthy patients with 100% accuracy is great. We do
not want to quarantine someone for nothing
…but we don’t want to classify someone as «healthy» when they are «COVID-19
positive», since it could infect other people without knowing
29 /31
Summary
Balancing sensitivity and specificity is incredibly challenging when it comes to medical
applications
The results should always be validated with another pool of people
Furthermore, we need to be concerned of what the model is actually learning:
Does the results align with the medical knowledge?
Was the dataset well representative of the population or there was selection bias?
30 /31
Summary
Furthermore, we need to be concerned of
what the model is actually learning:
Do we accounted for all external factors
(confounding) that could interfere with the
response?
31 /31