Supervised learning
Classification
Model evaluation
«Highly specialized digital skills in high-performance computing»
1
Classification
Assigning categories to the object
4
Classification
Goal of classification - assigning one* of
predefined class labels to an object
* special case – multi-label classification
5
Classification Example
Age Car Class
• Example training database
• Two predictor attributes: 20 M Yes
Age and Car-type (Sport, Minivan and 30 M Yes
Truck) 25 T No
• Age is ordered, Car-type is 30 S Yes
categorical attribute 40 S Yes
• Class label indicates 20 T No
whether person bought
30 M Yes
product
25 M Yes
• Response attribute is categorical
40 M Yes
20 S No
7
Definitions
• Random variables X1, …, Xk (predictor variables) and Y
(response or dependent variable)
• If Y is categorical, the problem is a classification problem
Predictor variables Response variable,
Class,
Outcome
8
Examples of classification approaches
• Artificial neural networks
• Decision trees
• Naive Bayes classifier
• Support vector machines
• ...
9
Classification of classification
• Eager learning Several of the most commonly used learning
approaches, such as neural networks and
• Model based decision trees, use their training examples to
• Naive Bayes construct an explicit global representation of the
• Decision trees, rules target function.
(inductive learning)
• Lazy learning In contrast, instance-based learning algorithms
simply store some or all of the training examples
• Instance based and postpone any generalization effort until a
• K-nearest –neighbor new instance must be classified. They can thus
build query-specific local models, which attempt
to fit the training examples only in a region
around the query point.
10
Representation of patterns
• Useful patterns allow us to make nontrivial
predictions on new data
• a black box
• model with no human-readable explanation
• a transparent box
• rules
• decision trees
11
Approximation examples - finding the best
hypothesis
f(x) f(x) f(x)
x x x
a) b) c)
x x
d) e)
Function f is not known therefore many different hypotheses h might be found
Formalization f(x)
• Having observation pairs (x,f(x)), where x – input and
f(x) – output related to x, goal of inductive learning is in
this case is: acquire function h, which approximates x
function f
• Function f is not known, only its results – outputs are
observed
• Function h is called hypothesis of function
• Acquisition of a function from observation of
form inputs – output is called inductive
learning
The contact lenses data
Age Spectacle Astigmatism Tear production Recommended
prescription rate lenses
Young Myope No Reduced None
Young Myope No Normal Soft
Young Myope Yes Reduced None
Young Myope Yes Normal Hard
Young Hypermetrope No Reduced None
Young Hypermetrope No Normal Soft
Young Hypermetrope Yes Reduced None
Young Hypermetrope Yes Normal hard
Pre-presbyopic Myope No Reduced None
Pre-presbyopic Myope No Normal Soft
Pre-presbyopic Myope Yes Reduced None
Pre-presbyopic Myope Yes Normal Hard
Pre-presbyopic Hypermetrope No Reduced None
Pre-presbyopic Hypermetrope No Normal Soft
Pre-presbyopic Hypermetrope Yes Reduced None
Pre-presbyopic Hypermetrope Yes Normal None
Presbyopic Myope No Reduced None
Presbyopic Myope No Normal None
Presbyopic Myope Yes Reduced None
Presbyopic Myope Yes Normal Hard
Presbyopic Hypermetrope No Reduced None
Presbyopic Hypermetrope No Normal Soft
Presbyopic Hypermetrope Yes Reduced None
Presbyopic Hypermetrope Yes Normal None
14
Structural descriptions
Example: if-then rules
If tear production rate = reduced
then recommendation = none
Otherwise, if age = young and astigmatic = no
then recommendation = soft
Age Spectacle Astigmatism Tear production Recommended
prescription rate lenses
Young Myope No Reduced None
Young Hypermetrope No Normal Soft
Pre-presbyopic Hypermetrope No Reduced None
Presbyopic Myope Yes Normal Hard
… … … … …
15
A decision tree for contact lenses problem
Root node
Branch
Leaf node
16
Process of building classification model
Age Spectacle Astigmatism Tear production Recommended Age Spectacle Astigmatism Tear production Recommended
prescription rate lenses prescription rate lenses
Young Myope No Reduced None Young Hypermetrope Yes Reduced Soft / ?
Young Hypermetrope No Normal Soft
Pre- Hypermetrope No Reduced None Age Spectacle Astigmatism Tear production Recommended
presbyopic prescription rate lenses
Presbyopic Myope Yes Normal Hard
… … … … … Young Myope Yes Normal ?
https://www.kdn
uggets.com/2016
/10/decision-
trees-concise- Recommended lenses = … Recommended lenses = …
technical- 18
overview.html Accuracy: …
Now YOU classify!
Age Spectacle Astigmatism Tear production Recommended
prescription rate lenses
Young Hypermetrope Yes Reduced Soft / ?
• According to the tree classifier,
would this case be classified
correctly?
• What would be classification for
this patient?
Recommended lenses = none Age Spectacle Astigmatism Tear production Recommended
Accuracy: 0% prescription rate lenses
Young Myope Yes Normal ?
Recommended lenses = hard 19
Example - Learning in decision trees
• Problem: Should we wait for a table in
restaurant?
• Goal: Learn decision tree that allows to
generate decision
Example - Learning in decision trees
• Friday / Saturday – true if it is Friday or Saturday;
• Hungry – Are we hungry?
• Customers – how many customers are in the restaurant
(None, Few, Full);
• Price level - $ cheap, $$ affordable, $$$ expensive;
• Rain – true if it is raining;
• Reservation – do we have a reservation;
• Type – French, Italian, Thai, Fast-food;
• Waiting time - 0-10 min, 10-30 min, 30-60 min, >60 min.
Data set
N. Decision:
Attributes Wait for
Alter. Bar Fri. Hungry Customers Price Rain Reserv Type Wait
/Saturday ation. time
the table?
X1 Yes No No Yes Few $$$ No Yes French 0-10 Yes
X2 Yes No No Yes Full $ No No Thai 30-60 No
X3 No Yes No No Few $ No No Fast f. 0-10 Yes
X4 Yes No Yes Yes Full $ No No Thai 10-30 Yes
X5 Yes No Yes No Full $$$ No Yes French >60 No
X6 No Yes No Yes Few $$ Yes Yes Italian 0-10 Yes
X7 No Yes No No None $ Yes No Fast f. 0-10 No
X8 No No No Yes Few $$ Yes Yes Thai 0-10 Yes
X9 No Yes Yes No Full $ Yes No Fast f. >60 No
X10 Yes Yes Yes Yes Full $$$ No Yes Italian 10-30 No
X11 No No No No None $ No No Thai 0-10 No
X12 Yes Yes Yes Yes Full $ No No Fast f. 30-60 Yes
Learning decision trees
• Simplistic approach: create a decision tree where each path
corresponds to a particular example. Number of attribute tests
in each path corresponds to the total number of attributes
• All observations will be classified
• Problem:
• The tree simply «remembers» all observations
• It does not approximate the observations limiting
possibilities to use the tree for unseen future observations
• The tree will be to large
Learning decision trees
• Induction approach or Ockham razor:
• The most likely hypothesis is the simplest one, which corresponds to all
observations
• Most probably simpler hypothesis is better than more complex one
due to better approximation
• Main idea:
• We always select attribute, which splits the observation in
the best way
• This allows to build a tree with small number of attribute
tests
• Thereby the tree will approximate the observations well
How does it work?
Split the set of instances in subsets such that the
variation within each subset becomes smaller.
25
Constructings decision trees
• Strategy: top down Recursive divide-and-conquer fashion
• First: select attribute for root node (Create branch for each
possible attribute value)
• Then: split instances into subsets (One for each branch extending
from the node)
• Finally: repeat recursively for each branch, using only instances
that reach the branch
• Stop if all instances have the same class
26
Selection of the right attribute
+: X1, X3, X4, X6, X8, X12
-: X2, X5, X7, X9, X10, X11 +: X1, X3, X4, X6, X8, X12
-: X2, X5, X7, X9, X10, X11
Type Customers
French Italian Thai Fast food None Few Full
+: X1 +: X6 +: X4, X8 +: X3, X12 +: +: X1, X3, X6, X8 +: X4, X12
-: X5 -: X10 -: X2, X11 -: X7, X9 -: X7, X11 -: -: X2, X5, X9, X10
Selection of the right attribute
+: X1, X3, X4, X6, X8, X12
-: X2, X5, X7, X9, X10, X11
Customers
None Few Full
+: +: X1, X3, X6, X8 +: X4, X12
-: X7, X11 -: -: X2, X5, X9, X10
No Yes Hungry
Yes No
+: X4, X12 +:
-: X2, X10 -: X5, X9
+: X1, X3, X4, X6, X8, X12
-: X2, X5, X7, X9, X10, X11
Customers
None Few Full
+: +: X1, X3, X6, X8 +: X4, X12
-: X7, X11 -: -: X2, X5, X9, X10
No Yes Hungry
Yes No
+: X4, X12 +:
-: X2, X10 -: X5, X9
Type No
French Italian Thai FastFood
Yes +: +: X4 +: X12
-: X10 -: X2 -:
No Weekend Yes
No Yes
+: +: X4
No -: X2 Yes -:
Weather data
34
35
Criterion for attribute selection
• Which is the best attribute?
• Want to get the smallest tree
• Heuristic: choose the attribute that produces the“purest” nodes
• Popular impurity criterion: information gain
• Information gain increases with the average purity of the subsets
• Strategy: choose attribute that gives greatest information
gain
• gain can be measured in bits
36
37
Continuing to split
• gain(Temperature ) = 0.571
• gain(Humidity ) = 0.971
• gain(Windy ) = 0.020
38
Final decision tree
• Note: not all leaves need to be pure; sometimes
identical instances have different classes
• Splitting stops when data can’t be split any further
39
Rule generation
• Covering methods
If true If x > 1.2 and y > 2.6
then class = b then class = a
If x > 1.2
then class = a
Could add more rules, get “perfect” rule set
40
Rule generation
• From decision tree
IF outlook = sunny AND humidity high THEN play = no
• Sometimes too complex
41
The weather problem
Conditions for playing golf
Outlook Temperature Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild Normal False Yes
… … … … …
If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
42
Weather data with mixed attributes
Some attributes have numeric values
Outlook Temperature Humidity Windy Play
Sunny 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes
Rainy 75 80 False Yes
… … … … …
If outlook = sunny and humidity > 83 then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity < 85 then play = yes
If none of the above then play = yes
43
Continuous values
Classical decision tree algorithms liks ID3 of C4.5 requires discrete values
for their attributes.
In case of using continuous values, the algorithm might use Boolean test
substitutes to check if the value falls into particular interval:
Temp. 40 48 60 72 80 90
Decision No No Yes Yes Yes No
CART (Classification And Regression Tree) algorithm utilize both categorical
and numerical values. Scikit-learn library uses an optimized version of the
CART algorithm; however, the scikit-learn implementation does not support
categorical variables directly.
Example of CART- Iris
Factor name and test
Number os sampes
Gini coef. value
tested by the test
Class value distribution
Class value if it is a leaf
Example of C4.5 (J48 in Weka) - Iris
J48 pruned tree
------------------
petalwidth <= 0.6: Iris-setosa (50.0)
petalwidth > 0.6
| petalwidth <= 1.7
| | petallength <= 4.9: Iris-versicolor (48.0/1.0)
| | petallength > 4.9
| | | petalwidth <= 1.5: Iris-virginica (3.0)
| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)
| petalwidth > 1.7: Iris-virginica (46.0/1.0)
Number of Leaves : 5
Size of the tree : 9
46
Example of J48 decision list (PART in Weka)
PART decision list
------------------
petalwidth <= 0.6: Iris-setosa (50.0)
petalwidth <= 1.7 AND
petallength <= 4.9: Iris-versicolor (48.0/1.0)
: Iris-virginica (52.0/3.0) default rule
Number of Rules : 3 47
Comparison
J48 pruned tree
------------------
petalwidth <= 0.6: Iris-setosa (50.0)
PART decision list
petalwidth > 0.6
------------------
| petalwidth <= 1.7
| | petallength <= 4.9: Iris-versicolor
petalwidth <= 0.6: Iris-setosa (50.0)
(48.0/1.0)
| | petallength > 4.9
petalwidth <= 1.7 AND
| | | petalwidth <= 1.5: Iris-virginica (3.0)
petallength <= 4.9: Iris-versicolor
| | | petalwidth > 1.5: Iris-versicolor
(48.0/1.0)
(3.0/1.0)
| petalwidth > 1.7: Iris-virginica (46.0/1.0)
: Iris-virginica (52.0/3.0)
Number of Leaves : 5
Number of Rules : 3
Size of the tree : 9 48
Example of RIPPER (JRIP in Weka) - Iris
JRIP rules:
===========
(petallength >= 3.3) and (petalwidth <= 1.6) and
(petallength <= 4.9) => class=Iris-versicolor
(46.0/0.0)
(petallength <= 1.9) => class=Iris-setosa (50.0/0.0)
=> class=Iris-virginica (54.0/4.0) default rule
Number of Rules : 3
49
Example of C4.5 (J48 in Weka) – play golf
J48 pruned tree
------------------
outlook = sunny
| humidity = high: no (3.0)
| humidity = normal: yes (2.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)
Number of Leaves : 5
Size of the tree : 8 50
Random Forest model
Machado, Gustavo & Recamonde-Mendoza, Mariana & Corbellini, Luís. (2015). What variables are important in predicting
bovine viral diarrhea virus? A random forest approach. Veterinary Research. 46. 85. 10.1186/s13567-015-0219-7.
51
Overfitting and underfitting
Model does not capture Model ‘misses the point’ Model is balanced
the concept
• Low training accuracy • High training accuracy • High training accuracy
• Low testing accuracy • Low testing accuracy • High testing accuracy
Overfitting - solutions
The main methods are:
• Methods that stop training in case of overfitting
• Methods that allow to overfit and then prune the tree
Regardless of the methods selected, the main question is
about the tree size, at which the tree is pruned
C4.5. uses tree pruning approach
Classifying handwritten digits
54
(Deep) Neural Networks
For each pixel in the input
image, the pixel's intensity
is encoded as the value for
a corresponding neuron in
the input layer.
For the 28×28 pixel
images, this means that
the network has 784
(=28×28) input neurons.
55
Lazy learning: K-nearest-neighbor
• Classification without model building!
56
K-nearest-neighbor
• Difficulties:
• Data must be stored; for classification of a new record, all data
must be available
• Computationally expensive in high dimensions
• Choice of k is unknown
57
Summary on classification
• Methods which create or do not create models
• Models are exploratory or black-box
• Both for nominal and numerical (discretized) data types
58
Classification vs Regression
• Similarities
• Both learn from data set
• Supervised
• Difference
• In classifiction each example has class associated
• In regression each example has numerical value associated
• Regression is considered as a subtype of classifiation
learning – classes are not descrete categories, but
continuous numeric values
59
Regression / Classification
Sum up
Regression – attempts to find a
function which models the data
with the least error. Can be
seen as classification for
numeric data.
Classification - the problem of
identifying to which category a
new observation belongs, on
the basis of a training set of
data containing observations
(or instances) whose category
membership is known. https://in.springboard.com/blog/regression-vs-classification-in-machine-learning/
60
Model credibility
Evaluation of what’s been learned
61
Evaluation
• Training and testing
• Hold-out
• Percentage split (70 % /30 % )
• Crossvalidation
• N-fold crossvalidatation
62
Evaluation with confusion matrix
True class
A B
Predicted A True Positive False Positive
class (TP) (FP)
B False Negative True Negative
(FN) (TN)
• Accuracy = (TP+TN ) / (TP+TN+FP+FN)
• Precision = TP / (TP+FP)
• Recall = TP / (TP+FN)
• F-score = 2*Precision*Recall / (Precision+Recall) 63
Spam filter
• Class A – no spam
• Class B – spam True class
A B
• This week 300 incoming mails
Predicted A True False
• 285 regular mails
class Positive Positive
• 15 junk mails (TP) = 282 (FP) = 5
• Classifier says: B False True
• 287 regular (5 actually junk) Negative Negative
• 13 junk (3 actually regular) (FN) = 3 (TN) = 10
64
Spam filter
• Class A – no spam
• Class B – spam
• Today in e-mail 300 mails
• 285 regular mails
• 15 junk mails
• Classifier says:
• 287 regular (5 actually junk)
• 13 junk (3 actually regular)
65
If Class A – legitimate mail
True class
A B
Predicted A True Positive False Positive
class (TP) = 282 (FP) = 5
B False True Negative
Negative (FN) (TN) = 10
=3
• Accuracy = (TP+TN )/ (TP+TN+FP+FN) = 292/ 300 = 0.973
• Precision = TP / (TP+FP) = 282 / 287 = 0.983
• Recall = TP / (TP+FN) = 282 / 285 = 0.989
• F-score = 2*Precision*Recall / (Precision+Recall) = 2*0.99*0.99 / (0.99+0.99) =
0.99 66
If Class A - spam
True class
A B
Predicted A True Positive False Positive
class (TP) = 10 (FP) = 3
B False True Negative
Negative (FN) (TN) = 282
=5
• Accuracy = (TP+TN )/ (TP+TN+FP+FN) = 292/ 300 = 0.973
• Precision = TP / (TP+FP) = 10 / 13 = 0.769
• Recall = TP / (TP+FN) = 10 / 15 = 0.667
• F-score = 2*Precision*Recall / (Precision+Recall) = 2*0.77*0.67 / (0.77+0.67) =
0.72 67
Evaluating each class
True class
A B
Predicted A True Positive False Positive
class (TP) = 282 (FP) = 5
B False True Negative
Negative (FN) (TN) = 10
=3
• Specificity = TN / (TN+FP) = 10 / 13 = 0.769
• Sensitivity = TP / (TP+FN) = 282 / 285 = 0.989
= Recall 68
https://towardsdatascience.com/should-i-look-at-precision-recall-or-specificity-sensitivity-3946158aace1
Evaluation in Weka
70
Imbalanced classes
71
Take-away practical work on
classification!
75
To review this lesson…
• www.menti.com
• CODE: 4627 9622