0% found this document useful (0 votes)
25 views65 pages

Classification

The document discusses supervised learning, specifically focusing on classification and model evaluation. It explains the process of assigning predefined class labels to objects, various classification approaches like decision trees and neural networks, and the importance of attribute selection for building effective classification models. Additionally, it provides examples and methodologies for constructing decision trees and generating rules from them.

Uploaded by

burakcaka90
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views65 pages

Classification

The document discusses supervised learning, specifically focusing on classification and model evaluation. It explains the process of assigning predefined class labels to objects, various classification approaches like decision trees and neural networks, and the importance of attribute selection for building effective classification models. Additionally, it provides examples and methodologies for constructing decision trees and generating rules from them.

Uploaded by

burakcaka90
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Supervised learning

Classification
Model evaluation

«Highly specialized digital skills in high-performance computing»


1
Classification
Assigning categories to the object

4
Classification
Goal of classification - assigning one* of
predefined class labels to an object

* special case – multi-label classification

5
Classification Example
Age Car Class
• Example training database
• Two predictor attributes: 20 M Yes
Age and Car-type (Sport, Minivan and 30 M Yes
Truck) 25 T No
• Age is ordered, Car-type is 30 S Yes
categorical attribute 40 S Yes
• Class label indicates 20 T No
whether person bought
30 M Yes
product
25 M Yes
• Response attribute is categorical
40 M Yes
20 S No
7
Definitions
• Random variables X1, …, Xk (predictor variables) and Y
(response or dependent variable)
• If Y is categorical, the problem is a classification problem

Predictor variables Response variable,


Class,
Outcome
8
Examples of classification approaches
• Artificial neural networks
• Decision trees
• Naive Bayes classifier
• Support vector machines
• ...

9
Classification of classification
• Eager learning Several of the most commonly used learning
approaches, such as neural networks and
• Model based decision trees, use their training examples to
• Naive Bayes construct an explicit global representation of the
• Decision trees, rules target function.
(inductive learning)

• Lazy learning In contrast, instance-based learning algorithms


simply store some or all of the training examples
• Instance based and postpone any generalization effort until a
• K-nearest –neighbor new instance must be classified. They can thus
build query-specific local models, which attempt
to fit the training examples only in a region
around the query point.
10
Representation of patterns
• Useful patterns allow us to make nontrivial
predictions on new data
• a black box
• model with no human-readable explanation
• a transparent box
• rules
• decision trees

11
Approximation examples - finding the best
hypothesis
f(x) f(x) f(x)

x x x
a) b) c)

x x
d) e)

Function f is not known therefore many different hypotheses h might be found


Formalization f(x)

• Having observation pairs (x,f(x)), where x – input and


f(x) – output related to x, goal of inductive learning is in
this case is: acquire function h, which approximates x
function f
• Function f is not known, only its results – outputs are
observed
• Function h is called hypothesis of function
• Acquisition of a function from observation of
form inputs – output is called inductive
learning
The contact lenses data
Age Spectacle Astigmatism Tear production Recommended
prescription rate lenses
Young Myope No Reduced None
Young Myope No Normal Soft
Young Myope Yes Reduced None
Young Myope Yes Normal Hard
Young Hypermetrope No Reduced None
Young Hypermetrope No Normal Soft
Young Hypermetrope Yes Reduced None
Young Hypermetrope Yes Normal hard
Pre-presbyopic Myope No Reduced None
Pre-presbyopic Myope No Normal Soft
Pre-presbyopic Myope Yes Reduced None
Pre-presbyopic Myope Yes Normal Hard
Pre-presbyopic Hypermetrope No Reduced None
Pre-presbyopic Hypermetrope No Normal Soft
Pre-presbyopic Hypermetrope Yes Reduced None
Pre-presbyopic Hypermetrope Yes Normal None
Presbyopic Myope No Reduced None
Presbyopic Myope No Normal None
Presbyopic Myope Yes Reduced None
Presbyopic Myope Yes Normal Hard
Presbyopic Hypermetrope No Reduced None
Presbyopic Hypermetrope No Normal Soft
Presbyopic Hypermetrope Yes Reduced None
Presbyopic Hypermetrope Yes Normal None
14
Structural descriptions
Example: if-then rules
If tear production rate = reduced
then recommendation = none
Otherwise, if age = young and astigmatic = no
then recommendation = soft

Age Spectacle Astigmatism Tear production Recommended


prescription rate lenses

Young Myope No Reduced None

Young Hypermetrope No Normal Soft

Pre-presbyopic Hypermetrope No Reduced None

Presbyopic Myope Yes Normal Hard

… … … … …

15
A decision tree for contact lenses problem

Root node

Branch

Leaf node

16
Process of building classification model
Age Spectacle Astigmatism Tear production Recommended Age Spectacle Astigmatism Tear production Recommended
prescription rate lenses prescription rate lenses

Young Myope No Reduced None Young Hypermetrope Yes Reduced Soft / ?

Young Hypermetrope No Normal Soft

Pre- Hypermetrope No Reduced None Age Spectacle Astigmatism Tear production Recommended
presbyopic prescription rate lenses
Presbyopic Myope Yes Normal Hard

… … … … … Young Myope Yes Normal ?

https://www.kdn
uggets.com/2016
/10/decision-
trees-concise- Recommended lenses = … Recommended lenses = …
technical- 18
overview.html Accuracy: …
Now YOU classify!
Age Spectacle Astigmatism Tear production Recommended
prescription rate lenses

Young Hypermetrope Yes Reduced Soft / ?

• According to the tree classifier,


would this case be classified
correctly?
• What would be classification for
this patient?
Recommended lenses = none Age Spectacle Astigmatism Tear production Recommended
Accuracy: 0% prescription rate lenses

Young Myope Yes Normal ?

Recommended lenses = hard 19


Example - Learning in decision trees
• Problem: Should we wait for a table in
restaurant?
• Goal: Learn decision tree that allows to
generate decision
Example - Learning in decision trees
• Friday / Saturday – true if it is Friday or Saturday;
• Hungry – Are we hungry?
• Customers – how many customers are in the restaurant
(None, Few, Full);
• Price level - $ cheap, $$ affordable, $$$ expensive;
• Rain – true if it is raining;
• Reservation – do we have a reservation;
• Type – French, Italian, Thai, Fast-food;
• Waiting time - 0-10 min, 10-30 min, 30-60 min, >60 min.
Data set
N. Decision:
Attributes Wait for
Alter. Bar Fri. Hungry Customers Price Rain Reserv Type Wait
/Saturday ation. time
the table?
X1 Yes No No Yes Few $$$ No Yes French 0-10 Yes
X2 Yes No No Yes Full $ No No Thai 30-60 No
X3 No Yes No No Few $ No No Fast f. 0-10 Yes
X4 Yes No Yes Yes Full $ No No Thai 10-30 Yes
X5 Yes No Yes No Full $$$ No Yes French >60 No
X6 No Yes No Yes Few $$ Yes Yes Italian 0-10 Yes
X7 No Yes No No None $ Yes No Fast f. 0-10 No
X8 No No No Yes Few $$ Yes Yes Thai 0-10 Yes
X9 No Yes Yes No Full $ Yes No Fast f. >60 No
X10 Yes Yes Yes Yes Full $$$ No Yes Italian 10-30 No
X11 No No No No None $ No No Thai 0-10 No
X12 Yes Yes Yes Yes Full $ No No Fast f. 30-60 Yes
Learning decision trees
• Simplistic approach: create a decision tree where each path
corresponds to a particular example. Number of attribute tests
in each path corresponds to the total number of attributes
• All observations will be classified
• Problem:
• The tree simply «remembers» all observations
• It does not approximate the observations limiting
possibilities to use the tree for unseen future observations
• The tree will be to large
Learning decision trees
• Induction approach or Ockham razor:
• The most likely hypothesis is the simplest one, which corresponds to all
observations
• Most probably simpler hypothesis is better than more complex one
due to better approximation
• Main idea:
• We always select attribute, which splits the observation in
the best way
• This allows to build a tree with small number of attribute
tests
• Thereby the tree will approximate the observations well
How does it work?

Split the set of instances in subsets such that the


variation within each subset becomes smaller.

25
Constructings decision trees
• Strategy: top down Recursive divide-and-conquer fashion
• First: select attribute for root node (Create branch for each
possible attribute value)
• Then: split instances into subsets (One for each branch extending
from the node)
• Finally: repeat recursively for each branch, using only instances
that reach the branch
• Stop if all instances have the same class

26
Selection of the right attribute

+: X1, X3, X4, X6, X8, X12


-: X2, X5, X7, X9, X10, X11 +: X1, X3, X4, X6, X8, X12
-: X2, X5, X7, X9, X10, X11
Type Customers

French Italian Thai Fast food None Few Full


+: X1 +: X6 +: X4, X8 +: X3, X12 +: +: X1, X3, X6, X8 +: X4, X12
-: X5 -: X10 -: X2, X11 -: X7, X9 -: X7, X11 -: -: X2, X5, X9, X10
Selection of the right attribute
+: X1, X3, X4, X6, X8, X12
-: X2, X5, X7, X9, X10, X11
Customers

None Few Full


+: +: X1, X3, X6, X8 +: X4, X12
-: X7, X11 -: -: X2, X5, X9, X10
No Yes Hungry
Yes No
+: X4, X12 +:
-: X2, X10 -: X5, X9
+: X1, X3, X4, X6, X8, X12
-: X2, X5, X7, X9, X10, X11
Customers
None Few Full

+: +: X1, X3, X6, X8 +: X4, X12


-: X7, X11 -: -: X2, X5, X9, X10
No Yes Hungry
Yes No
+: X4, X12 +:
-: X2, X10 -: X5, X9
Type No
French Italian Thai FastFood
Yes +: +: X4 +: X12
-: X10 -: X2 -:
No Weekend Yes
No Yes
+: +: X4
No -: X2 Yes -:
Weather data

34
35
Criterion for attribute selection
• Which is the best attribute?
• Want to get the smallest tree
• Heuristic: choose the attribute that produces the“purest” nodes
• Popular impurity criterion: information gain
• Information gain increases with the average purity of the subsets
• Strategy: choose attribute that gives greatest information
gain
• gain can be measured in bits

36
37
Continuing to split

• gain(Temperature ) = 0.571
• gain(Humidity ) = 0.971
• gain(Windy ) = 0.020
38
Final decision tree

• Note: not all leaves need to be pure; sometimes


identical instances have different classes
• Splitting stops when data can’t be split any further
39
Rule generation
• Covering methods

If true If x > 1.2 and y > 2.6


then class = b then class = a
If x > 1.2
then class = a

Could add more rules, get “perfect” rule set


40
Rule generation
• From decision tree
IF outlook = sunny AND humidity high THEN play = no
• Sometimes too complex

41
The weather problem
Conditions for playing golf
Outlook Temperature Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild Normal False Yes
… … … … …

If outlook = sunny and humidity = high then play = no


If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes

42
Weather data with mixed attributes

Some attributes have numeric values


Outlook Temperature Humidity Windy Play
Sunny 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes
Rainy 75 80 False Yes
… … … … …

If outlook = sunny and humidity > 83 then play = no


If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity < 85 then play = yes
If none of the above then play = yes

43
Continuous values
Classical decision tree algorithms liks ID3 of C4.5 requires discrete values
for their attributes.
In case of using continuous values, the algorithm might use Boolean test
substitutes to check if the value falls into particular interval:

Temp. 40 48 60 72 80 90

Decision No No Yes Yes Yes No

CART (Classification And Regression Tree) algorithm utilize both categorical


and numerical values. Scikit-learn library uses an optimized version of the
CART algorithm; however, the scikit-learn implementation does not support
categorical variables directly.
Example of CART- Iris
Factor name and test

Number os sampes
Gini coef. value
tested by the test

Class value distribution


Class value if it is a leaf
Example of C4.5 (J48 in Weka) - Iris
J48 pruned tree
------------------

petalwidth <= 0.6: Iris-setosa (50.0)


petalwidth > 0.6
| petalwidth <= 1.7
| | petallength <= 4.9: Iris-versicolor (48.0/1.0)
| | petallength > 4.9
| | | petalwidth <= 1.5: Iris-virginica (3.0)
| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)
| petalwidth > 1.7: Iris-virginica (46.0/1.0)

Number of Leaves : 5

Size of the tree : 9

46
Example of J48 decision list (PART in Weka)

PART decision list


------------------

petalwidth <= 0.6: Iris-setosa (50.0)

petalwidth <= 1.7 AND


petallength <= 4.9: Iris-versicolor (48.0/1.0)

: Iris-virginica (52.0/3.0)  default rule

Number of Rules : 3 47
Comparison
J48 pruned tree
------------------

petalwidth <= 0.6: Iris-setosa (50.0)


PART decision list
petalwidth > 0.6
------------------
| petalwidth <= 1.7
| | petallength <= 4.9: Iris-versicolor
petalwidth <= 0.6: Iris-setosa (50.0)
(48.0/1.0)
| | petallength > 4.9
petalwidth <= 1.7 AND
| | | petalwidth <= 1.5: Iris-virginica (3.0)
petallength <= 4.9: Iris-versicolor
| | | petalwidth > 1.5: Iris-versicolor
(48.0/1.0)
(3.0/1.0)
| petalwidth > 1.7: Iris-virginica (46.0/1.0)
: Iris-virginica (52.0/3.0)
Number of Leaves : 5
Number of Rules : 3
Size of the tree : 9 48
Example of RIPPER (JRIP in Weka) - Iris

JRIP rules:
===========

(petallength >= 3.3) and (petalwidth <= 1.6) and


(petallength <= 4.9) => class=Iris-versicolor
(46.0/0.0)
(petallength <= 1.9) => class=Iris-setosa (50.0/0.0)
=> class=Iris-virginica (54.0/4.0)  default rule

Number of Rules : 3
49
Example of C4.5 (J48 in Weka) – play golf
J48 pruned tree
------------------

outlook = sunny
| humidity = high: no (3.0)
| humidity = normal: yes (2.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)

Number of Leaves : 5

Size of the tree : 8 50


Random Forest model

Machado, Gustavo & Recamonde-Mendoza, Mariana & Corbellini, Luís. (2015). What variables are important in predicting
bovine viral diarrhea virus? A random forest approach. Veterinary Research. 46. 85. 10.1186/s13567-015-0219-7.
51
Overfitting and underfitting

Model does not capture Model ‘misses the point’ Model is balanced
the concept

• Low training accuracy • High training accuracy • High training accuracy


• Low testing accuracy • Low testing accuracy • High testing accuracy
Overfitting - solutions
The main methods are:
• Methods that stop training in case of overfitting
• Methods that allow to overfit and then prune the tree

Regardless of the methods selected, the main question is


about the tree size, at which the tree is pruned

C4.5. uses tree pruning approach


Classifying handwritten digits

54
(Deep) Neural Networks

For each pixel in the input


image, the pixel's intensity
is encoded as the value for
a corresponding neuron in
the input layer.
For the 28×28 pixel
images, this means that
the network has 784
(=28×28) input neurons.

55
Lazy learning: K-nearest-neighbor
• Classification without model building!

56
K-nearest-neighbor
• Difficulties:
• Data must be stored; for classification of a new record, all data
must be available
• Computationally expensive in high dimensions
• Choice of k is unknown

57
Summary on classification
• Methods which create or do not create models
• Models are exploratory or black-box
• Both for nominal and numerical (discretized) data types

58
Classification vs Regression
• Similarities
• Both learn from data set
• Supervised
• Difference
• In classifiction each example has class associated
• In regression each example has numerical value associated

• Regression is considered as a subtype of classifiation


learning – classes are not descrete categories, but
continuous numeric values
59
Regression / Classification
Sum up
Regression – attempts to find a
function which models the data
with the least error. Can be
seen as classification for
numeric data.

Classification - the problem of


identifying to which category a
new observation belongs, on
the basis of a training set of
data containing observations
(or instances) whose category
membership is known. https://in.springboard.com/blog/regression-vs-classification-in-machine-learning/

60
Model credibility
Evaluation of what’s been learned

61
Evaluation
• Training and testing
• Hold-out
• Percentage split (70 % /30 % )
• Crossvalidation
• N-fold crossvalidatation

62
Evaluation with confusion matrix
True class
A B
Predicted A True Positive False Positive
class (TP) (FP)
B False Negative True Negative
(FN) (TN)

• Accuracy = (TP+TN ) / (TP+TN+FP+FN)


• Precision = TP / (TP+FP)
• Recall = TP / (TP+FN)
• F-score = 2*Precision*Recall / (Precision+Recall) 63
Spam filter

• Class A – no spam
• Class B – spam True class
A B
• This week 300 incoming mails
Predicted A True False
• 285 regular mails
class Positive Positive
• 15 junk mails (TP) = 282 (FP) = 5
• Classifier says: B False True
• 287 regular (5 actually junk) Negative Negative
• 13 junk (3 actually regular) (FN) = 3 (TN) = 10

64
Spam filter

• Class A – no spam
• Class B – spam
• Today in e-mail 300 mails
• 285 regular mails
• 15 junk mails
• Classifier says:
• 287 regular (5 actually junk)
• 13 junk (3 actually regular)

65
If Class A – legitimate mail
True class
A B
Predicted A True Positive False Positive
class (TP) = 282 (FP) = 5
B False True Negative
Negative (FN) (TN) = 10
=3
• Accuracy = (TP+TN )/ (TP+TN+FP+FN) = 292/ 300 = 0.973
• Precision = TP / (TP+FP) = 282 / 287 = 0.983
• Recall = TP / (TP+FN) = 282 / 285 = 0.989
• F-score = 2*Precision*Recall / (Precision+Recall) = 2*0.99*0.99 / (0.99+0.99) =
0.99 66
If Class A - spam
True class
A B
Predicted A True Positive False Positive
class (TP) = 10 (FP) = 3
B False True Negative
Negative (FN) (TN) = 282
=5
• Accuracy = (TP+TN )/ (TP+TN+FP+FN) = 292/ 300 = 0.973
• Precision = TP / (TP+FP) = 10 / 13 = 0.769
• Recall = TP / (TP+FN) = 10 / 15 = 0.667
• F-score = 2*Precision*Recall / (Precision+Recall) = 2*0.77*0.67 / (0.77+0.67) =
0.72 67
Evaluating each class
True class
A B
Predicted A True Positive False Positive
class (TP) = 282 (FP) = 5
B False True Negative
Negative (FN) (TN) = 10
=3

• Specificity = TN / (TN+FP) = 10 / 13 = 0.769


• Sensitivity = TP / (TP+FN) = 282 / 285 = 0.989

= Recall 68
https://towardsdatascience.com/should-i-look-at-precision-recall-or-specificity-sensitivity-3946158aace1
Evaluation in Weka

70
Imbalanced classes

71
Take-away practical work on
classification!

75
To review this lesson…

• www.menti.com
• CODE: 4627 9622

You might also like