0% found this document useful (0 votes)
257 views50 pages

Data Mining Classification Basics

The document discusses predictive modeling tasks in data mining. It explains that there are two types of predictive tasks: classification, which predicts discrete target variables, and regression, which predicts continuous target variables. The goal is to build a model that minimizes error between predicted and actual target values. Classification involves learning a model from labeled training data that maps attributes to a predefined class. Common techniques include decision trees, rules, nearest neighbors, neural networks, naive Bayes, and support vector machines. Ensemble methods like boosting and random forests are also discussed.

Uploaded by

Mohammid Saeedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
257 views50 pages

Data Mining Classification Basics

The document discusses predictive modeling tasks in data mining. It explains that there are two types of predictive tasks: classification, which predicts discrete target variables, and regression, which predicts continuous target variables. The goal is to build a model that minimizes error between predicted and actual target values. Classification involves learning a model from labeled training data that maps attributes to a predefined class. Common techniques include decision trees, rules, nearest neighbors, neural networks, naive Bayes, and support vector machines. Ensemble methods like boosting and random forests are also discussed.

Uploaded by

Mohammid Saeedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Data Mining

Classification: Basic Concepts and


Techniques

Lecture Notes for Chapter 3

Introduction to Data Mining, 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar

Introduction to Data Mining, 2nd Edition 1


Predictive task

 The objective of predictive task is to predict the


value or the label of a particular attribute based
on the values of other attributes.

 The attribute to be predicted is commonly known


as target or dependent variable, while other
attributes known as explanatory or independent
attributes.

10/5/2020 Introduction to Data Mining, 2nd Edition 2


Predictive task

 There are two type of predictive modeling tasks:


 Classification.
Used to predict the value of a discrete target variables.
 Regression.
Used to predict the value of a continuous target
variables.

10/5/2020 Introduction to Data Mining, 2nd Edition 3


Predictive task

 Examples:
 Predicting whether a web user will make a
purchase at online bookstore is a classification
task because the target variable is binary-valued
attribute.

 Forecasting the future price of a stock is a


regression task because price is a continuous-
valued attribute.

10/5/2020 Introduction to Data Mining, 2nd Edition 4


Predictive task

 The goal of both tasks is to learn a model that


minimizes the error between the predicted and true
values of the target variable.

10/5/2020 Introduction to Data Mining, 2nd Edition 5


Predictive task

 Predictive modeling
Predictive modeling refers to the task of building a
model for the target variable as a function of the
explanatory variables.

 The main strategy is to collect a training set


and build a mapping function (that is, fit a model)
from the input variables (X) to the output variable (y).

10/5/2020 Introduction to Data Mining, 2nd Edition 6


Classification: Definition

 Given a collection of records (training set )


– Each record is by characterized by a tuple
(x,y), where x is the attribute set and y is the
class label
 x: attribute, predictor, independent variable, input
 y: class, response, dependent variable, output

 Task:
– Learn a model that maps each attribute set x
into one of the predefined class labels y

10/5/2020 Introduction to Data Mining, 2nd Edition 7


Classification: Definition

10/5/2020 Introduction to Data Mining, 2nd Edition 8


Classification: Definition

 In classification, the attribute set x can contain


attributes of any type, while the class label y must
be categorical.

10/5/2020 Introduction to Data Mining, 2nd Edition 9


Examples of Classification Task

Task Attribute set, x Class label, y

Categorizing Features extracted from spam or non-spam


email email message header
messages and content
Identifying Features extracted from malignant or benign
tumor cells MRI scans cells

10/5/2020 Introduction to Data Mining, 2nd Edition 10


General Approach for Building
Classification Model
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set

10/5/2020 Introduction to Data Mining, 2nd Edition 11


General Approach for Building
Classification Model
 Classification is the task of assigning labels to
unlabeled data instances and a classifiers is
used to perform such a task.

 A classifier is typically described in terms of a


model.

10/5/2020 Introduction to Data Mining, 2nd Edition 12


General Approach for Building
Classification Model
 The model is created
using a given set of
instances, known as
training set, which
contains attribute
values as well as
class labels for each
instance.

10/5/2020 Introduction to Data Mining, 2nd Edition 13


General Approach for Building
Classification Model
 The systematic approach for
learning a classification model
given a training set is known
as a learning algorithm.

 The process of using a


learning algorithm to build a
classification model from the
training data is known as
induction, learning a model
or building a model
10/5/2020 Introduction to Data Mining, 2nd Edition 14
General Approach for Building
Classification Model
 The process of
applying a
classification model
on unseen test
instances to predict
their class labels is
known as deduction.

10/5/2020 Introduction to Data Mining, 2nd Edition 15


General Approach for Building
Classification Model
 Thus, the process of
classification
involves two steps: Learning


algorithm
Induction
Induction
– Applying a learning
algorithm to training
data to learn a model.

Model
Deduction. Training Set

– Applying the model to


assign labels to Deduction

unlabeled instances.
Test Set

10/5/2020 Introduction to Data Mining, 2nd Edition 16


General Approach for Building
Classification Model
 Note on terminology:
– The terms “classifier” and “Model” are often taken to
be synonymous.
– The term “classifier” is often used in a more general
sense to refer to a classification technique.

10/5/2020 Introduction to Data Mining, 2nd Edition 17


Classification Techniques

 Base Classifiers
– Decision Tree based Methods
– Rule-based Methods
– Nearest-neighbor
– Neural Networks
– Deep Learning
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines

 Ensemble Classifiers
– Boosting, Bagging, Random Forests
10/5/2020 Introduction to Data Mining, 2nd Edition 18
Decision Tree Classifier

 Decision tree is the most powerful and popular


tool for classification and prediction.

 A decision tree is a decision support tool that use


tree-like model pf decision and their possible
consequence, include chance event outcomes,
resource cost, and utility. It is way to display an
algorithm that only contains conditional control
statements.

10/5/2020 Introduction to Data Mining, 2nd Edition 19


Decision Tree Classifier

 A Decision tree is a tree-like structure, where


each internal node denotes a test on an attribute,
each branch represents an outcome of the test,
and each leaf node (terminal node) holds a class
label.

10/5/2020 Introduction to Data Mining, 2nd Edition 20


Example of a Decision Tree

Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


10/5/2020 Introduction to Data Mining, 2nd Edition 21
Another Example of Decision Tree

MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
NO Home
1 Yes Single 125K No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
fits the same data!
10 No Single 90K Yes
10

10/5/2020 Introduction to Data Mining, 2nd Edition 22


Apply Model to Test Data

Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

10/5/2020 Introduction to Data Mining, 2nd Edition 23


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

10/5/2020 Introduction to Data Mining, 2nd Edition 24


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

10/5/2020 Introduction to Data Mining, 2nd Edition 25


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

10/5/2020 Introduction to Data Mining, 2nd Edition 26


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

10/5/2020 Introduction to Data Mining, 2nd Edition 27


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K

NO YES

10/5/2020 Introduction to Data Mining, 2nd Edition 28


Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set
10/5/2020 Introduction to Data Mining, 2nd Edition 29
Decision Tree Induction

 Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT

10/5/2020 Introduction to Data Mining, 2nd Edition 30


Hunt’s Algorithm

 Hunt's algorithm grows a decision tree in a


recursive fashion by partitioning the training
records into successively purer subsets.
(3,0)

(3,0)
(3,0)

(1,3) (3,0)
(1,0) (0,3)

Introduction to Data Mining, 2nd Edition 31


General Structure of Hunt’s Algorithm

l Let Dt be the set of training ID


Home
Owner
Marital
Status
Annual Defaulted
Income Borrower
records that reach a node t 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
l General Procedure:
4 Yes Married 120K No
– If Dt contains records that 5 No Divorced 95K Yes

belong the same class y, 6 No Married 60K No

then t is a leaf node 7 Yes Divorced 220K No


8 No Single 85K Yes
labeled as y
9 No Married 75K No
– If Dt contains records that 10 No Single 90K Yes

belong to more than one


10

Dt
class, use an attribute test
to split the data into smaller
subsets. Recursively apply ?
the procedure to each
subset.

Introduction to Data Mining, 2nd Edition 32


Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
Owner 8 No Single 85K Yes

Home Yes No 9 No Married 75K No


Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10

Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
Introduction to Data Mining, 2nd Edition 33
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
Owner 8 No Single 85K Yes

Home Yes No 9 No Married 75K No


Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10

Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
Introduction to Data Mining, 2nd Edition 34
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
Owner 8 No Single 85K Yes

Home Yes No 9 No Married 75K No


Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10

Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
Introduction to Data Mining, 2nd Edition 35
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
Owner 8 No Single 85K Yes

Home Yes No 9 No Married 75K No


Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10

Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
Introduction to Data Mining, 2nd Edition 36
Design Issues of Decision Tree Induction

l How should training records be split?


– Method for specifying test condition
 depending on attribute types
– Measure for evaluating the goodness of a test
condition

l How should the splitting procedure stop?


– Stop splitting if all the records belong to the
same class or have identical attribute values
– Early termination
10/5/2020 Introduction to Data Mining, 2nd Edition 37
Methods for Expressing Test Conditions

l Depends on attribute types


– Binary
– Nominal
– Ordinal
– Continuous

l Depends on number of ways to split


– 2-way split
– Multi-way split

10/5/2020 Introduction to Data Mining, 2nd Edition 38


Test Condition for Nominal Attributes

 Multi-way split:  10/5/2020


Marital
– Use as many partitions as Status
distinct values.

Single Divorced Married

 Binary split:
– Divides values into two subsets

Marital Marital Marital


Status Status Status
OR OR

{Married} {Single, {Single} {Married, {Single, {Divorced}


Divorced} Divorced} Married}

Introduction to Data Mining, 2nd Edition 39


Test Condition for Ordinal Attributes

l Multi-way split: Shirt


Size
– Use as many partitions
as distinct values
Small
Medium Large Extra Large

l Binary split: Shirt


Size
Shirt
Size

– Divides values into two


subsets
– Preserve order {Small,
Medium}
{Large,
Extra Large}
{Small} {Medium, Large,
Extra Large}

property among Shirt


attribute values Size
This grouping
violates order
property

{Small, {Medium,
Large} Extra Large}
Introduction to Data Mining, 2nd Edition 40
Test Condition for Continuous Attributes

Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

10/5/2020 Introduction to Data Mining, 2nd Edition 41


Splitting Based on Continuous Attributes

 Different ways of handling


– Discretization to form an ordinal categorical
attribute
Ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or
clustering.
 Static – discretize once at the beginning

 Dynamic – repeat at each node

– Binary Decision: (A < v) or (A ≥ v)


 consider all possible splits and finds the best cut
 can be more compute intensive
10/5/2020 Introduction to Data Mining, 2nd Edition 42
How to determine the Best Split

 There are many measure that can be used to


determine the goodness of an attributes test
condition.
 These measures try to give preference to attribute
test conditions that partition the training instance
into purer subsets.

 Having purer node is useful since a node that has


all of its training instances from the same class
does not need to be expanded further.

Introduction to Data Mining, 2nd Edition 43


How to determine the Best Split

 Large tree are less desirable, due to


– They are more susceptible to model overfitting
– They are difficult to interpret.
– They are incur more training and test time as
compared to smaller tree.

Introduction to Data Mining, 2nd Edition 44


How to determine the Best Split

Before Splitting: 10 records of class 0,


10 records of class 1

Gender Car Customer


Type ID

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
C0: 1
C1: 0
... C0: 1
C1: 0
C0: 0
C1: 1
... C0: 0
C1: 1

Which test condition is the best?


Introduction to Data Mining, 2nd Edition 45
How to determine the Best Split

l Greedy approach:
– Nodes with purer class distribution are
preferred

l Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

High degree of impurity Low degree of impurity

10/5/2020 Introduction to Data Mining, 2nd Edition 46


Measures of Node Impurity

l Gini Index
GINI (t ) = 1 − ∑ [ p ( j | t )] 2

l Entropy
Entropy (t ) = − ∑ p ( j | t ) log p ( j | t )
j

l Misclassification error

Error (t ) = 1 − max P (i | t ) i

10/5/2020 Introduction to Data Mining, 2nd Edition 47


Finding the Best Split

1. Compute impurity measure (P) before splitting


2. Compute impurity measure (M) after splitting
l Compute impurity measure of each child node
l M is the weighted impurity of children
3. Choose the attribute test condition that
produces the highest gain
Gain = P – M

or equivalently, lowest impurity measure after


splitting (M)

10/5/2020 Introduction to Data Mining, 2nd Edition 48


Finding the Best Split
C0 N00
Before Splitting: P
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M11 M12 M21 M22

M1 M2
Gain = P – M1 vs P – M2
10/5/2020 Introduction to Data Mining, 2nd Edition 49
Decision Tree Based Classification
l Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid
overfitting are employed)
– Can easily handle redundant or irrelevant attributes (unless
the attributes are interacting)
l Disadvantages:
– Space of possible decision trees is exponentially large.
Greedy approaches are often unable to find the best tree.
– Does not take into account interactions between attributes
– Each decision boundary involves only a single attribute
10/5/2020 Introduction to Data Mining, 2nd Edition 50

You might also like