0% found this document useful (0 votes)
8 views199 pages

Module 3 Classification

Data mining, also known as knowledge discovery in databases (KDD), involves extracting useful patterns and information from large datasets. The process includes several steps such as goal identification, data preprocessing, data mining, and evaluation, with applications in areas like customer segmentation and classification. It utilizes both supervised and unsupervised learning techniques to analyze data and make predictions based on historical patterns.

Uploaded by

2023.gargi.dhuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views199 pages

Module 3 Classification

Data mining, also known as knowledge discovery in databases (KDD), involves extracting useful patterns and information from large datasets. The process includes several steps such as goal identification, data preprocessing, data mining, and evaluation, with applications in areas like customer segmentation and classification. It utilizes both supervised and unsupervised learning techniques to analyze data and make predictions based on historical patterns.

Uploaded by

2023.gargi.dhuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 199

What Is Data Mining?

● Data mining (knowledge discovery in databases): Gold mining vs


– Extraction of interesting (non-trivial, implicit, previously sand minng
unknown and potentially useful) information or patterns
from data in large databases
- Data mining is the process of discovering interesting patterns
and knowledge from large amount of data.
● Alternative names and their “inside stories”:
– Data mining: a misnomer?
– Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting, business
intelligence, etc.
● What is not data mining?
– query processing.
– Expert systems or small ML/statistical programs

Data 1
1
Warehousing/Mining
Why Data Mining?

● Four questions to be answered

● Can the problem clearly be defined?


● Does potentially meaningful data exists?
● Does the data contain hidden knowledge or useful only for
reporting purposes?
● Will the cost of processing the data will be less then the likely
increase in profit from the knowledge gained from applying any
data mining project

Data 2
2
Warehousing/Mining
Steps of a KDD Process (1)

● 1. Goal identification:
– Define problem
– relevant prior knowledge and goals of application
● 2. Creating a target data set: data selection
● 3. Data preprocessing: (may take 60%-80% of effort!)
– removal of noise or outliers
– strategies for handling missing data fields
– accounting for time sequence information
● 4. Data reduction and transformation:
– Find useful features, dimensionality/variable reduction,
invariant representation.

Data 3
3
Warehousing/Mining
Steps of a KDD Process (2)
● 5. Data Mining:
– Choosing functions of data mining:
● summarization, classification, regression, association,
clustering.
– Choosing the mining algorithm(s):
● which models or parameters
– Search for patterns of interest
● 6. Presentation and Evaluation:
– visualization, transformation, removing redundant patterns,
etc.
● 7. Taking action:
– incorporating into the performance system
– documenting
– reporting to interested parties

Data 4
4
Warehousing/Mining
An example: Customer
Segmentation
● 1. Marketing department wants to perform a
segmentation study on the customers of AE Company
● 2. Decide on relevant variables from a data
warehouse on customers, sales, promotions
– Customers: name,ID,income,age,education,...
– Sales: history of sales
– Promotion: promotion types durations...
● 3. Handle missing income, addresses..
determine outliers if any
● 4. Generate new index variables representing wealth
of customers
– Wealth = a*income+b*#houses+c*#cars...
– Make neccesary transformations z scores so that some data
mining algorithms work more efficiently

Data 5
5
Warehousing/Mining
Example: Customer Segmentation cont.

● 5.a: Choose clustering as the data mining functionality as it is


the natural one for a segmentation study so as to find group of
customers with similar characteristics
● 5.b: Choose a clustering algorithm
– K-means or k-medoids or any suitable one for that problem
● 5.c: Apply the algorithm
– Find clusters or segments
● 6. Make reverse transformations, visualize the customer
segments
● 7. Present the results in the form of a report to the marketing
department
– Implement the segmentation as part of a DSS so that it can be
applied repeatedly at certain internvals as new customers arrive
– Develop marketing strategies for each segment

Data 6
6
Warehousing/Mining
Data Mining: A KDD Process

Pattern Evaluation
Data mining: the core of
knowledge discovery process.
Data Mining

Task-relevant Data

Data Selection
Data Warehouse Data transformation

Data Cleaning

Data Integration

Data Databases 7
7
Warehousing/Mining
Architecture of a Typical Data
Mining System
Graphical user interface

Pattern evaluation

Data mining engine


Knowledge-base
Database or data
warehouse server
Data cleaning & Filtering
data integration Data
Databases Warehouse
Data 8
8
Warehousing/Mining
Architecture of a Typical Data
Mining System
● Data base, data warehouse
● Data base or data warehouse server
● Knowledge base
– concept hierarchies
– user beliefs
● asses pattern’s interestingness
– other thresholds
● Data mining engine
– functional modules
● characterization, association, classification, cluster analysis,
evolution and deviation analysis
● Pattern evaluation module
● Graphical user interface

Data 9
9
Warehousing/Mining
What Is Data Mining?
● Data mining (knowledge discovery in databases): Gold mining vs
– Extraction of interesting (non-trivial, implicit, previously sand minng
unknown and potentially useful) information or patterns
from data in large databases
- Data mining is the process of discovering interesting patterns
and knowledge from large amount of data.
● Alternative names and their “inside stories”:
– Data mining: a misnomer?
– Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting, business
intelligence, etc.
● What is not data mining?
– query processing.
– Expert systems or small ML/statistical programs

Data 10
10
Warehousing/Mining
Data Mining Models and Tasks

Data 11
11
Warehousing/Mining
Two Styles of Data Mining

● Descriptive data mining


– characterize the general properties of the data in the database
– finds patterns in data and
– the user determines which ones are important
● Predictive data mining
– perform inference on the current data to make predictions
– we know what to predict
● Not mutually exclusive
– used together
– Descriptive → predictive
● Eg. Customer segmentation – descriptive by clustering
● Followed by a risk assignment model – predictive by ANN

Data 12
12
Warehousing/Mining
Supervised vs. Unsupervised Learning

■ Supervised learning (classification, prediction)


■ Supervision: The training data (observations,

measurements, etc.) are accompanied by labels indicating


the class of the observations
■ New data is classified based on the training set

■ Unsupervised learning (summarization. association,


clustering)
■ The class labels of training data is unknown

■ Given a set of measurements, observations, etc. with the

aim of establishing the existence of classes or clusters in


the data
Data 13
13
Warehousing/Mining
Classification:
● Classification is a form of data analysis that extracts
models describing important class label.
● Classification is a process of dividing dataset into different
categories or groups using different Label
● Such models are called as classifiers, predict categorical
class label.
● Learning is supervised
● Dependent variable is categorical
● Build a model able to assign new instances to one of a set of
well-defined classes.
● The models are classification, predict categorial (discrete ,
unordered) class labels.
• Applications of classification :
– Fraud detection, target marketing , performance
prediction, manufacturing and medical diagnosis,
Data banking. 14
Warehousing/Mining
Typical Classification Problems

● Given characteristics of individuals differentiate


them who have suffered a heart attack from those
who have not
● Determine if a credit card purchase is fraudulent
● Classify a car loan applicant as a good or a poor
credit risk

Data 15
15
Warehousing/Mining
Methods of Classification

● Decision Trees
● Artificial Neural Networks
● Bayesian Classification
– Naïve
– Belief Networks
● k-nearest neighbor
● Regression
– Logistic (logit) probit
● Predicts probability of each class
● when the dependent variable is categorical
– good customer bed customer or employed unemployed

Data 16
16
Warehousing/Mining
Typical Classification Process:
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of the values of other
attributes.
• Goal: previously unseen records should be assigned a class as
accurately as possible.
– A test set is used to determine the accuracy of the
model.
– Usually, the given data set is divided into training and
test sets, with training set used to build the model and
test set used to validate it.

Data
17
Warehousing/Mining
Steps of classification process

● (1) Train the model


– using a training set
– data objects whose class labels are known
● (2) Test the model
– on a test sample
– whose class labels are known but not used for training the
model
● (3) Use the model for classification
– on new data whose class labels are unknown

Data 18
18
Warehousing/Mining
● Classification is a two step process consisting of a
learning step and classification step.
● Learning step: classification model is constructed
– classifier is built describing a predetermined set of
data classes.
– Classification algorithm builds the classifier by
analyzing or learning from a training set made up of
database tuples and their associated labels.
--A tuple X is represented as n-dimentional attribute
vector, X={x1,x2....xn}
--Each tuple X is assumed to belong to a predefined class
as determined by another attributes.

Data
19
Warehousing/Mining
● Classification step:
--The model is used for classification.
--Predictive accuracy of the classifier is estimated.
--the model is used to predict the class labels for given data.
– Test data are used to estimate the accuracy of the
classification rule.
– if the accuracy is considered acceptable, the rule can be
applied to the classification of new data tuples.

Data
20
Warehousing/Mining
Classification—A Two-Step Process
● Step 1 - Model construction
– describe a set of predetermined classes
● Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
● The set of tuples used for model construction is the training set
● The model is represented as classification rules, decision trees, or
mathematical formulae
● Step 2 - Model usage
– Estimate accuracy of the model
● The known label of test sample is compared with the
classified result from the model
● Accuracy rate is the percentage of test set samples that are
correctly classified by the model
● Test set is independent of training set

– Use model to classify future or unknown objects


Data
21
Warehousing/Mining
Classification application involves > 1 class of data.
•Training data is a set of instances (samples, points) with
known class labels
•Test data is a set of instances whose class labels are to be
predicted

Data
22
Warehousing/Mining
Notation
Training data
{〈 x1, y1〉 , 〈 x2, y2〉 , …, 〈 xm, ym〉 }
where xj are n-dimensional vectors
and yj are from a discrete space Y.
E.g., Y = {L1, L2}.
Test data
{〈 u1, ?〉 , 〈 u2, ?〉 , …, 〈 uk, ?〉 , }

Data
23
Warehousing/Mining
Process
f(X)
Training data:
X Class labels Y

A classifier, a mapping, a hypothesis

f(U)
Test data: U Predicted class labels

Data
24
Warehousing/Mining
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data

Classifier
(Model)

IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Data
25
Warehousing/Mining
Classification Process (2): Use the
Model in Prediction
Accuracy != 100
Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)

Tenured?

Data
26
Warehousing/Mining
An example - classification
Historical data Each customer type İs
Cust ID age income education Type known
1 35 800 udergrad risky
Each customer has a Label
2 26 600 HighSch risky
3 48 1200 grad normal
8 52 2500 udergrad good
44 29 1700 HighSch good

CustID age income education Type ■Testing set whose labels are also
17 43 550 Ph.D. risky ■Known but not used in model
27 68 1650 grad Normal
■Training the model

■New customers Whose type is to be


CustID age income Educatin Type ■Estimated

■Each new customer has to be classified as


11 36 850 Udergrd ?
Risky normal or good
37 28 1650 grad ? 27
Data
27
Warehousing/Mining
An example – classification cont.

● Based on historical data develop a classification


model
– Decision tree, neural network, regression ...
● Test the performance of the model on a portion of the
historical data
● İf accuricy of the model is satisfactory
● Use the model on the new customers
– 11 and 37 to assign a type the these new customers

Data 28
28
Warehousing/Mining
age Example AE customers
good
risky

Yearly income

Data 29
29
Warehousing/Mining
age Example AE customers
good
risky

Yearly income
Assign the new customer whose type in unknown to
either * or +
Data 30
30
Warehousing/Mining
Solution
x2 : age

good
risky

35

1000 x1 : yearly
income
rule: IF yearly income> 1000 and age> 35
THEN good ELSE risky
Data 31
31
Warehousing/Mining
For example, in a medical database the training set would
have relevant patient information recorded previously,
where the prediction attribute is whether or not the
patient had a heart problem.
Table 1 below illustrates the training and prediction sets
of such database.

Data
32
Warehousing/Mining
Classification:-
● Classification consists of predicting a certain outcome based on a given input.
● In order to predict the outcome, the algorithm processes a training set
containing a set of attributes and the respective outcome, usually called goal or
prediction attribute.
● The algorithm tries to discover relationships between the attributes that would
make it possible to predict the outcome.
● Next the algorithm is given a data set not seen before, called prediction set,
which contains the same set of attributes, except for the prediction attribute – not
yet known.
● The algorithm analyzes the input and produces a prediction.
● The prediction accuracy defines how “good” the algorithm is.

Data
33
Warehousing/Mining
• Classification is a data mining function that assigns items in a collection to
target categories or classes.
• The goal of classification is to accurately predict the target class for each
case in the data.
• For example, a classification model could be used to identify loan applicants
as low, medium, or high credit risks.
• Classifications are discrete and do not imply order.
• In the model build (training) process, a classification algorithm finds
relationships between the values of the predictors and the values of the
target.
• Different classification algorithms use different techniques for finding
relationships.
• These relationships are summarized in a model, which can then be applied
to a different data set in which the class assignments are unknown.

Data
34
Warehousing/Mining
• A classification task begins with a data set in which the class assignments
are known.
• For example, a classification model that predicts credit risk could be
developed based on observed data for many loan applicants over a period
of time.
• In addition to the historical credit rating, the data might track employment
history, home ownership or rental, years of residence, number and type of
investments, and so on.
• Credit rating would be the target, the other attributes would be the
predictors, and the data for each customer would constitute a case.
• The simplest type of classification problem is binary classification.
• In binary classification, the target attribute has only two possible values: for
example, high credit rating or low credit rating.
• Multiclass targets have more than two values: for example, low, medium,
high, or unknown credit rating.

Data
35
Warehousing/Mining
• Classification models are tested by comparing the predicted values to
known target values in a set of test data.
• The historical data for a classification project is typically divided into two
data sets: one for building the model, the other for testing the model.
• For example, a model that classifies customers as low, medium, or high
value would also predict the probability of each classification for each
customer.
• Classification has many applications in customer segmentation, business
modeling, marketing, credit analysis, and biomedical and drug response
modeling.

Data
36
Warehousing/Mining
• A Sample Classification Problem
• Suppose you want to predict which of your customers are likely to increase
spending if given an affinity card.
• You could build a model using demographic data about customers who have
used an affinity card in the past.
• Since we want to predict either a positive or a negative response (will or
will not increase spending), we will build a binary classification model.
Sample Data for Classification

Data
37
Warehousing/Mining
• A two-step process is followed, to build a classification
model.
1. In the first step i.e. learning: A classification model based
on training data is built.
2. In the second step i.e. Classification, the accuracy of the
model is checked and then the model is used to classify
new data. The class labels presented here are in the
form of discrete values such as “yes” or “no”, “safe” or
“risky”.

Data
38
Warehousing/Mining
The general approach for building classification models is
given below:

Data
39
Warehousing/Mining
Data
40
Warehousing/Mining
Classification

Mainly Classification is done in following ways

1) Decision tree Classification

2) Bayesian Classification

3) Rule Based Classification

Data
41
Warehousing/Mining
Decision Tree

Data
42
Warehousing/Mining
Decision tree Induction
During that late 1970's and early 1980's J. Ross Quinlan, a researcher in
machine learning developed a decision tree algorithm known as ID3(Iterative
Dichotomiser).
--C4.5
--CART(Classification and regression trees)
--ID3,C4.5 and CART adapt Greedy approach in which Decision tree is
constructed in a top down recursive divide and conquer manner.
• (Induction: A method of discovering general rules and principles from particular
facts and examples)
● Most algorithms for Decision tree induction follows top down approach, which
starts with a training set of tuples and their associated class labels.
● The training set is recursively partitioned into smaller subsets as the tree is being
built.

Data
43
Warehousing/Mining
• Decision Tree Induction
• Decision tree induction is the method of learning the decision
trees from the training set.
• The training set consists of attributes and class labels.
• Applications of decision tree induction include astronomy,
financial analysis, medical diagnosis, manufacturing, and
production.
• A decision tree is a flowchart tree-like structure that is made
from training set tuples.
• The dataset is broken down into smaller subsets and is present
in the form of nodes of a tree.
• The tree structure has a root node, internal nodes or decision
nodes, leaf node, and branches.

Data
44
Warehousing/Mining
• Decision Tree:
• Graphical representation of all possible solutions to a
decision.
• Decisions are based on certain condition

Data
45
Warehousing/Mining
Decision Tree Induction
(Learning of a decision tree from class labeled training tuples)

Data
4646
Warehousing/Mining
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.
Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting

Data
47
Warehousing/Mining
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer
manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are
discretized in advance)
– Examples are partitioned recursively based on selected
attributes
– Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
Data48 – There are no samples left
48
Warehousing/Mining
Data
49
Warehousing/Mining
Let A be the splitting attribute. (a) If A is discrete-valued, then one branch is grown for each
known value of A. (b) If A is continuous-valued, then two branches are grown,
corresponding to A<= split point and A > split point.
(c) If A is discrete-valued and a binary tree must be produced, then the test is of the form
∈ SA?, where SA is the splitting subset for A.
AData
Warehousing/Mining Partitioning Scenarios 5050
Attribute Selection Measures
(Also known as Splitting Rules)

■ An attribute selection measure is a heuristic for selecting the splitting


criterion that “best” separates a given data partition, D , of class-labeled
training tuples into individual classes.

■ The attribute selection measure provides a ranking for each attribute


describing the given training tuples.

■ The three popular attribute selection measures—


1. information gain
2. gain ratio, and
3. Gini index

Data
5151
Warehousing/Mining
Attribute Selection Measures:
Information Gain (ID3 algorithm)
■ This measure is based on pioneering work by Claude Shannon on
information theory, which studied the value or “information
content” of messages.
■ The attribute with the highest information gain is chosen as the
splitting attribute for node N.
■ This attribute minimizes the information needed to classify the tuples
in the resulting partitions and reflects the least randomness or
“impurity” in these partitions.
■ Such an approach minimizes the expected number of tests needed to
classify a given tuple and guarantees that a simple (but not
necessarily the simplest) tree is found.

Data
5252
Warehousing/Mining
Information Gain (ID3/C4.5)

Let pi be the probability that an arbitrary tuple in D belongs to class C i , estimated by

pi =|Ci, D|/|D|

Expected information (entropy) needed to classify a tuple in D:

Information needed (after using A to split D into v partitions) to classify D:

Information gained by branching on attribute A

Select the attribute with the highest information gain.


Data

53
Warehousing/Mining
Gain Ratio for Attribute Selection (C4.5)
■ ‘Information gain’ measure is biased towards attributes with a large
number of values(e.g. Unique ID attribute like Product_ID)
■ C4.5 uses an extension to information gain known as gain ratio, which
attempts to overcome this bias.
■ It applies a kind of normalization to information gain using a “split
information” value defined analogously with Info(D) as

■ This value represents the potential information generated by splitting the


training data set, D, into v partitions, corresponding to the v outcomes of a
test on attribute A.

■ The gain ratio is defined as


■ GainRatio(A) = Gain(A)/SplitInfoA(D)
■ The attribute with the maximum gain ratio is selected as the splitting
Data attribute
5454
Warehousing/Mining
Example: Gain Ratio

Data
Calculate Gain
Warehousing/Mining Ratio(Income) 5555
Example: Gain Ratio for income attribute
■ From previous example, Gain(income) = 0.029.
■ A test on income attribute splits the data of Table into three partitions,
namely low, medium, and high, containing four, six, and four tuples,
respectively.

■ GainRatio(income) = Gain(income)/SplitInfoincome(D)
■ Therefore, GainRatio(income) = 0.029 /1.557 = 0.019.
■ The attribute with the maximum gain ratio is selected as the splitting
attribute
■ Drawback: tends to prefer unbalanced splits in which one partition is much
smaller than the others
Data
5656
Warehousing/Mining
Gini Index
• Gini Index or Gini impurity measures the degree or probability
of a particular variable being wrongly classified when it is
randomly chosen.
• calculates the amount of probability of a specific feature that is classified
incorrectly when selected randomly.

• But what is actually meant by ‘impurity’?


• If all the elements belong to a single class, then it can be called
pure. The degree of Gini Index varies between 0 and 1,
• where,
'0' denotes that all elements belong to a certain class or there
exists only one class (pure), and
'1' denotes that the elements are randomly distributed across
various classes (impure).
• A Gini Index of '0.5 'denotes equally distributed elements into
some classes.
Data
57
Warehousing/Mining
• Formula of Gini Index
• The formula of the Gini Index is as follows:

Data
58
Warehousing/Mining
Example: Gini Index

Data
5959
Warehousing/Mining
Example: Gini index
■ Ex: Let D be the training data where there are nine tuples belonging to the class
buys computer = yes and the remaining five tuples belong to the class buys
computer = no.
■ A (root) node N is created for the tuples in D. We first compute the impurity of D
as:

■ To find the splitting criterion for the tuples in D, we need to compute the Gini index
for each attribute.
■ Let’s start with the attribute income and and Consider the subset {low, medium}.
This would result in 10 tuples in partition D1 satisfying the condition “income Є
{low, medium}.” The remaining four tuples of D would be assigned to partition D2.
■ The Gini index value computed based on this partitioning is

Giniincome Є (low,high}(D) = 0.458

Giniincome Є (medium,high}(D) = 0.450

Data •Therefore, the best binary split for attribute income is on {low, medium} or {high}
because it minimizes the Gini index. 6060
Warehousing/Mining
Calculation of Gini Index

Data
61
Warehousing/Mining
Data
62
Warehousing/Mining
Data
63
Warehousing/Mining
https://blog.quantinsti.com/gini-index/
Data
64
Warehousing/Mining
Gini index (CART)
■ Gini index measures the impurity of D, a data partition or set of training
tuples, as

■ The Gini index considers a binary split for each attribute.

■ Case 1: A is discrete-valued: To determine the best binary split on


discrete valued attribute A, we examine all the possible subsets of the
known values of A.
■ Each subset, SA, can be considered as a binary test for attribute A of the
form “A Є SA?”
■ If A has v possible values, then there are 2v possible subsets.
■ Out of these, there are 2v -2 possible ways to form two partitions of the
data, D, based on a binary split on A.
Data
6565
Warehousing/Mining
Gini index (CART)

■ For each attribute, each of the possible binary splits is considered.


■ For a discrete-valued attribute, the subset that gives the minimum Gini index
for that attribute is selected as its splitting subset.

■ The reduction in impurity that would be incurred by a binary split on a discrete- or


continuous-valued attribute A is

■ The attribute that maximizes the reduction in impurity (or, equivalently, has the
minimum Gini index) is selected as the splitting attribute. This attribute and its
splitting subset (for a discrete-valued splitting attribute) together form the splitting
criterion.

Data
6666
Warehousing/Mining
Gini index (Continuous-valued attribute)
■ For continuous-valued attributes, each possible split-point must be
considered.
■ The strategy is to take the midpoint between each pair of (sorted)
adjacent values as a possible split-point.
■ The point giving the minimum Gini index for a given (continuous-valued)
attribute is taken as the split-point of that attribute.

■ Recall that for a possible split-point of A, D1 is the set of tuples in D


satisfying A <= split point, and D2 is the set of tuples in D satisfying A >
split point

■ The attribute that maximizes the reduction in impurity (or, equivalently,


has the minimum Gini index) is selected as the splitting attribute. This
attribute and its split-point (for a continuous-valued splitting attribute)
together form the splitting criterion.

Data
6767
Warehousing/Mining
Gini index (Continuous-valued attribute)
■ For continuous-valued attributes, each possible split-point must be
considered.
■ The strategy is to take the midpoint between each pair of (sorted)
adjacent values as a possible split-point.
■ The point giving the minimum Gini index for a given (continuous-valued)
attribute is taken as the split-point of that attribute.

■ Recall that for a possible split-point of A, D1 is the set of tuples in D


satisfying A <= split point, and D2 is the set of tuples in D satisfying A >
split point

■ The attribute that maximizes the reduction in impurity (or, equivalently,


has the minimum Gini index) is selected as the splitting attribute. This
attribute and its split-point (for a continuous-valued splitting attribute)
together form the splitting criterion.

Data
6868
Warehousing/Mining
Example: Gini Index

Data
6969
Warehousing/Mining
Example: Gini index
■ Ex: Let D be the training data where there are nine tuples belonging to the class
buys computer = yes and the remaining five tuples belong to the class buys
computer = no.
■ A (root) node N is created for the tuples in D. We first compute the impurity of D
as:

■ To find the splitting criterion for the tuples in D, we need to compute the Gini index
for each attribute.
■ Let’s start with the attribute income and and Consider the subset {low, medium}.
This would result in 10 tuples in partition D1 satisfying the condition “income Є
{low, medium}.” The remaining four tuples of D would be assigned to partition D2.
■ The Gini index value computed based on this partitioning is

Giniincome Є (low,high}(D) = 0.458

Giniincome Є (medium,high}(D) = 0.450

Data •Therefore, the best binary split for attribute income is on {low, medium} or {high}
because it minimizes the Gini index. 7070
Warehousing/Mining
Example: Gini index (Cntd…)
■ Evaluating age, we obtain Giniage Є (youth,senior}(D)=0.357 = Giniage Є (middle aged}(D)

■ The attributes student and credit rating are both binary, with Gini index
values of 0.367 and 0.429, respectively.

■ The attribute age and splitting subset {youth, senior} therefore give the
minimum Gini index overall, with a reduction in impurity of 0.459 - 0.357 =
0.102. The binary split “age Є {youth, senior?}” results in the maximum
reduction in impurity of the tuples in D and is returned as the splitting
criterion.

■ Node N is labeled with the criterion, two branches are grown from it, and
the tuples are partitioned accordingly.

Data
7171
Warehousing/Mining
Comparing Attribute Selection Measures

■ The three measures, in general, return good results but


■ Information gain:
■ biased towards multivalued attributes
■ Gain ratio:
■ tends to prefer unbalanced splits in which one
partition is much smaller than the others
■ Gini index:
■ biased to multivalued attributes
■ has difficulty when # of classes is large.
■ tends to favor tests that result in equal-sized
partitions and purity in both partitions
Data
7272
Warehousing/Mining
Building Decision Tree
• Top-down tree construction
– At start, all training examples are at the root.
– Partition the examples recursively by choosing one
attribute each time.
• Bottom-up tree pruning
– Remove subtrees or branches, in a bottom-up
manner, to improve the estimated accuracy on
new cases.

Data73
73
Warehousing/Mining
Root Node: This attribute is used for dividing the data into two or more sets.
The feature attribute in this node is selected based on Attribute Selection
Techniques.
Branch or Sub-Tree: A part of the entire decision tree is called a branch or
sub-tree.
Splitting: Dividing a node into two or more sub-nodes based on if-else conditions.
Decision Node: After splitting the sub-nodes into further sub-nodes, then it is
called the decision node.
Leaf or Terminal Node: This is the end of the decision tree where it cannot be
split into further sub-nodes.
Pruning: Removing a sub-node from the tree is called pruning.

Decision Tree:
•The root node is the topmost node.
•Root node represents the best attribute selected for classification.
•Internal nodes of the decision nodes represent a test of an attribute of the dataset
•Leaf node or terminal node which represents the classification or decision label.
•The branches show the outcome of the test performed.

Data
74
Warehousing/Mining
• Decision trees classify the examples by sorting them down the tree from the
root to some leaf/terminal node, with the leaf/terminal node providing the
classification of the example.
• Each node in the tree acts as a test case for some attribute, and each edge
descending from the node corresponds to the possible answers to the test
case.
• This process is recursive in nature and is repeated for every subtree rooted
at the new node.
•Data
Warehousing/Mining
75
• Decision Tree Algorithms:
• ID3 → (extension of D3)
• C4.5 → (successor of ID3
• CART → (Classification And Regression Tree)
• CHAID → (Chi-square automatic interaction detection
Performs multi-level splits when computing classification
trees
• MARS → (multivariate adaptive regression splines)

Data
76
Warehousing/Mining
• Attribute Selection Measures
• Entropy,
• Information gain,
• Gini index,
• Gain Ratio,
• Reduction in Variance
• Chi-Square

• While using Information Gain as a criterion, we assume


attributes to be categorical, and for the Gini index, attributes
are assumed to be continuous.

Data
77
Warehousing/Mining
• Entropy
Entropy is a measure of the randomness or uncertainty in the
information being processed.
The higher the entropy, the harder it is to draw any conclusions
from that information.
• The entropy H(X) is zero when the
probability is either 0 or 1.
• The Entropy is maximum when the
probability is 0.5 because it projects
perfect randomness in the data and
there is no chance if perfectly
determining the outcome.

Data
78
Warehousing/Mining
• ID3 follows the rule — A branch with an entropy of zero is a leaf
node and a branch with entropy more than zero needs further
splitting.
• Mathematically Entropy of attribute is represented as:

Data
79
Warehousing/Mining
• Mathematically Entropy for multiple attributes is
represented as(Conditional Entropy):

where T→ Current state and X → Selected attribute


Data
80
Warehousing/Mining
• Information Gain
• Information gain or IG is a statistical property that measures
how well a given attribute separates the training examples
according to their target classification.
• Constructing a decision tree is all about finding an attribute that
returns the highest information gain and the smallest entropy.

Data
81
Warehousing/Mining
• Information gain is a decrease in entropy.
• It computes the difference between entropy before split and
average entropy after split of the dataset based on given
attribute values.
• ID3 (Iterative Dichotomiser) decision tree algorithm uses
information gain.
• Mathematically, IG is represented as:

Where “before” is the dataset before the split, K is the number of subsets
generated by the split, and (j, after) is subset j after the split.
Data
82
Warehousing/Mining
• Information Gain: Amount of information gained about a
random variable observing another variable.
• It is also called as the change in information entropy from a
prior state that takes some information.
• What is Entropy?
• Entropy is the degree of uncertainty, impurity or disorder of a
random variable.
• Entropy is the measurement of impurities or randomness in the
data points.
• Here, if all elements belong to a single class, then it is termed as
“Pure”, and if not then the distribution is named as “Impurity”.
• In more simple terms, If a dataset contains homogeneous
subsets of observations, then no impurity or randomness is there
in the dataset, and if all the observations belong to one class, the
entropy of that dataset becomes zero.

Data

Warehousing/Mining
83
• Information gain is used for determining the best
features/attributes that render maximum information about a
class.
• It follows the concept of entropy while aiming at decreasing the
level of entropy, beginning from the root node to the leaf nodes.
• Information gain computes the difference between entropy
before and after split and specifies the impurity in class
elements.
• Information Gain = Entropy before splitting - Entropy after
splitting

Data
84
Warehousing/Mining
• Gini Index
• Gini index as a cost function used to evaluate splits in the dataset.
• It is calculated by subtracting the sum of the squared probabilities of
each class from one.
• It favors larger partitions and easy to implement whereas
information gain favors smaller partitions with distinct values.
• Gini Index works with the categorical target variable “Success” or
“Failure”. It performs only Binary splits.

Higher value of Gini index implies higher inequality, higher


heterogeneity.
CART (Classification and Regression Tree) uses the Gini index method to
create split points.
where P=(p1 , p2 ,.......pn ) , and pi is the probability of an object that is being
Higher value of Gini index implies higher inequality, higher heterogeneity.
classified to a particular class.

Data
85
Warehousing/Mining
• calculates the amount of probability of a specific feature that is classified
incorrectly when selected randomly.
• The degree of gini index varies from 0 to 1,
• Where 0 depicts that all the elements be allied to a certain
class, or only one class exists there.
• The gini index of value as 1 signifies that all the elements are
randomly distributed across various classes, and
• A value of 0.5 denotes the elements are uniformly distributed
into some classes.
• Also, an attribute/feature with least gini index is preferred as
root node while making a decision tree.

Data
86
Warehousing/Mining
• Gain ratio
• Information gain is biased towards choosing attributes with a large
number of values as root nodes.
• It means it prefers the attribute with a large number of distinct values.
• C4.5, an improvement of ID3, uses Gain ratio which is a modification of
Information gain that reduces its bias and is usually the best option.
• The attribute with the max gain ratio is used as a splitting attribute.
• Gain Ratio or Uncertainty Coefficient is used to normalize the
information gain of an attribute against how much entropy that
attribute has.

Data
87
Warehousing/Mining
• Gain Ratio or Uncertainty Coefficient is used to
normalize the information gain of an attribute against
how much entropy that attribute has.

Data
88
Warehousing/Mining
Choosing the Splitting Attribute
• At each node, available attributes are evaluated on the basis of separating
the classes of the training examples. Function is used for this purpose.
• Typical functions:
– information gain (ID3/C4.5)
– information gain ratio
– gini index

Data89
witten&eibe 89
Warehousing/Mining
Attribute Selection Measure:
Information Gain (ID3/C4.5)
■ Select the attribute with the highest information gain
■ Let pi be the probability that an arbitrary tuple in D belongs
to class Ci, estimated by |Ci, D|/|D|
■ Expected information (entropy) needed to classify a tuple in
D:

■ Information needed (after using A to split D into v


partitions) to classify D(Conditional Entropy) :

■ Information gained by branching on attribute A


Highest
Data
Warehousing/Mining
Informatio
90 90
Data *
Data Mining: Concepts and Techniques 91 91
Warehousing/Mining
Attribute Selection: Information Gain

g Class P: buys_computer = “yes”


g Class N: buys_computer = “no”

means “age as Youth” has 5


out of 14 samples, with 2 yes’es
and 3 no’s. Hence

Similarly,

Data
92 92
Warehousing/Mining
Information Gain (ID3)
Select the attribute with the highest information gain
Assume there are two classes, ....P and ....N
Let the set of examples S contain
• ...p elements of class P
• ...n elements of class N

The amount of information, needed to decide if an arbitrary


example in S belongs to P or N is defined as

93 Chapter 6
Data
93
Warehousing/Mining
Information Gain in Decision Tree Induction
Assume that using attribute A
a set S will be partitioned into sets {S1, S2 , …, Sv}
If Si contains pi examples of P and ni examples of N ,the
entropy, or the expected information needed to classify
objects in all subtrees Si is

The encoding information that would be gained


by branching on A

94 Chapter 6
Data
94
Warehousing/Mining
Attribute Selection by Information Gain
Computation

95 Chapter 6
Data
95
Warehousing/Mining
Attribute Selection by Information Gain
Computation

-9/(9+5)log2 9/(9+5) – 5/(9+5)log25/(9+5)


Class P: buys_computer = “yes”
Class N: buys_computer = “no”
I(p, n) = I(9, 5) = 0.940
Compute the entropy for age:

9 5
96 Chapter 6
Data
96
Warehousing/Mining
Attribute Selection by Information Gain
Computation
I(p, n) = I(9, 5) =0.940

Gain(age) = 0.940 –0.692 = 0.248

Similarly

97
Data
97
Warehousing/Mining
Gain Ratio for Attribute Selection (C4.5)

• Information gain measure is biased towards attributes


with a large number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)

– GainRatio(A) = Gain(A)/SplitInfo(A)
• Ex.
– gain_ratio(income) = 0.029/0.926 = 0.019
• The attribute with the maximum gain ratio is selected as
the splitting attribute
Data 98
98
Warehousing/Mining
Gini index (CART, IBM IntelligentMiner)
• If a data set D contains examples from n classes, gini index, gini(D) is
defined as

where pj is the relative frequency of class j in D


• If a data set D is split on A into two subsets D1 and D2, the gini index
gini(D) is defined as

• Split with smallest gini values (or the largest reduction in impurity) is
selected. (need to enumerate all the possible splitting points for each
attribute)
• Reduction in Impurity:

• The attribute provides the largest ∆ginisplit(D) is chosen


Data 99 to split the
99
node
Warehousing/Mining
• The ID3 algorithm builds decision trees using a top-down greedy
search approach through the space of possible branches with no
backtracking.
• Steps in ID3 algorithm:
1. It begins with the original set S as the root node.
2. On each iteration of the algorithm, it iterates through the very unused
attribute of the set S and calculates Entropy(H) and Information
gain(IG) of this attribute.
3. It then selects the attribute which has the smallest Entropy or Largest
Information gain.
4. The set S is then split by the selected attribute to produce a subset of
the data.
5. The algorithm continues to recur on each subset, considering only
attributes never selected before.

Data
100
Warehousing/Mining
• Decision Tree Induction : ID3
• ID3 later came to be known as C4.5.
• ID3 and C4.5 follow a greedy top-down approach for
constructing decision trees.
• The algorithm starts with a training dataset with
class labels that are portioned into smaller subsets
as the tree is being constructed.

Data
101
Warehousing/Mining
• The algorithm starts with a training dataset with class labels that
are portioned into smaller subsets as the tree is being constructed.

• #1) Initially, there are three parameters i.e. attribute list, attribute
selection method and data partition. The attribute list describes the
attributes of the training set tuples.

• #2) The attribute selection method describes the method for selecting
the best attribute for discrimination among tuples. The methods used
for attribute selection can either be Information Gain, Information
Ratio or Gini Index.

Data
102
Warehousing/Mining
• #3) The structure of the tree (binary or non-binary) is decided by
the attribute selection method.
• #4) When constructing a decision tree, it starts as a single node
representing the tuples.
• #5) If the root node tuples represent different class labels, then it
calls an attribute selection method to split or partition the tuples.
The step will lead to the formation of branches and decision nodes.
• #6) The splitting method will determine which attribute should be
selected to partition the data tuples. It also determines the
branches to be grown from the node according to the test outcome.
• The main motive of the splitting criteria is that the partition at
each branch of the decision tree should represent the same class
label.

Data
103
Warehousing/Mining
• #7) The above partitioning steps are followed recursively
to form a decision tree for the training dataset tuples.

• #8) The portioning stops only when either all the


partitions are made or when the remaining tuples cannot
be partitioned further.

• #9) The complexity of the algorithm is described by n *


|D| * log |D| where n is the number of attributes in
training dataset D and |D| is the number of tuples.

Data
104
Warehousing/Mining
• Constructing a Decision Tree
• Let us take an example of the weather dataset with
attributes outlook, temperature, wind, and humidity. The
outcome variable will be playing golf or not. We will use
the ID3 algorithm to build the decision tree.

Data
105
Warehousing/Mining
• Step1: The first step will be to create a root node.
• Step2: If all results are yes, then the leaf node “yes”
will be returned else the leaf node “no” will be
returned.
• Step3: Find out the Entropy of all observations and
entropy with attribute “x” that is E(S) and E(S, x).
• Step4: Find out the information gain and select the
attribute with high information gain.
• Step5: Repeat the above steps until all attributes
are covered.

Data
106
Warehousing/Mining
• Classification using the ID3 algorithm
• Consider whether a dataset based on which we will determine
whether to play football or not.

• Here There are for independent variables to determine the dependent


variable. The independent variables are Outlook, Temperature, Humidity,
and Wind. The dependent variable is whether to play football or not.
Data
107
Warehousing/Mining
• As the first step, we have to find the parent node for our decision
tree. For that follow the steps:
• Find the entropy of the class variable.
• Yes=9 No=5 n=14
• F=(Outlook,temp,humidityWind)
• E(S) = -[(9/14)log(9/14) + (5/14)log(5/14)] = 0.94
• E(9,5) = -[(9/14)log(9/14) + (5/14)log(5/14)] = 0.94

Data
108
Warehousing/Mining
• From the above data for outlook we can arrive at the
following table easily

• Now we have to calculate average weighted entropy. ie,


we have found the total of weights of each feature
multiplied by probabilities.
• E(S, outlook) = (5/14)*E(3,2) + (4/14)*E(4,0) +
(5/14)*E(2,3) = (5/14)(-(3/5)log(3/5)-(2/5)log(2/5))+
(4/14)(0) + (5/14)((2/5)log(2/5)-(3/5)log(3/5)) = 0.693

Data
109
Warehousing/Mining
• The next step is to find the information gain. It is the
difference between parent entropy and average weighted
entropy we found above.
• IG(S, outlook) = 0.94 - 0.693 = 0.247
• Similarly find Information gain for Temperature, Humidity, and
Windy.
• IG(S, Temperature) = 0.940 - 0.911 = 0.029
• IG(S, Humidity) = 0.940 - 0.788 = 0.152
• IG(S, Windy) = 0.940 - 0.8932 = 0.048
• Now select the feature having the largest entropy gain.
• Here it is Outlook. So it forms the first node(root node) of our
decision tree.

Data
110
Warehousing/Mining
• Now our data look as follows

Data
111
Warehousing/Mining
• Since overcast contains only examples of class ‘Yes’ we can set
it as yes. That means If outlook is overcast football will be
played.
• Now our decision tree looks as follows.

Data
112
Warehousing/Mining
• The next step is to find the next node in our decision tree.
Now we will find one under sunny. We have to determine
which of the following Temperature, Humidity or Wind has
higher information gain.
• F={Temprature,Humidity,wind)

Outlook=Sunny
Calculate parent entropy E(PlayGolf)
E(PlayGolf) = (-(3/5)log(3/5)-(2/5)log(2/5)) = 0.971.
Now Calculate the information gain of Temperature.
IG(PlayGolf, Temperature)=?

Data
113
Warehousing/Mining
E(Playgolf,Temprature)=

Now calculate information gain.


Data IG(PlayGolf, Temperature) = 0.971–0.4 =0.571
114
Warehousing/Mining
• Similarly we get
• IG(PlayGolf, Humidity) = 0.971
• IG(PlayGolf, Windy) = 0.020
• Here IG(PlayGolf, Humidity) is the largest value. So
Humidity is the node that comes under sunny.

Data
115
Warehousing/Mining
Humidity=high
Humidity=Normal

Data
116
Warehousing/Mining
Data
117
Warehousing/Mining
Extracting Classification Rules from Trees
• Represent the knowledge in the form of
IF-THEN rules
• One rule is created for each path from the root
to a leaf
• Each attribute-value pair along a path forms a
conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand

118 Chapter 6
Data
118
Warehousing/Mining
• Classification Rules:
• If(outlook=“sunny” and humidity=“high” then play
football=“No”
• If(outlook=“sunny” and humidity=“Normal” then play
football=“yes”
• If(outlook=“Overcast” then play football=“yes”
• If(outlook=“Rain” and wind=“strong” then play
football=“No”
• If(outlook=“Rain” and wind=“weak” then play
football=“ Yes”

Data
119
Warehousing/Mining
● How are decision trees are used for classification?
● Given tuple for which the associated class label is
unknown the attribute value of the tuple is tested against
the decision tree.
● A path is traces from the root to a leaf node which holds
the class prediction of the tuple.
● Decision tree can easily be converted to classification
rules.

Data
120
Warehousing/Mining
Data
121
Warehousing/Mining
Data
122
Warehousing/Mining
Data
123
Warehousing/Mining
Data
124
Warehousing/Mining
Data
125
Warehousing/Mining
Sunny Overcast Rainy
No Yes Yes
No Yes Yes
No Yes No
• Yes Yes Yes
Yes Total=4 No
Total=5 Total=5

E(outlook=rainy)=
Information from outlook:
I(outlook)=5/14*E(outlook=Sunny)+4/14*E(outlook=Outreach)+5/15*E(outlook=Rainy)
I(Outlook)=5/14*0.971+4/14*0+5/15*0.971=0.693
Gain=E(S)-I(outlook)=0.94-0.693=0.247
Data
126
Warehousing/Mining
E(Windy=True)=-6/8log2(6/8)-2/8log2(2/8)=0.811
E(Windy=False)=1
False True
Yes No
Yes No
Yes Yes
Yes Yes
Yes Yes Gain(Windy)=
Yes No
No Total=6
No
Total=8

Data
127
Warehousing/Mining
Windy:

Data
128
Warehousing/Mining
Data
129
Warehousing/Mining
• Example 2:

Data
130
Warehousing/Mining
Data *
Data Mining: Concepts and Techniques 131 131
Warehousing/Mining
Attribute Selection: Information Gain

g Class P: buys_computer = “yes”


g Class N: buys_computer = “no”

means “age as Youth” has 5


out of 14 samples, with 2 yes’es
and 3 no’s. Hence

Similarly,

Data
132 132
Warehousing/Mining
Data
133
Warehousing/Mining
Decision Tree Induction: An Example
❑ Training data set: Buys_computer
❑ Resulting tree:

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes
Data
134 134
Warehousing/Mining
IF age = “<=30” AND student = “no ” THEN buys_computer = “no ”
IF age = “<=30” AND student = “yes ” THEN buys_computer = “yes ”
IF age = “31…40” THEN buys_computer = “yes ”
IF age = “>40” AND credit_rating = “excellent ” THEN buys_computer = “yes ”
IF age = “<=30” AND credit_rating = “fair ” THEN buys_computer = “no ”

age?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes
135 Chapter 6
Data Output: A Decision Tree for “buys_computer” 135
Warehousing/Mining
Advantages:
● The construction of decision tree does not require any domain
knowledge or parameter setting.
● Decision tree can handle multi dimensional data.
● Easy to interpret
● The learning and classification steps of Decision tree
induction are fast and simple.
● Good accuracy.
Disadvantages
❑ If a decision tree is used for categorical variables with multiple
levels, those variables with more levels will have more
information gain.
❑ Calculations can quickly become very complex, although this is
usually only a problem if the tree is being created by hand.
Data
136
Warehousing/Mining
Pruning Trees
• When decision tree is built many of the branches will reflect anomalies in
the training data due to noise or outliers .
• Tree pruning methods addresses this problem of over fitting the data.
• Uses statistical measures to remove the least reliable branches.
• Pruned tree are smaller , faster and better at correctly classifying
independent test data as compared to pruned tree.
• There is another technique for reducing the number of attributes used in a
tree – pruning
• Pruning a decision tree means to remove a subtree that is redundant and
not a useful split and replace it with a leaf node.
• Decision tree pruning can be divided into two
types: pre-pruning and post-pruning.
• Two types of pruning:
– Pre-pruning (forward pruning)
– Post-pruning (backward pruning)

Data
137
Warehousing/Mining
• Decision trees are tree data structures that are generated using
learning algorithms for the purpose of Classification.
• One of the most common problem when learning a decision tree is
to learn the optimal size of the resulting tree that leads to a better
accuracy of the model.
• A tree that has too many branches and layers can result in
overfitting of the training data.
• Pruning a decision tree helps to prevent overfitting the training
data so that our model generalizes well to unseen data.
• Pruning a decision tree means to remove a subtree that is
redundant and not a useful split and replace it with a leaf node.
• Decision tree pruning can be divided into two types: pre-pruning
and post-pruning.
Data
138
Warehousing/Mining
Pre-pruning

• Tree is pruned by halting its construction early.


• In pre-pruning, we decide during the building process when to
stop adding attributes (based on their information gain)
• However, this may be problematic – Why?
– Sometimes attributes individually do not contribute much
to a decision, but combined, they may have a significant
impact

Data
139
Warehousing/Mining
• In pre-pruning, we evaluate the pruning condition based on the
above measures at each node.
• Examples of pruning conditions include
informationGain(Attr)> minGain or treeDepth == MaxDepth.
If the condition is satisfied, we prune the subtree.
• That means we replace the decision node with a leaf node.
• Otherwise, we continue building the tree using our decision
tree algorithm.
• Pre-pruning has the advantage of being faster and more
efficient as it avoids generating overly complex subtrees which
overfit the training data.

Data
140
Warehousing/Mining
• Post-pruning
• As the name suggests, post-pruning means to prune after the tree is built.
• You grow the tree entirely using your decision tree algorithm and then you
prune the subtrees in the tree in a bottom-up fashion.
• You start from the bottom decision node and, based on measures such as
Gini Impurity or Information Gain, you decide whether to keep this decision
node or replace it with a leaf node.
• For example, say we want to prune out subtrees that result in least
information gain.
• When deciding the leaf node, we want to know what leaf our decision tree
algorithm would have created if it didn’t create this decision node.

Data
141
Warehousing/Mining
• Pruning algorithms
• Pruning by information gain
• We can prune our decision tree by using information gain in both
post-pruning and pre-pruning.
• In pre-pruning, we check whether information gain at a particular
node is greater than minimum gain.
• In post-pruning, we prune the subtrees with the least information
gain until we reach a desired number of leaves.

• Reduced Error Pruning (REP)


• REP belongs to the Post-Pruning category.
• In REP, pruning is performed with the help of a validation set.
• In REP, all nodes are evaluated for pruning in a bottom up fashion. A
node is pruned if the resulting pruned tree performs no worse than the
original tree on the validation set.
• The subtree at the node is replaced with a leaf node which is assigned
the most common class.
Data
142
Warehousing/Mining
Post-pruning

• Postpruning waits until the full decision tree has built


and then prunes the attributes
• Two techniques:
– Subtree Replacement
– Subtree Raising

Data
143
Warehousing/Mining
• Pre-pruning

Data
144
Warehousing/Mining
Subtree Replacement

• Entire subtree is replaced by a single leaf node

C 4 5

1 2 3

Data
145
Warehousing/Mining
Subtree Replacement

• Node 6 replaced the subtree


• Generalizes tree a little more, but may increase accuracy

6 4 5

Data
146
Warehousing/Mining
Subtree Raising

• Entire subtree is raised onto another node

C 4 5

1 2 3

Data
147
Warehousing/Mining
Subtree Raising

• Entire subtree is raised onto another node


• This was not discussed in detail as it is not
clear whether this is really worthwhile (as it
is very time consuming)
A

1 2 3
Data
148
Warehousing/Mining
Bayesian Classification
• Bayesian classification is statistical classifier.
• They can predict class membership probabilities such as the
probability that a given tuple belongs to a particular class.
• Bayesian classification gives high accuracy and speed when
applied to large databases.
• Naïve base classifier assumes that the effect of an attribute value
on a given class is independent of the values of other attributes.
This assumption is called class conditional independence.
• The basic idea in NB approaches is To use the joint probabilities
of words and categories to estimate the probabilities of categories
given a document.

•Data. 149
Warehousing/Mining
The Naive Bayes classifier makes two main assumptions:
● Conditional independence: Features are independent of each
other, meaning that the presence or absence of one feature
doesn't affect the presence or absence of another.
● Equal contribution: Each feature contributes equally to the
outcome.

Explanation
The Naive Bayes classifier is a probabilistic classifier that uses
Bayes' theorem. It's called "naive" because these assumptions are
often not true in real-world scenarios

Data
150
Warehousing/Mining
Bayesian Classification:
• A statistical classifier: performs probabilistic prediction, i.e., predicts class
membership probabilities( i.e they can predict class membership probabilities
such as the probabilities such as the probability that a given tuple belongs to a
class.
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian classifier, has
comparable performance with decision tree and selected neural network
classifiers
• Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct — prior knowledge can be combined
with observed data
• Standard: Even when Bayesian methods are computationally intractable,
they can provide a standard of optimal decision making against which other
methods can be measured

Data
Warehousing/Mining 151 151
Bayesian Theorem: Basics
• Let X be a data sample (“evidence”): class label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), the probability that
the hypothesis holds given the observed data sample X
• P(H) (prior probability), the initial probability
– E.g., X will buy computer, regardless of age, income, …
• P(X): probability that sample data is observed
• P(X|H) (posteriori probability), the probability of
observing the sample X, given that the hypothesis holds
– E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
Data
Warehousing/Mining 152 152
Bayesian Theorem
• Given training data X, posteriori probability of a hypothesis H, P(H|X),
follows the Bayes theorem

• Informally, this can be written as


posteriori = likelihood x prior/evidence
• Predicts X belongs to C2 iff the probability P(Ci|X) is the highest among
all the P(Ck|X) for all the k classes
• Practical difficulty: require initial knowledge of many probabilities,
significant computational cost.
• Conditional probability is known as the possibility of an event or
outcome happening, based on the existence of a previous event or
outcome
• Conditional probability is often portrayed as the "probability of A given B,"
notated as P(A|B).
Data
Warehousing/Mining 153 153
Naïve Bayesian Classifier
• Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-attribute
vector X = (x1, x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e.,
the maximal P(Ci|X)
• This can be derived from Bayes’ theorem

• Since P(X) is constant for all classes, only

needs to be maximized
Data
154
*
Warehousing/Mining Data Mining:
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes):

Data
155
*
Warehousing/Mining Data Mining:
Naïve Bayesian Classifier: Training Dataset

Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’

Data sample
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)

Data
156
Warehousing/Mining Data Mining:
Naïve Bayesian Classifier: An Example
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044


P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Data
Warehousing/Mining 157 157
outlook
Play-tennis example:
estimating P(xi|C) P(sunny|p) = 2/9 P(sunny|n) = 3/5
P(overcast|p) = 4/9 P(overcast|n) = 0
P(rain|p) = 3/9 P(rain|n) = 2/5
temperature
P(hot|p) = 2/9 P(hot|n) = 2/5
P(mild|p) = 4/9 P(mild|n) = 2/5
P(cool|p) = 3/9 P(cool|n) = 1/5
humidity
P(high|p) = 3/9 P(high|n) = 4/5
2 classes – p (play),
n (don’t play) P(normal|p) = 6/9 P(normal|n) = 2/5

P(p) = 9/14 windy


P(true|p) = 3/9 P(true|n) = 3/5
P(n) = 5/14
Data P(false|p) = 6/9 P(false|n) = 2/5
158
Warehousing/Mining
Play-tennis example: classifying X

● An unseen sample X = <rain, hot, high, false>

● P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)
P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582

● P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)
P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286

● Sample X is classified in class n (don’t play)

Data
159
Warehousing/Mining
Data
160
Warehousing/Mining
Data
161
Warehousing/Mining
Data
162
Warehousing/Mining
Data
163
Warehousing/Mining
Data
164
Warehousing/Mining
Data
165
Warehousing/Mining
Naïve Bayes Classifier: Comments
• Advantages
– Easy to implement
– Good results obtained in most of the cases
• Disadvantages
– Assumption: class conditional independence, therefore
loss of accuracy
– Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayes
Classifier

Data
166
Warehousing/Mining
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted buy_computer = buy_computer = Total
class yes no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

• Given m classes, an entry, CMi,j in a confusion matrix


indicates # of tuples in class i that were labeled by the
classifier as class j
• May have extra rows/columns to provide totals
Data
167
167
Warehousing/Mining
Sensitivity :
Sensitivity of a classifier is the
ratio between how much were
correctly identified as positive to
how much were actually positive.
Sensitivity = TP / FN+TP

Specificity :
Specificity of a classifier is the
ratio between how much were
correctly classified as negative to
how much was actually negative.
Recall: Specificity = TN/FP+TN
Recall and sensitivity Precision:
are one and the same. How much were correctly
Recall = TP / FN+TP classified as positive out of all
positives.
Data Precision = TP/TP+FP 168
Warehousing/Mining
Classifier Evaluation Metrics: Accuracy, Error Rate,
Sensitivity and Specificity

A\P C ¬C ■ Class Imbalance Problem:


C TP FN P
■ One class may be rare, e.g.
¬C FP TN N
fraud.
P’ N’ All
■ Significant majority of the

• Classifier Accuracy, or negative class and minority of


recognition rate: percentage of the positive class
test set tuples that are ■ Sensitivity: True Positive

correctly classified recognition rate


Accuracy = (TP + TN)/All ■ Sensitivity = TP/P

• Error rate: 1 – accuracy, or ■ Specificity: True Negative

Error rate = (FP + FN)/All recognition rate


■ Specificity = TN/N

Data
169
169
Warehousing/Mining
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

• Recall: completeness – what % of positive tuples did the


classifier label as positive?
• Perfect score is 1.0
• Inverse relationship between precision & recall
• F measure (F1 or F-score): harmonic mean of precision and
recall,

• Fß: weighted measure of precision and recall


– assigns ß times as much weight to recall as to precision

Data
170
170
Warehousing/Mining
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)


cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

– Precision = 90/230 = 39.13% Recall = 90/300 =


30.00%

Data
171
171
Warehousing/Mining
Predictive Model:
• Prediction Model makes predictions about future outcomes
using historical data combined with statistical modeling,
• Predictive analytics uses predictors or known features to create
predictive models that will be used in obtaining an output.
• Two of the most widely used predictive modeling techniques are
regression and neural networks.
● Predictive modeling: Predict data values or construct generalized
linear models based on the database data.

Data
172
Warehousing/Mining
What Is Prediction?
● Prediction is similar to classification
– First, construct a model
– Second, use model to predict unknown value
● Major method for prediction is regression
– Linear and multiple regression
– Non-linear regression
● Prediction is different from classification
– Classification refers to predict categorical class label
– Prediction models continuous-valued functions

Data
173
Warehousing/Mining
• Representing Linear Regression Model-
• Regression is a type of technique which helps in finding the
relationship between independent and dependent variable.
• Linear regression model represents the linear relationship
between a dependent variable and independent variable(s) via a
sloped straight line.
• The sloped straight line representing the linear relationship
that fits the given data best is called as a regression line.

Data
174
Warehousing/Mining
Regression Models

1 Explanatory Regression 2+ Explanatory


Models
Variable Variables

Simple Multiple

Non- Non-
Linear Linear
Linear Linear

Data
175
Warehousing/Mining
• Linear regression is the most basic and commonly used predictive analysis.
One variable is considered to be an explanatory variable, and the other is
considered to be a dependent variable.

• Simple linear regression


• One dependent variable (interval or ratio)
• One independent variable (interval or ratio or dichotomous)

• Multiple linear regression


• One dependent variable (interval or ratio)
• Two or more independent variables (interval or ratio or dichotomous)

• Logistic regression
• One dependent variable (binary)
• Two or more independent variable(s) (interval or ratio)

Data
176
Warehousing/Mining
Simple Linear Regression

Dependent variable (y)


y’ = b0 + b1X ± є
є

B1 = slope
= ∆y/ ∆x
b0 (y intercept)

Independent variable (x)

The output of a regression is a function that predicts the dependent variable


based upon values of the independent variables.
Simple regression fits a straight line to the data.

Data
177
Warehousing/Mining
Linear Regression Model

1. Relationship Between Variables Is a Linear Function

Population Population Random


Y-Intercept Slope Error

Yi = β 0 + β1X i + ε i
Dependent Independent
(Response) (Explanatory) Variable
Variable

Data
178
Warehousing/Mining
Simple Linear Regression

Observation: y

Dependent variable Prediction: y^

Zero
Independent variable (x)

The function will make a prediction for each observed data point.
^
The observation is denoted by y and the prediction is denoted by y.

Data
179
Warehousing/Mining
Simple Linear Regression

Prediction error: ε

Observation: y
Prediction: y^

Zero

For each observation, the variation can be described as:

y = y + ^ε
Actual = Explained + Error
Data
180
Warehousing/Mining
1. Simple Linear Regression-
•In simple linear regression, the dependent variable depends
only on a single independent variable.
•For simple linear regression, the form of the model is-
•Y = β0 + β1X
•Here,
•Y is a dependent variable.
•X is an independent variable.
•β0 and β1 are the regression coefficients.
•β0 is the intercept or the bias that fixes the offset to a line.
•β1 is the slope or weight that specifies the factor by which X
has an impact on Y.

Data
181
Warehousing/Mining
• Simple linear regression :

• The formula for a simple linear regression is:


• Simple linear regression formula
• y is the predicted value of the dependent variable (y) for any given
value of the independent variable (x).
• B0 is the intercept, the predicted value of y when the x is 0.
• B1 is the regression coefficient – how much we expect y to change as
x increases.
• x is the independent variable ( the variable we expect is influencing
y).
• e is the error of the estimate, or how much variation there is in our
estimate of the regression coefficient.
Data
182
Warehousing/Mining
• Case-01: β1 < 0
• It indicates that variable X has negative impact on Y.
• If X increases, Y will decrease and vice-versa.

Data
183
Warehousing/Mining
• Case-02: β1 = 0
• It indicates that variable X has no impact on Y.
• If X changes, there will be no change in Y.

Data
184
Warehousing/Mining
• Case-03: β1 > 0
• It indicates that variable X has positive impact on Y.
• If X increases, Y will increase and vice-versa.

Data
185
Warehousing/Mining
• 2. Multiple Linear Regression-
• In multiple linear regression, the dependent variable depends
on more than one independent variables.
• For multiple linear regression, the form of the model is-
• Y = β0 + β1X1 + β2X2 + β3X3 + …… + βnXn
• Here,
• Y is a dependent variable.
• X1, X2, …., Xn are independent variables.
• β0, β1,…, βn are the regression coefficients.
• βj (1<=j<=n) is the slope or weight that specifies the factor by
which Xj has an impact on Y.

Data
186
Warehousing/Mining
• Formula for linear regression equation is given by:

Data
187
Warehousing/Mining
Data
188
Warehousing/Mining
Data
189
Warehousing/Mining
• Multiple Linear Regression
• Multiple linear regression is a method we can use to quantify
the relationship between two or more predictor variables and
a response variable.
• Suppose we have the following dataset with one response
variable y and two predictor variables X1 and X2:

Data
190
Warehousing/Mining
• Use the following steps to fit a multiple linear regression
model to this dataset.
• Step 1: Calculate X12, X22, X1y, X2y and X1X2.

Data
191
Warehousing/Mining
Data
192
Warehousing/Mining
Data
193
Warehousing/Mining
Data
194
Warehousing/Mining
• How to Interpret a Multiple Linear Regression
Equation

Data
195
Warehousing/Mining
• Evaluation Metric for Regression:
• Mean Absolute Error(MAE)
• Mean Squared Error(MSE)
• RMSE
• RMSLE
• R squared
• Adjusted R Squares

Data
196
Warehousing/Mining
• 1) Mean Absolute Error(MAE)
• MAE is a very simple metric which calculates the absolute
difference between actual and predicted values.

Data
197
Warehousing/Mining
• 2) Mean Squared Error(MSE)
• MSE is a most used and very simple metric with a little bit
of change in mean absolute error. Mean squared error
states that finding the squared difference between actual
and predicted value.

Data
198
Warehousing/Mining
• 3) Root Mean Squared Error(RMSE)
• As RMSE is clear by the name itself, that it is a simple
square root of mean squared error.

Data
199
Warehousing/Mining

You might also like