0% found this document useful (0 votes)

293 views44 pages

Decision Trees - CHAID AND CART 2019 PDF

Decision trees can be used for classification or regression problems. Classification and Regression Tree (CART) and Chi-Square Automatic Interaction Detection (CHAID) are two common decision tree algorithms. CART builds binary trees using Gini index or entropy for splits, while CHAID can create multi-way splits based on chi-square tests. It analyzes variables in order of statistical significance. CHAID also uses merging steps to combine similar nodes. Both algorithms were applied to a marketing dataset to predict sales conversions based on customer attributes.

Uploaded by

s4shivendra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

293 views44 pages

Decision Trees - CHAID AND CART 2019 PDF

Uploaded by

s4shivendra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Decision Tree Learning – CART &

CHAID

U DINESH KUMAR
Classification/Decision Trees
• Introduction to Decision Trees.

• Classification/Decision Tree approach for classification.

• Classification and Regression Tree (CART)

• Chi-Square Automatic Interaction Detection (CHAID)

Classification/Decision Trees
• Decision trees (aka decision tree learning) are collection of
predictive analytics techniques that uses tree like graph for
predicting the value of the outcome variable (or target
variable) based on values of predictors or features.

• In a decision tree when the response (outcome) variable takes

discrete values then they are called classification trees.
Classification Trees – Generic Steps
• Start with root node consisting all data (call subset S0).

• Split the root node into two or more branches (called edges) using
split criteria resulting in internal nodes.

• Each internal node is further divided until no further splitting is

possible. The terminal nodes (aka leaf nodes) will not have any
outgoing edges.

• Terminal nodes are used for generating predicting the value of the
outcome variable and business rules.

• Tree pruning Stopping criteria (Tree pruning) a process of

restricting the size of the tree is used to avoid large trees and
overfitting the data.
Classification and Regression Trees
(CART)
Classification and Regression Trees
(CART)
• Splits are chosen based either on Gini Index or
Entropy criteria (there are few more).

• CART is a binary tree, whereas CHAID can split

the initial node into more than 2 branches.
Gini Index (Classification Impurity)
• Gini Index is used to measure the impurity at a
node (in classification problem) and is given
by:
J
Gini(k) =  P( j k )(1 − P( j k ))
j =1
where P(j k) is the proportion of category j in node k

Smaller Gini Index implies less impurity.

Entropy (Impurity Measure)
• Entropy is another impurity measure that is
frequently used. Entropy at node “k” is given
by:
c
Entropy(k) = −  P(j | k)log 2 (P(j | k))
j=1
Gini Index Calculation
Number of classes 2 (say 0 and 1)

Consider node label k with 10 1s and 90 0s.

2
Gini(k) =  P( j k )(1 − P( j k )) = 2 P( j k )(1 − P( j k ))
j =1

Gini(k) = 2  0.1 0.9 = 0.18

Smaller number
implies less
impurity
Entropy Calculation
Number of classes 2 (say 0 and 1)

Consider node label k with 10 1s and 90 0s.

2
Entropy(k) = −  P( j k ) log( P( j k ))
j =1

Entropy(k) = −0.1  log 2 (0.1) − 0.9  log 2 (0.9) = 0.4689

Higher than Gini

coefficient
Classification Tree Logic
Node t

Node tL Node tR Reduction in

impurity

Maxi(t) − PL  i(t L ) − PR  i(t R )

i(.) = Impurity at node (.)
PL = Proportion of observations in the left node
PR = Proportion of observations in the right node
CART Example: Marketing Head’s
Conundrum
Business Context
• Sales conversion may take more than year in case of B2B sales
environment.

• Company invests a lot of money to convert a lead.

• The success rate (conversion rate) can be very low.

Data
1. Product

2. Industry

3. Region

4. Segment (combination of product, industry and region)

5. Profit of the customer

6. Sales value

7. Profit percentage (profit expected from sales as a percentage of sales value)

8. Joint bid proportion (proportion of profit in case of joint bid)

9. Sales outcome (outcome variables)

BOX PLOT
Distribution for lost and won cases for Profit of the customer
and sales value
Logistic Regression Output
Logistic Regression Output
Logistic Regression Output
CART Model
Hyper parameter
values

Max Depth 2
Min Split 20
Min Bucket 7
Complexity 0.01
Terminal nodes are 2, 6 and 7
Rule number: 7 [Sales.Outcome=1
cover=699 (34%) prob=0.91]
Profit.of.customer....Mn< 1.015
Profit..< 54.5

Rule number: 6 [Sales.Outcome=0

cover=350 (17%) prob=0.47]
Profit.of.customer....Mn< 1.015
Profit..>=54.5

Rule number: 2 [Sales.Outcome=0

cover=1026 (49%) prob=0.22]
Profit.of.customer....Mn>=1.015
Error Matrix (in Validation Data)
Predicted Error
Actual 0 1
0 189 35 15.6%
1 18 202 8.2%

Overall Error = 11.9%

AUC is 0.91 for
CART
CHAID
Chi-square Automatic Interaction Detection
Introduction to CHAID
• CHAID is one of the popular decision tree techniques used in
solving classification problems.

• Initial models of CHAID used chi-square test of independence for

splitting. CHAID was first presented in an article titled, “An
exploratory technique for Investigating large quantities of
categorical variables”, by G V Kass in Applied Statistics (1980)
CHAID
• CHAID partitions the data into mutually exclusive, exhaustive,
subsets that best describe the dependent categorical variable.

• CHAID is an iterative procedure that examines the predictors (or

classification variables) and uses them in the order of their
statistical significance.
CHAID Splitting Rule
• The splitting in CHAID is based on the type of
the dependent variable (or target variable).
– For continuous dependent variable F test is used.
– For categorical dependent variable chi-square test
of independence is used.
CHAID Procedure
• Step 1: Examine each feature for its statistical significance with
the outcome variable using F test (for continuous dependent) or
chi-square test for categorical dependent).

• Step 2: Determine the most significant among the features

(feature with smallest p value after Bonferonni correction).

• Step 3: Divide the data by levels of the most significant feature

Each of these groups will be examined individually further.

• Step 4: For each sub-group, determine the most significant

feature from the remaining features and divide the data again.

• Step 5: Repeat step 4 till stopping criteria is reached.

CHAID Example
• Marketing Head’s Conundrum

• Target Variable – Sales Conversion

Chi-Square test of independence
• Chi-square test of independence starts with an assumption that
there is no relationship between two variables.

• For example, we assume that there is no relationship between

profit of the customer and sales conversion.
Contingency Table
Sales Conversion
Profit 0 (Lost) 1 (Won) Total

<= 1 MN 325 1129 1454

> 1 MN 1190 321 1511
Total 1515 1450 2965

H0: Profit of the customer and sales conversion are independent

HA: Profit of the customer and sales conversion are dependent

Chi-square statistics

n  (Oij − Eij )
m 2

 =  
2


i =i j =1  Eij 
i th row sum × jth column sum
E ij =
Total sum

Oij = Observed Frequency and Eij = Expected Frequency

Observed Frequency Expected Frequency
Sales Sales
Profit Conversion Total Profit Conversion Total
0 1 0 1
(Lost) (Won) (Lost) (Won)
<= 1 325 1129 1454 <= 1 325 1129 1454
MN MN
>1 1190 321 1511 >1 1190 321 1511
MN MN
Total 1515 1450 2965 Total 1515 1450 2965

n m  (O − E ) 2

 =  
2 ij ij
 = 863.92

i =i j =1  Eij 

P-value = 6.8 x 10-190, thus we reject the null hypothesis that profit of the
customer and sales conversion are independent.
Business Rules
• Node 1: If the profit of the customer is less than 0.35 mn,
then predicted Yi = 1, P(Yi = 1) = 0.9798.

• Node 3: If the profit of the customer is between 1.01 and 1.43

mn, then predicted Yi = 0, P(Yi = 1) = 0.2981

• Node 8: If the profit of the customer is between 1.01 and 1.43

mn, and profit % is more than 50, then predicted Yi = 0, P(Yi =
1) = 0.9143
Merging - CHAID
• CHAID uses both splitting and merging steps.

• In merging, least significantly different groups are merged to form

one class.
CHAID Stopping Criteria
• Maximum tree depth is reached (which is pre-
defined).

• Minimum number of a cases to be a parent node is

reached (again pre-defined)

• Minimum number of to be a child node is reached.

ARIMA Models for Naira-Dollar Exchange Rate
No ratings yet
ARIMA Models for Naira-Dollar Exchange Rate
8 pages
Ho Fit
No ratings yet
Ho Fit
4 pages
(9781784719159 - Annals of Entrepreneurship Education and Pedagogy - 2016) What I Have Learned About Teaching Entrepreneurship - Perspectives of Five Master Educators
0% (1)
(9781784719159 - Annals of Entrepreneurship Education and Pedagogy - 2016) What I Have Learned About Teaching Entrepreneurship - Perspectives of Five Master Educators
23 pages
Hypothesis Testing Spinning The Wheel
No ratings yet
Hypothesis Testing Spinning The Wheel
1 page
Applied Nonparametric Econometrics
No ratings yet
Applied Nonparametric Econometrics
187 pages
Where Predictive Analytics Is Having The Biggest Impact
No ratings yet
Where Predictive Analytics Is Having The Biggest Impact
6 pages
Augmented Analytics for BI Experts
No ratings yet
Augmented Analytics for BI Experts
8 pages
Uses of Predictive Analytics
No ratings yet
Uses of Predictive Analytics
4 pages
Results: From The Neurolytics Scan
No ratings yet
Results: From The Neurolytics Scan
10 pages
2015 Analytics Trends Report Insights
No ratings yet
2015 Analytics Trends Report Insights
37 pages
Classification Vs Regression
No ratings yet
Classification Vs Regression
3 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
13 pages
Object-Oriented Programming in R
No ratings yet
Object-Oriented Programming in R
138 pages
Introduction to Basic Statistics Concepts
No ratings yet
Introduction to Basic Statistics Concepts
27 pages
Customer Segmentation in Python
No ratings yet
Customer Segmentation in Python
71 pages
Business Analytics for Decision Making
No ratings yet
Business Analytics for Decision Making
79 pages
Previewpdf
No ratings yet
Previewpdf
64 pages
Evžen Kočenda - Alexandr Černý - Elements of Time Series Econometrics - An Applied Approach-Karolinum Press, Charles University (2017)
No ratings yet
Evžen Kočenda - Alexandr Černý - Elements of Time Series Econometrics - An Applied Approach-Karolinum Press, Charles University (2017)
220 pages
Databook PDF
No ratings yet
Databook PDF
64 pages
Visual Analytics for Business Insights
No ratings yet
Visual Analytics for Business Insights
36 pages
GARCH Models: Overview and Applications
No ratings yet
GARCH Models: Overview and Applications
40 pages
Computing For Data Ysis
No ratings yet
Computing For Data Ysis
230 pages
2003 Makipaa 1
No ratings yet
2003 Makipaa 1
15 pages
Survey Nonresponse Solutions
No ratings yet
Survey Nonresponse Solutions
172 pages
MiniTab Introduction
100% (1)
MiniTab Introduction
124 pages
Omni Channel
No ratings yet
Omni Channel
26 pages
Big Data's Role in Business Value Creation
No ratings yet
Big Data's Role in Business Value Creation
11 pages
Lme4: Mixed-Effects Modeling With R
No ratings yet
Lme4: Mixed-Effects Modeling With R
145 pages
Business Analytics: Key Statistical Measures
No ratings yet
Business Analytics: Key Statistical Measures
109 pages
Time+Series+Forecasting Monograph
No ratings yet
Time+Series+Forecasting Monograph
58 pages
PLS-SEM: A Tool in Business Research
No ratings yet
PLS-SEM: A Tool in Business Research
16 pages
Innovation Agile Project Management and Firm Performance in - 2019 - Socio Econ
No ratings yet
Innovation Agile Project Management and Firm Performance in - 2019 - Socio Econ
14 pages
(Advances in Econometrics) Dek Terrell, Dek Terrell, Thomas B. B Fomby - Econometric Analysis of Financial and Economic Time Series. Part A-Emerald Group Publishing Limited (2006)
No ratings yet
(Advances in Econometrics) Dek Terrell, Dek Terrell, Thomas B. B Fomby - Econometric Analysis of Financial and Economic Time Series. Part A-Emerald Group Publishing Limited (2006)
407 pages
MI - Unit 3
100% (1)
MI - Unit 3
107 pages
Intro To Traditional and Bayesian M Using R-Guilford 2017
No ratings yet
Intro To Traditional and Bayesian M Using R-Guilford 2017
330 pages
Donnelly's Business Statistics (2nd Edition) PDF
No ratings yet
Donnelly's Business Statistics (2nd Edition) PDF
10 pages
Recommendation System in Python
No ratings yet
Recommendation System in Python
13 pages
2023-Stata Meta Analysis
No ratings yet
2023-Stata Meta Analysis
429 pages
Cognitive Psychology - Module 1
No ratings yet
Cognitive Psychology - Module 1
72 pages
Wiley - Diagnosing The System For Organizations - 978-0-471-95136-0
No ratings yet
Wiley - Diagnosing The System For Organizations - 978-0-471-95136-0
2 pages
Decision Support Systems Guide
No ratings yet
Decision Support Systems Guide
9 pages
Understanding SHAP Values in ML Models
No ratings yet
Understanding SHAP Values in ML Models
12 pages
On The Theory of Scales of Measurement - S. S. Stevens
100% (3)
On The Theory of Scales of Measurement - S. S. Stevens
5 pages
Variable Selection Techniques in R
No ratings yet
Variable Selection Techniques in R
15 pages
MIS - Management Information System
No ratings yet
MIS - Management Information System
25 pages
Data Mart Info
No ratings yet
Data Mart Info
5 pages
Survival Analysis Notes
No ratings yet
Survival Analysis Notes
13 pages
Bayesian A/B Testing for Business
No ratings yet
Bayesian A/B Testing for Business
8 pages
Mathematical Foundations For Deep Learning - Mehdi Ghayoumi - 1, 2026 - Chapman & Hall - Anna's Archive
No ratings yet
Mathematical Foundations For Deep Learning - Mehdi Ghayoumi - 1, 2026 - Chapman & Hall - Anna's Archive
387 pages
Data Science & Python Executive Program
No ratings yet
Data Science & Python Executive Program
21 pages
Slasscom Employability Skills Report 2024
No ratings yet
Slasscom Employability Skills Report 2024
28 pages
Unit 1
No ratings yet
Unit 1
36 pages
Model Thinking
100% (1)
Model Thinking
103 pages
Regression Guide for Supporting Characters
100% (1)
Regression Guide for Supporting Characters
21 pages
IDC FutureScape Worldwide Cloud 2019 Top Predictions Whitepaper
No ratings yet
IDC FutureScape Worldwide Cloud 2019 Top Predictions Whitepaper
17 pages
Albright DADM 5e - PPT - CH 16
No ratings yet
Albright DADM 5e - PPT - CH 16
50 pages
GAMs for Statistical Learning
No ratings yet
GAMs for Statistical Learning
10 pages
Classification Trees - CART and CHAID
No ratings yet
Classification Trees - CART and CHAID
50 pages
BA CH 12 PPT
No ratings yet
BA CH 12 PPT
50 pages
Investor Presentation October 2020
No ratings yet
Investor Presentation October 2020
63 pages
Decision Trees - CHAID AND CART 2019 PDF
No ratings yet
Decision Trees - CHAID AND CART 2019 PDF
44 pages
IMB 541 CASE Marketing Heads Conundrum R PDF
100% (1)
IMB 541 CASE Marketing Heads Conundrum R PDF
9 pages
Toyota Organization Design Report
100% (1)
Toyota Organization Design Report
22 pages
Marketers' Personalization Struggles
No ratings yet
Marketers' Personalization Struggles
19 pages
Marketers' Personalization Struggles
No ratings yet
Marketers' Personalization Struggles
19 pages
Scanned Document Compilation
No ratings yet
Scanned Document Compilation
10 pages
Monitoring and Evaluation Quiz 1 Module 1 and 2 March 2024
No ratings yet
Monitoring and Evaluation Quiz 1 Module 1 and 2 March 2024
14 pages
Using Customer Behavior Analytics To Increase Revenue
No ratings yet
Using Customer Behavior Analytics To Increase Revenue
13 pages
Business Analytics Using Excel
100% (1)
Business Analytics Using Excel
2 pages
AMS 404 - Multivariate Methods
No ratings yet
AMS 404 - Multivariate Methods
3 pages
Automated Accounting Impacts
100% (1)
Automated Accounting Impacts
96 pages
An Ova
No ratings yet
An Ova
17 pages
Analysis of Variance (ANOVA)
No ratings yet
Analysis of Variance (ANOVA)
24 pages
FM3 SEND Week 5 - Online Quiz Assessment (FINISHED)
No ratings yet
FM3 SEND Week 5 - Online Quiz Assessment (FINISHED)
10 pages
Project Needs and Cost
No ratings yet
Project Needs and Cost
3 pages
Diode Characteristics Guide
No ratings yet
Diode Characteristics Guide
13 pages
Bioepinet Consulting Company: Why Choose Us?
No ratings yet
Bioepinet Consulting Company: Why Choose Us?
3 pages
DataFrame Operations and Visualizations
100% (1)
DataFrame Operations and Visualizations
20 pages
Data Analytics Project - 22112037
No ratings yet
Data Analytics Project - 22112037
16 pages
Does Physical Activity Attenuate, or Even Eliminate, The Detrimental Association of Sitting Time With Mortality? A Harmonised Meta-Analysis of Data From More Than 1 Million Men and Women
No ratings yet
Does Physical Activity Attenuate, or Even Eliminate, The Detrimental Association of Sitting Time With Mortality? A Harmonised Meta-Analysis of Data From More Than 1 Million Men and Women
9 pages
Bio 5
No ratings yet
Bio 5
18 pages
Biostat Resume PDF
No ratings yet
Biostat Resume PDF
1 page
Zhou (2010) - Integrating TTF and UTAUT To Explain Mobile Banking User Adoption
No ratings yet
Zhou (2010) - Integrating TTF and UTAUT To Explain Mobile Banking User Adoption
8 pages
113 225 1 SM
No ratings yet
113 225 1 SM
7 pages
Bank Customer Churn Prediction Model
No ratings yet
Bank Customer Churn Prediction Model
11 pages
Time Series Analysis Assignment
No ratings yet
Time Series Analysis Assignment
4 pages
Melaku Assefa
No ratings yet
Melaku Assefa
62 pages
AI's Impact on Financial Decisions
No ratings yet
AI's Impact on Financial Decisions
17 pages
Regression Correlation
No ratings yet
Regression Correlation
22 pages
Lesson 2 - Nonparametric Methods
No ratings yet
Lesson 2 - Nonparametric Methods
13 pages
Socio-Economic Impact of Guinness in Ikeja
No ratings yet
Socio-Economic Impact of Guinness in Ikeja
19 pages
ECON 10 Statistical Methods Exam Guide
No ratings yet
ECON 10 Statistical Methods Exam Guide
1 page
Cover Page-Final Summer Project Report
No ratings yet
Cover Page-Final Summer Project Report
7 pages
Prescriptive Analytics Reinforcement Learning Overview
No ratings yet
Prescriptive Analytics Reinforcement Learning Overview
19 pages
Swaraj Project
No ratings yet
Swaraj Project
16 pages
Auditing and Assurance Services 7th Edition Timothy J. Louwers - Ebook PDF Download
100% (1)
Auditing and Assurance Services 7th Edition Timothy J. Louwers - Ebook PDF Download
53 pages

Decision Trees - CHAID AND CART 2019 PDF

Uploaded by

Decision Trees - CHAID AND CART 2019 PDF

Uploaded by

Decision Tree Learning – CART &

• Classification/Decision Tree approach for classification.

• Classification and Regression Tree (CART)

• Chi-Square Automatic Interaction Detection (CHAID)

• In a decision tree when the response (outcome) variable takes

• Each internal node is further divided until no further splitting is

• Tree pruning Stopping criteria (Tree pruning) a process of

• CART is a binary tree, whereas CHAID can split

Smaller Gini Index implies less impurity.

Consider node label k with 10 1s and 90 0s.

Gini(k) = 2  0.1 0.9 = 0.18

Consider node label k with 10 1s and 90 0s.

Entropy(k) = −0.1  log 2 (0.1) − 0.9  log 2 (0.9) = 0.4689

Higher than Gini

Node tL Node tR Reduction in

Maxi(t) − PL  i(t L ) − PR  i(t R )

• Company invests a lot of money to convert a lead.

• The success rate (conversion rate) can be very low.

4. Segment (combination of product, industry and region)

5. Profit of the customer

7. Profit percentage (profit expected from sales as a percentage of sales value)

8. Joint bid proportion (proportion of profit in case of joint bid)

9. Sales outcome (outcome variables)

Rule number: 6 [Sales.Outcome=0

Rule number: 2 [Sales.Outcome=0

Overall Error = 11.9%

• Initial models of CHAID used chi-square test of independence for

• CHAID is an iterative procedure that examines the predictors (or

• Step 2: Determine the most significant among the features

• Step 3: Divide the data by levels of the most significant feature

• Step 4: For each sub-group, determine the most significant

• Step 5: Repeat step 4 till stopping criteria is reached.

• Target Variable – Sales Conversion

• For example, we assume that there is no relationship between

<= 1 MN 325 1129 1454

H0: Profit of the customer and sales conversion are independent

HA: Profit of the customer and sales conversion are dependent

Oij = Observed Frequency and Eij = Expected Frequency

• Node 3: If the profit of the customer is between 1.01 and 1.43

• Node 8: If the profit of the customer is between 1.01 and 1.43

• In merging, least significantly different groups are merged to form

• Minimum number of a cases to be a parent node is

• Minimum number of to be a child node is reached.

You might also like