Tree-based methods:
CHAID
AID Morgan + Sonquist (1963)
Journal of Amer. Stat. Association
58, 415{434
Sonquist + Morgan (1964)
Monograph 35, ISR, U. of Michigan
THAID Messenger and Mandell (1972)
Journal of Amer. Stat. Association
67, 768{772
Morgan and Messenger (1973)
THAID, SRC-ISR, U. of Michigan
CHAID Kass, G. V. (1980),
Applied Statistics, 29, 119{127
CART Brieman, et al. (1984)
Classi cation and Regression Trees,
Wadsworth
categorical response variable
Credit Rating:
\bad" \poor" \good" \very good"
categorical explanatory variabales
create a decision tree
;@
;;
;;
;
;
Dividing the cases that reach a certain node in
the tree.
;@
;; @@
;
;
@@
@
@
;@
;;
;
;
@@
@@
1199
1198
Algorithm:
@@
Bad
Poor
Good
V.Good
Fico < 700 700-750 Fico > 750
@@
When there are more than two columns, nd
the "best" subtable formed by combining column categories.
(Step 1) Cross tabulate the response variable
with each of the explanatory variables.
Bad
Poor
Good
V.Good
NT=0 NT 1
1200
1201
(Step 2) This is applied to each table with
more than 2 columns.
Compute Pearson X 2 tests for independence
for each allowable subtable
Fico
< 700 700-750
bad
poor
good
v.good
2
X1
700-750 > 750
(Step 3) Allows categories combined at
step 2 to be broken apart.
For each compound category consisting of at
least 3 of the original categories,
nd the \most signi cant" binary split
if X 2 is signi cant, implement the
split and return to step 2.
otherwise retain the compound
categories for this variable, and move
on to the next variable.
2
X2
Look for the smallest X 2 value. If it is not
signi cant, combine the column categories.
bad
poor
good
v.good
< 750 > 750
Repeat step 2
if the new table
has more than
two columns
1202
1203
(Step 4) You have now completed the
\optimal" combining of categories
for each explanatory variable.
(Step 5) Use the \most signi cant" variable
in step 4 to split the node with respect to the \merged" categories
for that variable.
bad
poor
good
v.good
Find the \most signi cant of
these \optimally" merged explanatory variables.
C1+C2 C3 C4+C5+C6
;@@
;
;;
;;
;;
C1+C2
Compute a \Bonferroni" adjusted
chi-squared test of independence for
the reduced table for each explanatory variable.
1204
@@
@@
@@
C3
C4+C5+C6
- repeat steps 1-5 for
each of the o spring
nodes.
Stop if
no variable is signi cant in step 4.
the number of cases reaching a node is below a speci ed limit.
1205
TREEDISC macro is SAS
Summary:
{ modi ed version of CHAID
{ now part of the data mining package
{ application to the Wisconsin Driver data
response: tra c violations in 1974
(1) at least one
(0) none
explanatory variables:
sex
age
history of cardiovascular disease
place of residence
{ missing values are treated as
another category
CHAID is an algorithm
Must categorize every variable
{ ordinal variables
{ nominal variables
At each node it tries to nd
{ best explanatory variable
{ best merger of categories
;@
;; @@
;;
;;
;
Try to make the distributions of cases
across the response
categories as di erent as possible in the
\o spring"nodes.
@@
@@
@
1206
/*
1207
This program uses the TREEDISC
R = RESIDENTIAL AREA
macro in SAS to apply a modified
X = COUNT
CHAID algorithm to the Wisconsin
run
driver data. This code is stored
in the file
chaidwis.sas
*/
proc format
value sex 1 = 'Male'
2 = 'Female'
/*
Fisrt set some graphics options */
/*
value age 1 = '16-36'
To print postscipt files in UNIX */
2 = '36-55'
/*
3 = 'over 55'
goptions cback=white ctext=black
value d
targetdevice=ps300 rotate=landscape
1 = 'Disease'
2 = 'Control'
*/
value v
/* To print postscript files from Windows */
1 = 'Some'
2 = 'None'
goptions cback=white ctext=black
value r
1 = '> 150000'
device=WIN target=ps
2 = '39-150000'
rotate=landscape
3 = '10-39000'
4 = '< 10000'
DATA SET1
5 = 'rural'
INFILE 'c:\courses\st557\sas\drivall.dat'
INPUT AGE
LABEL
SEX
run
AGE = AGE GROUP
proc print data=set1
D = DRIVER GROUP
run
V = VIOLATION STATUS
1208
/*
/*
Draw a larger tree on several
pages */
Load in the xmacros file */
goptions cback=white ctext=black
%inc 'c:\courses\st557\sas\xmacro.sas'
device=WIN target=ps rotate=portrait
/*
Load in the TREEDISC macro
%treedisc(intree=trd,
*/
draw=graphics, pos=90 120)
%inc 'c:\courses\st557\sas\treedisc.sas'
/* Compute a tree for predicting
violation status (V) from age, sex,
disease stauts(D) and residence(R) */
%treedisc(data=set1, depvar=v, freq=x,
ordinal=age: r:,nominal=d: sex:,
outtree=trd, options=noformat,
trace=long)
/*
Draw the tree on one page */
%treedisc(intree=trd, draw=graphics)
TREEDISC Analysis
Values of
AGE :
Values of
R :
Splits Considered for Node
Values of
D :
SEX :
AGE
4 5
2
1
Ordinal
57.39
0.0001
Nominal
36.80
0.0001
Type
Nominal
4.40
0.0359
Predictor
SEX
Values of
Ordinal
2.53
0.4458
2
Best split:
Dependent variable (DV):
AGE Ordinal with p = 0.0000
V
New node: 3
DV values:
Chi-Square Adjusted p
AGE = 2
DV count:
New node: 2
1864
133
656
AGE = 1
DV count:
1209
147
1210
Splits Considered for Node
Splits Considered for Node
20
Predictor
Type
Type
Predictor
Chi-Square Adjusted p
Ordinal
1.41
0.7031
Nominal
0.06
0.8101
Chi-Square Adjusted p
SEX
Nominal
41.59
0.0001
Nominal
0.01
0.9193
Ordinal
0.15
0.9975
Best split:
R Ordinal with p = 0.7031
*** Reject split
Best split:
SEX Nominal with p = 0.0000
New node: 5
SEX = 1
DV count:
New node: 4
102
302
31
354
SEX = 2
DV count:
1211
TREEDISC Analysis of Dependent
1212
AGE value(s): 2
Variable (DV) V
DV counts: 147
3
1864
Best p-value(s): 0.0001 0.0221
V value(s): 1
DV counts: 280
2
2520
SEX value(s): 2
Best p-value(s): 0.0001 0.0001
DV counts: 20
563
Best p-value(s): 0.0856 0.5368
AGE value(s): 1
DV counts: 133
656
AGE value(s): 2
Best p-value(s): 0.0001 0.9193
DV counts: 14
284
Best p-value(s): 0.8083 0.8990
SEX value(s): 2
DV counts: 31
354
AGE value(s): 3
Best p-value(s): 0.6064 0.8571
DV counts: 6
279
Best p-value(s): 0.0264 0.1102
SEX value(s): 1
DV counts: 102
302
D value(s): 2
Best p-value(s): 0.7334 0.9703
DV counts: 0
1213
127
1214
D value(s): 1
DV counts: 6
D value(s): 2
152
DV counts: 18
Best p-value(s): 0.0592
217
Best p-value(s): 0.1317
R value(s): 1
DV counts: 3
D value(s): 1
22
DV counts: 40
R value(s): 2
DV counts: 1
3
111
Best p-value(s): 0.5928
AGE value(s): 3
DV counts: 69
839
Best p-value(s): 0.0363 0.8254
R value(s): 5
DV counts: 2
245
Best p-value(s): 0.3814
19
R value(s): 1
DV counts: 20
SEX value(s): 1
DV counts: 127
139
Best p-value(s): 0.8899
1301
Best p-value(s): 0.0232 0.1940
R value(s): 2
DV counts: 58
DV counts: 49
AGE value(s): 2
700
Best p-value(s): 0.7031 0.8101
462
Best p-value(s): 0.0215 0.7310
1215
1216