0% found this document useful (0 votes)
17 views38 pages

Module 3 Part - 2

This pdf discusses about the CART algorithm

Uploaded by

Niramay K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views38 pages

Module 3 Part - 2

This pdf discusses about the CART algorithm

Uploaded by

Niramay K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

 Emaan Tahir-442

 Kaneez Amna-448
 Izza Shah-401
 Aiman Arshad-445
 Summaya Nazir-440

Presented To:
Ma’am Nazia
 1977: Breiman, Stone, Friedman, and Olshen invented the
first CART version. 1984: The official publication with a CART
decision tree software. It was a revolution in the world of
algorithms. Even today, CART is one of the most used
methods for decision tree data analytics.

 CART is an umbrella word that refers to the following types of


decision trees:
Classification Trees: When the target variable is continuous, the
tree is used to find the "class" into which the target variable is
most likely to fall.
Regression trees: These are used to forecast the value of a
continuous variable.
 The representation used for CART is a binary tree.
 Stopping criteria define how much tree learns and pruning can be
used to improve a learned tree.
 Predictions are made with CART by traversing the binary tree given
a new input record.
 The tree is learned using a greedy algorithm on the training data to
pick splits in the tree.
 The nodes are split into subnodes on the basis of a threshold value of an
attribute. The CART algorithm does that by searching for the best
homogeneity for the subnodes, with the help of the Gini Index criterion.

1 If Height > 180 cm Then Male


2 If Height <= 180 cm AND Weight > 80 kg
3 Then Male
4 If Height <= 180 cm AND Weight <= 80
kg Then Female
Make Predictions With CART Models

 The root node is taken as the training set and is split into two by considering
the best attribute and threshold value. Further, the subsets are also split
using the same logic. This continues till the last pure sub-set is found in the
tree or the maximum number of leaves possible in that growing tree. This is
also known as Tree Pruning.
Greedy Algorithm:
 The input variables and the split points are selected through a greedy
algorithm.
Stopping Criterion:
 The recursive binary splitting method described above must know
when to stop splitting. If the count is less than a certain threshold, the
split is rejected and the node is considered the last leaf node.
Tree pruning:
A decision tree's complexity is defined as the number of splits in the tree.
Only leaf nodes are eliminated if the total cost function for the complete
test set decreases.
EXAMPLE
Day Outlook Temp. Humidity Wind Decision
1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes

5 Rain Cool Normal Weak Yes

6 Rain Cool Normal Strong No

7 Overcast Cool Normal Strong Yes

8 Sunny Mild High Weak No

9 Sunny Cool Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No


Gini index is a metric for classification tasks in
CART. It stores sum of squared probabilities of
each class. We can formulate it as illustrated below.
Gini = 1 – Σ (Pi)2 for i=1 to number of classes
Number of
Outlook Yes No
instances

Sunny 2 3 5

Overcast 4 0 4

Rain 3 2 5

Gini(Outlook=Sunny) = 1 – (2/5)2 – (3/5)2 = 1 – 0.16 – 0.36 = 0.48


Gini(Outlook=Overcast) = 1 – (4/4)2 – (0/4)2 = 0
Gini(Outlook=Rain) = 1 – (3/5)2 – (2/5)2 = 1 – 0.36 – 0.16 = 0.48
Then, we will calculate weighted sum of gini indexes for outlook
feature.
Gini(Outlook) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48 = 0.171 + 0
+ 0.171 = 0.342
Number of
Temperature Yes No
instances

Hot 2 2 4

Cool 3 1 4

Mild 4 2 6

Gini(Temp=Hot) = 1 – (2/4)2 – (2/4)2 = 0.5


Gini(Temp=Cool) = 1 – (3/4)2 – (1/4)2 = 1 – 0.5625 – 0.0625 = 0.375
Gini(Temp=Mild) = 1 – (4/6)2 – (2/6)2 = 1 – 0.444 – 0.111 = 0.445
We’ll calculate weighted sum of gini index for temperature feature
Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 = 0.142 +
0.107 + 0.190 = 0.439
Number of
Humidity Yes No
instances

High 3 4 7

Normal 6 1 7

Gini(Humidity=High) = 1 – (3/7)2 – (4/7)2 = 1 – 0.183 – 0.326 =


0.489
Gini(Humidity=Normal) = 1 – (6/7)2 – (1/7)2 = 1 – 0.734 – 0.02 =
0.244
Weighted sum for humidity feature will be calculated next
Gini(Humidity) = (7/14) x 0.489 + (7/14) x 0.244 = 0.367
Number of
Wind Yes No
instances

Weak 6 2 8

Strong 3 3 6

Gini(Wind=Weak) = 1 – (6/8)2 – (2/8)2 = 1 – 0.5625 – 0.062 = 0.375


Gini(Wind=Strong) = 1 – (3/6)2 – (3/6)2 = 1 – 0.25 – 0.25 = 0.5
Gini(Wind) = (8/14) x 0.375 + (6/14) x 0.5 = 0.428
 We’ve calculated gini index values for each feature. We will
select outlook feature because its cost is the lowest.

Feature Gini index

Outlook 0.342

Temperature 0.439

Humidity 0.367

Wind 0.428
outlook
OUTLOOK

YES
Day Outlook Temp. Humidity Wind Decision

1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

8 Sunny Mild High Weak No

9 Sunny Cool Normal Weak Yes

11 Sunny Mild Normal Strong Yes

Gini of temperature for sunny outlook


Number of
Temperature Yes No
instances

Hot 0 2 2

Cool 1 0 1

Mild 1 1 2
Gini(Outlook=Sunny and Temp.=Hot) = 1 – (0/2)2 – (2/2)2 = 0
Gini(Outlook=Sunny and Temp.=Cool) = 1 – (1/1)2 – (0/1)2 = 0
Gini(Outlook=Sunny and Temp.=Mild) = 1 – (1/2)2 – (1/2)2 = 1 – 0.25 – 0.25 = 0.5
Gini(Outlook=Sunny and Temp.) = (2/5)x0 + (1/5)x0 + (2/5)x0.5 = 0.2
Humidity Yes No Number of instances

High 0 3 3

Normal 2 0 2

Gini(Outlook=Sunny and Humidity=High) = 1 – (0/3)2 – (3/3)2 = 0


Gini(Outlook=Sunny and Humidity=Normal) = 1 – (2/2)2 – (0/2)2 = 0
Gini(Outlook=Sunny and Humidity) = (3/5)x0 + (2/5)x0 = 0
Gini of wind for sunny outlook
Number of
Wind Yes No
instances

Weak 1 2 3

Strong 1 1 2

Gini(Outlook=Sunny and Wind=Weak) = 1 – (1/3)2 – (2/3)2 = 0.266


Gini(Outlook=Sunny and Wind=Strong) = 1- (1/2)2 – (1/2)2 = 0.2
Gini(Outlook=Sunny and Wind) = (3/5)x0.266 + (2/5)x0.2 = 0.466
Feature Gini index

Temperature 0.2

Humidity 0

Wind 0.466
Rain outlook
Day Outlook Temp. Humidity Wind Decision

4 Rain Mild High Weak Yes

5 Rain Cool Normal Weak Yes

6 Rain Cool Normal Strong No

10 Rain Mild Normal Weak Yes

14 Rain Mild High Strong No


Temperature Yes No Number of instances

Cool 1 1 2

Mild 2 1 3

Gini(Outlook=Rain and Temp.=Cool) = 1 – (1/2)2 – (1/2)2 = 0.5


Gini(Outlook=Rain and Temp.=Mild) = 1 – (2/3)2 – (1/3)2 = 0.444
Gini(Outlook=Rain and Temp.) = (2/5)x0.5 + (3/5)x0.444 = 0.466
Gini of humidity for rain outlook

Humidity Yes No Number of instances

High 1 1 2

Normal 2 1 3

Gini(Outlook=Rain and Humidity=High) = 1 – (1/2)2 – (1/2)2 = 0.5


Gini(Outlook=Rain and Humidity=Normal) = 1 – (2/3)2 – (1/3)2 = 0.444
Gini(Outlook=Rain and Humidity) = (2/5)x0.5 + (3/5)x0.444 = 0.466
Wind Yes No Number of instances

Weak 3 0 3

Strong 0 2 2

Gini(Outlook=Rain and Wind=Weak) = 1 – (3/3)2 – (0/3)2 = 0


Gini(Outlook=Rain and Wind=Strong) = 1 – (0/2)2 – (2/2)2 = 0
Gini(Outlook=Rain and Wind) = (3/5)x0 + (2/5)x0 = 0

Decision for rain outlook

Feature Gini index

Temperature 0.466

Humidity 0.466

Wind 0
Example
Age Job House Credit Loan Approved

Young False No Fair No


Young False No Good No
Young True No Good Yes
Young True Yes Fair Yes
Young False No Fair No
Middle False No Fair No
Middle False No Good No
Middle True Yes Good Yes
Middle False Yes Excellent Yes
Middle False Yes Excellent Yes
Old False Yes Excellent Yes
Old False Yes Good Yes
Old True No Good Yes
Old True No Excellent Yes
Young False No Good ?
Young 5 Yes 2
No 3
Middle 5 Yes 3
No 2
Old 5 Yes 4
No 1

individual error, and total for Age attribute

Attribute Rules Error Total Error


Age Young->No 2/5 5/15
Middle->Yes 2/5
Old->Yes 1/5
False 10 Yes 4
No 6
True 5 Yes 5
No 0

individual error, and total for Job attribute

Attribute Rules Error Total Error

Job False->No 4/10 4/15

True->Yes 0/5
No 9 Yes 3
No 6
Yes 6 Yes 6
No 0

individual error, and total for House attribute

Attribute Rules Error Total Error

House No->No 3/9 3/15

Yes->yes 0/6
Fair 5 Yes 1
No 4
Good 6 Yes 4
No 2
Excellent 4 Yes 4
No 0

individual error, and total for Credit attribute

Attribute Rules Error Total Error


Credit Fair->No 1/5 3/15
Good->Yes 2/6
Excellent->Yes 0/4
Attribute Rules Error Total Error
Age Young->No 2/5 5/15
Middle->Yes 2/5
Old->Yes 1/5
Job False->No 4/10 4/15
True->Yes 0/5
House No->No 3/9 3/15
Yes->yes 0/6
Credit Fair->No 1/5 3/15
Good->Yes 2/6
Excellent->Yes 0/4
House

YES NO

YES Age Job Credit LA


young False Fair No
Young False Good No
young True Good Yes
young False Fair No
Middle False Fair No
Middle False Good No
Old True Good Yes
Old True Excellent Yes
Old False Fair No
Attribute Rules Error Total Error

Age Young->No 1/4 2/9

Middle->No 0/2

Old->Yes 1/3

Job False->No 0/6 0/9

True->Yes 0/3

Credit Fair->No 0/4 2/9

Good->Yes/No 2/4

Excellent->Yes 0/1
House

No
YES

YES
Job
False True

No YES

Loan
Age Job House Credit
Approved
Young False No Good No
Advantages of CART Algorithm
 CART requires minimal supervision and produces easy-
to-understand models.
 It focuses on finding interactions and signal
discontinuity.
 It finds important variables automatically.
 It uses any combination of continuous/ discrete
variables.
Disadvantages of CART Algorithm
 This does not use combinations of variables.
 The tree structure may be unstable.
 It has a limited number of positions to accommodate
available predictors.

You might also like