0% found this document useful (0 votes)
15 views4 pages

Answer Midterm 2024 - 11 - 19

The document discusses various concepts related to decision trees, including the calculation of Gini index and Gini gain for attributes A and B, with B being preferred for splitting due to higher Gini gain. It also covers the Simple Matching Coefficient (SMC) and Jaccard Index for comparing binary arrays, highlighting their differences and lack of conflict. Additionally, it addresses data discretization methods, attribute classification, and the comparison of underfitting and overfitting in models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views4 pages

Answer Midterm 2024 - 11 - 19

The document discusses various concepts related to decision trees, including the calculation of Gini index and Gini gain for attributes A and B, with B being preferred for splitting due to higher Gini gain. It also covers the Simple Matching Coefficient (SMC) and Jaccard Index for comparing binary arrays, highlighting their differences and lack of conflict. Additionally, it addresses data discretization methods, attribute classification, and the comparison of underfitting and overfitting in models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Midterm 2024 – 11 – 19

Q1. Consider the following data set for a binary class problem.
a) Calculate the gain in the Gini index when splitting on A and B. A B Class Label
𝒄−𝟏 1 T F +
𝑮𝒊𝒏𝒊 = 𝟏 − ∑ 𝒑𝒊 𝒕 𝟐 2 T T +
𝒊=𝟎
3 T T +
4 T F -
5 T T +
6 F F -
Calculate Gini for Parent
7 F F -
PARENT PI 8 F F -
+ 4 4/10 = 0.4 9 T T -
10 T F -
- 6 6 / 10 = 0.6
GINI = 1 – (0.4) – (0.6)2 = 0.48
2

Calculate Gini Gain for A

A NODE 01 (T) NODE 02 (F)


+ 4 0
- 3 3
GINI 1 – (4/7) – (3/7)2 = 24/49 = 0.489
2
1 – (0/3) – (3/3)2 = 0
2

WEIGHTED GINI = (7/10) * 0.489 + (3/10) * 0 = 0.342


GINI GAIN = 0.48 – 0.342 = 0.138

Calculate Gini Gain for B

B NODE 01 (T) NODE 02 (F)


+ 3 1
- 1 5
GINI 1 – (3/4) – (1/4)2 = 3/8 = 0.375
2
1 – (1/6) – (5/6)2 = 5/18 = 0.278
2

WEIGHTED GINI = (4/10) * 0.375 + (6/10) * 0.278 = 0.316


GINI GAIN = 0.48 – 0.316 = 0.164

b) Which attribute would the decision tree induction algorithm choose for splitting? Why?
→ The algorithm would choose Attribute B for splitting because it has a higher Gini gain (0.164) compared to
Attribute A (0.138), which indicates a better split for reducing impurity.

َ ‫علَ ْي ِه َو‬
ْ َ ‫س ِل ُموا ت‬
(56) ‫سلِي ًما‬ َ ‫علَى النَّ ِبي ِ ۚ يَا أَيُّهَا الَّ ِذينَ آ َمنُوا‬
َ ‫صلُّوا‬ َ َ‫صلُّون‬
َ ُ‫َّللا َو َم ََلئِ َكتَهُ ي‬
َ َّ َّ‫إِن‬
Q2. Consider the following binary arrays for two data samples:

x = [1, 1, 0, 0, 0, 1, 1, 0, 1, 0]
y = [0, 1, 1, 0, 0, 1, 1, 0, 0, 1]
a) Calculate the Simple Matching Coefficient (SMC) and the Jaccard Index between x and y.

f00 = 3, f01 = 2, f10 = 2, f11 = 3


𝑓00 + 𝑓11 3 + 3
SMC = = = 0.6
𝑓00 + 𝑓01 + 𝑓10 + 𝑓11 3 + 2 + 2 + 3
𝑓11 3 3
Jaccard = = = = 0.428
𝑓01 + 𝑓10 + 𝑓11 2 + 2 + 3 7

b) What can we conclude from the calculated coefficients?

→ SMC shows more similarity because it includes both 0-0 and 1-1 matches.

→ Jaccard is stricter, focusing only on 1-1 matches. So, it shows less similarity.

c) Are the calculated coefficients different? If yes, is there any conflict between them? Why? Why not?

→ Yes, they are different (SMC = 0.6, Jaccard = 0.428).

→ There is no conflict because:


- SMC considers both 1-1 and 0-0 matches.

- Jaccard only considers 1-1 matches.

→ The difference is expected because they measure similarity in different ways.

َ ‫علَ ْي ِه َو‬
ْ َ ‫س ِل ُموا ت‬
(56) ‫سلِي ًما‬ َ ‫علَى النَّ ِبي ِ ۚ يَا أَيُّهَا الَّ ِذينَ آ َمنُوا‬
َ ‫صلُّوا‬ َ َ‫صلُّون‬
َ ُ‫َّللا َو َم ََلئِ َكتَهُ ي‬
َ َّ َّ‫إِن‬
Q3. (The three images represent a dataset which consists of 4 groups of dots with a
different color each)

[Equal Interval Width] [Equal Frequency] [K-mean]

a) What is the process that converts this unlabeled dataset into only 3 discrete values?

→ Discretization
b) How can this be achieved using 3 different methods? Apply only one method on one image.

→ Equal Width: Divides the range of data into equal-size intervals.

→ Equal Frequency: Divide the data so that each interval has the same number of points.

→ K-mean: Group data into clusters based on data similarity.

c) Which is the best method?

→ K-mean: preferred for clustering when the data forms distinct groups.

→ Equal Width and Equal Frequency can be simpler and effective for distributed data.

Q4. Classify the following attributes as binary, discrete, or continuous. Also classify them
as qualitative (nominal or ordinal) or quantitative (interval or ratio).
Example: Age in years. Answer: Continuous, quantitative, ratio
a) Time in terms of AM or PM. → Binary, qualitative, nominal

b) Angles as measured in degrees between 0° and 360°. → continuous, quantitative, interval

c) bronze, silver, and gold medals as awarded at the Olympics. → discrete, qualitative, ordinal

d) Height above sea level. → continuous, quantitative, ratio

e) Number of patients in a hospital. → discrete, quantitative, ratio


f) Academic numbers (ID) for students. → discrete, qualitative, nominal

َ ‫علَ ْي ِه َو‬
ْ َ ‫س ِل ُموا ت‬
(56) ‫سلِي ًما‬ َ ‫علَى النَّ ِبي ِ ۚ يَا أَيُّهَا الَّ ِذينَ آ َمنُوا‬
َ ‫صلُّوا‬ َ َ‫صلُّون‬
َ ُ‫َّللا َو َم ََلئِ َكتَهُ ي‬
َ َّ َّ‫إِن‬
Q5.
a) Compare Underfitting and Overfitting.

→ Overfitting: The model is too complex and learn the training data too well, including the noise, which
makes it fails on new data

→ Underfitting: The model is too simple and fails to capture the pattern well, leading to poor
performance in both training and new data.

b) What is the main drawback of Pearson correlation? How to overcome this drawback?

→ The main drawback of Pearson correlation only measures linear relationship and doesn’t capture non-
linear relationship.

→ To overcome this, use other methods like mutual information that capture non-linear relationships

َ ‫علَ ْي ِه َو‬
ْ َ ‫س ِل ُموا ت‬
(56) ‫سلِي ًما‬ َ ‫علَى النَّ ِبي ِ ۚ يَا أَيُّهَا الَّ ِذينَ آ َمنُوا‬
َ ‫صلُّوا‬ َ َ‫صلُّون‬
َ ُ‫َّللا َو َم ََلئِ َكتَهُ ي‬
َ َّ َّ‫إِن‬

You might also like