0% found this document useful (0 votes)
17 views206 pages

Data Mining Presentation

The document outlines a course structure on data mining, detailing assessment components and chapters covering various aspects of data handling and analysis. It emphasizes the significance of data mining in the current data age, highlighting processes such as data cleaning, integration, and mining, as well as the types of data that can be mined. Additionally, it discusses various data mining tasks, algorithms, and techniques for managing data quality and privacy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views206 pages

Data Mining Presentation

The document outlines a course structure on data mining, detailing assessment components and chapters covering various aspects of data handling and analysis. It emphasizes the significance of data mining in the current data age, highlighting processes such as data cleaning, integration, and mining, as well as the types of data that can be mined. Additionally, it discusses various data mining tasks, algorithms, and techniques for managing data quality and privacy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 206

Dr.

Mohammadi Zanjireh

Imam Khomeini International University


(IKIU)
 Midterms: 4*10=40
 Exercises: 10
 Project:20
 Final exam: 30
 Chapter 1 – Introduction
 Chapter 2 - Getting to Know Your Data
 Chapter 3 - Data Preprocessing
 Chapter 4- Frequent Patterns Mining
 Chapter 5 – Classification
 Chapter 6 - Clustering
▪ “We are living in the information age” is a popular
saying; however, we are actually living in the data
age.

▪ Terabytes or petabytes of data into our computer


networks, the World Wide Web (WWW), and
various data storage devices every day from
business, society, science and engineering,
medicine, and almost every other aspect of daily
life.
▪ This explosively growing, widely available, and
gigantic body of data makes our time truly the data age.
Powerful tools are badly needed to automatically
uncover valuable information from the tremendous
amounts of data and to transform such data into
organised knowledge.

▪ This necessity has led to the birth of data mining. The


field is young, dynamic, and promising. Data mining
has and will continue to make great strides in our
journey from the data age toward the coming
information age.
▪ Example 1-1: Search Engine.
▪ A data rich but information poor situation.
▪ Data tombs.

▪What is Data Mining?


Data mining should have been more appropriately named
“knowledge mining from data,” which is unfortunately
somewhat long.
▪ Knowledge Discovery from Data (KDD):
▪ Data Cleaning.
Preprocessing Step
▪ Data Integration.
▪ Data Selection.
▪ Data Mining.
▪ Pattern Evaluation.
▪ Knowledge Presentation.
What Kinds of Data Can Be Mined?
As a general technology, data mining can be applied to
any kind of data as long as the data are meaningful
for a target application such as database data, data
warehouse data, transactional data, data streams,
multimedia data, and the WWW.
Exercise 1: What is the difference between Database and
Data warehouse?

Exercise 2: Describe a number of Data mining’s


applications.
 Data mining tasks can be classified into two
categories:

◦ Descriptive: Characterises properties of the data in a target


data set.

◦ Predictive: Performs induction on the current data in order to


make predictions.
▪ Data warehouse:
▪ Data Cube:
▪ Fig 1.7:
◦ Drill-down
◦ Roll-up

◦ Slice
◦ Dice

◦ OLAP: On Line Analytical Processing


◦ OLTP: On Line Transactions Processing
▪ Which Technologies Are Used?
▪ Statistics

▪ Machine Learning (ML)


▪ Supervised Learning
▪ Unsupervised Learning
▪ Semi-supervised Learning
▪ Efficiency and Scalability:
◦ The running time of a data mining algorithm must be
predictable, short, and acceptable by applications.

▪ Parallel and distributed mining algorithms:


◦ Such algorithms first partition the data into “pieces” .
◦ Each piece is processed, in parallel.
◦ The patterns from each partition are eventually merged.
◦ Parallel in one machine.
◦ Distributed in multiple machines.
 Handling noise, error, exceptions, and outliers:
◦ Data often contain noise, errors, exceptions, or outliers.
◦ These may confuse the data mining process, leading to the
derivation of erroneous patterns.
◦ Data cleaning, data preprocessing, and outlier detection and
removal, are examples of techniques that need to be integrated with
the data mining process.

 Privacy-preserving data mining:


◦ Data mining is useful. However, it poses the risk of disclosing an
individual’s personal information.
◦ We have to observe data sensitivity while performing data mining.
 Attributes:
Attributes

Student_ID Name Average

Observations
1001 Ali 17.12
1002 Reza 13.89
1003 Maryam 16.02
1004 Hasan 15.45
 Attribute types:
◦ Nominal: Subject, Occupation.

◦ Binary (0,1-T,F): Gender, medical test.


 Symmetric Binary.
 Asymmetric Binary.

◦ Ordinal: Drink_size (small, medium, large).

◦ Numeric
 Interval_scaled.
 Ratio_scaled.
 Discrete vs. Continuous attributes.
 Mean:

 Trimmed Mean:
 Median:
 Mode:
 Variance and Standard Deviation:
 Quartile:
Q1, Q2=Median, Q3
 Example1:

1 2 3 4 5 6 7 8 9
 Example1:

1 2 3 4 5 6 7 8 9

Q2=5.0
 Example1:

1 2 3 4 5 6 7 8 9

Q2=5.0

Q1=2.5 Q3=7.5
 Example2:
1 2 3 4 5 6 7 8 9 10
 Example2:
1 2 3 4 5 6 7 8 9 10

Q2=5.5
 Example2:
1 2 3 4 5 6 7 8 9 10

Q2=5.5

Q1=3.0 Q3=8.0
 Example3:

1 2 3 4 5 6 7 8 9 10 11
 Example3:

1 2 3 4 5 6 7 8 9 10 11

Q2=6.0
 Example3:

1 2 3 4 5 6 7 8 9 10 11

Q2=6.0
Q1=3.0 Q3=9.0
 Example4:

1 2 3 4 5 6 7 8 9 10 11 12
 Example4:

1 2 3 4 5 6 7 8 9 10 11 12

Q2=6.5
 Example4:

1 2 3 4 5 6 7 8 9 10 11 12

Q2=6.5

Q1=3.5 Q3=9.5
 Inter Quartile Range (IQR):

IQR = Q3 - Q1

 Lower fence: Q1 – 1.5 * IQR

 Upper fence: Q3 + 1.5 * IQR

 Outliers < Lower fence.


 Outliers > Upper fence.
 Boxplot:
 Example5:
0 2 5 8 8 8 9 9 10 10 10 11 12 12 12 14 15 20 25
 Example5:
0 2 5 8 8 8 9 9 10 10 10 11 12 12 12 14 15 20 25

Q2=10.0
 Example5:
0 2 5 8 8 8 9 9 10 10 10 11 12 12 12 14 15 20 25

Q2=10.0

Q1=8.0 Q3=12.0
 Example5:
0 2 5 8 8 8 9 9 10 10 10 11 12 12 12 14 15 20 25

Q2=10.0

Q1=8.0 Q3=12.0

IQR= 1.5 * (12.0-8.0) = 6.0

Lower fence: 8.0 - 6.0 = 2.0


Upper fence: 12.0 + 6.0 = 18.0
 Example5:
0 2 5 8 8 8 9 9 10 10 10 11 12 12 12 14 15 20 25

Q2=10.0

Q1=8.0 Q3=12.0

IQR= 1.5 * (12.0-8.0) = 6.0


Outliers
Lower fence: 8.0 - 6.0 = 2.0
Upper fence: 12.0 + 6.0 = 18.0
 Data Matrix:

Student _1

Student_i

Student_n
 Dissimilarity Matrix:

 Similarity Matrix:
Sim(i,j)=1-d(i,j)
 Proximity measures for Nominal attributes

 Example:
Id Subject Birth_City Living_City Eye_Colour
1 Computer Teh Teh Black
2 Electricity Teh Kar Brown
3 Mechanic Qaz Qaz Brown
4 Computer Kar Qaz Green
0.00
Dissimilarity 0.75 0.00
=
Matrix 1.00 0.75 0.00
0.75 1.00 0.75 0.00
 Proximity measures for Binary attributes

𝒓+𝒔
Symmetric Binary: d(i,j)=
𝒒+𝒓+𝒔+𝒕

𝒓+𝒔
Asymmetric Binary: d(i,j)=
𝒒+𝒓+𝒔
 Example: Gender : Symmetric others : Asymmetric

Name Gender Fever Cough Test1 Test2 Test3 Test4


Ali M Y N P N N N
Maryam F Y N P N P N
Reza M Y P N N N N
 Example: Gender : Symmetric others : Asymmetric

Name Gender Fever Cough Test1 Test2 Test3 Test4


Ali M Y N P N N N
Maryam F Y N P N P N
Reza M Y P N N N N

2
 d(Ali, Maryam) = = 0.50
4
2
 d(Ali, Reza) = = 0.50
4
4
 d(Maryam, Reza) = = 0.80
5
 Euclidean Distance:
 Euclidean Distance:

 Manhattan Distance:
 Euclidean Distance:

 Manhattan Distance:

 Supremum Distance:
 Proximity measures for Ordinal attributes
Small Medium Large
1 2 3
 Example:
 Degree={Diploma, Undergraduate, Master, PhD}
 Drink_size={Small, Medium, Large}

Id Degree Drink_size
1 Undergraduate Medium
2 PhD Small
3 Diploma Medium
4 Undergraduate Large
 Example:
 Degree={Diploma, Undergraduate, Master, PhD}
 Drink_size={Small, Medium, Large}

Id Degree Drink_size
1 0.33 0.50
2 1.00 0.00
3 0.00 0.50
4 0.33 1.00
 Normalising:

Id Grade
1 30
2 52
3 84
4 45
5 25
6 20
7 91
8 65
9 42
10 32
 Normalising:

Id Grade Grade-min
1 30 10
2 52 32
3 84 64
4 45 25
5 25 5
6 20 0
7 91 71
8 65 45
9 42 22
10 32 12
 Normalising:

(Grade-min)/(max-min)
Id Grade Grade-min
1 30 10 0.14
2 52 32 0.45
3 84 64 0.90
4 45 25 0.35
5 25 5 0.07
6 20 0 0.00
7 91 71 1.00
8 65 45 0.63
9 42 22 0.31
10 32 12 0.17
 Proximity measures for mixed types:
 Examples:
Id Test_1(nominal) Test_2(ordinal) Test_3(numeric)

1 Code A Excellent 45
2 Code B Fair 22
3 Code C Good 64
4 Code A Excellent 28
 Examples:
Id Test_1(nominal) Test_2(ordinal) Test_3(numeric)

1 Code A 1 0.55
2 Code B 0 0.00
3 Code C 0.5 1.00
4 Code A 1 0.14

0.00
Dissimilarity 0.85 0.00
=
Matrix 0.65 0.83 0.00
0.13 0.71 0.79 0.00
 Exercise:
Id Test_1(nominal) Test_2(ordinal) Test_3(numeric)

1 Code A Excellent ----


2 ---- Fair 22
3 Code C ---- 64
4 Code A Excellent 28
 Cosine similarity:
D# team coach hockey baseball soccer penalty score win loss season
D1 5 0 3 0 2 0 0 2 0 0
D2 3 0 2 0 1 1 0 1 0 1
D3 0 7 0 2 1 0 0 3 0 0
D4 0 1 0 0 1 2 2 0 3 0
 Cosine similarity:

1.00
Similarity 0.94 1.00
=
Matrix 0.17 0.12 1.00
0.07 0.17 0.23 1.00
 3.1-Data Cleaning.

 3.2-Data Integration.

 3.3-Data Reduction.

 3.4-Data Transform.
 3.1- Data Cleaning

 3.1.1-Missing values
o Ignore the tuple.
o Fill in the missing value manually.
o Use a global constant.
o Use the attribute mean or median.
o Use the class mean or median.
 3.1- Data Cleaning

➢ 3.1.2-Noise
o Binning
o Smoothing by bin means.
o Smoothing by bin boundaries.

o Regression

o Clustering.
 Example:
4, 8, 15, 21, 21, 24, 25, 28, 34
 Example:
4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: bin1: 4, 8, 15. bin2: 21, 21, 24. bin3: 25, 28, 34
 Example:
4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: bin1: 4, 8, 15. bin2: 21, 21, 24. bin3: 25, 28, 34

Smoothing by bin means:


bin1: 9, 9, 9. bin2: 22, 22, 22. bin3: 29, 29, 29.
 Example:
4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: bin1: 4, 8, 15. bin2: 21, 21, 24. bin3: 25, 28, 34

Smoothing by bin means:


bin1: 9, 9, 9. bin2: 22, 22, 22. bin3: 29, 29, 29.

Smoothing by bin boundaries:


bin1: 4, 4, 15. bin2: 21, 21, 24. bin3: 25, 25, 34
 Regression:
 Clustering:
 3.2- Data Integration:

 3.3- Data Reduction


 3.3.1-Dimensionality reduction
o Discrete Wavelet Transform (DWT).
o Principal Components Reduction (PCA).
o Feature Subset Selection.

➢ 3.3.2-Numerosity reduction
o Sampling.
o Clustering.
 3.4- Data Transform
❑ Normalisation:
❑ Min_max normalisation:

❑ Z-Score normalisation:
 Example:
3, 4, 4, 5, 9
Ã=5.00, σ = 2.34

Min-max: 0, 0.17, 0.17, 0.33, 1.00

Z-score: -0.95, -0.48, -0.48, 0, 1.91


 Descriptions
 Frequent Patterns
 Sequential Patterns
 Market Basket Analysis
 Association Rules:
 I={𝐼1 , 𝐼2 , … , 𝐼𝑛 }
 T
 D
 TID
 A => B[Support, Confidence]
A⊂ I , B⊂ I, A∩B=∅
A≠ ∅, B≠∅
 Example:

{Computer} => {Antivirus}[Support=2%, Confidence=60%]

• Support
Support(A=>B)=P(A∪B)

• Confidence
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴∪ 𝐵) 𝑆𝑢𝑝𝑝𝑜𝑟𝑡_𝐶𝑜𝑢𝑛𝑡(𝐴∪ 𝐵)
Confidence(A=>B) = P(B|A) = =
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴) 𝑆𝑢𝑝𝑝𝑜𝑟𝑡_𝐶𝑜𝑢𝑛𝑡(𝐴)
 Mining Association Rules
1. Find all frequent itemsets.
2. Generate strong association rules from the frequent itemsets.
‫•‬ ‫‪Example:‬‬
‫} 𝟖𝑻 ‪D = {𝑻𝟏 , 𝑻𝟐 , 𝑻𝟑 , 𝑻𝟒 , 𝑻𝟓 , 𝑻𝟔 , 𝑻𝟕 ,‬‬
‫}جعفری‪ ،‬پیاز‪ ،‬زیتون‪ ،‬خیار‪ ،‬گوجه فرنگی{= 𝟏𝑻‬
‫}جعفری‪ ،‬خیار‪ ،‬گوجه فرنگی{= 𝟐𝑻‬
‫}جعفری‪ ،‬پیاز‪،‬نان‪ ،‬نمک‪ ،‬خیار‪ ،‬گوجه فرنگی{= 𝟑𝑻‬
‫}پیاز‪ ،‬نان‪ ،‬خیار‪ ،‬گوجه فرنگی{= 𝟒𝑻‬
‫}پیاز‪ ،‬نمک‪ ،‬گوجه فرنگی{= 𝟓𝑻‬
‫}پنیر‪ ،‬نان{= 𝟔𝑻‬
‫}خیار‪ ،‬پیاز‪ ،‬گوجه فرنگی{= 𝟕𝑻‬
‫}کره‪ ،‬نان{= 𝟖𝑻‬

‫]‪A => B[ Support, Confidence‬‬


‫}پیاز‪ ،‬جعفری{ >= }خیار‪ ،‬گوجه فرنگی{‬
‫•‬ ‫‪Exercise:‬‬

‫}پیاز‪ ،‬جعفری{ >= }خیار{‬


‫}پیاز{ >= }خیار‪ ،‬گوجه فرنگی{‬
‫}پیاز{ >= }گوجه فرنگی{‬
 Apriory Algorithm
Proposed by Agrawal in 1994.
TID List of items
T100 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟓
T200 𝑰𝟐 , 𝑰𝟒
T300 𝑰𝟐 , 𝑰𝟑
T400 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟒
T500 𝑰𝟏 , 𝑰𝟑
T600 𝑰𝟐 , 𝑰𝟑
T700 𝑰𝟏 , 𝑰𝟑
T800 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑 , 𝑰𝟓
T900 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑
 1. 𝑪𝟏 : min-sup = 2, Scan D.
Itemset Sup_Count
{𝐼1 } 6
{𝐼2 } 7
{𝐼3 } 6
{𝐼4 } 2
{𝐼5 } 2
 2. 𝑳𝟏 :
Itemset Sup_Count
{𝐼1 } 6
{𝐼2 } 7
{𝐼3 } 6
{𝐼4 } 2
{𝐼5 } 2
 3. 𝑪𝟐 = 𝑳𝟏 ∗ 𝑳𝟏 , Scan D
Itemset Sup_Count
{𝐼1 , 𝐼2 } 4
{𝐼1 , 𝐼3 } 4
{𝐼1 , 𝐼4 } 1
{𝐼1 , 𝐼5 } 2
{𝐼2 , 𝐼3 } 4
{𝐼2 , 𝐼4 } 2
{𝐼2 , 𝐼5 } 2
{𝐼3 , 𝐼4 } 0
{𝐼3 , 𝐼5 } 1
{𝐼4 , 𝐼5 } 0
 4. 𝑳𝟐 :
Itemset Sup_Count
{𝐼1 , 𝐼2 } 4
{𝐼1 , 𝐼3 } 4
{𝐼1 , 𝐼5 } 2
{𝐼2 , 𝐼3 } 4
{𝐼2 , 𝐼4 } 2
{𝐼2 , 𝐼5 } 2
 Prune:
 5. 𝑪𝟑 = 𝑳𝟐 ∗ 𝑳𝟐
Itemset Prune
{𝐼1 , 𝐼2 , 𝐼3 }
{𝐼1 , 𝐼2 , 𝐼5 }
{𝐼1 , 𝐼2 , 𝐼4 }
{𝐼1 , 𝐼3 , 𝐼5 }
{𝐼2 , 𝐼3 , 𝐼4 }
{𝐼2 , 𝐼3 , 𝐼5 }
{𝐼2 , 𝐼4 , 𝐼5 }
 5. 𝑪𝟑 = 𝑳𝟐 ∗ 𝑳𝟐 , Scan D

Itemset Sup_Count
{𝐼1 , 𝐼2 , 𝐼3 } 2
{𝐼1 , 𝐼2 , 𝐼5 } 2
 6. 𝑳𝟑

Itemset Sup_Count
{𝐼1 , 𝐼2 , 𝐼3 } 2
{𝐼1 , 𝐼2 , 𝐼5 } 2
 7. 𝑪𝟒 = 𝑳𝟑 ∗ 𝑳𝟑 , Prune
 𝐿4 = ∅.

End of Step 1.
 Code of Apriori Algorithm:
 Confidence: X={𝐼1 , 𝐼2 , 𝐼5 }
 Confidence: X={𝐼1 , 𝐼2 , 𝐼5 }.

2
𝐼1 =>{𝐼2 , 𝐼5 }: Confidence= = 33%
6
2
𝐼2 =>{𝐼1 , 𝐼5 }: Confidence= = 29%
7
2
𝐼5 =>{𝐼1 , 𝐼2 }: Confidence= = 100%
2
2
{𝐼1 , 𝐼2 }=>{𝐼5 }: Confidence= = 50%
4
2
{𝐼1 , 𝐼5 }=>{𝐼2 }: Confidence= = 100%
2
2
{𝐼2 , 𝐼5 }=>{𝐼1 }: Confidence= = 100%
2
 Confidence: X={𝐼1 , 𝐼2 , 𝐼5 }.

2
{𝐼1 }=>{𝐼2 , 𝐼5 }: Confidence= = 33% ×
6
2
{𝐼2 }=>{𝐼1 , 𝐼5 }: Confidence= = 29% ×
7
2
{𝐼5 }=>{𝐼1 , 𝐼2 }: Confidence= = 100%
2
2
{𝐼1 , 𝐼2 }=>{𝐼5 }: Confidence= = 50% ×
4
2
{𝐼1 , 𝐼5 }=>{𝐼2 }: Confidence= = 100%
2
2
{𝐼2 , 𝐼5 }=>{𝐼1 }: Confidence= = 100%
2

For Confidence ≥ 70%:


 Exercise_2: Calculate confidence for X={𝐼1 , 𝐼2 , 𝐼3 }.

 Exercise_3: Re-run the previous example using min-sup=3 and


min-conf=60%

 Exercise_4: Using min-sup=30% and min-conf=60%:

TID Item Purchased


1 {‫ لیموناد‬،‫{آب پرتقال‬
2 {‫ شیشه پاک کن‬،‫ آب پرتقال‬،‫{شیر‬
3 {‫ لیموناد‬،‫ پاک کننده‬،‫{آب پرتقال‬
4 {‫ لیموناد‬،‫{شیشه پاک کن‬
5 {‫ چیپس‬،‫{لیموناد‬
 FP-growth (finding frequent itemsets without candidate generation).
Min-sup=2
TID List of items
T100 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟓
T200 𝑰𝟐 , 𝑰𝟒
T300 𝑰𝟐 , 𝑰𝟑
T400 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟒
T500 𝑰𝟏 , 𝑰𝟑
T600 𝑰𝟐 , 𝑰𝟑
T700 𝑰𝟏 , 𝑰𝟑
T800 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑 , 𝑰𝟓
T900 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑
1- Frequent items is sorted in the order of descending sup-count:
L={{𝐼2 : 7}, {𝐼1 : 6}, {𝐼3 : 6}, {𝐼4 : 2}, {𝐼5 : 2}}

2- T100={𝐼1 , 𝐼2 , 𝐼5 }
Null
𝟏
I2
𝟏
I1
𝟏
I5
3- T200={𝐼2 , 𝐼4 }
Null
𝟐
I2
𝟏 𝟏
I1 I4
𝟏
I5
4- T300={𝐼2 , 𝐼3 }
Null
𝟑
I2
𝟏 𝟏 𝟏
I3 I1 I4
𝟏
I5
5- T400={𝐼1 , 𝐼2 , 𝐼4 }
Null
𝟒
I2
𝟏 𝟐 𝟏
I3 I1 I4
𝟏 𝟏
I5 I4
6- T500={𝐼1 , 𝐼3 }
Null
𝟏

I2
𝟒 I1
𝟏
I3 𝟏
I1
𝟐
I4
𝟏 I3
𝟏 𝟏
I5 I4
7- T600={𝐼2 , 𝐼3 }
Null
𝟏

I2
𝟓 I1
𝟏
I3 𝟐
I1
𝟐
I4
𝟏 I3
𝟏 𝟏
I5 I4
8- T700={𝐼1 , 𝐼3 }
Null
𝟐

I2
𝟓 I1
𝟐
I3 𝟐
I1
𝟐
I4
𝟏 I3
𝟏 𝟏
I5 I4
9- T800={𝐼1 , 𝐼2 , 𝐼3 , 𝐼5 }
Null
𝟐

I2
𝟔 I1
𝟐
𝟐
I3 I1
𝟑
I4
𝟏 I3
𝟏 𝟏 𝟏
I5 I4 I3
𝟏
I5
10- T900={𝐼1 , 𝐼2 , 𝐼3 }
Null
𝟐

I2
𝟕 I1
𝟐
𝟐
I3 I1
𝟒
I4
𝟏 I3
𝟏 𝟏 𝟐
I5 I4 I3
𝟏
I5
11- FP-tree is mined: We start from the last item of L.

Item Conditional Pattern Base Conditional FP-tree


I5 {I2 , I1 : 1} , {I2 , I1 , I3 : 1} {I2 : 2, I1 : 2}
I4 {I2 , I1 : 1} , {I2 : 1} {I2 : 2}
I3 {I2 : 2},{I2 , I1 : 2},{I1 : 2} {I2 : 4, I1 : 2} ,{I1 : 2}
I1 {I2 : 4} {I2 : 4}
Frequent patterns generated
{I2 , I5 : 2}, {I1 , I5 : 2}, {I2 , I1 , I5 : 2}
{I2 , I4 : 2}
{I2 , I3 : 4}, {I1 , I3 : 4}, {I2 , I1 , I3 : 2}
{I2 , I1 : 4}

 Exercise_5: Calculate support and confidence for above rules.

 Exercise_6: Do exercise_4 using FP-growth algorithm.


 Code of FP-growth Algorithm:
 ECLAT (Equivalent CLAss Transformation) Algorithm
Vertical data format
Min-sup=2

TID List of items


T100 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟓
T200 𝑰𝟐 , 𝑰𝟒
T300 𝑰𝟐 , 𝑰𝟑
T400 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟒
T500 𝑰𝟏 , 𝑰𝟑
T600 𝑰𝟐 , 𝑰𝟑
T700 𝑰𝟏 , 𝑰𝟑
T800 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑 , 𝑰𝟓
T900 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑
itemsets List of items
𝑰𝟏 {T100, T400, T500, T700, T800, T900}
𝑰𝟐 {T100, T200, T300, T400, T600, T800 , T900}
𝑰𝟑 {T300, T500, T600, T700, T800, T900}
𝑰𝟒 {T200, T400}
𝑰𝟓 {T100, T800}
itemsets List of items
{𝑰𝟏 , 𝑰𝟐 } {T100, T400, T800, T900}
{𝑰𝟏 , 𝑰𝟑 } {T500, T700, T800 , T900}
{𝑰𝟏 , 𝑰𝟒 } {T400}
{𝑰𝟏 , 𝑰𝟓 } {T100, T800}
{𝑰𝟐 , 𝑰𝟑 } {T300, T600, T800, T900}

{𝑰𝟐 , 𝑰𝟒 } {T200, T400}


{𝑰𝟐 , 𝑰𝟓 } {T100, T800}

{𝑰𝟑 , 𝑰𝟒 } {}
{𝑰𝟑 , 𝑰𝟓 } {T800}
{𝑰𝟒 , 𝑰𝟓 } {}
itemsets List of items
{𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑 } {T800, T900}
{𝑰𝟏 , 𝑰𝟐 , 𝑰𝟒 } {T400}
{𝑰𝟏 , 𝑰𝟐 , 𝑰𝟓 } {T100, T800}

 Exercise_7: Explore association rules from above example.

 Exercise_8: Do exercise_4 again using Eclat algorithm.


 Strong Rules Are Not Necessarily Interesting:

 Example: |D|=10,000 {A}:6,000 {B}: 7,500 {A,B}: 4,000


min_sup=30% min-conf = 60%

{A}=> {B}[support=?, confidence=?]


 Strong Rules Are Not Necessarily Interesting:

 Example: |D|=10,000 {A}:6,000 {B}: 7,500 {A,B}: 4,000


min_sup=30% min-conf = 60%

{A}=> {B}[support=40%, confidence=67%] : Strong


Conf({A}=> {B}) < Conf({B}) : Not interesting, negative correlation
 Strong Rules Are Not Necessarily Interesting:

 Example: |D|=10,000 {A}:6,000 {B}: 7,500 {A,B}: 4,000


min_sup=30% min-conf = 60%

{A}=> {B}[support=40%, confidence=67%] : Strong


Conf({A}=> {B}) < Conf({B}) : Not interesting, negative correlation

 {A}=> {B}[support, confidence, correlation]


 P(AB) = P(A) * P(B) : A and B are independent from each other.
 P(AB) < P(A) * P(B) : Negative correlation.
 P(AB) > P(A) * P(B) : Positive correlation.
• Basic Concepts

• Data classification is a two-step process:


• Learning step (training phase).

• Classification
• Supervised learning vs. unsupervised learning.

• The accuracy of a classifier.


• 5.1 - Decision Tree Induction.

 Why are decision tree classifiers so popular?


Example:
Age Salary Class (Age ≤ 35)

30 65 G
(Salary ≤ 40) (Salary ≤ 50)
23 15 B
40 75 G
55 40 B
B G B G
55 100 G
45 60 G

Classification Rules:
Class B: {(Age<=35) and (Salary<=40)} or {(Age>35) and (Salary<=50)}
Class G: {(Age<=35) and (Salary>40)} or {(Age>35) and (Salary>50)}
Test data: (Age=25 and Salary=50) => Class G
• Generating Decision tree:

1. Apply SS to D to find Splitting-Criterion.


2. If n splits
Use best split to partition D to 𝑫𝟏 , 𝑫𝟐 , …, 𝑫𝒏 .
For (i=1; i<=n; ++i)
Build tree (𝒏𝒊 , 𝑫𝒊 , SS)
3. End if
𝑰𝑫𝟑 𝑨𝒍𝒈𝒐𝒕𝒊𝒕𝒉𝒎 (Information Gain)
• Classification Algorithms= ൞ 𝑪𝟒. 𝟓(Gain Ratio. )
𝑪𝑨𝑹𝑻(Gini Index)
➢ ID3 (Iterative Dichotomiser 3) Algorithm:

• Iteratively (repeatedly) Dichotomizes(divides) features into two or more groups at


each step.
➢ ID3 (Iterative Dichotomiser 3) Algorithm:

• Iteratively (repeatedly) Dichotomizes(divides) features into two or more groups at


each step.

• Invented by John Ross Quinlan, ID3 uses a top-down greedy approach to build a
decision tree. In simple words, the top-down approach means that we start
building the tree from the top and the greedy approach means that at each iteration
we select the best attribute at the present moment to create a node.
➢ ID3 (Iterative Dichotomiser 3) Algorithm:

• Iteratively (repeatedly) Dichotomizes(divides) features into two or more groups at


each step.

• Invented by John Ross Quinlan, ID3 uses a top-down greedy approach to build a
decision tree. In simple words, the top-down approach means that we start
building the tree from the top and the greedy approach means that at each iteration
we select the best attribute at the present moment to create a node.

• ID3 uses Information Gain or just Gain to find the best attribute.

• The attribute with the highest Information Gain is selected as the best one.
➢ ID3 steps:

1. Calculate the Information Gain of each attribute.

2. Make a decision tree node using the attribute with the maximum
Information gain.

3. If all rows belong to the same class, make the current node as a leaf node
with the class as its label.

4. Repeat for the remaining features until we run out of all features, or the
decision tree has all leaf nodes.
• ID3 algorithm
D: 7,000, E=0.82
(Information Gain)

𝐷2 : 2,500,
𝑒2 =0.40

• Gain(A) = E - (𝑒1 +𝑒2 +𝑒3 ) = 0.82 – ( 0.15+0.40+0.13) = 0.14


• Information Gain (Entropy)

𝑮𝒂𝒊𝒏 𝑨 = 𝑰𝒏𝒇𝒐(D) - 𝑰𝒏𝒇𝒐𝑨 𝑫

𝑰𝒏𝒇𝒐(D) = -σ𝒎
𝒊=𝟏 𝒑𝒊 ∗ 𝒍𝒐𝒈𝟐 (𝒑𝒊 )

|𝑪𝒊,𝑫 |
𝒑𝒊 =
|𝑫|

|𝑫𝒋 |
𝑰𝒏𝒇𝒐𝑨 𝑫 = -σ𝑽𝒋=𝟏 * Info(𝑫𝒋 )
|𝑫|
RID age income student credit-rating Class: buy-computer

1 youth high no fair no


2 youth high no excellent no
3 middle high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle medium no excellent yes
13 middle high yes fair yes
14 senior medium no excellent no
yes:9

age:14

no:5

|𝐶𝑖,𝐷 |
 𝐼𝑛𝑓𝑜(D) = -σ𝑚
𝑖=1 𝑝𝑖 ∗ 𝑙𝑜𝑔2 (𝑝𝑖 ) 𝑝𝑖 =
|𝐷|

9 9 5 5
 𝐼𝑛𝑓𝑜𝑎𝑔𝑒 (D) = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.94
14 14 14 14
yes:2
youth:5
no:3
yes:4
age:14 middle:4
no:0

yes:3
senior:5
no:2
|𝐷𝑗 |
➢ 𝐼𝑛𝑓𝑜𝐴 𝐷 = -σ𝑉𝑗=1 * Info(𝐷𝑗 )
|𝐷|

5 2 2 3 3 4 4 4
 𝐼𝑛𝑓𝑜𝑎𝑔𝑒 (D) = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 + ቂ− 𝑙𝑜𝑔2
14 5 5 5 5 14 4 4
0 0 5 3 3 2 2
− 𝑙𝑜𝑔2 ቃ + − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.69
4 4 14 5 5 5 5

 Gain(age) = 0.94 - 0.69 = 0.25


• Example:
Gain(age) = 0.25
Gain(income) = 0.03
Gain(credit_rating) = 0.05
Gain(student) = 0.15
• Split point for continuous-valued :
• Given v values of A=> (v-1) possible splits are evaluated.

𝒂𝒊 +𝒂𝒊+𝟏
( )
𝟐

𝑫𝟏 : 𝑨 ≤ 𝒔𝒑𝒍𝒊𝒕 − 𝒑𝒐𝒊𝒏𝒕
𝑫𝟐 : 𝑨 > 𝒔𝒑𝒍𝒊𝒕 − 𝒑𝒐𝒊𝒏𝒕

• Exercise: Complete the decision tree in the Fig. 8-5.


➢ C4.5 Algorithm:

• Developed by John Ross Quinlan again and is an extension of ID3 algorithm.

• C4.5 uses Gain Ratio to find the best attribute.

• Gain Ratio is the normalised Information Gain and Reduces a bias toward
multi-valued attributes.

• The attribute with the highest Information Gain is selected as the best one.
• Gain Ratio:

𝑮𝒂𝒊𝒏(𝑨)
GainRatio(A) =
𝑺𝒑𝒍𝒊𝒕𝑰𝒏𝒇𝒐𝑨 (𝑫)

𝒗 |𝑫𝒋 | |𝑫𝒋 |
𝑺𝒑𝒍𝒊𝒕𝑰𝒏𝒇𝒐𝑨 (𝑫) = -σ𝒋=𝟏 ∗ 𝒍𝒐𝒈𝟐 ( )
|𝑫| |𝑫|
|𝐷𝑗 | |𝐷𝑗 |
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝐴 (𝐷) = -σ𝑣𝑗=1 ∗ 𝑙𝑜𝑔2 ( )
|𝐷| |𝐷|

5 5 4 4 5 5
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑎𝑔𝑒 (𝐷) = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 1.57
14 14 14 14 14 14
• Example:
𝟎.𝟐𝟓
GainRatio(age) = =0.16
𝟏.𝟓𝟕

• Exercise:
GainRatio(income) = ?
GainRatio(credit_rating) = ?
GainRatio(student) = ?

• Exercise: Redraw the previous tree using the GainRatio


➢ CART (Classification And Regression Tree) Algorithm:

• Developed by Breiman.

• CART uses Gini Index to find the best splitting criterion.

• CART makes a binary decision tree.

• The attribute with the lowest Gini Index is selected as the best one.
• Gini Index:

|𝑫𝟏 | |𝑫𝟐 |
𝑮𝒊𝒏𝒊𝑨 𝑫 = Gini(𝑫𝟏 ) + Gini(𝑫𝟐 )
|𝑫| |𝑫|

𝑮𝒊𝒏𝒊 𝑫 = 𝟏 − ෍ 𝒑𝟐𝒊
𝒊=𝟏

|𝑪𝒊,𝑫 |
𝒑𝒊 =
|𝑫|
age

D1 D2

subset1 subset2

{} {youth, middle, senior}

{youth} {middle, senior}

{middle} {youth, senior}

{senior} {youth, middle}


yes:5
{ youth, senior}:10
no:5
age:14
yes:4
{ middle }:4
no:0

 𝐺𝑖𝑛𝑖𝑎𝑔𝑒 𝑦𝑜𝑢𝑡ℎ,𝑠𝑒𝑚𝑖𝑜𝑟 ,{𝑚𝑖𝑑𝑑𝑙𝑒} (D) =


10 5 5 4 4 0
1 − ( )2 −( )2 + 1 − ( )2 −( )2 = 0.36
14 10 10 14 4 4
• Gini Index: Is a binary index

• Example:
𝑮𝒊𝒏𝒊_𝑰𝒏𝒅𝒆𝒙𝒂𝒈𝒆 𝒚𝒐𝒖𝒕𝒉,𝒔𝒆𝒏𝒊𝒐𝒓 𝑫 = 𝑮𝒊𝒏𝒊_𝑰𝒏𝒅𝒆𝒙𝒂𝒈𝒆 𝒎𝒊𝒅𝒅𝒍𝒆 𝑫 = 𝟎. 𝟑𝟔

• Exercise: Calculate the Gini Index for all attributes and all combinations.

• Exercise: Redraw the previous tree using the Gini Index.


• Evaluating a decision tree:

• Tree pruning

• Pre-pruning.

• Post-pruning.

• Advantages of using decision trees.


• Other algorithms for building decision trees.

• CHAID.

• MARS.

• Rainforest.

• Boat.
• Bayesian Classification

• Bayes’ Theorem:
𝑷 𝑩 𝑨 . 𝑷(𝑨)
𝑷 𝑨𝑩 =
𝑷(𝑩)
• Naïve Bayesian:
A , B : independent from each other => P(AB) = P(A) * P(B)

X=(𝒙𝟏 , 𝒙𝟐 , …, 𝒙𝒏 ) ⇒

𝐏 𝐗 = 𝑷 𝒙𝟏 ∗ 𝑷 𝒙𝟐 ∗ … ∗ 𝑷(𝒙𝒏 )=ς𝒏𝒌=𝟏 𝑷(𝒙𝒌 )

𝑷 𝑿 𝑪𝒊 = 𝑷 𝒙𝟏 𝑪𝒊 ∗ 𝑷 𝒙𝟐 𝑪𝒊 ∗ … * 𝑷 𝒙𝒏 𝑪𝒊 =ς𝒏𝒌=𝟏 𝑷 𝒙𝒌 𝑪𝒊
• Bayesian Classification

• X=(𝒙𝟏 , 𝒙𝟐 , …, 𝒙𝒏 ), m classes: (𝑪𝟏 , 𝑪𝟐 , …, 𝑪𝒎 )

• P(𝑪𝒊 𝑿 > P(𝑪𝒋 𝑿 , j, 1≤ j ≤m, j≠I => X ∈ 𝑪𝒊


𝑷 𝑿 𝑪𝒊 .𝑷(𝑪𝒊 )
𝑷(𝑪𝒊 |X)=
𝑷(𝑿)
|𝑪𝒊,𝑫 |
P(𝑪𝒊 ) =
|𝑫|

𝑷 𝑿 𝑪𝒊 = 𝑷 𝒙𝟏 𝑪𝒊 . 𝑷 𝒙𝟐 𝑪𝒊 . … . 𝑷 𝒙𝒏 𝑪𝒊 =ς𝒏𝒌=𝟏 𝑷 𝒙𝒌 𝑪𝒊

• For continuous attributes:

(𝒙𝒌 −𝛍𝑪 )𝟐
𝟏 −( 𝟐𝛔 𝒊 )
𝑷 𝒙𝒌 𝑪𝒊 = 𝒆 𝒄𝒊
𝟐π𝛔𝑪𝒊
RID age income student credit-rating Class: buy-computer

1 youth high no fair no


2 youth high no excellent no
3 middle high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle medium no excellent yes
13 middle high yes fair yes
14 senior medium no excellent no
• Example:
X=(age=youth, income=medium, student=yes, credit-rating=fair)

X ∈ 𝐶1 or 𝐶2

P(X|𝐶1 ) * P(𝐶1 ) = ? P(X|𝐶2 ) * P(𝐶2 ) = ?

9
P(𝐶1 ) = 𝑃 𝑏𝑢𝑦 − 𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠 = = 0.643
14

5
P(𝐶2 ) = 𝑃 𝑏𝑢𝑦 − 𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜 = = 0.357
14
• Example:
X=(age=youth, income=medium, student=yes, credit-rating=fair)

𝑃 𝑋 𝐶1 =P(age=youth | buy-computer=yes) *
P(income=medium | buy-computer=yes) *
P(student=yes | buy-computer=yes) *
2 4 6 6
P(credit-rating=fair | buy-computer=yes) = * 9∗ 9∗ = 0.044
9 9

𝑃 𝑋 𝐶2 =P(age=youth | buy-computer=no) *
P(income=medium | buy-computer=no) *
P(student=yes | buy-computer=no) *
3 2 1 2
P(credit-rating=fair | buy-computer=no) = * 5∗ 5∗ = 0.019
5 5
P(X|𝐶1 ) * P(𝐶1 ) =0.044 * 0.643 = 0.028
P(X|𝐶2 ) * P(𝐶2 ) =0.019 * 0.357 = 0.007
P(X|𝐶1 ) * P(𝐶1 ) =0.044 * 0.643 = 0.028
P(X|𝐶2 ) * P(𝐶2 ) =0.019 * 0.357 = 0.007

X ∈ 𝑪𝟏
P(X|𝐶1 ) * P(𝐶1 ) =0.044 * 0.643 = 0.028
P(X|𝐶2 ) * P(𝐶2 ) =0.019 * 0.357 = 0.007

0.028
P(X ∈ 𝐶1 ) = = 0.80
0.028+0.007

0.007
P(X ∈ 𝐶2 ) = = 0.20
0.028+0.007

X ∈ 𝑪𝟏
• Laplacian correction:

𝑪𝟏 : buy_computer = yes : 1000 records

𝟎
income = low : 0 records =>P(income=low|𝑪𝟏 )= =0.000
𝟏𝟎𝟎𝟎
𝟗𝟗𝟎
income = medium : 990 records =>P(income=medium|𝑪𝟏 )=𝟏𝟎𝟎𝟎=0.990
𝟏𝟎
income = high: 10 records=>P(income=high|𝑪𝟏 )=𝟏𝟎𝟎𝟎 =0.010
• Laplacian correction:
𝑪𝟏 : buy_computer = yes : 1000 records

𝟏
income = low : 0 records =>P(income=low|𝑪𝟏 )= =0.001
𝟏𝟎𝟎𝟑
𝟗𝟗𝟏
income = medium : 990 records =>P(income=medium|𝑪𝟏 )=𝟏𝟎𝟎𝟑=0.988
𝟏𝟏
income = high: 10 records=>P(income=high|𝑪𝟏 )=𝟏𝟎𝟎𝟑 =0.011
• Using IF-THEN Rules for Classification:

IF condition THEN conclusion.

antecedent consequent

Example:
R1: IF age = youth AND student = yes THEN buy_computer = yes

R1: (age = youth) ^ (student = yes) => (buy_computer = yes)


• Coverage and Accuracy: A rule R can be assessed by its coverage and
accuracy.

𝒏𝒄𝒐𝒓𝒓𝒆𝒄𝒕 𝒏𝒄𝒐𝒗𝒆𝒓𝒔
• Accuracy(R) = Coverage(R) =
𝒏𝒄𝒐𝒗𝒆𝒓𝒔 |𝑫|
• Extracting classification rules from a
decision tree:

𝑹𝟏 : IF age = youth AND student = no THEN buy_computer = no


𝑹𝟐 : IF age = youth AND student = yes THEN buy_computer = yes
𝑹𝟑 : IF age = middle aged THEN buy_computer = yes
𝑹𝟒 : IF age = senior AND credit rating = excellent THEN buy_computer = yes
𝑹𝟓 : IF age = senior AND credit rating = fair THEN buy_computer = no
• Metrics for Evaluating Classifier Performance:

• Confusion Matrix:

=𝑃′ + 𝑁′

𝑻𝑷+𝑻𝑵
• Accuracy = 𝑷+𝑵
𝑻𝑷 𝑻𝑷
• precision = =
𝑻𝑷+𝑭𝑷 𝑷′

𝑻𝑷 𝑻𝑷
• recall = 𝑻𝑷+𝑭𝑵 = 𝑷

• Example:

Predicted Total
yes no
Actual yes 90 210 300
no 140 9560 9700
Total 230 9770 10000
𝟗𝟎
• precision = 𝟐𝟑𝟎 = 39.13%
𝟗𝟎
• recall = 𝟑𝟎𝟎 = 𝟑𝟎. 𝟎𝟎%

𝟐 ∗𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 ∗ 𝒓𝒆𝒄𝒂𝒍𝒍
• F= 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏+ 𝒓𝒆𝒄𝒂𝒍𝒍
➢ Estimating Accuracy Methods:
1. Holdout Method: The given data are randomly partitioned into two
independent sets, a training set and a test set. Typically, two-thirds of the data
are allocated to the training set, and the remaining one-third is allocated to
the test set.
➢ Estimating Accuracy Methods:

2. Random Subsampling: The holdout method is repeated k times and


averaged.
➢ Estimating Accuracy Methods:
3. (k-fold) Cross-Validation: the initial data are randomly partitioned into k
subsets (folds): 𝐷1 , 𝐷2 , …, 𝐷𝑘 . Training and testing is performed k times. In
iteration i, partition 𝐷𝑖 is reserved as the test set, and the remaining partitions
are collectively used to train the model.
➢ Estimating Accuracy Methods:
4. Bootstrap: the bootstrap method samples the given training tuples uniformly
with replacement.

.632 bootstrap: On average, 63.2% of the original data tuples will end up in the
bootstrap sample, and the remaining 36.8% will form the test set.
• In addition to accuracy-based measures, classifiers can also be compared with
respect to the following additional aspects:

Speed
𝑹𝒐𝒃𝒖𝒔𝒕𝒏𝒆𝒔𝒔
• Other aspects:
𝑺𝒄𝒂𝒍𝒂𝒃𝒊𝒍𝒊𝒕𝒚
Interpretability
• Introduction: Cohesion and Separation.

• Suppose a data set, D, contains n objects in Euclidean space. Partitioning


methods distribute the objects in D into k clusters, 𝑪𝟏 , ... , 𝑪𝒌 , that is, 𝑪𝒊 ⊂ D and
𝑪𝒊 ∩ 𝑪𝒋 = ∅ for (1 ≤ i, j ≤ k).

• k-means: A Centroid-Based Technique.


• Example 6-1: i 𝑿𝒊
𝒄𝟏 =3, 𝒄𝟐 =6 1 5
2 9
3 3
4 6
5 4
6 2
7 8
8 5
9 3
10 7
• Example 6-1:

Step_1: 𝒄𝟏 =3, 𝒄𝟐 =6

𝑪𝟏 = 𝑿𝟑 , 𝑿𝟓 , 𝑿𝟔 , 𝑿𝟗 , 𝑪𝟐 = 𝑿𝟏 , 𝑿𝟐 , 𝑿𝟒 , 𝑿𝟕 , 𝑿𝟖 , 𝑿𝟏𝟎
• Example 6-1:

Step_1: 𝒄𝟏 =3, 𝒄𝟐 =6

𝑪𝟏 = 𝑿𝟑 , 𝑿𝟓 , 𝑿𝟔 , 𝑿𝟗 , 𝑪𝟐 = 𝑿𝟏 , 𝑿𝟐 , 𝑿𝟒 , 𝑿𝟕 , 𝑿𝟖 , 𝑿𝟏𝟎

𝟑+𝟒+𝟐+𝟑 𝟓+𝟗+𝟔+𝟖+𝟓+𝟕
Step_2: 𝒄𝟏 = = 3.0, 𝒄𝟐 = = 6.67
𝟒 𝟔

𝑪𝟏 = 𝑿𝟑 , 𝑿𝟓 , 𝑿𝟔 , 𝑿𝟗 , 𝑪𝟐 = 𝑿𝟏 , 𝑿𝟐 , 𝑿𝟒 , 𝑿𝟕 , 𝑿𝟖 , 𝑿𝟏𝟎
• Exercise 1: Solve the previous example with 𝒄𝟏 = 2.0 and 𝒄𝟐 = 7.0

• Exercise 2: Solve the previous example with 𝒄𝟏 = 2.5, 𝒄𝟐 = 5.0 , and 𝒄𝟑 = 7.5
• Example 6-2 : (𝐢) (𝐢)
i 𝐗𝟏 𝐗𝟐
𝒄𝟏 =(1.0, 1.1), 𝒄𝟐 =(5.0, 7.0)
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
Step_1: 𝒄𝟏 =(1.0, 1.1), 𝒄𝟐 =(5.0, 7.0)

𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕
𝑪𝟏 = {𝑿 ,𝑿 ,𝑿 } , 𝑪𝟐 = {𝑿 ,𝑿 ,𝑿 ,𝑿 }
Step_1: 𝒄𝟏 =(1.0, 1.0), 𝒄𝟐 =(5.0, 7.0)

𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕
𝑪𝟏 = {𝑿 ,𝑿 ,𝑿 } , 𝑪𝟐 = {𝑿 ,𝑿 ,𝑿 ,𝑿 }

𝟏+𝟏.𝟓+𝟑 𝟏+𝟐+𝟒
Step_2: 𝒄𝟏 = ( , 𝟑 ) = (𝟏. 𝟖𝟑, 𝟐. 𝟑𝟑)
𝟑
𝟓+𝟑.𝟓+𝟒.𝟓+𝟑.𝟓 𝟕+𝟓+𝟓+𝟒.𝟓
𝒄𝟐 = ( , ) = (𝟒. 𝟏𝟐𝟓, 𝟓. 𝟑𝟕𝟓)
𝟒 𝟒

𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕
𝑪𝟏 = {𝑿 ,𝑿 } , 𝑪𝟐 = {𝑿 , 𝑿 ,𝑿 ,𝑿 ,𝑿 }
𝟏+𝟏.𝟓 𝟏+𝟐
Step_3: 𝒄𝟏 = ( , 𝟐 ) = (𝟏. 𝟐𝟓, 𝟏. 𝟓)
𝟐
𝟑+𝟓+𝟑.𝟓+𝟒.𝟓+𝟑.𝟓 𝟒+𝟕+𝟓+𝟓+𝟒.𝟓
𝒄𝟐 = ( , ) = (𝟑. 𝟗, 𝟓. 𝟏)
𝟓 𝟓

𝑪𝟏 = {𝑿 𝟏 ,𝑿 𝟐 } , 𝑪𝟐 = {𝑿 𝟑 , 𝑿 𝟒 ,𝑿 𝟓 ,𝑿 𝟔 ,𝑿 𝟕 }

• Exercise 3: Solve the previous example with 𝒄𝟏 = (3.0, 3.0) and 𝒄𝟐 = (4.0, 4.0)
• k-means is sensitive to outliers:
• Complexity of k-means: O(nkt)
• n: No of Subjects.

• k: No of Clusters.

• t: No of iterations.

• k << n

• t << n

• Other drawbacks of k-means:

• Evaluations of Clusters:

E=σ𝒌𝒊=𝟏 σ𝒑∈𝑪𝒊 𝒅𝒊𝒔𝒕(𝒑, 𝒄𝒊 )


• Hierarchical Clustering:
• Bottom-Up: AGglomerative NESting (AGNES).

• Top-Down: DIvisive ANAlysis (DIANA).


• Distance Measures in Algorithmic Methods:

• Single Link

• Complete Link

• Average Link
r
• Centroid s

• Ward’s method
• Single Linkage - Nearest Neighbour

d
a b
e g
c
f
r
s
dist(r,s) = dist(b,e)
• Farthest Neighbour – Complete Linkage

d
a b
e g
c
f
r
s

dist(r,s) = dist(a,g)
• Average Linkage

d
a b
e g
c
f
r
s

𝒅𝒊𝒔𝒕 𝒂,𝒅 +𝒅𝒊𝒔𝒕 𝒂,𝒆 +𝒅𝒊𝒔𝒕 𝒂,𝒇 +𝒅𝒊𝒔𝒕 𝒂,𝒈 +⋯+𝒅𝒊𝒔𝒕(𝒄,𝒈)


dist(r,s) = 𝟏𝟐
• Centroid

d
a m b
e n g
c
f
r
s
dist(r,s) = dist(m,n)
• Ward’s method: Minimizes the total within-cluster variance.

• Error Sum of Square (ESS): ESS=𝒆𝟏 𝟐 + 𝒆𝟐 𝟐 + 𝒆𝟑 𝟐

a
𝑒1
𝑒2 b
𝑒3
c
• Example 6-3: Bottom-Up approach, Single-Linkage, Similarity more than 1.8

i 𝐗𝟏 𝐗𝟐
A 1.0 1.0
B 1.0 3.5
C 3.0 1.0
D 2.0 4.0
E 1.0 3.0
F 3.0 2.0
• Example 6-3:
A B C D E F
i 𝐗𝟏 𝐗𝟐 A 0 2.5 2.0 3.16 2.0 2.24
A 1.0 1.0
B 1.0 3.5 B - 0 3.24 1.12 0.5 2.5
C 3.0 1.0 C - - 0 3.16 2.83 1.0
D 2.0 4.0
D - - - 0 1.41 2.24
E 1.0 3.0
F 3.0 2.0 E - - - - 0 1.41

F - - - - - 0
• Example 6-3:
merge A B C D E F
i 𝐗𝟏 𝐗𝟐 A 0 2.5 2.0 3.16 2.0 2.24
A 1.0 1.0
B 1.0 3.5 B - 0 3.24 1.12 0.5 2.5
C 3.0 1.0 C - - 0 3.16 2.83 1.0
D 2.0 4.0
D - - - 0 1.41 2.24
E 1.0 3.0
F 3.0 2.0 E - - - - 0 1.41

F - - - - - 0
• Example 6-3:
A BE C D F
i 𝐗𝟏 𝐗𝟐 A 0 2.0 2.0 3.16 2.24
A 1.0 1.0
BE 1.0 3.25 BE - 0 2.83 1.12 1.41
C 3.0 1.0 C - - 0 3.16 1.0
D 2.0 4.0
D - - - 0 2.24
F 3.0 2.0
F - - - - 0
• Example 6-3:
A BE C D F
i 𝐗𝟏 𝐗𝟐 merge A 0 2.0 2.0 3.16 2.24
A 1.0 1.0
BE 1.0 3.25 BE - 0 2.83 1.12 1.41
C 3.0 1.0 C - - 0 3.16 1.0
D 2.0 4.0
D - - - 0 2.24
F 3.0 2.0
F - - - - 0
• Example 6-3:

i 𝐗𝟏 𝐗𝟐 A BE CF D

A 1.0 1.0 A 0 2.0 2.0 3.16


BE 1.0 3.25
CF 3.0 1.5 BE - 0 1.41 1.12
D 2.0 4.0 CF - - 0 2.24

D - - - 0
• Example 6-3:

A BE CF D
i 𝐗𝟏 𝐗𝟐 merge
A 1.0 1.0 A 0 2.0 2.0 3.16
BE 1.0 3.25
CF 3.0 1.5 BE - 0 1.41 1.12
D 2.0 4.0 CF - - 0 2.24

D - - - 0
• Example 6-3:

i 𝐗𝟏 𝐗𝟐 A BDE CF

A 1.0 1.0 A 0 2.0 2.0


BDE 1.0 3.25
BDE - 0 2.24
CF 3.0 1.5
CF - - 0
• Example 6-3:

i 𝐗𝟏 𝐗𝟐 A BDE CF

A 1.0 1.0 A 0 2.0 2.0


BDE 1.0 3.25
BDE - 0 2.24
CF 3.0 1.5
CF - - 0

1.8≥…
• Example 6-3: Dendrogram - AGglomerative NESting (AGNES)

2.0

1.5

1.0

0.5

A B E D C F
• Exercise: Solve the previous example with Complete-Link, Average-Link,
Centroid, and ward’s method.

You might also like