0% found this document useful (0 votes)

17 views206 pages

Data Mining Presentation

The document outlines a course structure on data mining, detailing assessment components and chapters covering various aspects of data handling and analysis. It emphasizes the significance of data mining in the current data age, highlighting processes such as data cleaning, integration, and mining, as well as the types of data that can be mined. Additionally, it discusses various data mining tasks, algorithms, and techniques for managing data quality and privacy.

Uploaded by

hajimomena.ali431

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views206 pages

Data Mining Presentation

Uploaded by

hajimomena.ali431

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 206

Dr.

Mohammadi Zanjireh

Imam Khomeini International University

(IKIU)
 Midterms: 4*10=40
 Exercises: 10
 Project:20
 Final exam: 30
 Chapter 1 – Introduction
 Chapter 2 - Getting to Know Your Data
 Chapter 3 - Data Preprocessing
 Chapter 4- Frequent Patterns Mining
 Chapter 5 – Classification
 Chapter 6 - Clustering
▪ “We are living in the information age” is a popular
saying; however, we are actually living in the data
age.

▪ Terabytes or petabytes of data into our computer

networks, the World Wide Web (WWW), and
various data storage devices every day from
business, society, science and engineering,
medicine, and almost every other aspect of daily
life.
▪ This explosively growing, widely available, and
gigantic body of data makes our time truly the data age.
Powerful tools are badly needed to automatically
uncover valuable information from the tremendous
amounts of data and to transform such data into
organised knowledge.

▪ This necessity has led to the birth of data mining. The

field is young, dynamic, and promising. Data mining
has and will continue to make great strides in our
journey from the data age toward the coming
information age.
▪ Example 1-1: Search Engine.
▪ A data rich but information poor situation.
▪ Data tombs.

▪What is Data Mining?

Data mining should have been more appropriately named
“knowledge mining from data,” which is unfortunately
somewhat long.
▪ Knowledge Discovery from Data (KDD):
▪ Data Cleaning.
Preprocessing Step
▪ Data Integration.
▪ Data Selection.
▪ Data Mining.
▪ Pattern Evaluation.
▪ Knowledge Presentation.
What Kinds of Data Can Be Mined?
As a general technology, data mining can be applied to
any kind of data as long as the data are meaningful
for a target application such as database data, data
warehouse data, transactional data, data streams,
multimedia data, and the WWW.
Exercise 1: What is the difference between Database and
Data warehouse?

Exercise 2: Describe a number of Data mining’s

applications.
 Data mining tasks can be classified into two
categories:

◦ Descriptive: Characterises properties of the data in a target

data set.

◦ Predictive: Performs induction on the current data in order to

make predictions.
▪ Data warehouse:
▪ Data Cube:
▪ Fig 1.7:
◦ Drill-down
◦ Roll-up

◦ Slice
◦ Dice

◦ OLAP: On Line Analytical Processing

◦ OLTP: On Line Transactions Processing
▪ Which Technologies Are Used?
▪ Statistics

▪ Machine Learning (ML)

▪ Supervised Learning
▪ Unsupervised Learning
▪ Semi-supervised Learning
▪ Efficiency and Scalability:
◦ The running time of a data mining algorithm must be
predictable, short, and acceptable by applications.

▪ Parallel and distributed mining algorithms:

◦ Such algorithms first partition the data into “pieces” .
◦ Each piece is processed, in parallel.
◦ The patterns from each partition are eventually merged.
◦ Parallel in one machine.
◦ Distributed in multiple machines.
 Handling noise, error, exceptions, and outliers:
◦ Data often contain noise, errors, exceptions, or outliers.
◦ These may confuse the data mining process, leading to the
derivation of erroneous patterns.
◦ Data cleaning, data preprocessing, and outlier detection and
removal, are examples of techniques that need to be integrated with
the data mining process.

 Privacy-preserving data mining:

◦ Data mining is useful. However, it poses the risk of disclosing an
individual’s personal information.
◦ We have to observe data sensitivity while performing data mining.
 Attributes:
Attributes

Student_ID Name Average

Observations
1001 Ali 17.12
1002 Reza 13.89
1003 Maryam 16.02
1004 Hasan 15.45
 Attribute types:
◦ Nominal: Subject, Occupation.

◦ Binary (0,1-T,F): Gender, medical test.

 Symmetric Binary.
 Asymmetric Binary.

◦ Ordinal: Drink_size (small, medium, large).

◦ Numeric
 Interval_scaled.
 Ratio_scaled.
 Discrete vs. Continuous attributes.
 Mean:

 Trimmed Mean:
 Median:
 Mode:
 Variance and Standard Deviation:
 Quartile:
Q1, Q2=Median, Q3
 Example1:

1 2 3 4 5 6 7 8 9
 Example1:

1 2 3 4 5 6 7 8 9

Q2=5.0
 Example1:

1 2 3 4 5 6 7 8 9

Q2=5.0

Q1=2.5 Q3=7.5
 Example2:
1 2 3 4 5 6 7 8 9 10
 Example2:
1 2 3 4 5 6 7 8 9 10

Q2=5.5
 Example2:
1 2 3 4 5 6 7 8 9 10

Q2=5.5

Q1=3.0 Q3=8.0
 Example3:

1 2 3 4 5 6 7 8 9 10 11
 Example3:

1 2 3 4 5 6 7 8 9 10 11

Q2=6.0
 Example3:

1 2 3 4 5 6 7 8 9 10 11

Q2=6.0
Q1=3.0 Q3=9.0
 Example4:

1 2 3 4 5 6 7 8 9 10 11 12
 Example4:

1 2 3 4 5 6 7 8 9 10 11 12

Q2=6.5
 Example4:

1 2 3 4 5 6 7 8 9 10 11 12

Q2=6.5

Q1=3.5 Q3=9.5
 Inter Quartile Range (IQR):

IQR = Q3 - Q1

 Lower fence: Q1 – 1.5 * IQR

 Upper fence: Q3 + 1.5 * IQR

 Outliers < Lower fence.

 Outliers > Upper fence.
 Boxplot:
 Example5:
0 2 5 8 8 8 9 9 10 10 10 11 12 12 12 14 15 20 25
 Example5:
0 2 5 8 8 8 9 9 10 10 10 11 12 12 12 14 15 20 25

Q2=10.0
 Example5:
0 2 5 8 8 8 9 9 10 10 10 11 12 12 12 14 15 20 25

Q2=10.0

Q1=8.0 Q3=12.0
 Example5:
0 2 5 8 8 8 9 9 10 10 10 11 12 12 12 14 15 20 25

Q2=10.0

Q1=8.0 Q3=12.0

IQR= 1.5 * (12.0-8.0) = 6.0

Lower fence: 8.0 - 6.0 = 2.0

Upper fence: 12.0 + 6.0 = 18.0
 Example5:
0 2 5 8 8 8 9 9 10 10 10 11 12 12 12 14 15 20 25

Q2=10.0

Q1=8.0 Q3=12.0

IQR= 1.5 * (12.0-8.0) = 6.0

Outliers
Lower fence: 8.0 - 6.0 = 2.0
Upper fence: 12.0 + 6.0 = 18.0
 Data Matrix:

Student _1

Student_i

Student_n
 Dissimilarity Matrix:

 Similarity Matrix:
Sim(i,j)=1-d(i,j)
 Proximity measures for Nominal attributes

 Example:
Id Subject Birth_City Living_City Eye_Colour
1 Computer Teh Teh Black
2 Electricity Teh Kar Brown
3 Mechanic Qaz Qaz Brown
4 Computer Kar Qaz Green
0.00
Dissimilarity 0.75 0.00
=
Matrix 1.00 0.75 0.00
0.75 1.00 0.75 0.00
 Proximity measures for Binary attributes

𝒓+𝒔
Symmetric Binary: d(i,j)=
𝒒+𝒓+𝒔+𝒕

𝒓+𝒔
Asymmetric Binary: d(i,j)=
𝒒+𝒓+𝒔
 Example: Gender : Symmetric others : Asymmetric

Name Gender Fever Cough Test1 Test2 Test3 Test4

Ali M Y N P N N N
Maryam F Y N P N P N
Reza M Y P N N N N
 Example: Gender : Symmetric others : Asymmetric

Name Gender Fever Cough Test1 Test2 Test3 Test4

Ali M Y N P N N N
Maryam F Y N P N P N
Reza M Y P N N N N

2
 d(Ali, Maryam) = = 0.50
4
2
 d(Ali, Reza) = = 0.50
4
4
 d(Maryam, Reza) = = 0.80
5
 Euclidean Distance:
 Euclidean Distance:

 Manhattan Distance:
 Euclidean Distance:

 Manhattan Distance:

 Supremum Distance:
 Proximity measures for Ordinal attributes
Small Medium Large
1 2 3
 Example:
 Degree={Diploma, Undergraduate, Master, PhD}
 Drink_size={Small, Medium, Large}

Id Degree Drink_size
1 Undergraduate Medium
2 PhD Small
3 Diploma Medium
4 Undergraduate Large
 Example:
 Degree={Diploma, Undergraduate, Master, PhD}
 Drink_size={Small, Medium, Large}

Id Degree Drink_size
1 0.33 0.50
2 1.00 0.00
3 0.00 0.50
4 0.33 1.00
 Normalising:

Id Grade
1 30
2 52
3 84
4 45
5 25
6 20
7 91
8 65
9 42
10 32
 Normalising:

Id Grade Grade-min
1 30 10
2 52 32
3 84 64
4 45 25
5 25 5
6 20 0
7 91 71
8 65 45
9 42 22
10 32 12
 Normalising:

(Grade-min)/(max-min)
Id Grade Grade-min
1 30 10 0.14
2 52 32 0.45
3 84 64 0.90
4 45 25 0.35
5 25 5 0.07
6 20 0 0.00
7 91 71 1.00
8 65 45 0.63
9 42 22 0.31
10 32 12 0.17
 Proximity measures for mixed types:
 Examples:
Id Test_1(nominal) Test_2(ordinal) Test_3(numeric)

1 Code A Excellent 45
2 Code B Fair 22
3 Code C Good 64
4 Code A Excellent 28
 Examples:
Id Test_1(nominal) Test_2(ordinal) Test_3(numeric)

1 Code A 1 0.55
2 Code B 0 0.00
3 Code C 0.5 1.00
4 Code A 1 0.14

0.00
Dissimilarity 0.85 0.00
=
Matrix 0.65 0.83 0.00
0.13 0.71 0.79 0.00
 Exercise:
Id Test_1(nominal) Test_2(ordinal) Test_3(numeric)

1 Code A Excellent ----

2 ---- Fair 22
3 Code C ---- 64
4 Code A Excellent 28
 Cosine similarity:
D# team coach hockey baseball soccer penalty score win loss season
D1 5 0 3 0 2 0 0 2 0 0
D2 3 0 2 0 1 1 0 1 0 1
D3 0 7 0 2 1 0 0 3 0 0
D4 0 1 0 0 1 2 2 0 3 0
 Cosine similarity:

1.00
Similarity 0.94 1.00
=
Matrix 0.17 0.12 1.00
0.07 0.17 0.23 1.00
 3.1-Data Cleaning.

 3.2-Data Integration.

 3.3-Data Reduction.

 3.4-Data Transform.
 3.1- Data Cleaning

 3.1.1-Missing values
o Ignore the tuple.
o Fill in the missing value manually.
o Use a global constant.
o Use the attribute mean or median.
o Use the class mean or median.
 3.1- Data Cleaning

➢ 3.1.2-Noise
o Binning
o Smoothing by bin means.
o Smoothing by bin boundaries.

o Regression

o Clustering.
 Example:
4, 8, 15, 21, 21, 24, 25, 28, 34
 Example:
4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: bin1: 4, 8, 15. bin2: 21, 21, 24. bin3: 25, 28, 34
 Example:
4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: bin1: 4, 8, 15. bin2: 21, 21, 24. bin3: 25, 28, 34

Smoothing by bin means:

bin1: 9, 9, 9. bin2: 22, 22, 22. bin3: 29, 29, 29.
 Example:
4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: bin1: 4, 8, 15. bin2: 21, 21, 24. bin3: 25, 28, 34

Smoothing by bin means:

bin1: 9, 9, 9. bin2: 22, 22, 22. bin3: 29, 29, 29.

Smoothing by bin boundaries:

bin1: 4, 4, 15. bin2: 21, 21, 24. bin3: 25, 25, 34
 Regression:
 Clustering:
 3.2- Data Integration:

 3.3- Data Reduction

 3.3.1-Dimensionality reduction
o Discrete Wavelet Transform (DWT).
o Principal Components Reduction (PCA).
o Feature Subset Selection.

➢ 3.3.2-Numerosity reduction
o Sampling.
o Clustering.
 3.4- Data Transform
❑ Normalisation:
❑ Min_max normalisation:

❑ Z-Score normalisation:
 Example:
3, 4, 4, 5, 9
Ã=5.00, σ = 2.34

Min-max: 0, 0.17, 0.17, 0.33, 1.00

Z-score: -0.95, -0.48, -0.48, 0, 1.91

 Descriptions
 Frequent Patterns
 Sequential Patterns
 Market Basket Analysis
 Association Rules:
 I={𝐼1 , 𝐼2 , … , 𝐼𝑛 }
 T
 D
 TID
 A => B[Support, Confidence]
A⊂ I , B⊂ I, A∩B=∅
A≠ ∅, B≠∅
 Example:

{Computer} => {Antivirus}[Support=2%, Confidence=60%]

• Support
Support(A=>B)=P(A∪B)

• Confidence
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴∪ 𝐵) 𝑆𝑢𝑝𝑝𝑜𝑟𝑡_𝐶𝑜𝑢𝑛𝑡(𝐴∪ 𝐵)
Confidence(A=>B) = P(B|A) = =
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴) 𝑆𝑢𝑝𝑝𝑜𝑟𝑡_𝐶𝑜𝑢𝑛𝑡(𝐴)
 Mining Association Rules
1. Find all frequent itemsets.
2. Generate strong association rules from the frequent itemsets.
‫•‬ ‫‪Example:‬‬
‫} 𝟖𝑻 ‪D = {𝑻𝟏 , 𝑻𝟐 , 𝑻𝟑 , 𝑻𝟒 , 𝑻𝟓 , 𝑻𝟔 , 𝑻𝟕 ,‬‬
‫}جعفری‪ ،‬پیاز‪ ،‬زیتون‪ ،‬خیار‪ ،‬گوجه فرنگی{= 𝟏𝑻‬
‫}جعفری‪ ،‬خیار‪ ،‬گوجه فرنگی{= 𝟐𝑻‬
‫}جعفری‪ ،‬پیاز‪،‬نان‪ ،‬نمک‪ ،‬خیار‪ ،‬گوجه فرنگی{= 𝟑𝑻‬
‫}پیاز‪ ،‬نان‪ ،‬خیار‪ ،‬گوجه فرنگی{= 𝟒𝑻‬
‫}پیاز‪ ،‬نمک‪ ،‬گوجه فرنگی{= 𝟓𝑻‬
‫}پنیر‪ ،‬نان{= 𝟔𝑻‬
‫}خیار‪ ،‬پیاز‪ ،‬گوجه فرنگی{= 𝟕𝑻‬
‫}کره‪ ،‬نان{= 𝟖𝑻‬

‫]‪A => B[ Support, Confidence‬‬

‫}پیاز‪ ،‬جعفری{ >= }خیار‪ ،‬گوجه فرنگی{‬
‫•‬ ‫‪Exercise:‬‬

‫}پیاز‪ ،‬جعفری{ >= }خیار{‬

‫}پیاز{ >= }خیار‪ ،‬گوجه فرنگی{‬
‫}پیاز{ >= }گوجه فرنگی{‬
 Apriory Algorithm
Proposed by Agrawal in 1994.
TID List of items
T100 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟓
T200 𝑰𝟐 , 𝑰𝟒
T300 𝑰𝟐 , 𝑰𝟑
T400 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟒
T500 𝑰𝟏 , 𝑰𝟑
T600 𝑰𝟐 , 𝑰𝟑
T700 𝑰𝟏 , 𝑰𝟑
T800 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑 , 𝑰𝟓
T900 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑
 1. 𝑪𝟏 : min-sup = 2, Scan D.
Itemset Sup_Count
{𝐼1 } 6
{𝐼2 } 7
{𝐼3 } 6
{𝐼4 } 2
{𝐼5 } 2
 2. 𝑳𝟏 :
Itemset Sup_Count
{𝐼1 } 6
{𝐼2 } 7
{𝐼3 } 6
{𝐼4 } 2
{𝐼5 } 2
 3. 𝑪𝟐 = 𝑳𝟏 ∗ 𝑳𝟏 , Scan D
Itemset Sup_Count
{𝐼1 , 𝐼2 } 4
{𝐼1 , 𝐼3 } 4
{𝐼1 , 𝐼4 } 1
{𝐼1 , 𝐼5 } 2
{𝐼2 , 𝐼3 } 4
{𝐼2 , 𝐼4 } 2
{𝐼2 , 𝐼5 } 2
{𝐼3 , 𝐼4 } 0
{𝐼3 , 𝐼5 } 1
{𝐼4 , 𝐼5 } 0
 4. 𝑳𝟐 :
Itemset Sup_Count
{𝐼1 , 𝐼2 } 4
{𝐼1 , 𝐼3 } 4
{𝐼1 , 𝐼5 } 2
{𝐼2 , 𝐼3 } 4
{𝐼2 , 𝐼4 } 2
{𝐼2 , 𝐼5 } 2
 Prune:
 5. 𝑪𝟑 = 𝑳𝟐 ∗ 𝑳𝟐
Itemset Prune
{𝐼1 , 𝐼2 , 𝐼3 }
{𝐼1 , 𝐼2 , 𝐼5 }
{𝐼1 , 𝐼2 , 𝐼4 }
{𝐼1 , 𝐼3 , 𝐼5 }
{𝐼2 , 𝐼3 , 𝐼4 }
{𝐼2 , 𝐼3 , 𝐼5 }
{𝐼2 , 𝐼4 , 𝐼5 }
 5. 𝑪𝟑 = 𝑳𝟐 ∗ 𝑳𝟐 , Scan D

Itemset Sup_Count
{𝐼1 , 𝐼2 , 𝐼3 } 2
{𝐼1 , 𝐼2 , 𝐼5 } 2
 6. 𝑳𝟑

Itemset Sup_Count
{𝐼1 , 𝐼2 , 𝐼3 } 2
{𝐼1 , 𝐼2 , 𝐼5 } 2
 7. 𝑪𝟒 = 𝑳𝟑 ∗ 𝑳𝟑 , Prune
 𝐿4 = ∅.

End of Step 1.
 Code of Apriori Algorithm:
 Confidence: X={𝐼1 , 𝐼2 , 𝐼5 }
 Confidence: X={𝐼1 , 𝐼2 , 𝐼5 }.

2
𝐼1 =>{𝐼2 , 𝐼5 }: Confidence= = 33%
6
2
𝐼2 =>{𝐼1 , 𝐼5 }: Confidence= = 29%
7
2
𝐼5 =>{𝐼1 , 𝐼2 }: Confidence= = 100%
2
2
{𝐼1 , 𝐼2 }=>{𝐼5 }: Confidence= = 50%
4
2
{𝐼1 , 𝐼5 }=>{𝐼2 }: Confidence= = 100%
2
2
{𝐼2 , 𝐼5 }=>{𝐼1 }: Confidence= = 100%
2
 Confidence: X={𝐼1 , 𝐼2 , 𝐼5 }.

2
{𝐼1 }=>{𝐼2 , 𝐼5 }: Confidence= = 33% ×
6
2
{𝐼2 }=>{𝐼1 , 𝐼5 }: Confidence= = 29% ×
7
2
{𝐼5 }=>{𝐼1 , 𝐼2 }: Confidence= = 100%
2
2
{𝐼1 , 𝐼2 }=>{𝐼5 }: Confidence= = 50% ×
4
2
{𝐼1 , 𝐼5 }=>{𝐼2 }: Confidence= = 100%
2
2
{𝐼2 , 𝐼5 }=>{𝐼1 }: Confidence= = 100%
2

For Confidence ≥ 70%:

 Exercise_2: Calculate confidence for X={𝐼1 , 𝐼2 , 𝐼3 }.

 Exercise_3: Re-run the previous example using min-sup=3 and

min-conf=60%

 Exercise_4: Using min-sup=30% and min-conf=60%:

TID Item Purchased

1 {‫ لیموناد‬،‫{آب پرتقال‬
2 {‫ شیشه پاک کن‬،‫ آب پرتقال‬،‫{شیر‬
3 {‫ لیموناد‬،‫ پاک کننده‬،‫{آب پرتقال‬
4 {‫ لیموناد‬،‫{شیشه پاک کن‬
5 {‫ چیپس‬،‫{لیموناد‬
 FP-growth (finding frequent itemsets without candidate generation).
Min-sup=2
TID List of items
T100 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟓
T200 𝑰𝟐 , 𝑰𝟒
T300 𝑰𝟐 , 𝑰𝟑
T400 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟒
T500 𝑰𝟏 , 𝑰𝟑
T600 𝑰𝟐 , 𝑰𝟑
T700 𝑰𝟏 , 𝑰𝟑
T800 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑 , 𝑰𝟓
T900 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑
1- Frequent items is sorted in the order of descending sup-count:
L={{𝐼2 : 7}, {𝐼1 : 6}, {𝐼3 : 6}, {𝐼4 : 2}, {𝐼5 : 2}}

2- T100={𝐼1 , 𝐼2 , 𝐼5 }
Null
𝟏
I2
𝟏
I1
𝟏
I5
3- T200={𝐼2 , 𝐼4 }
Null
𝟐
I2
𝟏 𝟏
I1 I4
𝟏
I5
4- T300={𝐼2 , 𝐼3 }
Null
𝟑
I2
𝟏 𝟏 𝟏
I3 I1 I4
𝟏
I5
5- T400={𝐼1 , 𝐼2 , 𝐼4 }
Null
𝟒
I2
𝟏 𝟐 𝟏
I3 I1 I4
𝟏 𝟏
I5 I4
6- T500={𝐼1 , 𝐼3 }
Null
𝟏

I2
𝟒 I1
𝟏
I3 𝟏
I1
𝟐
I4
𝟏 I3
𝟏 𝟏
I5 I4
7- T600={𝐼2 , 𝐼3 }
Null
𝟏

I2
𝟓 I1
𝟏
I3 𝟐
I1
𝟐
I4
𝟏 I3
𝟏 𝟏
I5 I4
8- T700={𝐼1 , 𝐼3 }
Null
𝟐

I2
𝟓 I1
𝟐
I3 𝟐
I1
𝟐
I4
𝟏 I3
𝟏 𝟏
I5 I4
9- T800={𝐼1 , 𝐼2 , 𝐼3 , 𝐼5 }
Null
𝟐

I2
𝟔 I1
𝟐
𝟐
I3 I1
𝟑
I4
𝟏 I3
𝟏 𝟏 𝟏
I5 I4 I3
𝟏
I5
10- T900={𝐼1 , 𝐼2 , 𝐼3 }
Null
𝟐

I2
𝟕 I1
𝟐
𝟐
I3 I1
𝟒
I4
𝟏 I3
𝟏 𝟏 𝟐
I5 I4 I3
𝟏
I5
11- FP-tree is mined: We start from the last item of L.

Item Conditional Pattern Base Conditional FP-tree

I5 {I2 , I1 : 1} , {I2 , I1 , I3 : 1} {I2 : 2, I1 : 2}
I4 {I2 , I1 : 1} , {I2 : 1} {I2 : 2}
I3 {I2 : 2},{I2 , I1 : 2},{I1 : 2} {I2 : 4, I1 : 2} ,{I1 : 2}
I1 {I2 : 4} {I2 : 4}
Frequent patterns generated
{I2 , I5 : 2}, {I1 , I5 : 2}, {I2 , I1 , I5 : 2}
{I2 , I4 : 2}
{I2 , I3 : 4}, {I1 , I3 : 4}, {I2 , I1 , I3 : 2}
{I2 , I1 : 4}

 Exercise_5: Calculate support and confidence for above rules.

 Exercise_6: Do exercise_4 using FP-growth algorithm.

 Code of FP-growth Algorithm:
 ECLAT (Equivalent CLAss Transformation) Algorithm
Vertical data format
Min-sup=2

TID List of items

T100 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟓
T200 𝑰𝟐 , 𝑰𝟒
T300 𝑰𝟐 , 𝑰𝟑
T400 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟒
T500 𝑰𝟏 , 𝑰𝟑
T600 𝑰𝟐 , 𝑰𝟑
T700 𝑰𝟏 , 𝑰𝟑
T800 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑 , 𝑰𝟓
T900 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑
itemsets List of items
𝑰𝟏 {T100, T400, T500, T700, T800, T900}
𝑰𝟐 {T100, T200, T300, T400, T600, T800 , T900}
𝑰𝟑 {T300, T500, T600, T700, T800, T900}
𝑰𝟒 {T200, T400}
𝑰𝟓 {T100, T800}
itemsets List of items
{𝑰𝟏 , 𝑰𝟐 } {T100, T400, T800, T900}
{𝑰𝟏 , 𝑰𝟑 } {T500, T700, T800 , T900}
{𝑰𝟏 , 𝑰𝟒 } {T400}
{𝑰𝟏 , 𝑰𝟓 } {T100, T800}
{𝑰𝟐 , 𝑰𝟑 } {T300, T600, T800, T900}

{𝑰𝟐 , 𝑰𝟒 } {T200, T400}

{𝑰𝟐 , 𝑰𝟓 } {T100, T800}

{𝑰𝟑 , 𝑰𝟒 } {}
{𝑰𝟑 , 𝑰𝟓 } {T800}
{𝑰𝟒 , 𝑰𝟓 } {}
itemsets List of items
{𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑 } {T800, T900}
{𝑰𝟏 , 𝑰𝟐 , 𝑰𝟒 } {T400}
{𝑰𝟏 , 𝑰𝟐 , 𝑰𝟓 } {T100, T800}

 Exercise_7: Explore association rules from above example.

 Exercise_8: Do exercise_4 again using Eclat algorithm.

 Strong Rules Are Not Necessarily Interesting:

 Example: |D|=10,000 {A}:6,000 {B}: 7,500 {A,B}: 4,000

min_sup=30% min-conf = 60%

{A}=> {B}[support=?, confidence=?]

 Strong Rules Are Not Necessarily Interesting:

 Example: |D|=10,000 {A}:6,000 {B}: 7,500 {A,B}: 4,000

min_sup=30% min-conf = 60%

{A}=> {B}[support=40%, confidence=67%] : Strong

Conf({A}=> {B}) < Conf({B}) : Not interesting, negative correlation
 Strong Rules Are Not Necessarily Interesting:

 Example: |D|=10,000 {A}:6,000 {B}: 7,500 {A,B}: 4,000

min_sup=30% min-conf = 60%

{A}=> {B}[support=40%, confidence=67%] : Strong

Conf({A}=> {B}) < Conf({B}) : Not interesting, negative correlation

 {A}=> {B}[support, confidence, correlation]

 P(AB) = P(A) * P(B) : A and B are independent from each other.
 P(AB) < P(A) * P(B) : Negative correlation.
 P(AB) > P(A) * P(B) : Positive correlation.
• Basic Concepts

• Data classification is a two-step process:

• Learning step (training phase).

• Classification
• Supervised learning vs. unsupervised learning.

• The accuracy of a classifier.

• 5.1 - Decision Tree Induction.

 Why are decision tree classifiers so popular?

Example:
Age Salary Class (Age ≤ 35)

30 65 G
(Salary ≤ 40) (Salary ≤ 50)
23 15 B
40 75 G
55 40 B
B G B G
55 100 G
45 60 G

Classification Rules:
Class B: {(Age<=35) and (Salary<=40)} or {(Age>35) and (Salary<=50)}
Class G: {(Age<=35) and (Salary>40)} or {(Age>35) and (Salary>50)}
Test data: (Age=25 and Salary=50) => Class G
• Generating Decision tree:

1. Apply SS to D to find Splitting-Criterion.

2. If n splits
Use best split to partition D to 𝑫𝟏 , 𝑫𝟐 , …, 𝑫𝒏 .
For (i=1; i<=n; ++i)
Build tree (𝒏𝒊 , 𝑫𝒊 , SS)
3. End if
𝑰𝑫𝟑 𝑨𝒍𝒈𝒐𝒕𝒊𝒕𝒉𝒎 (Information Gain)
• Classification Algorithms= ൞ 𝑪𝟒. 𝟓(Gain Ratio. )
𝑪𝑨𝑹𝑻(Gini Index)
➢ ID3 (Iterative Dichotomiser 3) Algorithm:

• Iteratively (repeatedly) Dichotomizes(divides) features into two or more groups at

each step.
➢ ID3 (Iterative Dichotomiser 3) Algorithm:

• Iteratively (repeatedly) Dichotomizes(divides) features into two or more groups at

each step.

• Invented by John Ross Quinlan, ID3 uses a top-down greedy approach to build a
decision tree. In simple words, the top-down approach means that we start
building the tree from the top and the greedy approach means that at each iteration
we select the best attribute at the present moment to create a node.
➢ ID3 (Iterative Dichotomiser 3) Algorithm:

• Iteratively (repeatedly) Dichotomizes(divides) features into two or more groups at

each step.

• ID3 uses Information Gain or just Gain to find the best attribute.

• The attribute with the highest Information Gain is selected as the best one.
➢ ID3 steps:

1. Calculate the Information Gain of each attribute.

2. Make a decision tree node using the attribute with the maximum
Information gain.

3. If all rows belong to the same class, make the current node as a leaf node
with the class as its label.

4. Repeat for the remaining features until we run out of all features, or the
decision tree has all leaf nodes.
• ID3 algorithm
D: 7,000, E=0.82
(Information Gain)

𝐷2 : 2,500,
𝑒2 =0.40

• Gain(A) = E - (𝑒1 +𝑒2 +𝑒3 ) = 0.82 – ( 0.15+0.40+0.13) = 0.14

• Information Gain (Entropy)

𝑮𝒂𝒊𝒏 𝑨 = 𝑰𝒏𝒇𝒐(D) - 𝑰𝒏𝒇𝒐𝑨 𝑫

𝑰𝒏𝒇𝒐(D) = -σ𝒎
𝒊=𝟏 𝒑𝒊 ∗ 𝒍𝒐𝒈𝟐 (𝒑𝒊 )

|𝑪𝒊,𝑫 |
𝒑𝒊 =
|𝑫|

|𝑫𝒋 |
𝑰𝒏𝒇𝒐𝑨 𝑫 = -σ𝑽𝒋=𝟏 * Info(𝑫𝒋 )
|𝑫|
RID age income student credit-rating Class: buy-computer

1 youth high no fair no

2 youth high no excellent no
3 middle high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle medium no excellent yes
13 middle high yes fair yes
14 senior medium no excellent no
yes:9

age:14

no:5

|𝐶𝑖,𝐷 |
 𝐼𝑛𝑓𝑜(D) = -σ𝑚
𝑖=1 𝑝𝑖 ∗ 𝑙𝑜𝑔2 (𝑝𝑖 ) 𝑝𝑖 =
|𝐷|

9 9 5 5
 𝐼𝑛𝑓𝑜𝑎𝑔𝑒 (D) = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.94
14 14 14 14
yes:2
youth:5
no:3
yes:4
age:14 middle:4
no:0

yes:3
senior:5
no:2
|𝐷𝑗 |
➢ 𝐼𝑛𝑓𝑜𝐴 𝐷 = -σ𝑉𝑗=1 * Info(𝐷𝑗 )
|𝐷|

5 2 2 3 3 4 4 4
 𝐼𝑛𝑓𝑜𝑎𝑔𝑒 (D) = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 + ቂ− 𝑙𝑜𝑔2
14 5 5 5 5 14 4 4
0 0 5 3 3 2 2
− 𝑙𝑜𝑔2 ቃ + − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.69
4 4 14 5 5 5 5


 Gain(age) = 0.94 - 0.69 = 0.25

• Example:
Gain(age) = 0.25
Gain(income) = 0.03
Gain(credit_rating) = 0.05
Gain(student) = 0.15
• Split point for continuous-valued :
• Given v values of A=> (v-1) possible splits are evaluated.

𝒂𝒊 +𝒂𝒊+𝟏
( )
𝟐

𝑫𝟏 : 𝑨 ≤ 𝒔𝒑𝒍𝒊𝒕 − 𝒑𝒐𝒊𝒏𝒕
𝑫𝟐 : 𝑨 > 𝒔𝒑𝒍𝒊𝒕 − 𝒑𝒐𝒊𝒏𝒕

• Exercise: Complete the decision tree in the Fig. 8-5.

➢ C4.5 Algorithm:

• Developed by John Ross Quinlan again and is an extension of ID3 algorithm.

• C4.5 uses Gain Ratio to find the best attribute.

• Gain Ratio is the normalised Information Gain and Reduces a bias toward
multi-valued attributes.

• The attribute with the highest Information Gain is selected as the best one.
• Gain Ratio:

𝑮𝒂𝒊𝒏(𝑨)
GainRatio(A) =
𝑺𝒑𝒍𝒊𝒕𝑰𝒏𝒇𝒐𝑨 (𝑫)

𝒗 |𝑫𝒋 | |𝑫𝒋 |
𝑺𝒑𝒍𝒊𝒕𝑰𝒏𝒇𝒐𝑨 (𝑫) = -σ𝒋=𝟏 ∗ 𝒍𝒐𝒈𝟐 ( )
|𝑫| |𝑫|
|𝐷𝑗 | |𝐷𝑗 |
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝐴 (𝐷) = -σ𝑣𝑗=1 ∗ 𝑙𝑜𝑔2 ( )
|𝐷| |𝐷|

5 5 4 4 5 5
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑎𝑔𝑒 (𝐷) = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 1.57
14 14 14 14 14 14
• Example:
𝟎.𝟐𝟓
GainRatio(age) = =0.16
𝟏.𝟓𝟕

• Exercise:
GainRatio(income) = ?
GainRatio(credit_rating) = ?
GainRatio(student) = ?

• Exercise: Redraw the previous tree using the GainRatio

➢ CART (Classification And Regression Tree) Algorithm:

• Developed by Breiman.

• CART uses Gini Index to find the best splitting criterion.

• CART makes a binary decision tree.

• The attribute with the lowest Gini Index is selected as the best one.
• Gini Index:

|𝑫𝟏 | |𝑫𝟐 |
𝑮𝒊𝒏𝒊𝑨 𝑫 = Gini(𝑫𝟏 ) + Gini(𝑫𝟐 )
|𝑫| |𝑫|

𝑮𝒊𝒏𝒊 𝑫 = 𝟏 − ෍ 𝒑𝟐𝒊
𝒊=𝟏

|𝑪𝒊,𝑫 |
𝒑𝒊 =
|𝑫|
age

D1 D2

subset1 subset2

{} {youth, middle, senior}

{youth} {middle, senior}

{middle} {youth, senior}

{senior} {youth, middle}

yes:5
{ youth, senior}:10
no:5
age:14
yes:4
{ middle }:4
no:0

 𝐺𝑖𝑛𝑖𝑎𝑔𝑒 𝑦𝑜𝑢𝑡ℎ,𝑠𝑒𝑚𝑖𝑜𝑟 ,{𝑚𝑖𝑑𝑑𝑙𝑒} (D) =

10 5 5 4 4 0
1 − ( )2 −( )2 + 1 − ( )2 −( )2 = 0.36
14 10 10 14 4 4
• Gini Index: Is a binary index

• Example:
𝑮𝒊𝒏𝒊_𝑰𝒏𝒅𝒆𝒙𝒂𝒈𝒆 𝒚𝒐𝒖𝒕𝒉,𝒔𝒆𝒏𝒊𝒐𝒓 𝑫 = 𝑮𝒊𝒏𝒊_𝑰𝒏𝒅𝒆𝒙𝒂𝒈𝒆 𝒎𝒊𝒅𝒅𝒍𝒆 𝑫 = 𝟎. 𝟑𝟔

• Exercise: Calculate the Gini Index for all attributes and all combinations.

• Exercise: Redraw the previous tree using the Gini Index.

• Evaluating a decision tree:

• Tree pruning

• Pre-pruning.

• Post-pruning.

• Advantages of using decision trees.

• Other algorithms for building decision trees.

• CHAID.

• MARS.

• Rainforest.

• Boat.
• Bayesian Classification

• Bayes’ Theorem:
𝑷 𝑩 𝑨 . 𝑷(𝑨)
𝑷 𝑨𝑩 =
𝑷(𝑩)
• Naïve Bayesian:
A , B : independent from each other => P(AB) = P(A) * P(B)

X=(𝒙𝟏 , 𝒙𝟐 , …, 𝒙𝒏 ) ⇒

𝐏 𝐗 = 𝑷 𝒙𝟏 ∗ 𝑷 𝒙𝟐 ∗ … ∗ 𝑷(𝒙𝒏 )=ς𝒏𝒌=𝟏 𝑷(𝒙𝒌 )

𝑷 𝑿 𝑪𝒊 = 𝑷 𝒙𝟏 𝑪𝒊 ∗ 𝑷 𝒙𝟐 𝑪𝒊 ∗ … * 𝑷 𝒙𝒏 𝑪𝒊 =ς𝒏𝒌=𝟏 𝑷 𝒙𝒌 𝑪𝒊
• Bayesian Classification

• X=(𝒙𝟏 , 𝒙𝟐 , …, 𝒙𝒏 ), m classes: (𝑪𝟏 , 𝑪𝟐 , …, 𝑪𝒎 )

• P(𝑪𝒊 𝑿 > P(𝑪𝒋 𝑿 , j, 1≤ j ≤m, j≠I => X ∈ 𝑪𝒊

𝑷 𝑿 𝑪𝒊 .𝑷(𝑪𝒊 )
𝑷(𝑪𝒊 |X)=
𝑷(𝑿)
|𝑪𝒊,𝑫 |
P(𝑪𝒊 ) =
|𝑫|

𝑷 𝑿 𝑪𝒊 = 𝑷 𝒙𝟏 𝑪𝒊 . 𝑷 𝒙𝟐 𝑪𝒊 . … . 𝑷 𝒙𝒏 𝑪𝒊 =ς𝒏𝒌=𝟏 𝑷 𝒙𝒌 𝑪𝒊

• For continuous attributes:

(𝒙𝒌 −𝛍𝑪 )𝟐
𝟏 −( 𝟐𝛔 𝒊 )
𝑷 𝒙𝒌 𝑪𝒊 = 𝒆 𝒄𝒊
𝟐π𝛔𝑪𝒊
RID age income student credit-rating Class: buy-computer

1 youth high no fair no

X ∈ 𝐶1 or 𝐶2

P(X|𝐶1 ) * P(𝐶1 ) = ? P(X|𝐶2 ) * P(𝐶2 ) = ?

9
P(𝐶1 ) = 𝑃 𝑏𝑢𝑦 − 𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑦𝑒𝑠 = = 0.643
14

5
P(𝐶2 ) = 𝑃 𝑏𝑢𝑦 − 𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑛𝑜 = = 0.357
14
• Example:
X=(age=youth, income=medium, student=yes, credit-rating=fair)

𝑃 𝑋 𝐶1 =P(age=youth | buy-computer=yes) *
P(income=medium | buy-computer=yes) *
P(student=yes | buy-computer=yes) *
2 4 6 6
P(credit-rating=fair | buy-computer=yes) = * 9∗ 9∗ = 0.044
9 9

X ∈ 𝑪𝟏
P(X|𝐶1 ) * P(𝐶1 ) =0.044 * 0.643 = 0.028
P(X|𝐶2 ) * P(𝐶2 ) =0.019 * 0.357 = 0.007

0.028
P(X ∈ 𝐶1 ) = = 0.80
0.028+0.007

0.007
P(X ∈ 𝐶2 ) = = 0.20
0.028+0.007

X ∈ 𝑪𝟏
• Laplacian correction:

𝑪𝟏 : buy_computer = yes : 1000 records

𝟎
income = low : 0 records =>P(income=low|𝑪𝟏 )= =0.000
𝟏𝟎𝟎𝟎
𝟗𝟗𝟎
income = medium : 990 records =>P(income=medium|𝑪𝟏 )=𝟏𝟎𝟎𝟎=0.990
𝟏𝟎
income = high: 10 records=>P(income=high|𝑪𝟏 )=𝟏𝟎𝟎𝟎 =0.010
• Laplacian correction:
𝑪𝟏 : buy_computer = yes : 1000 records

𝟏
income = low : 0 records =>P(income=low|𝑪𝟏 )= =0.001
𝟏𝟎𝟎𝟑
𝟗𝟗𝟏
income = medium : 990 records =>P(income=medium|𝑪𝟏 )=𝟏𝟎𝟎𝟑=0.988
𝟏𝟏
income = high: 10 records=>P(income=high|𝑪𝟏 )=𝟏𝟎𝟎𝟑 =0.011
• Using IF-THEN Rules for Classification:

IF condition THEN conclusion.

antecedent consequent

Example:
R1: IF age = youth AND student = yes THEN buy_computer = yes

R1: (age = youth) ^ (student = yes) => (buy_computer = yes)

• Coverage and Accuracy: A rule R can be assessed by its coverage and
accuracy.

𝒏𝒄𝒐𝒓𝒓𝒆𝒄𝒕 𝒏𝒄𝒐𝒗𝒆𝒓𝒔
• Accuracy(R) = Coverage(R) =
𝒏𝒄𝒐𝒗𝒆𝒓𝒔 |𝑫|
• Extracting classification rules from a
decision tree:

𝑹𝟏 : IF age = youth AND student = no THEN buy_computer = no

𝑹𝟐 : IF age = youth AND student = yes THEN buy_computer = yes
𝑹𝟑 : IF age = middle aged THEN buy_computer = yes
𝑹𝟒 : IF age = senior AND credit rating = excellent THEN buy_computer = yes
𝑹𝟓 : IF age = senior AND credit rating = fair THEN buy_computer = no
• Metrics for Evaluating Classifier Performance:

• Confusion Matrix:

=𝑃′ + 𝑁′

𝑻𝑷+𝑻𝑵
• Accuracy = 𝑷+𝑵
𝑻𝑷 𝑻𝑷
• precision = =
𝑻𝑷+𝑭𝑷 𝑷′

𝑻𝑷 𝑻𝑷
• recall = 𝑻𝑷+𝑭𝑵 = 𝑷

• Example:

Predicted Total
yes no
Actual yes 90 210 300
no 140 9560 9700
Total 230 9770 10000
𝟗𝟎
• precision = 𝟐𝟑𝟎 = 39.13%
𝟗𝟎
• recall = 𝟑𝟎𝟎 = 𝟑𝟎. 𝟎𝟎%

𝟐 ∗𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 ∗ 𝒓𝒆𝒄𝒂𝒍𝒍
• F= 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏+ 𝒓𝒆𝒄𝒂𝒍𝒍
➢ Estimating Accuracy Methods:
1. Holdout Method: The given data are randomly partitioned into two
independent sets, a training set and a test set. Typically, two-thirds of the data
are allocated to the training set, and the remaining one-third is allocated to
the test set.
➢ Estimating Accuracy Methods:

2. Random Subsampling: The holdout method is repeated k times and

averaged.
➢ Estimating Accuracy Methods:
3. (k-fold) Cross-Validation: the initial data are randomly partitioned into k
subsets (folds): 𝐷1 , 𝐷2 , …, 𝐷𝑘 . Training and testing is performed k times. In
iteration i, partition 𝐷𝑖 is reserved as the test set, and the remaining partitions
are collectively used to train the model.
➢ Estimating Accuracy Methods:
4. Bootstrap: the bootstrap method samples the given training tuples uniformly
with replacement.

.632 bootstrap: On average, 63.2% of the original data tuples will end up in the
bootstrap sample, and the remaining 36.8% will form the test set.
• In addition to accuracy-based measures, classifiers can also be compared with
respect to the following additional aspects:

Speed
𝑹𝒐𝒃𝒖𝒔𝒕𝒏𝒆𝒔𝒔
• Other aspects:
𝑺𝒄𝒂𝒍𝒂𝒃𝒊𝒍𝒊𝒕𝒚
Interpretability
• Introduction: Cohesion and Separation.

• Suppose a data set, D, contains n objects in Euclidean space. Partitioning

methods distribute the objects in D into k clusters, 𝑪𝟏 , ... , 𝑪𝒌 , that is, 𝑪𝒊 ⊂ D and
𝑪𝒊 ∩ 𝑪𝒋 = ∅ for (1 ≤ i, j ≤ k).

• k-means: A Centroid-Based Technique.

• Example 6-1: i 𝑿𝒊
𝒄𝟏 =3, 𝒄𝟐 =6 1 5
2 9
3 3
4 6
5 4
6 2
7 8
8 5
9 3
10 7
• Example 6-1:

Step_1: 𝒄𝟏 =3, 𝒄𝟐 =6

𝑪𝟏 = 𝑿𝟑 , 𝑿𝟓 , 𝑿𝟔 , 𝑿𝟗 , 𝑪𝟐 = 𝑿𝟏 , 𝑿𝟐 , 𝑿𝟒 , 𝑿𝟕 , 𝑿𝟖 , 𝑿𝟏𝟎
• Example 6-1:

Step_1: 𝒄𝟏 =3, 𝒄𝟐 =6

𝑪𝟏 = 𝑿𝟑 , 𝑿𝟓 , 𝑿𝟔 , 𝑿𝟗 , 𝑪𝟐 = 𝑿𝟏 , 𝑿𝟐 , 𝑿𝟒 , 𝑿𝟕 , 𝑿𝟖 , 𝑿𝟏𝟎

𝟑+𝟒+𝟐+𝟑 𝟓+𝟗+𝟔+𝟖+𝟓+𝟕
Step_2: 𝒄𝟏 = = 3.0, 𝒄𝟐 = = 6.67
𝟒 𝟔

𝑪𝟏 = 𝑿𝟑 , 𝑿𝟓 , 𝑿𝟔 , 𝑿𝟗 , 𝑪𝟐 = 𝑿𝟏 , 𝑿𝟐 , 𝑿𝟒 , 𝑿𝟕 , 𝑿𝟖 , 𝑿𝟏𝟎
• Exercise 1: Solve the previous example with 𝒄𝟏 = 2.0 and 𝒄𝟐 = 7.0

• Exercise 2: Solve the previous example with 𝒄𝟏 = 2.5, 𝒄𝟐 = 5.0 , and 𝒄𝟑 = 7.5
• Example 6-2 : (𝐢) (𝐢)
i 𝐗𝟏 𝐗𝟐
𝒄𝟏 =(1.0, 1.1), 𝒄𝟐 =(5.0, 7.0)
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
Step_1: 𝒄𝟏 =(1.0, 1.1), 𝒄𝟐 =(5.0, 7.0)

𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕
𝑪𝟏 = {𝑿 ,𝑿 ,𝑿 } , 𝑪𝟐 = {𝑿 ,𝑿 ,𝑿 ,𝑿 }
Step_1: 𝒄𝟏 =(1.0, 1.0), 𝒄𝟐 =(5.0, 7.0)

𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕
𝑪𝟏 = {𝑿 ,𝑿 ,𝑿 } , 𝑪𝟐 = {𝑿 ,𝑿 ,𝑿 ,𝑿 }

𝟏+𝟏.𝟓+𝟑 𝟏+𝟐+𝟒
Step_2: 𝒄𝟏 = ( , 𝟑 ) = (𝟏. 𝟖𝟑, 𝟐. 𝟑𝟑)
𝟑
𝟓+𝟑.𝟓+𝟒.𝟓+𝟑.𝟓 𝟕+𝟓+𝟓+𝟒.𝟓
𝒄𝟐 = ( , ) = (𝟒. 𝟏𝟐𝟓, 𝟓. 𝟑𝟕𝟓)
𝟒 𝟒

𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕
𝑪𝟏 = {𝑿 ,𝑿 } , 𝑪𝟐 = {𝑿 , 𝑿 ,𝑿 ,𝑿 ,𝑿 }
𝟏+𝟏.𝟓 𝟏+𝟐
Step_3: 𝒄𝟏 = ( , 𝟐 ) = (𝟏. 𝟐𝟓, 𝟏. 𝟓)
𝟐
𝟑+𝟓+𝟑.𝟓+𝟒.𝟓+𝟑.𝟓 𝟒+𝟕+𝟓+𝟓+𝟒.𝟓
𝒄𝟐 = ( , ) = (𝟑. 𝟗, 𝟓. 𝟏)
𝟓 𝟓

𝑪𝟏 = {𝑿 𝟏 ,𝑿 𝟐 } , 𝑪𝟐 = {𝑿 𝟑 , 𝑿 𝟒 ,𝑿 𝟓 ,𝑿 𝟔 ,𝑿 𝟕 }

• Exercise 3: Solve the previous example with 𝒄𝟏 = (3.0, 3.0) and 𝒄𝟐 = (4.0, 4.0)
• k-means is sensitive to outliers:
• Complexity of k-means: O(nkt)
• n: No of Subjects.

• k: No of Clusters.

• t: No of iterations.

• k << n

• t << n

• Other drawbacks of k-means:

• Evaluations of Clusters:

E=σ𝒌𝒊=𝟏 σ𝒑∈𝑪𝒊 𝒅𝒊𝒔𝒕(𝒑, 𝒄𝒊 )

• Hierarchical Clustering:
• Bottom-Up: AGglomerative NESting (AGNES).

• Top-Down: DIvisive ANAlysis (DIANA).

• Distance Measures in Algorithmic Methods:

• Single Link

• Complete Link

• Average Link
r
• Centroid s

• Ward’s method
• Single Linkage - Nearest Neighbour

d
a b
e g
c
f
r
s
dist(r,s) = dist(b,e)
• Farthest Neighbour – Complete Linkage

d
a b
e g
c
f
r
s

dist(r,s) = dist(a,g)
• Average Linkage

d
a b
e g
c
f
r
s

𝒅𝒊𝒔𝒕 𝒂,𝒅 +𝒅𝒊𝒔𝒕 𝒂,𝒆 +𝒅𝒊𝒔𝒕 𝒂,𝒇 +𝒅𝒊𝒔𝒕 𝒂,𝒈 +⋯+𝒅𝒊𝒔𝒕(𝒄,𝒈)

dist(r,s) = 𝟏𝟐
• Centroid

d
a m b
e n g
c
f
r
s
dist(r,s) = dist(m,n)
• Ward’s method: Minimizes the total within-cluster variance.

• Error Sum of Square (ESS): ESS=𝒆𝟏 𝟐 + 𝒆𝟐 𝟐 + 𝒆𝟑 𝟐

a
𝑒1
𝑒2 b
𝑒3
c
• Example 6-3: Bottom-Up approach, Single-Linkage, Similarity more than 1.8

i 𝐗𝟏 𝐗𝟐
A 1.0 1.0
B 1.0 3.5
C 3.0 1.0
D 2.0 4.0
E 1.0 3.0
F 3.0 2.0
• Example 6-3:
A B C D E F
i 𝐗𝟏 𝐗𝟐 A 0 2.5 2.0 3.16 2.0 2.24
A 1.0 1.0
B 1.0 3.5 B - 0 3.24 1.12 0.5 2.5
C 3.0 1.0 C - - 0 3.16 2.83 1.0
D 2.0 4.0
D - - - 0 1.41 2.24
E 1.0 3.0
F 3.0 2.0 E - - - - 0 1.41

F - - - - - 0
• Example 6-3:
merge A B C D E F
i 𝐗𝟏 𝐗𝟐 A 0 2.5 2.0 3.16 2.0 2.24
A 1.0 1.0
B 1.0 3.5 B - 0 3.24 1.12 0.5 2.5
C 3.0 1.0 C - - 0 3.16 2.83 1.0
D 2.0 4.0
D - - - 0 1.41 2.24
E 1.0 3.0
F 3.0 2.0 E - - - - 0 1.41

F - - - - - 0
• Example 6-3:
A BE C D F
i 𝐗𝟏 𝐗𝟐 A 0 2.0 2.0 3.16 2.24
A 1.0 1.0
BE 1.0 3.25 BE - 0 2.83 1.12 1.41
C 3.0 1.0 C - - 0 3.16 1.0
D 2.0 4.0
D - - - 0 2.24
F 3.0 2.0
F - - - - 0
• Example 6-3:
A BE C D F
i 𝐗𝟏 𝐗𝟐 merge A 0 2.0 2.0 3.16 2.24
A 1.0 1.0
BE 1.0 3.25 BE - 0 2.83 1.12 1.41
C 3.0 1.0 C - - 0 3.16 1.0
D 2.0 4.0
D - - - 0 2.24
F 3.0 2.0
F - - - - 0
• Example 6-3:

i 𝐗𝟏 𝐗𝟐 A BE CF D

A 1.0 1.0 A 0 2.0 2.0 3.16

BE 1.0 3.25
CF 3.0 1.5 BE - 0 1.41 1.12
D 2.0 4.0 CF - - 0 2.24

D - - - 0
• Example 6-3:

A BE CF D
i 𝐗𝟏 𝐗𝟐 merge
A 1.0 1.0 A 0 2.0 2.0 3.16
BE 1.0 3.25
CF 3.0 1.5 BE - 0 1.41 1.12
D 2.0 4.0 CF - - 0 2.24

D - - - 0
• Example 6-3:

i 𝐗𝟏 𝐗𝟐 A BDE CF

A 1.0 1.0 A 0 2.0 2.0

BDE 1.0 3.25
BDE - 0 2.24
CF 3.0 1.5
CF - - 0
• Example 6-3:

i 𝐗𝟏 𝐗𝟐 A BDE CF

A 1.0 1.0 A 0 2.0 2.0

BDE 1.0 3.25
BDE - 0 2.24
CF 3.0 1.5
CF - - 0

1.8≥…
• Example 6-3: Dendrogram - AGglomerative NESting (AGNES)

2.0

1.5

1.0

0.5

A B E D C F
• Exercise: Solve the previous example with Complete-Link, Average-Link,
Centroid, and ward’s method.

Data Mining Presentation PDF
No ratings yet
Data Mining Presentation PDF
206 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Data Mining1
No ratings yet
Data Mining1
13 pages
Data Mining Presentation
No ratings yet
Data Mining Presentation
154 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Lecture 5
No ratings yet
Lecture 5
27 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
66 pages
2.2 Data Summarization
No ratings yet
2.2 Data Summarization
60 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Data Mining Notes: 7 Semester. CS 1435: Syllabus
No ratings yet
Data Mining Notes: 7 Semester. CS 1435: Syllabus
4 pages
Chapter 2 DM
No ratings yet
Chapter 2 DM
91 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
CAS CS 565, Data Mining
No ratings yet
CAS CS 565, Data Mining
30 pages
Data Mining and KDD Process Explained
No ratings yet
Data Mining and KDD Process Explained
28 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
68 pages
Lecture 1
No ratings yet
Lecture 1
43 pages
Q.1. Why Is Data Preprocessing Required?
100% (1)
Q.1. Why Is Data Preprocessing Required?
26 pages
Week2 2
No ratings yet
Week2 2
25 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
CH 2
No ratings yet
CH 2
36 pages
ChatGPT - Shared Content
No ratings yet
ChatGPT - Shared Content
26 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
Knowledge Discovery Database - Unit 2
No ratings yet
Knowledge Discovery Database - Unit 2
53 pages
It-3031 (DMDW) - CS End Nov 2023
No ratings yet
It-3031 (DMDW) - CS End Nov 2023
23 pages
UNIT - II - Data Mining Essentials
No ratings yet
UNIT - II - Data Mining Essentials
20 pages
3 Preprocessing
No ratings yet
3 Preprocessing
82 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
OpenSAP Ds1 Week 3 Transcript
No ratings yet
OpenSAP Ds1 Week 3 Transcript
17 pages
DWDM - Unit - III
No ratings yet
DWDM - Unit - III
77 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
DATA 240 - 23 - Lec3 - FA 2024 - Dist
No ratings yet
DATA 240 - 23 - Lec3 - FA 2024 - Dist
50 pages
DM Day2 DataUnderstanding MS S25
No ratings yet
DM Day2 DataUnderstanding MS S25
165 pages
Unit 2
No ratings yet
Unit 2
37 pages
Introduction To Data Mining With Case Studies - Sample Index
0% (1)
Introduction To Data Mining With Case Studies - Sample Index
16 pages
02 Pre Processing
No ratings yet
02 Pre Processing
52 pages
Big Data and Its Importance
No ratings yet
Big Data and Its Importance
49 pages
DM Lec2 Getting To Know Your Data
No ratings yet
DM Lec2 Getting To Know Your Data
34 pages
Data Warehousing & Mining Course Overview
No ratings yet
Data Warehousing & Mining Course Overview
118 pages
DM UNIT-1 Question and Answer
No ratings yet
DM UNIT-1 Question and Answer
25 pages
Importance of Data Preprocessing in Mining
No ratings yet
Importance of Data Preprocessing in Mining
77 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
DWDM Lecture PPT Unit3 Part3
No ratings yet
DWDM Lecture PPT Unit3 Part3
29 pages
Unit I
No ratings yet
Unit I
57 pages
Feature Engineering
No ratings yet
Feature Engineering
35 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
DMBI Questions
No ratings yet
DMBI Questions
8 pages
3 Ravi
No ratings yet
3 Ravi
82 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
13 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
Intravenous Infusion Stability Guide
No ratings yet
Intravenous Infusion Stability Guide
1 page
Friendship Application Form
100% (1)
Friendship Application Form
3 pages
Haloalkanes and Haloarenes NCERT Content
No ratings yet
Haloalkanes and Haloarenes NCERT Content
27 pages
Evidence Plan INstitutional Assessment
No ratings yet
Evidence Plan INstitutional Assessment
20 pages
Enhancing Self-Concept and Esteem
No ratings yet
Enhancing Self-Concept and Esteem
3 pages
Thesis 2.1 Custom PHP
100% (3)
Thesis 2.1 Custom PHP
6 pages
Nashtik Darshan - 3 (Autosaved)
No ratings yet
Nashtik Darshan - 3 (Autosaved)
34 pages
Capacitor Charging and Discharging Project
100% (2)
Capacitor Charging and Discharging Project
16 pages
Materi Prof Nursalam
100% (3)
Materi Prof Nursalam
88 pages
Malaria: History and Life Cycle Overview
100% (9)
Malaria: History and Life Cycle Overview
102 pages
A Study On Online Shopping Behaviour of Students With Special Reference To Aluva Taluk
No ratings yet
A Study On Online Shopping Behaviour of Students With Special Reference To Aluva Taluk
48 pages
Kimberly Akimbo Study Guide 2f430b6ac3
No ratings yet
Kimberly Akimbo Study Guide 2f430b6ac3
26 pages
2020 TiN
No ratings yet
2020 TiN
12 pages
Consumer Trend Canvas
100% (3)
Consumer Trend Canvas
23 pages
Second Term Past Paper Question
No ratings yet
Second Term Past Paper Question
14 pages
Deep Learning UNIT-II Part1
No ratings yet
Deep Learning UNIT-II Part1
48 pages
Anabaptist True-Confession-1596
No ratings yet
Anabaptist True-Confession-1596
10 pages
Auto ID Generation in T24 Explained
100% (4)
Auto ID Generation in T24 Explained
26 pages
Using A Scanner Generator: Lex: R.V.S College of Engineering, Dindigul
No ratings yet
Using A Scanner Generator: Lex: R.V.S College of Engineering, Dindigul
3 pages
Roman Tunnel Construction Techniques
No ratings yet
Roman Tunnel Construction Techniques
4 pages
Church Parking Lot Repair RFP
100% (1)
Church Parking Lot Repair RFP
10 pages
Adulthood Nutrition: Needs & Health Risks
No ratings yet
Adulthood Nutrition: Needs & Health Risks
6 pages
Easy Math: Interactive Learning App
No ratings yet
Easy Math: Interactive Learning App
9 pages
CS 4405-01 Mobile Applications - AY2025-T1 WRITTEN ASSIGNMENT UNIT 3
No ratings yet
CS 4405-01 Mobile Applications - AY2025-T1 WRITTEN ASSIGNMENT UNIT 3
11 pages
ABC SDN BHD Q1 2018 Sales Report
No ratings yet
ABC SDN BHD Q1 2018 Sales Report
6 pages
TTS 880
No ratings yet
TTS 880
2 pages
Scattering and Angular Velocity Analysis
No ratings yet
Scattering and Angular Velocity Analysis
8 pages
Echo vs. Print Statement.: Differences Between GET and POST Methods ?
No ratings yet
Echo vs. Print Statement.: Differences Between GET and POST Methods ?
3 pages
Aphex-2020-545250 Aphex Compellor 2020
No ratings yet
Aphex-2020-545250 Aphex Compellor 2020
114 pages
System-Based Attacks
No ratings yet
System-Based Attacks
13 pages