Certificate
I hereby declare that the work presented in this report entitled “ Clustering Techniques in
Machine Learning” in partial fulfillment of the requirements for the award of the degree of
Bachelor of Technology in Computer Science and Engineering/Information Technology
submitted in the department of Computer Science & Engineering and Information Technology,
Jaypee University of Information Technology Waknaghat is an authentic record of my own
work carried out over a period from July 2022 to May 2023 under the supervision of Dr.
Monika Bharti Assistant Professor (SG).
The matter embodied in the report has not been submitted for the award of any other degree or
diploma.
Vatsal Singh, 191286
This is to certify that the above statement made by the candidate is true to the best of my
knowledge.
Dr. Monika Bharti
Assistant Ptofessor (SG)
Computer Science and Engineering/Information Technology
Dated:
i
Acknowledgement
The successful completion of any task would be incomplete without acknowledging the people
who made it possible and whose constant guidance and encouragement secured the success.
First of all I wish to acknowledge the benevolence of omnipotent God who gave me strength
and courage to overcome all obstacles and showed me the silver lining in the dark clouds with
the profound sense of gratitude and heartiest regard. I express my sincere feelings of
indebtedness to my guide Dr. Monika Bharti for their positive attitude, excellent guidance,
constant encouragement, keen interest, invaluable co-operation, generous attitude and above
all their blessings. She has been a source of inspiration for me.
Last but not the least I would like to express my heartfelt thanks to my parents and my friends
who with their thought provoking views, veracity and whole hearted cooperation helped in
doing this project.
Vatsal Singh
(191286)
ii
Abstract
In today’s era data generated by scientific applications and corporate environment has grown
rapidly not only in size but also in variety. This data collected is of huge amount and there is
a difficulty in collecting and analyzing such big data. Data mining is the technique in which
useful information and hidden relationship among data is extracted, but the traditional data
mining approaches cannot be directly used for big data due to their inherent complexity.
Data Clustering is one of the most important issues in data mining and machine learning.
Clustering is a task of discovering homogenous groups of the studied objects. Recently, many
researchers have a significant interest in developing clustering algorithms. The most problem
in clustering is that we do not have prior information knowledge about the given dataset.
Moreover, the choice of input parameters such as the number of clusters, number of nearest
neighbors and other factors in these algorithms make the clustering more challengeable topic.
Thus any incorrect choice of these parameters yields bad clustering results. Furthermore, these
algorithms suffer from unsatisfactory accuracy when the dataset contains clusters with
different complex shapes, densities, sizes, noise, and outliers. In this project, we propose a
new approach for unsupervised clustering task. Our approach consists of three phases of
operations. In the first phase we use the Genetic algorithm for finding first initial cluster
centroid. In genetic algorithm we use a crossover and mutation of the dataset. The second
phase, takes these initial cluster centroid produced by genetic algorithm for finding clusters
using K-means clustering. From the second phase we obtain a set of clusters of the given
dataset. Hence, the third phase considers these clusters for evaluation of cluster based on
Davies Bouldin Index. This new algorithm is named as Genetic K-means Algorithm (GKA).
We present experiments that provide the strength of our new proposed algorithm in
discovering clusters with different non-convex shapes, sizes, densities, noise, outliers and
higher accuracy. These experiments show the superiority of our proposed algorithm when
comparing with K-means algorithm.
iii
Table of Contents
Certificate ……………………………………………………………………….. i
Acknowledgement………………………………………………………………. ii
Abstract……………………………………………………….…………………. iii
Table of Contents……………………………………………….……………….. iv
List of Figures…………………………………………….……………………... vi
List of Table………………………………………….………………………….. vii
1. Introduction………………………………………………………................ 1
1.1 Introduction to Machine Learning…………………………………… 1
1.2 Unsupervised Learning……………………………………………… 1
1.2.1 Clustering……………………………………………….. 2
1.3 Types of Clustering ………………………………………………….. 3
1.3.1 Partitioning Methods……………………………………. 3
1.3.2 Hierarchical Clustering…………………………………. 3
1.3.3 Fuzzy Clustering………………………………………… 4
1.3.4 Model Based Clustering………………………………… 4
1.3.5 Density Based Clustering……………………………….. 4
1.4 Comparison of Clusters………………………………………………. 4
1.4.1 Euclidian Distance………………………………………. 5
1.4.2 Manhattan Distance……………………………………… 5
5
1.4.3 Edit Distance……………………………………………..
5
1.4.4 Hamming Distance……………………………………….
5
1.5 Techniques to find the optimum number of Clusters………………….
5
1.5.1 Elbow method…………………………………………….
6
1.5.2 Average silhouette method……………………………….
7
1.5.3 Gap Statistical method……………………………………
8
1.5.4 Davies Bouldin Index…………………………………….
8
1.5.5 Dunn Index……………………………………………….
8
1.6 Standard K-Means Algorithm…………………………………………
9
1.6.1 Flowchart of standard K-means………………………….
iv
1.6.2 Drawbacks of standard K means clustering……………… 9
1.6.3 Example of K-means…………………………………….. 10
1.7 Genetic Algorithm…………………………………………………….. 11
1.7.1 Flow chart of genetic algorithm…………………………. 14
2. Literature Survey…………………………………………………………… 15
2.1 Clustering……………………………………………………….…… 15
2.2 Partitioning Clustering………………………………………………. 15
2.3 Unsupervised learning technique……………………………………. 16
2.4 Conclusion…………………………………………………………… 18
3. System Design and Development…………………………………………… 19
3.1 Problem Statement…………………………………………………… 19
3.2 Research Gap………………………………………………………… 19
3.3 Objectives……………………………………………………………. 20
3.4 Research Methodology……………………………………………… 20
3.5 Proposed Hybrid Technique…………………………………………
20
3.6 Basic Genetic Algorithm…………………………………………..…
21
3.6.1 Application of genetic algorithm……………………………
23
3.6.2 Example of MaxOne using genetic algorithm………………
23
3.7 Proposed Algorithm……………………………………...……………
3.7.1 Example of proposed Algorithm……………………………. 23
4. Experiments and Result Analysis……………………………………………. 27
4.1 Implementation of Proposed Technique……………………………… 36
4.1.1 Iris Dataset for Implementation…………………………….. 36
4.2 Experimental Results…………………………………………………. 36
4.2.1 Confusion Matrix of K-means clustering……………………
38
4.2.2 Confusion Matrix of Genetic K-means clustering………….. 39
4.2.3 Test for performance Accuracy…………………………….. 40
4.2.4 Calculation of Intra cluster distance………………………… 41
5. Conclusion and Future Scope………………………………………………... 42
5.1 Conclusion……………………………………………………………. 45
5.2 Limitations……………………………………………………………. 45
5.3 Future Scope………………………………………………………….. 45
References………………………………………………………………………. 46
v
List of Figures
Figure No. Description Page No.
1.1 Inter and Intra Similarities of Cluster ............................................................. 2
1.2 Evaluation Graph of Elbow Method............................................................... 6
1.3 Evaluation Graph of Silhouette Method .......................................................... 7
1.4 Evaluation Graph of Gap Statistical Method................................................... 7
1.5 Flowchart of standard K-means Clustering ..................................................... 9
1.6 Clustering on Iris Dataset .............................................................................. 11
1.7 Genetic Algorithm Chromosomes and Population ........................................ 12
1.8 Execution Steps of Genetic Algorithm ......................................................... 14
3.1 Implementation Methodology for Clustering of dataset .............................. 21
3.2 Process of Genetic Algorithm ...................................................................... 22
3.3 Flowchart of Proposed Algorithm ................................................................ 27
4.1 Code of K-means Clustering on Iris Dataset................................................ 37
4.2 Code of Genetic K-means Clustering on Iris dataset ................................... 38
4.3 Confusion Matrix obtained from K-means Algorithm ................................ 39
4.4 Confusion Matrix obtained from Genetic K-means Algorithm................... 40
vi
List of Tables
Table No. Description Page No.
1.1 Crossover Operation on chromosome S1 and S3 ............................................... 13
1.2 Result of Crossover on Chromosome S1 and S3 ................................................ 13
1.3 Mutation Operation on Chromosome S1 and S3 ................................................ 13
1.4 Result of Mutation on Chromosome S1 and S3 ................................................. 13
3.1 Initialization of Chromosome for MaxOne Problem .......................................... 23
3.2 Arrangement of Chromosome based on Fitness value ....................................... 24
3.3 Crossover of chromosome S1 and S3 ................................................................. 25
3.4 Crossover Result of chromosome S1 and S3 ...................................................... 25
3.5 Crossover of chromosome S2 and S4 ................................................................. 25
3.6 Crossover Result of chromosome S2 and S4 ...................................................... 25
3.7 Crossover of chromosome S5 and S6 ................................................................. 26
3.8 Crossover Result of chromosome S5 and S6 ...................................................... 26
3.9 Mutation Result of chromosomes ....................................................................... 26
3.10 Iris dataset for Genetic K-means Clustering ..................................................... 31
3.11 Normalized dataset for Genetic K-means Clustering....................................... 32
3.12 Selected Row Indices and Chromosomes ........................................................ 33
3.13 Calculated Distance and Assignment of cluster ............................................... 34
3.14 Clusters obtained for Fifteen Records .............................................................. 34
4.1 Accuracy obtained from K-means and Genetic Algorithm ................................ 41
4.2 Intra Cluster distance using K-means algorithm ................................................ 42
4.3 Intra Cluster distance using Proposed algorithm ................................................ 42
4.4 Inter Cluster distance using K-means algorithm................................................. 43
4.5 Inter Cluster distance using Proposed algorithm ................................................ 44
vii