Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
One basic requirement of many studies is the necessity of classifying data. Clustering is a proposed method for summarizing networks. Clustering methods can be divided into two categories named model-based approaches and algorithmic approaches. Since the most of clustering methods depend on their input parameters, it is important to evaluate the result of a clustering algorithm with its' different input parameters, to choose the most appropriate one. There are several clustering validity techniques based on inner density and outer density of clusters that represent different metrics to choose the most appropriate clustering independent of the input parameters. According to dependency of previous methods on the input parameters, one challenge in facing with large systems, is to complete data incrementally that effects on the final choice of the most appropriate clustering. Those methods define the existence of high intensity in a cluster, and low intensity among different clusters as the measure of choosing the optimal clustering. This measure has a tremendous problem, not availing all data at the first stage. In this paper, we introduce an efficient measure in which maximum number of repetitions for various initial values occurs.
Lecture Notes in Computer Science, 2010
Many real world systems can be modeled as networks or graphs. Clustering algorithms that help us to organize and understand these networks are usually referred to as, graph based clustering algorithms. Many algorithms exist in the literature for clustering network data. Evaluating the quality of these clustering algorithms is an important task addressed by different researchers. An important ingredient of evaluating these clustering techniques is the node-edge density of a cluster. In this paper, we argue that evaluation methods based on density are heavily biased to networks having dense components, such as social networks, but are not well suited for data sets with other network topologies where the nodes are not densely connected. Example of such data sets are the transportation and Internet networks. We justify our hypothesis by presenting examples from real world data sets. We present a new metric to evaluate the quality of a clustering algorithm to overcome the limitations of existing cluster evaluation techniques. This new metric is based on the path length of the elements of a cluster and avoids judging the quality based on cluster density. We show the effectiveness of the proposed metric by comparing its results with other existing evaluation methods on artificially generated and real world data sets.
Progress in Artificial Intelligence, 2020
Clustering has an important role in data mining field. However, there is a large variety of clustering algorithms and each could generate quite different results depending on input parameters. In the research literature, several cluster validity indices have been proposed to evaluate clustering results and find the partition that best fits the input dataset. However, these validity indices may fail to achieve satisfactory results, especially in case of clusters with arbitrary shapes. In this paper, we propose a new cluster validity index for density-based, arbitrarily shaped clusters. Our new index is based on the density and connectivity relations extracted among the data points, based on the proximity graph, Gabriel graph. The incorporation of the connectivity and density relations allows achieving the best clustering results in the case of clusters with any shape, size or density. The experimental results on synthetic and real datasets, using the well-known neighborhood-based clustering (NBC) algorithm and the DBSCAN (density-based spatial clustering of applications with noise) algorithm, illustrate the superiority of the proposed index over some classical and recent indices and show its effectiveness for the evaluation of clustering algorithms and the selection of their appropriate parameters.
2020
We introduce graph-clustering quality measures based on comparisons of global, intra-and inter-cluster densities, an accompanying statistical significance test and a step-by-step routine for clustering quality assessment. Our work is centered on the idea that well clustered graphs will display a mean intra-cluster density that is higher than global density and mean inter-cluster density. We do not rely on any generative model for the null model graph. Our measures are shown to meet the axioms of a good clustering quality function. They have an intuitive graph-theoretic interpretation, a formal statistical interpretation and can be tested for significance. Empirical tests also show they are more responsive to graph structure, less likely to breakdown during numerical implementation and less sensitive to uncertainty in connectivity than the commonly used measures.
International Journal of Computer Engineering in Research Trends, 2018
This paper presents a comparative study on clustering methods and developments made at various times. Clustering is defined as unsupervised learning where the objects are grouped on the basis of some similarity inherent among them. There are different methods for clustering objects such as hierarchical, partitioned, grid, density based and model-based. Many algorithms exist that can solve the problem of clustering, but most of them are very sensitive to their input parameters. Therefore it is essential to evaluate the result of the clustering algorithm. It is difficult to define whether a clustering result is acceptable or not; thus several clustering validity techniques and indices have been developed. Cluster validity indices are used for measuring the goodness of a clustering result comparing to other ones which were created by other clustering algorithms, or by the same algorithms but using different parameter values. The results of a clustering algorithm on the same data set can vary as the input parameters of an algorithm can extremely modify the behaviour and execution of the algorithm the intention of this paper is to describe the clustering process with an overview of different clustering methods and analysis of clustering validity indices.
Proceedings of the 5th WSEAS …, 2006
Clustering is a process of discovering groups of objects such that the objects of the same group are similar, and the objects belonging to different groups are dissimilar. Several research fields deal with the problem of clustering: for example pattern recognition, data mining, machine learning. A number of algorithms exist that can solve the problem of clustering, but most of them are very sensitive to their input parameters. Therefore it is very important to evaluate the result of the clustering algorithms. It is difficult to define whether a clustering result is acceptable or not, thus several clustering validity techniques and indices have been developed. This paper deals with the problem of clustering validity. The most commonly used validity indices are introduced and explained, and they are compared based on experimental results.
Applied Soft Computing, 2012
Identification of the correct number of clusters and the appropriate partitioning technique are some important considerations in clustering where several cluster validity indices, primarily utilizing the Euclidean distance, have been used in the literature. In this paper a new measure of connectivity is incorporated in the definitions of seven cluster validity indices namely, DB-index, Dunn-index, Generalized Dunn-index, PS-index, I-index, XB-index and SV-index, thereby yielding seven new cluster validity indices which are able to automatically detect clusters of any shape, size or convexity as long as they are well-separated. Here connectivity is measured using a novel approach following the concept of relative neighborhood graph. It is empirically established that incorporation of the property of connectivity significantly improves the capabilities of these indices in identifying the appropriate number of clusters. The well-known clustering techniques, single linkage clustering technique and K-means clustering technique are used as the underlying partitioning algorithms. Results on eight artificially generated and three real-life data sets show that connectivity based Dunn-index performs the best as compared to all the other six indices. Comparisons are made with the original versions of these seven cluster validity indices.
Lecture Notes in Computer Science, 2011
Determining the number of clusters is a crucial problem in cluster analysis. Cluster validity measures are one way to try to find the optimum number of clusters, especially for prototype-based clustering. However, no validity measure turns out to work well in all cases. In this paper, we propose an approach to determine the number of cluster based on the minimum description length principle which does not need high computational costs and is also applicable in the context of fuzzy clustering.
2015
Cluster analysis finds its place in many applications especially in data analysis, image processing, pattern recognition, market research by grouping customers based on purchasing pattern, classifying documents on web for information discovery, outlier detection applications and act as a tool to gain insight into the distribution of data to observe characteristics of each cluster. This ensures that cluster places its identity in all domains. This paper presents the clustering validity measures which evaluates the results of clustering algorithms on data sets with the three main approaches of cluster validation techniques namely internal, external and relative criteria. Also it validates the cluster using the cluster indices namely Dunn’s index, DaviesBoludin index and Generalized Dunn Index using K-mean and Chameleon algorithm.
Neurocomputing, 2014
Many real-world networks have a structure of overlapping cohesive groups. In order to uncover this structure several clustering algorithms have been developed. In this paper, we focus on the evaluation of these algorithms. Quality measures are commonly used for this purpose and provide a means to assess the quality of a derived cluster structure. Currently, there are too few measures for graph clusterings with overlaps available that would enable a meaningful evaluation, even though many well studied crisp quality measures exist. In order to expand the pool of overlapping measures we propose three methods to adapt existing crisp quality measures so that they can handle graph overlaps appropriately. We demonstrate our methods on the well known measures Density, Modularity and Conductance. We also propose an enhancement of an existing modularity measure for networks with overlapping structure. We analyse the proposed quality indices using experiments on artificial graphs that possess overlapping structure. For this evaluation, we apply a graph generation model to create clustered graphs with overlaps that are similar to real-world networks, i.e., their node degree and cluster size distribution follow a power law.
International Journal of Information Technology and Computer Science
The main task of any clustering algorithm is to produce compact and well-separated clusters. Well separated and compact type of clusters cannot be achieved in practice. Different types of clustering validation are used to evaluate the quality of the clusters generated by clustering. These measures are elements in the success of clustering. Different clustering requires different types of validity measures. For example, unsupervised algorithms require different evaluation measures than supervised algorithms. The clustering validity measures are categorized into two categories. These categories include external and internal validation. The main difference between external and internal measures is that external validity uses the external information and internal validity measures use internal information of the datasets. A well-known example of the external validation measure is Entropy. Entropy is used to measure the purity of the clusters using the given class labels. Internal measures validate the quality of the clustering without using any external information. External measures require the accurate value of the number of clusters in advance. Therefore, these measures are used mainly for selecting optimal clustering algorithms which work on a specific type of dataset. Internal validation measures are not only used to select the best clustering algorithm but also used to select the optimal value of the number of clusters. It is difficult for external validity measures to have predefined class labels because these labels are not available often in many of the applications. For these reasons, internal validation measures are the only solution where no external information is available in the applications. All these clustering validity measures used currently are time-consuming and especially take additional time for calculations. There are no clustering validity measures which can be used while the clustering process is going on. This paper has surveyed the existing and improved cluster validity measures. It then proposes time efficient and optimized cluster validity measures. These measures use the concept of cluster representatives and random sampling. The work proposes optimized measures for cluster compactness, separation and cluster validity. These three measures are simple and more time efficient than the existing clusters validity measures and are used to monitor the working of the clustering algorithms on large data while the clustering process is going on.
arXiv (Cornell University), 2019
Measuring graph clustering quality remains an open problem. To address it, we introduce quality measures based on comparisons of global, intra-and inter-cluster densities, an accompanying statistical significance test and a step-by-step routine for clustering quality assessment. In doing so, we also offer our own definition of good clustering, as well as necessary and sufficient conditions that characterize it. Our null model does not rely on any generative model for the graph, unlike modularity which uses the configuration model as a null. Our measures are also shown to meet the axioms of a good clustering quality function, unlike the very commonly used modularity measure. They also have an intuitive graph-theoretic interpretation, a formal statistical interpretation and can be easily tested for significance. Our work is centered on the idea that well clustered graphs will display a mean intra-cluster density that is higher than global density and mean intercluster density. We develop tests to verify the existence of such a cluster structure. We empirically explore the behavior of our measures
Pattern Recognition Letters, 2006
Cluster validation is a major issue in cluster analysis. Many existing validity indices do not perform well when clusters overlap or there is significant variation in their covariance structure. The contribution of this paper is twofold. First, we propose a new validity index for fuzzy clustering. Second, we present a new approach for the objective evaluation of validity indices and clustering algorithms. Our validity index makes use of the covariance structure of clusters, while the evaluation approach utilizes a new concept of overlap rate that gives a formal measure of the difficulty of distinguishing between overlapping clusters. We have carried out experimental studies using data sets containing clusters of different shapes and densities and various overlap rates, in order to show how validity indices behave when clusters become less and less separable. Finally, the effectiveness of the new validity index is also demonstrated on a number of real-life data sets.
Proceedings of the 2009 3rd International Conference on Research Challenges in Information Science, RCIS 2009, 2009
Artificial graphs are commonly used for the evaluation of community mining and clustering algorithms. Each artificial graph is assigned a pre-specified clustering, which is compared to clustering solutions obtained by the algorithms under evaluation. Hence, the pre-specified clustering should comply with specifications that are assumed to delimit a good clustering. However, existing construction processes for artificial graphs do not set explicit specifications for the pre-specified clustering. We call these graphs, randomly clustered graphs. Here, we introduce a new class of benchmark graphs which are clustered according to explicit specifications. We call them optimally clustered graphs. We present the basic properties of optimally clustered graphs and propose algorithms for their construction. Experimentally, we compare two community mining algorithms using both randomly and optimally clustered graphs. Results of this evaluation reveal interesting insights both for the algorithms and the artificial graphs.
2008
Graph clustering methods such as spectral clustering are defined for general weighted graphs. In machine learning, however, data often is not given in form of a graph, but in terms of similarity (or distance) values between points. In this case, first a neighborhood graph is constructed using the similarities between the points and then a graph clustering algorithm is applied to this graph. In this paper we investigate the influence of the construction of the similarity graph on the clustering results. We first study the convergence of graph clustering criteria such as the normalized cut (Ncut) as the sample size tends to infinity. We find that the limit expressions are different for different types of graph, for example the r-neighborhood graph or the k-nearest neighbor graph. In plain words: Ncut on a kNN graph does something systematically different than Ncut on an r-neighborhood graph! This finding shows that graph clustering criteria cannot be studied independently of the kind of graph they are applied to. We also provide examples which show that these differences can be observed for toy and real data already for rather small sample sizes.
2011
In this paper a new criterion for clusters validation is proposed. This new cluster validation criterion is used to approximate the goodness of a cluster. The clusters which satisfy a threshold of this measure are selected to participate in clustering ensemble. For combining the chosen clusters, a co-association based consensus function is applied. Since the Evidence Accumulation Clustering method cannot derive the co-association matrix from a subset of clusters, a new EAC based method which is called Extended EAC, EEAC, is applied for constructing the co-association matrix from the subset of clusters. Employing this new cluster validation criterion, the obtained ensemble is evaluated on some well-known and standard data sets. The empirical studies show promising results for the ensemble obtained using the proposed criterion comparing with the ensemble obtained using the standard clusters validation criterion.
Clustering attempts to discover significant groups present in a data set. It is an unsupervised process. It is difficult to define when a clustering result is acceptable. Thus, several clustering validity indices are developed to evaluate the qual-ity of clustering algorithms results. In this paper, we pro-pose to improve the quality of a clustering algorithm called "CLUSTER" by using a validity index. CLUSTER is an au-tomatic clustering technique. It is able to identify situations where data do not have any natural clusters. However, CLUS-TER has some drawbacks. In several cases, CLUSTER gen-erates small and not well-separated clusters. The extension of CLUSTER with validity indices overcomes these drawbacks. We propose four extensions of CLUSTER with four validity indices Dunn, DunnRNG, DB, and DB * . These extensions provide an adequate number of clusters. The experimental results on real data show that these algorithms improve the clustering quality of CLUSTER. In part...
Pattern Analysis and Applications, 2009
In this paper, new measures—called clustering performance measures (CPMs)—for assessing the reliability of a clustering algorithm are proposed. These CPMs are defined using a validation measure, which determines how well the algorithm works with a given set of parameter values, and a repeatability measure, which is used for studying the stability of the clustering solutions and has the ability to
Lecture Notes in Computer Science, 2011
Graph clustering, the process of discovering groups of similar vertices in a graph, is a very interesting area of study, with applications in many different scenarios. One of the most important aspects of graph clustering is the evaluation of cluster quality, which is important not only to measure the effectiveness of clustering algorithms, but also to give insights on the dynamics of relationships in a given network. Many quality evaluation metrics for graph clustering have been proposed in the literature, but there is no consensus on how do they compare to each other and how well they perform on different kinds of graphs. In this work we study five major graph clustering quality metrics in terms of their formal biases and their behavior when applied to clusters found by four implementations of classic graph clustering algorithms on five large, real world graphs. Our results show that those popular quality metrics have strong biases toward incorrectly awarding good scores to some kinds of clusters, especially seen in larger networks. They also indicate that currently used clustering algorithms and quality metrics do not behave as expected when cluster structures are different from the more traditional, clique-like ones.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.