Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2010, Lecture Notes in Computer Science
…
10 pages
1 file
Work on clustering combination has shown that clustering combination methods typically outperform single runs of clustering algorithms. While there is much work reported in the literature on validating data partitions produced by the traditional clustering algorithms, little has been done in order to validate data partitions produced by clustering combination methods. We propose to assess the quality of a consensus partition using a pattern pairwise similarity induced from the set of data partitions that constitutes the clustering ensemble. A new validity index based on the likelihood of the data set given a data partition, and three modified versions of well-known clustering validity indices are proposed. The validity measures on the original, clustering ensemble, and similarity spaces are analysed and compared based on experimental results on several synthetic and real data sets.
2011
In this paper a new criterion for clusters validation is proposed. This new cluster validation criterion is used to approximate the goodness of a cluster. The clusters which satisfy a threshold of this measure are selected to participate in clustering ensemble. For combining the chosen clusters, a co-association based consensus function is applied. Since the Evidence Accumulation Clustering method cannot derive the co-association matrix from a subset of clusters, a new EAC based method which is called Extended EAC, EEAC, is applied for constructing the co-association matrix from the subset of clusters. Employing this new cluster validation criterion, the obtained ensemble is evaluated on some well-known and standard data sets. The empirical studies show promising results for the ensemble obtained using the proposed criterion comparing with the ensemble obtained using the standard clusters validation criterion.
2008 Proceedings of the Tenth Workshop on Algorithm Engineering and Experiments (ALENEX), 2008
Consensus clustering is the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. Cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be NP-complete. A number of heuristics have been proposed as approximate solutions, some with performance guarantees. In practice, the problem is apparently easy to approximate, but guidance is necessary as to which heuristic to use depending on the number of elements and clusterings given. We have implemented a number of heuristics for the consensus clustering problem, and here we compare their performance, independent of data size, in terms of efficacy and efficiency, on both simulated and real data sets. We find that based on the underlying algorithms and their behavior in practice the heuristics can be categorized into two distinct groups, with ramification as to which one to use in a given situation, and that a hybrid solution is the best bet in general. We have also developed a refined consensus clustering heuristic for the occasions when the given clusterings may be too disparate, and their consensus may not be representative of any one of them, and we show that in practice the refined consensus clusterings can be much superior to the general consensus clustering.
2017
Abstract: Measuring the quality of data partitions is essential to the success of clustering applications. A lot of different validity indices have been proposed in the literature, but choosing the appropriate index for evaluating the results of a particular clustering algorithm remains a challenge. Clustering results can be evaluated using different indices based on external or internal criteria. An external criterion requires a partitioning of the data previously defined for comparison with the clustering results while an internal criterion evaluates clustering results considering only the data properties. In a previous work we proposed a method for selecting the most suitable cluster validity internal index applied on the results of partitioning clustering algorithms. In this paper we extend our previous work validating the method for density-based clustering algorithms. We have looked into the relationships between internal and external indices, relating them through linear regr...
International Journal of Computer Engineering in Research Trends, 2018
This paper presents a comparative study on clustering methods and developments made at various times. Clustering is defined as unsupervised learning where the objects are grouped on the basis of some similarity inherent among them. There are different methods for clustering objects such as hierarchical, partitioned, grid, density based and model-based. Many algorithms exist that can solve the problem of clustering, but most of them are very sensitive to their input parameters. Therefore it is essential to evaluate the result of the clustering algorithm. It is difficult to define whether a clustering result is acceptable or not; thus several clustering validity techniques and indices have been developed. Cluster validity indices are used for measuring the goodness of a clustering result comparing to other ones which were created by other clustering algorithms, or by the same algorithms but using different parameter values. The results of a clustering algorithm on the same data set can vary as the input parameters of an algorithm can extremely modify the behaviour and execution of the algorithm the intention of this paper is to describe the clustering process with an overview of different clustering methods and analysis of clustering validity indices.
Pattern Analysis and Applications, 2009
In this paper, new measures—called clustering performance measures (CPMs)—for assessing the reliability of a clustering algorithm are proposed. These CPMs are defined using a validation measure, which determines how well the algorithm works with a given set of parameter values, and a repeatability measure, which is used for studying the stability of the clustering solutions and has the ability to
2002
The concept of cluster stability is introduced as a means for assessing the validity of data partitionings found by clustering algorithms. It allows us to explicitly quantify the quality of a clustering solution, without being dependent on external information. The principle of maximizing the cluster stability can be interpreted as choosing the most self-consistent data partitioning. We present an empirical estimator for the theoretically derived stability index, based on imitating independent sample-sets by way of resampling. Experiments on both toy-examples and real-world problems effectively demonstrate that the proposed validation principle is highly suited for model selection.
Intelligent Data Analysis, 2009
Clustering quality or validation indices allow the evaluation of the quality of clustering in order to support the selection of a specific partition or clustering structure in its natural unsupervised environment, where the real solution is unknown or not available. In this paper, we investigate the use of quality indices mostly based on the concepts of clusters' compactness and separation, for the evaluation of clustering results (partitions in particular). This work intends to offer a general perspective regarding the appropriate use of quality indices for the purpose of clustering evaluation. After presenting some commonly used indices, as well as indices recently proposed in the literature, key issues regarding the practical use of quality indices are addressed. A general methodological approach is presented which considers the identification of appropriate indices thresholds. This general approach is compared with the simple use of quality indices for evaluating a clustering solution.
2020
A key issue in cluster analysis is the choice of an appropriate clustering method and the determination of the best number of clusters. Different clusterings are optimal on the same data set according to different criteria, and the choice of such criteria depends on the context and aim of clustering. Therefore, researchers need to consider what data analytic characteristics the clusters they are aiming at are supposed to have, among others within-cluster homogeneity, between-clusters separation, and stability. Here, a set of internal clustering validity indexes measuring different aspects of clustering quality is proposed, including some indexes from the literature. Users can choose the indexes that are relevant in the application at hand. In order to measure the overall quality of a clustering (for comparing clusterings from different methods and/or different numbers of clusters), the index values are calibrated for aggregation. Calibration is relative to a set of random clustering...
Pattern Recognition Letters, 2006
Cluster validation is a major issue in cluster analysis. Many existing validity indices do not perform well when clusters overlap or there is significant variation in their covariance structure. The contribution of this paper is twofold. First, we propose a new validity index for fuzzy clustering. Second, we present a new approach for the objective evaluation of validity indices and clustering algorithms. Our validity index makes use of the covariance structure of clusters, while the evaluation approach utilizes a new concept of overlap rate that gives a formal measure of the difficulty of distinguishing between overlapping clusters. We have carried out experimental studies using data sets containing clusters of different shapes and densities and various overlap rates, in order to show how validity indices behave when clusters become less and less separable. Finally, the effectiveness of the new validity index is also demonstrated on a number of real-life data sets.
Neural Computation, 2004
Data clustering describes a set of frequently employed techniques in exploratory data analysis to extract "natural" group structure in data. Such groupings need to be validated to separate the signal in the data from spurious structure. In this context, nding an appropriate number of clusters is a particularly important model selection question. We introduce a measure of cluster stability to assess the validity of a cluster model. This stability measure quanti es the reproducibility of clustering solutions on a second sample, and it can be interpreted as a classi cation risk with regard to class labels produced by a clustering algorithm. The preferred number of clusters is determined by minimizing this classi cation risk as a function of the number of clusters. Convincing results are achieved on simulated as well as gene expression data sets. Comparisons to other methods demonstrate the competitive performance of our method and its suitability as a general validation tool for clustering solutions in realworld problems.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics : a publication of the IEEE Systems, Man, and Cybernetics Society, 2012
International Journal of Information Technology and Computer Science
Intelligent Data Analysis, 2014
Electronic Journal of Applied Statistical Analysis, 2019
Pattern Analysis and Applications, 2017
Neural Processing Letters, 2006
Intelligent Data Analysis, 2015
Pattern Recognition Letters, 2010
International Journal of Electrical and Computer Engineering (IJECE), 2018
International Journal of Learning Management Systems, 2013
Pattern Recognition, 2013
Knowledge and Information Systems, 2009
Applied Intelligence, 2018