Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2015, Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management
Stability has been considered an important property for evaluating clustering solutions. Nevertheless, there are no conclusive studies on the relationship between this property and the capacity to recover clusters inherent to data ("ground truth"). This study focuses on this relationship, resorting to experiments on synthetic data generated under diverse scenarios (controlling relevant factors) and experiments on real data sets. Stability is evaluated using a weighted cross-validation procedure. Indices of agreement (corrected for agreement by chance) are used both to assess stability and external validation. The results obtained reveal a new perspective so far not mentioned in the literature. Despite the clear relationship between stability and external validity when a broad range of scenarios is considered, the within-scenarios conclusions deserve our special attention: faced with a specific clustering problem (as we do in practice), there is no significant relationship between clustering stability and the ability to recover data clusters
Neural Computation, 2004
Data clustering describes a set of frequently employed techniques in exploratory data analysis to extract "natural" group structure in data. Such groupings need to be validated to separate the signal in the data from spurious structure. In this context, nding an appropriate number of clusters is a particularly important model selection question. We introduce a measure of cluster stability to assess the validity of a cluster model. This stability measure quanti es the reproducibility of clustering solutions on a second sample, and it can be interpreted as a classi cation risk with regard to class labels produced by a clustering algorithm. The preferred number of clusters is determined by minimizing this classi cation risk as a function of the number of clusters. Convincing results are achieved on simulated as well as gene expression data sets. Comparisons to other methods demonstrate the competitive performance of our method and its suitability as a general validation tool for clustering solutions in realworld problems.
2014
Many different clustering algorithms have been developed to detect structure in data sets in an unsupervised way. As user intervention for these methods should be kept to a minimum, robustness with respect to userdefined initial conditions is of crucial importance. In a previous study, we have shown how the robustness of a hard clustering algorithm can be increased by the removal of what we called unstable data elements. Although robustness is a main characteristic of any clustering tool, the most important feature is still the quality of the produced clusterings. This paper experimentally investigates how the removal of unstable data elements from a data set affects the quality of produced clusterings, as measured by the mutual information index, using three biological gene expression data sets. Keywords-hard clustering; cluster quality; unstable elements; mutual information context; microarray data.
Lecture Notes in Computer Science, 2010
Work on clustering combination has shown that clustering combination methods typically outperform single runs of clustering algorithms. While there is much work reported in the literature on validating data partitions produced by the traditional clustering algorithms, little has been done in order to validate data partitions produced by clustering combination methods. We propose to assess the quality of a consensus partition using a pattern pairwise similarity induced from the set of data partitions that constitutes the clustering ensemble. A new validity index based on the likelihood of the data set given a data partition, and three modified versions of well-known clustering validity indices are proposed. The validity measures on the original, clustering ensemble, and similarity spaces are analysed and compared based on experimental results on several synthetic and real data sets.
2002
The concept of cluster stability is introduced as a means for assessing the validity of data partitionings found by clustering algorithms. It allows us to explicitly quantify the quality of a clustering solution, without being dependent on external information. The principle of maximizing the cluster stability can be interpreted as choosing the most self-consistent data partitioning. We present an empirical estimator for the theoretically derived stability index, based on imitating independent sample-sets by way of resampling. Experiments on both toy-examples and real-world problems effectively demonstrate that the proposed validation principle is highly suited for model selection.
IEEE Symposium on Information Visualization, 2001. INFOVIS 2001., 2001
A clustering and ordination algorithm suitable for mining extremely large databases, including those produced by microarray expression studies, is described and analyzed for stability. Data from a yeast cell cycle experiment with 6000 genes and 18 experimental measurements per gene are used to test this algorithm under practical conditions. The process of assigning database objects to an X,Y coordinate, ordination, is shown to be stable with respect to random starting conditions, and with respect to minor perturbations in the starting similarity estimates. Careful analysis of the way clusters typically co-locate, versus the occasional large displacements under different starting conditions are shown to be useful in interpreting the data. This extra stability information is lost when only a single cluster is reported, which is currently the accepted practice. However, it is believed that the approaches presented here should become a standard part of best practices in analyzing computer clustering of large data collections.
Clustering attempts to discover significant groups present in a data set. It is an unsupervised process. It is difficult to define when a clustering result is acceptable. Thus, several clustering validity indices are developed to evaluate the qual-ity of clustering algorithms results. In this paper, we pro-pose to improve the quality of a clustering algorithm called "CLUSTER" by using a validity index. CLUSTER is an au-tomatic clustering technique. It is able to identify situations where data do not have any natural clusters. However, CLUS-TER has some drawbacks. In several cases, CLUSTER gen-erates small and not well-separated clusters. The extension of CLUSTER with validity indices overcomes these drawbacks. We propose four extensions of CLUSTER with four validity indices Dunn, DunnRNG, DB, and DB * . These extensions provide an adequate number of clusters. The experimental results on real data show that these algorithms improve the clustering quality of CLUSTER. In part...
International Journal of Information Technology and Computer Science
The main task of any clustering algorithm is to produce compact and well-separated clusters. Well separated and compact type of clusters cannot be achieved in practice. Different types of clustering validation are used to evaluate the quality of the clusters generated by clustering. These measures are elements in the success of clustering. Different clustering requires different types of validity measures. For example, unsupervised algorithms require different evaluation measures than supervised algorithms. The clustering validity measures are categorized into two categories. These categories include external and internal validation. The main difference between external and internal measures is that external validity uses the external information and internal validity measures use internal information of the datasets. A well-known example of the external validation measure is Entropy. Entropy is used to measure the purity of the clusters using the given class labels. Internal measures validate the quality of the clustering without using any external information. External measures require the accurate value of the number of clusters in advance. Therefore, these measures are used mainly for selecting optimal clustering algorithms which work on a specific type of dataset. Internal validation measures are not only used to select the best clustering algorithm but also used to select the optimal value of the number of clusters. It is difficult for external validity measures to have predefined class labels because these labels are not available often in many of the applications. For these reasons, internal validation measures are the only solution where no external information is available in the applications. All these clustering validity measures used currently are time-consuming and especially take additional time for calculations. There are no clustering validity measures which can be used while the clustering process is going on. This paper has surveyed the existing and improved cluster validity measures. It then proposes time efficient and optimized cluster validity measures. These measures use the concept of cluster representatives and random sampling. The work proposes optimized measures for cluster compactness, separation and cluster validity. These three measures are simple and more time efficient than the existing clusters validity measures and are used to monitor the working of the clustering algorithms on large data while the clustering process is going on.
SMC'03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme - System Security and Assurance (Cat. No.03CH37483), 2003
Cluster validity investigates whether generated clusters are true clusters or due to chance. This is usually done based on subsampling stability analysis. Related to this problem is estimating true number of clusters in a given dataset. There are a number of methods described in the literature to handle both purposes. In this paper, we propose three methods for estimating confidence in the validity of clustering result. The first method validates clustering result by employing supervised classifiers. The dataset is divided into training and test sets and the accuracy of the classifier is evaluated on the test set. This method computes confidence in the generalization capability of clustering. The second method is based on the fact that if a clustering is valid then each of its subsets should be valid as well. The third method is similar to second method; it takes the dual approach, i.e., each cluster is expected to be stable and compact. Confidence is estimated by repeating the process a number of times on subsamples. Experimental results illustrate effectiveness of the proposed methods.
Pattern Analysis and Applications, 2009
In this paper, new measures—called clustering performance measures (CPMs)—for assessing the reliability of a clustering algorithm are proposed. These CPMs are defined using a validation measure, which determines how well the algorithm works with a given set of parameter values, and a repeatability measure, which is used for studying the stability of the clustering solutions and has the ability to
Neural Processing Letters, 2006
Cluster validity has been widely used to evaluate the fitness of partitions produced by clustering algorithms. This paper presents a new validity, which is called the Vapnik–Chervonenkis-bound (VB) index, for data clustering. It is estimated based on the structural risk minimization (SRM) principle, which optimizes the bound simultaneously over both the distortion function (empirical risk) and the VC-dimension (model complexity). The smallest bound of the guaranteed risk achieved on some appropriate cluster number validates the best description of the data structure. We use the deterministic annealing (DA) algorithm as the underlying clustering technique to produce the partitions. Five numerical examples and two real data sets are used to illustrate the use of VB as a validity index. Its effectiveness is compared to several popular cluster-validity indexes. The results of comparative study show that the proposed VB index has high ability in producing a good cluster number estimate and in addition, it provides a new approach for cluster validity from the view of statistical learning theory.
Statistical Analysis and Data Mining, 2013
We present a technique for evaluating and comparing how clusterings reveal structure inherent in the data set. Our technique is based on a criterion evaluating how much point-to-cluster distances may be perturbed without affecting the membership of the points. Although similar to some existing perturbation methods, our approach distinguishes itself in five ways. First, the strength of the perturbations is indexed by a prior distribution controlling how close to boundary regions a point may be before it is considered unstable. Second, our approach is exact in that we integrate over all the perturbations; in practice, this can be done efficiently for well-chosen prior distributions. Third, we provide a rigorous theoretical treatment of the approach, showing that it is consistent for estimating the correct number of clusters. Fourth, it yields a detailed picture of the behavior and structure of the clustering. Finally, it is computationally tractable and easy to use, requiring only a point-to-cluster distance matrix as input. In a simulation study, we show that it outperforms several existing methods in terms of recovering the correct number of clusters. We also illustrate the technique in three real data sets.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2008
Cluster validation to determine the right number of clusters is an important issue in clustering processes. In this work, a strategy to address the problem of cluster validation based on cluster stability properties is introduced. The stability index proposed is based on information measures taking into account the variation on some of these measures due to the variability in clustering solutions produced by different sample sets of the same problem. The experiments carried out on synthetic and real database show the effectiveness of the cluster stability index when the clustering algorithm is based on a data structure model adequate to the problem.
Pattern Recognition, 2010
An important goal in cluster analysis is the internal validation of results using an objective criterion. Of particular relevance in this respect is the estimation of the optimum number of clusters capturing the intrinsic structure of your data. This paper proposes a method to determine this optimum number based on the evaluation of fuzzy partition stability under bootstrap resampling. The method is first characterized on synthetic data with respect to hyper-parameters, like the fuzzifier, and spatial clustering parameters, such as feature space dimensionality, clusters degree of overlap, and number of clusters. The method is then validated on experimental datasets. Furthermore, the performance of the proposed method is compared to that obtained using a number of traditional fuzzy validity rules based on the cluster compactness-to-separation criteria. The proposed method provides accurate and reliable results, and offers better generalization capabilities than the classical approaches.
2020
A key issue in cluster analysis is the choice of an appropriate clustering method and the determination of the best number of clusters. Different clusterings are optimal on the same data set according to different criteria, and the choice of such criteria depends on the context and aim of clustering. Therefore, researchers need to consider what data analytic characteristics the clusters they are aiming at are supposed to have, among others within-cluster homogeneity, between-clusters separation, and stability. Here, a set of internal clustering validity indexes measuring different aspects of clustering quality is proposed, including some indexes from the literature. Users can choose the indexes that are relevant in the application at hand. In order to measure the overall quality of a clustering (for comparing clusterings from different methods and/or different numbers of clusters), the index values are calibrated for aggregation. Calibration is relative to a set of random clustering...
Intelligent Data Analysis, 2009
Clustering quality or validation indices allow the evaluation of the quality of clustering in order to support the selection of a specific partition or clustering structure in its natural unsupervised environment, where the real solution is unknown or not available. In this paper, we investigate the use of quality indices mostly based on the concepts of clusters' compactness and separation, for the evaluation of clustering results (partitions in particular). This work intends to offer a general perspective regarding the appropriate use of quality indices for the purpose of clustering evaluation. After presenting some commonly used indices, as well as indices recently proposed in the literature, key issues regarding the practical use of quality indices are addressed. A general methodological approach is presented which considers the identification of appropriate indices thresholds. This general approach is compared with the simple use of quality indices for evaluating a clustering solution.
Neural computation, 2001
We introduce a method for validation of results obtained by clustering analysis of data. The method is based on resampling the available data. A figure of merit that measures the stability of clustering solutions against resampling is introduced. Clusters which are stable against resampling give rise to local maxima of this figure of merit. This is presented first for a one-dimensional data set, for which an analytic approximation for the figure of merit is derived and compared with numerical measurements. Next, the applicability of the method is demonstrated for higher dimensional data, including gene microarray expression data.
Marketing Letters, 2010
Segmentation results derived using cluster analysis depend on (1) the structure of the data and (2) algorithm parameters. Typically neither the data structure is assessed in advance of clustering nor is the sensitivity of the analysis to changes in algorithm parameters. We propose a benchmarking framework based on bootstrapping techniques that accounts for sample and algorithm randomness. This provides much needed guidance both to data analysts and users of clustering solutions regarding the choice of the final clusters from computations which are exploratory in nature.
2011
In this paper a new criterion for clusters validation is proposed. This new cluster validation criterion is used to approximate the goodness of a cluster. The clusters which satisfy a threshold of this measure are selected to participate in clustering ensemble. For combining the chosen clusters, a co-association based consensus function is applied. Since the Evidence Accumulation Clustering method cannot derive the co-association matrix from a subset of clusters, a new EAC based method which is called Extended EAC, EEAC, is applied for constructing the co-association matrix from the subset of clusters. Employing this new cluster validation criterion, the obtained ensemble is evaluated on some well-known and standard data sets. The empirical studies show promising results for the ensemble obtained using the proposed criterion comparing with the ensemble obtained using the standard clusters validation criterion.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 1981
Assessing the stability of a clustering method involves the measurement of the extent to which the generated clusters are affected by perturbations in the input data. A measure which specifies the disturbance in a set of clusters as the minimum number of operations required to restore the set of modified clusters to the original ones is adopted. A number of well-known graph theoretic clustering methods are compared in terms of their stability as determined by this measure. Specifically, it is shown that among the clustering methods in any of several families of graph theoretic methods, clusters defined as the connected components are the most stable and the clusters specified as the maximal complete subgraphs are the least stable. Furthermore, as one proceeds from the method producing the most narrow clusters (maximal complete subgraphs) to those producing relatively broader clusters, the clustering process is shown to remain at least as stable as any method in the previous stages. Finally, the lower and the upper bounds for the measure of stability, when clusters are defined as the connected components, are derived.
Pattern Recognition Letters, 2006
Cluster validation is a major issue in cluster analysis. Many existing validity indices do not perform well when clusters overlap or there is significant variation in their covariance structure. The contribution of this paper is twofold. First, we propose a new validity index for fuzzy clustering. Second, we present a new approach for the objective evaluation of validity indices and clustering algorithms. Our validity index makes use of the covariance structure of clusters, while the evaluation approach utilizes a new concept of overlap rate that gives a formal measure of the difficulty of distinguishing between overlapping clusters. We have carried out experimental studies using data sets containing clusters of different shapes and densities and various overlap rates, in order to show how validity indices behave when clusters become less and less separable. Finally, the effectiveness of the new validity index is also demonstrated on a number of real-life data sets.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.