Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2020, ArXiv
We introduce a cluster evaluation technique called Tree Index. Our Tree Index algorithm aims at describing the structural information of the clustering rather than the quantitative format of cluster-quality indexes (where the representation power of clustering is some cumulative error similar to vector quantization). Our Tree Index is finding margins amongst clusters for easy learning without the complications of Minimum Description Length. Our Tree Index produces a decision tree from the clustered data set, using the cluster identifiers as labels. It combines the entropy of each leaf with their depth. Intuitively, a shorter tree with pure leaves generalizes the data well (the clusters are easy to learn because they are well separated). So, the labels are meaningful clusters. If the clustering algorithm does not separate well, trees learned from their results will be large and too detailed. We show that, on the clustering results (obtained by various techniques) on a brain dataset, ...
Quality of clustering is an important issue in application of clustering techniques. Most traditional cluster validity indices are geometry-based cluster quality measures. This work proposes a cluster validity index based on the decision-theoretic rough set model by considering various loss functions. Real time retail data show the usefulness of the proposed validity index for the evaluation of rough and crisp clustering. The measure is shown to help determine optimal number of clusters, as well as an important parameter called threshold in rough clustering. The experiments with a promotional campaign for the retail data illustrate the ability of the proposed measure to incorporate financial considerations in evaluating quality of a clustering scheme. This ability to deal with monetary values distinguishes the proposed decision-theoretic measure from other distance-based measures. Our proposed system validity index can also be efficient for evaluating other clustering algorithms such as fuzzy clustering.
2000
Clustering is an important unsupervised learning paradigm, but so far the traditional methodologies are mostly based on the minimization of the variance between the data and the cluster means. Here we propose a new evaluation function based on a previously developed information theoretic measure defined from Renyi's (1960) entropy. We show how to apply Renyi's entropy to clustering and analyze the resulting staircase nature of the performance function that can be expected during learning. We suggest simulated annealing as a possible optimization criterion
Lecture Notes in Computer Science, 2007
Clustering is one of the most well known types of unsupervised learning. Evaluating the quality of results and determining the number of clusters in data is an important issue. Most current validity indices only cover a subset of important aspects of clusters. Moreover, these indices are relevant only for data sets containing at least two clusters. In this paper, a new bounded index for cluster validity, called the score function (SF), is introduced. The score function is based on standard cluster properties. Several artificial and real-life data sets are used to evaluate the performance of the score function. The score function is tested against four existing validity indices. The index proposed in this paper is found to be always as good or better than these indices in the case of hyperspheroidal clusters. It is shown to work well on multidimensional data sets and is able to accommodate unique and sub-cluster cases.
Proceedings of the 26th Annual International Conference on Machine Learning, 2009
Information theoretic based measures form a fundamental class of similarity measures for comparing clusterings, beside the class of pair-counting based and set-matching based measures. In this paper, we discuss the necessity of correction for chance for information theoretic based measures for clusterings comparison. We observe that the baseline for such measures, i.e. average value between random partitions of a data set, does not take on a constant value, and tends to have larger variation when the ratio between the number of data points and the number of clusters is small. This effect is similar in some other non-information theoretic based measures such as the well-known Rand Index. Assuming a hypergeometric model of randomness, we derive the analytical formula for the expected mutual information value between a pair of clusterings, and then propose the adjusted version for several popular information theoretic based measures. Some examples are given to demonstrate the need and usefulness of the adjusted measures.
Clustering attempts to discover significant groups present in a data set. It is an unsupervised process. It is difficult to define when a clustering result is acceptable. Thus, several clustering validity indices are developed to evaluate the qual-ity of clustering algorithms results. In this paper, we pro-pose to improve the quality of a clustering algorithm called "CLUSTER" by using a validity index. CLUSTER is an au-tomatic clustering technique. It is able to identify situations where data do not have any natural clusters. However, CLUS-TER has some drawbacks. In several cases, CLUSTER gen-erates small and not well-separated clusters. The extension of CLUSTER with validity indices overcomes these drawbacks. We propose four extensions of CLUSTER with four validity indices Dunn, DunnRNG, DB, and DB * . These extensions provide an adequate number of clusters. The experimental results on real data show that these algorithms improve the clustering quality of CLUSTER. In part...
Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence (UAI-2002), 2002
In this paper we propose a measure of sim-ilarity/association between two partitions of a set of objects. Our motivation is the desire to use the measure to characterize the quality or accuracy of clustering algorithms by somehow comparing the clusters they produce with \ground truth" consisting of classes assigned by m a n ual means or some other means in whose veracity there is conndence. Such measures are referred to as \external". Our measure also allows clusterings with diier-ent n umbers of clusters to be compared in a quantitative and principled way. O u r e v al-uation scheme quantitatively measures how useful the cluster labels are as predictors of their class labels. It computes the reduction in the number of bits that would be required to encode (compress) the class labels if both the encoder and decoder have free access to the cluster labels. To a c hieve this encoding the estimated conditional probabilities of the class labels given the cluster labels must also be encoded. In addition to deening the measure we compare it to other commonly used external measures and demonstrate its superiority as judged by certain criteria. 1 The Clustering Problem The most common unsupervised-learning problem is clustering, in which w e are given a set of objects or patterns = f! i ji = 1 2 : : : n g and each object has a representation x i x(! i) i n s o m e feature s p ace which is frequently treated as an m-dimensional continuum R m. Some of the features may b e categorical, h o wever. The goal in clustering is to group the objects by grouping their associated feature v e ctors X = fx i ji = 1 2 : : : n g. This grouping can be based on any n um-ber of criteria. It is assumed that the dimensions of x are attributes relevant to some application of interest. The grouping is performed on the basis of some measure of similarity relevant to the application and associated feature space. There are numerous objective functions and algorithms for clustering (see JD88] f o r a s u r v ey), but we are not concerned with these here. Our task is to devise a measure of the quality o f t h e output of clustering algorithms. Let K = fk i ji = 1 2 : : : n g be a set of cluster labels assigned to the elements of X. The labels themselves are taken from a set K, where jKj is the number of clusters. We h a ve some clustering procedure f that maps X to K. Deenition: clustering procedure f: f : X(() ! K(() (1) The procedure f may determine the optimal number of clusters as well as the assignment of feature vectors (objects) to class labels or it may accept the numberof clusters as input. The set can be considered to have been drawn from some larger population, which can be characterized by a probability density p(x). The combination of p(x) and the clustering procedure f results in a probability distribution fp(k)g over cluster labels. We deene three clustering problems: (1) Each p a t-tern is assigned to one and only one cluster-so-called partitional clustering. (2) Each pattern may b e a s-signed to multiple clusters. These are binary assignments. (3) Each pattern has a degree of membership in each cluster. The measure we propose applies to parti-tional clustering. In addition to these three categories a distinction can be made between at and hierarchical clustering (although a at is technically a special case of hierarchical-i.e. a depth-one tree). Our measure applies to at clustering.
FAIR - NGHIÊN CỨU CƠ BẢN VÀ ỨNG DỤNG CÔNG NGHỆ THÔNG TIN - 2016, 2017
Clustering problem appears in many different fields like Data Mining, Pattern Recognition, Bioinfor-matics, etc. The basic objective of clustering is to group objects into clusters so that objects in the same cluster are more similar to one another than they are to objects in other clusters. Recently, many researchers have contributed to categorical data clustering, where data objects are made up of non-numerical attributes. Especially, rough set theory based attribute selection clustering approaches for categorical data have attracted much attention. The key to these approaches is how to select only one attribute that is the best to cluster the objects at each time from many candidates of attributes. In this paper, we review three rough set based techniques: Total Roughness (TR), Min-Min Roughness (MMR) and Maximum Dependency Attribute (MDA), and propose MAMD (Minimum value of Average Mantaras Distance), an alternative algorithm for hierarchical clustering attribute selection. MAMD uses Mantaras metric which is an information-theoretic metric on the set of partitions of a finite set of objects and seeks to determine a clustering attribute such that the average distance between the partition generated by this attribute and the partitions generated by other attributes of the objects has a minimum value. To evaluate and compare MAMD with three rough set based techniques, we use the concept of average intra-class similarity to measure the clustering quality of selected attribute. The experiment results show that the clustering quality of the attribute selected by our method is higher than that of attributes selected by TR, MMR and MDA methods.
Systems, Man, and Cybernetics, Part B: …, 1998
We review two clustering algorithms (hard c-means and single linkage) and three indexes of crisp cluster validity (Hubert's statistics, the Davies-Bouldin index, and Dunn's index). We illustrate two deficiencies of Dunn's index which make it overly sensitive to noisy clusters and propose several generalizations of it that are not as brittle to outliers in the clusters. Our numerical examples show that the standard measure of interset distance (the minimum distance between points in a pair of sets) is the worst (least reliable) measure upon which to base cluster validation indexes when the clusters are expected to form volumetric clouds. Experimental results also suggest that intercluster separation plays a more important role in cluster validation than cluster diameter. Our simulations show that while Dunn's original index has operational flaws, the concept it embodies provides a rich paradigm for validation of partitions that have cloud-like clusters.
Computing Research Repository, 2003
Motivation: Clustering is a frequently used con- cept in variety of bioinformatical applications. We present a new method for hierarchical clustering of data called mutual information clustering (MIC) al- gorithm. It uses mutual information (MI) as a sim- ilarity measure and exploits its grouping property: The MI between three objects X,Y, and Z is equal to the sum of the
Intelligent Data Analysis, 2009
Clustering quality or validation indices allow the evaluation of the quality of clustering in order to support the selection of a specific partition or clustering structure in its natural unsupervised environment, where the real solution is unknown or not available. In this paper, we investigate the use of quality indices mostly based on the concepts of clusters' compactness and separation, for the evaluation of clustering results (partitions in particular). This work intends to offer a general perspective regarding the appropriate use of quality indices for the purpose of clustering evaluation. After presenting some commonly used indices, as well as indices recently proposed in the literature, key issues regarding the practical use of quality indices are addressed. A general methodological approach is presented which considers the identification of appropriate indices thresholds. This general approach is compared with the simple use of quality indices for evaluating a clustering solution.
Background: High-dimensional biomedical data are frequently clustered to identify subgroup structures pointing at distinct disease subtypes. It is crucial that the used cluster algorithm works correctly. However, by imposing a predefined shape on the clusters, classical algorithms occasionally suggest a cluster structure in homogenously distributed data or assign data points to incorrect clusters. We analyzed whether this can be avoided by using emergent self-organizing feature maps (ESOM). Methods: Data sets with different degrees of complexity were submitted to ESOM analysis with large numbers of neurons, using an interactive R-based bioinformatics tool. On top of the trained ESOM the distance structure in the high dimensional feature space was visualized in the form of a so-called U-matrix. Clustering results were compared with those provided by classical common cluster algorithms including single linkage, Ward and k-means. Results: Ward clustering imposed cluster structures on cluster-less ''golf ball " , ''cuboid " and ''S-shaped " data sets that contained no structure at all (random data). Ward clustering also imposed structures on permuted real world data sets. By contrast, the ESOM/U-matrix approach correctly found that these data contain no cluster structure. However, ESOM/U-matrix was correct in identifying clusters in biomedical data truly containing subgroups. It was always correct in cluster structure identification in further canon-ical artificial data. Using intentionally simple data sets, it is shown that popular clustering algorithms typically used for biomedical data sets may fail to cluster data correctly, suggesting that they are also likely to perform erroneously on high dimensional biomedical data. Conclusions: The present analyses emphasized that generally established classical hierarchical clustering algorithms carry a considerable tendency to produce erroneous results. By contrast, unsupervised machine-learned analysis of cluster structures, applied using the ESOM/U-matrix method, is a viable, unbiased method to identify true clusters in the high-dimensional space of complex data.
2016 International Joint Conference on Neural Networks (IJCNN), 2016
This paper deals with a major challenge in clustering that is optimal model selection. It presents new efficient clustering quality indexes relying on feature maximization, which is an alternative measure to usual distributional measures relying on entropy, Chi-square metric or vector-based measures such as Euclidean distance or correlation distance. First Experiments compare the behavior of these new indexes with usual cluster quality indexes based on Euclidean distance on different kinds of test datasets for which ground truth is available. This comparison clearly highlights altogether the superior accuracy and stability of the new method on these datasets, its efficiency from low to high dimensional range and its tolerance to noise. Further experiments are then conducted on "real life" textual data extracted from a multisource bibliographic database for which ground truth is unknown. These experiments show that the accuracy and stability of these new indexes allow to deal efficiently with diachronic analysis, when other indexes do not fit the requirements for this task.
2007
Mutual information has been used in many clustering algorithms for measuring general dependencies between random data variables, but its difficulties in computing for small size datasets has limited its efficiency for clustering in many applications. A novel clustering method is proposed which estimates mutual information based on information potential computed pair-wise between data points and without any prior assumptions about cluster density function. The proposed algorithm increases the mutual information in each step in an agglomerative hierarchy scheme. We have shown experimentally that maximizing mutual information between data points and their class labels will lead to an efficient clustering. Experiments done on a variety of artificial and real datasets show the superiority of this algorithm, besides its low computational complexity, in comparison to other information based clustering methods and also some ordinary clustering algorithms.
IAEME PUBLICATION, 2014
Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Hierarchical clustering cannot represent distinct clusters with similar expression patterns. In order to address the problem that the time complexity of the existing hierarchical K-means algorithms is high and most of algorithms are sensitive to noise, a hierarchical K-means clustering algorithm based on silhouette and entropy (HKSE) is put forward. The optimal number of clusters is determined through computing the average Improved Silhouette of the dataset, such that the time complexity can be reduced. we develop an algorithm for calculating the overlap rate. Then we show how the theory can be used to deal with the problem of cluster merging in a hierarchical approach to clustering and give an optimal number of clusters automatically. Finally, experimental results demonstrate the effectiveness of the overlap rate measuring method and the new hierarchical clustering algorithm.
2019
Clustering analysis is one of the most commonly used techniques for uncovering patterns in data mining. Most clustering methods require establishing the number of clusters beforehand. However, due to the size of the data currently used, predicting that value is at a high computational cost task in most cases. In this article, we present a clustering technique that avoids this requirement, using hierarchical clustering. There are many examples of this procedure in the literature, most of them focusing on the dissociative or descending subtype, while in this article we cover the agglomerative or ascending subtype. Being more expensive in computational and temporal cost, it nevertheless allows us to obtain very valuable information, regarding elements membership to clusters and their groupings, that is to say, their dendrogram. Finally, several sets of data have been used, varying their dimensionality. For each of them, we provide the calculations of internal validation indexes to test...
2010
Determining the number of clusters present in a data set automatically is a very important problem. Conventional clustering techniques assume a certain number of clusters, and then try to find out the possible cluster structure associated to the above number. For very large and complex data sets it is not easy to guess this number of clusters. There exists validity based clustering techniques, which measure a certain cluster validity measure of a certain clustering result by varying the number of clusters. After doing this for a broad range of possible number of clusters, this method selects the number for which the validity measure is optimum. This method is, however, awkward and may not always be applicable for very large data sets. Recently an interesting visual technique for determining clustering tendency has been developed. This new technique is called VAT in abbreviation. The original VAT and its different versions are found to determine the number of clusters, before actually applying any clustering algorithm, very satisfactorily. In this paper, we have proposed an out-of-core VAT algorithm (o-VAT) for very large data sets.
2014
Number of variables or attributes of any data set effect to a large extent clustering of that particular data. These attributes directly affect the dissimilarity or distance measures thereby effecting accuracy of data. So dimensionality reduction techniques can definitely improve clustering. As clustering is a unsupervised machine learning technique, the validation of results obtained from application of clustering algorithm to a particular data set is a big issue. This paper formulates a new model for data clustering using combination of feature extraction, data clustering algorithm and clustering validity index/indices. The data clustering algorithm used is Agglomerative Hierarchal Clustering Algorithm. The different features reduction techniques used are PCA, CMDS, ISOMAP and HLLE. The clustering validity indices used are Silhouette index, Dunn index, Davies Bouldin Index and Calinski Harbasaz index. Keywords—Agglomerative Hierarchal Clustering, PCA, CMDS, ISOMAP, HLLE, Silhouett...
IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2011
Evaluation of how well the extracted clusters fit the 4 true partitions of a data set is one of the fundamental chal-5 lenges in unsupervised clustering because the data structure and 6 the number of clusters are unknown a priori. Cluster validity 7 indices are commonly used to select the best partitioning from 8 different clustering results; however, they are often inadequate 9 unless clusters are well separated or have parametrical shapes. 10 Prototype-based clustering (finding of clusters by grouping the 11 prototypes obtained by vector quantization of the data), which 12 is becoming increasingly important for its effectiveness in the 13 analysis of large high-dimensional data sets, adds another dimen-14 sion to this challenge. For validity assessment of prototype-based 15 clusterings, previously proposed indexes-mostly devised for the 16 evaluation of point-based clusterings-usually perform poorly. 17 The poor performance is made worse when the validity indexes 18 are applied to large data sets with complicated cluster structure. 19 In this paper, we propose a new index, Conn_Index, which can 20 be applied to data sets with a wide variety of clusters of different 21 shapes, sizes, densities, or overlaps. We construct Conn_Index 22 based on inter-and intra-cluster connectivities of prototypes. 23 Connectivities are defined through a "connectivity matrix", which 24 is a weighted Delaunay graph where the weights indicate the local 25 data distribution. Experiments on synthetic and real data indicate 26 that Conn_Index outperforms existing validity indices, used in 27 this paper, for the evaluation of prototype-based clustering results. 28 Index Terms-Cluster validity index, complex data structure, 29 connectivity, Conn_Index, prototype-based clustering. 30 I. INTRODUCTION 31 U NSUPERVISED clustering aims to extract the natural 32 partitions in a data set without a priori class information. 33 It groups the data samples into subsets so that samples within a 34 subset are more similar to each other than to samples in other 35 subsets. Any given clustering method can produce a different 36 partitioning depending on its parameters and criteria. This leads 37 to one of the main challenges in clustering-to determine, 38 without auxiliary information, how well the obtained clusters fit 39 the natural partitions of the data set. The common approach for 40 this evaluation is to use validity indices. A meaningful validity
Entropy, 2011
Hierarchical clustering has been extensively used in practice, where clusters can be assigned and analyzed simultaneously, especially when estimating the number of clusters is challenging. However, due to the conventional proximity measures recruited in these algorithms, they are only capable of detecting mass-shape clusters and encounter problems in identifying complex data structures. Here, we introduce two bottom-up hierarchical approaches that exploit an information theoretic proximity measure to explore the nonlinear boundaries between clusters and extract data structures further than the second order statistics. Experimental results on both artificial and real datasets demonstrate the superiority of the proposed algorithm compared to conventional and information theoretic clustering algorithms reported in the literature, especially in detecting the true number of clusters.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.