Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2007, 2007 IEEE Symposium on Computational Intelligence and Data Mining
…
8 pages
1 file
In recent applications of clustering such as gene expression microarray analysis, collaborative filtering, and web mining, object similarity is no longer measured by physical distance, but rather by the behavior patterns objects manifest or the magnitude of correlations they induce. Current state of the art algorithms aiming at this type of clustering typically postulate specific cluster models that are able to capture only specific behavior patterns or correlations, and omit the possibility that other information carrying patterns or correlations may coexist in the data. We cast the problem of searching for pattern clusters or clusters that induce large correlations in some subset of features into the problem of searching for groups of points embedded in lines. The advantage of this approach is that is allows the clustering of different patterns or correlations simultaneously. It also allows the clustering of patterns and correlations that are overlooked by existing methods. A formal stochastic line cluster model is presented and its connection to correlation is established. Based on this model an algorithm, which uses feature selection to search for line clusters embedded in subspaces of the data is presented.
2002
Clustering has been an active research area of great practical importance for recent years. Most previous clustering models have focused on grouping objects with similar values on a (sub)set of dimensions (e.g., subspace cluster) and assumed that every object has an associated value on every dimension (e.g., bicluster). These existing cluster models may not always be adequate in capturing coherence exhibited among objects. Strong coherence may still exist among a set of objects (on a subset of attributes) even if they take quite different values on each attribute and the attribute values are not fully specified. This is very common in many applications including bio-informatics analysis as well as collaborative filtering analysis, where the data may be incomplete and subject to biases. In bio-informatics, a bicluster model has recently been proposed to capture coherence among a subset of the attributes. Here, we introduce a more general model, referred to as the -cluster model, to capture coherence exhibited by a subset of objects on a subset of attributes, while allowing absent attribute values. A move-based algorithm (FLOC) is devised to efficiently produce a near-optimal clustering results. The -cluster model takes the bicluster model as a special case, where the FLOC algorithm performs far superior to the bicluster algorithm. We demonstrate the correctness and efficiency of the -cluster model and the FLOC algorithm on a number of real and synthetic data sets.
2002
Clustering is the process of grouping a set of objects into classes of similar objects. Although definitions of similarity vary from one clustering model to another, in most of these models the concept of similarity is based on distances, e.g., Euclidean distance or cosine distance. In other words, similar objects are required to have close values on at least a set of dimensions. In this paper, we explore a more general type of similarity. Under the pCluster model we proposed, two objects are similar if they exhibit a coherent pattern on a subset of dimensions. For instance, in DNA microarray analysis, the expression levels of two genes may rise and fall synchronously in response to a set of environmental stimuli. Although the magnitude of their expression levels may not be close, the patterns they exhibit can be very much alike. Discovery of such clusters of genes is essential in revealing significant connections in gene regulatory networks. E-commerce applications, such as collaborative filtering, can also benefit from the new model, which captures not only the closeness of values of certain leading indicators but also the closeness of (purchasing, browsing, etc.) patterns exhibited by the customers. Our paper introduces an effective algorithm to detect such clusters, and we perform tests on several real and synthetic data sets to show its effectiveness.
Statistical Analysis and Data Mining, 2009
Traditional similarity measurements often become meaningless when dimensions of datasets increase. Subspace clustering has been proposed to find clusters embedded in subspaces of high dimensional datasets.
2004
Clustering suffers from the curse of dimensionality, and similarity functions that use all input features with equal relevance may not be effective. We introduce an algorithm that discovers clusters in subspaces spanned by different combinations of dimensions via local weightings of features. This approach avoids the risk of loss of information encountered in global dimensionality reduction techniques, and does not assume any data distribution model. Our method associates to each cluster a weight vector, whose values capture the relevance of features within the corresponding cluster. We experimentally demonstrate the gain in perfomance our method achieves, using both synthetic and real data sets. In particular, our results show the feasibility of the proposed technique to perform simultaneous clustering of genes and conditions in microarray data.
2007
The detection of correlations is a data mining task of increa sing importance due to new areas of application such as DNA microarr ay nalysis, collaborative filtering, and text mining. In these case s object similarity is no longer measured by physical distance, but rather by the behavior patterns objects manifest or the magnitude of correlations they induce. Many approaches have been proposed to identify clusters com plying with this requirement. However, most approaches assume spe cific cluster models, which in turn may lead to biased results. In this p aper we present a novel methodology based on linear manifolds which provides a more general and flexible framework by which correlation cl ustering can be done. We discuss two stochastic linear manifold clust er models and demonstrate their applicability to a wide range of co rrelation clustering situations. The general model provides the abil ity to capture arbitrarily complex linear dependencies or correlations. The specialized m...
Knowledge-Based Systems, 2017
In high-dimensional data, clusters often exist in the form complex hierarchical relationships. In order to explore these relationships, there is a need to integrate dimensionality reduction techniques with data mining approaches and graph theory. The correlations in data points emerge more clearly if this integration is flawless. We propose an approach called Local Graph Based Correlation Clustering (LGBACC). This approach merges hierarchical clustering, with PCA to uncover complex hierarchical relationships, and uses graph models to visualize the results. We propose a framework of this approach, that is divided into four phases. Each phase is flawlessly integrated with the next phase. Visualization of data after each phase is an important output and is knitted into the fabric of the framework. The focus of this technique remains on obtaining high quality clusters. The quality of the final clusters obtained is measured using standard indices. It is found that LGBACC is better to the existing hierarchical clustering approaches. We have used real world data sets to validate our framework. These datasets test the approach on low as well as high-dimensional data. It is found that LGBACC produces high quality clusters across a wide spectrum of dimensionality. Scalability test on synthetically produced high-dimensional, and large datasets show that the proposed approach runs efficiently. Hence, LGBACC is an efficient and scalable approach that produces high quality clusters in high-dimensional and large data spaces.
International Journal of Advanced Computer Science and Applications, 2010
The task of biclustering or subspace clustering is a data mining technique that allows simultaneous clustering of rows and columns of a matrix. Though the definition of similarity varies from one biclustering model to another, in most of these models the concept of similarity is often based on such metrics as Manhattan distance, Euclidean distance or other L p distances. In other words, similar objects must have close values in at least a set of dimensions. Pattern-based clustering is important in many applications, such as D A micro-array data analysis, automatic recommendation systems and target marketing systems. However, pattern-based clustering in large databases is challenging. On the one hand, there can be a huge number of clusters and many of them can be redundant and thus makes the pattern-based clustering ineffective. On the other hand, the previous proposed methods may not be efficient or scalable in mining large databases. The objective of this paper is to perform a comparative study of all subspace clustering algorithms in terms of efficiency, accuracy and time complexity.
IEEE International Conference on Data Mining, 2008
Clustering problems often involve datasets where only a part of the data is relevant to the problem, e.g., in microarray data analysis only a subset of the genes show cohesive expressions within a subset of the conditions/features. The existence of a large number of non-informative data points and features makes it challenging to hunt for coherent and meaningful clusters from such datasets. Additionally, since clusters could exist in different subspaces of the feature space, a co-clustering algorithm that simultaneously clusters objects and features is often more suitable as compared to one that is restricted to traditional "one-sided" clustering. We propose Robust Overlapping Co-clustering (ROCC), a scalable and very versatile framework that addresses the problem of efficiently mining dense, arbitrarily positioned, possibly overlapping co-clusters from large, noisy datasets. ROCC has several desirable properties that make it extremely well suited to a number of real life applications. Through extensive experimentation we show that our approach is significantly more accurate in identifying biologically meaningful co-clusters in microarray data as compared to several other prominent approaches that have been applied to this task. We also point out other interesting applications of the proposed framework in solving difficult clustering problems.
In several application domains, high-dimensional observations are collected and then analysed in search for naturally occurring data clusters which might provide further insights about the nature of the problem. In this paper we describe a new approach for partitioning such high-dimensional data. Our assumption is that, within each cluster, the data can be approximated well by a linear subspace estimated by means of a principal component analysis (PCA). The proposed algorithm, Predictive Subspace Clustering (PSC) partitions the data into clusters while simultaneously estimating cluster-wise PCA parameters. The algorithm minimises an objective function that depends upon a new measure of influence for PCA models. A penalised version of the algorithm is also described for carrying our simultaneous subspace clustering and variable selection. The convergence of PSC is discussed in detail, and extensive simulation results and comparisons to competing methods are presented. The comparative performance of PSC has been assessed on six real gene expression data sets for which PSC often provides state-of-art results.
BMC Bioinformatics, 2008
In DNA microarray experiments, discovering groups of genes that share similar transcriptional characteristics is instrumental in functional annotation, tissue classification and motif identification. However, in many situations a subset of genes only exhibits consistent pattern over a subset of conditions. Conventional clustering algorithms that deal with the entire row or column in an expression matrix would therefore fail to detect these useful patterns in the data. Recently, biclustering has been proposed to detect a subset of genes exhibiting consistent pattern over a subset of conditions. However, most existing biclustering algorithms are based on searching for sub-matrices within a data matrix by optimizing certain heuristically defined merit functions. Moreover, most of these algorithms can only detect a restricted set of bicluster patterns.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
IEEE Transactions on Knowledge and Data Engineering, 2009
INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY, 2007
Lecture Notes in Computer Science, 2006
Mathematics in Computer Science, 2008
2009 Ninth IEEE International Conference on Data Mining, 2009
2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2010
Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery - DMKD '03, 2003
Journal of Biomedical Informatics, 2004
Pattern Recognition, 2012
Lecture Notes in Computer Science, 2007
Journal of Computational Biology, 1999
Lecture Notes in Computer Science, 2006
International Conference on Machine Learning, 2009
Information and Software Technology, 2004