Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
IEEE Journal of Selected Topics in Signal Processing
The problem of organizing data that evolves over time into clusters is encountered in a number of practical settings. We introduce evolutionary subspace clustering, a method whose objective is to cluster a collection of evolving data points that lie on a union of low-dimensional evolving subspaces. To learn the parsimonious representation of the data points at each time step, we propose a non-convex optimization framework that exploits the self-expressiveness property of the evolving data while taking into account representation from the preceding time step. To find an approximate solution to the aforementioned non-convex optimization problem, we develop a scheme based on alternating minimization that both learns the parsimonious representation as well as adaptively tunes and infers a smoothing parameter reflective of the rate of data evolution. The latter addresses a fundamental challenge in evolutionary clustering-determining if and to what extent one should consider previous clustering solutions when analyzing an evolving data collection. Our experiments on both synthetic and real-world datasets demonstrate that the proposed framework outperforms state-of-the-art static subspace clustering algorithms and existing evolutionary clustering schemes in terms of both accuracy and running time, in a range of scenarios. Index Terms subspace clustering, evolutionary clustering, self-expressive models, temporal data, real-time clustering I. INTRODUCTION Massive amounts of high-dimensional data collected by contemporary information processing systems create new challenges in the fields of signal processing and machine learning. High dimensionality of data presents computational and memory burdens and may adversely affect performance of the existing data analysis algorithms. An important unsupervised learning problem encountered in such settings deals with finding informative parsimonious structures characterizing large-scale high-dimensional datasets. This task is critical for detection of meaningful patterns in complex data and enabling accurate and efficient clustering. The problem of extracting low-dimensional structures for the purpose of clustering is encountered in many applications including motion segmentation and face clustering in computer vision [1], [2], image representation and compression in image clustering [3], [4], robust
International Journal of Machine Learning and Cybernetics
Massive volumes of high-dimensional data that evolves over time is continuously collected by contemporary information processing systems, which brings up the problem of organizing this data into clusters, i.e. achieve the purpose of dimensional deduction, and meanwhile learning its temporal evolution patterns. In this paper, a framework for evolutionary subspace clustering, referred to as LSTM-ESCM, is introduced, which aims at clustering a set of evolving high-dimensional data points that lie in a union of low-dimensional evolving subspaces. In order to obtain the parsimonious data representation at each time step, we propose to exploit the so-called self-expressive trait of the data at each time point. At the same time, LSTM networks are implemented to extract the inherited temporal patterns behind data in an overall time frame. An efficient algorithm has been proposed based on MATLAB. Next, experiments are carried out on real-world datasets to demonstrate the effectiveness of our proposed approach. And the results show that the suggested algorithm dramatically outperforms other known similar approaches in terms of both run time and accuracy.
The 2003 Congress on Evolutionary Computation, 2003. CEC '03., 2003
This paper proposes a new evolutionary algorithm for subspace clustering in very large and high dimensional databases. The design includes task-specific coding and genetic operators, along with a non-random initialization procedure. Reported experimental results show the algorithm scales almost linearly with the size and dimensionality of the database as well as the dimensionality of the hidden clusters. Our algorithm is able of discovering clusters of different densities embedded in low as well as in high dimensional subspaces of the original space. Finally, the discovered knowledge is presented in the form of non-overlapping clustering rules where only those features that are relevant to the clustering are reported. These two properties contributes to the high comprehensibility of the clustering output.
2005
Subspace clustering has been investigated extensively since traditional clustering algorithms often fail to detect meaningful clusters in high-dimensional data spaces. Many recently proposed subspace clustering methods suffer from two severe problems: First, the algorithms typically scale exponentially with the data dimensionality and/or the subspace dimensionality of the clusters. Second, for performance reasons, many algorithms use a global density threshold for clustering, which is quite questionable since clusters in subspaces of significantly different dimensionality will most likely exhibt significantly varying densities. In this paper, we propose a generic framework to overcome these limitations. Our framework is based on an efficient filterrefinement architecture that scales at most quadratic w.r.t. the data dimensionality and the dimensionality of the subspace clusters. It can be applied to any clustering notions including notions that are based on a local density threshold. A broad experimental evaluation on synthetic and real-world data empirically shows that our method achieves a significant gain of runtime and quality in comparison to state-of-the-art subspace clustering algorithms.
INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY, 2007
When clustering high dimensional data, traditional clustering methods are found to be lacking since they consider all of the dimensions of the dataset in discovering clusters whereas only some of the dimensions are relevant. This may give rise to subspaces within the dataset where clusters may be found. Using feature selection, we can remove irrelevant and redundant dimensions by analyzing the entire dataset. The problem of automatically identifying clusters that exist in multiple and maybe overlapping subspaces of high dimensional data, allowing better clustering of the data points, is known as Subspace Clustering. There are two major approaches to subspace clustering based on search strategy. Top-down algorithms find an initial clustering in the full set of dimensions and evaluate the subspaces of each cluster, iteratively improving the results. Bottom-up approaches start from finding low dimensional dense regions, and then use them to form clusters. Based on a survey on subspace ...
2004
Clustering suffers from the curse of dimensionality, and similarity functions that use all input features with equal relevance may not be effective. We introduce an algorithm that discovers clusters in subspaces spanned by different combinations of dimensions via local weightings of features. This approach avoids the risk of loss of information encountered in global dimensionality reduction techniques, and does not assume any data distribution model. Our method associates to each cluster a weight vector, whose values capture the relevance of features within the corresponding cluster. We experimentally demonstrate the gain in perfomance our method achieves, using both synthetic and real data sets. In particular, our results show the feasibility of the proposed technique to perform simultaneous clustering of genes and conditions in microarray data.
2010
Clustering algorithms break down when the data points fall in huge-dimensional spaces. To tackle this problem, many subspace clustering methods were proposed to build up a subspace where data points cluster efficiently. The bottom-up approach is used widely to select a set of candidate features, and then to use a portion of this set to build up the hidden subspace step by step. The complexity depends exponentially or cubically on the number of the selected features. In this paper, we present SEGCLU, a SEGregation-based subspace CLUstering method which significantly reduces the size of the candidate features' set and has a cubic complexity. This algorithm was applied at noise-free data of DNA copy numbers of two groups of autistic and typically developing children to extract a potential bio-marker for autism. 85% of the individuals were classified correctly in a 13-dimensional subspace.
ArXiv, 2018
The problem of dimension reduction is of increasing importance in modern data analysis. In this paper, we consider modeling the collection of points in a high dimensional space as a union of low dimensional subspaces. In particular we propose a highly scalable sampling based algorithm that clusters the entire data via first spectral clustering of a small random sample followed by classifying or labeling the remaining out of sample points. The key idea is that this random subset borrows information across the entire data set and that the problem of clustering points can be replaced with the more efficient and robust problem of "clustering sub-clusters". We provide theoretical guarantees for our procedure. The numerical results indicate we outperform other state-of-the-art subspace clustering algorithms with respect to accuracy and speed.
Clustering is considered as the most important unsupervised learning problem. It aims to find some structure in a collection of unlabeled data. Dealing with a large quantity of data items can be problematic because of time complexity. On the other hand high dimensional data is a challenge arena in data clustering e.g. time series data. Novel algorithms are needed to be robust, scalable, efficient and accurate to cluster of these kinds of data. In this study we proposed a two stages algorithm base on K-Means to achieve our objective.
Background: High-dimensional biomedical data are frequently clustered to identify subgroup structures pointing at distinct disease subtypes. It is crucial that the used cluster algorithm works correctly. However, by imposing a predefined shape on the clusters, classical algorithms occasionally suggest a cluster structure in homogenously distributed data or assign data points to incorrect clusters. We analyzed whether this can be avoided by using emergent self-organizing feature maps (ESOM). Methods: Data sets with different degrees of complexity were submitted to ESOM analysis with large numbers of neurons, using an interactive R-based bioinformatics tool. On top of the trained ESOM the distance structure in the high dimensional feature space was visualized in the form of a so-called U-matrix. Clustering results were compared with those provided by classical common cluster algorithms including single linkage, Ward and k-means. Results: Ward clustering imposed cluster structures on cluster-less ''golf ball " , ''cuboid " and ''S-shaped " data sets that contained no structure at all (random data). Ward clustering also imposed structures on permuted real world data sets. By contrast, the ESOM/U-matrix approach correctly found that these data contain no cluster structure. However, ESOM/U-matrix was correct in identifying clusters in biomedical data truly containing subgroups. It was always correct in cluster structure identification in further canon-ical artificial data. Using intentionally simple data sets, it is shown that popular clustering algorithms typically used for biomedical data sets may fail to cluster data correctly, suggesting that they are also likely to perform erroneously on high dimensional biomedical data. Conclusions: The present analyses emphasized that generally established classical hierarchical clustering algorithms carry a considerable tendency to produce erroneous results. By contrast, unsupervised machine-learned analysis of cluster structures, applied using the ESOM/U-matrix method, is a viable, unbiased method to identify true clusters in the high-dimensional space of complex data.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019
Finding the informative subspaces of high-dimensional datasets is at the core of numerous applications in computer vision, where spectral-based subspace clustering is arguably the most widely studied method due to its strong empirical performance. Such algorithms first compute the affinity matrix to construct a self-representation for each sample using other samples as a dictionary. Sparsity and connectivity of the self-representation play important roles in effective subspace clustering. However, simultaneous optimization of both factors is difficult due to their conflicting nature, and most existing methods are designed to address only one factor. In this paper, we propose a post-processing technique to optimize both sparsity and connectivity by finding good neighbors. Good neighbors induce key connections among samples within a subspace and not only have large affinity coefficients but are also strongly connected to each other. We reassign the coefficients of the good neighbors and eliminate other entries to generate a new coefficient matrix. We show that the few good neighbors can effectively recover the subspace, and the proposed post-processing step of finding good neighbors can be complementary to most existing subspace clustering algorithms. Experiments on five benchmark datasets show that the proposed algorithm performs favorably against the state-of-the-art methods with negligible additional computation cost.
IEEE Transactions on Knowledge and Data Engineering, 2020
Subspace clustering assumes that the data is separable into separate subspaces. Such a simple assumption, does not always hold. We assume that, even if the raw data is not separable into subspaces, one can learn a representation (transform coefficients) such that the learnt representation is separable into subspaces. To achieve the intended goal, we embed subspace clustering techniques (locally linear manifold clustering, sparse subspace clustering and low rank representation) into transform learning. The entire formulation is jointly learnt; giving rise to a new class of methods called transformed subspace clustering (TSC). In order to account for non-linearity, kernelized extensions of TSC are also proposed. To test the performance of the proposed techniques, benchmarking is performed on image clustering and document clustering datasets. Comparison with state-of-the-art clustering techniques shows that our formulation improves upon them.
BIOMAT 2013, 2014
on the theoretical foundation and applications of projected clustering of high dimensional and big data has been supported by a number of programs and funding agencies including the Canada Research Chairs program, the Natural Sciences and Engineering Research Council of Canada (discovery grant, collaborative research development program), the Mitacs globalink program, the Mitacs accelerate program, the Fields-Mitacs summer research program, the Canada Foundation for Innovation and the Ontario Innovation Trust. A few industrial partners have also been involved, these partners include the Generation 5 Mathematical Technologies Inc. and the InferSystems Corporation. † Corresponding author. We present some progress in high dimensional data clustering, made at the Laboratory for Industrial and Applied Mathematics over the last ten years. The focus is on the role of information processing delay as an adaptive mechanism for pattern recognition in subspaces of high dimensional data. Our objective is to develop both mathematical foundation and effective techniques/tools for projective clustering. We also present some applications to gene filtering, cancer diagnosis, neural spike trains pattern recognition, text mining, stock associations, and online social network news aggregation.
We present a novel deep neural network architecture for unsupervised subspace clustering. This architecture is built upon deep auto-encoders, which non-linearly map the input data into a latent space. Our key idea is to introduce a novel self-expressive layer between the encoder and the decoder to mimic the "self-expressiveness" property that has proven effective in traditional subspace clustering. Being differentiable, our new self-expressive layer provides a simple but effective way to learn pairwise affinities between all data points through a standard back-propagation procedure. Being nonlinear, our neural-network based method is able to cluster data points having complex (often nonlinear) structures. We further propose pre-training and fine-tuning strategies that let us effectively learn the parameters of our subspace clustering networks. Our experiments show that our method significantly outperforms the state-of-the-art unsupervised subspace clustering techniques.
2009
Many real-world data sets consist of a very high dimensional feature space. Most clustering techniques use the distance or similarity between objects as a measure to build clusters. But in high dimensional spaces, distances between points become relatively uniform. In such cases, density based approaches may give better results. Subspace Clustering algorithms automatically identify lower dimensional subspaces of the higher dimensional feature space in which clusters exist. In this paper, we propose a new clustering algorithm, ISC – Intelligent Subspace Clustering, which tries to overcome three major limitations of the existing state-of-art techniques. ISC determines the input parameter such as є – distance at various levels of Subspace Clustering which helps in finding meaningful clusters. The uniform parameters approach is not suitable for different kind of databases. ISC implements dynamic and adaptive determination of Meaningful clustering parameters based on hierarchical filteri...
ACM SIGKDD Explorations Newsletter, 2004
Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Often in high dimensional data, many dimensions are irrelevant and can mask existing clusters in noisy data. Feature selection removes irrelevant and redundant dimensions by analyzing the entire dataset. Subspace clustering algorithms localize the search for relevant dimensions allowing them to find clusters that exist in multiple, possibly overlapping subspaces. There are two major ...
New Directions in Statistical Physics, 2004
Cluster analysis divides data into groups (clusters) for the purposes of summarization or improved understanding. For example, cluster analysis has been used to group related documents for browsing, to find genes and proteins that have similar functionality, or as a means of data compression. While clustering has a long history and a large number of clustering techniques have been developed in statistics, pattern recognition, data mining, and other fields, significant challenges still remain. In this chapter we provide a short introduction to cluster analysis, and then focus on the challenge of clustering high dimensional data. We present a brief overview of several recent techniques, including a more detailed description of recent work of our own which uses a concept-based clustering approach.
Information and Software Technology, 2004
The aim of this paper is to present a novel subspace clustering method named FINDIT. Clustering is the process of finding interesting patterns residing in the dataset by grouping similar data objects from dissimilar ones based on their dimensional values. Subspace clustering is a new area of clustering which achieves the clustering goal in high dimension by allowing clusters to be formed with their own correlated dimensions. In subspace clustering, selecting correct dimensions is very important because the distance between points is easily changed according to the selected dimensions. However, to select dimensions correctly is difficult, because data grouping and dimension selecting should be performed simultaneously. FINDIT determines the correlated dimensions for each cluster based on two key ideas: dimension-oriented distance measure which fully utilizes dimensional difference information, and dimension voting policy which determines important dimensions in a probabilistic way based on V nearest neighbors' information. Through various experiments on synthetic data, FINDIT is shown to be very successful in the high dimensional clustering problem. FINDIT satisfies most requirements for good clustering methods such as accuracy of results, robustness to the noise and the cluster density, and scalability to the dataset size and the dimensionality. Moreover, it is gracefully scalable to full dimension without any modification to algorithm.
2019
101 Published By: Blue Eyes Intelligence Engineering & Sciences Publication Retrieval Number: A5498108118/19©BEIESP Abstract: Clustering High dimensional data is a propitious research area in current scenario. Now it becomes a crucial task to cluster multi-dimensional dataset as data-objects are largely dispersed in multi-dimensional space. Most of the conventional algorithms for clustering work on all dimensions of the feature space for calculating clusters. Whereas only few attributes are relevant. Thus their performance is not very Precise. A modified subspace clustering is proposed in this research paper, which does not use all attributes of high-dimensional feature space simultaneously rather, it determine a subspace of attributes which are important for each individual cluster. This subspace of attributes may be same or different for the different cluster. The comparison between conventional K-Means and modified subspace K-means clustering algorithms were done based on variou...
2011 10th International Conference on Machine Learning and Applications and Workshops, 2011
The problem of detecting clusters in highdimensional data is increasingly common in machine learning applications, for instance in computer vision and bioinformatics. Recently, a number of approaches in the field of subspace clustering have been proposed which search for clusters in subspaces of unknown dimensions. Learning the number of clusters, the dimension of each subspace, and the correct assignments is a challenging task, and many existing algorithms often perform poorly in the presence of subspaces that have different dimensions and possibly overlap, or are otherwise computationally expensive. In this work we present a novel approach to subspace clustering that learns the numbers of clusters and the dimensionality of each subspace in an efficient way. We assume that the data points in each cluster are well represented in low-dimensions by a PCA model. We propose a measure of predictive influence of data points modelled by PCA which we minimise to drive the clustering process. The proposed predictive subspace clustering algorithm is assessed on both simulated data and on the popular Yale faces database where state-of-the-art performance and speed are obtained.
The data mining has emerged as a powerful tool to extract knowledge from huge databases. Researchers have introduced several machine learning algorithms to explore the databases to discover information, hidden patterns, and rules from the data which were not known at the data recording time. Due to the remarkable developments in the storage capacities, processing and powerful algorithmic tools, practitioners are developing new and improved algorithms and techniques in several areas of data mining to discover the rules and relationship among the attributes in simple and complex higher dimensional databases. Furthermore data mining has its implementation in large variety of areas ranging from banking to marketing, engineering to bioinformatics and from investment to risk analysis and fraud detection. Practitioners are analyzing and implementing the techniques of artificial neural networks for classification and regression problems because of accuracy, efficiency. The aim of his short ...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.