Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Neurocomputing
…
17 pages
1 file
Data often exists in subspaces embedded within a high-dimensional space. Subspace clustering seeks to group data according to the dimensions relevant to each subspace. This requires the estimation of subspaces as well as the clustering of data. Subspace clustering becomes increasingly challenging in high dimensional spaces due to the curse of dimensionality which affects reliable estimations of distances and density. Recently, another aspect of high-dimensional spaces has been observed, known as the hubness phenomenon, whereby few data points appear frequently as nearest neighbors of the rest of the data. The distribution of neighbor occurrences becomes skewed with increasing intrinsic dimensionality of the data, and few points with high neighbor occurrences emerge as hubs. Hubs exhibit useful geometric properties and have been leveraged for clustering data in the full-dimensional space. In this paper, we study hubs in the context of subspace clustering. We present new characterizations of hubs in relation to subspaces, and design graph-based meta-features to identify a subset of hubs which are well fit to serve as seeds for the discovery of local latent subspaces and clusters. We propose and evaluate a hubnessdriven algorithm to find subspace clusters, and show that our approach is superior to the baselines, and is competitive against state-of-the-art subspace clustering methods. We also identify the data characteristics that make hubs suitable for subspace clustering. Such characterization gives valuable guidelines to data mining practitioners.
IEEE Transactions on Knowledge and Data Engineering, 2000
High-dimensional data arise naturally in many domains, and have regularly presented a great challenge for traditional data-mining techniques, both in terms of effectiveness and efficiency. Clustering becomes difficult due to the increasing sparsity of such data, as well as the increasing difficulty in distinguishing distances between data points. In this paper we take a novel perspective on the problem of clustering high-dimensional data. Instead of attempting to avoid the curse of dimensionality by observing a lower-dimensional feature subspace, we embrace dimensionality by taking advantage of some inherently high-dimensional phenomena. More specifically, we show that hubness, i.e., the tendency of high-dimensional data to contain points (hubs) that frequently occur in k-nearest neighbor lists of other points, can be successfully exploited in clustering. We validate our hypothesis by proposing several hubness-based clustering algorithms and testing them on high-dimensional data. Experimental results demonstrate good performance of our algorithms in multiple settings, particularly in the presence of large quantities of noise.
INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY, 2007
When clustering high dimensional data, traditional clustering methods are found to be lacking since they consider all of the dimensions of the dataset in discovering clusters whereas only some of the dimensions are relevant. This may give rise to subspaces within the dataset where clusters may be found. Using feature selection, we can remove irrelevant and redundant dimensions by analyzing the entire dataset. The problem of automatically identifying clusters that exist in multiple and maybe overlapping subspaces of high dimensional data, allowing better clustering of the data points, is known as Subspace Clustering. There are two major approaches to subspace clustering based on search strategy. Top-down algorithms find an initial clustering in the full set of dimensions and evaluate the subspaces of each cluster, iteratively improving the results. Bottom-up approaches start from finding low dimensional dense regions, and then use them to form clusters. Based on a survey on subspace ...
2004
Clustering suffers from the curse of dimensionality, and similarity functions that use all input features with equal relevance may not be effective. We introduce an algorithm that discovers clusters in subspaces spanned by different combinations of dimensions via local weightings of features. This approach avoids the risk of loss of information encountered in global dimensionality reduction techniques, and does not assume any data distribution model. Our method associates to each cluster a weight vector, whose values capture the relevance of features within the corresponding cluster. We experimentally demonstrate the gain in perfomance our method achieves, using both synthetic and real data sets. In particular, our results show the feasibility of the proposed technique to perform simultaneous clustering of genes and conditions in microarray data.
Proceedings of the 20th ACM international conference on Information and knowledge management - CIKM '11, 2011
Abstract For knowledge discovery in high dimensional databases, subspace clustering detects clusters in arbitrary subspace projections. Scalability is a crucial issue, as the number of possible projections is exponential in the number of dimensions. We propose a scalable density-based subspace clustering method that steers mining to few selected subspace clusters. Our novel steering technique reduces subspace processing by identifying and clustering promising subspaces and their combinations directly. Thereby, it ...
ACM SIGKDD Explorations Newsletter, 2004
Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Often in high dimensional data, many dimensions are irrelevant and can mask existing clusters in noisy data. Feature selection removes irrelevant and redundant dimensions by analyzing the entire dataset. Subspace clustering algorithms localize the search for relevant dimensions allowing them to find clusters that exist in multiple, possibly overlapping subspaces. There are two major ...
Statistical Analysis and Data Mining, 2009
Traditional similarity measurements often become meaningless when dimensions of datasets increase. Subspace clustering has been proposed to find clusters embedded in subspaces of high dimensional datasets.
2009 Ninth IEEE International Conference on Data Mining, 2009
Subspace clustering aims at detecting clusters in any subspace projection of a high dimensional space. As the number of possible subspace projections is exponential in the number of dimensions, the result is often tremendously large. Recent approaches fail to reduce results to relevant subspace clusters. Their results are typically highly redundant, i.e. many clusters are detected multiple times in several projections.
ACM SIGKDD Explorations …, 2007
International Journal of Scientific Research in Computer Science, Engineering and Information Technology, 2020
Most data of interest today in data-mining applications is complex and is usually represented by many different features. Such high-dimensional data is by its very nature often quite difficult to handle by conventional machine-learning algorithms. This is considered to be an aspect of the well known curse of dimensionality. Consequently, high-dimensional data needs to be processed with care, which is why the design of machine-learning algorithms needs to take these factors into account. Furthermore, it was observed that some of the arising high-dimensional properties could in fact be exploited in improving overall algorithm design. One such phenomenon, related to nearest-neighbor learning methods, is known as hubness and refers to the emergence of very influential nodes (hubs) in k-nearest neighbor graphs. A crisp weighted voting scheme for the k-nearest neighbor classifier has recently been proposed which exploits this notion.
Information and Software Technology, 2004
The aim of this paper is to present a novel subspace clustering method named FINDIT. Clustering is the process of finding interesting patterns residing in the dataset by grouping similar data objects from dissimilar ones based on their dimensional values. Subspace clustering is a new area of clustering which achieves the clustering goal in high dimension by allowing clusters to be formed with their own correlated dimensions. In subspace clustering, selecting correct dimensions is very important because the distance between points is easily changed according to the selected dimensions. However, to select dimensions correctly is difficult, because data grouping and dimension selecting should be performed simultaneously. FINDIT determines the correlated dimensions for each cluster based on two key ideas: dimension-oriented distance measure which fully utilizes dimensional difference information, and dimension voting policy which determines important dimensions in a probabilistic way based on V nearest neighbors' information. Through various experiments on synthetic data, FINDIT is shown to be very successful in the high dimensional clustering problem. FINDIT satisfies most requirements for good clustering methods such as accuracy of results, robustness to the noise and the cluster density, and scalability to the dataset size and the dimensionality. Moreover, it is gracefully scalable to full dimension without any modification to algorithm.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Journal of Machine Learning Research, 2010
2011 10th International Conference on Machine Learning and Applications and Workshops, 2011
Soft Computing, 2021
Lecture Notes in Computer Science, 2006
Proceedings of the ... AAAI Conference on Artificial Intelligence, 2019
Proceedings of the VLDB …, 2009
IEEE Transactions on Knowledge and Data Engineering, 2009
Seventh IEEE International Conference on Data Mining (ICDM 2007), 2007
2007 IEEE Symposium on Computational Intelligence and Data Mining, 2007