Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1996
Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely st,udied problems in this area is the identification of clusters, or deusel y populated regions, in a multi-dir nensi onal clataset. Prior work does not adequately address the problem of large datasets and minimization of 1/0 costs.
Data Mining and Knowledge Discovery, 1997
Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. Recently, there has been a growing emphasis on exploratory analysis of very large datasets to discover useful patterns and/or correlations among attributes. This is called data mining, and data clustering is regarded as a particular branch. However existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (e.g., memory and cpu cycles). So as the dataset size increases, they do not scale up well in terms of memory requirement, running time, and result quality.
2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2016
Clustering techniques are very attractive for extracting and identifying patterns in datasets. However, their application to very large spatial datasets presents numerous challenges such as high-dimensionality data, heterogeneity, and high complexity of some algorithms. For instance, some algorithms may have linear complexity but they require the domain knowledge in order to determine their input parameters. Distributed clustering techniques constitute a very good alternative to the big data challenges (e.g.,Volume, Variety, Veracity, and Velocity). Usually these techniques consist of two phases. The first phase generates local models or patterns and the second one tends to aggregate the local results to obtain global models. While the first phase can be executed in parallel on each site and, therefore, efficient, the aggregation phase is complex, time consuming and may produce incorrect and ambiguous global clusters and therefore incorrect models. In this paper we propose a new distributed clustering approach to deal efficiently with both phases; generation of local results and generation of global models by aggregation. For the first phase, our approach is capable of analysing the datasets located in each site using different clustering techniques. The aggregation phase is designed in such a way that the final clusters are compact and accurate while the overall process is efficient in time and memory allocation. For the evaluation, we use two well-known clustering algorithms; K-Means and DBSCAN. One of the key outputs of this distributed clustering technique is that the number of global clusters is dynamic; no need to be fixed in advance. Experimental results show that the approach is scalable and produces high quality results.
2002 IEEE International Conference on Data Mining, 2002. Proceedings.
Clustering large data sets of high dimensionality has always been a serious challenge for clustering algorithms. Many recently developed clustering algorithms have attempted to address either handling data sets with very large number of records or data sets with very high number of dimensions. This paper provides a discussion of the advantages and limitations of existing algorithms when they operate on very large multidimensional data sets. To simultaneously overcome both the "curse of dimensionality" and the scalability problems associated with large amounts of data, we propose a new clustering algorithm called O-Cluster. This new clustering method combines a novel active sampling technique with an axis-parallel partitioning strategy to identify continuous areas of high density in the input space. The method operates on a limited memory buffer and requires at most a single scan through the data. We demonstrate the high quality of the obtained clustering solutions, their robustness to noise, and O-Cluster's excellent scalability.
ArXiv, 2021
We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our algorithm achieves high-performance by evaluating distances of datapoints with a subset of the cluster centres. Our contribution is substantially more efficient than k-means as it does not require an all to all comparison of data points and clusters. We show that the optimal solutions of our approximation are the same as in the exact solution. However, our approach is considerably more efficient at extracting these clusters compared to the state-of-the-art. We compare our approximation with the exact k-means and alternative approximation approaches on a series of standardised clustering tasks. For the evaluation, we consider the algorithmic complexity, including number of operations to convergence, and the stability of the results. An efficient open source implementation of the algorithm is provided in the “peregrine” software repository.
2003
Pattern-based clustering is important in many applications, such as DNA micro-array data analysis, automatic recommendation systems and target marketing systems. However, pattern-based clustering in large databases is challenging. On the one hand, there can be a huge number of clusters and many of them can be redundant and thus make the pattern-based clustering ineffective. On the other hand, the previous proposed methods may not be efficient or scalable in mining large databases.
International Journal of Computer Applications, 2013
Cluster analysis in data mining is a main application of business. This Investigation describes to present NCDBC algorithm that extends expansion seed selection into a DBSCAN algorithm. And the DBSCAN Algorithm describes the density based clustering concept and also describes its hierarchical additional room OPTICS has been planned newly, and one of the mainly triumphant approaches to clustering. Aim of this research work is to move on the hightech clustering; mainly density-based clustering by identifying new challenges for density based clustering and proposing inventive for these challenges. In this work the proposed procedure focuses on decrease the number of seeds points and also reduces the execution time cost of searching neighborhood data. And A hierarchical clustering procedure can be useful to these interesting subspaces in order to calculate a Latitude for north and south cities and also calculate Longitude of different cities.
Clustering is one of the data mining techniques that extracts knowledge from spatial datasets. DBSCAN algorithm was considered as well-founded algorithm as it discovers clusters in different shapes and handles noise effectively. There are several algorithms that improve DBSCAN as fast hybrid density algorithm (L-DBSCAN) and fast density-based clustering algorithm. In this paper, an enhanced algorithm is proposed that improves fast density-based clustering algorithm in the ability to discover clusters with different densities and clustering large datasets.
2014 World Congress on Computer Applications and Information Systems (WCCAIS), 2014
Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that (1) DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm CLAR-ANS, and that (2) DBSCAN outperforms CLARANS by factor of more than 100 in terms of efficiency.
Emergence of modern techniques for scientific data collection has resulted in large scale accumulation of data pertaining to diverse fields. Cluster analysis is a primary method for database mining [8]. Among different types of cluster the density cluster has advantages as its clusters are easy to understand and it does not limit itself to shapes of clusters. But existing density-based algorithms are lagging behind. Almost all of the well-known clustering algorithms require input parameters which are hard to determine but have a significant influence on the clustering result. Furthermore, for many real-data sets there does not even exist a global parameter setting for which the result of the clustering algorithm describes the intrinsic clustering structure accurately [1][2]. This paper gives a survey of density based clustering algorithms. DBSCAN [15] is a base algorithm for density based clustering techniques. It can detect the clusters of different shapes and sizes from large amount of data which contains noise and outliers. The main drawback of traditional clustering algorithm was largely recovered by VDBSCAN algorithm. But in VDBSCAN algorithm the value of parameter ‘K’ was a user input dependent parameter. It largely degrades the efficiency of permanent Eps. In our proposed method the Eps is determined by the value of ‘k’ in varied density based spatial cluster analysis by declaring ‘k’ as variable one by using algorithmic average determination and distance measurement by Cartesian method and Cartesian product on two dimensional spatial dataset where data are sparsely distributed. So the objective is to enhance the existing DBSCAN algorithm by automatically selecting the input parameters and to find the density varied clusters. The proposed algorithm discovers arbitrary shaped clusters, requires no input parameters and uses the same definitions of DBSCAN algorithm.
2004
A novel algorithm, named DESCRY, for clustering very large multidimensional data sets with numerical attributes is presented. DESCRY discovers clusters having different shape, size, and density and when data contains noise by first finding and clustering a small set of points, called meta-points, that well depict the shape of clusters present in the data set. Final clusters are obtained by assigning each point to one of the partial clusters. The computational complexity of DESCRY is linear both in the data set size and in the data set dimensionality. Experiments show the very good qualitative results obtained comparable with those obtained by state of the art clustering algorithms.
Complexity International Journal (CIJ) , 2017
Clustering is the most well-known data mining method used for classifying the data into clusters based on distance measures. The appearance growth of high dimensional data like microarray gene appearance data and clustering high dimensional data into groups will find the similarity between the objects in the full-dimensional space is usually wrong because it includes various types of data. We propose a new set of clustering algorithm called CURE (Clustering Using Representatives) which is more robust for outliers and recognises clusters with non-spherical shapes and wide variations in size. CURE achieves this by representing each group by a positive fixed number of factors created by selecting well-dispersed factors from the group and then shrinking them near the group's centre by a given ratio. Having two representative factors by group allows CURE to fit well into the geometry of non-spherical shapes, and shrinkage allows us to extract from the effects of outliers. To manage large databases, CURE uses a combination of random sampling and segmentation. The random pattern derived from the set of facts is split first, and each section is partially grouped. The subsets are then grouped into a two-dimensional path to produce the desired groups. Our experimental results confirm that the number of clusters produced by CURE is significantly better than that observed with the current algorithms. The proposed CURE algorithm is compared with existed BIRCH algorithm in two parameters of accuracy and execution time of the algorithm. The experimental results demonstrate that the proposed algorithm gets better accuracy compared with the previous algorithm.
A plethora of algorithms exist for clustering to discover actionable knowledge from large data sources. Given un-labeled data objects, clustering is an unsupervised learning to find natural groups of objects which are similar. Each cluster is a subset of objects that exhibit high similarity. Quality of clusters is high when they feature highest intra-cluster similarity and lowest inter-cluster similarity. The quality of clusters is influenced by the similarity measure being employed for grouping objects. The clustering quality is measured the ability of clustering technique to unearth latent trends distributed in data. The usage of data mining technique clustering is ubiquitous in real time applications such as market research, discovering web access patterns, document classification, image processing, pattern recognition, earth observation, banking, insurance to name few. Clustering algorithms differ in type of data, measure of similarity, computational efficiency, and linkage methods, soft or hard clustering and so on. Employing a clustering technique correct depends on the technical knowhow one has on various kinds clustering algorithms and suitable scenarios to apply them. Towards this end, in this paper, we explore clustering algorithms in terms of computational efficiency, measure of similarity, speed and performance.
2000
Data mining, also known as knowledge discovery in databases, is a statistical analysis technique used to find hidden patterns and identify untapped value in large datasets. Clustering is a principal data discovery technique in data mining that segregates a dataset into subsets or clusters so that data values in the same cluster have some common characteristics or attributes. A number of clustering techniques have been proposed in the past by many researchers that can identify arbitrary shaped cluster; where a cluster is defined as a dense region separated by the low-density regions and among them DBSCAN is a prime density-based clustering algorithm. DBSCAN is capable of discovering clusters of any arbitrary shape and size in databases which even include noise and outliers. Many researchers have attempted to overcome certain deficiencies in the original DBSCAN like identifying patterns within datasets of varied densities and its high computational complexity; hence a number of augmented forms of DBSCAN algorithm are available. We present an incremental density-based clustering technique which is based on the fundamental DBSCAN clustering algorithm to enhance its computational complexity. Our proposed algorithm can be used in different knowledge domains like image processing, classification of patterns in GIS maps, x-ray crystallography and information security.
IEEE Transactions on Knowledge and Data Engineering, 2018
Clustering large volumes of high-dimensional data is a challenging task. Many clustering algorithms have been developed to address either handling datasets with a very large sample size or with a very high number of dimensions, but they are often impractical when the data is large in both aspects. To simultaneously overcome both the 'curse of dimensionality' problem due to high dimensions and scalability problems due to large sample size, we propose a new fast clustering algorithm called FensiVAT. FensiVAT is a hybrid, ensemble-based clustering algorithm which uses fast data-space reduction and an intelligent sampling strategy. In addition to clustering, FensiVAT also provides visual evidence that is used to estimate the number of clusters (cluster tendency assessment) in the data. In our experiments, we compare FensiVAT with nine state-of-the-art approaches which are popular for large sample size or high-dimensional data clustering. Experimental results suggest that FensiVAT, which can cluster large volumes of high-dimensional datasets in a few seconds, is the fastest and most accurate method of the ones tested.
WIREs Data Mining and Knowledge Discovery, 2019
FIGURE 1 Data mining techniques 2 of 14 MITTAL ET AL.
Proceeding of 15th International …, 2002
Hierarchical clustering methods have attracted much attention by giving the user a maximum amount of flexibility. Rather than requiring parameter choices to be predetermined, the result represents all possible levels of granularity. In this paper a hierarchical method is introduced that is fundamentally related to partitioning methods, such as k-medoids and k-means as well as to a density based method, namely center-defined DENCLUE. It is superior to both kmeans and k-medoids in its reduction of outlier influence. Nevertheless it avoids both the time complexity of some partition-based algorithms and the storage requirements of density-based ones. An implementation is presented that is particularly suited to spatial-, stream-, and multimedia data, using P-trees for efficient data storage and access.
New Directions in Statistical Physics, 2004
Cluster analysis divides data into groups (clusters) for the purposes of summarization or improved understanding. For example, cluster analysis has been used to group related documents for browsing, to find genes and proteins that have similar functionality, or as a means of data compression. While clustering has a long history and a large number of clustering techniques have been developed in statistics, pattern recognition, data mining, and other fields, significant challenges still remain. In this chapter we provide a short introduction to cluster analysis, and then focus on the challenge of clustering high dimensional data. We present a brief overview of several recent techniques, including a more detailed description of recent work of our own which uses a concept-based clustering approach.
2002
Clustering is the process of grouping a set of objects into classes of similar objects. Although definitions of similarity vary from one clustering model to another, in most of these models the concept of similarity is based on distances, e.g., Euclidean distance or cosine distance. In other words, similar objects are required to have close values on at least a set of dimensions. In this paper, we explore a more general type of similarity. Under the pCluster model we proposed, two objects are similar if they exhibit a coherent pattern on a subset of dimensions. For instance, in DNA microarray analysis, the expression levels of two genes may rise and fall synchronously in response to a set of environmental stimuli. Although the magnitude of their expression levels may not be close, the patterns they exhibit can be very much alike. Discovery of such clusters of genes is essential in revealing significant connections in gene regulatory networks. E-commerce applications, such as collaborative filtering, can also benefit from the new model, which captures not only the closeness of values of certain leading indicators but also the closeness of (purchasing, browsing, etc.) patterns exhibited by the customers. Our paper introduces an effective algorithm to detect such clusters, and we perform tests on several real and synthetic data sets to show its effectiveness.
Alexandria Engineering Journal, 2015
In dynamic information environments such as the web, the amount of information is rapidly increasing. Thus, the need to organize such information in an efficient manner is more important than ever. With such dynamic nature, incremental clustering algorithms are always preferred compared to traditional static algorithms. In this paper, an enhanced version of the incremental DBSCAN algorithm is introduced for incrementally building and updating arbitrary shaped clusters in large datasets. The proposed algorithm enhances the incremental clustering process by limiting the search space to partitions rather than the whole dataset which results in significant improvements in the performance compared to relevant incremental clustering algorithms. Experimental results with datasets of different sizes and dimensions show that the proposed algorithm speeds up the incremental clustering process by factor up to 3.2 compared to existing incremental algorithms.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.