Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
DBSCAN, a density-based clustering method for multi-dimensional points, was proposed in 1996.
Proceedings of the ACM on management of data, 2024
DBSCAN is a popular density-based clustering algorithm that has many different applications in practice. However, the running time of DBSCAN in high-dimensional space or general metric space (e.g., clustering a set of texts by using edit distance) can be as large as quadratic in the input size. Moreover, most of existing accelerating techniques for DBSCAN are only available for low-dimensional Euclidean space. In this paper, we study the DBSCAN problem under the assumption that the inliers (the core points and border points) have a low intrinsic dimension (which is a realistic assumption for many high-dimensional applications), where the outliers can locate anywhere in the space without any assumption. First, we propose a-center clustering based algorithm that can reduce the time-consuming labeling and merging tasks of DBSCAN to be linear. Further, we propose a linear time approximate DBSCAN algorithm, where the key idea is building a novel small-size summary for the core points. Also, our algorithm can be efficiently implemented for streaming data and the required memory is independent of the input size. Finally, we conduct our experiments and compare our algorithms with several popular DBSCAN algorithms. The experimental results suggest that our proposed approach can significantly reduce the computational complexity in practice. CCS Concepts: • Theory of computation → Theory and algorithms for application domains.
This thesis is concerned with efficient density-based clustering using algorithms such as DBSCAN and NBC as well as the application of indices and the property of triangle inequality in order to make these algorithms faster.
International Journal of Computational Geometry & Applications, 2019
We present a new algorithm for the widely used density-based clustering method dbscan. For a set of [Formula: see text] points in [Formula: see text] our algorithm computes the dbscan-clustering in [Formula: see text] time, irrespective of the scale parameter [Formula: see text] (and assuming the second parameter MinPts is set to a fixed constant, as is the case in practice). Experiments show that the new algorithm is not only fast in theory, but that a slightly simplified version is competitive in practice and much less sensitive to the choice of [Formula: see text] than the original dbscan algorithm. We also present an [Formula: see text] randomized algorithm for hdbscan in the plane — hdbscan is a hierarchical version of dbscan introduced recently — and we show how to compute an approximate version of hdbscan in near-linear time in any fixed dimension.
Data Mining and Knowledge Discovery, 1998
The clustering algorithm DBSCAN relies on a density-based notion of clusters and is designed to discover clusters of arbitrary shape as well as to distinguish noise. In this paper, we generalize this algorithm in two important directions. The generalized algorithm-called GDBSCAN-can cluster point objects as well as spatially extended objects according to both, their spatial and their nonspatial attributes. In addition, four applications using 2D points (astronomy), 3D points (biology), 5D points (earth science) and 2D polygons (geography) are presented, demonstrating the applicability of GDBSCAN to real-world problems.
2005
We generalize the k-means algorithm presented by the authors and show that the resulting algorithm can solve a larger class of clustering problems that satisfy certain properties (existence of a random sampling procedure and tightness). We prove these properties for the k-median and the discrete k-means clustering problems, resulting in O(2 (k/ε) O(1) dn) time (1 + ε)-approximation algorithms for these problems. These are the first algorithms for these problems linear in the size of the input (nd for n points in d dimensions), independent of dimensions in the exponent, assuming k and ε to be fixed. A key ingredient of the k-median result is a (1 + ε)-approximation algorithm for the 1-median problem which has running time O(2 (1/ε) O(1) d). The previous best known algorithm for this problem had linear running time.
Journal of The ACM, 2010
We present a general approach for designing approximation algorithms for a fundamental class of geometric clustering problems in arbitrary dimensions. More specifically, our approach leads to simple randomized algorithms for the k-means, k-median and discrete k-means problems that yield (1 + ε) approximations with probability ≥ 1/2 and running times of O(2 (k/ε) O(1) dn). These are the first algorithms for these problems whose running times are linear in the size of the input (nd for n points in d dimensions) assuming k and ε are fixed. Our method is general enough to be applicable to clustering problems satisfying certain simple properties and is likely to have further applications. fixed) . Interestingly, the center in the optimal solution to the 1-mean problem is the same as the center of mass of the points. However, in the case of the 1-median problem, also known as the Fermat-Weber problem, no such closed form is known. We show that despite the lack of such a closed form, we can obtain an approximation to the optimal 1-median in O(1) time (independent of the number of points). There are many useful variations to these clustering problems, for example, in the discrete versions of these problems, the centers that we seek should belong to the input set of points.
Density based clustering techniques like DBSCAN can find arbitrary shaped clusters along with noisy outliers. A severe drawback of the method is its huge time requirement which makes it a unsuitable one for large data sets. One solution is to apply DBSCAN using only a few selected prototypes. But because of this the clustering result can deviate from that which uses the full data set. A novel method proposed in the paper is to use two types of prototypes, one at a coarser level meant to reduce the time requirement, and the other at a finer level meant to reduce the deviation of the result. Prototypes are derived using leaders clustering method. The proposed hybrid clustering method called l-DBSCAN is analyzed and experimentally compared with DBSCAN which shows that it could be a suitable one for large data sets.
2004
Density-based and grid-based clustering are two main clustering approaches. The former is famous for its capability of discovering clusters of various shapes and eliminating noises, while the latter is well known for its high speed. Combination of the two approaches seems to provide better clustering results. To the best of our knowledge, however, all existing algorithms that combine density-based clustering and grid-based clustering take cells as atomic units, in the sense that either all objects in a cell belong to a cluster or no object in the cell belong to any cluster. This requires the cells to be small enough to ensure the fine resolution of results. In high-dimensional spaces, however, the number of cells can be very large when cells are small, which would make the clustering process extremely costly. On the other hand, the number of neighbors of a cell grows exponentially with the dimensionality of datasets, which makes the complexity increase further. In this paper, we present a new approach that takes objects (or points) as the atomic units, so that the restriction of cell size can be relaxed without degrading the resolution of clustering results. In addition, a concept of ith-order neighbors is introduced to avoid considering the exponential number of neighboring cells. By considering only low-order neighbors, our algorithm is very efficient while losing only a little bit of accuracy. Experiments on synthetic and public data show that our algorithm can cluster high-dimensional data effectively and efficiently.
Procedia Computer Science, 2013
With the advent of Web 2.0, we see a new and differentiated scenario: there is more data than that can be effectively analyzed. Organizing this data has become one of the biggest problems in Computer Science. Many algorithms have been proposed for this purpose, highlighting those related to the Data Mining area, specifically the clustering algorithms. However, these algorithms are still a computational challenge because of the volume of data that needs to be processed. We found in the literature some proposals to make these algorithms feasible, and, recently, those related to parallelization on graphics processing units (GPUs) have presented good results. In this work we present the G-DBSCAN, a GPU parallel version of one of the most widely used clustering algorithms, the DBSCAN. Although there are other parallel versions of this algorithm, our technique distinguishes itself by the simplicity with which the data are indexed, using graphs, allowing various parallelization opportunities to be explored. In our evaluation we show that the G-DBSCAN using GPU, can be over 100x faster than its sequential version using CPU.
Journal of Systems and Software, 2004
Clustering on large databases has been studied actively as an increasing number of applications involve huge amount of data. In this paper, we propose an efficient top-down approach for density-based clustering, which is based on the density information stored in index nodes of a multidimensional index. We first provide a formal definition of the cluster based on the concept of region contrast partition. Based on this notion, we propose a novel top-down clustering algorithm, which improves the efficiency through branchand-bound pruning. For this pruning, we present a technique for determining the bounds based on sparse and dense internal regions and formally prove the correctness of the bounds. Experimental results show that the proposed method reduces the elapsed time by up to 96 times compared with that of BIRCH, which is a well-known clustering method. The results also show that the performance improvement becomes more marked as the size of the database increases.
As a research branch of data mining, clustering, as an unsupervised learning scheme, focuses on assigning objects in the dataset into several groups, called clusters, without any prior knowledge. Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is one of the most widely used clustering algorithms for spatial datasets, which can detect any shapes of clusters and can automatically identify noise points. However, there are several troublesome limitations of DBSCAN: (1) the performance of the algorithm depends on two specified parameters, ε and MinPts in which ε represents the maximum radius of a neighborhood from the observing point and MinPts means the minimum number of data points contained in such a neighborhood. (2) The time consumption for searching the nearest neighbors of each object is intolerable in the cluster expansion. (3) Selecting different starting points results in quite different consequences. (4) DBSCAN is unable to identify adjacent clusters of various densities. In addition to these restrictions about DBSCAN mentioned above, the identification of border points is often ignored. In our paper, we successfully solve the above problems. Firstly, we improve the traditional locality sensitive hashing method to implement fast query of nearest neighbors. Secondly, several definitions are redefined on the basis of the influence space of each object, which takes the nearest neighbors and the reverse nearest neighbors into account. The influence space is proved to be sensitive to local density changes to successfully reduce the amount of parameters and identify adjacent clusters of different densities. Moreover, this new relationship based on the influence space makes the insensitivity to the ordering of inputting points possible. Finally, a new concept—core density reachable based on the influence space is put forward which aims to distinguish between border objects and noisy objects. Several experiments are performed which demonstrate that the performance of our proposed algorithm is better than the traditional DBSCAN algorithm and the improved algorithm IS-DBSCAN.
2009 IEEE Symposium on Computational Intelligence and Data Mining, 2009
Clustering is the problem of finding relations in a data set in an supervised manner. These relations can be extracted using the density of a data set, where density of a data point is defined as the number of data points around it. To find the number of data points around another point, region queries are adopted. Region queries are the most expensive construct in density based algorithm, so it should be optimized to enhance the performance of density based clustering algorithms specially on large data sets. Finding the optimum set of region queries to cover all the data points has been proven to be NP-complete. This optimum set is called the skeletal points of a data set. In this paper, we proposed a generic algorithms which fires region queries at most 6 times the optimum number of region queries (has 6 as approximation factor). Also, we have extend this generic algorithm to create a DBSCAN (the most wellknown density based algorithm) derivative, named ADBSCAN. Presented experimental results show that ADBSCAN has a better approximation to DBCSAN than the DBRS (the most well-known randomized density based algorithm) in terms of performance and quality of clustering, specially for large data sets.
ArXiv, 2018
Density-based clustering techniques are used in a wide range of data mining applications. One of their most attractive features con- sists in not making use of prior knowledge of the number of clusters that a dataset contains along with their shape. In this paper we propose a new algorithm named Linear DBSCAN (Lin-DBSCAN), a simple approach to clustering inspired by the density model introduced with the well known algorithm DBSCAN. Designed to minimize the computational cost of density based clustering on geospatial data, Lin-DBSCAN features a linear time complexity that makes it suitable for real-time applications on low-resource devices. Lin-DBSCAN uses a discrete version of the density model of DBSCAN that takes ad- vantage of a grid-based scan and merge approach. The name of the algorithm stems exactly from its main features outlined above. The algorithm was tested with well known data sets. Experimental results prove the efficiency and the validity of this approach over DBSCAN ...
18th International Conference on Pattern Recognition (ICPR'06), 2006
Density based clustering techniques like DBSCAN can find arbitrary shaped clusters along with noisy outliers. A severe drawback of the method is its huge time requirement which makes it a unsuitable one for large data sets. One solution is to apply DBSCAN using only a few selected prototypes. But because of this the clustering result can deviate from that which uses the full data set. A novel method proposed in the paper is to use two types of prototypes, one at a coarser level meant to reduce the time requirement, and the other at a finer level meant to reduce the deviation of the result. Prototypes are derived using leaders clustering method. The proposed hybrid clustering method called l-DBSCAN is analyzed and experimentally compared with DBSCAN which shows that it could be a suitable one for large data sets.
International Journal of Machine Learning and Computing, 2013
Clustering problem is an unsupervised learning problem. It is a procedure that partition data objects into matching clusters. The data objects in the same cluster are quite similar to each other and dissimilar in the other clusters. The traditional algorithms do not meet the latest multiple requirements simultaneously for objects. Density-based clustering algorithms find clusters based on density of data points in a region. DBSCAN algorithm is one of the density-based clustering algorithms. It can discover clusters with arbitrary shapes and only requires two input parameters.In this paper, we propose a new algorithm based on DBSCAN. We design a new method for automatic parameters generation that create clusters with different densities and generates arbitrary shaped clusters. The kd-tree is used for increasing the memory efficiency. The performance of proposed algorithm is compared with DBSCAN. Experimental results indicate the superiority of proposed algorithm.
Density-based clustering forms the clusters of densely gathered objects separated by sparse regions. In this paper, we survey the previous and recent density-based clustering algorithms. DBSCAN [6], OPTICS [1], and DENCLUE [5, 6] are previous representative density-based clustering algorithms. Several recent algorithms such as PDBSCAN [8], CUDA-DClust [3], and GSCAN [7] have been proposed to improve the performance of DBSCAN. They make the most of multi-core CPUs and GPUs.
Over the last several years, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) has been widely used in many areas of science due to its simplicity and the ability to detect clusters of different sizes and shapes. However, the algorithm becomes unstable when detecting border objects of adjacent clusters as was mentioned in the article that introduced the algorithm. The final clustering result obtained from DBSCAN depends on the order in which objects are processed in the course of the algorithm run. In this article, a modified version of the DBSCAN algorithm is proposed to solve this problem. It was shown that by using the revised algorithm the clustering results are considerably improved, in particular for data sets containing dense structures with connected clusters.
2013
Density based clustering algorithm is one of the primary methods for clustering in data mining. The clusters which are formed based on the density are easy to understand and it does not limit itself to the shapes of clusters. One of them is DBSCAN which is a well known DENSITY-based clustering algorithm used for mining of unsupervised data. The DBSCAN algorithm suffers from several deficiencies whenever the database size is large. Also, DBSCAN does not respond well to data sets with varying densities. For this reason its complexity in worst case becomes O(n 2). The PROPOSED novel algorithm NDCMD (A Unified Novel Density Based Clustering Using Multidimensional Spatial Data): this outperforms DBSCAN for varying density. This is motivated by the current state-of-the-art density clustering algorithm DBSCAN. Ultimately we show how to automatically and capably extract not only 'traditional' clustering information, such as representative points, but also the fundamental clustering structure. Extensive experiments on some synthetic datasets show the validity of the proposed algorithm.
Proceedings of the 31st Annual ACM Symposium on Applied Computing - SAC '16, 2016
Clustering is a fundamental task in Knowledge Discovery and Data mining. It aims to discover the unknown nature of data by grouping together data objects that are more similar. While hundreds of clustering algorithms have been proposed, many are complex and do not scale well as more data become available, making then inadequate to analyze very large datasets. In addition, many clustering algorithms are sequential, thus inherently difficult to parallelize. We propose PatchWork, a novel clustering algorithm to address those issues. PatchWork is a distributed density clustering algorithm with linear computational complexity and linear horizontal scalability. It presents several desirable characteristics in knowledge discovery, in particular, it does not require a priori the number of clusters to identify, and offers a natural protection against outliers and noise. In addition, PatchWork makes it possible to discover spatially large clusters instead of dense clusters only. PatchWork relies on the map/reduce paradigm to parallelize computations and was implemented using Apache Spark, the distributed computation framework. As a result, PatchWork can cluster a billion points in a few minutes only, a 40x improvement over the distributed implementation of k-means in Spark MLLib.
Lecture Notes in Computer Science, 2003
Data mining for spatial data has become increasingly important as more and more organizations are exposed to spatial data from sources such as remote sensing, geographical information systems, astronomy, computer cartography, environmental assessment and planning, etc. Recently, density based clustering methods, such as DENCLUE, DBSCAN, OPTICS, have been published and recognized as powerful clustering methods for data mining. These approaches have run time complexity of ) log ( n n O when using spatial index techniques, R + tree and grid cell. However, these methods are known to lack scalability with respect to dimensionality. In this paper, a unique approach to efficient neighborhood search and a new efficient density based clustering algorithm using EIN-rings are developed. Our approach exploits compressed vertical data structures, Peano Trees (P-trees 1 ), and fast P-tree logical operations to accelerate the calculation of the density function within EIN-rings. This approach stands in contrast to the ubiquitous approach of vertically scanning horizontal data structures (records). The average run time complexity of our algorithm for spatial data in d-dimension is ) ( n dn O . Our proposed method has comparable cardinality scalability with other density methods for small and medium size of data, but superior speed and dimensional scalability.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.