Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Iterative clustering algorithms based on Lloyds algorithm (often referred to as the k-means algorithm) have been used in a wide variety of areas, including graphics, computer vision, signal processing, compression, and computational geometry. We describe a method for accelerating many variants of iterative clustering by using programmable graphics hardware to perform the most computationally expensive portion of the work. In particular, we demonstrate significant speedups for k-means clustering (essential in vector quantization) and clustered principal component analysis. An additional contribution is a new hierarchical algorithm for k-means which performs less work than the brute-force algorithm, but which offers significantly more SIMD parallelism than the straightforward hierarchical approach.
Data Warehousing and Knowledge Discovery
We exploit the parallel architecture of the Graphics Processing Unit (GPU) used in desktops to efficiently implement the traditional K-means algorithm. Our approach in clustering avoids the need for data and cluster information transfer between the GPU and CPU in between the iterations. In this paper we present the novelties in our approach and techniques employed to represent data, compute distances, centroids and identify the cluster elements using the GPU. We measure performance using the metric: computational time per iteration. Our implementation of k-means clustering on an Nvidia 5900 graphics processor is 4 to 12 times faster than the CPU and 7 to 22 times faster on the Nvidia 8500 graphics processor for various data sizes. We also achieved 12 to 64 times speed gain on the 5900 and 20 to 140 times speed gains on the 8500 graphics processor in computational time per iteration for evaluations with various cluster sizes.
… Conference on Parallel …, 2008
Graphics Processing Units (GPU) have recently been the subject of attention in research as an efficient coprocessor for implementing many classes of highly parallel applications. The GPUs design is engineered for graphics applications, where many independent SIMD workloads are simultaneously dispatched to processing elements. While parallelism has been explored in the context of traditional CPU threads and SIMD processing elements, the principles involved in dividing the steps of a parallel algorithm for execution on GPU architectures remains a significant challenge. In this paper, we introduce a first step towards building an efficient GPU-based parallel implementation of a commonly used clustering algorithm called K-Means on an NVIDIA G80 PCI express graphics board using the CUDA processing extensions. Clustering algorithms are important for search, data mining, spam and intrusion detection applications. Modern desktop machines commonly include desktop search software that can be greatly enhanced by these advances, while low-power machines such as laptops can reduce power consumption by utilizing the video chip for these clustering and indexing operations. Our preliminary results show over a 13x performance improvement compared to a baseline 3 GHz Intel Pentium(R) based PC running the same algorithm with an average spec G80 graphics card, the NVIDIA 8600GT. The low cost of these video cards (less than $100 market price as of 2008), and the high performance gains suggest that our approach is both practical and economical for common applications.
Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop - UCHPC-MAW '09, 2009
In this paper, we report our research on using GPUs to accelerate clustering of very large data sets, which are common in today's real world applications. While many published works have shown that GPUs can be used to accelerate various general purpose applications with respectable performance gains, few attempts have been made to tackle very large problems. Our goal here is to investigate if GPUs can be useful accelerators even with very large data sets that cannot fit into GPU's onboard memory.
Journal of Computer and System Sciences, 2013
Cluster analysis plays a critical role in a wide variety of applications; but it is now facing the computational challenge due to the continuously increasing data volume. Parallel computing is one of the most promising solutions to overcoming the computational challenge. In this paper, we target at parallelizing k-Means, which is one of the most popular clustering algorithms, by using the widely available Graphics Processing Units (GPUs). Different from existing GPU-based k-Means algorithms, we observe that data dimensionality is an important factor that should be taken into consideration when parallelizing k-Means on GPUs. In particular, we use two different strategies for low-dimensional data sets and high-dimensional data sets respectively, in order to make the best use of GPU computing horsepower. For low-dimensional data sets, we design an algorithm that exploits GPU on-chip registers to significantly decrease the data access latency. For high-dimensional data sets, we design another novel algorithm that simulates matrix multiplication and exploits GPU on-chip shared memory to achieve high compute-to-memoryaccess ratio. Our experimental results show that our GPU-based k-Means algorithms are three to eight times faster than the best reported GPU-based algorithms.
2009
We explore the use of today's high-end Graphics processing units on desktops to perform hierarchical agglomerative clustering with the Compute Unified Device Architecture -CUDA of NVIDIA. Although the advancement in graphics cards has made the gaming industry to flourish, there is a lot more to be gained the field of scientific computing, high performance computing and their applications. Previous works have illustrated considerable speed gains on computing pair wise Euclidean distances between vectors, which is the fundamental operation in hierarchical clustering. We have used CUDA to implement the complete hierarchical agglomerative clustering algorithm and show almost double the speed gain using much cheaper desk top graphics card. In this paper we briefly explain the highly parallel and internally distributed programming structure of CUDA. We explore CUDA capabilities and propose methods to efficiently handle data within the graphics hardware for data intense, data independent, iterative or repetitive generalpurpose algorithms such as the hierarchical clustering. We achieved results with speed gains of about 30 to 65 times over the CPU implementation using micro array gene expressions.
IJSRD, 2013
In today's digital world, Data sets are increasing exponentially. Statistical analysis using clustering in various scientific and engineering applications become very challenging issue for such large data set. Clustering on huge data set and its performance are two major factors demand for optimization. Parallelization is well-known approach to optimize performance. It has been observed from recent research work that GPU based parallelization help to achieve high degree of performance. Hence, this thesis focuses on optimizing hierarchical clustering algorithms using parallelization. It covers implementation of optimized algorithm on various parallel environments using Open MP on multi-core architecture and using CUDA on may-core architecture.
Advances in Knowledge Discovery and Data …, 2010
Abstract. Graphics Processing Units in today's desktops can well be thought of as a high performance parallel processor. Each single processor within the GPU is able to execute different tasks independently but concurrently. Such computational capabilities of the GPU are being exploited in the ...
2008
The exceptional growth of graphics hardware in programmability and data processing speed in the past few years has fuelled extensive research in using it for general purpose computations more than just image-processing and gaming applications. We explore the use of graphics processors (GPU) to speedup the computations involved in Fuzzy c-means (FCM). FCM is an important iterative clustering algorithm, and usually performs better than k-means. But for large data sets it requires substantial amount of time, which limits its applicability. FCM is an iterative algorithm that involves linear computations and repeated summations. Moreover, there is little reuse of the same data over FCM iterations (i.e., the centre of the clusters change in each iteration) and these characteristics make it a good candidate to be mapped to the parallel processors in the GPU to gain speed. We look at efficient methods for processing input data, handling intermediate results within the GPU with reusability of shader programs and minimizing the use of GPU resources. Two previous implementations of FCM on the graphics-processing unit (GPU) are also analysed. Our implementation shows speed gains in computational time over two orders of magnitude when compared with a recent generation of CPU at certain experimental conditions. This computational time includes both the processing time in the GPU and the data transfer time from the CPU to the GPU.
International Journal of Artificial Intelligence & Applications, 2013
We explore the capabilities of today's high-end Graphics processing units (GPU) on desktop computers to efficiently perform hierarchical agglomerative clustering (HAC) through partitioning of gene expressions. Our focus is to significantly reduce time and memory bottlenecks of the traditional HAC algorithm by parallelization and acceleration of computations without compromising the accuracy of clusters. We use partially overlapping partitions (PoP) to parallelize the HAC algorithm using the hardware capabilities of GPU with Compute Unified Device Architecture (CUDA). We compare the computational performance of GPU over the CPU and our experiments show that the computational performance of GPU is much faster than the CPU. The traditional HAC and partitioning based HAC are up to 66 times and 442 times faster on the GPU respectively, than the time taken by a CPU for the traditional HAC computations. Moreover, the PoP HAC on GPU requires only a fraction of the memory required by the traditional algorithm on the CPU. The novelties in our research includes boosting computational speed while utilizing GPU global memory, identifying minimum distance pair in virtually a single-pass, avoiding the necessity to maintain huge data in memories and complete the entire HAC computation within the GPU.
2017 New York Scientific Data Summit (NYSDS), 2017
Clustering has become an unavoidable step in big data analysis. It may be used to arrange data into a compact format, making operations on big data manageable. However, clustering of big data requires not only the capability of handling data with large volume and high dimensionality, but also the ability to process streaming data, all of which are less developed in most current algorithms. Furthermore, big data processing is seldom interactive, which stands at conflict with users who seek answers immediately. The best one can do is to process incrementally, such that partial and, hopefully, accurate results can be available relatively quickly and are then progressively refined over time. We propose a clustering framework which uses Multi-Dimensional Scaling for layout and GPU acceleration to accomplish these goals. Our domain application is the clustering of mass spectral data of individual aerosol particles with 8 million data points of 450 dimensions each.
Procedia Computer Science, 2016
Due the recent increase of the volume of data that has been generated, organizing this data has become one of the biggest problems in Computer Science. Among the different strategies propose to deal efficiently and effectively for this purpose, we highlight those related to clustering, more specifically, density-based clustering strategies, which stands out for its ability to define clusters of arbitrary shape and the robustness to deal with the presence of data noise, such as DBSCAN and OPTICS. However, these algorithms are still a computational challenge since they are distance-based proposals. In this work we present a new approach to make OPTICS feasible based on data indexing strategy. Although the simplicity with which the data are indexed, using graphs, it allows explore various parallelization opportunities, which were explored using graphic processing unit (GPU). Based on this structure, the complexity of OPTICS is reduced to O(E * logV) in the worst case, becoming itself very fast. In our evaluation we show that our proposal can be over 200x faster than its sequential version using CPU.
Procedia Computer Science, 2013
With the advent of Web 2.0, we see a new and differentiated scenario: there is more data than that can be effectively analyzed. Organizing this data has become one of the biggest problems in Computer Science. Many algorithms have been proposed for this purpose, highlighting those related to the Data Mining area, specifically the clustering algorithms. However, these algorithms are still a computational challenge because of the volume of data that needs to be processed. We found in the literature some proposals to make these algorithms feasible, and, recently, those related to parallelization on graphics processing units (GPUs) have presented good results. In this work we present the G-DBSCAN, a GPU parallel version of one of the most widely used clustering algorithms, the DBSCAN. Although there are other parallel versions of this algorithm, our technique distinguishes itself by the simplicity with which the data are indexed, using graphs, allowing various parallelization opportunities to be explored. In our evaluation we show that the G-DBSCAN using GPU, can be over 100x faster than its sequential version using CPU.
The growth in multicore CPUs and the emergence of powerful manycore GPUs has led to proliferation of parallel applications. Many applications are not straight forward to be parallelized. This paper examines the performance of a parallelized implementation for calculating measurements of Complex Networks. We present an algorithm for calculating complex networks topological feature clustering coefficient, and conducted an execution of the serial, parallel and parallel GPU implementations. A hash-table based structure was used for encoding the complex network’s data, which is different than the standard representation, and also speedups the parallel GPU implementations. Our results demonstrate that the parallelization of the sequential implementations on a multicore CPU, using OpenMP produces a significant speedup. Using OpenCL on a GPU produces even larger speedup depending of the volume of data being processed.
Ingénierie des systèmes d information, 2021
K-means++ is the clustering algorithm that is created to improve the process of getting initial clusters in the K-means algorithm. The k-means++ algorithm selects initial k-centroids arbitrarily dependent on a probability that is proportional to each data-point distance to the existing centroids. The most noteworthy problem of this algorithm is when running happens in sequential mode, as this reduces the speed of clustering. In this paper, we develop a new parallel k-means++ algorithm using the graphics processing units (GPU) where the Open Computing Language (OpenCL) platform is used as the programming environment to perform the data assignment phase in parallel while the Streaming SIMD Extension (SSE) technology is used to perform the initialization step to select the initial centroids in parallel on CPU. The focus is on optimizations directly targeted to this architecture to exploit the most of the available computing capabilities. Our objective is to minimize runtime while keepi...
IEEE Access
This paper presents a new clustering algorithm, the GPIC, a graphics processing unit (GPU) accelerated algorithm for power iteration clustering (PIC). Our algorithm is based on the original PIC proposal, adapted to take advantage of the GPU architecture, maintaining the algorithm's original properties. The proposed method was compared against the serial implementation, achieving a considerable speedup in tests with synthetic and real data sets. A significant volume of real data application (>10 7 records) was used, and we identified that GPIC implementation has good scalability to handle data sets with millions of data points. Our implementation efforts are directed towards two aspects: to process large data sets in less time and to maintain the same quality of the clusters results generated by the original PIC version. INDEX TERMS Scalable machine learning algorithms, GPU, power iteration clustering.
2013 IEEE International Conference on Big Data, 2013
Clustering is an important preparation step in big data processing. It may even be used to detect redundant data points as well as outliers. Elimination of redundant data and duplicates can serve as a viable means for data reduction and it can also aid in sampling. Visual feedback is very valuable here to give users confidence in this process. Furthermore, big data preprocessing is seldom interactive, which stands at conflict with users who seek answers immediately. The best one can do is incremental preprocessing in which partial and hopefully quite accurate results become available relatively quickly and are then refined over time. We propose a correlation clustering framework which uses MDS for layout and GPU-acceleration to accomplish these goals. Our domain application is the correlation clustering of atmospheric mass spectrum data with 8 million data points of 450 dimensions each.
2016
K-Means is probably the leading clustering algorithm with several applications in varying fields such as image processing and patterns analysis. K-Means has been the basis for several clustering algorithms including Kernel KMeans. In machine intelligence and related domains Kernelization transforms the data into a higher dimensional feature space by calculating the inner products between the different pairs of data in that space. This work targets Kernel K-Means and presents parallel implementations of the clustering algorithm using CUDA on the GPU and OpenMP, Cilk-Plus and BLAS on the CPU. The implementations are tested on different datasets leading to different speedups with CUDA achieving the faster runtimes.
Lecture Notes in Computer Science, 2013
The field of similarity search on metric spaces has been widely studied in the last years, mainly because it has proven suitable for a number of application domains such as multimedia retrieval and computational biology, just to name a few. To achieve efficient query execution throughput, it is essential to exploit the intrinsic parallelism in respective search algorithms. Many strategies have been proposed in the literature to parallelize these algorithms either on shared or distributed memory multiprocessor systems. More recently, GPUs have been proposed to evaluate similarity queries for small indexes that fit completely in GPU's memory. However, most of the real databases in production are much larger. In this paper, we propose multi-GPU metric space techniques that are capable to perform similarity search in datasets large enough not to fit in memory of GPUs. Specifically, we implemented a hybrid algorithm which makes use of CPU-cores and GPUs in a pipeline. We also present a hierarchical multi-level index named List of Superclusters (LSC ), with suitable properties for memory transfer in a GPU.
2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020
Today's applications generate a large amount of data that need to be processed by learning algorithms. In practice, the majority of the data are not associated with any labels. Unsupervised learning, i.e., clustering methods, are the most commonly used algorithms for data analysis. However, running clustering algorithms on traditional cores results in high energy consumption and slow processing speed due to a large amount of data movement between memory and processing units. In this paper, we propose DUAL, a Digital-based Unsupervised learning AcceLeration, which supports a wide range of popular algorithms on conventional crossbar memory. Instead of working with the original data, DUAL maps all data points into highdimensional space, replacing complex clustering operations with memory-friendly operations. We accordingly design a PIM-based architecture that supports all essential operations in a highly parallel and scalable way. DUAL supports a wide range of essential operations and enables in-place computations, allowing data points to remain in memory. We have evaluated DUAL on several popular clustering algorithms for a wide range of large-scale datasets. Our evaluation shows that DUAL provides a comparable quality to existing clustering algorithms while using a binary representation and a simplified distance metric. DUAL also provides 58.8× speedup and 251.2× energy efficiency improvement as compared to the state-of-the-art solution running on GPU.
2011
Hierarchical clustering is an important and powerful but computationally extensive operation. This motivates the exploration of highly parallel approaches such as is available in Graphics Processing Units, as well as low-complexity algorithms such as Adaptive Resonance Theory (ART). Although ART has been implemented on GPU processors, this is the first hierarchical ART GPU implementation we are aware of. Each ART layer is distributed in the GPU multiprocessors and is trained simultaneously. The experimental results show that for deep trees, the GPU performance advantage is significant.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.