Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2011, Procedia Computer Science
There has been a substantial interest in scientific and engineering computing community to speed up the CPU-intensive tasks on graphical processing units (GPUs) with the development of many-core GPUs as having very large memory bandwidth and computational power. Cluster analysis is a widely used technique for grouping a set of objects into classes of "similar" objects and commonly used in many fields such as data mining, bioinformatics and pattern recognition. WaveCluster defines the notion of cluster as a dense region consisting of connected components in the transformed feature space. In this study, we present the implementation of WaveCluster algorithm as a novel clustering approach based on wavelet transform to GPU level parallelization and investigate the parallel performance for very large spatial datasets. The CUDA implementations of two main sub-algorithms of WaveCluster approach; namely extraction of low-frequency component from the signal using wavelet transform and connected component labeling are presented. Then, the corresponding performance evaluations are reported for each sub-algorithm. Divide and conquer approach is followed on the implementation of wavelet transform and multi-pass sliding window approach on the implementation of connected component labeling. The maximum achieved speedup is found in kernel as 107x in the computation of extraction of the low-frequency component and 6x in the computation of connected component labeling with respect to the sequential algorithms running on the CPU.
Journal of Parallel and Distributed Computing, 2011
A linear scaling parallel clustering algorithm implementation and its application to very large datasets for cluster analysis is reported. WaveCluster is a novel clustering approach based on wavelet transforms. Despite this approach has an ability to detect clusters of arbitrary shapes in an efficient way, it requires considerable amount of time to collect results for large sizes of multi-dimensional datasets. We propose the parallel implementation of the WaveCluster algorithm based on the message passing model for a distributed-memory multiprocessor system. In the proposed method, communication among processors and memory requirements are kept at minimum to achieve high efficiency. We have conducted the experiments on a dense dataset and a sparse dataset to measure the algorithm behavior appropriately. Our results obtained from performed experiments demonstrate that developed parallel WaveCluster algorithm exposes high speedup and scales linearly with the increasing number of processors.
IJSRD, 2013
In today's digital world, Data sets are increasing exponentially. Statistical analysis using clustering in various scientific and engineering applications become very challenging issue for such large data set. Clustering on huge data set and its performance are two major factors demand for optimization. Parallelization is well-known approach to optimize performance. It has been observed from recent research work that GPU based parallelization help to achieve high degree of performance. Hence, this thesis focuses on optimizing hierarchical clustering algorithms using parallelization. It covers implementation of optimized algorithm on various parallel environments using Open MP on multi-core architecture and using CUDA on may-core architecture.
… Conference on Parallel …, 2008
Graphics Processing Units (GPU) have recently been the subject of attention in research as an efficient coprocessor for implementing many classes of highly parallel applications. The GPUs design is engineered for graphics applications, where many independent SIMD workloads are simultaneously dispatched to processing elements. While parallelism has been explored in the context of traditional CPU threads and SIMD processing elements, the principles involved in dividing the steps of a parallel algorithm for execution on GPU architectures remains a significant challenge. In this paper, we introduce a first step towards building an efficient GPU-based parallel implementation of a commonly used clustering algorithm called K-Means on an NVIDIA G80 PCI express graphics board using the CUDA processing extensions. Clustering algorithms are important for search, data mining, spam and intrusion detection applications. Modern desktop machines commonly include desktop search software that can be greatly enhanced by these advances, while low-power machines such as laptops can reduce power consumption by utilizing the video chip for these clustering and indexing operations. Our preliminary results show over a 13x performance improvement compared to a baseline 3 GHz Intel Pentium(R) based PC running the same algorithm with an average spec G80 graphics card, the NVIDIA 8600GT. The low cost of these video cards (less than $100 market price as of 2008), and the high performance gains suggest that our approach is both practical and economical for common applications.
Advances in Knowledge Discovery and Data …, 2010
Abstract. Graphics Processing Units in today's desktops can well be thought of as a high performance parallel processor. Each single processor within the GPU is able to execute different tasks independently but concurrently. Such computational capabilities of the GPU are being exploited in the ...
2009
We explore the use of today's high-end Graphics processing units on desktops to perform hierarchical agglomerative clustering with the Compute Unified Device Architecture -CUDA of NVIDIA. Although the advancement in graphics cards has made the gaming industry to flourish, there is a lot more to be gained the field of scientific computing, high performance computing and their applications. Previous works have illustrated considerable speed gains on computing pair wise Euclidean distances between vectors, which is the fundamental operation in hierarchical clustering. We have used CUDA to implement the complete hierarchical agglomerative clustering algorithm and show almost double the speed gain using much cheaper desk top graphics card. In this paper we briefly explain the highly parallel and internally distributed programming structure of CUDA. We explore CUDA capabilities and propose methods to efficiently handle data within the graphics hardware for data intense, data independent, iterative or repetitive generalpurpose algorithms such as the hierarchical clustering. We achieved results with speed gains of about 30 to 65 times over the CPU implementation using micro array gene expressions.
The Journal of Supercomputing
Recent development in Graphics Processing Units (GPUs) has enabled inexpensive high performance computing for general-purpose applications. Compute Unified Device Architecture (CUDA) programming model provides the programmers adequate C language like APIs to better exploit the parallel power of the GPU. Data mining is widely used and has significant applications in various domains. However, current data mining toolkits cannot meet the requirement of applications with large-scale databases in terms of speed. In this paper, we propose three techniques to speedup fundamental problems in data mining algorithms on the CUDA platform: scalable thread scheduling scheme for irregular pattern, parallel distributed top-k scheme, and parallel high dimension reduction scheme. They play a key role in our CUDA-based implementation of three representative data mining algorithms, CU-Apriori, CU-KNN, and CU-K-means. These parallel implementations outperform the other state-of-the-art implementations significantly on a HP xw8600 workstation with a Tesla C1060 GPU and a Core-quad Intel Xeon CPU. Our results have shown that GPU + CUDA parallel architecture is feasible and promising for data mining applications.
Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop - UCHPC-MAW '09, 2009
In this paper, we report our research on using GPUs to accelerate clustering of very large data sets, which are common in today's real world applications. While many published works have shown that GPUs can be used to accelerate various general purpose applications with respectable performance gains, few attempts have been made to tackle very large problems. Our goal here is to investigate if GPUs can be useful accelerators even with very large data sets that cannot fit into GPU's onboard memory.
International Journal of Artificial Intelligence & Applications, 2013
We explore the capabilities of today's high-end Graphics processing units (GPU) on desktop computers to efficiently perform hierarchical agglomerative clustering (HAC) through partitioning of gene expressions. Our focus is to significantly reduce time and memory bottlenecks of the traditional HAC algorithm by parallelization and acceleration of computations without compromising the accuracy of clusters. We use partially overlapping partitions (PoP) to parallelize the HAC algorithm using the hardware capabilities of GPU with Compute Unified Device Architecture (CUDA). We compare the computational performance of GPU over the CPU and our experiments show that the computational performance of GPU is much faster than the CPU. The traditional HAC and partitioning based HAC are up to 66 times and 442 times faster on the GPU respectively, than the time taken by a CPU for the traditional HAC computations. Moreover, the PoP HAC on GPU requires only a fraction of the memory required by the traditional algorithm on the CPU. The novelties in our research includes boosting computational speed while utilizing GPU global memory, identifying minimum distance pair in virtually a single-pass, avoiding the necessity to maintain huge data in memories and complete the entire HAC computation within the GPU.
Concurrency and Computation: Practice and Experience, 2019
Classification and clustering techniques are used in different applications. Large-scale big data applications such as social networks analysis applications need to process large data chunks in a short time. Classification and clustering tasks in such applications consume a lot of processing time. Improving the performance of classification and clustering algorithms enhances the performance of applications that use such type of algorithms. This paper introduces an approach for exploiting the graphics processing unit (GPU) platform to improve the performance of classification and clustering algorithms. The proposed approach uses two GPUs implementations, which are the pure GPU or GPU-only implementation and the GPU-CPU hybrid implementation. The results show that the hybrid implementation, which optimizes the subtask scheduling for both the CPU and the GPU processing elements, outperforms the approach that uses only the GPU.
2014 1st International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), 2014
Discrete wavelet transform (DWT), has diverse applications in signal and image processing fields. In this paper, we have implemented the lifting "Le Gall 5/3" algorithm on a low cost NVIDIA's GPU (Graphics processing unit) with MatLab to achieve speedup in computation. The efficiency of our GPU based implementation is measured and compared with CPU based algorithms. Our investigational results with GPU show performance enhancement over a factor of 1.52 compared with CPU for an image of size 3072x3072.
IEEE Access
This paper presents a new clustering algorithm, the GPIC, a graphics processing unit (GPU) accelerated algorithm for power iteration clustering (PIC). Our algorithm is based on the original PIC proposal, adapted to take advantage of the GPU architecture, maintaining the algorithm's original properties. The proposed method was compared against the serial implementation, achieving a considerable speedup in tests with synthetic and real data sets. A significant volume of real data application (>10 7 records) was used, and we identified that GPIC implementation has good scalability to handle data sets with millions of data points. Our implementation efforts are directed towards two aspects: to process large data sets in less time and to maintain the same quality of the clusters results generated by the original PIC version. INDEX TERMS Scalable machine learning algorithms, GPU, power iteration clustering.
Data Warehousing and Knowledge Discovery
We exploit the parallel architecture of the Graphics Processing Unit (GPU) used in desktops to efficiently implement the traditional K-means algorithm. Our approach in clustering avoids the need for data and cluster information transfer between the GPU and CPU in between the iterations. In this paper we present the novelties in our approach and techniques employed to represent data, compute distances, centroids and identify the cluster elements using the GPU. We measure performance using the metric: computational time per iteration. Our implementation of k-means clustering on an Nvidia 5900 graphics processor is 4 to 12 times faster than the CPU and 7 to 22 times faster on the Nvidia 8500 graphics processor for various data sizes. We also achieved 12 to 64 times speed gain on the 5900 and 20 to 140 times speed gains on the 8500 graphics processor in computational time per iteration for evaluations with various cluster sizes.
Procedia Computer Science, 2010
GPUs have recently attracted our attention as accelerators on a wide variety of algorithms, including assorted examples within the image analysis field. Among them, wavelets are gaining popularity as solid tools for data mining and video compression, though this comes at the expense of a high computational cost. After proving the effectiveness of the GPU for accelerating the 2D Fast Wavelet Transform [1], we present in this paper a novel implementation on manycore GPUs and multicore CPUs for a high performance computation of the 3D Fast Wavelet Transform (3D-FWT). This algorithm poses a challenging access pattern on matrix operators demanding high sustainable bandwidth, as well as mathematical functions with remarkable arithmetic intensity on ALUs and FPUs. On the GPU side, we focus on CUDA programming to develop methods for an efficient mapping on manycores and to fully exploit the memory hierarchy, whose management is explicit by the programmer. On multicore CPUs, OpenMP and Pthreads are used as counterparts to maximize parallelism, and renowned techniques like tiling and blocking are exploited to optimize the use of memory. Experimental results on an Nvidia Tesla C870 GPU and an Intel Core 2 Quad Q6700 CPU indicate that our implementation runs three times faster on the Tesla and up to fifteen times faster when communications are neglected, which enables the GPU for processing real-time videos in many applications where the 3D-FWT is involved.
Procedia Computer Science, 2016
Due the recent increase of the volume of data that has been generated, organizing this data has become one of the biggest problems in Computer Science. Among the different strategies propose to deal efficiently and effectively for this purpose, we highlight those related to clustering, more specifically, density-based clustering strategies, which stands out for its ability to define clusters of arbitrary shape and the robustness to deal with the presence of data noise, such as DBSCAN and OPTICS. However, these algorithms are still a computational challenge since they are distance-based proposals. In this work we present a new approach to make OPTICS feasible based on data indexing strategy. Although the simplicity with which the data are indexed, using graphs, it allows explore various parallelization opportunities, which were explored using graphic processing unit (GPU). Based on this structure, the complexity of OPTICS is reduced to O(E * logV) in the worst case, becoming itself very fast. In our evaluation we show that our proposal can be over 200x faster than its sequential version using CPU.
This paper illustrates the design and performance evaluation of few algorithms used for analyzing the magnetic resonance imaging (MRI) dataset on the massive parallel graphics processing unit (GPU) using the compute unified device architecture (CUDA) programming model. Few such algorithms are MRI brain image segmentation and feature extraction. Efficient segmentation with these types of huge medical data is very expensive. Feature extraction getting statistical feature from MRI image for classification. Based on the classification abnormal and tumour slices find out from MRI datasets. Serial central processing unit (CPU) has not enough computation power to process some type of complex algorithms with huge medical data. GPU can solve large data parallel problems to huge medical data than CPU. MRI image volumes are best candidates for GPU implementation, since the parallelization is naturally provided by the proposed per-slice threading (PST) model. In this paper, simple algorithms like K-means segmentation and feature extraction for MRI volumes were implemented and assessed the speed up gained in GPU computing. Our experiments show that the GPU based implementation achieved 10X-30X times faster than conventional CPU processor in PST model.
2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP), 2018
In this paper the authors analyze the effectiveness of parallel graphics processing unit (GPU) realizations of discrete wavelet transform (DWT) using lattice structure and matrixbased approach. Experimental verification shows that, in general, for smaller input vector sizes along with the larger filter lengths DWT computation based on the direct approach with the use of the direct matrix multiplication significantly f aster t han the application of the lattice structure factorization while for large vector sizes the lattice structure becomes more effective. The detailed results define boundaries of performance for both implementations and determine the most advantageous situations in which one might use a given approach. The results also include comparative analysis of time efficiency o f t he p resented methods for two different GPU architectures. The presented effectiveness characteristics of the considered realizations of the two selected DWT computation methods allows for making the proper choice of the particular method in future applications depending on input data sizes, filter l engths a nd u nderlying G PU architecture.
Image processing and pattern recognition algorithms take more time for execution on a single core processor. Graphics Processing Unit (GPU) is more popular now-a-days due to their speed, programmability, low cost and more inbuilt execution cores in it. Most of the researchers started work to use GPUs as a processing unit with a single core computer system to speedup execution of algorithms and in the field of Content based medical image retrieval (CBMIR), Euclidean distance and Mahalanobis plays an important role in retrieval of images. Distance formula is important because it plays an important role in matching the images. In this research work, we parallelized Euclidean distance algorithm on CUDA. CPU with Intel® Dual-Core E5500 @ 2.80GHz and 2.0 GB of main memory which run on Windows XP (SP2). The next step was to convert this code in GPU format i.e. to run this program on GPU NVIDIA GeForce series 9500GT model having 1023 MB of video memory of DDR2 type and bus width of 64bit. The graphic driver we used is of 270.81 series of NVIDIA. In this paper both the CPU and GPU version of algorithm is being implemented on the MATLAB R2010. The CPU version of the algorithm is being analyzed in simple MATLAB but the GPU version is being implemented with the help of intermediate software Jacket-win-1.3.0. For using Jacket, we have to make some changes in our source code so to make the CPU and GPU to work simultaneously and thus reducing the overall computational acceleration. Our work employs extensive usage of highly multithreaded architecture of multicored GPU. An efficient use of shared memory is required to optimize parallel reduction in Compute Unified Device Architecture (CUDA), Graphic Processing Units (GPUs) are emerging as powerful parallel systems at a cheap cost of a few thousand rupees.
Iterative clustering algorithms based on Lloyds algorithm (often referred to as the k-means algorithm) have been used in a wide variety of areas, including graphics, computer vision, signal processing, compression, and computational geometry. We describe a method for accelerating many variants of iterative clustering by using programmable graphics hardware to perform the most computationally expensive portion of the work. In particular, we demonstrate significant speedups for k-means clustering (essential in vector quantization) and clustered principal component analysis. An additional contribution is a new hierarchical algorithm for k-means which performs less work than the brute-force algorithm, but which offers significantly more SIMD parallelism than the straightforward hierarchical approach.
Parallel Computing, 2011
In this work we propose several parallel algorithms to compute the twodimensional discrete wavelet transform (2D-DWT), exploiting the available hardware resources. In particular, we will explore OpenMP optimized versions of 2D-DWT over a multicore platform and we will also develop CUDA-based 2D-DWT algorithms which are able to run on GPUs (Graphics Processing Unit). The proposed algorithms are based on several 2D-DWT computation approaches as (1) filter-bank convolution, (2) lifting transform and (3) matrix convolution, so we can determine which of them better adapts to our parallel versions. All proposed algorithms are based on the Daubechies 9/7 filter which is widely used in image/video compression.
Journal of Computer and System Sciences, 2013
Cluster analysis plays a critical role in a wide variety of applications; but it is now facing the computational challenge due to the continuously increasing data volume. Parallel computing is one of the most promising solutions to overcoming the computational challenge. In this paper, we target at parallelizing k-Means, which is one of the most popular clustering algorithms, by using the widely available Graphics Processing Units (GPUs). Different from existing GPU-based k-Means algorithms, we observe that data dimensionality is an important factor that should be taken into consideration when parallelizing k-Means on GPUs. In particular, we use two different strategies for low-dimensional data sets and high-dimensional data sets respectively, in order to make the best use of GPU computing horsepower. For low-dimensional data sets, we design an algorithm that exploits GPU on-chip registers to significantly decrease the data access latency. For high-dimensional data sets, we design another novel algorithm that simulates matrix multiplication and exploits GPU on-chip shared memory to achieve high compute-to-memoryaccess ratio. Our experimental results show that our GPU-based k-Means algorithms are three to eight times faster than the best reported GPU-based algorithms.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.