no longer supports Internet Explorer.
To browse and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
5 pages
1 file
In this paper, we provide a qualitative and quantitative analysis of the performance of parallel algorithms on modern multi-core hardware. We attempt to show a comparative study of the performances of algorithms (traditionally perceived as sequential in nature) in a parallel environment, using the Message Passing Interface (MPI) based on Amdahl’s Law. First, we study sorting algorithms. Sorting is a fundamental problem in computer science, and one where there is a limit on the efficiency of algorithms that exist. In theory it contains a large amount of parallelism and should not be difficult to accelerate sorting of very large datasets on modern architectures. Unfortunately, most serial sorting algorithms do not lend themselves to easy parallelization, especially in a distributed memory system such as we might use with MPI. While initial results show a promising speedup for sorting algorithms, owing to inter-process communication latency, we see an slower run-time, overall with incr...
In this paper, we investigate the performance of two different parallel sorting algorithms (PSRS and BRADIX) in both distributed and shared memory systems. These two algorithms are technically very different than each other. PSRS is a computation intensive algorithm whereas BRADIX relies mainly on communication among processors to sort data items. We observe that a communication intensive algorithm exhibits worse performance than a computation intensive algorithm, in terms of speedup, both in shared and distributed systems. This suggest that this type of algorithms, in general, can not take advantage of an increased number of processors.
Journal of Experimental Algorithmics, 1998
We introduce a new deterministic parallel sorting algorithm for distributed memory machines based on the regular sampling approach. The algorithm uses only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead. Moreover, unlike previous variations, our algorithm efficiently handles the presence of duplicate values without the overhead of tagging each element with a unique identifier. This algorithm was implemented in SPLIT-C and run on a variety of platforms, including the Thinking Machines CM-5, the IBM SP-2-WN, and the Cray Research T3D. We ran our code using widely different benchmarks to examine the dependence of our algorithm on the input distribution. Our experimental results illustrate the efficiency and scalability of our algorithm across different platforms. In fact, the performance compares closely to that of our random sample sort algorithm, which seems to outperform all similar algorithms known to the authors on these platforms. Together, their performance is nearly invariant over the set of input distributions, unlike previous efficient algorithms. However, unlike our randomized sorting algorithm, the performance and memory requirements of our regular sorting algorithm can be deterministically guaranteed. We present a novel variation on the approach of sorting by regular sampling which leads to a new deterministic sorting algorithm that achieves optimal computational speedup with very little communication. Our algorithm exchanges the single step of irregular communication used by previous implementations for two steps of regular communication. In return, our algorithm mitigates the problem of poor load balancing because it is able to sustain a high sampling rate at substantially less cost. In addition, our algorithm efficiently accommodates the presence of duplicates without the overhead of tagging each element. And our algorithm achieves predictable, regular communication requirements which are essentially invariant with respect to the input distribution. Utilizing regular communication has become more important with the advent of message passing standards, such as MPI [16], which seek to guarantee the availability of very efficient (often machine specific) implementations of certain basic collective communication routines. Our algorithm was implemented in a high-level language and run on a variety of platforms, including the Thinking Machines CM-5, the IBM SP-2, and the Cray Research T3D. We ran our code using a variety of benchmarks that we identified to examine the dependence of our algorithm on the input distribution. Our experimental results are consistent with the theoretical analysis and illustrate the efficiency and scalability of our algorithm across different platforms. In fact, the performance compares closely to that of our random sample sort algorithm, which seems to outperform all similar algorithms known to the authors on these platforms. Together, their performance is nearly indifferent to the set of input distributions, unlike previous efficient algorithms. However, unlike our randomized sorting algorithm, the performance and memory requirements of our regular sorting algorithm can be guaranteed with deterministically. The high-level language used in our studies is SPLIT-C [10], an extension of C for distributed memory machines. The algorithm makes use of MPI-like communication primitives but does not make any assumptions as to how these primitives are actually implemented. The basic data transport is a read or write operation. The remote read and write typically have both blocking and non-blocking versions. Also, when reading or writing more than a single element, bulk data transports are provided with corresponding bulk read and bulk write primitives. Our collective communication primitives, described in detail in [4], are similar to those of the MPI [16], the IBM POWERparallel [6], and the Cray MPP systems [9] and, for example, include the following: transpose, bcast, gather, and scatter. Brief descriptions of these are as follows. The transpose primitive is an all-to-all personalized communication in which each processor has to send a unique block of data to every processor, and all the blocks are of the same size. The bcast primitive is used to copy a block of data from a single source to all the other processors. The primitives gather and scatter are companion primitives. Scatter divides a single array residing on a processor into equal-sized blocks, each of which is distributed to a unique processor, and gather coalesces these blocks back into a single array at a particular processor. See [3, 4, 5] for algorithmic details, performance analyses, and empirical results for these communication primitives. The organization of this paper is as follows. Section 2 presents our computation model for analyzing parallel algorithms. Section 3 describes in detail our improved sample sort algorithm. Finally, Section 4 describes our data sets and the experimental performance of our sorting algorithm.
International Journal of Computer Applications, 2014
The aim of this paper if to show that the great part of the execution time is consumed in computations. So as the number of processors increase, the amount of work done by each processor will be decrease regardless the effect of the number of physical cores used. Still the time taken to solve the computations dominates over the communication time as by increasing number of processors; tasks are more divided so overall time decreases. The total overhead generated from process initializations and inter-process communication negatively affects the execution time. Using MPI, parallelization on five sorting techniques which are selection sort, bubble sort, quick sort, insertion sort and shell sort have been implemented.
Previous schemes for sorting on general-purpose parallel machines have had to choose between poor load balancing and irregular communication or multiple rounds of all-to-all personalized communication. In this paper, we introduce a novel variation on sample sort which uses only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead. This algorithm was implemented in Split-C and run on a variety of platforms, including the Thinking Machines ...
Journal of Parallel and Distributed Computing, 1995
In this paper we present a new parallel sorting algorithm which maximizes the overlap between the disk, network, and CPU subsystems of a processing node. This algorithm is shown to be of similar complexity to known e cient sorting algorithms. The pipelining e ect exploited by our algorithm should lead to higher levels of performance on distributed memory parallel processors. In order to achieve the best results using this strategy, the CPU, network and disk operations must take comparable time. We suggest acceptable levels of system balance for sorting machines and analyze the performance of the sorting algorithm as system parameters vary.
International Journal of Advanced Research in Computer Engineering & Technology, 2019
There are various sorting algorithms available in the literature such as merge sort, bucket sort, heap sort, etc. The fundamental issues with all these sorting algorithms are the time and space complexities, for huge amount data. In this paper we are providing optimized counting sort algorithm for sorting billions of integers using serial and parallel processing. Comparisons are made between various sorting algorithms like counting sort, bucket sort and me rge sort. Counting sort and merge sort algorithms are implemented on CPU and in parallel on graphics processor unit (GPU). It is observed that our optimized counting sort takes only 6 m seconds of time to sort 100 million of integers & it is 23 times fas ter than the counting algorithm implemented before.
International Journal of Advanced Science and Technology, 2016
Many sorting algorithms have been proposed and implemented in previous years. These algorithms are usually judged by their performance in term of algorithm growth rate according to the input size. Efficient sorting algorithm implementation is important for optimizing the use of other algorithms such as searching algorithms, load balancing algorithms, etc. In this paper, parallel Quicksort, parallel Merge sort, and parallel Merge-Quicksort algorithms are evaluated and compared in terms of the running time, speedup, and parallel efficiency. These sorting algorithms are implemented using Message Passing Interface (MPI) library, and results have been conducted using IMAN1 supercomputer. Results show that the run time of parallel Quicksort algorithm outperforms both Merge sort and Merge-Quicksort algorithms. Moreover, on large number of processors, parallel Quicksort achieves the best parallel efficiency of up to 88%, while Merge sort and Merge-Quicksort algorithms achieve up to 49% and 52% parallel efficiency, respectively.
Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing
We show the importance of sequential sorting in the context of in memory parallel sorting of large data sets of 64 bit keys. First, we analyze several sequential strategies like Straight Insertion, Quick sort, Radix sort and CC-Radix sort. As a consequence of the analysis, we propose a new algorithm that we call Sequential Counting Split Radix sort, SCS-Radix sort. SCS-Radix sort is a combination of some of the algorithms analyzed and other new ideas. There are three important contributions in SCS-Radix sort. First, the work saved by detecting data skew dynamically. Second, the exploitation of the memory hierarchy done by the algorithm. Third, the execution time stability of SCS-Radix when sorting data sets with different characteristics. We evaluate the use of SCS-Radix sort in the context of a parallel sorting algorithm on an SGI Origin 2000. The parallel algorithm is from 1:2 to 45 times faster using SCS-Radix sort than using Radix sort or Quick sort.
Sorting is commonly viewed as the most fundamental problem in the study of algorithms. Some cited reasons for this are that a great many software applications use sorting for various reasons, and a great many algorithms use sorting as a subroutine [1]. Given its ubiquity, therefore, it is valuable to be able to solve the sorting problem efficiently. For this reason, many efficient sorting algorithms have been developed and studied. Three of the most popular and efficient sorting algorithms are Mergesort, Quicksort, and Heapsort. Given the asymptotic lower bound of Ω(nlog(n)) for comparison-based sorting algorithms such as these, a natural route to take to achieve greater performance is parallel computing. In the interest of wanting to select the optimal sorting algorithm to run on a particular parallel computing architecture, it is valuable to empirically compare the performance of different parallelized sorting algorithms. This is the aim of our research. In this project, we conduct an empirical analysis and comparison of parallelized versions of two popular sorting algorithms: Mergesort and Quicksort. Heapsort and the difficulties of parallelizing it are also considered. The criteria for evaluation are (i) execution time and (ii) scalability. The research was conducted on Case Western Reserve's high-performance computing (HPC) architecture, specifically the Markov cluster. We implement parallel Mergesort and Quicksort and execute them with variously sized and randomly permuted input arrays. The execution times are recorded for each run. Additionally, we run the algorithms on a varying number of CPUs (e.g., one CPU, two CPUs, four CPUs) in order to assess their scalability. After collecting the data, we perform data analysis and use it to compare the algorithms according to the aforementioned criteria for evaluation. The comparison will facilitate making an informed choice about which sorting algorithm to use under various conditions (e.g., the number of CPUs available and the size of the input array).
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Parallel and Distributed Processing Techniques and Applications, 2006
Proceedings of the 16th International Conference on Applied Computing 2019, 2019
Lecture Notes in Computer Science, 1993
Lecture Notes in Computer Science, 2003
Proceedings of the 2010 Workshop on Parallel Programming Patterns - ParaPLoP '10, 2010
Proceedings of the 1st international workshop on Data management on new hardware - DAMON '05, 2005
International Journal of Parallel Programming, 1991
Journal of Computer and System Sciences, 2003
Modern Applied Science