Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing
We show the importance of sequential sorting in the context of in memory parallel sorting of large data sets of 64 bit keys. First, we analyze several sequential strategies like Straight Insertion, Quick sort, Radix sort and CC-Radix sort. As a consequence of the analysis, we propose a new algorithm that we call Sequential Counting Split Radix sort, SCS-Radix sort. SCS-Radix sort is a combination of some of the algorithms analyzed and other new ideas. There are three important contributions in SCS-Radix sort. First, the work saved by detecting data skew dynamically. Second, the exploitation of the memory hierarchy done by the algorithm. Third, the execution time stability of SCS-Radix when sorting data sets with different characteristics. We evaluate the use of SCS-Radix sort in the context of a parallel sorting algorithm on an SGI Origin 2000. The parallel algorithm is from 1:2 to 45 times faster using SCS-Radix sort than using Radix sort or Quick sort.
International Journal of Advanced Research in Computer Engineering & Technology, 2019
There are various sorting algorithms available in the literature such as merge sort, bucket sort, heap sort, etc. The fundamental issues with all these sorting algorithms are the time and space complexities, for huge amount data. In this paper we are providing optimized counting sort algorithm for sorting billions of integers using serial and parallel processing. Comparisons are made between various sorting algorithms like counting sort, bucket sort and me rge sort. Counting sort and merge sort algorithms are implemented on CPU and in parallel on graphics processor unit (GPU). It is observed that our optimized counting sort takes only 6 m seconds of time to sort 100 million of integers & it is 23 times fas ter than the counting algorithm implemented before.
2011
The problem addressed in this paper is that we want to sort an integer array a [] of length n on a multi core machine with k cores. Amdahl’s law tells us that the inherent sequential part of any algorithm will in the end dominate and limit the speedup we get from parallelisation of that algorithm. This paper introduces PARL, a parallel left radix sorting algorithm for use on ordinary shared memory multi core machines, that has just one simple statement in its sequential part. It can be seen as a major rework of the Partitioned Parallel Radix Sort (PPR) that was developed for use on a network of communicating machines with separate memories. The PARL algorithm, which was developed independently of the PPR algorithm, has in principle some of the same phases as PPR, but also many significant differences as described in this paper. On a 32 core server, a speedup of 5-12 times is achieved compared with the same sequential ARL algorithm when sorting more than 100 000 numbers and half that...
ArXiv, 2010
Sorting algorithms are the deciding factor for the performance of common operations such as removal of duplicates or database sort-merge joins. This work focuses on 32-bit integer keys, optionally paired with a 32-bit value. We present a fast radix sorting algorithm that builds upon a microarchitecture-aware variant of counting sort. Taking advantage of virtual memory and making use of write-combining yields a per-pass throughput corresponding to at least 88 % of the system's peak memory bandwidth. Our implementation outperforms Intel's recently published radix sort by a factor of 1.5. It also compares favorably to the reported performance of an algorithm for Fermi GPUs when data-transfer overhead is included. These results indicate that scalar, bandwidth-sensitive sorting algorithms remain competitive on current architectures. Various other memory-intensive applications can benefit from the techniques described herein.
Journal of Parallel and Distributed Computing, 2002
Partitioned parallel radix sort is a parallel radix sort that shortens the execution time by modifying the load balanced r adix sort which i s known one of the fastest internal sorts with parallel processing. Parallel sorts usually consist of a few phases of local sort and data movement across processors. In load balanced radix sort, it requires data redistribution in each round for perfect load balancing, whereas in partitioned parallel radix sort, it is needed only once in the rst round. The remaining work is only computation and data movement within each processor, requiring no further interprocessor communication. The proposed method has been implemented on IBM SP2, PC Cluster, and CRAY T3E. The experimental results show that partitioned parallel radix sort outperforms the load balanced radix sort in all three machines with various key distributions, by 13 up to 30 in SP2, and 20 to 100 in T3E, 2.5 fold or more in PC Cluster, in the execution time.
2014
In this paper, we investigate the performance of two different parallel sorting algorithms (PSRS and BRADIX) in both distributed and shared memory systems. These two algorithms are technically very different than each other. PSRS is a computation intensive algorithm whereas BRADIX relies mainly on communication among processors to sort data items. We observe that a communication intensive algorithm exhibits worse performance than a computation intensive algorithm, in terms of speedup, both in shared and distributed systems. This suggest that this type of algorithms, in general, can not take advantage of an increased number of processors.
Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM) - Supercomputing '99, 1999
The performance of parallel sorting is not well understood on hardware cachecoherent shared address space (CC-SAS) multiprocessors, which increasingly dominate the market for tightly-coupled multiprocessing. We study two high-performance parallel sorting algorithms, radix and sample sorting, under three major programming models-a load-store CC-SAS, message passing, and the segmented SHMEM model-on a 64-processor SGI Origin2000. We observe surprisingly good speedups on this demanding application. The performance of radix sort is greatly affected by the programming model and particular implementation used. Sample sort exhibits more uniform performance across programming models on this platform, but it is usually not so good as that of the best radix sort for larger data sets if each is allowed to use the best programming model for itself. The best combination of algorithm and programming model is radix sorting under the SHMEM model for larger data sets and sample sorting under CC-SAS for smaller data sets.
ACM Journal of Experimental Algorithmics, 2001
We demonstrate the importance of reducing misses in the translation-lookaside buffer (TLB) for obtaining good performance on modern computer architectures. We focus on least-significantbit first (LSB) radix sort, standard implementations of which make many TLB misses. We give three techniques which simultaneously reduce cache and TLB misses for LSB radix sort: reducing working set size, explicit block transfer and pre-sorting. We note that: • All the techniques above yield algorithms whose implementations outperform optimised cache-tuned implementations of LSB radix sort and comparison-based sorting algorithms. The fastest running times are obtained by the pre-sorting approach and these are over twice as fast as optimised cache-tuned implementations of LSB radix sort and quicksort. Even the simplest optimisation, using the TLB size to guide the choice of radix in standard implementations of LSB radix sort, gives good improvements over cache-tuned algorithms. • One of the pre-sorting...
2016
The paper presents a new sorting algorithm that takes input data integer elements and sorts them without any comparison operations between the data—a comparison-free sorting. The algorithm uses a one-hot representation for each input element that is stored in a two-dimensional matrix called a one-hot matrix. Concurrently, each input element is also stored in a one-dimensional matrix in the input element’s integer representation. Subsequently, the transposed one-hot matrix is mapped to a binary matrix producing a sorted matrix with all elements in their sorted order. The algorithm exploits parallelism that is suitable for single instruction multiple thread (SIMT) computing that can harness the resources of these computing machines, such as CPUs with multiple cores and GPUs with large thread blocks. We analyze our algorithm’s sorting time on varying CPU architectures, including singleand multi-threaded implementations on a single CPU. Our results show a fast sorting time for the singl...
Ditse Journals of Pure and Applied Sciences (DUJOPAS), 2018
Multi-core processor is an improvement over the Single-core processor architecture and represents the latest developments in microprocessor technology. Performance evaluation of algorithms can be carried out on a computer with different number of processing cores with a view to establish the level of their performances. This research paper carried out performance evaluation studies on three different Sorting algorithms namely: QuickSort-Sequential, QuickSort-ParallelNaive, and QuickSort-ForkJoin on a Single-core and Multi-core processors to determine which of the algorithms has better execution time, and to show the effect of Multi-processor machines on their performances. System.nanotime() benchmark suits was used to measure the performances of these three algorithms. The overall running time of these algorithms were reported and compared. Results showed that the running time of QuickSort-ForkJoin is about 46.62% faster than QuickSort-Sequential, and 31.20% faster than QuickSort-ParallelNaive. Therefore, QuickSort-ForkJoin algorithm exhibits better performance probably due to its divide and conquer approach. Work stealing action is also another reason for this good performance when measured on both single and Multi-core processor. It was also discovered that increase in a number of processing cores in a machine significantly improves the performances of these algorithms.
Future Generation Computer Systems, 2006
The aim of the paper is to introduce techniques in order to tune sequential in-core sorting algorithms in the frameworks of two applications. The first application is parallel sorting when the processor speeds are not identical in the parallel system. The second application is the Zeta-Data Project (Koskas , 2003) whose aim is to develop novel algorithms for databases issues. About 50% of the work done in building indexes is devoted to sorting sets of integers. We develop and compare algorithms built to sort with equal keys. Algorithms are variations of the 3way-Quicksort of Segdewick. In order to observe performances and to fully exploit functional units in processors and also in order to optimize the use of the memory system and the different functional units, we use hardware performance counters that are available on most modern microprocessors. We develop also analytical results for one of our algorithms and compare expected results with the measures. For the two applications, we show through fine experiments on an Athlon processor (a three-way superscalar x86 processor), that L1 data cache misses is not the central problem but a subtil proportion of independent retired instructions should be advised to get performance for in-core sorting.
Lecture Notes in Computer Science, 1993
The use of multiprocessor architectures requires the parallelization of sorting algorithms. A parallel sorting algorithm based on horizontal parallelization is presented. This algorithm is suited for large data volumes (external sorting) and does not suffer from processing skew in presence of data skew. The core of the parallel sorting algorithm is a new adaptive partitioning method. The effect of data skew is remedied by taking samples representing the distribution of the input data. The parallel algorithm has been implemented on top of a shared disk multiprocessor architecture. The performance evaluation of the algorithm shows that it has linear speedup. Furthermore, the optimal degree of CPU parallelism is derived if I/O limitations are taken into account.
2007
This paper introduces Buffered Adaptive Radix (BARsort) that adds two improvements to the well known right-to-left Radix sorting algorithm (Right Radix or just Radix). The first improvement, the adaptive part, is that the size of the sorting digit is adjusted according to the maximum value of the elements in the array. This makes BARsort somewhat faster than ordinary 8-bit Radix sort (Radix8). The second and most important improvement is that data is transferred back and forth between the original array and a buffer that can be only a percentage of the size of the original array, as opposed to traditional Radix where that second array is the same length as the original array. Even though a buffer size of 100% of the original array is the fastest choice, any percentage larger than 6% gives a good to acceptable performance. This result is also explained analytically. This flexibility in memory requirement is important in programming languages such as Java where the heap size is fixed ...
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures - SPAA '07, 2007
Most contemporary processors offer some version of Single Instruction Multiple Data (SIMD) machinery -vector registers and instructions to manipulate data stored in such registers. The central idea of this paper is to use these SIMD resources to improve the performance of the tail of recursive sorting algorithms. When the number of elements to be sorted reaches a set threshold, data is loaded into the vector registers, manipulated in-register, and the result stored back to memory. Three implementations of sorting with two different SIMD machineries -x86-64's SSE2 and G5's AltiVec -demonstrate that this idea delivers significant speed improvements. The improvements provided are orthogonal to the gains obtained through empirical search for a suitable sorting algorithm . When integrated with the Dynamically Tuned Sorting Library (DTSL) this new code generation strategy reduces the time spent by DTSL up to 22% for moderately-sized arrays, with greater relative reductions for small arrays. Wall-clock performance of d-heaps is improved by up to 39% using a similar technique.
Lecture Notes in Computer Science, 2003
The efficient parallelization of an algorithm is a hard task that requires a good command of software techniques and a considerable knowledge of the target computer. In such a task, either the programmer or the compiler have to tune different parameters to adapt the algorithm to the computer network topology and the communication libraries, or to expose the data locality of the algorithm on the memory hierarchy at hand.
2000
This paper introduces Buffered Adaptive Radix (BARsort) that adds two improvements to the well known right-to-left Radix sorting algorithm (Right Radix or just Radix). The first improvement, the adaptive part, is that the size of the sorting digit is adjusted according to the maximum value of the elements in the array. This makes BARsort somewhat faster than ordinary 8-bit Radix
This paper introduces a new, faster sorting algorithm (ARL - Adaptive Left Radix) that does in-place, non-stable sorting. Left Radix, often called MSD (Most Significant Digit) radix, is not new in itself, but the adaptive feature and the in-place sorting ability are new features. ARL does sorting with only internal moves in the array, and uses a dynamically defined radix for each pass. ALR is a recursive algorithm that sorts by first sorting on the most significant 'digits' of the numbers - i.e. going from left to right. Its space requirements are O(N + logM) and time performance is O(N*log M) - where M is the maximum value sorted and N is the number of integers to sort. The performance of ARL is compared with both with the built in Quicksort algorithm in Java, Arrays.sort(), and with ordinary Radix sorting (sorting from right-to-left). ARL is almost twice as fast as Quicksort if N > 100. This applies to the normal case, a uniformly drawn distribution of the numbers 0:N...
2017
The paper introduces RADULS, a new parallel sorter based on radix sort algorithm, intended to organize ultra-large data sets efficiently. For example 4 G 16-byte records can be sorted with 16 threads in less than 15 s on Intel Xeon-based workstation. The implementation of RADULS is not only highly optimized to gain such an excellent performance, but also parallelized in a cache friendly manner to make the most of modern multicore architectures. Besides, our parallel scheduler launches a few different procedures at runtime, according to the current parameters of the execution, for proper workload management. All experiments show RADULS to be superior to competing algorithms.
2015
Sorting is a widely studied problem in computer science and an elementary building block in many of its subfields. There are several known techniques to vectorise and accelerate a handful of sorting algorithms by using single instruction-multiple data (SIMD) instructions. It is expected that the widths and capabilities of SIMD support will improve dramatically in future microprocessor generations and it is not yet clear whether or not these sorting algorithms will be suitable or optimal when executed on them. This work extrapolates the level of SIMD support in future microprocessors and evaluates these algorithms using a simulation framework. The scalability, strengths and weaknesses of each algorithm are experimentally derived. We then propose VSR sort, our own novel vectorised non-comparative sorting algorithm based on radix sort. To facilitate the execution of this algorithm we define two new SIMD instructions and propose a complementary hardware structure for their execution. Our results show that VSR sort has maximum speedups between 14.9x and 20.6x over a scalar baseline and an average speedup of 3.4x over the next-best vectorised sorting algorithm.
Journal of Experimental Algorithmics, 1998
We introduce a new deterministic parallel sorting algorithm for distributed memory machines based on the regular sampling approach. The algorithm uses only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead. Moreover, unlike previous variations, our algorithm efficiently handles the presence of duplicate values without the overhead of tagging each element with a unique identifier. This algorithm was implemented in SPLIT-C and run on a variety of platforms, including the Thinking Machines CM-5, the IBM SP-2-WN, and the Cray Research T3D. We ran our code using widely different benchmarks to examine the dependence of our algorithm on the input distribution. Our experimental results illustrate the efficiency and scalability of our algorithm across different platforms. In fact, the performance compares closely to that of our random sample sort algorithm, which seems to outperform all similar algorithms known to the authors on these platforms. Together, their performance is nearly invariant over the set of input distributions, unlike previous efficient algorithms. However, unlike our randomized sorting algorithm, the performance and memory requirements of our regular sorting algorithm can be deterministically guaranteed. We present a novel variation on the approach of sorting by regular sampling which leads to a new deterministic sorting algorithm that achieves optimal computational speedup with very little communication. Our algorithm exchanges the single step of irregular communication used by previous implementations for two steps of regular communication. In return, our algorithm mitigates the problem of poor load balancing because it is able to sustain a high sampling rate at substantially less cost. In addition, our algorithm efficiently accommodates the presence of duplicates without the overhead of tagging each element. And our algorithm achieves predictable, regular communication requirements which are essentially invariant with respect to the input distribution. Utilizing regular communication has become more important with the advent of message passing standards, such as MPI [16], which seek to guarantee the availability of very efficient (often machine specific) implementations of certain basic collective communication routines. Our algorithm was implemented in a high-level language and run on a variety of platforms, including the Thinking Machines CM-5, the IBM SP-2, and the Cray Research T3D. We ran our code using a variety of benchmarks that we identified to examine the dependence of our algorithm on the input distribution. Our experimental results are consistent with the theoretical analysis and illustrate the efficiency and scalability of our algorithm across different platforms. In fact, the performance compares closely to that of our random sample sort algorithm, which seems to outperform all similar algorithms known to the authors on these platforms. Together, their performance is nearly indifferent to the set of input distributions, unlike previous efficient algorithms. However, unlike our randomized sorting algorithm, the performance and memory requirements of our regular sorting algorithm can be guaranteed with deterministically. The high-level language used in our studies is SPLIT-C [10], an extension of C for distributed memory machines. The algorithm makes use of MPI-like communication primitives but does not make any assumptions as to how these primitives are actually implemented. The basic data transport is a read or write operation. The remote read and write typically have both blocking and non-blocking versions. Also, when reading or writing more than a single element, bulk data transports are provided with corresponding bulk read and bulk write primitives. Our collective communication primitives, described in detail in [4], are similar to those of the MPI [16], the IBM POWERparallel [6], and the Cray MPP systems [9] and, for example, include the following: transpose, bcast, gather, and scatter. Brief descriptions of these are as follows. The transpose primitive is an all-to-all personalized communication in which each processor has to send a unique block of data to every processor, and all the blocks are of the same size. The bcast primitive is used to copy a block of data from a single source to all the other processors. The primitives gather and scatter are companion primitives. Scatter divides a single array residing on a processor into equal-sized blocks, each of which is distributed to a unique processor, and gather coalesces these blocks back into a single array at a particular processor. See [3, 4, 5] for algorithmic details, performance analyses, and empirical results for these communication primitives. The organization of this paper is as follows. Section 2 presents our computation model for analyzing parallel algorithms. Section 3 describes in detail our improved sample sort algorithm. Finally, Section 4 describes our data sets and the experimental performance of our sorting algorithm.
Computers in Physics, 1990
Four sorting algorithms-bubble, insertion, heap, and quick-are studied on an IBM 3090/ 600, a VAX 11/780, and the NYU Ultracomputer. It is verified that for N items the bubble and insertion sorts are of order N 2 whereas the heap and quick sorts are of order N In N. It is shown that the choice of algorithm is more important than the choice of machine. Moreover, the influence of paging on algorithm performance is examined.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.