Norsk IKT-konferanse for forskning og utdanning, Nov 23, 2020
This paper investigates the problem of finding large prime gaps (the difference between two conse... more This paper investigates the problem of finding large prime gaps (the difference between two consecutive prime numbers, pi+1pi) and on the development of a small, efficient program for generating such large prime gaps for a single computer, a laptop or a workstation. In Wikipedia [1], one can find a table of all known record prime gaps less than 2 64 , the record is a 20 decimal digit number. We wanted to go beyond 64 bit numbers and demonstrate algorithms that do not needed a huge number of computers in a grid to produce useful results. After some preliminary tests, we found that the Sieve of Eratosthenes, SE, from the year 250 BC was the fastest for finding prime numbers and it could also be made space efficient. Each odd number is represented by one bit and when storing 8 odd numbers in a single byte (representing 16 consecutive numbers ignoring the even numbers), we found that we should not make one long SE table, but instead divide the SE table into segments (called SE segments), each of length 10 8 or 10 9 and dynamically generate the necessary SE segments as to find prime numbers. First, we made a basic segment of all prime numbers < 10 8 (in less than a second). We also relied heavily on the old observation [2] that when using SE to find all prime numbers ≤ , we cross out all numbers using the prime numbers ≤ √ 2 , and that the first number crossed off when crossing out for prime number p is 2. When we want to find prime gaps, we first create one or more consecutive SE in that range, say starting on 2 74 and ending with the value Minitially these big segments are crossed out by our first basic set of primes < 10 8 , To find all prime number in these big segments, we next need the rest of prime numbers ≤ √ 2. These can be all be constructed by using our first set of prime numbers to generate segments of consecutive SE from 10 8. The primes in these segments are used to cross out in the big SE segment and can then be discarded (each prime used only once). Our most significant algorithm was to find a simple formula for using primes from a range 3-2 36 to cross out the non-primes in any SE segment without crossing out in all the numbers between 2 36 and 2 72. This leads to an exponential saving in both space and execution time. In addition to this, we created a small package Int3 to represent numbers > 2 64 by storing 8 decimal values in each of 3 integer variables together with the necessary mathematical operations. The Int3 package can handle numbers up to 24 decimal digits and is significantly faster than the BigInteger package in the Java library. We also created a faster algorithm for finding all record prime gaps. The results presented in this paper are some tables of prime gaps for primes significantly larger than 2 64 and data supporting an observation that big prime gaps in these segments are much more frequent than the ones we find in the Wikipedia table where the search starts at prime number 3. Our combined set of algorithms is also sufficiently fast to test every entry in the Wikipedia table in less than 5 minutes. We conclude by reflecting on the use of brute force (more computers) versus smarter algorithms.
This paper demonstrates how an unstable in place sorting algorithm, the ALR algorithm, can be mad... more This paper demonstrates how an unstable in place sorting algorithm, the ALR algorithm, can be made stable by temporary changing the sorting keys during the recursion. At ‘the bottom of the recursion ’ all subsequences with equal valued element are then individually sorted with a stable sorting subalgorithm (insertion sort or radix). Later, on backtrack the original keys are restored. This results in a stable sorting of the whole input. Unstable ALR is much faster than Quicksort (which is also unstable). In this paper it is demonstrated that StableALR, which is some 10-30 % slower than the original unstable ALR, is still in most cases 20-60 % faster than Quicksort. It is also shown to be faster than Flashsort, a new unstable in place, bucket type sorting algorithm. This is demonstrated for five different distributions of integers in a array of length from 50 to 97 million elements. The StableALR sorting algorithm can be extended to sort floating point numbers and strings and make eff...
It is believed that writing parallel programs is hard. Therefore a paradigm for writing parallel ... more It is believed that writing parallel programs is hard. Therefore a paradigm for writing parallel programs that is not much harder than writing a sequential program is in demand. This paper describes a general method for parallelizing programs based on the idea of using recursive procedures as the grain of parallelism (PRP). This paradigm only adds a few new constructs to an ordinary sequential program with recursive procedures and makes both the underlying hardware topology and in most cases also the number of independent processing nodes, transparent to the programmer. The implementation reported in this paper uses workstations and servers connected to an Ethernet. The efficiency achieved by the PRP-concept must be said to be good with a more than 50% processor utilization on the two problems reported in this paper. Other implementations of the PRP-paradigm is also reported.
The Design of an Efficient Portable Driver for Shared Memory Cluster Adapters
We describe the design of an efficient portable driver for shared memory interconnects. The drive... more We describe the design of an efficient portable driver for shared memory interconnects. The driver provides a foundation for interfacing to commodity software like clustered database servers. We present performance figures for a driver implementation that uses SCI through the PCI bus on standard PCs running Windows NT. Keywords: Clustering, Shared memory, SCI. 1 Introduction With the rapid advances in microprocessor design, the cluster of workstations is emerging as a cost effective solution to high-end processing needs [1]. A cluster consists of a number of autonomous, loosely coupled computers. Proprietary interconnects like high-speed memory buses are not required for their operation. Clusters will instead use standard communication technologies like Ethernet, or (more recently) IO-bus based shared memory interconnects like the Dolphin SBUS-SCI [2] and PCI-SCI [3] [4] adapters, ServerNet [5] and the DEC Memory Channel [6]. A cluster has two highly desirable properties: . Scalabil...
The problem addressed in this paper is that we want to sort an integer array a [] of length n on ... more The problem addressed in this paper is that we want to sort an integer array a [] of length n on a multi core machine with k cores. Amdahl's law tells us that the inherent sequential part of any algorithm will in the end dominate and limit the speedup we get from parallelisation of that algorithm. This paper introduces PARL, a parallel left radix sorting algorithm for use on ordinary shared memory multi core machines, that has just one simple statement in its sequential part. It can be seen as a major rework of the Partitioned Parallel Radix Sort (PPR) that was developed for use on a network of communicating machines with separate memories. The PARL algorithm, which was developed independently of the PPR algorithm, has in principle some of the same phases as PPR, but also many significant differences as described in this paper. On a 32 core server, a speedup of 5-12 times is achieved compared with the same sequential ARL algorithm when sorting more than 100 000 numbers and half that speedup on a 4 core PC work station and on two dual core laptops. Since the original sequential ARL algorithm in addition is 3-5 times faster than the standard Java Arrays.sort algorithm, this parallelisation translates to a significant speedup of approx. 10 to 30 for ordinary user programs sorting larger arrays. The reason that we don't get better results, i.e. a speedup equal to the number of cores when the number of cores exceeds 2, is chiefly explained by a limited memory bandwidth. This thread pool implementation of PARL is also user friendly in that the user calling this PARL algorithm does not have to deal with creating threads themselves; to sort their data, they just create a sorting object and make a call to a thread safe method in that object.
This paper introduces Buffered Adaptive Radix (BARsort) that adds two improvements to the well kn... more This paper introduces Buffered Adaptive Radix (BARsort) that adds two improvements to the well known right-to-left Radix sorting algorithm (Right Radix or just Radix). The first improvement, the adaptive part, is that the size of the sorting digit is adjusted according to the maximum value of the elements in the array. This makes BARsort somewhat faster than ordinary 8-bit Radix sort (Radix8). The second and most important improvement is that data is transferred back and forth between the original array and a buffer that can be only a percentage of the size of the original array, as opposed to traditional Radix where that second array is the same length as the original array. Even though a buffer size of 100 % of the original array is the fastest choice, any percentage larger than 6 % gives a good to acceptable performance. This result is also explained analytically. This flexibility in memory requirement is important in programming languages such as Java where the heap size is fixe...
We describe how a real application, the Synthetic Aperture Radar (SAR) program, is parallelized a... more We describe how a real application, the Synthetic Aperture Radar (SAR) program, is parallelized and run on a cluster of PCs connected by SCI. We have a prototype SCI switch that uses the serial IEEE 1355 HIC technology in its switching fabric. The prototype switch can not keep up with the speed of the SCI interconnect, and we analyze and discuss how switch performance in uence the execution time of our parallelized application Keywords:
It is widely documented that the cost of maintenance activities is the dominant part of the total... more It is widely documented that the cost of maintenance activities is the dominant part of the total lifetime expenditure on a software system. Still, there is no commonly accepted framework for evaluating technology that is supposed to support maintenance. This paper discusses methods for such evaluation, which include both analytical work, such as development of taxonomies, and empirical work based on benchmarks. Several of the methods have been applied in experiments that we have already finished, in experiments that we are currently running and in planned experiments. What we report is initial work carried out within severe resource limitations. Hence, our work should be regarded as a basis for more extensive work which is required in this area.
The problem addressed in this paper is that we want to sort an integer array a[] of length n in p... more The problem addressed in this paper is that we want to sort an integer array a[] of length n in parallel on a multi core machine with p cores using mergesort. Amdahl’s law tells us that the inherent sequential part of any algorithm will in the end dominate and limit the speedup we get from parallelisation. This paper introduces ParaMerge, an all parallel mergesort algorithm for use on an ordinary shared memory multi core machine that has just a few simple statements in its sequential part. The new algorithm is all parallel in the sense that by recursive decent it is two parallel in the top node, four parallel on the next level in the recursion, then eight parallel until we at least have started one thread for all the p cores. After parallelization, each thread then uses sequential recursion mergesort with a variant of insertion sort for sorting short subsections at the end. ParaMerge can be seen as an improvement over traditional parallelization of the mergesort algorithm where one ...
This paper presents two new algorithms for inline transforming an integer array 'a' into its own ... more This paper presents two new algorithms for inline transforming an integer array 'a' into its own sorting permutation-that is: after performing either of these algorithms, a[i] is the index in the unsorted input array 'a' of its i'th largest element (i=0,1..n-1). The difference between the two IPS (Inline Permutation Substitution) algorithms is that the first and fastest generates an unstable permutation while the second generates the unique, stable, permutation array. The extra space needed in both algorithms is O(log n)-no extra array of length n is needed! The motivation for using these algorithms is given along with their pseudo code. To evaluate their efficiency, they are tested relative to sorting the same array with Quicksort on 4 different machines and for 14 different distributions of the numbers in the input array, with n=10, 50, 250.. 97M. This evaluation shows that both IPS algorithms are generally faster than Quicksort for values of n less than 10 7 , but degenerates both in speed and demand for space for larger values. These are results with 32 bit integers (with 64 bits integers this limit would be proportionally higher, at least 10 14). The two IPS algorithms do a recursive, most significant digit radix sort (left to right) on the input array while substituting the values of the sorted elements with their indexes.
The problem addressed in this paper is that we want to sort an integer array a [] of length n on ... more The problem addressed in this paper is that we want to sort an integer array a [] of length n on a multi core machine with k cores. Amdahl’s law tells us that the inherent sequential part of any algorithm will in the end dominate and limit the speedup we get from parallelisation of that algorithm. This paper introduces PARL, a parallel left radix sorting algorithm for use on ordinary shared memory multi core machines, that has just one simple statement in its sequential part. It can be seen as a major rework of the Partitioned Parallel Radix Sort (PPR) that was developed for use on a network of communicating machines with separate memories. The PARL algorithm, which was developed independently of the PPR algorithm, has in principle some of the same phases as PPR, but also many significant differences as described in this paper. On a 32 core server, a speedup of 5-12 times is achieved compared with the same sequential ARL algorithm when sorting more than 100 000 numbers and half that...
It is believed that writing parallel programs is hard. Therefore a paradigm for writing parallel ... more It is believed that writing parallel programs is hard. Therefore a paradigm for writing parallel programs that is not much harder than writing a sequential program is in demand. This paper describes a general method for parallelizing programs based on the idea of using recursive procedures as the grain of parallelism (PRP). This paradigm only adds a few new constructs to an ordinary sequential program with recursive procedures and makes both the underlying hardware topology and in most cases also the number of independent processing nodes, transparent to the programmer. The implementation reported in this paper uses workstations and servers connected to an Ethernet. The e ciency achieved by the PRP-concept must be said to be good with a more than 50% processor utilization on the two problems reported in this paper. Other implementations of the PRP-paradigm is also reported.
We describe the design of an efficient portable driver forshared memory interconnects. The driverp... more We describe the design of an efficient portable driver forshared memory interconnects. The driverprovidesa foun-dation for interfacing to commodity software like clus-tered database servers. We present performance figuresfor a driver implementation that uses SCI through the PCIbus on standard PCs running Windows NT.Keywords: Clustering,Shared memory, SCI. 1 Introduction With the rapid advances in microprocessor design, thecluster of workstationsis emerging as a cost effective so-lutionto high-endprocessing needs [1]. A clusterconsistsof a number of autonomous, loosely coupled computers.Proprietary interconnects like high-speed memory busesare not required for their operation. Clusters will insteaduse standard communication technologies like Ethernet,or (more recently) IO-bus based shared memory intercon-nects like the Dolphin SBUS-SCI [2] and PCI-SCI [3] [4]adapters, ServerNet [5] and the DEC Memory Channel[6]. A cluster has two highly desirable properties:• Scalability means that the...
Practical Parallel Programming – a B . S . course on how to design an efficient parallel algorithm
This paper describes a new course in parallel programming at the University of Oslo that emphasis... more This paper describes a new course in parallel programming at the University of Oslo that emphasis thread based parallel programming in Java on multicore computers. The quality of a parallel algorithm is evaluated by its speedup, since the main reason for making a parallel algorithm is to create a faster program. This course stresses that there are many possible correct parallel algorithms for a given problem. Since the course focuses on measured efficiency, a more theoretical approach like the PRAM (Parallel Random Access Memory) model is not considered. However, we are examining those practical factors that contribute most to the total running time – like the startup time of a thread based solution, the Java JIT-compilation, the number of synchronizations, and the locality of data in the caches. A partly new pattern for designing a parallel algorithm by dividing it into fully parallel parts with barrier synchronization between each part is taught. Instead of synchronization, when u...
This paper demonstrates how an unstable in place sorting algorithm, the ALR algorithm, can be mad... more This paper demonstrates how an unstable in place sorting algorithm, the ALR algorithm, can be made stable by temporary changing the sorting keys during the recursion. At ‘the bottom of the recursion’ all subsequences with equal valued element are then individually sorted with a stable sorting subalgorithm (insertion sort or radix). Later, on backtrack the original keys are restored. This results in a stable sorting of the whole input. Unstable ALR is much faster than Quicksort (which is also unstable). In this paper it is demonstrated that StableALR, which is some 10-30% slower than the original unstable ALR, is still in most cases 20-60% faster than Quicksort. It is also shown to be faster than Flashsort, a new unstable in place, bucket type sorting algorithm. This is demonstrated for five different distributions of integers in a array of length from 50 to 97 million elements. The StableALR sorting algorithm can be extended to sort floating point numbers and strings and make effect...
In this paper we first prove that for any set points P in the plane, the closest neighbor b to an... more In this paper we first prove that for any set points P in the plane, the closest neighbor b to any point p in P is a proper triangle edge bp in D(P), the Delaunay triangulation of P. We also generalize this result and prove that the j’th (second, third, ..) closest neighbors bj to p are also edges pbj in D(P) if they satisfy simple tests on distances from a point. Even though we can find many of the edges in D(P) in this way by looking at close neighbors, we give a three point example showing that not all edges in D(P) will be found. For a random dataset we give results from test runs and a model that show that our method finds on the average 4 edges per point in D(P). Also, we prove that the Delaunay edges found in this way form a connected graph. We use these results to outline two new parallel, and potentially faster algorithms for finding D(P). We then report results from parallelizing one of these algorithms on a multicore CPU (MPU), which resulted in a significant speedup; and...
The problem addressed in this paper is that we want to sort an integer array a[] of length n in p... more The problem addressed in this paper is that we want to sort an integer array a[] of length n in parallel on a multi core machine with p cores using Quicksort. Amdahl’s law tells us that the inherent sequential part of any algorithm will in the end dominate and limit the speedup we get from parallelisation. This paper introduces ParaQuick, a full parallel quicksort algorithm for use on an ordinary shared memory multi core machine that has just a few simple statements in its sequential part. It can be seen as an improvement over traditional parallelization of the Quicksort algorithm, where one follows the sequential algorithm and substitute recursive calls with the creation of parallel threads for these calls in the top of the recursion tree. The ParaQuick algorithm, starts with k parallel threads, where k is a multiple of p (here k = 8*p) in a k way partition of the original array with the same pivot value, and hence we get 2k partitioned areas in the first pass. We then calculate wh...
The problem addressed in this paper is that we want to sort an array a[] of n floating point numb... more The problem addressed in this paper is that we want to sort an array a[] of n floating point numbers conforming to the IEEE 754 standard, both in the 64bit double precision and the 32bit single precision formats on a multi core computer with p real cores and shared memory (an ordinary PC). This we do by introducing a new stable, sorting algorithm, RadixInsert, both in a sequential version and with two parallel implementations. RadixInsert is tested on two different machines, a 2 core laptop and a 4 core desktop, outperforming the not stable Quicksort based algorithms from the Java library – both the sequential Arrays.sort() and a merge-based parallel version Arrays.parallelsort() for 500<n <250mill by a factor from 3 to 10. The RadixInsert algorithm resembles in many ways the Shell sort algorithm [1]. First, the array is presorted to some degree – and in the case of Shell, Insertion sort is first used with long jumps and later shorter jumps along the array to ensure that small...
The problem addressed in this paper is that we want to sort an integer array a[] of length n in p... more The problem addressed in this paper is that we want to sort an integer array a[] of length n in parallel on a multi core machine with p cores using Quicksort. Amdahl’s law tells us that the inherent sequential part of any algorithm will in the end dominate and limit the speedup we get from parallelisation. This paper introduces ParaQuick, a full parallel quicksort algorithm for use on an ordinary shared memory multi core machine that has just a few simple statements in its sequential part. It can be seen as an improvement over traditional parallelization of the Quicksort algorithm, where one follows the sequential algorithm and substitute recursive calls with the creation of parallel threads for these calls in the top of the recursion tree. The ParaQuick algorithm, starts with k parallel threads, where k is a multiple of p (here k = 8*p) in a k way partition of the original array with the same pivot value, and hence we get 2k partitioned areas in the first pass. We then calculate wh...
Norsk IKT-konferanse for forskning og utdanning, Nov 23, 2020
This paper investigates the problem of finding large prime gaps (the difference between two conse... more This paper investigates the problem of finding large prime gaps (the difference between two consecutive prime numbers, pi+1pi) and on the development of a small, efficient program for generating such large prime gaps for a single computer, a laptop or a workstation. In Wikipedia [1], one can find a table of all known record prime gaps less than 2 64 , the record is a 20 decimal digit number. We wanted to go beyond 64 bit numbers and demonstrate algorithms that do not needed a huge number of computers in a grid to produce useful results. After some preliminary tests, we found that the Sieve of Eratosthenes, SE, from the year 250 BC was the fastest for finding prime numbers and it could also be made space efficient. Each odd number is represented by one bit and when storing 8 odd numbers in a single byte (representing 16 consecutive numbers ignoring the even numbers), we found that we should not make one long SE table, but instead divide the SE table into segments (called SE segments), each of length 10 8 or 10 9 and dynamically generate the necessary SE segments as to find prime numbers. First, we made a basic segment of all prime numbers < 10 8 (in less than a second). We also relied heavily on the old observation [2] that when using SE to find all prime numbers ≤ , we cross out all numbers using the prime numbers ≤ √ 2 , and that the first number crossed off when crossing out for prime number p is 2. When we want to find prime gaps, we first create one or more consecutive SE in that range, say starting on 2 74 and ending with the value Minitially these big segments are crossed out by our first basic set of primes < 10 8 , To find all prime number in these big segments, we next need the rest of prime numbers ≤ √ 2. These can be all be constructed by using our first set of prime numbers to generate segments of consecutive SE from 10 8. The primes in these segments are used to cross out in the big SE segment and can then be discarded (each prime used only once). Our most significant algorithm was to find a simple formula for using primes from a range 3-2 36 to cross out the non-primes in any SE segment without crossing out in all the numbers between 2 36 and 2 72. This leads to an exponential saving in both space and execution time. In addition to this, we created a small package Int3 to represent numbers > 2 64 by storing 8 decimal values in each of 3 integer variables together with the necessary mathematical operations. The Int3 package can handle numbers up to 24 decimal digits and is significantly faster than the BigInteger package in the Java library. We also created a faster algorithm for finding all record prime gaps. The results presented in this paper are some tables of prime gaps for primes significantly larger than 2 64 and data supporting an observation that big prime gaps in these segments are much more frequent than the ones we find in the Wikipedia table where the search starts at prime number 3. Our combined set of algorithms is also sufficiently fast to test every entry in the Wikipedia table in less than 5 minutes. We conclude by reflecting on the use of brute force (more computers) versus smarter algorithms.
This paper demonstrates how an unstable in place sorting algorithm, the ALR algorithm, can be mad... more This paper demonstrates how an unstable in place sorting algorithm, the ALR algorithm, can be made stable by temporary changing the sorting keys during the recursion. At ‘the bottom of the recursion ’ all subsequences with equal valued element are then individually sorted with a stable sorting subalgorithm (insertion sort or radix). Later, on backtrack the original keys are restored. This results in a stable sorting of the whole input. Unstable ALR is much faster than Quicksort (which is also unstable). In this paper it is demonstrated that StableALR, which is some 10-30 % slower than the original unstable ALR, is still in most cases 20-60 % faster than Quicksort. It is also shown to be faster than Flashsort, a new unstable in place, bucket type sorting algorithm. This is demonstrated for five different distributions of integers in a array of length from 50 to 97 million elements. The StableALR sorting algorithm can be extended to sort floating point numbers and strings and make eff...
It is believed that writing parallel programs is hard. Therefore a paradigm for writing parallel ... more It is believed that writing parallel programs is hard. Therefore a paradigm for writing parallel programs that is not much harder than writing a sequential program is in demand. This paper describes a general method for parallelizing programs based on the idea of using recursive procedures as the grain of parallelism (PRP). This paradigm only adds a few new constructs to an ordinary sequential program with recursive procedures and makes both the underlying hardware topology and in most cases also the number of independent processing nodes, transparent to the programmer. The implementation reported in this paper uses workstations and servers connected to an Ethernet. The efficiency achieved by the PRP-concept must be said to be good with a more than 50% processor utilization on the two problems reported in this paper. Other implementations of the PRP-paradigm is also reported.
The Design of an Efficient Portable Driver for Shared Memory Cluster Adapters
We describe the design of an efficient portable driver for shared memory interconnects. The drive... more We describe the design of an efficient portable driver for shared memory interconnects. The driver provides a foundation for interfacing to commodity software like clustered database servers. We present performance figures for a driver implementation that uses SCI through the PCI bus on standard PCs running Windows NT. Keywords: Clustering, Shared memory, SCI. 1 Introduction With the rapid advances in microprocessor design, the cluster of workstations is emerging as a cost effective solution to high-end processing needs [1]. A cluster consists of a number of autonomous, loosely coupled computers. Proprietary interconnects like high-speed memory buses are not required for their operation. Clusters will instead use standard communication technologies like Ethernet, or (more recently) IO-bus based shared memory interconnects like the Dolphin SBUS-SCI [2] and PCI-SCI [3] [4] adapters, ServerNet [5] and the DEC Memory Channel [6]. A cluster has two highly desirable properties: . Scalabil...
The problem addressed in this paper is that we want to sort an integer array a [] of length n on ... more The problem addressed in this paper is that we want to sort an integer array a [] of length n on a multi core machine with k cores. Amdahl's law tells us that the inherent sequential part of any algorithm will in the end dominate and limit the speedup we get from parallelisation of that algorithm. This paper introduces PARL, a parallel left radix sorting algorithm for use on ordinary shared memory multi core machines, that has just one simple statement in its sequential part. It can be seen as a major rework of the Partitioned Parallel Radix Sort (PPR) that was developed for use on a network of communicating machines with separate memories. The PARL algorithm, which was developed independently of the PPR algorithm, has in principle some of the same phases as PPR, but also many significant differences as described in this paper. On a 32 core server, a speedup of 5-12 times is achieved compared with the same sequential ARL algorithm when sorting more than 100 000 numbers and half that speedup on a 4 core PC work station and on two dual core laptops. Since the original sequential ARL algorithm in addition is 3-5 times faster than the standard Java Arrays.sort algorithm, this parallelisation translates to a significant speedup of approx. 10 to 30 for ordinary user programs sorting larger arrays. The reason that we don't get better results, i.e. a speedup equal to the number of cores when the number of cores exceeds 2, is chiefly explained by a limited memory bandwidth. This thread pool implementation of PARL is also user friendly in that the user calling this PARL algorithm does not have to deal with creating threads themselves; to sort their data, they just create a sorting object and make a call to a thread safe method in that object.
This paper introduces Buffered Adaptive Radix (BARsort) that adds two improvements to the well kn... more This paper introduces Buffered Adaptive Radix (BARsort) that adds two improvements to the well known right-to-left Radix sorting algorithm (Right Radix or just Radix). The first improvement, the adaptive part, is that the size of the sorting digit is adjusted according to the maximum value of the elements in the array. This makes BARsort somewhat faster than ordinary 8-bit Radix sort (Radix8). The second and most important improvement is that data is transferred back and forth between the original array and a buffer that can be only a percentage of the size of the original array, as opposed to traditional Radix where that second array is the same length as the original array. Even though a buffer size of 100 % of the original array is the fastest choice, any percentage larger than 6 % gives a good to acceptable performance. This result is also explained analytically. This flexibility in memory requirement is important in programming languages such as Java where the heap size is fixe...
We describe how a real application, the Synthetic Aperture Radar (SAR) program, is parallelized a... more We describe how a real application, the Synthetic Aperture Radar (SAR) program, is parallelized and run on a cluster of PCs connected by SCI. We have a prototype SCI switch that uses the serial IEEE 1355 HIC technology in its switching fabric. The prototype switch can not keep up with the speed of the SCI interconnect, and we analyze and discuss how switch performance in uence the execution time of our parallelized application Keywords:
It is widely documented that the cost of maintenance activities is the dominant part of the total... more It is widely documented that the cost of maintenance activities is the dominant part of the total lifetime expenditure on a software system. Still, there is no commonly accepted framework for evaluating technology that is supposed to support maintenance. This paper discusses methods for such evaluation, which include both analytical work, such as development of taxonomies, and empirical work based on benchmarks. Several of the methods have been applied in experiments that we have already finished, in experiments that we are currently running and in planned experiments. What we report is initial work carried out within severe resource limitations. Hence, our work should be regarded as a basis for more extensive work which is required in this area.
The problem addressed in this paper is that we want to sort an integer array a[] of length n in p... more The problem addressed in this paper is that we want to sort an integer array a[] of length n in parallel on a multi core machine with p cores using mergesort. Amdahl’s law tells us that the inherent sequential part of any algorithm will in the end dominate and limit the speedup we get from parallelisation. This paper introduces ParaMerge, an all parallel mergesort algorithm for use on an ordinary shared memory multi core machine that has just a few simple statements in its sequential part. The new algorithm is all parallel in the sense that by recursive decent it is two parallel in the top node, four parallel on the next level in the recursion, then eight parallel until we at least have started one thread for all the p cores. After parallelization, each thread then uses sequential recursion mergesort with a variant of insertion sort for sorting short subsections at the end. ParaMerge can be seen as an improvement over traditional parallelization of the mergesort algorithm where one ...
This paper presents two new algorithms for inline transforming an integer array 'a' into its own ... more This paper presents two new algorithms for inline transforming an integer array 'a' into its own sorting permutation-that is: after performing either of these algorithms, a[i] is the index in the unsorted input array 'a' of its i'th largest element (i=0,1..n-1). The difference between the two IPS (Inline Permutation Substitution) algorithms is that the first and fastest generates an unstable permutation while the second generates the unique, stable, permutation array. The extra space needed in both algorithms is O(log n)-no extra array of length n is needed! The motivation for using these algorithms is given along with their pseudo code. To evaluate their efficiency, they are tested relative to sorting the same array with Quicksort on 4 different machines and for 14 different distributions of the numbers in the input array, with n=10, 50, 250.. 97M. This evaluation shows that both IPS algorithms are generally faster than Quicksort for values of n less than 10 7 , but degenerates both in speed and demand for space for larger values. These are results with 32 bit integers (with 64 bits integers this limit would be proportionally higher, at least 10 14). The two IPS algorithms do a recursive, most significant digit radix sort (left to right) on the input array while substituting the values of the sorted elements with their indexes.
The problem addressed in this paper is that we want to sort an integer array a [] of length n on ... more The problem addressed in this paper is that we want to sort an integer array a [] of length n on a multi core machine with k cores. Amdahl’s law tells us that the inherent sequential part of any algorithm will in the end dominate and limit the speedup we get from parallelisation of that algorithm. This paper introduces PARL, a parallel left radix sorting algorithm for use on ordinary shared memory multi core machines, that has just one simple statement in its sequential part. It can be seen as a major rework of the Partitioned Parallel Radix Sort (PPR) that was developed for use on a network of communicating machines with separate memories. The PARL algorithm, which was developed independently of the PPR algorithm, has in principle some of the same phases as PPR, but also many significant differences as described in this paper. On a 32 core server, a speedup of 5-12 times is achieved compared with the same sequential ARL algorithm when sorting more than 100 000 numbers and half that...
It is believed that writing parallel programs is hard. Therefore a paradigm for writing parallel ... more It is believed that writing parallel programs is hard. Therefore a paradigm for writing parallel programs that is not much harder than writing a sequential program is in demand. This paper describes a general method for parallelizing programs based on the idea of using recursive procedures as the grain of parallelism (PRP). This paradigm only adds a few new constructs to an ordinary sequential program with recursive procedures and makes both the underlying hardware topology and in most cases also the number of independent processing nodes, transparent to the programmer. The implementation reported in this paper uses workstations and servers connected to an Ethernet. The e ciency achieved by the PRP-concept must be said to be good with a more than 50% processor utilization on the two problems reported in this paper. Other implementations of the PRP-paradigm is also reported.
We describe the design of an efficient portable driver forshared memory interconnects. The driverp... more We describe the design of an efficient portable driver forshared memory interconnects. The driverprovidesa foun-dation for interfacing to commodity software like clus-tered database servers. We present performance figuresfor a driver implementation that uses SCI through the PCIbus on standard PCs running Windows NT.Keywords: Clustering,Shared memory, SCI. 1 Introduction With the rapid advances in microprocessor design, thecluster of workstationsis emerging as a cost effective so-lutionto high-endprocessing needs [1]. A clusterconsistsof a number of autonomous, loosely coupled computers.Proprietary interconnects like high-speed memory busesare not required for their operation. Clusters will insteaduse standard communication technologies like Ethernet,or (more recently) IO-bus based shared memory intercon-nects like the Dolphin SBUS-SCI [2] and PCI-SCI [3] [4]adapters, ServerNet [5] and the DEC Memory Channel[6]. A cluster has two highly desirable properties:• Scalability means that the...
Practical Parallel Programming – a B . S . course on how to design an efficient parallel algorithm
This paper describes a new course in parallel programming at the University of Oslo that emphasis... more This paper describes a new course in parallel programming at the University of Oslo that emphasis thread based parallel programming in Java on multicore computers. The quality of a parallel algorithm is evaluated by its speedup, since the main reason for making a parallel algorithm is to create a faster program. This course stresses that there are many possible correct parallel algorithms for a given problem. Since the course focuses on measured efficiency, a more theoretical approach like the PRAM (Parallel Random Access Memory) model is not considered. However, we are examining those practical factors that contribute most to the total running time – like the startup time of a thread based solution, the Java JIT-compilation, the number of synchronizations, and the locality of data in the caches. A partly new pattern for designing a parallel algorithm by dividing it into fully parallel parts with barrier synchronization between each part is taught. Instead of synchronization, when u...
This paper demonstrates how an unstable in place sorting algorithm, the ALR algorithm, can be mad... more This paper demonstrates how an unstable in place sorting algorithm, the ALR algorithm, can be made stable by temporary changing the sorting keys during the recursion. At ‘the bottom of the recursion’ all subsequences with equal valued element are then individually sorted with a stable sorting subalgorithm (insertion sort or radix). Later, on backtrack the original keys are restored. This results in a stable sorting of the whole input. Unstable ALR is much faster than Quicksort (which is also unstable). In this paper it is demonstrated that StableALR, which is some 10-30% slower than the original unstable ALR, is still in most cases 20-60% faster than Quicksort. It is also shown to be faster than Flashsort, a new unstable in place, bucket type sorting algorithm. This is demonstrated for five different distributions of integers in a array of length from 50 to 97 million elements. The StableALR sorting algorithm can be extended to sort floating point numbers and strings and make effect...
In this paper we first prove that for any set points P in the plane, the closest neighbor b to an... more In this paper we first prove that for any set points P in the plane, the closest neighbor b to any point p in P is a proper triangle edge bp in D(P), the Delaunay triangulation of P. We also generalize this result and prove that the j’th (second, third, ..) closest neighbors bj to p are also edges pbj in D(P) if they satisfy simple tests on distances from a point. Even though we can find many of the edges in D(P) in this way by looking at close neighbors, we give a three point example showing that not all edges in D(P) will be found. For a random dataset we give results from test runs and a model that show that our method finds on the average 4 edges per point in D(P). Also, we prove that the Delaunay edges found in this way form a connected graph. We use these results to outline two new parallel, and potentially faster algorithms for finding D(P). We then report results from parallelizing one of these algorithms on a multicore CPU (MPU), which resulted in a significant speedup; and...
The problem addressed in this paper is that we want to sort an integer array a[] of length n in p... more The problem addressed in this paper is that we want to sort an integer array a[] of length n in parallel on a multi core machine with p cores using Quicksort. Amdahl’s law tells us that the inherent sequential part of any algorithm will in the end dominate and limit the speedup we get from parallelisation. This paper introduces ParaQuick, a full parallel quicksort algorithm for use on an ordinary shared memory multi core machine that has just a few simple statements in its sequential part. It can be seen as an improvement over traditional parallelization of the Quicksort algorithm, where one follows the sequential algorithm and substitute recursive calls with the creation of parallel threads for these calls in the top of the recursion tree. The ParaQuick algorithm, starts with k parallel threads, where k is a multiple of p (here k = 8*p) in a k way partition of the original array with the same pivot value, and hence we get 2k partitioned areas in the first pass. We then calculate wh...
The problem addressed in this paper is that we want to sort an array a[] of n floating point numb... more The problem addressed in this paper is that we want to sort an array a[] of n floating point numbers conforming to the IEEE 754 standard, both in the 64bit double precision and the 32bit single precision formats on a multi core computer with p real cores and shared memory (an ordinary PC). This we do by introducing a new stable, sorting algorithm, RadixInsert, both in a sequential version and with two parallel implementations. RadixInsert is tested on two different machines, a 2 core laptop and a 4 core desktop, outperforming the not stable Quicksort based algorithms from the Java library – both the sequential Arrays.sort() and a merge-based parallel version Arrays.parallelsort() for 500<n <250mill by a factor from 3 to 10. The RadixInsert algorithm resembles in many ways the Shell sort algorithm [1]. First, the array is presorted to some degree – and in the case of Shell, Insertion sort is first used with long jumps and later shorter jumps along the array to ensure that small...
The problem addressed in this paper is that we want to sort an integer array a[] of length n in p... more The problem addressed in this paper is that we want to sort an integer array a[] of length n in parallel on a multi core machine with p cores using Quicksort. Amdahl’s law tells us that the inherent sequential part of any algorithm will in the end dominate and limit the speedup we get from parallelisation. This paper introduces ParaQuick, a full parallel quicksort algorithm for use on an ordinary shared memory multi core machine that has just a few simple statements in its sequential part. It can be seen as an improvement over traditional parallelization of the Quicksort algorithm, where one follows the sequential algorithm and substitute recursive calls with the creation of parallel threads for these calls in the top of the recursion tree. The ParaQuick algorithm, starts with k parallel threads, where k is a multiple of p (here k = 8*p) in a k way partition of the original array with the same pivot value, and hence we get 2k partitioned areas in the first pass. We then calculate wh...
Uploads
Papers by Arne Maus