Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1982
including computational physics, weather forecasting, etc. The current state of hardware capabilities will facilitate the use of such parallel processors to many more applications as the speed and the number of processors that can be tightly coupled increases dramatically. (A very good introduction to the future promise of "highly parallel computing" can be found in the January, 1982 issue of Computer, published by the IEEE Computer Society.)
Symposium on the Theory of Computing, 1982
1994
Cole presented a parallel merge sort for the PRAM model that performs in O(logn) parallel steps using n processors. He gave an algorithm for the CREW PRAM model for which the constant in the running time is small. He also gave a more complex version of the algorithm for the ...
IEEE Transactions on Computers, 1989
Shear-sort opened new avenues in the research of sorting techniques for mesh-connected processor arrays. The algorithm is extremely simple and converges to a snake-like sorted sequence with a time complexity which is suboptimal by a logarithmic factor. The techniques used for analyzing shear-sort have been used to derive more efficient algorithms, which have important ramifications both from practical and theoretical viewpoints. Although the algorithms described apply to any general two-dimensional computational model, the focus of most discussions is on mesh-connected computers which are now commercially available. In spite of a rich history of O ( n ) sorting algorithms on an n x n SIMD mesh, the constants associated with the leading term (i.e., n ) are fairly large. This had led researchers to speculate about the tightness of the lower bound. The work in this paper sheds some more light on this problem as a 4n-step algorithm is shown to exist for a model slightly more powerful than the conventional SIMD model. Moreover, this algorithm has a running time of 3n steps on the more powerful MIMD model, which is "truly" optimal for such a model. Index Terms-Distance bound, lower bound, mesh-connected network, parallel algorithm, sorting, time complexity, upper bound. WO-DIMENSIONAL sorting is defined as the ordering of T a rectangular array of numbers such that every element is routed to a distinct position of the array predetermined by some indexing scheme. Some of the standard indexing schemes are illustrated in Fig. . The simplest computational model onto which this problem can be mapped is the meshconnected processor array (mesh for short). The simplicity of the interconnection pattern, and the locality of communication, makes the mesh easy to build and program and was the basis of one of the earliest parallel computers (ILLIAC IV). Since then, there have been more machines built on a much larger scale including the MPP and the DAPP using similar interconnection patterns. This simple architecture further motivates the idea of dealing with a given set of numbers as a rectangular array rather than as a linear sequence. More recently, Scherson [15] and Tseng et al. [22] have independently proposed a network which they call the orthogonal access architecture and the reduced-mesh network, respectively. It consists of p processors which are connected by a shared memory of p -q x p -q locations, where each Manuscript
Communications of the ACM, 1977
Two algorithms are presented for sorting n z elements on an n × n mesh-connected processor array that require O (n) routing and comparison steps. The best previous algoritmhm takes time O(n log n). The algorithms of this paper are shown to be optimal in time within small constant factors. Extensions to higherdimensional arrays are also given.
We provide the first optimal algorithms in terms of the number of input/outputs (I/Os) required between internal memory and multiple secondary storage devices for the problems of sorting, FFT, matrix transposition, standard matrix multiplication, and related problems. Our two-level memory model is new and gives a realistic treatment of parallel block transfer, in which during a single I/O each of the P secondary storage devices can simultaneously transfer a contiguous block of B records. The model pertains to a large-scale uniprocessor system or parallel multiprocessor system with P disks. In addition, the sorting, FFT, permutation network, and standard matrix multiplication algorithms are typically optimal in terms of the amount of internal processing time. The difficulty in developing optimal algorithms is to cope with the partitioning of memory into P separate physical devices. Our algorithms' performances can be significantly better than those obtained by the well-known but nonoptimal technique of disk striping. Our optimal sorting algorithm is randomized, but practical; the probability of using more than I times the optimal number of I/Os is exponentially small in/(log/) log(M/B), where M is the internal memory size.
Telecommunication …, 2000
In this work an efficient model for parallel computing, called Shuffled Mesh (SM), is introduced. This bounded degree model has the mesh as subgraph and it is based on the union of mesh and shuffle-exchange topologies. It is shown that an N-processor SM combines ...
18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., 2004
We study conflict-free data distribution schemes in parallel memories in multiprocessor system architectures. Given a host graph G, the problem is to map the nodes of G into memory modules such that any instance of a template type T in G can be accessed without memory conflicts. A conflict occurs if two or more nodes of T are mapped to the same memory module. The mapping algorithm should: (i) be fast in terms of data access (possibly mapping each node in constant time); (ii) minimize the required number of memory modules for accessing any instance in G of the given template type; and (iii) guarantee load balancing on the modules. In this paper, we consider conflict-free access to star templates, i.e., to any node of G along with all of its neighbors. Such a template type arises in many classical algorithms like breadth-first search in a graph, message broadcasting in networks, and nearest neighbor based approximation in numerical computation. We consider the star-template access problem on two specific host graphs-tori and hypercubes-that are also popular interconnection network topologies. The proposed conflict-free mappings on these graphs are fast, use an optimal or provably good number of memory modules, and guarantee load balancing.
Journal of Parallel and Distributed Computing, 1995
In this paper we present a new parallel sorting algorithm which maximizes the overlap between the disk, network, and CPU subsystems of a processing node. This algorithm is shown to be of similar complexity to known e cient sorting algorithms. The pipelining e ect exploited by our algorithm should lead to higher levels of performance on distributed memory parallel processors. In order to achieve the best results using this strategy, the CPU, network and disk operations must take comparable time. We suggest acceptable levels of system balance for sorting machines and analyze the performance of the sorting algorithm as system parameters vary.
Journal of the ACM, 1987
Lecture Notes in Computer Science, 2002
Merge sort is useful in sorting a great number of data progressively, especially when they can be partitioned and easily collected to a few processors. Merge sort can be parallelized, however, conventional algorithms using distributed memory computers have poor performance due to the successive reduction of the number of participating processors by a half, up to one in the last merging stage. This paper presents load-balanced parallel merge sort where all processors do the merging throughout the computation. Data are evenly distributed to all processors, and every processor is forced to work in all merging phases. An analysis shows the upper bound of the speedup of the merge time as (P − 1)/ log P where P is the number of processors. We have reached a speedup of 8.2 (upper bound is 10.5) on 32-processor Cray T3E in sorting of 4M 32-bit integers.
Computers & Electrical Engineering, 2009
Optical interconnections attract many engineers and scientists' attention due to their potential for gigahertz transfer rates and concurrent access to the bus in a pipelined fashion. These unique characteristics of optical interconnections give us the opportunity to reconsider traditional algorithms designed for ideal parallel computing models, such as PRAMs. Since the PRAM model is far from practice, not all algorithms designed on this model can be implemented on a realistic parallel computing system. From this point of view, we study Cole's pipelined merge sort [Cole R. Parallel merge sort. SIAM J Comput 1988;14:770-85] on the CREW PRAM and extend it in an innovative way to an optical interconnection model, the LARPBS (Linear Array with Reconfigurable Pipelined Bus System) model [Pan Y, Li K. Linear array with a reconfigurable pipelined bus system-concepts and applications. J Inform Sci 1998;106;237-58]. Although Cole's algorithm is optimal, communication details have not been provided due to the fact that it is designed for a PRAM. We close this gap in our sorting algorithm on the LARPBS model and obtain an O(log N)-time optimal sorting algorithm using O(N) processors. This is a substantial improvement over the previous best sorting algorithm on the LARPBS model that runs in O(log N log log N) worst-case time using N processors [Datta A, Soundaralakshmi S, Owens R. Fast sorting algorithms on a linear array with a reconfigurable pipelined bus system. IEEE Trans Parallel Distribut Syst 2002;13(3):212-22]. Our solution allows efficiently assign and reuse processors. We also discover two new properties of Cole's sorting algorithm that are presented as lemmas in this paper.
Lecture Notes in Computer Science, 1993
Theory of Computing Systems / Mathematical Systems Theory, 1999
There has been a great deal of interest recently in the development of general-purpose bridging models for parallel computation. Models such as the BSP and LogP have been proposed as more realistic alternatives to the widely used PRAM model. The BSP and LogP models imply a rather different style for designing algorithms when compared with the PRAM model. Indeed, while many consider data parallelism as a convenient style, and the shared-memory abstraction as an easyto-use platform, the bandwidth limitations of current machines have diverted much attention to message-passing and distributed-memory models (such as the BSP and LogP) that account more properly for these limitations.
Lecture Notes in Computer Science, 1990
We address the problem of sorting n integers each in the range {l, ... ,m}, for m = nO(l), in parallel on the PRAM model of computation. We present a randomized algorithm that runs with very high probability in O(log njlog log n) time with a processor-time product of O(n loglogm) and O(n) space on the CRCW (COLLISION) PRAM [13]. The improvements that this algorithm makes over existing ones [5, 20, 27] include a weakening of the model of computation used and reducing the space requirement to O(n), without increasing the time needed or work done. For larger values of m our algorithm is better than existing algorithms in several other ways as well. We show that the algorithm can be analyzed using O(logO(l))-wise independence, which implies that the amount of true randomness needed is small. We also give an improved randomized algorithm for the the problem of chaining [22,29]. An interesting subroutine used is an algorithm for solving a class of processor allocation problems quickly. The algorithms for chaining and integer sorting both make use of efficient algorithms for the construction of the fast priority queue of van Emde Boas [35].
Lecture Notes in Computer Science, 1989
IEEE Transactions on Computers, 1996
Mesh connected computers have become attractive models of computing because of their varied special features. In this paper we consider two variations of the mesh model: 1) a mesh with fixed buses, and 2) a mesh with reconfigurable buses. Both these models have been the subject matter of extensive previous research. We solve numerous important problems related to packet routing and sorting on these models. In particular, we provide lower bounds and very nearly matching upper bounds for the following problems on both these models: 1) Routing on a linear array; and 2) k − k routing and k − k sorting on a 2D mesh for any k ≥ 12. We provide an improved algorithm for 1 − 1 routing and a matching sorting algorithm. In addition we present greedy algorithms for 1 − 1 routing, k − k routing, and k − k sorting that are better on average and supply matching lower bounds. We also show that sorting can be performed * This research was supported in part by an NSF Research Initiation Award CCR-92-09260. Preliminary versions of some of the results in this paper were presented in the First Annual European Symposium on Algorithms, 1993. in logarithmic time on a mesh with fixed buses. Most of our algorithms have considerably better time bounds than known algorithms for the same problems.
In this paper we have proposed a new solution for sorting algorithms. In the beginning of the sorting algorithm for serial computers (Random access machines, or RAM'S) that allow only one operation to be executed at a time. We have investigated sorting algorithm based on a comparison network model of computation, in which many comparison operation can be performed simultaneously.
Proceedings 16th International Parallel and Distributed Processing Symposium, 2002
Sort can be speeded up on parallel computers by dividing and computing data individually in parallel. Merge sort can be parallelized, however, the conventional algorithm implemented on distributed memory computers has poor performance due to the successive reduction of the number of active (non-idling) processors by a half, up to one in the last merging stage. This paper presents load-balanced parallel merge sort algorithm where all processors participate in merging throughout the computation. Data are evenly distributed to all processors, and every processor is forced to work in merging phase. Significant enhancement of the performance has been achieved. Our analysis shows the upper bound of the speedup of the merge time as (P ;1)= logP. We have had a speedup of 9.6 (upper bound is 10.5) on 32-processor Cray T3E in sorting of 4M 32-bit integers. The same idea can be applied to parallellize other sorting algorithms.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.