2002, Lecture Notes in Computer Science
The emerging discipline of algorithm engineering has primarily focused on transforming pencil-and-paper sequential algorithms into robust, efficient, well tested, and easily used implementations. As parallel computing becomes ubiquitous, we need to extend algorithm engineering techniques to parallel computation. Such an extension adds significant complications. After a short review of algorithm engineering achievements for sequential computing, we review the various complications caused by parallel computing, present some examples of successful efforts, and give a personal view of possible future research.
We present a new parallel computation model called the Parallel Resource-Optimal computation model. PRO is a framework being proposed to enable the design of efficient and scalable parallel algorithms in an architecture-independent manner, and to simplify the analysis of such algorithms. A focus on three key features distinguishes PRO from existing parallel computation models. First, the design and analysis of a parallel algorithm in the PRO model is performed relative to the time and space complexity of a sequential reference algorithm. Second, a PRO algorithm is required to be both time-and space-optimal relative to the reference sequential algorithm. Third, the quality of a PRO algorithm is measured by the maximum number of processors that can be employed while optimality is maintained. Inspired by the Bulk Synchronous Parallel model, an algorithm in the PRO model is organized as a sequence of supersteps. Each superstep consists of distinct computation and communication phases, but the supersteps are not required to be separated by synchronization barriers. Both computation and communication costs are accounted for in the runtime analysis of a PRO algorithm. Experimental results on parallel algorithms designed using the PRO model-and implemented using its accompanying programming environment SSCRAP-demonstrate that the model indeed delivers efficient and scalable implementations on a wide range of platforms.
Microelectronics Journal, 2001
This book grew out of lecture notes for a course on parallel algorithms that I gave at Drexel University over a period of several years. I was frustrated by the lack of texts that had the focus that I wanted. Although the book also addresses some architectural issues, the main focus is on the development of parallel algorithms on "massively parallel" computers. This book could be used in several versions of a course on Parallel Algorithms. We tend to focus on SIMD parallel algorithms in several general areas of application:
ACM SIGPLAN Notices, 2011
With core counts on the rise, the sequential components of applications are becoming the major bottleneck in performance scaling as predicted by Amdahl's law. We are therefore faced with the simultaneous problems of occupying an increasing number of cores and speeding up sequential sections. In this work, we reconcile these two seemingly incompatible problems with a novel programming model called N-way. The core idea behind N-way is to benefit from the algorithmic diversity available to express certain key computational steps. By simultaneously launching in parallel multiple ways to solve a given computation, a runtime can just-in-time pick the best (for example the fastest) way and therefore achieve speedup. Previous work has demonstrated the benefits of such an approach but has not addressed its inherent waste. In this work, we focus on providing a mathematically sound learning-based statistical model that can be used by a runtime to determine the optimal balance between resou...
The traditional parallel algorithms of data processing based on method: (for some principal the number of cores, a speed of processing etc.) separate all information which should be processed into parts and then process each part separately on a different cores (or processor). It takes quite a long time and is not always an optimal solution to separate into a parts. It's impossible to bring to minimum a standing of cores. It's not always possible to find optimal choice (or changing algorithm during execution). The algorithms provided by us have the main principal that the processing by the cores should be performed in parallel and bring to minimum the stay of cores.In the article are reviewed two algorithms working according to this principal "smart-delay" and the development of multiplication of matrix transposed ribbon-like algorithm.
This paper presents a 7-step, semi-systematic approach for designing and implementing parallel algorithms. In this paper, the target implementation uses MPI for message passing. The approach is applied to a family of matrix factorization algorithms- LU, QR, and Cholesky - which share a common structure, namely, that the second factor of each is upper right triangular. The efficacy of the approach is demonstrated by implementing, tuning, and timing execution on two commercially available multiprocessor computers.
International Conference on Parallel Processing, 1990
Siam Journal on Computing, 1989
AbSlracl. T<:chniqucs for parallt:l divide-and-conquer are presenled. rcsulting in improved parallel :lIgo r jthms for a number of probl<:ms. The problems for which improved algorithms are gi\'cn include 5cg.menL inlcrscctlon detection. trapezoidal decomposition. and planar point location. Efficient parallel algorithms ilre algo given for fraclional cascading. lhree·dimensional maxima. lwo-set dominancc counting. and visibility from a point. All of the algorithms presenled run in O(log n) lime with either a linear or a sub linear number of proccssors in the CREW PRAM model. Ke~' words. parallel algorithms. parallel data structures, divide-and-conquer. computational geometry. fraclional cascading. visibility. planar point locat;on. trapezoidal decomposition. dominance, lnlersection deteCllon AMS(MOS) subject ci:l.ssil1cations. 68E05, 68C05. 68C15 l. Intl'oduction. This paper presents a number of genel'al techniques for parallel divide-and-conquer. These techniques are based on nontrivial generaliza[ions of Cole's recent parallel merge son resulI r13] and enable us [Q achieve improved complexi[y bounds for a large number o( pl'Oblems_ In panicular, our [echniques can be applied
Lecture Notes in Computer Science, 1996
Lecture Notes in Computer Science, 2012
For more than thirty years, the parallel programming community has used the dependence graph as the main abstraction for reasoning about and exploiting parallelism in "regular" algorithms that use dense arrays, such as finite-differences and FFTs. In this paper, we argue that the dependence graph is not a suitable abstraction for algorithms in new application areas like machine learning and network analysis in which the key data structures are "irregular" data structures like graphs, trees, and sets.
Theoretical Computer Science, 1990
ALnrr& This paper outlines a theory of parallel algorithms that emphasizes two crucial aspects of parallel computation: speedup the improvement in running time due to parallelism. and cficienc,t; the ratio of work done by a parallel algorithm to the work done hv a sequential alponthm. We define six classes of algonthms in these terms: of particular Interest is the &cc. EP, of algorithms that achieve a polynomiai spredup with constant efficiency. The relations hr:ween these classes are examined. WC investigate the robustness of these classes across various models of parallel computation. To do so. w'e examine simulations across models where the simulating machine may be smaller than the simulated machine. These simulations are analyzed with respect to their efficiency and to the reducbon in the number of processors. We show that a large number of parallel computation models are related via efficient simulations. if a polynomial reduction of the number of processors is allowed. This implies that the class EP is invariant across all these models. Many open pmblemc motivated by our app oath are listed. I. IwNtdoetiom As parallel computers become increasingly available, a theory of para!lel algorithms is needed to guide the design of algorithms for such machines. To be useful, such a theory must address two major concerns in parallel computation, namely speedup and efficiency. It should classify algorithms and problems into a few, meaningful classes that are, to the largest exient possible, model independent. This paper outlines an approach to the analysis of parallel algorithms that we feel answers these concerns without sacrificing tc:, much generality or abstractness. We propose a classification of parallel algorithms in terms of parallel running time and inefficiency, which is the extra amount of work done by a parallel algorithm es compared to a sequential algorithm. Both running time and inefficiency are measured as a function of the sequential running time, which is used as a yardstick * A preliminary version of this paper was presented at 15th International Colloquium on Automata,
Large-scale parallel computations are more common than ever, due to the increasing availability of multi-processor systems. However, writing parallel software is often a complicated and error-prone task. To relieve Diffpack users of the tedious and low-level technical details of parallel programming, we have designed a set of new software modules, tools, and programming rules, which will be the topic of
IEEE Transactions on Software Engineering, 1981
Description/Abstract The best enterprises have both a compelling need pulling them forward and an innovative technological solution pushing them on. In high-performance computing, we have the need for increased computational power in many applications and the inevitable long-term solution is massive parallelism. In the short term, the relation between pull and push may seem unclear as novel algorithms and software are needed to support parallel computing.
In the history of computational world, sequential uni-processor computers have been exploited for years to solve scientific and business problems. To satisfy the demand of compute & data hungry applications, it was observed that better response time can be achieved only through parallelism. Large computational problems were partitioned and solved by using multiple CPUs in parallel. Computing performance was further improved by adopting multi-core architecture which provides hardware parallelism through use of multiple cores. Efficient resource utilization of a parallel computing environment by using software and hardware parallelism is a major research challenge. The present hardware technologies provide freedom to algorithm developers for control & management of resources through software codes, such as threads-to-cores mapping in recent multi-core processors. In this paper, a survey is presented since beginning of parallel computing up to the use of present state-of-art multi-core...
14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP'06), 2006
The Parallel Resource-Optimal (PRO) computation model was introduced by Gebremedhin et al. [2002] as a framework for the design and analysis of efficient parallel algorithms.
