Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Because the interconnection scheme among processors (or between processors and memory) significantly affects the running time, efficient parallel algorithms must take the interconnection scheme into account. This in turn entails tradeoffs between efficiency and portability among different architectures. Our goal is to develop algorithms that are portable among massively parallel fine grain architectures such as hypercubes, meshes, and pyramids, while yielding a fairly efficient implementation on each. Our approach is to utilize standardized operations such as prefix, broadcast, sort, compression, and crossproduct calculations. This paper describes an approach for designing efficient, portable algorithms and gives sample algorithms to solve some fundamental geometric problems. The difficulties of portability and efficiency for these geometric problems have been redirected into similar difficulties for the standardized operations. However, the cost of developing efficient implementations of them on the various target architectures can be amortized over numerous algorithms.
1989
We consider the problem of subsystem allocation in the mesh, torus, and hypercube multicomputers. Although the usual practice is to use a serial algorithm on the host processor to do the allocation, we show how the free and non-faulty processors can be used to perform the allocation in parallel. The algorithms we provide are dynamic, require very little storage, and work correctly even in the presence of faults.
Journal of Parallel and Distributed Computing, 1986
This paper presents new algorithms for solving some geometric problems on a shared memory parallel computer, where concurrent reads are allowed but no two processors can simultaneously attempt to write in the same memory location. The algorithms are quite different from known sequential algorithms, and are based on the use of a new parallel divide-and-conquer technique. One of our results is an 0 (log n) time, O(n) processor algorithm for the convex hull problem. Another result is an O(log n log log n) time, O(n) processor algorithm for the problem of selecting a closest pair of points among n inpUt pOint.3.
Proceedings of the ACM on Computer Graphics and Interactive Techniques, 2018
Due to its flexibility, compute mode is becoming more and more attractive as a way to implement many of the algorithms part of a state-of-the-art rendering pipeline. A key problem commonly encountered in graphics applications is streaming vertex and geometry processing. In a typical triangle mesh, the same vertex is on average referenced six times. To avoid redundant computation during rendering, a post-transform cache is traditionally employed to reuse vertex processing results. However, such a vertex cache can generally not be implemented efficiently in software and does not scale well as parallelism increases. We explore alternative strategies for reusing per-vertex results on-the-fly during massively-parallel software geometry processing. Given an input stream divided into batches, we analyze the effectiveness of sorting, hashing, and intra-thread-group communication for identifying and exploiting local reuse potential. We design and present four vertex reuse strategies tailored...
Journal of Systems and …, 2008
This paper describes Cronus, a platform for parallelizing general nested loops. General nested loops contain complex loop bodies (assignments, conditionals, repetitions) and exhibit uniform loop-carried dependencies. The novelty of Cronus is twofold: (1) it determines the optimal scheduling hyperplane using the QuickHull algorithm, which is more efficient than previously used methods, and (2) it implements a simple and efficient dynamic rule (successive dynamic scheduling) for the runtime scheduling of the loop iterations along the optimal hyperplane. This scheduling policy enhances data locality and improves the makespan. Cronus provides an efficient runtime library, specifically designed for communication minimization, that performs better than more generic systems, such as Berkeley UPC. Its performance was evaluated through extensive testing. Three representative case studies are examined: the Floyd–Steinberg dithering algorithm, the Transitive Closure algorithm, and the FSBM motion estimation algorithm. The experimental results corroborate the efficiency of the parallel code. The tests show speedup ranging from 1.18 (out of the ideal 4) to 12.29 (out of the ideal 16) on distributed-systems and 3.60 (out of 4) to 15.79 (out of 16) on shared-memory systems. Cronus outperforms UPC by 5–95% depending on the test case.
2008
We present the design and implementation of an automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytical model-driven automatic transformation in the polyhedral model.Unlike previous polyhedral frameworks, our approach is an end-to-end fully automatic one driven by an integer linear optimization framework that takes an explicit view of finding good ways of tiling for parallelism and locality using affine transformations. The framework has been implemented into a tool to automatically generate OpenMP parallel code from C program sections. Experimental results from the tool show very high performance for local and parallel execution on multi-cores, when compared with state-of-the-art compiler frameworks from the research community as well as the best native production compilers. The system also enables the easy use of powerful empirical/iterative optimization for general arbitrarily nested loop sequences.
2003
Abstract ParLeda is a software library that provides the basic primitives needed for parallel implementation of computational geometry applications. It can also be used in implementing a parallel application that uses geometric data structures. The parallel model that we use is based on a new heterogeneous parallel model named HBSP which is based on BSP and is introduced here.
Parallel Computing, 2003
In application areas such as GIS, the Euclidean metric is often less meaningfully applied to determine a shortest path than metrics which capture, through weights, the varying nature of the terrain (e.g., water, rock, forest). Considering weighted metrics however increases the run-time of algorithms considerably suggesting the use of a parallel approach. In this paper, we provide a parallel implementation of shortest path algorithms for the Euclidean and weighted metrics on triangular irregular networks (i.e., a triangulated point set in which each point has an associated height value). To the best of our knowledge, this is the first parallel implementation of shortest path problems in these metrics. We provide a detailed discussion of the algorithmic issues and the factors related to data, machine, implementation determining the performance of parallel shortest path algorithms. We describe our parallel algorithm for weighted shortest paths, its implementation and performance for single-source and multiple-source instances. Our experiments were performed on standard architectures with different communication/computation characteristics, including PCs interconnected by a cross-bar switch using fast ethernet, a state-of-the-art Beowulf cluster with gigabit interconnect and a shared-memory architecture, SunFire.
2008
This paper describes CRONUS, a platform for parallelizing general nested loops. General nested loops contain complex loop bodies (assignments, conditionals, repetitions) and exhibit uniform loop-carried dependencies. The novelty of CRONUS is twofold: (1) it determines the optimal scheduling hyperplane using the QuickHull algorithm, which is more efficient than previously used methods, and (2) it implements a simple and efficient dynamic rule (successive dynamic scheduling) for the runtime scheduling of the loop iterations along the optimal hyperplane. This scheduling policy enhances data locality and improves the makespan. CRONUS provides an efficient runtime library, specifically designed for communication minimization, that performs better than more generic systems, such as Berkeley UPC. Its performance was evaluated through extensive testing. Three representative case studies are examined: the Floyd-Steinberg dithering algorithm, the Transitive Closure algorithm, and the FSBM motion estimation algorithm. The experimental results corroborate the efficiency of the parallel code. The tests show speedup ranging from 1.18 (out of the ideal 4) to 12.29 (out of the ideal 16) on distributedsystems and 3.60 (out of 4) to 15.79 (out of 16) on shared-memory systems. CRONUS outperforms UPC by 5-95% depending on the test case.
… on Numerical Grid …, 2007
Scalable and locality-aware multiprocessor memory allocators are critical for harnessing the potential of emerging multithreaded and multicore architectures. This paper evaluates two state-of-the-art generic multithreaded allocators designed for both scalability and locality, against custom allocators, written to optimize the multithreaded implementation of parallel mesh generation algorithms. We use three different algorithms in terms of communication/synchronization requirements. The implementations of all three algorithms are heavily dependent on dynamically allocated pointer-based data structures and all three use optimized internal memory allocators based on application-specific knowledge. For our study we used memory allocators which are implemented and evaluated on two real multiprocessors with a multi-SMT (quad Hyperthreaded Intel) and a multi-CMP/SMT (dual IBM Power5) organization. Our results indicate that properly engineered generic memory allocators can come close or sometimes exceed (in sequential allocation) the performance of custom multi-threaded allocators. These results suggest that in the near future we should be able to develop generic multi-threaded allocators that can adapt to application charac-teristics and increase productivity without compromising performance.
2018
As attention is focused upon the "time to solution", it becomes obvious that the entire process must be taken into account-not just the cost and efficiency of the solver. Amdahl's Law tells us that any serial portion of the application will be the limiting factor in scalability. Therefore it does not matter how efficient a solver is if both the pre-and post-processing have not been given the same focus towards scalability. The most obvious way to insure that a scalable process exists is to view the process as an integrated whole and remove any serial portions. The work discussed in this paper makes geometry available in a parallel environment to support parallel mesh generation, solver-based grid adaptation, and the curving of linear meshes to support high(er) order spacial discretizations.
This paper discusses the implementation of a distributed geometry for parallel mesh generation, involving dynamic load-balancing and hence dynamic re-partitioning of the geometry. A novel approach is described for improving the efficiency of the distributed geometry interface when dealing with irregular shaped mesh partitions.
Microprocessors and Microsystems, 1992
The popular hypercube interconnection network has high wiring(VLS1) complexity. The reduced hypercube (RH) is obtained by a uniform reduction in the number of channels for each hypercube node in order io reduce the VLSI complexity. It is known that the RH achieves performance comparable to that of the hypercube, at much lower hardware cost, through hypercube emulation. The reduced complexity of the RH permits the construction of powerful, massively parallel computers. This paper proposes algorithms for data broadcasting and reduction, prefix computation, and sorting on the RH. These operations are fundamental to many parallel algorithms. A worst case analysis of each algorithm is given and compared with that of equivalent algorithms for the hypercube. It is shown that the proposed algorithms for the RH yield performance comparable to that of the hypercube. , and DMI-9500260. 0167-8191/96/$15.00 0 1996 Elsevier Science B.V. All rights reserved SSDZ 0167-8191(96)00008-7 596 S.G. Ziavras, A. Mukherjee /Parallel Computing 22 (1996) 595406
2000
In this work an efficient model for parallel computing, called Shuffled Mesh (SM), is introduced. This bounded degree model has the mesh as subgraph and it is based on the union of mesh and shuffle-exchange topologies. It is shown that an N-processor SM combines the features of mesh, shuffle-exchange, hypercubic networks, mesh of trees and hypercube, and is able to support all the algorithms designed for such topologies with constant or logarithmic time performance degradation. Finally, it is proved that the VLSI layout of a SM is the same as of a shuffle exchange of the same size.
Circuits, Systems, and Signal Processing, 1988
Parallel algorithms for solving geometric problems on two array processor models-the mesh-connected computer (MCC) and a two-dimensional systolic array-are presented. We illustrate a recursive divide-and-conquer paradigm for MCC algorithms by presenting a time-optimal solution for the problem of finding the nearest neighbors of a set of planar points represented by their Cartesian coordinates. The algorithm executes on a ~/n • x/n MCC, and requires an optimal O(x/n) time. An algorithm for constructing the convex hull of a set of planar points and an update algorithm for the disk placement problem on an nZ/3x n 2/3 twodimensional systolic array are presented. Both these algorithms require O(n 2/3) time steps. The advantage of the systolic solutions lies in their suitability for direct hardware implementation.
Sigplan Notices, 1994
The help project proposes a model of data-parallel programming allowing a programmer to develop an algorithm the nearest of his thought. Usually, for many parts of a data-parallel program, the manipulations of data could be modelized as geometrical migrations inside a cartesian reference space.
Parallel Computing, 1996
The popular hypercube interconnection network has high wiring(VLS1) complexity. The reduced hypercube (RH) is obtained by a uniform reduction in the number of channels for each hypercube node in order io reduce the VLSI complexity. It is known that the RH achieves performance comparable to that of the hypercube, at much lower hardware cost, through hypercube emulation. The reduced complexity of the RH permits the construction of powerful, massively parallel computers. This paper proposes algorithms for data broadcasting and reduction, prefix computation, and sorting on the RH. These operations are fundamental to many parallel algorithms. A worst case analysis of each algorithm is given and compared with that of equivalent algorithms for the hypercube. It is shown that the proposed algorithms for the RH yield performance comparable to that of the hypercube. , and DMI-9500260. 0167-8191/96/$15.00 0 1996 Elsevier Science B.V. All rights reserved SSDZ 0167-8191(96)00008-7 596 S.G. Ziavras, A. Mukherjee /Parallel Computing 22 (1996) 595406
Algorithmica, 1992
We present parallel algorithms fcr some fundamental problems in computational geometry which have a running time of O(log n) using n processors, with very high probability (approaching 1 as n ~ oo). These include planar-point location, triangulation, and trapezoidal decomposition. We also present optimal algorithms for three-dimensional maxima and two-set dominance counting by an application of integer sorting. Most of these algorithms run on a CREW PRAM model and have optimal processor-time product which improve on the previously best-known algorithms of Atallah and Goodrich [5] for these problems. The crux of these algorithms is a useful data structure which emulates the plane-sweeping paradigm used for sequential algorithms. We extend some of the techniques used by Reischuk [26] and Reif and Valiant [25] for flashsort algorithm to perform divide and conquer in a plane very efficiently leading to the improved performance by our approach.
This paper presents several parallel algorithms for the construction of the Delaunay triangulation in E 2 and E 3 -one of the fundamental problems in computer graphics. The proposed algorithms are designed for parallel systems with shared memory and several processors. Such a hardware configuration (especially the case with two-processors) became widely spread in the last few years in the computer graphics area. Some of the proposed algorithms are easy to be implemented but not very efficient, while some of them prove opposite characteristics. Some of them are usable in E 2 only, other work in E 3 as well. The algorithms themselves were already published in computer graphics where the computer graphics criteria were highlighted. This paper concentrates on parallel and systematic point of view and gives detailed information about the parallelization of a computational geometry application to parallel and distributed computation oriented community.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.