Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2000, Parallel Computing - Fundamentals and Applications - Proceedings of the International Conference ParCo99
The choice of parallel programming models reflects a trade off between the ability to express parallelism and the cost associated with the efficient optimization of the program to specific parallel machines. In particular, to be able to perform scheduling and/or cost estimation for real applications at any acceptable cost, the coordination model must be highly structured, i.e., the related task graphs (DAG) must be of series-parallel (SP) type. Assuming the choice for such a structured coordination model, a critical question is to what extent the ability to express parallelism is sacrificed by restricting parallelism to SP form. Previous work based on small, random topologies suggests that only for extremely unbalanced workload distributions the relative increment on the critical path due to transforming a task graph to SP form exceeds a relative small constant. The research described in this paper is mainly focused on huge graphs with well-known topologies generated by program skeletons for regular problems (e.g., cellular automata, linear algebra solvers, macro-pipelines). We also analyze synthetic topologies to bring to light the main graph properties that are related to the loss of parallelism. Results show that several basic parameters of the graph, such as the maximum degree of parallelism, the depth, and the mean number of predecessors/successors per node, are the key factors. An analytical model to approximate the loss of parallelism while transforming a couple of specific topologies to SP form confirms the experimental evidence. These results indicate that a wide range of parallel computations can be expressed using a structured coordination model with a loss of parallelism that is small and predictable.
Parallel Computing, 2009
The restricted synchronization structure of so-called structured parallel programming paradigms has an advantageous effect on programmer productivity, cost modeling, and scheduling complexity. However, imposing these restrictions can lead to a loss of parallelism, compared to using a programming approach that does not impose synchronization structure. In this paper we study the potential loss of parallelism when expressing parallel computations into a programming model which limits the computation graph (DAG) to series-parallel topology, which characterizes all well-known structured programming models. We present an analytical model that approximately captures this loss of parallelism in terms of simple parameters that are related to DAG topology and workload distribution. We validate the model using a wide range of synthetic and real-world parallel computations running on shared and distributed-memory machines. Although the loss of parallelism is theoretically unbounded, our measurements show that for all above applications the performance loss due to choosing a series-parallel structured model is invariably limited up to 10%. In all cases, the loss of parallelism is predictable provided the topology and workload variability of the DAG are known.
Nested parallelism programming models, where the task graph associated to a computation is series-parallel, present good analysis properties that can be exploited for scheduling, cost estimation or automatic mapping to different architectures. However, restricting synchronization structures may force the programmer to accept a potential loss of parallelism and performance comparing with more generic solutions based in unstructured models (e.g. message passing). We present an experimental study of the impact of mapping regular and irregular applications to nested parallelism. New graph transformation techniques are applied to random and real application task graphs to investigate the potential performance degradation when conveying them into series-parallel form. We extent previous results with new interesting application classes. Our conclusion is that a wide range of irregular applications can be expressed using a structured coordination model with a small loss of parallelism.
2001
The Network of Tasks (NOT) model allows adaptive node programs written in a variety of parallel languages to be connected together in an almost acyclic task graph. The main difference between NOT and other task graphs is that it is designed to make the performance of the graph predictable from knowledge of the performance of the component node programs and the visible structure of the graph. It can therefore be regarded as a coordination language that is transparent about performance. For largescale computations that are distributed to geographically-distributed compute servers, the NOT model helps programmers to plan, assemble, schedule, and distribute their problems.
Lecture Notes in Computer Science, 2001
Microprocessing and microprogramming, 1994
Many parallel algorithms can be modelled as directed acyclic task graphs. Recently, Degree of Simultaneousness (DS) and Degree of Connection (DC) have been defined as the two measures of parallelism in algorithms represented by task graphs. ...
1996
In this paper, we survey algorithms that allocate a parallel program represented by an edge-weighted directed acyclic graph (DAG), also called a task graph or macrodataflow graph, to a set of homogeneous processors, with the objective of minimizing the completion time. We analyze 21 such algorithms and classify them into four groups. The first group includes algorithms that schedule the DAG to a bounded number of processors directly. These algorithms ;we called the bounded number of processors (€3") scheduling algorithms. The algorithms in the second group schedule the DAG to an unbounded number of clusters and are called the unbounded number of clusters (UNC) scheduling algorithms. The algorithms in the third group schedule the DAG usingtask duplication and are called the task duplication based (TDB) scheduling algorithms. The algorithms in the fourth group perform allocation and mapjping on arbitrary processor network topologies. These algorithms are called the arbitrary processor network (APN) scheduling algorithms. The design philosophies and principles behind these algorithms are discussed, and the performance of all of the algorithms is evaluated and compared against each other on a unified basis by using various scheduling parameters. start by classifying these algorithms into the following four groups:
1996
Many of today's high level parallel languages support dynam ic, fine-grained parallelism. These languages allow the user to expose all the parallelism in the program, w hich is typically of a much higher degree than the number of processors. Hence an efficient scheduling algorithm is required to assign computations to processors at runtime. Besides having low overheads and good load balancing, it is important for the scheduling algorithm to minimize the space usage of the parallel program. In this paper, we first present a general framework to model non-preemptive parallel computations based on task graphs, in which schedules of the graphs represent executions of the computations. We then prove bounds on the space and time requirements of certain classes of schedules that can be generated by an offline scheduler. Next, we present an online scheduling algorithm that is provably space-efficient and time-efficient for multithreaded computations with nested parallelism. If a s eria...
Journal of Parallel and Distributed Computing, 1992
A model of parallel computation is introduced which employs the P R A M as a sub-model, while simultaneously being more reflective of realistic parallel architectures by accounting for and providing abstract control over communication and synchronization costs. Cost control is achieved via the representation of general degrees of locality ("neighborhoods" of activity). The model organizes "control asynchrony" via an implicit hierarchy relation, and restricts "communication asynchrony" in order to obtain determinate algorithms.
The International Journal of High Performance Computing Applications
We take a historical approach to our presentation of self-scheduled task parallelism, a programming model with its origins in early irregular and nondeterministic computations encountered in automated theorem proving and logic programming. We show how an extremely simple task model has evolved into a system, asynchronous dynamic load balancing (ADLB), and a scalable implementation capable of supporting sophisticated applications on today’s (and tomorrow’s) largest supercomputers; and we illustrate the use of ADLB with a Green’s function Monte Carlo application, a modern, mature nuclear physics code in production use. Our lesson is that by surrendering a certain amount of generality and thus applicability, a minimal programming model (in terms of its basic concepts and the size of its application programmer interface) can achieve extreme scalability without introducing complexity.
2012 SC Companion: High Performance Computing, Networking Storage and Analysis, 2012
Many sparse or irregular scientific computations are memory bound and benefit from locality improving optimizations such as blocking or tiling. These optimizations result in asynchronous parallelism that can be represented by arbitrary task graphs. Unfortunately, most popular parallel programming models with the exception of Threading Building Blocks (TBB) do not directly execute arbitrary task graphs. In this paper, we compare the programming and execution of arbitrary task graphs qualitatively and quantitatively in TBB, the OpenMP doall model, the OpenMP 3.0 task model, and Cilk Plus. We present performance and scalability results for 8 and 40 core shared memory systems on a sparse matrix iterative solver and a molecular dynamics benchmark.
Proceedings of the 19th international conference on Parallel architectures and compilation techniques - PACT '10, 2010
There has been a proliferation of task-parallel programming systems to address the requirements of multicore programmers. Current production task-parallel systems include Cilk++, Intel Threading Building Blocks, Java Concurrency, .Net Task Parallel Library, OpenMP 3.0, and current research task-parallel languages include Cilk, Chapel, Fortress, X10, and Habanero-Java (HJ). It is desirable for the programmer to express all the parallelism intrinsic to their algorithm in their code for forward scalability and portability, but the overhead incurred by doing so can be prohibitively large in today's systems. In this paper, we address the problem of reducing the total amount of overhead incurred by a program due to excessive task creation and termination. We introduce a transformation framework to optimize task-parallel programs with finish, forall and next statements. Our approach includes elimination of redundant task creation and termination operations as well as strength reduction of termination operations (finish) to lighter-weight synchronizations (next). Experimental results were obtained on three platforms: a dual-socket 128-thread (16-core) Niagara T2 system, a quad-socket 16-way Intel Xeon SMP and a quad-socket 32-way Power7 SMP. The results showed maximum speedup of 66.7×, 11.25× and 23.1× respectively on each platform and 4.6×, 2.1× and 6.4× performance improvements respectively in geometric mean related to non-optimized parallel codes. The original benchmarks in this study were written with medium-grained parallelism; a larger relative improvement can be expected for programs written with finer-grained parallelism. However, even for the medium-grained parallel benchmarks studied in this paper, the significant improvement obtained by the transformation framework underscores the importance of the compiler optimizations introduced in this paper. We discuss differences between our transformation framework and past work on SPMDization transformations in Section 7.
Theoretical Computer Science, 1990
ALnrr& This paper outlines a theory of parallel algorithms that emphasizes two crucial aspects of parallel computation: speedup the improvement in running time due to parallelism. and cficienc,t; the ratio of work done by a parallel algorithm to the work done hv a sequential alponthm. We define six classes of algonthms in these terms: of particular Interest is the &cc. EP, of algorithms that achieve a polynomiai spredup with constant efficiency. The relations hr:ween these classes are examined. WC investigate the robustness of these classes across various models of parallel computation. To do so. w'e examine simulations across models where the simulating machine may be smaller than the simulated machine. These simulations are analyzed with respect to their efficiency and to the reducbon in the number of processors. We show that a large number of parallel computation models are related via efficient simulations. if a polynomial reduction of the number of processors is allowed. This implies that the class EP is invariant across all these models. Many open pmblemc motivated by our app oath are listed. I. IwNtdoetiom As parallel computers become increasingly available, a theory of para!lel algorithms is needed to guide the design of algorithms for such machines. To be useful, such a theory must address two major concerns in parallel computation, namely speedup and efficiency. It should classify algorithms and problems into a few, meaningful classes that are, to the largest exient possible, model independent. This paper outlines an approach to the analysis of parallel algorithms that we feel answers these concerns without sacrificing tc:, much generality or abstractness. We propose a classification of parallel algorithms in terms of parallel running time and inefficiency, which is the extra amount of work done by a parallel algorithm es compared to a sequential algorithm. Both running time and inefficiency are measured as a function of the sequential running time, which is used as a yardstick * A preliminary version of this paper was presented at 15th International Colloquium on Automata,
2010
This thesis reviews selected topics from the theory of parallel computation. The research begins with a survey of the proposed models of parallel computation. It examines the characteristics of each model and it discusses its use either for theoretical studies, or for practical applications. Subsequently, it employs common simulation techniques to evaluate the computational power of these models. The simulations establish certain model relations before advancing to a detailed study of the parallel complexity theory, which is the subject of the second part of this thesis. The second part examines classes of feasible highly parallel problems and it investigates the limits of parallelization. It is concerned with the benefits of the parallel solutions and the extent to which they can be applied to all problems. It analyzes the parallel complexity of various well-known tractable problems and it discusses the automatic parallelization of the efficient sequential algorithms. Moreover, it ...
There is a clear industrial trend towards chip multiprocessors (CMP) as the most power efficient way of further increasing performance. Heterogeneous CMP architectures take one more step along this power efficiency trend by using multiple types of processors, tailored to the workloads they will execute. Programming these CMP architectures has been identified as one of the main challenges in the near future, and programming heterogeneous systems is even more challenging. High-level programming models which allow the programmer to identify parallel tasks, and the runtime management of the inter-task dependencies, have been identified as a suitable model for programming such heterogeneous CMP architectures.
Integrated Research in Grid Computing, CoreGRID, 2008
2001
Large scale parallel programming projects may become heterogeneous in both language and architectural model. We propose that skeletal programming techniques can alleviate some of the costs involved in designing and porting such programs, illustrating our approach with a simple program which combines shared memory and message passing code. We introduce Activity Graphs as a simple and practical means of capturing model independent aspects of the operational semantics of skeletal parallel programs. They are independent of low level details of parallel implementation and so can act as an intermediate layer for compilation to diverse underlying models. Activity graphs provide a notion of parallel activities, dependencies between activities, and the process groupings within which these take place. The compilation process uses a set of graph generators (templates) to derive the activity graph. We describe simple schemes for transforming activity graphs into message passing programs, targeting both MPI and BSP.
Lecture Notes in Computer Science, 1996
In this paper we propose methods to optimize the speed-up which can be obtained for a parallel program in a distributed system by modelling the assignment of the tasks of a parallel program as a graph partitioning problem. The tasks (set of instructions that must be executed sequentially) which compose the program are represented by weighted nodes, and the arcs of the graph represent the precedence order between tasks. Because this problem is in general NP-hard we propose and investigate several heuristic algorithms, and compare their performance. The approaches we present are: a neural network based algorithm (based on the random neural model of Gelenbe), an algorithm based on simulated annealing and a genetic algorithm based heuristic.
2003
In this paper, we consider the execution of a complex application on a heterogeneous "grid" computing platform. The complex application consists of a suite of identical, independent problems to be solved. In turn, each problem consists of a set of tasks. There are dependences (precedence constraints) between these tasks. A typical example is the repeated execution of the same algorithm on several distinct data samples. We use a non-oriented graph to model the grid platform, where resources have different speeds of computation and communication. We show how to determine the optimal steady-state scheduling strategy for each processor (the fraction of time spent computing and the fraction of time spent communicating with each neighbor) and how to build such as schedule. This result holds for a quite general framework, allowing for cycles and multiple paths in the platform graph.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.