no longer supports Internet Explorer.
To browse and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
The Network of Tasks (NOT) model allows adaptive node programs written in a variety of parallel languages to be connected together in an almost acyclic task graph. The main difference between NOT and other task graphs is that it is designed to make the performance of the graph predictable from knowledge of the performance of the component node programs and the visible structure of the graph. It can therefore be regarded as a coordination language that is transparent about performance. For largescale computations that are distributed to geographically-distributed compute servers, the NOT model helps programmers to plan, assemble, schedule, and distribute their problems.
In this paper, we consider the execution of a complex application on a heterogeneous "grid" computing platform. The complex application consists of a suite of identical, independent problems to be solved. In turn, each problem consists of a set of tasks. There are dependences (precedence constraints) between these tasks. A typical example is the repeated execution of the same algorithm on several distinct data samples. We use a non-oriented graph to model the grid platform, where resources have different speeds of computation and communication. We show how to determine the optimal steady-state scheduling strategy for each processor (the fraction of time spent computing and the fraction of time spent communicating with each neighbor) and how to build such as schedule. This result holds for a quite general framework, allowing for cycles and multiple paths in the platform graph.
Parallel Computing - Fundamentals and Applications - Proceedings of the International Conference ParCo99, 2000
The choice of parallel programming models reflects a trade off between the ability to express parallelism and the cost associated with the efficient optimization of the program to specific parallel machines. In particular, to be able to perform scheduling and/or cost estimation for real applications at any acceptable cost, the coordination model must be highly structured, i.e., the related task graphs (DAG) must be of series-parallel (SP) type. Assuming the choice for such a structured coordination model, a critical question is to what extent the ability to express parallelism is sacrificed by restricting parallelism to SP form. Previous work based on small, random topologies suggests that only for extremely unbalanced workload distributions the relative increment on the critical path due to transforming a task graph to SP form exceeds a relative small constant. The research described in this paper is mainly focused on huge graphs with well-known topologies generated by program skeletons for regular problems (e.g., cellular automata, linear algebra solvers, macro-pipelines). We also analyze synthetic topologies to bring to light the main graph properties that are related to the loss of parallelism. Results show that several basic parameters of the graph, such as the maximum degree of parallelism, the depth, and the mean number of predecessors/successors per node, are the key factors. An analytical model to approximate the loss of parallelism while transforming a couple of specific topologies to SP form confirms the experimental evidence. These results indicate that a wide range of parallel computations can be expressed using a structured coordination model with a loss of parallelism that is small and predictable.
Parallel Processing Letters, 1995
We present a model of parallel computation, the parameterized task graph, which is a compact, problem size independent, representation of some frequently used directed acyclic task graphs. Techniques automating the construction of such a representation, starting from an annotated sequential program are proposed. We show that many important properties of the task graph such as the computational load of the nodes and the communication volume of the edges can be automatically deduced in a problem size independent way.
In this paper, we survey algorithms that allocate a parallel program represented by an edge-weighted directed acyclic graph (DAG), also called a task graph or macrodataflow graph, to a set of homogeneous processors, with the objective of minimizing the completion time. We analyze 21 such algorithms and classify them into four groups. The first group includes algorithms that schedule the DAG to a bounded number of processors directly. These algorithms ;we called the bounded number of processors (€3") scheduling algorithms. The algorithms in the second group schedule the DAG to an unbounded number of clusters and are called the unbounded number of clusters (UNC) scheduling algorithms. The algorithms in the third group schedule the DAG usingtask duplication and are called the task duplication based (TDB) scheduling algorithms. The algorithms in the fourth group perform allocation and mapjping on arbitrary processor network topologies. These algorithms are called the arbitrary processor network (APN) scheduling algorithms. The design philosophies and principles behind these algorithms are discussed, and the performance of all of the algorithms is evaluated and compared against each other on a unified basis by using various scheduling parameters. start by classifying these algorithms into the following four groups:
Lecture Notes in Computer Science, 2001
Journal of Computer and System Sciences, 2003
We present a polynomial time algorithm for precedence-constrained scheduling problems in which the task graph can be partitioned into large disjoint parts by removing edges with high float, where the float of an edge is defined as the difference between the length of the longest path in the graph and the length of the longest path containing the edge. Our algorithm guarantees schedules within a factor 1:875 of the optimal independent of the number of processors. The best-known factor for this problem and in general is 2 À 2 p ; where p is the number of processors, due to Coffman-Graham. Our algorithm is unusual and considerably different from that of Coffman-Graham and other algorithms in the literature.
Journal of Parallel and Distributed Computing, 2002
This paper presents BCL, a border-based coordination language focused on the solution of numerical applications. Our approach provides a simple parallelism model. Coordination and computational aspects are clearly separated. The former are established using the coordination language and the latter are coded using HPF (together with only a few extensions related to coordination). This way, we have a coordinator process that is in charge of both creating the different HPF tasks and establishing the communication and synchronization scheme among them. In the coordination part, processor and data layouts are also specified. Data distribution belonging to the different HPF tasks is known at the coordination level. This is the key for an efficient implementation of the communication among them. Besides that, our system implementation requires no change to the runtime support of the underlying HPF compiler. By means of some examples, the suitability and expressiveness of the language are shown. Some experimental results also demonstrate the efficiency of the model. © 2002 Elsevier Science (USA)
Languages and Compilers for Parallel Computing, 2000
In this paper, we present a new powerful method for parallel program representation called Data Driven Graph (DDG). DDG takes all advantages of classical Directed Acyclic Graph (DAG) and adds much more: simple definition, flexibility and ability to represent loops and dynamically created tasks. With DDG, scheduling becomes an efficient tool for increasing performance of parallel systems. DDG is not only a parallel program model, it also initiates a new parallel programming style, allows programmers to write a parallel program with minimal difficulty. We also present our parallel program development tool with support for DDG and scheduling.
Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2016
Declarative programming has been hailed as a promising approach to parallel programming since it makes it easier to reason about programs while hiding the implementation details of parallelism from the programmer. However, its advantage is also its disadvantage as it leaves the programmer with no straightforward way to optimize programs for performance. In this paper, we introduce Coordinated Linear Meld (CLM), a concurrent forward-chaining linear logic programming language, with a declarative way to coordinate the execution of parallel programs allowing the programmer to specify arbitrary scheduling and data partitioning policies. Our approach allows the programmer to write graph-based declarative programs and then optionally to use coordination to fine-tune parallel performance. In this paper we specify the set of coordination facts, discuss their implementation in a parallel virtual machine, and show-through example-how they can be used to optimize parallel execution. We compare the execution of CLM programs against the original uncoordinated Linear Meld and several other frameworks.
Journal of Parallel and Distributed Computing, 1999
The problem of scheduling a parallel program represented by a weighted directed acyclic graph (DAG) to a set of homogeneous processors for minimizing the completion time of the program has been extensively studied. The NP-completeness of the problem has stimulated researchers to propose a myriad of heuristic algorithms. While most of these algorithms are reported to be efficient, it is not clear how they compare against each other. A meaningful performance evaluation and comparison of these algorithms is a complex task and it must take into account a number of issues. First, most scheduling algorithms are based upon diverse assumptions, making the performance comparison rather purposeless. Second, there does not exist a standard set of benchmarks to examine these algorithms. Third, most algorithms are evaluated using small problem sizes, and, therefore, their scalability is unknown. In this paper, we first provide a taxonomy for classifying various algorithms into distinct categories according to their assumptions and functionalities. We then propose a set of benchmarks that are based on diverse structures and are not biased towards a particular scheduling technique. We have implemented 15 scheduling algorithms and compared them on a common platform by using the proposed benchmarks as well as by varying important problem parameters. We interpret the results based upon the design philosophies and principles behind these algorithms, drawing inferences why some algorithms perform better than the others. We also propose a performance measure called the scheduling scalability (SS) that captures the collective effectiveness of a scheduling algorithm in terms of its solution quality, the number of processors used, and the running time.
The prevalence of multicore processors is bound to drive most kinds of software development towards parallel programming. To limit the difficulty and overhead of parallel software design and maintenance, it is crucial that parallel programming models allow an easy-to-understand, concise and dense representation of parallelism. Parallel programming models such as Cilk++ and Intel TBBs attempt to offer a better, higher-level abstraction for parallel programming than threads and locking synchronization. It is not straightforward, however, to express all patterns of parallelism in these models. Pipelines are an important parallel construct, although difficult to express in Cilk and TBBs in a straightforward way, not without a verbose restructuring of the code. In this paper we demonstrate that pipeline parallelism can be easily and concisely expressed in a Cilk-like language, which we extend with input, output and input/output dependency types on procedure arguments, enforced at runtime by the scheduler. We evaluate our implementation on real applications and show that our Cilk-like scheduler, extended to track and enforce these dependencies has performance comparable to Cilk++.
Microprocessing and microprogramming, 1994
Many parallel algorithms can be modelled as directed acyclic task graphs. Recently, Degree of Simultaneousness (DS) and Degree of Connection (DC) have been defined as the two measures of parallelism in algorithms represented by task graphs. ...
Large scale parallel programming projects may become heterogeneous in both language and architectural model. We propose that skeletal programming techniques can alleviate some of the costs involved in designing and porting such programs, illustrating our approach with a simple program which combines shared memory and message passing code. We introduce Activity Graphs as a simple and practical means of capturing model independent aspects of the operational semantics of skeletal parallel programs. They are independent of low level details of parallel implementation and so can act as an intermediate layer for compilation to diverse underlying models. Activity graphs provide a notion of parallel activities, dependencies between activities, and the process groupings within which these take place. The compilation process uses a set of graph generators (templates) to derive the activity graph. We describe simple schemes for transforming activity graphs into message passing programs, targeting both MPI and BSP.
Lecture Notes in Computer Science, 1999
This paper presents a new task manager aimed to general-purpose small-scale multiprocessors. It is designed to exploit complex parallelism structures from general applications. The manager uses an explicit parallelism encoding based on a topological description of the task dependence graph. A dedicated task manager decodes the parallelism information and builds a structured representation of the dependence graph in a queue bank. The queue bank design allows to efficiently extract the tasks ready to be executed and to implement synchronization between consecutive tasks. Simulations performed on SPLASH and image processing benchmarks validates the parallelism exploitation. Results shows that in case of complex parallel structures, performances can be significantly improved.
Lecture Notes in Computer Science, 2013
Programming parallel machines as effectively as sequential ones would ideally require a language that provides high-level programming constructs to avoid the programming errors frequent when expressing parallelism. Since task parallelism is considered more error-prone than data parallelism, we survey six popular and efficient parallel language designs that tackle this difficult issue: Cilk, Chapel, X10, Habanero-Java, OpenMP and OpenCL. Using as single running example a parallel implementation of the computation of the Mandelbrot set, this paper describes how the fundamentals of task parallel programming, i.e., collective and point-to-point synchronization and mutual exclusion, are dealt with in these languages. We discuss how these languages allocate and distribute data over memory. Our study suggests that, even though there are many keywords and notions introduced by these languages, they all boil down, as far as control issues are concerned, to three key task concepts: creation, synchronization and atomicity. Regarding memory models, these languages adopt one of three approaches: shared memory, message passing and PGAS (Partitioned Global Address Space). The paper is designed to give users and language and compiler designers an upto-date comparative overview of current parallel languages. Recent programming models explore the best trade-offs between expressiveness and performance when addressing parallelism. Traditionally, there are two general ways to break an application into concurrent parts in order to take advantage of a parallel computer and execute them simultaneously on different CPUs: data and task parallelisms.
IEEE Transactions on Parallel and Distributed Systems, 2000
Task scheduling is an essential aspect of parallel programming. Most heuristics for this NP-hard problem are based on a simple system model that assumes fully connected processors and concurrent interprocessor communication. Hence, contention for communication resources is not considered in task scheduling, yet it has a strong influence on the execution time of a parallel program. This paper investigates the incorporation of contention awareness into task scheduling. A new system model for task scheduling is proposed, allowing us to capture both end-point and network contention. To achieve this, the communication network is reflected by a topology graph for the representation of arbitrary static and dynamic networks. The contention awareness is accomplished by scheduling the communications, represented by the edges in the task graph, onto the links of the topology graph. Edge scheduling is theoretically analyzed, including aspects like heterogeneity, routing, and causality. The proposed contention-aware scheduling preserves the theoretical basis of task scheduling. It is shown how classic list scheduling is easily extended to this more accurate system model. Experimental results show the significantly improved accuracy and efficiency of the produced schedules.
Parallel programming is becoming mainstream due to the increased availability of multiprocessor and multicore architectures and the need to solve larger and more complex problems. Languages and tools available for the development of parallel applications are often difficult to learn and use. The Standard Template Adaptive Parallel Library (STAPL) is being developed to help programmers address these difficulties. STAPL is a parallel C++ library with functionality similar to STL, the ISO adopted C++ Standard Template Library. STAPL provides a collection of parallel pContainers for data storage and pViews that provide uniform data access operations by abstracting away the details of the pContainer data distribution. Generic pAlgorithms are written in terms of PARAGRAPHs, high level task graphs expressed as a composition of common parallel patterns. These task graphs define a set of operations on pViews as well as any ordering (i.e., dependences) on these operations that must be enforce...
International Journal of Computational Science and Engineering, 2005
A set of communication operations is defined which allows a form of task parallelism to be achieved in a data parallel architecture. The set of processors can be subdivided recursively into groups, and a communication operation inside a group never conflicts with communications taking place in other groups. The groups may be subdivided and recombined at any time, allowing the task structure to adapt to the needs of the data. The algorithms implementing the grouping and communications are defined using parallel scans and folds which can be executed efficiently in an abstract tree machine. This approach is best suited for massively parallel systems with fine grain processors.
Several approaches for handling the multi-processor scheduling problem have been presented over the last decades, such as scheduling techniques, clustering techniques and task merging techniques. However, the increasing gap between processor speed and communication speed demand more accurate models of parallel computation and efficient algorithms to work under the restrictions of these models.
Language Reference Manual is discussed with a mechanism suggested for the control of distribution in order to achieve optimal use of available resources. This approach uses compile-time complexity analysis of the source to permit the dynamic allocation of tasks to processing nodes based on the assumed computational intensity. The implementation of this structure using the Java Virtual Machine as the compilation target is discussed.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.