Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2018, Concurrency and Computation: Practice and Experience
…
29 pages
1 file
In High Performance Computing, heterogeneity is now the norm with specialized accelerators like GPUs providing efficient computational power. Resulting complexity led to the development of task-based runtime systems, where complex computations are described as task graphs, and scheduling decisions are made at runtime to perform load balancing between all resources of the platforms. Developing good scheduling strategies, even at the scale of a single node, and analyzing them both theoretically and in practice is expected to have a very high impact on the performance of current HPC systems. The special case of two kinds of resources, typically CPUs and GPUs is already of great practical interest. The scheduling policy Hetero-Prio has been proposed in the context of fast multipole computations (FMM), and has been extended to general task graphs with very promising results. In this paper, we provide a theoretical study of the performance of HeteroPrio, by proving approximation bounds compared to the optimal schedule, both in the case of independent tasks and in the case of general task graphs. Interestingly, our results establish that spoliation (a technique that enables resources to restart uncompleted tasks on another resource) is enough to prove bounded approximation ratios for a list scheduling algorithm on two unrelated resources, which is known to be impossible otherwise. This result holds true both for independent and dependent tasks graphs. Additionally, we provide an experimental evaluation of HeteroPrio on real task graphs from dense linear algebra computation, that establishes its strong performance in practice.
Concurrency and Computation: Practice and Experience, 2011
In the field of HPC, the current hardware trend is to design multiprocessor architectures that feature heterogeneous technologies such as specialized coprocessors (e.g., Cell/BE SPUs) or data-parallel accelerators (e.g., GPGPUs). Approaching the theoretical performance of these architectures is a complex issue. Indeed, substantial efforts have already been devoted to efficiently offload parts of the computations. However, designing an execution model that unifies all computing units and associated embedded memory remains a main challenge. We have thus designed STARPU, an original runtime system providing a highlevel, unified execution model tightly coupled with an expressive data management library. The main goal of STARPU is to provide numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware on the one hand, and easily develop and tune powerful scheduling algorithms on the other hand. We have developed several strategies that can be selected seamlessly at run time, and we have demonstrated their efficiency by analyzing the impact of those scheduling policies on several classical linear algebra algorithms that take advantage of multiple cores and GPUs at the same time. In addition to substantial improvements regarding execution times, we obtained consistent superlinear parallelism by actually exploiting the heterogeneous nature of the machine.
2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020
We consider the problem of scheduling task graphs on two types of unrelated resources, which arises in the context of task-based runtime systems on modern platforms containing CPUs and GPUs. In this paper, we focus on an algorithm named HeteroPrio, which was originally introduced as an efficient heuristic for a particular application. HeteroPrio is an adaptation of the well known list scheduling algorithm, in which the tasks are picked by the resources in the order of their acceleration factor. This algorithm is augmented with a spoliation mechanism: a task assigned by the list algorithm can later on be reassigned to a different resource if it allows to finish this task earlier.We propose here the first theoretical analysis of the HeteroPrio algorithm in the presence of dependencies. More specifically, if the platform contains m and n processors of each type, we show that the worst-case approximation ratio of HeteroPrio is between $1 + \max \left( {\frac{m}{n},\frac{n}{m}} \right)$ ...
Information Processing Letters, 2002
A general parallel task scheduling problem is considered. A task can be processed in parallel on one of several alternative subsets of processors. The processing time of the task depends on the subset of processors assigned to the task. We first show the hardness of approximating the problem for both preemptive and nonpreemptive cases in the general setting. Next we focus on linear array network of m processors. We give an approximation algorithm of ratio O(log m) for nonpreemptive scheduling, and another algorithm of ratio 2 for preemptive scheduling. Finally, we give a nonpreemptive scheduling algorithm of ratio O(log 2 m) for m × m two-dimensional meshes.
Lecture Notes in Computer Science, 2014
More and more computers use hybrid architectures combining multi-core processors and hardware accelerators like GPUs (Graphics Processing Units). We present in this paper a new method for scheduling efficiently parallel applications with m CPUs and k GPUs, where each task of the application can be processed either on a core (CPU) or on a GPU. The objective is to minimize the makespan. The corresponding scheduling problem is NP-hard, we propose an efficient approximation algorithm which achieves an approximation ratio of 4 3 + 1 3k. We first detail and analyze the method, based on a dual approximation scheme, that uses a dynamic programming scheme to balance evenly the load between the heterogeneous resources. Finally, we run some simulations based on realistic benchmarks and compare the solution obtained by a relaxed version of this method to the one provided by a classical greedy algorithm and to lower bounds on the value of the optimal makespan. This work is supported by a CNRS-Google contract and a french national program GDR-RO.
2017
We study the problem of executing an application represented by a precedence task graph on a multi-core machine composed of standard computing cores and accelerators. Contrary to most existing approaches, we distinguish the allocation and the scheduling phases and we mainly focus on the allocation part of the problem: choose the most appropriate type of computing unit for each task. We address both off-line and on-line settings. In the first case, we establish strong lower bounds on the worst-case performance of a known approach based on Linear Programming for solving the allocation problem. Then, we refine the scheduling phase and we replace the greedy list scheduling policy used in this approach by a better ordering of the tasks. Although this modification leads to the same approximability guarantees, it performs much better in practice. In the on-line case, we assume that the tasks arrive in any, not known in advance, order which respects the precedence relations and the scheduler has to take irrevocable decisions about their allocation and execution. In this setting, we propose the first online scheduling algorithm which takes into account precedences. Our algorithm is based on adequate rules for selecting the type of processor where to allocate the tasks and it achieves a constant factor approximation guarantee if the ratio of the number of CPUs over the number of GPUs is bounded. Finally, all the previous algorithms have been experimented on a large number of simulations built on actual libraries. These simulations assess the good practical behavior of the algorithms with respect to the state-of-the-art solutions whenever these exist or baseline algorithms.
2010 3rd Workshop on Many-Task Computing on Grids and Supercomputers, 2010
In order for many-task applications to be attractive candidates for running on high-end supercomputers, they must be able to benefit from the additional compute, I/O, and communication performance provided by high-end HPC hardware relative to clusters, grids, or clouds. Typically this means that the application should use the HPC resource in such a way that it can reduce time to solution beyond what is possible otherwise. Furthermore, it is necessary to make efficient use of the computational resources, achieving high levels of utilization.
2018
We study the problem of executing an application represented by a precedence task graph on a parallel machine composed of standard computing cores and accelerators. Contrary to most existing approaches, we distinguish the allocation and the scheduling phases and we mainly focus on the allocation part of the problem: choose the most appropriate type of computing unit for each task. We address both off-line and on-line settings and design generic scheduling approaches. In the first case, we establish strong lower bounds on the worst-case performance of a known approach based on Linear Programming for solving the allocation problem. Then, we refine the scheduling phase and we replace the greedy List Scheduling policy used in this approach by a better ordering of the tasks. Although this modification leads to the same approximability guarantees, it performs much better in practice. We also extend this algorithm to more types of computing units, achieving an approximation ratio which depends on the number of different types. In the on-line case, we assume that the tasks arrive in any, not known in advance, order which respects the precedence relations and the scheduler has to take irrevocable decisions about their allocation and execution. In this setting, we propose the first on-line scheduling algorithm which takes into account precedences. Our algorithm is based on adequate rules for selecting the type of processor where to allocate the tasks and it achieves a constant factor approximation guarantee if the ratio of the number of CPUs over the number of GPUs is bounded. Finally, all the previous algorithms for hybrid architectures have been experimented on a large number of simulations built on actual libraries. These simulations assess the good practical behavior of the algorithms with respect to the state-of-the-art solutions, whenever these exist, or baseline algorithms.
Lecture Notes in Computer Science, 2014
Multi-core architectures comprising several GPUs have become mainstream in the field of High-Performance Computing. However, obtaining the maximum performance of such heterogeneous machines is challenging as it requires to carefully offload computations and manage data movements between the different processing units. The most promising and successful approaches so far rely on task-based runtimes that abstract the machine and rely on opportunistic scheduling algorithms. As a consequence, the problem gets shifted to choosing the task granularity, task graph structure, and optimizing the scheduling strategies. Trying different combinations of these different alternatives is also itself a challenge. Indeed, getting accurate measurements requires reserving the target system for the whole duration of experiments. Furthermore, observations are limited to the few available systems at hand and may be difficult to generalize. In this article, we show how we crafted a coarsegrain hybrid simulation/emulation of StarPU, a dynamic runtime for hybrid architectures, over SimGrid, a versatile simulator for distributed systems. This approach allows to obtain performance predictions accurate within a few percents on classical dense linear algebra kernels in a matter of seconds, which allows both runtime and application designers to quickly decide which optimization to enable or whether it is worth investing in higher-end GPUs or not.
2003
In this paper, we consider the execution of a complex application on a heterogeneous "grid" computing platform. The complex application consists of a suite of identical, independent problems to be solved. In turn, each problem consists of a set of tasks. There are dependences (precedence constraints) between these tasks. A typical example is the repeated execution of the same algorithm on several distinct data samples. We use a non-oriented graph to model the grid platform, where resources have different speeds of computation and communication. We show how to determine the optimal steady-state scheduling strategy for each processor (the fraction of time spent computing and the fraction of time spent communicating with each neighbor) and how to build such as schedule. This result holds for a quite general framework, allowing for cycles and multiple paths in the platform graph.
Concurrency and Computation: Practice and Experience, 2015
Multi-core architectures comprising several GPUs have become mainstream in the eld of High-Performance Computing. However, obtaining the maximum performance of such heterogeneous machines is challenging as it requires to carefully ooad computations and manage data movements between the dierent processing units. The most promising and successful approaches so far build on task-based runtimes that abstract the machine and rely on opportunistic scheduling algorithms. As a consequence, the problem gets shifted to choosing the task granularity, task graph structure, and optimizing the scheduling strategies. Trying dierent combinations of these dierent alternatives is also itself a challenge. Indeed, getting accurate measurements requires reserving the target system for the whole duration of experiments. Furthermore, observations are limited to the few available systems at hand and may be dicult to generalize. In this article, we show how we crafted a coarse-grain hybrid simulation/emulation of StarPU, a dynamic runtime for hybrid architectures, over SimGrid, a versatile simulator of distributed systems. This approach allows to obtain performance predictions of classical dense linear algebra kernels accurate within a few percents and in a matter of seconds, which allows both runtime and application designers to quickly decide which optimization to enable or whether it is worth investing in higher-end GPUs or not. Additionally, it allows to conduct robust and extensive scheduling studies in a controlled environment whose characteristics are very close to real platforms while having reproducible behavior.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Scalable Computing: Practice and Experience, 2016
Information Processing Letters, 1999
Parallel and Distributed Systems, …, 2002
Theoretical Computer Science, 2011
Proceedings of the 23rd international symposium on High-performance parallel and distributed computing - HPDC '14, 2014
2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)
Parallel Processing Letters, 2003