Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Load balancing is an important prerequisite to efficiently execute dynamic computations on parallel computers. In this context, this project has focussed on two topics: balancing dynamically generated work load cost efficiently in a network and partitioning graphs to equally distribute connected tasks on the processing nodes while reducing the communication overhead. We summarize new insights and results in these areas.
In parallel computing, dynamic load balancing of parallel codes is considered as a crucial problem. The goal is to distribute roughly equal amounts of computational load across a number of processors, while minimizing inter-processor communication. The objective is to optimize the time of the simulation execution. In some applications, the load grow in unpredictable way that is why another distribution must be computed dynamically. Graph partitioning and repartitioning are usually combined to solve the dynamic load-balancing problem. In this paper we study and evaluate heuristic partitioning methods such as region expansion, multilevel, kernighan-lin algorithms; And methods of repartitioning graphs with a comparison between these different methods. Advantages and limitations of different existing heuristics in the literature are cleared.
Scalable Computing: Practice and Experience
In modern computing, high-performance computing (HPC) and parallel computing require most of the decision-making in terms of distributing the payloads (input) uniformly across the available set of resources, majorly processors; the former deals with the hardware and its better utilization. In parallel computing, a larger, complex problem is broken down into multiple smaller calculations and executed simultaneously on several processors. The efficient use of resources (processors) plays a vital role in achieving the maximum throughput which necessitates uniform load distribution across available processors, i.e. load balancing. The load balancing in parallel computing is modeled as a graph partitioning problem. In the graph partitioning problem, the weighted nodes represent the computing cost at each node, and the weighted edges represent the communication cost between the connected nodes. The goal is to partition the graph G into k partitions such that: I) the sum of weights on the ...
Applied Numerical Mathematics, 2005
Data partitioning and load balancing are important components of parallel computations. Many different partitioning strategies have been developed, with great effectiveness in parallel applications. But the load-balancing problem is not yet solved completely; new applications and architectures require new partitioning features. Existing algorithms must be enhanced to support more complex applications. New models are needed for non-square, non-symmetric, and highly connected systems arising from applications in biology, circuits, and materials simulations. Increased use of heterogeneous computing architectures requires partitioners that account for non-uniform computing, network, and memory resources. And, for greatest impact, these new capabilities must be delivered in toolkits that are robust, easy-to-use, and applicable to a wide range of applications. In this paper, we discuss our approaches to addressing these issues within the Zoltan Parallel Data Services toolkit.
2012
Load imbalance leads to an increasing waste of resources as an application is scaled to more and more processors. Achieving the best parallel efficiency for a program requires optimal load balancing which is a NP-hard problem. However, finding near-optimal solutions to this problem for complex computational science and engineering applications is becoming increasingly important. Charm++, a migratable objects based programming model, provides a measurement-based dynamic load balancing framework. This framework instruments and then migrates over-decomposed objects to balance computational load and communication at runtime. This paper explores the use of graph partitioning algorithms, traditionally used for partitioning physical domains/meshes, for measurement-based dynamic load balancing of parallel applications. In particular, we present repartitioning methods developed in a graph partitioning toolbox called SCOTCH that consider the previous mapping to minimize migration costs. We also discuss a new imbalance reduction algorithm for graphs with irregular load distributions. We compare several load balancing algorithms using microbenchmarks on Intrepid and Ranger and evaluate the effect of communication, number of cores and number of objects on the benefit achieved from load balancing. New algorithms developed in SCOTCH lead to better performance compared to the METIS partitioners for several cases, both in terms of the application execution time and fewer number of objects migrated.
1997
Parallel computing has become increasingly ubiquitous in recent years. Problem sizes continue to increase rapidly for solving real-world applications. A recent survey indicates that molecular dynamics simulations require a minimum of teraflops to tens of petaflops . The survey continues to report that computational cosmology needs tens of exaflops. The computational requirements for some physical problems such as a study on interaction between atoms is currently unknown due to its complex nature; however, it will be extremely large. One way to satisfy these computational demands is to use large-scale distributed-memory multiprocessors. The Intel ASCI Red machine, and the IBM and SGI/Cray Blue machines, consisting of hundreds to thousands of processors, are designed specifically to satisfy such computational demands . Another possibility of meeting such immense computational needs is to use a cluster of personal computers or workstations . The Berkeley Millennium project aims to provide computational resources through clusters of clusters of Intelbased personal computers .
7th International Symposium on Parallel Architectures, Algorithms and Networks, 2004. Proceedings., 2004
Efficient load balancing algorithms are the key to many efficient parallel applications. Until now, research in this area mainly focused on static networks. However, observations show that diffusive algorithms, originally designed for these networks, can also be applied in non static scenarios. In this paper we prove that the general diffusion scheme can be deployed on dynamic networks and show that its convergence rate depends on the average value of the quotient of the second smallest eigenvalue and the maximum vertex degree of the networks occurring during the iterations. In the presented experiments we illustrate that even if communication links of static networks fail with high probability, load can still be balanced quite efficiently. Simulating diffusion on ad-hoc networks we demonstrate that diffusive schemes provide a reliable and efficient load balancing strategy also in mobile environments.
2007
Adaptive scientific computations require that periodic repartitioning (load balancing) occur dynamically to maintain load balance. Hypergraph partitioning is a successful model for minimizing communication volume in scientific computations, and partitioning software for the static case is widely available. In this paper, we present a new hypergraph model for the dynamic case, where we minimize the sum of communication in the application plus the migration cost to move data, thereby reducing total execution time. The new model can be solved using hypergraph partitioning with fixed vertices. We describe an implementation of a parallel multilevel repartitioning algorithm within the Zoltan load-balancing toolkit, which to our knowledge is the first code for dynamic load balancing based on hypergraph partitioning. Finally, we present experimental results that demonstrate the effectiveness of our approach on a Linux cluster with up to 64 processors. Our new algorithm compares favorably to the widely used ParMETIS partitioning software in terms of quality, and would have reduced total execution time in most of our test cases. * Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin CompanyMuch of the early work in load balancing focused on diffusive methods , where overloaded processors give work to neighboring processors that have lower than average loads. A quite different approach is to partition the new problem "from scratch" without accounting for existing partition assignments, and then try to remap partitions to minimize the migration cost . These two strategies have very different properties. Diffusive schemes are fast and have low migration cost, but may incur high communication volume. Scratch-remap schemes give low
Theoretical Computer Science, 2006
This work considers the problem of efficiently performing a set of tasks using a network of processors in the setting where the network is subject to dynamic reconfigurations, including partitions and merges. A key challenge for this setting is the implementation of dynamic load balancing that reduces the number of tasks that are performed redundantly because of the reconfigurations. We explore new approaches for load balancing in dynamic networks that can be employed by applications using a group communication service. The group communication services that we consider include a membership service (establishing new groups to reflect dynamic changes) but does not include maintenance of a primary component. For the n-processor, n-task load balancing problem defined in this work, the following specific results are obtained. For the case of fully dynamic changes including fragmentation and merges we show that the termination time of any on-line task assignment algorithm is greater than the termination time of an off-line task assignment algorithm by a factor greater than n/12. We present a load balancing algorithm that guarantees completion of all tasks in all fragments caused by partitions with work O(n + f • n) in the presence of f fragmentation failures. We develop an effective scheduling strategy for minimizing the task execution redundancy and we prove that our strategy provides each of the n processors with a schedule of Θ(n 1/3) tasks such that at most one task is performed redundantly by any two processors.
Parallel Computing, 2011
One of the most significant causes for performance degradation of scientific and engineering applications on high performance computing systems is the uneven distribution of the computational work to the resources of the system. This effect, which is known as load imbalance, is even more noticeable in the case of irregular applications and heterogeneous distributed systems. This motivated the parallel and distributed computing research community to focus on methods that provide good load balancing for scientific and engineering applications running on (heterogeneous) distributed systems. Efficient load balancing and scheduling methods are employed for scientific applications from various fields, such as mechanics, materials, physics, chemistry, biology, applied mathematics, etc. Such applications typically employ a large number of computational methods in order to simulate complex phenomena, on very large scales of time and magnitude. These simulations consist of routines that perform repetitive computations (in the form of DO/FOR loops) over very large data sets, which, if not properly implemented and executed, may suffer from poor performance. The number of repetitive computations in the simulation codes is not always constant. Moreover, the computational nature of these simulations may be in fact irregular, leading to the case when one computation takes (unpredictably) more time than others. For successful and timely results, large scale simulations require the use of large scale computing systems, which often are widely distributed and highly heterogeneous. Moreover, large scale computing systems are usually shared among multiple users, which causes the quality and quantity of the available resources to be highly unpredictable. There are numerous load balancing methods in the literature for different parallel architectures. The most recent of these methods typically follow the master-worker paradigm, where a single coordinator (master) is responsible for making all the scheduling decisions based on information provided by the workers. Depending on the application requirements, the scheduling policy and the computational environment, the benefits of this paradigm may be limited as follows: (1) its efficiency may not scale as the number of processors increases, and (2) it is quite probable that the scheduling decisions are made based on outdated information, especially on systems where the workload changes rapidly. In an effort to address these limitations, we propose a distributed (master-less) load balancing scheme, in which the scheduling decisions are made by the workers in a distributed fashion. We implemented this method along with other two master-worker schemes (a previously existing one and a recently modified one) for three different scientific computational kernels. In order to validate the usefulness and efficiency of the proposed scheme, we conducted a series of comparative performance tests with the two master-worker schemes for each computational
Journal of Parallel and Distributed Computing, 2009
In parallel adaptive applications, the computational structure of the applications changes over time, leading to load imbalances even though the initial load distributions were balanced. To restore balance and to keep communication volume low in further iterations of the applications, dynamic load balancing (repartitioning) of the changed computational structure is required. Repartitioning differs from static load balancing (partitioning) due to the additional requirement of minimizing migration cost to move data from an existing partition to a new partition. In this paper, we present a novel repartitioning hypergraph model for dynamic load balancing that accounts for both communication volume in the application and migration cost to move data, in order to minimize the overall cost. Use of a hypergraph-based model allows us to accurately model communication costs rather than approximating them with graph-based models. We show that the new model can be realized using hypergraph partitioning with fixed vertices and describe our parallel multilevel implementation within the Zoltan load-balancing toolkit. To the best of our knowledge, this is the first implementation for dynamic load balancing based on hypergraph partitioning. To demonstrate the effectiveness of our approach, we conducted experiments on a Linux cluster with 1024 processors. The results show that, in terms of reducing total cost, our new model compares favorably to the graphbased dynamic load balancing approaches, and multilevel approaches improve the repartitioning quality significantly.
2017
Acceleration of graph applications on GPUs has found large interest due to the ubiquitous use of graph processing in various domains. The inherent \textit{irregularity} in graph applications leads to several challenges for parallelization. A key challenge, which we address in this paper, is that of load-imbalance. If the work-assignment to threads uses node-based graph partitioning, it can result in skewed task-distribution, leading to poor load-balance. In contrast, if the work-assignment uses edge-based graph partitioning, the load-balancing is better, but the memory requirement is relatively higher. This makes it unsuitable for large graphs. In this work, we propose three techniques for improved load-balancing of graph applications on GPUs. Each technique brings in unique advantages, and a user may have to employ a specific technique based on the requirement. Using Breadth First Search and Single Source Shortest Paths as our processing kernels, we illustrate the effectiveness of ...
The dynamic load balancing techniques, practically, do not assume any information about the tasks to be executed at compilation time. Parameters like execution time or communication time are unknown at compilation time. These techniques are used to distribute the computation tasks of an application between different processors at execution time to achieve some defined performance objectives . In this paper we present a dynamic load balancing algorithm designed especially for heterogeneous network of workstations . The algorithm distributes the parallel tasks dynamically attempting to minimize its execution time. The experiments are done over a network of workstation interconnected via a fast Ethernet. It is a Linux cluster which has some degree of heterogeneity in the processing nodes. Our algorithm is shown to be efficient in increasing the resource utilization and reducing the total execution time of the applications.
Lecture Notes in Computer Science, 2004
The task of balancing dynamically generated work load occurs in a wide range of parallel and distributed applications. Diffusion based schemes, which belong to the class of nearest neighbor load balancing algorithms, are a popular way to address this problem. Originally created to equalize the amount of arbitrarily divisible load among the nodes of a static and homogeneous network, they have been generalized to heterogeneous topologies. Additionally, some simple diffusion algorithms have been adapted to work in dynamic networks as well. However, if the load is not divisible arbitrarily but consists of indivisible unit size tokens, diffusion schemes are not able to balance the load properly. In this paper we consider the problem of balancing indivisible unit size tokens on dynamic and heterogeneous systems. By modifying a randomized strategy invented for homogeneous systems, we can achieve an asymptotically minimal expected overload in l 1 , l 2 and l ∞ norm while only slightly increasing the run-time by a logarithmic factor. Our experiments show that this additional factor is usually not required in applications.
2007
This paper examines MPI's ability to support continuous, dynamic load balancing for unbalanced parallel applications. We use an unbalanced tree search benchmark (UTS) to compare two approaches, 1) work sharing using a centralized work queue, and 2) work stealing using explicit polling to handle steal requests. Experiments indicate that in addition to a parameter defining the granularity of load balancing, message-passing paradigms require additional parameters such as polling intervals to manage runtime overhead. Using these additional parameters, we observed an improvement of up to 2X in parallel performance. Overall we found that while work sharing may achieve better peak performance on certain workloads, work stealing achieves comparable if not better performance across a wider range of chunk sizes and workloads.
Proc. of the International Conference on Computing and Systems- 2010 (ICCS-2010), ISBN: 93-80813-01-5, 2010
A parallel computer system is a collection of processing elements that communicate and cooperate to solve large computational problems efficiently. To achieve this, at first the large computational problem is partitioned into several tasks with different work-loads and then are assigned to the different processing elements for computation. Distribution of the work load is known as Load Balancing. An appropriate distribution of work-loads across the various processing elements is very important as disproportional workloads can eliminate the performance benefit of parallelizing the job. Hence, load balancing on parallel systems is a critical and challenging activity. Load balancing algorithms can be broadly categorized as static or dynamic. Static load balancing algorithms distribute the tasks to processing elements at compile time, while dynamic algorithms bind tasks to processing elements at run time. This paper explains only the different dynamic load balancing techniques in brief used in parallel systems and concluding with the comparative performance analysis result of these algorithms.
Journal of Parallel and Distributed Computing, 1994
In this paper we analyze the scalability of a number of load balancing algorithms which can be applied to problems that have the following characteristics : the work done by a processor can be partitioned into independent work pieces; the work pieces are of highly variable sizes; and it is not possible (or very di cult) to estimate the size of total work at a given processor. Such problems require a load balancing scheme that distributes the work dynamically among di erent processors.
Parallel Computing, 2011
kernel. The target system is an SMP cluster, on which we simulated three different patterns of system load fluctuation. The experiments strongly support the belief that the distributed approach offers greater performance and better scalability on such systems, showing an overall improvement ranging from 13% to 24% over the master-worker approaches.
Pdcn, 2004
The dynamic load balancing techniques, practically, do not assume any information about the tasks to be executed at compilation time. Parameters like execution time or communication time are unknown at compilation time. These techniques are used to distribute the computation tasks of an application between different processors at execution time to achieve some defined performance objectives [1]. In this paper we present a dynamic load balancing algorithm designed especially for heterogeneous network of workstations. The algorithm distributes the parallel tasks dynamically attempting to minimize its execution time. The experiments are done over a network of workstation interconnected via a fast Ethernet. It is a Linux cluster which has some degree of heterogeneity in the processing nodes. Our algorithm is shown to be efficient in increasing the resource utilization and reducing the total execution time of the applications.
Parallel & Distributed …, 1996
The authors compare the perJbrmances of five dynamic loadbalancing strategies. The simulator they 'ue developed lets them measure these performances across a range of network topologies, including a 20 mesh, a 4 0 hypercube, a linear array, and a composite Fibonacci cube. multiprocessor network without load balancing processes processor-generated tasks locally with little or no sharing of computational resources. Load balancing, on the other hand, uses a multiprocessor network's inherently redundant A processing power by redistributing the workload among the processors to improve the application's overall performance.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.