Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2010, International Journal of Critical …
In this paper, we present an approach to safety scheduling in distributed computing based on strategies of resource co-allocation for complex sets of tasks (jobs). The necessity of guaranteed job execution until the time limits requires taking into account the distributed environment dynamics, namely, changes in the number of jobs for servicing, volumes of computations, possible failures of processor nodes, etc. As a consequence, in the general case, a set of versions of scheduling and resource co-allocation, or a strategy, is required instead of a single version. Safety strategies are formed for structurally different job models with various levels of task granularity and data replication policies. We develop and consider scheduling strategies which combine fine-grain and coarse-grain computations, multiple data replicas and constrained data movement. These strategies are evaluated using simulations studies and addressing a variety of metrics.
Third International Conference on …, 2008
In this paper, we present an approach to safety scheduling in distributed computing based on strategies of resource co-allocation. Safety strategies are formed for structurally different program models with various levels of task granularity and data replication policies. We develop and consider scheduling strategies which combine fine-grain and coarse-grain computations, multiple data replicas, and constrained data movement. These strategies are evaluated using simulations studies and addressing a variety of metrics.
… on Computer Systems …, 2010
In this paper, we present an approach to scalable coscheduling in distributed computing for complex sets of interrelated tasks (jobs). The scalability means that schedules are formed for job models with various levels of task granularity, data replication policies, and the processor resource and memory can be upgraded. The necessity of guaranteed job execution at the required quality of service causes taking into account the distributed environment dynamics, namely, changes in the number of jobs for servicing, volumes of computations, possible failures of processor nodes, etc. As a consequence, in the general case, a set of versions of scheduling, or a strategy, is required instead of a single version. We propose a scalable model of scheduling based on multicriteria strategies. The choice of the specific schedule depends on the load level of the resource dynamics and is formed as a resource query which is sent to a local batch-job management system.
This paper addresses the problem of task allocation in distributed computing systems with the goal of maximizing the system reliability. It first develops a mathematical model for reliability based on a cost function representing the unreliability caused by the execution of tasks on the system processors and the unreliability caused by the inter-processor communication costs subject to constraints imposed by both the application and the system resources. It then presents an exact algorithm derived from the well known Branch¬-and-Bound technique to this problem. For reducing the computations of finding an optimal allocation, the algorithm solves the dual problem, uses the idea of best first branch strategy for selecting a node to be expanded and handles tasks at the tree levels according to the task of more connectivity.
… and Communication (UbiCC) Journal. Special Issue …, 2009
This paper presents an integrated approach for scheduling in distributed computing with strategies as sets of job supporting schedules generated by a critical works method. The strategies are implemented using a combination of job-flow and application-level techniques of scheduling within virtual organizations of Grid. Applications are regarded as compound jobs with a complex structure containing several tasks co-allocated to processor nodes. The choice of the specific schedule depends on the load level of the resource dynamics and is formed as a resource request, which is sent to a local batch-job management system. We propose scheduling framework and compare diverse types of scheduling strategies using simulation studies.
ArXiv, 2019
This work presents a decentralized allocation algorithm of safety-critical application on parallel computing architectures, where individual Computational Units can be affected by faults. The described method consists in representing the architecture by an abstract graph where each node represents a Computational Unit. Applications are also represented by the graph of Computational Units they require for execution. The problem is then to decide how to allocate Computational Units to applications to guarantee execution of the safety-critical application. The problem is formulated as an optimization problem, with the form of an Integer Linear Program. A state-of-the-art solver is then used to solve the problem. Decentralizing the allocation process is achieved through redundancy of the allocator executed on the architecture. No centralized element decides on the allocation of the entire architecture, thus improving the reliability of the system. Experimental reproduction of a multi-co...
Dependability of …, 2009
This paper presents the scheduling strategies in distributed computing. The fact that architecture of the computational environment is distributed, heterogeneous, and dynamic along with autonomy of processor nodes, makes it much more difficult to manage and assign resources for job execution which fulfils user expectations for quality of service (QoS). The strategies are implemented using a combination of job-flow and application-level techniques of scheduling and resource co-allocation. Strategy is considered as a set of possible job scheduling variants with a coordinated allocation of the tasks to the processor nodes. Applications are regarded as compound jobs with a complex structure containing several tasks. The choice of the specific scheduling depends on the load level of the resource dynamics and is formed as a resource request, which is sent to a local batch-job management system.
alzaytoonah.edu.jo
This paper presents an integrated approach for scheduling in distributed computing with strategies as sets of job supporting schedules generated by a critical works method. The strategies are implemented using a combination of job-flow and application-level techniques of scheduling within virtual organizations of Grid. Applications are regarded as compound jobs with a complex structure containing several tasks co-allocated to processor nodes. The choice of the specific schedule depends on the load level of the resource dynamics and is formed as a resource request, which is sent to a local batch-job management system. We propose scheduling framework and compare diverse types of scheduling strategies using simulation studies.
Task scheduling algorithms in distributed and parallel systems play a vital role to provide better performance platforms for multiprocessor networks. A large number of policies, which can determine best structures of task scheduling algorithms, have been explored so far. These policies have significant value for optimizing system efficiency. The objective of all these approaches are maximizing system throughput with assigning a task to a suitable processor, maximizing resource utilization, and minimizing execution time. In this essay, there are various types of different algorithms for parallel and distributed systems that have been classified by reviewing former surveys. Then, various task scheduling algorithms are discussed from different points of view such as dependency among tasks, static vs. dynamic approaches, and heterogeneity of processors. Precedence orders like list heuristics have been studied. Duplication based algorithms, clustering heuristics and scheduling methods inspired by nature's laws like GA (Genetic Algorithm) are other kind of algorithm approaches of this study.
International Journal of Intelligent Systems and Applications, 2016
Cloud users usually have different preferences over their applications that outsource to the cloud, based on the financial pro fit of each application's execution. Moreover, various types of virtual machines are offered by a cloud service provider with distinct characteristics , such as rental prices, availab ility levels , each with a different probability of occurrence and a penalty, which is paid to the user in case the virtual mach ine is not available. Therefore, the problem o f applicat ion scheduling in cloud computing environments, considering the risk of financial loss of application-to-VM assignment becomes a challenging issue. In this paper, we propose a riskaware scheduling model, using risk analysis to allocate the applications to the virtual machines , so that, the expected total pay-off o f an application is maximized, by taking into account of the priority of applications. A running examp le is used through the paper to better illustrate the model and its application to imp rove the efficiency of resource assignment in cloud computing scenarios.
International Journal of Computer Applications, 2013
Performance estimation of a distributed software is a challenging problem. A distributed software runs on multiple processing nodes interconnected in some fashion. In such a situation computational load of a software is distributed onto the processing nodes of the given system. Such a system makes use of an appropriate task scheduling algorithm for obtaining a good performance. The program used in this work emulates a distributed system. An emulator gives the result like an actual system. The emulator is of a fully connected distributed system in which any two processors can directly communicate. The objective of this experiment is to identify the task scheduling algorithm that also performs well in the presence of communication fault delay occured because of network failure or computation fault delay occured because of no response from processors in a distributed system.
2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2018
We study the expected completion time of some recently proposed algorithms for distributed computing which redundantly assign computing tasks to multiple machines in order to tolerate a certain number of machine failures. We analytically show that not only the amount of redundancy but also the task-to-machine assignments affect the latency in a distributed system. We study systems with a fixed number of computing tasks that are split in possibly overlapping batches, and independent exponentially distributed machine service times. We show that, for such systems, the uniform replication of non-overlapping (disjoint) batches of computing tasks achieves the minimum expected computing time.
S. Ghosh et al. (1998) presented a novel approach for providing fault tolerance for sets of independent, periodic tasks with rate-monotonic scheduling. We extend this approach to tasks that share logical or physical resources (and hence require synchronization). We show that if the simple rate-monotonic dispatch is replaced by stack scheduling (T. Baker, 1991), the worst-case blocking overhead of the stack resource policy and the worst-case retry overhead for fault tolerance are not additive, but rather only the maximum of the two overheads is incurred
IEEE Design & Test, 2018
The safe and, at the same time, efficient deployment of parallelisable applications on many-core platforms is a challenging task. Theoretical Models of Computation (MoC) require the realistic estimation of task Worst-Case Execution Time (WCET) to provide safe latency guarantees. Due to interferences on shared resources, task WCET estimations are often exceedingly pessimistic. In reality, though, rarely do all the tasks execute with their WCET, thus introducing an efficiency gap, which is of consequence in realizing safety-critical and mixed-criticality systems. In this paper, we outline the additional research efforts required to i) derive a safe deployment from a MoC reducing that efficiency gap and ii) adapt at runtime to further improve performance and still preserve safety. We also outline the impact of the level of data-parallelisation onto this efficiency gap and present experimental evidence of the performance improvements from accurate WCET estimation, level of data-parallelisation and runtime adaptation.
Proceedings of the 17th …, 2005
Scientific investigations have to deal with rapidly growing amounts of data from simulations and experiments. During data analysis, scientists typically want to extract subsets of the data and perform computations on them. In order to speed up the analysis, computations are performed on distributed systems such as computer clusters, or Grid systems. A well-known difficult problem is to build systems that execute the computations and data movement in a coordinated fashion. In this paper, we describe an architecture for executing co-scheduled tasks of computation and data movement on a computer cluster that takes advantage of two technologies currently being used in distributed Grid systems. The first is Condor, that manages the scheduling and execution of distributed computation, and the second is Storage Resource Managers (SRMs) that manage the space usage and content of storage systems. This is achieved by including the information about the availability of files on the nodes provided by SRMs into the advertised information that Condor uses for the purpose of matchmaking. The system is capable of dynamically load balancing by replicating popular files on idle nodes. To confirm the feasibility of our approach, a prototype system was built on a computer cluster. Several experiments based on real work logs were performed. We observed that without replication compute nodes are underutilized and job wait times in the scheduler's queue are longer. This architecture can be used in wide-area Grid systems since the basic components are already used for the Grid.
2014
In this paper, we describe an algorithm to detect a stable property for a dynamic distributed system that does not suffer from any of the limitations described above. Our approach is based on maintaining a spanning tree of all processes currently participating in the computation. The spanning tree, which is dynamically changing, is used to collect local snapshots of processes periodically. Processes can join and leave the system while a snapshot algorithm is in progress. We identify sufficient conditions under which a collection of local snapshots can be safely used to evaluate a stable property. Specifically, the collection has to be consistent (local states in the collection are pair-wise consistent) and complete (no local state necessary for correctly evaluating the property is missing from the collection). We also identify a condition that allows the current root of the spanning tree to detect termination of the snapshot algorithm even if the algorithm was initiated by an ―earli...
2007
We present bu ered coscheduling, a new methodology to multitask parallel jobs in a message-passing environment and to develop parallel programs that can pave the way to the e cient implementation of a distributed operating system. Bu ered coscheduling is based on three innovative techniques: communication bu ering, strobing, and non-blocking communication. By leveraging these techniques, we can perform effective optimizations based on the global status of the parallel machine rather than on the limited knowledge available locally to each processor. The advantages of bu ered coscheduling include higher resource utilization, reduced communication overhead, e cient implementation of ow-control strategies and fault-tolerant protocols, accurate performance modeling, and a simpli ed yet still expressive parallel programming model. Preliminary experimental results show that bu ered coscheduling is very effective in increasing the overall performance in the presence of load imbalance and communication-intensive workloads.
1996
The problem considered in this thesis is how to run a workload of multiple parallel jobs on a single parallel machine. Jobs are assumed to be data-parallel with large degrees of parallelism, and the machine is assumed to have an MIMD architecture. We identify a spectrum of scheduling policies between the two extremes of time-slicing, in which jobs take turns to use the whole machine, and space-slicing, in which jobs get disjoint subsets of processors for their own dedicated use. Each of these scheduling policies is evaluated using a metric suited for interactive execution: the minimum machine power being devoted to any job, averaged over time. The following result is demonstrated. If there is no advance knowledge of job characteristics (such as running time, I/O frequency and communication locality) the best scheduling policy is gang-scheduling with instruction-balance. This conclusion validates some of the current practices in commercial systems. The proof uses the notions of clair...
ubicc.org
In this work, we present slot selection algorithms for job batch scheduling in distributed computing with non-dedicated resources. Jobs are parallel applications and these applications are independent. Existing approaches towards resource coallocation and parallel job scheduling in economic models of distributed computing are based on search of time-slots in resource occupancy schedules. The sought time-slots must match requirements of necessary span, computational resource properties, and cost. Usually such scheduling methods consider only one suited variant of time-slot set. This work discloses a scheduling scheme that features multi-variant search. Two algorithms of linear complexity for search of alternative variants are proposed. Having several optional resource configurations for each job makes an opportunity to perform an optimization of execution of the whole batch of jobs and to increase overall efficiency of scheduling.
2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2014
Task scheduling and execution over large scale, distributed systems plays an important role on achieving good performance and high system utilization. Due to the explosion of parallelism found in today's hardware, applications need to perform over-decomposition to deliver good performance; this over-decomposition is driving job management systems' requirements to support applications with a growing number of tasks with finer granularity. Our goal in this work is to provide a compact, light-weight, scalable, and distributed task execution framework (CloudKon) that builds upon cloud computing building blocks (Amazon EC2, SQS, and DynamoDB). Most of today's state-of-the-art job execution systems have predominantly Master/Slaves architectures, which have inherent limitations, such as scalability issues at extreme scales and single point of failures. On the other hand distributed job management systems are complex, and employ non-trivial load balancing algorithms to maintain good utilization. CloudKon is a distributed job management system that can support both HPC and MTC workloads with millions of tasks/jobs. We compare our work with other state-of-the-art job management systems including Sparrow and MATRIX. The results show that CloudKon delivers better scalability compared to other state-of-the-art systems for some metricsall with a significantly smaller code-base (5%).
Integrated Research in Grid Computing, CoreGRID, 2008
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.