Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2001, Lecture Notes in Computer Science
We develop a new metric for job scheduling that includes the effects of memory contention amongst simultaneously-executing jobs that share a given level of memory. Rather than assuming each job or process has a fixed, static memory requirement, we consider a general scenario wherein a process' performance monotonically increases as a function of allocated memory, as defined by a miss-rate versus memory size curve. Given a schedule of jobs in a shared-memory multiprocessor (SMP), and an isolated miss-rate versus memory size curve for each job, we use an analytical memory model to estimate the overall memory miss-rate for the schedule. This, in turn, can be used to estimate overall performance. We develop a heuristic algorithm to find a good schedule of jobs on a SMP that minimizes memory contention, thereby improving memory and overall performance.
2016
Scientific and technological advances in the area of integrated circuits have allowed the performance of microprocessors to grow exponentially since the late 1960’s. However, the imbalance between processor and memory bus capacity has increased in recent years. The increasing on-chip-parallelism of multi-core processors has turned the memory subsystem into a key factor for achieving high performance. When two or more processes share the memory subsystem their execution times typically increase, even at relatively low levels of memory traffic. Current research shows that a throughput increase of up to 40% is possible if the job-scheduler can minimizes the slowdown caused by memory contention in industrial multi-core systems such as high performance clusters, datacenters or clouds. In order to optimize the throughput the job-scheduler has to know how much slower the process will execute when co-scheduled on the same server as other processes. Consequently, unless the slowdown is known...
IFIP International Federation for Information Processing, 2006
This paper presents an extension of the Latency Time (LT) scheduling algorithm for assigning tasks with arbitrary execution times on a multiprocessor with shared memory. The Extended Latency Time (ELT) algorithm adds to the priority function the synchronization associated with access to the shared memory. The assignment is carried out associating with each task a time window of the same size as its duration, which decreases for every time unit that goes by. The proposed algorithm is compared with the Insertion Scheduling Heuristic (ISH). Analysis of the results established that ELT has better performance with fine granularity tasks (computing time comparable to synchronization time), and also, when the number of processors available to carry out the assignment increases.
ACM SIGMETRICS Performance Evaluation Review, 1991
In shared-memory multiprocessor systems it may be more efficient to schedule a task on one processor than on mother. Due to the inevitability of idle processors in these environments, there exists an important tradeoff between keeping the workload balanced and scheduling tasks where they run most efficiently. The purpose of an adaptive task migration policy is to determine the appropriate balance between the extremes of this load sharing tradeoff.We make the observation that there are considerable differences between this load sharing problem in distributed and shared-memory multiprocessor systems, and we formulate a queueing theoretic model of task migration to study the problem. A detailed mathematical analysis of the model is developed, which includes the effects of increased contention for system resources induced by the task migration policy. Our objective is to provide a better understanding of task migration in shared-memory multiprocessor environments. In particular, we illu...
2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010
In a modern chip-multiprocessor system, memory is a shared resource among multiple concurrently executing threads. The memory scheduling algorithm should resolve memory contention by arbitrating memory access in such a way that competing threads progress at a relatively fast and even pace, resulting in high system throughput and fairness. Previously proposed memory scheduling algorithms are predominantly optimized for only one of these objectives: no scheduling algorithm provides the best system throughput and best fairness at the same time. This paper presents a new memory scheduling algorithm that addresses system throughput and fairness separately with the goal of achieving the best of both. The main idea is to divide threads into two separate clusters and employ different memory request scheduling policies in each cluster. Our proposal, Thread Cluster Memory scheduling (TCM), dynamically groups threads with similar memory access behavior into either the latency-sensitive (memory-non-intensive) or the bandwidth-sensitive (memory-intensive) cluster. TCM introduces three major ideas for prioritization: 1) we prioritize the latency-sensitive cluster over the bandwidth-sensitive cluster to improve system throughput; 2) we introduce a "niceness" metric that captures a thread's propensity to interfere with other threads; 3) we use niceness to periodically shuffle the priority order of the threads in the bandwidth-sensitive cluster to provide fair access to each thread in a way that reduces inter-thread interference. On the one hand, prioritizing memory-non-intensive threads significantly improves system throughput without degrading fairness, because such "light" threads only use a small fraction of the total available memory bandwidth. On the other hand, shuffling the priority order of memory-intensive threads improves fairness because it ensures no thread is disproportionately slowed down or starved. We evaluate TCM on a wide variety of multiprogrammed workloads and compare its performance to four previously proposed scheduling algorithms, finding that TCM achieves both the best system throughput and fairness. Averaged over 96 workloads on a 24-core system with 4 memory channels, TCM improves system throughput and reduces maximum slowdown by 4.6%/38.6% compared to ATLAS (previous work providing the best system throughput) and 7.6%/4.6% compared to PAR-BS (previous work providing the best fairness).
Jco, 1998
We show that there is a good algorithm for scheduling the average completion time of a set of unknown DAGs (i.e., data dependency relation graphs of programs) on a multiprocessor in the PRAM model ) (or other similar shared memory models.) Then, we show that a large class of parallel jobs can be scheduled with near-optimal average completion time in the BSP model (Valiant, 1990a) though this is not possible for the class of all unknown DAGs (Deng and Koutsoupias, 1993) (the same holds for other similar distributed memory models.)
Space-sharing is regarded as the proper resource management scheme for many-core OSes. For today's many-core chips and parallel programming models providing no explicit resource requirements, an important research problem is to provide a proper resource allocation to the running applications while considering not only the architectural features but also the characteristics of the parallel applications. In this paper, we introduce a space-shared scheduling strategy for shared-memory parallel programs. To properly assign the disjoint set of cores to simultaneously running parallel applications, the proposed scheme considers the performance characteristics of the executing (parallel) code section of all running applications. The information about the performance is used to compute a proper core allocation in accordance to the goal of the scheduling policy given by the system manager. We have first implemented a user-level scheduling framework that runs on Linux-based multi-core chips. A simple performance model based solely on online profile data is used to characterize the performance scal-ability of applications. The framework is evaluated for two scheduling policies, balancing and maximizing QoS, and on two different many-core platforms, a 64-core AMD Opteron platform and a 36-core Tile-Gx36 processor. Experimental results of various OpenMP benchmarks show that in general our space-shared scheduling outperforms the standard Linux scheduler and meets the goal of the active scheduling policy.
Techniques for predicting the efficiency of multi-core processing associated with a set of tasks with varied CPU and main memory requirements are introduced. Given a set of tasks each with different CPU and main memory requirements, and a multi-core system (which generally has fewer cores than the number of tasks), our goal is to derive equations for upper-and lower-bounds to estimate the efficiency with which the tasks are executed. Prediction of execution efficiency of processes due to CPU and required memory availability is important in the context of making process assignment, load balancing, and scheduling decisions in distributed systems. Input parameters to models include: number of cores, number of threads, CPU usage factor of threads, available memory frames, required amount of memory for each thread, and others. Additionally, a CPU availability average prediction model is introduced from the empirical study for the set of applications that require a single predicted value instead of bounds. Extensive experimental studies and statistical analysis are performed and observed that the proposed efficiency bounds are consistently tight. The model provides a basis of an empirical model for predicting execution efficiency of threads while CPU and memory resources are uncertain. To facilitate scientific and controlled empirical evaluation, real-world benchmark programs with dynamic behavior are employed on UNIX systems that are parameterized by their CPU usage factor and memory requirement.
Software Engineering / 811: Parallel and Distributed Computing and Networks / 816: Artificial Intelligence and Applications, 2014
When two or more programs are co-scheduled on the same multicore computer they might experience a slowdown due to the limited off-chip memory bandwidth. According to our measurements, this slowdown does not depend on the total bandwidth use in a simple way. One thing we observe is that a higher memory bandwidth usage will not always lead to a larger slowdown. This means that relying on bandwidth usage as input to a job scheduler might cause non-optimal scheduling of processes on multicore nodes in clusters, clouds, and grids. To guide scheduling decisions, we instead propose a slowdown based characterization approach. Real slowdowns are complex to measure due to the exponential number of experiments needed. Thus, we present a novel method for estimating the slowdown programs will experience when co-scheduled on the same computer. We evaluate the method by comparing the predictions made with real slowdown data and the often used memory bandwidth based method. This study show that a scheduler relying on slowdown based categorization makes fewer incorrect co-scheduling choices and the negative impact on program execution times is less than when using a bandwidth based categorization method.
Parallel and Distributed Computing Systems, 2000
In this work we present an innovative kernel-level scheduling methodology designed for multiprogrammed shared-memory multiprocessors. We propose three scheduling policies equipped with both dynamic space sharing and time sharing, to ensure the scalability of par- allel programs under multiprogramming while increasing processor utilization and overall system performance. Our scheduling methodology is designed for multidisciplinary multiprocessor schedulers that need to handle
Lecture Notes in Computer Science, 1998
The evaluation of parallel job schedulers hinges on two things: the use of appropriate metrics, and the use of appropriate workloads on which the scheduler can operate. We argue that the focus should be on on-line open systems, and propose that a standard workload should be used as a benchmark for schedulers. This benchmark will specify distributions of parallelism and runtime, as found by analyzing accounting traces, and also internal structures that create different speedup and synchronization characteristics. As for metrics, we present some problems with slowdown and bounded slowdown that have been proposed recently.
Proceedings 11th International Parallel Processing Symposium, 1997
This paper demonstrates the effectiveness of the twophase method of scheduling, in which task clustering is performed prior to the actual scheduling process. Task clustering determines the optimal or near-optimal number of processors on which to schedule the task graph. In other words, there is never a need to use more processors (even though they are available) than the number of clusters produced by the task clustering algorithm. The paper also indicates that when task clustering is performed prior to scheduling, load balancing (LB) is the preferred approach for cluster merging. LB is fast, easy to implement, and produces significantly better final schedules than communication traffic minimizing (CTM). In summary, the two-phase method consisting of task clustering and load balancing is a simple yet highly effective strategy for scheduling task graphs on distributed memory parallel architectures.
Proceedings of the ACM on Measurement and Analysis of Computing Systems
To keep pace with Moore's law, chip designers have focused on increasing the number of cores per chip rather than single core performance. In turn, modern jobs are often designed to run on any number of cores. However, to effectively leverage these multi-core chips, one must address the question of how many cores to assign to each job. Given that jobs receive sublinear speedups from additional cores, there is an obvious tradeoff: allocating more cores to an individual job reduces the job's runtime, but in turn decreases the efficiency of the overall system. We ask how the system should schedule jobs across cores so as to minimize the mean response time over a stream of incoming jobs. To answer this question, we develop an analytical model of jobs running on a multi-core machine. We prove that EQUI, a policy which continuously divides cores evenly across jobs, is optimal when all jobs follow a single speedup curve and have exponentially distributed sizes. EQUI requires jobs t...
International Journal of Advanced Computer Science and Applications, 2014
Multicore technology enables the system to perform more tasks with higher overall system performance. However, this performance can't be exploited well due to the high miss rate in the second level shared cache among the cores which represents one of the multicore's challenges. This paper addresses the dynamic co-scheduling of tasks in multicore real-time systems. The focus is on the basic idea of the megatask technique for grouping the tasks that may affect the shared cache miss rate ,and the Pfair scheduling that is then used for reducing the concurrency within the grouped tasks while ensuring the real time constrains. Consequently the shared cache miss rate is reduced.The dynamic co-scheduling is proposed through the combination of the symbiotic technique with the megatask technique for co-scheduling the tasks based on the collected information using two schemes. The first scheme is measuring the temporal working set size of each running task at run time, while the second scheme is collecting the shared cache miss rate of each running task at run time. Experiments show that the proposed dynamic coscheduling can decrease the shared cache miss rate compared to the static one by 52%.This indicates that the dynamic coscheduling is important to achieve high performance with shared cache memory for running high workloads like multimedia applications that require real-time response and continuousmedia data types.
The problem of scheduling a set of applications on a multiprocessor system has been investigated in a number of di erent points of view. This paper describes our work on the scheduling problem at the user level, where we have to distribute evenly the parallel tasks that compose a program among a set of processors. We investigated dynamic scheduling heuristics applied to loops on distributed-memory multiprocessors. Our approaches try to combine the workload balancing goal with the data locality exploitation, as many scienti c applications exhibit some processor a nity when executing on a distributedmemory machine. Some experimental results on a CM-5 comparing our heuristics with some other approaches are presented.
Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems - SIGMETRICS '93, 1993
As a process executes on a CPU, it builds up state in that CPU's cache. In multiprogrammed workloads, the opportunity to reuse this state may be lost when a process gets rescheduled, either because intervening processes destroy its cache state or because the process may migrate to another processor. In this paper, we explore a nity s c heduling, a technique that helps reduce cache misses by preferentially scheduling a process on a CPU where it has run recently. Our study focuses on a bus-based multiprocessor executing a variety o f workloads, including mixes of scienti c, software development, and database applications. In addition to quantifying the performance bene ts of exploiting a nity, our study is distinctive i n t h a t i t p r o vides low-level data from a hardware performance monitor that details why t h e w orkloads perform as they do. Overall, for the workloads studied, we s h o w that a nity s c heduling reduces the number of cache misses by 7{36%, resulting in execution time improvements of up to 10%. Although the overall improvements are small, modifying the OS scheduler to exploit a nity appears worthwhile|a nity has no negative i m p a c t o n t h e workloads and we show that it is extremely simple to add to existing schedulers.
Journal of Parallel and Distributed Computing, 1995
As a process executes on a CPU, it builds up state in that CPU's cache. In multiprogrammed workloads, the opportunity to reuse this state may be lost when a process gets rescheduled, either because intervening processes destroy its cache state or because the process may migrate to another processor. In this paper, we explore a nity scheduling, a technique that helps reduce cache misses by preferentially scheduling a process on a CPU where it has run recently. Our study focuses on a bus-based multiprocessor executing a variety of workloads, including mixes of scienti c, software development, and database applications. In addition to quantifying the performance bene ts of exploiting a nity, our study is distinctive in that it provides low-level data from a hardware performance monitor that details why the workloads perform as they do. Overall, for the workloads studied, we show that a nity scheduling reduces the number of cache misses by 7{36%, resulting in execution time improvements of up to 10%. Although the overall improvements are small, modifying the OS scheduler to exploit a nity appears worthwhile|a nity has no negative impact on the workloads and we show that it is extremely simple to add to existing schedulers.
2007
The major chip manufacturers have all introduced chip multiprocessing (CMP) and simultaneous multithreading (SMT) technology into their processing units. As a result, even low-end computing systems and game consoles have become shared memory multiprocessors with L1 and L2 cache sharing within a chip. Mid-and large-scale systems will have multiple processing chips and hence consist of an SMP-CMP-SMT configuration with non-uniform data sharing overheads. Current operating system schedulers are not aware of these new cache organizations, and as a result, distribute threads across processors in a way that causes many unnecessary, long-latency cross-chip cache accesses.
2008 37th International Conference on Parallel Processing, 2008
On systems with multi-core processors, the memory access scheduling scheme plays an important role not only in utilizing the limited memory bandwidth but also in balancing the program execution on all cores. In this study, we propose a scheme, called ME-LREQ, which considers the utilization of both processor cores and memory subsystem. It takes into consideration both the long-term and shortterm gains of serving a memory request by prioritizing requests hitting on the row buffers and from the cores that can utilize memory more efficiently and have fewer pending requests. We have also thoroughly evaluated a set of memory scheduling schemes that differentiate and prioritize requests from different cores. Our simulation results show that for memory-intensive, multiprogramming workloads, the new policy improves the overall performance by 10.7% on average and up to 17.7% on a four-core processor, when compared with scheme that serves row buffers hit memory requests first and allows memory reads bypassing writes; and by up to 9.2% (6.4% on average) when compared with the scheme that serves requests from the core with the fewest pending requests first.
Proceedings of the 1997 International Symposium on Parallel Architectures, Algorithms and Networks (I-SPAN'97), 1997
The thesis of this research is that the task of exposing the parallelism in a given application should be left to the algorithm designer, who has intimate knowledge of the application characteristics. On the other hand, the task of limiting the parallelism in a chosen parallel algorithm is best handled by the compiler or operating system for the target MPP machine. Toward this end, we have developed CASS (for Clustering And Scheduling System), a task management system that provides facilities for automatic granularity optimization and task scheduling of parallel programs on distributed memory parallel architectures. Our tool environment, CASS, consists of a twophase method of compiler-time scheduling, in which task clustering is performed prior to the actual scheduling process. The clustering module identifies the optimal number of processing nodes that the program will require to obtain maximum performance on the target parallel machine. The scheduling module maps the clusters onto a fixed number of processors and determines the order of execution of tasks in each processor.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.