Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1990
In this paper we analyze a model of a parallel processing syslem. In our model there is a single queue which is scrvcd by K 1 1 identical proccs-SOTS. Jobs arc assumed LO consist of a scqucnce of barrier synchronizations where, at each step, the number of tasks that must be synchronixcd is random with a known distribution. An exact analysis of the model is dcrivcd. The model lcads to a rich set of rcsul(x characterizing the performance of parallel processing syslcms. WC show Ihal, the number of jobs concurrenlly in execution, as well as I.hc number 0C synchronixation varia.bles, grows linearly wi1.h Ihe load or the system and strongly dcpcntls on the avcragc number of parallel lasks Found in the workload. I'ropcrtics of expected rcsponsc I,imc or such syslcms arc exbcnsively analyzed and, in parliciilar, wc report on some non-obvious response time behavior that arises as a function of l;hc variance of parallelism round in the work1oa.d. Rascd on exact response lime analysis, we propose a simple calculalion lhat can be used as a rule of t.humh 1.0 predict speedups. This can be viewed a.s a gcncra.lizalion of Amdahl's law t.hat includes qucucirrg cC fccts. This gcncralizat.ion. is rcformulalcd when prccisc workloads cannot be characterizccl, but raohcr when only Ihc fraclion or scquerrtial work and the average number or parallcl tasks arc assumcd to bc known.
[1990] Proceedings. Second IEEE Workshop on Future Trends of Distributed Computing Systems
Generic queueing models of parallel systems with K 2 2 exponential servers where jobs may be split into K independent tasks are considered. The queueing of jobs is distributed if each server has its own queue, and is centralized if there is a common queue. The scheduling of jobs is no splitting if all tasks of a job must run on one processor and splitting if they can run concurrently on different processors. Exact and approximate expressions for the mean response time, TrZK, of the ra, r = I , 2, ..., K, departing task in a job are obtained and compared for four models: Distributed/Splitting (D/S), Distributed/No Splitting (D/NS), Centralized/Splitting (Cis) and Centralized/No Splitting (C/NS). It is shown that 7r:K for Il/S systems is lower than that of CjS systems for small values of r and medium to high utilizations. The effect of splitting jobs into tasks is studied and it is shown that C/NS systems yield lower va!ues of TrrK than D/S systems only when Y and the utilization are quite high. Also, the relative range defined as (TKrK-T I : K) / T K : K is shown to be bounded for the systems considered except for D/S systems, which implies the possibility of overflow of the waiting space used for task synchronization. These results are useful in evaluating the performance of parallel algorithms with rout of -K operations, in assessing the lock holding times in replicated database systems, arid in comparing alternative architectures of parallel processing systems.
2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010
DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers. Link to publication General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal. If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the "Taverne" license above, please follow below link for the End User Agreement:
1989
This technical report has been reviewed and is approved for publication.
ACM SIGMETRICS Performance …, 1999
The study and design of computer systems requires good models of the workload to which these systems are subjected. Until recently, the data necessary to build these models|observations from production installations|were not available, especially for parallel computers. Instead, most models were based on assumptions and mathematical attributes that facilitate analysis. Recently a number of supercomputer sites have made accounting data available that make it possible to build realistic workload models. It is not clear, however, how to generalize from speci c observations to an abstract model of the workload. This paper presents observations of workloads from several parallel supercomputers and discusses modeling issues that have caused problems for researchers in this area.
2006
B ib lio g ra p h y 179 V ita 187 vii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I would like to extend my thanks to all the faculty and staff in the Computer Science Department for providing me with a friendly and productive work environment. In particular, I would like to thank Vanessa Godwin for her kind assistance on many administrative issues enabling me to focus completely on my research work, and for not letting me forget any University deadlines. To my fellow graduate student colleagues, all of you have made my stay at William and Mary truly memorable. To all my past teachers, I would like to thank them for their excellent teaching and advice during my formative years in China. Their dedication has given me a firm foundation for m y fu r th e r s tu d ie s in th e U n ite d S ta te s. Finally, I would like to dedicate this dissertation to my dear father Quanxing Zhang, my mother Wenling Li, and my husband Xin Chen, for their unlimited love and understanding during my Ph.D period. My parents taught me the value of education and worked hard to provide me with the best of it. My brother Yi Zhang has been a constant source of joy and support. I am also fortunate to have Xin by my side during this journey, always standing by me, loving me, understanding me and encouraging me. W ithout them, I would never have gone this far. viii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.16 CCDFs of (a) round trip time, (b) response time of front server, (c) response time of database server when the database server has ACF in its service pro cess. In all experiments MPL is equal to 512.
IEEE Transactions on Software Engineering, 1992
We model a job in a parallel processing system as a sequence of stages, each of which requires a certain integral number of processors for a certain interval of time. With this model we derive the speedup of the system for two cases: systems with no arrivals, and systems with arrivals. In the case with no arrivals, our speedup result is a generalization of Amdahl's Law. We extend the notion of "power" (the simplest definition is power = throughput/response time) as previously applied to general queueing and computer-communication systems to our case of parallel processing systems. With this definition of power we are able to find the optimal system operating point (i.e., the optimal input rate of jobs) and the optimal number of processors to use in the parallel processing system such that power is maximized. Many of the results for the case of arrivals are the same as for the case of no arrivals. A familiar and intuitively pleasing result is obtained, which states that the average number of jobs in the system with arrivals equals unity when power is maximized. We also model a job in a way such that the number of processors required is a continuous variable that changes continuously over time. The same performance indices and parameters studied in the discrete model are evaluated for this continuous model. These continuous results are more easily obtained, are easier to state, and are simpler to interpret than for the discrete model.
Journal of Mathematical Psychology, 2012
A critical component of how we understand a mental process is given by measuring the effect of varying the workload. The capacity coefficient is a measure on response times for quantifying changes in performance due to workload. Despite its precise mathematical foundation, until now rigorous statistical tests have been lacking. In this paper, we demonstrate statistical properties of the components of the capacity measure and propose a significance test for comparing the capacity coefficient to a baseline measure or two capacity coefficients to each other.
The Journal of Supercomputing
This paper presents a new formulation of the isoefficiency function which can be applied to parallel systems executing balanced or unbalanced workloads. This new formulation allows analyzing the scalability of parallel systems under either balanced or unbalanced workloads. Finally, the validity of this new metric is evaluated using some synthetic benchmarks. The experimental results allow assessing the importance of considering the unbalanced workloads while analyzing the scalability of parallel systems.
… Computing, 1997. Proceedings. The Sixth IEEE …, 1996
We develop a workload model based on observations of parallel computers at the San Diego Supercomputer Center and the Cornell Theory Center. This model gives us insight into the performance of strategies for scheduling moldable jobs on space-sharing parallel computers. We find that Adaptive Static Partitioning (ASP), which has been reported to work well for other workloads, does not perform as well as strategies that adapt better to system load. The best of the strategies we consider is one that explicitly reduces allocations when load is high (a variation of Sevcik's A+ strategy (1989))
2011 IEEE International Conference on High Performance Computing and Communications, 2011
2009
Computer workloads have many attributes. When modeling these workloads it is often difficult to decide which attributes are important, and which can be abstracted away. In many cases, the modeler only includes attributes that are believed to be important, and ignores the rest. We argue, however, that this can lead to impaired workloads and unreliable system evaluations. Using parallel job scheduling as a case study, and daily cycles of activity as the attribute in dispute, we present two schedulers whose simulated performance seems identical without cycles, but then becomes significantly different when daily cycles are included in the workload. We trace this to the ability of one scheduler to prioritize interactive jobs, which leads to implicitly delaying less critical work to nighttime, when it can utilize resources that otherwise would have been left idle. Notably, this was not a design feature of this scheduler, but rather an emergent property that was not anticipated in advance. 100−299 >=300 © © © © 9 8 6 7 9 8 6 7 9 8 6 7 9 8 6 7 Figure 6: Characteristics of running and waiting jobs as sampled at different times of the day, with the CREASY and EASY schedulers (250 users). The situation with no daily cycles is shown on the right for comparison.
2010 Second International Conference on Computational Intelligence, Modelling and Simulation, 2010
From the queuing system approach, a compute intensive application can be defined as any application where the arrival rate of processes into the processors queue is greater than the overall departure rate of the processes from the processors. Such a compute intensive application is ideal for a parallel computer because there are always processes for the processors to execute. This paper, therefore, aims at using a novel and efficient queuing approach to model some of the performance metrics of the parallel processors, for compute intensive applications.
Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96), 1996
The analysis of workload traces from real production parallel machines can aid a wide variety of parallel processing research, providing a realistic basis for experimentation in the management of resources over an entire workload. We analyze a ve-month workload trace of an Intel Paragon machine supporting a production parallel workload at the San Diego Supercomputer Center (SDSC), comparing and contrasting it with a similar workload study of an Intel iPSC/860 machine at NASA Ames NAS. Our analysis of workload characteristics takes into account the job scheduling policies of the sites and focuses on characteristics such as job size distribution (job parallelism), resource usage, runtimes, submission patterns, and wait times. Despite fundamental di erences in the two machines and their respective usage environments, we observe a number of interesting similarities with respect to job size distribution, system utilization, runtime distribution, and interarrival time distribution. We hope to gain insight into the potential use of workload traces for evaluating resource management polices at supercomputing sites and for providing both real-world job streams and accurate stochastic workload models for use in simulation analysis of resource management policies.
Our goal in this section is to extend the previous performance analysis (based only on mean execution times) to include both fluctuations about mean values as well as ordering constraints or correlations between different steps and different jobs. This demands additional assumptions concerning the job arrival and execution time statistics, as well as the scheduling policy. We focus on a class of Markovian mathematical models where knowing only the present or current state of the system, not the entire history of the system operations, its future statistical behavior is determined. These are called Jackson queueing networks in honor of the pioneer researcher who first studied their behavior. If all possible system states and the associated state transition rates are exhaustively enumerated, oftentimes the number of states in such models explodes (into tens of trillions for straightforward applications such as a single terminal word processor system, and into far far greater numbers for more sophisticated applications such as airline reservation systems with hundreds of terminals, tens of communication lines, and upwards of one hundred disk spindles) to far exceed the storage and processing capabilities of the most powerful supercomputers (both available and envisioned!): it can be virtually impossible to extract any analytically tractable formulae or numerical approximations of different measures of performance. The only credible avenues for quantifying system performance for many people for years appeared to be simulation studies or building a prototype (which is another form of simulation, using the actual system but simulating the load). ----CHAPTER 6 JACKSON NETWORK ANALYSIS 3 Figure 6.1.System Block Diagram time of a job, denoted by E (T Q ).
Parallel Computing - Fundamentals and Applications - Proceedings of the International Conference ParCo99, 2000
This paper describes the collection and analysis of usage data for a large (hundreds of nodes) distributed memory machine over a period of 31 months during which 178,000 batch jobs were submitted. A number of data items were collected for each job, including the queue wait times, elapsed (wall clock) execution times, the number of nodes used, as well as the actual job CPU, system and wait times in node hours. This data set represents perhaps the most comprehensive such information on the use of a 100 Gflop parallel machine by a large (over 1,200 users in any given month) and diverse set of users. The results of this analysis provide some insights on how much machines are used and on workload profiles in terms of the number of nodes used, average queue wait times, elapsed and CPU times, as well as their distributions. A longitudinal analysis shows how these have changed over time and how scheduling policies affect user behavior. Some of these results confirm earlier studies, while others reveal new information. That knowledge has been used to develop a new scheduler for the system which has increased system node utilization from the 60% to the 90-95% range if there are sufficient jobs waiting in the queue.
Lecture Notes in Computer Science, 1995
Performance evaluation studies are to be an integral part of the design and tuning of parallel applications. Their structure and their behavior are the dominating factors. We propose a hierarchical approach to the systematic characterization of the workload of a parallel system, to be kept as modular and exible as possible. The methodology is based on three di erent, but related, layers: the application, the algorithm, and the routine layer. For each o f t h e s e l a yers di erent c haracteristics representing functional, sequential, parallel, and quantitative descriptions have been identi ed. Taking also architectural and mapping features into consideration, the hierarchical workload characterization can be used for any t ype of performance studies.
Job Scheduling Strategies for Parallel Processing, 2009
In parallel systems, similar jobs tend to arrive within bursty periods. This fact leads to the existence of the locality phenomenon, a persistent similarity between nearby jobs, in real parallel computer workloads. This important phenomenon deserves to be taken into account and used as a characteristic of any workload model. Regrettably, this property has received little if any attention of researchers and synthetic workloads used for performance evaluation to date often do not have locality. With respect to this research trend, Feitelson has suggested a general repetition approach to model locality in synthetic workloads . Using this approach, Li et al. recently introduced a new method for modeling temporal locality in workload attributes such as run time and memory . However, with the assumption that each job in the synthetic workload requires a single processor, the parallelism has not been taken into account in their study. In this paper, we propose a new model for parallel computer workloads based on their result. In our research, we firstly improve their model to control locality of a run time process better and then model the parallelism. The key idea for modeling the parallelism is to control the cross-correlation between the run time and the number of processors. Experimental results show that not only the cross-correlation is controlled well by our model, but also the marginal distribution can be fitted nicely. Furthermore, the locality feature is also obtained in our model.
IEEE Transactions on Computers, 1989
Lecture Notes in Computer Science, 1998
The evaluation of parallel job schedulers hinges on two things: the use of appropriate metrics, and the use of appropriate workloads on which the scheduler can operate. We argue that the focus should be on on-line open systems, and propose that a standard workload should be used as a benchmark for schedulers. This benchmark will specify distributions of parallelism and runtime, as found by analyzing accounting traces, and also internal structures that create different speedup and synchronization characteristics. As for metrics, we present some problems with slowdown and bounded slowdown that have been proposed recently.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.