Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1999, ACM SIGMETRICS Performance …
The study and design of computer systems requires good models of the workload to which these systems are subjected. Until recently, the data necessary to build these models|observations from production installations|were not available, especially for parallel computers. Instead, most models were based on assumptions and mathematical attributes that facilitate analysis. Recently a number of supercomputer sites have made accounting data available that make it possible to build realistic workload models. It is not clear, however, how to generalize from speci c observations to an abstract model of the workload. This paper presents observations of workloads from several parallel supercomputers and discusses modeling issues that have caused problems for researchers in this area.
Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96), 1996
The analysis of workload traces from real production parallel machines can aid a wide variety of parallel processing research, providing a realistic basis for experimentation in the management of resources over an entire workload. We analyze a ve-month workload trace of an Intel Paragon machine supporting a production parallel workload at the San Diego Supercomputer Center (SDSC), comparing and contrasting it with a similar workload study of an Intel iPSC/860 machine at NASA Ames NAS. Our analysis of workload characteristics takes into account the job scheduling policies of the sites and focuses on characteristics such as job size distribution (job parallelism), resource usage, runtimes, submission patterns, and wait times. Despite fundamental di erences in the two machines and their respective usage environments, we observe a number of interesting similarities with respect to job size distribution, system utilization, runtime distribution, and interarrival time distribution. We hope to gain insight into the potential use of workload traces for evaluating resource management polices at supercomputing sites and for providing both real-world job streams and accurate stochastic workload models for use in simulation analysis of resource management policies.
2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010
DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers. Link to publication General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal. If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the "Taverne" license above, please follow below link for the End User Agreement:
1996
The methodologies and techniques for characterizing the workload of parallel systems are related to the performance of such systems. Parallel systems are characterized by a large number of cooperating/communicating processors. Their performance is mainly in uenced by the ...
Lecture Notes in Computer Science, 1995
Performance evaluation studies are to be an integral part of the design and tuning of parallel applications. Their structure and their behavior are the dominating factors. We propose a hierarchical approach to the systematic characterization of the workload of a parallel system, to be kept as modular and exible as possible. The methodology is based on three di erent, but related, layers: the application, the algorithm, and the routine layer. For each o f t h e s e l a yers di erent c haracteristics representing functional, sequential, parallel, and quantitative descriptions have been identi ed. Taking also architectural and mapping features into consideration, the hierarchical workload characterization can be used for any t ype of performance studies.
Journal of Systems Architecture, 2000
Experimental design of parallel computers calls for quanti®able methods to compare and evaluate the requirements of dierent workloads within an application domain. Such methods can help establish the basis for scienti®c design of parallel computers driven by application needs, to optimize performance to cost. In this paper, a framework is presented for representing and comparing workloads, based on the way they would exercise parallel machines. This workload characterization is derived from parallel instruction centroid and parallel workload similarity. The centroid is a workload approximation that captures the type and amount of parallel work generated by the workload on the average. The centroid is a simple measure that aggregates average parallelism, instruction mix, and critical path length. When captured with abstracted information about communication requirements, the result is a powerful tool in understanding the requirements of workloads and their potential performance on target machines. The workload similarity is based on measuring the normalized Euclidean distance (ned) between workload centroids. It will be shown that this workload representation method outperforms comparable ones in accuracy, as well as in time and space requirements. Analysis of the NAS Parallel Benchmark workloads and their performance will be presented to demonstrate some of the applications and insight provided by this framework. This will include the use of the proposed framework for predicting the performance of real-life workloads on target machines, with good accuracy.
2009
Computer workloads have many attributes. When modeling these workloads it is often difficult to decide which attributes are important, and which can be abstracted away. In many cases, the modeler only includes attributes that are believed to be important, and ignores the rest. We argue, however, that this can lead to impaired workloads and unreliable system evaluations. Using parallel job scheduling as a case study, and daily cycles of activity as the attribute in dispute, we present two schedulers whose simulated performance seems identical without cycles, but then becomes significantly different when daily cycles are included in the workload. We trace this to the ability of one scheduler to prioritize interactive jobs, which leads to implicitly delaying less critical work to nighttime, when it can utilize resources that otherwise would have been left idle. Notably, this was not a design feature of this scheduler, but rather an emergent property that was not anticipated in advance. 100−299 >=300 © © © © 9 8 6 7 9 8 6 7 9 8 6 7 9 8 6 7 Figure 6: Characteristics of running and waiting jobs as sampled at different times of the day, with the CREASY and EASY schedulers (250 users). The situation with no daily cycles is shown on the right for comparison.
2007
Abstract. Workload characterization is an important technique that helps us understand the performance of parallel applications and the demands they place on the system. Each application run is profiled using instrumentation at the MPI library level. Characterizing the performance of the MPI library based on the sizes of messages helps us understand how the performance of an application is affected based on messages of different sizes. Partitioning of the time spent in MPI routines based on the type of MPI operation and the message size involved requires a two level mapping of performance data. This paper describes how performance mapping is implemented in the TAU performance system to support workload characterization.
2017
In the K computer, the job manager and peripheral tools collect various metrics and store them into databases. A part of the metrics is directly provided to users by the job manager. Also, some part of the metrics is summarized and reported by administrators. However, most of the data are not fully exploited for analysis to help inform our operations because the amount of data stored in databases is growing every moment and becoming huge size that is difficult to handle them. In this study, to get the picture of workloads behavior regarding arithmetic, memory access, and I/O intensive, we attempt to classify the workloads based on modern statistics. At first, before classification of the workloads, we analyze metrics behavior as a preliminary study by PCA and select features to be used in classification. After that, we partition the workloads into several groups by k-means and DBSCAN clustering methods with 10,000 sampling workload records extracted from nearly one million records i...
2010 IEEE International Conference on Computer Design, 2010
This work contributes to throughput calculation for real-time multiprocessor applications experiencing dynamic workload variations. We focus on a method to predict the system throughput when processing an arbitrarily long data frame given the meta-characteristics of the workload in that frame. This is useful for different purposes, such as resource allocation or dynamic voltage scaling in embedded systems.
Characterizing the I/O requirements of parallel applications that manipulate huge amounts of data, such as scienti c codes, is critical to the achievement of good application performance and to the e ective use of parallel systems. In this paper we formulate a compositional stochastic model of the behavior of I/O intensive scienti c applications, which can be applied at various granularity levels of characterization. Based on the observation of the interaction of CPU and I/O activity in a set of scienti c codes, we exercise the stochastic model. The model is in excellent agreement with experimental data in the set of the examined codes. For the model to be used for performance prediction, we also propose a set of functional forms for forecasting the scalability of computation and I/O components of an application. The complexity of the functional form adopted a ects the accuracy of performance prediction.
ECMS 2014 Proceedings edited by: Flaminio Squazzoni, Fabio Baronio, Claudia Archetti, Marco Castellani, 2014
Multicore architectures are now available for a wide range of high performance applications, ranging from embedded systems to large scale servers deployed in cloud environments. Multicore architectures are usually subject to two conflicting goals: obtaining a full utilization of the cores while achieving given performance objectives, such as throughput, response time or reduced energy consumption. Moreover, there is a strong interdependence between the software characteristics of the applications, and the underlying CPU architecture. In this scenario, simulation and analytical techniques can provide solid tools to properly design the considered class of systems: however, properly characterize the workload on multithreaded application in multicore environment is not an easy task, and thus is an hot research topic. In this paper we present several models, of increasing complexity, that can characterize multithreaded applications running on multicore architectures.
… Computing, 1997. Proceedings. The Sixth IEEE …, 1996
We develop a workload model based on observations of parallel computers at the San Diego Supercomputer Center and the Cornell Theory Center. This model gives us insight into the performance of strategies for scheduling moldable jobs on space-sharing parallel computers. We find that Adaptive Static Partitioning (ASP), which has been reported to work well for other workloads, does not perform as well as strategies that adapt better to system load. The best of the strategies we consider is one that explicitly reduces allocations when load is high (a variation of Sevcik's A+ strategy (1989))
International Journal of Embedded and Real-Time Communication Systems, 2013
Modern mobile nomadic devices for example internet tablets and high end mobile phones support diverse distributed and stand-alone applications that were supported by single devices a decade back. Furthermore the complex heterogeneous platforms supporting these applications contain multi-core processors, hardware accelerators and IP cores and all these components can possibly be integrated into a single integrated circuit (chip). The high complexity of both the platform and the applications makes the design space very complex due to the availability of several alternatives. Therefore the system designer must be able to quickly evaluate the performance of different application architectures and implementations on potential platforms. The most popular technique employed nowadays is termed as system-level-performance evaluation which uses abstract workload and platform capacity models. The platform capacity models and application workload models reside at a higher abstraction-level. The...
Parallel Computing - Fundamentals and Applications - Proceedings of the International Conference ParCo99, 2000
This paper describes the collection and analysis of usage data for a large (hundreds of nodes) distributed memory machine over a period of 31 months during which 178,000 batch jobs were submitted. A number of data items were collected for each job, including the queue wait times, elapsed (wall clock) execution times, the number of nodes used, as well as the actual job CPU, system and wait times in node hours. This data set represents perhaps the most comprehensive such information on the use of a 100 Gflop parallel machine by a large (over 1,200 users in any given month) and diverse set of users. The results of this analysis provide some insights on how much machines are used and on workload profiles in terms of the number of nodes used, average queue wait times, elapsed and CPU times, as well as their distributions. A longitudinal analysis shows how these have changed over time and how scheduling policies affect user behavior. Some of these results confirm earlier studies, while others reveal new information. That knowledge has been used to develop a new scheduler for the system which has increased system node utilization from the 60% to the 90-95% range if there are sufficient jobs waiting in the queue.
In computing, the workload is the amount of processing that the computer has been given to do at a given time. Workloads are two types namely synthetic and real workload. Real workloads are not publicly available and some workloads are available in the internet like Google trace, world cup 98 trace and Clark Net trace. Synthetic workloads are generated based on our experiments. The real trace is downloaded from Google cluster data which consists of two workloads. In first trace it refers to 7 hours period with set of tasks. In second trace it refers to 30 days period of work. Each dataset is packed as set of one or more files, each provided in compressed Common Separated Values(CSV) format. In this paper we are analyzing the Google cluster data version 2 trace in IBM SPSS statistics and generating another workload called synthetic workload with the same characteristics and behavior of real workload based on formulae which is generated using linear regression in IBM SPSS statistics.
The Mermaid simulation environment facilitates the performance evaluation of a wide range of design options in MIMD multicomputer architectures. Because the Mermaid architectural simulators are driven by events that are more abstract than real machine instructions, workloads must explicitly be modelled. In this paper, we provide an overview of the workload modelling techniques used within Mermaid. This includes the tracing of real programs and the stochastic modelling of application behaviour. Moreover, we present several validation and performance results indicating that our simulation methodology obtains good accuracy while being fairly efficient.
Job Scheduling Strategies for Parallel Processing, 2009
In parallel systems, similar jobs tend to arrive within bursty periods. This fact leads to the existence of the locality phenomenon, a persistent similarity between nearby jobs, in real parallel computer workloads. This important phenomenon deserves to be taken into account and used as a characteristic of any workload model. Regrettably, this property has received little if any attention of researchers and synthetic workloads used for performance evaluation to date often do not have locality. With respect to this research trend, Feitelson has suggested a general repetition approach to model locality in synthetic workloads . Using this approach, Li et al. recently introduced a new method for modeling temporal locality in workload attributes such as run time and memory . However, with the assumption that each job in the synthetic workload requires a single processor, the parallelism has not been taken into account in their study. In this paper, we propose a new model for parallel computer workloads based on their result. In our research, we firstly improve their model to control locality of a run time process better and then model the parallelism. The key idea for modeling the parallelism is to control the cross-correlation between the run time and the number of processors. Experimental results show that not only the cross-correlation is controlled well by our model, but also the marginal distribution can be fitted nicely. Furthermore, the locality feature is also obtained in our model.
2010 IEEE International Conference on Computer Design, 2010
A new generation of high-performance engines now combine graphics-oriented parallel processors with a cache architecture. In order to meet this new trend, new highlyparallel workloads are being developed. However, it is often difficult to predict how a given application would perform on a given architecture. This paper provides a new model capturing the behavior of such parallel workloads on different multi-core architectures. Specifically, we provide a simple analytical model, which, for a given application, describes its performance and power as a function of the number of threads it runs in parallel, on a range of architectures. We use our model (backed by simulations) to study both synthetic workloads and real ones from the PARSEC suite. Our findings recognize distinctly different behavior patterns for different application families and architectures.