Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1999, IEEE Micro
Designers face many choices when planning a new high-performance, general purpose microprocessor. Options include superscalar organization (the ability to dispatch and execute more than one instruction at a time), out-of-order issue of instructions, speculative execution, branch prediction, and cache hierarchy. However, the interaction of multiple microarchitecture features is often counterintuitive, raising questions concerning potential performance benefits and other effects on various workloads. Complex design trade-offs require accurate and timely performance modeling, which in turn requires flexible, efficient environments for exploring microarchitecture processor performance. Workload-driven simulation models are essential for microprocessor design space exploration. A processor model must ideally: capture in sufficient detail those features that are already well defined; make evolving assumptions and approximations in interpreting the desired execution semantics for those features that are not yet well defined; and be validated against the existing specification. These requirements suggest the need for an evolving but reasonably precise specification, so that validating against such a specification provides confidence in the results. Processor model validation normally relies on behavioral timing specifications based on test cases that exercise the microarchitecture. This approach, commonly used in simulation-based functional validation methods, is also useful for performance validation. In this article, we describe a workload driven simulation environment for PowerPC processor microarchitecture performance exploration. We summarize the environment's properties and give examples of its usage
2006 IEEE International Symposium on Performance Analysis of Systems and Software
The latest high-performance IBM PowerPC microprocessor, the POWER5 chip, poses challenges for performance model validation. The current stateof-the-art is to use simple hand-coded bandwidth and latency testcases, but these are not comprehensive for processors as complex as the POWER5 chip. Applications and benchmark suites such as SPEC CPU are difficult to set up or take too long to execute on functional models or even on detailed performance models. We present an automatic testcase synthesis methodology to address these concerns. By basing testcase synthesis on the workload characteristics of an application, source code is created that largely represents the performance of the application, but which executes in a fraction of the runtime. We synthesize representative PowerPC versions of the SPEC2000, STREAM, TPC-C and Java benchmarks, compile and execute them, and obtain an average IPC within 2.4% of the average IPC of the original benchmarks and with many similar average workload characteristics. The synthetic testcases often execute two orders of magnitude faster than the original applications, typically in less than 300K instructions, making performance model validation for today's complex processors feasible.
1989
The task of choosing the right architecture for a VLSI chip is becoming more difficult as the complexity of microprocessors increases. The process of designing such an architecture, examining the various alternatives and obtaining initial performance estimates requires a flexible and fast simulation environment that addresses those issues. This paper describes a performance simulation environment for the evaluation process of an advanced microprocessor, its use and our experience during definition and design time.
Modern microprocessors achieve high p erformance through the use of speculative execution and mechanisms to exploit instruc- tion level parallelism. Performance evaluation of such architec- tures is generally made using d etailed, cycle-by-cycle simula- tion. Since detailed simulation is s low, the design o f r ecent simulators has been focused on developing fast simulation en- gines. However, these optimized simulators are difficult t o modify or extend. In addition, intensive benchmarking is re- quired to v alidate simulation performance results. This task consumes a significant amount of time even if very fast simula- tors are used. This paper presents a novel simulation environment to study high p erformance microarchitectures. This environment con- sists of an extensible simulator for superscalar architectures and a group o f utilities to p erform benchmarking in p arallel. The new simulator developed has features that are not found in other simulators reported in the literatur...
Architectural simulators used for microprocessor design study and optimization can require large amount computational time and/or resources. In such cases, models can be a fast alternative to lengthy simulations, and can help reach a designer near-optimal system configuration. However, The nonlinear characteristics of a processor system make the modeling task quite challenging. The models not only need to incorporate the micro-architectural parameters but also the dynamic behavior of programs. This paper presents a hybrid (hardware/software), non-linear model for processors. The model provides accurate predictions of processor throughput for a wide range of design space.
ACM Sigsoft Software Engineering Notes, 2001
We measure the experimental error that arises from the use of non-validated simulators in computer architecture research, with the goal of increasing the rigor of simulation-based studies. We describe the methodology that we used to validate a microprocessor simulator against a Compaq DS-10L workstation, which contains an Alpha 21264 processor. Our evaluation suite consists of a set of 21 microbenchmarks that stress different aspects of the 21264 microarchitecture. Using the microbenchmark suite as the set of workloads, we describe how we reduced our simulator error to an arithmetic mean of 2%, and include details about the specific aspects of the pipeline that required extra care to reduce the error. We show how these low-level optimizations reduce average error from 40% to less than 20% on macrobenchmarks drawn from the SPEC2000 suite. Finally, we examine the degree to which performance optimizations are stable across different simulators, showing that researchers would draw different conclusions, in some cases, if using validated simulators.
Proceedings of the 22nd annual international symposium on Computer architecture - ISCA '95, 1995
The PowerPC 620TM microprocessor' is the most recent and performance leading member of the PowerPCT~i family. The 64-bit PowerPC 620 microprocessor employs a two-phase branch prediction scheme, dynamic renaming for all the register jiles, distributed multi-entry reservation stations, true out-oforder execution by six execution units, and a completion buffer for ensuring precise exceptions. This paper presents an instruction-level performance evaluation of the 620 microarchitectu re. A pe~ormance simulator is developed using the VM W (Visualization-based Microarchitecture Workbench) retargetable framework. The VMW-based simulator accurately models the microarchitecture down to the machine cycle level. Extensive trace-driven simulation is pe~ormed using the SPEC92 benchmarks. Detailed quantitative analyses of the effectiveness of all key microarchitecture features are presented.
acmbulletin.fiit.stuba.sk
Embedded systems have become indivisible part of our everyday activities. They are dedicated devices performing a particular job. A computing core of more complicated embedded systems is formed by one or more applicationspecific instruction set processor. Therefore, powerful tools for processors development are necessary. One of the most important phases is the testing and optimization phase of the processor design and target software. In the testing phase, the most often used tool is a simulator. The simulator can discover bugs in the processor design and target software before the embedded system realization. This paper describes several advanced methods of processor simulation, which can be used in the different phases of processor development. In the optimization phase, the most frequently used tool is a profiler. The profiler can uncover problematic parts, such as bottleneck points, in the processor design or in target software. Then, using the results from the profiler, the designer can easily find which parts of the processor design or target software should be modified to get a better performance or reduce the powerconsumption. In this paper, two methods of profiling are described. Furthermore, the ways how to simulate and profile multiprocessor systems are also described in this thesis. The processor or multiprocessor system is designed using architecture description language.
Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis - CODES+ISSS '07, 2007
Performance analysis of microprocessors is a critical step in defining the microarchitecture, prior to register-transfer-level (RTL) design. In complex chip multiprocessor systems, including multiple cores, caches and busses, this problem is compounded by complex performance interactions between cores, caches and interconnections, as well as by tight interdependencies between performance, power and physical characteristics of the design (i.e., floorplan). Although there are many point tools for the analysis of performance, or power, or floorplan of complex systems-on-chip (SoCs), there are surprisingly few works on an integrated tool that is capable of analyzing these various system characteristics simultaneously and allow the user to explore different design configurations and their effect on performance, power, size and thermal aspects. This paper describes an integrated tool for early analysis of performance, power, physical and thermal characteristics of multi-core systems. It includes cycle-accurate, transaction-level SystemC-based performance models of POWER processors and system components (i.e., caches, buses). Power models, for power computation, physical models for floorplanning and packaging models for thermal analysis are also included. The tool allows the user to build different systems by selecting components from a library and connecting them together in a visual environment. Using these models, users can simulate and dynamically analyze the performance, power and thermal aspects of multi-core systems.
Workload Characterization of Emerging Computer Applications, 2001
The large input datasets in the SPEC 2000 benchmark suite result in unreasonably long simulation times when using detailed execution-driven simulators for evaluating future computer architecture ideas. To address this problem, we have an ongoing project to reduce the execution times of the SPEC 2000 benchmarks in a quantitatively defensible way. Upon completion of this work 1 , we will have smaller input datasets for several SPEC 2000 benchmarks. The programs using our reduced input datasets will produce execution profiles that accurately reflect the program behavior of the full reference dataset, as measured using standard statistical tests. In the process of reducing and verifying the SPEC 2000 benchmark datasets, we also obtain instruction mix, memory behavior, and instructions per cycle characterization information about each benchmark program.
2009
The SPEC CPU2006 suite, released in Aug 2006 is the current industry-standard, CPUintensive benchmark suite, created from a collection of popular modern workloads. But, these workloads take machine weeks to months of time when fed to cycle accurate simulators and have widely varying behavior even over large scales of time . It is to be noted that we do not see simulation based papers using SPEC CPU2006 even after 1.5 years of its release. A well known technique to solve this problem is the use of simulation points . We have generated the simulation points for SPEC CPU2006 and made it available at [3]. We also report the accuracies of these simulation points based on the CPI, branch misspredictions, cache & TLB miss ratios by comparing with the full runs for a subset of the benchmarks. It is to be noted that the simulation points were only used for cache, branch and CPI studies until now and this is the first attempt towards validating them for TLB studies. They have also been found to be equally representative in depicting the TLB characteristics. Using the generated simulation points, we provide an analysis of the behavior of the workloads in the suite for different branch predictor & cache configurations and report the optimally performing configurations. The simulations for the different TLB configurations revealed that usage of large page sizes significantly reduce the translation misses and aid in improving the overall CPI of the modern workloads.
ACM SIGARCH Computer Architecture News, 2004
Designing a new microprocessor is extremely timeconsuming. One of the contributing reasons is that computer designers rely heavily on detailed architectural simulations, which are very time-consuming. Recent work has focused on statistical simulation to address this issue. The basic idea of statistical simulation is to measure characteristics during program execution, generate a synthetic trace with those characteristics and then simulate the synthetic trace. The statistically generated synthetic trace is orders of magnitude smaller than the original program sequence and hence results in significantly faster simulation.
ACM SIGARCH Computer Architecture News, 2005
The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers. We leverage an existing full-system functional simulation infrastructure (Simics [14]) as the basis around which to build a set of timing simulator modules for modeling the timing of the memory system and microprocessors. This simulator infrastructure enables us to run architectural experiments using a suite of scaled-down commercial workloads . To enable other researchers to more easily perform such research, we have released these timing simulator modules as the Multifacet General Execution-driven Multiprocessor Simulator (GEMS) Toolset, release 1.0, under GNU GPL .
ACM SIGMETRICS Performance Evaluation Review, 2000
Generating an accurate estimate of the performance of a program on a given system is important to a large number of people. Computer architects, compiler writers, and developers all need insight into a machine's performance. There are a number of performance estimation techniques in use, from profile-based approaches to full machine simulation. This paper discusses a profile-based performance estimation technique that uses a lightweight instrumentation phase that runs in order number of dynamic instructions, followed by an analysis phase that runs in roughly order number of static instructions. This technique accurately predicts the performance of the core pipeline of a detailed out-of-order issue processor model while scheduling far fewer instructions than does full simulation. The difference between the predicted execution time and the time obtained from full simulation is only a few percent.
International Conference on Computer Design, 2008
Increasing sizes of benchmarks make detailed simulation an extremely time consuming process. Statistical techniques such as the SimPoint methodology have been proposed in order to address this problem during the initial design phase. The SimPoint methodology attempts to identify repetitive, long, large-grain phases in programs and predict the performance of the architecture based on its aggregate performance on the individual phases. This study attempts to compare accuracy of the SimPoint methodology for the SPEC CPU 2006 benchmark suite with that of SPEC CPU 2000 and to study the large-grain phases in the two benchmark suites using the SimPoint methodology. We find that there has not been a significant increase in the number of simulation points required to accurately predict the behavior of the programs in SPEC CPU 2006, despite its significantly larger data footprint and dynamic instruction count. We also find that the programs in both benchmark suites have similar characteristics in terms of the number of phases that contribute significantly towards overall behavior, further emphasizing the similarity between the two benchmark suites with respect to the number of simulation points required for similar accuracies.
2008 IEEE International Symposium on Workload Characterization, 2008
As multiprocessors become mainstream, techniques to address efficient simulation of multi-threaded workloads are needed. Multi-threaded simulation presents a new challenge: non-determinism across simulations for different architecture configurations. If the execution paths between two simulation runs of the same benchmark with the same input are too different, the simulation results cannot be used to compare the configurations. In this paper we focus on a simulation technique to efficiently collect simulation checkpoints for multi-threaded workloads, and to compare simulation runs addressing this non-determinism problem. We focus on user-level simulation of multi-threaded workloads for multiprocessor architectures. We present an approach, based on binary instrumentation, to collect checkpoints for simulation. Our checkpoints allow reproducible execution of the samples across different architecture configurations by controlling the sources of nondeterminism during simulation. This results in stalls that would not naturally occur in execution. We propose techniques that allow us to accurately compare performance across architecture configurations in the presence of these stalls.
Proceedings of the Rapid Simulation and Performance Evaluation: Methods and Tools on - RAPIDO '19
Given the constantly growing complexity of multi-core architectures, Design Space Exploration (DSE) tools play an important role to evaluate different design options. In this paper, we present a DSE toolset targeting massively parallelized HW/SW architectures with a high degree of flexibility in order to successfully simulate multi-core-multi-node platforms. Our DSE tools provide a rapid and simple-to-use work-flow to easily retrieve and analyze the key metrics and eventually evaluate the design. We examine the DSE toolset and methodology while performing several simulations of a general purpose 1K-core architecture and evaluate not only standard metrics like the L2 cache miss rates, but also operating system activity and its impact. We leverage the knowledge gained in our methodology to develop and evaluate a novel dataflow execution model named "DataFlow-Threads" (DF-Threads). We validated the outcomes of the simulator against an equivalent FPGA-based design.
2011
As the evolution of multi-core multi-threaded processors continues, the complexity demanded to perform an extensive trade-off analysis, increases proportionally. Cycleaccurate or trace-driven simulators are too slow to execute the large amount of experiments required to obtain indicative results. To achieve a thorough analysis of the system, software benchmarks or traces are required. In many cases when an analysis is needed most, during the earlier stages of the processor design, benchmarks or traces are not available. Analytical models overcome these limitations but do not provide the fine grain details needed for a deep analysis of these architectures.
2005
M-Sim is a multi-threaded microarchitectural simulation environment with a detailed cycle-accurate model for the key pipeline structures. M-Sim extends the SimpleScalar 3.0d toolset with accurate models of the pipeline structures, including explicit register renaming, and support for the concurrent execution of multiple threads according to the Simultaneous Multithreading (SMT) model. For power estimations, M-Sim includes the Wattch framework as applied to SimpleScalar. This technical report provides an overview of M-Sim, including a detailed description of the simulated processor as well as instructions for the installation and use of the M-Sim environment. The description is focused only on the changes made with respect to the Simplescalar.
Lecture Notes in Computer Science, 2006
In this paper we present sim-async, an architectural simulator able to model a 64-bit asynchronous superscalar microarchitecture. The aim of this tool is to serve the designers on the study of different architectural proposals for asynchronous processors. Sim-async models the data-dependant timing of the processor modules by using distribution functions that describe the probability of a given delay to be spent on a computation. This idea of characterizing the timing of the modules at the architectural level of abstraction using distribution functions is introduced for the first time with this work. In addition, sim-async models the delays of all the relevant hardware involved in the asynchronous communication between stages. To tackle the development of sim-async we have modified the source code of SimpleScalar by substituting the simulator's core with our own execution engine, which provides the functionality of a parameterizable microarchitecture adapted to the Alpha ISA. The correctness of simasync was checked by comparing the outputs of the SPEC2000 benchmarks with SimpleScalar executions, and the asynchronous behavior was successfully tested in relation to a synchronous configuration of simasync.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.