Multi 2 Sim
Multi 2 Sim
CPU-GPU Computing
† ‡
Electrical and Computer Engineering Dept. Computer and Information Science Dept.
Northeastern University University of Mississippi
360 Huntington Ave., Boston, MA 02115 P. O. Box 1848, University, MS 38677
ABSTRACT 1. INTRODUCTION
Accurate simulation is essential for the proper design and evalu- GPUs have become an important component of High Perfor-
ation of any computing platform. Upon the current move toward mance Computing (HPC) platforms by accelerating the ever de-
the CPU-GPU heterogeneous computing era, researchers need a manding data-parallel portions of a wide range of applications. The
simulation framework that can model both kinds of computing de- success of GPU computing has made microprocessor researchers in
vices and their interaction. In this paper, we present Multi2Sim, an both academia and industry believe that CPU-GPU heterogeneous
open-source, modular, and fully configurable toolset that enables computing is not just an alternative, but the future of HPC. Now,
ISA-level simulation of an x86 CPU and an AMD Evergreen GPU. GPUs are showing up as integrated accelerators for general purpose
Focusing on a model of the AMD Radeon 5870 GPU, we address platforms [8, 5, 9]. This move attempts to leverage the combined
program emulation correctness, as well as architectural simulation capabilities of multi-core CPU and many-core GPU architectures.
accuracy, using AMD’s OpenCL benchmark suite. Simulation ca- As CPU-GPU heterogeneous computing research gains momen-
pabilities are demonstrated with a preliminary architectural explo- tum, the need to provide a robust simulation environment becomes
ration study, and workload characterization examples. The project more critical. Simulation frameworks provide a number of benefits
source code, benchmark packages, and a detailed user’s guide are to researchers. They allow pre-silicon designs to be evaluated and
publicly available at www.multi2sim.org. performance results to be obtained for a range of design points. A
number of CPU simulators supporting simulation at the ISA level
Categories and Subject Descriptors have been developed [11, 14] and successfully used in a range of
architectural studies. Although there are tools that are currently
C.1.2 [Computer Systems Organization]: Processor Architec- available for simulating GPUs at the intermediate language level
tures—Multiple Data Stream Architectures; B.1.2 [Hardware]: (e.g., PTX) [12, 13], the research community still lacks a publicly
Control Structures and Microprogramming—Control Structure available framework integrating both fast functional simulation and
Performance and Design Aids cycle-accurate detailed architectural simulation at the ISA level that
considers a true heterogeneous CPU-GPU model.
Keywords In this paper we present Multi2Sim, a simulation framework for
GPU, AMD, Evergreen ISA, Multi2Sim CPU-GPU computing. The proposed framework integrates a pub-
licly available model of the data-parallel AMD Evergreen GPU
family [3]1 with the simulation of superscalar, multi-threaded, and
multicore x86 processors. This work also offers important insight
∗ into the architecture of an AMD Evergreen GPU, by describing our
This work was done while the author was with AMD.
models of instruction pipelines and memory hierarchy, to a deeper
extent than previous public work, to the best of our knowledge, has
done before.
Permission to make digital or hard copies of all or part of this work for Multi2Sim is provided as a Linux-based command-line toolset,
personal or classroom use is granted without fee provided that copies are designed with an emphasis on presenting a user-friendly interface.
not made or distributed for profit or commercial advantage and that copies It runs OpenCL applications without any source code modifica-
bear this notice and the full citation on the first page. To copy otherwise, to tions, and provides a number of instrumentation capabilities that
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. 1
PACT’12, September 19–23, 2012, Minneapolis, Minnesota, USA. AMD has used the Evergreen ISA specification for the implemen-
Copyright 2012 ACM 978-1-4503-1182-3/12/09 ...$15.00. tation of its mainstream Radeon 5000 and 6000 series of GPUs.
335
enable research in application characterization, code optimization, 2.1 The OpenCL Programming Model
compiler optimization, and hardware architecture design. To illus- OpenCL is an industry-standard programming framework de-
trate the utility and power of our toolset, we report on a wide range signed specifically for developing programs targeting heteroge-
of experimental results based on benchmarks taken from AMD’s neous computing platforms, consisting of CPUs, GPUs, and
Accelerated Parallel Processing (APP) SDK 2.5 [1]. other classes of processing devices [7]. OpenCL’s programming
The rest of this paper is organized as follows. Section 2 intro- model emphasizes parallel processing by using the single-program
duces the functional simulation model in Multi2Sim. Section 3 multiple-data (SPMD) paradigm, in which a single piece of code,
presents the Evergreen GPU architecture and its simulation. Sec- called a kernel, maps to multiple subsets of input data, creating a
tion 4 reports our experimental evaluation. We summarize related massive amount of parallel execution.
work in Section 5, and conclude the paper in Section 6. Figure 1 provides a view of the basic execution elements hierar-
chy defined in OpenCL. An instance of the OpenCL kernel is called
a work-item, which can access its own pool of private memory.
2. THE MULTI2SIM PROJECT Work-items are arranged into work-groups with two basic proper-
The Multi2Sim project started as a free, open-source, cycle- ties: i) those work-items contained in the same work-group can
accurate simulation framework targeting superscalar, multi- perform efficient synchronization operations, and ii) work-items
threaded, and multicore x86 CPUs. The CPU simulation frame- within the same work-group can share data through a low-latency
work consists of two major interacting software components: the local memory. The totality of work-groups form the ND-Range
functional simulator and the architectural simulator. The func- (grid of work-item groups) and share a common global memory.
tional simulator (i.e., emulator) mimics the execution of a guest
program on a native x86 processor, by interpreting the program 2.2 OpenCL Simulation
binary and dynamically reproducing its behavior at the ISA level.
The architectural simulator (i.e., detailed or timing simulator) ob-
tains a trace of x86 instructions from the functional simulator, and
tracks execution of the processor hardware structures on a cycle-
by-cycle basis.
The current version of the CPU functional simulator supports the
execution of a number of different benchmark suites without any
porting effort, including single-threaded benchmark suites (e.g.,
SPEC2006 and Mediabench), multi-threaded parallel benchmarks
(SPLASH-2 and PARSEC 2.1), as well as custom self-compiled
user code. The architectural simulator models many-core super-
scalar pipelines with out-of-order execution, a complete memory
hierarchy with cache coherence, interconnection networks, and ad- Figure 2: Comparison of software modules of an OpenCL pro-
ditional components. gram: native AMD GPU based heterogeneous system versus
Multi2Sim integrates a configurable model for the commercial Multi2Sim simulation framework.
AMD Evergreen GPU family (e.g., Radeon 5870). The latest re-
leases fully support both functional and architectural simulation of The call stack of an OpenCL program running on Multi2Sim dif-
a GPU, following the same interaction model between them as for fers from the native call stack starting at the OpenCL library
CPU simulation. While the GPU emulator provides traces of Ev- call, as shown in Figure 2. When an OpenCL API func-
ergreen instructions, the detailed simulator tracks execution times tion call is issued, our implementation of the OpenCL runtime
and architectural state. (libm2s-opencl.so) handles the call. This call is intercepted
All simulated programs begin with the execution of CPU code. by the CPU simulation module, which transfers control to the GPU
The interface to the GPU simulator is the Open Compute Language module as soon as the guest application launches the device kernel
(OpenCL). When OpenCL programs are executed, the host (i.e., execution. This infrastructure allows unmodified x86 binaries (pre-
CPU) portions of the program are run using the CPU simulation compiled OpenCL host programs) to run on Multi2Sim with total
modules. When OpenCL API calls are encountered, they are inter- binary compatibility with the native environment.
cepted and used to setup or begin GPU simulation.
3. ARCHITECTURAL SIMULATION OF
AN AMD EVERGREEN GPU
This section presents the architecture of a generic AMD Ever-
green GPU device, focusing on hardware components devoted to
general purpose computing of OpenCL kernels. As one of the nov-
elties of this paper, the following block diagrams and descriptions
provide some insight into the instruction pipelines, memory com-
ponents, and interconnects, which tend to be kept private by the ma-
jor GPU vendors, and remain undocumented in currently available
tools. All presented architectural details are accurately modeled on
Multi2Sim, as described next.
3.1 The Evergreen GPU Architecture
A GPU consists of an ultra-threaded dispatcher, an array of
independent compute units, and a memory hierarchy. The ultra-
Figure 1: OpenCL programming and memory model. threaded dispatcher processes the ND-Range and maps waiting
336
Figure 4: Example of AMD Evergreen assembly code: (a) main
CF clause instruction counter, (b) internal clause instruction
counter, (c) ALU clause, (d) TEX clause.
337
ber of assigned work-groups per compute unit is a performance- are run asynchronously in the CF engine itself, without requiring a
sensitive decision that can be evaluated on Multi2Sim. secondary engine.
Each work-group assigned to a compute unit is partitioned into The ALU engine is devoted to the execution of ALU clauses
wavefronts, which are then placed into a ready wavefront pool. The from the allocated wavefront (Figure 5b). After the fetch and de-
CF engine selects wavefronts from the wavefront pool for execu- code stages, decoded VLIW instructions are placed into a VLIW
tion, based on a wavefront scheduling algorithm. A new wavefront bundle buffer. The read stage consumes the VLIW bundle and
starts running the main CF clause of the OpenCL kernel binary, reads the source operands from the register file and/or local mem-
and subsequently spawns secondary ALU and TEX clauses. The ory for each work-item in the wavefront. The execute stage issues
wavefront scheduling algorithm is another performance sensitive an instance of a VLIW bundle to each of the stream cores every cy-
parameter, which can be evaluated with Multi2Sim. cle. The number of stream cores in a compute unit might be smaller
When a wavefront is extracted from the pool, it is only inserted than the number of work-items in a wavefront. Thus, a wavefront
back in when the executed CF instruction completes. This ensures is split into subwavefronts, where each subwavefront contains as
that there is only a single CF instruction in flight at any time for a many work-items as there are stream cores in a compute unit. The
given wavefront, avoiding the need for branch prediction or specu- result of the computation is written back to the destination operands
lative execution in case flow control is affected. The performance (register file or local memory) at the write stage.
penalty for this serialization is hidden by overlapping the execution The TEX engine (Figure 5c) is devoted to the execution of global
of different wavefronts. Determining the extent to which overlap- memory fetch instructions in TEX clauses. The TEX instruction
ping execution is occurring and the cause of bottlenecks are addi- bytes are stored into a TEX instruction buffer after being fetched
tional benefits of simulating execution with Multi2Sim. and decoded. Memory addresses for each work-item in the wave-
front are read from the register file and a read request to the global
3.2.2 Work-Item Divergence memory hierarchy is performed at the read stage. Completed
In a SIMD execution model, work-item divergence is side-effect global memory reads are handled in order by the write stage. The
generated when a conditional branch instruction is resolved differ- fetched data is stored into the corresponding locations of the regis-
ently for any work-items within a wavefront. To address work-item ter file for each work-item. The lifetime of a memory read is mod-
divergence present during SIMD execution, the Evergreen ISA pro- eled in detail throughout the global memory hierarchy, as specified
vides each wavefront with an active mask. The active mask is a bit in the following sections.
map, where each bit represents the active status of an individual
work-item in the wavefront. If a work-item is labeled as inactive, 3.4 Memory Subsystem
the result of any arithmetic computation performed in its associated The GPU memory subsystem contains different components for
stream core is ignored, preventing the work-item from changing the data storage and transfer. With Multi2Sim, the memory subsys-
kernel state. tem is highly configurable, including customizable settings for the
This work-item divergence strategy attempts to converge all number of cache levels, memory capacities, block sizes, number of
work-items together across all possible execution paths, allowing banks, and ports. A description of the memory components for the
only those active work-items whose conditional execution matches Evergreen model follows:
the currently fetched instruction flow to continue execution. To Register file (GPRs). Multi2Sim provides a model with no con-
support nested conditionals and procedure calls, an active mask tention for register file accesses. In a given cycle, the register can be
stack is used to push and pop active masks, so that the active mask accessed by the TEX and ALU engines simultaneously by differ-
at the top of the stack always represents the active mask of the cur- ent wavefronts. Work-items within and among wavefronts always
rently executing work-items. Using Multi2Sim, statistics related to access different register sets.
work-item divergence are available to researchers (see Section 4.3). Local Memory. A separate local memory module is present in
each compute unit, and is modeled in Multi2Sim with a config-
3.3 The Instruction Pipelines urable latency, number of banks, ports, and allocation chunk size.
In a compute unit, the CF, ALU, and TEX engines are orga- In an OpenCL kernel, accesses to local memory are defined by the
nized as instruction pipelines. Figure 5 presents a block diagram programmer by specifying a variable’s scope, whose accesses are
of each engine’s instruction pipeline. Within each pipeline, deci- then compiled into distinct assembly instructions. Contention to lo-
sions about scheduling policies, latencies, and buffer sizes must cal memory is modeled by serializing accesses to the same memory
be made. These subtle factors have performance implications, and bank whenever no read or write port is available. Also, memory
provide another opportunity for researchers to benefit from experi- access coalescing is considered by grouping those accesses from
menting with design decisions within Multi2Sim. different work-items to the same memory block.
The CF engine (Figure 5a) runs the CF clause of an OpenCL Global memory. The GPU global memory is accessible by all
kernel. The fetch stage selects a new wavefront from the wavefront compute units. It is presented to the programmer as a separate
pool on every cycle, switching among them at the granularity of memory scope, and implemented as a memory hierarchy managed
one single CF instruction. Instructions from different wavefronts by hardware in order to reduce access latency. In Multi2Sim, the
are interpreted by the decode stage in a round-robin fashion. When global memory hierarchy has a configurable number of cache levels
a CF instruction triggers a secondary clause, the corresponding ex- and interconnects. A possible configuration is shown in Figure 6a,
ecution engine (ALU or TEX engine) is allocated, and the CF in- using private L1 caches per compute unit, and multiple L2 caches
struction remains in the execute stage until the secondary clause that are shared between subsets of compute units. L1 caches pro-
completes. Other CF instructions from other wavefronts can be ex- vide usually a similar access time as local memory, but they are
ecuted in the interim, as long as they do not request a busy execu- managed transparently by hardware, similarly to how a memory
tion engine. CF instruction execution (including all instructions run hierarchy is managed on a CPU.
in a secondary clause, if any) finishes in order in the complete stage Interconnection networks. Each cache in the global memory
stage. The wavefront is returned to the wavefront pool, making it hierarchy is connected to the lower-level cache (or global memory)
again a candidate for instruction fetching. Global memory writes using an interconnection network. Interconnects are organized as
338
Figure 5: Block diagram of the execution engine pipelines.
point-to-point connections using a switch, whose architecture block features available with Multi2Sim. All simulations are based on
diagram is presented in Figure 6b. A switch contains two disjoint a baseline GPU model resembling the commercial AMD Radeon
inner subnetworks, each devoted to package transfers in opposite 5870 GPU, whose hardware parameters are summarized in Table 1.
directions. For the simulator performance studies, simulations were run on
Cache access queues. Each cache memory has a buffer where a machine with four quad-core Intel Xeon processors (2.27GHz,
access requests are enqueued, as shown in Figure 6c. On one hand, 8MB cache, 24GB DDR3). Experimental evaluations were per-
access buffers allow for asynchronous writes that prevent stalls in formed using a subset of the AMD OpenCL SDK [1] applications,
instruction pipelines. On the other hand, memory access coalesc- representing a wide range of application behaviors and memory
ing is handled in access buffers at every level of the global mem- access patterns [16]. The applications discussed in this paper are
ory hierarchy (both caches and global memory). Each sequence listed in Table 2, where we include a short description of the pro-
of subsequent entries in the access queue reading or writing to the grams and the corresponding input dataset characteristics.
same cache block are grouped into one single actual memory ac-
cess. The coalescing degree depends on the memory block size,
the access queue size, and the memory access pattern, and is a very 4.1 Validation
performance sensitive metric measurable with Multi2Sim.
Our validation methodology for establishing the fidelity of the
GPU simulator considered the correctness of both the functional
4. EXPERIMENTAL EVALUATION and architectural simulation models, though we follow two differ-
This section presents a set of experiments aimed at validating and ent validation methodologies. For the functional simulator, the cor-
demonstrating the range of functional and architectural simulation rectness of the instruction decoder is validated by comparing the
339
8 8 35
DCT DCT
7 7 MatixMultiplication
5
3 3
0
2 2
So tion
rge G
Ra ays
Bit tions
Av ort
ge
an U er
Mu D t
ltip CT
or
La RN
t
era
S
lFil
icS
Arr
lica
dix
Op
be
on
1 1
ial
om
Bin
trix
0 1
Sc
0 2 3 4 5 6 7 8 9 10
Ma
1 2 3 4 5 6 7 8 9 10
Input Set Number Input Set Number
c) Average error percentage between the
a) Simulated execution time reported by b) Native execution time on the AMD Radeon native execution time and simulated execution
Multi2Sim. 5870. time for APP SDK benchmarks.
Figure 7: Validation for the architectural simulation, comparing simulated and native absolute execution times.
! / )
+ / )
+! 0 ! / % )
1) '
.12 %
'
.
'
)(
-
#
)
,
&
+
+
,
"
'
%
!
'+
*
(
#
$
Figure 8: Validation for architectural simulation, comparing trends between simulated and native execution times.
disassembled code to the Evergreen output that is generated by the marks are shown for clarity). Figure 7c shows the percentage dif-
AMD compiler. We also validate the correctness of each bench- ference in performance for a larger selection of benchmarks. The
mark’s execution by comparing the simulated application output value shown for each benchmark in Figure 7c is the average of
with the output of the application run directly on the CPU. All sim- the absolute percent error for each input of the benchmark. For
ulations generate functionally correct results for all programs stud- those cases where simulation accuracy decreases, Figure 8 shows
ied and input problem sets. detailed trends, leading to the following analysis.
Regarding the fidelity of the architectural model, Multi2Sim’s In Figure 8a, we show the correlation between the native execu-
performance results have been compared against native execution tion time and the simulated execution time for the studied bench-
performance (native here refers to the actual Radeon 5870 hard- marks. For some of the benchmarks (e.g., Histogram or Recursive-
ware), using ten different input sizes within the ranges shown in Gauss), execution times vary significantly. However, we still see
Table 2 (column Input Range). Since our architectural model is a strong correlation between each of the native execution points
cycle-based, and the native execution is measured as kernel execu- and their associated simulator results for all benchmarks. In other
tion time, it is challenging to compare our metrics directly. To con- words, a change in the problem size for a benchmark has the same
vert simulated cycles into time, we use the documented ALU clock relative performance impact for both native and simulated execu-
frequency of 850MHz of the 5870 hardware. The native execution tions. The linear trend-line is represented using a curve-fitting al-
time is computed as the average time of 1000 kernel executions for gorithm that minimizes the squared distances between every data
each benchmark. Native kernel execution time was measured us- point and itself. For the benchmarks that are modeled accurately
ing the AMD APP profiler [2]. The execution time provided by the using the simulator, the data points lie on the 45◦ line. The rea-
APP profiler does not include overheads such as kernel setup and son for the occurrence of divergent slopes can be attributed to the
host-device I/O [2]. lack of precise representation of the memory hierarchy in the 5870
Figure 7a and Figure 7b plot simulated execution time and native GPU, including the following factors:
execution time performance trends, respectively (only four bench-
340
Table 1: Baseline GPU simulation parameters.
341
4.3 Benchmark Characterization number of compute units decreases the available bandwidth per ex-
As a case study of GPU simulation, this section presents a brief ecuted work-group. The available memory bandwidth for the de-
characterization of OpenCL benchmarks carried out on Multi2Sim, vice in this experiment only increases between compute unit counts
based on instruction classification, VLIW bundle occupancy, and which are a multiple of 5 when a new L2 is added (Table 1). When
control flow divergence. These statistics are dynamic in nature, the total bandwidth is exhausted, the trend (as seen between 10-
and are reported by Multi2Sim as part of its simulation reports. 15 and 15-20 compute units) flattens. This point is clearly ob-
Figure 10a shows Evergreen instruction mixes executed by each served when we increase the number of compute units in compute-
OpenCL kernel. The instruction categories are control flow in- intensive kernels with high ALU-to-Fetch instruction ratios (e.g.,
structions (jumps, stack operations, and synchronizations), global URNG) and less so in memory-intensive benchmarks (e.g., His-
memory reads, global memory writes, local memory accesses, and togram).
arithmetic-logic operations. Arithmetic-logic operations form the Figure 11b presents the performance achieved by varying the
bulk of executed instructions (these are GPU-friendly workloads). number of stream cores per compute unit. In the BinomialOption
Figure 10b represents the average occupancy of VLIW bundles kernel we observe a step function, where each step corresponds to
executed in the stream cores of the GPU’s ALU engine. If a VLIW a multiple of the wavefront size (64). This behavior is due to the
instruction uses less than 5 slots, there will be idle VLIW lanes in fact that the number of stream cores determines the number of sub-
the stream core, resulting in an underutilization of available exe- wavefronts (or time-multiplexed slots) that stream cores deal with
cution resources. The Evergreen compiler tries to maximize the for each VLIW bundle. When an increase in the number of stream
VLIW slot occupancy, but there is an upper limit imposed by the cores causes a decrease in the number of subwavefronts (e.g., 15 to
available instruction-level parallelism in the kernel code. Results 16, 21 to 22, and 31 to 32), performance improves. When the num-
show that we rarely utilize all 5 slots (except for SobelFilter thanks ber of stream cores matches the number of work-items per wave-
to its high fraction of ALU instructions), and the worst case of only front, the bottleneck due to a serialized stream core utilization dis-
one single filled slot is encountered frequently. appears. This effect is not observed for ScanLargeArrays due to a
Finally, Figure 10c illustrates the control flow divergence effect lower wavefront occupancy.
among work-items. When work-items within a wavefront execut- Figure 11c plots the impact of increasing the L1 cache size. For
ing in a SIMD fashion diverge on branch conditions, the entire benchmarks that lack temporal locality and exhibit large strided ac-
wavefront must go through all possible execution paths. Thus, fre- cesses in the data stream, performance is insensitive to increasing
quent work-item divergence has a negative impact on performance. cache size, as seen in Reduction. In contrast, benchmarks with
For each benchmark in Figure 10c, each color stride within a bar locality are more sensitive to changes in the L1 cache size, as ob-
represents a different control flow path through the program. If a served for FloydWarshall.
bar has one single stride, then only one path was taken by all work-
items for that kernel. If there are n strides, then n different control
flow paths were taken by different work-items. Notice that different 5. RELATED WORK
colors are used here with the only purpose of delimiting bar strides, While numerous mature CPU simulators at various levels are
but no specific meaning is assigned to each color. The size of each available, GPU simulators are still in their infancy. There continues
stride represents the percentage of work-items that took that con- to be a growing need for architectural GPU simulators that model
trol flow path for the kernel. Results show benchmarks with the a GPU at the ISA level. And in the near future, we will see a
following divergence characteristics: more pressing need for a true CPU-GPU heterogeneous simulation
framework. This section briefly summarizes existing simulators
• No control flow divergence at all (URNG, DCT). targeting GPUs.
Barra [12] is an ISA-level functional simulator targeting the
• Groups of divergence with a logarithmic decreasing size NVIDIA G80 GPUs. It runs CUDA executables without any mod-
due to different number of loop iterations (Reduction, ification. Since the NVIDIA’s G80 ISA specification is not pub-
DwtHaar1D). licly available, the simulator relies on a reverse-engineered ISA
provided by another academic project. Similar to our approach,
• Multiple divergence groups depending on input data (Bino- Barra intercepts API calls to the CUDA library and reroutes them
mialOption2 ). to the simulator. Unfortunately, it is limited to GPU functional sim-
ulation, lacking an architectural simulation model.
4.4 Architectural Exploration GPGPUSim [10] is a detailed simulator that models a GPU ar-
The architectural GPU model provided in Multi2Sim allows re- chitecture similar to NVIDIA’s architecture. It includes a shader
searchers to perform large design space evaluations. As a sample core, interconnects, thread block (work-group) scheduling, and
of the simulation flexibility, this section presents three case studies, memory hierarchy. Multi2Sim models a different GPU ISA and ar-
where performance significantly varies for different input parame- chitecture (Evergreen). GPGPUSim can provide us with important
ter values. In each case, we compare two benchmarks with respect insight into design problems for GPUs. However, Multi2Sim also
to their architectural sensitivity. Performance is measured using supports CPU simulation within the same tool enabling additional
the number of instructions per cycle (IPC), where the instruction architectural research into heterogeneous architectures.
count is incremented by one for a whole wavefront, regardless of Ocelot [13] is a widely used functional simulator and dynamic
the number of comprising work-items. compilation framework that works at a virtual ISA level. Tak-
Figure 11a shows performance scaling with respect to the num- ing NVIDIA’s CUDA PTX code as input, it can either emulate or
ber of compute units. The total memory bandwidth provided by dynamically translate it to multiple platforms such as x86 CPUs,
global memory is shared by all compute units, so increasing the NVIDIA GPUs, and AMD GPUs. Ocelot has objectives different
than GPU architectural simulation, so there is an extensive func-
2 tionality not provided or targeted by Multi2Sim, which makes them
The darker color for BinomialOption is caused by many small di-
vergence regions represented in the same bar. complementary tools.
342
URNG
SobelFilter
ScanLargeArrays
Reduction
RecursiveGaussian
PrefixSum
MatrixTranspose
MatrixMultiplication
FloydWarshall
DwtHaar1D
DCT
BitonicSort
BinomialOption
Figure 10: Examples of benchmarks characterization, based on program features relevant to GPU performance. IPC is calculated
as total number of instructions the for entire kernel, divided by total cycles to execute the entire kernel.
a) Scaling the number of compute units. b) Scaling the number of stream cores. c) Scaling L1 cache size.
Figure 11: Architectural exploration, showing results for those benchmarks with interesting performance trends.
When compared to previous work, Multi2Sim is unique in the Multi2Sim is currently being used by a number of leading research
following aspects. First, it models the native ISA of a commercially groups, we believe this is a great opportunity to accelerate research
available GPU. Second, it provides an architectural simulation of a on heterogeneous, parallel architectures.
real GPU with tractable accuracy. Third, Multi2Sim is a CPU-GPU
heterogeneous simulation framework, which can be used to evalu- Acknowledgments
ate upcoming architectures where the CPU and GPU are merged on
This work was supported in part by NSF Award EEC-0946463,
silicon and share a common memory address space [8].
and through the support and donations from AMD and NVIDIA.
The authors would also like to thank Norman Rubin (AMD) for his
6. CONCLUSIONS advice and feedback on this work.
In this paper we have presented Multi2Sim, a full-fledged sim-
ulation framework that supports both fast functional and detailed 7. REFERENCES
architectural simulation for x86 CPUs and Evergreen GPUs at the
[1] AMD Accelerated Parallel Processing (APP) Software
ISA level. It is modular, fully configurable, and easy to use. The
Development Kit (SDK).
toolset is actively maintained and is available as a free, open-source
http://developer.amd.com/sdks/amdappsdk/.
project at www.multi2sim.org, together with packages of bench-
marks, a complete user guide, and active mailing lists and forums. [2] AMD Accelerated Parallel Processing OpenCL
Ongoing work for Multi2Sim includes expanding benchmark Programming Guide (v1.3c).
support by increasing Evergreen ISA coverage. Future releases will [3] AMD Evergreen Family Instruction Set Arch. (v1.0d).
include a model for the AMD Fusion architecture, where the CPU http://developer.amd.com/sdks/amdappsdk/documentation/.
and GPU share a common global memory hierarchy and address [4] AMD Intermediate Language (IL) Spec. (v2.0e).
space. Supporting shared memory for heterogeneous architectures http://developer.amd.com/sdks/amdappsdk/documentation/.
highlights the potential of Multi2Sim, as no other simulator can [5] Intel Ivy Bridge.
provide useful architectural statistics in this type of environment. http://ark.intel.com/products/codename/29902/Ivy-Bridge.
Current development also includes support for OpenGL applica- [6] NVIDIA PTX: Parallel Thread Execution ISA.
tions and exploration into OpenCL language extensions. Since http://developer.nvidia.com/cuda-downloads/.
343
[7] OpenCL: The Open Standard for Parallel Programming of [14] P. S. M. et. al. Simics: A Full System Simulation Platform.
Heterogeneous Systems. www.khronos.org/opencl. IEEE Computer, 35(2), 2002.
[8] The AMD Fusion Family of APUs. http://fusion.amd.com/. [15] W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt.
[9] The NVIDIA Denver Project. http://blogs.nvidia.com/. Dynamic Warp Formation and Scheduling for Efficient GPU
[10] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Control Flow. In Proc. of the 40th Int’l Symposium on
Analyzing CUDA Workloads Using a Detailed GPU Microarchitecture, Dec. 2007.
Simulator. In Proc. of the Int’l Symposium on Performance [16] B. Jang, D. Schaa, P. Mistry, and D. Kaeli. Exploiting
Analysis of Systems and Software (ISPASS), Apr. 2009. Memory Access Patterns to Improve Memory Performance
[11] N. L. Binkert, E. G. Hallnor, and S. K. Reinhardt. in Data-Parallel Architectures. IEEE Transactions on
Network-Oriented Full-System Simulation Using M5. 6th Parallel and Distributed Systems, 22(1), Jan. 2011.
Workshop on Computer Architecture Evaluation using [17] M. Houston and M. Mantor. AMD Graphics Core Next.
Commercial Workloads (CAECW), Feb. 2003. http://developer.amd.com/afds/assets/presentations/2620_final.pdf.
[12] S. Collange, M. Daumas, D. Defour, and D. Parello. Barra: [18] G. L. Yuan, A. A. Bakhoda, and T. M. Aamodt. Complexity
A Parallel Functional Simulator for GPGPU. In Proc. of the Effective Memory Access Scheduling for Many-Core
18th Int’l Symposium on Modeling, Analysis and Simulation Accelerator Architectures. In 42nd Int’l Symposium on
of Computer and Telecommunication Systems (MASCOTS), Microarchitecture, Dec. 2009.
Aug. 2010.
[13] G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark. Ocelot: a
Dynamic Optimization Framework for Bulk-Synchronous
Applications in Heterogeneous Systems. In Proc. of the 19th
Int’l Conference on Parallel Architectures and Compilation
Techniques, Sept. 2010.
344