0% found this document useful (0 votes)

25 views10 pages

Multi 2 Sim

Multi2Sim is an open-source simulation framework designed for CPU-GPU computing, enabling ISA-level simulation of x86 CPUs and AMD Evergreen GPUs. It provides a robust environment for researchers to evaluate pre-silicon designs and performance results, integrating both functional and architectural simulation capabilities. The framework supports OpenCL applications without modification, facilitating research in application characterization, code optimization, and hardware architecture design.

Uploaded by

lamb.00092

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views10 pages

Multi 2 Sim

Uploaded by

lamb.00092

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Multi2Sim: A Simulation Framework for

CPU-GPU Computing

Rafael Ubal† Byunghyun Jang‡ ∗ Perhaad Mistry†

[email protected] [email protected] [email protected]
Dana Schaa† David Kaeli†
[email protected] [email protected]

† ‡
Electrical and Computer Engineering Dept. Computer and Information Science Dept.
Northeastern University University of Mississippi
360 Huntington Ave., Boston, MA 02115 P. O. Box 1848, University, MS 38677

ABSTRACT 1. INTRODUCTION
Accurate simulation is essential for the proper design and evalu- GPUs have become an important component of High Perfor-
ation of any computing platform. Upon the current move toward mance Computing (HPC) platforms by accelerating the ever de-
the CPU-GPU heterogeneous computing era, researchers need a manding data-parallel portions of a wide range of applications. The
simulation framework that can model both kinds of computing de- success of GPU computing has made microprocessor researchers in
vices and their interaction. In this paper, we present Multi2Sim, an both academia and industry believe that CPU-GPU heterogeneous
open-source, modular, and fully configurable toolset that enables computing is not just an alternative, but the future of HPC. Now,
ISA-level simulation of an x86 CPU and an AMD Evergreen GPU. GPUs are showing up as integrated accelerators for general purpose
Focusing on a model of the AMD Radeon 5870 GPU, we address platforms [8, 5, 9]. This move attempts to leverage the combined
program emulation correctness, as well as architectural simulation capabilities of multi-core CPU and many-core GPU architectures.
accuracy, using AMD’s OpenCL benchmark suite. Simulation ca- As CPU-GPU heterogeneous computing research gains momen-
pabilities are demonstrated with a preliminary architectural explo- tum, the need to provide a robust simulation environment becomes
ration study, and workload characterization examples. The project more critical. Simulation frameworks provide a number of benefits
source code, benchmark packages, and a detailed user’s guide are to researchers. They allow pre-silicon designs to be evaluated and
publicly available at www.multi2sim.org. performance results to be obtained for a range of design points. A
number of CPU simulators supporting simulation at the ISA level
Categories and Subject Descriptors have been developed [11, 14] and successfully used in a range of
architectural studies. Although there are tools that are currently
C.1.2 [Computer Systems Organization]: Processor Architec- available for simulating GPUs at the intermediate language level
tures—Multiple Data Stream Architectures; B.1.2 [Hardware]: (e.g., PTX) [12, 13], the research community still lacks a publicly
Control Structures and Microprogramming—Control Structure available framework integrating both fast functional simulation and
Performance and Design Aids cycle-accurate detailed architectural simulation at the ISA level that
considers a true heterogeneous CPU-GPU model.
Keywords In this paper we present Multi2Sim, a simulation framework for
GPU, AMD, Evergreen ISA, Multi2Sim CPU-GPU computing. The proposed framework integrates a pub-
licly available model of the data-parallel AMD Evergreen GPU
family [3]1 with the simulation of superscalar, multi-threaded, and
multicore x86 processors. This work also offers important insight
∗ into the architecture of an AMD Evergreen GPU, by describing our
This work was done while the author was with AMD.
models of instruction pipelines and memory hierarchy, to a deeper
extent than previous public work, to the best of our knowledge, has
done before.
Permission to make digital or hard copies of all or part of this work for Multi2Sim is provided as a Linux-based command-line toolset,
personal or classroom use is granted without fee provided that copies are designed with an emphasis on presenting a user-friendly interface.
not made or distributed for profit or commercial advantage and that copies It runs OpenCL applications without any source code modifica-
bear this notice and the full citation on the first page. To copy otherwise, to tions, and provides a number of instrumentation capabilities that
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. 1
PACT’12, September 19–23, 2012, Minneapolis, Minnesota, USA. AMD has used the Evergreen ISA specification for the implemen-
Copyright 2012 ACM 978-1-4503-1182-3/12/09 ...$15.00. tation of its mainstream Radeon 5000 and 6000 series of GPUs.

335
enable research in application characterization, code optimization, 2.1 The OpenCL Programming Model
compiler optimization, and hardware architecture design. To illus- OpenCL is an industry-standard programming framework de-
trate the utility and power of our toolset, we report on a wide range signed specifically for developing programs targeting heteroge-
of experimental results based on benchmarks taken from AMD’s neous computing platforms, consisting of CPUs, GPUs, and
Accelerated Parallel Processing (APP) SDK 2.5 [1]. other classes of processing devices [7]. OpenCL’s programming
The rest of this paper is organized as follows. Section 2 intro- model emphasizes parallel processing by using the single-program
duces the functional simulation model in Multi2Sim. Section 3 multiple-data (SPMD) paradigm, in which a single piece of code,
presents the Evergreen GPU architecture and its simulation. Sec- called a kernel, maps to multiple subsets of input data, creating a
tion 4 reports our experimental evaluation. We summarize related massive amount of parallel execution.
work in Section 5, and conclude the paper in Section 6. Figure 1 provides a view of the basic execution elements hierar-
chy defined in OpenCL. An instance of the OpenCL kernel is called
a work-item, which can access its own pool of private memory.
2. THE MULTI2SIM PROJECT Work-items are arranged into work-groups with two basic proper-
The Multi2Sim project started as a free, open-source, cycle- ties: i) those work-items contained in the same work-group can
accurate simulation framework targeting superscalar, multi- perform efficient synchronization operations, and ii) work-items
threaded, and multicore x86 CPUs. The CPU simulation frame- within the same work-group can share data through a low-latency
work consists of two major interacting software components: the local memory. The totality of work-groups form the ND-Range
functional simulator and the architectural simulator. The func- (grid of work-item groups) and share a common global memory.
tional simulator (i.e., emulator) mimics the execution of a guest
program on a native x86 processor, by interpreting the program 2.2 OpenCL Simulation
binary and dynamically reproducing its behavior at the ISA level.
The architectural simulator (i.e., detailed or timing simulator) ob-
tains a trace of x86 instructions from the functional simulator, and
tracks execution of the processor hardware structures on a cycle-
by-cycle basis.
The current version of the CPU functional simulator supports the
execution of a number of different benchmark suites without any
porting effort, including single-threaded benchmark suites (e.g.,
SPEC2006 and Mediabench), multi-threaded parallel benchmarks
(SPLASH-2 and PARSEC 2.1), as well as custom self-compiled
user code. The architectural simulator models many-core super-
scalar pipelines with out-of-order execution, a complete memory
hierarchy with cache coherence, interconnection networks, and ad- Figure 2: Comparison of software modules of an OpenCL pro-
ditional components. gram: native AMD GPU based heterogeneous system versus
Multi2Sim integrates a configurable model for the commercial Multi2Sim simulation framework.
AMD Evergreen GPU family (e.g., Radeon 5870). The latest re-
leases fully support both functional and architectural simulation of The call stack of an OpenCL program running on Multi2Sim dif-
a GPU, following the same interaction model between them as for fers from the native call stack starting at the OpenCL library
CPU simulation. While the GPU emulator provides traces of Ev- call, as shown in Figure 2. When an OpenCL API func-
ergreen instructions, the detailed simulator tracks execution times tion call is issued, our implementation of the OpenCL runtime
and architectural state. (libm2s-opencl.so) handles the call. This call is intercepted
All simulated programs begin with the execution of CPU code. by the CPU simulation module, which transfers control to the GPU
The interface to the GPU simulator is the Open Compute Language module as soon as the guest application launches the device kernel
(OpenCL). When OpenCL programs are executed, the host (i.e., execution. This infrastructure allows unmodified x86 binaries (pre-
CPU) portions of the program are run using the CPU simulation compiled OpenCL host programs) to run on Multi2Sim with total
modules. When OpenCL API calls are encountered, they are inter- binary compatibility with the native environment.
cepted and used to setup or begin GPU simulation.
3. ARCHITECTURAL SIMULATION OF
AN AMD EVERGREEN GPU
This section presents the architecture of a generic AMD Ever-
green GPU device, focusing on hardware components devoted to
general purpose computing of OpenCL kernels. As one of the nov-
elties of this paper, the following block diagrams and descriptions
provide some insight into the instruction pipelines, memory com-
ponents, and interconnects, which tend to be kept private by the ma-
jor GPU vendors, and remain undocumented in currently available
tools. All presented architectural details are accurately modeled on
Multi2Sim, as described next.
3.1 The Evergreen GPU Architecture
A GPU consists of an ultra-threaded dispatcher, an array of
independent compute units, and a memory hierarchy. The ultra-
Figure 1: OpenCL programming and memory model. threaded dispatcher processes the ND-Range and maps waiting

336
Figure 4: Example of AMD Evergreen assembly code: (a) main
CF clause instruction counter, (b) internal clause instruction
counter, (c) ALU clause, (d) TEX clause.

Evergreen assembly uses a clause-based format. The kernel exe-

cution starts with a CF instruction. CF instructions affect the main
program control flow (such is the case for CF instruction 03), write
Figure 3: Block diagram of the GPU architecture. data to global memory (04), or transfer control to a secondary
clause, such as an ALU clause (00, 02), or a TEX clause (01).
ALU clauses contain instructions performing arithmetic-logic op-
work-groups onto available compute units. Once a work-group is erations and local memory accesses, while TEX clauses are exclu-
assigned to a compute unit, it remains in the compute unit until its sively devoted to global memory read operations.
execution completes. As a work-group executes, work-items fetch ALU instructions are packed into VLIW bundles. A VLIW bun-
and store data through the global memory hierarchy, formed of two dle is run one at a time on a stream core, where each ALU instruc-
levels of cache, interconnects, and memory controllers. Figure 3a tion label reflects the VLIW lane assigned to that instruction. An
shows a block diagram of the Evergreen family compute device. ALU instruction operand can be any output from the previously
A compute unit consists of three execution engines, a local mem- executed VLIW bundle using the Previous Vector (PV) or the Pre-
ory, and a register file. The three execution engines, called control vious Scalar (PS) special registers. Finally, constant memory is an
flow (CF), arithmetic-logic (ALU), and texture (TEX) engines, are additional globally accessible storage initialized by the CPU, which
devoted to execute different portions of an OpenCL kernel binary, can also be used as ALU instruction operands (KC).
referred to as CF, ALU, and TEX clauses, respectively (see Sec- From our discussion above of Evergreen ISA characteristics,
tion 3.2). A block diagram of the compute unit is illustrated in we can observe a couple of important differences from working
Figure 3b. with higher level intermediate languages, such as AMD’s IL [4] or
The ALU engine contains a set of stream cores, each devoted to NVIDIA’s PTX [6]. For example, in AMD’s Evergreen ISA there
the execution of one work-item’s arithmetic operations. ALU in- is a limited number of general purpose registers, so there are re-
structions are organized as 5-way VLIW bundles, created at com- strictions on how to form VLIW bundles, and there are specific
pile time. Each instruction in a VLIW bundle is executed on one of rules to group machine instructions forming clauses. In general,
the 5 VLIW lanes forming the stream core. there are many properties of the ISA run directly by the machine
An Evergreen GPU defines the concept of a wavefront as a that need not be considered working with an intermediate language.
group of work-items executing in a Single-Instruction Multiple- Thus, significant performance accuracy can be gained with ISA-
Data (SIMD) fashion. Each instruction is executed concurrently by level simulation.
every work-item comprising a wavefront, although each work-item
uses its private data for the computations. This model simplifies in- 3.2.1 Kernel Execution Model
struction fetch hardware by implementing a common front-end for When an OpenCL kernel is launched by a host program, the ND-
a whole wavefront. Range configuration is provided to the GPU. Work-groups are then
created and successively assigned to compute units when they have
3.2 The Evergreen Instruction Set Architec- available execution resources. The number of work-groups that can
ture (ISA) be assigned to a single compute unit is determined by four hardware
When the GPU functional simulator receives the OpenCL ker- limitations: i) the maximum number of work-groups supported per
nel to execute, an emulation loop starts by fetching, decoding, and compute unit, ii) the maximum number of wavefronts per compute
executing Evergreen instructions. The basic format of the AMD unit, iii) the number of registers on a compute unit, and iv) the
Evergreen ISA can be observed in the sample code from Figure 4. amount of local memory on a compute unit. Maximizing the num-

337
ber of assigned work-groups per compute unit is a performance- are run asynchronously in the CF engine itself, without requiring a
sensitive decision that can be evaluated on Multi2Sim. secondary engine.
Each work-group assigned to a compute unit is partitioned into The ALU engine is devoted to the execution of ALU clauses
wavefronts, which are then placed into a ready wavefront pool. The from the allocated wavefront (Figure 5b). After the fetch and de-
CF engine selects wavefronts from the wavefront pool for execu- code stages, decoded VLIW instructions are placed into a VLIW
tion, based on a wavefront scheduling algorithm. A new wavefront bundle buffer. The read stage consumes the VLIW bundle and
starts running the main CF clause of the OpenCL kernel binary, reads the source operands from the register file and/or local mem-
and subsequently spawns secondary ALU and TEX clauses. The ory for each work-item in the wavefront. The execute stage issues
wavefront scheduling algorithm is another performance sensitive an instance of a VLIW bundle to each of the stream cores every cy-
parameter, which can be evaluated with Multi2Sim. cle. The number of stream cores in a compute unit might be smaller
When a wavefront is extracted from the pool, it is only inserted than the number of work-items in a wavefront. Thus, a wavefront
back in when the executed CF instruction completes. This ensures is split into subwavefronts, where each subwavefront contains as
that there is only a single CF instruction in flight at any time for a many work-items as there are stream cores in a compute unit. The
given wavefront, avoiding the need for branch prediction or specu- result of the computation is written back to the destination operands
lative execution in case flow control is affected. The performance (register file or local memory) at the write stage.
penalty for this serialization is hidden by overlapping the execution The TEX engine (Figure 5c) is devoted to the execution of global
of different wavefronts. Determining the extent to which overlap- memory fetch instructions in TEX clauses. The TEX instruction
ping execution is occurring and the cause of bottlenecks are addi- bytes are stored into a TEX instruction buffer after being fetched
tional benefits of simulating execution with Multi2Sim. and decoded. Memory addresses for each work-item in the wave-
front are read from the register file and a read request to the global
3.2.2 Work-Item Divergence memory hierarchy is performed at the read stage. Completed
In a SIMD execution model, work-item divergence is side-effect global memory reads are handled in order by the write stage. The
generated when a conditional branch instruction is resolved differ- fetched data is stored into the corresponding locations of the regis-
ently for any work-items within a wavefront. To address work-item ter file for each work-item. The lifetime of a memory read is mod-
divergence present during SIMD execution, the Evergreen ISA pro- eled in detail throughout the global memory hierarchy, as specified
vides each wavefront with an active mask. The active mask is a bit in the following sections.
map, where each bit represents the active status of an individual
work-item in the wavefront. If a work-item is labeled as inactive, 3.4 Memory Subsystem
the result of any arithmetic computation performed in its associated The GPU memory subsystem contains different components for
stream core is ignored, preventing the work-item from changing the data storage and transfer. With Multi2Sim, the memory subsys-
kernel state. tem is highly configurable, including customizable settings for the
This work-item divergence strategy attempts to converge all number of cache levels, memory capacities, block sizes, number of
work-items together across all possible execution paths, allowing banks, and ports. A description of the memory components for the
only those active work-items whose conditional execution matches Evergreen model follows:
the currently fetched instruction flow to continue execution. To Register file (GPRs). Multi2Sim provides a model with no con-
support nested conditionals and procedure calls, an active mask tention for register file accesses. In a given cycle, the register can be
stack is used to push and pop active masks, so that the active mask accessed by the TEX and ALU engines simultaneously by differ-
at the top of the stack always represents the active mask of the cur- ent wavefronts. Work-items within and among wavefronts always
rently executing work-items. Using Multi2Sim, statistics related to access different register sets.
work-item divergence are available to researchers (see Section 4.3). Local Memory. A separate local memory module is present in
each compute unit, and is modeled in Multi2Sim with a config-
3.3 The Instruction Pipelines urable latency, number of banks, ports, and allocation chunk size.
In a compute unit, the CF, ALU, and TEX engines are orga- In an OpenCL kernel, accesses to local memory are defined by the
nized as instruction pipelines. Figure 5 presents a block diagram programmer by specifying a variable’s scope, whose accesses are
of each engine’s instruction pipeline. Within each pipeline, deci- then compiled into distinct assembly instructions. Contention to lo-
sions about scheduling policies, latencies, and buffer sizes must cal memory is modeled by serializing accesses to the same memory
be made. These subtle factors have performance implications, and bank whenever no read or write port is available. Also, memory
provide another opportunity for researchers to benefit from experi- access coalescing is considered by grouping those accesses from
menting with design decisions within Multi2Sim. different work-items to the same memory block.
The CF engine (Figure 5a) runs the CF clause of an OpenCL Global memory. The GPU global memory is accessible by all
kernel. The fetch stage selects a new wavefront from the wavefront compute units. It is presented to the programmer as a separate
pool on every cycle, switching among them at the granularity of memory scope, and implemented as a memory hierarchy managed
one single CF instruction. Instructions from different wavefronts by hardware in order to reduce access latency. In Multi2Sim, the
are interpreted by the decode stage in a round-robin fashion. When global memory hierarchy has a configurable number of cache levels
a CF instruction triggers a secondary clause, the corresponding ex- and interconnects. A possible configuration is shown in Figure 6a,
ecution engine (ALU or TEX engine) is allocated, and the CF in- using private L1 caches per compute unit, and multiple L2 caches
struction remains in the execute stage until the secondary clause that are shared between subsets of compute units. L1 caches pro-
completes. Other CF instructions from other wavefronts can be ex- vide usually a similar access time as local memory, but they are
ecuted in the interim, as long as they do not request a busy execu- managed transparently by hardware, similarly to how a memory
tion engine. CF instruction execution (including all instructions run hierarchy is managed on a CPU.
in a secondary clause, if any) finishes in order in the complete stage Interconnection networks. Each cache in the global memory
stage. The wavefront is returned to the wavefront pool, making it hierarchy is connected to the lower-level cache (or global memory)
again a candidate for instruction fetching. Global memory writes using an interconnection network. Interconnects are organized as

338
Figure 5: Block diagram of the execution engine pipelines.

Figure 6: Components of the GPU global memory hierarchy, as modeled in Multi2Sim.

point-to-point connections using a switch, whose architecture block features available with Multi2Sim. All simulations are based on
diagram is presented in Figure 6b. A switch contains two disjoint a baseline GPU model resembling the commercial AMD Radeon
inner subnetworks, each devoted to package transfers in opposite 5870 GPU, whose hardware parameters are summarized in Table 1.
directions. For the simulator performance studies, simulations were run on
Cache access queues. Each cache memory has a buffer where a machine with four quad-core Intel Xeon processors (2.27GHz,
access requests are enqueued, as shown in Figure 6c. On one hand, 8MB cache, 24GB DDR3). Experimental evaluations were per-
access buffers allow for asynchronous writes that prevent stalls in formed using a subset of the AMD OpenCL SDK [1] applications,
instruction pipelines. On the other hand, memory access coalesc- representing a wide range of application behaviors and memory
ing is handled in access buffers at every level of the global mem- access patterns [16]. The applications discussed in this paper are
ory hierarchy (both caches and global memory). Each sequence listed in Table 2, where we include a short description of the pro-
of subsequent entries in the access queue reading or writing to the grams and the corresponding input dataset characteristics.
same cache block are grouped into one single actual memory ac-
cess. The coalescing degree depends on the memory block size,
the access queue size, and the memory access pattern, and is a very 4.1 Validation
performance sensitive metric measurable with Multi2Sim.
Our validation methodology for establishing the fidelity of the
GPU simulator considered the correctness of both the functional
4. EXPERIMENTAL EVALUATION and architectural simulation models, though we follow two differ-
This section presents a set of experiments aimed at validating and ent validation methodologies. For the functional simulator, the cor-
demonstrating the range of functional and architectural simulation rectness of the instruction decoder is validated by comparing the

339
8 8 35
DCT DCT
7 7 MatixMultiplication

Simulation inaccuracy (%)

MatixMultiplication 30

Native Execution Time (ms)

Simulated Execution Time (ms)

6 Sobel Filter 6 Sobel Filter 25

Binomial Option Binomial Option 20
5 5
15
4 4 10

5
3 3
0
2 2

So tion

rge G
Ra ays
Bit tions

Av ort
ge
an U er
Mu D t
ltip CT
or

La RN
t

era
S
lFil
icS

Arr
lica

dix
Op

be
on
1 1

ial
om
Bin

trix
0 1

Sc
0 2 3 4 5 6 7 8 9 10

Ma
1 2 3 4 5 6 7 8 9 10
Input Set Number Input Set Number
c) Average error percentage between the
a) Simulated execution time reported by b) Native execution time on the AMD Radeon native execution time and simulated execution
Multi2Sim. 5870. time for APP SDK benchmarks.

Figure 7: Validation for the architectural simulation, comparing simulated and native absolute execution times.

! / )
+ / )
+! 0 ! / % )

1) '
.12 %
'

.
'

)(

-
#

)
,
&

+
+

,
"

'
%
!

'+
*
(

#
$

Figure 8: Validation for architectural simulation, comparing trends between simulated and native execution times.

disassembled code to the Evergreen output that is generated by the marks are shown for clarity). Figure 7c shows the percentage dif-
AMD compiler. We also validate the correctness of each bench- ference in performance for a larger selection of benchmarks. The
mark’s execution by comparing the simulated application output value shown for each benchmark in Figure 7c is the average of
with the output of the application run directly on the CPU. All sim- the absolute percent error for each input of the benchmark. For
ulations generate functionally correct results for all programs stud- those cases where simulation accuracy decreases, Figure 8 shows
ied and input problem sets. detailed trends, leading to the following analysis.
Regarding the fidelity of the architectural model, Multi2Sim’s In Figure 8a, we show the correlation between the native execu-
performance results have been compared against native execution tion time and the simulated execution time for the studied bench-
performance (native here refers to the actual Radeon 5870 hard- marks. For some of the benchmarks (e.g., Histogram or Recursive-
ware), using ten different input sizes within the ranges shown in Gauss), execution times vary significantly. However, we still see
Table 2 (column Input Range). Since our architectural model is a strong correlation between each of the native execution points
cycle-based, and the native execution is measured as kernel execu- and their associated simulator results for all benchmarks. In other
tion time, it is challenging to compare our metrics directly. To con- words, a change in the problem size for a benchmark has the same
vert simulated cycles into time, we use the documented ALU clock relative performance impact for both native and simulated execu-
frequency of 850MHz of the 5870 hardware. The native execution tions. The linear trend-line is represented using a curve-fitting al-
time is computed as the average time of 1000 kernel executions for gorithm that minimizes the squared distances between every data
each benchmark. Native kernel execution time was measured us- point and itself. For the benchmarks that are modeled accurately
ing the AMD APP profiler [2]. The execution time provided by the using the simulator, the data points lie on the 45◦ line. The rea-
APP profiler does not include overheads such as kernel setup and son for the occurrence of divergent slopes can be attributed to the
host-device I/O [2]. lack of precise representation of the memory hierarchy in the 5870
Figure 7a and Figure 7b plot simulated execution time and native GPU, including the following factors:
execution time performance trends, respectively (only four bench-

340
Table 1: Baseline GPU simulation parameters.

Figure 9: Simulation slowdowns over native execution for func-

tional and architectural simulation.

Specialized Memory Path Design. The AMD Radeon 5870

consists of two paths from compute units to memory [2], each with
different performance characteristics. The fast path performs only
basic operations, such as loads and stores for 32-bit data types. The
complete path supports additional advanced operations, including
atomics and stores for sub-32-bit data types. This design has been
deprecated in later GPU architectures for a more conventional lay-
out [17], which is similar to the one currently implemented in
Multi2Sim.
Cache Interconnects. The specification of the interconnection
network between the L1 and L2 caches has not been published.
We use an approximation where four L2 caches are shared between
compute units (Table 1).
Cache Parameters. The latency and associativity of the dif-
ferent levels of the cache hierarchy are not known. Some sources
of simulation inaccuracy can be attributed to cache parameters, as
Table 2: List of OpenCL benchmarks used for experiments. shown in Figure 8, where the percent error is minimum for the cases
Column Input base contains the baseline problem size used, and where the native cache hit ratios and simulated cache hit ratios vary
column Input range contains the range of problem sizes used the least.
during simulator validation.
4.2 Simulation Speed
For the benchmarks used in this paper, Multi2Sim’s simulation
overhead is plotted in Figure 9 as a function of the slowdown over
native execution time. The average functional simulation slow-
down is 8700× (113s), and the average architectural simulation
time is 44000× (595s). It should be noted that simulation time
is not necessarily related to native execution time (e.g., simulat-
ing one 100-cycle latency instruction is faster than simulating ten
1-cycle instructions), so these results only aim to provide some rep-
resentative samples of simulation overhead.
Simulation performance has been also evaluated for an architec-
tural simulation on GPGPUSim, an NVIDIA-based GPU simula-
tor [10]. This simulator has been used as experimental support for
recent studies on GPU computing, exploring alternative memory
controller implementations [18] and dynamic grouping of threads
(work-items) to minimize thread divergence penalty [15], for ex-
ample. To enable this comparison, the APP SDK benchmarks were
adapted to run on GPGPUSim. Figure 9c shows the performance
slowdown over native execution, which averages about 90000×
(1350s).

341
4.3 Benchmark Characterization number of compute units decreases the available bandwidth per ex-
As a case study of GPU simulation, this section presents a brief ecuted work-group. The available memory bandwidth for the de-
characterization of OpenCL benchmarks carried out on Multi2Sim, vice in this experiment only increases between compute unit counts
based on instruction classification, VLIW bundle occupancy, and which are a multiple of 5 when a new L2 is added (Table 1). When
control flow divergence. These statistics are dynamic in nature, the total bandwidth is exhausted, the trend (as seen between 10-
and are reported by Multi2Sim as part of its simulation reports. 15 and 15-20 compute units) flattens. This point is clearly ob-
Figure 10a shows Evergreen instruction mixes executed by each served when we increase the number of compute units in compute-
OpenCL kernel. The instruction categories are control flow in- intensive kernels with high ALU-to-Fetch instruction ratios (e.g.,
structions (jumps, stack operations, and synchronizations), global URNG) and less so in memory-intensive benchmarks (e.g., His-
memory reads, global memory writes, local memory accesses, and togram).
arithmetic-logic operations. Arithmetic-logic operations form the Figure 11b presents the performance achieved by varying the
bulk of executed instructions (these are GPU-friendly workloads). number of stream cores per compute unit. In the BinomialOption
Figure 10b represents the average occupancy of VLIW bundles kernel we observe a step function, where each step corresponds to
executed in the stream cores of the GPU’s ALU engine. If a VLIW a multiple of the wavefront size (64). This behavior is due to the
instruction uses less than 5 slots, there will be idle VLIW lanes in fact that the number of stream cores determines the number of sub-
the stream core, resulting in an underutilization of available exe- wavefronts (or time-multiplexed slots) that stream cores deal with
cution resources. The Evergreen compiler tries to maximize the for each VLIW bundle. When an increase in the number of stream
VLIW slot occupancy, but there is an upper limit imposed by the cores causes a decrease in the number of subwavefronts (e.g., 15 to
available instruction-level parallelism in the kernel code. Results 16, 21 to 22, and 31 to 32), performance improves. When the num-
show that we rarely utilize all 5 slots (except for SobelFilter thanks ber of stream cores matches the number of work-items per wave-
to its high fraction of ALU instructions), and the worst case of only front, the bottleneck due to a serialized stream core utilization dis-
one single filled slot is encountered frequently. appears. This effect is not observed for ScanLargeArrays due to a
Finally, Figure 10c illustrates the control flow divergence effect lower wavefront occupancy.
among work-items. When work-items within a wavefront execut- Figure 11c plots the impact of increasing the L1 cache size. For
ing in a SIMD fashion diverge on branch conditions, the entire benchmarks that lack temporal locality and exhibit large strided ac-
wavefront must go through all possible execution paths. Thus, fre- cesses in the data stream, performance is insensitive to increasing
quent work-item divergence has a negative impact on performance. cache size, as seen in Reduction. In contrast, benchmarks with
For each benchmark in Figure 10c, each color stride within a bar locality are more sensitive to changes in the L1 cache size, as ob-
represents a different control flow path through the program. If a served for FloydWarshall.
bar has one single stride, then only one path was taken by all work-
items for that kernel. If there are n strides, then n different control
flow paths were taken by different work-items. Notice that different 5. RELATED WORK
colors are used here with the only purpose of delimiting bar strides, While numerous mature CPU simulators at various levels are
but no specific meaning is assigned to each color. The size of each available, GPU simulators are still in their infancy. There continues
stride represents the percentage of work-items that took that con- to be a growing need for architectural GPU simulators that model
trol flow path for the kernel. Results show benchmarks with the a GPU at the ISA level. And in the near future, we will see a
following divergence characteristics: more pressing need for a true CPU-GPU heterogeneous simulation
framework. This section briefly summarizes existing simulators
• No control flow divergence at all (URNG, DCT). targeting GPUs.
Barra [12] is an ISA-level functional simulator targeting the
• Groups of divergence with a logarithmic decreasing size NVIDIA G80 GPUs. It runs CUDA executables without any mod-
due to different number of loop iterations (Reduction, ification. Since the NVIDIA’s G80 ISA specification is not pub-
DwtHaar1D). licly available, the simulator relies on a reverse-engineered ISA
provided by another academic project. Similar to our approach,
• Multiple divergence groups depending on input data (Bino- Barra intercepts API calls to the CUDA library and reroutes them
mialOption2 ). to the simulator. Unfortunately, it is limited to GPU functional sim-
ulation, lacking an architectural simulation model.
4.4 Architectural Exploration GPGPUSim [10] is a detailed simulator that models a GPU ar-
The architectural GPU model provided in Multi2Sim allows re- chitecture similar to NVIDIA’s architecture. It includes a shader
searchers to perform large design space evaluations. As a sample core, interconnects, thread block (work-group) scheduling, and
of the simulation flexibility, this section presents three case studies, memory hierarchy. Multi2Sim models a different GPU ISA and ar-
where performance significantly varies for different input parame- chitecture (Evergreen). GPGPUSim can provide us with important
ter values. In each case, we compare two benchmarks with respect insight into design problems for GPUs. However, Multi2Sim also
to their architectural sensitivity. Performance is measured using supports CPU simulation within the same tool enabling additional
the number of instructions per cycle (IPC), where the instruction architectural research into heterogeneous architectures.
count is incremented by one for a whole wavefront, regardless of Ocelot [13] is a widely used functional simulator and dynamic
the number of comprising work-items. compilation framework that works at a virtual ISA level. Tak-
Figure 11a shows performance scaling with respect to the num- ing NVIDIA’s CUDA PTX code as input, it can either emulate or
ber of compute units. The total memory bandwidth provided by dynamically translate it to multiple platforms such as x86 CPUs,
global memory is shared by all compute units, so increasing the NVIDIA GPUs, and AMD GPUs. Ocelot has objectives different
than GPU architectural simulation, so there is an extensive func-
2 tionality not provided or targeted by Multi2Sim, which makes them
The darker color for BinomialOption is caused by many small di-
vergence regions represented in the same bar. complementary tools.

342
URNG
SobelFilter
ScanLargeArrays
Reduction
RecursiveGaussian
PrefixSum
MatrixTranspose
MatrixMultiplication
FloydWarshall
DwtHaar1D
DCT
BitonicSort
BinomialOption

a) Classification of instruction types. b) VLIW bundles occupancy. c) Control flow divergence.

Figure 10: Examples of benchmarks characterization, based on program features relevant to GPU performance. IPC is calculated
as total number of instructions the for entire kernel, divided by total cycles to execute the entire kernel.

a) Scaling the number of compute units. b) Scaling the number of stream cores. c) Scaling L1 cache size.

Figure 11: Architectural exploration, showing results for those benchmarks with interesting performance trends.

When compared to previous work, Multi2Sim is unique in the Multi2Sim is currently being used by a number of leading research
following aspects. First, it models the native ISA of a commercially groups, we believe this is a great opportunity to accelerate research
available GPU. Second, it provides an architectural simulation of a on heterogeneous, parallel architectures.
real GPU with tractable accuracy. Third, Multi2Sim is a CPU-GPU
heterogeneous simulation framework, which can be used to evalu- Acknowledgments
ate upcoming architectures where the CPU and GPU are merged on
This work was supported in part by NSF Award EEC-0946463,
silicon and share a common memory address space [8].
and through the support and donations from AMD and NVIDIA.
The authors would also like to thank Norman Rubin (AMD) for his
6. CONCLUSIONS advice and feedback on this work.
In this paper we have presented Multi2Sim, a full-fledged sim-
ulation framework that supports both fast functional and detailed 7. REFERENCES
architectural simulation for x86 CPUs and Evergreen GPUs at the
[1] AMD Accelerated Parallel Processing (APP) Software
ISA level. It is modular, fully configurable, and easy to use. The
Development Kit (SDK).
toolset is actively maintained and is available as a free, open-source
http://developer.amd.com/sdks/amdappsdk/.
project at www.multi2sim.org, together with packages of bench-
marks, a complete user guide, and active mailing lists and forums. [2] AMD Accelerated Parallel Processing OpenCL
Ongoing work for Multi2Sim includes expanding benchmark Programming Guide (v1.3c).
support by increasing Evergreen ISA coverage. Future releases will [3] AMD Evergreen Family Instruction Set Arch. (v1.0d).
include a model for the AMD Fusion architecture, where the CPU http://developer.amd.com/sdks/amdappsdk/documentation/.
and GPU share a common global memory hierarchy and address [4] AMD Intermediate Language (IL) Spec. (v2.0e).
space. Supporting shared memory for heterogeneous architectures http://developer.amd.com/sdks/amdappsdk/documentation/.
highlights the potential of Multi2Sim, as no other simulator can [5] Intel Ivy Bridge.
provide useful architectural statistics in this type of environment. http://ark.intel.com/products/codename/29902/Ivy-Bridge.
Current development also includes support for OpenGL applica- [6] NVIDIA PTX: Parallel Thread Execution ISA.
tions and exploration into OpenCL language extensions. Since http://developer.nvidia.com/cuda-downloads/.

343
[7] OpenCL: The Open Standard for Parallel Programming of [14] P. S. M. et. al. Simics: A Full System Simulation Platform.
Heterogeneous Systems. www.khronos.org/opencl. IEEE Computer, 35(2), 2002.
[8] The AMD Fusion Family of APUs. http://fusion.amd.com/. [15] W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt.
[9] The NVIDIA Denver Project. http://blogs.nvidia.com/. Dynamic Warp Formation and Scheduling for Efficient GPU
[10] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Control Flow. In Proc. of the 40th Int’l Symposium on
Analyzing CUDA Workloads Using a Detailed GPU Microarchitecture, Dec. 2007.
Simulator. In Proc. of the Int’l Symposium on Performance [16] B. Jang, D. Schaa, P. Mistry, and D. Kaeli. Exploiting
Analysis of Systems and Software (ISPASS), Apr. 2009. Memory Access Patterns to Improve Memory Performance
[11] N. L. Binkert, E. G. Hallnor, and S. K. Reinhardt. in Data-Parallel Architectures. IEEE Transactions on
Network-Oriented Full-System Simulation Using M5. 6th Parallel and Distributed Systems, 22(1), Jan. 2011.
Workshop on Computer Architecture Evaluation using [17] M. Houston and M. Mantor. AMD Graphics Core Next.
Commercial Workloads (CAECW), Feb. 2003. http://developer.amd.com/afds/assets/presentations/2620_final.pdf.
[12] S. Collange, M. Daumas, D. Defour, and D. Parello. Barra: [18] G. L. Yuan, A. A. Bakhoda, and T. M. Aamodt. Complexity
A Parallel Functional Simulator for GPGPU. In Proc. of the Effective Memory Access Scheduling for Many-Core
18th Int’l Symposium on Modeling, Analysis and Simulation Accelerator Architectures. In 42nd Int’l Symposium on
of Computer and Telecommunication Systems (MASCOTS), Microarchitecture, Dec. 2009.
Aug. 2010.
[13] G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark. Ocelot: a
Dynamic Optimization Framework for Bulk-Synchronous
Applications in Heterogeneous Systems. In Proc. of the 19th
Int’l Conference on Parallel Architectures and Compilation
Techniques, Sept. 2010.

344

BCS702 Module 5 Textbook
No ratings yet
BCS702 Module 5 Textbook
48 pages
Paper 3
No ratings yet
Paper 3
13 pages
Core-Level DVFS in Multitasking GPUs
No ratings yet
Core-Level DVFS in Multitasking GPUs
4 pages
GPGPU-Sim 3.x Performance Simulator Guide
No ratings yet
GPGPU-Sim 3.x Performance Simulator Guide
27 pages
GPU-Accelerated Power Simulations
No ratings yet
GPU-Accelerated Power Simulations
6 pages
Multi2Sim: CPU-GPU Simulation Framework
No ratings yet
Multi2Sim: CPU-GPU Simulation Framework
36 pages
Computer Architecture Term Project Guide
No ratings yet
Computer Architecture Term Project Guide
3 pages
Cks 2012 It Art 002
No ratings yet
Cks 2012 It Art 002
10 pages
GPU Verification Iccad18-Gpu
No ratings yet
GPU Verification Iccad18-Gpu
8 pages
GPGPU-Sim Tutorial Guide
No ratings yet
GPGPU-Sim Tutorial Guide
35 pages
Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems For Programmability and Reliability
No ratings yet
Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems For Programmability and Reliability
6 pages
Vortex: Extending The RISC-V ISA For GPGPU and 3D-Graphics: Blaise Tine Fares Elsabbagh
No ratings yet
Vortex: Extending The RISC-V ISA For GPGPU and 3D-Graphics: Blaise Tine Fares Elsabbagh
13 pages
GPGPU Sim Tutorial
No ratings yet
GPGPU Sim Tutorial
28 pages
AMD Gem5 APU Simulator Micro 2015 Final PDF
No ratings yet
AMD Gem5 APU Simulator Micro 2015 Final PDF
62 pages
Dynamic Heterogeneous Scheduling of GPU-CPU in Distributed Environment
No ratings yet
Dynamic Heterogeneous Scheduling of GPU-CPU in Distributed Environment
8 pages
Hong 2017
No ratings yet
Hong 2017
37 pages
Design of Graphics Processing Framework On FPGA
No ratings yet
Design of Graphics Processing Framework On FPGA
5 pages
CPU-Assisted GPGPU On Fused CPU-GPU Architectures
No ratings yet
CPU-Assisted GPGPU On Fused CPU-GPU Architectures
12 pages
GPU EstimationMachineLearning PDF
No ratings yet
GPU EstimationMachineLearning PDF
22 pages
Analyzing CUDA Workloads Using A Detailed GPU Simulator
No ratings yet
Analyzing CUDA Workloads Using A Detailed GPU Simulator
12 pages
Programming GPU Clusters with Shared Memory
No ratings yet
Programming GPU Clusters with Shared Memory
8 pages
Vortex Micro21 Final
No ratings yet
Vortex Micro21 Final
13 pages
G-GPU: Automated GPU-like ASIC Generator
No ratings yet
G-GPU: Automated GPU-like ASIC Generator
4 pages
Wa0008
No ratings yet
Wa0008
5 pages
10 GPU-IntroCUDA3
No ratings yet
10 GPU-IntroCUDA3
141 pages
PDC Lecture 7-8 GPU Architectures
No ratings yet
PDC Lecture 7-8 GPU Architectures
25 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
The HPC-DAG Task Model For Heterogeneous Real-Time Systems
No ratings yet
The HPC-DAG Task Model For Heterogeneous Real-Time Systems
15 pages
CUDA for Developers & Researchers
No ratings yet
CUDA for Developers & Researchers
77 pages
Understanding PGPU and CUDA Basics
No ratings yet
Understanding PGPU and CUDA Basics
70 pages
Comp206 Lecture14
No ratings yet
Comp206 Lecture14
29 pages
Lecture 4 - Cpu, Gpu
No ratings yet
Lecture 4 - Cpu, Gpu
13 pages
Storegpu: Exploiting Graphics Processing Units To Accelerate Distributed Storage Systems
No ratings yet
Storegpu: Exploiting Graphics Processing Units To Accelerate Distributed Storage Systems
10 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
14 pages
Dynamic Load Balancing On Single-And Multi-GPU Systems
No ratings yet
Dynamic Load Balancing On Single-And Multi-GPU Systems
12 pages
Gpgpu Workshop Cuda
No ratings yet
Gpgpu Workshop Cuda
10 pages
Open-Source GPGPU for Domain-Specific Architectures
No ratings yet
Open-Source GPGPU for Domain-Specific Architectures
8 pages
Graphics Processing Unit (GPU) Programming Strategies and Trends in GPU Computing
No ratings yet
Graphics Processing Unit (GPU) Programming Strategies and Trends in GPU Computing
10 pages
Comparison of Processing Performance and Architectural Efficiency Metrics For Fpgas and Gpus in 3D Ultrasound Computer Tomography
No ratings yet
Comparison of Processing Performance and Architectural Efficiency Metrics For Fpgas and Gpus in 3D Ultrasound Computer Tomography
7 pages
A Survey of Architectural Approaches For Improving GPGPU
No ratings yet
A Survey of Architectural Approaches For Improving GPGPU
24 pages
Ijaret: International Journal of Advanced Research in Engineering and Technology (Ijaret)
No ratings yet
Ijaret: International Journal of Advanced Research in Engineering and Technology (Ijaret)
9 pages
Cuda Chapter
No ratings yet
Cuda Chapter
18 pages
Brodtkorb Etal Meta10
No ratings yet
Brodtkorb Etal Meta10
15 pages
PDC Lecture 09
No ratings yet
PDC Lecture 09
36 pages
GPGPUs CUDA
No ratings yet
GPGPUs CUDA
21 pages
The Impact of GPU/Multicore in Signal Processing: A Quantitative Approach
No ratings yet
The Impact of GPU/Multicore in Signal Processing: A Quantitative Approach
11 pages
A Programming Model For Massive Data Parallelism With Data Dependencies
No ratings yet
A Programming Model For Massive Data Parallelism With Data Dependencies
8 pages
10.2478 - Cait 2022 0028
No ratings yet
10.2478 - Cait 2022 0028
14 pages
3 Gpgpu PDF
No ratings yet
3 Gpgpu PDF
11 pages
Zhongliang Chen Thesis
No ratings yet
Zhongliang Chen Thesis
71 pages
International Journal of Distributed and Parallel Systems (IJDPS)
No ratings yet
International Journal of Distributed and Parallel Systems (IJDPS)
20 pages
Araujo 2005 F DL
No ratings yet
Araujo 2005 F DL
12 pages
Graphics Processing Unit GPU Programming Strategie
No ratings yet
Graphics Processing Unit GPU Programming Strategie
14 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Base Paper1
No ratings yet
Base Paper1
27 pages
F650GS-CS Maintenance Schedule
No ratings yet
F650GS-CS Maintenance Schedule
2 pages
Laundry and Linen Management SOP
No ratings yet
Laundry and Linen Management SOP
16 pages
World-Class Warehousing and Material Handling Second Edition Frazelle Ready To Read
No ratings yet
World-Class Warehousing and Material Handling Second Edition Frazelle Ready To Read
66 pages
Hot Work Permit Application Checklist
No ratings yet
Hot Work Permit Application Checklist
2 pages
Wireless M-BUS Protocol Overview
100% (1)
Wireless M-BUS Protocol Overview
33 pages
Nike G12 - Stoll - Abrasion - R5
No ratings yet
Nike G12 - Stoll - Abrasion - R5
17 pages
Language Hub Unit12 Vocabulary Revision
No ratings yet
Language Hub Unit12 Vocabulary Revision
2 pages
Fundamentals of Auditing and Assurance Services: LS 1.20 Psa 120: Framework of The Philippine Standards of Auditing
No ratings yet
Fundamentals of Auditing and Assurance Services: LS 1.20 Psa 120: Framework of The Philippine Standards of Auditing
4 pages
TM (Bhakti) CHP 4
No ratings yet
TM (Bhakti) CHP 4
5 pages
Can You Tell Me The Way
No ratings yet
Can You Tell Me The Way
2 pages
Angol Tétel 7-12
No ratings yet
Angol Tétel 7-12
6 pages
NZ Soil and Rock Classification Guidelines
No ratings yet
NZ Soil and Rock Classification Guidelines
38 pages
Cloud Computing An Introduction - (Chapter 2 Cloud Computing Architectures)
No ratings yet
Cloud Computing An Introduction - (Chapter 2 Cloud Computing Architectures)
48 pages
FPSI Seminar Broucher-1
No ratings yet
FPSI Seminar Broucher-1
6 pages
Upper Extremity Special Tests
No ratings yet
Upper Extremity Special Tests
5 pages
English Assessment Multiple Choice
No ratings yet
English Assessment Multiple Choice
14 pages
Azure & Microsoft 365 Setup Guide
100% (1)
Azure & Microsoft 365 Setup Guide
105 pages
Simplified Course-Book of Medical English For 1st Year Pharmacy Students 1st Semester 2020 2021 by DR Shaghi
100% (1)
Simplified Course-Book of Medical English For 1st Year Pharmacy Students 1st Semester 2020 2021 by DR Shaghi
91 pages
Updated Glassblowing2
No ratings yet
Updated Glassblowing2
16 pages
Linguistics: Word Formation Basics
No ratings yet
Linguistics: Word Formation Basics
3 pages
Polymers
No ratings yet
Polymers
35 pages
3 Assembler Directives
No ratings yet
3 Assembler Directives
73 pages
Guide To Importing From China & India 2025 Edition Full
No ratings yet
Guide To Importing From China & India 2025 Edition Full
2 pages
Updated Hard IMO Level 2 Question Paper Class 7
No ratings yet
Updated Hard IMO Level 2 Question Paper Class 7
5 pages
GeezIME Keyboard Mapping Guide
100% (1)
GeezIME Keyboard Mapping Guide
2 pages
Frame Shoring Safety Guidelines
No ratings yet
Frame Shoring Safety Guidelines
2 pages
Consumer Behavior in Web Based Commerce An Empirical Study
No ratings yet
Consumer Behavior in Web Based Commerce An Empirical Study
25 pages
6.1.3 Final Exam - Physics Semester 1 (Test)
No ratings yet
6.1.3 Final Exam - Physics Semester 1 (Test)
11 pages
Culinary Techniquesfor Schools
No ratings yet
Culinary Techniquesfor Schools
24 pages
Claims Verification
No ratings yet
Claims Verification
12 pages

Multi 2 Sim

Uploaded by

Multi 2 Sim

Uploaded by

Multi2Sim: A Simulation Framework for

Rafael Ubal† Byunghyun Jang‡ ∗ Perhaad Mistry†

Evergreen assembly uses a clause-based format. The kernel exe-

Figure 6: Components of the GPU global memory hierarchy, as modeled in Multi2Sim.

Simulation inaccuracy (%)

Native Execution Time (ms)

6 Sobel Filter 6 Sobel Filter 25

Figure 9: Simulation slowdowns over native execution for func-

Specialized Memory Path Design. The AMD Radeon 5870

a) Classification of instruction types. b) VLIW bundles occupancy. c) Control flow divergence.

You might also like