ASPLOS2017 Gem5 Tutorial
ASPLOS2017 Gem5 Tutorial
gem5
Andreas Sandberg
Stephan Diestelhorst
William Wang
ARM Research
2 © ARM 2017
Agenda
Presenters: Andreas Sandberg, William Wang, Stephan Diestelhorst (ARM Cambridge,
UK)
3 © ARM 2017
What is gem5?
© ARM 2017
Level of detail
HW Virt.
gem5 + kvm
HW Virtualization
GIPS
Very no/limited timing
The same Host/guest ISA Loosely Timed
7 © ARM 2017
Users and contributors
Publications with gem5
1200
Widely used in academia and industry 1000
800
600
Contributions from 400
ARM, AMD, Google,… 200
0
Wisconsin, Cambridge, Michigan, BSC, … 2011 2012 2013 2014 2015 2016
8 © ARM 2017
When not to use gem5
Performance validation
gem5 is not a cycle-accurate microarchitecture model!
This typically requires more accurate models such as RTL simulation.
Commercial products such as ARM CycleModels operate in this space.
Core microarchitecture exploration
Only do this if you have a custom, detailed, CPU model!
gem5’s core models were not designed to replace more accurate microarchitectural models.
To validate functional correctness or test bleeding-edge ISA improvements
gem5 is not as rigorously tested as commercial products.
New (ARMv8.0+) or optional instructions are sometimes not implemented
Commercial products such as ARM FastModels offer better reliability in this space.
9 © ARM 2017
Why gem5?
Runs real workloads
Analyze workloads that customers use and care about
… including complex workloads such as Android
Comprehensive model library
Memory and I/O devices
Full OS, Web browsers
Clients and servers But not a microarchitectural
Rapid early prototyping model out of theAndroid
Ubuntu (Linux 4.x)
box! Nougat
New ideas can be tested quickly
System-level impact can be quantified
System-level insights
Enables us to study complex
memory-system interactions
Can be wired to custom models
Add detail where it matters, when it matters!
10 © ARM 2017
Getting Started
William Wang
© ARM 2017
Prerequisites
Operating system:
OSX, Linux
Limited support for Windows 10 with a Linux environment
Software:
git
Python 2.7 (dev packages)
SCons
gcc 4.8 or clang 3.1 (or newer)
SWIG 2.0.4 or newer
make
Optional:
dtc (to compile device trees)
ARMv8 cross compilers (to compile workloads)
python-pydot (to generate system diagrams)
13 © ARM 2017
Compiling gem5
$ scons build/ARM/gem5.opt
14 © ARM 2017
Compiling gem5’s device trees
1. sudo apt install device-tree-compiler
2. make –C system/arm/dt
armv8_gem5_v1_Ncpu.dtb
Traditional CMP/SMP configuration with N cores
Built from armv8.dts and platforms/vexpress_gem5_v1.dtsi
armv8_gem5_v1_big_little_M_N.dtb
bigLittle configurations with M big cores and N small cores
Built from armv8.dts and platforms/vexpress_gem5_v1.dtsi
15 © ARM 2017
Compiling Linux for gem5
1. sudo apt install gcc-aarch64-linux-gnu
2. git clone -b gem5/v4.4 https://github.com/gem5/linux-arm-gem5
3. cd linux-arm-gem5
4. make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- gem5_defconfig
5. make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- -j `nproc`
16 © ARM 2017
Example disk images
Example kernels and disk images can be downloaded from gem5.org/Download
This includes pre-compiled boot loaders
Old but useful to get started
Download and extract this into a new directory:
wget http://www.gem5.org/dist/current/arm/aarch-system-2014-10.tar.xz
mkdir dist; cd dist
tar xvf ../aarch-system-2014-10.tar.xz
Set the M5_PATH variable to point to this directory:
export M5_PATH=/path/to/dist
Most example scripts try to find files using M5_PATH
Kernels/boot loaders/device trees in ${M5_PATH}/binaries
Disk images in ${M5_PATH}/disks
17 © ARM 2017
Running an example script
$ build/ARM/gem5.opt configs/example/arm/fs_bigLITTLE.py \
--kernel path/to/vmlinux \
--cpu-type atomic \
--dtb $PWD/system/arm/dt/armv8_gem5_v1_big_little_1_1.dtb \
--disk your_disk_image.img
Simulates a bL system with 1+1 cores
Uses a functional ‘atomic’ CPU model
Use the ‘timing’ CPU type for an example OoO + InO configuration
18 © ARM 2017
Demo
19 © ARM 2017
Configuration and Control
Andreas Sandberg
© ARM 2017
Design philosophy
gem5 is conceptually a Python library implemented in C++
Configured by instantiating Python classes with matching C++ classes
Model parameters exposed as attributes in Python
Running is controlled from Python, but implemented in C++
21 © ARM 2017
Useful tricks
gem5 can be launched interactively
Use the -i option
Pretty prompt if ipython has been installed
Still requires a simulation script
22 © ARM 2017
Control flow
Python m5.instantiate() m5.simulate() m5.simulate()
Create Python
Instantiate objects Run simulation Run simulation
objects
Exit event
Exit event
C++
Instantiate C++
Simulate in C++ Simulate in C++
objects
Callback
Callback
Simulated system
23 © ARM 2017
General structure
The simulator contains exactly one Root object
Controls global configuration options
root = Root(full_system=True)
24 © ARM 2017
System Overview
25 © ARM 2017
Creating a “simple” system
26 © ARM 2017
Overriding model parameters
import m5
print 'Exiting @ tick %i: %s' \ • Print why the simulator exited
% ( m5.curTick(), • Sometimes desirable to call
event.getCause())
m5.simulate() again.
28 © ARM 2017
Creating Checkpoints
m5.checkpoint('name.cpt')
Checkpoint limitations:
The act of taking a checkpoint affects system state!
Checkpoints don’t store cache state
Checkpoints don’t store pipeline state
29 © ARM 2017
Restoring Checkpoints
• Instantiate system and load
m5.instantiate('name.cpt')
state from checkpoint
30 © ARM 2017
Guest to simulation script communication
system.exit_on_work_items = True • Work item handling in Python
…
• Exit event will contain
event = m5.simulate()
information about work items
-----
m5_work_begin(id, 0);
• Annotate your regions of
// Region of interest interest
m5_work_end(id, 0);
31 © ARM 2017
Exit Events
event.getCause() event.getCode() Description
user interrupt received - User pressed Ctrl+C
simulate() limit reached - gem5 reached the specified
time limit
m5_exit instruction Exit code from guest Guest executed m5_exit()
encountered
m5_fail instruction Failure code from guest Guest executed m5_fail()
encountered
checkpoint - Guest executed
m5_checkpoint()
workbegin/workend Work item ID Guest work item annotation
32 © ARM 2017
Dumping statistics
Can be requested from Python:
m5.stats.dump(): Dump statistics
m5.stats.reset(): Reset stat counters
33 © ARM 2017
Examples
Simple full system configuration file: ARM big.LITTLE configuration example
configs/example/arm/{fs_bigLittle.py, devices.py}
Demonstrates how to setup a single system
Reasonably small and well documented
34 © ARM 2017
Debugging
William Wang
© ARM 2017
Debugging Facilities
Tracing
Instruction tracing
Diffing traces
Pipeline viewer
36 © ARM 2017
Tracing/Debugging
printf() is a nice debugging tool
Keep good print statements in code and selectively enable them
Lots of debug output can be a very good thing when a problem arises
Use DPRINTFs in code
DPRINTF(TLB, "Inserting entry into TLB with pfn:%#x…)
Example flags:
Fetch, Decode, Ethernet, Exec, TLB, DMA, Bus, Cache, O3CPUAll
Print out all flags with ./build/ARM/gem5.opt -- debug-help
37 © ARM 2017
Sample Run with Debugging
Command Line:
22:44:28 [/work/gem5] ./build/ARM/gem5.opt --debug-flags=Decode --
debug-start=50000-- debug-file=my_trace.out configs/example/se.py -c
tests/test-progs/hello/bin/arm/linux/hello
…
**** REAL SIMULATION ****
info: Entering event queue @ 0. Starting simulation...
Hello world!
Exiting @ tick 3107500 because target called exit()
my_trace.out:
2:44:47 [ /work/gem5] head m5out/my_trace.out
50000: system.cpu: Decode: Decoded cmps instruction: 0xe353001e
50500: system.cpu: Decode: Decoded ldr instruction: 0x979ff103
51000: system.cpu: Decode: Decoded ldr instruction: 0xe5107004
51500: system.cpu: Decode: Decoded ldr instruction: 0xe4903008
52000: system.cpu: Decode: Decoded addi_uop instruction: 0xe4903008
52500: system.cpu: Decode: Decoded cmps instruction: 0xe3530000
53000: system.cpu: Decode: Decoded b instruction: 0x1affff84
53500: system.cpu: Decode: Decoded sub instruction: 0xe2433003
54000: system.cpu: Decode: Decoded cmps instruction: 0xe353001e
54500: system.cpu: Decode: Decoded ldr instruction: 0x979ff103
38 © ARM 2017
Adding Your Own Flag
Print statements put in source code
Encourage you to add ones to your models or contribute ones you find particularly useful
40 © ARM 2017
Using GDB with gem5
Several gem5 functions are designed to be called from GDB
schedBreakCycle() – also with --debug-break
setDebugFlag()/clearDebugFlag()
dumpDebugStatus()
eventqDump()
SimObject::find()
takeCheckpoint()
41 © ARM 2017
Using GDB with gem5
2:44:47 [/work/gem5] gdb --args ./build/ARM/gem5.opt
configs/example/fs.py
GNU gdb Fedora (6.8-37.el5)
...
(gdb) b main
Breakpoint 1 at 0x4090b0: file build/ARM/sim/main.cc, line 40.
(gdb) run
Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at
build/ARM/sim/main.cc
main(int argc, char **argv)
(gdb) call schedBreakCycle(1000000)
(gdb) continue
Continuing.
gem5 Simulator System
...
0: system.remote_gdb.listener: listening for remote gdb #0 on
port 7000
**** REAL SIMULATION ****
info: Entering event queue @ 0. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.
42 © ARM 2017
0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
Using GDB with gem5
(gdb) p _curTick
$1 = 1000000
…but you really don’t want to run the simulation to completion first
util/rundiff
Perl script for diffing two pipes on the fly
util/tracediff
Handy wrapper for using rundiff to compare gem5 outputs
tracediff “a/gem5.opt|b/gem5.opt” –debug-flags=Exec
Compares instructions traces from two builds of gem5
See comments for details
44 © ARM 2017
Advanced Trace Diffing
Sometimes if you run into a nasty bug it’s hard to compare apples-to-apples traces
Different cycles counts, different code paths from interrupts/timers
State trace
PTRACE program that runs binary on real system and compares cycle-by-cycle to gem5
Supports ARM, x86, SPARC
See wiki for more information [http://gem5.org/Trace_Based_Debugging]
45 © ARM 2017
Checker CPU
Runs a complex CPU model such as the O3 model in tandem with a special
Atomic CPU model
Checker re-executes and compares architectural state for each instruction
executed by complex model at commit
Used to help determine where a complex model begins executing instructions
incorrectly in complex code
46 © ARM 2017
Remote Debugging
./build/ARM/gem5.opt configs/example/fs.py
gem5 Simulator System
...
command line: ./build/ARM/gem5.opt configs/example/fs.py
Global frequency set at 1000000000000 ticks per second
info: kernel located at: /dist/binaries/vmlinux.arm
Listening for system connection on port 5900
Listening for system connection on port 3456
0: system.remote_gdb.listener: listening for remote gdb #0 on
port 7000 info: Entering event queue @ 0. Starting
simulation...
47 © ARM 2017
Remote Debugging
GNU gdb (Sourcery G++ Lite 2010.09-50) 7.2.50.20100908-cvs
Copyright (C) 2010 Free Software Foundation, Inc.
...
(gdb) symbol-file /dist/binaries/vmlinux.arm
Reading symbols from /dist/binaries/vmlinux.arm...done.
(gdb) set remote Z-packet on
ARMv7 only, ARMv8 doesn’t need
(gdb) set tdesc filename arm-with-neon.xml
(gdb) target remote 127.0.0.1:7000
Remote debugging using 127.0.0.1:7000
cache_init_objs (cachep=0xc7c00240, flags=3351249472) at
mm/slab.c:2658
(gdb) step
sighand_ctor (data=0xc7ead060) at kernel/fork.c:1467
(gdb) info registers
r0 0xc7ead060 -940912544
r1 0x5201312
r2 0xc002f1e4 -1073548828
r3 0xc7ead060 -940912544
r4 0x00
r5 0xc7ead020 -940912608
48 © ARM 2017
…
O3 Pipeline Viewer
Use --debug-flags=O3PipeView and util/o3-pipeview.py
50 © ARM 2017
Adding new models
Andreas Sandberg
© ARM 2017
How are models implemented
Describes parameters and
exported methods
Generates
Python Python
description wrappers
Parameter
C++ model
structs
52 © ARM 2017
How are models instantiated
Python
Simulation script Python object
wrappers
Parameter
C++ model
struct
MyObjParams::create()
53 © ARM 2017
Discrete event based simulation
Schedule
MyObj::startup() Event handler
Time
Call
Event handler
54 © ARM 2017
Creating a SimObject
Derive Python class from Python SimObject
Define parameters, ports and configuration
Parameters in Python are automatically turned into C++ struct and passed to C++ object
Add Python file to SConscript
Or, place it in an existing Python file
Recompile
55 © ARM 2017
SimObject initialization
56 © ARM 2017
Parameters and SimObjects
Parameters to SimObjects are synthesized from Python structures
Object hierarchy in Python reflects the C++ world
This example is from src/dev/arm/Realview.py
57 © ARM 2017
SimObject Parameters
Parameters can be:
Scalars – Param.Unsigned(5), Param.Float(5.0), Param.UInt32(42), …
Arrays – VectorParam.Unsigned([1,1,2,3])
SimObjects – Param.PhysicalMemory(…)
Arrays of SimObjects – VectorParam.PhysicalMemory(Parent.any)
Memory address ranges– Param. AddrRange(0,Addr.max))
Normally converted from strings with units :
Latency – Param.Latency(’15ns’) Tick
Frequency – Param.Frequency(‘100MHz’) -> Tick
MemorySize – Param.MemorySize(‘1GB’) -> Bytes
Time – Param.Time(‘Mon Mar 25 09:00:00 CST 2012’)
Ethernet Address – Param.EthernetAddr(“90:00:AC:42:45:00”)
58 © ARM 2017
Auto-generated Header file
#ifndef __PARAMS__Pl011__
#define __PARAMS__Pl011__
class Pl011;
#include <cstddef>
#include "base/types.hh”
#include "params/Gic.hh"
##include "base/types.hh"
#include "params/Uart.hh"
Factory method
struct Pl011Params
: public UartParams
{ class Pl011(Uart):
Pl011 * create(); type = 'Pl011'
uint32_t int_num; gic = Param.Gic(Parent.any, …)
Gic * gic; int_num = Param.UInt32(…)
bool end_on_eot; end_on_eot = Param.Bool(False, "End …)
Tick int_delay; int_delay = Param.Latency("100ns", "Time …")
};
#endif // __PARAMS__Pl011__
59 © ARM 2017
How Parameters are used in C++
src/dev/arm/pl011.cc:
Pl011::Pl011(const Pl011Params *p)
: Uart(p), …,
intNum(p->int_num), gic(p->gic),
endOnEOT(p->end_on_eot), intDelay(p->int_delay)
{
…
}
You can also access parameters through params() accessor after instantiation.
60 © ARM 2017
Creating/Using Events
One of the most common things in an event driven simulator is
scheduling events
Declaring events and handlers is easy:
/* Handle when a timer event occurs */
void timerHappened();
EventWrapper<MyClass, &MyClass::timerHappend> event;
61 © ARM 2017
Checkpointing SimObject State
If your object has state, that needs to be written to the checkpoint
Checkpointing takes place on a drained simulator
Draining ensures that microarchitectural state is flushed
Models may need to flush pipelines and wait for outstanding requests to finish.
Checkpoint implemented by overriding
SimObject::serialize(CheckpointOut &)
Save necessary state
No need to store parameters from the config systyem!
Use SERIALIZE_*() macros or paramOut
To implement restore, override
SimObject::unserialize(CheckpointIn &)
Use UNSERIALIZE_*() macros or paramIn
62 © ARM 2017
Creating a checkpoint
63 © ARM 2017
Restoring from a checkpoint
64 © ARM 2017
Draining
Script requests draining
Yes
Done
65 © ARM 2017
Checkpointing Example
// uint16_t control;
void
Pl011::serialize(CheckpointOut &cp) const
{
SERIALIZE_SCALAR(control);
}
void
Pl011::unserialize(CheckpointIn &cp)
{
UNSERIALIZE_SCALAR(control);
}
66 © ARM 2017
Good Examples
Simple IO devices: IsaFake
See: src/dev/isa_fake.{cc,hh} and src/dev/Device.py
Demonstrates a basic memory-mapped device using the BasicPioDevice base class
PCI devices: PciVirtIO
See: src/dev/virtio/pci.{cc,hh} and src/dev/VirtIO.py
PCI device with a single BAR and interrupts
More complex PCI device: CopyEngine
See: src/dev/pci/copy_engine.{cc,hh} and src/dev/pci/CopyEngine.py
PCI device with DMA support
Python exports: PowerModelState
See: src/sim/power/PowerModelState.py
Exports two methods (getDynamicPower & getStaticPower) to Python
67 © ARM 2017
<Insert coffee break here>
68 © ARM 2017
Memory System
Stephan Diestelhorst
© ARM 2017
Goals
Model a system with heterogeneous applications, running on a set of
heterogeneous processing engines, using heterogeneous memories and
interconnect
CPU centric: capture memory system behaviour accurate enough
Memory centric: Investigate memory subsystem and interconnect architectures
Interconnect Interconnect
3D- PCM
DRAM SRAM NAND STT-RAM
DRAM DRAM NAND
DRAM
70 © ARM 2017
Goals, contd.
Two worlds...
Computation-centric simulation
e.g. SimpleScalar, Asim etc
More behaviourally oriented, with ad-hoc ways of describing parallel behaviours and
intercommunication
Communication-centric simulation
e.g. SystemC+TLM2 (IEEE standard)
More structurally oriented, with parallelism and interoperability as a key component
Deterministic
cache
fixed random number seed response
no dependence on host addresses
time
Cache Model
Multi-Queue
multiple workers
cache lookup
curTick
72 © ARM 2017
Ports, Masters and Slaves
MemObjects are connected through master and slave ports
A master module has at least one master port, a slave module at least one slave
port, and an interconnect module at least one of each
A master port always connects to a slave port
Similar to TLM-2 notation
Master Interconnect Slave
module module module
I$ memory0
CPU bus
D
memory1
$
Functional
Debug interface that doesn’t affect coherency states.
Blocking: Requests complete within a single call chain.
74 © ARM 2017
Communication Monitor
Insert as a structural component where stats are desired
memmonitor = CommMonitor()
membus.master = memmonitor.slave
memmonitor.master = memctrl.slave
Distribution (%)
60
Footprint estimation 50
40
30
20
10
0
75 © ARM 2017
Latency (ns)
Traffic generator
Test scenarios for memory system regression and performance validation
High-level of control for scenario creation
Black-box models for components that are not yet modeled
Video/baseband/accelerator for memory-system loading
Inject requests based on (probabilistic) state-transition diagrams
Idle, random, linear and trace replay states
Address
idle
76 © ARM 2017
Memory controllers
All memories in the system inherit from AbstractMemory
Basic single-channel memory controller
Instantiate multiple times if required
Interleaving support added in the bus/crossbar (to be posted)
SimpleMemory
Fixed latency (possibly with a variance)
Fixed throughput (request throttling without buffering)
SimpleDRAM
High-level configurable DRAM controller model to mimic DDRx, LPDDRx, WideIO, HBM etc
Memory organization: ranks, banks, row-buffer size
Controller architecture: Read/write buffers, open/close page, mapping, scheduling policy
Key timing constraints: tRCD, tCL, tRP, tBURST, tRFC, tREFI, tTAW/tFAW
77 © ARM 2017
Top-down controller model
Don’t model the actual DRAM, only the timing constraints
DDR3/4, LPDDR2/3/4, WIO1/2, GDDR5, HBM, HMC, even PCM
See src/mem/DRAMCtrl.py and src/mem/dram_ctrl.{hh, cc}
DRAM Memory Controller Device width
Burst length
#ranks, #banks
write queue Page size
tRCD
tCL
tRP
tRAS
tBURST
tRFC & tRFEI
tWTR
read queue
tRRD
tFAW/tTAW
…
Hansson et al, Simulating DRAM controllers for future system architecture exploration, ISPASS’14
78 © ARM 2017
Controller model correlation
Comparing with a real memory controller
Synthetic traffic sweeping bytes per activate and number of banks
See configs/dram/sweep.py and util/dram_sweep_plot.py
Energy Saving due to Power-Down (%) BBench DRAM Energy Analysis (LPDDR3 x32)
GPU-AngryBirds
• Active Energy 36%
• Refresh Energy
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
80 © ARM 2017 Naji et al, A High-Level DRAM Timing, Power and Area Exploration Tool, SAMOS’15
Address interleaving
Multi-channel memory support is essential
Emerging DRAM standards are multi-channel by nature
(LPDDR4, WIO1/2, HBM1/2, HMC)
XBar
Bridges connects two buses
Queues requests and forwards them
Configurable amount of queuing space for requests and
responses
XBar Bridge XBar
82 © ARM 2017
Caches
Single cache model with several components:
Cache: request processing, miss handling, coherence
Tags: data storage and replacement (LRU, Random, etc.) Cache
Prefetcher: N-Block Ahead, Tagged Prefetching, Stride
Prefetching Prefetch
Tags
MSHR & MSHRQueue: track pending/outstanding
requests
Also used for write buffer
Data
Parameters: size, hit latency, block size, associativity, MSHR
number of MSHRs (max outstanding requests)
83 © ARM 2017
Coherence protocol
MOESI bus-based snooping protocol
Support nearly arbitrary multi-level hierarchies at the expense of some realism
Does not enforce inclusion
Magic “express snoops” propagate upward in zero time
Avoid complex race conditions when snoops get delayed
Timing is similar to some real-world configurations
L2 keeps copies of all L1 tags
L2 and L1s snooped in parallel
84 © ARM 2017
Snoop (probe) filtering
Broadcast-based coherence protocol
Incurs performance and power cost
Does not reflect realistic implementations
86 © ARM 2017
Ruby for Networks and Coherence
As an alternative to its native memory system gem5 also integrates Ruby
Create networked interconnects based on domain-specific language (SLICC) for
coherence protocols
Detailed statistics
e.g., Request size/type distribution, state transition frequencies, etc...
Detailed component simulation
Network (fixed/flexible pipeline and simple)
Caches (Pluggable replacement policies)
Supports Alpha and x86
Limited ARM support about to be added
Limited support for functional accesses
87 © ARM 2017
Instantiating and Connecting Objects
class BaseCPU(MemObject): CPU
icache_port = MasterPort("Instruction Port")
dcache_port = MasterPort("Data Port")
…
class BaseCache(MemObject):
cpu_side = SlavePort("Port on side closer to CPU")
mem_side = MasterPort("Port on side closer to MEM") I$ D$
...
class Bus(MemObject):
slave = VectorSlavePort("vector port for connecting masters")
master = VectorMasterPort("vector port for connecting slaves")
… Bus
system.cpu.icache_port = system.icache.cpu_side
system.cpu.dcache_port = system.dcache.cpu_side
system.icache.mem_side = system.l2bus.slave
system.dcache.mem_side = system.l2bus.slave
Memory
88 © ARM 2017
Requests & Packets
Protocol stack based on Requests and Packets
Uniform across all MemObjects (with the exception of Ruby)
Aimed at modelling general memory-mapped interconnects
A master module, e.g. a CPU, changes the state of a slave module, e.g. a memory through a
Request transported between master ports and slave ports using Packets
if (req_pkt->needsResponse()) {
req_pkt->makeResponse();
CPU memory } else {
delete req_pkt;
}
...
...
delete resp_pkt;
89 © ARM 2017
Requests & Packets
Requests contain information persistent throughout a transaction
Virtual/physical addresses, size
MasterID uniquely identifying the module initiating the request
Stats/debug info: PC, CPU, and thread ID
Requests are transported as Packets
Command (ReadReq, WriteReq, ReadResp, etc.) (MemCmd)
Address/size (may differ from request, e.g., block aligned cache miss)
Pointer to request and pointer to data (if any)
Source & destination port identifiers (relative to interconnect)
Used for routing responses back to the master
Always follow the same path
SenderState opaque pointer
Enables adding arbitrary information along packet path
90 © ARM 2017
Functional transport interface
On a master port we send a request packet using sendFunctional
This in turn calls recvFunctional on the connected slave port
For a specific slave port we implement the desired functionality by overloading recvFunctional
Typically check internal (packet) buffers against request packet
For a slave module, turn the request into a response (without altering state)
For an interconnect module, forward the request through the appropriate master port using
sendFunctional
Potentially after performing snoops by issuing sendFunctionalSnoop
CPU memory
91 © ARM 2017
Atomic transport interface
On a master port we send a request packet using sendAtomic
This in turn calls recvAtomic on the connected slave port
For a specific slave port we implement the desired functionality by overloading recvAtomic
For a slave module, perform any state updates and turn the request into a response
For an interconnect module, perform any state updates and forward the request through the
appropriate master port using sendAtomic
Potentially after performing snoops by issuing sendAtomicSnoop
Return an approximate latency
CPU memory
92 © ARM 2017
Timing transport interface
On a master port we try to send a request packet using sendTimingReq
This in turn calls recvTiming on the connected slave port
For a specific slave port we implement the desired functionality by overloading recvTimingReq
Perform state updates and potentially forward request packet
For a slave module, typically schedule an action to send a response at a later time
A slave port can choose not to accept a request packet by returning false
The slave port later has to call sendRetryReq to alert the master port to try again
93 © ARM 2017
Timing transport interface (cont’d)
Responses follow a symmetric pattern in the opposite direction
On a slave port we try to send a response packet using sendTiming
This in turn calls recvTiming on the connected master port
For a specific master port we implement the desired functionality by overloading recvTiming
Perform state updates and potentially forward response packet
For a master module, typically schedule a succeeding request
A master port can choose not to accept a response packet by returning false
The master port later has to call sendRetryResp to alert the slave port to try again
CPU memory
94 © ARM 2017
CPU Models
Andreas Sandberg
© ARM 2017
CPU models overview
BaseCPU
ArmV8KvmCPU TimingSimpleCPU
X86KvmCPU AtomicSimpleCPU
97 © ARM 2017
Atomic Simple CPU
On every CPU tick() perform all
operations for an instruction
98 © ARM 2017
Timing Simple CPU
Memory accesses use timing path
99 © ARM 2017
Detailed CPU Models
Parameterizable pipeline models w/SMT support
Two Types
MinorCPU – Parameterizable in-order pipeline model
O3CPU – Parameterizable out-of-order pipeline model
“Execute in Execute”, detailed modeling
Roughly an order-of-magnitude slower than Simple
Models the timing for each pipeline stage
Forces both timing and execution of simulation to be accurate
Important for Coherence, I/O, Multiprocessor Studies, etc
Key Resources
Cache, Execution, BranchPredictor, etc.
Pipeline stages
Key Resources
Physical Registers, IQ, LSQ, ROB, Functional Units
ThreadContext
Interface for accessing total architectural state of a single thread (PC, registers, etc.)
Holds pointers to important structures (TLB, CPU, etc.)
CPU models typically implement custom versions or use SimpleThread
ExecContext
Abstract interface defining how an instruction interface with the CPU model
© ARM 2017
Accelerating gem5
Switching modes
(kvm +) functional + timing / detailed
Checkpoints
boot Linux -> checkpoint
run multiple configurations in parallel
run multiple checkpoints in parallel
Multi-threading
multiple queues
multiple workers execute events
data sharing and tight coupling limits speedup
Multi-processed gem5
for design space explorations
simulated
system
#3
111 © ARM 2017 Host #3
Object Diagram : Simulating a 2-node Cluster Example
simulated compute simulated Ethernet switch simulated compute
node node
Root
Root Root
EtherSwitch
NSGigE NSGigE
Relative CPI
Error (%)
1 4
Capture data dependencies and MLP 0.9 2
Elastic replay 0.8 0
capture
Predict scalability for SMPs
Additional 10x speedup
40%
30%
20%
10%
0%
Instructions IPC BP Miss Ratio DL1 Miss Ratio IL1 Miss Ratio L2 Miss Ratio DRAM Read BW
simulate gates
Fetch Decode / Rename Issue Branch Execute
Inst
Align
Dec Ren
BX
IQ
Reg
Read
Mux BR Res
Align/Steer
IC Inst 16 uops
Decode Q
IC Dec Ren
Fetch Q
Tags 4 inst Align 4 uop
Read
toggle rates
Inst E1 E2 E3
ITLB 128b Dec Ren
Align
Integer Execute
12 P-blks Inst
Dec Ren SX0 Reg
Align Mux ALU 64b
IQ Read
Setup
16x32b insts
12 uops
G G
Main
C C
Main Pred
Commit SX1 Reg
complex aggregation
BTB Mux ALU 64b
MCQ IQ Read
Main retire
RCQ 12 uops
Core
GHBs
ALU PLUS 64b
128 insts
Decompose
uBTB
96 regs 64b
+ SourceIleak C
IMAC
Hot
nBTB 32 branches MX Reg
64 loads Mux IDIV 64b
IQ Read
C GateG G
32 stores
12 uops 64b
DrainAcc Acc
CRC32
Other 64b
Aggregate
Iswitch
SOC
D1 D2 D3 D4
top-down
L2
Load & Store
N+
LS0
IQ
12 uops L2
Reg
Read
Mux AGEN DTLB
L2 N+ Acc Acc
ISUB
DC DC
Cold
Setup FMT
-
128b
Tags Read
Tags Read
LS1 Reg
Mux AGEN DTLB
IQ Read
Psub
12 uops
Rt/Arb Tag Data1 Data2 256b
Interconnect
Cmp Mux
I0 I1 I2 I3 V1 V2 V3 V4
Vector Execute
VX0 IQ Reg
Mux
V-ALU
V-FADD
128b
128b
V-FMUL 128b
DRAM
VX1 IQ Reg V-FADD 128b
Mux
IQ Read Read V-ALU PLUS 128b
16 uops
V-FCVT 128b
ODROID-XU3
Exynos-5422 2. Record:
4x Cortex-A7 • Performance Counters (PMCS)
4x Cortex-A15 • Voltage, Power
126 © ARM 2017
Power&Energy Framework Overview
PE Model Generation Env. Gem5 Simulation Env. S/W Power Management Env.
Power States**:
Definition & Migration
Express PE Model Runtime Statistics:
in gem5 fitting form Voltage, Freq, Power State, High level Drivers
Event Count
CPUFreq Driver
System Controller
(Extendable)
- DVFS Control Registers
- Energy Monitoring Registers**
Low-level Drivers
- Temperature Monitor**
Run-time management
CPU employs power-saving techniques (DVFS, DPM, asymmetric multi-core e.g. ARM
big.LITTLE)
Need accurate power estimations to make performance-power trade-off
gem5.opt configs/example/arm/fs_power.py \
--caches --kernel vmlinux
./build/ARM/gem5.opt \
configs/example/arm/fs_bigLITTLE.py
--cpu-type kvm \
--kernel vmlinux --disk my_disk.img \
--big-cpus 1 --little-cpus 0 \
--dtb
$GEM5/system/arm/dt/armv8_gem5_v1_1cpu.dtb
Terminology:
Intervals – slices in time, sampling granularity (e.g. 10K instructions)
Phases – intervals with similar behavior that often recur periodically
IPC
A A B A B
1 2 3 4 5 gzip gcc
Time (Intervals)
Output from SimPoint analysis are slices and weights for each slice (choose a clustering
within 5% of CPI of full run)
473_astar 483_xalancbmk
Very limited coverage of
0
-4 -2 0 2 4 6 8 10 12
200_sixtrack
full mobile systems’ 172_mgrid -2
450_soplex
behavior 470_lbm
183_equake 179_art1/2
433_milc 429_mcf
-4
181_mcf
DL1 Assoc
--+ +-+
Looks for parameters where the average ‘+’ run is --- +--
very different from ‘-’ DL1 Lat
Experiments are tolerant to noise DL1 Lat DL1 DL1
Size Assoc
Clustering
based on Similar
Characterize and cluster workload phases
Characteristics Cluster based on performance sensitivity to various hardware
parameters
Identification of ideal H/W
config per core type
Selectively enable or disable hardware parameters per cluster
of similar workload phases to improve their efficiency
Evaluation of
Heterogeneous Systems
Optimal Systems
AutoGUI
Record and deterministically playback
GUI interactions
300x speedup of our simulations
SimPoints Good correlation to full runs for statistics of interest
Identifies unique phases of software behavior
8
Android
rlbench
specInt2000Ref
6 specInt2006Ref
bbench
specFp2000Ref
Characterization
caffeinemark specFp2006Ref
4
andebench
wps
angrybirds
2
0
-4 -2 0 2 4 6 8 10 12
-2
-4
PCA
-6
Characterization
Workload
Comparison
Guided
PCA
Parameter Selection Phase
Comparison
Reduced Detailed
Sunwoo, et al. “A Structured Approach to the Simulation, Analysis and Characterization of Smartphone Applications.”
Simulation
144 Published
© ARM 2017 at IISWC 2013.
How to Contribute to gem5
Andreas Sandberg
© ARM 2017
Prerequisites
gem5’s is distributed under a 3-clause BSD license
See LICENSE in the repository
Body: Swig wrappers for native objects currently share the _m5.internal name
space with Python code. This is undesirable if we ever want to switch
from Swig to some other framework for native binding (e.g., PyBind11
or Boost::Python). This changeset moves all of such wrappers to the
_m5 namespace, which is now reserved for native code.
Apply stick to
Wait for reviews
reviewer
Reviewers No
Update change
happy?
Yes
Legal aspects
Patch author’s responsibility, but reviewers should look out for obvious issues.
Where to start:
http://gem5.googlesource.com
Git authentication
Required to push changes for review
Uses https unlike most other installations
Requires an authentication cookie
Reviewers No
Update change
happy?
Yes
Maintainer No
happy?
Yes
Commit change Done
168 © ARM 2017
How to review code
Start with the commit message
Does it make sense?
Is it a change that makes sense in gem5? Why/Why not?
Look at the code
Is it solving the problem in the description?
Is the implementation technically sound? Are there obvious bugs?
Comment on the code and submit a review score
-2: Don’t submit under any circumstances (blocks submission)
…
+2: Looks good, approved!
Be polite and kind
Developers and reviewers are people too!