0% found this document useful (0 votes)
33 views160 pages

ASPLOS2017 Gem5 Tutorial

The document is a presentation on architectural exploration using gem5, detailing its features, configuration, and debugging capabilities. It outlines the agenda, prerequisites for getting started, and the design philosophy behind gem5, emphasizing its utility in simulating complex workloads. Additionally, it covers how to compile gem5, create systems, and utilize debugging tools effectively.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views160 pages

ASPLOS2017 Gem5 Tutorial

The document is a presentation on architectural exploration using gem5, detailing its features, configuration, and debugging capabilities. It outlines the agenda, prerequisites for getting started, and the design philosophy behind gem5, emphasizing its utility in simulating complex workloads. Additionally, it covers how to compile gem5, create systems, and utilize debugging tools effectively.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 160

Architectural Exploration with

gem5

Andreas Sandberg
Stephan Diestelhorst
William Wang
ARM Research

Xi’An: ASPLOS 2017


2017-04-09
© ARM 2017
This is an interactive presentation
Even if they are in:
Please ask questions! • English
• Chinese
• Swedish
• German

2 © ARM 2017
Agenda
 Presenters: Andreas Sandberg, William Wang, Stephan Diestelhorst (ARM Cambridge,
UK)

 13:00 Introduction (10 min) – Stephan


 13:10 Getting Started (15 min) – William
 13:25 Configuration (25 min) – Andreas
 13:50 Debug & Trace (20 min) – William
 14:10 Creating SimObjects (20 min) – Andreas
 14:30 Coffee Break (30 min)
 15:00 Memory System (40 min) – Stephan
 15:40 CPU Models (20 min) – Andreas
 16:00 Advanced Features (45 min) – all
 16:45 Contributing to gem5 (20 min) – Andreas

3 © ARM 2017
What is gem5?

© ARM 2017
Level of detail
HW Virt.
gem5 + kvm
 HW Virtualization
GIPS
 Very no/limited timing
 The same Host/guest ISA Loosely Timed

 Functional mode Qemu

 No timing, chain basic blocks of instructions 50–200 MIPS


 Can add cache models for warming Approximately Timed
µarch Exploration SW Dev
 Timing mode HW Validation gem5
Perf.Validation
 Single time for execute and memory lookup 0.2–3 MIPS
 Advanced on bundle Cycle Accurate
High-level perf./power
 Detailed mode RTL simulation Architecture exploration
 Full out-of-order, in-order CPU models
1–50 KIPS
 Hit-under-miss, reodering, …

7 © ARM 2017
Users and contributors
Publications with gem5
1200
 Widely used in academia and industry 1000
800
600
 Contributions from 400
 ARM, AMD, Google,… 200
0
 Wisconsin, Cambridge, Michigan, BSC, … 2011 2012 2013 2014 2015 2016

8 © ARM 2017
When not to use gem5
 Performance validation
 gem5 is not a cycle-accurate microarchitecture model!
 This typically requires more accurate models such as RTL simulation.
 Commercial products such as ARM CycleModels operate in this space.
 Core microarchitecture exploration
 Only do this if you have a custom, detailed, CPU model!
 gem5’s core models were not designed to replace more accurate microarchitectural models.
 To validate functional correctness or test bleeding-edge ISA improvements
 gem5 is not as rigorously tested as commercial products.
 New (ARMv8.0+) or optional instructions are sometimes not implemented
 Commercial products such as ARM FastModels offer better reliability in this space.

9 © ARM 2017
Why gem5?
 Runs real workloads
 Analyze workloads that customers use and care about
 … including complex workloads such as Android
 Comprehensive model library
 Memory and I/O devices
 Full OS, Web browsers
 Clients and servers But not a microarchitectural
 Rapid early prototyping model out of theAndroid
Ubuntu (Linux 4.x)
box! Nougat
 New ideas can be tested quickly
 System-level impact can be quantified
 System-level insights
 Enables us to study complex
memory-system interactions
 Can be wired to custom models
 Add detail where it matters, when it matters!

10 © ARM 2017
Getting Started

William Wang

© ARM 2017
Prerequisites
 Operating system:
 OSX, Linux
 Limited support for Windows 10 with a Linux environment
 Software:
 git
 Python 2.7 (dev packages)
 SCons
 gcc 4.8 or clang 3.1 (or newer)
 SWIG 2.0.4 or newer
 make
 Optional:
 dtc (to compile device trees)
 ARMv8 cross compilers (to compile workloads)
 python-pydot (to generate system diagrams)

13 © ARM 2017
Compiling gem5
$ scons build/ARM/gem5.opt

 Guest architecture  Optimization level:


 Several architectures in the source  debug: Debug symbols, no/few
tree. optimizations
 opt: Debug symbols + most
 Most common ones are: optimizations
 ARM  fast: No symbols + even more
 NULL – Used for trace-drive simulation optimizations
 X86 – Popular in academia, but very
strange timing behavior

14 © ARM 2017
Compiling gem5’s device trees
1. sudo apt install device-tree-compiler
2. make –C system/arm/dt

 Device trees are used to describe hard-to-discover devices

 armv8_gem5_v1_Ncpu.dtb
 Traditional CMP/SMP configuration with N cores
 Built from armv8.dts and platforms/vexpress_gem5_v1.dtsi
 armv8_gem5_v1_big_little_M_N.dtb
 bigLittle configurations with M big cores and N small cores
 Built from armv8.dts and platforms/vexpress_gem5_v1.dtsi

15 © ARM 2017
Compiling Linux for gem5
1. sudo apt install gcc-aarch64-linux-gnu
2. git clone -b gem5/v4.4 https://github.com/gem5/linux-arm-gem5
3. cd linux-arm-gem5
4. make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- gem5_defconfig
5. make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- -j `nproc`

 Builds the default kernel configuration for gem5


 Has support for most of the devices that gem5 supports

16 © ARM 2017
Example disk images
 Example kernels and disk images can be downloaded from gem5.org/Download
 This includes pre-compiled boot loaders
 Old but useful to get started
 Download and extract this into a new directory:
 wget http://www.gem5.org/dist/current/arm/aarch-system-2014-10.tar.xz
 mkdir dist; cd dist
 tar xvf ../aarch-system-2014-10.tar.xz
 Set the M5_PATH variable to point to this directory:
 export M5_PATH=/path/to/dist
 Most example scripts try to find files using M5_PATH
 Kernels/boot loaders/device trees in ${M5_PATH}/binaries
 Disk images in ${M5_PATH}/disks

17 © ARM 2017
Running an example script
$ build/ARM/gem5.opt configs/example/arm/fs_bigLITTLE.py \
--kernel path/to/vmlinux \
--cpu-type atomic \
--dtb $PWD/system/arm/dt/armv8_gem5_v1_big_little_1_1.dtb \
--disk your_disk_image.img
 Simulates a bL system with 1+1 cores
 Uses a functional ‘atomic’ CPU model
 Use the ‘timing’ CPU type for an example OoO + InO configuration

18 © ARM 2017
Demo

19 © ARM 2017
Configuration and Control

Andreas Sandberg

© ARM 2017
Design philosophy
 gem5 is conceptually a Python library implemented in C++
 Configured by instantiating Python classes with matching C++ classes
 Model parameters exposed as attributes in Python
 Running is controlled from Python, but implemented in C++

 Configuration and running are two distinct steps


 Configuration phase ends with a call to instantiate the C++ world
 Parameters cannot be changed after the C++ world has been created

21 © ARM 2017
Useful tricks
 gem5 can be launched interactively
 Use the -i option
 Pretty prompt if ipython has been installed
 Still requires a simulation script

 Ignore configs/example/{fs,se}.py and configs/common/FSConfig.py


 Far too complex
 Tries to handle every single use case in a single configuration file

 Good configuration examples:


 configs/learning_gem5/
 configs/example/arm/

22 © ARM 2017
Control flow
Python m5.instantiate() m5.simulate() m5.simulate()

Create Python
Instantiate objects Run simulation Run simulation
objects

Exit event

Exit event
C++

Instantiate C++
Simulate in C++ Simulate in C++
objects

Callback

Callback
Simulated system

Running guest Running guest


code code

23 © ARM 2017
General structure
 The simulator contains exactly one Root object
 Controls global configuration options
 root = Root(full_system=True)

 The root object contains one or more System instances


 A system represents a shared memory machine
 Contains devices, CPUs, and memories

 Multiple system may be connected using network interfaces


 Cluster on cluster simulation
 Not within the scope of this presentation

24 © ARM 2017
System Overview

25 © ARM 2017
Creating a “simple” system

 The system contains basic platform devices


 Interrupt controllers, PCI bridge, debug UART
 Sets up the boot loader and kernel as well
 See examples in config/example/arm:
 SimpleSystem (devices.py) defines a basic ARM system with PCI support
 Instantiated by createSystem() in fs_bigLITTLE.py

26 © ARM 2017
Overriding model parameters
import m5

class L1DCache(m5.objects.Cache): • Use gem5’s base Cache


assoc = 2 • Override associativity
size = '16kB' • Override size

class L1ICache(L1DCache): • Use defaults from L1DCache


assoc = 16 • Override associativity again

l1i = L1ICache(assoc=8, • Override parameters at


repl=m5.objects.RandomRepl()) instantiation time
• We’ll cover memory ports later
27 © ARM 2017
Running
m5.instantiate() • Instantiate the C++ world

event = m5.simulate() • Start the simulation

print 'Exiting @ tick %i: %s' \ • Print why the simulator exited
% ( m5.curTick(), • Sometimes desirable to call
event.getCause())
m5.simulate() again.

• Run for a fixed number of


m5.simulate(m5.tick.fromSeconds(0.1)) simulated seconds

28 © ARM 2017
Creating Checkpoints
m5.checkpoint('name.cpt')

 Checkpoints can be used to store the simulator’s state


 Can be used to implement SimPoints or similar methodologies

 Checkpoint limitations:
 The act of taking a checkpoint affects system state!
 Checkpoints don’t store cache state
 Checkpoints don’t store pipeline state

29 © ARM 2017
Restoring Checkpoints
• Instantiate system and load
m5.instantiate('name.cpt')
state from checkpoint

event = m5.simulate() • Run in the same way as before

30 © ARM 2017
Guest to simulation script communication
system.exit_on_work_items = True • Work item handling in Python

• Exit event will contain
event = m5.simulate()
information about work items
-----

• Include the m5op header


#include "m5op.h"
• Remember to link with libm5.a

m5_work_begin(id, 0);
• Annotate your regions of
// Region of interest interest
m5_work_end(id, 0);

31 © ARM 2017
Exit Events
event.getCause() event.getCode() Description
user interrupt received - User pressed Ctrl+C
simulate() limit reached - gem5 reached the specified
time limit
m5_exit instruction Exit code from guest Guest executed m5_exit()
encountered
m5_fail instruction Failure code from guest Guest executed m5_fail()
encountered
checkpoint - Guest executed
m5_checkpoint()
workbegin/workend Work item ID Guest work item annotation
32 © ARM 2017
Dumping statistics
 Can be requested from Python:
 m5.stats.dump(): Dump statistics
 m5.stats.reset(): Reset stat counters

 Guest command line:


 m5 dumpstats [[delay] [period]]
 m5 dumpresetstas [[delay] [period]]

 Guest code using libm5.a:


 m5_dump_stats(delay, periodicity): Dump statistics
 m5_dumpreset_stats(delay, periodicity): Dump & reset statistics

33 © ARM 2017
Examples
 Simple full system configuration file: ARM big.LITTLE configuration example
 configs/example/arm/{fs_bigLittle.py, devices.py}
 Demonstrates how to setup a single system
 Reasonably small and well documented

 Distributed multi-system configuration:


 configs/example/arm/dist_bigLittle.py
 Reuses the configuration file above

 Simple syscall emulation mode example: Jason Lowe-Power’s Learning gem5


 configs/learning_gem5/part1

34 © ARM 2017
Debugging

William Wang

© ARM 2017
Debugging Facilities
 Tracing
 Instruction tracing
 Diffing traces

 Using gdb to debug gem5


 Debugging C++ and gdb-callable functions
 Remote debugging

 Pipeline viewer

36 © ARM 2017
Tracing/Debugging
 printf() is a nice debugging tool
 Keep good print statements in code and selectively enable them
 Lots of debug output can be a very good thing when a problem arises
 Use DPRINTFs in code
 DPRINTF(TLB, "Inserting entry into TLB with pfn:%#x…)

 Example flags:
 Fetch, Decode, Ethernet, Exec, TLB, DMA, Bus, Cache, O3CPUAll
 Print out all flags with ./build/ARM/gem5.opt -- debug-help

 Enabled on the command line


 --debug-flags=Exec
 --debug-start=30000
 --debug-file=my_trace.out
 Enable the flag Exec; Start at tick 30000; Write to my_trace.out

37 © ARM 2017
Sample Run with Debugging
Command Line:
22:44:28 [/work/gem5] ./build/ARM/gem5.opt --debug-flags=Decode --
debug-start=50000-- debug-file=my_trace.out configs/example/se.py -c
tests/test-progs/hello/bin/arm/linux/hello

**** REAL SIMULATION ****
info: Entering event queue @ 0. Starting simulation...
Hello world!
Exiting @ tick 3107500 because target called exit()

my_trace.out:
2:44:47 [ /work/gem5] head m5out/my_trace.out
50000: system.cpu: Decode: Decoded cmps instruction: 0xe353001e
50500: system.cpu: Decode: Decoded ldr instruction: 0x979ff103
51000: system.cpu: Decode: Decoded ldr instruction: 0xe5107004
51500: system.cpu: Decode: Decoded ldr instruction: 0xe4903008
52000: system.cpu: Decode: Decoded addi_uop instruction: 0xe4903008
52500: system.cpu: Decode: Decoded cmps instruction: 0xe3530000
53000: system.cpu: Decode: Decoded b instruction: 0x1affff84
53500: system.cpu: Decode: Decoded sub instruction: 0xe2433003
54000: system.cpu: Decode: Decoded cmps instruction: 0xe353001e
54500: system.cpu: Decode: Decoded ldr instruction: 0x979ff103
38 © ARM 2017
Adding Your Own Flag
 Print statements put in source code
 Encourage you to add ones to your models or contribute ones you find particularly useful

 Macros remove them from the gem5.fast binary


 There is no performance penalty for adding them
 To enable them you need to run gem5.opt or gem5.debug

 Adding one with an existing flag


 DPRINTF(<flag>, “normal printf %s\n”, “arguments”);

 To add a new flag add the following in a Sconscript


 DebugFlag(‘MyNewFlag’)
 Include corresponding header, e.g. #include “debug/MyNewFlag.hh”
39 © ARM 2017
Instruction Tracing
 Separate from the general debug/trace facility
 But both are enabled the same way
 Per-instruction records populated as instruction executes
 Start with PC and mnemonic
 Add argument and result values as they become known
 Printed to trace when instruction completes
 Flags for printing cycle, symbolic addresses, etc.
2:44:47 [ /work/gem5] head m5out/my_trace.out
50000: T0 : 0x14468 : cmps r3, #30 : IntAlu : D=0x00000000
50500: T0 : 0x1446c : ldrls pc, [pc, r3 LSL #2] : MemRead : D=0x00014640 A=0x14480
51000: T0 : 0x14640 : ldr r7, [r0, #-4] : MemRead : D=0x00001000 A=0xbeffff0c
51500: T0 : 0x14644.0 : ldr r3, [r0] #8 : MemRead : D=0x00000011 A=0xbeffff10
52000: T0 : 0x14644.1 : addi_uop r0, r0, #8 : IntAlu : D=0xbeffff18
52500: T0 : 0x14648 : cmps r3, #0 : IntAlu : D=0x00000001
53000: T0 : 0x1464c : bne : IntAlu :

40 © ARM 2017
Using GDB with gem5
 Several gem5 functions are designed to be called from GDB
 schedBreakCycle() – also with --debug-break
 setDebugFlag()/clearDebugFlag()
 dumpDebugStatus()
 eventqDump()
 SimObject::find()
 takeCheckpoint()

41 © ARM 2017
Using GDB with gem5
2:44:47 [/work/gem5] gdb --args ./build/ARM/gem5.opt
configs/example/fs.py
GNU gdb Fedora (6.8-37.el5)
...
(gdb) b main
Breakpoint 1 at 0x4090b0: file build/ARM/sim/main.cc, line 40.
(gdb) run
Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at
build/ARM/sim/main.cc
main(int argc, char **argv)
(gdb) call schedBreakCycle(1000000)
(gdb) continue
Continuing.
gem5 Simulator System
...
0: system.remote_gdb.listener: listening for remote gdb #0 on
port 7000
**** REAL SIMULATION ****
info: Entering event queue @ 0. Starting simulation...
Program received signal SIGTRAP, Trace/breakpoint trap.
42 © ARM 2017
0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
Using GDB with gem5
(gdb) p _curTick
$1 = 1000000

(gdb) call setDebugFlag("Exec")


(gdb) call schedBreakCycle(1001000)
(gdb) continue
Continuing.

1000000: system.cpu T0 : @_stext+148. 1 : addi_uop r0, r0, #4 : IntAlu


: D=0x00004c30
1000500: system.cpu T0 : @_stext+152 : teqs r0, r6 : IntAlu :
D=0x00000000
Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000003ccb6306f7 in kill () from /lib64/libc.so.6
(gdb) print SimObject::find("system.cpu")
$2 = (SimObject *) 0x19cba130
(gdb) print (BaseCPU*)SimObject::find("system.cpu")
$3 = (BaseCPU *) 0x19cba130
(gdb) p $3->instCnt
$4 = 431
43 © ARM 2017
Diffing Traces
 Often useful to compare traces from two simulations
 Find where known good and modified simulators diverge

 Standard diff only works on files (not pipes)

 …but you really don’t want to run the simulation to completion first

 util/rundiff
 Perl script for diffing two pipes on the fly

 util/tracediff
 Handy wrapper for using rundiff to compare gem5 outputs
 tracediff “a/gem5.opt|b/gem5.opt” –debug-flags=Exec
 Compares instructions traces from two builds of gem5
 See comments for details

44 © ARM 2017
Advanced Trace Diffing
 Sometimes if you run into a nasty bug it’s hard to compare apples-to-apples traces
 Different cycles counts, different code paths from interrupts/timers

 Some mechanisms that can help:


 -ExecTicks don’t print out ticks
 -ExecKernel don’t print out kernel code
 -ExecUser don’t print out user code
 ExecAsid print out ASID of currently running process

 State trace
 PTRACE program that runs binary on real system and compares cycle-by-cycle to gem5
 Supports ARM, x86, SPARC
 See wiki for more information [http://gem5.org/Trace_Based_Debugging]

45 © ARM 2017
Checker CPU
 Runs a complex CPU model such as the O3 model in tandem with a special
Atomic CPU model
 Checker re-executes and compares architectural state for each instruction
executed by complex model at commit
 Used to help determine where a complex model begins executing instructions
incorrectly in complex code

 Checker cannot be used to debug MP or SMT systems


 Checker cannot verify proper handling of interrupts
 Certain instructions must be marked unverifiable i.e. “wfi”

46 © ARM 2017
Remote Debugging
./build/ARM/gem5.opt configs/example/fs.py
gem5 Simulator System
...
command line: ./build/ARM/gem5.opt configs/example/fs.py
Global frequency set at 1000000000000 ticks per second
info: kernel located at: /dist/binaries/vmlinux.arm
Listening for system connection on port 5900
Listening for system connection on port 3456
0: system.remote_gdb.listener: listening for remote gdb #0 on
port 7000 info: Entering event queue @ 0. Starting
simulation...

47 © ARM 2017
Remote Debugging
GNU gdb (Sourcery G++ Lite 2010.09-50) 7.2.50.20100908-cvs
Copyright (C) 2010 Free Software Foundation, Inc.
...
(gdb) symbol-file /dist/binaries/vmlinux.arm
Reading symbols from /dist/binaries/vmlinux.arm...done.
(gdb) set remote Z-packet on
ARMv7 only, ARMv8 doesn’t need
(gdb) set tdesc filename arm-with-neon.xml
(gdb) target remote 127.0.0.1:7000
Remote debugging using 127.0.0.1:7000
cache_init_objs (cachep=0xc7c00240, flags=3351249472) at
mm/slab.c:2658
(gdb) step
sighand_ctor (data=0xc7ead060) at kernel/fork.c:1467
(gdb) info registers
r0 0xc7ead060 -940912544
r1 0x5201312
r2 0xc002f1e4 -1073548828
r3 0xc7ead060 -940912544
r4 0x00
r5 0xc7ead020 -940912608
48 © ARM 2017

O3 Pipeline Viewer
Use --debug-flags=O3PipeView and util/o3-pipeview.py

50 © ARM 2017
Adding new models

Andreas Sandberg

© ARM 2017
How are models implemented
Describes parameters and
exported methods

Generates
Python Python
description wrappers

Parameter
C++ model
structs

Implements your model Includes

52 © ARM 2017
How are models instantiated

obj = MyObj() m5.instantiate()

Python
Simulation script Python object
wrappers

Instantiate and populate


MyObjParams

Parameter
C++ model
struct

MyObjParams::create()

53 © ARM 2017
Discrete event based simulation
Schedule
MyObj::startup() Event handler

Time

Call

Event handler

 Discrete: Handles time in discrete steps


 Each step is a tick
 Usually 1THz in gem5
 Simulator skips to the next event on the timeline

54 © ARM 2017
Creating a SimObject
 Derive Python class from Python SimObject
 Define parameters, ports and configuration
 Parameters in Python are automatically turned into C++ struct and passed to C++ object
 Add Python file to SConscript
 Or, place it in an existing Python file

 Derive C++ class from C++ SimObject


 Defines the simulation behavior
 See src/sim/sim_object.{cc,hh}
 Add C++ filename to SConscript in directory of new object
 Need to make sure you have a create factory method for the object
 Look at the bottom of an existing object for info

 Recompile

55 © ARM 2017
SimObject initialization

Instantiation Register stats Initialize architectural


• Uses a factory method: • MyObject::regStats() state
MyObjectParams::create() • MyObject::initState()

Start model Reset stats


• MyObject::startup() • MyObject::resetStats()

56 © ARM 2017
Parameters and SimObjects
 Parameters to SimObjects are synthesized from Python structures
 Object hierarchy in Python reflects the C++ world
 This example is from src/dev/arm/Realview.py

Python class name Python base class


class Pl011(Uart):
C++ class type = 'Pl011'
cxx_header = "dev/arm/pl011.hh"
C++ header gic = Param.Gic(Parent.any, "Gic to use for interrupting")
int_num = Param.UInt32("Interrupt number that connects to GIC")
end_on_eot = Param.Bool(False, "End the simulation when …")
int_delay = Param.Latency("100ns", "Time between action …")

Parameter name Parameter type Parameter Description


Default value

57 © ARM 2017
SimObject Parameters
 Parameters can be:
 Scalars – Param.Unsigned(5), Param.Float(5.0), Param.UInt32(42), …
 Arrays – VectorParam.Unsigned([1,1,2,3])
 SimObjects – Param.PhysicalMemory(…)
 Arrays of SimObjects – VectorParam.PhysicalMemory(Parent.any)
 Memory address ranges– Param. AddrRange(0,Addr.max))
 Normally converted from strings with units :
 Latency – Param.Latency(’15ns’) Tick
 Frequency – Param.Frequency(‘100MHz’) -> Tick
 MemorySize – Param.MemorySize(‘1GB’) -> Bytes
 Time – Param.Time(‘Mon Mar 25 09:00:00 CST 2012’)
 Ethernet Address – Param.EthernetAddr(“90:00:AC:42:45:00”)

58 © ARM 2017
Auto-generated Header file
#ifndef __PARAMS__Pl011__
#define __PARAMS__Pl011__

class Pl011;

#include <cstddef>
#include "base/types.hh”
#include "params/Gic.hh"
##include "base/types.hh"

#include "params/Uart.hh"
Factory method

struct Pl011Params
: public UartParams
{ class Pl011(Uart):
Pl011 * create(); type = 'Pl011'
uint32_t int_num; gic = Param.Gic(Parent.any, …)
Gic * gic; int_num = Param.UInt32(…)
bool end_on_eot; end_on_eot = Param.Bool(False, "End …)
Tick int_delay; int_delay = Param.Latency("100ns", "Time …")
};
#endif // __PARAMS__Pl011__
59 © ARM 2017
How Parameters are used in C++
src/dev/arm/pl011.cc:
Pl011::Pl011(const Pl011Params *p)
: Uart(p), …,
intNum(p->int_num), gic(p->gic),
endOnEOT(p->end_on_eot), intDelay(p->int_delay)
{

}

You can also access parameters through params() accessor after instantiation.

60 © ARM 2017
Creating/Using Events
 One of the most common things in an event driven simulator is
scheduling events
 Declaring events and handlers is easy:
/* Handle when a timer event occurs */
void timerHappened();
EventWrapper<MyClass, &MyClass::timerHappend> event;

 Scheduling them is easy too:

/* something that requires me to schedule an event at time t*/


if (event.scheduled())
reschedule(event, curTick() + t);
else
schedule(event, curTick() + t);

61 © ARM 2017
Checkpointing SimObject State
 If your object has state, that needs to be written to the checkpoint
 Checkpointing takes place on a drained simulator
 Draining ensures that microarchitectural state is flushed
 Models may need to flush pipelines and wait for outstanding requests to finish.
 Checkpoint implemented by overriding
SimObject::serialize(CheckpointOut &)
 Save necessary state
 No need to store parameters from the config systyem!
 Use SERIALIZE_*() macros or paramOut
 To implement restore, override
SimObject::unserialize(CheckpointIn &)
 Use UNSERIALIZE_*() macros or paramIn

62 © ARM 2017
Creating a checkpoint

Trigger checkpointing Drain the simulator Serialize objects


• Script call: • Ensures a well-defined • MyObject::serialize(
m5.checkpoint(“my.cpt”) architectural state CheckpointOut&)
• Flushes CPU pipelines
• Writes back caches

Resume drained objects Resume simulation


• MyObject::drainResume() • Script call:
m5.simulate()

63 © ARM 2017
Restoring from a checkpoint

Instantiation Register stats Restore architectural


• Uses a factory method: • MyObject::regStats() state
MyObjectParams::create() • MyObject::unserialize(
CheckpointIn&)

Resume system Start model Reset stats


• MyObject::drainResume() • MyObject::startup() • MyObject::resetStats()

64 © ARM 2017
Draining
Script requests draining

• Flush internal state


• Stop producing new Call SimObject::drain()
messages

All objects No Simulate until


drained signalDrainDone()

Yes

Done

65 © ARM 2017
Checkpointing Example
// uint16_t control;
void
Pl011::serialize(CheckpointOut &cp) const
{
SERIALIZE_SCALAR(control);
}

void
Pl011::unserialize(CheckpointIn &cp)
{
UNSERIALIZE_SCALAR(control);
}

66 © ARM 2017
Good Examples
 Simple IO devices: IsaFake
 See: src/dev/isa_fake.{cc,hh} and src/dev/Device.py
 Demonstrates a basic memory-mapped device using the BasicPioDevice base class
 PCI devices: PciVirtIO
 See: src/dev/virtio/pci.{cc,hh} and src/dev/VirtIO.py
 PCI device with a single BAR and interrupts
 More complex PCI device: CopyEngine
 See: src/dev/pci/copy_engine.{cc,hh} and src/dev/pci/CopyEngine.py
 PCI device with DMA support
 Python exports: PowerModelState
 See: src/sim/power/PowerModelState.py
 Exports two methods (getDynamicPower & getStaticPower) to Python

67 © ARM 2017
<Insert coffee break here>

68 © ARM 2017
Memory System

Stephan Diestelhorst

© ARM 2017
Goals
 Model a system with heterogeneous applications, running on a set of
heterogeneous processing engines, using heterogeneous memories and
interconnect
 CPU centric: capture memory system behaviour accurate enough
 Memory centric: Investigate memory subsystem and interconnect architectures

Processo Video Video DMA


Processo GPU
GPU
Processo
rr backend decoder GPU
CPU
r GPU

Interconnect Interconnect

3D- PCM
DRAM SRAM NAND STT-RAM
DRAM DRAM NAND
DRAM

70 © ARM 2017
Goals, contd.
 Two worlds...
 Computation-centric simulation
 e.g. SimpleScalar, Asim etc
 More behaviourally oriented, with ad-hoc ways of describing parallel behaviours and
intercommunication
 Communication-centric simulation
 e.g. SystemC+TLM2 (IEEE standard)
 More structurally oriented, with parallelism and interoperability as a key component

 gem5 is trying to balance


 Easy to extend (flexible)
 Easy to understand (well defined)
 Fast enough (to run full-system simulation at MIPS)
 Accurate enough (to draw the right conclusions)
71 © ARM 2017
Event Simulation
 Event-driven
 no activity -> no clocking event queue
 event queue

 Deterministic
cache
 fixed random number seed response
 no dependence on host addresses

time
Cache Model
 Multi-Queue
 multiple workers
cache lookup
curTick

72 © ARM 2017
Ports, Masters and Slaves
 MemObjects are connected through master and slave ports
 A master module has at least one master port, a slave module at least one slave
port, and an interconnect module at least one of each
 A master port always connects to a slave port
 Similar to TLM-2 notation
Master Interconnect Slave
module module module

I$ memory0
CPU bus
D
memory1
$

Master port Slave port


73 © ARM 2017
Transport interfaces
 Atomic
 Similar to loosely timed in TLM
 Blocking: Requests completes in a single call chain
 Each component along the way adds latency to the request The Atomic and Timing
interfaces are mutually
exclusive
 Timing
 Similar to approximately timed in TLM
 Asynchronous: One call to send a packet, callback when response is ready.

 Functional
 Debug interface that doesn’t affect coherency states.
 Blocking: Requests complete within a single call chain.

74 © ARM 2017
Communication Monitor
 Insert as a structural component where stats are desired
memmonitor = CommMonitor()
membus.master = memmonitor.slave
memmonitor.master = memctrl.slave

 A wide range of communication stats


 bandwidth, latency, inter-transaction (read/write) time, outstanding transactions, address
heatmap, etc
 Provides an attachment point for communication probes:
 Tracing (using protobuf) Latency distribution
 Stack distance monitoring 70

Distribution (%)
60
 Footprint estimation 50
40
30
20
10
0

75 © ARM 2017
Latency (ns)
Traffic generator
 Test scenarios for memory system regression and performance validation
 High-level of control for scenario creation
 Black-box models for components that are not yet modeled
 Video/baseband/accelerator for memory-system loading
 Inject requests based on (probabilistic) state-transition diagrams
 Idle, random, linear and trace replay states

Address

idle

linear linear idle linear idle linear


Time

76 © ARM 2017
Memory controllers
 All memories in the system inherit from AbstractMemory
 Basic single-channel memory controller
 Instantiate multiple times if required
 Interleaving support added in the bus/crossbar (to be posted)

 SimpleMemory
 Fixed latency (possibly with a variance)
 Fixed throughput (request throttling without buffering)
 SimpleDRAM
 High-level configurable DRAM controller model to mimic DDRx, LPDDRx, WideIO, HBM etc
 Memory organization: ranks, banks, row-buffer size
 Controller architecture: Read/write buffers, open/close page, mapping, scheduling policy
 Key timing constraints: tRCD, tCL, tRP, tBURST, tRFC, tREFI, tTAW/tFAW

77 © ARM 2017
Top-down controller model
 Don’t model the actual DRAM, only the timing constraints
 DDR3/4, LPDDR2/3/4, WIO1/2, GDDR5, HBM, HMC, even PCM
 See src/mem/DRAMCtrl.py and src/mem/dram_ctrl.{hh, cc}
DRAM Memory Controller Device width
Burst length
#ranks, #banks
write queue Page size

PHY & timing constraints


Page policy & arbitration
System interfaces

tRCD
tCL
tRP
tRAS
tBURST
tRFC & tRFEI
tWTR
read queue
tRRD
tFAW/tTAW

Hansson et al, Simulating DRAM controllers for future system architecture exploration, ISPASS’14
78 © ARM 2017
Controller model correlation
 Comparing with a real memory controller
 Synthetic traffic sweeping bytes per activate and number of banks
 See configs/dram/sweep.py and util/dram_sweep_plot.py

gem5 model Real memory controller


100
100
80
80
60 80-100
60 80-100
40 60-80
40 60-80 40-60
20
40-60 20-40
20
20-40 0
0-20
0 8
0-20 7
8 6
7 5 256
6 4 192
5 256 3
4
3 192 Number of Banks 2
1
128 Bytes per
Number of Banks 2 128 64 Activate
1 Bytes per
64
79 © ARM 2017 Activate
DRAM power modeling
 DRAM accounts for a large portion of system power
 Need to capture power states, and system impact
 Integrated model opens up for developing more clever strategies
 DRAMPower adapted and adopted for gem5 use-case

Energy Saving due to Power-Down (%) BBench DRAM Energy Analysis (LPDDR3 x32)

GPU-AngryBirds
• Active Energy 36%

• Precharge Energy Static Energy(mJ)


bbench Energy Saving due to Dynamic Energy(mJ)
Power-Down (%) • Read/Write Energy
64%

AndeBench • Background Energy

• Refresh Energy
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90

80 © ARM 2017 Naji et al, A High-Level DRAM Timing, Power and Area Exploration Tool, SAMOS’15
Address interleaving
 Multi-channel memory support is essential
 Emerging DRAM standards are multi-channel by nature
(LPDDR4, WIO1/2, HBM1/2, HMC)

 Interleaving support added to address range


 Understood by memory controller and interconnect
 See src/base/addr_range.hh for matching and
src/mem/xbar.{hh, cc} for actual usage
 Interleaving not visible in checkpoints

 XOR-based hashing to avoid imbalances


Source: Micron
 Simple yet effective, and widely published
 See configs/common/MemConfig.py for system configuration
81 © ARM 2017
Crossbars& Bridges
 Create rich system interconnect topologies using Core Core
a simple bus model and bus bridge L1i L1d L1i L1d ...
XBar XBar
 Crossbars do address decoding and arbitration
 Distributes snoops and aggregates snoop responses XBar
 Routes responses
 Configurable width and clock speed L2

XBar
 Bridges connects two buses
 Queues requests and forwards them
 Configurable amount of queuing space for requests and
responses
XBar Bridge XBar
82 © ARM 2017
Caches
 Single cache model with several components:
 Cache: request processing, miss handling, coherence
 Tags: data storage and replacement (LRU, Random, etc.) Cache
 Prefetcher: N-Block Ahead, Tagged Prefetching, Stride
Prefetching Prefetch
Tags
 MSHR & MSHRQueue: track pending/outstanding
requests
 Also used for write buffer
Data
 Parameters: size, hit latency, block size, associativity, MSHR
number of MSHRs (max outstanding requests)

83 © ARM 2017
Coherence protocol
 MOESI bus-based snooping protocol
 Support nearly arbitrary multi-level hierarchies at the expense of some realism
 Does not enforce inclusion
 Magic “express snoops” propagate upward in zero time
 Avoid complex race conditions when snoops get delayed
 Timing is similar to some real-world configurations
 L2 keeps copies of all L1 tags
 L2 and L1s snooped in parallel

84 © ARM 2017
Snoop (probe) filtering
 Broadcast-based coherence protocol
 Incurs performance and power cost
 Does not reflect realistic implementations

 Snoop filter goes one step towards directories


 Track sharers, based on writeback and clean eviction
 Direct snoops and benefit from locality

 Many possible implementations


 Currently ideal (infinite), no back invalidations
 Can be used with coherent crossbars on any level
 See src/mem/SnoopFilter.py and
src/mem/snoop_filter.{hh, cc}*
85 © ARM 2017
Source: AMD
Memory system verification
 Check adherence to consistency model
L2
 Notion of functional reference memory is too simplistic
 Need to track valid values according to consistency XBar
model
L1 L1 L1
 Memory checker and monitors MemChecker Monitor Monitor Monitor
 Tracking in src/mem/MemChecker.py and
src/mem/mem_checker.{hh, cc} Core 0 Core 1 Core 2
 Probing in src/mem/mem_checker_monitor.{hh, cc}
 Revamped testing
 Complex cache (tree) hierarchies in configs/examples/{memtest, memcheck}.py
 Randomly generated soak test in util/memtest-soak.py
 For any changes to the memory system, please use these

86 © ARM 2017
Ruby for Networks and Coherence
 As an alternative to its native memory system gem5 also integrates Ruby
 Create networked interconnects based on domain-specific language (SLICC) for
coherence protocols
 Detailed statistics
 e.g., Request size/type distribution, state transition frequencies, etc...
 Detailed component simulation
 Network (fixed/flexible pipeline and simple)
 Caches (Pluggable replacement policies)
 Supports Alpha and x86
 Limited ARM support about to be added
 Limited support for functional accesses

87 © ARM 2017
Instantiating and Connecting Objects
class BaseCPU(MemObject): CPU
icache_port = MasterPort("Instruction Port")
dcache_port = MasterPort("Data Port")

class BaseCache(MemObject):
cpu_side = SlavePort("Port on side closer to CPU")
mem_side = MasterPort("Port on side closer to MEM") I$ D$
...
class Bus(MemObject):
slave = VectorSlavePort("vector port for connecting masters")
master = VectorMasterPort("vector port for connecting slaves")
… Bus
system.cpu.icache_port = system.icache.cpu_side
system.cpu.dcache_port = system.dcache.cpu_side

system.icache.mem_side = system.l2bus.slave
system.dcache.mem_side = system.l2bus.slave
Memory
88 © ARM 2017
Requests & Packets
 Protocol stack based on Requests and Packets
 Uniform across all MemObjects (with the exception of Ruby)
 Aimed at modelling general memory-mapped interconnects
 A master module, e.g. a CPU, changes the state of a slave module, e.g. a memory through a
Request transported between master ports and slave ports using Packets

Request req(addr, size, flags, masterId);


Packet* req_pkt = new Packet(req, MemCmd::ReadReq);
...

if (req_pkt->needsResponse()) {
req_pkt->makeResponse();
CPU memory } else {
delete req_pkt;
}
...
...
delete resp_pkt;

89 © ARM 2017
Requests & Packets
 Requests contain information persistent throughout a transaction
 Virtual/physical addresses, size
 MasterID uniquely identifying the module initiating the request
 Stats/debug info: PC, CPU, and thread ID
 Requests are transported as Packets
 Command (ReadReq, WriteReq, ReadResp, etc.) (MemCmd)
 Address/size (may differ from request, e.g., block aligned cache miss)
 Pointer to request and pointer to data (if any)
 Source & destination port identifiers (relative to interconnect)
 Used for routing responses back to the master
 Always follow the same path
 SenderState opaque pointer
 Enables adding arbitrary information along packet path

90 © ARM 2017
Functional transport interface
 On a master port we send a request packet using sendFunctional
 This in turn calls recvFunctional on the connected slave port
 For a specific slave port we implement the desired functionality by overloading recvFunctional
 Typically check internal (packet) buffers against request packet
 For a slave module, turn the request into a response (without altering state)
 For an interconnect module, forward the request through the appropriate master port using
sendFunctional
 Potentially after performing snoops by issuing sendFunctionalSnoop

masterPort.sendFunctional(pkt); MySlavePort::recvFunctional(PacketPtr pkt)


// packet is now a response {
...

CPU memory

91 © ARM 2017
Atomic transport interface
 On a master port we send a request packet using sendAtomic
 This in turn calls recvAtomic on the connected slave port
 For a specific slave port we implement the desired functionality by overloading recvAtomic
 For a slave module, perform any state updates and turn the request into a response
 For an interconnect module, perform any state updates and forward the request through the
appropriate master port using sendAtomic
 Potentially after performing snoops by issuing sendAtomicSnoop
 Return an approximate latency

Tick latency = masterPort.sendAtomic(pkt); MySlavePort::recvAtomic(PacketPtr pkt)


// packet is now a response {
...
return latency;
}

CPU memory

92 © ARM 2017
Timing transport interface
 On a master port we try to send a request packet using sendTimingReq
 This in turn calls recvTiming on the connected slave port
 For a specific slave port we implement the desired functionality by overloading recvTimingReq
 Perform state updates and potentially forward request packet
 For a slave module, typically schedule an action to send a response at a later time
 A slave port can choose not to accept a request packet by returning false
 The slave port later has to call sendRetryReq to alert the master port to try again

bool success = masterPort.sendTimingReq(pkt); MySlavePort::recvTimingReq(PacketPtr pkt)


if (success) { {
// request packet is sent assert(pkt->isRequest());
... ...
} else { return true/false;
// failed, wait for recvReqRetry from slave port }
...
}
CPU memory

93 © ARM 2017
Timing transport interface (cont’d)
 Responses follow a symmetric pattern in the opposite direction
 On a slave port we try to send a response packet using sendTiming
 This in turn calls recvTiming on the connected master port
 For a specific master port we implement the desired functionality by overloading recvTiming
 Perform state updates and potentially forward response packet
 For a master module, typically schedule a succeeding request
 A master port can choose not to accept a response packet by returning false
 The master port later has to call sendRetryResp to alert the slave port to try again

MyMasterPort::recvTimingResp(PacketPtr pkt) bool success = slavePort.sendTimingResp(pkt);


{ if (success) {
assert(pkt->isResponse()); // response packet is sent
... ...
return true/false; } else { ...
}

CPU memory

94 © ARM 2017
CPU Models

Andreas Sandberg

© ARM 2017
CPU models overview
BaseCPU

BaseKvmCPU BaseSimpleCPU TraceCPU DerivO3CPU MinorCPU

ArmV8KvmCPU TimingSimpleCPU

X86KvmCPU AtomicSimpleCPU

• No timing • Some timing • Some timing • Full timing


• No caches • Caches • Caches • Caches
• No BP • Limited BPs • No BPs • Branch predictors
• Really fast • Fast • Fast • Slow

97 © ARM 2017
Atomic Simple CPU
 On every CPU tick() perform all
operations for an instruction

 Memory accesses use atomic


methods

 Fastest functional simulation


 Except for KVM-accelerated CPUs

98 © ARM 2017
Timing Simple CPU
 Memory accesses use timing path

 CPU waits until memory access


returns

 Fast, provides some level of timing

99 © ARM 2017
Detailed CPU Models
 Parameterizable pipeline models w/SMT support
 Two Types
 MinorCPU – Parameterizable in-order pipeline model
 O3CPU – Parameterizable out-of-order pipeline model
 “Execute in Execute”, detailed modeling
 Roughly an order-of-magnitude slower than Simple
 Models the timing for each pipeline stage
 Forces both timing and execution of simulation to be accurate
 Important for Coherence, I/O, Multiprocessor Studies, etc

100 © ARM 2017


In-Order CPU Model
 Models a “standard” 4-stage pipeline
 Fetch1, Fetch2, Decode, Execute

 Key Resources
 Cache, Execution, BranchPredictor, etc.
 Pipeline stages

101 © ARM 2017


Out-of-Order (O3) CPU Model
 Defaults to a 7-stage pipeline
 Fetch, Decode, Rename, Issue, Execute, Writeback, Commit
 Model varying amount of stages by changing the delay between them
 For example: fetchToDecodeDelay

 Key Resources
 Physical Registers, IQ, LSQ, ROB, Functional Units

102 © ARM 2017


Important CPU interfaces
 BaseCPU
 Base class for all CPU models
 Provides a common interface for checkpointing/switching/interrupts/…
 Even used by KVM-based CPUs

 ThreadContext
 Interface for accessing total architectural state of a single thread (PC, registers, etc.)
 Holds pointers to important structures (TLB, CPU, etc.)
 CPU models typically implement custom versions or use SimpleThread

 ExecContext
 Abstract interface defining how an instruction interface with the CPU model

103 © ARM 2017


StaticInst
 Represents a decoded instruction
 Has classifications of the inst
 Corresponds to the binary machine inst
 Only has static information

 Has all the methods needed to execute an instruction


 Tells which regs are source and dest
 Contains the execute() function
 ISA parser generates execute() for all insts

105 © ARM 2017


DynInst
 Complex CPU models need to track resources used by instructions

 Dynamic version of StaticInst


 Used to hold extra information for in-flight instructions
 Holds PC, Results, Branch Prediction Status
 Interface for TLB translations

 Specialized versions for detailed CPU models

106 © ARM 2017


Examples
 Virtualization-based CPU: BaseKvmCPU
 See: src/cpu/kvm/base.{cc,hh} and src/cpu/kvm/BaseKvmCPU.py
 Implements the basic interfaces required by all CPU model
 Reasonably small and well documented
 Does not simulate instructions or implement ExecContext
 Simplest possible simulated CPU: AtomicSimpleCPU
 See: src/cpu/simple/{base.cc,base.hh,atomic.cc,atomic.hh,
AtomicSimpleCPU.py}
 Minimal simulated CPU that includes SMT
 Simplest “real” model: MinorCPU
 See src/cpu/minor/*
 Implements a pipelined in-order CPU

108 © ARM 2017


Advanced Features &
Capabilities

© ARM 2017
Accelerating gem5
 Switching modes
 (kvm +) functional + timing / detailed

 Checkpoints
 boot Linux -> checkpoint
 run multiple configurations in parallel
 run multiple checkpoints in parallel

 Multi-threading
 multiple queues
 multiple workers execute events
 data sharing and tight coupling limits speedup

 Multi-processed gem5
 for design space explorations

110 © ARM 2017


Distributed gem5 simulation
Host #1
host machine
 gem5 running in parallel on a cluster of host machines simulated
system gem5 process
 Packet forwarding engine #1
 Forward packets among the simulated systems
 Synchronize the distributed simulation
Host #1
 Simulate network topology simulated
Packet
system
forwarding
 Tested with ~30 nodes, 100s planned #2
Host #2

simulated
system
#3
111 © ARM 2017 Host #3
Object Diagram : Simulating a 2-node Cluster Example
simulated compute simulated Ethernet switch simulated compute
node node
Root
Root Root

EtherSwitch
NSGigE NSGigE

DistEtherLink DistEtherLink DistEtherLink DistEtherLink

TCPIface TCPIface TCPIface TCPIface

SyncEvent SyncNode SyncEvent SyncSwitch SyncEvent SyncNode

112 © ARM 2017


TCP socket TCP socket
Elastic Traces – fast, realistic memory exploration
 High-level OOO core model 1.1
(B) L2 size 1MB --> 2MB Mean error = 1.4%
6
speedy simulation

Relative CPI

Error (%)
1 4
 Capture data dependencies and MLP 0.9 2
 Elastic replay 0.8 0

 High-level synchronisation event 5x-8x => ~1MIPS

capture
 Predict scalability for SMPs
 Additional 10x speedup

113 © ARM 2017


Data Profiling and Heterogeneous Memory
 Address rising cost of communication
 Optimize data structures to improve cache utilization and efficiency
 Optimize data storage onto heterogeneous memories

114 © ARM 2017


Graphics & Android Andreas

115 © ARM 2017


Common Approach: CPU-Centric
 Software renderer instead of a real GPU
 Optimization friendly code
Workload
 Can be vectorized
Android SW renderer  Easy-to-predict branches
 Large memory foot print
 Doesn’t simulate the driver
CPU CPU
 Known to be the bottleneck for some workloads
L1D L1I L1D L1I  Horrible code
L2  Workload and software renderer compete
for resources
LPDDR3
 Can significantly skew core behavior
Display  Affects 2D applications and 3D
GPU
Controller applications

116 © ARM 2017


Full system NoMali modelling
 Passes the duck test (almost)
Workload  Most GPU integration tests work (no pixels)
 Implements the Mali register interface & interrupts
Android GPU drivers  Accurate CPU+GPU interactions
 Runs the full driver stack
 Complex software with significant CPU component
CPU CPU
 Limitations:
L1D L1I L1D L1I  Doesn’t produce any display output
L2  No memory system interactions
LPDDR3
 Requires a properly optimized driver stack
 Use cases:
Display  CPU-centric studies (driver performance)
NoMali
Controller
 Fast-forward (boot / long traces)

118 © ARM 2017


De Jong, Rene, and Andreas Sandberg. "NoMali: Simulating a Realistic Graphics Driver Stack Using a Stub GPU." ISPASS 2016
Why do you care?
Relative Error
Software Rendering NoMali
50%
103% 73% 135% 54%

40%

30%

20%

10%

0%
Instructions IPC BP Miss Ratio DL1 Miss Ratio IL1 Miss Ratio L2 Miss Ratio DRAM Read BW

bbench on Android K (real GPU as reference)

119 © ARM 2017


Power Modelling Stephan

121 © ARM 2017


Power Models
 bottom-up P1 P2 F1 F2 IA DE RR I0 I1 I2 B1

simulate gates
Fetch Decode / Rename Issue Branch Execute

 Inst
Align
Dec Ren
BX
IQ
Reg
Read
Mux BR Res

Align/Steer
IC Inst 16 uops

Decode Q
IC Dec Ren

Fetch Q
Tags 4 inst Align 4 uop
Read

toggle rates
Inst E1 E2 E3
ITLB 128b Dec Ren
Align


Integer Execute
12 P-blks Inst
Dec Ren SX0 Reg
Align Mux ALU 64b
IQ Read

Setup
16x32b insts
12 uops

G G
Main

C C

Main Pred
Commit SX1 Reg

complex aggregation
BTB Mux ALU 64b
MCQ IQ Read


Main retire
RCQ 12 uops

Core
GHBs
ALU PLUS 64b
128 insts

Decompose
uBTB
96 regs 64b

+ SourceIleak C
IMAC

Hot
nBTB 32 branches MX Reg
64 loads Mux IDIV 64b
IQ Read

C GateG G
32 stores
12 uops 64b

DrainAcc Acc
CRC32
Other 64b

Aggregate
Iswitch
SOC
D1 D2 D3 D4

top-down

L2
Load & Store

 N+
LS0
IQ
12 uops L2
Reg
Read
Mux AGEN DTLB
L2 N+ Acc Acc
ISUB
DC DC

Cold
Setup FMT

-
128b
Tags Read

high level activities Core DC DC


Setup FMT 128b


Tags Read
LS1 Reg
Mux AGEN DTLB
IQ Read

Psub
12 uops

IGATE IGIDL IREV


L2 M1 M2 M3 M4 M5 M6

few voltage rails


Rt/ Rt/


Rt/Arb Tag Data1 Data2 256b

Interconnect
Cmp Mux

I0 I1 I2 I3 V1 V2 V3 V4
Vector Execute

measure real devices


V-FMUL 128b

 VX0 IQ Reg
Mux
V-ALU
V-FADD
128b
128b

IQ Read Read V-IMAC 128b


16 uops
V-FDIV 128b

CRYPTO2 CRYPTO4 128b

V-FMUL 128b

DRAM
VX1 IQ Reg V-FADD 128b
Mux
IQ Read Read V-ALU PLUS 128b
16 uops
V-FCVT 128b

122 © ARM 2017


Top Down vs. Bottom Up

Top-down also has uses in design-space exploration – accurate reference


123 © ARM 2017
Top Down Power Models
 Built experimentally
 Often uses regression
 Extremely accurate
 Inflexible, often tied to a specific platform

124 © ARM 2017


Bottom Up Power Models
 Built on theory
 E.g. McPAT – Power Area and Timing Multi- and Many- core modelling framework
 Good for design-space exploration
 Large errors (largely due to abstraction)
 Relatively slow (not suitable for run-time management)

125 © ARM 2017


Power Modeling Based on Existing Hardware
1. Run: workloads 5.Validate 6. Uses
@ different DVFS level • K-fold cross validation • OS run-time
@ different affinities • R2: ~0.99 management
• 3-6% Av. Error • Reference for research
60 workloads used: • gem5 add-on
MiBench, MediaBench,
LMbench, NEON, OpenMP
3. Choose PMCs:
Hierarchical cluster
4. Build Model analysis, correlation matrix
• OLS multiple linear regression analysis, exhaustive search
• Deals with PMC multicollinearity etc.
• Considers heteroscedasticity

ODROID-XU3
Exynos-5422 2. Record:
4x Cortex-A7 • Performance Counters (PMCS)
4x Cortex-A15 • Voltage, Power
126 © ARM 2017
Power&Energy Framework Overview
PE Model Generation Env. Gem5 Simulation Env. S/W Power Management Env.

Derive OSPM Policies


Clocks, Generic
Power/Energy (PE) Model Clock Domains DVFS
(IP Characterization or otherwise) Voltage Domains Handler CPUFreq DEVFreq CPUIdle

Power States**:
Definition & Migration
Express PE Model Runtime Statistics:
in gem5 fitting form Voltage, Freq, Power State, High level Drivers
Event Count
CPUFreq Driver

P&E Model Database

(Use model generator scripts


P&E Estimator Device Tree
(Generate P&E Stats Equation)
to create equivalent *.json ) Define clock domains
and associate them
with devices

System Controller
(Extendable)
- DVFS Control Registers
- Energy Monitoring Registers**
Low-level Drivers
- Temperature Monitor**

127 © ARM 2017


Ongoing activities within P&E framework ** Needs to be spec’ed out
Why are CPU power models important?
 Design space exploration
 To see the effect of making architectural changes

 Run-time management
 CPU employs power-saving techniques (DVFS, DPM, asymmetric multi-core e.g. ARM
big.LITTLE)
 Need accurate power estimations to make performance-power trade-off

128 © ARM 2017


Enable Power Modelling in gem5
 configs/example/arm/fs_power.py
 dyn = "voltage * (2 * ipc + 3 * 0.000000001 *
dcache.overall_misses / sim_seconds)”
 st = "4 * temp"

 gem5.opt configs/example/arm/fs_power.py \
--caches --kernel vmlinux

 grep pm0.dynamic_power m5out/stats.txt


 system.bigCluster.cpus.power_model.pm0.dynamic_power 0.057501 #Dynamic power for
this object (Watts)
 ...

129 © ARM 2017


And it wiggles!

130 © ARM 2017


KVM
Andreas

131 © ARM 2017


Problem: Simulation is Slow
SPEC CPU2006 runtime
~1 year / benchmark
in detailed mode

<1 hour per SPEC


benchmark on
native HW

Native Fast Detailed


3,000 MIPS 1 MIPS 0.1 MIPS
132 © ARM 2017
A KVM-Based CPU Model
Simulation
Modes

Detailed: Pipeline simulator (timing, queues, speculation…)


Detailed • caches, TLBs, branch predictor
~0.1 MIPS
Fast: 1 instruction per cycle
Fast • caches, TLBs, branch predictor
~1 MIPS

KVM Hardware CPU via virtualization


~90% of • Only simulates IO devices
native • No/Limited timing

133 © ARM 2017


Can switch between modes during simulation
Current state of KVM on ARM
 Requirements
 Server-class ARMv8-based system
Already in use despite
 RAM: 4+ GiB
known limitations
 Host system and kernel with KVM support
 Known-working:
 Running full-systems with simulated devices
 Able to boot Android N
 Limited-support:
 Multiple CPUs
 Graphics, KMI
 CPU switching
 Checkpointing

134 © ARM 2017


How Do I Use KVM?
 Supported by config/example/fs.py and config/example/arm/fs_bigLITTLE.py
 Only the bL configuration supports multi-core!
 Behaves like a “normal” CPU model

./build/ARM/gem5.opt \
configs/example/arm/fs_bigLITTLE.py
--cpu-type kvm \
--kernel vmlinux --disk my_disk.img \
--big-cpus 1 --little-cpus 0 \
--dtb
$GEM5/system/arm/dt/armv8_gem5_v1_1cpu.dtb

135 © ARM 2017


Demo

136 © ARM 2017


Methodology
William

137 © ARM 2017


SimPoints
 Generate wieldable, representative slices of full benchmarks

 Terminology:
 Intervals – slices in time, sampling granularity (e.g. 10K instructions)
 Phases – intervals with similar behavior that often recur periodically
IPC

A A B A B

1 2 3 4 5 gzip gcc
Time (Intervals)

 Output from SimPoint analysis are slices and weights for each slice (choose a clustering
within 5% of CPI of full run)

 Gem5 is instrumented to capture SimPoints


 Run one time to analyze basic block vectors
 Second time generates gem5 checkpoints at every identified phase
 Runs can be repeated with different experimental configuration
138 © ARM 2017
Principal Component Analysis (PCA)
 Find the most important parameters from a large data set automatically
 How to describe “most important” using math?
 High variance
 How do we represent our data so that the most important features can be extracted easily?
 Change of basis
 Can infer similarities and dissimilarities of workloads
 Based on distance on projected component space

PCA reveals the internal structure of the data that


139 © ARM 2017 best explains the variance in the data!
Studying Complex Software is Important
Principal Components of SPEC and Android
 Android workloads Workloads
Y-axis (PC2) key 8
Android
stress the Instruction- components: rlbench
specInt2000Ref
side aspects of a system L1-I MPKI, ITLB MPKI, BP 6
bbench
specInt2006Ref
MPKI, Inst mix, … specFp2000Ref
 The popular SPEC 4 caffeinemark specFp2006Ref

benchmarks primarily wps


andebench

stress only the Data- angrybirds


2
400_perlbench
445_gobmk
253_perlbmk
side 252_eon
403_gcc
471_omnetpp

473_astar 483_xalancbmk
Very limited coverage of
0
 -4 -2 0 2 4 6 8 10 12
200_sixtrack
full mobile systems’ 172_mgrid -2
450_soplex

behavior 470_lbm
183_equake 179_art1/2
433_milc 429_mcf
-4

181_mcf

-6 X-axis (PC1) key components:


CPI, DTLB MPKI, L2 MPKI, L1-D MPKI,
IQ_full_events, …
140 © ARM 2017
Fractional Factorial Designs
-++ +++

 Balanced experiment distribution -+- ++-


 Identify important factors
2N-M experiments << 2N

DL1 Assoc

--+ +-+

 Looks for parameters where the average ‘+’ run is --- +--
very different from ‘-’ DL1 Lat
 Experiments are tolerant to noise DL1 Lat DL1 DL1
Size Assoc

 Does not identify what are the best options - - -


 Narrows design space to what matters most + - +-
- + +-
141 © ARM 2017 +- +- +-
Methodology
Workloads

 Objective: To find the ideal heterogeneous system for a given


Characterization set of workloads and hardware parameters

Clustering
based on Similar
 Characterize and cluster workload phases
Characteristics  Cluster based on performance sensitivity to various hardware
parameters
Identification of ideal H/W
config per core type
 Selectively enable or disable hardware parameters per cluster
of similar workload phases to improve their efficiency
Evaluation of
Heterogeneous Systems

Optimal Systems

142 © ARM 2017


Characterization Methodology
Workloads Full Run SimPoint Run

AutoGUI
 Record and deterministically playback
GUI interactions
 300x speedup of our simulations
SimPoints  Good correlation to full runs for statistics of interest
Identifies unique phases of software behavior
8


Android
rlbench
specInt2000Ref
6 specInt2006Ref
bbench
specFp2000Ref

Characterization
caffeinemark specFp2006Ref
4
andebench
wps
angrybirds
2

0
-4 -2 0 2 4 6 8 10 12

-2

-4
PCA
-6

 Quickly and automatically expose


differences in elements of a large data
set Fractional
 Compare and contrast phase behavior Factorial
 Perform high-level coverage architectural
exploration using a limited set of experiments
Reduced Detailed
Simulation
143 © ARM 2017
Characterization Methodology
Workloads
Tractable Simulation Comprehensive
Characterization

Repeatable Full Runs for


AutoGUI
Simulation Correlations

Reduced Key Phase


SimPoints
Simulation Time Identification

Characterization
Workload
Comparison
Guided
PCA
Parameter Selection Phase
Comparison

Reduced # of Fractional Sensitivity


Experiments Factorial Analysis

Reduced Detailed
Sunwoo, et al. “A Structured Approach to the Simulation, Analysis and Characterization of Smartphone Applications.”
Simulation
144 Published
© ARM 2017 at IISWC 2013.
How to Contribute to gem5

Andreas Sandberg

© ARM 2017
Prerequisites
 gem5’s is distributed under a 3-clause BSD license
 See LICENSE in the repository

 New code must have this license as well!

 It’s your responsibility to:


 Ensure that your contribution is covered by the license.
 Ensure that you have the right to submit the code
 Ensure that the right copyright notices are in place

147 © ARM 2017


Best practice
“How to operate your friendly reviewer”

148 © ARM 2017


How to structure your change
 What characterizes a good change?
 Small: Smaller changes are easier to review and understand.
 Well-defined: One commit == logical change
 No unrelated changes: Don’t sneak bug fixes into feature commits
 Descriptive commit message
 Always use your real name and email in the commit meta data

 What characterizes a change that makes reviewers cringe?


 Multiple changes going into the same commit “various bug fixes in Foo”
 Large changes that could have been broken into incremental changes
 Poorly written commit messages

149 © ARM 2017


The structure of a commit message
Summary: python: Move native wrappers to the _m5 namespace

Body: Swig wrappers for native objects currently share the _m5.internal name
space with Python code. This is undesirable if we ever want to switch
from Swig to some other framework for native binding (e.g., PyBind11
or Boost::Python). This changeset moves all of such wrappers to the
_m5 namespace, which is now reserved for native code.

Meta data: Change-Id: I2d2bc12dbc05b57b7c5a75f072e08124413d77f3


Signed-off-by: Andreas Sandberg <[email protected]>
Reviewed-by: Curtis Dunham <[email protected]>
Reviewed-by: Jason Lowe-Power <[email protected]>

150 © ARM 2017


Commit message: Summary line
Summary: python: Move native wrappers to the _m5 namespace

 Short summary of your change (max 65 characters)


 Think of it as a subject in an email
 Should uniquely identify your change
 Typically the first thing a potential reviewer sees
 Sometimes the only information shown about a change

 Keywords used to identify affected components


 See the wiki for details

151 © ARM 2017


Commit message: Body
Body: Swig wrappers for native objects currently share the _m5.internal name
space with Python code. ...

 Should describe your change in detail – think of it as documentation


 Reviewers will read this before they see any code

 Describe what the change does and why


 Not necessarily how, that should be clear from the code
 Describe any implementation trade-offs
 Describe known limitations

152 © ARM 2017


Commit message: Metadata
Meta data: Change-Id: I2d2bc12dbc05b57b7c5a75f072e08124413d77f3
Signed-off-by: Andreas Sandberg <[email protected]>
Reviewed-by: Curtis Dunham <[email protected]>
Reviewed-by: Jason Lowe-Power <[email protected]>

 Change-Id: Unique ID used by Gerrit to identify the change (generated)


 Signed-off-by: It’s complicated…
 Reviewed-by: Use this to acknowledge reviewers (generated by Gerrit)
 Reviewed-on: Link to review request (generated by Gerrit)
 Reported-by: Use this to acknowledge users that report bugs
 Tested-by: Can be used to acknowledge testers

153 © ARM 2017


Developer Certificate of Origin
 By making a contribution to this project, I certify that:
a) The contribution was … by me and I have the right to submit it…; or
b) … is based upon previous work that … is covered under an appropriate open source
license and I have the right under that license to submit that work with modifications… ; or
c) The contribution was provided directly to me by some other person who certified (a), (b)
or (c) and I have not modified it.
d) I understand and agree that this project and the contribution are public and that a record
of the contribution … is maintained indefinitely and may be redistributed…

 See the https://developercertificate.org/ for the full version.


 A Signed-off-by: tag indicates that you understand and agree to the DCO.

154 © ARM 2017


Submitting Code:
How to use the new Gerrit-based flow

155 © ARM 2017


Code submission flow
Post change for review

Apply stick to
Wait for reviews
reviewer

Reviewers No
Update change
happy?

Yes

Commit change Done

156 © ARM 2017


The job of a reviewer
 Evaluate technical aspects
 Is it doing what it says in the commit message?
 Is a technically sound implementation?

 Evaluate implementation aspects


 Is the commit message describing the change?
 Is it following the style guidelines?

 Legal aspects
 Patch author’s responsibility, but reviewers should look out for obvious issues.

You are the reviewers!


157 © ARM 2017
gem5 is changing
 Recently switched from Mercurial to Git
 Canonical repository on http://gem5.googlesource.com
 Mirror on GitHub: http://github.com/gem5

 Recently switched from ReviewBoard to Gerrit


 Automates code submission
 Tightly integrated with git
 Google (e.g., GMail) accounts for authentication
 Will integrate support automatic testing

158 © ARM 2017


Setting up gerrit & git
 Prerequisites
 Google account registered with the email
address you use for contributions

 Where to start:
 http://gem5.googlesource.com

 Git authentication
 Required to push changes for review
 Uses https unlike most other installations
 Requires an authentication cookie

161 © ARM 2017


Posting a change for review
 Push to a “magical” git ref:
 refs/for/<branch>: Create a review request
 refs/drafts/<branch>: Create a draft review

 Pushes either updates an existing review or creates a new one


 More advanced usage described in the Gerrit manual

 Tips and tricks:


 Make sure that you assign one or more reviewers to the change
 Assign a topic name to related changes

162 © ARM 2017


Simple Example
$ git clone https://gem5.googlesource.com/public/gem5 Create a
local clone
<hack hack hack>
$ git add -i Commit
$ git commit -m “test commit” your changes
$ git push origin HEAD:refs/for/master

remote: New Changes:
remote: https://gem5-review.googlesource.com/2160 Test commit Push changes
for review
remote:
To https://gem5.googlesource.com/public/gem5
* [new branch] HEAD -> refs/for/master
163 © ARM 2017
https://gem5-review.googlesource.com/2160

164 © ARM 2017


https://gem5-review.googlesource.com/2160

165 © ARM 2017


https://gem5-review.googlesource.com/2160

166 © ARM 2017


Reviewing code in Gerrit
 Changes can only be submitted if they have been:
 Reviewed
 Accepted by a maintainer
 Passed automatic testing

 Gerrit uses labels to enforce these policies:


 Code-Review: Normal code reviews, anyone can use these.
 Maintainer: Only available to maintainers, required for submission.
 Verified: Used by CI system to accept/reject depending on test outcomes
 Style-Check: Automatic style checking

 Maintainers can override labels if they are obviously wrong

167 © ARM 2017


Code submission flow
Post change for review

Wait for reviews

Reviewers No
Update change
happy?

Yes

Maintainer No
happy?

Yes
Commit change Done
168 © ARM 2017
How to review code
 Start with the commit message
 Does it make sense?
 Is it a change that makes sense in gem5? Why/Why not?
 Look at the code
 Is it solving the problem in the description?
 Is the implementation technically sound? Are there obvious bugs?
 Comment on the code and submit a review score
 -2: Don’t submit under any circumstances (blocks submission)
 …
 +2: Looks good, approved!
 Be polite and kind
 Developers and reviewers are people too!

169 © ARM 2017


Further information - gem5 related papers from ARM Research
 Sunwoo, Dam, et al. "A structured approach to the simulation, analysis and
characterization of smartphone applications." IISWC'13
 Gutierrez, Anthony, et al. "Sources of error in full-system simulation." ISPASS'14
 Hansson, Andreas, et al. "Simulating DRAM controllers for future system architecture
exploration." ISPASS'14
 De Jong, Rene, and Andreas Sandberg. "NoMali: Simulating a realistic graphics driver
stack using a stub GPU." ISPASS'16
 Rusitoru, Roxana. "ARMv8 micro-architectural design space exploration for high
performance computing using fractional factorial." PMBS'15
 Vasileios Spiliopoulos, et.al.“Introducing DVFS-Management in a Full-System
Simulator.” MASCOTS '13
 Matthew J. Walker, et al. “Accurate and Stable Run-Time Power Modeling for Mobile
and Embedded CPUs.” IEEE Trans. on CAD of Integrated Circuits and Systems 36’2017

170 © ARM 2017


Further information - gem5 related papers from ARM Research
 Jagtap, Radhika, et al. "Elastic traces for fast and accurate system performance
exploration." ISPASS’16
 Mohammad Alian, et al. “dist-gem5: Distributed simulation of computer clusters.”
ISPASS’17

171 © ARM 2017


11-13 September 2017 Submission deadline - 30 April 2017
Robinson College, Cambridge, UK Early-bird discount ends - 30 June 2017

You might also like