Computer Architecture
Lecture 14: Simulation
(with a Focus on Memory)
Prof. Onur Mutlu
ETH Zürich
Fall 2020
12 November 2020
Simulating (Memory)
Systems
2
Evaluating New Ideas
for New (Memory)
Architectures
Potential Evaluation Methods
How do we assess how an idea will affect a target
metric X?
A variety of evaluation methods are available:
Theoretical proof
Analytical modeling/estimation
Simulation (at varying degrees of abstraction and
accuracy)
Prototyping with a real system (e.g., FPGAs)
4
Real implementation
The Difficulty in Architectural
Evaluation
The answer is usually workload dependent
E.g., think caching
E.g., think pipelining
E.g., think any idea we talked about (RAIDR, Mem.
Sched., …)
Workloads change
System has many design choices and parameters
Architect needs to decide many ideas and many
parameters for a design
Not easy to evaluate all possible combinations!
System parameters may change
5
Simulation: The Field of Dreams
Dreaming and Reality
An architect is in part a dreamer, a creator
Simulation is a key tool of the architect
Allows the evaluation & understanding of non-existent
systems
Simulation enables
The exploration of many dreams
A reality check of the dreams
Deciding which dream is better
Simulation also enables
The ability to fool yourself with false dreams
7
Why High-Level Simulation?
Problem: RTL simulation is intractable for design
space exploration too time consuming to design
and evaluate
Especially over a large number of workloads
Especially if you want to predict the performance of a
good chunk of a workload on a particular design
Especially if you want to consider many design choices
Cache size, associativity, block size, algorithms
Memory control and scheduling algorithms
In-order vs. out-of-order execution
Reservation station sizes, ld/st queue size, register file
size, …
…
Goal: Explore design choices quickly to see their
8
impact on the workloads we are designing the
Different Goals in Simulation
Explore the design space quickly and see what you
want to
potentially implement in a next-generation platform
propose as the next big idea to advance the state of the art
the goal is mainly to see relative effects of design decisions
Match the behavior of an existing system so that
you can
debug and verify it at cycle-level accuracy
propose small tweaks to the design that can make a
difference in performance or energy
the goal is very high accuracy
Other goals in-between:
Refine the explored design space without going into a
full detailed, cycle-accurate design
Gain confidence in your design decisions made by 9
Tradeoffs in Simulation
Three metrics to evaluate a simulator
Speed
Flexibility
Accuracy
Speed: How fast the simulator runs (xIPS, xCPS,
slowdown)
Flexibility: How quickly one can modify the simulator
to evaluate different algorithms and design choices?
Accuracy: How accurate the performance (energy)
numbers the simulator generates are vs. a real
design (Simulation error)
The relative importance of these metrics varies
depending on where you are in the design process 10
Trading Off Speed, Flexibility,
Accuracy
Speed & flexibility affect:
How quickly you can make design tradeoffs
Accuracy affects:
How good your design tradeoffs may end up being
How fast you can build your simulator (simulator design
time)
Flexibility also affects:
How much human effort you need to spend modifying
the simulator
You can trade off between the three to achieve
design exploration and decision goals
11
High-Level Simulation
Key Idea: Raise the abstraction level of modeling to
give up some accuracy to enable speed & flexibility
(and quick simulator design)
Advantage
+ Can still make the right tradeoffs, and can do it quickly
+ All you need is modeling the key high-level factors,
you can omit corner case conditions
+ All you need is to get the “relative trends”
accurately, not exact performance numbers
Disadvantage
-- Opens up the possibility of potentially wrong decisions
-- How do you ensure you get the “relative trends”
accurately? 12
Simulation as Progressive
Refinement
High-level models (Abstract, C)
…
Medium-level models (Less abstract)
…
Low-level models (RTL with everything modeled)
…
Real design
As you refine (go down the above list)
Abstraction level reduces
Accuracy (hopefully) increases (not necessarily, if not
careful)
Flexibility reduces; Speed likely reduces except for real
design 13
Making The Best of
Architecture
A good architect is comfortable at all levels of
refinement
Including the extremes
A good architect knows when to use what type of
simulation
And, more generally, what type of evaluation method
Recall: A variety of evaluation methods are
available:
Theoretical proof
Analytical modeling
Simulation (at varying degrees of abstraction and
accuracy)
Prototyping with a real system (e.g., FPGAs) 14
An Example Simulator
15
Ramulator: A Fast and
Extensible DRAM Simulator
[IEEE Comp Arch Letters’15]
16
Ramulator Motivation
DRAM and Memory Controller landscape is changing
Many new and upcoming standards
Many new controller designs
A fast and easy-to-extend simulator is very much needed
17
Ramulator
Provides out-of-the box support for many DRAM
standards:
DDR3/4, LPDDR3/4, GDDR5, WIO1/2, HBM, plus new
proposals (SALP, AL-DRAM, TLDRAM, RowClone, and
SARP)
~2.5X faster than fastest open-source simulator
Modular and extensible to different standards
18
Case Study: Comparison of DRAM
Standards
Across 22
workloads,
simple CPU
model
19
Ramulator Paper and Source
Code
Yoongu Kim, Weikun Yang, and Onur Mutlu,
"Ramulator: A Fast and Extensible DRAM Simu
lator"
IEEE Computer Architecture Letters (CAL), March
2015.
[Source Code]
Source code is released under the liberal MIT
License
https://github.com/CMU-SAFARI/ramulator
20
Bonus Assignment as Part of
HW
#4
Review the Ramulator paper
Same points as any other BONUS review in HW #4
21
An Example Study using
Ramulator
22
An Example Study with
Ramulator
and Onur Mutlu,
(I)
Saugata Ghose, Tianshi Li, Nastaran Hajinazar, Damla Senol Cali,
"Demystifying Workload–DRAM Interactions: An Experimental St
udy"
Proceedings of the
ACM International Conference on Measurement and Modeling of Comput
er Systems
(SIGMETRICS), Phoenix, AZ, USA, June 2019.
[Preliminary arXiv Version]
[Abstract]
[Slides (pptx) (pdf)]
[MemBen Benchmark Suite]
[Source Code for GPGPUSim-Ramulator]
23
Why Study Workload–DRAM Interactions?
Manufacturers are developing many new
types of DRAM
• DRAM limits performance, energy improvements:
new types may overcome some limitations
• Memory systems now serve a very diverse set of
applications:
can no longer take a one-size-fits-all approach
So which DRAM type works best with which
application?
• Difficult to understand intuitively due to the complexity of
the interaction
• Can’t be tested methodically on real systems: new type
needs a new CPU
We perform a wide-ranging experimentalPage 24 of 25
Modern DRAM Types: Comparison to DDR3
Low- Bank groups
Banks Bank 3D-
DRAM
per Group Stack
Type Power Bank Group Bank Group
Rank s ed
Bank Bank Bank Bank
DDR3 8
DDR4 16 increased latency
GDDR5
GDDR5 16
16
increased area/power memory channel
HBM
HBM
High- 16
High-
Bandwidth
Bandwidth
16 3D-stacked
Memory high bandwidth with
Memory DRAM Through-Silicon
HMC narrower rows,
HMC
Hybrid higher latency Vias (TSVs)
Hybrid
Memory
256
256
Memory
Cube
Cube Memory
Wide I/O 4 Layers
Wide I/O 4
Wide I/O
Wide I/O 8
2 8 dedicated Logic Layer
2
LPDDR3 8 Page 25 of 25
4. Need for Lower Access Latency: Performance
New DRAM types often increase access
latency in order to provide more banks,
higher throughput
Many applications can’t make up for the
increased
1.2 latency
DDR4 GDDR5 HBM HMC
• Especially
1.1 true of common OS routines (e.g., file I/O,
Speedup
process
1.0 forking)
0.9
0.8
forkbench (4...
TCP_STREAM (...
shell (0.2)
bootup (1.1)
TCP_RR (0.1)
UDP_STREAM ...
Test 4 (3.4)
Test 9 (4.7)
Test 8 (4.7)
UDP_RR (0.1)
Test 11 (4.5)
Test 10 (4.7)
Test 5 (10.1)
Test 3 (13.3)
Test 1 (13.6)
Test 7 (13.7)
Test 12 (15.4)
Test 2 (15.6)
Test 0 (15.7)
Test 6 (16.5)
Netperf IOZone, 64MB File
Several applications don’t benefit from more
parallelism Page 26 of 25
Key Takeaways
1. DRAM latency remains a critical bottleneck
for
many applications
2. Bank parallelism is not fully utilized by a
wide variety
of our applications
3. Spatial locality continues to provide
significant performance benefits if it is
exploited by the memory subsystem
4. For some classes of applications, low-
Page 27 of 25
Conclusion
Manufacturers are developing many new
types of DRAM
• DRAM limits performance, energy improvements:
new types may overcome some limitations
• Memory systems now serve a very diverse set of
applications:
can no longer take a one-size-fits-all approach
• Difficult to intuitively determine which DRAM–workload
pair works best
We perform a wide-ranging experimental
study to uncover
the combined behavior of workloads, DRAM
types
Open-source tools: https://github.com/CMU-
SAFARI/ramulator
• 115 prevalent/emerging applications and
multiprogrammed workloads
Full paper: https://arxiv.org/pdf/1902.07609
Page 28 of 25
For More Information…
Saugata Ghose, Tianshi Li, Nastaran Hajinazar, Damla Senol Cali,
and Onur Mutlu,
"Demystifying Workload–DRAM Interactions: An Experimental St
udy"
Proceedings of the
ACM International Conference on Measurement and Modeling of Comput
er Systems
(SIGMETRICS), Phoenix, AZ, USA, June 2019.
[Preliminary arXiv Version]
[Abstract]
[Slides (pptx) (pdf)]
[MemBen Benchmark Suite]
[Source Code for GPGPUSim-Ramulator]
29
Ramulator for Processing in
Memory
30
Simulation Infrastructures for
PIM
Ramulator extended for PIM
Flexible and extensible DRAM simulator
Can model many different memory standards and
proposals
Kim+, “Ramulator: A Flexible and Extensible
DRAM Simulator”, IEEE CAL 2015.
https://github.com/CMU-SAFARI/ramulator-pim
https://github.com/CMU-SAFARI/ramulator
[Source Code for Ramulator-PIM]
31
Ramulator for PIM
Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo
F. Oliveira, Stefano Corda, Sander Stujik, Onur Mutlu, and Henk
Corporaal,
"NAPEL: Near-Memory Computing Application Performanc
e Prediction via Ensemble Learning"
Proceedings of the 56th Design Automation Conference (DAC),
Las Vegas, NV, USA, June 2019.
[Slides (pptx) (pdf)]
[Poster (pptx) (pdf)]
[Source Code for Ramulator-PIM]
32
What We Discussed Is Applicable
to
Other Types of Simulation
Case Study:
COVID-19 Spread
Modeling and Prediction
COVID-19 Measures: Evaluation
Methods
How do we assess how an idea will affect a target
metric X?
A variety of evaluation methods are available:
Theoretical proof
Analytical modeling/estimation
Simulation (at varying degrees of abstraction and
accuracy)
Prototyping with a real system (e.g., FPGAs)
35
Real implementation
Simulating COVID-19 Spread
An architect is in part a dreamer, a creator
Simulation is a key tool of the architect
Allows the evaluation & understanding of non-existent
systems
Simulation enables
The exploration of many dreams
A reality check of the dreams
Deciding which dream is better
Simulation also enables
The ability to fool yourself with false dreams
36
Goals in Simulating COVID-19
Spread
Explore the design space quickly and see what you
want to
potentially implement in a next-generation platform
propose as the next big idea to advance the state of the art
the goal is mainly to see relative effects of design decisions
Match the behavior of an existing system so that
you can
debug and verify it at cycle-level accuracy
propose small tweaks to the design that can make a
difference in performance or energy
the goal is very high accuracy
Other goals in-between:
Refine the explored design space without going into a
full detailed, cycle-accurate design
Gain confidence in your design decisions made by 37
Tradeoffs in Simulation
Three metrics to evaluate a simulator
Speed
Flexibility
Accuracy
Speed: How fast the simulator runs (xIPS, xCPS,
slowdown)
Flexibility: How quickly one can modify the simulator
to evaluate different algorithms and design choices?
Accuracy: How accurate the performance (energy)
numbers the simulator generates are vs. a real
design (Simulation error)
The relative importance of these metrics varies
depending on where you are in the design process 38
Trading Off Speed, Flexibility,
Accuracy
Speed & flexibility affect:
How quickly you can make design tradeoffs
Accuracy affects:
How good your design tradeoffs may end up being
How fast you can build your simulator (simulator design
time)
Flexibility also affects:
How much human effort you need to spend modifying
the simulator
You can trade off between the three to achieve
design exploration and decision goals
39
High-Level Simulation
Key Idea: Raise the abstraction level of modeling to
give up some accuracy to enable speed & flexibility
(and quick simulator design)
Advantage
+ Can still make the right tradeoffs, and can do it quickly
+ All you need is modeling the key high-level factors,
you can omit corner case conditions
+ All you need is to get the “relative trends”
accurately, not exact performance numbers
Disadvantage
-- Opens up the possibility of potentially wrong decisions
-- How do you ensure you get the “relative trends”
accurately? 40
Simulation as Progressive
Refinement
High-level models (Abstract, C)
…
Medium-level models (Less abstract)
…
Low-level models (RTL with everything modeled)
…
Real design
As you refine (go down the above list)
Abstraction level reduces
Accuracy (hopefully) increases (not necessarily, if not
careful)
Flexibility reduces; Speed likely reduces except for real
design 41
Making The Best of
Architecture
A good architect is comfortable at all levels of
refinement
Including the extremes
A good architect knows when to use what type of
simulation
And, more generally, what type of evaluation method
Recall: A variety of evaluation methods are
available:
Theoretical proof
Analytical modeling
Simulation (at varying degrees of abstraction and
accuracy)
Prototyping with a real system (e.g., FPGAs) 42
Computer Architecture
Lecture 14: Simulation
(with a Focus on Memory)
Prof. Onur Mutlu
ETH Zürich
Fall 2020
12 November 2020