07 Firesim Intro
07 Firesim Intro
https://fires.im
@firesimproject
Sagar Karandikar, David Biancolin, Howard Mao, Alon Amid, Nathan Pemberton, Albert Magyar,
Albert Ou, Qijing Huang, Randy Katz, Borivoje Nikolić, Jonathan Bachrach, Krste Asanović
The architect/chip-developer’s design flow
1. High-level Simulation
2. Write RTL + Software, plug into your favorite ecosystem (e.g. Chipyard)
3. Co-design in software RTL sim (e.g. Verilator, VCS, etc.)
• Run microbenchmarks
4. Co-design in FPGA-accelerated simulation
• Boot an OS and run the complete software stack,
obtain realistic performance measurements
5. Tapeout → Chip
• Boot OS and run applications, but no more opportunity for co-design
2
The architect/chip-developer’s design flow
1. High-level Simulation
2. Write RTL + Software, plug into your favorite ecosystem (e.g. Chipyard)
3. Co-design in software RTL sim (e.g. Verilator, VCS, etc.)
• Run microbenchmarks
4. Co-design in FPGA-accelerated simulation
• Boot an OS and run the complete software stack,
obtain realistic performance measurements
5. Tapeout → Chip
• Boot OS and run applications, but no more opportunity for co-design
3
What about FPGA prototyping?
DRAM
Taped-out SoC DRAM DRAM
FPGA Prototype of SoC DRA
L1I Rocket L1I Rocket
L1D Core L1D Core
DRAM Model
DRAM Model
L1I
Server
Rocket
Core Core
DRAM L1D DRAM L1D
L2
L2
FPG
100ns
FPGA
100ns L1D Core L1D Core
latency
L1I Rocket RTL
latency
L1I Rocket RTL
Sim. SimulaIon
taped-out Sim. Simula
on FPGA
L1D Core L1D Core
Fabric
Fabric
FPGA
FPGA
5
Want HW simulators that:
• Are as fast as silicon
• Are as detailed as silicon
• Have all the benefits of SW-based simulators
• Are low-cost
Our Thesis:
• FPGAs are the only viable basis technology
à Build FPGA-accelerated simulators with
SW-like flexibility using an open-source tool
6
How? Useful Trends Throughout the Stack
Open ISA Open, Silicon-Proven
SoC Implementations
High-Productivity
Hardware Design FPGAs in the Cloud
Language & IR
7
FireSim at 35,000 feet
• Open-source, fast, automatic, deterministic FPGA-accelerated
hardware simulation for pre-silicon verification and performance
validation
• Ingests:
• Your RTL design (FIRRTL, either via Chisel or Verilog via Yosys*)
• HW and/or SW IO models (e.g. UART, Ethernet, DRAM, etc.)
• Workload descriptions
• Produces:
• Fast, cycle-exact simulation of your design + models around it
• Automatically deployed to cloud FPGAs (AWS EC2 F1)
8
Three Distinguishing Features of FireSim
1) Not FPGA prototypes, rather FPGA-accelerated simulators
• Automatic transformation of designs into FPGA-accelerated
simulators
• Enables new debugging, resource optimization, and profiling
capabilities
2) Uses cloud FPGAs
• Inexpensive, elastic supply of large FPGAs
• Easy to collaborate with other researchers
• Heavy automation to hide FPGA complexity
3) Open-source (https://fires.im)
9
Separating Target and Host
Target: the machine under Host: the machine executing
simulation (hosting) the simulation
Physical
RTL DRAM
FPGA DRAM
CPU CPU
Core Core Physical
RTL DRAM DRAM
taped-out 100ns
Mem
Channel
CPU CPU
1 GHz latency
Core Core
100ns
latency
Multiprocessor
Closed simulation world.
11
FireSim Generates FPGA-Hosted Simulators
CPU CPU Physical
Core Core DRAM
CPU CPU
Core Core 100ns
latency
RTL DRAM Multiprocessor
taped-out 100ns
1 GHz latency
Physical
DRAM
FPGA
Fabric 100ns
latency
12
Host Decoupling in FireSim: Transforming the Target
1) Convert RTL into a latency-insensitive [1] model using FIRRTL transform
Queue FASED[2]
DDR3 DRAM
RTL Design DRAM RTL Design Timing
(4 GB) Model
Queue (4 GB)
FASED
DRAM
<- Resp Queue Physical
Timing
DRAM
4) Allocate host resources Model Mem
RTL Designto models
Channel
100ns
100
latency
cycle
Req Queue -> latency
FPGA Fabric
17
Example use cases: Evaluating SoC Designs
• Performance Measurement
• Run SPECint 2017 with reference inputs on Rocket Chip in parallel on ~10
FPGAs within a day (e.g., in D. Biancolin, et. al., FASED, FPGA ’19)
• Rapid Full-System Design Space Exploration
• Data-parallel accelerators (Hwacha) and multi-core processors
• Complex software stacks (Linux, OpenMP, GraphMat, Caffe)
18
Example use cases: Evaluating SoC Designs
• Security:
• BOOM Spectre replication
• A. Gonzalez, et. al., Replicating and Mitigating Spectre Attacks on an
Open Source RISC-V Microarchitecture, CARRV ’19
• Keystone Enclave performance evaluation
• D. Lee, et. al., Keystone, EuroSys ‘20
• Accelerator evaluation
• Chisel-based accelerators:
• ML (H. Genc, et. al., Gemmini, Arxiv)
• Garbage collection (M. Maas, et. al., A Hardware Accelerator for
Tracing Garbage Collection, ISCA ‘18)
• NVDLA (F. Farshchi, et. al. Integrating NVIDIA Deep Learning
Accelerator (NVDLA) with RISC-V SoC on FireSim. EMC2 ‘19)
• HLS-based rapid prototyping (Q. Huang, et. al., Centrifuge,
ICCAD ‘19)
• Novel scale-out systems
• nanoPU NIC-CPU co-design (S. Ibanez, et. al., nanoPU, OSDI ‘21)
19
Example use cases: Debugging and Profiling SoC
Designs
• Debugging a Chisel design at FPGA-
speeds
• e.g. FireSim Debugging Docs
• e.g. Fixing BOOM Bugs (D. Kim, et. al.,
DESSERT, FPL ’18)
• Profiling a custom RISC-V SoC at
FPGA-speeds
• e.g. HW/SW Co-design of a networked RISC-
FirePerf
V system (S. Karandikar, et. al., FirePerf,
ASPLOS 2020)
20
How-to-build a datacenter-scale
FireSim simulation
[1] S. Karandikar et. al., “FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud.” ISCA 2018
[2] S. Karandikar et. al., “FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud.” IEEE Micro Top Picks 2018
21
The new datacenter hardware environment
Deeper
New datacenter
memory/storage
architectures
hierarchies
e.g. disaggregation
e.g. 3DXPoint, HBM
[1]
22
The new datacenter hardware environment
23
Disaggregated Datacenters
24
Diagram from Gao et al., OSDI’16
Mapping a datacenter simulation
• DC simulation requires:
• Model hardware at scale, cycle-accurately f1.16xlarge
• Run real software CPU
• RTL and abstract SW model co-simulation
FP
GA
• Server Simulations
s(
Server
x8
• Good fit for the FPGA
)
Server
Simulations
Switch Model
• We have tapeout-proven RTL: FAME-1 Server
Simulation(s)
transform w/Golden-Gate Server
Simulation(s)
Server
Simulation(s)
• Network simulation Server
Simulation(s)
Server
Simulation(s)
• Little parallelism in switch models (e.g. a Server
Simulation(s)
thread per port) Simulation(s)
Host
• Need to coordinate all the distributed server
simulations PCIe
• So use CPUs + host network
25
Step 1: Server SoC in RTL
Modeled System
Sim Endpoints
Other Periph.
- 4x RISC-V Rocket
Cores @ 3.2 GHz
Rocket
Rocket
Rocket
Rocket
Core
Core
Core
Core
Other Peripherals
PCIe to Host
- 16K I/D L1$
- 256K Shared L2$
erver
lade
- 200 Gb/s Eth.
NIC
Sim. NIC
Resource Util.
L1D
L1D
L1D
L1D
L1I
L1I
L1I
L1I
- < ¼ of an FPGA
Endpoint
NIC Sim
Sim Rate
L2 - N/A
26
Step 1: Server SoC in RTL
Modeled System
Sim Endpoints
Other Periph.
- 4x RISC-V Rocket
Cores @ 3.2 GHz
Rocket
Rocket
Rocket
Rocket
Core
Core
Core
Core
Other Peripherals
PCIe to Host
- 16K I/D L1$
- 256K Shared L2$
erver
lade
- 200 Gb/s Eth.
NIC
Sim. NIC
Resource Util.
L1D
L1D
L1D
L1D
L1I
L1I
L1I
L1I
- < ¼ of an FPGA
Endpoint
NIC Sim
Sim Rate
L2 - N/A
27
Si
Step 2: FPGA Simulation of one server blade
Modeled System
- 4x RISC-V Rocket
Cores @ 3.2 GHz
Sim Endpoints
Other Periph.
- 16K I/D L1$
Rocket
Rocket
Rocket
Rocket
Core
Core
Core
Core
Other Peripherals
PCIe to Host
- 256K Shared L2$
Server
Server
DRAM
Blade
Blade
- 200 Gb/s Eth.
Sim.
NIC
NIC
- 16 GB DDR3
L1D
L1D
L1D
L1D
L1I
L1I
L1I
L1I
Endpoint
NIC Sim
Resource Util.
L2 - < ¼ of an FPGA
- ¼ Mem Chans
Sim Rate
DRAM Model FPGA
Fabric - ~150 MHz
- ~40 MHz (netw)28
Si
Step 2: FPGA Simulation of one server blade
Modeled System
- 4x RISC-V Rocket
Cores @ 3.2 GHz
Sim Endpoints
Other Periph.
- 16K I/D L1$
Rocket
Rocket
Rocket
Rocket
Core
Core
Core
Core
Other Peripherals
PCIe to Host
- 256K Shared L2$
Server
Server
DRAM
Blade
Blade
- 200 Gb/s Eth.
Sim.
NIC
NIC
- 16 GB DDR3
L1D
L1D
L1D
L1D
L1I
L1I
L1I
L1I
Endpoint
NIC Sim
Resource Util.
L2 - < ¼ of an FPGA
- ¼ Mem Chans
Sim Rate
DRAM Model FPGA
Fabric - ~150 MHz
- ~40 MHz (netw)29
Step 3: FPGA Simulation of 4 server blades
DRAM DRAM
L1I
L1D
Rocket
Core Modeled System
DRAM Model
L1I
Server Server
Rocket
Cost:
L1D Core
- 4 Server Blades
L2
L1I Rocket
Blade Blade
FPGA FPGA
- 16 Cores
L1D Core
L1D
Rocket
Core
SimulaIon
(spot)
Other Peripherals
NIC - 64 GB DDR3
NIC Sim Other Periph.
Resource Util.
Fabric
FPGA
4 Sims) (4 Sims)
- < 1 FPGA
$1.65 per hour Server Server - 4/4 Mem Chans
(on-demand) Blade Blade Sim Rate
Simulation Simulation
- ~14.3 MHz
(netw)
DRAM DRAM
30
Step 3: FPGA Simulation of 4 server blades
DRAM DRAM
L1I
L1D
Rocket
Core Modeled System
DRAM Model
L1I
Server Server
Rocket
L1D Core
- 4 Server Blades
L2
L1I Rocket
Blade Blade
FPGA FPGA
- 16 Cores
L1D Core
L1I Rocket
Sim. SimulaIon
L1D Core
Other Peripherals
NIC - 64 GB DDR3
NIC Sim Other Periph.
Resource Util.
Fabric
FPGA
4 Sims) (4 Sims)
- < 1 FPGA
Server Server - 4/4 Mem Chans
Blade Blade Sim Rate
Simulation Simulation
- ~14.3 MHz
(netw)
DRAM DRAM
31
Step 4: Simulating a 32 node rack
DRAM
L1I
L1D
Rocket
Core
DRAM
A
Modeled System
- 32 Server Blades
DRAM Model
L1I
Server Server
Rocket
- 128 Cores
L1D Core
L2
L1I Rocket
Blade Blade
L1I Rocket
Sim. SimulaIon
L1D Core
Other Peripherals
Cost:
NIC
- 512 GB DDR3
NIC Sim Other Periph.
Fabric
FPGA
Endpoint Sim Endpoints
PCIe to Host
DRAM
L1I
L1D
Rocket
Core
DRAM
A
Modeled System
- 32 Server Blades
DRAM Model
L1I
Server Server
Rocket
- 128 Cores
L1D Core
L2
L1I Rocket
Blade Blade
L1I Rocket
Sim. SimulaIon
L1D Core
Other Peripherals
Cost:
NIC
- 512 GB DDR3
NIC Sim Other Periph.
Fabric
FPGA
Endpoint Sim Endpoints
PCIe to Host
DRAM
L1I
L1D
Rocket
Core
DRAM
A
Modeled System
- 32 Server Blades
DRAM Model
L1I
Server Server
Rocket
- 128 Cores
L1D Core
L2
L1I Rocket
Blade Blade
L1I Rocket
Sim. SimulaIon
L1D Core
Other Peripherals
NIC
- 512 GB DDR3
NIC Sim Other Periph.
Fabric
FPGA
Endpoint Sim Endpoints
PCIe to Host
DRAM DRAM
Switch
Host Instance CPU: ToR Switch Model
- 200 Gb/s, 2us
links
Resource Util.
- 8 FPGAs =
FPGA FPGA FPGA FPGA
- 1x f1.16xlarge
(4 Sims) (4 Sims) (4 Sims) (4 Sims)
Sim Rate
- ~10.7 MHz
(netw) 34
Step 5: Simulating a 256 node “aggregation pod”
Modeled System
- 256 Server
Blades
- 1024 Cores
Rack Rack Rack Rack - 4 TB DDR3
- 8 ToRs, 1 Aggr
Aggregation Switch
DRAM
L1I
L1D
Rocket
Core
DRAM
Aggr
- 200 Gb/s, 2us
links
Resource Util.
- 64 FPGAs =
DRAM Model
Server
L1I Rocket
L1D Core
Server
L2
L1I Rocket
Blade Blade
L1I Rocket
Sim. SimulaIon
L1D Core
Other Peripherals
NIC
Fabric
FPGA
Endpoint Sim Endpoints
PCIe to Host
- ~9 MHz (netw) 35
Step 5: Simulating a 256 node “aggregation pod”
Modeled System
- 256 Server
Blades
- 1024 Cores
Rack Rack Rack Rack - 4 TB DDR3
- 8 ToRs, 1 Aggr
Aggregation Switch
DRAM
L1I
L1D
Rocket
Core
DRAM
Aggr
- 200 Gb/s, 2us
links
Resource Util.
- 64 FPGAs =
DRAM Model
Server
L1I Rocket
L1D Core
Server
L2
L1I Rocket
Blade Blade
L1I Rocket
Sim. SimulaIon
L1D Core
Other Peripherals
NIC
Fabric
FPGA
Endpoint Sim Endpoints
PCIe to Host
- ~9 MHz (netw) 36
Step 6: Simulating a 1024 node datacenter
Modeled System
Rack Rack Rack Rack - 1024 Servers
- 4096 Cores
Aggregation Switch
DRAM DRAM
Aggregation Pod - 16 TB DDR3
- 32 ToRs, 4 Aggr, 1
L1I Rocket
L1D Core
DRAM Model
Server
L1I Rocket
L1D Core
Server
L2
L1I Rocket
Blade Blade
L1I Rocket
Sim. SimulaIon
L1D Core
Other Peripherals
NIC
Fabric
FPGA
Endpoint Sim Endpoints
PCIe to Host
13
links
Resource Util.
- 256 FPGAs =
L1D
Rocket
Core
DRAM
Aggregation Pod
DRAM Model
Server
L1I Rocket
L1D Core
Server
L2
L1I Rocket
Blade Blade
L1I Rocket
Sim. SimulaIon
L1D Core
Other Peripherals
NIC
Fabric
FPGA
Endpoint Sim Endpoints
PCIe to Host
13
Resource Util.
at a cost-to-user of only 100s of dollars/hour
- 256 FPGAs =
FireSim ISCA’18 paper selected as an IEEE Micro Top Pick of 2018 Arch. Confs
and as the CACM Research Highlights Nominee from ISCA’18 39
Wrapping-up: Productive Open-Source FPGA
Simulation
• github.com/firesim/firesim, BSD Licensed
• An “easy” button for fast, FPGA-accelerated full-
system simulation
• Plug in your own RTL designs, your own HW/SW models
• One-click: Parallel FPGA builds, Simulation run/result collection,
building target software
• Scales to a variety of use cases:
• Networked (performance depends on scale)
• Non-networked (150+ MHz), limited by your budget
The information, data, or work presented herein was funded in part by the Advanced
Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award
Number DE-AR0000849, and by DARPA, Award Number HR0011-12-2-0016. Research
was also partially funded by ADEPT Lab industrial sponsors and affiliates Intel, Apple,
Futurewei, Google, and Seagate, and RISE Lab sponsor Amazon Web Services. The
views and opinions of authors expressed herein do not necessarily state or reflect
those of the United States Government or any agency thereof.