- 1. Overview
- 2. Using Ramulator 2.1
- 3. Your First Run
- 4. Writing Configurations
- 5. Validation and Regression Tests
- 6. gem5 Integration
- 7. Using Ramulator as a pure C++ Library
- 8. Extending Ramulator
- 9. How Ramulator Works Internally
Ramulator 2.1 is a modern, modular, and extensible cycle-level DRAM simulator. It is the successor of Ramulator 1.0 [Kim+, CAL'16] and a major overhaul of Ramulator 2.0 [Luo+, CAL'23]. The goal of Ramulator 2.1 is to enable rapid and agile implementation and evaluation of design changes in the memory controller and DRAM to meet the increasing research effort in improving the performance, security, and reliability of memory systems. Ramulator 2.1 features a clean and modular C++ codebase with automatically generated Python wrappers that enables easy and scriptable configurations and extensions. Users can focus just on the C++ code that implements modeling and simulation logic without having to worry about manually maintaining boilerplate code.
Ramulator 2.1 can either be used as a standalone simulator that takes memory traces, or be easily integrated into other simulators as a DRAM and memory controller simulation library. We currently provide gem5 wrappers that works as a drop-in component for both SE and FS mode (tested with gem5 v25.1, FS mode tested with a post-boot checkpoint).
This Github repository contains the public version of Ramulator 2.1. From time to time, we will synchronize improvements of the code framework, additional functionalities, bug fixes, etc. from our internal version. Ramulator 2.1 welcomes your contribution as well as new ideas and implementations in the memory system.
Currently, Ramulator 2.1 provides the DRAM device and memory controller models for the following standards:
- DDR3, DDR4, DDR5
- LPDDR5 (with WCK2CK sync and expiry tracking and tAAD deadline aware scheduling for separate ACT1 and ACT2)
- HBM, HBM2, HBM3 (Row and column command dual-issue and pseudochannels)
What has changed from Ramulator 2.0:
- Aggregated bug fixes (identified from both Github issues and internal testing)
- More comprehensive support for newer DRAM & controller features
- More comprehensive sets of test and validation workflows
- Significantly improved the ease of use, configuration, and extension
- Overall code quality improvements
src/C++ code for main simulator implementation.python/Python wrappers for easy and scriptable configuration of Ramulator.examples/Ready-to-run example configurations and traces.tests/Tests and validation workflows.resources/gem5_wrappers/Reference wrapper code for gem5 integration.
We highly recommend to use our container (Dockerfile available at .devcontainer/Dockerfile) to avoid any dependency issues. The repository is also configured to be able to be one-click opened as a Dev Container with all the dependencies already installed. The easiest way to start using and developing Ramulator 2.1 is to directly create a Codespace on the Github page of the Ramulator 2.1 repository. If you are using Visual Studio Code, it should automatically prompt you to reopen the repository in a Dev Container after you clone and open the repo for the first time.
If you want to set up the container locally, you can do the following steps:
docker compose up -d --build
docker compose exec ramulator2 bashDoing so creates a container with all the dependencies, mounts the Ramulator 2.1 repository at /workspace, and automatically activates ramulator2-venv in the container bash.
If you need to configure your own environment, please refer to Section 2.3 for detailed instructions.
From the repository root:
mkdir -p build
cd build
cmake ..
make -j
cd ..This default build does three useful things:
- Builds
libramulator.soin the repository root - Builds the Python extension module under
python/ramulator/ - Runs the code generator so all the automatically generated code are in sync with the source code
Then run Ramulator 2.1 in standalone mode with an example configuration:
python3 examples/example_config.pyYou should see some example statistics being printed. You can head to Section 3 directly for detailed explanations and instructions on how to use and configure Ramulator 2.1 if you do not need to build Ramulator 2.1 in your custom environment.
Required:
- A C++20 compiler, such as
g++-12orclang++-15 - CMake 3.14 or newer
- Python 3.10 or newer if you want the Python bindings, the CLI, or the tests
Auto-fetched by CMake (no manual install):
yaml-cpp— YAML configuration parsingfmt— C++ print formattingnanobind— Python-C++ bindings (only whenRAMULATOR_PYTHON_BINDINGS=ON)argparse— command-line argument parsing (vendored inext/)
Python dev dependencies (install with pip install -r requirements-dev.txt):
pytest— test frameworkmatplotlib— latency-throughput plottingruff— Python linter and formatter
setuptools >= 64 is required as the build backend and is pulled in automatically by pip install -e ..
Optional: clang-format.
To build Ramulator 2.1 in standalone mode:
mkdir -p build
cd build
cmake ..
make -j
cd ..If you only want the pure C++ library without Python bindings for your own simulator:
mkdir -p build
cd build
cmake .. -DRAMULATOR_PYTHON_BINDINGS=OFF
make -j
cd ..After building, install the Python package in editable mode so that python -m ramulator and import ramulator work from any directory:
pip install -e .Then run examples directly:
python3 examples/example_config.pyIf you prefer not to install the python package, you can also set PYTHONPATH=python as a one-off alternative:
PYTHONPATH=python python3 examples/example_config.pyexamples/example_config.py looks like the following:
"""Example Ramulator2 configuration and simulation script"""
import ramulator
# Configure the simulation frontend that sends memory requests
frontend = ramulator.frontend.SimpleO3(
clock_ratio=8,
traces=["./examples/traces/example_inst.trace"],
num_expected_insts=500000,
translation=ramulator.translation.NoTranslation(max_addr=2147483648),
)
# Create DRAM configuration
ddr4 = ramulator.dram.DDR4(org_preset="DDR4_8Gb_x8", timing_preset="DDR4_2400R", rank=1)
# Instantiate the memory controller with the DRAM configuation
ctrl = ramulator.controller.GenericDDR(
dram=ddr4,
scheduler=ramulator.scheduler.FRFCFS(),
refresh_manager=ramulator.refresh_manager.AllBank(scope="Rank"),
row_policy=ramulator.row_policy.Open(),
addr_mapper=ramulator.addr_mapper.RoBaRaCoCh(),
)
# Create a memory system with the controller
mem = ramulator.memory_system.GenericDRAM(
clock_ratio=3,
controllers=[ctrl],
channel_mapper=ramulator.channel_mapper.CacheLineInterleave(),
)
# Run the simulation
sim = ramulator.Simulation(frontend, mem)
sim.run()
# sim.stats returns a nested Python dict of all simulation statistics
stats = sim.stats
# Controller stats are under memory_system → controller
ctrl_stats = stats["memory_system"]["controller"]
print(f"Controller cycles: {ctrl_stats['cycles']}")
print(f"Avg read latency: {ctrl_stats['avg_read_latency']:.1f} cycles")
print(f"Read requests: {ctrl_stats['num_read_reqs']}")
print(f"Write requests: {ctrl_stats['num_write_reqs']}")
print(f"Row hits: {ctrl_stats['row_hits']}")
print(f"Row misses: {ctrl_stats['row_misses']}")
print(f"Row conflicts: {ctrl_stats['row_conflicts']}")There are two top-level components in Ramulator 2.1:
frontendgenerates memory traffic. In this example it is a simple out-of-order core model driven by a memory instruction trace.memory_systemencapsulates one or more memory controllers (channel). Each controller owns a DRAM device model.
Any Ramulator 2.1 simulation must contain these two components. They are used to create the main simulation object (ramulator.Simulation(frontend, mem)) used as the entry point of the simulation.
frontend = ramulator.frontend.SimpleO3(
clock_ratio=8,
traces=["./examples/traces/example_inst.trace"],
num_expected_insts=500000,
translation=ramulator.translation.NoTranslation(max_addr=2147483648),
)The frontend generates memory requests and sends them to the memory system. In this example, SimpleO3 models a simple Out-of-Order processor with an LLC. It reads one or more instruction trace files, each corresponds to a processor core. The simulation will run until the number of retired instructions has reached 500000. The configured fontend will not apply address translation to the memory addresses in the trace (i.e., ramulator.translation.NoTranslation). clock_ratio=8 means that for every x memory ticks (i.e., memory system side clock_ratio=x), the frontend will be ticked 8 times.
ddr4 = ramulator.dram.DDR4(
org_preset="DDR4_8Gb_x8",
timing_preset="DDR4_2400R",
rank=2,
)This includes:
- An organization preset, such as die density, DQ width, number of banks, etc.
- A timing constraints preset, such as tRCD, tRAS, tRP, etc.
- Optional overrides to both presets. In this example, we set
rank=2. You can append as many overrides as you want.
If you want to understand what this object turns into at runtime, section 9.3 walks through the full DRAM device model and the hierarchical state machine behind it.
ctrl = ramulator.controller.GenericDDR(
dram=ddr4,
scheduler=ramulator.scheduler.FRFCFS(),
refresh_manager=ramulator.refresh_manager.AllBank(scope="Rank"),
row_policy=ramulator.row_policy.Open(),
addr_mapper=ramulator.addr_mapper.RoBaRaCoCh(),
)This configures a GenericDDR memory controller for our just configured ddr4 DRAM. It has an FRFCFS (First-Ready First-Come-First-Served) scheduler, an all-bank refresh that happens at the rank-level (AllBank(scope="Rank")), an Open row policy, and a RoBaRaCoCh address mapper.
mem = ramulator.memory_system.GenericDRAM(
clock_ratio=3,
controllers=[ctrl],
channel_mapper=ramulator.channel_mapper.CacheLineInterleave(),
)GenericDRAM is a thin top-level wrapper around one or more controllers. It contains a clock_ratio that sets the memory-side tick rate, a list of controllers (controllers=[...]), and a channel mapper (channel_mapper=...) that decides which memory requests goes to which controller. clock_ratio=3 means that for every y frontend ticks (i.e., front side clock_ratio=y), the memory system will be ticked 3 times. Currently, GenericDRAM requires all its memory controllers to have the same frequency.
The example prints a few key numbers from sim.stats after the simulation finishes:
Controller cycles: 81302
Avg read latency: 45.2 cycles
Read requests: 6
Write requests: 0
Row hits: 6
Row misses: 0
Row conflicts: 0
The exact numbers depend on the workload and configuration.
sim.stats is a nested Python dict with two top-level keys: "frontend" and "memory_system". The most useful counters live under stats["memory_system"]["controller"]:
| Stat | Meaning |
|---|---|
cycles |
Controller cycles |
avg_read_latency |
Average read latency in controller cycles |
num_read_reqs |
Read requests accepted |
num_write_reqs |
Write requests accepted |
row_hits |
Total row hits |
row_misses |
Total row misses |
row_conflicts |
Total row conflicts |
read_queue_len_avg |
Average read queue occupancy |
write_queue_len_avg |
Average write queue occupancy |
total_num_read_requests |
Total reads accepted by the memory system (one level up, under stats["memory_system"]) |
total_num_write_requests |
Total writes accepted by the memory system |
With a single channel, stats["memory_system"]["controller"] is a dict. With multiple channels it becomes a list of dicts (one per channel), each with an "id" field (e.g., "Channel 0").
The Ramulator Python package exposes the major components as a set of namespaces:
ramulator.frontendramulator.dramramulator.controllerramulator.schedulerramulator.refresh_managerramulator.row_policyramulator.addr_mapperramulator.channel_mapperramulator.translationramulator.controller_pluginramulator.memory_system
You can swap DDR4 for another standard by replacing the DRAM object and, when needed, the controller class.
Examples:
dram = ramulator.dram.DDR5(org_preset="DDR5_8Gb_x8", timing_preset="DDR5_4800AN")
ctrl = ramulator.controller.GenericDDR(dram=dram, ...)dram = ramulator.dram.LPDDR5(org_preset="LPDDR5_8Gb_x16", timing_preset="LPDDR5_5500")
ctrl = ramulator.controller.LPDDR5(dram=dram,...)dram = ramulator.dram.HBM2(org_preset="HBM2_2Gb", timing_preset="HBM2_2000Mbps")
ctrl = ramulator.controller.HBM(dram=dram,...)Use the controller that matches the standard you want to model. DDR3, DDR4, and DDR5 use GenericDDR. LPDDR5 uses LPDDR5. HBM1, HBM2, and HBM3 use HBM.
The DRAM object accepts preset names plus overrides:
dram = ramulator.dram.DDR4(
org_preset="DDR4_8Gb_x8",
timing_preset="DDR4_2400R",
rank=2, # This overrides the 1 rank settings in org_preset to be 2 ranks
# Add more organization and timing overrides here
)Overrides are validated against the DRAM specification. Overriding non-existent parameters raise an error.
One controller corresponds to one channel.
ctrl = ramulator.controller.GenericDDR(
dram=ramulator.dram.DDR4(org_preset="DDR4_8Gb_x8", timing_preset="DDR4_2400R", rank=2),
scheduler=ramulator.scheduler.FRFCFS(),
refresh_manager=ramulator.refresh_manager.AllBank(scope="Rank"),
row_policy=ramulator.row_policy.Open(),
addr_mapper=ramulator.addr_mapper.RoBaRaCoCh(),
)
num_channels = 2
mem = ramulator.memory_system.GenericDRAM(
clock_ratio=3,
controllers=[ctrl] * num_channels,
channel_mapper=ramulator.channel_mapper.CacheLineInterleave(),
)One detail matters here: CacheLineInterleave currently requires the number of channels to be a power of two.
The built-in frontends serve different purposes:
SimpleO3Best first stop for memory-trace-driven studies with a simple core model and LLC. The memory trace includes both 1) the memory requests, and 2) the interval (i.e., the number of non-memory instructions) between consecutive memory requests. Please checksrc/ramulator/frontend/impl/processor/simpleO3/simpleO3.cppfor the trace format.LoadStoreTraceReplays a flat-address trace withLDandSTrecords. Intervals between memory requests are not modeled (i.e., memory requests are sent to the memory system on every cycle).ReadWriteTraceReplays a trace withRandWrecords. Similar toLoadStoreTracebut expects the address vector instead of flat-addresses. Good for debugging/testing.LatencyThroughputTraceSynthetic load generator used by the validation workflow that generates two kinds of memory requests: 1) random-access pointer-chasing like requests that are used to probe the memory access latency, and 2) streaming-access requests that generates load (configurable via the interval between consecutive streaming requests) on the memory system.
sim.stats returns all simulation statistics as a nested Python dict. This is the easiest way to access results that enables you to streamline your experiment workflow (configure, parameter sweep, result analyses) all in a single Python script. sim.stats_yaml returns the same data as a YAML-formatted string in case you want to save the results to disk.
Ramulator includes four practical test layers under tests/.
smokeThe fastest end-to-end sanity check. It verifies that each supported standard can build a realistic configuration and run without crashing.latency_throughputA modeling-fidelity suite built around generated frontend traffic. It checks whether unloaded latency and sustained bandwidth line up with the theoretical behavior implied by the DRAM definition.device_timingsA short-sequence device-level correctness suite. It checks DRAM command legality, prerequisites, and timing-gate behavior one command at a time.controller_schedulingA short-sequence controller-level correctness suite. It checks emitted command sequences, row-hit/miss/conflict behavior, and priority-buffer scheduling contracts.
These suites are complementary:
smokeanswers "does it run at all?"latency_throughputanswers "does the high-level performance shape look right?"device_timingsanswers "does the DRAM timing and prerequisite behavior look right?"controller_schedulinganswers "does the controller schedule and emit commands the right way?"
Smoke tests make sure each supported standard can run end to end without crashing.
PYTHONPATH=python pytest tests/smoke -qThis is the fastest confidence check after a build or a local code change.
Fast latency-throughput is the main modeling-fidelity check used in day-to-day development. It should finish in just a few minutes.
PYTHONPATH=python pytest tests/latency_throughput/test_fast.py -v -sIt does three things for each DRAM standard:
- Runs a no-refresh latency-throughput sweep
- Checks unloaded latency against the timing formula
- Checks measured streaming throughput against the theoretical peak
It also writes annotated plots to:
tests/latency_throughput/plots/fast/
If you only want one standard:
PYTHONPATH=python pytest tests/latency_throughput/test_fast.py -v -s -k DDR4Full latency-throughput is a longer run with refresh enabled:
PYTHONPATH=python pytest tests/latency_throughput/test_full.py -v -sDevice timings is the device-level correctness suite for short DRAM command sequences.
PYTHONPATH=python pytest tests/device_timings -qIt focuses on cases that are too small and specific for a throughput sweep:
- DRAM prerequisites such as
RDrequiringACT - timing gates such as
nRCD,nRTP, andnRP
Use it when you change command semantics or device timing enforcement.
DeviceUnderTest gives you two complementary operations:
probe(command, addr_vec, clk)A read-only query. It asks, "if I wanted this command at this cycle, what would the device say?" It does not change device state.issue(command, addr_vec, clk)A state-mutating operation. It actually issues the command into the device and updates timing/state. It throws if the command is not fully legal at that cycle.
probe() returns five pieces of information:
preqThe command that should happen next from the device's point of view. If you probeRDon a closed bank, this will beACT.timing_OKWhether timing alone allows the probed command at that cycle.readyWhether the command is fully issuable now. This is the combination of functional prerequisite and timing:ready = (preq == command) and timing_OKrow_hitWhether the target bank is already open to the requested row.row_openWhether some row is already open in the target bank.
That distinction matters because a command can fail for two different reasons:
- The bank state is wrong, so a different prerequisite command must happen first.
- The bank state is right, but timing still blocks the command.
For example, a closed-bank read is not ready, but it can still be timing-OK:
# A closed-bank read is functionally blocked until the row is opened.
closed = dut.probe("RD", a, clk=0)
# Without ACT issued, the prerequisite is ACT.
assert closed.preq == "ACT"
# Here we only check the timing constraints. The timing is OK here since no ACT has been issued yet!
assert closed.timing_OK is True
# ready means the command is fully issuable now: correct prerequisite and timing_OK.
assert closed.ready is FalseHere, timing is not the problem. The missing prerequisite is.
After you issue ACT, the prerequisite changes to RD, but the access is still blocked until nRCD expires:
dut.issue("ACT", a, clk=0)
early = dut.probe("RD", a, clk=dut.timings["nRCD"] - 1)
assert early.preq == "RD"
assert early.timing_OK is False
assert early.ready is FalseNow the state is correct, but timing is not.
At exactly nRCD, the same probe becomes fully ready:
ontime = dut.probe("RD", a, clk=dut.timings["nRCD"])
assert ontime.preq == "RD"
assert ontime.timing_OK is True
assert ontime.ready is TrueThat is the point where issue() becomes valid:
dut.issue("RD", a, clk=dut.timings["nRCD"])issue() is intentionally strict. It does not try to fix the sequence for you. If you call it on a command whose prerequisite is different, or on a command that is still timing-blocked, it raises an error. A good testing pattern is:
- Use
probe()to understand what the device expects next. - Assert on
preq,timing_OK, andready. - Call
issue()only when you expectreadyto beTrue.
A minimal DeviceUnderTest example looks like this:
import ramulator
import ramulator.device_timings
dram = ramulator.dram.DDR4(org_preset="DDR4_8Gb_x8", timing_preset="DDR4_2400R", rank=1)
dut = ramulator.device_timings.DeviceUnderTest(dram)
a = dut.addr_vec(Rank=0, BankGroup=0, Bank=0, Row=12, Column=0)
assert dut.probe("RD", a, clk=0).preq == "ACT"
dut.issue("ACT", a, clk=0)
assert dut.probe("RD", a, clk=dut.timings["nRCD"]).ready is TrueThe canonical full device example lives in:
tests/device_timings/example.py
Controller scheduling is the controller-level correctness suite for short request and maintenance sequences.
PYTHONPATH=python pytest tests/controller_scheduling -qIt focuses on cases that are too small and specific for a throughput sweep:
- controller-issued command sequences for row hits, misses, and conflicts
- scheduling preferences such as FRFCFS choosing a ready request
- controller contracts around priority/internal commands
Use it when you change row-policy behavior, controller scheduling, or command issuance logic.
A minimal ControllerUnderTest example looks like this:
import ramulator
import ramulator.controller_scheduling
dram = ramulator.dram.DDR4(org_preset="DDR4_8Gb_x8", timing_preset="DDR4_2400R", rank=1)
dut = ramulator.controller_scheduling.ControllerUnderTest.make_generic_ddr(dram)
row0 = dut.addr_vec(Rank=0, BankGroup=0, Bank=0, Row=0, Column=0)
row1 = dut.addr_vec(Rank=0, BankGroup=0, Bank=0, Row=1, Column=0)
dut.send_request("Read", row0)
dut.send_request("Read", row1)
history = dut.run_until_idle(max_ticks=128)
dut.assert_commands(["ACT", "RD", "PREpb", "ACT", "RD"], history=history)The canonical full controller example lives in:
tests/controller_scheduling/examples/test_controller_example.py
- Use smoke tests after a clean build or a small code change
- Use fast latency-throughput when you change timing behavior, controller logic, or DRAM definitions
- Use device timings when you change command legality or timing enforcement
- Use controller scheduling when you change controller command sequencing or row-policy behavior
- Use full latency-throughput when you want a more comprehensive sanity check
Ramulator2 integrates with gem5 as a drop-in memory system. You can configure Ramulator2 in the same gem5 Python configuration script. The integration uses gem5's stdlib Board API, so it works with SimpleBoard and other stdlib boards.
Tested with gem5 v25.1.
- gem5 built from source (stable branch recommended)
- Ramulator2 built with
libramulator.so(the default build) - Ramulator2 Python package available (either
pip install -e .orPYTHONPATH)
Copy the wrapper files into your gem5 source tree:
cp -r resources/gem5_wrappers/ <gem5>/src/mem/ramulator2/This creates four files:
| File | Purpose |
|---|---|
Ramulator2.py |
gem5 SimObject declaration |
ramulator2.hh |
C++ header |
ramulator2.cc |
C++ implementation |
SConscript |
Build configuration |
Edit SConscript and set RAMULATOR2_HOME to your ramulator2 directory:
# In <gem5>/src/mem/ramulator2/SConscript
RAMULATOR2_HOME = '/path/to/ramulator2' # ← change thisThen rebuild gem5:
cd <gem5>
scons build/X86/gem5.opt -j$(nproc) --ignore-styleA complete working example is in examples/gem5_se_ramulator_hello_world.py:
import sys
# Replace with the actual path
sys.path.insert(0, "/path/to/ramulator2/python")
import ramulator
from gem5.components.boards.simple_board import SimpleBoard
from gem5.components.cachehierarchies.classic.no_cache import NoCache
from gem5.components.processors.cpu_types import CPUTypes
from gem5.components.processors.simple_processor import SimpleProcessor
from gem5.isas import ISA
from gem5.resources.resource import BinaryResource
from gem5.simulate.simulator import Simulator
# ── Ramulator2 memory configuration ──
ddr4 = ramulator.dram.DDR4(org_preset="DDR4_8Gb_x8", timing_preset="DDR4_2400R", rank=1)
ctrl = ramulator.controller.GenericDDR(
dram=ddr4,
scheduler=ramulator.scheduler.FRFCFS(),
refresh_manager=ramulator.refresh_manager.AllBank(scope="Rank"),
row_policy=ramulator.row_policy.Open(),
addr_mapper=ramulator.addr_mapper.RoBaRaCoCh(),
)
mem_sys = ramulator.memory_system.GenericDRAM(
clock_ratio=3,
controllers=[ctrl],
channel_mapper=ramulator.channel_mapper.CacheLineInterleave(),
)
memory = ramulator.gem5.Memory(mem_sys, size="4GiB")
# ── gem5 system setup ──
processor = SimpleProcessor(cpu_type=CPUTypes.TIMING, isa=ISA.X86, num_cores=1)
board = SimpleBoard(
clk_freq="3GHz",
processor=processor,
memory=memory,
cache_hierarchy=NoCache(),
)
board.set_se_binary_workload(binary=BinaryResource(local_path="/path/to/binary"))
simulator = Simulator(board=board)
simulator.run()cd <gem5>
build/X86/gem5.opt configs/your_config.pyThe SConscript sets RPATH so libramulator.so is found automatically. If you moved the library after building, set LD_LIBRARY_PATH instead:
LD_LIBRARY_PATH=/path/to/ramulator2 build/X86/gem5.opt configs/your_config.pyFor request-level debug tracing, add --debug-flags=Ramulator2.
Ramulator's internal stats (row hits, queue lengths, read latency, etc.) are written to m5out/ramulator_stats.yaml alongside gem5's own stats.txt.
If you want to Ramulator in your simulator without introducing Python bindings for your simulator, you can still use Ramulator as a pure C++ library.
When embedding Ramulator inside another simulator, you use the External frontend. Your simulator sends memory requests to Ramulator and ticks the memory system; Ramulator handles scheduling, timing, and state tracking.
The typical flow is:
- Build Ramulator (
libramulator.so) - Write a Python config that uses
ramulator.frontend.Externaland export it to YAML - Load that YAML from C++
- Send memory requests via
receive_external_requests() - Tick the memory system at the DRAM clock rate
Export the config like this:
python3 -m ramulator export examples/example_config.py -o config.yamlNote that the generated YAML is fully equivalent to the Python configuration that exports it (from Ramulator's perspective). It is less readable than the Python configuration becauses it is intended solely for Ramulator to parse. You are not expected to manually edit the YAML files.
The exported YAML must use External as the frontend implementation. When building the Python config for export, replace your usual frontend with:
frontend = ramulator.frontend.External(clock_ratio=1)#include <ramulator/base/config.h>
#include <ramulator/base/factory.h>
#include <ramulator/base/request.h>
#include <ramulator/frontend/i_frontend.h>
#include <ramulator/memory_system/i_memory_system.h>
// 1. Load config and create components
auto config = Ramulator::Config::parse_config_file("config.yaml");
auto* frontend = Ramulator::Factory::create_frontend(config);
auto* memory_system = Ramulator::Factory::create_memory_system(config);
frontend->connect_memory_system(memory_system);
memory_system->connect_frontend(frontend);
// 2. Send a read request
// req_type_id: Request::Type::Read (0) or Request::Type::Write (1)
// addr: byte address
// source_id: identifies which core or source (0 for single-core)
// callback: called when the request completes
bool accepted = frontend->receive_external_requests(
Ramulator::Request::Type::Read, // read request
0x1000, // address
0, // source id
[](Ramulator::Request& req) {
// Request completed — req.depart has the completion cycle
}
);
// If accepted is false, the memory system's queue is full.
// Retry on the next cycle after ticking.
// 3. Tick the memory system each DRAM cycle
memory_system->tick();
// 4. When done, finalize to flush stats
frontend->finalize();
memory_system->finalize();The External frontend's tick() is a no-op — your simulator controls when and how requests are injected. You only need to tick the memory system.
- Exported configs are fully expanded. The C++ side expects resolved values, not symbolic Python presets.
libramulator.sois the library you link against.receive_external_requests()returnsfalsewhen the controller's request queue is full. The caller must retry on a subsequent cycle.- Request type IDs are
0(read) and1(write), matchingRequest::Type::ReadandRequest::Type::Write. - For a complete working integration, see the gem5 wrapper in
resources/gem5_wrappers/or the gem5 integration section above.
This section covers the extension points most contributors actually touch.
The most common change is adding a new implementation of an existing interface, such as a scheduler.
Create a single .cpp file:
#include "controller/controller_base.h"
#include "controller/scheduler/i_scheduler.h"
namespace Ramulator {
// Inherit from both its interface and a common Implementation base class
class FooBarScheduler : public IScheduler, public Implementation {
// Register the implementation class to the interface with a name ("FooBar")
// Python wrappers are *automatically* generated
RAMULATOR_REGISTER_IMPLEMENTATION(IScheduler, FooBarScheduler, "FooBar")
ControllerBase* m_ctrl = nullptr;
int m_weight = 0;
size_t s_decisions = 0;
void init() override {
// Initialize your component with parameters from the config
RAMULATOR_PARSE_PARAM(m_weight, int, "weight").default_val(4);
// If you need to access the parent component
m_ctrl = cast_parent<ControllerBase>();
// Register your variables as stats to be automatically printed
m_stats.add("foobar_decisions", s_decisions);
}
void setup(IFrontEnd* frontend, IMemorySystem* memory_system) override {
// setup() gets called *after* all components have been initialized
// You can resolve configurations that depends on other components here
// ...
}
ReqBuffer::iterator get_best_request(
ReqBuffer& buffer,
RequestFilterRef filter) override {
// Implement actual behavior and logic of the component by overriding interface virtual functions
}
};
} // namespace RamulatorThen add that file to the relevant CMakeLists.txt, rebuild, and use it from Python:
ctrl = ramulator.controller.GenericDDR(
dram=dram,
scheduler=ramulator.scheduler.FooBar(weight=8), # New scheduler!
refresh_manager=ramulator.refresh_manager.AllBank(scope="Rank"),
row_policy=ramulator.row_policy.Open(),
addr_mapper=ramulator.addr_mapper.RoBaRaCoCh(),
)Controller plugins are optional observer components attached to a controller.
Today, the plugin lifecycle is:
pre_schedule()Runs before candidate selectionon_issue(const Request&)Runs after a command is issuedpost_schedule()Runs at the end of the controller tick
Built-in plugins include:
CommandCounterCounts selected DRAM commands and writescommand, countlines to a CSV fileCmdTraceRecorderRecords every issued command to a per-channel trace file such astrace.csv.ch0IssuedCommandValidationHookForwards each issued DRAM command to the controller scheduling test suite in Python
Example:
ctrl = ramulator.controller.GenericDDR(
dram=dram,
scheduler=ramulator.scheduler.FRFCFS(),
refresh_manager=ramulator.refresh_manager.AllBank(scope="Rank"),
row_policy=ramulator.row_policy.Open(),
addr_mapper=ramulator.addr_mapper.RoBaRaCoCh(),
controller_plugins=[
ramulator.controller_plugin.CommandCounter(
commands_to_count=["ACT", "PREpb", "RD", "WR", "REFab"],
path="cmd_counts.csv",
),
],
)DRAM standards are defined in Python scripts that describes the specifications of the DRAM standard. Ramulator automatically generates C++ code that fits the modeling methodology during the build process. Users can enjoy the readability, flexibility and extensibility of Python and avoid low-level code that is much less readable.
For example, if you want to create a variant of an existing DRAM standard by adding a new DRAM command (e.g., FOO), follow the following two simple steps:
-
Add the implementation of the
FOOcommand undersrc/ramulator/dram/commands/. -
Create the variant DRAM standard by simply inheriting from the base DRAM standard and specify only what changes:
import math
from ramulator.dram.ddr3 import DDR3
from ramulator.dram.spec import TimingConstraint
class DDR3Foo(DDR3):
name = "DDR3Foo"
# Add the new command
commands = DDR3.commands + ["FOO"]
# Add new timing constraints from the new command
timing_params = DDR3.timing_params + ["nFOO"]
timing_constraints = DDR3.timing_constraints + [
TimingConstraint(level="Bank", preceding=["FOO"], following=["ACT"], latency="nFOO"),
TimingConstraint(level="Bank", preceding=["ACT"], following=["FOO"], latency="nRC"),
]A new DRAM standard is a DRAMStandard subclass under python/ramulator/dram/.
You define:
namelevelscommandsstatestiming_paramssupported_requeststiming_constraintsorg_presetstiming_presetsresolve_secondary_timings()
Code generation imports modules under python/ramulator/dram/, discovers these classes, and generates the corresponding C++ implementation in src/ramulator/dram/impl/.
After the DRAM definition exists, add a testcase file in tests/latency_throughput/testcases/.
The current latency-throughput flow expects a config shape like this:
import ramulator
CONFIG = dict(
dram_class="MyStandard",
org_preset="MyOrgPreset",
timing_preset="MyTimingPreset",
dram_kwargs=dict(),
controller_class="GenericDDR",
fast_ctrl_extra_kwargs=dict(
refresh_manager=ramulator.refresh_manager.NoRefresh(),
),
full_ctrl_extra_kwargs=dict(
refresh_manager=ramulator.refresh_manager.AllBank(scope="Rank"),
),
full_streaming_requests=1_000_000,
frontend_clock_ratio=4,
stream_cols=8,
nop_counters=[1, 10, 100, 1000],
)The exact controller class depends on the standard. For example, LPDDR5 uses LPDDR5, and HBM-family standards use HBM.
Then run:
PYTHONPATH=python pytest tests/smoke -v -k MyStandard
PYTHONPATH=python pytest tests/latency_throughput/test_fast.py -v -s -k MyStandard
PYTHONPATH=python pytest tests/latency_throughput/test_full.py -v -s -k MyStandardThis section is here so that the guide remains useful after you gets your hands on with the codebase. If you only want to run experiments, you can stop earlier and come back when you need the deeper model.
Ramulator uses an interface and implementation pattern. An interface models a type of component in the simulated system. The interface class defines the high-level protocol and contract that the component exposes to other components in the system (i.e., how should other components interact with this component) through virtual functions. Interface classes start with I, and the header files are prefixed with i_.
Examples:
IFrontEndIMemorySystemIControllerISchedulerIRowPolicyIRefreshManagerIAddrMapperIChannelMapperIControllerPluginITranslation
Implementations are concrete instances of the component type that its interface models. Implementation classes define concrete behavior of the component it models by overriding the virtual functions of the interface classes. In Ramulator2, all implementation classes must inherit from both its interface class and a common Implementation class that provides basic shared utilities and boilerplate.
Ramulator2 implements a self-registrying component factory so that it can create the component hierarchy from the configuration automatically. Users do not need to worry about handling the factory boilerplate as long as they make their custom interfaces and implementations discoverable through the following one-line macros:
RAMULATOR_REGISTER_INTERFACE(IfceClassName, "ifce_name")
RAMULATOR_REGISTER_IMPLEMENTATION(IfceClassName, ImplClassName, "ImplName")The factory uses those registrations to automatically discover and create components from config data (e.g., the above example will be come ramulator.ifce_name.ImplName at the Python side). There is no hand-maintained registry file.
Similarily, by using the provided macros to parse parameters and create child components, the boilerplate codes for Python to discover them are also automatically generated. No manual maintenance of C++ to Python binding is necessary.
RAMULATOR_PARSE_PARAM(parsed_variable, type_t, "param_name")
RAMULATOR_CREATE_CHILD(IfceClassName, "ifce_name")In Python mode, the configuration path is:
- You create Python component objects.
- Each object serializes itself with
to_config(). - The Python binding converts that nested dictionary into
ConfigNode. - The C++ factory creates the top-level frontend and memory-system objects.
- Child components are created recursively during
init(). - The simulation loop advances frontend and memory system according to their clock ratios.
In C++ library mode, the only difference is that you load the ConfigNode tree from an exported YAML file instead of starting from Python objects.
The controller owns a DRAMDevice, and that device is where Ramulator turns a DRAM standard description into a live protocol model. The easiest way to think about it is that the device keeps two views of the same channel at the same time:
- A hierarchy of nodes for timing
- A flat bank-oriented view for command semantics
That split is deliberate. DRAM timing rules are written at different scopes. Some live at the bank level, some at bank group or rank, and some at the channel or pseudo-channel level. Functional state, on the other hand, is usually easiest to answer from the point of view of a specific bank. Ramulator uses the hierarchy where scope matters, and the flat bank view where direct bank-local answers are faster and clearer.
Another important point is ownership of time. The device does not keep its own free-running clock. The controller owns m_clk and passes the current cycle into every device query and every issued command. That is why DRAMDevice::check_timing, DRAMDevice::get_preq_command, and DRAMDevice::issue_command all take clk as an argument.
The model starts in Python, not in C++. Each DRAM standard is a DRAMStandard subclass under python/ramulator/dram/. That class is the single source of truth for:
- The hierarchy, through
levels - The legal command set, through
commands - The state names, through
states - The timing vocabulary, through
timing_params - The external request to DRAM-command mapping, through
supported_requests - The timing rules, through
TimingConstraint - Optional bus timing behavior, through
command_cycles,tick_multiplier,row_commands, andcolumn_commands
to_config() is where that Python definition becomes runtime data. It does more work than its name might suggest:
- Resolves the chosen organization and timing presets
- Applies user overrides such as
rank=2 - Computes derived timings in
resolve_secondary_timings() - Evaluates timing expressions such as
nCL + nBL + 2 - nCWL - Scales everything into simulation ticks when
tick_multiplieris greater than 1 - Expands each
TimingConstraintinto integer-indexed entries - Adjusts latencies for multi-cycle commands using
command_cycles - Auto-generates bus occupancy constraints for standards that need them
By the time the config reaches C++, the symbolic DRAM description is already fully resolved. The C++ DRAMSpec does not re-derive JEDEC tables on the fly. It stores the resolved names, timing values, timing constraints, command metadata, bank-targeting mode, and function pointers for command behavior.
That design keeps the runtime lean. All the expensive symbolic work happens once during config creation instead of every tick.
When the controller initializes its device, DRAMDevice::init() does three things:
- Takes ownership of the resolved
DRAMSpec - Builds the root
DRAMNode - Collects a flat list of all bank nodes
The node tree represents the structural hierarchy of one channel. For DDR4, that hierarchy is effectively:
Channel -> Rank -> BankGroup -> Bank
For HBM3 it is:
Channel -> PseudoChannel -> BankGroup -> Bank
One subtle detail matters here. The tree stops before the Row level. Ramulator does not instantiate one node per physical row, because that would explode the runtime footprint for no practical benefit. Instead, it tracks row state lazily inside the bank-ish node that owns those rows.
Each DRAMNode stores four kinds of state:
m_stateThe coarse protocol state for that node, such asClosed,Opened, or LPDDR5'sActivatingm_cmd_ready_clkFor each command, the earliest cycle when that command may next issue at this nodem_cmd_historyRecent issue times for each command, sized large enough to model the largest rolling window seen at that levelm_row_stateA map of currently open rows for that bank-like node
That last field is the reason Ramulator can model large devices without creating millions of row objects. Rows only appear in the state map if they have been opened. A closed bank has an empty m_row_state.
The flat bank array, m_bank_nodes, is just a different view of the same tree. It lets the controller ask bank-local questions without walking down the hierarchy every time.
Components query DRAMSpec (defined in src/ramulator/dram/dram_spec.h) for level IDs, command IDs, and timing values. The API has two families:
Existence checks return bool — use these to test whether an optional feature is present before acting on it:
| Method | Returns |
|---|---|
has_level("X") / has_command("X") / has_state("X") / has_timing("X") |
bool |
Value getters return the integer ID or resolved value. If the name does not exist in the DRAM standard, they throw std::runtime_error immediately — there is no silent -1 return:
| Method | Returns |
|---|---|
get_level_id("X") / get_command_id("X") / get_state_id("X") |
int — the integer ID for the given name |
get_timing_value("X") |
int — the resolved timing value (not an index) |
get_level_size("X") |
int — the number of instances at the named level (from organization.level_sizes) |
Best practice: cache lookups at init-time. The API does not prevent calling these at runtime, but string-keyed map lookups are unnecessary overhead on hot paths. The recommended pattern is to call them once in init() and store the results in member variables:
void init() override {
const auto& spec = *m_ctrl->m_device.m_spec;
// Required lookups — throw immediately if the DRAM standard is missing these
m_cmd_act = spec.get_command_id("ACT");
m_cmd_rd = spec.get_command_id("RD");
m_nCL = spec.get_timing_value("nCL");
m_bank_lvl = spec.get_level_id("Bank");
// Optional features — check first, branch explicitly
m_cmd_rda = spec.has_command("RDA") ? spec.get_command_id("RDA") : -1;
m_has_ap = (m_cmd_rda != -1);
}At runtime, use the cached member variables (m_cmd_act, m_nCL, etc.) and direct array access (spec.command_meta[cmd], spec.bank_targets[cmd]).
The timing model is driven by TimingConstraint. Each constraint says:
- At which hierarchy level it applies
- Which preceding commands create the constraint
- Which following commands are blocked by it
- How long the latency is
- Whether a rolling history window is needed
- Whether the effect applies to sibling nodes rather than only the exact addressed path
At config time, those objects become TimingConsEntry records in DRAMSpec::timing_cons. From that point on, the runtime only deals with integer command IDs, level IDs, and resolved cycle counts.
Two methods on DRAMNode implement the timing algorithm:
DRAMNode::check_timingDRAMNode::update_timing
check_timing() is the read-only timing side. It walks from the root toward the addressed scope and asks, at each node, whether the candidate command is still timing-blocked there. If the current cycle is earlier than m_cmd_ready_clk[command], the answer is immediately false. Otherwise the walk continues.
If the address vector names a specific child, check_timing() follows that one path. If the address vector contains -1 at the next level, the command is scoped broadly and timing must hold for every descendant in that scope. That is how commands such as PREab and REFab naturally become multi-bank checks without special-case traversal logic in the controller.
update_timing() is the write side. It runs when a command actually issues, and it updates both the targeted path and any relevant siblings.
The logic is easier to understand in two cases.
First, consider the node that lies on the addressed path. update_timing():
- Records the issue time in
m_cmd_history[command] - Looks up all non-sibling timing constraints triggered by that command at this level
- Uses the recorded history to compute when each blocked command becomes legal again
- Updates
m_cmd_ready_clkfor those blocked commands - Recurses into child nodes
Now consider a sibling node at the same level. If the address vector names a different child and the constraint entry is marked sibling=true, Ramulator updates that sibling's m_cmd_ready_clk without descending further. This is how rules that affect peer ranks or peer bank groups are modeled cleanly.
The window field is what makes rolling constraints work. A good example is nFAW, which limits how many activates can occur in a recent interval. If a constraint has window=4, the node keeps the four most recent issue times for that preceding command. When a new command arrives, Ramulator looks back to the fourth most recent one and uses that timestamp to decide when the next blocked command may issue.
Because the recursion visits every relevant level, timing rules compose naturally:
- Channel-level rules model shared buses or top-level serialization
- Rank-level rules model rank-wide interactions such as refresh and activate windows
- Bank-group rules model same-group restrictions
- Bank-level rules model per-bank open, close, and access timing
- Pseudo-channel rules model the per-PC timing domains used by HBM-family devices
The important takeaway is that Ramulator does not flatten all timing into one giant table. It keeps timing at the scope where the rule actually lives, then lets recursion combine those scopes at runtime.
Timing tells you whether a command may issue now. It does not tell you which command should issue next for a request. That part comes from the command handlers registered in populate_commands().
Each command may provide up to four bank-level handlers:
preqReturns the prerequisite command that should be issued nextactionMutates state when the command actually issuesrowhitAnswers whether the request is a row hitrowopenAnswers whether some row is already open in the target bank
These handlers are bank-level on purpose. They receive a bank node, not the full hierarchy. The controller already knows how to find the relevant bank or banks, and bank-local command semantics are usually where the functional state machine is easiest to express.
For a normal DDR-style access, the command chain is driven by preq:
- If the bank is closed, an access command's prerequisite is
ACT - If the bank is open to the requested row, the prerequisite is the access itself
- If the bank is open to the wrong row, the prerequisite becomes
PREpb
That logic lives directly in the command handlers. ACT::preq() is a good example. It checks whether the bank is closed, already open to the same row, or open to a conflicting row, then returns ACT, the original command, or PREpb respectively.
RD::preq() and WR::preq() reuse that open-row logic instead of duplicating it. PREpb::action() closes the bank and clears m_row_state. RDA and WRA are modeled as access commands whose action also closes the bank. REFab::preq() checks whether all targeted banks are closed, and if not, it first requires PREab.
BankTarget determines how wide that bank-local dispatch is:
SingleOne specific bank, used by commands such asACT,RD,WR, andPREpbAllEvery bank in the addressed scope, used by commands such asPREabandREFabSameBankThe same bank ID across a wider scope, used by standards that need that pattern, such as DDR5
DRAMDevice::get_target_banks() turns the address vector and BankTarget into concrete bank-node indices. That is why a refresh handler can still be written as bank-local logic while affecting many banks.
It helps to walk through one ordinary read request on a closed DDR4 bank.
- A frontend sends a read request into the controller.
- The controller maps the physical address into
addr_vec. - The controller sets
final_commandfrom the DRAM standard'ssupported_requests, so a read request targetsRD. - The scheduler or controller calls
get_preq_command(final_command, addr_vec). - The device dispatches that question to the relevant bank node. Because the bank is closed, the answer is
ACT. - The controller calls
check_timing(ACT, addr_vec). The hierarchy checks channel, rank, bank group, and bank timing state. - If timing allows it,
issue_command(ACT, addr_vec, clk)runs. First it updates timing through the node tree, then it applies the functional action that changes the bank state to open and records the opened row. - Because
ACTis marked as an opening command, the request moves to the active buffer instead of retiring. - On a later tick, the controller asks again for the prerequisite command. Now the bank is open to the right row, so the answer is
RD. check_timing(RD, addr_vec)validates the access against all relevant timing scopes.issue_command(RD, addr_vec, clk)updates timing again.RDdoes not open a new row, so the request is now complete from the DRAM-command point of view.- The controller retires the request and assigns its departure time using the DRAM read latency.
If the bank had been open to the wrong row, the same flow would insert PREpb before ACT, and only then reach RD. The controller does not hardcode that sequence. It falls out of repeated prerequisite checks against current bank state.
That is the key modeling idea. A request is not expanded into a fixed command script ahead of time. Instead, each cycle the controller asks the device, "Given the state right now, what is the next legal command for this request?"
We explain the GenericDDR controller flow as an example:
-
tick_prologue()Advance the controller clock, accumulate queue-length statistics, and serve completed reads (i.e., calls their callback when they shall be returned to the frontend).
-
m_refresh->tick()Give the refresh manager a chance to inject maintenance work.
-
m_rowpolicy->pre_schedule()and pluginpre_schedule()Let policies react before candidate selection.
-
Candidate selection through 3 requests buffers (active, priority, and normal R/W)
The controller tries to schedule active requests first, then priority requests, then normal read or write traffic. The active requests are the ones that already has their DRAM row open. Prioritizing them reduces premature precharges that wastes cycles.
-
m_rowpolicy->try_upgrade_command(req)Row policy may change a command in place, for example
RDtoRDA(i.e., close row policy), if that upgraded command is valid and ready. -
Issue the command that the scheduled request needs to progress
m_device.issue_command(...)updates timing state and any command-driven state changes. -
Update stats and notify observers
The controller updates row-hit and row-miss statistics, then calls
on_issue(...)on the row policy and all plugins. -
Advance the request lifecycle
If the issued command is the final command, the request is retired. If it is an opening command such as
ACT, the request is promoted to the active buffer. -
post_schedule()hooksRow policy and plugins can do things at the end of the tick.
If you want to understand the codebase without getting lost, this order works well:
examples/example_config.pypython/ramulator/__init__.pysrc/ramulator/python/bindings.cppsrc/ramulator/controller/impl/generic_ddr_controller.cppsrc/ramulator/controller/controller_base.cppsrc/ramulator/dram/device.handsrc/ramulator/dram/node.cpp- One DRAM definition in
python/ramulator/dram/impl, such as DDR4
That path starts from the public API, then drops into the execution path, then finally into the deeper modeling machinery.