Small Datum: db

Showing posts with label db_bench. Show all posts

Monday, December 8, 2025

RocksDB performance over time on a small Arm server

This post has results for RocksDB on an Arm server. I previously shared results for RocksDB performance using gcc and clang. Here I share results using clang with LTO.

RocksDB is boring, there are few performance regressions.

tl;dr

for cached workloads throughput with RocksDB 10.8 is as good or better than with 6.29
for not-cached workloads throughput with RocksDB 10.8 is similar to 6.29 except for the overwrite test where it is 7% less, probably from correctness checks added in 7.x and 8.x.

Software

I used RocksDB versions 6.29, 7.0, 7.10, 8.0, 8.4, 8.8, 8.11, 9.0, 9.4, 9.8, 9.11 and 10.0 through 10.8.

I compiled each version clang version 18.3.1 with link-time optimization enabled (LTO). The build command line was:

flags=( DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 )

# for clang+LTO
AR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \
make USE_LTO=1 "${flags[@]}" static_lib db_bench

Hardware

I used a small Arm server from the Google cloud running Ubuntu 22.04. The server type was c4a-standard-8-lssd with 8 cores and 32G of RAM. Storage was 2 local SSDs with RAID 0 and ext-4.

Benchmark

Overviews on how I use db_bench are here and here.

The benchmark was run with 1 thread and used the LRU block cache.

Tests were run for three workloads:

byrx - database cached by RocksDB
iobuf - database is larger than RAM and RocksDB used buffered IO
iodir - database is larger than RAM and RocksDB used O_DIRECT

The benchmark steps that I focus on are:

fillseq

load RocksDB in key order with 1 thread

revrangeww, fwdrangeww

do reverse or forward range queries with a rate-limited writer. Report performance for the range queries

readww

do point queries with a rate-limited writer. Report performance for the point queries.

overwrite

overwrite (via Put) random keys

Relative QPS

Many of the tables below (inlined and via URL) show the relative QPS which is:
(QPS for my version / QPS for RocksDB 6.29)

The base version varies and is listed below. When the relative QPS is > 1.0 then my version is faster than RocksDB 6.29. When it is < 1.0 then there might be a performance regression or there might just be noise.

The spreadsheet with numbers and charts is here. Performance summaries are here.

Results: byrx

This has results for by byrx workload where the database is cached by RocksDB.

RocksDB 10.x is faster than 6.29 for all tests.

Results: iobuf

This has results for by iobuf workload where the database is larger than RAM and RocksDB used buffered IO.

Performance in RocksDB 10.x is about the same as 6.29 except for overwrite. I think the performance decreases in overwrite that arrived in versions 7.x and 8.x are from new correctness checks and throughput in 10.8 is 7% less than in 6.29. The big drop for fillseq in 10.6.2 was from bug 13996.

Results: iodir

This has results for by iodir workload where the database is larger than RAM and RocksDB used O_DIRECT.

Monday, December 1, 2025

Using db_bench to measure RocksDB performance with gcc and clang

This has results for db_bench, a benchmark for RocksDB, when compiling it with gcc and clang. On one of my servers I saw a regression on one of the tests (fillseq) when compiling with gcc. The result on that server didn't match what I measured on two other servers. So I repeated tests after compiling with clang to see if I could reproduce it.

tl;dr

a common outcome is

~10% more QPS with clang+LTO than with gcc
~5% more QPS with clang than with gcc

the performance gap between clang and gcc is larger in RocksDB 10.x than in earlier versions

Variance

I always worry about variance when I search for performance bugs. Variance can be misinterpreted as a performance regression and I strive to avoid that because I don't want to file bogus performance bugs.

Possible sources of variance are:

the compiler toolchain

a bad code layout might hurt performance by increasing cache and TLB misses

RocksDB

the overhead from compaction is intermittent and the LSM tree layout can help or hurt CPU overhead during reads

hardware

sources include noisy neighbors on public cloud servers, insufficient CPU cooling and CPU frequency management that is too clever

benchmark client

the way in which I run tests can create more or less variance and more information on that is here and here

Software

I used RocksDB versions 6.29.5, 7.10.2, 8.0, 8.4, 8.8, 8.11, 9.0, 9.4, 9.8, 9.11 and 10.0 through 10.8.

I compiled each version three times:

gcc using version 13.3.0
clang - using version 18.3.1
clang+LTO - using version 18.3.1, where LTO is link-time optimization

The build command lines are below

flags=( DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 )

# for gcc
make "${flags[@]}" static_lib db_bench

# for clang
AR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \
make "${flags[@]}" static_lib db_bench

# for clang+LTO
AR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \
make USE_LTO=1 "${flags[@]}" static_lib db_bench

On the small servers I used the LRU block cache. On the large server I used hyper clock when possible:

lru_cache was used for versions 7.6 and earlier
hyper_clock_cache was used for versions 7.7 through 8.5
auto_hyper_clock_cache was used for versions 8.5+

Hardware

I used two small servers and one large server, all run Ubuntu 22.04:

pn-53

Ryzen 7 (AMD) CPU with 8 cores and 32G of RAM. It is v5 in the blog post
benchmarks are run with 1 client (thread)

an ARM server from the Google cloud -- c4a-standard-8-lssd with 8 cores and 32G of RAM, 2 local SSDs using RAID 0 and ext-4
benchmarks are run with 1 client (thread)

hetzner

an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G of RAM, 2 SSDs with RAID 1 (3.8T each) using ext4
benchmarks are run with 36 clients (threads)

Benchmark

Overviews on how I use db_bench are here and here.

Tests were run for a workload with the database cached by RocksDB that I call byrx in my scripts.

The benchmark steps that I focus on are:

fillseq

load RocksDB in key order with 1 thread

revrangeww, fwdrangeww

do reverse or forward range queries with a rate-limited writer. Report performance for the range queries

readww

do point queries with a rate-limited writer. Report performance for the point queries.

overwrite

overwrite (via Put) random keys

Relative QPS

Many of the tables below (inlined and via URL) show the relative QPS which is:
(QPS for my version / QPS for RocksDB 6.29 compiled with gcc)

The base version varies and is listed below. When the relative QPS is > 1.0 then my version is faster than the base version. When it is < 1.0 then there might be a performance regression or there might just be noise.

The spreadsheet with numbers and charts is here.

Results: fillseq

Results for the pn53 server

clang+LTO provides ~15% more QPS than gcc in RocksDB 10.8

clang provides ~11% more QPS than gcc in RocksDB 10.8

Results for the Arm server

I am fascinated by how stable the QPS is here for clang and clang+LTO
clang+LTO and clang provide ~3% more QPS than gcc in RocksDB 10.8

Results for the Hetzner server

I don't show results for 6.29 or 7.x to improve readability
the performance for RocksDB 10.8.3 with gcc is what motivated me to repeat tests with clang
clang+LTO and clang provide ~20% more QPS than gcc in RocksDB 10.8

Results: revrangeww

Results for the pn53 server

clang+LTO provides ~9% more QPS than gcc in RocksDB 10.8
clang provides ~6% more QPS than gcc in RocksDB 10.8

Results for the Arm server

clang+LTO provides ~11% more QPS than gcc in RocksDB 10.8

clang provides ~6% more QPS than gcc in RocksDB 10.8

Results for the Hetzner server

I don't show results for 6.29 or 7.x to improve readability

clang+LTO provides ~8% more QPS than gcc in RocksDB 10.8

clang provides ~3% more QPS than gcc in RocksDB 10.8

Results: fwdrangeww

Results for the pn53 server

clang+LTO provides ~9% more QPS than gcc in RocksDB 10.8

clang provides ~4% more QPS than gcc in RocksDB 10.8

Results for the Arm server

clang+LTO provides ~13% more QPS than gcc in RocksDB 10.8

clang provides ~7% more QPS than gcc in RocksDB 10.8

Results for the Hetzner server

I don't show results for 6.29 or 7.x to improve readability

clang+LTO provides ~4% more QPS than gcc in RocksDB 10.8

clang provides ~1% more QPS than gcc in RocksDB 10.8

Results: readww

Results for the pn53 server

clang+LTO provides ~6% more QPS than gcc in RocksDB 10.8

clang provides ~5% less QPS than gcc in RocksDB 10.8

Results for the Arm server

clang+LTO provides ~14% more QPS than gcc in RocksDB 10.8

clang provides ~2% more QPS than gcc in RocksDB 10.8

Results for the Hetzner server

I don't show results for 6.29 or 7.x to improve readability

clang+LTO provides ~4% more QPS than gcc in RocksDB 10.8

clang provides ~1% more QPS than gcc in RocksDB 10.8

Results: overwrite

Results for the pn53 server

clang+LTO provides ~6% less QPS than gcc in RocksDB 10.8

clang provides ~8% less QPS than gcc in RocksDB 10.8

but for most versions there is similar QPS for gcc, clang and clang+LTO

Results for the Arm server

QPS is similar for gcc, clang and clang+LTO

Results for the Hetzner server

I don't show results for 6.29 or 7.x to improve readability

clang+LTO provides ~2% more QPS than gcc in RocksDB 10.8

clang provides ~1% more QPS than gcc in RocksDB 10.8

Thursday, October 23, 2025

How efficient is RocksDB for IO-bound, point-query workloads?

How efficient is RocksDB for workloads that are IO-bound and read-only? One way to answer this is to measure the CPU overhead from RocksDB as this is extra overhead beyond what libc and the kernel require to perform an IO. Here my focus is on KV pairs that are smaller than the typical RocksDB block size that I use -- 8kb.

By IO efficiency I mean:
(storage read IOPs from RocksDB benchmark / storage read IOPs from fio)

And I measure this in a setup where RocksDB doesn't get much benefit from RocksDB block cache hits (database size > 400G, block cache size was 16G).

This value will be less than 1.0 in such a setup. But how much less than 1.0 will it be? On my hardware the IO efficiency was ~0.85 at 1 client and ~0.88 at 6 clients. Were I to use slower storage, such as an SSD where read latency was ~200 usecs at io_depth=1 then the IO efficiency would be closer to 0.95.

Note that:

IO efficiency increases (decreases) when SSD read latency increases (decreases)
IO efficiency increases (decreases) when the RocksDB CPU overhead decreases (increases)
RocksDB QPS increases by ~8% for IO-bound workloads when --block_align is enabled

The overheads per 8kb block read on my test hardware were:

about 11 microseconds from libc + kernel
between 6 and 10 microseconds from RocksDB
~100 usecs of IO latency at io_depth=1, ~150 usecs at io_depth=6

A simple performance model

A simple model to predict the wall-clock latency for reading a block is:

userland CPU + libc/kernel CPU + device latency

For fio I assume that userland CPU is zero, I measured libc/kernel at ~11 usecs and will estimate that device latency is ~91 usecs. My device latency estimate comes from read-only benchmarks with fio where fio reports the average latency as 102 usecs which includes 11 usecs of CPU from libc+kernel and 91 = 102 - 11.

This model isn't perfect, as I will show below when reporting results for RocksDB, but it might be sufficient. But it allows you to predict latencies and IO efficiency when the RocksDB CPU overhead is increased or reduced.

Q and A

The RocksDB API could function as a universal API for storage engines, and if new DBMS built on that then it would be possible to combine new DBMS with new storage engines much faster than what is possible today.

Persistent hash indexes are not widely implemented, but getting one that uses the RocksDB API would be interesting for workloads such as the one I run here. However, there are fewer use cases for a hash index (no range queries) than for a range index like an LSM so it is harder to justify the investment in such work.

Q: What is the CPU overhead from libc + kernel per 8kb read?
A: About 10 microseconds on this CPU.

Q: Can you write your own code that will be faster than RocksDB for such a workload?
A: Yes, you can

Q: Should you write your own library for this?
A: It depends on how many features you need and the opportunity cost in spending time writing that code vs doing something else.

Q: Will RocksDB add features to make this faster?
A: That is for them to answer. But all projects have a complexity budget. Code can become too expensive to maintain when that budget is exceeded. There is also the opportunity cost to consider as working on this delays work on other features.

Q: Does this matter?

A: It matters more when storage is fast (read latency less than 100 usecs). As read response time grows the CPU overhead from RocksDB becomes much less of an issue.

Benchmark hardware

I ran tests on a Beelink SER7 with a Ryzen 7 7840HS CPU that has 8 cores and 32G of RAM. The storage device a Crucial is CT1000P3PSSD8 (Crucial P3, 1TB) using ext-4 with discard enabled. The OS is Ubuntu 24.04.

From fio, the average read latency for the SSD is 102 microseconds using O_DIRECT with io_depth=1 and the sync engine.

CPU frequency management makes it harder to claim that the CPU runs at X GHz, but the details are:

$ cpupower frequency-info

analyzing CPU 5:
driver: acpi-cpufreq
CPUs which run at the same hardware frequency: 5
CPUs which need to have their frequency coordinated by software: 5
maximum transition latency: Cannot determine or is not supported.
hardware limits: 1.60 GHz - 3.80 GHz
available frequency steps: 3.80 GHz, 2.20 GHz, 1.60 GHz
available cpufreq governors: conservative ... powersave performance schedutil
current policy: frequency should be within 1.60 GHz and 3.80 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency: Unable to call hardware
current CPU frequency: 3.79 GHz (asserted by call to kernel)
boost state support:
Supported: yes
Active: no

Results from fio

I started with fio using a command-line like the following for NJ=1 and NJ=6 to measure average IOPs and the CPU overhead per IO.

fio --name=randread --rw=randread --ioengine=sync --numjobs=$NJ --iodepth=1 \
--buffered=0 --direct=1 \
--bs=8k \
--size=400G \
--randrepeat=0 \
--runtime=600s --ramp_time=1s \
--filename=G_1:G_2:G_3:G_4:G_5:G_6:G_7:G_8 \
--group_reporting

Results are:

legend:
* iops - average reads/s reported by fio
* usPer, syPer - user, system CPU usecs per read

* cpuPer - usPer + syPer
* lat.us - average read latency in microseconds
* numjobs - the value for --numjobs with fio

iops usPer syPer cpuPer lat.us numjobs
9884 1.351 9.565 10.916 101.61 1
43782 1.379 10.642 12.022 136.35 6

Results from RocksDB

I used an edited version of my benchmark helper scripts that run db_bench. In this case the sequence of tests was:

fillseq - loads the LSM tree in key order
revrange - I ignore the results from this
overwritesome - overwrites 10% of the KV pairs
flush_mt_l0 - flushes the memtable, waits, compacts L0 to L1, waits
readrandom - does random point queries when LSM tree has many levels
compact - compacts LSM tree into one level
readrandom2 - does random point queries when LSM tree has one level, bloom filters enabled
readrandom3 - does random point queries when LSM tree has one level, bloom filters disabled

I use readrandom, readrandom2 and readrandom3 to vary the amount of work that RocksDB must do per query and measure the CPU overhead of that work. The most work happens with readrandom as the LSM tree has many levels and there are bloom filters to check. The least work happens with readrandom3 as the LSM tree only has one level and there are no bloom filters to check.

Initially I ran tests with --block_align not set as that reduces space-amplification (less padding) but 8kb reads are likely to cross file system page boundaries and become larger reads. But given the focus here is on IO efficiency, I used --block_align.

A summary of the results for db_bench with 1 user (thread) and 6 users (threads) is:

--- 1 user

qps iops reqsz usPer syPer cpuPer rx.lat io.lat test

8282 8350 8.5 11.643 7.602 19.246 120.74 101 readrandom

8394 8327 8.7 9.997 8.525 18.523 119.13 105 readrandom2

8522 8400 8.2 8.732 8.718 17.450 117.34 100 readrandom3

--- 6 users

38391 38628 8.1 14.645 7.291 21.936 156.27 134 readrandom

39359 38623 8.3 10.449 9.346 19.795 152.43 144 readrandom2

39669 38874 8.0 9.459 9.850 19.309 151.24 140 readrandom3

From the following:

IO efficiency is approximately 0.84 at 1 client and 0.88 at 6 clients
With 1 user RocksDB adds between 6.534 and 8.330 usecs of CPU time per query compared to fio depending on the amount of work it has to do.
With 6 users RocksDB adds between 7.287 to 9.914 usecs of CPU time per query
IO latency as reported by RocksDB is ~20 usecs larger than as reported by iostat. But I have to re-read the RocksDB source code to understand where and how it is measured.

legend:

* io.eff - IO efficiency as (db_bench storage read IOPs / fio storage read IOPs)

* us.inc - incremental user CPU usecs per read as (db_bench usPer - fio usPer)

* cpu.inc - incremental total CPU usecs per read as (db_bench cpuPer - fio cpuPer)

--- 1 user

io.eff us.inc cpu.inc test

------ ------ ------

0.844 10.292 8.330 readrandom

0.842 8.646 7.607 readrandom2

0.849 7.381 6.534 readrandom3

--- 6 users

io.eff us.inc cpu.inc test

------ ------ ------

0.882 13.266 9.914 readrandom

0.882 9.070 7.773 readrandom2

0.887 8.080 7.287 readrandom3

Evaluating the simple performance model

I described a simple performance model earlier in this blog post and now it is time to see how well it does for RocksDB. First I will use values from the 1 user/client/thread case:

IO latency is ~91 usecs per fio
libc+kernel CPU overhead is ~11 usecs per fio
RocksDB CPU overhead is 8.330, 7.607 and 6.534 usecs for readrandom, *2 and *3

The model is far from perfect as it predicts that RocksDB will sustain:

9063 IOPs for readrandom, when it actually did 8350
9124 IOPs for readrandom2, when it actually did 8327
9214 IOPs for readrandom3, when it actually did 8400

Regardless, model is a good way to think about the problem.

The impact from --block_align

RocksDB QPS increases by between 7% and 9% when --block_align is enabled. Enabling it reduces read-amp and increases space-amp. But given the focus here is on IO efficiency I prefer to enable it. RocksDB QPS increases with it enabled because fewer storage read requests cross file system page boundaries, thus the average read size from storage is reduced (see the reqsz column below).

legend:

* qps - RocksDB QPS

* iops - average reads/s reported by fio

* reqsz - average read request size in KB per iostat

* usPer, syPer, cpuPer - user, system and (user+system) CPU usecs per read

* rx.lat - average read latency in microseconds, per RocksDB

* io.lat - average read latency in microseconds, per iostat

* test - the db_bench test name

- block_align disabled

qps iops reqsz usPer syPer cpuPer rx.lat io.lat test

7629 7740 8.9 12.133 8.718 20.852 137.92 111 readrandom

7866 7813 9.1 10.094 9.098 19.192 127.12 115 readrandom2

7972 7862 8.6 8.931 9.326 18.257 125.44 110 readrandom3

- block_align enabled

qps iops reqsz usPer syPer cpuPer rx.lat io.lat test

8282 8350 8.5 11.643 7.602 19.246 120.74 101 readrandom

8394 8327 8.7 9.997 8.525 18.523 119.13 105 readrandom2

8522 8400 8.2 8.732 8.718 17.450 117.34 100 readrandom3

Async IO in RocksDB

Per the wiki, RocksDB can do async IO for point queries that use MultiGet. That is done via coroutines and requires linking with Folly. My builds do not support that today and because my focus is on efficiency rather than throughput I did not try it for this test.

Flamegraphs

Flamegraphs are here for readrandom, readrandom2 and readrandom3.

A summary of where CPU time is spent based on the flamegraphs.

Legend:

* rr, rr2, rr3 - readrandom, readrandom2, readrandom3

* libc+k - time in libc + kernel

* checksm - verify data block checksum after read

* IBI:Sk - IndexBlockIter::SeekImpl

* DBI:Sk - DataBlockIter::SeekImpl

* LRU - lookup, insert blocks in the LRU, update metrics

* bloom - check bloom filters

* BSI - BinarySearchIndexReader::NewIterator

* File - FilePicker::GetNextFile, FindFileInRange

* other - other parts of the call stack, from DBImpl::Get and functions called by it

rr is readrandom, rr2 is readrandom2, rr3 is readrandom3

Percentage of samples

rr rr2 rr3

libc+k 37.30 42.22 50.92

checksm 3.76 2.66 2.91

IBI:Sk 7.07 7.36 7.76

DBI:Sk 3.05 2.15 1.96

LRU 5.19 6.19 6.02

bloom 18.35 8.14 0

BSI 2.28 4.02 3.12

File 3.74 3.34 4.44

other 19.26 23.92 22.87

Sunday, May 18, 2025

RocksDB 10.2 benchmarks: large & small servers with a cached workload

I previously shared benchmark results for RocksDB using the larger server that I have. In this post I share more results from two other large servers and one small server. This is arbitrary but I mean >= 20 cores for large, 10 to 19 cores for medium and less than 10 cores for small.

tl;dr

There are several big improvements
There might be small regression in fillseq performance, I will revisit this
For the block cache hyperclock does much better than LRU on CPU-bound tests
I am curious about issue 13546 but not sure the builds I tested include it

Software

I used RocksDB versions 6.29.5, 7.10.2, 8.11.4, 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.2, 9.7.4, 9.8.4, 9.9.3, 9.10.0, 9.11.2, 10.0.1, 10.1.3 and 10.2.1. Everything was compiled with gcc 11.4.0.

For 8.x, 9.x and 10.x the benchmark was repeated using both the LRU block cache (older code) and hyperclock (newer code). That was done by setting the --cache_type argument:

lru_cache was used for versions 7.6 and earlier
hyper_clock_cache was used for versions 7.7 through 8.5
auto_hyper_clock_cache was used for versions 8.5+

Hardware

My servers are described here. From that list I used:

The small server is a Ryzen 7 (AMD) CPU with 8 cores and 32G of RAM. It is v5 in the blog post.
The first large server has 24 cores with 64G of RAM. It is v6 in the blog post.
The other large server has 32 cores and 128G of RAM. It is v7 in the blog post.

Benchmark

Overviews on how I use db_bench are here and here.

Tests were run for a workload with the database cached by RocksDB that I call byrx in my scripts.

The benchmark steps that I focus on are:

fillseq

load RocksDB in key order with 1 thread

revrangeww, fwdrangeww

do reverse or forward range queries with a rate-limited writer. Report performance for the range queries

readww

do point queries with a rate-limited writer. Report performance for the point queries.

overwrite

overwrite (via Put) random keys using many threads

Relative QPS

Many of the tables below (inlined and via URL) show the relative QPS which is:
(QPS for my version / QPS for base version)

Small server

The benchmark was run using 1 client thread and 20M KV pairs. Each benchmark step was run for 1800 seconds. Performance summaries are here.

For the byrx (cached database) workload with the LRU block cache:

see relative and absolute performance summaries, the base version is RocksDB 6.29.5
fillseq is ~14% faster in 10.2 vs 6.29 with improvements in 7.x and 9.x
revrangeww and fwdrangeww are ~6% slower in 10.2 vs 6.29, I might revisit this
readww has similar perf from 6.29 through 10.2
overwrite is ~14% faster in10.2 vs 6.29 with most of the improvement in 7.x

For the byrx (cached database) workload with the Hyper Clock block cache

see relative and absolute performance summaries, the base version is RocksDB 8.11.4
there might be small regression (~3%) or there might be noise in the results

Results from RocksDB 10.2.1 that show relative QPS for 10.2 with the Hyper Clock block cache relative to 10.2 with the LRU block cache. Here the QPS for revrangeww, fwdrangeww and readww are ~10% better with Hyper Clock.

relQPS test

0.99 fillseq.wal_disabled.v400

1.09 revrangewhilewriting.t1

1.13 fwdrangewhilewriting.t 1

1.15 readwhilewriting.t1

0.96 overwriteandwait.t1.s0

Large server (24 cores)

The benchmark was run using 16 client threads and 40M KV pairs. Each benchmark step was run for 1800 seconds. Performance summaries are here.

For the byrx (cached database) workload with the LRU block cache

see relative and absolute performance summaries, the base version is RocksDB 6.29.5
fillseq might have a new regression of ~4% in 10.2.1 or that might be noise, I will revisit this
revrangeww, fwdrangeww, readww and overwrite are mostly unchanged since 8.x

For the byrx (cached database) workload with the Hyper Clock block cache

see relative and absolute performance summaries, the base version is RocksDB 8.11.4
fillseq might have a new regression of ~8% in 10.2.1 or that might be noise, I will revisit this
revrangeww, fwdrangeww, readww and overwrite are mostly unchanged since 8.x

Results from RocksDB 10.2.1 that show relative QPS for 10.2 with the Hyper Clock block cache relative to 10.2 with the LRU block cache. Hyper Clock is much better for workloads that have frequent access to the block cache with multiple threads.

relQPS test

0.97 fillseq.wal_disabled.v400

1.35 revrangewhilewriting.t16

1.43 fwdrangewhilewriting.t16

1.69 readwhilewriting.t16

0.97 overwriteandwait.t16.s0

Large server (32 cores)

The benchmark was run using 24 client threads and 50M KV pairs. Each benchmark step was run for 1800 seconds. Performance summaries are here.

For the byrx (cached database) workload with the LRU block cache

see relative and absolute performance summaries, the base version is RocksDB 6.29.5
fillseq might have a new regression of ~10% from 7.10 through 10.2, I will revisit this
revrangeww, fwdrangeww, readww and overwrite are mostly unchanged since 8.x

For the byrx (cached database) workload with the Hyper Clock block cache

see relative and absolute performance summaries, the base version is RocksDB 7.10.2
fillseq might have a new regression of ~10% from 7.10 through 10.2, I will revisit this
revrangeww, fwdrangeww, readww and overwrite are mostly unchanged since 8.x

relQPS test

1.02 fillseq.wal_disabled.v400

1.39 revrangewhilewriting.t24

1.55 fwdrangewhilewriting.t24

1.77 readwhilewriting.t24

1.00 overwriteandwait.t24.s0

Tuesday, May 6, 2025

RocksDB 10.2 benchmarks: large server

This post has benchmark results for RocksDB 10.x, 9.x, 8.11, 7.10 and 6.29 on a large server.

\tl;dr

There are several big improvements
There are no new regressions
For the block cache hyperclock does much better than LRU on CPU-bound tests

Software

I used RocksDB versions 6.0.2, 6.29.5, 7.10.2, 8.11.4, 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.2, 9.7.4, 9.8.4, 9.9.3, 9.10.0, 9.11.2, 10.0.1, 10.1.3, 10.2.1. Everything was compiled with gcc 11.4.0.

For 8.x, 9.x and 10.x the benchmark was repeated using both the LRU block cache (older code) and hyperclock (newer code). That was done by setting the --cache_type argument:

lru_cache was used for versions 7.6 and earlier
hyper_clock_cache was used for versions 7.7 through 8.5
auto_hyper_clock_cache was used for versions 8.5+

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Benchmark

Overviews on how I use db_bench are here and here.

All of my tests here use a CPU-bound workload with a database that is cached by RocksDB and the benchmark is run for 36 threads.

Tests were repeated for 3 workload+configuration setups:

byrx - database is cached by RocksDB
iobuf - database is larger than RAM and RocksDB uses buffered IO
iodir - database is larger than RAM and RocksDB uses O_DIRECT

The benchmark steps named on the charts are:

fillseq

load RocksDB in key order with 1 thread

revrangeww, fwdrangeww

do reverse or forward range queries with a rate-limited writer. Report performance for the range queries

readww

do point queries with a rate-limited writer. Report performance for the point queries.

overwrite

overwrite (via Put) random keys using many threads

Results: byrx

Performance summaries are here for: LRU block cache, hyperclock and LRU vs hyperclock. A spreadsheet with relative QPS and charts is here.

The graphs below shows relative QPS which is: (QPS for me / QPS for base case). When the relative QPS is greater than one than performance improved relative to the base case. The y-axis doesn't start at zero in most graphs to make it easier to see changes.

This chart has results for the LRU block cache and the base case is RocksDB 6.29.5:

overwrite

~1.2X faster in modern RocksDB

revrangeww, fwdrangeww, readww

slightly faster in modern RocksDB

fillseq

~15% slower in modern RocksDB most likely from new code added for correctness checks

This chart has results for the hyperclock block cache and the base case is RocksDB 8.11.4:

there are approximately zero regressions. The changes are small and might be normal variance.

This chart has results from RocksDB 10.2.1. The base case uses the LRU block cache and that is compared with hyperclock

readww

almost 3X faster with hyperclock because it suffers the most from block cache contention

revrangeww, fwdrangeww

almost 2X faster with hyperclock

fillseq

no change with hyperclock because the workload uses only 1 thread

overwrite

no benefit from hyperclock because write stalls are the bottleneck

Results: iobuf

Performance summaries are here for: LRU block cache, hyperclock and LRU vs hyperclock. A spreadsheet with relative QPS and charts is here.

This chart has results for the LRU block cache and the base case is RocksDB 6.29.5.

fillseq

~1.6X faster since RocksDB 7.x

readww

~6% faster in modern RocksDB

overwrite

suffered from issue 12038 in versions 8.6 through 9.8

revrangeww, fwdrangeww

~5% slower since early 8.x

This chart has results for the hyperclock block cache and the base case is RocksDB 8.11.4.

overwrite

suffered from issue 12038 in versions 8.6 through 9.8. The line would be similar to what I show above had the base case been prior to 8.5 or earlier

fillseq

~7% faster in 10.2 relative to 8.11

revrangeww, fwdrangeww, readww

unchanged from 8.11 to 10.2

This chart has results from RocksDB 10.2.1. The base case uses the LRU block cache and that is compared with hyperclock.

readww

~8% faster with hyperclock. The benefit here is smaller than above for byrx because the workload here is less CPU-bound

revrangeww, fwdrangeww, overwrite

slightly faster with hyperclock

fillseq

no change with hyperclock because the workload uses only 1 thread

Results: iodir

Performance summaries are here for: LRU block cache, hyperclock and LRU vs hyperclock. A spreadsheet with relative QPS and charts is here.

This chart has results for the LRU block cache and the base case is RocksDB 6.29.5.

fillseq

~1.6X faster since RocksDB 7.x (see results above for iobuf)

overwrite

~1.2X faster in modern RocksDB

revrangeww, fwdrangeww, readww

unchanged from 6.29 to 10.2

This chart has results for the hyperclock block cache and the base case is RocksDB 8.11.4.

overwrite

might have a small regression (~3%) from 8.11 to 10.2

revrangeww, fwdrangeww, readww, fillseq

unchanged from 8.11 to 10.2

This chart has results from RocksDB 10.2.1. The base case uses the LRU block cache and that is compared with hyperclock.

there are small regressions and/or small improvements and/or normal variance

Friday, November 29, 2024

RocksDB on a big server: LRU vs hyperclock, v2

This post shows that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a large server to show the speedup from the hyperclock block cache implementation for different concurrency levels with RocksDB 9.6. Here I share results from the same server and different (old and new) RocksDB releases.

Results are amazing on a large (48 cores) server with 40 client threads

~2X more QPS for range queries with hyperclock
~3X more QPS for point queries with hyperclock

Software

I used RocksDB versions 6.0.2, 6.29.5, 7.0.4, 7.6.0, 7.7.8, 8.5.4, 8.6.7, 9.0.1, 9.1.2, 9.3.2, 9.5.2, 9.7.4 and 9.9.0. Everything was compiled with gcc 11.4.0.

The --cache_type argument selected the block cache implementation:

lru_cache was used for versions 7.6 and earlier. Because some of the oldest releases don't support --cache_type I also used --undef_params=...,cache_type
hyper_clock_cache was used for versions 7.7 through 8.5
auto_hyper_clock_cache was used for versions 8.5+

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Benchmark

Overviews on how I use db_bench are here and here.

All of my tests here use a CPU-bound workload with a database that is cached by RocksDB and the benchmark is run for 40 threads.

I focus on the read-heavy benchmark steps:

revrangeww (reverse range while writing) - this does short reverse range scans
fwdrangeww (forward range while writing) - this does short forward range scans
readww (read while writing) - this does point queries

For each of these there is a fixed rate for writes done in the background and performance is reported for the reads. I prefer to measure read performance when there are concurrent writes because read-only benchmarks with an LSM suffer from non-determinism as the state (shape) of the LSM tree has a large impact on CPU overhead and throughput.

Results

All results are in this spreadsheet and the performance summary is here.

The graph below shows relative QPS which is: (QPS for my version / QPS for RocksDB 6.0.2) and the results are amazing:

~2X more QPS for range queries with hyperclock
~3X more QPS for point queries with hyperclock

The average values for vmstat metrics provide more detail on why hyperclock is so good for performance. The context switch rate drops dramatically when it is enabled because there is much less mutex contention. The user CPU utilization increases by ~1.6X because more useful work can get done when there is less mutex contention.

legend
* cs - context switches per second per vmstat
* us - user CPU utilization per vmstat
* sy - system CPU utilization per vmstat
* id - idle CPU utilization per vmstat
* wa - wait CPU utilization per vmstat
* version - RocksDB version

cs us sy us+sy id wa version
1495325 50.3 14.0 64.3 18.5 0.1 7.6.0
2360 82.7 14.0 96.7 16.6 0.1 9.9.0

Monday, November 25, 2024

RocksDB benchmarks: large server, universal compaction

This post has results from a large server with universal compaction from the same server for which I recently shared leveled compaction results. The results are boring (no large regressions) but a bit more exciting than the ones for leveled compaction because there is more variance. A somewhat educated guess is that variance more likely with universal.

tl;dr

there are some small regressions for cached workloads (see byrx below)
there are some small to medium improvements for IO-bound workloads (see iodir and iobuf)
modern RocksDB would look better were I to use the Hyper Clock block cache, but here I don't to test similar code across all versions

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Builds

I compiled db_bench from source on all servers. I used versions:

6.x - 6.0.2, 6.10.4, 6.20.4, 6.29.5
7.x - 7.0.4, 7.3.2, 7.6.0, 7.10.2
8.x - 8.0.0, 8.3.3, 8.6.7, 8.9.2, 8.11.4
9.x - 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.1 and 9.7.3

Benchmark

All tests used the default value for compaction_readahead_size and the block cache (LRU).

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 40 threads. How I do benchmarks for RocksDB is explained here and here. The command line to run the tests is: bash x3.sh 40 no 1800 c48r128 100000000 2000000000 byrx iobuf iodir

The tests on the charts are named as:

fillseq -- load in key order with the WAL disabled
revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
fwdrangeww -- like revrangeww except do short forward range scans
readww - like revrangeww except do point queries
overwrite - do overwrites (Put) as fast as possible

Workloads

There are three workloads, all of which use 40 threads:

byrx - the database is cached by RocksDB (100M KV pairs)
iobuf - the database is larger than memory and RocksDB uses buffered IO (2B KV pairs)
iodir - the database is larger than memory and RocksDB uses O_DIRECT (2B KV pairs)

A spreadsheet with all results is here and performance summaries with more details are here for byrx, iobuf and iodir.

Relative QPS

The numbers in the spreadsheet and on the y-axis in the charts that follow are the relative QPS which is (QPS for $me) / (QPS for $base). When the value is greater than 1.0 then $me is faster than $base. When it is less than 1.0 then $base is faster (perf regression!).

The base version is RocksDB 6.0.2.

Results: byrx

The byrx tests use a cached database. The performance summary is here.

The chart shows the relative QPS for a given version relative to RocksDB 6.0.2. There are two charts and the second narrows the range for the y-axis to make it easier to see regressions.

Summary:

fillseq has new CPU overhead in 7.0 from code added for correctness checks and QPS has been stable since then
QPS for other tests has been stable, with some variance, since late 6.x

Results: iobuf

The iodir tests use an IO-bound database with buffered. The performance summary is here.

The chart shows the relative QPS for a given version relative to RocksDB 6.0.2. There are two charts and the second narrows the range for the y-axis to make it easier to see regressions.

Summary:

fillseq has been stable since 7.6
readww has always been stable
overwrite improved in 7.6 and has been stable since then
fwdrangeww and revrangeww improved in late 6.0 and have been stable since then

Results: iodir

The iodir tests use an IO-bound database with O_DIRECT. The performance summary is here.

The chart shows the relative QPS for a given version relative to RocksDB 6.0.2. There are two charts and the second narrows the range for the y-axis to make it easier to see regressions.

Summary:

fillseq has been stable since 7.6
readww has always been stable
overwrite improved in 7.6 and has been stable since then
fwdrangeww and revrangeww have been stable but there is some variance