Showing posts with label db_bench. Show all posts
Showing posts with label db_bench. Show all posts

Monday, December 8, 2025

RocksDB performance over time on a small Arm server

This post has results for RocksDB on an Arm server. I previously shared results for RocksDB performance using gcc and clang. Here I share results using clang with LTO.

RocksDB is boring, there are few performance regressions.

tl;dr

  • for cached workloads throughput with RocksDB 10.8 is as good or better than with 6.29
  • for not-cached workloads throughput with RocksDB 10.8 is similar to 6.29 except for the overwrite test where it is 7% less, probably from correctness checks added in 7.x and 8.x.

Software

I used RocksDB versions 6.29, 7.0, 7.10, 8.0, 8.4, 8.8, 8.11, 9.0, 9.4, 9.8, 9.11 and 10.0 through 10.8.

I compiled each version clang version 18.3.1 with link-time optimization enabled (LTO). The build command line was:

flags=( DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 )

# for clang+LTO
AR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \
    make USE_LTO=1 "${flags[@]}" static_lib db_bench

Hardware

I used a small Arm server from the Google cloud running Ubuntu 22.04. The server type was c4a-standard-8-lssd with 8 cores and 32G of RAM. Storage was 2 local SSDs with RAID 0 and ext-4.

Benchmark

Overviews on how I use db_bench are here and here.

The benchmark was run with 1 thread and used the LRU block cache.

Tests were run for three workloads:

  • byrx - database cached by RocksDB
  • iobuf - database is larger than RAM and RocksDB used buffered IO
  • iodir - database is larger than RAM and RocksDB used O_DIRECT

The benchmark steps that I focus on are:
  • fillseq
    • load RocksDB in key order with 1 thread
  • revrangeww, fwdrangeww
    • do reverse or forward range queries with a rate-limited writer. Report performance for the range queries
  • readww
    • do point queries with a rate-limited writer. Report performance for the point queries.
  • overwrite
    • overwrite (via Put) random keys

Relative QPS

Many of the tables below (inlined and via URL) show the relative QPS which is:
    (QPS for my version / QPS for RocksDB 6.29)

The base version varies and is listed below. When the relative QPS is > 1.0 then my version is faster than RocksDB 6.29. When it is < 1.0 then there might be a performance regression or there might just be noise.

The spreadsheet with numbers and charts is here. Performance summaries are here.

Results: byrx

This has results for by byrx workload where the database is cached by RocksDB.

RocksDB 10.x is faster than 6.29 for all tests.

Results: iobuf

This has results for by iobuf workload where the database is larger than RAM and RocksDB used buffered IO.

Performance in RocksDB 10.x is about the same as 6.29 except for overwrite. I think the performance decreases in overwrite that arrived in versions 7.x and 8.x are from new correctness checks and throughput in 10.8 is 7% less than in 6.29. The big drop for fillseq in 10.6.2 was from bug 13996.

Results: iodir

This has results for by iodir workload where the database is larger than RAM and RocksDB used O_DIRECT.

Performance in RocksDB 10.x is about the same as 6.29 except for overwrite. I think the performance decreases in overwrite that arrived in versions 7.x and 8.x are from new correctness checks and throughput in 10.8 is 7% less than in 6.29. The big drop for fillseq in 10.6.2 was from bug 13996.

Monday, December 1, 2025

Using db_bench to measure RocksDB performance with gcc and clang

This has results for db_bench, a benchmark for RocksDB, when compiling it with gcc and clang. On one of my servers I saw a regression on one of the tests (fillseq) when compiling with gcc. The result on that server didn't match what I measured on two other servers. So I repeated tests after compiling with clang to see if I could reproduce it.

tl;dr

  • a common outcome is
    • ~10% more QPS with clang+LTO than with gcc
    • ~5% more QPS with clang than with gcc
  • the performance gap between clang and gcc is larger in RocksDB 10.x than in earlier versions

Variance

I always worry about variance when I search for performance bugs. Variance can be misinterpreted as a performance regression and I strive to avoid that because I don't want to file bogus performance bugs.

Possible sources of variance are:

  • the compiler toolchain
    • a bad code layout might hurt performance by increasing cache and TLB misses
  • RocksDB
    • the overhead from compaction is intermittent and the LSM tree layout can help or hurt CPU overhead during reads
  • hardware
    • sources include noisy neighbors on public cloud servers, insufficient CPU cooling and CPU frequency management that is too clever
  • benchmark client
    • the way in which I run tests can create more or less variance and more information on that is here and here

Software

I used RocksDB versions 6.29.5, 7.10.2, 8.0, 8.4, 8.8, 8.11, 9.0, 9.4, 9.8, 9.11 and 10.0 through 10.8.

I compiled each version three times:

  • gcc using version 13.3.0
  • clang - using version 18.3.1
  • clang+LTO - using version 18.3.1, where LTO is link-time optimization
The build command lines are below

flags=( DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 )

# for gcc
make "${flags[@]}" static_lib db_bench

# for clang
AR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \
    make "${flags[@]}" static_lib db_bench

# for clang+LTO
AR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \
    make USE_LTO=1 "${flags[@]}" static_lib db_bench

On the small servers I used the LRU block cache. On the large server I used hyper clock when possible:
  • lru_cache was used for versions 7.6 and earlier
  • hyper_clock_cache was used for versions 7.7 through 8.5
  • auto_hyper_clock_cache was used for versions 8.5+

Hardware

I used two small servers and one large server, all run Ubuntu 22.04:

  • pn-53
    • Ryzen 7 (AMD) CPU with 8 cores and 32G of RAM. It is v5 in the blog post
    • benchmarks are run with 1 client (thread)
  • arm
    • an ARM server from the Google cloud -- c4a-standard-8-lssd with 8 cores and 32G of RAM, 2 local SSDs using RAID 0 and ext-4
    • benchmarks are run with 1 client (thread)
  • hetzner
    • an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G of RAM, 2 SSDs with RAID 1 (3.8T each) using ext4
    • benchmarks are run with 36 clients (threads)

Benchmark

Overviews on how I use db_bench are here and here.

Tests were run for a workload with the database cached by RocksDB that I call byrx in my scripts.

The benchmark steps that I focus on are:
  • fillseq
    • load RocksDB in key order with 1 thread
  • revrangeww, fwdrangeww
    • do reverse or forward range queries with a rate-limited writer. Report performance for the range queries
  • readww
    • do point queries with a rate-limited writer. Report performance for the point queries.
  • overwrite
    • overwrite (via Put) random keys

Relative QPS

Many of the tables below (inlined and via URL) show the relative QPS which is:
    (QPS for my version / QPS for RocksDB 6.29 compiled with gcc)

The base version varies and is listed below. When the relative QPS is > 1.0 then my version is faster than the base version. When it is < 1.0 then there might be a performance regression or there might just be noise.

The spreadsheet with numbers and charts is here.

Results: fillseq

Results for the pn53 server

  • clang+LTO provides ~15% more QPS than gcc in RocksDB 10.8
  • clang provides ~11% more QPS than gcc in RocksDB 10.8
  • Results for the Arm server

    • I am fascinated by how stable the QPS is here for clang and clang+LTO
    • clang+LTO and clang provide ~3% more QPS than gcc in RocksDB 10.8

    Results for the Hetzner server

    • I don't show results for 6.29 or 7.x to improve readability
    • the performance for RocksDB 10.8.3 with gcc is what motivated me to repeat tests with clang
    • clang+LTO and clang provide ~20% more QPS than gcc in RocksDB 10.8

    Results: revrangeww

    Results for the pn53 server

    • clang+LTO provides ~9% more QPS than gcc in RocksDB 10.8
    • clang provides ~6% more QPS than gcc in RocksDB 10.8

    Results for the Arm server

  • clang+LTO provides ~11% more QPS than gcc in RocksDB 10.8
  • clang provides ~6% more QPS than gcc in RocksDB 10.8
  • Results for the Hetzner server

    • I don't show results for 6.29 or 7.x to improve readability
    • clang+LTO provides ~8% more QPS than gcc in RocksDB 10.8
    • clang provides ~3% more QPS than gcc in RocksDB 10.8
    • Results: fwdrangeww

      Results for the pn53 server

    • clang+LTO provides ~9% more QPS than gcc in RocksDB 10.8
    • clang provides ~4% more QPS than gcc in RocksDB 10.8
    • Results for the Arm server

    • clang+LTO provides ~13% more QPS than gcc in RocksDB 10.8
    • clang provides ~7% more QPS than gcc in RocksDB 10.8
    • Results for the Hetzner server

      • I don't show results for 6.29 or 7.x to improve readability
      • clang+LTO provides ~4% more QPS than gcc in RocksDB 10.8
      • clang provides ~1% more QPS than gcc in RocksDB 10.8
      • Results: readww

        Results for the pn53 server

      • clang+LTO provides ~6% more QPS than gcc in RocksDB 10.8
      • clang provides ~5% less QPS than gcc in RocksDB 10.8
      • Results for the Arm server

      • clang+LTO provides ~14% more QPS than gcc in RocksDB 10.8
      • clang provides ~2% more QPS than gcc in RocksDB 10.8
      • Results for the Hetzner server

        • I don't show results for 6.29 or 7.x to improve readability
        • clang+LTO provides ~4% more QPS than gcc in RocksDB 10.8
        • clang provides ~1% more QPS than gcc in RocksDB 10.8
        • Results: overwrite

          Results for the pn53 server

        • clang+LTO provides ~6% less QPS than gcc in RocksDB 10.8
        • clang provides ~8% less QPS than gcc in RocksDB 10.8
        • but for most versions there is similar QPS for gcc, clang and clang+LTO
        • Results for the Arm server

          • QPS is similar for gcc, clang and clang+LTO

          Results for the Hetzner server

          • I don't show results for 6.29 or 7.x to improve readability
          • clang+LTO provides ~2% more QPS than gcc in RocksDB 10.8
          • clang provides ~1% more QPS than gcc in RocksDB 10.8
          • Thursday, October 23, 2025

            How efficient is RocksDB for IO-bound, point-query workloads?

            How efficient is RocksDB for workloads that are IO-bound and read-only? One way to answer this is to measure the CPU overhead from RocksDB as this is extra overhead beyond what libc and the kernel require to perform an IO. Here my focus is on KV pairs that are smaller than the typical RocksDB block size that I use -- 8kb.

            By IO efficiency I mean:
                (storage read IOPs from RocksDB benchmark / storage read IOPs from fio)

            And I measure this in a setup where RocksDB doesn't get much benefit from RocksDB block cache hits (database size > 400G, block cache size was 16G).

            This value will be less than 1.0 in such a setup. But how much less than 1.0 will it be? On my hardware the IO efficiency was ~0.85 at 1 client and ~0.88 at 6 clients. Were I to use slower storage, such as an SSD where read latency was ~200 usecs at io_depth=1 then the IO efficiency would be closer to 0.95.

             Note that:

            • IO efficiency increases (decreases) when SSD read latency increases (decreases)
            • IO efficiency increases (decreases) when the RocksDB CPU overhead decreases (increases)
            • RocksDB QPS increases by ~8% for IO-bound workloads when --block_align is enabled

            The overheads per 8kb block read on my test hardware were:

            • about 11 microseconds from libc + kernel
            • between 6 and 10 microseconds from RocksDB
            • ~100 usecs of IO latency at io_depth=1, ~150 usecs at io_depth=6

            A simple performance model

            A simple model to predict the wall-clock latency for reading a block is:
                userland CPU + libc/kernel CPU + device latency

            For fio I assume that userland CPU is zero, I measured libc/kernel at ~11 usecs and will estimate that device latency is ~91 usecs. My device latency estimate comes from read-only benchmarks with fio where fio reports the average latency as 102 usecs which includes 11 usecs of CPU from libc+kernel and 91 = 102 - 11.

            This model isn't perfect, as I will show below when reporting results for RocksDB, but it might be sufficient. But it allows you to predict latencies and IO efficiency when the RocksDB CPU overhead is increased or reduced.

            Q and A

            The RocksDB API could function as a universal API for storage engines, and if new DBMS built on that then it would be possible to combine new DBMS with new storage engines much faster than what is possible today.

            Persistent hash indexes are not widely implemented, but getting one that uses the RocksDB API would be interesting for workloads such as the one I run here. However, there are fewer use cases for a hash index (no range queries) than for a range index like an LSM so it is harder to justify the investment in such work.

            Q: What is the CPU overhead from libc + kernel per 8kb read?
            A: About 10 microseconds on this CPU.

            Q: Can you write your own code that will be faster than RocksDB for such a workload?
            A: Yes, you can

            Q: Should you write your own library for this?
            A: It depends on how many features you need and the opportunity cost in spending time writing that code vs doing something else.

            Q: Will RocksDB add features to make this faster?
            A: That is for them to answer. But all projects have a complexity budget. Code can become too expensive to maintain when that budget is exceeded. There is also the opportunity cost to consider as working on this delays work on other features.

            Q: Does this matter?
            A: It matters more when storage is fast (read latency less than 100 usecs). As read response time grows the CPU overhead from RocksDB becomes much less of an issue.

            Benchmark hardware

            I ran tests on a Beelink SER7 with a Ryzen 7 7840HS CPU that has 8 cores and 32G of RAM. The storage device a Crucial is CT1000P3PSSD8 (Crucial P3, 1TB) using ext-4 with discard enabled. The OS is Ubuntu 24.04.

            From fio, the average read latency for the SSD is 102 microseconds using O_DIRECT with io_depth=1 and the sync engine.

            CPU frequency management makes it harder to claim that the CPU runs at X GHz, but the details are:

            $ cpupower frequency-info

            analyzing CPU 5:
              driver: acpi-cpufreq
              CPUs which run at the same hardware frequency: 5
              CPUs which need to have their frequency coordinated by software: 5
              maximum transition latency:  Cannot determine or is not supported.
              hardware limits: 1.60 GHz - 3.80 GHz
              available frequency steps:  3.80 GHz, 2.20 GHz, 1.60 GHz
              available cpufreq governors: conservative ... powersave performance schedutil
              current policy: frequency should be within 1.60 GHz and 3.80 GHz.
                              The governor "performance" may decide which speed to use
                              within this range.
              current CPU frequency: Unable to call hardware
              current CPU frequency: 3.79 GHz (asserted by call to kernel)
              boost state support:
                Supported: yes
                Active: no

            Results from fio

            I started with fio using a command-line like the following for NJ=1 and NJ=6 to measure average IOPs and the CPU overhead per IO.

            fio --name=randread --rw=randread --ioengine=sync --numjobs=$NJ --iodepth=1 \
              --buffered=0 --direct=1 \
              --bs=8k \
              --size=400G \
              --randrepeat=0 \
              --runtime=600s --ramp_time=1s \
              --filename=G_1:G_2:G_3:G_4:G_5:G_6:G_7:G_8  \
              --group_reporting

            Results are:

            legend:
            * iops - average reads/s reported by fio
            * usPer, syPer - user, system CPU usecs per read
            * cpuPer - usPer + syPer
            * lat.us - average read latency in microseconds
            * numjobs - the value for --numjobs with fio

            iops    usPer   syPer   cpuPer  lat.us  numjobs
             9884   1.351    9.565  10.916  101.61  1
            43782   1.379   10.642  12.022  136.35  6

            Results from RocksDB

            I used an edited version of my benchmark helper scripts that run db_bench. In this case the sequence of tests was:

            1. fillseq - loads the LSM tree in key order
            2. revrange - I ignore the results from this
            3. overwritesome - overwrites 10% of the KV pairs
            4. flush_mt_l0 - flushes the memtable, waits, compacts L0 to L1, waits
            5. readrandom - does random point queries when LSM tree has many levels
            6. compact - compacts LSM tree into one level
            7. readrandom2 - does random point queries when LSM tree has one level, bloom filters enabled
            8. readrandom3 - does random point queries when LSM tree has one level, bloom filters disabled
            I use readrandom, readrandom2 and readrandom3 to vary the amount of work that RocksDB must do per query and measure the CPU overhead of that work. The most work happens with readrandom as the LSM tree has many levels and there are bloom filters to check. The least work happens with readrandom3 as the LSM tree only has one level and there are no bloom filters to check.

            Initially I ran tests with --block_align not set as that reduces space-amplification (less padding) but 8kb reads are likely to cross file system page boundaries and become larger reads. But given the focus here is on IO efficiency, I used --block_align. 

            A summary of the results for db_bench with 1 user (thread) and 6 users (threads) is:

            --- 1 user
            qps     iops    reqsz   usPer   syPer   cpuPer  rx.lat  io.lat  test
            8282     8350   8.5     11.643   7.602  19.246  120.74  101     readrandom
            8394     8327   8.7      9.997   8.525  18.523  119.13  105     readrandom2
            8522     8400   8.2      8.732   8.718  17.450  117.34  100     readrandom3

            --- 6 users
            38391   38628   8.1     14.645   7.291  21.936  156.27  134     readrandom
            39359   38623   8.3     10.449   9.346  19.795  152.43  144     readrandom2
            39669   38874   8.0      9.459   9.850  19.309  151.24  140     readrandom3

            From the following:
            • IO efficiency is approximately 0.84 at 1 client and 0.88 at 6 clients
            • With 1 user RocksDB adds between 6.534 and 8.330 usecs of CPU time per query compared to fio depending on the amount of work it has to do. 
            • With 6 users RocksDB adds between 7.287 to 9.914 usecs of CPU time per query
            • IO latency as reported by RocksDB is ~20 usecs larger than as reported by iostat. But I have to re-read the RocksDB source code to understand where and how it is measured.
            legend:
            * io.eff - IO efficiency as (db_bench storage read IOPs / fio storage read IOPs)
            * us.inc - incremental user CPU usecs per read as (db_bench usPer - fio usPer)
            * cpu.inc - incremental total CPU usecs per read as (db_bench cpuPer - fio cpuPer)

            --- 1 user

                    io.eff          us.inc          cpu.inc         test
                    ------          ------          ------
                    0.844           10.292           8.330          readrandom
                    0.842            8.646           7.607          readrandom2
                    0.849            7.381           6.534          readrandom3

            --- 6 users

                    io.eff          us.inc          cpu.inc         test
                    ------          ------          ------
                    0.882           13.266           9.914          readrandom
                    0.882            9.070           7.773          readrandom2
                    0.887            8.080           7.287          readrandom3

            Evaluating the simple performance model

            I described a simple performance model earlier in this blog post and now it is time to see how well it does for RocksDB. First I will use values from the 1 user/client/thread case:
            • IO latency is ~91 usecs per fio
            • libc+kernel CPU overhead is ~11 usecs per fio
            • RocksDB CPU overhead is 8.330, 7.607 and 6.534 usecs for readrandom, *2 and *3
            The model is far from perfect as it predicts that RocksDB will sustain:
            • 9063 IOPs for readrandom, when it actually did 8350
            • 9124 IOPs for readrandom2, when it actually did 8327
            • 9214 IOPs for readrandom3, when it actually did 8400
            Regardless, model is a good way to think about the problem.

            The impact from --block_align

            RocksDB QPS increases by between 7% and 9% when --block_align is enabled. Enabling it reduces read-amp and increases space-amp. But given the focus here is on IO efficiency I prefer to enable it. RocksDB QPS increases with it enabled because fewer storage read requests cross file system page boundaries, thus the average read size from storage is reduced (see the reqsz column below).

            legend:
            * qps - RocksDB QPS
            * iops - average reads/s reported by fio
            * reqsz - average read request size in KB per iostat
            * usPer, syPer, cpuPer - user, system and (user+system) CPU usecs per read
            * rx.lat - average read latency in microseconds, per RocksDB
            * io.lat - average read latency in microseconds, per iostat
            * test - the db_bench test name

            - block_align disabled
            qps     iops    reqsz   usPer   syPer   cpuPer  rx.lat  io.lat  test
            7629     7740   8.9     12.133   8.718  20.852  137.92  111     readrandom
            7866     7813   9.1     10.094   9.098  19.192  127.12  115     readrandom2
            7972     7862   8.6      8.931   9.326  18.257  125.44  110     readrandom3

            - block_align enabled
            qps     iops    reqsz   usPer   syPer   cpuPer  rx.lat  io.lat  test
            8282     8350   8.5     11.643   7.602  19.246  120.74  101     readrandom
            8394     8327   8.7      9.997   8.525  18.523  119.13  105     readrandom2
            8522     8400   8.2      8.732   8.718  17.450  117.34  100     readrandom3

            Async IO in RocksDB

            Per the wiki, RocksDB can do async IO for point queries that use MultiGet. That is done via coroutines and requires linking with Folly. My builds do not support that today and because my focus is on efficiency rather than throughput I did not try it for this test.

            Flamegraphs

            Flamegraphs are here for readrandom, readrandom2 and readrandom3.

            A summary of where CPU time is spent based on the flamegraphs.

            Legend:
            * rr, rr2, rr3 - readrandom, readrandom2, readrandom3
            * libc+k - time in libc + kernel
            * checksm - verify data block checksum after read
            * IBI:Sk - IndexBlockIter::SeekImpl
            * DBI:Sk - DataBlockIter::SeekImpl
            * LRU - lookup, insert blocks in the LRU, update metrics
            * bloom - check bloom filters
            * BSI - BinarySearchIndexReader::NewIterator
            * File - FilePicker::GetNextFile, FindFileInRange
            * other - other parts of the call stack, from DBImpl::Get and functions called by it

            rr is readrandom, rr2 is readrandom2, rr3 is readrandom3

            Percentage of samples
                    rr      rr2     rr3
            libc+k  37.30   42.22   50.92
            checksm  3.76    2.66    2.91
            IBI:Sk   7.07    7.36    7.76
            DBI:Sk   3.05    2.15    1.96
            LRU      5.19    6.19    6.02
            bloom   18.35    8.14    0
            BSI      2.28    4.02    3.12
            File     3.74    3.34    4.44
            other   19.26   23.92   22.87











            Sunday, May 18, 2025

            RocksDB 10.2 benchmarks: large & small servers with a cached workload

            I previously shared benchmark results for RocksDB using the larger server that I have. In this post I share more results from two other large servers and one small server. This is arbitrary but I mean >= 20 cores for large, 10 to 19 cores for medium and less than 10 cores for small.

            tl;dr

            • There are several big improvements
            • There might be small regression in fillseq performance, I will revisit this
            • For the block cache hyperclock does much better than LRU on CPU-bound tests
            • I am curious about issue 13546 but not sure the builds I tested include it

            Software

            I used RocksDB versions 6.29.5, 7.10.2, 8.11.4, 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.2, 9.7.4, 9.8.4, 9.9.3, 9.10.0, 9.11.2, 10.0.1, 10.1.3 and 10.2.1. Everything was compiled with gcc 11.4.0.

            For 8.x, 9.x and 10.x the benchmark was repeated using both the LRU block cache (older code) and hyperclock (newer code). That was done by setting the --cache_type argument:

            • lru_cache was used for versions 7.6 and earlier
            • hyper_clock_cache was used for versions 7.7 through 8.5
            • auto_hyper_clock_cache was used for versions 8.5+

            Hardware

            My servers are described here. From that list I used:

            • The small server is a Ryzen 7 (AMD) CPU with 8 cores and 32G of RAM. It is v5 in the blog post.
            • The first large server has 24 cores with 64G of RAM. It is v6 in the blog post.
            • The other large server has 32 cores and 128G of RAM. It is v7 in the blog post.

            Benchmark

            Overviews on how I use db_bench are here and here.

            Tests were run for a workload with the database cached by RocksDB that I call byrx in my scripts.

            The benchmark steps that I focus on are:
            • fillseq
              • load RocksDB in key order with 1 thread
            • revrangeww, fwdrangeww
              • do reverse or forward range queries with a rate-limited writer. Report performance for the range queries
            • readww
              • do point queries with a rate-limited writer. Report performance for the point queries.
            • overwrite
              • overwrite (via Put) random keys using many threads

            Relative QPS

            Many of the tables below (inlined and via URL) show the relative QPS which is:
                (QPS for my version / QPS for base version)

            The base version varies and is listed below. When the relative QPS is > 1.0 then my version is faster than the base version. When it is < 1.0 then there might be a performance regression or there might just be noise 

            Small server

            The benchmark was run using 1 client thread and 20M KV pairs. Each benchmark step was run for 1800 seconds. Performance summaries are here

            For the byrx (cached database) workload with the LRU block cache:

            • see relative and absolute performance summaries, the base version is RocksDB 6.29.5
            • fillseq is ~14% faster in 10.2 vs 6.29 with improvements in 7.x and 9.x
            • revrangeww and fwdrangeww are ~6% slower in 10.2 vs 6.29, I might revisit this
            • readww has similar perf from 6.29 through 10.2
            • overwrite is ~14% faster in10.2 vs 6.29 with most of the improvement in 7.x

            For the byrx (cached database) workload with the Hyper Clock block cache

            • see relative and absolute performance summaries, the base version is RocksDB 8.11.4
            • there might be small regression (~3%) or there might be noise in the results
            Results from RocksDB 10.2.1 that show relative QPS for 10.2 with the Hyper Clock block cache relative to 10.2 with the LRU block cache. Here the QPS for revrangeww, fwdrangeww and readww are ~10% better with Hyper Clock.

            relQPS  test
            0.99    fillseq.wal_disabled.v400
            1.09    revrangewhilewriting.t1
            1.13    fwdrangewhilewriting.t 1
            1.15    readwhilewriting.t1
            0.96    overwriteandwait.t1.s0

            Large server (24 cores)

            The benchmark was run using 16 client threads and 40M KV pairs. Each benchmark step was run for 1800 seconds. Performance summaries are here.

            For the byrx (cached database) workload with the LRU block cache

            • see relative and absolute performance summaries, the base version is RocksDB 6.29.5
            • fillseq might have a new regression of ~4% in 10.2.1 or that might be noise, I will revisit this
            • revrangeww, fwdrangeww, readww and overwrite are mostly unchanged since 8.x

            For the byrx (cached database) workload with the Hyper Clock block cache

            • see relative and absolute performance summaries, the base version is RocksDB 8.11.4
            • fillseq might have a new regression of ~8% in 10.2.1 or that might be noise, I will revisit this
            • revrangeww, fwdrangeww, readww and overwrite are mostly unchanged since 8.x
            Results from RocksDB 10.2.1 that show relative QPS for 10.2 with the Hyper Clock block cache relative to 10.2 with the LRU block cache.  Hyper Clock is much better for workloads that have frequent access to the block cache with multiple threads.

            relQPS  test
            0.97    fillseq.wal_disabled.v400
            1.35    revrangewhilewriting.t16
            1.43    fwdrangewhilewriting.t16
            1.69    readwhilewriting.t16
            0.97    overwriteandwait.t16.s0

            Large server (32 cores)

            The benchmark was run using 24 client threads and 50M KV pairs. Each benchmark step was run for 1800 seconds. Performance summaries are here.

            For the byrx (cached database) workload with the LRU block cache

            • see relative and absolute performance summaries, the base version is RocksDB 6.29.5
            • fillseq might have a new regression of ~10% from 7.10 through 10.2, I will revisit this
            • revrangeww, fwdrangeww, readww and overwrite are mostly unchanged since 8.x

            For the byrx (cached database) workload with the Hyper Clock block cache

            • see relative and absolute performance summaries, the base version is RocksDB 7.10.2
            • fillseq might have a new regression of ~10% from 7.10 through 10.2, I will revisit this
            • revrangeww, fwdrangeww, readww and overwrite are mostly unchanged since 8.x
            Results from RocksDB 10.2.1 that show relative QPS for 10.2 with the Hyper Clock block cache relative to 10.2 with the LRU block cache. Hyper Clock is much better for workloads that have frequent access to the block cache with multiple threads.

            relQPS  test
            1.02    fillseq.wal_disabled.v400
            1.39    revrangewhilewriting.t24
            1.55    fwdrangewhilewriting.t24
            1.77    readwhilewriting.t24
            1.00    overwriteandwait.t24.s0

            Tuesday, May 6, 2025

            RocksDB 10.2 benchmarks: large server

             This post has benchmark results for RocksDB 10.x, 9.x, 8.11, 7.10 and 6.29 on a large server.

            \tl;dr

            • There are several big improvements
            • There are no new regressions
            • For the block cache hyperclock does much better than LRU on CPU-bound tests

            Software

            I used RocksDB versions 6.0.2, 6.29.5, 7.10.2, 8.11.4, 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.2, 9.7.4, 9.8.4, 9.9.3, 9.10.0, 9.11.2, 10.0.1, 10.1.3, 10.2.1. Everything was compiled with gcc 11.4.0.

            For 8.x, 9.x and 10.x the benchmark was repeated using both the LRU block cache (older code) and hyperclock (newer code). That was done by setting the --cache_type argument:

            • lru_cache was used for versions 7.6 and earlier
            • hyper_clock_cache was used for versions 7.7 through 8.5
            • auto_hyper_clock_cache was used for versions 8.5+

            Hardware

            The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

            Benchmark

            Overviews on how I use db_bench are here and here.

            All of my tests here use a CPU-bound workload with a database that is cached by RocksDB and the benchmark is run for 36 threads.

            Tests were repeated for 3 workload+configuration setups:

            • byrx - database is cached by RocksDB
            • iobuf - database is larger than RAM and RocksDB uses buffered IO
            • iodir - database is larger than RAM and RocksDB uses O_DIRECT
            The benchmark steps named on the charts are:
            • fillseq
              • load RocksDB in key order with 1 thread
            • revrangeww, fwdrangeww
              • do reverse or forward range queries with a rate-limited writer. Report performance for the range queries
            • readww
              • do point queries with a rate-limited writer. Report performance for the point queries.
            • overwrite
              • overwrite (via Put) random keys using many threads
            Results: byrx

            Performance summaries are here for: LRU block cache, hyperclock and LRU vs hyperclock. A spreadsheet with relative QPS and charts is here.

            The graphs below shows relative QPS which is: (QPS for me / QPS for base case). When the relative QPS is greater than one than performance improved relative to the base case. The y-axis doesn't start at zero in most graphs to make it easier to see changes.

            This chart has results for the LRU block cache and the base case is RocksDB 6.29.5:
            • overwrite
              • ~1.2X faster in modern RocksDB
            • revrangeww, fwdrangeww, readww
              • slightly faster in modern RocksDB
            • fillseq
              • ~15% slower in modern RocksDB most likely from new code added for correctness checks
            This chart has results for the hyperclock block cache and the base case is RocksDB 8.11.4:
            • there are approximately zero regressions. The changes are small and might be normal variance.
            This chart has results from RocksDB 10.2.1. The base case uses the LRU block cache and that is compared with hyperclock
            • readww
              • almost 3X faster with hyperclock because it suffers the most from block cache contention
            • revrangeww, fwdrangeww
              • almost 2X faster with hyperclock
            • fillseq
              • no change with hyperclock because the workload uses only 1 thread
            • overwrite
              • no benefit from hyperclock because write stalls are the bottleneck
            Results: iobuf

            Performance summaries are here for: LRU block cache, hyperclock and LRU vs hyperclock. A spreadsheet with relative QPS and charts is here.

            The graphs below shows relative QPS which is: (QPS for me / QPS for base case). When the relative QPS is greater than one than performance improved relative to the base case. The y-axis doesn't start at zero in most graphs to make it easier to see changes.

            This chart has results for the LRU block cache and the base case is RocksDB 6.29.5.
            • fillseq
              • ~1.6X faster since RocksDB 7.x
            • readww
              • ~6% faster in modern RocksDB
            • overwrite
            • revrangeww, fwdrangeww
              • ~5% slower since early 8.x
            This chart has results for the hyperclock block cache and the base case is RocksDB 8.11.4.
            • overwrite
              • suffered from issue 12038 in versions 8.6 through 9.8. The line would be similar to what I show above had the base case been prior to 8.5 or earlier
            • fillseq
              • ~7% faster in 10.2 relative to 8.11
            • revrangeww, fwdrangeww, readww
              • unchanged from 8.11 to 10.2

            This chart has results from RocksDB 10.2.1. The base case uses the LRU block cache and that is compared with hyperclock.

            • readww
              • ~8% faster with hyperclock. The benefit here is smaller than above for byrx because the workload here is less CPU-bound
            • revrangeww, fwdrangeww, overwrite
              • slightly faster with hyperclock
            • fillseq
              • no change with hyperclock because the workload uses only 1 thread

            Results: iodir

            Performance summaries are here for: LRU block cache, hyperclock and LRU vs hyperclock. A spreadsheet with relative QPS and charts is here

            The graphs below shows relative QPS which is: (QPS for me / QPS for base case). When the relative QPS is greater than one than performance improved relative to the base case. The y-axis doesn't start at zero in most graphs to make it easier to see changes.

            This chart has results for the LRU block cache and the base case is RocksDB 6.29.5.

            • fillseq
              • ~1.6X faster since RocksDB 7.x (see results above for iobuf)
            • overwrite
              • ~1.2X faster in modern RocksDB
            • revrangeww, fwdrangeww, readww
              • unchanged from 6.29 to 10.2

            This chart has results for the hyperclock block cache and the base case is RocksDB 8.11.4.

            • overwrite
              • might have a small regression (~3%) from 8.11 to 10.2
            • revrangeww, fwdrangeww, readww, fillseq
              • unchanged from 8.11 to 10.2

            This chart has results from RocksDB 10.2.1. The base case uses the LRU block cache and that is compared with hyperclock.

            • there are small regressions and/or small improvements and/or normal variance



            Friday, November 29, 2024

            RocksDB on a big server: LRU vs hyperclock, v2

            This post shows that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a large server to show the speedup from the hyperclock block cache implementation for different concurrency levels with RocksDB 9.6. Here I share results from the same server and different (old and new) RocksDB releases.

            Results are amazing on a large (48 cores) server with 40 client threads

            • ~2X more QPS for range queries with hyperclock
            • ~3X more QPS for point queries with hyperclock

            Software

            I used RocksDB versions 6.0.2, 6.29.5, 7.0.4, 7.6.0, 7.7.8, 8.5.4, 8.6.7, 9.0.1, 9.1.2, 9.3.2, 9.5.2, 9.7.4 and 9.9.0. Everything was compiled with gcc 11.4.0.

            The --cache_type argument selected the block cache implementation:

            • lru_cache was used for versions 7.6 and earlier. Because some of the oldest releases don't support --cache_type I also used --undef_params=...,cache_type
            • hyper_clock_cache was used for versions 7.7 through 8.5
            • auto_hyper_clock_cache was used for versions 8.5+

            Hardware

            The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

            Benchmark

            Overviews on how I use db_bench are here and here.

            All of my tests here use a CPU-bound workload with a database that is cached by RocksDB and the benchmark is run for 40 threads.

            I focus on the read-heavy benchmark steps:

            • revrangeww (reverse range while writing) - this does short reverse range scans
            • fwdrangeww (forward range while writing) - this does short forward range scans
            • readww (read while writing) - this does point queries

            For each of these there is a fixed rate for writes done in the background and performance is reported for the reads. I prefer to measure read performance when there are concurrent writes because read-only benchmarks with an LSM suffer from non-determinism as the state (shape) of the LSM tree has a large impact on CPU overhead and throughput.

            Results

            All results are in this spreadsheet and the performance summary is here.

            The graph below shows relative QPS which is: (QPS for my version / QPS for RocksDB 6.0.2) and the results are amazing:

            • ~2X more QPS for range queries with hyperclock
            • ~3X more QPS for point queries with hyperclock

            The average values for vmstat metrics provide more detail on why hyperclock is so good for performance. The context switch rate drops dramatically when it is enabled because there is much less mutex contention. The user CPU utilization increases by ~1.6X because more useful work can get done when there is less mutex contention.

            legend
            * cs - context switches per second per vmstat
            * us - user CPU utilization per vmstat
            * sy - system CPU utilization per vmstat
            * id - idle CPU utilization per vmstat
            * wa - wait CPU utilization per vmstat
            * version - RocksDB version

            cs      us      sy      us+sy   id      wa      version
            1495325 50.3    14.0    64.3    18.5    0.1     7.6.0
            2360    82.7    14.0    96.7    16.6    0.1     9.9.0







            Monday, November 25, 2024

            RocksDB benchmarks: large server, universal compaction

            This post has results from a large server with universal compaction from the same server for which I recently shared leveled compaction results. The results are boring (no large regressions) but a bit more exciting than the ones for leveled compaction because there is more variance. A somewhat educated guess is that variance more likely with universal.

            tl;dr

            • there are some small regressions for cached workloads (see byrx below)
            • there are some small to medium improvements for IO-bound workloads (see iodir and iobuf)
            • modern RocksDB would look better were I to use the Hyper Clock block cache, but here I don't to test similar code across all versions

            Hardware

            The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

            Builds

            I compiled db_bench from source on all servers. I used versions:
            • 6.x - 6.0.2, 6.10.4, 6.20.4, 6.29.5
            • 7.x - 7.0.4, 7.3.2, 7.6.0, 7.10.2
            • 8.x - 8.0.0, 8.3.3, 8.6.7, 8.9.2, 8.11.4
            • 9.x - 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.1 and 9.7.3
            Benchmark

            All tests used the default value for compaction_readahead_size and the block cache (LRU).

            I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 40 threads. How I do benchmarks for RocksDB is explained here and here. The command line to run the tests is: bash x3.sh 40 no 1800 c48r128 100000000 2000000000 byrx iobuf iodir

            The tests on the charts are named as:
            • fillseq -- load in key order with the WAL disabled
            • revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
            • fwdrangeww -- like revrangeww except do short forward range scans
            • readww - like revrangeww except do point queries
            • overwrite - do overwrites (Put) as fast as possible
            Workloads

            There are three workloads, all of which use 40 threads:

            • byrx - the database is cached by RocksDB (100M KV pairs)
            • iobuf - the database is larger than memory and RocksDB uses buffered IO (2B KV pairs)
            • iodir - the database is larger than memory and RocksDB uses O_DIRECT (2B KV pairs)

            A spreadsheet with all results is here and performance summaries with more details are here for byrxiobuf and iodir.

            Relative QPS

            The numbers in the spreadsheet and on the y-axis in the charts that follow are the relative QPS which is (QPS for $me) / (QPS for $base). When the value is greater than 1.0 then $me is faster than $base. When it is less than 1.0 then $base is faster (perf regression!).

            The base version is RocksDB 6.0.2.

            Results: byrx

            The byrx tests use a cached database. The performance summary is here

            The chart shows the relative QPS for a given version relative to RocksDB 6.0.2. There are two charts and the second narrows the range for the y-axis to make it easier to see regressions.

            Summary:
            • fillseq has new CPU overhead in 7.0 from code added for correctness checks and QPS has been stable since then
            • QPS for other tests has been stable, with some variance, since late 6.x
            Results: iobuf

            The iodir tests use an IO-bound database with buffered. The performance summary is here

            The chart shows the relative QPS for a given version relative to RocksDB 6.0.2. There are two charts and the second narrows the range for the y-axis to make it easier to see regressions.

            Summary:
            • fillseq has been stable since 7.6
            • readww has always been stable
            • overwrite improved in 7.6 and has been stable since then
            • fwdrangeww and revrangeww improved in late 6.0 and have been stable since then
            Results: iodir

            The iodir tests use an IO-bound database with O_DIRECT. The performance summary is here

            The chart shows the relative QPS for a given version relative to RocksDB 6.0.2. There are two charts and the second narrows the range for the y-axis to make it easier to see regressions.

            Summary:
            • fillseq has been stable since 7.6
            • readww has always been stable
            • overwrite improved in 7.6 and has been stable since then
            • fwdrangeww and revrangeww have been stable but there is some variance








            IO-bound sysbench vs MySQL on a 48-core server

            This has results for an IO-bound sysbench benchmark on a 48-core server for MySQL versions 5.6 through 9.5. Results from a CPU-bound sysbenc...