Small Datum

Thursday, December 11, 2025

Sysbench for MySQL 5.6 through 9.5 on a 2-socket, 24-core server

This has results for the sysbench benchmark on a 2-socket, 24-core server. A post with results from 8-core and 32-core servers is here.

tl;dr

old bad news - there were many large regressions from 5.6 to 5.7 to 8.0
new bad news - there are some new regressions after MySQL 8.0

Normally I claim that there are few regressions after MySQL 8.0 but that isn't the case here. I also see regressions after MySQL 8.0 on the other larger servers that I use, but that topic will explained in another post.

Builds, configuration and hardware

I compiled MySQL from source for versions 5.6.51, 5.7.44, 8.0.43, 8.0.44, 8.4.6, 8.4.7, 9.4.0 and 9.5.0.

The server is a SuperMicro SuperWorkstation 7049A-T with 2 sockets, 12 cores/socket, 64G RAM, one m.2 SSD (2TB, ext4 with discard enabled). The OS is Ubuntu 24.04. The CPUs are Intel Xeon Silver 4214R CPU @ 2.40GHz.

The config files are here for 5.6, 5.7, 8.0, 8.4 and 9.x.

Benchmark

I used sysbench and my usage is explained here. I now run 32 of the 42 microbenchmarks listed in that blog post. Most test only one type of SQL statement. Benchmarks are run with the database cached by InnoDB.

The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. The benchmark is run with 16 clients and 8 tables with 10M rows per table.

The purpose is to search for regressions from new CPU overhead and mutex contention. The workload is cached -- there should be no read IO but will be some write IO.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation.

I provide charts below with relative QPS. The relative QPS is the following:

(QPS for some version) / (QPS for base version)

When the relative QPS is > 1 then some version is faster than the base version. When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than the base version.

I present two sets of charts. One set uses MySQL 5.6.51 as the base version which is my standard practice. The other uses MySQL 8.0.44 as the base version to show

Values from iostat and vmstat divided by QPS are here. These can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

The spreadsheet and charts are here and in some cases are easier to read than the charts below. Converting the Google Sheets charts to PNG files does the wrong thing for some of the test names listed at the bottom of the charts below.

Results: point queries

Summary

from 5.6 to 5.7 there are big improvements for 5 tests, no changes for 2 tests and small regressions for 2 tests
from 5.7 to 8.0 there are big regressions for all tests
from 8.0 to 9.5 performance is stable
for 9.5 the common result is ~20% less throughput vs 5.6

Using vmstat from the hot-points test to understand the performance changes (see here)

context switch rate (cs/o) is stable, mutex contention hasn't changed
CPU per query (cpu/o) drops by 35% from 5.6 to 5.7
CPU per query (cpu/o) grows by 23% from 5.7 to 8.0
CPU per query (cpu/o) is stable from 8.0 through 9.5

Results: range queries without aggregation

Summary

from 5.6 to 5.7 throughput drops by 10% to 15%
from 5.7 to 8.0 throughput drops by about 15%
from 8.0 to 9.5 throughput is stable
for 9.5 the common result is ~30% less throughput vs 5.6

Using vmstat from the scan test to understand the performance changes (see here)

context switch rates are low and can be ignored
CPU per query (cpu/o) grows by 11% from 5.6 to 5.7
CPU per query (cpu/o) grows by 15% from 5.7 to 8.0
CPU per query (cpu/o) is stable from 8.0 through 9.5

Results: range queries with aggregation

Summary

from 5.6 to 5.7 there are big improvements for 2 tests, no changes for 1 tests and regressions for 5 tests
from 5.7 to 8.0 there are regressions for all tests
from 8.0 through 9.5 performance is stable
for 9.5 the common result is ~25% less throughput vs 5.6

Using vmstat from the read-only-count test to understand the performance changes (see here)

context switch rates are similar
CPU per query (cpu/o) grows by 16% from 5.6 to 5.7
CPU per query (cpu/o) grows by 15% from 5.7 to 8.0
CPU per query (cpu/o) is stable from 8.0 through 9.5

Results: writes

Summary

from 5.6 to 5.7 there are big improvements for 9 tests and no changes for 1 test
from 5.7 to 8.0 there are regressions for all tests
from 8.4 to 9.x there are regressions for 8 tests and no change for 2 tests
for 9.5 vs 5.6: 5 are slower in 9.5, 3 are similar and 2 are faster in 9.5

Using vmstat from the insert test to understand the performance changes (see here)

in 5.7, CPU per insert drops by 30% while context switch rates are stable vs 5.6
in 8.0, CPU per insert grows by 36% while context switch rates are stable vs 5.7
in 9.5, CPU per insert grows by 3% while context switch rates grow by 23% vs 8.4

The first chart doesn't truncate the y-axis to show the big improvement for update-index but that makes it hard to see the smaller changes on the other tests.

This chart truncates the y-axis to make it easier to see changes on tests other than update-index.

Wednesday, December 10, 2025

The insert benchmark on a small server : MySQL 5.6 through 9.5

This has results for MySQL versions 5.6 through 9.5 with the Insert Benchmark on a small server. Results for Postgres on the same hardware are here.

tl;dr

good news - there are no large regressions after MySQL 8.0
bad news - there are many large regressions from 5.6 to 5.7 to 8.0

Builds, configuration and hardware

I compiled MySQL from source for versions 5.6.51, 5.7.44, 8.0.43, 8.0.44, 8.4.6, 8.4.7, 9.4.0 and 9.5.0.

The server is an ASUS ExpertCenter PN53 with and AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here.

The config files are here: 5.6.51, 5.7.44, 8.0.4x, 8.4.x, 9.x.0.

The Benchmark

The benchmark is explained here and is run with 1 client and 1 table. I repeated it with two workloads:

cached - the values for X, Y, Z are 30M, 40M, 10M
IO-bound - the values for X, Y, Z are 800M, 4M, 1M

The point query (qp100, qp500, qp1000) and range query (qr100, qr500, qr1000) steps are run for 1800 seconds each.

The benchmark steps are:

l.i0

insert X rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.

create 3 secondary indexes per table. There is one connection per client.

l.i1

use 2 connections/client. One inserts Y rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.

l.i2

like l.i1 but each transaction modifies 5 rows (small transactions) and Z rows are inserted and deleted per table.
Wait for S seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of S is a function of the table size.

qr100

use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload.

qp100

like qr100 except uses point queries on the PK index

qr500

like qr100 but the insert and delete rates are increased from 100/s to 500/s

qp500

like qp100 but the insert and delete rates are increased from 100/s to 500/s

qr1000

like qr100 but the insert and delete rates are increased from 100/s to 1000/s

qp1000

like qp100 but the insert and delete rates are increased from 100/s to 1000/s

Results: overview

The performance reports are here for:

Cached

IO-bound

The summary sections from the performances report have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from MySQL 5.6.51.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures:

insert/s for l.i0, l.i1, l.i2
indexed rows/s for l.x
range queries/s for qr100, qr500, qr1000
point queries/s for qp100, qp500, qp1000

Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements.

Results: cached

Performance summaries are here for all versions and latest versions. I focus on the latest versions.

Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements. There are large regressions from new CPU overheads.

the load step (l.i0) is almost 2X faster for 5.6.51 vs 8.4.7 (relative QPS is 0.59)
the create index step (l.x) is more than 2X faster for 8.4.7 vs 5.6.51
the first write-only steps (l.i1) has similar throughput for 5.6.51 and 8.4.7
the second write-only step (l.i2) is 14% slower in 8.4.7 vs 8.4.7
the range-query steps (qr*) are ~30% slower in 8.4.7 vs 5.6.51
the point-query steps (qp*) are 38% slower in 8.4.7 vs 5.6.51

dbms	l.i0	l.x	l.i1	l.i2	qr100	qp100	qr500	qp500	qr1000	qp1000
5.6.51	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
5.7.44	0.91	1.53	1.16	1.09	0.83	0.83	0.83	0.84	0.83	0.83
8.0.44	0.60	2.42	1.05	0.87	0.69	0.62	0.70	0.62	0.70	0.62
8.4.7	0.59	2.54	1.04	0.86	0.68	0.61	0.68	0.61	0.67	0.60
9.4.0	0.59	2.57	1.03	0.86	0.69	0.62	0.69	0.62	0.70	0.61
9.5.0	0.59	2.61	1.05	0.85	0.69	0.62	0.69	0.62	0.69	0.62

Results: IO-bound

Performance summaries are here for all versions and latest versions. I focus on the latest versions.

Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements. There are large regressions from new CPU overheads.

the load step (l.i0) is almost 2X faster for 5.6.51 vs 8.4.7 (relative QPS is 0.60)
the create index step (l.x) is more than 2X faster for 8.4.7 vs 5.6.51
the first write-only steps (l.i1) is 1.54X faster for 8.4.7 vs 5.6.51
the second write-only step (l.i2) is 1.82X faster for 8.4.7 vs 5.6.51
the range-query steps (qr*) are ~20% slower in 8.4.7 vs 5.6.51
the point-query steps (qp*) are 13% slower, 3% slower and 17% faster in 8.4.7 vs 5.6.51

dbms	l.i0	l.x	l.i1	l.i2	qr100	qp100	qr500	qp500	qr1000	qp1000
5.6.51	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
5.7.44	0.91	1.42	1.52	1.78	0.84	0.92	0.87	0.97	0.93	1.17
8.0.44	0.62	2.58	1.56	1.81	0.76	0.88	0.79	0.99	0.85	1.18
8.4.7	0.60	2.65	1.54	1.82	0.74	0.87	0.77	0.98	0.82	1.17
9.4.0	0.61	2.68	1.52	1.76	0.75	0.86	0.80	0.97	0.85	1.16
9.5.0	0.60	2.75	1.53	1.73	0.75	0.87	0.79	0.97	0.84	1.17

The insert benchmark on a small server : Postgres 12.22 through 18.1

This has results for Postgres versions 12.22 through 18.1 with the Insert Benchmark on a small server.

Postgres continues to be boring in a good way. It is hard to find performance regressions.

tl;dr for a cached workload

performance has been stable from Postgres 12 through 18

tl;dr for an IO-bound workload

performance has mostly been stable
create index has been ~10% faster since Postgres 15
throughput for the write-only steps has been ~10% less since Postgres 15
throughput for the point-query steps (qp*) has been ~20% better since Postgres 13

Builds, configuration and hardware

I compiled Postgres from source using -O2 -fno-omit-frame-pointer for versions 12.22, 13.22, 13.23, 14.19, 14.20, 15.14, 15.15, 16.10, 16.11, 17.6, 17.7, 18.0 and 18.1.

For versions prior to 18, the config file is named conf.diff.cx10a_c8r32 and they are as similar as possible and here for versions 12, 13, 14, 15, 16 and 17.

For Postgres 18 I used 3 variations, which are here:

conf.diff.cx10b_c8r32

uses io_method='sync' to match Postgres 17 behavior

conf.diff.cx10c_c8r32

uses io_method='worker' and io_workers=16 to do async IO via a thread pool. I eventually learned that 16 is too large.

conf.diff.cx10d_c8r32

uses io_method='io_uring' to do async IO via io_uring

The Benchmark

The benchmark is explained here and is run with 1 client and 1 table. I repeated it with two workloads:

cached - the values for X, Y, Z are 30M, 40M, 10M
IO-bound - the values for X, Y, Z are 800M, 4M, 1M

The point query (qp100, qp500, qp1000) and range query (qr100, qr500, qr1000) steps are run for 1800 seconds each.

The benchmark steps are:

l.i0

insert X rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.

create 3 secondary indexes per table. There is one connection per client.

l.i1

use 2 connections/client. One inserts Y rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.

l.i2

like l.i1 but each transaction modifies 5 rows (small transactions) and Z rows are inserted and deleted per table.
Wait for S seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of S is a function of the table size.

qr100

use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload.

qp100

like qr100 except uses point queries on the PK index

qr500

like qr100 but the insert and delete rates are increased from 100/s to 500/s

qp500

like qp100 but the insert and delete rates are increased from 100/s to 500/s

qr1000

like qr100 but the insert and delete rates are increased from 100/s to 1000/s

qp1000

like qp100 but the insert and delete rates are increased from 100/s to 1000/s

Results: overview

The performance reports are here for:

Cached

IO-bound

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from Postgres 12.22.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures:

insert/s for l.i0, l.i1, l.i2
indexed rows/s for l.x
range queries/s for qr100, qr500, qr1000
point queries/s for qp100, qp500, qp1000

This statement doesn't apply to this blog post, but I keep it here for copy/paste into future posts. Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

Results: cached

The performance summaries are here for all versions and latest versions.

I focus on the latest versions. Throughput for 18.1 is within 2% of 12.22, with the exception of the l.i2 benchmark step. This is great news because it means that Postgres has avoided introducing new CPU overhead as they improve the DBMS. There is some noise from the l.i2 benchmark step and that doesn't surprise me because it is likely variance from two issues -- vacuum and get_actual_variable_range.

Results: IO-bound

The performance summaries are here for all versions and latest versions.

I focus on the latest versions.

throughput for the load step (l.i0) is 1% less in 18.1 vs 12.22
throughput for the index step (l.x) is 13% better in 18.1 vs 12.22
throughput for the write-only steps (l.i1, l.i2) is 11% and 12% less in 18.1 vs 12.22
throughput for the range-query steps (qr*) is 2%, 3% and 3% less in 18.1 vs 12.22
throughput for the point-query steps (qp*) is 22%, 23% and 23% better in 18.1 vs 12.22

The improvements for the index step arrived in Postgres 15.

The regressions for the write-only steps arrived in Postgres 15 and are likely from two issues -- vacuum and get_actual_variable_range.

The improvements for the point-query steps arrived in Postgres 13.

Monday, December 8, 2025

RocksDB performance over time on a small Arm server

This post has results for RocksDB on an Arm server. I previously shared results for RocksDB performance using gcc and clang. Here I share results using clang with LTO.

RocksDB is boring, there are few performance regressions.

tl;dr

for cached workloads throughput with RocksDB 10.8 is as good or better than with 6.29
for not-cached workloads throughput with RocksDB 10.8 is similar to 6.29 except for the overwrite test where it is 7% less, probably from correctness checks added in 7.x and 8.x.

Software

I used RocksDB versions 6.29, 7.0, 7.10, 8.0, 8.4, 8.8, 8.11, 9.0, 9.4, 9.8, 9.11 and 10.0 through 10.8.

I compiled each version clang version 18.3.1 with link-time optimization enabled (LTO). The build command line was:

flags=( DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 )

# for clang+LTO
AR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \
make USE_LTO=1 "${flags[@]}" static_lib db_bench

Hardware

I used a small Arm server from the Google cloud running Ubuntu 22.04. The server type was c4a-standard-8-lssd with 8 cores and 32G of RAM. Storage was 2 local SSDs with RAID 0 and ext-4.

Benchmark

Overviews on how I use db_bench are here and here.

The benchmark was run with 1 thread and used the LRU block cache.

Tests were run for three workloads:

byrx - database cached by RocksDB
iobuf - database is larger than RAM and RocksDB used buffered IO
iodir - database is larger than RAM and RocksDB used O_DIRECT

The benchmark steps that I focus on are:

fillseq

load RocksDB in key order with 1 thread

revrangeww, fwdrangeww

do reverse or forward range queries with a rate-limited writer. Report performance for the range queries

readww

do point queries with a rate-limited writer. Report performance for the point queries.

overwrite

overwrite (via Put) random keys

Relative QPS

Many of the tables below (inlined and via URL) show the relative QPS which is:
(QPS for my version / QPS for RocksDB 6.29)

The base version varies and is listed below. When the relative QPS is > 1.0 then my version is faster than RocksDB 6.29. When it is < 1.0 then there might be a performance regression or there might just be noise.

The spreadsheet with numbers and charts is here. Performance summaries are here.

Results: byrx

This has results for by byrx workload where the database is cached by RocksDB.

RocksDB 10.x is faster than 6.29 for all tests.

Results: iobuf

This has results for by iobuf workload where the database is larger than RAM and RocksDB used buffered IO.

Performance in RocksDB 10.x is about the same as 6.29 except for overwrite. I think the performance decreases in overwrite that arrived in versions 7.x and 8.x are from new correctness checks and throughput in 10.8 is 7% less than in 6.29. The big drop for fillseq in 10.6.2 was from bug 13996.

Results: iodir

This has results for by iodir workload where the database is larger than RAM and RocksDB used O_DIRECT.

Monday, December 1, 2025

Using db_bench to measure RocksDB performance with gcc and clang

This has results for db_bench, a benchmark for RocksDB, when compiling it with gcc and clang. On one of my servers I saw a regression on one of the tests (fillseq) when compiling with gcc. The result on that server didn't match what I measured on two other servers. So I repeated tests after compiling with clang to see if I could reproduce it.

tl;dr

a common outcome is

~10% more QPS with clang+LTO than with gcc
~5% more QPS with clang than with gcc

the performance gap between clang and gcc is larger in RocksDB 10.x than in earlier versions

Variance

I always worry about variance when I search for performance bugs. Variance can be misinterpreted as a performance regression and I strive to avoid that because I don't want to file bogus performance bugs.

Possible sources of variance are:

the compiler toolchain

a bad code layout might hurt performance by increasing cache and TLB misses

RocksDB

the overhead from compaction is intermittent and the LSM tree layout can help or hurt CPU overhead during reads

hardware

sources include noisy neighbors on public cloud servers, insufficient CPU cooling and CPU frequency management that is too clever

benchmark client

the way in which I run tests can create more or less variance and more information on that is here and here

Software

I used RocksDB versions 6.29.5, 7.10.2, 8.0, 8.4, 8.8, 8.11, 9.0, 9.4, 9.8, 9.11 and 10.0 through 10.8.

I compiled each version three times:

gcc using version 13.3.0
clang - using version 18.3.1
clang+LTO - using version 18.3.1, where LTO is link-time optimization

The build command lines are below

flags=( DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 )

# for gcc
make "${flags[@]}" static_lib db_bench

# for clang
AR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \
make "${flags[@]}" static_lib db_bench

# for clang+LTO
AR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \
make USE_LTO=1 "${flags[@]}" static_lib db_bench

On the small servers I used the LRU block cache. On the large server I used hyper clock when possible:

lru_cache was used for versions 7.6 and earlier
hyper_clock_cache was used for versions 7.7 through 8.5
auto_hyper_clock_cache was used for versions 8.5+

Hardware

I used two small servers and one large server, all run Ubuntu 22.04:

pn-53

Ryzen 7 (AMD) CPU with 8 cores and 32G of RAM. It is v5 in the blog post
benchmarks are run with 1 client (thread)

an ARM server from the Google cloud -- c4a-standard-8-lssd with 8 cores and 32G of RAM, 2 local SSDs using RAID 0 and ext-4
benchmarks are run with 1 client (thread)

hetzner

an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G of RAM, 2 SSDs with RAID 1 (3.8T each) using ext4
benchmarks are run with 36 clients (threads)

Benchmark

Overviews on how I use db_bench are here and here.

Tests were run for a workload with the database cached by RocksDB that I call byrx in my scripts.

The benchmark steps that I focus on are:

fillseq

load RocksDB in key order with 1 thread

revrangeww, fwdrangeww

do reverse or forward range queries with a rate-limited writer. Report performance for the range queries

readww

do point queries with a rate-limited writer. Report performance for the point queries.

overwrite

overwrite (via Put) random keys

Relative QPS

Many of the tables below (inlined and via URL) show the relative QPS which is:
(QPS for my version / QPS for RocksDB 6.29 compiled with gcc)

The base version varies and is listed below. When the relative QPS is > 1.0 then my version is faster than the base version. When it is < 1.0 then there might be a performance regression or there might just be noise.

The spreadsheet with numbers and charts is here.

Results: fillseq

Results for the pn53 server

clang+LTO provides ~15% more QPS than gcc in RocksDB 10.8

clang provides ~11% more QPS than gcc in RocksDB 10.8

Results for the Arm server

I am fascinated by how stable the QPS is here for clang and clang+LTO
clang+LTO and clang provide ~3% more QPS than gcc in RocksDB 10.8

Results for the Hetzner server

I don't show results for 6.29 or 7.x to improve readability
the performance for RocksDB 10.8.3 with gcc is what motivated me to repeat tests with clang
clang+LTO and clang provide ~20% more QPS than gcc in RocksDB 10.8

Results: revrangeww

Results for the pn53 server

clang+LTO provides ~9% more QPS than gcc in RocksDB 10.8
clang provides ~6% more QPS than gcc in RocksDB 10.8

Results for the Arm server

clang+LTO provides ~11% more QPS than gcc in RocksDB 10.8

clang provides ~6% more QPS than gcc in RocksDB 10.8

Results for the Hetzner server

I don't show results for 6.29 or 7.x to improve readability

clang+LTO provides ~8% more QPS than gcc in RocksDB 10.8

clang provides ~3% more QPS than gcc in RocksDB 10.8

Results: fwdrangeww

Results for the pn53 server

clang+LTO provides ~9% more QPS than gcc in RocksDB 10.8

clang provides ~4% more QPS than gcc in RocksDB 10.8

Results for the Arm server

clang+LTO provides ~13% more QPS than gcc in RocksDB 10.8

clang provides ~7% more QPS than gcc in RocksDB 10.8

Results for the Hetzner server

I don't show results for 6.29 or 7.x to improve readability

clang+LTO provides ~4% more QPS than gcc in RocksDB 10.8

clang provides ~1% more QPS than gcc in RocksDB 10.8

Results: readww

Results for the pn53 server

clang+LTO provides ~6% more QPS than gcc in RocksDB 10.8

clang provides ~5% less QPS than gcc in RocksDB 10.8

Results for the Arm server

clang+LTO provides ~14% more QPS than gcc in RocksDB 10.8

clang provides ~2% more QPS than gcc in RocksDB 10.8

Results for the Hetzner server

I don't show results for 6.29 or 7.x to improve readability

clang+LTO provides ~4% more QPS than gcc in RocksDB 10.8

clang provides ~1% more QPS than gcc in RocksDB 10.8

Results: overwrite

Results for the pn53 server

clang+LTO provides ~6% less QPS than gcc in RocksDB 10.8

clang provides ~8% less QPS than gcc in RocksDB 10.8

but for most versions there is similar QPS for gcc, clang and clang+LTO

Results for the Arm server

QPS is similar for gcc, clang and clang+LTO

Results for the Hetzner server

I don't show results for 6.29 or 7.x to improve readability

clang+LTO provides ~2% more QPS than gcc in RocksDB 10.8

clang provides ~1% more QPS than gcc in RocksDB 10.8

Saturday, November 29, 2025

Using sysbench to measure how Postgres performance changes over time, November 2025 edition

This has results for the sysbench benchmark on a small and big server for Postgres versions 12 through 18. Once again, Postgres is boring because I search for perf regressions and can't find any here. Results from MySQL are here and MySQL is not boring.

While I don't show the results here, I don't see regressions when comparing the latest point releases with their predecessors -- 13.22 vs 13.23, 14.19 vs 14.20, 15.14 vs 15.15, 16.10 vs 16.11, 17.6 vs 17.7 and 18.0 vs 18.1.

tl;dr

a few small regressions
many more small improvements
for write-heavy tests at high-concurrency there are many large improvements starting in PG 17

Builds, configuration and hardware

I compiled Postgres from source for versions 12.22, 13.22, 13.23, 14.19, 14.20, 15.14, 15.15, 16.10, 16.11, 17.6, 17.7, 18.0 and 18.1.

I used two servers:

small

an ASUS ExpertCenter PN53 with AMD Ryzen 7735HS CPU, 32G of RAM, 8 cores with AMD SMT disabled, Ubuntu 24.04 and an NVMe device with ext4 and discard enabled.

an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled
2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
128G RAM
Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)

Configuration files for the small server

Configuration files are here for Postgres versions 12, 13, 14, 15, 16 and 17.
For Postgres 18 I used io_method=sync and the configuration file is here.

Configuration files for the big server

Configuration files are here for Postgres versions 12, 13, 14, 15, 16 and 17.
For Postgres 18 I used io_method=sync and the configuration file is here.

Benchmark

The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. On the small server the benchmark is run with 1 client and 1 table with 50M rows. On the big server the benchmark is run with 12 clients and 8 tables with 10M rows per table.

The purpose is to search for regressions from new CPU overhead and mutex contention. I use the small server with low concurrency to find regressions from new CPU overheads and then larger servers with high concurrency to find regressions from new CPU overheads and mutex contention.

Results

I provide charts below with relative QPS. The relative QPS is the following:

(QPS for some version) / (QPS for Postgres 12.22)

When the relative QPS is > 1 then some version is faster than Postgres 12.12. When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than Postgres 12.22.

Values from iostat and vmstat divided by QPS are here for the small server and the big server. These can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

Results: point queries

This is from the small server.

a large improvement arrived in Postgres 17 for the hot-points test
otherwise results have been stable from 12.22 through 18.1

This is from the big server.

a large improvement arrived in Postgres 17 for the hot-points test
otherwise results have been stable from 12.22 through 18.1

Results: range queries without aggregation

This is from the small server.

there are small improvements for the scan test
otherwise results have been stable from 12.22 through 18.1

This is from the big server.

there are small improvements for the scan test
otherwise results have been stable from 12.22 through 18.1

Results: range queries with aggregation

This is from the small server.

there are small improvements for a few tests
otherwise results have been stable from 12.22 through 18.1

This is from the big server.

there might be small regressions for a few tests
otherwise results have been stable from 12.22 through 18.1

Results: writes

This is from the small server.

there are small improvements for most tests
otherwise results have been stable from 12.22 through 18.1

This is from the big server.

there are large improvements for half of the tests
otherwise results have been stable from 12.22 through 18.1

From vmstat results for update-index the per-operation CPU overhead and context switch rate are much smaller starting in Postgres 17.7. The CPU overhead is about 70% of what it was in 16.11 and the context switch rate is about 50% of the rate for 16.11. Note that context switch rates are often a proxy for mutex contention.

Friday, November 28, 2025

Using sysbench to measure how MySQL performance changes over time, November 2025 edition

This has results for the sysbench benchmark on a small and big server for MySQL versions 5.6 through 9.5. The good news is that the arrival rate of performance regressions has mostly stopped as of 8.0.43. The bad news is that there were large regressions from 5.6 through 8.0.

tl;dr for low-concurrency tests

for point queries

MySQL 5.7.44 gets about 10% less QPS than 5.6.51
MySQL 8.0 through 9.5 get about 30% less QPS than 5.6.51

for range queries without aggregation

MySQL 5.7.44 gets about 15% less QPS than 5.6.51
MySQL 8.0 through 9.5 get about 30% less QPS than 5.6.51

for range queries with aggregation

MySQL 5.7.44 is faster than 5.6.51 for two tests, as fast for one and gets about 15% less QPS for the other five
MySQL 8.0 to 9.5 are faster than 5.6.51 for one test, as fast for one and get about 30% less QPS for the other six

for writes

MySQL 5.7.44 gets between 10% and 20% less QPS than 5.6.51 for most tests
MySQL 8.0 to 9.5 get between 40% to 50% less QPS than 5.6.51 for most tests

tl;dr for high-concurrency tests

for point queries

for most tests MySQL 5.7 to 9.5 get at least 1.5X more QPS than 5.6.51
for tests that use secondary indexes MySQL 5.7 to 9.5 get about 25% less QPS than 5.6.51

for range queries without aggregation

MySQL 5.7.44 gets about 10% less QPS than 5.6.51
MySQL 8.0 through 9.5 get about 30% less QPS than 5.6.51

for range queries with aggregation

MySQL 5.7.44 is faster than 5.6.51 for six tests, as fast for one test and gets about 20% less QPS for one test
MySQL 8.0 to 9.5 are a lot faster than 5.6.51 for two tests, about as fast for three tests and gets between 10% and 30% less QPS for the other three tests

for writes

MySQL 5.7.44 gets more QPS than 5.6.51 for all tests
MySQL 8.0 to 9.5 get more QPS than 5.6.51 for all tests

Builds, configuration and hardware

I compiled MySQL from source for versions 5.6.51, 5.7.44, 8.0.43, 8.0.44, 8.4.6, 8.4.7, 9.4.0 and 9.5.0.

I used two servers:

small

an ASUS ExpertCenter PN53 with AMD Ryzen 7735HS CPU, 32G of RAM, 8 cores with AMD SMT disabled, Ubuntu 24.04 and an NVMe device with ext4 and discard enabled.

an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled
2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
128G RAM
Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)

The config files are here:

small server: 5.6.51, 5.7.44, 8.0.4x, 8.4.x, 9.x.0
big server: 5.6.51, 5.7.44, 8.0.4x, 8.4.x, 9.x.0

Benchmark

Results

I provide charts below with relative QPS. The relative QPS is the following:

(QPS for some version) / (QPS for MySQL 5.6.51)

When the relative QPS is > 1 then some version is faster than MySQL 5.6.51. When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than MySQL 5.6.51.

Results: point queries

This is from the small server.

MySQL 5.7.44 gets about 10% less QPS than 5.6.51
MySQL 8.0 through 9.5 get about 30% less QPS than 5.6.51
There are few regressions after MySQL 8.0
New CPU overheads explain the regressions. See the vmstat results for the hot-points test.

This is from the large server.

For most point query tests MySQL 5.7 to 9.5 get at least 1.5X more QPS than 5.6.51

MySQL 5.7 to 9.5 use less CPU, see vmstat results for the hot-points test.

For tests that use secondary indexes (*-si) MySQL 5.7 to 9.5 get about 25% less QPS than 5.6.51.

This result is similar to what happens on the small server above.
The regressions are from extra CPU overhead, see vmstat results

MySQL 5.7 does better than 8.0 to 9.5. There are few regressions after MySQL 8.0.

Results: range queries without aggregation

This is from the small server.

MySQL 5.7.44 gets about 15% less QPS than 5.6.51
MySQL 8.0 through 9.5 get about 30% less QPS than 5.6.51
There are few regressions after MySQL 8.0
New CPU overheads explain the regressions. See the vmstat results for the scan test.

This is from the large server.

MySQL 5.7.44 gets about 10% less QPS than 5.6.51
MySQL 8.0 through 9.5 get about 30% less QPS than 5.6.51
There are few regressions after MySQL 8.0
New CPU overheads explain the regressions. See the vmstat results for the scan test.

Results: range queries with aggregation

This is from the small server.

for the read-only-distinct test, MySQL 5.7 to 9.5 are faster than 5.6.51
for the read-only_range=X tests

with the longest range scan (*_range=10000), MySQL 5.7.44 is faster than 5.6.51 and 8.0 to 9.5 have the same QPS as 5.6.51
with shorter range scans (*_range=100 & *_range=10) MySQL 5.6.51 is faster than 5.7 to 9.5. This implies that the regressions are from code above the storage engine layer.
From vmstat results the perf differences are explained by CPU overheads

for the other tests

MySQL 5.7.44 gets about 15% less QPS than 5.6.51
MySQL 8.0 to 9.5 get about 30% less QPS than 5.6.51
From vmstat results for read-only-count the reason is new CPU overhead

This is from the large server.

for the read-only-distinct test, MySQL 5.7 to 9.5 are faster than 5.6.51
for the read-only_range=X tests

MySQL 5.7.44 is as fast as 5.6.51 for the longest range scan and faster than 5.6.51 for the shorter range scans
MySQL 8.0 to 9.5 are much faster than 5.6.51 for the longest range scan and somewhat faster for the shorter range scans
From vmstat results the perf differences are explained by CPU overheads and possible from changes in mutex contention

for the other tests

MySQL 5.7.44 gets about 20% less QPS than 5.6.51 for read-only-count and about 10% more QPS than 5.6.51 for read-only-simple and read-only-sum
MySQL 8.0 to 9.5 get about 30% less QPS than 5.6.51 for read-only-count and up to 20% less QPS than 5.6.51 for read-only-simple and read-only-sum
From vmstat results for read-only-count the reason is new CPU overhead

Results: writes

This is from the small server.

For most tests

MySQL 5.7.44 gets between 10% and 20% less QPS than 5.6.51
MySQL 8.0 to 9.5 get between 40% to 50% less QPS than 5.6.51
From vmstat results for the insert test, MySQL 5.7 to 9.5 use a lot more CPU

For the update-index test

MySQL 5.7.44 is faster than 5.6.51
MySQL 8.0 to 9.5 get about 10% less QPS than 5.6.51
From vmstat metrics MySQL 5.6.51 has more mutex contention

For the update-inlist test

MySQL 5.7.44 is as fast as 5.6.51
MySQL 8.0 to 9.5 get about 30% less QPS than 5.6.51
From vmstat metrics MySQL 5.6.51 has more mutex contention

This is from the large server and the y-axis truncates the result for the update-index test to improve readability for the other results.

For all tests MySQL 5.7 to 9.5 get more QPS than 5.6.51

From vmstat results for the write-only test MySQL 5.6.51 uses more CPU and has more mutex contention.

For some tests (read-write_range=X) MySQL 8.0 to 9.5 get less QPS than 5.7.44

These are the classic sysbench transaction with different range scan lengths and the performance is dominated by the range query response time, thus 5.7 is fastest.

For most tests MySQL 5.7 to 9.5 have similar perf with two exceptions

For the delete test, MySQL 8.0 to 9.5 are faster than 5.7. From vmstat metrics 5.7 uses more CPU and has more mutex contention than 8.0 to 9.5.
For the update-inlist test, MySQL 8.0 to 9.5 are faster than 5.7. From vmstat metrics 5.7 uses more CPU than 8.0 to 9.5.

This is also from the large server and does not truncate the update-index test result.