Small Datum: innodb

Showing posts with label innodb. Show all posts

Thursday, December 11, 2025

Sysbench for MySQL 5.6 through 9.5 on a 2-socket, 24-core server

This has results for the sysbench benchmark on a 2-socket, 24-core server. A post with results from 8-core and 32-core servers is here.

tl;dr

old bad news - there were many large regressions from 5.6 to 5.7 to 8.0
new bad news - there are some new regressions after MySQL 8.0

Normally I claim that there are few regressions after MySQL 8.0 but that isn't the case here. I also see regressions after MySQL 8.0 on the other larger servers that I use, but that topic will explained in another post.

Builds, configuration and hardware

I compiled MySQL from source for versions 5.6.51, 5.7.44, 8.0.43, 8.0.44, 8.4.6, 8.4.7, 9.4.0 and 9.5.0.

The server is a SuperMicro SuperWorkstation 7049A-T with 2 sockets, 12 cores/socket, 64G RAM, one m.2 SSD (2TB, ext4 with discard enabled). The OS is Ubuntu 24.04. The CPUs are Intel Xeon Silver 4214R CPU @ 2.40GHz.

The config files are here for 5.6, 5.7, 8.0, 8.4 and 9.x.

Benchmark

I used sysbench and my usage is explained here. I now run 32 of the 42 microbenchmarks listed in that blog post. Most test only one type of SQL statement. Benchmarks are run with the database cached by InnoDB.

The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. The benchmark is run with 16 clients and 8 tables with 10M rows per table.

The purpose is to search for regressions from new CPU overhead and mutex contention. The workload is cached -- there should be no read IO but will be some write IO.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation.

I provide charts below with relative QPS. The relative QPS is the following:

(QPS for some version) / (QPS for base version)

When the relative QPS is > 1 then some version is faster than the base version. When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than the base version.

I present two sets of charts. One set uses MySQL 5.6.51 as the base version which is my standard practice. The other uses MySQL 8.0.44 as the base version to show

Values from iostat and vmstat divided by QPS are here. These can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

The spreadsheet and charts are here and in some cases are easier to read than the charts below. Converting the Google Sheets charts to PNG files does the wrong thing for some of the test names listed at the bottom of the charts below.

Results: point queries

Summary

from 5.6 to 5.7 there are big improvements for 5 tests, no changes for 2 tests and small regressions for 2 tests
from 5.7 to 8.0 there are big regressions for all tests
from 8.0 to 9.5 performance is stable
for 9.5 the common result is ~20% less throughput vs 5.6

Using vmstat from the hot-points test to understand the performance changes (see here)

context switch rate (cs/o) is stable, mutex contention hasn't changed
CPU per query (cpu/o) drops by 35% from 5.6 to 5.7
CPU per query (cpu/o) grows by 23% from 5.7 to 8.0
CPU per query (cpu/o) is stable from 8.0 through 9.5

Results: range queries without aggregation

Summary

from 5.6 to 5.7 throughput drops by 10% to 15%
from 5.7 to 8.0 throughput drops by about 15%
from 8.0 to 9.5 throughput is stable
for 9.5 the common result is ~30% less throughput vs 5.6

Using vmstat from the scan test to understand the performance changes (see here)

context switch rates are low and can be ignored
CPU per query (cpu/o) grows by 11% from 5.6 to 5.7
CPU per query (cpu/o) grows by 15% from 5.7 to 8.0
CPU per query (cpu/o) is stable from 8.0 through 9.5

Results: range queries with aggregation

Summary

from 5.6 to 5.7 there are big improvements for 2 tests, no changes for 1 tests and regressions for 5 tests
from 5.7 to 8.0 there are regressions for all tests
from 8.0 through 9.5 performance is stable
for 9.5 the common result is ~25% less throughput vs 5.6

Using vmstat from the read-only-count test to understand the performance changes (see here)

context switch rates are similar
CPU per query (cpu/o) grows by 16% from 5.6 to 5.7
CPU per query (cpu/o) grows by 15% from 5.7 to 8.0
CPU per query (cpu/o) is stable from 8.0 through 9.5

Results: writes

Summary

from 5.6 to 5.7 there are big improvements for 9 tests and no changes for 1 test
from 5.7 to 8.0 there are regressions for all tests
from 8.4 to 9.x there are regressions for 8 tests and no change for 2 tests
for 9.5 vs 5.6: 5 are slower in 9.5, 3 are similar and 2 are faster in 9.5

Using vmstat from the insert test to understand the performance changes (see here)

in 5.7, CPU per insert drops by 30% while context switch rates are stable vs 5.6
in 8.0, CPU per insert grows by 36% while context switch rates are stable vs 5.7
in 9.5, CPU per insert grows by 3% while context switch rates grow by 23% vs 8.4

The first chart doesn't truncate the y-axis to show the big improvement for update-index but that makes it hard to see the smaller changes on the other tests.

This chart truncates the y-axis to make it easier to see changes on tests other than update-index.

Monday, August 25, 2025

MySQL 5.6 thru 9.4: small server, Insert Benchmark

This has results for the Insert Benchmark on a small server with InnoDB from MySQL 5.6 through 9.4. The workload here uses low concurrency (1 client), a small server and a cached database. I run it this way to look for CPU regressions before moving on to IO-bound workloads with high concurrency.

tl;dr

good news - there are no large regressions after MySQL 8.0
bad news - there are large regressions from MySQL 5.6 to 5.7 to 8.0

load in 8.0, 8.4 and 9.4 gets about 60% of the throughput vs 5.6
queries in 8.0, 8.4 and 9.4 get between 60% and 70% of the throughput vs 5.6

Builds, configuration and hardware

I compiled MySQL 5.6.51, 5.7.44, 8.0.43, 8.4.6 and 9.4.0 from source.

The server is an ASUS PN53 with 8 cores, AMD SMT disabled and 32G of RAM. The OS is Ubuntu 24.04. Storage is 1 NVMe device with ext4. More details on it are here.

I used the cz12a_c8r32 config file (my.cnf) which is here for 5.6.51, 5.7.44, 8.0.43, 8.4.6 and 9.4.0.

The Benchmark

The benchmark is explained here. I recently updated the benchmark client to connect via socket rather than TCP so that I can get non-SSL connections for all versions tested. AFAIK, with TCP I can only get SSL connections for MySQL 8.4 and 9.4.

The workload uses 1 client, 1 table with 30M rows and a cached database.

The benchmark steps are:

l.i0

insert 30 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.

create 3 secondary indexes per table. There is one connection per client.

l.i1

use 2 connections/client. One inserts 40 million rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.

l.i2

like l.i1 but each transaction modifies 5 rows (small transactions) and 10 million rows are inserted and deleted per table.
Wait for N seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of N is a function of the table size.

qr100

use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.

qp100

like qr100 except uses point queries on the PK index

qr500

like qr100 but the insert and delete rates are increased from 100/s to 500/s

qp500

like qp100 but the insert and delete rates are increased from 100/s to 500/s

qr1000

like qr100 but the insert and delete rates are increased from 100/s to 1000/s

qp1000

like qp100 but the insert and delete rates are increased from 100/s to 1000/s

Results: overview

The performance report is here.

The summary section has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA. The summary section is here.

Below I use relative QPS (rQPS) to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from MySQL 5.6.51.

When rQPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. When it is 0.90 then I claim there is a 10% regression. The Q in relative QPS measures:

insert/s for l.i0, l.i1, l.i2
indexed rows/s for l.x
range queries/s for qr100, qr500, qr1000
point queries/s for qp100, qp500, qp1000

Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements.

Results: details

This table is a copy of the second table in the summary section. It lists the relative QPS (rQPS) for each benchmark step where rQPS is explained above.

The benchmark steps are explained above, they are:

l.i0 - initial load in PK order
l.x - create 3 secondary indexes per table
l.i1, l.i2 - random inserts and random deletes
qr100, qr500, qr1000 - short range queries with background writes
qp100, qp500, qp1000 - point queries with background writes

dbms	l.i0	l.x	l.i1	l.i2	qr100	qp100	qr500	qp500	qr1000	qp1000
5.6.51	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
5.7.44	0.89	1.52	1.14	1.08	0.83	0.84	0.83	0.84	0.84	0.84
8.0.43	0.60	2.50	1.04	0.86	0.69	0.62	0.69	0.63	0.70	0.62
8.4.6	0.60	2.53	1.03	0.86	0.68	0.61	0.67	0.61	0.68	0.61
9.4.0	0.60	2.53	1.03	0.87	0.70	0.63	0.70	0.63	0.70	0.62

The summary is:

l.i0

there are large regressions starting in 8.0 and modern MySQL only gets ~60% of the throughput relative to 5.6 because modern MySQL has more CPU overhead

I ignore this but there have been improvements

l.i1, l.i2

there was a large improvement in 5.7 but new CPU overhead since 8.0 reduces that

qr100, qr500, qr1000

there are large regressions from 5.6 to 5.7 and then again from 5.7 to 8.0
throughput in modern MySQL is ~60% to 70% of what it was in 5.6

Thursday, August 21, 2025

Sysbench for MySQL 5.6 thru 9.4 on a small server

This has performance results for InnoDB from MySQL 5.6.51, 5.7.44, 8.0.43, 8.4.6 and 9.4.0 on a small server with sysbench microbenchmarks. The workload here is cached by InnoDB and my focus is on regressions from new CPU overheads. This work was done by Small Datum LLC and not sponsored.

tl;dr

Low concurrency (1 client) is the worst case for regressions in modern MySQL
MySQL 8.0, 8.4 and 9.4 are much slower than 5.6.51 in all but 2 of the 32 microbenchmarks

The bad news - performance regressions aren't getting fixed
The good news - regressions after MySQL 8.0 are small

Builds, configuration and hardware

I compiled MySQL from source for versions 5.6.51, 5.7.44, 8.0.43, 8.4.6 and 9.4.0.

The server is an ASUS ExpertCenter PN53 with AMD Ryzen 7 7735HS, 32G RAM and an m.2 device for the database. More details on it are here. The OS is Ubuntu 24.04 and the database filesystem is ext4 with discard enabled.

The my.cnf files are here.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by InnoDB.

The tests are run using 1 table with 50M rows. The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.

Results

All files I saved from the benchmark are here and the spreadsheet is here.

I provide charts below with relative QPS. The relative QPS is the following:

(QPS for some version) / (QPS for MySQL 5.6.51)

When the relative QPS is > 1 then some version is faster than 5.6.51. When it is < 1 then there might be a regression. Values from iostat and vmstat divided by QPS are also provided here. These can help to explain why something is faster or slower because it shows how much HW is used per request.

Results: point queries

Based on results from vmstat the regressions are from new CPU overheads.

Results: range queries without aggregation

Based on results from vmstat the regressions are from new CPU overheads.

Results; range queries with aggregation

Based on results from vmstat the regressions are from new CPU overheads.

Results: writes

Based on results from vmstat the regressions are from new CPU overheads.

Thursday, May 1, 2025

The impact of innodb_doublewrite_pages in MySQL 8.0.41

After reading a blog post from JFG on changes to innodb_doublewrite_pages and bug 111353, I wanted to understand the impact from that on the Insert Benchmark using a large server.

I test the impact from:

using a larger (non-default) value for innodb_doublewrite_pages
disabling the doublewrite buffer

tl;dr

Using a larger value for innodb_doublewrite_pages improves QPS by up to 10%
Disabling the InnoDB doublewrite buffer is great for performance, but bad for durability. I don't suggest you do this in production.

Builds, configuration and hardware

I compiled upstream MySQL 8.0.41 from source.

The server is an ax162-s from Hetzner with 48 cores, AMD 128G RAM and AMD SMT disabled. It uses Ubuntu 22.04 and storage is ext4 using SW RAID 1 over 2 locally attached NVMe devices. More details on it are here. At list prices a similar server from Google Cloud costs 10X more than from Hetzner.

The MySQL configuration files are:

cz11a_c32r128 - the base configuration file that does not set innodb_doublewrite_pages and gets innodb_doublewrite_pages=8
cz11e_c32r128 - adds innodb_doublewrite_pages=128 to the base config
cz11f_c32r128 - adds innodb_doublewrite=0 to the base config (disables doublewrite)

The Benchmark

The benchmark is explained here and is run with 20 clients and a table per client with an IO-bound workload. The database is larger than memory with 200M rows per table and 20 tables.

The benchmark steps are:

l.i0

insert 200 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.

create 3 secondary indexes per table. There is one connection per client.

l.i1

use 2 connections/client. One inserts 4M rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.

l.i2

like l.i1 but each transaction modifies 5 rows (small transactions) and 1M rows are inserted and deleted per table.
Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.

qr100

use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.

qp100

like qr100 except uses point queries on the PK index

qr500

like qr100 but the insert and delete rates are increased from 100/s to 500/s

qp500

like qp100 but the insert and delete rates are increased from 100/s to 500/s

qr1000

like qr100 but the insert and delete rates are increased from 100/s to 1000/s

qp1000

like qp100 but the insert and delete rates are increased from 100/s to 1000/s

Results: overview

The performance report is here.

The summary section in the performance report has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result with the cz11e_c32r128 or cz11f_c32r128 configs and $base is the result from the cz11a_c32r128 config. The configs are explained above, cz11e_c32r128 increases innodb_doublewrite_pages and cz11f_c32r128 disabled the doublewrite buffer.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures:

insert/s for l.i0, l.i1, l.i2
indexed rows/s for l.x
range queries/s for qr100, qr500, qr1000
point queries/s for qp100, qp500, qp1000

Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

Results: more IO-bound

The performance summary is here.

From the cz11e_c32r128 config that increases innodb_doublewrite_pages to 128:

the impact on write-heavy steps is mixed: create index was ~7% slower and l.i2 was ~10% faster
the impact on range query + write steps is positive but small. The improvements were 0%, 0% and 4%. Note that these steps are not as IO-bound as point query + write steps and the range queries do ~0.3 reads per query (see here).
the impact on point query + write steps is positive and larger. The improvements were 3%, 8% and 9%. These benchmark steps are much more IO-bound than the steps that do range queries.

From the cz11f_c32r128 config that disables the InnoDB doublewrite buffer:

the impact on write-heavy steps is large -- from 1% to 36% faster.
the impact on range query + write steps is positive but small. The improvements were 0%, 2% and 15%. Note that these steps are not as IO-bound as point query + write steps and the range queries do ~0.3 reads per query (see here).
the impact on point query + write steps is positive and larger. The improvements were 14%, 41% and 42%.

Saturday, April 19, 2025

Battle of the Mallocators: part 2

This post addresses some of the feedback I received from my previous post on the impact of the malloc library when using RocksDB and MyRocks. Here I test:

MALLOC_ARENA_MAX with glibc malloc

see here for more background on MALLOC_ARENA_MAX. By default glibc can use too many arenas for some workloads (8 X number_of_CPU_cores) so I tested it with 1, 8, 48 and 96 arenas.

compiling RocksDB and MyRocks with jemalloc specific code enabled

In my previous results I just set malloc-lib in my.cnf which uses LD_LIBRARY_PATH to link with your favorite malloc library implementation.

tl;dr: jemalloc

For mysqld with jemalloc enabled via malloc-lib (LD_LIBRARY_PATH) versus mysqld with jemalloc specific code enabled

performance, VSZ and RSS were similar

After setting rocksdb_cache_dump=0 in the binary with jemalloc specific code

performance is slightly better (excluding the outlier, the benefit is up to 3%)
peak VSZ is cut in half
peak RSS is reduced by ~9%

tl;dr: glibc malloc on a 48-core server

With 1 arena performance is lousy but the RSS bloat is mostly solved
With 8, 48 or 96 arenas the RSS bloat is still there
With 48 arenas there are still significant (5% to 10%) performance drops
With 96 arenas the performance drop was mostly ~2%

Building MyRocks with jemalloc support

This was harder than I expected. The first step was easy -- I added these to the CMake command line, the first is for MyRocks and the second is for RocksDB. When the first is set then HAVE_JEMALLOC is defined in config.h. When the second is set then ROCKSDB_JEMALLOC is defined on the compiler command line.

-DHAVE_JEMALLOC=1
-DWITH_JEMALLOC=1

The hard part is that there were linker errors for unresolved symbols -- the open-source build was broken. The fix that worked for me is here. I removed libunwind.so and added libjemalloc.so in its place.

Running mysqld with MALLOC_ARENA_MAX

I wasn't sure if it was sufficient for me to set an environment variable when invoking mysqld_safe, so I just edited the mysqld_safe script to do that for me:

182a183,184
> cmd="MALLOC_ARENA_MAX=1 $cmd"
> echo Run :: $cmd

Results: jemalloc

The jemalloc specific code in MyRocks and RocksDB is useful but most of it is not there to boost performance. The jemalloc specific code most likely to boost performance is here in MyRocks and is enabled when rocksdb_cache_dump=0 is added to my.cnf.

Results are here for 3 setups:

fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128

This is the base case in the table below
this is what I used in my previous post and jemalloc is enabled via setting malloc-lib in my.cnf which uses LD_LIBRARY_PATH

fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128

This is col-1 in the table below
MySQL with jemalloc specific code enabled at compile time

fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128

This is col-2 in the table below
MySQL with jemalloc specific code enabled at compile time and rocksdb_cache_dump=0 added to my.cnf

These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.

(QPS with $allocator) / (QPS with glibc malloc)

From the results below:

results in col-1 are similar to the base case. So compiling in the jemalloc specific code didn't help performance.
results in col-2 are slightly better than the base case with one outlier (hot-points). So consider setting rocksdb_cache_dump=0 in my.cnf after compiling in jemalloc specific code.

Relative to: fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128

col-1 : fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128

col-2 : fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128

col-1 col-2

0.92 1.40 hot-points_range=100

1.00 1.01 point-query_range=100

1.01 1.02 points-covered-pk_range=100

0.94 1.03 points-covered-si_range=100

1.01 1.02 points-notcovered-pk_range=100

0.98 1.02 points-notcovered-si_range=100

1.01 1.03 random-points_range=1000

1.01 1.02 random-points_range=100

0.99 1.00 random-points_range=10

0.98 1.00 range-covered-pk_range=100

0.96 0.97 range-covered-si_range=100

0.98 0.98 range-notcovered-pk_range=100

1.00 1.02 range-notcovered-si_range=100

0.98 1.00 read-only-count_range=1000

1.01 1.01 read-only-distinct_range=1000

0.99 0.99 read-only-order_range=1000

1.00 1.00 read-only_range=10000

0.99 0.99 read-only_range=100

0.99 1.00 read-only_range=10

0.98 0.99 read-only-simple_range=1000

0.99 0.99 read-only-sum_range=1000

0.98 0.98 scan_range=100

1.01 1.02 delete_range=100

1.01 1.03 insert_range=100

0.99 1.01 read-write_range=100

1.00 1.01 read-write_range=10

1.00 1.02 update-index_range=100

1.02 1.02 update-inlist_range=100

1.01 1.03 update-nonindex_range=100

0.99 1.01 update-one_range=100

1.01 1.03 update-zipf_range=100

1.00 1.01 write-only_range=10000

The impact on VSZ and RSS is interesting. The tables below show the peak values for VSZ and RSS from mysqld during the benchmark. The last column is the ratio (peak RSS / buffer pool size). To save space I use abbreviated names for the binaries.

jemalloc.1

base case, fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128

jemalloc.2

col-1 above, fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128
This has little impact on VSZ and RSS

jemalloc.3

col-2 above, fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128
This cuts peak VSZ in half and reduces peak RSS by 9%

Peak values for MyRocks with 10G buffer pool

alloc VSZ RSS RSS/10

jemalloc.1 45.6 12.2 1.22

jemalloc.2 46.0 12.5 1.25

jemalloc.3 20.2 11.6 1.16

Results: MALLOC_ARENA_MAX

The binaries tested are:

fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_c32r128

base case in the table below

fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_1arena_c32r128

col-1 in the table below
uses MALLOC_ARENA_MAX=1

fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_8arena_c32r128

col-2 in the table below
uses MALLOC_ARENA_MAX=8

fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_48arena_c32r128

col-3 in the table below
uses MALLOC_ARENA_MAX=48

fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_96arena_c32r128

col-4 in the table below
uses MALLOC_ARENA_MAX=48

These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.

(QPS with $allocator) / (QPS with glibc malloc)

From the results below:

performance with 1 or 8 arenas is lousy
performance drops some (often 5% to 10%) with 48 arenas
performance drops ~2% with 96 arenas

Relative to: fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_c32r128

col-1 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_1arena_c32r128

col-2 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_8arena_c32r128

col-3 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_48arena_c32r128

col-4 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_96arena_c32r128

col-1 col-2 col-3 col-4

0.89 0.78 0.72 0.78 hot-points_range=100

0.23 0.61 0.96 0.98 point-query_range=100

0.31 0.86 0.96 1.01 points-covered-pk_range=100

0.24 0.87 0.95 1.01 points-covered-si_range=100

0.31 0.86 0.97 1.01 points-notcovered-pk_range=100

0.20 0.86 0.97 1.00 points-notcovered-si_range=100

0.35 0.79 0.96 1.01 random-points_range=1000

0.30 0.87 0.96 1.01 random-points_range=100

0.23 0.67 0.96 0.99 random-points_range=10

0.06 0.48 0.92 0.96 range-covered-pk_range=100

0.14 0.52 0.97 0.99 range-covered-si_range=100

0.13 0.46 0.91 0.97 range-notcovered-pk_range=100

0.23 0.87 0.96 1.01 range-notcovered-si_range=100

0.23 0.76 0.97 0.99 read-only-count_range=1000

0.56 1.00 0.96 0.97 read-only-distinct_range=1000

0.20 0.47 0.90 0.94 read-only-order_range=1000

0.68 1.04 1.00 1.00 read-only_range=10000

0.21 0.76 0.98 0.99 read-only_range=100

0.19 0.70 0.97 0.99 read-only_range=10

0.21 0.58 0.94 0.98 read-only-simple_range=1000

0.19 0.57 0.95 1.00 read-only-sum_range=1000

0.53 0.98 1.00 1.01 scan_range=100

0.30 0.81 0.98 1.00 delete_range=100

0.50 0.92 1.00 1.00 insert_range=100

0.23 0.72 0.97 0.98 read-write_range=100

0.20 0.67 0.96 0.98 read-write_range=10

0.33 0.88 0.99 1.00 update-index_range=100

0.36 0.76 0.94 0.98 update-inlist_range=100

0.30 0.85 0.98 0.99 update-nonindex_range=100

0.86 0.98 1.00 1.01 update-one_range=100

0.32 0.86 0.98 0.98 update-zipf_range=100

0.27 0.80 0.97 0.98 write-only_range=10000

Using 1 arena prevents RSS bloat but comes at a huge cost in performance. If I had more time I would have tested for 2, 4 and 6 arenas but I don't think glibc malloc + RocksDB are meant to be.

Peak values for MyRocks with 10G buffer pool

alloc VSZ RSS RSS/10

default 46.1 36.2 3.62

arena = 1 15.9 14.1 1.41

arena = 8 32.6 27.7 2.77

arena = 48 35.2 29.2 2.92

arena = 96 39.3 32.5 3.25

Friday, April 11, 2025

Battle of the Mallocators

If you use RocksDB and want to avoid OOM then use jemalloc or tcmalloc and avoid glibc malloc. That was true in 2015 and remains true in 2025 (see here). The problem is that RocksDB can be an allocator stress test because it does an allocation (calls malloc) when a block is read from storage and then does a deallocation (calls free) on eviction. These allocations have very different lifetimes as some blocks remain cached for a long time and that leads to much larger RSS than expected when using glibc malloc. Fortunately, jemalloc and tcmalloc are better at tolerating that allocation pattern without making RSS too large.

I have yet to notice a similar problem with InnoDB, in part because it does a few large allocations at process start for the InnoDB buffer pool and it doesn't do malloc/free per block read from storage.

There was a recent claim from a MySQL performance expert, Dimitri Kravtchuk, that either RSS or VSZ can grow too large with InnoDB and jemalloc. I don't know all of the details for his setup and I failed to reproduce his result on my setup. Too be fair, I show here that VSZ for InnoDB + jemalloc can be larger than you might expect but that isn't a problem, it is just an artifact of jemalloc that can be confusing. But RSS for jemalloc with InnoDB is similar to what I get from tcmalloc.

tl;dr

For glibc malloc with MyRocks I get OOM on a server with 128G of RAM when the RocksDB buffer pool size is 50G. I might have been able to avoid OOM by using between 30G and 40G for the buffer pool. On that host I normally use jemalloc with MyRocks and a 100G buffer pool.
With respect to peak RSS

For InnoDB the peak RSS with all allocators is similar and peak RSS is ~1.06X larger than the InnoDB buffer pool.
For MyRocks the peak RSS is smallest with jemalloc, slightly larger with tcmalloc and much too large with glibc malloc. For (jemalloc, tcmalloc, glibc malloc) It was (1.22, 1.31, 3.62) times larger than the 10G MyRocks buffer pool. I suspect those ratios would be smaller for jemalloc and tcmalloc had I used an 80G buffer pool.

For performance, QPS with jemalloc and tcmalloc is slightly better than with glibc malloc

For InnoDB: [jemalloc, tcmalloc] get [2.5%, 3.5%] more QPS than glibc malloc
For MyRocks: [jemalloc, tcmalloc] get [5.1%, 3.0%] more QPS than glibc malloc

Prior art

I have several blog posts on using jemalloc with MyRocks.

October 2015 - MyRocks with glibc malloc, jemalloc and tcmalloc
April 2017 - Performance for large, concurrent allocations
April 2018 - RSS for MyRocks with jemalloc vs glibc malloc
August 2023 - RocksDB and glibc malloc
September 2023 - A regression in jemalloc 4.4.0 and 4.5.0 (too-large RSS)
September 2023 - More on the regression in jemalloc 4.4.0 and 4.5.0
October 2023 - Even more on the regression in jemalloc 4.4.0 and 4.5.0

Builds, configuration and hardware

I compiled upstream MySQL 8.0.40 from source for InnoDB. I also compiled FB MySQL 8.0.32 from source for MyRocks. For FB MySQL I used source as of October 23, 2024 at git hash ba9709c9 with RocksDB 9.7.1.

The server is an ax162-s from Hetzner with 48 cores (AMD EPYC 9454P), 128G RAM and AMD SMT disabled. It uses Ubuntu 22.04 and storage is ext4 with SW RAID 1 over 2 locally attached NVMe devices. More details on it are here. At list prices a similar server from Google Cloud costs 10X more than from Hetzner.

For malloc the server uses:

glibc

version2.35-0ubuntu3.9

tcmalloc

provided by libgoogle-perftools-dev and apt-cache show claims this is version 2.9.1
enabled by malloc-lib=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so in my.cnf

jemalloc

provided by libjemalloc-dev and apt-cache show claims this is version 5.2.1-4ubuntu1
enabled by malloc-lib=/usr/lib/x86_64-linux-gnu/libjemalloc.so in my.cnf

The configuration files are here for InnoDB and for MyRocks. For InnoDB I used an 80G buffer pool. I tried to use a 50G buffer pool for MyRocks but with glibc malloc there was OOM so I repeated all tests with a 10G buffer pool. I might have been able avoid OOM with MyRocks and glibc malloc by using a between 30G and 40G for MyRocks -- but I didn't want to spend more time figuring that out when the real answer is to use jemalloc or tcmalloc.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 27 of the 42 microbenchmarks and most test only 1 type of SQL statement.

The tests run with 16 tables and 50M rows/table. There are 256 client threads and each microbenchmark runs for 1200 seconds. Normally I don't run with (client threads / cores) >> 1 but I do so here to create more stress and to copy what I think Dimitri had done.

Normally when I run sysbench I configure it so that the test tables fit in the buffer pool (block cache) but I don't do that here because I want to MyRocks to do IO as allocations per storage read create much drama for the allocator.

The command line to run all tests is: bash r.sh 16 50000000 1200 1200 md2 1 0 256

Peak VSZ and RSS

The tables below show the peak values for VSZ and RSS from mysqld during the benchmark. The last column is the ratio (peak RSS / buffer pool size). I am not sure it is fair to compare these ratios between InnoDB and MyRocks from this work because the buffer pool size is so much larger for InnoDB. Regardless, RSS is more than 3X larger than the MyRocks buffer pool size with glibc malloc and that is a problem.

Peak values for InnoDB with 80G buffer pool

alloc VSZ RSS RSS/80

glibc 88.2 86.5 1.08

tcmalloc 88.1 85.3 1.06

jemalloc 91.5 87.0 1.08

Peak values for MyRocks with 10G buffer pool

alloc VSZ RSS RSS/10

glibc 46.1 36.2 3.62

tcmalloc 15.3 13.1 1.31

jemalloc 45.6 12.2 1.22

Performance: InnoDB

From the results here, QPS is mostly similar between tcmalloc and jemalloc but there are a few microbenchmarks where tcmalloc is much better than jemalloc and those are highlighted.

The results for read-only_range=10000 are an outlier (tcmalloc much faster than jemalloc) and from vmstat metrics here I see that CPU/operation (cpu/o) and context switches /operation (cs/o) are much larger for jemalloc than for tcmalloc.

These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.

(QPS with $allocator) / (QPS with glibc malloc)

Relative to results with glibc malloc

col-1 : results with tcmalloc

col-2 : results with jemalloc

col-1 col-2

0.99 1.02 hot-points_range=100

1.05 1.04 point-query_range=100

0.96 0.99 points-covered-pk_range=100

0.98 0.99 points-covered-si_range=100

0.96 0.99 points-notcovered-pk_range=100

0.97 0.98 points-notcovered-si_range=100

0.97 1.00 random-points_range=1000

0.95 0.99 random-points_range=100

0.99 0.99 random-points_range=10

1.04 1.03 range-covered-pk_range=100

1.05 1.07 range-covered-si_range=100

1.04 1.03 range-notcovered-pk_range=100

0.98 1.00 range-notcovered-si_range=100

1.02 1.02 read-only-count_range=1000

1.05 1.07 read-only-distinct_range=1000

1.07 1.12 read-only-order_range=1000

1.28 1.09 read-only_range=10000

1.03 1.05 read-only_range=100

1.05 1.08 read-only_range=10

1.08 1.07 read-only-simple_range=1000

1.04 1.03 read-only-sum_range=1000

1.02 1.02 scan_range=100

1.01 1.00 delete_range=100

1.03 1.01 insert_range=100

1.02 1.02 read-write_range=100

1.03 1.03 read-write_range=10

1.01 1.02 update-index_range=100

1.15 0.98 update-inlist_range=100

1.06 0.99 update-nonindex_range=100

1.03 1.03 update-one_range=100

1.02 1.01 update-zipf_range=100

1.18 1.05 write-only_range=10000

Performance: MyRocks

From the results here, QPS is mostly similar between tcmalloc and jemalloc with a slight advantage for jemalloc but there are a few microbenchmarks where jemalloc is much better than tcmalloc and those are highlighted.

The results for hot-points below are odd (jemalloc is a lot faster than tcmalloc) and from vmstat metrics here I see that CPU/operation (cpu/o) and context switches /operation (cs/o) are both much larger for tcmalloc.

These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.

(QPS with $allocator) / (QPS with glibc malloc)

Relative to results with glibc malloc

col-1 : results with tcmalloc

col-2 : results with jemalloc

col-1 col-2

0.68 1.00 hot-points_range=100

1.04 1.04 point-query_range=100

1.09 1.09 points-covered-pk_range=100

1.00 1.09 points-covered-si_range=100

1.09 1.09 points-notcovered-pk_range=100

1.10 1.12 points-notcovered-si_range=100

1.08 1.08 random-points_range=1000

1.09 1.09 random-points_range=100

1.05 1.10 random-points_range=10

0.99 1.07 range-covered-pk_range=100

1.01 1.03 range-covered-si_range=100

1.05 1.09 range-notcovered-pk_range=100

1.10 1.09 range-notcovered-si_range=100

1.07 1.05 read-only-count_range=1000

1.00 1.00 read-only-distinct_range=1000

0.98 1.04 read-only-order_range=1000

1.03 1.03 read-only_range=10000

0.96 1.03 read-only_range=100

1.02 1.04 read-only_range=10

0.98 1.07 read-only-simple_range=1000

1.07 1.09 read-only-sum_range=1000

1.02 1.02 scan_range=100

1.05 1.03 delete_range=100

1.11 1.07 insert_range=100

0.96 0.97 read-write_range=100

0.94 0.95 read-write_range=10

1.08 1.04 update-index_range=100

1.08 1.07 update-inlist_range=100

1.09 1.04 update-nonindex_range=100

1.04 1.04 update-one_range=100

1.07 1.04 update-zipf_range=100

1.03 1.02 write-only_range=10000

Sunday, March 16, 2025

At what level of concurrency do MySQL 5.7 and 8.0 become faster than 5.6?

Are MySQL 5.7 and 8.0 faster than 5.6? That depends a lot on the workload -- both types of SQL and amount of concurrency. Here I summarize results from sysbench on a larger server (48 cores) using 1, 4, 6, 8, 10, 20 and 40 clients to show how things change.

tl;dr

the workload here is microbenchmarks with a database cached by InnoDB
5.7.44 is faster than 8.0.x at all concurrency levels on most microbenchmarks
for 5.6.51 vs 8.0.x

for point queries, 5.6.51 is faster at <= 8 clients
for range queries without aggregation 5.6.51 is always faster
for range queries with aggregation 5.6.51 is faster except at 40 clients
for writes, 5.6.51 is almost always faster at 10 or fewer clients (excluding update-index)

Performance summaries

For point queries:

5.7.44 is always faster than 8.0
8.0.28 suffers from bug 102037 - found by me with sysbench, fixed in 8.0.31
at what level of concurrency do most things get faster in 5.7 & 8.0 vs 5.6?

5.7.44 becomes faster than 5.6.51 at 6+ clients
8.0.x becomes faster than 5.6.51 at between 10 and 20 clients
Two of the microbenchmarks are always faster in 5.6.51 - both do non-covering queries on a secondary index

For range queries without aggregation

5.7.44 is always faster than 8.0x
5.6.51 is always faster than 5.7.44 and 8.0.x

For range queries with aggregation

5.7.44 is almost always faster than 8.0.x
5.7.44 becomes faster than 5.6.51 at 6+ clients
8.0.x becomes faster than 5.6.51 at 40 clients

For writes

For update-index

5.7.44 and 8.0.x are always faster than 5.6.51 at 4+ clients
There is an odd drop from ~6X to ~3X for 8.0.32 and 8.0.39 at 20 clients

5.7.44 is mostly faster than 8.0.x for 1 to 20 clients and they have similar results at 40 clients
5.7.44 & 8.0.x are always faster than 5.6.51 at 20+ clients

Builds, configuration and hardware

I compiled MySQL from source for versions 5.6.51, 5.7.44, 8.0.28, 8.0.32, 8.0.39 and 8.0.41.

The configuration files are named my.cnf.cz11a_c32r128 and here for 5.6.51, 5.7.44, 8.0.28, 8.0.32, 8.0.39 and 8.0.41.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 27 of the 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by InnoDB.

The tests run with 8 tables and 10M rows/table. There are 40 client threads, read-heavy microbenchmarks run for 180 seconds and write-heavy run for 300 seconds.

The command lines to run all tests are:

bash r.sh 8 10000000 180 300 md2 1 1 1
bash r.sh 8 10000000 180 300 md2 1 1 4
bash r.sh 8 10000000 180 300 md2 1 1 6
bash r.sh 8 10000000 180 300 md2 1 1 8
bash r.sh 8 10000000 180 300 md2 1 1 10
bash r.sh 8 10000000 180 300 md2 1 1 20
bash r.sh 8 10000000 180 300 md2 1 1 40

Results

For the results below I split the microbenchmarks into 4 groups: point queries, range queries without aggregation, range queries with queries, writes. The spreadsheet with all data is here. Files with performance summaries for relative and absolute QPS are here. Values from iostat and vmstat per microbenchmark are here for 1 client, 4 clients, 6 clients, 8 clients, 10 clients, 20 clients and 40 clients. These help to explain why something is faster or slower because it shows how much HW is used per query.

The relative QPS is the following where $version is >= 5.7.44.
(QPS for $version) / (QPS for MySQL 5.6.51)
The numbers in the spreadsheets are the relative QPS. When the relative QPS is > 1 then $version is faster than MySQL 5.6.51. When it is 3.0 then $version is 3X faster than the base case.

Results: charts
Notes on the charts
the y-axis shows the relative QPS
the y-axis starts at 0.80 to make it easier to see differences
in some cases the y-axis truncates the good outliers, cases where the relative QPS is greater than 1.5. I do this to improve readability for values near 1.0. Regardless, the improvements are nice.
Results: point queries

Summary
5.7.44 is always faster than 8.0
8.0.28 suffers from bug 102037 - found by me with sysbench, fixed in 8.0.31
at what level of concurrency do most things get faster in 5.7 & 8.0 vs 5.6?
5.7.44 becomes faster than 5.6.51 at 6+ clients
8.0.x becomes faster than 5.6.51 at between 10 and 20 clients
Two of the microbenchmarks are always faster in 5.6.51 - both do non-covering queries on a secondary index
Results: range queries without aggregation

Summary
5.7.44 is always faster than 8.0x
5.6.51 is always faster than 5.7.44 and 8.0.x
Results: range queries with aggregation

Summary
5.7.44 is almost always faster than 8.0.x
5.7.44 becomes faster than 5.6.51 at 6+ clients
8.0.x becomes faster than 5.6.51 at 40 clients
Results: writes

The relative speedup for the update-index microbenchmark is frequently so large that it obscures the smaller changes on other microbenchmarks. So here I truncate the y-axis for some of the charts (for 6+ clients) and the section that follows has the charts without truncation.

Summary
For update-index
5.7.44 and 8.0.x are always faster than 5.6.51 at 4+ clients
There is an odd drop from ~6X to ~3X for 8.0.32 and 8.0.39 at 20 clients but you can't see that on the charts in this section because of the truncation. It is visible in the next section. From vmstat I see an increase in CPU/operation (cpu/o) and context switches /operation (cs/o) at 20 clients but not at 40 clients.
5.7.44 is mostly faster than 8.0.x for 1 to 20 clients and they have similar results at 40 clients
5.7.44 & 8.0.x are always faster than 5.6.51 at 20+ clients
Results: charts for writes without truncation

The y-axis is truncated the the charts for writes in the previous section for 6+ clients. This section has those charts without truncation.