Skip to content

100% CPU Lockup with subkey expiry on 5.3.3 #187

@markabey

Description

@markabey

Describe the bug

Migrated from redis to keydb for the subkey expiry feature. On most hosts this worked fine, but on one it got into a high CPU lockup state. Every host has different data, and I am unsure what is different about this host, but this was unexpected as redis has been used fine for ages.

Log Files

CPU was on 100%, running commands like info took up to 2 minutes, and continued to slow down.

Output of info:
keydbinfo.log

Running a bgsave here, the hang seemed to be before accepting the connection:

$$  date; time redis-cli -p 6389 bgsave; date
Wed 27 May 10:17:49 UTC 2020
(error) ERR Background save already in progress

real    1m59.435s
user    0m0.001s
sys     0m0.000s
Wed 27 May 10:19:49 UTC 2020

Log file for that time:

17160:17362:M 27 May 2020 10:17:23.961 - Client closed connection
17160:17362:M 27 May 2020 10:17:23.961 - Client closed connection
17160:17362:M 27 May 2020 10:17:23.962 - Client closed connection
17160:17362:M 27 May 2020 10:19:48.990 - Accepted 127.0.0.1:38274
17160:17362:M 27 May 2020 10:19:48.991 - Accepted 127.0.0.1:35410
17160:17362:M 27 May 2020 10:19:48.991 - Accepted 127.0.0.1:48794
17160:17362:M 27 May 2020 10:19:48.991 - Accepted 127.0.0.1:33910
17160:17362:M 27 May 2020 10:19:48.991 - Accepted 127.0.0.1:39824
17160:17362:M 27 May 2020 10:19:48.991 - Accepted 127.0.0.1:47330
17160:17362:M 27 May 2020 10:19:48.991 - Accepted 127.0.0.1:49230
17160:17362:M 27 May 2020 10:19:49.055 * Background saving started by pid 32518
17160:17362:M 27 May 2020 10:19:49.056 - DB 6: 60007 keys (60003 volatile) in 65536 slots HT.
17160:17362:M 27 May 2020 10:19:49.056 - 21 clients connected (0 replicas), 328125192 bytes in use
17160:17362:M 27 May 2020 10:19:49.109 - Client closed connection
17160:17362:M 27 May 2020 10:19:49.109 - Client closed connection
17160:17362:M 27 May 2020 10:19:49.109 - Client closed connection
17160:17362:M 27 May 2020 10:19:49.109 - Client closed connection
17160:17362:M 27 May 2020 10:19:49.109 - Client closed connection
17160:17362:M 27 May 2020 10:19:49.109 - Client closed connection
17160:17362:M 27 May 2020 10:19:49.109 - Client closed connection
32518:17362:C 27 May 2020 10:19:51.363 * DB saved on disk
32518:17362:C 27 May 2020 10:19:51.377 * RDB: 46 MB of memory used by copy-on-write
17160:17362:M 27 May 2020 10:19:51.473 * Background saving terminated with success
17160:17362:M 27 May 2020 10:19:55.395 - DB 6: 60160 keys (60157 volatile) in 65536 slots HT.
17160:17362:M 27 May 2020 10:19:55.395 - 14 clients connected (0 replicas), 368008936 bytes in use
17160:17362:M 27 May 2020 10:19:59.079 - Accepted 127.0.0.1:41888
17160:17362:M 27 May 2020 10:19:59.657 - Client closed connection
17160:17362:M 27 May 2020 10:20:02.029 - DB 6: 68047 keys (68044 volatile) in 131072 slots HT.
17160:17362:M 27 May 2020 10:20:02.029 - 14 clients connected (0 replicas), 375634136 bytes in use

Logs on start up have lots of this for many of the various subkeys we use, all key:null:

17309:17309:M 27 May 2020 10:40:39.349 # Corrupt subexpire entry in RDB skipping. key: (null) subkey: pra2
17309:17309:M 27 May 2020 10:40:39.349 # Corrupt subexpire entry in RDB skipping. key: (null) subkey: rrc2
17309:17309:M 27 May 2020 10:40:39.369 # Corrupt subexpire entry in RDB skipping. key: (null) subkey: j5
17309:17309:M 27 May 2020 10:40:39.369 # Corrupt subexpire entry in RDB skipping. key: (null) subkey: sn
17309:17309:M 27 May 2020 10:40:39.369 # Corrupt subexpire entry in RDB skipping. key: (null) subkey: gh
17309:17309:M 27 May 2020 10:40:39.369 # Corrupt subexpire entry in RDB skipping. key: (null) subkey: sv
17309:17309:M 27 May 2020 10:40:39.369 # Corrupt subexpire entry in RDB skipping. key: (null) subkey: sc
17309:17309:M 27 May 2020 10:40:39.369 # Corrupt subexpire entry in RDB skipping. key: (null) subkey: cn
17309:17309:M 27 May 2020 10:40:39.369 # Corrupt subexpire entry in RDB skipping. key: (null) subkey: ca

I think we use the same expiry time (14400) for subkeys as well as their parent keys. The parent keys will get refreshed occasionally, and the subkeys won't.

Flame graphs:

Database load took a very long time (multiple minutes, for 78k keys - 400mb), flamegraph in the zip below. Note exe is called keydb-serverOLD as we moved it out of the way to put redis back on, but it is 5.3.3

Then when running, here is a 1 minute flame graph (gdb confirmed the same functions):
flame-graph.zip

And screenshot:
image

To Reproduce

I tried to reproduce this by copying the database out and running it standalone on another port (had to do it before it expired). When data was being loaded from the RDB it was super slow.
It was also spamming out this in the logs while doing the ping benchmark below
Corrupt subexpire entry in RDB skipping. key: (null) subkey: CX

Here is some benchmark data (it got slower as it ran) before I ctrl+c'd it:

====== PING_INLINE ======
  100000 requests completed in 327.07 seconds
  50 parallel clients
  3 bytes payload
  keep alive: 1
  host configuration "save": 3600 100000
  host configuration "appendonly": no
  multi-thread: no

0.00% <= 0.1 milliseconds
0.54% <= 0.2 milliseconds
19.94% <= 0.3 milliseconds
59.78% <= 0.4 milliseconds
78.01% <= 0.5 milliseconds
85.70% <= 0.6 milliseconds
93.66% <= 0.7 milliseconds
95.92% <= 0.8 milliseconds
96.87% <= 0.9 milliseconds
97.18% <= 1.0 milliseconds
...
99.90% <= 27059 milliseconds
99.90% <= 36692 milliseconds
99.91% <= 36693 milliseconds
99.95% <= 48527 milliseconds
99.99% <= 48528 milliseconds
100.00% <= 48528 milliseconds
305.74 requests per second

Once it finished loading the DB:

11284:11284:M 27 May 2020 11:27:00.246 * DB loaded from disk: 340.658 seconds
11284:11284:M 27 May 2020 11:27:00.246 * Ready to accept connections
11284:18340:M 27 May 2020 11:27:00.246   Thread 0 alive.

Then it got fast again. This speedup didn't happen when the data was actively being written to when used live, it stayed slow and on 100% CPU.

$$  keydb-benchmark -p 6390
====== PING_INLINE ======
  100000 requests completed in 1.02 seconds
  50 parallel clients
  3 bytes payload
  keep alive: 1
  host configuration "save": 3600 100000
  host configuration "appendonly": no
  multi-thread: no

0.00% <= 0.1 milliseconds
3.82% <= 0.2 milliseconds
51.79% <= 0.3 milliseconds
80.65% <= 0.4 milliseconds
92.42% <= 0.5 milliseconds
97.84% <= 0.6 milliseconds
99.16% <= 0.7 milliseconds
99.50% <= 0.8 milliseconds
99.63% <= 0.9 milliseconds
99.75% <= 1.0 milliseconds
99.84% <= 1.1 milliseconds
99.85% <= 1.3 milliseconds
99.86% <= 1.4 milliseconds
99.88% <= 1.5 milliseconds
99.89% <= 1.6 milliseconds
99.92% <= 1.7 milliseconds
99.92% <= 1.8 milliseconds
99.93% <= 1.9 milliseconds
99.93% <= 2 milliseconds
99.95% <= 8 milliseconds
99.96% <= 9 milliseconds
100.00% <= 9 milliseconds
98039.22 requests per second

Changing back to redis, and wiping the rdb dump resolved the issue. When trying keydb again, the error hangs continued.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions