Skip to content

redis crash when close other normal connections after a timeout lua script if io threads do reads #6988

@soloestoy

Description

@soloestoy
65976:M 15 Mar 2020 22:05:11.079 # Lua slow script detected: still in execution after 5003 milliseconds. You can try killing the script using the SCRIPT KILL command. Script SHA1 is: 5cc492ff8069ef954920c191887905108182c43b

=== REDIS BUG REPORT START: Cut & paste starting from here ===
65976:M 15 Mar 2020 22:05:48.563 # === ASSERTION FAILED ===
65976:M 15 Mar 2020 22:05:48.563 # ==> networking.c:1024 'ln != NULL' is not true
65976:M 15 Mar 2020 22:05:48.563 # (forcing SIGSEGV to print the bug report.)
65976:M 15 Mar 2020 22:05:48.563 # Redis 999.999.999 crashed by signal: 11
65976:M 15 Mar 2020 22:05:48.563 # Crashed running the instruction at: 0x47126a
65976:M 15 Mar 2020 22:05:48.563 # Accessing address: 0xffffffffffffffff
65976:M 15 Mar 2020 22:05:48.563 # Failed assertion: ln != NULL (networking.c:1024)

------ STACK TRACE ------
EIP:
./redis-server *:6379(_serverAssert+0x6a)[0x47126a]

Backtrace:
./redis-server *:6379(logStackTrace+0x29)[0x473739]
./redis-server *:6379(sigsegvHandler+0xa6)[0x473dd6]
/lib64/libpthread.so.0(+0xf620)[0x7f26b5388620]
./redis-server *:6379(_serverAssert+0x6a)[0x47126a]
./redis-server *:6379(unlinkClient+0x229)[0x4412d9]
./redis-server *:6379(freeClient+0xf5)[0x441465]
./redis-server *:6379(freeClientsInAsyncFreeQueue+0x27)[0x441b57]
./redis-server *:6379(beforeSleep+0x8c)[0x431bac]
./redis-server *:6379(aeMain+0x1e)[0x42ee8e]
./redis-server *:6379(main+0x4c4)[0x42b974]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f26b4fcd445]
./redis-server *:6379[0x42bbfd]

We can see that redis crashes because assertion failed, the assertion is in freeClient:

    /* Remove from the list of pending reads if needed. */
    if (c->flags & CLIENT_PENDING_READ) {
        ln = listSearchKey(server.clients_pending_read,c);
        serverAssert(ln != NULL);
        listDelNode(server.clients_pending_read,ln);
        c->flags &= ~CLIENT_PENDING_READ;
    }

It means the client we try to unlink is in server.clients_pending_read, it's easy to reproduce:

  1. Start redis with config open io-threads-do-reads

    ./redis-server --io-threads 4 --io-threads-do-reads yes

  2. Process some commands using redis-benchmark

    ./redis-benchmark -t get,set -n 10000000 -l --threads 6

  3. Send a slow lua script

    127.0.0.1:6379> eval "for i=1,100000000,1 do redis.call('get','a') end" 0

  4. Kill redis-benchmark after Lua slow script detected and before the script done.

    now redis crashes

The root cause is we use function processEventsWhileBlocked to handle events if lua timeout, and the read events are delayed and just put client into server.clients_pending_read's head, but aeProcessEvents is called without flag AE_CALL_AFTER_SLEEP in processEventsWhileBlocked, so the new fired read event cannot be processed, since we call handleClientsWithPendingReadsUsingThreads in afterSleep. And client close connection before lua script done.

Then after lua script done, function handleClientsWithPendingReadsUsingThreads empty the list server.clients_pending_read but does not remove the new fired client's CLIENT_PENDING_READ flag.

Finally freeClient try to remove the client from server.clients_pending_read because it's still with CLIENT_PENDING_READ flag but cannot find it, so the assertion failed.

To fix it, I don't call beforeSleep and handleClientsWithPendingReadsUsingThreads in processEventsWhileBlocked, because I think in this scenario we do not need the io-threads mode, just handle all events in main thread is OK.

So I add a new variable process_while_blocked to check, if we are in processEventsWhileBlocked do not postpone client read.

See commit 606a01d in PR #6989

There are also other scenarios too, like DEBUG RELOAD|LOADAOF or reboot or slave loading data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions