Fixed tracking of command duration for multi/eval/module/wait by madolson · Pull Request #11970 · redis/redis

madolson · 2023-03-25T00:51:57Z

In #11012, we changed the way command durations were computed to handle the same command being executed multiple times. This commit fixes some misses from that commit.

Wait commands were not correctly reporting their duration if the timeout was reached.
Multi/scripts/and modules with RM_Call were not properly resetting the duration between inner calls, leading to them reporting cumulative duration.
When a blocked client is freed, the call and duration are always discarded.

This commit also adds an assert if the duration is not properly reset, potentially indicating that a report to call statistics was missed. The assert potentially be removed in the future, as it's mainly intended to detect misses in tests.

Before:

(error) ERR EXEC without MULTI
127.0.0.1:6379> multi
OK
127.0.0.1:6379> debug sleep 0.1
QUEUED
127.0.0.1:6379> debug sleep 0.1
QUEUED
127.0.0.1:6379> info commandstats
127.0.0.1:6379> exec
1) OK
2) OK
3) "# Commandstats\r\ncmdstat_exec:calls=1,usec=4,usec_per_call=4.00,rejected_calls=0,failed_calls=1\r\ncmdstat_debug:calls=2,usec=300206,usec_per_call=150103.00,rejected_calls=0,failed_calls=0\r\ncmdstat_multi:calls=1,usec=5,usec_per_call=5.00,rejected_calls=0,failed_calls=0\r\n"

After:

127.0.0.1:6379> multi
OK
127.0.0.1:6379> debug sleep 0.1
QUEUED
127.0.0.1:6379> debug sleep 0.1
QUEUED
127.0.0.1:6379> exec
1) OK
2) OK
127.0.0.1:6379> info commandstats
# Commandstats
cmdstat_exec:calls=2,usec=200176,usec_per_call=100088.00,rejected_calls=0,failed_calls=1
cmdstat_info:calls=1,usec=35,usec_per_call=35.00,rejected_calls=0,failed_calls=0
cmdstat_debug:calls=2,usec=200132,usec_per_call=100066.00,rejected_calls=0,failed_calls=0
cmdstat_multi:calls=2,usec=4,usec_per_call=2.00,rejected_calls=0,failed_calls=0

ranshid · 2023-03-25T04:46:40Z

good catch - I missed it (after struggling to adjust lua to start calling resetClient)
I think this fix makes sense. I just need to check the blocked timeout case though

ranshid · 2023-03-25T11:57:35Z

maybe we can also zero the duration when we are updating stats?

───────────────────────────────────────────────────────────────────────────────
modified: src/blocked.c
───────────────────────────────────────────────────────────────────────────────
@ src/blocked.c:111 @ void updateStatsOnUnblock(client *c, long blocked_us, long reply_us, int had_err
    c->lastcmd->microseconds += total_cmd_duration;
    c->lastcmd->calls++;
    server.stat_numcommands++;
    c->duration = 0; <-- here
    if (had_errors)
        c->lastcmd->failed_calls++;
    if (server.latency_tracking_enabled)

oranagra · 2023-03-26T14:39:00Z

i think i agree, with Ran, it'll look better if we zero it right after using it (explicitly in call() and in updateStatsOnUnblock()).
i think we can also keep the one in resetClient just for safety.

madolson · 2023-03-26T21:07:49Z

i think we can also keep the one in resetClient just for safety.

I moved it to a serverAssert(), maybe it will catch a place where we miss. Missed one case related to WAIT where we weren't recording the wait time and one extra false positive related to module client freeing.

ranshid · 2023-03-27T04:51:29Z

i think we can also keep the one in resetClient just for safety.

I moved it to a serverAssert(), maybe it will catch a place where we miss. Missed one case related to WAIT where we weren't recording the wait time and one extra false positive related to module client freeing.

classic case to place a "debugAssert" which we do not have :)
anyway I think that is fine, but I wonder if the code wouldn't have been cleaner if we just had the original fix+keep the zeroing inside resetClient?

oranagra

i think i'd prefer a simpler approach of clearing it in obvious places:

creteClient, resetClient
and after using the value (i.e. updateStatsOnUnblock and call)

not sure we need the assertion...

madolson · 2023-03-27T15:42:05Z

i think i'd prefer a simpler approach of clearing it in obvious places:

The problem is that the "obvious places" aren't really obvious anymore.

not sure we need the assertion...

I would have agreed if it didn't find another issue with wait. It wasn't critical, but still

oranagra · 2023-03-27T19:30:32Z

You mean the condition that ended up in call?

Anyway. OK. Go ahead and merge it..

src/server.c

src/blocked.c

madolson · 2023-03-29T04:58:15Z

src/networking.c


    /* Deallocate structures used to block on blocking ops. */
+    /* If there is any in-flight command, we don't don't record their duration. */
+    c->duration = 0;


I started running into issues related to module APIs that had resetClient() being called when a client with a pending blocking RM_Call() was getting freed, which triggered the resetClient() call path (which crashed since it had an unrecorded command). There might be a more elegant way to fix it, but I ended up just zero'ing the duration when the client was being freed.

i don't understand. are we calling resetClient from within / after freeClient?
same thing for moduleReleaseTempClient?

Yes, resetClient is being called from within freeClient, because freeClient calls unbockClient which calls resetClient here: https://github.com/redis/redis/blob/unstable/src/blocked.c#L211.

ok.
but now that we fixed the problem with WAIT, can we (at least in theory), replace that assert with a =0, and drop many other changes?
i'm ok with to keep the assert, just curious since i'm not looking at this PR with much care..

src/networking.c

src/replication.c

ranshid

LGTM - thank you for fixing this!

oranagra

please mention the change about unblockClientWaitingReplicas in the top comment.

redis#526) In redis#11012, we changed the way command durations were computed to handle the same command being executed multiple times. In redis#11970, we added an assert if the duration is not properly reset, potentially indicating that a call to report statistics was missed. I found an edge case where this happens - easily reproduced by blocking a client on `XGROUPREAD` and migrating the stream's slot. This causes the engine to process the `XGROUPREAD` command twice: 1. First time, we are blocked on the stream, so we wait for unblock to come back to it a second time. In most cases, when we come back to process the command second time after unblock, we process the command normally, which includes recording the duration and then resetting it. 2. After unblocking we come back to process the command, and this is where we hit the edge case - at this point, we had already migrated the slot to another node, so we return a `MOVED` response. But when we do that, we don’t reset the duration field. Fix: also reset the duration when returning a `MOVED` response. I think this is right, because the client should redirect the command to the right node, which in turn will calculate the execution duration. Also wrote a test which reproduces this, it fails without the fix and passes with it. --------- Signed-off-by: Nitai Caro <[email protected]> Co-authored-by: Nitai Caro <[email protected]>

Fixed tracking of command duration for multi/eval/module

4b7b145

madolson requested a review from ranshid March 25, 2023 00:52

Restructure the commit

062392f

oranagra approved these changes Mar 27, 2023

View reviewed changes

oranagra added the release-notes indication that this issue needs to be mentioned in the release notes label Mar 27, 2023

vitarb reviewed Mar 28, 2023

View reviewed changes

src/server.c Outdated Show resolved Hide resolved

vitarb reviewed Mar 28, 2023

View reviewed changes

src/blocked.c Show resolved Hide resolved

madolson added 4 commits March 28, 2023 20:08

Handle more edge cases

fe1ae5a

Actually commit files

015549b

Don't crash when a client without a pending client is freed

671dcc9

Remove a stat that was probably not needed

d5dcaa5

madolson commented Mar 29, 2023

View reviewed changes

src/networking.c Outdated Show resolved Hide resolved

Update src/networking.c

fea8e57

ranshid reviewed Mar 29, 2023

View reviewed changes

src/replication.c Show resolved Hide resolved

Ran said these are so 2022

5c3fd1e

ranshid approved these changes Mar 29, 2023

View reviewed changes

oranagra approved these changes Mar 29, 2023

View reviewed changes

madolson changed the title ~~Fixed tracking of command duration for multi/eval/module~~ Fixed tracking of command duration for multi/eval/module/wait Mar 29, 2023

madolson merged commit 971b177 into redis:unstable Mar 30, 2023

oranagra mentioned this pull request May 15, 2023

Release Redis 7.2 RC2 #12173

Merged

Conversation

madolson commented Mar 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before:

After:

Uh oh!

ranshid commented Mar 25, 2023

Uh oh!

ranshid commented Mar 25, 2023

Uh oh!

oranagra commented Mar 26, 2023

Uh oh!

madolson commented Mar 26, 2023

Uh oh!

ranshid commented Mar 27, 2023

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

madolson commented Mar 27, 2023

Uh oh!

oranagra commented Mar 27, 2023

Uh oh!

Uh oh!

Uh oh!

madolson Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oranagra Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

madolson Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

oranagra Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ranshid left a comment

Choose a reason for hiding this comment

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

madolson commented Mar 25, 2023 •

edited

Loading

madolson Mar 29, 2023 •

edited

Loading