Skip to content

Fixed tracking of command duration for multi/eval/module/wait#11970

Merged
madolson merged 8 commits intoredis:unstablefrom
madolson:duration_multi
Mar 30, 2023
Merged

Fixed tracking of command duration for multi/eval/module/wait#11970
madolson merged 8 commits intoredis:unstablefrom
madolson:duration_multi

Conversation

@madolson
Copy link
Contributor

@madolson madolson commented Mar 25, 2023

In #11012, we changed the way command durations were computed to handle the same command being executed multiple times. This commit fixes some misses from that commit.

  1. Wait commands were not correctly reporting their duration if the timeout was reached.
  2. Multi/scripts/and modules with RM_Call were not properly resetting the duration between inner calls, leading to them reporting cumulative duration.
  3. When a blocked client is freed, the call and duration are always discarded.

This commit also adds an assert if the duration is not properly reset, potentially indicating that a report to call statistics was missed. The assert potentially be removed in the future, as it's mainly intended to detect misses in tests.

Before:

(error) ERR EXEC without MULTI
127.0.0.1:6379> multi
OK
127.0.0.1:6379> debug sleep 0.1
QUEUED
127.0.0.1:6379> debug sleep 0.1
QUEUED
127.0.0.1:6379> info commandstats
127.0.0.1:6379> exec
1) OK
2) OK
3) "# Commandstats\r\ncmdstat_exec:calls=1,usec=4,usec_per_call=4.00,rejected_calls=0,failed_calls=1\r\ncmdstat_debug:calls=2,usec=300206,usec_per_call=150103.00,rejected_calls=0,failed_calls=0\r\ncmdstat_multi:calls=1,usec=5,usec_per_call=5.00,rejected_calls=0,failed_calls=0\r\n"

After:

127.0.0.1:6379> multi
OK
127.0.0.1:6379> debug sleep 0.1
QUEUED
127.0.0.1:6379> debug sleep 0.1
QUEUED
127.0.0.1:6379> exec
1) OK
2) OK
127.0.0.1:6379> info commandstats
# Commandstats
cmdstat_exec:calls=2,usec=200176,usec_per_call=100088.00,rejected_calls=0,failed_calls=1
cmdstat_info:calls=1,usec=35,usec_per_call=35.00,rejected_calls=0,failed_calls=0
cmdstat_debug:calls=2,usec=200132,usec_per_call=100066.00,rejected_calls=0,failed_calls=0
cmdstat_multi:calls=2,usec=4,usec_per_call=2.00,rejected_calls=0,failed_calls=0

@madolson madolson requested a review from ranshid March 25, 2023 00:52
@ranshid
Copy link
Contributor

ranshid commented Mar 25, 2023

good catch - I missed it (after struggling to adjust lua to start calling resetClient)
I think this fix makes sense. I just need to check the blocked timeout case though

@ranshid
Copy link
Contributor

ranshid commented Mar 25, 2023

maybe we can also zero the duration when we are updating stats?

───────────────────────────────────────────────────────────────────────────────
modified: src/blocked.c
───────────────────────────────────────────────────────────────────────────────
@ src/blocked.c:111 @ void updateStatsOnUnblock(client *c, long blocked_us, long reply_us, int had_err
    c->lastcmd->microseconds += total_cmd_duration;
    c->lastcmd->calls++;
    server.stat_numcommands++;
    c->duration = 0; <-- here
    if (had_errors)
        c->lastcmd->failed_calls++;
    if (server.latency_tracking_enabled)

@oranagra
Copy link
Member

i think i agree, with Ran, it'll look better if we zero it right after using it (explicitly in call() and in updateStatsOnUnblock()).
i think we can also keep the one in resetClient just for safety.

@madolson
Copy link
Contributor Author

i think we can also keep the one in resetClient just for safety.

I moved it to a serverAssert(), maybe it will catch a place where we miss. Missed one case related to WAIT where we weren't recording the wait time and one extra false positive related to module client freeing.

@ranshid
Copy link
Contributor

ranshid commented Mar 27, 2023

i think we can also keep the one in resetClient just for safety.

I moved it to a serverAssert(), maybe it will catch a place where we miss. Missed one case related to WAIT where we weren't recording the wait time and one extra false positive related to module client freeing.

classic case to place a "debugAssert" which we do not have :)
anyway I think that is fine, but I wonder if the code wouldn't have been cleaner if we just had the original fix+keep the zeroing inside resetClient?

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i'd prefer a simpler approach of clearing it in obvious places:

  • creteClient, resetClient
  • and after using the value (i.e. updateStatsOnUnblock and call)

not sure we need the assertion...

@oranagra oranagra added the release-notes indication that this issue needs to be mentioned in the release notes label Mar 27, 2023
@madolson
Copy link
Contributor Author

i think i'd prefer a simpler approach of clearing it in obvious places:

The problem is that the "obvious places" aren't really obvious anymore.

not sure we need the assertion...

I would have agreed if it didn't find another issue with wait. It wasn't critical, but still

@oranagra
Copy link
Member

You mean the condition that ended up in call?

Anyway. OK. Go ahead and merge it..


/* Deallocate structures used to block on blocking ops. */
/* If there is any in-flight command, we don't don't record their duration. */
c->duration = 0;
Copy link
Contributor Author

@madolson madolson Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started running into issues related to module APIs that had resetClient() being called when a client with a pending blocking RM_Call() was getting freed, which triggered the resetClient() call path (which crashed since it had an unrecorded command). There might be a more elegant way to fix it, but I ended up just zero'ing the duration when the client was being freed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand. are we calling resetClient from within / after freeClient?
same thing for moduleReleaseTempClient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, resetClient is being called from within freeClient, because freeClient calls unbockClient which calls resetClient here: https://github.com/redis/redis/blob/unstable/src/blocked.c#L211.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.
but now that we fixed the problem with WAIT, can we (at least in theory), replace that assert with a =0, and drop many other changes?
i'm ok with to keep the assert, just curious since i'm not looking at this PR with much care..

Copy link
Contributor

@ranshid ranshid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thank you for fixing this!

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please mention the change about unblockClientWaitingReplicas in the top comment.

@madolson madolson changed the title Fixed tracking of command duration for multi/eval/module Fixed tracking of command duration for multi/eval/module/wait Mar 29, 2023
@madolson madolson merged commit 971b177 into redis:unstable Mar 30, 2023
@oranagra oranagra mentioned this pull request May 15, 2023
sundb pushed a commit to sundb/redis that referenced this pull request Jul 28, 2025
redis#526)

In redis#11012, we changed the way command durations were computed to handle
the same command being executed multiple times. In redis#11970, we added an
assert if the duration is not properly reset, potentially indicating
that a call to report statistics was missed.

I found an edge case where this happens - easily reproduced by blocking
a client on `XGROUPREAD` and migrating the stream's slot. This causes
the engine to process the `XGROUPREAD` command twice:

1. First time, we are blocked on the stream, so we wait for unblock to
come back to it a second time. In most cases, when we come back to
process the command second time after unblock, we process the command
normally, which includes recording the duration and then resetting it.
2. After unblocking we come back to process the command, and this is
where we hit the edge case - at this point, we had already migrated the
slot to another node, so we return a `MOVED` response. But when we do
that, we don’t reset the duration field.

Fix: also reset the duration when returning a `MOVED` response. I think
this is right, because the client should redirect the command to the
right node, which in turn will calculate the execution duration.

Also wrote a test which reproduces this, it fails without the fix and
passes with it.

---------

Signed-off-by: Nitai Caro <[email protected]>
Co-authored-by: Nitai Caro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes indication that this issue needs to be mentioned in the release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants