Fixed tracking of command duration for multi/eval/module/wait#11970
Fixed tracking of command duration for multi/eval/module/wait#11970madolson merged 8 commits intoredis:unstablefrom
Conversation
|
good catch - I missed it (after struggling to adjust lua to start calling resetClient) |
|
maybe we can also zero the duration when we are updating stats? |
|
i think i agree, with Ran, it'll look better if we zero it right after using it (explicitly in |
I moved it to a serverAssert(), maybe it will catch a place where we miss. Missed one case related to |
classic case to place a "debugAssert" which we do not have :) |
oranagra
left a comment
There was a problem hiding this comment.
i think i'd prefer a simpler approach of clearing it in obvious places:
- creteClient, resetClient
- and after using the value (i.e.
updateStatsOnUnblockandcall)
not sure we need the assertion...
The problem is that the "obvious places" aren't really obvious anymore.
I would have agreed if it didn't find another issue with wait. It wasn't critical, but still |
|
You mean the condition that ended up in Anyway. OK. Go ahead and merge it.. |
|
|
||
| /* Deallocate structures used to block on blocking ops. */ | ||
| /* If there is any in-flight command, we don't don't record their duration. */ | ||
| c->duration = 0; |
There was a problem hiding this comment.
I started running into issues related to module APIs that had resetClient() being called when a client with a pending blocking RM_Call() was getting freed, which triggered the resetClient() call path (which crashed since it had an unrecorded command). There might be a more elegant way to fix it, but I ended up just zero'ing the duration when the client was being freed.
There was a problem hiding this comment.
i don't understand. are we calling resetClient from within / after freeClient?
same thing for moduleReleaseTempClient?
There was a problem hiding this comment.
Yes, resetClient is being called from within freeClient, because freeClient calls unbockClient which calls resetClient here: https://github.com/redis/redis/blob/unstable/src/blocked.c#L211.
There was a problem hiding this comment.
ok.
but now that we fixed the problem with WAIT, can we (at least in theory), replace that assert with a =0, and drop many other changes?
i'm ok with to keep the assert, just curious since i'm not looking at this PR with much care..
ranshid
left a comment
There was a problem hiding this comment.
LGTM - thank you for fixing this!
oranagra
left a comment
There was a problem hiding this comment.
please mention the change about unblockClientWaitingReplicas in the top comment.
redis#526) In redis#11012, we changed the way command durations were computed to handle the same command being executed multiple times. In redis#11970, we added an assert if the duration is not properly reset, potentially indicating that a call to report statistics was missed. I found an edge case where this happens - easily reproduced by blocking a client on `XGROUPREAD` and migrating the stream's slot. This causes the engine to process the `XGROUPREAD` command twice: 1. First time, we are blocked on the stream, so we wait for unblock to come back to it a second time. In most cases, when we come back to process the command second time after unblock, we process the command normally, which includes recording the duration and then resetting it. 2. After unblocking we come back to process the command, and this is where we hit the edge case - at this point, we had already migrated the slot to another node, so we return a `MOVED` response. But when we do that, we don’t reset the duration field. Fix: also reset the duration when returning a `MOVED` response. I think this is right, because the client should redirect the command to the right node, which in turn will calculate the execution duration. Also wrote a test which reproduces this, it fails without the fix and passes with it. --------- Signed-off-by: Nitai Caro <[email protected]> Co-authored-by: Nitai Caro <[email protected]>
In #11012, we changed the way command durations were computed to handle the same command being executed multiple times. This commit fixes some misses from that commit.
This commit also adds an assert if the duration is not properly reset, potentially indicating that a report to call statistics was missed. The assert potentially be removed in the future, as it's mainly intended to detect misses in tests.
Before:
After: