Fix race in sentinel manual failover test#11900
Merged
oranagra merged 1 commit intoredis:unstablefrom Mar 12, 2023
Merged
Conversation
In redis#9408, we added some SENTINEL DEBUG to reduce default timeouts and allow tests to execute faster. The change in 05-manual.tcl may cause a race that SENTINEL FAILOVER response with a NOGOODSLAVE: ``` Manual failover works: FAILED: Expected NOGOODSLAVE No suitable replica to promote eq "OK" (context: type eval line 6 cmd {assert {$reply eq "OK"}} proc ::test) (Jumping to next unit after error) FAILED: caught an error in the test assertion:Expected NOGOODSLAVE No suitable replica to promote eq "OK" (context: type eval line 6 cmd {assert {$reply eq "OK"}} proc ::test) ``` The reason is that the info-period value was reduced in redis#9408 (the default value is 10000), and then manual failover was performed immediately, but the INFO may not exchanged between the sentinel and replicas, causing the sentinel to skip all the replicas in sentinelSelectSlave (Because replica's info_refresh is not updated, see the code snippet below), then return a NOGOODSLAVE, break the test. Code snippet from sentinelSelectSlave: ``` while((de = dictNext(di)) != NULL) { sentinelRedisInstance *slave = dictGetVal(de); mstime_t info_validity_time; if (master->flags & SRI_S_DOWN) info_validity_time = sentinel_ping_period*5; else info_validity_time = sentinel_info_period*3; if (mstime() - slave->info_refresh > info_validity_time) continue; } ``` By adding a wait_for_condition, we have the opportunity to let sentinel update the info_period of the replicas.
moticless
approved these changes
Mar 12, 2023
oranagra
approved these changes
Mar 12, 2023
enjoy-binbin
added a commit
to enjoy-binbin/redis
that referenced
this pull request
Jul 31, 2023
In redis#9408, we added some SENTINEL DEBUG to reduce default timeouts and allow tests to execute faster. The change in 05-manual.tcl may cause a race that SENTINEL FAILOVER response with a NOGOODSLAVE: ``` Manual failover works: FAILED: Expected NOGOODSLAVE No suitable replica to promote eq "OK" (context: type eval line 6 cmd {assert {$reply eq "OK"}} proc ::test) (Jumping to next unit after error) FAILED: caught an error in the test assertion:Expected NOGOODSLAVE No suitable replica to promote eq "OK" (context: type eval line 6 cmd {assert {$reply eq "OK"}} proc ::test) ``` The reason is that the info-period value was reduced in redis#9408 (the default value is 10000), and then manual failover was performed immediately, but the INFO may not exchanged between the sentinel and replicas, causing the sentinel to skip all the replicas in sentinelSelectSlave (Because replica's info_refresh is not updated, see the code snippet below), then return a NOGOODSLAVE, break the test. Code snippet from sentinelSelectSlave: ``` while((de = dictNext(di)) != NULL) { sentinelRedisInstance *slave = dictGetVal(de); mstime_t info_validity_time; if (master->flags & SRI_S_DOWN) info_validity_time = sentinel_ping_period*5; else info_validity_time = sentinel_info_period*3; if (mstime() - slave->info_refresh > info_validity_time) continue; } ``` By adding a wait_for_condition, we have the opportunity to let sentinel update the info_period of the replicas.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In #9408, we added some SENTINEL DEBUG to reduce default
timeouts and allow tests to execute faster. The change
in 05-manual.tcl may cause a race that SENTINEL FAILOVER
response with a NOGOODSLAVE:
The reason is that the info-period value was reduced in #9408
(the default value is 10000), and then manual failover was
performed immediately, but the INFO may not exchanged between
the sentinel and replicas, causing the sentinel to skip all
the replicas in sentinelSelectSlave (Because replica's info_refresh
is not updated, see the code snippet below), then return a NOGOODSLAVE,
break the test.
Code snippet from sentinelSelectSlave:
By adding a wait_for_condition, we have the opportunity to
let sentinel update the info_period of the replicas.