Fix race in sentinel manual failover test by enjoy-binbin · Pull Request #11900 · redis/redis

enjoy-binbin · 2023-03-10T06:48:46Z

In #9408, we added some SENTINEL DEBUG to reduce default
timeouts and allow tests to execute faster. The change
in 05-manual.tcl may cause a race that SENTINEL FAILOVER
response with a NOGOODSLAVE:

Manual failover works: FAILED: Expected NOGOODSLAVE No suitable replica to promote eq "OK" (context: type eval line 6 cmd {assert {$reply eq "OK"}} proc ::test)
(Jumping to next unit after error)
FAILED: caught an error in the test
assertion:Expected NOGOODSLAVE No suitable replica to promote eq "OK" (context: type eval line 6 cmd {assert {$reply eq "OK"}} proc ::test)

The reason is that the info-period value was reduced in #9408
(the default value is 10000), and then manual failover was
performed immediately, but the INFO may not exchanged between
the sentinel and replicas, causing the sentinel to skip all
the replicas in sentinelSelectSlave (Because replica's info_refresh
is not updated, see the code snippet below), then return a NOGOODSLAVE,
break the test.

Code snippet from sentinelSelectSlave:

while((de = dictNext(di)) != NULL) {
    sentinelRedisInstance *slave = dictGetVal(de);
    mstime_t info_validity_time;
    if (master->flags & SRI_S_DOWN)
        info_validity_time = sentinel_ping_period*5;
    else
        info_validity_time = sentinel_info_period*3;
    if (mstime() - slave->info_refresh > info_validity_time) continue;
}

By adding a wait_for_condition, we have the opportunity to
let sentinel update the info_period of the replicas.

In redis#9408, we added some SENTINEL DEBUG to reduce default timeouts and allow tests to execute faster. The change in 05-manual.tcl may cause a race that SENTINEL FAILOVER response with a NOGOODSLAVE: ``` Manual failover works: FAILED: Expected NOGOODSLAVE No suitable replica to promote eq "OK" (context: type eval line 6 cmd {assert {$reply eq "OK"}} proc ::test) (Jumping to next unit after error) FAILED: caught an error in the test assertion:Expected NOGOODSLAVE No suitable replica to promote eq "OK" (context: type eval line 6 cmd {assert {$reply eq "OK"}} proc ::test) ``` The reason is that the info-period value was reduced in redis#9408 (the default value is 10000), and then manual failover was performed immediately, but the INFO may not exchanged between the sentinel and replicas, causing the sentinel to skip all the replicas in sentinelSelectSlave (Because replica's info_refresh is not updated, see the code snippet below), then return a NOGOODSLAVE, break the test. Code snippet from sentinelSelectSlave: ``` while((de = dictNext(di)) != NULL) { sentinelRedisInstance *slave = dictGetVal(de); mstime_t info_validity_time; if (master->flags & SRI_S_DOWN) info_validity_time = sentinel_ping_period*5; else info_validity_time = sentinel_info_period*3; if (mstime() - slave->info_refresh > info_validity_time) continue; } ``` By adding a wait_for_condition, we have the opportunity to let sentinel update the info_period of the replicas.

enjoy-binbin requested a review from moticless March 11, 2023 08:30

moticless approved these changes Mar 12, 2023

View reviewed changes

enjoy-binbin requested a review from oranagra March 12, 2023 09:57

oranagra approved these changes Mar 12, 2023

View reviewed changes

oranagra merged commit 4e7eb16 into redis:unstable Mar 12, 2023

enjoy-binbin deleted the fix_sentinel_manual_failover_race branch March 12, 2023 12:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race in sentinel manual failover test#11900

Fix race in sentinel manual failover test#11900
oranagra merged 1 commit intoredis:unstablefrom
enjoy-binbin:fix_sentinel_manual_failover_race

enjoy-binbin commented Mar 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

enjoy-binbin commented Mar 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants