Skip to content

Fix expectations for manual failover tests#2453

Merged
enjoy-binbin merged 2 commits intovalkey-io:unstablefrom
amanosme:amanosme-unstable
Aug 8, 2025
Merged

Fix expectations for manual failover tests#2453
enjoy-binbin merged 2 commits intovalkey-io:unstablefrom
amanosme:amanosme-unstable

Conversation

@amanosme
Copy link
Contributor

@amanosme amanosme commented Aug 7, 2025

Test Instance #5 is still a slave after some time (no failover) is supposed to verify that command CLUSTER FAILOVER will not promote a replica without quorum from the primary; later in the file (Instance 5 is a master after some time), we verify that CLUSTER FAILOVER FORCE does promote a replica under the same conditions.

There's a couple issues with the tests:

  1. Instance #5 is still a slave after some time (no failover) should verify that instance 5 is a replica (i.e. that there's no failover), but we call assert {[s -5 role] eq {master}}.
  2. The reason why the above assert works is that we previously send DEBUG SLEEP 10 to the primary, which pauses the primary for longer than the configured 3 seconds for cluster-node-timeout. The primary is marked as failed from the perspective of the rest of the cluster, so quorum can be established and instance 5 is promoted as primary.

This commit fixes the two by shortening the sleep to less than 3 seconds, and then asserting the role is still replica. Test Instance #5 is a master after some time is updated to sleep for a shorter duration to ensure that FAILOVER FORCE succeeds under the exact same conditions.

Testing

./runtest --single unit/cluster/manual-failover --loop --fastfail

@amanosme amanosme force-pushed the amanosme-unstable branch from 51adff5 to cf42ba8 Compare August 7, 2025 21:49
@codecov
Copy link

codecov bot commented Aug 7, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.56%. Comparing base (7bbf523) to head (f34af45).
⚠️ Report is 1 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2453      +/-   ##
============================================
+ Coverage     71.51%   71.56%   +0.04%     
============================================
  Files           125      125              
  Lines         69214    69214              
============================================
+ Hits          49499    49530      +31     
+ Misses        19715    19684      -31     

see 9 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@amanosme amanosme force-pushed the amanosme-unstable branch from cf42ba8 to 5dd1833 Compare August 7, 2025 22:55
@amanosme amanosme force-pushed the amanosme-unstable branch from 5dd1833 to e2f7960 Compare August 7, 2025 23:21
@amanosme amanosme marked this pull request as draft August 7, 2025 23:22
@amanosme amanosme marked this pull request as ready for review August 7, 2025 23:39
Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seem corret, did you encounter this failure somewhere? Or did you just doing a code review and find this?

@amanosme
Copy link
Contributor Author

amanosme commented Aug 8, 2025

@enjoy-binbin I caught this while debugging the test under valgrind. I think valgrind slows the tests down enough to the point where the primary would be back up when the CLUSTER FAILOVER command was issued (and the check for master role failed)

@enjoy-binbin enjoy-binbin merged commit de7bb61 into valkey-io:unstable Aug 8, 2025
52 checks passed
@amanosme amanosme deleted the amanosme-unstable branch August 8, 2025 08:11
allenss-amazon pushed a commit to allenss-amazon/valkey-core that referenced this pull request Aug 19, 2025
Test `Instance valkey-io#5 is still a slave after some time (no failover)` is
supposed to verify that command `CLUSTER FAILOVER` will not promote a
replica without quorum from the primary; later in the file (`Instance 5
is a master after some time`), we verify that `CLUSTER FAILOVER FORCE`
does promote a replica under the same conditions.

There's a couple issues with the tests:

1. `Instance valkey-io#5 is still a slave after some time (no failover)` should
verify that instance 5 is a replica (i.e. that there's no failover), but
we call `assert {[s -5 role] eq {master}}`.
2. The reason why the above assert works is that we previously send
`DEBUG SLEEP 10` to the primary, which pauses the primary for longer
than the configured 3 seconds for`cluster-node-timeout`.
The primary is marked as failed from the perspective of the rest of the
cluster, so quorum can be established and instance 5 is promoted as
primary.

This commit fixes the two by shortening the sleep to less than 3
seconds, and then asserting the role is still replica. Test `Instance valkey-io#5
is a master after some time` is updated to sleep for a shorter duration
to ensure that `FAILOVER FORCE` succeeds under the exact same
conditions.

### Testing
`./runtest --single unit/cluster/manual-failover --loop --fastfail`

Signed-off-by: Tyler Amano-Smerling <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants