Skip to content

[fix][client] Fix stale Healthy state in SameAuthParamsLookupAutoClusterFailover causing flaky test#25388

Merged
merlimat merged 1 commit intoapache:masterfrom
lhotari:lh-fix-SameAuthParamsLookupAutoClusterFailoverTest
Mar 23, 2026
Merged

[fix][client] Fix stale Healthy state in SameAuthParamsLookupAutoClusterFailover causing flaky test#25388
merlimat merged 1 commit intoapache:masterfrom
lhotari:lh-fix-SameAuthParamsLookupAutoClusterFailoverTest

Conversation

@lhotari
Copy link
Copy Markdown
Member

@lhotari lhotari commented Mar 23, 2026

Motivation

The SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover integration test is flaky. The root cause is a bug in SameAuthParamsLookupAutoClusterFailover.findFailoverTo(): when it probes intermediate services and finds them unavailable, it does not update their state. They remain stale Healthy, which causes a spurious recovery bounce on the next check cycle.

For example, when failing over from index 0 to index 2 (skipping index 1 which is unavailable), index 1's state remains Healthy. On the next check, firstHealthyPulsarService() sees the stale Healthy state and immediately "recovers" to index 1 — which is actually a broken service. This causes unnecessary bouncing (0→2→1→2 instead of 0→2). Combined with 3-second probe timeouts on dead services, the extra bounce adds ~30 seconds, pushing the total failover time past the test's 60-second awaitility timeout.

Modifications

  • In findFailoverTo(), when a probe fails for a service, mark it as Failed immediately. This prevents stale Healthy states from causing spurious recovery bounces after failover.
  • Added unit tests in pulsar-client to verify the fix and prevent regression.

Note on failoverThreshold: The fix marks skipped services directly as Failed, bypassing the gradual Healthy → PreFail → Failed transition that checkPulsarServices uses with failoverThreshold. This is intentional: findFailoverTo() is only called after the current service has already gone through the full threshold-based detection and reached Failed. The intermediate services being skipped were never monitored by checkPulsarServices (which only checks i <= currentPulsarServiceIndex), so there is no threshold to respect — we just probed them and they are down. The recoverThreshold is still fully respected when these services come back up, via the normal Failed → PreRecover → Healthy path in checkPulsarServices.

Verifying this change

This change added tests and can be verified as follows:

  • testFindFailoverToMarksSkippedServicesAsFailed — directly verifies that findFailoverTo marks skipped unavailable services as Failed (the core bug). Verified this test fails without the fix.
  • testNoSpuriousRecoveryBounceAfterFailover — verifies no spurious recovery bounce occurs on the next check cycle after failover.
  • testRecoveryAfterFindFailoverToMarksServiceFailed — verifies that recovery still works correctly after a service was marked Failed by findFailoverTo, once it becomes available again.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc-not-needed

Matching PR in forked repository

PR in forked repository: <!-- ENTER URL HERE after creating forked PR -->

🤖 Generated with Claude Code

@github-actions github-actions Bot added the doc-not-needed Your PR changes do not impact docs label Mar 23, 2026
…terFailover causing spurious recovery bounce

When findFailoverTo() skipped over an unavailable service, it left that
service's state as stale Healthy. This caused a spurious recovery bounce
(e.g. 0→2→1→2 instead of 0→2) on the next check cycle, as
firstHealthyPulsarService() would immediately "recover" to the stale
service. Combined with 3-second probe timeouts on dead services, this
could push total failover time past the integration test's awaitility
timeout, making SameAuthParamsLookupAutoClusterFailoverTest flaky.

Fix: mark skipped services as Failed in findFailoverTo() when probing
finds them unavailable.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@lhotari lhotari force-pushed the lh-fix-SameAuthParamsLookupAutoClusterFailoverTest branch from 47fd724 to 189d32b Compare March 23, 2026 11:55
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.76%. Comparing base (ba851d3) to head (189d32b).
⚠️ Report is 123 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #25388      +/-   ##
============================================
+ Coverage     72.13%   72.76%   +0.63%     
- Complexity    34023    34289     +266     
============================================
  Files          1954     1954              
  Lines        154857   154859       +2     
  Branches      17739    17740       +1     
============================================
+ Hits         111699   112689     +990     
+ Misses        34120    33110    -1010     
- Partials       9038     9060      +22     
Flag Coverage Δ
inttests 25.77% <0.00%> (?)
systests 22.50% <0.00%> (+0.06%) ⬆️
unittests 73.74% <100.00%> (+0.59%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../impl/SameAuthParamsLookupAutoClusterFailover.java 68.04% <100.00%> (+1.58%) ⬆️

... and 154 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@merlimat merlimat merged commit 3bc834f into apache:master Mar 23, 2026
56 checks passed
@lhotari lhotari added this to the 4.2.0 milestone Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs ready-to-test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants