Skip to content

chore: fix TestDoUntilQuorumWithoutSuccessfulContextCancellation_CancelsEntireZoneImmediatelyOnSingleFailure flakyness#948

Merged
pracucci merged 1 commit intomainfrom
fix-TestDoUntilQuorumWithoutSuccessfulContextCancellation_CancelsEntireZoneImmediatelyOnSingleFailure
Apr 2, 2026
Merged

chore: fix TestDoUntilQuorumWithoutSuccessfulContextCancellation_CancelsEntireZoneImmediatelyOnSingleFailure flakyness#948
pracucci merged 1 commit intomainfrom
fix-TestDoUntilQuorumWithoutSuccessfulContextCancellation_CancelsEntireZoneImmediatelyOnSingleFailure

Conversation

@pracucci
Copy link
Copy Markdown
Contributor

@pracucci pracucci commented Apr 2, 2026

What this PR does:

I've seen TestDoUntilQuorumWithoutSuccessfulContextCancellation_CancelsEntireZoneImmediatelyOnSingleFailure hanging in CI until the 30m tests timeout expires:

Details

panic: test timed out after 30m0s
	running tests:
		TestDoUntilQuorumWithoutSuccessfulContextCancellation_CancelsEntireZoneImmediatelyOnSingleFailure (26m56s)

goroutine 1582 [running]:
testing.(*M).startAlarm.func1()
	/opt/hostedtoolcache/go/1.26.1/x64/src/testing/testing.go:2802 +0x605
created by time.goFunc
	/opt/hostedtoolcache/go/1.26.1/x64/src/time/sleep.go:215 +0x45

goroutine 1 [chan receive, 26 minutes]:
testing.(*T).Run(0xc000312248, {0x1dd60a1, 0x61}, 0x1def0b0)
	/opt/hostedtoolcache/go/1.26.1/x64/src/testing/testing.go:2109 +0xb3e
testing.runTests.func1(0xc000312248)
	/opt/hostedtoolcache/go/1.26.1/x64/src/testing/testing.go:2585 +0x86
testing.tRunner(0xc000312248, 0xc00050dad0)
	/opt/hostedtoolcache/go/1.26.1/x64/src/testing/testing.go:2036 +0x21d
testing.runTests({0x1d8fa84, 0x18}, {0x1d97450, 0x1d}, 0xc0001a6528, {0x2aaad40, 0xf2, 0xf2}, {0xc26b6ea34f6f5834, 0x1a31a2c19f5, ...})
	/opt/hostedtoolcache/go/1.26.1/x64/src/testing/testing.go:2583 +0x9ea
testing.(*M).Run(0xc000128140)
	/opt/hostedtoolcache/go/1.26.1/x64/src/testing/testing.go:2443 +0xf4c
main.main()
	_testmain.go:572 +0x165

goroutine 398 [chan receive, 29 minutes]:
testing.(*T).Parallel(0xc0000e2fc8)
	/opt/hostedtoolcache/go/1.26.1/x64/src/testing/testing.go:1803 +0x50c
github.com/grafana/dskit/ring.TestDeletePersistedTokensOnShutdown(0xc0000e2fc8)
	/home/runner/work/dskit/dskit/ring/lifecycler_test.go:1528 +0x3f
testing.tRunner(0xc0000e2fc8, 0x1def060)
	/opt/hostedtoolcache/go/1.26.1/x64/src/testing/testing.go:2036 +0x21d
created by testing.(*T).Run in goroutine 1
	/opt/hostedtoolcache/go/1.26.1/x64/src/testing/testing.go:2101 +0xb13

goroutine 366 [chan receive, 29 minutes]:
testing.(*T).Parallel(0xc000312488)
	/opt/hostedtoolcache/go/1.26.1/x64/src/testing/testing.go:1803 +0x50c
github.com/grafana/dskit/ring.TestWaitBeforeJoining(0xc000312488)
	/home/runner/work/dskit/dskit/ring/lifecycler_test.go:1691 +0x3f
testing.tRunner(0xc000312488, 0x1def5e0)
	/opt/hostedtoolcache/go/1.26.1/x64/src/testing/testing.go:2036 +0x21d
created by testing.(*T).Run in goroutine 1
	/opt/hostedtoolcache/go/1.26.1/x64/src/testing/testing.go:2101 +0xb13

goroutine 493 [chan receive, 29 minutes]:
testing.(*T).Parallel(0xc0001fa6c8)
	/opt/hostedtoolcache/go/1.26.1/x64/src/testing/testing.go:1803 +0x50c
github.com/grafana/dskit/ring.TestInstanceDesc_IsHealthy_ForIngesterOperations(0xc0001fa6c8)
	/home/runner/work/dskit/dskit/ring/model_test.go:11 +0x3f
testing.tRunner(0xc0001fa6c8, 0x1def180)
	/opt/hostedtoolcache/go/1.26.1/x64/src/testing/testing.go:2036 +0x21d
created by testing.(*T).Run in goroutine 1
	/opt/hostedtoolcache/go/1.26.1/x64/src/testing/testing.go:2101 +0xb13

goroutine 1563 [select, 26 minutes]:
github.com/grafana/dskit/ring.DoUntilQuorumWithoutSuccessfulContextCancellation[...]({0x1e12880, 0x2ad5a80}, {{0xc00017c008, 0x6, 0x6}, 0x0, 0x1, 0x0}, {0x1, 0x0, ...}, ...)
	/home/runner/work/dskit/dskit/ring/replication_set.go:344 +0x12e5
github.com/grafana/dskit/ring.TestDoUntilQuorumWithoutSuccessfulContextCancellation_CancelsEntireZoneImmediatelyOnSingleFailure(0xc0001fbb08)
	/home/runner/work/dskit/dskit/ring/replication_set_test.go:745 +0x75e
testing.tRunner(0xc0001fbb08, 0x1def0b0)
	/opt/hostedtoolcache/go/1.26.1/x64/src/testing/testing.go:2036 +0x21d
created by testing.(*T).Run in goroutine 1
	/opt/hostedtoolcache/go/1.26.1/x64/src/testing/testing.go:2101 +0xb13
FAIL	github.com/grafana/dskit/ring	1800.056s

Why?

There were two bugs in the test, both related to the f() callback running inside goroutines spawned by DoUntilQuorumWithoutSuccessfulContextCancellation:

  1. Race condition causing the hang (primary): With MinimizeRequests: true, the test relies on the failing zone's replica-2 entering f() and closing a synchronization channel when its context is cancelled. However, there was no guarantee that replica-2's goroutine would reach awaitStart before replica-1's error was processed by the main loop. When replica-1's error was processed first, cancelContextFor cancelled all zone contexts — including replica-2's. If replica-2's goroutine then reached awaitStart with both ctx.Done() and the zone release channel ready, Go's select could randomly pick ctx.Done(), causing the goroutine to return early without ever calling f(). The synchronization channel was never closed, and all other goroutines blocked waiting for it.
  2. require.FailNow in non-test goroutines (secondary): The timeout branches called require.FailNow, which invokes runtime.Goexit(). When called from a goroutine spawned by DoUntilQuorumWithoutSuccessfulContextCancellation (not the test goroutine), this terminated the goroutine without sending its result to the internal results channel. The main loop then hung forever waiting for results that would never arrive.

The fix:

  • Added a sync.WaitGroup so that replica-1 waits for replica-2 to enter f() before returning its error. This guarantees both goroutines have passed awaitStart before the error triggers context cancellation, eliminating the race.
  • Replaced require.FailNow calls with returning errors from f(). These errors are still caught by the test: the replica-2 timeout is caught by require.True(t, failingZoneSawCancelledContext), and the default case timeout causes a second zone failure which is caught by require.NoError(t, err).

Which issue(s) this PR fixes:

N/A

Checklist

  • Tests updated

…elsEntireZoneImmediatelyOnSingleFailure flakyness

Signed-off-by: Marco Pracucci <[email protected]>
@pracucci pracucci merged commit b8c86ad into main Apr 2, 2026
14 checks passed
@pracucci pracucci deleted the fix-TestDoUntilQuorumWithoutSuccessfulContextCancellation_CancelsEntireZoneImmediatelyOnSingleFailure branch April 2, 2026 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants