Skip to content

fix: race in gatewaapi runner#8037

Merged
zirain merged 3 commits intoenvoyproxy:mainfrom
zirain:runner/data-race
Jan 26, 2026
Merged

fix: race in gatewaapi runner#8037
zirain merged 3 commits intoenvoyproxy:mainfrom
zirain:runner/data-race

Conversation

@zirain
Copy link
Copy Markdown
Member

@zirain zirain commented Jan 24, 2026

fixes: #8035

@zirain zirain requested a review from a team as a code owner January 24, 2026 12:48
@zirain zirain changed the title fix: fix race in gatewaapi runner fix: race in gatewaapi runner Jan 24, 2026
@netlify
Copy link
Copy Markdown

netlify bot commented Jan 24, 2026

Deploy Preview for cerulean-figolla-1f9435 canceled.

Name Link
🔨 Latest commit 2d84080
🔍 Latest deploy log https://app.netlify.com/projects/cerulean-figolla-1f9435/deploys/6976fe1b35703f00085cca3b

@zirain
Copy link
Copy Markdown
Member Author

zirain commented Jan 24, 2026

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.75%. Comparing base (424d039) to head (2d84080).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8037      +/-   ##
==========================================
+ Coverage   73.70%   73.75%   +0.04%     
==========================================
  Files         237      237              
  Lines       35703    35709       +6     
==========================================
+ Hits        26316    26338      +22     
+ Misses       7529     7515      -14     
+ Partials     1858     1856       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

r.Logger = r.Logger.WithName(r.Name()).WithValues("runner", r.Name())

go r.startWasmCache(ctx)
r.done.Add(2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here, outlining why 2 is needed here

arkodg
arkodg previously approved these changes Jan 24, 2026
@arkodg arkodg requested review from a team January 24, 2026 21:50
@zirain zirain requested a review from arkodg January 25, 2026 04:06
@zirain zirain force-pushed the runner/data-race branch 3 times, most recently from 0314fcc to 9535661 Compare January 25, 2026 06:36
keyCache *KeyCache

// Goroutine synchronization
done sync.WaitGroup
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: name it as wg? done.Done() was confusing at first glance.

go r.startWasmCache(ctx)
// Add 2 to the WaitGroup: one for the WASM cache server goroutine and one for the
// subscribeAndTranslate goroutine that handles resource translation
r.done.Add(2)
Copy link
Copy Markdown
Member

@rudrakhp rudrakhp Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we want to add 2 right away or Add(1) just before calling each routine? We wouldn't want to wait on routines that might not have started for some reason. Also any new routine that we might add here will follow the same pattern.

    // Increment by 1 specifically for the WASM cache
	r.done.Add(1)
	go func() {
		defer r.done.Done()
		r.startWasmCache(ctx)
	}()

	// If Subscribe crashes or returns an error, the WaitGroup 
	// won't be stuck waiting for a goroutine that never started.
	c := r.ProviderResources.GatewayAPIResources.Subscribe(ctx)

	// Increment by 1 specifically for the translation handler
	r.done.Add(1)
	go func() {
		defer r.done.Done()
		r.subscribeAndTranslate(c)
	}()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no strong opinion on this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to keeping the Add close to the go func()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could simply it with r.done.Go()

// t.Output() while goroutines are still active.
//
// Run with: go test -race -run TestRunnerGoroutineRace -count=100 ./internal/cmd/
func TestRunnerGoroutineRace(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see another test in runner_race_test, do we need this as well? Which one would we need to detect a race if someone spawns another routine without WG?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this's in case we have another one unkonwn race, TBH it's hard to reproduce here.

@zirain
Copy link
Copy Markdown
Member Author

zirain commented Jan 26, 2026

/retest

@zirain zirain requested a review from rudrakhp January 26, 2026 03:31
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
@zirain zirain merged commit 1f9c321 into envoyproxy:main Jan 26, 2026
57 of 59 checks passed
@zirain zirain deleted the runner/data-race branch January 26, 2026 07:31
zirain added a commit to zirain/gateway that referenced this pull request Jan 26, 2026
* add testcase

Signed-off-by: zirain <[email protected]>

* fix

Signed-off-by: zirain <[email protected]>

* simply

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: zirain <[email protected]>
rudrakhp pushed a commit to rudrakhp/gateway that referenced this pull request Jan 26, 2026
* add testcase

Signed-off-by: zirain <[email protected]>

* fix

Signed-off-by: zirain <[email protected]>

* simply

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: zirain <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>
rudrakhp added a commit that referenced this pull request Jan 26, 2026
* fix: extproc is discarded with failOpen is enabled for wasm (#7956)

* fix: extproc is discarded with failOpen is enabled for wasm

Signed-off-by: Huabing Zhao <[email protected]>

* add test

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* polish code

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* add test

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

---------

Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix: sanitize control plane config dump (#7901)

* mask secrets

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

---------

Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix: server run race (#7964)

* add test

Signed-off-by: zirain <[email protected]>

* fix race

Signed-off-by: zirain <[email protected]>

* fix lint

Signed-off-by: zirain <[email protected]>

* fix

Signed-off-by: zirain <[email protected]>

* fix

Signed-off-by: zirain <[email protected]>

* fix lint

Signed-off-by: zirain <[email protected]>

* use Semaphore instead of WaitGroup

Signed-off-by: zirain <[email protected]>

* comments

Signed-off-by: zirain <[email protected]>

* lint

Signed-off-by: zirain <[email protected]>

* fix

Signed-off-by: zirain <[email protected]>

* fix lint

Signed-off-by: zirain <[email protected]>

* callback

Signed-off-by: zirain <[email protected]>

* fix lint

Signed-off-by: zirain <[email protected]>

* run hook sequentially

Signed-off-by: zirain <[email protected]>

* fix lint

Signed-off-by: zirain <[email protected]>

* rename to cfgMux

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: zirain <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix: wrong cluster type with mixed FQDN backend and service backend refs (#7994)

* fix: wrong cluster type with mixed FQDN backend and service backend refs

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* fix mirror cluster endpoint type

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* simplify the test

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* update comment

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

---------

Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix: fail fast when unrecoverable discovery errors happens on checking optional CRDs (#7872)

* fail fast when unrecoverable discovery errors happens

Signed-off-by: Huabing Zhao <[email protected]>

* only retry transient errors

Signed-off-by: Huabing Zhao <[email protected]>

* fix potenial dead lock

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing Zhao <[email protected]>

* minor wording

Signed-off-by: Huabing Zhao <[email protected]>

* create discovery client once

Signed-off-by: Huabing Zhao <[email protected]>

* fix lint

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing Zhao <[email protected]>

* remove redundant logging

Signed-off-by: Huabing Zhao <[email protected]>

* add e2e test

Signed-off-by: Huabing Zhao <[email protected]>

* fix test

Signed-off-by: Huabing(Robin) Zhao <[email protected]>

* fix test

Signed-off-by: Huabing(Robin) Zhao <[email protected]>

---------

Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix: merge route match rule with match all route (#8011)

Signed-off-by: zirain <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix: do not set autoHTTPConfig when used mixed(HTTP + HTTPS) backends (#7950)

* fix: do not set autoHTTPConfig when used mixed backend

Signed-off-by: zirain <[email protected]>

* release notes

Signed-off-by: zirain <[email protected]>

* fix

Signed-off-by: zirain <[email protected]>

* add e2e

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: zirain <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix: backend tls default namespace (#7987)

Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix: race in gatewaapi runner (#8037)

* add testcase

Signed-off-by: zirain <[email protected]>

* fix

Signed-off-by: zirain <[email protected]>

* simply

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: zirain <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>

* [release/v1.6] v1.6.3 release notes (#8054)

Signed-off-by: Rudrakh Panigrahi <[email protected]>

* v1.6.3 version

Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix gen-check

Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix lint

Signed-off-by: Rudrakh Panigrahi <[email protected]>

---------

Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>
Signed-off-by: zirain <[email protected]>
Co-authored-by: Huabing (Robin) Zhao <[email protected]>
Co-authored-by: zirain <[email protected]>
SadmiB pushed a commit to SadmiB/gateway that referenced this pull request Jan 30, 2026
* add testcase

Signed-off-by: zirain <[email protected]>

* fix

Signed-off-by: zirain <[email protected]>

* simply

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: zirain <[email protected]>
Signed-off-by: Sadmi Bouhafs <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Data race while running tests that fixes with retry

3 participants