Skip to content

fix: fail fast when unrecoverable discovery errors happens on checking optional CRDs#7872

Merged
zhaohuabing merged 20 commits intoenvoyproxy:mainfrom
zhaohuabing:fix-7871
Jan 15, 2026
Merged

fix: fail fast when unrecoverable discovery errors happens on checking optional CRDs#7872
zhaohuabing merged 20 commits intoenvoyproxy:mainfrom
zhaohuabing:fix-7871

Conversation

@zhaohuabing
Copy link
Copy Markdown
Member

@zhaohuabing zhaohuabing commented Jan 7, 2026

What type of PR is this?

This PR adds retries to the controller when it fails to discover optional CRDs from the API server. If all retries fail, the error is propagated and causes the EG pod to restart. This prevents the EG pod from reconciling incomplete resources and serving partial xDS configuration to Envoy.

It also propagates runner startup errors to the server, so the Envoy Gateway process can exit and restart cleanly. Previously, runner startup failures were only logged, and Envoy Gateway continued running even with failed runners.

Fixes #7871

Release Notes: Yes

@zhaohuabing zhaohuabing requested a review from a team as a code owner January 7, 2026 02:16
@zhaohuabing zhaohuabing marked this pull request as draft January 7, 2026 02:16
@zhaohuabing zhaohuabing changed the title fail fast when unrecoverable discovery errors happens fix: fail fast when unrecoverable discovery errors happens Jan 7, 2026
@zhaohuabing zhaohuabing force-pushed the fix-7871 branch 2 times, most recently from 36e3fe1 to d4af0fb Compare January 7, 2026 02:25
@codecov
Copy link
Copy Markdown

codecov bot commented Jan 7, 2026

Codecov Report

❌ Patch coverage is 48.57143% with 54 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.74%. Comparing base (3fd3e4a) to head (29071df).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
internal/provider/kubernetes/controller.go 56.79% 17 Missing and 18 partials ⚠️
internal/cmd/server.go 0.00% 10 Missing ⚠️
...nternal/envoygateway/config/loader/configloader.go 22.22% 7 Missing ⚠️
internal/provider/kubernetes/controller_watch.go 60.00% 1 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (48.57%) is below the target coverage (60.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7872      +/-   ##
==========================================
- Coverage   72.80%   72.74%   -0.07%     
==========================================
  Files         235      235              
  Lines       35313    35380      +67     
==========================================
+ Hits        25709    25736      +27     
- Misses       7781     7806      +25     
- Partials     1823     1838      +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@zhaohuabing zhaohuabing changed the title fix: fail fast when unrecoverable discovery errors happens fix: fail fast when unrecoverable discovery errors happens on checking optional CRDs Jan 7, 2026
@zhaohuabing zhaohuabing force-pushed the fix-7871 branch 3 times, most recently from ff6bbad to 0e6c3a9 Compare January 7, 2026 07:14
@zhaohuabing zhaohuabing marked this pull request as ready for review January 7, 2026 07:17
Signed-off-by: Huabing Zhao <[email protected]>
@netlify
Copy link
Copy Markdown

netlify bot commented Jan 8, 2026

Deploy Preview for cerulean-figolla-1f9435 ready!

Name Link
🔨 Latest commit 29071df
🔍 Latest deploy log https://app.netlify.com/projects/cerulean-figolla-1f9435/deploys/69675027f84f4d00088fe9d9
😎 Deploy Preview https://deploy-preview-7872--cerulean-figolla-1f9435.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Signed-off-by: Huabing Zhao <[email protected]>
@zhaohuabing zhaohuabing force-pushed the fix-7871 branch 2 times, most recently from 273f904 to f96ea29 Compare January 8, 2026 02:25
Signed-off-by: Huabing(Robin) Zhao <[email protected]>
@zhaohuabing zhaohuabing marked this pull request as ready for review January 12, 2026 11:28
@zhaohuabing zhaohuabing requested a review from arkodg January 12, 2026 23:55
@zhaohuabing zhaohuabing added this to the v1.7.0-rc.1 Release milestone Jan 13, 2026
arkodg
arkodg previously approved these changes Jan 14, 2026
Copy link
Copy Markdown
Contributor

@arkodg arkodg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks

@arkodg arkodg requested review from a team January 14, 2026 05:14
zirain
zirain previously approved these changes Jan 14, 2026
Signed-off-by: Huabing (Robin) Zhao <[email protected]>
@zhaohuabing zhaohuabing merged commit 09b6456 into envoyproxy:main Jan 15, 2026
56 of 59 checks passed
@zhaohuabing zhaohuabing deleted the fix-7871 branch January 15, 2026 01:15
andreik-n2 pushed a commit to andreik-n2/gateway that referenced this pull request Jan 15, 2026
…g optional CRDs (envoyproxy#7872)

* fail fast when unrecoverable discovery errors happens

Signed-off-by: Huabing Zhao <[email protected]>

* only retry transient errors

Signed-off-by: Huabing Zhao <[email protected]>

* fix potenial dead lock

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing Zhao <[email protected]>

* minor wording

Signed-off-by: Huabing Zhao <[email protected]>

* create discovery client once

Signed-off-by: Huabing Zhao <[email protected]>

* fix lint

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing Zhao <[email protected]>

* remove redundant logging

Signed-off-by: Huabing Zhao <[email protected]>

* add e2e test

Signed-off-by: Huabing Zhao <[email protected]>

* fix test

Signed-off-by: Huabing(Robin) Zhao <[email protected]>

* fix test

Signed-off-by: Huabing(Robin) Zhao <[email protected]>

---------

Signed-off-by: Huabing (Robin) Zhao <[email protected]>
@zirain
Copy link
Copy Markdown
Member

zirain commented Jan 15, 2026

FYI, during test #7964, it would take more 60s before runner return error with discovery failure.

@zhaohuabing
Copy link
Copy Markdown
Member Author

FYI, during test #7964, it would take more 60s before runner return error with discovery failure.

Hi @zirain Is this because of the retries in this PR?

@zirain
Copy link
Copy Markdown
Member

zirain commented Jan 23, 2026

FYI, during test #7964, it would take more 60s before runner return error with discovery failure.

Hi @zirain Is this because of the retries in this PR?

I'm not 100% sure, maybe we need a way to disable the retry in test code?

zirain pushed a commit to zirain/gateway that referenced this pull request Jan 26, 2026
…g optional CRDs (envoyproxy#7872)

* fail fast when unrecoverable discovery errors happens

Signed-off-by: Huabing Zhao <[email protected]>

* only retry transient errors

Signed-off-by: Huabing Zhao <[email protected]>

* fix potenial dead lock

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing Zhao <[email protected]>

* minor wording

Signed-off-by: Huabing Zhao <[email protected]>

* create discovery client once

Signed-off-by: Huabing Zhao <[email protected]>

* fix lint

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing Zhao <[email protected]>

* remove redundant logging

Signed-off-by: Huabing Zhao <[email protected]>

* add e2e test

Signed-off-by: Huabing Zhao <[email protected]>

* fix test

Signed-off-by: Huabing(Robin) Zhao <[email protected]>

* fix test

Signed-off-by: Huabing(Robin) Zhao <[email protected]>

---------

Signed-off-by: Huabing (Robin) Zhao <[email protected]>
rudrakhp pushed a commit to rudrakhp/gateway that referenced this pull request Jan 26, 2026
…g optional CRDs (envoyproxy#7872)

* fail fast when unrecoverable discovery errors happens

Signed-off-by: Huabing Zhao <[email protected]>

* only retry transient errors

Signed-off-by: Huabing Zhao <[email protected]>

* fix potenial dead lock

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing Zhao <[email protected]>

* minor wording

Signed-off-by: Huabing Zhao <[email protected]>

* create discovery client once

Signed-off-by: Huabing Zhao <[email protected]>

* fix lint

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing Zhao <[email protected]>

* remove redundant logging

Signed-off-by: Huabing Zhao <[email protected]>

* add e2e test

Signed-off-by: Huabing Zhao <[email protected]>

* fix test

Signed-off-by: Huabing(Robin) Zhao <[email protected]>

* fix test

Signed-off-by: Huabing(Robin) Zhao <[email protected]>

---------

Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>
zirain added a commit that referenced this pull request Jan 26, 2026
* fix: fail fast when unrecoverable discovery errors happens on checking optional CRDs (#7872)

* fail fast when unrecoverable discovery errors happens

Signed-off-by: Huabing Zhao <[email protected]>

* only retry transient errors

Signed-off-by: Huabing Zhao <[email protected]>

* fix potenial dead lock

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing Zhao <[email protected]>

* minor wording

Signed-off-by: Huabing Zhao <[email protected]>

* create discovery client once

Signed-off-by: Huabing Zhao <[email protected]>

* fix lint

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing Zhao <[email protected]>

* remove redundant logging

Signed-off-by: Huabing Zhao <[email protected]>

* add e2e test

Signed-off-by: Huabing Zhao <[email protected]>

* fix test

Signed-off-by: Huabing(Robin) Zhao <[email protected]>

* fix test

Signed-off-by: Huabing(Robin) Zhao <[email protected]>

---------

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* fix: extproc is discarded with failOpen is enabled for wasm (#7956)

* fix: extproc is discarded with failOpen is enabled for wasm

Signed-off-by: Huabing Zhao <[email protected]>

* add test

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* polish code

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* add test

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

---------

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* fix: sanitize control plane config dump (#7901)

* mask secrets

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

---------

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* fix: server run race (#7964)

* add test

Signed-off-by: zirain <[email protected]>

* fix race

Signed-off-by: zirain <[email protected]>

* fix lint

Signed-off-by: zirain <[email protected]>

* fix

Signed-off-by: zirain <[email protected]>

* fix

Signed-off-by: zirain <[email protected]>

* fix lint

Signed-off-by: zirain <[email protected]>

* use Semaphore instead of WaitGroup

Signed-off-by: zirain <[email protected]>

* comments

Signed-off-by: zirain <[email protected]>

* lint

Signed-off-by: zirain <[email protected]>

* fix

Signed-off-by: zirain <[email protected]>

* fix lint

Signed-off-by: zirain <[email protected]>

* callback

Signed-off-by: zirain <[email protected]>

* fix lint

Signed-off-by: zirain <[email protected]>

* run hook sequentially

Signed-off-by: zirain <[email protected]>

* fix lint

Signed-off-by: zirain <[email protected]>

* rename to cfgMux

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: zirain <[email protected]>

* fix: wrong cluster type with mixed FQDN backend and service backend refs (#7994)

* fix: wrong cluster type with mixed FQDN backend and service backend refs

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* fix mirror cluster endpoint type

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* simplify the test

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* update comment

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

---------

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* fix: merge route match rule with match all route (#8011)

Signed-off-by: zirain <[email protected]>

* fix gen

Signed-off-by: zirain <[email protected]>

* fix lint

Signed-off-by: zirain <[email protected]>

* fix for golang 11.24

Signed-off-by: zirain <[email protected]>

* fix lint

Signed-off-by: zirain <[email protected]>

* fix watch CRD version

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: zirain <[email protected]>
Co-authored-by: Huabing (Robin) Zhao <[email protected]>
rudrakhp added a commit that referenced this pull request Jan 26, 2026
* fix: extproc is discarded with failOpen is enabled for wasm (#7956)

* fix: extproc is discarded with failOpen is enabled for wasm

Signed-off-by: Huabing Zhao <[email protected]>

* add test

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* polish code

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* add test

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

---------

Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix: sanitize control plane config dump (#7901)

* mask secrets

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

---------

Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix: server run race (#7964)

* add test

Signed-off-by: zirain <[email protected]>

* fix race

Signed-off-by: zirain <[email protected]>

* fix lint

Signed-off-by: zirain <[email protected]>

* fix

Signed-off-by: zirain <[email protected]>

* fix

Signed-off-by: zirain <[email protected]>

* fix lint

Signed-off-by: zirain <[email protected]>

* use Semaphore instead of WaitGroup

Signed-off-by: zirain <[email protected]>

* comments

Signed-off-by: zirain <[email protected]>

* lint

Signed-off-by: zirain <[email protected]>

* fix

Signed-off-by: zirain <[email protected]>

* fix lint

Signed-off-by: zirain <[email protected]>

* callback

Signed-off-by: zirain <[email protected]>

* fix lint

Signed-off-by: zirain <[email protected]>

* run hook sequentially

Signed-off-by: zirain <[email protected]>

* fix lint

Signed-off-by: zirain <[email protected]>

* rename to cfgMux

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: zirain <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix: wrong cluster type with mixed FQDN backend and service backend refs (#7994)

* fix: wrong cluster type with mixed FQDN backend and service backend refs

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* fix mirror cluster endpoint type

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* simplify the test

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* update comment

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

---------

Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix: fail fast when unrecoverable discovery errors happens on checking optional CRDs (#7872)

* fail fast when unrecoverable discovery errors happens

Signed-off-by: Huabing Zhao <[email protected]>

* only retry transient errors

Signed-off-by: Huabing Zhao <[email protected]>

* fix potenial dead lock

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing Zhao <[email protected]>

* minor wording

Signed-off-by: Huabing Zhao <[email protected]>

* create discovery client once

Signed-off-by: Huabing Zhao <[email protected]>

* fix lint

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing Zhao <[email protected]>

* remove redundant logging

Signed-off-by: Huabing Zhao <[email protected]>

* add e2e test

Signed-off-by: Huabing Zhao <[email protected]>

* fix test

Signed-off-by: Huabing(Robin) Zhao <[email protected]>

* fix test

Signed-off-by: Huabing(Robin) Zhao <[email protected]>

---------

Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix: merge route match rule with match all route (#8011)

Signed-off-by: zirain <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix: do not set autoHTTPConfig when used mixed(HTTP + HTTPS) backends (#7950)

* fix: do not set autoHTTPConfig when used mixed backend

Signed-off-by: zirain <[email protected]>

* release notes

Signed-off-by: zirain <[email protected]>

* fix

Signed-off-by: zirain <[email protected]>

* add e2e

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: zirain <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix: backend tls default namespace (#7987)

Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix: race in gatewaapi runner (#8037)

* add testcase

Signed-off-by: zirain <[email protected]>

* fix

Signed-off-by: zirain <[email protected]>

* simply

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: zirain <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>

* [release/v1.6] v1.6.3 release notes (#8054)

Signed-off-by: Rudrakh Panigrahi <[email protected]>

* v1.6.3 version

Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix gen-check

Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix lint

Signed-off-by: Rudrakh Panigrahi <[email protected]>

---------

Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>
Signed-off-by: zirain <[email protected]>
Co-authored-by: Huabing (Robin) Zhao <[email protected]>
Co-authored-by: zirain <[email protected]>
SadmiB pushed a commit to SadmiB/gateway that referenced this pull request Jan 30, 2026
…g optional CRDs (envoyproxy#7872)

* fail fast when unrecoverable discovery errors happens

Signed-off-by: Huabing Zhao <[email protected]>

* only retry transient errors

Signed-off-by: Huabing Zhao <[email protected]>

* fix potenial dead lock

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing Zhao <[email protected]>

* minor wording

Signed-off-by: Huabing Zhao <[email protected]>

* create discovery client once

Signed-off-by: Huabing Zhao <[email protected]>

* fix lint

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing Zhao <[email protected]>

* remove redundant logging

Signed-off-by: Huabing Zhao <[email protected]>

* add e2e test

Signed-off-by: Huabing Zhao <[email protected]>

* fix test

Signed-off-by: Huabing(Robin) Zhao <[email protected]>

* fix test

Signed-off-by: Huabing(Robin) Zhao <[email protected]>

---------

Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: Sadmi Bouhafs <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optional CRDs skipped when discovery errors are treated as “absent”

4 participants