Skip to content

fix: retry reconcile on transient errors during reconcile #6299

Merged
arkodg merged 13 commits intoenvoyproxy:mainfrom
patrostkowski:fix/transient-errors-6284
Jul 3, 2025
Merged

fix: retry reconcile on transient errors during reconcile #6299
arkodg merged 13 commits intoenvoyproxy:mainfrom
patrostkowski:fix/transient-errors-6284

Conversation

@patrostkowski
Copy link
Copy Markdown
Contributor

@patrostkowski patrostkowski commented Jun 12, 2025

What type of PR is this?

fix: #6284

What this PR does / why we need it:

This PR:

  • Introduces isTransientError to detect transient Kubernetes errors and enable proper reconciliation retries.
  • Ignores non-transient errors from the provider layer and publishes them to the translation layer - the errors mostly come from missing referenced resources or misconfigurations that can be caught later in the Gateway API translator and surfaced in the resource status.

Description updated by @zhaohuabing on Jul 3 2025.

Which issue(s) this PR fixes:

Fixes #6284

Release Notes: Yes

@patrostkowski patrostkowski requested a review from a team as a code owner June 12, 2025 18:06
@patrostkowski patrostkowski force-pushed the fix/transient-errors-6284 branch from c45000a to a69693a Compare June 12, 2025 19:15
@codecov
Copy link
Copy Markdown

codecov bot commented Jun 13, 2025

Codecov Report

Attention: Patch coverage is 27.44186% with 156 lines in your changes missing coverage. Please review.

Project coverage is 70.73%. Comparing base (a394c61) to head (80b953e).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
internal/provider/kubernetes/controller.go 27.44% 150 Missing and 6 partials ⚠️

❌ Your patch status has failed because the patch coverage (27.44%) is below the target coverage (60.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6299      +/-   ##
==========================================
- Coverage   70.92%   70.73%   -0.19%     
==========================================
  Files         220      220              
  Lines       37260    37354      +94     
==========================================
- Hits        26426    26423       -3     
- Misses       9289     9383      +94     
- Partials     1545     1548       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@guydc
Copy link
Copy Markdown
Contributor

guydc commented Jun 13, 2025

@arkodg, @patrostkowski
My question here would be: In case of a non-transient error, we will still store inconsistent resources (partial? invalid?) that may lead to corruption of XDS. Is there a way to avoid this, e.g. by using previously-stored resources for the affected GWC if they exist? This would be more aligned and compatible with previous behavior, where ALL XDS is essentially not updated in case of ANY error in this area of provider?

@patrostkowski patrostkowski force-pushed the fix/transient-errors-6284 branch from a69693a to ba30677 Compare June 16, 2025 05:50
@guydc
Copy link
Copy Markdown
Contributor

guydc commented Jun 18, 2025

Notes from the community meeting:

  • Provider layer should publish an accurate state-of-the-world.
  • If transient errors make it impossible to build an accurate snapshot, the reconciliation should be retried by returning early with an error
  • If validation errors are encountered, state should be published, so that status can be updated and user notified about issues with resources.
  • If users want to implement alternative strategies for handling invalid configuration (e.g. pausing XDS updates), that should be implemented in other layers.

@arkodg
Copy link
Copy Markdown
Contributor

arkodg commented Jun 19, 2025

can we also log the transient error

@arkodg
Copy link
Copy Markdown
Contributor

arkodg commented Jun 24, 2025

@patrostkowski still working on this ? we'd like to get this into the patch release scheduled to be out next week

@zhaohuabing zhaohuabing marked this pull request as draft June 26, 2025 02:40
@zhaohuabing zhaohuabing marked this pull request as draft June 26, 2025 02:40
@zhaohuabing
Copy link
Copy Markdown
Member

It seems @patrostkowski is currently unavailable. I’ll take over from here since we need this in the upcoming patch.

@zhaohuabing zhaohuabing force-pushed the fix/transient-errors-6284 branch 3 times, most recently from fcd7f69 to 40ec0c9 Compare June 26, 2025 03:31
@patrostkowski
Copy link
Copy Markdown
Contributor Author

thanks @zhaohuabing and sorry, I am very busy time atm with other things 😞

@zhaohuabing
Copy link
Copy Markdown
Member

thanks @zhaohuabing and sorry, I am very busy time atm with other things 😞

No worries. I'll take care of it.

@zhaohuabing zhaohuabing force-pushed the fix/transient-errors-6284 branch 2 times, most recently from 5eeaad0 to 5313be2 Compare June 26, 2025 11:29
@zhaohuabing zhaohuabing self-assigned this Jun 26, 2025
@zhaohuabing zhaohuabing force-pushed the fix/transient-errors-6284 branch 2 times, most recently from 74541e2 to 27b84b4 Compare June 26, 2025 13:14
@zhaohuabing zhaohuabing changed the title fix: add isTransientError helper to classify retryable errors during reconcile fix: retry reconcile onvtransient errors during reconcile and use the last state to allow Envoy continue serving traffic Jun 26, 2025
@zhaohuabing zhaohuabing force-pushed the fix/transient-errors-6284 branch 3 times, most recently from 6f9a261 to 9b633af Compare June 27, 2025 01:11
@zhaohuabing zhaohuabing force-pushed the fix/transient-errors-6284 branch from a3b57a9 to 63c2eee Compare June 27, 2025 02:33
service := new(corev1.Service)
err := r.client.Get(ctx, types.NamespacedName{Namespace: string(*backendRef.Namespace), Name: string(backendRef.Name)}, service)
if err != nil {
if isTransientError(err) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is going to hard to remember to add this for every client call

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can figure out a better approach to handle this later.

Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: Huabing (Robin) Zhao <[email protected]>
@zhaohuabing zhaohuabing force-pushed the fix/transient-errors-6284 branch from 9926890 to 8704fd8 Compare July 3, 2025 01:08
@zhaohuabing zhaohuabing requested a review from arkodg July 3, 2025 01:12
Signed-off-by: Huabing (Robin) Zhao <[email protected]>
@zhaohuabing zhaohuabing requested a review from arkodg July 3, 2025 01:28
arkodg
arkodg previously approved these changes Jul 3, 2025
Copy link
Copy Markdown
Contributor

@arkodg arkodg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks

@arkodg arkodg requested review from a team and shahar-h and removed request for shahar-h July 3, 2025 01:36
Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: Huabing (Robin) Zhao <[email protected]>
@arkodg arkodg merged commit 71ce56f into envoyproxy:main Jul 3, 2025
27 of 28 checks passed
zhaohuabing added a commit to shawnh2/gateway that referenced this pull request Jul 4, 2025
…#6299)

* fix: add isTransientError helper to classify retryable errors

Introduces isTransientError to detect transient Kubernetes errors and
enable proper reconciliation retries.

Signed-off-by: Patryk Rostkowski <[email protected]>

handle errors from processing BackendRefs

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

handle errors from processing ConfigMap

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* skip invalid GatewayClass

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* address comment

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* handle all transient errors

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* don't skip failed GCs

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

---------

Signed-off-by: Patryk Rostkowski <[email protected]>
Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Co-authored-by: Huabing (Robin) Zhao <[email protected]>
(cherry picked from commit 71ce56f)
Signed-off-by: Huabing (Robin) Zhao <[email protected]>
zhaohuabing added a commit that referenced this pull request Jul 4, 2025
* fix(translator): ext-proc full duplex streamed trailers and validation (#6323)
* fix ext proc validation and trailer management for full duplex streamed mode

Signed-off-by: Guy Daich <[email protected]>
Signed-off-by: shawnh2 <[email protected]>

* feat: disable automountServiceAccountToken for proxy and ratelimit (#6364)

Signed-off-by: Jeff Davis <[email protected]>

* bugfix: make EnvoyPatchPolicy able to replace telemetry cluster (#6367)

Signed-off-by: zirain <[email protected]>
Signed-off-by: shawnh2 <[email protected]>

* feat: add validation of section name for Gateway listener (#6343)

* add validation of section name

Signed-off-by: kkk777-7 <[email protected]>

* update error status reason

Signed-off-by: kkk777-7 <[email protected]>

* refactor: define as function of validate section name for gateway listener

Signed-off-by: kkk777-7 <[email protected]>
Signed-off-by: shawnh2 <[email protected]>

* fix: add configMap indexers for EEP reconciler (#6369)

Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix: use buildEndpointType for access and tracing (#6370)

Signed-off-by: zirain <[email protected]>

* fix: default accesslog not working (#6441)
* fix default accesslog

Signed-off-by: zirain <[email protected]>

* release notes

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: zirain <[email protected]>
Signed-off-by: shawnh2 <[email protected]>

* chore: fix cve (#6446)

* fix cve

Signed-off-by: zirain <[email protected]>

* lint

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: zirain <[email protected]>

* fix: Do not set backendRequestTimeout when Retries are set (#6421)

* fix: Do not set backendRequestTimeout when Retries are set

Signed-off-by: sudipto baral <[email protected]>

* fix: update comment

Signed-off-by: sudipto baral <[email protected]>

---------

Signed-off-by: sudipto baral <[email protected]>

* gatewayapi: don't append gwcResource if there's invalid GatewayClass (#6379)

* gatewayapi: don't process gloabal resources when acceptedGateways is 0

Signed-off-by: zirain <[email protected]>

* update

Signed-off-by: zirain <[email protected]>

* fix test

Signed-off-by: zirain <[email protected]>

* don't skip gateways

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: zirain <[email protected]>
Signed-off-by: shawnh2 <[email protected]>

* fix testdata

Signed-off-by: shawnh2 <[email protected]>

* fix k8s provider controller

Signed-off-by: shawnh2 <[email protected]>

* fix: retry reconcile on transient errors during reconcile  (#6299)

* fix: add isTransientError helper to classify retryable errors

Introduces isTransientError to detect transient Kubernetes errors and
enable proper reconciliation retries.

Signed-off-by: Patryk Rostkowski <[email protected]>

handle errors from processing BackendRefs

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

handle errors from processing ConfigMap

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* skip invalid GatewayClass

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* address comment

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* handle all transient errors

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* don't skip failed GCs

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

---------

Signed-off-by: Patryk Rostkowski <[email protected]>
Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Co-authored-by: Huabing (Robin) Zhao <[email protected]>
(cherry picked from commit 71ce56f)
Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* fix: fix bug in hostname overlap detection (#6332)

fix bug in hostname overlap detection

Signed-off-by: Rudrakh Panigrahi <[email protected]>
(cherry picked from commit e78e268)
Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* fix telemetry with host port not working (#6460)

Signed-off-by: zirain <[email protected]>
(cherry picked from commit c0a2ce7)
Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* bugfix: BackendTlsPolicy should not reference across namespace (#6309)

* bugfix: BackendTlsPolicy should not reference across namespace

Signed-off-by: zirain <[email protected]>
(cherry picked from commit 9925189)
Signed-off-by: Huabing (Robin) Zhao <[email protected]>

---------

Signed-off-by: Guy Daich <[email protected]>
Signed-off-by: shawnh2 <[email protected]>
Signed-off-by: Jeff Davis <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: kkk777-7 <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>
Signed-off-by: sudipto baral <[email protected]>
Signed-off-by: Patryk Rostkowski <[email protected]>
Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Co-authored-by: Guy Daich <[email protected]>
Co-authored-by: Jeff Davis <[email protected]>
Co-authored-by: zirain <[email protected]>
Co-authored-by: Kota Kimura <[email protected]>
Co-authored-by: Rudrakh Panigrahi <[email protected]>
Co-authored-by: Sudipto Baral <[email protected]>
Co-authored-by: Patryk Rostkowski <[email protected]>
Co-authored-by: Huabing (Robin) Zhao <[email protected]>
tjvdmolen pushed a commit to tjvdmolen/gateway that referenced this pull request Jul 11, 2025
…#6299)

* fix: add isTransientError helper to classify retryable errors

Introduces isTransientError to detect transient Kubernetes errors and
enable proper reconciliation retries.

Signed-off-by: Patryk Rostkowski <[email protected]>

handle errors from processing BackendRefs

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

handle errors from processing ConfigMap

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* skip invalid GatewayClass

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* address comment

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* handle all transient errors

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* don't skip failed GCs

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

---------

Signed-off-by: Patryk Rostkowski <[email protected]>
Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Co-authored-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: Tjeerd Jan van der Molen <[email protected]>
shawnh2 added a commit to shawnh2/gateway that referenced this pull request Sep 15, 2025
* fix(translator): ext-proc full duplex streamed trailers and validation (envoyproxy#6323)
* fix ext proc validation and trailer management for full duplex streamed mode

Signed-off-by: Guy Daich <[email protected]>
Signed-off-by: shawnh2 <[email protected]>

* feat: disable automountServiceAccountToken for proxy and ratelimit (envoyproxy#6364)

Signed-off-by: Jeff Davis <[email protected]>

* bugfix: make EnvoyPatchPolicy able to replace telemetry cluster (envoyproxy#6367)

Signed-off-by: zirain <[email protected]>
Signed-off-by: shawnh2 <[email protected]>

* feat: add validation of section name for Gateway listener (envoyproxy#6343)

* add validation of section name

Signed-off-by: kkk777-7 <[email protected]>

* update error status reason

Signed-off-by: kkk777-7 <[email protected]>

* refactor: define as function of validate section name for gateway listener

Signed-off-by: kkk777-7 <[email protected]>
Signed-off-by: shawnh2 <[email protected]>

* fix: add configMap indexers for EEP reconciler (envoyproxy#6369)

Signed-off-by: Rudrakh Panigrahi <[email protected]>

* fix: use buildEndpointType for access and tracing (envoyproxy#6370)

Signed-off-by: zirain <[email protected]>

* fix: default accesslog not working (envoyproxy#6441)
* fix default accesslog

Signed-off-by: zirain <[email protected]>

* release notes

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: zirain <[email protected]>
Signed-off-by: shawnh2 <[email protected]>

* chore: fix cve (envoyproxy#6446)

* fix cve

Signed-off-by: zirain <[email protected]>

* lint

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: zirain <[email protected]>

* fix: Do not set backendRequestTimeout when Retries are set (envoyproxy#6421)

* fix: Do not set backendRequestTimeout when Retries are set

Signed-off-by: sudipto baral <[email protected]>

* fix: update comment

Signed-off-by: sudipto baral <[email protected]>

---------

Signed-off-by: sudipto baral <[email protected]>

* gatewayapi: don't append gwcResource if there's invalid GatewayClass (envoyproxy#6379)

* gatewayapi: don't process gloabal resources when acceptedGateways is 0

Signed-off-by: zirain <[email protected]>

* update

Signed-off-by: zirain <[email protected]>

* fix test

Signed-off-by: zirain <[email protected]>

* don't skip gateways

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: zirain <[email protected]>
Signed-off-by: shawnh2 <[email protected]>

* fix testdata

Signed-off-by: shawnh2 <[email protected]>

* fix k8s provider controller

Signed-off-by: shawnh2 <[email protected]>

* fix: retry reconcile on transient errors during reconcile  (envoyproxy#6299)

* fix: add isTransientError helper to classify retryable errors

Introduces isTransientError to detect transient Kubernetes errors and
enable proper reconciliation retries.

Signed-off-by: Patryk Rostkowski <[email protected]>

handle errors from processing BackendRefs

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

handle errors from processing ConfigMap

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* skip invalid GatewayClass

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* address comment

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* handle all transient errors

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* don't skip failed GCs

Signed-off-by: Huabing (Robin) Zhao <[email protected]>

---------

Signed-off-by: Patryk Rostkowski <[email protected]>
Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Co-authored-by: Huabing (Robin) Zhao <[email protected]>
(cherry picked from commit 71ce56f)
Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* fix: fix bug in hostname overlap detection (envoyproxy#6332)

fix bug in hostname overlap detection

Signed-off-by: Rudrakh Panigrahi <[email protected]>
(cherry picked from commit e78e268)
Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* fix telemetry with host port not working (envoyproxy#6460)

Signed-off-by: zirain <[email protected]>
(cherry picked from commit c0a2ce7)
Signed-off-by: Huabing (Robin) Zhao <[email protected]>

* bugfix: BackendTlsPolicy should not reference across namespace (envoyproxy#6309)

* bugfix: BackendTlsPolicy should not reference across namespace

Signed-off-by: zirain <[email protected]>
(cherry picked from commit 9925189)
Signed-off-by: Huabing (Robin) Zhao <[email protected]>

---------

Signed-off-by: Guy Daich <[email protected]>
Signed-off-by: shawnh2 <[email protected]>
Signed-off-by: Jeff Davis <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: kkk777-7 <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>
Signed-off-by: sudipto baral <[email protected]>
Signed-off-by: Patryk Rostkowski <[email protected]>
Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Co-authored-by: Guy Daich <[email protected]>
Co-authored-by: Jeff Davis <[email protected]>
Co-authored-by: zirain <[email protected]>
Co-authored-by: Kota Kimura <[email protected]>
Co-authored-by: Rudrakh Panigrahi <[email protected]>
Co-authored-by: Sudipto Baral <[email protected]>
Co-authored-by: Patryk Rostkowski <[email protected]>
Co-authored-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: shawnh2 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Errors during reconcile causes a creation of partial xDS configuraion

7 participants