Skip to content

fix(xds-server): clear snapshot on stream close#6618

Merged
arkodg merged 2 commits intoenvoyproxy:mainfrom
Hackzzila:zvacura/snapshotcache-bug
Jul 30, 2025
Merged

fix(xds-server): clear snapshot on stream close#6618
arkodg merged 2 commits intoenvoyproxy:mainfrom
Hackzzila:zvacura/snapshotcache-bug

Conversation

@Hackzzila
Copy link
Copy Markdown
Contributor

In the snapshotcache, only snapshots for nodes with active connections are updated.

for _, node := range s.getNodeIDs(irKey) {
s.log.Debugf("Generating a snapshot with Node %s", node)
if err = s.SetSnapshot(context.TODO(), node, snapshot); err != nil {

As far as I can tell, these snapshots are never deleted. When a node disconnects and then reconnects it will receive the configuration from the time it disconnected. It will only get the latest configuration when there is an update while it is still connected.

This is a big problem when you are running multiple replicas of envoy-gateway, if a proxy flips between which instance it is connected to it can receive data that is several days out of date. We have seen proxy instances suddenly start using certificates that are expired and have already been rotated, and use pod IPs that do not exist anymore.

If we clear the snapshot for a node when it disconnects, it will get the latest snapshot when it reconnects (this does not happen if a node already has a cached snapshot).

_, err := s.GetSnapshot(nodeID)
if err != nil {
err = s.SetSnapshot(context.TODO(), nodeID, s.lastSnapshot[cluster])

@Hackzzila Hackzzila requested a review from a team as a code owner July 28, 2025 18:26
@codecov
Copy link
Copy Markdown

codecov bot commented Jul 28, 2025

Codecov Report

❌ Patch coverage is 0% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.02%. Comparing base (ceec666) to head (cfb177c).
⚠️ Report is 483 commits behind head on main.

Files with missing lines Patch % Lines
internal/xds/cache/snapshotcache.go 0.00% 21 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6618      +/-   ##
==========================================
- Coverage   71.03%   71.02%   -0.01%     
==========================================
  Files         225      225              
  Lines       39153    39173      +20     
==========================================
+ Hits        27811    27823      +12     
- Misses       9732     9743      +11     
+ Partials     1610     1607       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@arkodg
Copy link
Copy Markdown
Contributor

arkodg commented Jul 28, 2025

thanks for flagging this edge case
although for ADS there should be 1 stream per node, we cannot rely on it

instead of clearing the snapshot in OnStreamClosed , we could compare snapshots using GetVersionMap in OnStreamRequest which executesGetSnapshot

a cheaper alternative could be to hold a nodeId frequency map of stream counts, and only clear snapshot when no streams are available, my preference would be this approach

@arkodg arkodg added this to the v1.5.0 Release milestone Jul 28, 2025
@arkodg
Copy link
Copy Markdown
Contributor

arkodg commented Jul 28, 2025

This is a big problem when you are running multiple replicas of envoy-gateway, if a proxy flips between which instance it is connected to it can receive data that is several days out of date. We have seen proxy instances suddenly start using certificates that are expired and have already been rotated, and use pod IPs that do not exist anymore.

trying to understand when this realistically occurs - is this a race between pod restart and Gateway API config update ?

@Hackzzila
Copy link
Copy Markdown
Contributor Author

trying to understand when this realistically occurs - is this a race between pod restart and Gateway API config update ?

We've seen this happen at random in our clusters. If there is a brief network interruption a proxy may get disconnected from envoy-gateway and then connect to a different instance (one that it was connected to previously).

Copy link
Copy Markdown
Contributor

@arkodg arkodg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks !

@arkodg arkodg requested review from a team July 29, 2025 19:35
@guydc
Copy link
Copy Markdown
Contributor

guydc commented Jul 29, 2025

@Hackzzila - Thanks for this. I would really appreciate an E2E test like the ones that we have in the resilience suite, demonstrating that this works as expected after creating an disconnection with a network policy.

If that's too much work, can you confirm that you manually tested the change in a similar way ?

@Hackzzila
Copy link
Copy Markdown
Contributor Author

@Hackzzila - Thanks for this. I would really appreciate an E2E test like the ones that we have in the resilience suite, demonstrating that this works as expected after creating an disconnection with a network policy.

If that's too much work, can you confirm that you manually tested the change in a similar way ?

I did a test by connecting to envoy-gateway with grpcurl and sending requests manually. I connected and verified the correct config was being returned, then disconnected and restarted a deployment, and then reconnected and saw that the IPs were not updated. I then did this same test with the patch and verified that it returned the correct data. I did not test the delta variant, or multiple streams for the same node.

If nothing major comes up at work I can investigate writing a test for this on Thursday/Friday 👍

Copy link
Copy Markdown
Contributor

@jukie jukie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on adding an E2E but non-blocking

@jukie
Copy link
Copy Markdown
Contributor

jukie commented Jul 30, 2025

Thanks! If you're unable to do so before merging could you create an issue tracking the E2E followup?

@arkodg arkodg merged commit 6e21739 into envoyproxy:main Jul 30, 2025
27 of 28 checks passed
shawnh2 pushed a commit to shawnh2/gateway that referenced this pull request Oct 15, 2025
* fix(xds-server): clear snapshot on stream close

Signed-off-by: Zachary Vacura <[email protected]>

* check if there are other active connections before clearning the snapshot

Signed-off-by: Zachary Vacura <[email protected]>
Signed-off-by: shawnh2 <[email protected]>
arkodg added a commit that referenced this pull request Oct 15, 2025
* fix(xds-server): clear snapshot on stream close (#6618)

* fix(xds-server): clear snapshot on stream close

Signed-off-by: Zachary Vacura <[email protected]>

* check if there are other active connections before clearning the snapshot

Signed-off-by: Zachary Vacura <[email protected]>
Signed-off-by: shawnh2 <[email protected]>

* bug: disable x-envoy-ratelimited by default (#7110)

* bug: disable x-envoy-ratelimited by default

* can be enabled with `enableEnvoyHeaders` in CTP

Relates to #7034

Signed-off-by: Arko Dasgupta <[email protected]>

* fix tests and release note

Signed-off-by: Arko Dasgupta <[email protected]>

* fix testdata

Signed-off-by: Arko Dasgupta <[email protected]>

---------

Signed-off-by: Arko Dasgupta <[email protected]>
Signed-off-by: shawnh2 <[email protected]>

* fix(translator): Fix panic with request mirror + grpcroute (#6875)

Signed-off-by: Andrew Moreland <[email protected]>
Signed-off-by: shawnh2 <[email protected]>

* fix: bug in overlap detection of cert SANs (#7234)

Signed-off-by: shawnh2 <[email protected]>

* bump golang for crypto/x509 reggression (#7236)

Signed-off-by: zirain <[email protected]>
Signed-off-by: shawnh2 <[email protected]>

* fix gen-check

Signed-off-by: shawnh2 <[email protected]>

---------

Signed-off-by: Zachary Vacura <[email protected]>
Signed-off-by: shawnh2 <[email protected]>
Signed-off-by: Arko Dasgupta <[email protected]>
Signed-off-by: Andrew Moreland <[email protected]>
Signed-off-by: zirain <[email protected]>
Co-authored-by: Zach Vacura <[email protected]>
Co-authored-by: Arko Dasgupta <[email protected]>
Co-authored-by: Andrew Moreland <[email protected]>
Co-authored-by: Rudrakh Panigrahi <[email protected]>
Co-authored-by: zirain <[email protected]>
jukie pushed a commit to jukie/gateway that referenced this pull request Dec 5, 2025
* fix(xds-server): clear snapshot on stream close

Signed-off-by: Zachary Vacura <[email protected]>

* check if there are other active connections before clearning the snapshot

Signed-off-by: Zachary Vacura <[email protected]>
Signed-off-by: jukie <[email protected]>
jukie added a commit that referenced this pull request Dec 5, 2025
* fix(xds-server): clear snapshot on stream close (#6618)

* fix(xds-server): clear snapshot on stream close

Signed-off-by: Zachary Vacura <[email protected]>

* check if there are other active connections before clearning the snapshot

Signed-off-by: Zachary Vacura <[email protected]>
Signed-off-by: jukie <[email protected]>

* fix: oidc authentication endpoint was overwritten by discovered value (#7460)

fix: oid authentication endpoint was overriden by discovered value

Signed-off-by: Huabing Zhao <[email protected]>
Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: jukie <[email protected]>

* ci: add script to free disk space (#7534)

* feat: free disk space

Signed-off-by: Shreemaan Abhishek <[email protected]>

* lint

Signed-off-by: Shreemaan Abhishek <[email protected]>

* cleanup

Signed-off-by: Shreemaan Abhishek <[email protected]>

* make target and tools/hack

Signed-off-by: Shreemaan Abhishek <[email protected]>

* lint

Signed-off-by: Shreemaan Abhishek <[email protected]>

* modular action

Signed-off-by: Shreemaan Abhishek <[email protected]>

---------

Signed-off-by: Shreemaan Abhishek <[email protected]>
Signed-off-by: jukie <[email protected]>

* treat too many addresses as programmed (#7542)

Signed-off-by: cong <[email protected]>
Signed-off-by: jukie <[email protected]>

* feat: reclaim space in release pipeline (#7587)

Signed-off-by: Shreemaan Abhishek <[email protected]>
Signed-off-by: jukie <[email protected]>

* chore: bump golang.org/x/crypto (#7588)

* chore: bump golang.org/x/crypto

Signed-off-by: zirain <[email protected]>

* fix gen

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: zirain <[email protected]>
Signed-off-by: jukie <[email protected]>

* findOwningGateway should return controller based on linked GatewayClass (#7611)

* fix: filter Gateway by controller in findOwningGateway

Prevent cross-controller Gateway mutations by validating GatewayClass

Signed-off-by: Sudipto Baral <[email protected]>
Signed-off-by: jukie <[email protected]>

* fix: use default when namespace is unset (#7612)

* fix: use default when namespace is unset

Signed-off-by: zirain <[email protected]>

* fix

Signed-off-by: zirain <[email protected]>

* fix test

Signed-off-by: zirain <[email protected]>

---------

Signed-off-by: zirain <[email protected]>
Signed-off-by: jukie <[email protected]>

* fix: prevent skeleton route status entries for unmanaged GatewayClasses (#7536)

* fix: prevent skeleton route status entries for unmanaged GatewayClasses

When processing policies (EnvoyExtensionPolicy, SecurityPolicy), the translator
was calling GetRouteParentContext for ALL parentRefs in a route, even those
referencing gateways with different GatewayClasses not managed by this translator.

GetRouteParentContext creates a skeleton RouteParentStatus entry with just the
controllerName when called on a parentRef that hasn't been processed yet. Since
all GatewayClass instances share the same controller name, these skeleton entries
persisted in status without conditions.

The fix checks if a parentRef context already exists before attempting to apply
policy configuration to it. If the context doesn't exist, it means this parentRef
wasn't processed by this translator and should be skipped.

Signed-off-by: Raj Singh <[email protected]>

* fix: also prevent skeleton entries in BackendTrafficPolicy processing

The same issue exists in BackendTrafficPolicy route processing - calling
GetRouteParentContext for all parentRefs creates skeleton status entries.

Apply the same fix: check if parentRef context exists before adding to list.

Signed-off-by: Raj Singh <[email protected]>

---------

Signed-off-by: Raj Singh <[email protected]>
Signed-off-by: jukie <[email protected]>

* lint

Signed-off-by: jukie <[email protected]>

---------

Signed-off-by: Zachary Vacura <[email protected]>
Signed-off-by: jukie <[email protected]>
Signed-off-by: Huabing Zhao <[email protected]>
Signed-off-by: Huabing (Robin) Zhao <[email protected]>
Signed-off-by: Shreemaan Abhishek <[email protected]>
Signed-off-by: cong <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: Sudipto Baral <[email protected]>
Signed-off-by: Raj Singh <[email protected]>
Co-authored-by: Zach Vacura <[email protected]>
Co-authored-by: Huabing (Robin) Zhao <[email protected]>
Co-authored-by: shreealt <[email protected]>
Co-authored-by: 聪 <[email protected]>
Co-authored-by: zirain <[email protected]>
Co-authored-by: Sudipto Baral <[email protected]>
Co-authored-by: Raj Singh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants