fix(xds-server): clear snapshot on stream close by Hackzzila · Pull Request #6618 · envoyproxy/gateway

Hackzzila · 2025-07-28T18:26:51Z

In the snapshotcache, only snapshots for nodes with active connections are updated.

gateway/internal/xds/cache/snapshotcache.go

Lines 92 to 95 in ceec666

    
           for _, node := range s.getNodeIDs(irKey) { 
        
           	s.log.Debugf("Generating a snapshot with Node %s", node) 
        
           	if err = s.SetSnapshot(context.TODO(), node, snapshot); err != nil {

As far as I can tell, these snapshots are never deleted. When a node disconnects and then reconnects it will receive the configuration from the time it disconnected. It will only get the latest configuration when there is an update while it is still connected.

This is a big problem when you are running multiple replicas of envoy-gateway, if a proxy flips between which instance it is connected to it can receive data that is several days out of date. We have seen proxy instances suddenly start using certificates that are expired and have already been rotated, and use pod IPs that do not exist anymore.

If we clear the snapshot for a node when it disconnects, it will get the latest snapshot when it reconnects (this does not happen if a node already has a cached snapshot).

gateway/internal/xds/cache/snapshotcache.go

Lines 208 to 210 in ceec666

    
           _, err := s.GetSnapshot(nodeID) 
        
           if err != nil { 
        
           	err = s.SetSnapshot(context.TODO(), nodeID, s.lastSnapshot[cluster])

Signed-off-by: Zachary Vacura <[email protected]>

internal/xds/cache/snapshotcache.go

codecov · 2025-07-28T18:53:56Z

Codecov Report

❌ Patch coverage is 0% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.02%. Comparing base (ceec666) to head (cfb177c).
⚠️ Report is 483 commits behind head on main.

Files with missing lines	Patch %	Lines
internal/xds/cache/snapshotcache.go	0.00%	21 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6618      +/-   ##
==========================================
- Coverage   71.03%   71.02%   -0.01%     
==========================================
  Files         225      225              
  Lines       39153    39173      +20     
==========================================
+ Hits        27811    27823      +12     
- Misses       9732     9743      +11     
+ Partials     1610     1607       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

arkodg · 2025-07-28T20:06:24Z

thanks for flagging this edge case
although for ADS there should be 1 stream per node, we cannot rely on it

instead of clearing the snapshot in OnStreamClosed , we could compare snapshots using GetVersionMap in OnStreamRequest which executesGetSnapshot

a cheaper alternative could be to hold a nodeId frequency map of stream counts, and only clear snapshot when no streams are available, my preference would be this approach

arkodg · 2025-07-28T20:14:25Z

This is a big problem when you are running multiple replicas of envoy-gateway, if a proxy flips between which instance it is connected to it can receive data that is several days out of date. We have seen proxy instances suddenly start using certificates that are expired and have already been rotated, and use pod IPs that do not exist anymore.

trying to understand when this realistically occurs - is this a race between pod restart and Gateway API config update ?

Hackzzila · 2025-07-29T14:01:39Z

trying to understand when this realistically occurs - is this a race between pod restart and Gateway API config update ?

We've seen this happen at random in our clusters. If there is a brief network interruption a proxy may get disconnected from envoy-gateway and then connect to a different instance (one that it was connected to previously).

…shot Signed-off-by: Zachary Vacura <[email protected]>

arkodg

LGTM thanks !

guydc · 2025-07-29T23:06:43Z

@Hackzzila - Thanks for this. I would really appreciate an E2E test like the ones that we have in the resilience suite, demonstrating that this works as expected after creating an disconnection with a network policy.

If that's too much work, can you confirm that you manually tested the change in a similar way ?

Hackzzila · 2025-07-30T00:25:32Z

@Hackzzila - Thanks for this. I would really appreciate an E2E test like the ones that we have in the resilience suite, demonstrating that this works as expected after creating an disconnection with a network policy.

If that's too much work, can you confirm that you manually tested the change in a similar way ?

I did a test by connecting to envoy-gateway with grpcurl and sending requests manually. I connected and verified the correct config was being returned, then disconnected and restarted a deployment, and then reconnected and saw that the IPs were not updated. I then did this same test with the patch and verified that it returned the correct data. I did not test the delta variant, or multiple streams for the same node.

If nothing major comes up at work I can investigate writing a test for this on Thursday/Friday 👍

jukie

+1 on adding an E2E but non-blocking

jukie · 2025-07-30T14:46:04Z

Thanks! If you're unable to do so before merging could you create an issue tracking the E2E followup?

* fix(xds-server): clear snapshot on stream close Signed-off-by: Zachary Vacura <[email protected]> * check if there are other active connections before clearning the snapshot Signed-off-by: Zachary Vacura <[email protected]> Signed-off-by: shawnh2 <[email protected]>

* fix(xds-server): clear snapshot on stream close (#6618) * fix(xds-server): clear snapshot on stream close Signed-off-by: Zachary Vacura <[email protected]> * check if there are other active connections before clearning the snapshot Signed-off-by: Zachary Vacura <[email protected]> Signed-off-by: shawnh2 <[email protected]> * bug: disable x-envoy-ratelimited by default (#7110) * bug: disable x-envoy-ratelimited by default * can be enabled with `enableEnvoyHeaders` in CTP Relates to #7034 Signed-off-by: Arko Dasgupta <[email protected]> * fix tests and release note Signed-off-by: Arko Dasgupta <[email protected]> * fix testdata Signed-off-by: Arko Dasgupta <[email protected]> --------- Signed-off-by: Arko Dasgupta <[email protected]> Signed-off-by: shawnh2 <[email protected]> * fix(translator): Fix panic with request mirror + grpcroute (#6875) Signed-off-by: Andrew Moreland <[email protected]> Signed-off-by: shawnh2 <[email protected]> * fix: bug in overlap detection of cert SANs (#7234) Signed-off-by: shawnh2 <[email protected]> * bump golang for crypto/x509 reggression (#7236) Signed-off-by: zirain <[email protected]> Signed-off-by: shawnh2 <[email protected]> * fix gen-check Signed-off-by: shawnh2 <[email protected]> --------- Signed-off-by: Zachary Vacura <[email protected]> Signed-off-by: shawnh2 <[email protected]> Signed-off-by: Arko Dasgupta <[email protected]> Signed-off-by: Andrew Moreland <[email protected]> Signed-off-by: zirain <[email protected]> Co-authored-by: Zach Vacura <[email protected]> Co-authored-by: Arko Dasgupta <[email protected]> Co-authored-by: Andrew Moreland <[email protected]> Co-authored-by: Rudrakh Panigrahi <[email protected]> Co-authored-by: zirain <[email protected]>

* fix(xds-server): clear snapshot on stream close Signed-off-by: Zachary Vacura <[email protected]> * check if there are other active connections before clearning the snapshot Signed-off-by: Zachary Vacura <[email protected]> Signed-off-by: jukie <[email protected]>

* fix(xds-server): clear snapshot on stream close (#6618) * fix(xds-server): clear snapshot on stream close Signed-off-by: Zachary Vacura <[email protected]> * check if there are other active connections before clearning the snapshot Signed-off-by: Zachary Vacura <[email protected]> Signed-off-by: jukie <[email protected]> * fix: oidc authentication endpoint was overwritten by discovered value (#7460) fix: oid authentication endpoint was overriden by discovered value Signed-off-by: Huabing Zhao <[email protected]> Signed-off-by: Huabing (Robin) Zhao <[email protected]> Signed-off-by: jukie <[email protected]> * ci: add script to free disk space (#7534) * feat: free disk space Signed-off-by: Shreemaan Abhishek <[email protected]> * lint Signed-off-by: Shreemaan Abhishek <[email protected]> * cleanup Signed-off-by: Shreemaan Abhishek <[email protected]> * make target and tools/hack Signed-off-by: Shreemaan Abhishek <[email protected]> * lint Signed-off-by: Shreemaan Abhishek <[email protected]> * modular action Signed-off-by: Shreemaan Abhishek <[email protected]> --------- Signed-off-by: Shreemaan Abhishek <[email protected]> Signed-off-by: jukie <[email protected]> * treat too many addresses as programmed (#7542) Signed-off-by: cong <[email protected]> Signed-off-by: jukie <[email protected]> * feat: reclaim space in release pipeline (#7587) Signed-off-by: Shreemaan Abhishek <[email protected]> Signed-off-by: jukie <[email protected]> * chore: bump golang.org/x/crypto (#7588) * chore: bump golang.org/x/crypto Signed-off-by: zirain <[email protected]> * fix gen Signed-off-by: zirain <[email protected]> --------- Signed-off-by: zirain <[email protected]> Signed-off-by: jukie <[email protected]> * findOwningGateway should return controller based on linked GatewayClass (#7611) * fix: filter Gateway by controller in findOwningGateway Prevent cross-controller Gateway mutations by validating GatewayClass Signed-off-by: Sudipto Baral <[email protected]> Signed-off-by: jukie <[email protected]> * fix: use default when namespace is unset (#7612) * fix: use default when namespace is unset Signed-off-by: zirain <[email protected]> * fix Signed-off-by: zirain <[email protected]> * fix test Signed-off-by: zirain <[email protected]> --------- Signed-off-by: zirain <[email protected]> Signed-off-by: jukie <[email protected]> * fix: prevent skeleton route status entries for unmanaged GatewayClasses (#7536) * fix: prevent skeleton route status entries for unmanaged GatewayClasses When processing policies (EnvoyExtensionPolicy, SecurityPolicy), the translator was calling GetRouteParentContext for ALL parentRefs in a route, even those referencing gateways with different GatewayClasses not managed by this translator. GetRouteParentContext creates a skeleton RouteParentStatus entry with just the controllerName when called on a parentRef that hasn't been processed yet. Since all GatewayClass instances share the same controller name, these skeleton entries persisted in status without conditions. The fix checks if a parentRef context already exists before attempting to apply policy configuration to it. If the context doesn't exist, it means this parentRef wasn't processed by this translator and should be skipped. Signed-off-by: Raj Singh <[email protected]> * fix: also prevent skeleton entries in BackendTrafficPolicy processing The same issue exists in BackendTrafficPolicy route processing - calling GetRouteParentContext for all parentRefs creates skeleton status entries. Apply the same fix: check if parentRef context exists before adding to list. Signed-off-by: Raj Singh <[email protected]> --------- Signed-off-by: Raj Singh <[email protected]> Signed-off-by: jukie <[email protected]> * lint Signed-off-by: jukie <[email protected]> --------- Signed-off-by: Zachary Vacura <[email protected]> Signed-off-by: jukie <[email protected]> Signed-off-by: Huabing Zhao <[email protected]> Signed-off-by: Huabing (Robin) Zhao <[email protected]> Signed-off-by: Shreemaan Abhishek <[email protected]> Signed-off-by: cong <[email protected]> Signed-off-by: zirain <[email protected]> Signed-off-by: Sudipto Baral <[email protected]> Signed-off-by: Raj Singh <[email protected]> Co-authored-by: Zach Vacura <[email protected]> Co-authored-by: Huabing (Robin) Zhao <[email protected]> Co-authored-by: shreealt <[email protected]> Co-authored-by: 聪 <[email protected]> Co-authored-by: zirain <[email protected]> Co-authored-by: Sudipto Baral <[email protected]> Co-authored-by: Raj Singh <[email protected]>

fix(xds-server): clear snapshot on stream close

d5d95b9

Signed-off-by: Zachary Vacura <[email protected]>

Hackzzila requested a review from a team as a code owner July 28, 2025 18:26

Hackzzila commented Jul 28, 2025

View reviewed changes

internal/xds/cache/snapshotcache.go Outdated Show resolved Hide resolved

arkodg added this to the v1.5.0 Release milestone Jul 28, 2025

check if there are other active connections before clearning the snap…

cfb177c

…shot Signed-off-by: Zachary Vacura <[email protected]>

arkodg approved these changes Jul 29, 2025

View reviewed changes

arkodg requested review from a team July 29, 2025 19:35

jukie approved these changes Jul 30, 2025

View reviewed changes

arkodg merged commit 6e21739 into envoyproxy:main Jul 30, 2025
27 of 28 checks passed

arkodg mentioned this pull request Jul 30, 2025

Add Proxy disconnects with Configs updates in Resilience E2E #6650

Open

arkodg added the cherrypick/release-v1.4.5 label Oct 9, 2025

zirain added the cherrypick/release-v1.5.6 label Dec 3, 2025

arkodg mentioned this pull request Dec 10, 2025

Inconsistent Envoy Gateway replica state #7115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(xds-server): clear snapshot on stream close#6618

fix(xds-server): clear snapshot on stream close#6618
arkodg merged 2 commits intoenvoyproxy:mainfrom
Hackzzila:zvacura/snapshotcache-bug

Hackzzila commented Jul 28, 2025

Uh oh!

Uh oh!

codecov bot commented Jul 28, 2025 •

edited

Loading

Uh oh!

arkodg commented Jul 28, 2025

Uh oh!

arkodg commented Jul 28, 2025

Uh oh!

Hackzzila commented Jul 29, 2025

Uh oh!

arkodg left a comment

Uh oh!

guydc commented Jul 29, 2025

Uh oh!

Hackzzila commented Jul 30, 2025

Uh oh!

jukie left a comment

Uh oh!

jukie commented Jul 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	for _, node := range s.getNodeIDs(irKey) {
	s.log.Debugf("Generating a snapshot with Node %s", node)

	if err = s.SetSnapshot(context.TODO(), node, snapshot); err != nil {

	_, err := s.GetSnapshot(nodeID)
	if err != nil {
	err = s.SetSnapshot(context.TODO(), nodeID, s.lastSnapshot[cluster])

Conversation

Hackzzila commented Jul 28, 2025

Uh oh!

Uh oh!

codecov bot commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

arkodg commented Jul 28, 2025

Uh oh!

arkodg commented Jul 28, 2025

Uh oh!

Hackzzila commented Jul 29, 2025

Uh oh!

arkodg left a comment

Choose a reason for hiding this comment

Uh oh!

guydc commented Jul 29, 2025

Uh oh!

Hackzzila commented Jul 30, 2025

Uh oh!

jukie left a comment

Choose a reason for hiding this comment

Uh oh!

jukie commented Jul 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Jul 28, 2025 •

edited

Loading