fix(xds-server): clear snapshot on stream close#6618
Conversation
Signed-off-by: Zachary Vacura <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6618 +/- ##
==========================================
- Coverage 71.03% 71.02% -0.01%
==========================================
Files 225 225
Lines 39153 39173 +20
==========================================
+ Hits 27811 27823 +12
- Misses 9732 9743 +11
+ Partials 1610 1607 -3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
thanks for flagging this edge case instead of clearing the snapshot in a cheaper alternative could be to hold a |
trying to understand when this realistically occurs - is this a race between pod restart and Gateway API config update ? |
We've seen this happen at random in our clusters. If there is a brief network interruption a proxy may get disconnected from envoy-gateway and then connect to a different instance (one that it was connected to previously). |
…shot Signed-off-by: Zachary Vacura <[email protected]>
|
@Hackzzila - Thanks for this. I would really appreciate an E2E test like the ones that we have in the resilience suite, demonstrating that this works as expected after creating an disconnection with a network policy. If that's too much work, can you confirm that you manually tested the change in a similar way ? |
I did a test by connecting to envoy-gateway with grpcurl and sending requests manually. I connected and verified the correct config was being returned, then disconnected and restarted a deployment, and then reconnected and saw that the IPs were not updated. I then did this same test with the patch and verified that it returned the correct data. I did not test the delta variant, or multiple streams for the same node. If nothing major comes up at work I can investigate writing a test for this on Thursday/Friday 👍 |
jukie
left a comment
There was a problem hiding this comment.
+1 on adding an E2E but non-blocking
|
Thanks! If you're unable to do so before merging could you create an issue tracking the E2E followup? |
* fix(xds-server): clear snapshot on stream close Signed-off-by: Zachary Vacura <[email protected]> * check if there are other active connections before clearning the snapshot Signed-off-by: Zachary Vacura <[email protected]> Signed-off-by: shawnh2 <[email protected]>
* fix(xds-server): clear snapshot on stream close (#6618) * fix(xds-server): clear snapshot on stream close Signed-off-by: Zachary Vacura <[email protected]> * check if there are other active connections before clearning the snapshot Signed-off-by: Zachary Vacura <[email protected]> Signed-off-by: shawnh2 <[email protected]> * bug: disable x-envoy-ratelimited by default (#7110) * bug: disable x-envoy-ratelimited by default * can be enabled with `enableEnvoyHeaders` in CTP Relates to #7034 Signed-off-by: Arko Dasgupta <[email protected]> * fix tests and release note Signed-off-by: Arko Dasgupta <[email protected]> * fix testdata Signed-off-by: Arko Dasgupta <[email protected]> --------- Signed-off-by: Arko Dasgupta <[email protected]> Signed-off-by: shawnh2 <[email protected]> * fix(translator): Fix panic with request mirror + grpcroute (#6875) Signed-off-by: Andrew Moreland <[email protected]> Signed-off-by: shawnh2 <[email protected]> * fix: bug in overlap detection of cert SANs (#7234) Signed-off-by: shawnh2 <[email protected]> * bump golang for crypto/x509 reggression (#7236) Signed-off-by: zirain <[email protected]> Signed-off-by: shawnh2 <[email protected]> * fix gen-check Signed-off-by: shawnh2 <[email protected]> --------- Signed-off-by: Zachary Vacura <[email protected]> Signed-off-by: shawnh2 <[email protected]> Signed-off-by: Arko Dasgupta <[email protected]> Signed-off-by: Andrew Moreland <[email protected]> Signed-off-by: zirain <[email protected]> Co-authored-by: Zach Vacura <[email protected]> Co-authored-by: Arko Dasgupta <[email protected]> Co-authored-by: Andrew Moreland <[email protected]> Co-authored-by: Rudrakh Panigrahi <[email protected]> Co-authored-by: zirain <[email protected]>
* fix(xds-server): clear snapshot on stream close Signed-off-by: Zachary Vacura <[email protected]> * check if there are other active connections before clearning the snapshot Signed-off-by: Zachary Vacura <[email protected]> Signed-off-by: jukie <[email protected]>
* fix(xds-server): clear snapshot on stream close (#6618) * fix(xds-server): clear snapshot on stream close Signed-off-by: Zachary Vacura <[email protected]> * check if there are other active connections before clearning the snapshot Signed-off-by: Zachary Vacura <[email protected]> Signed-off-by: jukie <[email protected]> * fix: oidc authentication endpoint was overwritten by discovered value (#7460) fix: oid authentication endpoint was overriden by discovered value Signed-off-by: Huabing Zhao <[email protected]> Signed-off-by: Huabing (Robin) Zhao <[email protected]> Signed-off-by: jukie <[email protected]> * ci: add script to free disk space (#7534) * feat: free disk space Signed-off-by: Shreemaan Abhishek <[email protected]> * lint Signed-off-by: Shreemaan Abhishek <[email protected]> * cleanup Signed-off-by: Shreemaan Abhishek <[email protected]> * make target and tools/hack Signed-off-by: Shreemaan Abhishek <[email protected]> * lint Signed-off-by: Shreemaan Abhishek <[email protected]> * modular action Signed-off-by: Shreemaan Abhishek <[email protected]> --------- Signed-off-by: Shreemaan Abhishek <[email protected]> Signed-off-by: jukie <[email protected]> * treat too many addresses as programmed (#7542) Signed-off-by: cong <[email protected]> Signed-off-by: jukie <[email protected]> * feat: reclaim space in release pipeline (#7587) Signed-off-by: Shreemaan Abhishek <[email protected]> Signed-off-by: jukie <[email protected]> * chore: bump golang.org/x/crypto (#7588) * chore: bump golang.org/x/crypto Signed-off-by: zirain <[email protected]> * fix gen Signed-off-by: zirain <[email protected]> --------- Signed-off-by: zirain <[email protected]> Signed-off-by: jukie <[email protected]> * findOwningGateway should return controller based on linked GatewayClass (#7611) * fix: filter Gateway by controller in findOwningGateway Prevent cross-controller Gateway mutations by validating GatewayClass Signed-off-by: Sudipto Baral <[email protected]> Signed-off-by: jukie <[email protected]> * fix: use default when namespace is unset (#7612) * fix: use default when namespace is unset Signed-off-by: zirain <[email protected]> * fix Signed-off-by: zirain <[email protected]> * fix test Signed-off-by: zirain <[email protected]> --------- Signed-off-by: zirain <[email protected]> Signed-off-by: jukie <[email protected]> * fix: prevent skeleton route status entries for unmanaged GatewayClasses (#7536) * fix: prevent skeleton route status entries for unmanaged GatewayClasses When processing policies (EnvoyExtensionPolicy, SecurityPolicy), the translator was calling GetRouteParentContext for ALL parentRefs in a route, even those referencing gateways with different GatewayClasses not managed by this translator. GetRouteParentContext creates a skeleton RouteParentStatus entry with just the controllerName when called on a parentRef that hasn't been processed yet. Since all GatewayClass instances share the same controller name, these skeleton entries persisted in status without conditions. The fix checks if a parentRef context already exists before attempting to apply policy configuration to it. If the context doesn't exist, it means this parentRef wasn't processed by this translator and should be skipped. Signed-off-by: Raj Singh <[email protected]> * fix: also prevent skeleton entries in BackendTrafficPolicy processing The same issue exists in BackendTrafficPolicy route processing - calling GetRouteParentContext for all parentRefs creates skeleton status entries. Apply the same fix: check if parentRef context exists before adding to list. Signed-off-by: Raj Singh <[email protected]> --------- Signed-off-by: Raj Singh <[email protected]> Signed-off-by: jukie <[email protected]> * lint Signed-off-by: jukie <[email protected]> --------- Signed-off-by: Zachary Vacura <[email protected]> Signed-off-by: jukie <[email protected]> Signed-off-by: Huabing Zhao <[email protected]> Signed-off-by: Huabing (Robin) Zhao <[email protected]> Signed-off-by: Shreemaan Abhishek <[email protected]> Signed-off-by: cong <[email protected]> Signed-off-by: zirain <[email protected]> Signed-off-by: Sudipto Baral <[email protected]> Signed-off-by: Raj Singh <[email protected]> Co-authored-by: Zach Vacura <[email protected]> Co-authored-by: Huabing (Robin) Zhao <[email protected]> Co-authored-by: shreealt <[email protected]> Co-authored-by: 聪 <[email protected]> Co-authored-by: zirain <[email protected]> Co-authored-by: Sudipto Baral <[email protected]> Co-authored-by: Raj Singh <[email protected]>
In the snapshotcache, only snapshots for nodes with active connections are updated.
gateway/internal/xds/cache/snapshotcache.go
Lines 92 to 95 in ceec666
As far as I can tell, these snapshots are never deleted. When a node disconnects and then reconnects it will receive the configuration from the time it disconnected. It will only get the latest configuration when there is an update while it is still connected.
This is a big problem when you are running multiple replicas of envoy-gateway, if a proxy flips between which instance it is connected to it can receive data that is several days out of date. We have seen proxy instances suddenly start using certificates that are expired and have already been rotated, and use pod IPs that do not exist anymore.
If we clear the snapshot for a node when it disconnects, it will get the latest snapshot when it reconnects (this does not happen if a node already has a cached snapshot).
gateway/internal/xds/cache/snapshotcache.go
Lines 208 to 210 in ceec666