Skip to content

Optimize pod cache#6936

Merged
zirain merged 8 commits intoenvoyproxy:mainfrom
jukie:topology-webhook-cache-bug
Sep 11, 2025
Merged

Optimize pod cache#6936
zirain merged 8 commits intoenvoyproxy:mainfrom
jukie:topology-webhook-cache-bug

Conversation

@jukie
Copy link
Copy Markdown
Contributor

@jukie jukie commented Sep 10, 2025

What this PR does / why we need it:

  • Optimize controller cache to only cache EnvoyProxy pods. There's a race condition between cache sync and the topology injector running and in large clusters ~5% of the time the caches aren't synced in time.
  • If cach.Get() fails, run get against API server directly.

Which issue(s) this PR fixes:

Fixes #6932

Release Notes: Yes

Signed-off-by: jukie <[email protected]>
Signed-off-by: jukie <[email protected]>
@jukie jukie marked this pull request as ready for review September 10, 2025 21:22
@jukie jukie requested a review from a team as a code owner September 10, 2025 21:22
@jukie jukie requested a review from a team September 10, 2025 21:23
@jukie jukie added this to the v1.6.0-rc.1 Release milestone Sep 10, 2025
@codecov
Copy link
Copy Markdown

codecov bot commented Sep 10, 2025

Codecov Report

❌ Patch coverage is 38.88889% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.03%. Comparing base (80d00a0) to head (f6ea9d5).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
internal/provider/kubernetes/topology_injector.go 0.00% 7 Missing ⚠️
internal/provider/kubernetes/kubernetes.go 63.63% 4 Missing ⚠️

❌ Your patch status has failed because the patch coverage (38.88%) is below the target coverage (60.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6936   +/-   ##
=======================================
  Coverage   71.03%   71.03%           
=======================================
  Files         226      226           
  Lines       40026    40037   +11     
=======================================
+ Hits        28431    28442   +11     
- Misses       9911     9914    +3     
+ Partials     1684     1681    -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: jukie <[email protected]>
@jukie jukie force-pushed the topology-webhook-cache-bug branch from de74d1c to 0233919 Compare September 10, 2025 22:06
@jukie jukie requested a review from arkodg September 10, 2025 22:07
@jukie jukie force-pushed the topology-webhook-cache-bug branch from 5b85f3f to af47036 Compare September 10, 2025 23:15
arkodg
arkodg previously approved these changes Sep 10, 2025
Copy link
Copy Markdown
Contributor

@arkodg arkodg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @jukie can you share any data on mem reduction

@arkodg arkodg requested review from a team September 10, 2025 23:57
@jukie
Copy link
Copy Markdown
Contributor Author

jukie commented Sep 11, 2025

The cache can probably be optimized even further and drop other unused resources completely or use similar filtering. In the case of pods the memory savings would in theory scale based on the size of the cluster.

I can share more data tomorrow once I run this in a cluster actually hitting the issue but in a staging cluster running this patch it's down from ~570Mi --> 540Mi.

Edit: that doesn't seem accurate so will probably need to let it run longer.

@jukie jukie requested a review from arkodg September 11, 2025 01:12
@arkodg arkodg requested review from a team September 11, 2025 02:27
@zirain zirain enabled auto-merge (squash) September 11, 2025 02:51
@jukie
Copy link
Copy Markdown
Contributor Author

jukie commented Sep 11, 2025

/retest

@zirain zirain merged commit da0d913 into envoyproxy:main Sep 11, 2025
50 of 53 checks passed
@jukie jukie deleted the topology-webhook-cache-bug branch September 11, 2025 06:50
@jukie
Copy link
Copy Markdown
Contributor Author

jukie commented Sep 11, 2025

Could the main pipeline for da0d913 be retried for the docker push please?

zirain added a commit to zirain/gateway that referenced this pull request Sep 16, 2025
* Optimize pod cache

Signed-off-by: jukie <[email protected]>

* release note

Signed-off-by: jukie <[email protected]>

* Remove retry

Signed-off-by: jukie <[email protected]>

* cleanup

Signed-off-by: jukie <[email protected]>

---------

Signed-off-by: jukie <[email protected]>
Signed-off-by: Isaac <[email protected]>
Co-authored-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
zirain added a commit to zirain/gateway that referenced this pull request Sep 16, 2025
* Optimize pod cache

Signed-off-by: jukie <[email protected]>

* release note

Signed-off-by: jukie <[email protected]>

* Remove retry

Signed-off-by: jukie <[email protected]>

* cleanup

Signed-off-by: jukie <[email protected]>

---------

Signed-off-by: jukie <[email protected]>
Signed-off-by: Isaac <[email protected]>
Co-authored-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
zirain added a commit that referenced this pull request Sep 16, 2025
* fix: cluster stat name: lowercase Kind (#6780)

cluster stat name: lowercase Kind

Signed-off-by: Guy Daich <[email protected]>
Signed-off-by: zirain <[email protected]>

* fix: envoy service cluster name for zone-aware routing (#6763)

* fix!: fix envoy service cluster name for zone-aware routing

Signed-off-by: y-rabie <[email protected]>

* extend e2e tests for zone aware routing

Signed-off-by: y-rabie <[email protected]>

* extend unit tests for zone aware routing

Signed-off-by: y-rabie <[email protected]>

---------

Signed-off-by: y-rabie <[email protected]>
Signed-off-by: zirain <[email protected]>

* conformance: update experimental test report (#6782)

* conformance: update experimental test report

Signed-off-by: zirain <[email protected]>

* fix version

Signed-off-by: zirain <[email protected]>

* fix(api): image validation regex, support port in repository (#6819)

fix: match repository in image with port

Signed-off-by: Windfarer <[email protected]>
Signed-off-by: zirain <[email protected]>

* fix: Actually update xdsIR with maxAcceptPerSocketEvent (#6834)

* Actually update xdsIR with maxAcceptPerSocketEvent

Signed-off-by: jukie <[email protected]>

* release note

Signed-off-by: jukie <[email protected]>

* newline lint

Signed-off-by: jukie <[email protected]>

---------

Signed-off-by: jukie <[email protected]>
Signed-off-by: zirain <[email protected]>

* bugfix: fix the topologyInjectorDisabled and the local cluster was not defined (#6847)

* bugfix: fix the topologyInjectorDisabled and the local cluster was not defined.

Signed-off-by: qicz <[email protected]>

* fix ut

Signed-off-by: qicz <[email protected]>

* add topology-injector-enabled ut

Signed-off-by: qicz <[email protected]>

* add release note

Signed-off-by: qi <[email protected]>

---------

Signed-off-by: qicz <[email protected]>
Signed-off-by: qi <[email protected]>
Signed-off-by: zirain <[email protected]>

* fix(logging): correct log formatting to avoid DPANIC in controller-runtime logger (#6846)

* Update filters.go

Signed-off-by: TomerJLevy <[email protected]>

* add release notes

Signed-off-by: TomerJLevy <[email protected]>

---------

Signed-off-by: TomerJLevy <[email protected]>
Signed-off-by: zirain <[email protected]>

* fix: handle context errors as transient errors (#6850)

* handle context errors as transient errors

Signed-off-by: TomerJLevy <[email protected]>

* add test cases

Signed-off-by: TomerJLevy <[email protected]>

* no need the new line

Signed-off-by: TomerJLevy <[email protected]>

* add release notes

Signed-off-by: TomerJLevy <[email protected]>

* Return the error as is

Signed-off-by: TomerJLevy <[email protected]>

* revert redundant changes

Signed-off-by: TomerJLevy <[email protected]>

* revert unrelated changes

Signed-off-by: TomerJLevy <[email protected]>

* revert more changes...

Signed-off-by: TomerJLevy <[email protected]>

---------

Signed-off-by: TomerJLevy <[email protected]>
Co-authored-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>

* bugfix: the controller cannot read the EnvoyProxy attached gatewayclass only. (#6838)

Signed-off-by: zirain <[email protected]>

* chore: fix CVE (#6903)

Signed-off-by: Shahar Harari <[email protected]>
Signed-off-by: zirain <[email protected]>

* fix: nil pointer dereference in btp configmap indexer (#6921)

Signed-off-by: Rudrakh Panigrahi <[email protected]>
Signed-off-by: zirain <[email protected]>

* improve targetRef selection for targetSelectors (#6917)

* improve targetRef selection for targetSelectors

* only select refs in the same namespace as the policy

Signed-off-by: Arko Dasgupta <[email protected]>

* fix lint

Signed-off-by: Arko Dasgupta <[email protected]>

---------

Signed-off-by: Arko Dasgupta <[email protected]>
Signed-off-by: zirain <[email protected]>

* fix: suppress lua validation logs (#6929)

Signed-off-by: Rudrakh Panigrahi <[email protected]>
Signed-off-by: zirain <[email protected]>

* fix: rm incorrectly set exclusiveMaximum field in CRD (#6926)

* fix: rm incorrectly set exclusiveMaximum field in CRD

* Also fix maximum value to 599 which includes 599 as a valid num

Fixes: #6925

Signed-off-by: Arko Dasgupta <[email protected]>
Signed-off-by: zirain <[email protected]>

* fix: rm Strict SameSite default (#6941)

* by default it should be unset which implies `Lax`

Relates to #6347

Signed-off-by: Arko Dasgupta <[email protected]>
Signed-off-by: zirain <[email protected]>

* Optimize pod cache (#6936)

* Optimize pod cache

Signed-off-by: jukie <[email protected]>

* release note

Signed-off-by: jukie <[email protected]>

* Remove retry

Signed-off-by: jukie <[email protected]>

* cleanup

Signed-off-by: jukie <[email protected]>

---------

Signed-off-by: jukie <[email protected]>
Signed-off-by: Isaac <[email protected]>
Co-authored-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>

* reduce DeepCopy in gateway-api layer (#6940)

* reduce deep copy in gateway-api layer

* also fixed the DeepCopy implementation for ControllerResources
which was performing a Shallow Copy resulting it lack of isolation
b/w provider and gateway-api layer

Relates to #6919

Signed-off-by: Arko Dasgupta <[email protected]>

* fix lint

Signed-off-by: Arko Dasgupta <[email protected]>

---------

Signed-off-by: Arko Dasgupta <[email protected]>
Signed-off-by: zirain <[email protected]>

* fix: validation for grpc routes with extension ref filters (#6949)

Signed-off-by: Rudrakh Panigrahi <[email protected]>
Signed-off-by: zirain <[email protected]>

* fix: cleanup dangling route status conditions (#6812)

Signed-off-by: y-rabie <[email protected]>
Signed-off-by: zirain <[email protected]>

* Fix: Add missing patch annotations to Compression struct for proper Merge (#6951)

* fix: merge compression annotation

Signed-off-by: sudipto baral <[email protected]>

* test: add more compression merge test cases

Signed-off-by: sudipto baral <[email protected]>

---------

Signed-off-by: sudipto baral <[email protected]>
Signed-off-by: zirain <[email protected]>

* fix: update distroless image to resolve glibc CVEs (#6953)

Signed-off-by: Shahar Harari <[email protected]>
Signed-off-by: zirain <[email protected]>

* chore: bump golang to 1.24.7 (#6959)

chore: bump golang

Signed-off-by: zirain <[email protected]>

* fix: Make sure proxy protocol filter is the first listener filter (#6972)

Fixes: #6873

Signed-off-by: Arko Dasgupta <[email protected]>
Co-authored-by: Jacob Neil Taylor <[email protected]>
Signed-off-by: zirain <[email protected]>

* release notes

Signed-off-by: zirain <[email protected]>

* Removes reflection from RouteContext to reduce allocations (#6820)

* bench: adds APIToXDS bench & small opt

Signed-off-by: Takeshi Yoneda <[email protected]>

* no refect

goos: darwin
goarch: arm64
pkg: github.com/envoyproxy/gateway/test/gobench
cpu: Apple M1 Pro
                          │   old.txt    │               new.txt               │
                          │    sec/op    │   sec/op     vs base                │
GatewayAPItoXDS/small-10    881.2µ ±  7%   803.4µ ± 1%   -8.82% (p=0.000 n=10)
GatewayAPItoXDS/medium-10   4.130m ± 26%   2.959m ± 4%  -28.36% (p=0.000 n=10)
GatewayAPItoXDS/large-10     5.375 ±  2%    4.553 ± 1%  -15.28% (p=0.000 n=10)
geomean                     26.94m         22.12m       -17.90%

                          │   old.txt    │               new.txt                │
                          │     B/op     │     B/op      vs base                │
GatewayAPItoXDS/small-10    507.2Ki ± 0%   492.9Ki ± 0%   -2.83% (p=0.000 n=10)
GatewayAPItoXDS/medium-10   2.545Mi ± 7%   1.954Mi ± 2%  -23.21% (p=0.000 n=10)
GatewayAPItoXDS/large-10    2.832Gi ± 0%   2.831Gi ± 0%        ~ (p=0.529 n=10)
geomean                     15.40Mi        13.97Mi        -9.31%

                          │   old.txt   │               new.txt               │
                          │  allocs/op  │  allocs/op   vs base                │
GatewayAPItoXDS/small-10    8.328k ± 0%   8.017k ± 0%   -3.73% (p=0.000 n=10)
GatewayAPItoXDS/medium-10   39.45k ± 6%   29.74k ± 2%  -24.60% (p=0.000 n=10)
GatewayAPItoXDS/large-10    38.75M ± 0%   38.71M ± 0%   -0.11% (p=0.000 n=10)
geomean                     233.5k        209.8k       -10.16%

Signed-off-by: Takeshi Yoneda <[email protected]>

* removes garbage

Signed-off-by: Takeshi Yoneda <[email protected]>

* more tests

Signed-off-by: Takeshi Yoneda <[email protected]>

* more tests

Signed-off-by: Takeshi Yoneda <[email protected]>

---------

Signed-off-by: Takeshi Yoneda <[email protected]>

---------

Signed-off-by: Guy Daich <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: y-rabie <[email protected]>
Signed-off-by: Windfarer <[email protected]>
Signed-off-by: jukie <[email protected]>
Signed-off-by: qicz <[email protected]>
Signed-off-by: qi <[email protected]>
Signed-off-by: TomerJLevy <[email protected]>
Signed-off-by: Shahar Harari <[email protected]>
Signed-off-by: Rudrakh Panigrahi <[email protected]>
Signed-off-by: Arko Dasgupta <[email protected]>
Signed-off-by: Isaac <[email protected]>
Signed-off-by: sudipto baral <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
Co-authored-by: Guy Daich <[email protected]>
Co-authored-by: Youssef Rabie <[email protected]>
Co-authored-by: Windfarer <[email protected]>
Co-authored-by: Isaac <[email protected]>
Co-authored-by: qi <[email protected]>
Co-authored-by: TomerJLevy <[email protected]>
Co-authored-by: shahar-h <[email protected]>
Co-authored-by: Rudrakh Panigrahi <[email protected]>
Co-authored-by: Arko Dasgupta <[email protected]>
Co-authored-by: Sudipto Baral <[email protected]>
Co-authored-by: Jacob Neil Taylor <[email protected]>
Co-authored-by: Takeshi Yoneda <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: topologyInjector race condition

3 participants