ipam/aws: add flag to disable route table discovery by vietcgi · Pull Request #42365 · cilium/cilium

vietcgi · 2025-10-24T02:27:22Z

What

Adds --aws-disable-route-table-discovery flag to the operator.

Why

After #37229, the operator fetches all route tables on every instance resync. This is causing production issues for users with large VPCs:

We have 172 route tables with 22k routes in our VPC
That's ~6MB of JSON per DescribeRouteTables call
Happens 100+ times per hour during normal operations
Operator keeps OOM'ing and AWS is rate limiting us

The route table feature is great for ensuring pods route traffic the same way as nodes, but many of us run with pod subnets completely separate from node subnets. In those cases we don't need route table awareness at all, and the memory/API cost isn't worth it.

How

When --aws-disable-route-table-discovery=true:

Skips all DescribeRouteTables API calls
Subnet selection falls back to availability zone matching
No route table mismatch warnings logged

Default is false so existing behavior is unchanged.

Testing

Updated all tests to pass the new parameter
Tested that operator compiles
Backwards compatible (no changes when flag not set)

Usage

operator:
  extraArgs:
    - --aws-disable-route-table-discovery=true

Fixes #42310

@liyihuang would love your thoughts on this approach. The flag gives immediate relief for users hitting memory limits while keeping the route table feature for those who need it.

Route table discovery was added in eec7036 to enable route-table-aware subnet selection. However, this causes issues in large VPCs: - Fetches all route tables on every resync (100+ calls per hour) - Large VPCs with 172 route tables and 22k routes = 6MB per call - Results in memory bloat, OOM kills, and AWS API throttling - Route table data is mostly static but never cached Many users run Cilium with pod subnets separate from node subnets, making route-table-aware selection unnecessary. Add a flag to disable it entirely for these environments. When disabled: - Zero route table API calls - Minimal memory overhead - Falls back to AZ-based subnet selection - No route table mismatch warnings Fixes: cilium#42310 Signed-off-by: Kevin Vu <[email protected]>

squeed · 2025-10-24T13:36:25Z

Thanks for the PR! As an uninitiated outsider to this particular corner of the codebase, two questions come to mind:

Is there a way we can safely auto-detect this situation and disable route-table listing by heuristic? I am on a minor mission to remove configuration flags. Ideally this would be in the operator itself, but at cilium install time would also be nice.
Given that listing route tables can be expensive, it sure seems like we do it too frequently. In your ideal world, what would be a good edge to trigger such a list?

I'm not asking you to do all this, but I'd like to know exactly the problem space.

liyihuang · 2025-10-24T14:24:04Z

Do you mind to wait a bit to figure out the best way in #42310 before we start to work on the flag?

vietcgi · 2025-10-25T15:26:21Z

@squeed

Q1: Can we auto-detect this?

Safest heuristic: Count route tables per AZ. If there's only one route table in the AZ, all subnets route identically → awareness is pointless. This is provably safe.

Problem with other heuristics (subnet overlap checks, etc): We can't know if "pod traffic needs same routing as nodes" without knowing user intent. Separate subnets might still need identical VPN routes. A
heuristic could silently break hybrid connectivity.

Recommendation: Auto-disable when len(routeTablesInAZ) == 1, otherwise require explicit config. Covers simple VPCs (majority) while avoiding silent failures.

Q2: What's the right trigger for listing route tables?

Current: Every resync (100+/hour)Change frequency of route tables: Hours to weeksRatio: We refetch ~1000× more than they change

Ideal triggers:

Cache with 15-30min TTL (solves 98% of the problem immediately)
Only fetch when creating ENIs (not on every resync)
Lazy load (try simple subnet selection first, only fetch if needed)

Route tables are slow-changing infrastructure. We shouldn't be in the hot path of IPAM resync at all.

In my ideal world:

Fetch once at startup, cache for 30min
Skip entirely during periodic resync (only fetch instances/ENIs)
Refetch on cache expiry or manual trigger

This drops my 600MB/hour to ~12MB/hour (98% reduction) and eliminates the OOM.

The core problem: Route tables are 100× larger than other resources but change just as rarely. We treat them like instances (fetch constantly) when they should be treated like VPC topology (fetch once, cache
long).

liyihuang · 2025-10-26T02:09:10Z

I have different opinions where I expressed it in the #42310.

There are 2 issues where my previous PR introduced.

The OOM issue on operator. That can be resolved by only querying certain field and VPC filter where I mentioned here Significant memory usage increase for AWS Operator with 1.18 #42310 (comment). We can further reduce the memory consumption for other resource like SG, subnets, describe ec2 etc.
rate limit issue. I know we have 1 min internal full sync which is not ideal, but that shouldn't be the major reason of the issue since AWS refill 20 token sper second. I guess multiple single instance resync in a short time is causing the issue. I could review the ipam logic to see if we can skip VPC, subnet, RT and SG for single instance sync.

I can see your environment having the around 180 subnets and 180 route tables. it's more like route table per subnet. I dont think that's a common setup. Do you mind to share a bit about why? From my view, I usually see people put several subnets together sharing the same route table, where I think that's the target setup that cilium should consider.

Do you mind to share your opinion on my approach in #42310?

vietcgi · 2025-10-31T20:23:03Z

@liyihuang Fair points on the architectural approach. I think the disable flag and your refactor are complementary. The flag gives users like @antonipp immediate relief without waiting for the resync refactor. Skip VPC/subnet/RT/SG for instance resync. This helps everyone and is the right long-term fix. Both address different aspects, the flag is pragmatic, your refactor is clean. We can do both incrementally.

For the flag itself: it's low-risk, backwards compatible, and unblocks a real user. I think we should merge it while you work on the resync optimization. Thoughts?

squeed

One naming nit, but seems reasonable to me. I appreciate there is a broader fix in the works.

squeed · 2025-11-03T13:36:50Z

operator/cmd/provider_aws_flags.go

 	flags.Bool(operatorOption.AWSPaginationEnabled, true, "Enable pagination for AWS EC2 API requests. The default page size is 1000 items.")
 	option.BindEnv(vp, operatorOption.AWSPaginationEnabled)

+	flags.Bool(operatorOption.AWSDisableRouteTableDiscovery, false, "Disable route table discovery and route-table-aware subnet selection. "+


nit: we tend to use "enable" rather than "disable" options, since the double-negative is hard to parse. So call this aws-enable-route-table-discovery with a default of true. It will parse --aws-enable-route-table-discovery=false correctly.

This applies to the PR as a whole.

Done! I've renamed the flag to aws-enable-route-table-discovery with a default of true. This eliminates the double negative and matches Cilium's naming conventions. All logic has been inverted correctly, the behavior stays the same (route table discovery is enabled by default), just the flag semantics are clearer now.

liyihuang · 2025-11-03T14:10:27Z

@vietcgi Since we already have a lot of flags, do you mind to see #42529 fixes the issue? If not, let's look into this PR.

I have chatted one of my friends working at AWS as the solution architect in networking area. We all feel the route table should be the resource that only has a few in a VPC. I understand you may have your reason to have 100+ route table, I personally think we should keep it as true by default since that's more common setup for all AWS users if we want to merge this PR

antonipp · 2025-11-03T14:34:42Z

I think this flag could actually be useful, since #42529 should improve the memory usage but we still have the Route Table calls on the critical path of the full resync, so it would be great if we could turn it off. And I'm fine with having the flag be true by default.

vietcgi · 2025-11-03T14:50:15Z

Hey @liyihuang

checked out PR #42529. It’s close, but there are a few issues that could break prod:

Cache not initialized, m.vpcs/m.subnets start empty, so InstanceSync() can crash if it runs before the first resync.

Data race, reads happen without locks, writes use a mutex → classic race.

No tests, nothing covers InstanceSync() or cache behavior.

The idea’s solid, but needs cache init, proper locking, and some tests. Until then, PR #42365 is safer short-term.

maintainer-s-little-helper · 2025-11-03T15:05:45Z

Commit feddbcc does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

Signed-off-by: Kevin Vu <[email protected]>

liyihuang · 2025-11-03T15:21:42Z

Thanks for comment. Do you mind to put the comments in PR?

Cache not initialized, m.vpcs/m.subnets start empty, so InstanceSync() can crash if it runs before the first resync.

I personally dont think caches can be empty for InstanceSync() since the 1st full sync at here https://github.com/liyihuang/cilium/blob/e56bad826338931f14fe9f8735222b9e0905129a/operator/cmd/root.go#L647-L650.

I dont think there is a way for operator can start and let agent to create CiliumNode.

I'm not sure if I'm following it looks like there is a lock here

e56bad8#diff-120e1fc9482652cdd5001efe26f26ff049c32c601e30d1c3b34e9a3150431faeR386-R392.

Let me think about the testing

pkg/aws/eni/node.go

maintainer-s-little-helper · 2025-11-03T15:37:54Z

Commit 6b2d453 does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

operator/option/config.go

squeed · 2025-11-03T16:25:27Z

operator/cmd/provider_aws_flags.go

 	flags.Bool(operatorOption.AWSPaginationEnabled, true, "Enable pagination for AWS EC2 API requests. The default page size is 1000 items.")
 	option.BindEnv(vp, operatorOption.AWSPaginationEnabled)

+	flags.Bool(operatorOption.AWSEnableRouteTableDiscovery, true, "Enable route table discovery and route-table-aware subnet selection. "+


I'm not the expert here, but it's not always safe to disable this, right? Can you add a sentence describing the safety?

If this is really an experts-only option, you can mark it hidden to exclude it from docs.

Route table discovery was only added in 1.18 with #37229 and currently brings performance drawbacks (see #42310). Disabling it restores pre-1.18 operator behavior so it's completely safe and more stable performance-wise.
We could even consider defaulting this flag to false until the underlying feature is more stable.

@squeed Good point. I already added "Disabling is safe and restores pre-1.18 operator behavior" to the help text. It's a good safety valve for large route table environments.

maintainer-s-little-helper · 2025-11-03T17:21:17Z

Commits 6b2d453, 40e441b do not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

maintainer-s-little-helper · 2025-11-03T17:26:04Z

Commits 6b2d453, 40e441b, 9e2e5e7 do not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

… is disabled Signed-off-by: Kevin Vu <[email protected]>

Signed-off-by: Kevin Vu <[email protected]>

…odeSubnet Signed-off-by: Kevin Vu <[email protected]>

maintainer-s-little-helper · 2025-11-04T04:41:49Z

Commit 87d00cf does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

pchaigno · 2025-11-24T15:16:17Z

This doesn't look it has been in a passing state for the past three weeks (see the bot's comments above), so switching to draft while it's being worked on.

github-actions · 2025-12-25T02:12:42Z

This pull request has been automatically marked as stale because it
has not had recent activity. It will be closed if no further activity
occurs. Thank you for your contributions.

github-actions · 2026-01-08T02:12:47Z

This pull request has not seen any activity since it was marked stale.
Closing.

vietcgi requested review from a team as code owners October 24, 2025 02:27

vietcgi requested review from hemanthmalla and tommyp1ckles October 24, 2025 02:27

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 24, 2025

vietcgi requested a review from squeed October 24, 2025 02:27

github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Oct 24, 2025

vietcgi force-pushed the fix-aws-route-table-memory branch 2 times, most recently from 0b2d35f to 49d6edb Compare October 24, 2025 02:32

vietcgi force-pushed the fix-aws-route-table-memory branch from 49d6edb to 73332f7 Compare October 24, 2025 02:40

squeed requested a review from gandro October 24, 2025 12:53

squeed requested changes Nov 3, 2025

View reviewed changes

maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Nov 3, 2025

refactor: rename flag to aws-enable-route-table-discovery

88eb62c

Signed-off-by: Kevin Vu <[email protected]>

vietcgi force-pushed the fix-aws-route-table-memory branch from feddbcc to 88eb62c Compare November 3, 2025 15:09

maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Nov 3, 2025

liyihuang requested changes Nov 3, 2025

View reviewed changes

pkg/aws/eni/node.go Outdated Show resolved Hide resolved

maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Nov 3, 2025

liyihuang reviewed Nov 3, 2025

View reviewed changes

operator/option/config.go Outdated Show resolved Hide resolved

squeed reviewed Nov 3, 2025

View reviewed changes

vietcgi added 3 commits November 3, 2025 09:32

fix: fallback to AZ-based subnet selection when route table discovery…

60f7003

… is disabled Signed-off-by: Kevin Vu <[email protected]>

fix: make route table discovery flag checks explicit in caller

a1d25a8

Signed-off-by: Kevin Vu <[email protected]>

fix: add explicit route table discovery flag checks for consistency

71aaee6

Signed-off-by: Kevin Vu <[email protected]>

vietcgi force-pushed the fix-aws-route-table-memory branch from 9e2e5e7 to 71aaee6 Compare November 3, 2025 17:33

maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Nov 3, 2025

vietcgi added 2 commits November 3, 2025 09:41

fix: remove redundant flag check from findSubnetInSameRouteTableWithN…

fc483a5

…odeSubnet Signed-off-by: Kevin Vu <[email protected]>

Merge branch 'main' into fix-aws-route-table-memory

87d00cf

maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Nov 4, 2025

pchaigno marked this pull request as draft November 24, 2025 15:16

github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Dec 25, 2025

github-actions bot closed this Jan 8, 2026

liyihuang mentioned this pull request Jan 8, 2026

pkg/aws: improve aws resources resync logic #42529

Merged

Conversation

vietcgi commented Oct 24, 2025

What

Why

How

Testing

Usage

Uh oh!

squeed commented Oct 24, 2025

Uh oh!

liyihuang commented Oct 24, 2025

Uh oh!

vietcgi commented Oct 25, 2025

Uh oh!

liyihuang commented Oct 26, 2025

Uh oh!

vietcgi commented Oct 31, 2025

Uh oh!

squeed left a comment

Choose a reason for hiding this comment

Uh oh!

squeed Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

vietcgi Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

liyihuang commented Nov 3, 2025

Uh oh!

antonipp commented Nov 3, 2025

Uh oh!

vietcgi commented Nov 3, 2025

Uh oh!

maintainer-s-little-helper bot commented Nov 3, 2025

Uh oh!

liyihuang commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

maintainer-s-little-helper bot commented Nov 3, 2025

Uh oh!

Uh oh!

squeed Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

HadrienPatte Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

vietcgi Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maintainer-s-little-helper bot commented Nov 3, 2025

Uh oh!

maintainer-s-little-helper bot commented Nov 3, 2025

Uh oh!

maintainer-s-little-helper bot commented Nov 4, 2025

Uh oh!

pchaigno commented Nov 24, 2025

Uh oh!

github-actions bot commented Dec 25, 2025

Uh oh!

github-actions bot commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

liyihuang commented Nov 3, 2025 •

edited

Loading

vietcgi Nov 3, 2025 •

edited

Loading