Skip to content

ipam/aws: add flag to disable route table discovery#42365

Closed
vietcgi wants to merge 7 commits intocilium:mainfrom
vietcgi:fix-aws-route-table-memory
Closed

ipam/aws: add flag to disable route table discovery#42365
vietcgi wants to merge 7 commits intocilium:mainfrom
vietcgi:fix-aws-route-table-memory

Conversation

@vietcgi
Copy link
Copy Markdown
Contributor

@vietcgi vietcgi commented Oct 24, 2025

What

Adds --aws-disable-route-table-discovery flag to the operator.

Why

After #37229, the operator fetches all route tables on every instance resync. This is causing production issues for users with large VPCs:

  • We have 172 route tables with 22k routes in our VPC
  • That's ~6MB of JSON per DescribeRouteTables call
  • Happens 100+ times per hour during normal operations
  • Operator keeps OOM'ing and AWS is rate limiting us

The route table feature is great for ensuring pods route traffic the same way as nodes, but many of us run with pod subnets completely separate from node subnets. In those cases we don't need route table awareness at all, and the memory/API cost isn't worth it.

How

When --aws-disable-route-table-discovery=true:

  • Skips all DescribeRouteTables API calls
  • Subnet selection falls back to availability zone matching
  • No route table mismatch warnings logged

Default is false so existing behavior is unchanged.

Testing

  • Updated all tests to pass the new parameter
  • Tested that operator compiles
  • Backwards compatible (no changes when flag not set)

Usage

operator:
  extraArgs:
    - --aws-disable-route-table-discovery=true

Fixes #42310


@liyihuang would love your thoughts on this approach. The flag gives immediate relief for users hitting memory limits while keeping the route table feature for those who need it.

@vietcgi vietcgi requested review from a team as code owners October 24, 2025 02:27
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 24, 2025
@vietcgi vietcgi requested a review from squeed October 24, 2025 02:27
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Oct 24, 2025
@vietcgi vietcgi force-pushed the fix-aws-route-table-memory branch 2 times, most recently from 0b2d35f to 49d6edb Compare October 24, 2025 02:32
Route table discovery was added in eec7036 to enable route-table-aware
subnet selection. However, this causes issues in large VPCs:

- Fetches all route tables on every resync (100+ calls per hour)
- Large VPCs with 172 route tables and 22k routes = 6MB per call
- Results in memory bloat, OOM kills, and AWS API throttling
- Route table data is mostly static but never cached

Many users run Cilium with pod subnets separate from node subnets, making
route-table-aware selection unnecessary. Add a flag to disable it entirely
for these environments.

When disabled:
- Zero route table API calls
- Minimal memory overhead
- Falls back to AZ-based subnet selection
- No route table mismatch warnings

Fixes: cilium#42310

Signed-off-by: Kevin Vu <[email protected]>
@vietcgi vietcgi force-pushed the fix-aws-route-table-memory branch from 49d6edb to 73332f7 Compare October 24, 2025 02:40
@squeed squeed requested a review from gandro October 24, 2025 12:53
@squeed
Copy link
Copy Markdown
Contributor

squeed commented Oct 24, 2025

Thanks for the PR! As an uninitiated outsider to this particular corner of the codebase, two questions come to mind:

  1. Is there a way we can safely auto-detect this situation and disable route-table listing by heuristic? I am on a minor mission to remove configuration flags. Ideally this would be in the operator itself, but at cilium install time would also be nice.
  2. Given that listing route tables can be expensive, it sure seems like we do it too frequently. In your ideal world, what would be a good edge to trigger such a list?

I'm not asking you to do all this, but I'd like to know exactly the problem space.

@liyihuang
Copy link
Copy Markdown
Contributor

Do you mind to wait a bit to figure out the best way in #42310 before we start to work on the flag?

@vietcgi
Copy link
Copy Markdown
Contributor Author

vietcgi commented Oct 25, 2025

@squeed

Q1: Can we auto-detect this?

Safest heuristic: Count route tables per AZ. If there's only one route table in the AZ, all subnets route identically → awareness is pointless. This is provably safe.

Problem with other heuristics (subnet overlap checks, etc): We can't know if "pod traffic needs same routing as nodes" without knowing user intent. Separate subnets might still need identical VPN routes. A
heuristic could silently break hybrid connectivity.

Recommendation: Auto-disable when len(routeTablesInAZ) == 1, otherwise require explicit config. Covers simple VPCs (majority) while avoiding silent failures.

Q2: What's the right trigger for listing route tables?

Current: Every resync (100+/hour)Change frequency of route tables: Hours to weeksRatio: We refetch ~1000× more than they change

Ideal triggers:

  1. Cache with 15-30min TTL (solves 98% of the problem immediately)
  2. Only fetch when creating ENIs (not on every resync)
  3. Lazy load (try simple subnet selection first, only fetch if needed)

Route tables are slow-changing infrastructure. We shouldn't be in the hot path of IPAM resync at all.

In my ideal world:

  • Fetch once at startup, cache for 30min
  • Skip entirely during periodic resync (only fetch instances/ENIs)
  • Refetch on cache expiry or manual trigger

This drops my 600MB/hour to ~12MB/hour (98% reduction) and eliminates the OOM.

The core problem: Route tables are 100× larger than other resources but change just as rarely. We treat them like instances (fetch constantly) when they should be treated like VPC topology (fetch once, cache
long).

@liyihuang
Copy link
Copy Markdown
Contributor

I have different opinions where I expressed it in the #42310.

There are 2 issues where my previous PR introduced.

  • The OOM issue on operator. That can be resolved by only querying certain field and VPC filter where I mentioned here Significant memory usage increase for AWS Operator with 1.18 #42310 (comment). We can further reduce the memory consumption for other resource like SG, subnets, describe ec2 etc.

  • rate limit issue. I know we have 1 min internal full sync which is not ideal, but that shouldn't be the major reason of the issue since AWS refill 20 token sper second. I guess multiple single instance resync in a short time is causing the issue. I could review the ipam logic to see if we can skip VPC, subnet, RT and SG for single instance sync.

I can see your environment having the around 180 subnets and 180 route tables. it's more like route table per subnet. I dont think that's a common setup. Do you mind to share a bit about why? From my view, I usually see people put several subnets together sharing the same route table, where I think that's the target setup that cilium should consider.

Do you mind to share your opinion on my approach in #42310?

@vietcgi
Copy link
Copy Markdown
Contributor Author

vietcgi commented Oct 31, 2025

@liyihuang Fair points on the architectural approach. I think the disable flag and your refactor are complementary. The flag gives users like @antonipp immediate relief without waiting for the resync refactor. Skip VPC/subnet/RT/SG for instance resync. This helps everyone and is the right long-term fix. Both address different aspects, the flag is pragmatic, your refactor is clean. We can do both incrementally.

For the flag itself: it's low-risk, backwards compatible, and unblocks a real user. I think we should merge it while you work on the resync optimization. Thoughts?

Copy link
Copy Markdown
Contributor

@squeed squeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One naming nit, but seems reasonable to me. I appreciate there is a broader fix in the works.

flags.Bool(operatorOption.AWSPaginationEnabled, true, "Enable pagination for AWS EC2 API requests. The default page size is 1000 items.")
option.BindEnv(vp, operatorOption.AWSPaginationEnabled)

flags.Bool(operatorOption.AWSDisableRouteTableDiscovery, false, "Disable route table discovery and route-table-aware subnet selection. "+
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we tend to use "enable" rather than "disable" options, since the double-negative is hard to parse. So call this aws-enable-route-table-discovery with a default of true. It will parse --aws-enable-route-table-discovery=false correctly.

This applies to the PR as a whole.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! I've renamed the flag to aws-enable-route-table-discovery with a default of true. This eliminates the double negative and matches Cilium's naming conventions. All logic has been inverted correctly, the behavior stays the same (route table discovery is enabled by default), just the flag semantics are clearer now.

@liyihuang
Copy link
Copy Markdown
Contributor

@vietcgi Since we already have a lot of flags, do you mind to see #42529 fixes the issue? If not, let's look into this PR.

I have chatted one of my friends working at AWS as the solution architect in networking area. We all feel the route table should be the resource that only has a few in a VPC. I understand you may have your reason to have 100+ route table, I personally think we should keep it as true by default since that's more common setup for all AWS users if we want to merge this PR

@antonipp
Copy link
Copy Markdown
Contributor

antonipp commented Nov 3, 2025

I think this flag could actually be useful, since #42529 should improve the memory usage but we still have the Route Table calls on the critical path of the full resync, so it would be great if we could turn it off. And I'm fine with having the flag be true by default.

@vietcgi
Copy link
Copy Markdown
Contributor Author

vietcgi commented Nov 3, 2025

Hey @liyihuang

checked out PR #42529. It’s close, but there are a few issues that could break prod:

Cache not initialized, m.vpcs/m.subnets start empty, so InstanceSync() can crash if it runs before the first resync.

Data race, reads happen without locks, writes use a mutex → classic race.

No tests, nothing covers InstanceSync() or cache behavior.

The idea’s solid, but needs cache init, proper locking, and some tests. Until then, PR #42365 is safer short-term.

@maintainer-s-little-helper
Copy link
Copy Markdown

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Nov 3, 2025
@vietcgi vietcgi force-pushed the fix-aws-route-table-memory branch from feddbcc to 88eb62c Compare November 3, 2025 15:09
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Nov 3, 2025
@liyihuang
Copy link
Copy Markdown
Contributor

liyihuang commented Nov 3, 2025

Thanks for comment. Do you mind to put the comments in PR?

Cache not initialized, m.vpcs/m.subnets start empty, so InstanceSync() can crash if it runs before the first resync.

I personally dont think caches can be empty for InstanceSync() since the 1st full sync at here https://github.com/liyihuang/cilium/blob/e56bad826338931f14fe9f8735222b9e0905129a/operator/cmd/root.go#L647-L650.

I dont think there is a way for operator can start and let agent to create CiliumNode.

I'm not sure if I'm following it looks like there is a lock here

e56bad8#diff-120e1fc9482652cdd5001efe26f26ff049c32c601e30d1c3b34e9a3150431faeR386-R392.

Let me think about the testing

@maintainer-s-little-helper
Copy link
Copy Markdown

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Nov 3, 2025
flags.Bool(operatorOption.AWSPaginationEnabled, true, "Enable pagination for AWS EC2 API requests. The default page size is 1000 items.")
option.BindEnv(vp, operatorOption.AWSPaginationEnabled)

flags.Bool(operatorOption.AWSEnableRouteTableDiscovery, true, "Enable route table discovery and route-table-aware subnet selection. "+
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not the expert here, but it's not always safe to disable this, right? Can you add a sentence describing the safety?

If this is really an experts-only option, you can mark it hidden to exclude it from docs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Route table discovery was only added in 1.18 with #37229 and currently brings performance drawbacks (see #42310). Disabling it restores pre-1.18 operator behavior so it's completely safe and more stable performance-wise.
We could even consider defaulting this flag to false until the underlying feature is more stable.

Copy link
Copy Markdown
Contributor Author

@vietcgi vietcgi Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@squeed Good point. I already added "Disabling is safe and restores pre-1.18 operator behavior" to the help text. It's a good safety valve for large route table environments.

@maintainer-s-little-helper
Copy link
Copy Markdown

@maintainer-s-little-helper
Copy link
Copy Markdown

@vietcgi vietcgi force-pushed the fix-aws-route-table-memory branch from 9e2e5e7 to 71aaee6 Compare November 3, 2025 17:33
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Nov 3, 2025
@maintainer-s-little-helper
Copy link
Copy Markdown

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Nov 4, 2025
@pchaigno
Copy link
Copy Markdown
Member

This doesn't look it has been in a passing state for the past three weeks (see the bot's comments above), so switching to draft while it's being worked on.

@pchaigno pchaigno marked this pull request as draft November 24, 2025 15:16
@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it
has not had recent activity. It will be closed if no further activity
occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Dec 25, 2025
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 8, 2026

This pull request has not seen any activity since it was marked stale.
Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. kind/community-contribution This was a contribution made by a community member. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Significant memory usage increase for AWS Operator with 1.18

6 participants