ipam/aws: add flag to disable route table discovery#42365
ipam/aws: add flag to disable route table discovery#42365vietcgi wants to merge 7 commits intocilium:mainfrom
Conversation
0b2d35f to
49d6edb
Compare
Route table discovery was added in eec7036 to enable route-table-aware subnet selection. However, this causes issues in large VPCs: - Fetches all route tables on every resync (100+ calls per hour) - Large VPCs with 172 route tables and 22k routes = 6MB per call - Results in memory bloat, OOM kills, and AWS API throttling - Route table data is mostly static but never cached Many users run Cilium with pod subnets separate from node subnets, making route-table-aware selection unnecessary. Add a flag to disable it entirely for these environments. When disabled: - Zero route table API calls - Minimal memory overhead - Falls back to AZ-based subnet selection - No route table mismatch warnings Fixes: cilium#42310 Signed-off-by: Kevin Vu <[email protected]>
49d6edb to
73332f7
Compare
|
Thanks for the PR! As an uninitiated outsider to this particular corner of the codebase, two questions come to mind:
I'm not asking you to do all this, but I'd like to know exactly the problem space. |
|
Do you mind to wait a bit to figure out the best way in #42310 before we start to work on the flag? |
|
Q1: Can we auto-detect this? Safest heuristic: Count route tables per AZ. If there's only one route table in the AZ, all subnets route identically → awareness is pointless. This is provably safe. Problem with other heuristics (subnet overlap checks, etc): We can't know if "pod traffic needs same routing as nodes" without knowing user intent. Separate subnets might still need identical VPN routes. A Recommendation: Auto-disable when len(routeTablesInAZ) == 1, otherwise require explicit config. Covers simple VPCs (majority) while avoiding silent failures. Q2: What's the right trigger for listing route tables? Current: Every resync (100+/hour)Change frequency of route tables: Hours to weeksRatio: We refetch ~1000× more than they change Ideal triggers:
Route tables are slow-changing infrastructure. We shouldn't be in the hot path of IPAM resync at all. In my ideal world:
This drops my 600MB/hour to ~12MB/hour (98% reduction) and eliminates the OOM. The core problem: Route tables are 100× larger than other resources but change just as rarely. We treat them like instances (fetch constantly) when they should be treated like VPC topology (fetch once, cache |
|
I have different opinions where I expressed it in the #42310. There are 2 issues where my previous PR introduced.
I can see your environment having the around 180 subnets and 180 route tables. it's more like route table per subnet. I dont think that's a common setup. Do you mind to share a bit about why? From my view, I usually see people put several subnets together sharing the same route table, where I think that's the target setup that cilium should consider. Do you mind to share your opinion on my approach in #42310? |
|
@liyihuang Fair points on the architectural approach. I think the disable flag and your refactor are complementary. The flag gives users like @antonipp immediate relief without waiting for the resync refactor. Skip VPC/subnet/RT/SG for instance resync. This helps everyone and is the right long-term fix. Both address different aspects, the flag is pragmatic, your refactor is clean. We can do both incrementally. For the flag itself: it's low-risk, backwards compatible, and unblocks a real user. I think we should merge it while you work on the resync optimization. Thoughts? |
squeed
left a comment
There was a problem hiding this comment.
One naming nit, but seems reasonable to me. I appreciate there is a broader fix in the works.
operator/cmd/provider_aws_flags.go
Outdated
| flags.Bool(operatorOption.AWSPaginationEnabled, true, "Enable pagination for AWS EC2 API requests. The default page size is 1000 items.") | ||
| option.BindEnv(vp, operatorOption.AWSPaginationEnabled) | ||
|
|
||
| flags.Bool(operatorOption.AWSDisableRouteTableDiscovery, false, "Disable route table discovery and route-table-aware subnet selection. "+ |
There was a problem hiding this comment.
nit: we tend to use "enable" rather than "disable" options, since the double-negative is hard to parse. So call this aws-enable-route-table-discovery with a default of true. It will parse --aws-enable-route-table-discovery=false correctly.
This applies to the PR as a whole.
There was a problem hiding this comment.
Done! I've renamed the flag to aws-enable-route-table-discovery with a default of true. This eliminates the double negative and matches Cilium's naming conventions. All logic has been inverted correctly, the behavior stays the same (route table discovery is enabled by default), just the flag semantics are clearer now.
|
@vietcgi Since we already have a lot of flags, do you mind to see #42529 fixes the issue? If not, let's look into this PR. I have chatted one of my friends working at AWS as the solution architect in networking area. We all feel the route table should be the resource that only has a few in a VPC. I understand you may have your reason to have 100+ route table, I personally think we should keep it as true by default since that's more common setup for all AWS users if we want to merge this PR |
|
I think this flag could actually be useful, since #42529 should improve the memory usage but we still have the Route Table calls on the critical path of the full resync, so it would be great if we could turn it off. And I'm fine with having the flag be true by default. |
|
Hey @liyihuang checked out PR #42529. It’s close, but there are a few issues that could break prod: Cache not initialized, m.vpcs/m.subnets start empty, so InstanceSync() can crash if it runs before the first resync. Data race, reads happen without locks, writes use a mutex → classic race. No tests, nothing covers InstanceSync() or cache behavior. The idea’s solid, but needs cache init, proper locking, and some tests. Until then, PR #42365 is safer short-term. |
|
Commit feddbcc does not match "(?m)^Signed-off-by:". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
Signed-off-by: Kevin Vu <[email protected]>
feddbcc to
88eb62c
Compare
|
Thanks for comment. Do you mind to put the comments in PR?
I personally dont think caches can be empty for InstanceSync() since the 1st full sync at here https://github.com/liyihuang/cilium/blob/e56bad826338931f14fe9f8735222b9e0905129a/operator/cmd/root.go#L647-L650. I dont think there is a way for operator can start and let agent to create I'm not sure if I'm following it looks like there is a lock here e56bad8#diff-120e1fc9482652cdd5001efe26f26ff049c32c601e30d1c3b34e9a3150431faeR386-R392. Let me think about the testing |
|
Commit 6b2d453 does not match "(?m)^Signed-off-by:". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
| flags.Bool(operatorOption.AWSPaginationEnabled, true, "Enable pagination for AWS EC2 API requests. The default page size is 1000 items.") | ||
| option.BindEnv(vp, operatorOption.AWSPaginationEnabled) | ||
|
|
||
| flags.Bool(operatorOption.AWSEnableRouteTableDiscovery, true, "Enable route table discovery and route-table-aware subnet selection. "+ |
There was a problem hiding this comment.
I'm not the expert here, but it's not always safe to disable this, right? Can you add a sentence describing the safety?
If this is really an experts-only option, you can mark it hidden to exclude it from docs.
There was a problem hiding this comment.
Route table discovery was only added in 1.18 with #37229 and currently brings performance drawbacks (see #42310). Disabling it restores pre-1.18 operator behavior so it's completely safe and more stable performance-wise.
We could even consider defaulting this flag to false until the underlying feature is more stable.
There was a problem hiding this comment.
@squeed Good point. I already added "Disabling is safe and restores pre-1.18 operator behavior" to the help text. It's a good safety valve for large route table environments.
|
Commits 6b2d453, 40e441b do not match "(?m)^Signed-off-by:". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
|
Commits 6b2d453, 40e441b, 9e2e5e7 do not match "(?m)^Signed-off-by:". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
… is disabled Signed-off-by: Kevin Vu <[email protected]>
Signed-off-by: Kevin Vu <[email protected]>
Signed-off-by: Kevin Vu <[email protected]>
9e2e5e7 to
71aaee6
Compare
…odeSubnet Signed-off-by: Kevin Vu <[email protected]>
|
Commit 87d00cf does not match "(?m)^Signed-off-by:". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
|
This doesn't look it has been in a passing state for the past three weeks (see the bot's comments above), so switching to draft while it's being worked on. |
|
This pull request has been automatically marked as stale because it |
|
This pull request has not seen any activity since it was marked stale. |
What
Adds
--aws-disable-route-table-discoveryflag to the operator.Why
After #37229, the operator fetches all route tables on every instance resync. This is causing production issues for users with large VPCs:
DescribeRouteTablescallThe route table feature is great for ensuring pods route traffic the same way as nodes, but many of us run with pod subnets completely separate from node subnets. In those cases we don't need route table awareness at all, and the memory/API cost isn't worth it.
How
When
--aws-disable-route-table-discovery=true:DescribeRouteTablesAPI callsDefault is
falseso existing behavior is unchanged.Testing
Usage
Fixes #42310
@liyihuang would love your thoughts on this approach. The flag gives immediate relief for users hitting memory limits while keeping the route table feature for those who need it.