-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Significant memory usage increase for AWS Operator with 1.18 #42310
Copy link
Copy link
Closed
Labels
affects/v1.18This issue affects v1.18 branchThis issue affects v1.18 brancharea/agentCilium agent related.Cilium agent related.area/operatorImpacts the cilium-operator componentImpacts the cilium-operator componentkind/bugThis is a bug in the Cilium logic.This is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.This was reported by a user in the Cilium community, eg via Slack.kind/regressionThis functionality worked fine before, but was broken in a newer release of Cilium.This functionality worked fine before, but was broken in a newer release of Cilium.
Metadata
Metadata
Assignees
Labels
affects/v1.18This issue affects v1.18 branchThis issue affects v1.18 brancharea/agentCilium agent related.Cilium agent related.area/operatorImpacts the cilium-operator componentImpacts the cilium-operator componentkind/bugThis is a bug in the Cilium logic.This is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.This was reported by a user in the Cilium community, eg via Slack.kind/regressionThis functionality worked fine before, but was broken in a newer release of Cilium.This functionality worked fine before, but was broken in a newer release of Cilium.
Is there an existing issue for this?
Version
equal or higher than v1.18.2 and lower than v1.19.0
What happened?
We noticed a significant regression in memory usage for the AWS Operator after upgrading from 1.17 to 1.18. We started getting consistent OOMs on previously unnoticeable scale-ups in our AWS clusters (order of magnitude of 100s of nodes) with Operator memory set to 5GiB, which used to be more than enough in the past.
I managed to catch a
pprofheap profile of the Operator shortly before an OOM happened, it looks like this:So there are millions of AWS Route Table objects taking all the memory.
Looking further into it, I found #37229 added in 1.18, which added Route Tables refreshes for every single instance in the
resyncoperation:cilium/pkg/aws/eni/instances.go
Line 222 in 80a4025
Moreover,
routeTableFiltersare never set, so it fetches all route tables.Running
aws ec2 describe-route-tables > route-tables.jsonin one of our affected accounts, we get 12MiB:It's of course seralized differently in Operator memory but this gives some sense of the size of data retrieved at each call.
Looking at CloudTrail, we have 100s of calls per Operator (and we even have rate-limiting from time to time!).
So it's very easy for us to blow up memory now because Route Tables are quite large. Even with the VPC filter added on
main, the number of calls and the number of results will still be very high.To summarize, this is quite disrupting for our operations because it severely limits our ability to rapidly upscale clusters since we very quickly blow up the Operator. We temporarily increased the memory however we'd like to find a solution for this.
Route tables are extremely static objects which almost never change. Why should they be refreshed for every single instance, 100s of times? Moreover, it looks like the goal of this logic is "When creating a new ENI in AWS, trying the best to select a subnet with the same route table as the host's primary ENI" - but in our case this is unnecessary because our Cilium-managed ENIs are always in separate subnets from the host ENIs (we manage capacity differently for hosts vs pods). Could we have a way to completely turn off this logic maybe?
How can we reproduce the issue?
Cilium Version
Kernel Version
n/a
Kubernetes Version
n/a
Regression
1.17
Sysdump
No response
Relevant log output
Anything else?
No response
Cilium Users Document
Code of Conduct