Fix for excess IP release race condition | handshake between agent and operator#24
Fix for excess IP release race condition | handshake between agent and operator#24hemanthmalla wants to merge 6 commits into1.8.x-ddfrom
Conversation
Currently there's a potential 15 sec delay in updating ciliumnode CRD after IP allocation to a pod, meanwhile the operator can determine that a node has excess IPs and release the IP causing pod connectivity issues.
| } else { | ||
| n.ownNode.Status.IPAM.ReleaseIps[ip] = ipamOption.IPAMReadyForRelease | ||
| } | ||
| allocator.mutex.Unlock() |
There was a problem hiding this comment.
Shouldn't this be in a defer just after locking?
… AWS release IP selection deterministic
|
Thanks a lot for tackling this! The overall approach looks good to me. Doing the release via handshake seems like a solid approach and at least I couldn't come up with a simpler approach that maintains correctness. While I haven't reviewed the code in depth yet, the implementation also looks good to me. A few smaller comments though:
|
|
@gandro Thank you for taking the time and providing early feedback.
Correct me if i'm wrong, but I did not see an interval based refresh in the agent for Also currently in the unit test for checking agent related changes
Yeah, you're right. Releasing the IP a min after receiving CRD update sounds okay though. But I can update code to trigger
I'll update this to use strings instead of integers.
Yeah, currently even if an IP has a release entry in CRD, operator would not release the IP untill it's local timestamp satisfies excess IP release delay. During that cool down period, operator does not override any ACK / NACK from the agent too. |
|
Thanks for the reply!
You are 100% correct. The 15 second min interval will be respected, so updates are coalesced (which should remove unnecessary load from the apiserver). Apologies, I had myself confused there (probably because I'm currently working on similar code which has a periodic controller).
Ah yes, my comment was probably a bit unclear (it was in regard to the timing of agent writes, which I misunderstood). I absolutely agree that releasing the IP after 1 min is totally fine. No changes needed.
Awesome, thanks!
Great, thanks for confirming! |
3a369f8 to
2643da9
Compare
|
@gandro Updated the PR changing the map values from Also, can I go ahead and forward port these changes to the latest on cilium/cilium ? |
In my opinion, yes. But maybe it's worth if the @cilium/kubernetes team provides their feedback here as well, as I'm not an expert in our CRD best practices.
Yes, please do! |
| var allocator *crdAllocator | ||
| var ipFamily Family | ||
| if ipAddr := net.ParseIP(ip); ipAddr != nil { | ||
| if ipAddr.To4() != nil { | ||
| ipFamily = IPv4 | ||
| } else { | ||
| ipFamily = IPv6 | ||
| } | ||
| } | ||
| if ipFamily == "" { | ||
| continue | ||
| } | ||
| for _, a := range n.allocators { | ||
| if a.family == ipFamily { | ||
| allocator = a | ||
| } | ||
| } | ||
| if allocator == nil { | ||
| continue | ||
| } |
There was a problem hiding this comment.
We can just set the allocator in place of the ipFamily and remove the latter variable. There can only be v4 or v6 allocators anyway.
There was a problem hiding this comment.
It might be worth refactoring the n.allocators first to be a map from family to allocator, to avoid having to run this code in a loop. If you decide to do this, it would make sense as a separate prior commit
|
@christarazi Thanks for the feedback. Will address the feedback shortly. Regarding the datatype for IP release map values, are we good with using strings instead of int ? |
|
@hemanthmalla Yes I think it's fine to proceed with |
aanm
left a comment
There was a problem hiding this comment.
I didn't look too much into the details of the code changes since I only have one overall concern.
The introduction of the CiliumNode structure was, beside other things, to prevent the huge number of events created by the K8s Node struct. Since each cilium-agent is watching for changes from all other cilium-agents through the k8s watcher mechanism and since K8s Node struct is modified every 15-30 seconds, it was impossible to use this watching mechanism on large clusters. Thus, we introduced the CiliumNode since the number of changes made into this struct is almost zero which avoids having to send events from a cilium-agent to another.
I assume you will perform tests around this caveat and if necessary I would suggest to batch the events sent from the operator that mark an IP as released. Otherwise, if a large number of pods are removed from a node we will send a Cilium node event for each pod removed.
|
@aanm Thanks for the feedback. Mark for release updates from the operator are already batched (existing 1 min interval based sync). I'm not triggering a k8s update when there are IPs marked for release. CRD sync is triggered immediately only when an actual allocation / release event occurs. We discussed this a little in #24 (comment) |
|
Closing in favor of #81 |
Currently there's a 15 sec delay in updating ciliumnode (CN) CRD after IP allocation to a pod, meanwhile the operator can determine that a node has excess IPs and release the IP causing pod connectivity issues. More details on the race condition cilium#13412 (comment)
Forcing a CN write to go through before allocation can result in too many writes from the agent.
The operator picks the ENI with the most free IPs to release excess IPs from, this is done so that ENI could potentially be released in the future. An alternative approach considered was to move excess detection logic to the agent, but this involves calling into cloud provider specific implementation, which AFAIK the agent IPAM implementation doesn't.
So, this PR introduces a handshake between the operator and agent before releasing an excess IP to avoid the race condition. A new operator flag
excess-ip-release-delayis added to control how long operator should wait before considering an IP for release (defaults to 180 secs). This is done to better handle IP reuse during rollouts. Operator and agent use a new map in cilium node status.status.ipam.release-ipsto exchange information during the handshake.