Skip to content

Fix for excess IP release race condition | handshake between agent and operator#24

Closed
hemanthmalla wants to merge 6 commits into1.8.x-ddfrom
hemanth.malla/ip_release_handshake
Closed

Fix for excess IP release race condition | handshake between agent and operator#24
hemanthmalla wants to merge 6 commits into1.8.x-ddfrom
hemanth.malla/ip_release_handshake

Conversation

@hemanthmalla
Copy link
Copy Markdown

@hemanthmalla hemanthmalla commented Oct 25, 2021

Currently there's a 15 sec delay in updating ciliumnode (CN) CRD after IP allocation to a pod, meanwhile the operator can determine that a node has excess IPs and release the IP causing pod connectivity issues. More details on the race condition cilium#13412 (comment)

Forcing a CN write to go through before allocation can result in too many writes from the agent.

The operator picks the ENI with the most free IPs to release excess IPs from, this is done so that ENI could potentially be released in the future. An alternative approach considered was to move excess detection logic to the agent, but this involves calling into cloud provider specific implementation, which AFAIK the agent IPAM implementation doesn't.

So, this PR introduces a handshake between the operator and agent before releasing an excess IP to avoid the race condition. A new operator flag excess-ip-release-delay is added to control how long operator should wait before considering an IP for release (defaults to 180 secs). This is done to better handle IP reuse during rollouts. Operator and agent use a new map in cilium node status .status.ipam.release-ips to exchange information during the handshake.

full_handshake

Currently there's a potential 15 sec delay in updating ciliumnode CRD after IP allocation to a pod,
meanwhile the operator can determine that a node has excess IPs and release the IP causing pod connectivity issues.
} else {
n.ownNode.Status.IPAM.ReleaseIps[ip] = ipamOption.IPAMReadyForRelease
}
allocator.mutex.Unlock()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be in a defer just after locking?

@gandro
Copy link
Copy Markdown

gandro commented Oct 28, 2021

Thanks a lot for tackling this! The overall approach looks good to me. Doing the release via handshake seems like a solid approach and at least I couldn't come up with a simpler approach that maintains correctness.

While I haven't reviewed the code in depth yet, the implementation also looks good to me. A few smaller comments though:

  • In the current implementation, the agent will trigger an update to its CiliumNode CRD immediately if an IP is ready-for-release, instead of waiting for the 15 second update interval to expire. While I understand this likely makes testing a bit easier, this will also create a bit of unnecessary load on the API server. If the trigger is required only for unit testing purposes, I would try to trigger the controller manually in the test code.
  • On that point: If I understand the code correctly, the operator will only invoke the pool maintenance code once every minute (earlier, if it detects that the node is running out of IPs). So the triggered write will likely not cause the operator the perform an earlier release. But I might be wrong on the last part.
  • This is more of a style comment, but I believe in Cilium CRD we prefer to use strings instead of opaque integers to indicate states. This means I would make ReleaseIps a map[string]string, where each IP can have the string values marked-for-release, ready-for-release, and do-not-release. See CiliumEndpoint.Status.State for an example how to do this (including kubebuilder validation).
  • I assume you probably have already considered this, but in the case the operator pod restarts after marking IPs for release, every release delay timer will be restarted. I believe the current code handles this correctly (i.e. will start a new timer in the operator for any IP with a pre-exsting marked-for-release flag), but just something I wanted to point out so it doesn't get forgotten.

@hemanthmalla
Copy link
Copy Markdown
Author

@gandro Thank you for taking the time and providing early feedback.

the agent will trigger an update to its CiliumNode CRD immediately

Correct me if i'm wrong, but I did not see an interval based refresh in the agent for crd-allocator-node-refresher, it's only event triggered on operations like allocate, release and CRD update failures. Also the MinInterval is set to 15 secs in the trigger. So 15 secs should still be respected right ?

Also currently in the unit test for checking agent related changes TestMarkForReleaseNoAllocate, I'm not relying on the trigger being called. n.ownNode is updated directly, so i'm testing against that.

So the triggered write will likely not cause the operator the perform an earlier release

Yeah, you're right. Releasing the IP a min after receiving CRD update sounds okay though. But I can update code to trigger poolMaintainer on changes to IP release status. Something like this in UpdatedResource func:

	if allocationNeeded  || releaseNeeded {
		n.requirePoolMaintenance()
		n.poolMaintainer.Trigger()
	}

in Cilium CRD we prefer to use strings instead of opaque integers

I'll update this to use strings instead of integers.

will start a new timer in the operator for any IP with a pre-exsting marked-for-release flag

Yeah, currently even if an IP has a release entry in CRD, operator would not release the IP untill it's local timestamp satisfies excess IP release delay. During that cool down period, operator does not override any ACK / NACK from the agent too.

@gandro
Copy link
Copy Markdown

gandro commented Oct 28, 2021

@hemanthmalla

Thanks for the reply!

Correct me if i'm wrong, but I did not see an interval based refresh in the agent for crd-allocator-node-refresher, it's only event triggered on operations like allocate, release and CRD update failures. Also the MinInterval is set to 15 secs in the trigger. So 15 secs should still be respected right ?

You are 100% correct. The 15 second min interval will be respected, so updates are coalesced (which should remove unnecessary load from the apiserver). Apologies, I had myself confused there (probably because I'm currently working on similar code which has a periodic controller).

So the triggered write will likely not cause the operator the perform an earlier release

Yeah, you're right. Releasing the IP a min after receiving CRD update sounds okay though. But I can update code to trigger poolMaintainer on changes to IP release status. Something like this in UpdatedResource func:

Ah yes, my comment was probably a bit unclear (it was in regard to the timing of agent writes, which I misunderstood). I absolutely agree that releasing the IP after 1 min is totally fine. No changes needed.

I'll update this to use strings instead of integers.

Awesome, thanks!

Yeah, currently even if an IP has a release entry in CRD, operator would not release the IP untill it's local timestamp satisfies excess IP release delay. During that cool down period, operator does not override any ACK / NACK from the agent too.

Great, thanks for confirming!

@hemanthmalla hemanthmalla force-pushed the hemanth.malla/ip_release_handshake branch from 3a369f8 to 2643da9 Compare October 29, 2021 00:16
@hemanthmalla
Copy link
Copy Markdown
Author

@gandro Updated the PR changing the map values from uint8 to string . This data is stored twice in the operator, ipReleaseStatus on the node struct and cilium node cache. Just to make sure we're on the same page, the slight increase in memory is justified for readability. Correct ?

Also, can I go ahead and forward port these changes to the latest on cilium/cilium ?

@gandro
Copy link
Copy Markdown

gandro commented Oct 29, 2021

@gandro Updated the PR changing the map values from uint8 to string . This data is stored twice in the operator, ipReleaseStatus on the node struct and cilium node cache. Just to make sure we're on the same page, the slight increase in memory is justified for readability. Correct ?

In my opinion, yes. But maybe it's worth if the @cilium/kubernetes team provides their feedback here as well, as I'm not an expert in our CRD best practices.

Also, can I go ahead and forward port these changes to the latest on cilium/cilium ?

Yes, please do!

Copy link
Copy Markdown

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, nice work and thanks for the helpful diagram! A few relatively minor comments as code suggestions

Comment on lines +326 to +345
var allocator *crdAllocator
var ipFamily Family
if ipAddr := net.ParseIP(ip); ipAddr != nil {
if ipAddr.To4() != nil {
ipFamily = IPv4
} else {
ipFamily = IPv6
}
}
if ipFamily == "" {
continue
}
for _, a := range n.allocators {
if a.family == ipFamily {
allocator = a
}
}
if allocator == nil {
continue
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just set the allocator in place of the ipFamily and remove the latter variable. There can only be v4 or v6 allocators anyway.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth refactoring the n.allocators first to be a map from family to allocator, to avoid having to run this code in a loop. If you decide to do this, it would make sense as a separate prior commit

@hemanthmalla
Copy link
Copy Markdown
Author

@christarazi Thanks for the feedback. Will address the feedback shortly.

Regarding the datatype for IP release map values, are we good with using strings instead of int ?

@christarazi
Copy link
Copy Markdown

@hemanthmalla Yes I think it's fine to proceed with string.

Copy link
Copy Markdown

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look too much into the details of the code changes since I only have one overall concern.

The introduction of the CiliumNode structure was, beside other things, to prevent the huge number of events created by the K8s Node struct. Since each cilium-agent is watching for changes from all other cilium-agents through the k8s watcher mechanism and since K8s Node struct is modified every 15-30 seconds, it was impossible to use this watching mechanism on large clusters. Thus, we introduced the CiliumNode since the number of changes made into this struct is almost zero which avoids having to send events from a cilium-agent to another.

I assume you will perform tests around this caveat and if necessary I would suggest to batch the events sent from the operator that mark an IP as released. Otherwise, if a large number of pods are removed from a node we will send a Cilium node event for each pod removed.

@hemanthmalla
Copy link
Copy Markdown
Author

hemanthmalla commented Nov 3, 2021

@aanm Thanks for the feedback.

Mark for release updates from the operator are already batched (existing 1 min interval based sync). I'm not triggering a k8s update when there are IPs marked for release. CRD sync is triggered immediately only when an actual allocation / release event occurs.

We discussed this a little in #24 (comment)

@hemanthmalla
Copy link
Copy Markdown
Author

Closing in favor of #81

@HadrienPatte HadrienPatte deleted the hemanth.malla/ip_release_handshake branch March 26, 2025 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants