Fix for excess IP release race condition | handshake between agent and operator by hemanthmalla · Pull Request #24 · DataDog/cilium

hemanthmalla · 2021-10-25T14:39:59Z

Currently there's a 15 sec delay in updating ciliumnode (CN) CRD after IP allocation to a pod, meanwhile the operator can determine that a node has excess IPs and release the IP causing pod connectivity issues. More details on the race condition cilium#13412 (comment)

Forcing a CN write to go through before allocation can result in too many writes from the agent.

The operator picks the ENI with the most free IPs to release excess IPs from, this is done so that ENI could potentially be released in the future. An alternative approach considered was to move excess detection logic to the agent, but this involves calling into cloud provider specific implementation, which AFAIK the agent IPAM implementation doesn't.

So, this PR introduces a handshake between the operator and agent before releasing an excess IP to avoid the race condition. A new operator flag excess-ip-release-delay is added to control how long operator should wait before considering an IP for release (defaults to 180 secs). This is done to better handle IP reuse during rollouts. Operator and agent use a new map in cilium node status .status.ipam.release-ips to exchange information during the handshake.

Currently there's a potential 15 sec delay in updating ciliumnode CRD after IP allocation to a pod, meanwhile the operator can determine that a node has excess IPs and release the IP causing pod connectivity issues.

EricMountain · 2021-10-26T16:38:21Z

pkg/ipam/crd.go

+		} else {
+			n.ownNode.Status.IPAM.ReleaseIps[ip] = ipamOption.IPAMReadyForRelease
+		}
+		allocator.mutex.Unlock()


Shouldn't this be in a defer just after locking?

pkg/ipam/crd.go

pkg/ipam/crd_test.go

… AWS release IP selection deterministic

gandro · 2021-10-28T12:39:42Z

Thanks a lot for tackling this! The overall approach looks good to me. Doing the release via handshake seems like a solid approach and at least I couldn't come up with a simpler approach that maintains correctness.

While I haven't reviewed the code in depth yet, the implementation also looks good to me. A few smaller comments though:

In the current implementation, the agent will trigger an update to its CiliumNode CRD immediately if an IP is ready-for-release, instead of waiting for the 15 second update interval to expire. While I understand this likely makes testing a bit easier, this will also create a bit of unnecessary load on the API server. If the trigger is required only for unit testing purposes, I would try to trigger the controller manually in the test code.
On that point: If I understand the code correctly, the operator will only invoke the pool maintenance code once every minute (earlier, if it detects that the node is running out of IPs). So the triggered write will likely not cause the operator the perform an earlier release. But I might be wrong on the last part.
This is more of a style comment, but I believe in Cilium CRD we prefer to use strings instead of opaque integers to indicate states. This means I would make ReleaseIps a map[string]string, where each IP can have the string values marked-for-release, ready-for-release, and do-not-release. See CiliumEndpoint.Status.State for an example how to do this (including kubebuilder validation).
I assume you probably have already considered this, but in the case the operator pod restarts after marking IPs for release, every release delay timer will be restarted. I believe the current code handles this correctly (i.e. will start a new timer in the operator for any IP with a pre-exsting marked-for-release flag), but just something I wanted to point out so it doesn't get forgotten.

…Status

hemanthmalla · 2021-10-28T16:03:33Z

@gandro Thank you for taking the time and providing early feedback.

the agent will trigger an update to its CiliumNode CRD immediately

Correct me if i'm wrong, but I did not see an interval based refresh in the agent for crd-allocator-node-refresher, it's only event triggered on operations like allocate, release and CRD update failures. Also the MinInterval is set to 15 secs in the trigger. So 15 secs should still be respected right ?

Also currently in the unit test for checking agent related changes TestMarkForReleaseNoAllocate, I'm not relying on the trigger being called. n.ownNode is updated directly, so i'm testing against that.

So the triggered write will likely not cause the operator the perform an earlier release

Yeah, you're right. Releasing the IP a min after receiving CRD update sounds okay though. But I can update code to trigger poolMaintainer on changes to IP release status. Something like this in UpdatedResource func:

	if allocationNeeded  || releaseNeeded {
		n.requirePoolMaintenance()
		n.poolMaintainer.Trigger()
	}

in Cilium CRD we prefer to use strings instead of opaque integers

I'll update this to use strings instead of integers.

will start a new timer in the operator for any IP with a pre-exsting marked-for-release flag

Yeah, currently even if an IP has a release entry in CRD, operator would not release the IP untill it's local timestamp satisfies excess IP release delay. During that cool down period, operator does not override any ACK / NACK from the agent too.

gandro · 2021-10-28T16:28:41Z

@hemanthmalla

Thanks for the reply!

Correct me if i'm wrong, but I did not see an interval based refresh in the agent for crd-allocator-node-refresher, it's only event triggered on operations like allocate, release and CRD update failures. Also the MinInterval is set to 15 secs in the trigger. So 15 secs should still be respected right ?

You are 100% correct. The 15 second min interval will be respected, so updates are coalesced (which should remove unnecessary load from the apiserver). Apologies, I had myself confused there (probably because I'm currently working on similar code which has a periodic controller).

So the triggered write will likely not cause the operator the perform an earlier release

Yeah, you're right. Releasing the IP a min after receiving CRD update sounds okay though. But I can update code to trigger poolMaintainer on changes to IP release status. Something like this in UpdatedResource func:

Ah yes, my comment was probably a bit unclear (it was in regard to the timing of agent writes, which I misunderstood). I absolutely agree that releasing the IP after 1 min is totally fine. No changes needed.

I'll update this to use strings instead of integers.

Awesome, thanks!

Yeah, currently even if an IP has a release entry in CRD, operator would not release the IP untill it's local timestamp satisfies excess IP release delay. During that cool down period, operator does not override any ACK / NACK from the agent too.

Great, thanks for confirming!

hemanthmalla · 2021-10-29T14:15:19Z

@gandro Updated the PR changing the map values from uint8 to string . This data is stored twice in the operator, ipReleaseStatus on the node struct and cilium node cache. Just to make sure we're on the same page, the slight increase in memory is justified for readability. Correct ?

Also, can I go ahead and forward port these changes to the latest on cilium/cilium ?

gandro · 2021-10-29T16:22:32Z

@gandro Updated the PR changing the map values from uint8 to string . This data is stored twice in the operator, ipReleaseStatus on the node struct and cilium node cache. Just to make sure we're on the same page, the slight increase in memory is justified for readability. Correct ?

In my opinion, yes. But maybe it's worth if the @cilium/kubernetes team provides their feedback here as well, as I'm not an expert in our CRD best practices.

Also, can I go ahead and forward port these changes to the latest on cilium/cilium ?

Yes, please do!

christarazi

LGTM overall, nice work and thanks for the helpful diagram! A few relatively minor comments as code suggestions

pkg/ipam/node.go

pkg/ipam/types/types.go

pkg/ipam/crd.go

christarazi · 2021-11-02T06:53:35Z

pkg/ipam/crd.go

+		var allocator *crdAllocator
+		var ipFamily Family
+		if ipAddr := net.ParseIP(ip); ipAddr != nil {
+			if ipAddr.To4() != nil {
+				ipFamily = IPv4
+			} else {
+				ipFamily = IPv6
+			}
+		}
+		if ipFamily == "" {
+			continue
+		}
+		for _, a := range n.allocators {
+			if a.family == ipFamily {
+				allocator = a
+			}
+		}
+		if allocator == nil {
+			continue
+		}


We can just set the allocator in place of the ipFamily and remove the latter variable. There can only be v4 or v6 allocators anyway.

It might be worth refactoring the n.allocators first to be a map from family to allocator, to avoid having to run this code in a loop. If you decide to do this, it would make sense as a separate prior commit

hemanthmalla · 2021-11-02T15:27:05Z

@christarazi Thanks for the feedback. Will address the feedback shortly.

Regarding the datatype for IP release map values, are we good with using strings instead of int ?

christarazi · 2021-11-02T22:12:50Z

@hemanthmalla Yes I think it's fine to proceed with string.

aanm

I didn't look too much into the details of the code changes since I only have one overall concern.

The introduction of the CiliumNode structure was, beside other things, to prevent the huge number of events created by the K8s Node struct. Since each cilium-agent is watching for changes from all other cilium-agents through the k8s watcher mechanism and since K8s Node struct is modified every 15-30 seconds, it was impossible to use this watching mechanism on large clusters. Thus, we introduced the CiliumNode since the number of changes made into this struct is almost zero which avoids having to send events from a cilium-agent to another.

I assume you will perform tests around this caveat and if necessary I would suggest to batch the events sent from the operator that mark an IP as released. Otherwise, if a large number of pods are removed from a node we will send a Cilium node event for each pod removed.

hemanthmalla · 2021-11-03T16:19:25Z

@aanm Thanks for the feedback.

Mark for release updates from the operator are already batched (existing 1 min interval based sync). I'm not triggering a k8s update when there are IPs marked for release. CRD sync is triggered immediately only when an actual allocation / release event occurs.

We discussed this a little in #24 (comment)

hemanthmalla · 2022-02-01T15:35:08Z

Closing in favor of #81

Adding handshake between agent and operator before releasing IPs.

2e31460

Currently there's a potential 15 sec delay in updating ciliumnode CRD after IP allocation to a pod, meanwhile the operator can determine that a node has excess IPs and release the IP causing pod connectivity issues.

hemanthmalla mentioned this pull request Oct 25, 2021

COMPUTE-881 : Fix for operator's IP release race condition #20

Closed

EricMountain reviewed Oct 26, 2021

View reviewed changes

pkg/ipam/crd.go Outdated Show resolved Hide resolved

EricMountain reviewed Oct 26, 2021

View reviewed changes

pkg/ipam/crd_test.go Outdated Show resolved Hide resolved

EricMountain reviewed Oct 26, 2021

View reviewed changes

pkg/ipam/crd_test.go Outdated Show resolved Hide resolved

EricMountain reviewed Oct 27, 2021

View reviewed changes

pkg/ipam/crd_test.go Show resolved Hide resolved

hemanthmalla added 2 commits October 27, 2021 11:22

Calling CRD refresh trigger when agent acts on release IPs and making…

37e562b

… AWS release IP selection deterministic

Invoking IPAM pool maintainer trigger only when instance state changes

f6ace2d

Updating TestNodeManagerReleaseAddress to work with populateIPRelease…

190b828

…Status

Changing release-ips map in IPAM status from [str]uint8 to [str]str

2643da9

hemanthmalla force-pushed the hemanth.malla/ip_release_handshake branch from 3a369f8 to 2643da9 Compare October 29, 2021 00:16

christarazi reviewed Nov 2, 2021

View reviewed changes

aanm reviewed Nov 2, 2021

View reviewed changes

Minor refactoring and fmt changes

be009df

hemanthmalla mentioned this pull request Nov 19, 2021

Fix for excess IP release race condition | handshake between agent and operator cilium/cilium#17939

Merged

hemanthmalla closed this Feb 1, 2022

HadrienPatte deleted the hemanth.malla/ip_release_handshake branch March 26, 2025 13:24

Conversation

hemanthmalla commented Oct 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EricMountain Oct 26, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gandro commented Oct 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hemanthmalla commented Oct 28, 2021

Uh oh!

gandro commented Oct 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hemanthmalla commented Oct 29, 2021

Uh oh!

gandro commented Oct 29, 2021

Uh oh!

christarazi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

christarazi Nov 2, 2021

Choose a reason for hiding this comment

Uh oh!

christarazi Nov 2, 2021

Choose a reason for hiding this comment

Uh oh!

hemanthmalla commented Nov 2, 2021

Uh oh!

christarazi commented Nov 2, 2021

Uh oh!

aanm left a comment

Choose a reason for hiding this comment

Uh oh!

hemanthmalla commented Nov 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hemanthmalla commented Feb 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hemanthmalla commented Oct 25, 2021 •

edited

Loading

gandro commented Oct 28, 2021 •

edited

Loading

gandro commented Oct 28, 2021 •

edited

Loading

christarazi left a comment •

edited

Loading

hemanthmalla commented Nov 3, 2021 •

edited

Loading