Fix for excess IP release race condition | handshake between agent and operator#17939
Conversation
gandro
left a comment
There was a problem hiding this comment.
Awesome, thanks a ton for addressing this long-standing issue.
I think we need to ensure that allocateNext doesn't hand out IPs which are marked for release too (see inline comment), the rest are smaller nits.
d11f4ed to
618bfd5
Compare
618bfd5 to
c76cc5a
Compare
|
/test Job 'Cilium-PR-K8s-1.22-kernel-4.9' has 1 failure but they might be new flake since it also hit 1 known flake: #17919 (93.81) |
|
There's one more corner case that needs addressing. After the handshake is complete and the operator is done releasing an IP, CiliumNode status (release-ips) and spec (pool) are updated in two consecutive requests. There's a tiny window between the two updates where the entry is removed from @gandro / @aanm Can i flip the order of update only when there are changes to IP release status ? Comments in Lines 700 to 701 in eeb7f1b |
Oh wow, good catch! In dynamic cluster pool, we avoid this by having the agent remove the entry from
While I'm also not sure, I believe the order might have to do with the fact that that the agent needs to know the ENI metadata for each IP present in spec. If the agent sees an IP in the pool before it knows about its ENI metadata from status (e.g. subnet CIDR and ENI MAC), this error will be hit: Line 528 in 124fa2b |
|
The two CI failures (besides checkpatch) look like unrelated flaky tests:
|
5bd0924 to
77f286b
Compare
|
@christarazi Thanks for the review. Addressed most the the feedback except the conversations still unresolved. There's also one more refactor suggestion in the original PR DataDog#24 (comment) we can add to the list for the future PR. |
|
/test Job 'Cilium-PR-K8s-1.23-kernel-4.9' failed and has not been observed before, so may be related to your PR: Click to show.Test NameFailure OutputIf it is a flake, comment Job 'Cilium-PR-K8s-1.22-kernel-4.19' failed and has not been observed before, so may be related to your PR: Click to show.Test NameFailure OutputIf it is a flake, comment |
|
@hemanthmalla the unit tests failures are legitimate |
Currently there's a potential 15 sec delay in updating ciliumnode CRD after IP allocation to a pod, meanwhile the operator can determine that a node has excess IPs and release the IP causing pod connectivity issues. A new operator flag `excess-ip-release-delay` is added to control how long operator should wait before considering an IP for release(defaults to 180 secs). This is done to better handle IP reuse during rollouts. Operator and agent use a new map in cilium node status .status.ipam.release-ips to exchange information during the handshake. Fixes: cilium#13412 Signed-off-by: Hemanth Malla <[email protected]>
After the handshake is complete and the operator is done releasing an IP, CiliumNode status (release-ips) and spec (pool) are updated in two consecutive requests. There's a tiny window between the two updates where the entry is removed from .status.ipam.release-ips but the IP is still present in spec.ipam.pool. It was possible that the IP could be allocated between these requests. This commit introduces a new state called released to deal with this. Now agent removes the entry from release-ips only when the IP was removed from .spec.ipam.pool as well. Signed-off-by: Hemanth Malla <[email protected]>
77f286b to
bcdb763
Compare
|
/test |
|
/ci-gke Edit: Previous failure looks like a variant of #17102 |
|
Thanks @hemanthmalla! Merging. |
| for k := range n.enis { | ||
| eniIds = append(eniIds, k) | ||
| } | ||
| sort.Strings(eniIds) |
There was a problem hiding this comment.
Same iterate order not ensure selecting the same ENI to release IPs in multi PrepareIPRelease called. what is use case ? @hemanthmalla
There was a problem hiding this comment.
Because select the ENI with the most addresses available for release in every call PrepareIPRelease
Forward port of DataDog#24
Currently there's a 15 sec delay in updating ciliumnode (CN) CRD after IP allocation to a pod, meanwhile the operator can determine that a node has excess IPs and release the IP causing pod connectivity issues.
This PR introduces a handshake between the operator and agent before releasing an excess IP to avoid the race condition. A new operator flag
excess-ip-release-delayis added to control how long operator should wait before considering an IP for release (defaults to 180 secs). This is done to better handle IP reuse during rollouts. Operator and agent use a new map in cilium node status.status.ipam.release-ipsto exchange information during the handshake.Fixes: #13412
The following alternative options were considered :
We could force a CN write to go through before allocation but it can result in too many writes from the agent.
The operator picks the ENI with the most free IPs to release excess IPs from, this is done so that the ENI could potentially be released in the future. We could move the excess detection logic to the agent, but this involves calling into cloud provider specific implementation, which AFAIU the agent IPAM implementation does not.
Note : DataDog#24 already received some valuable feedback from @aanm @gandro and @christarazi