Excess IP release handshake 1.8 backport by hemanthmalla · Pull Request #81 · DataDog/cilium

hemanthmalla · 2021-12-13T13:45:11Z

Backport of cilium#17939, cilium#18217 and cilium#18330 to cilium 1.8.7.

See commit and PR description for more details.

Currently there's a potential 15 sec delay in updating ciliumnode CRD after IP allocation to a pod, meanwhile the operator can determine that a node has excess IPs and release the IP causing pod connectivity issues. A new operator flag `excess-ip-release-delay` is added to control how long operator should wait before considering an IP for release(defaults to 180 secs). This is done to better handle IP reuse during rollouts. Operator and agent use a new map in cilium node status .status.ipam.release-ips to exchange information during the handshake. Fixes: cilium#13412 Signed-off-by: Hemanth Malla <[email protected]>

After the handshake is complete and the operator is done releasing an IP, CiliumNode status (release-ips) and spec (pool) are updated in two consecutive requests. There's a tiny window between the two updates where the entry is removed from .status.ipam.release-ips but the IP is still present in spec.ipam.pool. It was possible that the IP could be allocated between these requests. This commit introduces a new state called released to deal with this. Now agent removes the entry from release-ips only when the IP was removed from .spec.ipam.pool as well. Signed-off-by: Hemanth Malla <[email protected]>

Fixes: cilium#18204 Signed-off-by: Hemanth Malla <[email protected]>

JulienBalestra

🙏

* Fix for blocked state transition from ready-for-release to released * Fix for unnecessary updates between agent and operator during handshake Signed-off-by: Hemanth Malla <[email protected]>

crdAllocator.Allocate() aquires lock on allocator first and then on nodestore. But updateLocalNodeResource() acquires locks in the opposite order. This commit releases nodestore lock before acquiring allocator lock to avoid potential deadlocks due to inconsistent lock ordering. Signed-off-by: Hemanth Malla <[email protected]>

… node Imagine a scenario where a node has 2 unused IPs and pre-allocate set to 1. Let's say one of the IPs is in the middle of a handshake and a new pod is scheduled on the node. The other unused IP would be allocated to the pod. Now, when the operator re-evaluates, the node is no longer considered to be in excess. Without this commit, the operator does not act further on IPs in this state. This results in a scenario where no new IPs are allocated to the node and agent cannot allocate the unused IPs because they're in the middle of a handshake. Signed-off-by: Hemanth Malla <[email protected]>

…ol() Signed-off-by: Hemanth Malla <[email protected]>

Signed-off-by: Hemanth Malla <[email protected]>

hemanthmalla added 3 commits December 6, 2021 15:55

Fix for data race in IP release features

c3e2d3d

Fixes: cilium#18204 Signed-off-by: Hemanth Malla <[email protected]>

hemanthmalla requested review from EricMountain, JulienBalestra and lvsl December 13, 2021 13:45

JulienBalestra approved these changes Dec 13, 2021

View reviewed changes

hemanthmalla added 2 commits December 21, 2021 11:57

Improvements to IP release handshake

52733af

* Fix for blocked state transition from ready-for-release to released * Fix for unnecessary updates between agent and operator during handshake Signed-off-by: Hemanth Malla <[email protected]>

hemanthmalla force-pushed the hemanth.malla/release_handshake_1.8_backport branch from 46035d2 to e635317 Compare December 21, 2021 18:27

hemanthmalla added 3 commits December 30, 2021 16:11

Refactoring IP release and allocate functionality out of maintainIPPo…

f186312

…ol() Signed-off-by: Hemanth Malla <[email protected]>

Adding unit test for IP release abort flow

36a4268

Signed-off-by: Hemanth Malla <[email protected]>

hemanthmalla force-pushed the hemanth.malla/release_handshake_1.8_backport branch from 8a60d01 to 36a4268 Compare December 30, 2021 21:22

hemanthmalla merged commit 7688cbc into 1.8.x-dd Jan 6, 2022

hemanthmalla mentioned this pull request Feb 1, 2022

Fix for excess IP release race condition | handshake between agent and operator #24

Closed

HadrienPatte deleted the hemanth.malla/release_handshake_1.8_backport branch March 26, 2025 13:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excess IP release handshake 1.8 backport#81

Excess IP release handshake 1.8 backport#81
hemanthmalla merged 8 commits into1.8.x-ddfrom
hemanth.malla/release_handshake_1.8_backport

hemanthmalla commented Dec 13, 2021 •

edited

Loading

Uh oh!

JulienBalestra left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hemanthmalla commented Dec 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JulienBalestra left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hemanthmalla commented Dec 13, 2021 •

edited

Loading