Skip to content

Fix for excess IP release race condition | handshake between agent and operator#17939

Merged
nbusseneau merged 2 commits intocilium:masterfrom
DataDog:hemanthmalla/ip_release_handshake_1.10
Dec 9, 2021
Merged

Fix for excess IP release race condition | handshake between agent and operator#17939
nbusseneau merged 2 commits intocilium:masterfrom
DataDog:hemanthmalla/ip_release_handshake_1.10

Conversation

@hemanthmalla
Copy link
Copy Markdown
Member

@hemanthmalla hemanthmalla commented Nov 19, 2021

Forward port of DataDog#24

Currently there's a 15 sec delay in updating ciliumnode (CN) CRD after IP allocation to a pod, meanwhile the operator can determine that a node has excess IPs and release the IP causing pod connectivity issues.

This PR introduces a handshake between the operator and agent before releasing an excess IP to avoid the race condition. A new operator flag excess-ip-release-delay is added to control how long operator should wait before considering an IP for release (defaults to 180 secs). This is done to better handle IP reuse during rollouts. Operator and agent use a new map in cilium node status .status.ipam.release-ips to exchange information during the handshake.

Fixes: #13412

handshake_upstream

The following alternative options were considered :

  • We could force a CN write to go through before allocation but it can result in too many writes from the agent.

  • The operator picks the ENI with the most free IPs to release excess IPs from, this is done so that the ENI could potentially be released in the future. We could move the excess detection logic to the agent, but this involves calling into cloud provider specific implementation, which AFAIU the agent IPAM implementation does not.

Fix for excess IP release race condition. New operator flag excess-ip-release-delay is introduced to control waiting period before marking an IP for release.

Note : DataDog#24 already received some valuable feedback from @aanm @gandro and @christarazi

@hemanthmalla hemanthmalla requested a review from a team as a code owner November 19, 2021 17:47
@hemanthmalla hemanthmalla requested review from a team November 19, 2021 17:47
@hemanthmalla hemanthmalla requested a review from a team as a code owner November 19, 2021 17:47
@hemanthmalla hemanthmalla requested a review from tgraf November 19, 2021 17:47
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 19, 2021
@aanm aanm removed the request for review from tgraf November 19, 2021 23:58
@aanm aanm unassigned tgraf Nov 19, 2021
@aanm aanm requested review from aanm, christarazi and gandro November 19, 2021 23:59
@aanm aanm added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Nov 19, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 19, 2021
@gandro gandro added area/eni Impacts ENI based IPAM. sig/ipam labels Nov 22, 2021
Copy link
Copy Markdown
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks a ton for addressing this long-standing issue.

I think we need to ensure that allocateNext doesn't hand out IPs which are marked for release too (see inline comment), the rest are smaller nits.

@hemanthmalla hemanthmalla force-pushed the hemanthmalla/ip_release_handshake_1.10 branch 2 times, most recently from d11f4ed to 618bfd5 Compare November 23, 2021 20:04
@hemanthmalla hemanthmalla requested a review from a team as a code owner November 23, 2021 20:04
@hemanthmalla hemanthmalla force-pushed the hemanthmalla/ip_release_handshake_1.10 branch from 618bfd5 to c76cc5a Compare November 23, 2021 21:42
@joestringer
Copy link
Copy Markdown
Member

joestringer commented Nov 24, 2021

/test

Job 'Cilium-PR-K8s-1.22-kernel-4.9' has 1 failure but they might be new flake since it also hit 1 known flake: #17919 (93.81)

@hemanthmalla
Copy link
Copy Markdown
Member Author

hemanthmalla commented Nov 24, 2021

There's one more corner case that needs addressing.

After the handshake is complete and the operator is done releasing an IP, CiliumNode status (release-ips) and spec (pool) are updated in two consecutive requests. There's a tiny window between the two updates where the entry is removed from .status.ipam.release-ips but the IP is till present in spec.ipam.pool. It might be possible that the IP can be allocated between these requests.

@gandro / @aanm Can i flip the order of update only when there are changes to IP release status ? Comments in syncToAPIServer() say that the status should always be updated first though. I don't fully understand why.

cilium/pkg/ipam/node.go

Lines 700 to 701 in eeb7f1b

// Always update the status first to ensure that the IPAM information
// is synced for all addresses that are marked as available.

@gandro
Copy link
Copy Markdown
Member

gandro commented Nov 24, 2021

There's one more corner case that needs addressing.

After the handshake is complete and the operator is done releasing an IP, CiliumNode status (release-ips) and spec (pool) are updated in two consecutive requests. There's a tiny window between the two updates where the entry is removed from .status.ipam.release-ips but the IP is till present in spec.ipam.pool. It might be possible that the IP can be allocated between these requests.

Oh wow, good catch! In dynamic cluster pool, we avoid this by having the agent remove the entry from status.ipam.release-ips once it is no longer present in spec.pool, which might also be an option here.

@gandro / @aanm Can i flip the order of update only when there are changes to IP release status ? Comments in syncToAPIServer() say that the status should always be updated first though. I don't fully understand why.

While I'm also not sure, I believe the order might have to do with the fact that that the agent needs to know the ENI metadata for each IP present in spec. If the agent sees an IP in the pool before it knows about its ENI metadata from status (e.g. subnet CIDR and ENI MAC), this error will be hit:

return nil, fmt.Errorf("unable to find ENI %s", ipInfo.Resource)

@gandro
Copy link
Copy Markdown
Member

gandro commented Nov 24, 2021

@hemanthmalla hemanthmalla force-pushed the hemanthmalla/ip_release_handshake_1.10 branch from 5bd0924 to 77f286b Compare November 29, 2021 21:01
@hemanthmalla
Copy link
Copy Markdown
Member Author

@christarazi Thanks for the review. Addressed most the the feedback except the conversations still unresolved. There's also one more refactor suggestion in the original PR DataDog#24 (comment) we can add to the list for the future PR.

@christarazi
Copy link
Copy Markdown
Member

christarazi commented Nov 29, 2021

/test

Job 'Cilium-PR-K8s-1.23-kernel-4.9' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sServicesTest Checks graceful termination of service endpoints Checks client terminates gracefully on service endpoint deletion

Failure Output

FAIL: Timed out after 30.001s.

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.23-kernel-4.9 so I can create a new GitHub issue to track it.

Job 'Cilium-PR-K8s-1.22-kernel-4.19' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sServicesTest Checks service across nodes Tests NodePort BPF Tests with direct routing Tests NodePort

Failure Output

FAIL: Request from k8s1 to service http://[fd04::12]:31301 failed

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.22-kernel-4.19 so I can create a new GitHub issue to track it.

@aanm
Copy link
Copy Markdown
Member

aanm commented Nov 30, 2021

@hemanthmalla the unit tests failures are legitimate

Currently there's a potential 15 sec delay in updating ciliumnode CRD after IP
allocation to a pod, meanwhile the operator can determine that a node has
excess IPs and release the IP causing pod connectivity issues.

A new operator flag `excess-ip-release-delay` is added to control how long
operator should wait before considering an IP for release(defaults to 180 secs).
This is done to better handle IP reuse during rollouts. Operator and agent use
a new map in cilium node status .status.ipam.release-ips to exchange information
during the handshake.

Fixes: cilium#13412

Signed-off-by: Hemanth Malla <[email protected]>
After the handshake is complete and the operator is done releasing an IP,
CiliumNode status (release-ips) and spec (pool) are updated in two
consecutive requests. There's a tiny window between the two updates where
the entry is removed from .status.ipam.release-ips but the IP is still
present in spec.ipam.pool. It was possible that the IP could be allocated
between these requests.

This commit introduces a new state called released to deal with this.
Now agent removes the entry from release-ips only when the IP was
removed from .spec.ipam.pool as well.

Signed-off-by: Hemanth Malla <[email protected]>
@hemanthmalla hemanthmalla force-pushed the hemanthmalla/ip_release_handshake_1.10 branch from 77f286b to bcdb763 Compare December 6, 2021 13:47
@gandro
Copy link
Copy Markdown
Member

gandro commented Dec 6, 2021

/test

@gandro
Copy link
Copy Markdown
Member

gandro commented Dec 6, 2021

@aanm aanm removed their request for review December 9, 2021 03:01
@aanm aanm removed their assignment Dec 9, 2021
@aanm aanm added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Dec 9, 2021
@nbusseneau
Copy link
Copy Markdown
Member

Thanks @hemanthmalla! Merging.

for k := range n.enis {
eniIds = append(eniIds, k)
}
sort.Strings(eniIds)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same iterate order not ensure selecting the same ENI to release IPs in multi PrepareIPRelease called. what is use case ? @hemanthmalla

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because select the ENI with the most addresses available for release in every call PrepareIPRelease

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/eni Impacts ENI based IPAM. backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AWS ENI: Pod assigned IP that is not attached to an ENI

9 participants