multipool: Fix allocations when deleting or shrinking pools#42126
Merged
pippolo84 merged 9 commits intocilium:mainfrom Oct 16, 2025
Merged
multipool: Fix allocations when deleting or shrinking pools#42126pippolo84 merged 9 commits intocilium:mainfrom
pippolo84 merged 9 commits intocilium:mainfrom
Conversation
30d095c to
b823529
Compare
This comment was marked as outdated.
This comment was marked as outdated.
ced20c8 to
0985735
Compare
This comment was marked as outdated.
This comment was marked as outdated.
0985735 to
eb45a53
Compare
Member
Author
|
/test |
gandro
reviewed
Oct 13, 2025
Member
gandro
left a comment
There was a problem hiding this comment.
Awesome work! One thing around operator restart that we need to think about a bit
36cbc58 to
0478ca5
Compare
gandro
reviewed
Oct 14, 2025
Member
gandro
left a comment
There was a problem hiding this comment.
Neat, thanks! One minor piece of feedback
gandro
reviewed
Oct 14, 2025
When one or more CIDRs are removed from a CiliumPodIPPool, the allocator
scans the list of allocated CIDRs to emit a warning in case a node is
still using one of the removed CIDRs. Moreover, the removed CIDRs are
deleted from the operator internal map of allocated CIDRs, to keep its
status consistent with the latest snapshot of the CiliumPodIPPool
resource.
Since the CiliumPodIPPool allows to specify a mask size, the CIDR
belonging to the custom resource is usually sliced into smaller CIDRs
that are individually allocated to nodes.
As an example, consider the following CiliumPodIPPool:
```
apiVersion: cilium.io/v2alpha1
kind: CiliumPodIPPool
metadata:
name: test-pool
spec:
ipv4:
cidrs:
- 10.100.0.0/16
maskSize: 24
```
A CiliumNode requesting allocations from that pool will end up with the
following CIDRs: "10.100.0.0/24", "10.100.1.0/24", 10.100.2.0/24 and so
on, depending on how many IPs are requested.
The previous logic to check if a removed CIDR is still in use was using
the entire CIDR (i.e: "10.100.0.0/16") from the CiliumPodIPPool, instead
of checking if the allocated CIDRs (i.e: "10.100.1.0/24",
"10.100.2.0/24" and so on) were part of the removed one.
The commit fixes the bug and adds a unit test to prove that the
CiliumPodIPPool update is now handled correctly.
Signed-off-by: Fabio Falzoi <[email protected]>
Use the pool name as a key into the per-node in-use pools map instead of looping over each key to find a match. Signed-off-by: Fabio Falzoi <[email protected]>
Rely on the same code path to both add a new pool and to update an existing one, removing the old function that was taking care of the insertion separately. This commit is not meant to change anything in the pools handling logic, but it allows to apply additional checks while handling a pool upsertion in a single place, being a new pool or an update of an existing one. Signed-off-by: Fabio Falzoi <[email protected]>
Whenever a CIDR is removed from a pool or a pool is removed altogether, there might be nodes that are still using those CIDRs. Previously, the operator would emit a warning, but would not track those CIDRs, leading to issues in case of a subsequent upsertion of the same CIDRs. Specifically, consider the following sequence of events: - CiliumPodIPPool is added with a CIDR "10.10.0.0/16" and a mask size equal to 24 - A CiliumNode requests IPs from that pool, and got assigned the CIDR "10.10.1.0/24" - The CiliumPodIPPool is deleted. The operator emits a warning, but the allocated CIDR is released from its internal allocation maps. Nonetheless, the CIDR is still in use by the node and there might be pods scheduled on it with IPs belonging to that CIDR. - The previously deleted CiliumPodIPPool is added again. No more actions are taken by the operator, that is, the operator thinks "10.10.1.0/24" is free to use. - A new node joins the cluster and requests a CIDR allocation from the same pool. The CIDR "10.10.1.0/24" is assigned to it, ultimately leading to multiple pods on different nodes having the same IPs. To avoid the above scenario we add the concept of "orphan CIDRs": an orphan CIDR is a CIDR still allocated to the node but not listed in any pools anymore, either because the parent CiliumPodIPPool has been removed or shrunk. When the operator detects that a CIDR has been orphaned, it stores it in a separate map. The map will be checked at each subsequent CiliumPodIPPool upsert event: if the original pool containing the orphan CIDR is restored, the CIDR is marked as allocated to reconcile the allocations status in the operator internal maps with the effective one. Fixes: cilium#41986 Signed-off-by: Fabio Falzoi <[email protected]>
Since orphan CIDRs are stored in a map, we lose track of them after an operator restart. Therefore, at startup, after the allocator received all the CiliumPodIPPool resources from k8s, it will mark as orphan all the CIDRs that are "allocated" in the CiliumNodes. This allows to reconstruct the status of the orphan CIDRs after the restart, even if a CiliumPodIPPool was deleted but its CIDRs were still in use by nodes. Signed-off-by: Fabio Falzoi <[email protected]>
Add a test to verify that a newly added pool with a name different than the one originating the orphan CIDRs cannot "steal" those CIDRs. Since we always keep orphan CIDRs untouched in the spec.IPAM.Pools.Allocated section of the CiliumNode, we cannot reassign those CIDRs to other pools, otherwise the allocations reported in the CiliumNode would be out of date. Therefore, in order to keep the CiliumNode always consistent, we return an error when upserting a new pool. To unorphan a CIDR, the original pool must be restored first. Signed-off-by: Fabio Falzoi <[email protected]>
When a pool is updated, we have to loop over all the old CIDRs to remove
the ones that are not part of the latest version of the pool.
However, we should not immediately delete the removed CIDRs while
looping on them, because doing that interfere with the loop itself,
leading to incorrect results. Instead we should mark each stale CIDR
and finally remove them from the slice later, once the loop is finished.
This can be proven by running the new unit test, that was failing before
this change:
=== RUN TestUpdatePoolKeepOldCIDRs
pool_allocator_test.go:1082:
Error Trace: github.com/cilium/cilium/pkg/ipam/allocator/multipool/pool_allocator_test.go:1082
Error: Should be false
Test: TestUpdatePoolKeepOldCIDRs
--- FAIL: TestUpdatePoolKeepOldCIDRs (0.00s)
Signed-off-by: Fabio Falzoi <[email protected]>
0478ca5 to
2fc8c99
Compare
Member
Author
|
@gandro I kept the release of the orphan CIDRs when they are not used by nodes anymore in a separate commit. Same for the test that assert that "stealing" orphan CIDRs is not possible (see this commit). Hope that will ease the further round of review a bit. Also, while working on the above I've discovered another bug and fixed it, see this commit. I understand the PR is becoming pretty convoluted, if you prefer I can try to split it, let me know. |
Member
Author
|
/test |
Member
Author
|
ci-integration failure seems legit, marking this back to draft while I investigate. |
When a node does not claim an orphan CIDR anymore (i.e: when all pods scheduled on the node with an IP from an orphan CIDR are deleted) release the orphan CIDR in the pool allocator just like we do for non-orphan CIDRs. Signed-off-by: Fabio Falzoi <[email protected]>
Whenever a pool is updated, we have to reconcile the status of the pool allocator accordingly, for both v4 and v6 allocators. Simplify and improve the readability of the update logic. Signed-off-by: Fabio Falzoi <[email protected]>
2fc8c99 to
ca8bb6d
Compare
Member
Author
Member
Author
|
/test |
gandro
approved these changes
Oct 16, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a CiliumPodIPPool is shrunk or deleted but its CIDRs are still in use by the nodes (i.e: there are some pods scheduled on those nodes with IPs from the removed CIDRs), we enter a status where the operator is not able to correctly account the allocated CIDRs anymore.
Specifically, it is possible to add again the deleted pool and end up with another node getting assigned the same CIDRs still in use by the initial node and ultimately having the same IPs assigned to multiple pods.
To avoid this scenario, beside emitting a warning when the operator detects that a CIDR still in use has been deleted, we introduce the concept of orphan CIDRs: CIDRs that do not have their parent pool anymore. The operator keeps track of those CIDRs specifically, in order to:
This is done in order to allow fixing the orphan CIDRs scenario by either:
a) restore the CiliumPodIPPool that should not have been shrunk or deleted initially
b) delete all pods using an IP from a deleted pool
Please refer to the individual commit messages for more information.
Fixes: #41986