Skip to content

Support tunnel routing in Multi-Pool IPAM mode#38015

Closed
gandro wants to merge 6 commits intomainfrom
pr/gandro/use-ipcache-map-for-tunneling-v1.1
Closed

Support tunnel routing in Multi-Pool IPAM mode#38015
gandro wants to merge 6 commits intomainfrom
pr/gandro/use-ipcache-map-for-tunneling-v1.1

Conversation

@gandro
Copy link
Copy Markdown
Member

@gandro gandro commented Mar 5, 2025

This PR introduces support for tunnel routing with the Multi-Pool IPAM mode. This is achieved by removing the reliance on the tunnel map as a fallback mechanism and instead use Pod CIDR entries in IPCache that act as this fallback. The pod CIDR entries contain the node's tunnel endpoint and encryption key and have the reserved:world identity (i.e. the same identity as the existing catch-all fallback, which would have been hit before this PR).

The bulk of the work in this PR was done by @pippolo84. This PR is a replacement for his PR #37146 with a few notable changes:

  1. The first commit has been reworked to add unit tests and fix up some issues in the previous version (c.f. Use ipcache for tunneling #37146 (comment)).
  2. A commit has been added adding some unit tests to the IPCache package emulating some of the expected corner cases.
  3. Commit node: Remove insertions and deletions to tunnel map has been dropped. In other words, even though the tunnel map is no longer used in the main branch, we still populate it. This addresses concerns with regards to downgrades.
  4. The Conformance Multi-Pool workflow has been extended to also test in tunnel mode.

This change will eventually allow us to remove the tunnel map completely (ref #20170).

@gandro gandro added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/ipam IP address management, including cloud IPAM area/multipool Affects Multi-Pool IPAM labels Mar 5, 2025
@gandro gandro force-pushed the pr/gandro/use-ipcache-map-for-tunneling-v1.1 branch from 2b92eb9 to 07e6d26 Compare March 5, 2025 16:08
@gandro
Copy link
Copy Markdown
Member Author

gandro commented Mar 5, 2025

/ci-multi-pool

@gandro gandro force-pushed the pr/gandro/use-ipcache-map-for-tunneling-v1.1 branch from 07e6d26 to 2bb568d Compare March 5, 2025 16:11
@gandro
Copy link
Copy Markdown
Member Author

gandro commented Mar 5, 2025

/ci-multi-pool

@gandro gandro force-pushed the pr/gandro/use-ipcache-map-for-tunneling-v1.1 branch from 2bb568d to 59f8ed8 Compare March 5, 2025 16:32
@gandro
Copy link
Copy Markdown
Member Author

gandro commented Mar 5, 2025

/ci-multi-pool

@gandro gandro force-pushed the pr/gandro/use-ipcache-map-for-tunneling-v1.1 branch from 59f8ed8 to a06a68b Compare March 5, 2025 16:51
@gandro
Copy link
Copy Markdown
Member Author

gandro commented Mar 5, 2025

/ci-multi-pool

@gandro gandro force-pushed the pr/gandro/use-ipcache-map-for-tunneling-v1.1 branch from a06a68b to 7b9439f Compare March 6, 2025 08:43
@gandro
Copy link
Copy Markdown
Member Author

gandro commented Mar 6, 2025

/ci-multi-pool

@gandro gandro force-pushed the pr/gandro/use-ipcache-map-for-tunneling-v1.1 branch 2 times, most recently from 763380b to d836551 Compare March 6, 2025 09:16
@gandro
Copy link
Copy Markdown
Member Author

gandro commented Mar 6, 2025

/ci-multi-pool

@gandro gandro force-pushed the pr/gandro/use-ipcache-map-for-tunneling-v1.1 branch from d836551 to 8067549 Compare March 6, 2025 09:54
@gandro gandro marked this pull request as ready for review March 6, 2025 09:54
@gandro gandro requested review from a team as code owners March 6, 2025 09:54
@gandro gandro requested review from a user, doniacld, thorn3r and ysksuzuki March 6, 2025 09:54
@gandro gandro requested review from aanm and christarazi March 6, 2025 09:54
@gandro
Copy link
Copy Markdown
Member Author

gandro commented Mar 6, 2025

/test

@julianwiedmann julianwiedmann self-requested a review March 6, 2025 11:11
Copy link
Copy Markdown
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for completing (and improving) #37146 🙏

LGTM, just a (very minor) nit left inline ✅

Copy link
Copy Markdown
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@julianwiedmann
Copy link
Copy Markdown
Member

👋 Thank you for pushing this forward! Just some (potentially obvious) questions from the peanut gallery:

  1. on initial upgrade, how do we learn about PodCIDRs for existing nodes and insert them into the ipcache? I suppose we replay NodeUpdated() as the agent initially syncs with k8s?
  2. on initial upgrade, how do we handle the race condition where an endpoint's BPF program is re-generated (removing the tunnel-map lookup) before the PodCIDRs are populated into the ipcache? Is that just a dependency that's already enforced inside the agent?
  3. for deleted nodes during agent restart, we rely on the synthesized node deletion feature to clean up the stale PodCIDRs?

Copy link
Copy Markdown

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs good

pippolo84 and others added 6 commits March 11, 2025 09:49
Store allocation CIDRs from each remote node into the ipcache, using the
world identity.  This sets the ground for using the ipcache for
tunneling instead of relying on the additional tunnel map.

Co-authored-by: Sebastian Wicki <[email protected]>
Signed-off-by: Sebastian Wicki <[email protected]>
Signed-off-by: Fabio Falzoi <[email protected]>
This commit adds a unit test asserting the expected behavior for pod
CIDR entries upserted to IPCache by the node manager.

In particular, we also test two corner cases:

  - The first case tests a scenario where the pod CIDR is also selected
    by a second resource, ensuring that tunnel and encryption key are
    preserved.

  - The second case tests a scenario where the pod CIDR is a /32 and the
    legacy API upserts the pod IP (e.g. based on CEP or kvstore),
    thereby shadowing the existing pod CIDR entry.

Signed-off-by: Sebastian Wicki <[email protected]>
The ipcache map now stores entries describing remote node pod CIDRs,
including the tunnel endpoint and encryption key.

Therefore, when a packet must be encapsulated or encrypted we can
leverage the information from the lookup into the ipcache, avoiding the
additional one into the tunnel map.

Signed-off-by: Fabio Falzoi <[email protected]>
When handling the deletion of a node, we should take into account all
the node allocations CIDRs, not just the primary one.

Signed-off-by: Fabio Falzoi <[email protected]>
As a consequence of using the ipcache map to support tunneling,
multi-pool IPAM is now compatible with it, thus lift the startup checks
that stops the agent when both the features are enabled.

Signed-off-by: Fabio Falzoi <[email protected]>
As it is now supported.

Signed-off-by: Sebastian Wicki <[email protected]>
@gandro gandro force-pushed the pr/gandro/use-ipcache-map-for-tunneling-v1.1 branch from 8067549 to ef0deac Compare March 11, 2025 08:50
@gandro
Copy link
Copy Markdown
Member Author

gandro commented Mar 11, 2025

👋 Thank you for pushing this forward! Just some (potentially obvious) questions from the peanut gallery:

These are great questions!

1. on initial upgrade, how do we learn about PodCIDRs for existing nodes and insert them into the ipcache? I suppose we replay `NodeUpdated()` as the agent initially syncs with k8s?

Yes. The high-level source of information here is not changed. Previously, the node manager's NodeUpdated() forwarded the node object to the linuxNodeHandler for updating the tunnel map, now the node manager extracts the pod CIDRs itself (but linuxNodeHandler part is preserved for downgrade compatibility).

2. on initial upgrade, how do we handle the race condition where an endpoint's BPF program is re-generated (removing the tunnel-map lookup) **before** the PodCIDRs are populated into the ipcache? Is that just a dependency that's already enforced inside the agent?

Yes, this the dependency is enforced by the agent. Endpoints are not regenerated until IPCache is fully populated. We wait for the K8s caches to be synced (and the initial IPCache revision to be realized) before we allow endpoint regeneration to start.

3. for deleted nodes during agent restart, we rely on the [synthesized node deletion](https://github.com/cilium/cilium/pull/33278) feature to clean up the stale PodCIDRs?

Since IPCache is re-created when the agent restarts, there are no stale pod CIDRs to clean up (at least with regards to IPCache - when it comes to the $podCIDR via cilium_host routes installed by linuxNodeHandler, we do rely on those - but that behavior in unchanged with this PR).

Copy link
Copy Markdown
Contributor

@doniacld doniacld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 🌍

@gandro
Copy link
Copy Markdown
Member Author

gandro commented Mar 11, 2025

So I just had the realization that keeping the old tunnel map around for downgrade purposes is pretty pointless, since we are re-creating it every time agent restarts:

if err := tunnel.TunnelMap().Recreate(); err != nil {

So it might make sense to re-add node: Remove insertions and deletions to tunnel map and remove the tunnel map completely.

@gandro
Copy link
Copy Markdown
Member Author

gandro commented Mar 12, 2025

/test

@gandro
Copy link
Copy Markdown
Member Author

gandro commented Mar 13, 2025

Putting this back into draft, as we have discovered that this needs a cluster-id aware upsert into IPCache.

@gandro gandro marked this pull request as draft March 13, 2025 11:29
@julianwiedmann
Copy link
Copy Markdown
Member

2. on initial upgrade, how do we handle the race condition where an endpoint's BPF program is re-generated (removing the tunnel-map lookup) **before** the PodCIDRs are populated into the ipcache? Is that just a dependency that's already enforced inside the agent?

Yes, this the dependency is enforced by the agent. Endpoints are not regenerated until IPCache is fully populated. We wait for the K8s caches to be synced (and the initial IPCache revision to be realized) before we allow endpoint regeneration to start.

3. for deleted nodes during agent restart, we rely on the [synthesized node deletion](https://github.com/cilium/cilium/pull/33278) feature to clean up the stale PodCIDRs?

Since IPCache is re-created when the agent restarts, there are no stale pod CIDRs to clean up (at least with regards to IPCache - when it comes to the $podCIDR via cilium_host routes installed by linuxNodeHandler, we do rely on those - but that behavior in unchanged with this PR).

Awesome, thank you! This helps a lot to understand the bigger picture. And it also clarifies what happens on downgrade - as the IPCache is rebuilt, the PodCIDR entries naturally disappear again.

@snowhanse
Copy link
Copy Markdown

Just wanted to add to this to say that this is exactly what I've been looking for.
I've done a much cruder version of this, proving out that cilium can support my use case of having default pod ip pool use tunnel and separate pool as routable with IPs advertised over BGP.

So I'll also wait for this one to merge, and in the meantime use it as a base for my testing.

@pippolo84
Copy link
Copy Markdown
Member

Superseded by #38483

@pippolo84 pippolo84 closed this Mar 25, 2025
@christarazi christarazi deleted the pr/gandro/use-ipcache-map-for-tunneling-v1.1 branch March 26, 2025 01:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/ipam IP address management, including cloud IPAM area/multipool Affects Multi-Pool IPAM release-note/minor This PR changes functionality that users may find relevant to operating Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants