Make ipcache async api cluster id aware by pippolo84 · Pull Request #38379 · cilium/cilium

pippolo84 · 2025-03-20T18:12:11Z

The clusterID is already part of the ipcache BPF map key. That allows to support a multiple clusters scenario where the ipcache may store multiple entries with the same prefix but different clusterIDs.

The controlplane part of the ipcache allows to specificy a clusterID when upserting an address only with the legacy sync api, that is, through the Upsert method. But it lacks support to do the same with the new async api (UpsertMetadata, RemoveMetadata and so on).

The async api works internally by enqueuing multiple metadata updates, and then reconciling them to emit a set of changes to the underlying BPF map. A CIDRTrie is used to store the upserted prefixes and to efficiently find the affected child prefixes for each metadata upsertion or removal.

To support the multi-clusterID case, the CIDRTrie is changed to a map of CIDRTries, in order to keep one separate trie for each clusterID. Then, the async api are updated to accept a PrefixCluster instead of a bare prefix.

The remaining changes update all the ipcache users to pass a clusterID alongside the prefix. For now the clusterID is always set to 0, that is, the prefixes are all considered as local to the cluster.

The ability to specify a clusterID will be used in #38015 to allow the insertion of remote nodes pod allocation CIDRs into the ipcache map with their clusterID, just like it is currently done with the tunnel map.

Notes to the reviewers: please review commit by commit

pippolo84 · 2025-03-20T18:48:17Z

/test

christarazi

LGTM 🚀

bimmlerd

FQDN lgtm, couple of nits

I'm assuming this was discussed, but did you consider extending the key of the Trie with the clusterID bits instead of having a map of Tries? What's the tradeoff there?

pkg/fqdn/restore/restore.go

pkg/ipcache/metadata.go

pippolo84 · 2025-03-21T13:56:05Z

I'm assuming this was discussed, but did you consider extending the key of the Trie with the clusterID bits instead of having a map of Tries? What's the tradeoff there?

Excellent question, TBH we didn't think about this. My hunch is that we almost always end up in a scenario where:

the number of prefixes in the ipcache is high
the number of keys of the CIDRTrieMap (i.e: the number of clusters) is almost always 1 or << number of CIDRs

With these assumptions (one or very few keys in the map), I think that prepending the clusterID bits to each CIDRTrie key (that is, to each upserted CIDR) might be less efficient compared to the map overhead. But I didn't run any benchmark to confirm this.

One possible improvement to the current "bare" map might be to extend this:

cilium/pkg/container/set/set.go

Lines 15 to 20 in df50162

    
           // Set contains zero, one, or more members. Zero or one members do not consume any additional 
        
           // storage, more than one members are held in an non-exported membersMap. 
        
           type Set[T comparable] struct { 
        
           	single  *T 
        
           	members map[T]empty 
        
           }

to have a map type optimized to store a single mapping.

Running the existing BenchmarkInjectLabels benchmark to compare the CIDRTrieMap in a single cluster scenario against the previous CIDRTrie shows that with the map we are ~8% slower:

                │   old.txt   │               new.txt               │
                │   sec/op    │    sec/op     vs base               │
InjectLabels-24   9.343µ ± 2%   10.079µ ± 4%  +7.88% (p=0.000 n=10)

                │   old.txt    │            new.txt             │
                │     B/op     │     B/op      vs base          │
InjectLabels-24   6.501Ki ± 1%   6.317Ki ± 4%  ~ (p=0.218 n=10)

                │  old.txt   │            new.txt             │
                │ allocs/op  │ allocs/op   vs base            │
InjectLabels-24   57.00 ± 0%   57.00 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

giorio94

Looks good to me, thanks! Just a few minor comments inline.

pkg/container/bitlpm/cidr_map.go

pkg/container/bitlpm/cidr_map_test.go

daemon/cmd/ipcache.go

pkg/ipcache/metadata.go

pippolo84 · 2025-03-22T08:49:49Z

/test

giorio94 · 2025-03-24T08:03:15Z

It looks like this PR needs a rebase to make CI happy.

bimmlerd · 2025-03-24T08:16:15Z

With these assumptions (one or very few keys in the map), I think that prepending the clusterID bits to each CIDRTrie key (that is, to each upserted CIDR) might be less efficient compared to the map overhead. But I didn't run any benchmark to confirm this.

There's the aspect of constructing the key which you'll pay for the additional bits - but that's identical in the map of trie variant. In the actual Trie I think they'd matter very little:

in the single cluster case, the only change would be that the topmost node would have its prefix extended
in the multi cluster case, you'd branch at most sizeof(clusterid) times before getting to the same trie the hash map maps to.

Speaking of, what's the actual number of bits in the clusterID in the DP? would it be feasible to store an array of pointers instead of a proper map, with trivial indexing?

giorio94 · 2025-03-24T08:46:01Z

Speaking of, what's the actual number of bits in the clusterID in the DP?

IIRC the type is a u16, although currently only either 8 (default) or 9 (extended clustermesh) bits can be used.

pippolo84 · 2025-03-24T14:05:59Z

in the single cluster case, the only change would be that the topmost node would have its prefix extended

IIUC your proposal, we'll end storing a copy of the key (a cmtypes.PrefixCluster instance) inside each trie node for each upsertion. Thus we'll have 4 more bytes for each node (since clusterID is a uint32):

cilium/pkg/container/bitlpm/trie.go

Lines 576 to 580 in 7e2d98a

    
           upsertNode := &node[K, T]{ 
        
           	prefixLen: prefixLen, 
        
           	key:       k, 
        
           	value:     value, 
        
           }

Speaking of, what's the actual number of bits in the clusterID in the DP? would it be feasible to store an array of pointers instead of a proper map, with trivial indexing?

IIRC the type is a u16, although currently only either 8 (default) or 9 (extended clustermesh) bits can be used.

It is a uint32 actually:

cilium/pkg/clustermesh/types/addressing.go

Line 41 in 7e2d98a

clusterID uint32

But if 9 bits only are used we could have a [512]*CIDRTrie inner type. I can also rely on the cmtypes.ClusterIDExt511 constant. I'll try to implement this variant too and run the above benchmark against it. 👍

giorio94 · 2025-03-24T14:48:40Z

IIRC the type is a u16, although currently only either 8 (default) or 9 (extended clustermesh) bits can be used.

It is a uint32 actually:

Yeah, I was referring to the datapath. Not sure if it even matters at the end of the day though in terms of space usage, considering alignment and alike.

pippolo84 · 2025-03-24T14:58:34Z

Yeah, I was referring to the datapath. Not sure if it even matters at the end of the day though in terms of space usage, considering alignment and alike.

Ok, let's go ahead with this approach, then. We can still rework this in the future with the ideas we discussed above 👍

Signed-off-by: Fabio Falzoi <[email protected]>

UpsertLabels and RemoveLabels are just aliases of, respectively, UpsertMetadata and RemoveMetadata. Since the inner methods are exported as well, remove the aliases and call them directly. Also clean up any mention of the aliases in comments and tests. Signed-off-by: Fabio Falzoi <[email protected]>

In preparation for the new ipcache clusterID aware async API, add two helpers to easily build a PrefixCluster from a netip.Prefix and a clusterID as well as a helper to build a local PrefixCluster. Signed-off-by: Fabio Falzoi <[email protected]>

In preparation of the ipcache clusterID aware async API, the DNS rules restoration should support prefixes qualified with their clusterID of the form <prefix>@<clusterID>. Being DNS rules, in the restoration process, non local prefixes are discarded. Signed-off-by: Fabio Falzoi <[email protected]>

The clusterID is already part of the ipcache BPF map key. That allows to support a multiple clusters scenario where the ipcache may store multiple entries with the same prefix but different clusterIDs. The controlplane part of the ipcache allows to specificy a clusterID when upserting an address only with the legacy sync api, that is, through the `Upsert` method. But it lacks support to do the same with the new async api (`UpsertMetadata`, `RemoveMetadata` and so on). The async api works internally by enqueuing multiple metadata updates, and then reconciling them to emit a set of changes to the underlying BPF map. A CIDRTrie is used to store the upserted prefixes and to efficiently find the affected child prefixes for each metadata upsertion or removal. To support the multi-clusterID case, the CIDRTrie is changed to a map of CIDRTries, in order to keep one separate trie for each clusterID. Furthermore, the map that associates the prefixes to their metadata now uses a PrefixCluster as key. Finally, the async api are updated to accept a PrefixCluster instead of a bare prefix. The remaining changes update all the ipcache users to pass a clusterID alongside the prefix. In this commit the clusterID is always set to 0, that is, the prefixes are all considered as local to the cluster. Signed-off-by: Fabio Falzoi <[email protected]>

pippolo84 · 2025-03-24T15:18:11Z

/test

pippolo84 · 2025-03-24T19:04:23Z

ci-e2e-upgrade is hitting CI: Cilium E2E Upgrade: no-interrupted-connections #37520
ci-ipsec-upgrade is hitting CI: Cilium E2E Upgrade (ci-e2e-upgrade): Found unexpected packet drops: "No mapping for NAT masquerade" #38466

pippolo84 added area/daemon Impacts operation of the Cilium daemon. release-note/misc This PR makes changes that have no direct user impact. labels Mar 20, 2025

github-actions bot added the sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. label Mar 20, 2025

pippolo84 marked this pull request as ready for review March 20, 2025 19:35

pippolo84 requested review from a team as code owners March 20, 2025 19:35

pippolo84 requested review from bimmlerd, christarazi and giorio94 March 20, 2025 19:35

christarazi approved these changes Mar 20, 2025

View reviewed changes

bimmlerd approved these changes Mar 21, 2025

View reviewed changes

pkg/fqdn/restore/restore.go Outdated Show resolved Hide resolved

pkg/ipcache/metadata.go Outdated Show resolved Hide resolved

pippolo84 force-pushed the pr/pippolo84/ipcache-async-api-cluster-id-aware branch from 1af4883 to 42bb749 Compare March 21, 2025 13:32

giorio94 approved these changes Mar 21, 2025

View reviewed changes

pkg/container/bitlpm/cidr_map.go Outdated Show resolved Hide resolved

pkg/container/bitlpm/cidr_map_test.go Show resolved Hide resolved

daemon/cmd/ipcache.go Show resolved Hide resolved

daemon/cmd/ipcache.go Outdated Show resolved Hide resolved

pkg/ipcache/metadata.go Show resolved Hide resolved

pippolo84 force-pushed the pr/pippolo84/ipcache-async-api-cluster-id-aware branch from 42bb749 to 57e66c4 Compare March 21, 2025 19:48

pippolo84 added 2 commits March 24, 2025 16:02

bitlpm: Add CIDRTrieMap

52f4a7c

Signed-off-by: Fabio Falzoi <[email protected]>

pkg/ipcache: Remove unused {Upsert,Remove}Prefixes

0efcb0c

Signed-off-by: Fabio Falzoi <[email protected]>

pippolo84 added 4 commits March 24, 2025 16:02

pippolo84 force-pushed the pr/pippolo84/ipcache-async-api-cluster-id-aware branch from 57e66c4 to 6a624b2 Compare March 24, 2025 15:02

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 24, 2025

julianwiedmann added this pull request to the merge queue Mar 25, 2025

Merged via the queue into cilium:main with commit 8f4c6bb Mar 25, 2025
67 checks passed

This was referenced Mar 25, 2025

Support tunnel routing in Multi-Pool IPAM mode #38483

Merged

ClusterID aware ipcache #38209

Closed

pippolo84 added the affects/v1.17 This issue affects v1.17 branch label Apr 4, 2025

Conversation

pippolo84 commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pippolo84 commented Mar 20, 2025

Uh oh!

christarazi left a comment

Choose a reason for hiding this comment

Uh oh!

bimmlerd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pippolo84 commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giorio94 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pippolo84 commented Mar 22, 2025

Uh oh!

giorio94 commented Mar 24, 2025

Uh oh!

bimmlerd commented Mar 24, 2025

Uh oh!

giorio94 commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pippolo84 commented Mar 24, 2025

Uh oh!

giorio94 commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pippolo84 commented Mar 24, 2025

Uh oh!

pippolo84 commented Mar 24, 2025

Uh oh!

pippolo84 commented Mar 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pippolo84 commented Mar 20, 2025 •

edited

Loading

pippolo84 commented Mar 21, 2025 •

edited

Loading

giorio94 commented Mar 24, 2025 •

edited

Loading

giorio94 commented Mar 24, 2025 •

edited

Loading