Skip to content

Make ipcache async api cluster id aware#38379

Merged
julianwiedmann merged 6 commits intocilium:mainfrom
pippolo84:pr/pippolo84/ipcache-async-api-cluster-id-aware
Mar 25, 2025
Merged

Make ipcache async api cluster id aware#38379
julianwiedmann merged 6 commits intocilium:mainfrom
pippolo84:pr/pippolo84/ipcache-async-api-cluster-id-aware

Conversation

@pippolo84
Copy link
Copy Markdown
Member

@pippolo84 pippolo84 commented Mar 20, 2025

The clusterID is already part of the ipcache BPF map key. That allows to support a multiple clusters scenario where the ipcache may store multiple entries with the same prefix but different clusterIDs.

The controlplane part of the ipcache allows to specificy a clusterID when upserting an address only with the legacy sync api, that is, through the Upsert method. But it lacks support to do the same with the new async api (UpsertMetadata, RemoveMetadata and so on).

The async api works internally by enqueuing multiple metadata updates, and then reconciling them to emit a set of changes to the underlying BPF map. A CIDRTrie is used to store the upserted prefixes and to efficiently find the affected child prefixes for each metadata upsertion or removal.

To support the multi-clusterID case, the CIDRTrie is changed to a map of CIDRTries, in order to keep one separate trie for each clusterID. Then, the async api are updated to accept a PrefixCluster instead of a bare prefix.

The remaining changes update all the ipcache users to pass a clusterID alongside the prefix. For now the clusterID is always set to 0, that is, the prefixes are all considered as local to the cluster.

The ability to specify a clusterID will be used in #38015 to allow the insertion of remote nodes pod allocation CIDRs into the ipcache map with their clusterID, just like it is currently done with the tunnel map.

Notes to the reviewers: please review commit by commit

@pippolo84 pippolo84 added area/daemon Impacts operation of the Cilium daemon. release-note/misc This PR makes changes that have no direct user impact. labels Mar 20, 2025
@github-actions github-actions bot added the sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. label Mar 20, 2025
@pippolo84
Copy link
Copy Markdown
Member Author

/test

@pippolo84 pippolo84 marked this pull request as ready for review March 20, 2025 19:35
@pippolo84 pippolo84 requested review from a team as code owners March 20, 2025 19:35
Copy link
Copy Markdown
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

Copy link
Copy Markdown
Member

@bimmlerd bimmlerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FQDN lgtm, couple of nits

I'm assuming this was discussed, but did you consider extending the key of the Trie with the clusterID bits instead of having a map of Tries? What's the tradeoff there?

@pippolo84 pippolo84 force-pushed the pr/pippolo84/ipcache-async-api-cluster-id-aware branch from 1af4883 to 42bb749 Compare March 21, 2025 13:32
@pippolo84
Copy link
Copy Markdown
Member Author

pippolo84 commented Mar 21, 2025

I'm assuming this was discussed, but did you consider extending the key of the Trie with the clusterID bits instead of having a map of Tries? What's the tradeoff there?

Excellent question, TBH we didn't think about this. My hunch is that we almost always end up in a scenario where:

  • the number of prefixes in the ipcache is high
  • the number of keys of the CIDRTrieMap (i.e: the number of clusters) is almost always 1 or << number of CIDRs

With these assumptions (one or very few keys in the map), I think that prepending the clusterID bits to each CIDRTrie key (that is, to each upserted CIDR) might be less efficient compared to the map overhead. But I didn't run any benchmark to confirm this.

One possible improvement to the current "bare" map might be to extend this:

// Set contains zero, one, or more members. Zero or one members do not consume any additional
// storage, more than one members are held in an non-exported membersMap.
type Set[T comparable] struct {
single *T
members map[T]empty
}

to have a map type optimized to store a single mapping.

Running the existing BenchmarkInjectLabels benchmark to compare the CIDRTrieMap in a single cluster scenario against the previous CIDRTrie shows that with the map we are ~8% slower:

                │   old.txt   │               new.txt               │
                │   sec/op    │    sec/op     vs base               │
InjectLabels-24   9.343µ ± 2%   10.079µ ± 4%  +7.88% (p=0.000 n=10)

                │   old.txt    │            new.txt             │
                │     B/op     │     B/op      vs base          │
InjectLabels-24   6.501Ki ± 1%   6.317Ki ± 4%  ~ (p=0.218 n=10)

                │  old.txt   │            new.txt             │
                │ allocs/op  │ allocs/op   vs base            │
InjectLabels-24   57.00 ± 0%   57.00 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

Copy link
Copy Markdown
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks! Just a few minor comments inline.

@pippolo84 pippolo84 force-pushed the pr/pippolo84/ipcache-async-api-cluster-id-aware branch from 42bb749 to 57e66c4 Compare March 21, 2025 19:48
@pippolo84
Copy link
Copy Markdown
Member Author

/test

@giorio94
Copy link
Copy Markdown
Member

It looks like this PR needs a rebase to make CI happy.

@bimmlerd
Copy link
Copy Markdown
Member

With these assumptions (one or very few keys in the map), I think that prepending the clusterID bits to each CIDRTrie key (that is, to each upserted CIDR) might be less efficient compared to the map overhead. But I didn't run any benchmark to confirm this.

There's the aspect of constructing the key which you'll pay for the additional bits - but that's identical in the map of trie variant. In the actual Trie I think they'd matter very little:

  • in the single cluster case, the only change would be that the topmost node would have its prefix extended
  • in the multi cluster case, you'd branch at most sizeof(clusterid) times before getting to the same trie the hash map maps to.

Speaking of, what's the actual number of bits in the clusterID in the DP? would it be feasible to store an array of pointers instead of a proper map, with trivial indexing?

@giorio94
Copy link
Copy Markdown
Member

giorio94 commented Mar 24, 2025

Speaking of, what's the actual number of bits in the clusterID in the DP?

IIRC the type is a u16, although currently only either 8 (default) or 9 (extended clustermesh) bits can be used.

@pippolo84
Copy link
Copy Markdown
Member Author

  • in the single cluster case, the only change would be that the topmost node would have its prefix extended

IIUC your proposal, we'll end storing a copy of the key (a cmtypes.PrefixCluster instance) inside each trie node for each upsertion. Thus we'll have 4 more bytes for each node (since clusterID is a uint32):

upsertNode := &node[K, T]{
prefixLen: prefixLen,
key: k,
value: value,
}

Speaking of, what's the actual number of bits in the clusterID in the DP? would it be feasible to store an array of pointers instead of a proper map, with trivial indexing?

IIRC the type is a u16, although currently only either 8 (default) or 9 (extended clustermesh) bits can be used.

It is a uint32 actually:

But if 9 bits only are used we could have a [512]*CIDRTrie inner type. I can also rely on the cmtypes.ClusterIDExt511 constant. I'll try to implement this variant too and run the above benchmark against it. 👍

@giorio94
Copy link
Copy Markdown
Member

giorio94 commented Mar 24, 2025

IIRC the type is a u16, although currently only either 8 (default) or 9 (extended clustermesh) bits can be used.

It is a uint32 actually:

Yeah, I was referring to the datapath. Not sure if it even matters at the end of the day though in terms of space usage, considering alignment and alike.

@pippolo84
Copy link
Copy Markdown
Member Author

Yeah, I was referring to the datapath. Not sure if it even matters at the end of the day though in terms of space usage, considering alignment and alike.

Ok, let's go ahead with this approach, then. We can still rework this in the future with the ideas we discussed above 👍

UpsertLabels and RemoveLabels are just aliases of, respectively,
UpsertMetadata and RemoveMetadata. Since the inner methods are exported
as well, remove the aliases and call them directly. Also clean up any
mention of the aliases in comments and tests.

Signed-off-by: Fabio Falzoi <[email protected]>
In preparation for the new ipcache clusterID aware async API, add two
helpers to easily build a PrefixCluster from a netip.Prefix and a
clusterID as well as a helper to build a local PrefixCluster.

Signed-off-by: Fabio Falzoi <[email protected]>
In preparation of the ipcache clusterID aware async API, the DNS rules
restoration should support prefixes qualified with their clusterID of
the form <prefix>@<clusterID>.

Being DNS rules, in the restoration process, non local prefixes are
discarded.

Signed-off-by: Fabio Falzoi <[email protected]>
The clusterID is already part of the ipcache BPF map key. That allows to
support a multiple clusters scenario where the ipcache may store
multiple entries with the same prefix but different clusterIDs.

The controlplane part of the ipcache allows to specificy a clusterID
when upserting an address only with the legacy sync api, that is,
through the `Upsert` method. But it lacks support to do the same with the
new async api (`UpsertMetadata`, `RemoveMetadata` and so on).

The async api works internally by enqueuing multiple metadata updates,
and then reconciling them to emit a set of changes to the underlying BPF
map. A CIDRTrie is used to store the upserted prefixes and to
efficiently find the affected child prefixes for each metadata upsertion
or removal.

To support the multi-clusterID case, the CIDRTrie is changed to a map of
CIDRTries, in order to keep one separate trie for each clusterID.
Furthermore, the map that associates the prefixes to their metadata now
uses a PrefixCluster as key. Finally, the async api are updated to
accept a PrefixCluster instead of a bare prefix.

The remaining changes update all the ipcache users to pass a clusterID
alongside the prefix. In this commit the clusterID is always set to 0,
that is, the prefixes are all considered as local to the cluster.

Signed-off-by: Fabio Falzoi <[email protected]>
@pippolo84 pippolo84 force-pushed the pr/pippolo84/ipcache-async-api-cluster-id-aware branch from 57e66c4 to 6a624b2 Compare March 24, 2025 15:02
@pippolo84
Copy link
Copy Markdown
Member Author

/test

@pippolo84
Copy link
Copy Markdown
Member Author

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 24, 2025
@julianwiedmann julianwiedmann added this pull request to the merge queue Mar 25, 2025
Merged via the queue into cilium:main with commit 8f4c6bb Mar 25, 2025
67 checks passed
@pippolo84 pippolo84 added the affects/v1.17 This issue affects v1.17 branch label Apr 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

affects/v1.17 This issue affects v1.17 branch area/daemon Impacts operation of the Cilium daemon. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants