Make ipcache async api cluster id aware#38379
Conversation
|
/test |
bimmlerd
left a comment
There was a problem hiding this comment.
FQDN lgtm, couple of nits
I'm assuming this was discussed, but did you consider extending the key of the Trie with the clusterID bits instead of having a map of Tries? What's the tradeoff there?
1af4883 to
42bb749
Compare
Excellent question, TBH we didn't think about this. My hunch is that we almost always end up in a scenario where:
With these assumptions (one or very few keys in the map), I think that prepending the clusterID bits to each CIDRTrie key (that is, to each upserted CIDR) might be less efficient compared to the map overhead. But I didn't run any benchmark to confirm this. One possible improvement to the current "bare" map might be to extend this: cilium/pkg/container/set/set.go Lines 15 to 20 in df50162 to have a map type optimized to store a single mapping. Running the existing |
giorio94
left a comment
There was a problem hiding this comment.
Looks good to me, thanks! Just a few minor comments inline.
42bb749 to
57e66c4
Compare
|
/test |
|
It looks like this PR needs a rebase to make CI happy. |
There's the aspect of constructing the key which you'll pay for the additional bits - but that's identical in the map of trie variant. In the actual Trie I think they'd matter very little:
Speaking of, what's the actual number of bits in the clusterID in the DP? would it be feasible to store an array of pointers instead of a proper map, with trivial indexing? |
IIRC the type is a |
IIUC your proposal, we'll end storing a copy of the key (a cilium/pkg/container/bitlpm/trie.go Lines 576 to 580 in 7e2d98a
It is a But if 9 bits only are used we could have a |
Yeah, I was referring to the datapath. Not sure if it even matters at the end of the day though in terms of space usage, considering alignment and alike. |
Ok, let's go ahead with this approach, then. We can still rework this in the future with the ideas we discussed above 👍 |
Signed-off-by: Fabio Falzoi <[email protected]>
Signed-off-by: Fabio Falzoi <[email protected]>
UpsertLabels and RemoveLabels are just aliases of, respectively, UpsertMetadata and RemoveMetadata. Since the inner methods are exported as well, remove the aliases and call them directly. Also clean up any mention of the aliases in comments and tests. Signed-off-by: Fabio Falzoi <[email protected]>
In preparation for the new ipcache clusterID aware async API, add two helpers to easily build a PrefixCluster from a netip.Prefix and a clusterID as well as a helper to build a local PrefixCluster. Signed-off-by: Fabio Falzoi <[email protected]>
In preparation of the ipcache clusterID aware async API, the DNS rules restoration should support prefixes qualified with their clusterID of the form <prefix>@<clusterID>. Being DNS rules, in the restoration process, non local prefixes are discarded. Signed-off-by: Fabio Falzoi <[email protected]>
The clusterID is already part of the ipcache BPF map key. That allows to support a multiple clusters scenario where the ipcache may store multiple entries with the same prefix but different clusterIDs. The controlplane part of the ipcache allows to specificy a clusterID when upserting an address only with the legacy sync api, that is, through the `Upsert` method. But it lacks support to do the same with the new async api (`UpsertMetadata`, `RemoveMetadata` and so on). The async api works internally by enqueuing multiple metadata updates, and then reconciling them to emit a set of changes to the underlying BPF map. A CIDRTrie is used to store the upserted prefixes and to efficiently find the affected child prefixes for each metadata upsertion or removal. To support the multi-clusterID case, the CIDRTrie is changed to a map of CIDRTries, in order to keep one separate trie for each clusterID. Furthermore, the map that associates the prefixes to their metadata now uses a PrefixCluster as key. Finally, the async api are updated to accept a PrefixCluster instead of a bare prefix. The remaining changes update all the ipcache users to pass a clusterID alongside the prefix. In this commit the clusterID is always set to 0, that is, the prefixes are all considered as local to the cluster. Signed-off-by: Fabio Falzoi <[email protected]>
57e66c4 to
6a624b2
Compare
|
/test |
|
The clusterID is already part of the ipcache BPF map key. That allows to support a multiple clusters scenario where the ipcache may store multiple entries with the same prefix but different clusterIDs.
The controlplane part of the ipcache allows to specificy a clusterID when upserting an address only with the legacy sync api, that is, through the
Upsertmethod. But it lacks support to do the same with the new async api (UpsertMetadata,RemoveMetadataand so on).The async api works internally by enqueuing multiple metadata updates, and then reconciling them to emit a set of changes to the underlying BPF map. A CIDRTrie is used to store the upserted prefixes and to efficiently find the affected child prefixes for each metadata upsertion or removal.
To support the multi-clusterID case, the CIDRTrie is changed to a map of CIDRTries, in order to keep one separate trie for each clusterID. Then, the async api are updated to accept a
PrefixClusterinstead of a bare prefix.The remaining changes update all the ipcache users to pass a clusterID alongside the prefix. For now the clusterID is always set to 0, that is, the prefixes are all considered as local to the cluster.
The ability to specify a clusterID will be used in #38015 to allow the insertion of remote nodes pod allocation CIDRs into the ipcache map with their clusterID, just like it is currently done with the tunnel map.
Notes to the reviewers: please review commit by commit