Implement Consumable Capacity and fine-grained CPU Allocation on the DRA driver#16
Conversation
|
@johnbelamaric @ffromani This is currently a prototype and this PR is not ready for review yet. I wanted to get feedback on the idea of using consumable capacity along with reusing cpu allocation logic from Kubelet to make the fine grained cpu assignment decisions on the Node (DRA driver) instead of the scheduler. This should also address #5. I have captured more details in this doc. |
|
/cc @catblade @dchen1107 |
|
@pravk03: GitHub didn't allow me to request PR reviews from the following users: catblade, dchen1107. Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
ffromani
left a comment
There was a problem hiding this comment.
super quick and very partial review because I'm going to be OOO next week till Sep 1.
- I'm super interested in the consumable capacity approach. I'm admittedly catching up with DRA ecosystem (and CDI) but unless @johnbelamaric / @klueska which have a much, much more deep understanding raise concerns about the general direction, this should be the way to go.
- historically, cpumanager suffered some times quite a lot from the fact that it had to work with the limited and unflexible hardware representation cadvisor offered, which was a pain to extend and was a poor fit to model modern, complex CPU topologies. The cpu assignment logic also indirectly suffers from these limitations. IMO, it's ok to start replicating/copying cpumanager logic to quickly catch up, but we should treat it as a. implementation detail subject to change b. only the first bootstrap step. This should be made abundandtly clear in docs, and probably the code layout should also reflect that (move
cpuinfoandcpuassignmentand possibly driver ininternal/?)
| func init() { | ||
| flag.StringVar(&kubeconfig, "kubeconfig", "", "absolute path to the kubeconfig file") | ||
| flag.StringVar(&hostnameOverride, "hostname-override", "", "If non-empty, will be used as the name of the Node that kube-network-policies is running on. If unset, the node name is assumed to be the same as the node's hostname.") | ||
| flag.StringVar(&cpuDeviceMode, "cpu-device-mode", "grouped", "Sets the mode for exposing CPU devices. 'grouped' exposes a single device for a group of CPUs (currently grouping by socket). 'individual' exposes each CPU as a separate device.") |
There was a problem hiding this comment.
Let's evaluate if we want to keep the old method at all. I fail to see the advantages of the old method, besides the fact the new method depends on less mature APIs.
There was a problem hiding this comment.
I think there is still value in the the "individual" mode. It can expose much more granular details (eg: P-cores and E-cores) which allows the workload to explicitly request in the ResourceClaim. With consumable capacity (currently) this level of detail is abstracted away within a group (like socket in this implementation).
Long term: The ideal solution could be a combination of consumable capacity and partitionable devices (kubernetes/enhancements#4815) to express hierarchical topology information. For instance, we could partition a node or socket into smaller, potentially overlapping sets based on attributes like core type or L3 cache. A workload could then request CPUs from a specific partition (e.g., "only E-cores"), and the DRA driver would allocate from that partition while also tracking consumption at the group level. FWIU, partitionable devices may not be fully compatible with consumable capacity at present, but it's a direction we could explore to achieve both fine-grained resource selection and scalability.
There was a problem hiding this comment.
Yes, that's a really interesting option. I have a couple other use cases where I may want to use consumable capacity and partitionable devices together. The key API I think we need here to connect the two is a way to specify how to translate consumable capacity consumption into consumption from both that user-facing entry AND some counter sets. Alternatively, we may want to allow the base functionality of partitionable - counter sets - to be utilized without pre-enumeration of devices via a similar mapping. This essentially would allow us to encode the "rules" for how to generate partitions without having to enumerate them all. When there can be 1000s of partitions, this can be necessary. I want to explore this more and would love more details on this use case.
An alternative to using partitionable would be to use aggregation on the claim side. So, your claim would say "I need 15 CPUs from any number of sockets", as opposed to using AllocationMode: ExactCount, which would require the user know ahead of time the CPU/socket topology. We have talked about that as a feature on the GPU use case before ("I can take anywhere from 1-4 GPUs, so long as aggregate memory is at least 80Gi").
| flag.StringVar(&kubeconfig, "kubeconfig", "", "absolute path to the kubeconfig file") | ||
| flag.StringVar(&hostnameOverride, "hostname-override", "", "If non-empty, will be used as the name of the Node that kube-network-policies is running on. If unset, the node name is assumed to be the same as the node's hostname.") | ||
| flag.StringVar(&cpuDeviceMode, "cpu-device-mode", "grouped", "Sets the mode for exposing CPU devices. 'grouped' exposes a single device for a group of CPUs (currently grouping by socket). 'individual' exposes each CPU as a separate device.") | ||
| // TODO: We can add a --cpu-grouping-strategy flag to specify how the CPUs should be grouped (e.g., 'socket', 'l3cache', 'numa'). |
There was a problem hiding this comment.
There always was a strong push to have this policy per-workload vs per-node with cpumanager and topology manager, so I'd be really wary of adding this option. We should IMO push harder for a cleaner/more flexible solution in the long run.
|
cc @sunya-ch |
|
Hi, sorry I've been out on leave. Still out on leave until October 20, but if I'm reading this correctly, this has us back to talking as a CPU is a CPU with no distinguishing between types and locations. For HPC and for other high performance workloads, schedulers like Slurm schedule according to position. I truly understand the desire to push this decision into the node, but part of the point of this work is to be able to handle multi node workloads where placement absolutely will matter.. Additionally, part of the reason we couldn't do blocks of cpus is sometimes heterogenous cpus are not in an easily predictable location. It might be useful to have this as an option, however, as if you're not using that type of workload I'm the cluster, it may not be useful to have the scheduler work harder for placement. |
|
@catblade Thanks for the feedback. I agree that for HPC and other high-performance workloads, topology-aware scheduling is critical, and we need to preserve the ability to distinguish between CPU types and their locations. The goal of this PR is not to replace the existing model but to offer a more scalable alternative for certain use cases. This PR introduces a --cpu-device-mode flag that allows the operator to choose between two modes The primary motivation for the grouped mode is to address the scalability limitations of the current approach, which can create a large number of API objects on nodes with many CPUs and increase scheduling latency. I see the grouped mode as a good solution for use cases where you need alignment between different DRA drivers or for workloads that are not as sensitive to topology, and where managing a large number of individual CPU devices is not necessary. In the future, I believe that integrating consumable capacity with partitionable devices could allow for hierarchical topology definition and resource requests from specific partitions, achieving both scalability and the granular control. |
|
@pravk03 I understand. And we'll have to be clever eventually even in grouped mode because number of cores can exceed the 256 limit per partition, even today. Maybe we can come up with something clever in a brainstorm. |
6c99e51 to
bbe5e91
Compare
08a6874 to
88e9a23
Compare
|
I'm obviously in favor of supporting the HPC usecase. I'm still trying to reconcile the declarative model which should be preserved with the benefits of exposing granular CPU and, well, allow workloads to pick the exact CPU they wish to run on. Where I can find again a document detailing why we have such strict and precise requirement? I probably already asked but I'm struggling to find it. |
88e9a23 to
78c9b37
Compare
dbd3676 to
969be76
Compare
| // L3CacheID is the L3 cache ID | ||
| L3CacheID int64 `json:"l3CacheID"` | ||
|
|
||
| // TODO: FIX | ||
| UncoreCacheID int `json:"uncoreCacheID"` |
There was a problem hiding this comment.
TODO is because we have a duplication, right?
I think we are at a junction:
- we fork the cpumanager code: import and then start to adapt it to the driver needs. Then we can rename
UncareCacheIDtoL3CacheIDand move on - we want to import the cpumanager code with minimal or no changes. Then we should adapt the driver code and rename
L3CacheIDtoUncoreCacheID.
For the time being, I don't see the shared code to be moving in a public package, one of the main reasons being the cpumanager code was never really made to be a library, so we need quite some review before to enable this.
Nevertheless, being able to update it with a trivial copy/paste seems desirable, so I think we should bite the bullet and rename L3CacheID to UncoreCacheID aligning to cpumanager naming.
Sidenote: I think the UncoreCache naming is actually slightly better than L3Cache in this specific context
There was a problem hiding this comment.
Thanks for the suggestion and I agree with the above. Replacing L3CacheID with UncoreCacheID to maintain compatibility with code borrowed from kubelet codebase.
| // Process the last block of cpu info. | ||
| if len(cpuInfoLines) > 0 { | ||
| cpuInfo := parseCPUInfo(isHybrid, eCoreCpus, cpuInfoLines...) | ||
| if cpuInfo != nil { | ||
| cpuInfos = append(cpuInfos, *cpuInfo) | ||
| } | ||
| } |
There was a problem hiding this comment.
for the future (milestone 2 or so) we may want to try to use only sysfs and stop processing cpuinfo. This should lead to a simpler code and we should have all the informations we need, and then some.
| func (d CPUDetails) KeepOnly(cpus cpuset.CPUSet) CPUDetails { | ||
| result := CPUDetails{} | ||
| for cpu, info := range d { |
There was a problem hiding this comment.
do we want to extract this code in a separate file, to make future imports from cpumanager codebase easier?
There was a problem hiding this comment.
Makes sense. Moved these interface function (with cpu_assignment.go) to a separate file. Can further clean this up and use the same dir structure as kubelet code. Added a TODO for now and will take up all cleanup's in a followup PR.
| socketID := int64(i) | ||
| deviceName := fmt.Sprintf("%s%d", cpuDeviceSocketGroupedPrefix, i) | ||
| socketCPUSet := topo.CPUDetails.CPUsInSockets(i) | ||
| allocatableCPUs := socketCPUSet.Difference(cp.reservedCPUs) |
There was a problem hiding this comment.
This is important to make sure we use consistent IDs to make sure we account the reserved CPUs against the right pool. It may a source of nasty bugs later on if the kernel enumerates resources in a nontrivial way (yep, it actually happened: kubernetes/kubernetes#100145)
There was a problem hiding this comment.
Updated to read the ID's from the topology directly.
| socketCPUSet := topo.CPUDetails.CPUsInSockets(i) | ||
| allocatableCPUs := socketCPUSet.Difference(cp.reservedCPUs) | ||
| availableCPUsInSocket := int64(allocatableCPUs.Size()) | ||
| hyperThreadEnabled := topo.CPUsPerCore() > 1 |
There was a problem hiding this comment.
this was a historical flow in the cpumanager code. We should fix this design issue and fetch the actual value from the system rather than try to guess.
We can check system-wide or per-core.
A system-wide check is already an improvement, and we need to read /sys/devices/system/cpu/smt/control
There was a problem hiding this comment.
Added an explicit check based on devices/system/cpu/smt/control in cpuinfo package. I retained the fallback mechanism based on CPU and Core count if the file is not present. Do you think that is needed ?
| Attributes: map[resourceapi.QualifiedName]resourceapi.DeviceAttribute{ | ||
| "dra.cpu/socketID": {IntValue: &socketID}, | ||
| "dra.cpu/numCPUs": {IntValue: &availableCPUsInSocket}, | ||
| "dra.cpu/hyperthreading": {BoolValue: &hyperThreadEnabled}, |
There was a problem hiding this comment.
I'd use "smt" rather than the commercial name (hyperthreading).
I think we can and should expose the NUMANodeID anyway, can't we?
In general, shouldn't we expose the same attributes in both grouping modes?
There was a problem hiding this comment.
renamed to dra.cpu/smtEnabled
There was a problem hiding this comment.
Regading NUMANodeID, Since there could be mutiple NUMA domains in a socket, I decided not to expose it with socket grouping.
We could technically expose a list of NUMA IDs, but we currently lack primitives in ResourceClaim to match against a list.
ffromani
left a comment
There was a problem hiding this comment.
need to deep dive on e2e tests, but at glance look reasonnable.
It is not a blocker for this PR, but we need to start thinking about how to keep the test matrix as small as possible, because it is starting to become unwieldly.
a76eb94 to
2e686f6
Compare
| // NOTE: This file is a copy of https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/cpu_assignment.go | ||
| // as of commit https://github.com/kubernetes/kubernetes/commit/fd5b2efa76e44c5ef523cd0711f5ed23eb7e6b1a with minor modifications | ||
|
|
There was a problem hiding this comment.
This is a a lot of code copied from Kubernetes core! How will this be maintained? What is the sync strategy going to be? Are there any divergences and have they been documented?
There was a problem hiding this comment.
That’s a valid concern. We decided to start with the Kubelet code to leverage its maturity and existing test coverage. @ffromani and I briefly discussed about long term strategy, but it wasn't very clear at that point and definitely needs more discussion. We may not necessarily continue keeping in sync with kubelet code, as customizing the implementation to fit the DRA driver's specific use cases (and removing unused logic) could be more beneficial in the long term.
Currently, the divergences from kubelet code is minimal. I can add a comment in the code to document them.
There was a problem hiding this comment.
In addition to what @pravk03 mentioned, the kubelet code is not ready nor designed to be consumed as library. I reckon the effort to make it so is not trivial, and is not clear yet if we should do that from kube side. The ecosystem is very fluid yet and until we stabilize, I think that our poor man's vendoring (borrowing files from kubelet) is actually the cheapest approach.
Surely this is a long term concern we need to address somehow.
I won't rule out that we end up rolling up our own independent allocator code.
There was a problem hiding this comment.
Thanks both! This gives helpful context on why the kubelet code was brought in directly. I agree that starting from a mature, well-tested implementation is a sensible short-term choice, especially while the DRA ecosystem is still evolving.
At the same time, I think it would be good for us to define an explicit maintenance strategy so we don’t end up with silent drift or unexpected behavioural differences over time. Even something lightweight would help, for example:
- documenting the initial divergence points and any future deviations as they happen
- clarifying whether we plan to periodically re-sync with kubelet or intentionally fork and evolve independently
- outlining the criteria that would signal it’s time to move away from borrowed kubelet code toward a standalone allocator
This doesn’t need to be finalised today, but having a small design note capturing the intended direction would give contributors and users better clarity and help avoid surprises when kubelet logic evolves.
There was a problem hiding this comment.
Thanks, @swatisehgal. That makes sense. Updating the documentation is long overdue, so I will take that up in a follow-up PR after this one merges. I'll make sure to include the suggestions you outlined above.
|
Diff between kubelet code (commit kubernetes/kubernetes@fd5b2ef) and code in this repo - |
ffromani
left a comment
There was a problem hiding this comment.
LGTM with minor comments, most notably the spurious file seems still around?
I'll have another final pass focusing on tests, then I think we can merge and possibly iterate a bit later to polish.
| func (t *CPUTopology) CPUsPerCore() int { | ||
| if t.NumCores == 0 { | ||
| return 0 | ||
| } | ||
| return t.NumCPUs / t.NumCores | ||
| } | ||
|
|
||
| func (t *CPUTopology) CPUsPerSocket() int { | ||
| if t.NumSockets == 0 { | ||
| return 0 | ||
| } | ||
| return t.NumCPUs / t.NumSockets | ||
| } | ||
|
|
||
| func (t *CPUTopology) CPUsPerUncore() int { | ||
| if t.NumUncoreCache == 0 { | ||
| // Avoid division by zero. If there are no uncore caches, then | ||
| // no CPUs can be in one. | ||
| return 0 | ||
| } | ||
| // Note: this is an approximation that assumes all uncore caches have the same number of CPUs. | ||
| return t.NumCPUs / t.NumUncoreCache | ||
| } |
There was a problem hiding this comment.
I'm really torn here. On one hand we should not block here, but this is a really egregious oversimplification has been a subtle source of troubles in cpumanager. The easiest case to consider is a user offlining a set of CPUs asymmetrically, e.g. only on one NUMA node, or LLC complex. All the math goes haywire.
This is probably one of the best areas on which we should diverge from cpumanager code (which is pretty much unfixable at this point in time) and do better. I'm confident that this "do better" doesn't require a totalrewrite or a super complex representation.
I think the best way forward is be very aware of this fragility and plan improvements for the next milestones.
There was a problem hiding this comment.
Makes sense. I added a todo and filed #35 for the followup work.
* pkg/cpumanager/cpu_assignment.go and tests are branched from kubelet [cpu_assignment.go](https://github.com/kubernetes/kubernetes/blob/4cf195304caa519be476b367267f6c656bce19f7/pkg/kubelet/cm/cpumanager/cpu_assignment.go) * The pkg/cpuinfo interface and data structures have been updated to provide the detailed topology information required by the imported CPU assignment logic, including methods for querying CPUs by NUMA, socket, and core. * Renamed NumaNode to NUMANodeID in CPUInfo for consistency.
1ef2eb8 to
3aff81c
Compare
|
/approve thanks for your good work @pravk03 . I think we have some more room for polishing, but that's better done once this PR is merged. So let's move on. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ffromani, pravk03 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold |
|
/hold cancel only the notification email had still the "WIP" prefix. We're good to merge. |
This PR enables modeling CPU resources as a consumable capacity. The behavior is controlled by the flag (
cpu-device-mode) and setting it toindividualwill revert the behavior to the current model where every CPU is exposed as a device in the ResourceClaim.POC
Design Details and Rationale: