KEP-5194: DRA: ReservedFor Workloads in 1.34#5379
KEP-5194: DRA: ReservedFor Workloads in 1.34#5379mortent wants to merge 5 commits intokubernetes:masterfrom
Conversation
| @@ -0,0 +1,3 @@ | |||
| kep-number: 5194 | |||
| alpha: | |||
| approver: "@johnbelamaric" | |||
There was a problem hiding this comment.
Probably best to pick someone else since I will be out all but two days between now and KEP freeze
| and therefore know that it is safe to deallocate the `ResourceClaim`. | ||
|
|
||
| This requires that the controller/user has permissions to update the status | ||
| subresource of the `ResourceClaim`. The resourceclaim controller will also try to detect if |
There was a problem hiding this comment.
So, if the other controllers need to do it, then they each need to be granted update permissions to the status. Whereas for the resource claim controller to do it, it needs get/list/watch for that resource type. I think that the latter option has way less opportunity for orphaned resource claims. I would suggest we lead with that as the primary mechanism.
If the resource claim controller sees something in there and doesn't have the right permissions, it can complain in events and logs.
Realistically, this probably means giving the resourceclaim controller the ability to watch deployments, jobs, and statefulsets. And maybe a few others.
There was a problem hiding this comment.
So I've updated the design here a bit.
What I think might be a possible challenge with making the resource claim controller responsible for removing the reference in the ReservedFor list is that we probably can only do this safely when the resource has been deleted (and even then I guess there can be orphaned pods). If a Deployment is scaled down to zero, there will not be any pods using the ResourceClaim, so it could potentially be deallocated. But I don't think the resource claim controller can deallocate just based on the number of replicas, since there will be a race between the deallocation and the possible creation of new pods. If the workload controller is responsible for handling this, I think it should be able to handle this safely.
There was a problem hiding this comment.
Based on other feedback and thinking more about this, I have changed this to actually put the responsibility on the controllers. Two primary reasons for this:
- It lets us implement a single solution that is available to both in-tree and out-of-tree controllers.
- Controllers have more context about the workloads, so they are in a better position than the resourceclaim controller to decide when deallocation is safe.
| ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? | ||
|
|
||
| Yes. Applications that were already running will continue to run and the allocated | ||
| devices will remain so. |
There was a problem hiding this comment.
What happens when they finish, though? Won't the dangling reference to, say, a Job in the ReservedFor do something weird in the deallocation process?
There was a problem hiding this comment.
Yeah, I was thinking that removing them was handled by the workload controller and therefore wouldn't be affected if the feature was disabled. But with the change that we want the resource claim controller responsible for removing the reference that is obviously not true. And even if we made it the responsibility of the workload controller, that functionality would probably be covered by the same feature gate.
I have updated this section.
There was a problem hiding this comment.
What happens if kubelet restarts in the meantime? Won't it try to re-admit the pod (and it will fail because now it would try to validate the ReservedFor, which is not set to pod?
|
/assign @pohly |
|
/wg device-management |
|
|
||
| Rather than expecting the `ReservedFor` field to contain an exhaustive list of | ||
| all pods using the ResourceClaim, we propose letting the controller managing | ||
| a ResourceClaim specify the reference to the resource consuming the claim |
There was a problem hiding this comment.
Isn't it enough to specify the ResourceClaim itself to which all pods that are supposed to use the allocated device are referring to?
What is a use case for using a different object, since it potentially may contain other set of pods than the ones that refer to this ResourceClaim?
There was a problem hiding this comment.
So the main reason for having the reference here is that we need a way to signal to the resourceclaim controller that there are no more pods consuming the claim and it can therefore be deallocated. We have discussed two ways this can happen:
- The resourceclaim controller checks whether the referenced resource exists, and if not, concludes where are no pods consuming the claim and it deallocates it.
- The workload controller decides when there are no more pods consuming the resourceclaim and therefore decides it can be deallocated.
In both cases, I think it is useful that we capture information about the "owning workload" of the ResourceClaim and I think the ReservedFor field seems like a reasonable place to do it.
As for having a workload managing pods where not all of them are consuming the ResourceClaim, for that scenario I think it will be up to the workload controller to decide when it is safe to remove the reference in the ReservedFor list and therefore deallocate the claim.
There was a problem hiding this comment.
+1 to Morten
I think in theory we could consider ownerRef for that, but given we may want to create this ResourceClaim before workload, there would be a chance of races here. So I think that having this explicit pointer is reasonable.
| The `ReservedFor` list already accepts generic resource references, so this | ||
| field doesn't need to be changed. However, we are proposing adding two new | ||
| fields to the `ResourceClaim` type: | ||
| * `spec.ReservedFor` which allows the creator of a `ResourceClaim` to specify in |
There was a problem hiding this comment.
If we assume that the ResourceClaim itself could be specified (and so all pods referring to it), why not treat it as a default way of reserving resources (after making the feature flag gated)?
Is there any reason why the new approach cannot replace the existing one?
There was a problem hiding this comment.
So I think there are two ways we use the ReservedFor list today:
- To determine when the ResourceClaim can be deallocated.
- To find all pods consuming (or referencing) the ResourcecClaim.
We can deallocate when there are no more pods consuming the ResourceClaim. Without the ReservedFor list of pods, the only entity that can determine this is the controller responsible for managing creation/deletion of the pods, i.e. the workload controller. And it "communicates" with the resourceclaim controller by removing the reference in the ReservedFor list.
Finding the pods referencing the ResourceClaim is not sufficient for deallocation, because we have a race between the check and new pods being created. But for the situations where we just need the pods referencing the claim, that is pretty easy to find. And the proposal here is that we will do that in the device_taint_eviction controller.
| the spec which resource is the consumer of the `ResourceClaim`. When the first pod | ||
| referencing the `ResourceClaim` is scheduled, the reference will be copied into | ||
| the `status.ReservedFor` list. | ||
| * `status.allocation.ReservedForAnyPod` which will be set to `true` by the DRA |
There was a problem hiding this comment.
The name ReservedForAnyPod may be a bit misleading and inaccurate. The claim would be reserved for for any pod that is referencing this claim (or other object specified in spec.ReservedFor).
Actually the name ReservedFor itself is misleading, as we should rather say that the ResourceClaim is AllocatedFor all pods that are referencing it. We probably should also say that the ResourceClaim should be deallocated when non of the pods referencing it is scheduled (assumed in the scheduler cache).
There was a problem hiding this comment.
So I've tried to make the difference between a pod referencing a claim and a pod consuming a claim. The pod references in the list gives us the latter, since it is set by the scheduler. But without the explicit list of pods, we have to find another way to handle deallocation. As mentioned in another comment, I don't think there is a safe way to do deallocation by looking listing pods referencing the claim.
Agree that the naming isn't necessarily perfect, but I think the new names mostly makes sense assuming we keep the existing names. But definitely open to changing them.
| deleted or finish running. An empty list means there are no current consumers of the claim | ||
| and it can be deallocated. | ||
|
|
||
| #### Finding pods using a ResourceClaim |
There was a problem hiding this comment.
ReservedFor is also used to avoid races between different schedulers scheduling pods that share the same ResourceClaim (see https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/4381-dra-structured-parameters/README.md)
It may be hard to replace this use case since currently there is no good way of coordinating claim allocation between different schedulers. We work on a KEP #5287 which allows to set NominateNodeName after scheduler decides on a pod placement (pod is assumed, meaning it's scheduled but not bound yet), but the way it's designed now, this fields also conveys more information than just whether a pod is assumed, so unfortunately this bit of information cannot be used directly.
There was a problem hiding this comment.
We also work on an ultimate way of reserving resources that could address the problem of races between schedulers, but there are no details yet to share.
There was a problem hiding this comment.
Do you have an example of how this could become a problem with multiple schedulers? I'm not familiar with that situation.
|
|
||
| Rather than expecting the `ReservedFor` field to contain an exhaustive list of | ||
| all pods using the ResourceClaim, we propose letting the controller managing | ||
| a ResourceClaim specify the reference to the resource consuming the claim |
There was a problem hiding this comment.
+1 to Morten
I think in theory we could consider ownerRef for that, but given we may want to create this ResourceClaim before workload, there would be a chance of races here. So I think that having this explicit pointer is reasonable.
| ##### Deallocation | ||
| The resourceclaim controller will remove Pod references from the `ReservedFor` list just | ||
| like it does now using the same logic. For non-Pod references, the controller will recognize | ||
| a small number of built-in types, starting with `Deployment`, `StatefulSet` and `Job`, and will |
There was a problem hiding this comment.
I think this is somewhat misleading. I believe we should either make it work for an arbitrary owner or not at all.
Ignoring out-of-tree resources makes us not out-of-tree friendly and additionally may lead to situations where people will be assuming that it works (based on experience with in-tree types) and end up leaking resources...
I would personally suggest requiring a controller to unset it for Alpha and add a beta graduation to revisit that decision.
There was a problem hiding this comment.
Yeah, I agree. I also think workload controllers are in a better position to make decisions on when allocation is safe, so several reasons for moving this responsibility to the workload controllers.
| in the `spec.ReservedFor` list. As a result, the workload will get scheduled, but | ||
| it will be subject to the 256 limit on the size of the `ReservedFor` list and the | ||
| controller creating the `ResourceClaim` will not find the reference it expects | ||
| in the `ReservedFor` list when it tries to remove it. |
There was a problem hiding this comment.
To clarify - if there is already a reference to pod in ReservedFor, then for newly scheduled pods scheduler will continue to adding those, despite spec.ReservedFor set to something else right?
[If so, it would be useful to add that explicitly here too]
There was a problem hiding this comment.
Yeah, looking at this again I don't think there are any good ways to completely avoid situations where there might be both pod and non-pod references in the ReservedFor list. So I've called out here that this is something the controllers need to be able to handle.
| ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? | ||
|
|
||
| Yes. Applications that were already running will continue to run and the allocated | ||
| devices will remain so. |
There was a problem hiding this comment.
What happens if kubelet restarts in the meantime? Won't it try to re-admit the pod (and it will fail because now it would try to validate the ReservedFor, which is not set to pod?
|
We haven't made much progress on this one over the last couple of weeks, so it will not make it for 1.35. |
|
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
|
/remove-lifecycle rotten |
|
Workload is targeted to 1.35 alpha. In order to refer to Workloads in
Question: |
That's too early. The pods might still be running, using the devices that were allocated for the claim.
This is better, but not 100% reliable: a "sufficiently determined" user could delete the PodGroup while pods are running. In that case there's no place for a controller to post that all pods have terminated. Extra protections like a finalizer would be necessary to discourage such undesirable behavior.
Same problem.
When it is certain that the pod is in a final terminated state (resource removed, deleted and not scheduled, or kubelet has confirmed termination): https://github.com/kubernetes/kubernetes/blob/9998041e0ffe0dd3f2abab3b9f95505c4402bf14/pkg/controller/resourceclaim/controller.go#L979-L984 Users can also screw up the resource tracking here by force-deleting a pod. They shouldn't. |
A new feature and direct promotion to beta, or a new feature in alpha with promotion one release later? |
Workload API status will be a new alpha feature in v1.36 planned to promote to beta in v1.37 |
|
Based on @pohly comments, it seems like there are two categories of solutions:
The benefits of Approach 1 are that:
The benefits of Approach 2 are that:
Not clear to me which is the best option. |
|
Beware that there is no "DRA manager" in the control plane (kubelet has one). Managing allocations is split across kube-scheduler (allocates) and kube-controller-manager ("resourceclaim" controller deallocates). That's just an aside, kube-controller-manager and thus the controller has all of the necessary information. But I don't like approach 1 because it implies that the "resourceclaim" controller has to make assumptions about how the workload controller manages pods: more specifically, it has to assume that once one pod terminates or gets deleted, no new pod will get started. If the workload controller restarts pods, then there's a race. I therefore prefer approach 2. Let one entity manage pods and the declaration that "all pods a gone for good". |
|
/assign |
|
So from what I can tell from different discussions, it looks likely that there will be a This seems to imply that rather than adding references to a
Form the discussion above this seems to imply the need for a finalizer on the PodGroup objects to make sure they can't be deleted while there are still pods running that belong to the group. Does this seem like a reasonable way to handle this based on what is happening on the other KEPs? If so, I can take a pass at updating this KEP. Right now it seems to me like this handles a different issue than #5736, although they are closely related. @nojnhuh do you agree with this? |
|
Yes, overall I expect we will be targeting the new PodGroup objects instead of Workloads directly.
I don't think it's possible to guarantee at any point that no new Pods will be created. The brute-force workaround would be to consider the claim should be allocated as long as the PodGroup object exists, no matter the state of its Pods. That's obviously not ideal if the PodGroup sits at 0 replicas or keeps around a bunch of Completed/Failed Pods for a long time. Another approach would be like what I've drafted in #5736 to deallocate when the PodGroup is done "for now" i.e. when all of its Pods are currently terminated or gone, then reallocating when new Pods needing the claim are created. I suppose
Yes, I think this KEP and #5736 are targeting different problems. #5736 should be able to address sharing resources between Pods in a PodGroup while still listing individual Pods in |
cc @helayoty |
Ok, couple things:
So, it seems to me the scope here of this KEP is:
Assuming the ResourceClaimController does the de-allocation, then listing only Pods in the ResourceClaim will cause the de-allocation to happen once the last Pod is gone. Unless the controller also looks at the PodGroup objects referenced by the Pods. So, it's not clear to me why we would bother implementing the flow of "resource claims that follow the lifecycle of the podgroup, but put the pod reference into the reservedfor instead of podgroup". What are we saving by implementing that instead of just going all the way with this KEP at the same time? |
|
Workload aware scheduling improves ResourceClaim scheduling when both the PodGroup and the ResourceClaim are aligned. Technically it should be possible to have two or more gangs using the same ResourceClaim. In that case two PodGroups would be listed in ReservedFor similarly how Pods are listed now. Sure, they won't be gang scheduled (as long as there is no parent gang), but I think that implementing just this part may make sense, especially since it's not clear whether #5736 can make 1.36. |
We save having to do both at the same time even though they can be implemented independently. |
So that approach works when we are listing individual pods in the I don't think we have a way to check if all pods are terminated or gone that doesn't race with the scheduler scheduling new pods that might reference the same ResourceClaim. It could cause situations where a running pod is referencing a claim that doesn't have allocated devices. I think deallocating only when the
I agree that it seems like the real workload controllers would have to give the signal when all pods have completed or are terminating AND no new pods will be created.
The main feature of this KEP is to find a way around the 256 limit on the number of pods that can share a ResourceClaim. While probably not a hard requirement for #5736, its usefulness will be somewhat limited with the constraint on the number of pods in a PodGroup. |
Yes, the idea is that the deletion of the PodGroup (or some lifecycle state represented in a spec field) is an explicit indicator from the true workload controller that it is safe to de-allocate the claim. Anything else carries risk (though maybe we just live with that risk and some workaround for now). |
Ref the discussion on slack, the indicator is that the PodGroup have no pods using the ResourceClaim and will not create new pods that do. The consequence of this is that the reference to the PodGroup in the ReservedFor list can be removed. It does not mean that it is safe to deallocate the claim since there might be other PodGroups or Pods using the claim. |
One-line PR description: DRA: ReservedFor Workloads as alpha in 1.34
Issue link: DRA: ReservedFor Workloads #5194
Other comments: first revision