Skip to content

KEP-5194: DRA: ReservedFor Workloads in 1.34#5379

Closed
mortent wants to merge 5 commits intokubernetes:masterfrom
mortent:ReservedFor
Closed

KEP-5194: DRA: ReservedFor Workloads in 1.34#5379
mortent wants to merge 5 commits intokubernetes:masterfrom
mortent:ReservedFor

Conversation

@mortent
Copy link
Copy Markdown
Member

@mortent mortent commented Jun 5, 2025

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 5, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jun 5, 2025
@github-project-automation github-project-automation Bot moved this to Needs Triage in SIG Scheduling Jun 5, 2025
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 5, 2025
@@ -0,0 +1,3 @@
kep-number: 5194
alpha:
approver: "@johnbelamaric"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to pick someone else since I will be out all but two days between now and KEP freeze

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to @wojtek-t. @wojtek-t are you ok with that?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I can take it

Comment thread keps/sig-scheduling/5194-reserved-for-workloads/README.md Outdated
Comment thread keps/sig-scheduling/5194-reserved-for-workloads/README.md
Comment thread keps/sig-scheduling/5194-reserved-for-workloads/README.md Outdated
Comment thread keps/sig-scheduling/5194-reserved-for-workloads/README.md
and therefore know that it is safe to deallocate the `ResourceClaim`.

This requires that the controller/user has permissions to update the status
subresource of the `ResourceClaim`. The resourceclaim controller will also try to detect if
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if the other controllers need to do it, then they each need to be granted update permissions to the status. Whereas for the resource claim controller to do it, it needs get/list/watch for that resource type. I think that the latter option has way less opportunity for orphaned resource claims. I would suggest we lead with that as the primary mechanism.

If the resource claim controller sees something in there and doesn't have the right permissions, it can complain in events and logs.

Realistically, this probably means giving the resourceclaim controller the ability to watch deployments, jobs, and statefulsets. And maybe a few others.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've updated the design here a bit.

What I think might be a possible challenge with making the resource claim controller responsible for removing the reference in the ReservedFor list is that we probably can only do this safely when the resource has been deleted (and even then I guess there can be orphaned pods). If a Deployment is scaled down to zero, there will not be any pods using the ResourceClaim, so it could potentially be deallocated. But I don't think the resource claim controller can deallocate just based on the number of replicas, since there will be a race between the deallocation and the possible creation of new pods. If the workload controller is responsible for handling this, I think it should be able to handle this safely.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on other feedback and thinking more about this, I have changed this to actually put the responsibility on the controllers. Two primary reasons for this:

  • It lets us implement a single solution that is available to both in-tree and out-of-tree controllers.
  • Controllers have more context about the workloads, so they are in a better position than the resourceclaim controller to decide when deallocation is safe.

Comment thread keps/sig-scheduling/5194-reserved-for-workloads/README.md Outdated
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. Applications that were already running will continue to run and the allocated
devices will remain so.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when they finish, though? Won't the dangling reference to, say, a Job in the ReservedFor do something weird in the deallocation process?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking that removing them was handled by the workload controller and therefore wouldn't be affected if the feature was disabled. But with the change that we want the resource claim controller responsible for removing the reference that is obviously not true. And even if we made it the responsibility of the workload controller, that functionality would probably be covered by the same feature gate.

I have updated this section.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if kubelet restarts in the meantime? Won't it try to re-admit the pod (and it will fail because now it would try to validate the ReservedFor, which is not set to pod?

Comment thread keps/sig-scheduling/5194-reserved-for-workloads/README.md
Comment thread keps/sig-scheduling/5194-reserved-for-workloads/README.md
@mortent
Copy link
Copy Markdown
Member Author

mortent commented Jun 9, 2025

/assign @pohly

@mortent
Copy link
Copy Markdown
Member Author

mortent commented Jun 9, 2025

/wg device-management

@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jun 9, 2025
@pohly pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Jun 10, 2025
@wojtek-t wojtek-t self-assigned this Jun 10, 2025

Rather than expecting the `ReservedFor` field to contain an exhaustive list of
all pods using the ResourceClaim, we propose letting the controller managing
a ResourceClaim specify the reference to the resource consuming the claim
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it enough to specify the ResourceClaim itself to which all pods that are supposed to use the allocated device are referring to?

What is a use case for using a different object, since it potentially may contain other set of pods than the ones that refer to this ResourceClaim?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the main reason for having the reference here is that we need a way to signal to the resourceclaim controller that there are no more pods consuming the claim and it can therefore be deallocated. We have discussed two ways this can happen:

  • The resourceclaim controller checks whether the referenced resource exists, and if not, concludes where are no pods consuming the claim and it deallocates it.
  • The workload controller decides when there are no more pods consuming the resourceclaim and therefore decides it can be deallocated.

In both cases, I think it is useful that we capture information about the "owning workload" of the ResourceClaim and I think the ReservedFor field seems like a reasonable place to do it.

As for having a workload managing pods where not all of them are consuming the ResourceClaim, for that scenario I think it will be up to the workload controller to decide when it is safe to remove the reference in the ReservedFor list and therefore deallocate the claim.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Morten

I think in theory we could consider ownerRef for that, but given we may want to create this ResourceClaim before workload, there would be a chance of races here. So I think that having this explicit pointer is reasonable.

The `ReservedFor` list already accepts generic resource references, so this
field doesn't need to be changed. However, we are proposing adding two new
fields to the `ResourceClaim` type:
* `spec.ReservedFor` which allows the creator of a `ResourceClaim` to specify in
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we assume that the ResourceClaim itself could be specified (and so all pods referring to it), why not treat it as a default way of reserving resources (after making the feature flag gated)?

Is there any reason why the new approach cannot replace the existing one?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think there are two ways we use the ReservedFor list today:

  • To determine when the ResourceClaim can be deallocated.
  • To find all pods consuming (or referencing) the ResourcecClaim.

We can deallocate when there are no more pods consuming the ResourceClaim. Without the ReservedFor list of pods, the only entity that can determine this is the controller responsible for managing creation/deletion of the pods, i.e. the workload controller. And it "communicates" with the resourceclaim controller by removing the reference in the ReservedFor list.

Finding the pods referencing the ResourceClaim is not sufficient for deallocation, because we have a race between the check and new pods being created. But for the situations where we just need the pods referencing the claim, that is pretty easy to find. And the proposal here is that we will do that in the device_taint_eviction controller.

the spec which resource is the consumer of the `ResourceClaim`. When the first pod
referencing the `ResourceClaim` is scheduled, the reference will be copied into
the `status.ReservedFor` list.
* `status.allocation.ReservedForAnyPod` which will be set to `true` by the DRA
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name ReservedForAnyPod may be a bit misleading and inaccurate. The claim would be reserved for for any pod that is referencing this claim (or other object specified in spec.ReservedFor).

Actually the name ReservedFor itself is misleading, as we should rather say that the ResourceClaim is AllocatedFor all pods that are referencing it. We probably should also say that the ResourceClaim should be deallocated when non of the pods referencing it is scheduled (assumed in the scheduler cache).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've tried to make the difference between a pod referencing a claim and a pod consuming a claim. The pod references in the list gives us the latter, since it is set by the scheduler. But without the explicit list of pods, we have to find another way to handle deallocation. As mentioned in another comment, I don't think there is a safe way to do deallocation by looking listing pods referencing the claim.

Agree that the naming isn't necessarily perfect, but I think the new names mostly makes sense assuming we keep the existing names. But definitely open to changing them.

deleted or finish running. An empty list means there are no current consumers of the claim
and it can be deallocated.

#### Finding pods using a ResourceClaim
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReservedFor is also used to avoid races between different schedulers scheduling pods that share the same ResourceClaim (see https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/4381-dra-structured-parameters/README.md)

It may be hard to replace this use case since currently there is no good way of coordinating claim allocation between different schedulers. We work on a KEP #5287 which allows to set NominateNodeName after scheduler decides on a pod placement (pod is assumed, meaning it's scheduled but not bound yet), but the way it's designed now, this fields also conveys more information than just whether a pod is assumed, so unfortunately this bit of information cannot be used directly.

@wojtek-t @sanposhiho @macsko

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also work on an ultimate way of reserving resources that could address the problem of races between schedulers, but there are no details yet to share.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of how this could become a problem with multiple schedulers? I'm not familiar with that situation.

@mortent mortent requested a review from dom4ha June 10, 2025 22:27

Rather than expecting the `ReservedFor` field to contain an exhaustive list of
all pods using the ResourceClaim, we propose letting the controller managing
a ResourceClaim specify the reference to the resource consuming the claim
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Morten

I think in theory we could consider ownerRef for that, but given we may want to create this ResourceClaim before workload, there would be a chance of races here. So I think that having this explicit pointer is reasonable.

##### Deallocation
The resourceclaim controller will remove Pod references from the `ReservedFor` list just
like it does now using the same logic. For non-Pod references, the controller will recognize
a small number of built-in types, starting with `Deployment`, `StatefulSet` and `Job`, and will
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is somewhat misleading. I believe we should either make it work for an arbitrary owner or not at all.
Ignoring out-of-tree resources makes us not out-of-tree friendly and additionally may lead to situations where people will be assuming that it works (based on experience with in-tree types) and end up leaking resources...

I would personally suggest requiring a controller to unset it for Alpha and add a beta graduation to revisit that decision.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree. I also think workload controllers are in a better position to make decisions on when allocation is safe, so several reasons for moving this responsibility to the workload controllers.

Comment thread keps/sig-scheduling/5194-reserved-for-workloads/README.md
in the `spec.ReservedFor` list. As a result, the workload will get scheduled, but
it will be subject to the 256 limit on the size of the `ReservedFor` list and the
controller creating the `ResourceClaim` will not find the reference it expects
in the `ReservedFor` list when it tries to remove it.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify - if there is already a reference to pod in ReservedFor, then for newly scheduled pods scheduler will continue to adding those, despite spec.ReservedFor set to something else right?

[If so, it would be useful to add that explicitly here too]

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looking at this again I don't think there are any good ways to completely avoid situations where there might be both pod and non-pod references in the ReservedFor list. So I've called out here that this is something the controllers need to be able to handle.

###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. Applications that were already running will continue to run and the allocated
devices will remain so.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if kubelet restarts in the meantime? Won't it try to re-admit the pod (and it will fail because now it would try to validate the ReservedFor, which is not set to pod?

Comment thread keps/sig-scheduling/5194-reserved-for-workloads/README.md
Comment thread keps/sig-scheduling/5194-reserved-for-workloads/README.md Outdated
@alimaazamat
Copy link
Copy Markdown

latest-milestone should be set to v1.35
milestone updated so that alpha is set to v1.35
Otherwise we cannot set this KEP to be Tracked for PRR Freeze for tomorrow's deadline!
@pohly @wojtek-t @mortent

@mortent
Copy link
Copy Markdown
Member Author

mortent commented Oct 8, 2025

We haven't made much progress on this one over the last couple of weeks, so it will not make it for 1.35.

@k8s-triage-robot
Copy link
Copy Markdown

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 7, 2025
@johnbelamaric
Copy link
Copy Markdown
Member

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 7, 2025
@erictune
Copy link
Copy Markdown
Contributor

erictune commented Dec 1, 2025

Workload is targeted to 1.35 alpha. In order to refer to Workloads in ReservedFors, We need to decide a few things:

  1. Decide when and how to add a Workload to the ReservedFor list
    • When: When device is first allocated to any pod of the workload
    • How: should be easy because ResourceClaimConsumerReference can refer to any whole object
  2. Decide when and how to add a PodGroup Replica to the ReservedFor list
    • When: When the device is first allocated to any pod of this PodGroup Replica
    • How: need to add a field to ResourceClaimConsumerReference to hold a podGroupReplicaKey; a generic "path" is not going to work.
  3. Decide when and how to remove a Workload from a ReservedFor list
    • Immediately when a Workload is deleted.
    • When the last PodGroup Replica of a workload is "terminated". What controller will track this? It maybe should come from the Status of the Workload (which is being worked on for 1.36 beta of Workload API).
  4. Decide when and how to remove a PodGroup Replica from a ReservedFor list
    • Immediately if the Workload is deleted.
    • When the last Pod of a PodGroup Replica is "terminated". Or when the group falls below minCount? What controller will track this? It maybe should come from the Status of the Workload (which is being worked on for 1.36 beta of Workload API).

Question:
At what phase or condition of Pod lifecycle is the pod removed from ReservedFor? Knowing this may help determine how to aggregate that status of a collection of pods (Workload or PodGroup Replica.).

@pohly
Copy link
Copy Markdown
Contributor

pohly commented Dec 2, 2025

Decide when and how to remove a Workload from a ReservedFor list

  • Immediately when a Workload is deleted.

That's too early. The pods might still be running, using the devices that were allocated for the claim.

  • When the last PodGroup Replica of a workload is "terminated". What controller will track this? It maybe should come from the Status of the Workload (which is being worked on for 1.36 beta of Workload API).

This is better, but not 100% reliable: a "sufficiently determined" user could delete the PodGroup while pods are running. In that case there's no place for a controller to post that all pods have terminated. Extra protections like a finalizer would be necessary to discourage such undesirable behavior.

Decide when and how to remove a PodGroup Replica from a ReservedFor list

Same problem.

At what phase or condition of Pod lifecycle is the pod removed from ReservedFor?

When it is certain that the pod is in a final terminated state (resource removed, deleted and not scheduled, or kubelet has confirmed termination): https://github.com/kubernetes/kubernetes/blob/9998041e0ffe0dd3f2abab3b9f95505c4402bf14/pkg/controller/resourceclaim/controller.go#L979-L984

Users can also screw up the resource tracking here by force-deleting a pod. They shouldn't.

@pohly
Copy link
Copy Markdown
Contributor

pohly commented Dec 2, 2025

(which is being worked on for 1.36 beta of Workload API)

A new feature and direct promotion to beta, or a new feature in alpha with promotion one release later?

@macsko
Copy link
Copy Markdown
Member

macsko commented Dec 2, 2025

(which is being worked on for 1.36 beta of Workload API)

A new feature and direct promotion to beta, or a new feature in alpha with promotion one release later?

Workload API status will be a new alpha feature in v1.36 planned to promote to beta in v1.37

@erictune
Copy link
Copy Markdown
Contributor

erictune commented Dec 2, 2025

Based on @pohly comments, it seems like there are two categories of solutions:

  • Approach 1 -- DRA manager tracks the number of not isDone pods for every Workload and PodGroup replica that is mentioned in any ReservedFor. When it sees a pod transition to isDone, it updates this count. When a count goes to zero, it removes the object reference to the Workload or PodGroup from ReservedFor.
  • Approach 2 -- A Finalizer on a Workload is used so that Workloads cannot easily be deleted when they have running pods. Workloads and PodGroup Replicas go through terminal states analogous to the isDone ], which are observable by watching the Workload's status. DRA Manager only watches Workloads and removes them from any ReservedFor when it sees them reach a terminal state.

The benefits of Approach 1 are that:

  • DRA manager is likely already tracking most of the information it needs to do this already, since it watches pods, and kube-scheduler already watches Workloads.
  • DRA manager may need to run a node-local cleanup action after every pod termination anyhow, so it needs to track every pod reaching a terminal state anyway.
  • Workload object can be deleted and replaced by a different one with the same name and a different UID, while pods with long terminations continue to terminate in the background. This is helpful to clients that make Workloads, but it also means that devices can have object reference to non-existent resources.

The benefits of Approach 2 are that:

  • Other clients might benefit from knowing whether a workload is terminating/terminated.
    • However, other clients might have different definitions of workload state. For example, in KEP-5682: PDBs for multi-pod replicas, they need to know how many running pods are in a PodGroup, which is different from knowing how isDone.
  • It matches a reasonable user expectation that objects referenced in ReservedFor actually exist.
  • (problem not benefit) How to ensure terminated podGroupReplicas are seen in status before eventually removing the mention of the podGroupReplica from status.

Not clear to me which is the best option.

@pohly
Copy link
Copy Markdown
Contributor

pohly commented Dec 3, 2025

Beware that there is no "DRA manager" in the control plane (kubelet has one). Managing allocations is split across kube-scheduler (allocates) and kube-controller-manager ("resourceclaim" controller deallocates). That's just an aside, kube-controller-manager and thus the controller has all of the necessary information.

But I don't like approach 1 because it implies that the "resourceclaim" controller has to make assumptions about how the workload controller manages pods: more specifically, it has to assume that once one pod terminates or gets deleted, no new pod will get started. If the workload controller restarts pods, then there's a race.

I therefore prefer approach 2. Let one entity manage pods and the declaration that "all pods a gone for good".

@helayoty
Copy link
Copy Markdown
Member

/assign

@helayoty helayoty moved this from Needs Review to Backlog in SIG Scheduling Jan 21, 2026
@mortent
Copy link
Copy Markdown
Member Author

mortent commented Jan 26, 2026

So from what I can tell from different discussions, it looks likely that there will be a PodGroup API added in 1.36: #5833. And from #5736, we are planning to make it possible to associate ResourceClaims with PodGroups rather than with Workloads.

This seems to imply that rather than adding references to a Workload object in the status.reservedFor list, we will be adding references to PodGroup objects. I think this means that this will work as follows:

  • The reference to the PodGroup is added when the first pod in the group is getting scheduled. The referenced PodGroup must always exist for a Pod during scheduling, otherwise the scheduler will not handle it.
  • There will be some way to determine isDone on a PodGroup object, meaning that there are no currently running Pods in the group and no new ones will be created. This allows someone, I would assume either the PodGroup controller or the resourceclaim controller to know when it is safe to remove the reference from the status.reservedFor list.

Form the discussion above this seems to imply the need for a finalizer on the PodGroup objects to make sure they can't be deleted while there are still pods running that belong to the group.

Does this seem like a reasonable way to handle this based on what is happening on the other KEPs? If so, I can take a pass at updating this KEP.

Right now it seems to me like this handles a different issue than #5736, although they are closely related. @nojnhuh do you agree with this?

@nojnhuh
Copy link
Copy Markdown
Contributor

nojnhuh commented Jan 26, 2026

Yes, overall I expect we will be targeting the new PodGroup objects instead of Workloads directly.

  • There will be some way to determine isDone on a PodGroup object, meaning that there are no currently running Pods in the group and no new ones will be created.

I don't think it's possible to guarantee at any point that no new Pods will be created. The brute-force workaround would be to consider the claim should be allocated as long as the PodGroup object exists, no matter the state of its Pods. That's obviously not ideal if the PodGroup sits at 0 replicas or keeps around a bunch of Completed/Failed Pods for a long time. Another approach would be like what I've drafted in #5736 to deallocate when the PodGroup is done "for now" i.e. when all of its Pods are currently terminated or gone, then reallocating when new Pods needing the claim are created.

I suppose isDone could map to some spec field on PodGroups which workload controllers like Job or JobSet can control to say whether or not ResourceClaims owned by the PodGroup should remain allocated. I'd expect changes like that to be proposed in this KEP vs. #5833. I'll be updating #5736 similarly to propose changes to the PodGroup API on top of #5833.

Right now it seems to me like this handles a different issue than #5736, although they are closely related. @nojnhuh do you agree with this?

Yes, I think this KEP and #5736 are targeting different problems. #5736 should be able to address sharing resources between Pods in a PodGroup while still listing individual Pods in status.reservedFor, though changing that to list the PodGroup in status.reservedFor is an obvious next step once this KEP is implemented.

@nojnhuh
Copy link
Copy Markdown
Contributor

nojnhuh commented Jan 26, 2026

  • There will be some way to determine isDone on a PodGroup object, meaning that there are no currently running Pods in the group and no new ones will be created.

I don't think it's possible to guarantee at any point that no new Pods will be created. The brute-force workaround would be to consider the claim should be allocated as long as the PodGroup object exists, no matter the state of its Pods. That's obviously not ideal if the PodGroup sits at 0 replicas or keeps around a bunch of Completed/Failed Pods for a long time. Another approach would be like what I've drafted in #5736 to deallocate when the PodGroup is done "for now" i.e. when all of its Pods are currently terminated or gone, then reallocating when new Pods needing the claim are created.

I suppose isDone could map to some spec field on PodGroups which workload controllers like Job or JobSet can control to say whether or not ResourceClaims owned by the PodGroup should remain allocated. I'd expect changes like that to be proposed in this KEP vs. #5833. I'll be updating #5736 similarly to propose changes to the PodGroup API on top of #5833.

cc @helayoty

@johnbelamaric
Copy link
Copy Markdown
Member

I suppose isDone could map to some spec field on PodGroups which workload controllers like Job or JobSet can control to say whether or not ResourceClaims owned by the PodGroup should remain allocated.

Ok, couple things:

  1. ResourceClaims that follow the PodGroup lifecycle - for sure. This means that the ResourceClaim will be de-allocated when the PodGroup is deleted, OR possibly when we have some sort of field in the spec of PodGroup as @nojnhuh suggests here (e.g., the PodGroup is in TERMINATED or PAUSED state or something).

  2. Eventually we may ALSO want to associate ResourceClaims at the Workload-level. We should think about whether we want to address that now or it's something for later. @helayoty and @kannon92 do you have thoughts on that?

So, it seems to me the scope here of this KEP is:

  • I put Workload and/or PodGroup in the ReservedFor, and we will have a deterministic way to know when to de-allocate the claim

Yes, I think this KEP and #5736 are targeting different problems. #5736 should be able to address sharing resources between Pods in a PodGroup while still listing individual Pods in status.reservedFor, though changing that to list the PodGroup in status.reservedFor is an obvious next step once this KEP is implemented.

Assuming the ResourceClaimController does the de-allocation, then listing only Pods in the ResourceClaim will cause the de-allocation to happen once the last Pod is gone. Unless the controller also looks at the PodGroup objects referenced by the Pods. So, it's not clear to me why we would bother implementing the flow of "resource claims that follow the lifecycle of the podgroup, but put the pod reference into the reservedfor instead of podgroup". What are we saving by implementing that instead of just going all the way with this KEP at the same time?

@dom4ha
Copy link
Copy Markdown
Member

dom4ha commented Jan 26, 2026

Workload aware scheduling improves ResourceClaim scheduling when both the PodGroup and the ResourceClaim are aligned.

Technically it should be possible to have two or more gangs using the same ResourceClaim. In that case two PodGroups would be listed in ReservedFor similarly how Pods are listed now. Sure, they won't be gang scheduled (as long as there is no parent gang), but I think that implementing just this part may make sense, especially since it's not clear whether #5736 can make 1.36.

@nojnhuh
Copy link
Copy Markdown
Contributor

nojnhuh commented Jan 26, 2026

So, it's not clear to me why we would bother implementing the flow of "resource claims that follow the lifecycle of the podgroup, but put the pod reference into the reservedfor instead of podgroup". What are we saving by implementing that instead of just going all the way with this KEP at the same time?

We save having to do both at the same time even though they can be implemented independently.

@mortent
Copy link
Copy Markdown
Member Author

mortent commented Jan 26, 2026

I don't think it's possible to guarantee at any point that no new Pods will be created. The brute-force workaround would be to consider the claim should be allocated as long as the PodGroup object exists, no matter the state of its Pods. That's obviously not ideal if the PodGroup sits at 0 replicas or keeps around a bunch of Completed/Failed Pods for a long time. Another approach would be like what I've drafted in #5736 to deallocate when the PodGroup is done "for now" i.e. when all of its Pods are currently terminated or gone, then reallocating when new Pods needing the claim are created.

So that approach works when we are listing individual pods in the ReservedFor list, since we can detect concurrent updates to the list and therefore prevent situations where a new pod is scheduled referencing a ResourceClaim at the same time as the last previous pod is terminated that could cause the claim to be deallocated.

I don't think we have a way to check if all pods are terminated or gone that doesn't race with the scheduler scheduling new pods that might reference the same ResourceClaim. It could cause situations where a running pod is referencing a claim that doesn't have allocated devices.

I think deallocating only when the PodGroup has been deleted would work, but that requires that the PodGroup is never deleted as long as there are running pods belonging to the group. So it potentially have a similar problem, although maybe the lifecycle of a PodGroup is defined such that this can never happen.

I suppose isDone could map to some spec field on PodGroups which workload controllers like Job or JobSet can control to say whether or not ResourceClaims owned by the PodGroup should remain allocated. I'd expect changes like that to be proposed in this KEP vs. #5833. I'll be updating #5736 similarly to propose changes to the PodGroup API on top of #5833.

I agree that it seems like the real workload controllers would have to give the signal when all pods have completed or are terminating AND no new pods will be created.

Yes, I think this KEP and #5736 are targeting different problems. #5736 should be able to address sharing resources between Pods in a PodGroup while still listing individual Pods in status.reservedFor, though changing that to list the PodGroup in status.reservedFor is an obvious next step once this KEP is implemented.

The main feature of this KEP is to find a way around the 256 limit on the number of pods that can share a ResourceClaim. While probably not a hard requirement for #5736, its usefulness will be somewhat limited with the constraint on the number of pods in a PodGroup.

@johnbelamaric
Copy link
Copy Markdown
Member

I agree that it seems like the real workload controllers would have to give the signal when all pods have completed or are terminating AND no new pods will be created.

Yes, the idea is that the deletion of the PodGroup (or some lifecycle state represented in a spec field) is an explicit indicator from the true workload controller that it is safe to de-allocate the claim. Anything else carries risk (though maybe we just live with that risk and some workaround for now).

@mortent
Copy link
Copy Markdown
Member Author

mortent commented Jan 27, 2026

Yes, the idea is that the deletion of the PodGroup (or some lifecycle state represented in a spec field) is an explicit indicator from the true workload controller that it is safe to de-allocate the claim. Anything else carries risk (though maybe we just live with that risk and some workaround for now).

Ref the discussion on slack, the indicator is that the PodGroup have no pods using the ResourceClaim and will not create new pods that do. The consequence of this is that the reference to the PodGroup in the ReservedFor list can be removed. It does not mean that it is safe to deallocate the claim since there might be other PodGroups or Pods using the claim.

@mortent
Copy link
Copy Markdown
Member Author

mortent commented Jan 28, 2026

This KEP is being merged with #5729. Work will continue in #5736

@mortent mortent closed this Jan 28, 2026
@github-project-automation github-project-automation Bot moved this from Backlog to Closed in SIG Scheduling Jan 28, 2026
@pohly pohly moved this from 📋 Backlog to ✅ Done in Dynamic Resource Allocation Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Projects

Status: ✅ Done
Archived in project

Development

Successfully merging this pull request may close these issues.