[KEP-4815] Update KEP for 1.35 by mortent · Pull Request #5515 · kubernetes/enhancements

mortent · 2025-09-05T15:54:56Z

One-line PR description: Update the design to simplify validation while preserving flexibility.

Issue link: DRA: Partitionable Devices #4815

Other comments:

mortent · 2025-09-09T00:06:12Z

I have updated the KEP to reflect the idea that SharedCounters must be specified in a separate ResourceSlice from devices. It does help simplify the validation logic, but it doesn't fully remove the need for some cross-ResourceSlice validation.

johnbelamaric

I like it. But would like @klueska and @pohly to provide feedback as well.

pohly

I like the approach, but we need to sort out some details.

pohly

Some nits, but overall LGTM.

pohly

/lgtm

johnbelamaric · 2025-09-23T21:11:31Z

@liggitt can you render an opinion here on the API discussions?

johnbelamaric · 2025-09-23T21:11:43Z

/approve

johnbelamaric

From a PRR perspective, can you describe in the feature enablement / disablement, and the rollout, upgrade, and rollback planning section what happens to resources using the old alpha fields? We are now alpha fields in v1 types. We do NOT need to support old alpha fields, nor do we need to support upgrade of alpha resources. However, we should document clearly what happens if people have existing objects with those fields set.

mortent · 2025-10-06T22:58:51Z

From a PRR perspective, can you describe in the feature enablement / disablement, and the rollout, upgrade, and rollback planning section what happens to resources using the old alpha fields? We are now alpha fields in v1 types. We do NOT need to support old alpha fields, nor do we need to support upgrade of alpha resources. However, we should document clearly what happens if people have existing objects with those fields set.

@johnbelamaric I have updated the PR to include more details about this.

macsko · 2025-10-07T12:43:04Z

+An invalid resource pool will not prevent a pod from getting scheduled if devices
+can be allocated from valid resource pools. For requests with `AllocationMode: All`,
+this means that those can not be satisfied if any of the devices available on the node
+is in an invalid resource pool. Only if a pod can't be scheduled on any of the nodes in
+the cluster _and_ an invalid resource pool were encountered on at least one of the nodes,
+will the system return an error. This error will be surfaced as a scheduling error that
+will be added in the `Pod` status.


How could anyone notice that they have a misconfiguration here? Will there be any metric or other way of checking that?

So they would notice when they try to schedule a pod, but the scheduler is unable to satisfy all claims and it encounters one or more invalid resource pools. So if there are sufficient devices available, it is possible that this might go unnoticed for some time, although arguably it doesn't really have any impact.

In could add a metric from the allocator that expose the number of invalid resource pool encountered, but it has a few challenges:

The allocator is invoked once per pod-node combination, so we will have to figure out a way to report the metric once per pod I think.

The allocator tries to do as little work as possible per invocation, so it discards resource pools that aren't relevant to allocating a pod on a node. As a result, it the number of resource pools identified might differ based on the pod and node.

I think the best option is the "Future options" proposal in the KEP: https://github.com/kubernetes/enhancements/pull/5515/files#diff-8985915ed27658e0b602c60b13c47d5d7e0309f9dff42df6cc9d514da47d652fR660. Such a controller could update status on ResourceSlice objects and/or expose a metric with the number of invalid resource pools in the cluster.

However, misconfiguration may be hidden, and the entire scheduling will work differently than expected. It would be good to be able to inform the admin (?) about this. This is not a blocker, as KEP remains in the alpha stage, but it's worth reconsidering this aspect before moving to the beta.

I agree that an error in a resource pool might lead to an unexpected set of devices being allocated to a pod since the expected pods might be in an invalid resource pool. But that should be ok since DRA will only allocate devices that meets the requirements in the ResourceClaim(s), so whatever is allocated should be usable.

I've updated the "Future options" section in the doc to also include metrics as a possible way to surface information about invalid resource pools.

liggitt · 2025-10-07T14:47:20Z

updates to the API docs / validation lgtm

johnbelamaric

small request to clarify some language

mortent · 2025-10-10T16:18:52Z

@macsko Any other concerns about this KEP from SIG-Scheduling?

johnbelamaric · 2025-10-14T21:04:55Z

Latest changes LGTM

dom4ha · 2025-10-15T16:50:12Z

/approve for SIG scheduling
/lgtm

dom4ha · 2025-10-15T17:20:53Z

/lgtm

SergeyKanzhelev · 2025-10-15T18:28:53Z

hold if from @johnbelamaric seems to be resolved

@liggitt lgtm: [KEP-4815] Update KEP for 1.35 #5515 (comment)
sig scheduling: [KEP-4815] Update KEP for 1.35 #5515 (comment)

johnbelamaric · 2025-10-15T18:53:00Z

/approve
/lgtm

/hold cancel

k8s-ci-robot · 2025-10-15T18:53:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dom4ha, johnbelamaric, mortent, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~keps/sig-scheduling/OWNERS~~ [dom4ha,johnbelamaric]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Sep 5, 2025

k8s-ci-robot requested review from dom4ha and macsko September 5, 2025 15:55

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Sep 5, 2025

github-project-automation Bot added this to SIG Scheduling Sep 5, 2025

github-project-automation Bot moved this to Needs Triage in SIG Scheduling Sep 5, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 5, 2025

mortent force-pushed the UpdatePartitionableFor135 branch 4 times, most recently from 901e51e to 8443d6e Compare September 8, 2025 00:50

mortent commented Sep 8, 2025

View reviewed changes

Comment thread keps/sig-scheduling/4815-dra-partitionable-devices/README.md Outdated

mortent force-pushed the UpdatePartitionableFor135 branch from 8443d6e to b89af85 Compare September 9, 2025 00:00

k8s-ci-robot assigned johnbelamaric, klueska and pohly Sep 9, 2025

helayoty moved this from Needs Triage to Needs Review in SIG Scheduling Sep 9, 2025

johnbelamaric reviewed Sep 10, 2025

View reviewed changes

pohly reviewed Sep 11, 2025

View reviewed changes

mortent force-pushed the UpdatePartitionableFor135 branch from b83d5be to 53a2197 Compare September 13, 2025 21:18

pohly requested changes Sep 16, 2025

View reviewed changes

mortent requested a review from pohly September 21, 2025 21:58

pohly approved these changes Sep 22, 2025

View reviewed changes

github-project-automation Bot moved this from Needs Review to In Progress in SIG Scheduling Sep 22, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 22, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 23, 2025

johnbelamaric reviewed Oct 6, 2025

View reviewed changes

kannon92 reviewed Oct 6, 2025

View reviewed changes

Comment thread keps/sig-scheduling/4815-dra-partitionable-devices/kep.yaml Outdated

kannon92 reviewed Oct 6, 2025

View reviewed changes

Comment thread keps/sig-scheduling/4815-dra-partitionable-devices/kep.yaml Outdated

kannon92 mentioned this pull request Oct 6, 2025

DRA: Partitionable Devices #4815

Open

12 tasks

kannon92 reviewed Oct 6, 2025

View reviewed changes

Comment thread keps/sig-scheduling/4815-dra-partitionable-devices/kep.yaml

macsko reviewed Oct 7, 2025

View reviewed changes

liggitt moved this from Changes requested to API review completed, 1.35 in API Reviews Oct 7, 2025

johnbelamaric reviewed Oct 8, 2025

View reviewed changes

Comment thread keps/sig-scheduling/4815-dra-partitionable-devices/README.md Outdated

mortent requested a review from macsko October 9, 2025 20:34

mortent mentioned this pull request Oct 12, 2025

Updates to DRA Partitionable Devices feature kubernetes/kubernetes#134189

Merged

k8s-ci-robot assigned dom4ha Oct 15, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 15, 2025

[KEP-4815] Update KEP for 1.35

2a323de

mortent force-pushed the UpdatePartitionableFor135 branch from 2074be5 to 2a323de Compare October 15, 2025 16:58

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 15, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 15, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 15, 2025

k8s-ci-robot merged commit 6c75b39 into kubernetes:master Oct 15, 2025
4 checks passed

k8s-ci-robot added this to the v1.35 milestone Oct 15, 2025

github-project-automation Bot moved this from In Progress to Done in SIG Scheduling Oct 15, 2025

jgehrcke mentioned this pull request Oct 21, 2025

GPU sharing: support Dynamic MIG (using DRA partitionable devices) kubernetes-sigs/dra-driver-nvidia-gpu#361

Closed

Conversation

mortent commented Sep 5, 2025

Uh oh!

Uh oh!

mortent commented Sep 9, 2025

Uh oh!

johnbelamaric left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pohly left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pohly left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pohly left a comment

Choose a reason for hiding this comment

Uh oh!

johnbelamaric commented Sep 23, 2025

Uh oh!

johnbelamaric commented Sep 23, 2025

Uh oh!

johnbelamaric left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mortent commented Oct 6, 2025

Uh oh!

Uh oh!

macsko Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

mortent Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

macsko Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

mortent Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

liggitt commented Oct 7, 2025

Uh oh!

johnbelamaric left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mortent commented Oct 10, 2025

Uh oh!

johnbelamaric commented Oct 14, 2025

Uh oh!

dom4ha commented Oct 15, 2025

Uh oh!

dom4ha commented Oct 15, 2025

Uh oh!

SergeyKanzhelev commented Oct 15, 2025