DRA: device taints: new ResourceSlice API, new features by pohly · Pull Request #134152 · kubernetes/kubernetes

pohly · 2025-09-19T09:23:06Z

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

This implements the updated functionality as described in kubernetes/enhancements#5512

In addition, fields in DeviceTaintRule which cannot be supported properly without more fundamental changes (DeviceClass, CEL selectors) get removed. The use case for them is not strong enough to justify additional complexity.

Which issue(s) this PR is related to:

KEP: kubernetes/enhancements#5055

Special notes for your reviewer:

Does this PR introduce a user-facing change?

DRA device taints: DeviceTaintRule status provided information about the rule, in particular whether pods still need to be evicted ("EvictionInProgress" condition). The new "None" effect can be used to preview what a DeviceTaintRule would do if it used the "NoExecute" effect and to taint devices ("device health") without immediately affecting scheduling or running pods.

k8s-ci-robot · 2025-09-19T09:23:09Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2025-09-19T09:23:15Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

When the ResourceSlice no longer exists, the ResourceSlice tracker didn't and couldn't report the tainted devices even if they are allocated and in use. The controller must keep track of DeviceTaintRules itself and handle this scenario. In this scenario it is impossible to evaluation CEL expressions because the necessary device attributes aren't available. We could: - Copy them in the allocation result: too large, big change. - Limit usage of CEL expressions to rules with no eviction: inconsistent. - Remove the fields which cannot be supported well. The last option is chosen. The tracker is now no longer needed by the eviction controller. Reading directly from the informer means that we cannot assume that pointers are consistent. We have to track ResourceSlices by their name, not their pointer.

While at it, ensure that future unknown effects are treating like the None effect.

We know how often the controller should get a pod, let's check it. Must run before we do our own GET call.

The approach copied from node taint eviction was to fire off one goroutine per pod the intended time. This leads to the "thundering herd" problem: when a single taint causes eviction of several pods and those all have no or the same toleration grace period, then they all get deleted concurrently at the same time. For node taint eviction that is limited by the number of pods per node, which is typically ~100. In an integration test, that already led to problems with watchers: cacher.go:855] cacher (pods): 100 objects queued in incoming channel. cache_watcher.go:203] Forcing pods watcher close due to unresponsiveness: key: "/pods/", labels: "", fields: "". len(c.input) = 10, len(c.result) = 10, graceful = false It also causes spikes in memory consumption (mostly the 2KB stack per goroutine plus closure) with no upper limit. Using a workqueue makes concurrency more deterministic because there is an upper limit. In the integration test, 10 workers kept the watch active. Another advantage is that failures to evict the pod get retried with exponential backoff per affected pod forever. Previously, evicting was tried a few times with a fixed rate and then the controller gave up. If the apiserver was down long enough, pods didn't get evicted.

pohly · 2025-10-31T18:36:59Z

/assign @soltysh

For kube-controller-manager.

liggitt

API / validation / strategy changes lgtm

would like a final ack from a DV reviewer on the annotations there, and can squash as you think makes sense before merge

pohly · 2025-11-03T15:08:05Z

would like a final ack from a DV reviewer on the annotations there

/assign @aaron-prindle

Any thoughts on declarative validation tags for these API changes?

See also #134152 (comment) (helpfully hidden by GitHub, but not resolved yet).

aaron-prindle · 2025-11-03T18:48:18Z

 	//
 	// +optional
 	// +listType=atomic
 	// +featureGate=DRADeviceTaints


For the updated max # of taints validation, could use the +k8s:maxItems DV tag here:

// +k8s:maxItems=16 Taints []DeviceTaint `json:"taints,omitempty" protobuf:"bytes,8,rep,name=taints"`

Shall I give this a try in this PR, can this be a follow-up or should this be a follow-up?

This can be a follow-up 👍 . Perhaps we can track the known DV tags we want to add in an issue to make it easier to follow-up for folks. I've created this issue where we can put the known followups:

#135073

Thanks! I'll leave it in your capable hands then because I already have mine full with other things 😅

aaron-prindle · 2025-11-03T19:28:14Z

LGTM w.r.t declarative validation tags

NOTE: no DV tags were added in this PR but added as follow-up in umbrella tracking issue:
#135073

liggitt · 2025-11-03T19:45:25Z

great, API bits are approved

nojnhuh

/lgtm

k8s-ci-robot · 2025-11-03T20:03:12Z

LGTM label has been added.

Details

Git tree hash: a9059764f4af85fd67a75e4a6893b99675f0c532

soltysh

I left a few questions

soltysh · 2025-11-04T10:25:18Z

 	// ValidatingAdmissionPolicyStatusController related features.
 	ValidatingAdmissionPolicyStatusController validatingadmissionpolicystatusconfig.ValidatingAdmissionPolicyStatusControllerConfiguration
+	// DeviceTaintEvictionControllerConfiguration contains elements configuring the device taint eviction controller.
+	DeviceTaintEvictionController devicetaintevictionconfig.DeviceTaintEvictionControllerConfiguration


Nit: this contents of this struct are kept in alphabetical order, would be nice to move this up after DeploymentController. Also while at it, would you mind moving StatefulSetController and CronJobController to appropriate places, these two seem to be out of order 😅

Changed (all of them).

Dims added a "sorted" linter for feature gate definitions. It was almost generic enough to be applied to arbitrary fields, enums, etc. but not quite yet - might be worth picking up again...

Thank you!

Dims added a "sorted" linter for feature gate definitions. It was almost generic enough to be applied to arbitrary fields, enums, etc. but not quite yet - might be worth picking up again...

Hmmm... that'd could be useful, do I see correctly that's #132668 ?

I'll check it out, but probably after the freeze 😉

soltysh · 2025-11-04T10:34:20Z

+	// ruleStatusPeriod is the shortest time between DeviceTaintRule status
+	// updates while eviction is in progress. Once it is done, it no longer gets
+	// updated until in progress again.
+	ruleStatusPeriod = 10 * time.Second


Conservative is the usual approach we go with. Your initial approach with not exposing any configs is a good start, and expanding based on tests with number of workers. If you wonder about usefulness of additional configs, that means you don't need them, at least not yet, and so I'd not expose anything.

I'm aware of all the gotchas you described above, but those values are rarely configured and we should treat exposing them with utmost cautiousness.

soltysh · 2025-11-04T10:37:14Z

+		// that a watch in the integration test couldn't keep up:
+		//   cacher.go:855] cacher (pods): 100 objects queued in incoming channel.
+		//   cache_watcher.go:203] Forcing pods watcher close due to unresponsiveness: key: "/pods/", labels: "", fields: "". len(c.input) = 10, len(c.result) = 10, graceful = false
+		obj.ConcurrentSyncs = 10


Is this number 10 coming from any particular tests? I see above you're explaining why 100 was too much, and I think I wouldn't be comfortable with such a big default. But most of our controllers oscillate around 5, with some outliers like GC with 20, namespace with 10, and evictions with 8.

soltysh · 2025-11-04T10:41:35Z

+}
+
+// trackedTaint augments a DeviceTaint with a pointer to its origin.
+// rule and slice are mutually exclusive. At least one is set.


If they are mutually exclusive why At least one is set? Based on the code it seems rule will always take priority.

Replaced with "mutually exclusive. Exactly one of them is always set."

I wanted to make it clear that if rule == nil, then slice must be non-empty.

soltysh · 2025-11-04T10:45:41Z

+	}
+
+	podDeletionLatency := time.Since(eviction.when.Time)
+	logger.Info("Evicted pod by deleting it", "latency", podDeletionLatency, "reason", eviction.reason)


Most of the controller logs are at V(1) or V(2) ideally.

Copied from node eviction... Changed to V(2).

soltysh · 2025-11-04T10:48:44Z

+
+		tc.handleRuleChange(rule, ruleEvict)
+		numPendingPods, numPendingNamespaces, err = tc.countPendingPods(rule)
+		numTaintedSliceDevices, numTaintedAllocatedDevices = tc.countTaintedDevices(rule)


Why doing this, rather than just invoking the necessary logic? It seems somewhat counterintuitive 🤔

The logic is implemented such that it depends on the state of the controller. To compute stats for the "what if" scenario, I first have to set up a controller that is in that state. I was experimenting with temporarily changing the state, then reverting back, but that turned out to be more complicated and expensive.

Counting tainted devices is something that is only necessary for the dry-run message for effect: None.

That's fair approach. Although, still I'd consider refactoring that eventually into re-usable bit, rather than instantiating a whole controller.

Basically a simpler struct with just has the relevant state and update methods. I had considered that, but it didn't feel worth it - I might give it another try.

Yeah, exactly that. I'm worried that the footprint (both cpu and mem) for duplicated controller might be significant at scale. Although I'm pretty sure you've done your fair share of testing 😅

soltysh · 2025-11-04T10:53:43Z

+		rules, err = tc.ruleLister.List(labels.Everything())
+		// TODO: instead of listing and handling an error, keep track of rules in the informer event handler?
+		if err != nil {
+			panic(err)


None of our controllers panic, and I don't think this one should either. Maybe log error and return nil, or something like that?

This is a local lister, so this should never happen. A panic for something that shouldn't happened seemed appropriate, but as it won't happen I can also switch to logging the error - done.

soltysh

minor nits

/approve
the controller changes

soltysh · 2025-11-04T20:48:51Z

+	config := new(kubectrlmgrconfigv1alpha1.DeviceTaintEvictionControllerConfiguration)
+	RecommendedDefaultDeviceTaintEvictionControllerConfiguration(config)
+	if config.ConcurrentSyncs != 8 {
+		t.Errorf("incorrect default value, expected 10 but got %v", config.ConcurrentSyncs)


Nit: the message says 10 still

soltysh · 2025-11-04T20:50:28Z

+
+		tc.handleRuleChange(rule, ruleEvict)
+		numPendingPods, numPendingNamespaces, err = tc.countPendingPods(rule)
+		numTaintedSliceDevices, numTaintedAllocatedDevices = tc.countTaintedDevices(rule)


That's fair approach. Although, still I'd consider refactoring that eventually into re-usable bit, rather than instantiating a whole controller.

It might never be necessary to change the default, but it is hard to be sure. It's better to have the option, just in case.

To update the right statuses, the controller must collect more information about why a pod is being evicted. Updating the DeviceTaintRule statuses then is handled by the same work queue as evicting pods. Both operations already share the same client instance and thus QPS+server-side throttling, so they might as well share the same work queue. Deleting pods is not necessarily more important than informing users or vice-versa, so there is no strong argument for having different queues. While at it, switching the unit tests to usage of the same mock work queue as in staging/src/k8s.io/dynamic-resource-allocation/internal/workqueue. Because there is no time to add it properly to a staging repo, the implementation gets copied.

This covers permissions and correct usage of SSA.

liggitt · 2025-11-04T21:50:50Z

/approve
for API / config bits

nojnhuh

/lgtm

k8s-ci-robot · 2025-11-04T21:59:59Z

LGTM label has been added.

Details

Git tree hash: d5e30bcb47b3d439cfe03a81a0eef75934f5f214

k8s-ci-robot · 2025-11-04T22:00:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, nojnhuh, pohly, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~api/OWNERS~~ [liggitt]
~~cmd/kube-controller-manager/OWNERS~~ [liggitt,soltysh]
~~pkg/apis/OWNERS~~ [liggitt]
~~pkg/controller/apis/config/OWNERS~~ [liggitt]
~~pkg/controller/devicetainteviction/OWNERS~~ [liggitt,soltysh]
~~pkg/generated/openapi/OWNERS~~ [liggitt]
~~pkg/registry/OWNERS~~ [liggitt]
~~plugin/pkg/auth/authorizer/OWNERS~~ [liggitt]
~~staging/src/k8s.io/api/OWNERS~~ [liggitt]
~~staging/src/k8s.io/client-go/OWNERS~~ [liggitt]
~~staging/src/k8s.io/dynamic-resource-allocation/OWNERS~~ [liggitt,pohly]
~~staging/src/k8s.io/kube-controller-manager/config/OWNERS~~ [liggitt]
~~test/integration/apiserver/OWNERS~~ [liggitt,pohly,soltysh]
~~test/integration/dra/OWNERS~~ [liggitt,pohly,soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

nojnhuh · 2025-11-04T22:00:27Z

/retest

k8s-ci-robot · 2025-11-04T23:08:06Z

@pohly: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-apidiff-client-go	`3f45a67`	link	false	`/test pull-kubernetes-apidiff-client-go`
pull-kubernetes-audit-kind-conformance	`3f45a67`	link	false	`/test pull-kubernetes-audit-kind-conformance`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

github-project-automation Bot added this to SIG Apps, SIG Node: code and documentation PRs and SIG Node CI/Test Board Sep 19, 2025

k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Sep 19, 2025

github-project-automation Bot moved this to Needs Triage in SIG Apps Sep 19, 2025

github-project-automation Bot moved this to Triage in SIG Node: code and documentation PRs Sep 19, 2025

github-project-automation Bot moved this to Triage in SIG Node CI/Test Board Sep 19, 2025

github-project-automation Bot added this to SIG Scheduling Sep 19, 2025

pohly mentioned this pull request Sep 19, 2025

DRA: device taints and tolerations kubernetes/enhancements#5055

Open

21 tasks

k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Sep 19, 2025

github-project-automation Bot added this to Dynamic Resource Allocation Sep 19, 2025

pohly added 5 commits October 31, 2025 18:11

generated files

75c9186

DRA: implementation of none taint effect

6ebd853

While at it, ensure that future unknown effects are treating like the None effect.

DRA device taints: tighten controller test

e5fcd20

We know how often the controller should get a pod, let's check it. Must run before we do our own GET call.

pohly mentioned this pull request Oct 31, 2025

Make ConcurrentResourceClaimSyncs configurable #134701

Merged

liggitt reviewed Nov 3, 2025

View reviewed changes

Comment thread pkg/apis/resource/types.go

liggitt mentioned this pull request Nov 3, 2025

Updates to DRA Partitionable Devices feature #134189

Merged

aaron-prindle reviewed Nov 3, 2025

View reviewed changes

nojnhuh reviewed Nov 3, 2025

View reviewed changes

soltysh reviewed Nov 4, 2025

View reviewed changes

soltysh approved these changes Nov 4, 2025

View reviewed changes

pohly added 4 commits November 4, 2025 21:57

DRA device taint eviction: configurable number of workers

f4a4533

It might never be necessary to change the default, but it is hard to be sure. It's better to have the option, just in case.

generated files

0689b62

DRA integration: test eviction through DeviceTaintRule

3f45a67

This covers permissions and correct usage of SSA.

nojnhuh approved these changes Nov 4, 2025

View reviewed changes

pohly mentioned this pull request Nov 5, 2025

DRA device taint eviction: several improvements #135068

Merged

soltysh mentioned this pull request Dec 1, 2025

lint: Expand sorted linter to allow checking for sorted struct members #135532

Closed

renovate Bot mentioned this pull request Feb 7, 2026

Update Python Dependency Updates for Next (next) (major) nautobot/nautobot#8000

Open

1 task

Conversation

pohly commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

k8s-ci-robot commented Sep 19, 2025

Uh oh!

k8s-ci-robot commented Sep 19, 2025

Uh oh!

pohly commented Oct 31, 2025

Uh oh!

liggitt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pohly commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aaron-prindle Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaron-prindle Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaron-prindle commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liggitt commented Nov 3, 2025

Uh oh!

nojnhuh left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Nov 3, 2025

Uh oh!

soltysh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pohly Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

pohly commented Sep 19, 2025 •

edited

Loading

pohly commented Nov 3, 2025 •

edited

Loading

aaron-prindle Nov 3, 2025 •

edited

Loading

aaron-prindle Nov 3, 2025 •

edited

Loading

aaron-prindle commented Nov 3, 2025 •

edited

Loading

pohly Nov 4, 2025 •

edited

Loading

liggitt commented Nov 4, 2025 •

edited

Loading