Skip to content

DRA: device taints: new ResourceSlice API, new features#134152

Merged
k8s-ci-robot merged 11 commits intokubernetes:masterfrom
pohly:dra-device-taints-1.35
Nov 4, 2025
Merged

DRA: device taints: new ResourceSlice API, new features#134152
k8s-ci-robot merged 11 commits intokubernetes:masterfrom
pohly:dra-device-taints-1.35

Conversation

@pohly
Copy link
Copy Markdown
Contributor

@pohly pohly commented Sep 19, 2025

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

This implements the updated functionality as described in kubernetes/enhancements#5512

In addition, fields in DeviceTaintRule which cannot be supported properly without more fundamental changes (DeviceClass, CEL selectors) get removed. The use case for them is not strong enough to justify additional complexity.

Which issue(s) this PR is related to:

KEP: kubernetes/enhancements#5055

Special notes for your reviewer:

Does this PR introduce a user-facing change?

DRA device taints: DeviceTaintRule status provided information about the rule, in particular whether pods still need to be evicted ("EvictionInProgress" condition). The new "None" effect can be used to preview what a DeviceTaintRule would do if it used the "NoExecute" effect and to taint devices ("device health") without immediately affecting scheduling or running pods.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 19, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. area/code-generation area/test sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 19, 2025
@k8s-ci-robot k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Sep 19, 2025
@github-project-automation github-project-automation Bot moved this to Needs Triage in SIG Apps Sep 19, 2025
@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Sep 19, 2025
pohly added 5 commits October 31, 2025 18:11
When the ResourceSlice no longer exists, the ResourceSlice tracker didn't and
couldn't report the tainted devices even if they are allocated and in use. The
controller must keep track of DeviceTaintRules itself and handle this scenario.

In this scenario it is impossible to evaluation CEL expressions because the
necessary device attributes aren't available. We could:
- Copy them in the allocation result: too large, big change.
- Limit usage of CEL expressions to rules with no eviction: inconsistent.
- Remove the fields which cannot be supported well.

The last option is chosen.

The tracker is now no longer needed by the eviction controller. Reading
directly from the informer means that we cannot assume that pointers are
consistent. We have to track ResourceSlices by their name, not their pointer.
While at it, ensure that future unknown effects are treating like
the None effect.
We know how often the controller should get a pod, let's check it.
Must run before we do our own GET call.
The approach copied from node taint eviction was to fire off one goroutine per
pod the intended time. This leads to the "thundering herd" problem: when a
single taint causes eviction of several pods and those all have no or the same
toleration grace period, then they all get deleted concurrently at the same
time.

For node taint eviction that is limited by the number of pods per node, which
is typically ~100. In an integration test, that already led to problems with
watchers:

   cacher.go:855] cacher (pods): 100 objects queued in incoming channel.
   cache_watcher.go:203] Forcing pods watcher close due to unresponsiveness: key: "/pods/", labels: "", fields: "". len(c.input) = 10, len(c.result) = 10, graceful = false

It also causes spikes in memory consumption (mostly the 2KB stack per goroutine
plus closure) with no upper limit.

Using a workqueue makes concurrency more deterministic because there is an
upper limit. In the integration test, 10 workers kept the watch active.

Another advantage is that failures to evict the pod get retried with
exponential backoff per affected pod forever. Previously, evicting was tried a
few times with a fixed rate and then the controller gave up. If the apiserver
was down long enough, pods didn't get evicted.
@pohly
Copy link
Copy Markdown
Contributor Author

pohly commented Oct 31, 2025

/assign @soltysh

For kube-controller-manager.

Copy link
Copy Markdown
Member

@liggitt liggitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API / validation / strategy changes lgtm

would like a final ack from a DV reviewer on the annotations there, and can squash as you think makes sense before merge

Comment thread pkg/apis/resource/types.go
@pohly
Copy link
Copy Markdown
Contributor Author

pohly commented Nov 3, 2025

would like a final ack from a DV reviewer on the annotations there

/assign @aaron-prindle

Any thoughts on declarative validation tags for these API changes?

See also #134152 (comment) (helpfully hidden by GitHub, but not resolved yet).

//
// +optional
// +listType=atomic
// +featureGate=DRADeviceTaints
Copy link
Copy Markdown
Contributor

@aaron-prindle aaron-prindle Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the updated max # of taints validation, could use the +k8s:maxItems DV tag here:

// +k8s:maxItems=16
Taints []DeviceTaint `json:"taints,omitempty" protobuf:"bytes,8,rep,name=taints"`

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall I give this a try in this PR, can this be a follow-up or should this be a follow-up?

Copy link
Copy Markdown
Contributor

@aaron-prindle aaron-prindle Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a follow-up 👍 . Perhaps we can track the known DV tags we want to add in an issue to make it easier to follow-up for folks. I've created this issue where we can put the known followups:

#135073

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll leave it in your capable hands then because I already have mine full with other things 😅

@aaron-prindle
Copy link
Copy Markdown
Contributor

aaron-prindle commented Nov 3, 2025

LGTM w.r.t declarative validation tags

NOTE: no DV tags were added in this PR but added as follow-up in umbrella tracking issue:
#135073

@liggitt
Copy link
Copy Markdown
Member

liggitt commented Nov 3, 2025

great, API bits are approved

Copy link
Copy Markdown
Contributor

@nojnhuh nojnhuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: a9059764f4af85fd67a75e4a6893b99675f0c532

Copy link
Copy Markdown
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few questions

Comment thread pkg/controller/apis/config/types.go Outdated
// ValidatingAdmissionPolicyStatusController related features.
ValidatingAdmissionPolicyStatusController validatingadmissionpolicystatusconfig.ValidatingAdmissionPolicyStatusControllerConfiguration
// DeviceTaintEvictionControllerConfiguration contains elements configuring the device taint eviction controller.
DeviceTaintEvictionController devicetaintevictionconfig.DeviceTaintEvictionControllerConfiguration
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this contents of this struct are kept in alphabetical order, would be nice to move this up after DeploymentController. Also while at it, would you mind moving StatefulSetController and CronJobController to appropriate places, these two seem to be out of order 😅

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed (all of them).

Dims added a "sorted" linter for feature gate definitions. It was almost generic enough to be applied to arbitrary fields, enums, etc. but not quite yet - might be worth picking up again...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Dims added a "sorted" linter for feature gate definitions. It was almost generic enough to be applied to arbitrary fields, enums, etc. but not quite yet - might be worth picking up again...

Hmmm... that'd could be useful, do I see correctly that's #132668 ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check it out, but probably after the freeze 😉

// ruleStatusPeriod is the shortest time between DeviceTaintRule status
// updates while eviction is in progress. Once it is done, it no longer gets
// updated until in progress again.
ruleStatusPeriod = 10 * time.Second
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conservative is the usual approach we go with. Your initial approach with not exposing any configs is a good start, and expanding based on tests with number of workers. If you wonder about usefulness of additional configs, that means you don't need them, at least not yet, and so I'd not expose anything.

I'm aware of all the gotchas you described above, but those values are rarely configured and we should treat exposing them with utmost cautiousness.

// that a watch in the integration test couldn't keep up:
// cacher.go:855] cacher (pods): 100 objects queued in incoming channel.
// cache_watcher.go:203] Forcing pods watcher close due to unresponsiveness: key: "/pods/", labels: "", fields: "". len(c.input) = 10, len(c.result) = 10, graceful = false
obj.ConcurrentSyncs = 10
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this number 10 coming from any particular tests? I see above you're explaining why 100 was too much, and I think I wouldn't be comfortable with such a big default. But most of our controllers oscillate around 5, with some outliers like GC with 20, namespace with 10, and evictions with 8.

}

// trackedTaint augments a DeviceTaint with a pointer to its origin.
// rule and slice are mutually exclusive. At least one is set.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they are mutually exclusive why At least one is set? Based on the code it seems rule will always take priority.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with "mutually exclusive. Exactly one of them is always set."

I wanted to make it clear that if rule == nil, then slice must be non-empty.

}

podDeletionLatency := time.Since(eviction.when.Time)
logger.Info("Evicted pod by deleting it", "latency", podDeletionLatency, "reason", eviction.reason)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the controller logs are at V(1) or V(2) ideally.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied from node eviction... Changed to V(2).


tc.handleRuleChange(rule, ruleEvict)
numPendingPods, numPendingNamespaces, err = tc.countPendingPods(rule)
numTaintedSliceDevices, numTaintedAllocatedDevices = tc.countTaintedDevices(rule)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doing this, rather than just invoking the necessary logic? It seems somewhat counterintuitive 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is implemented such that it depends on the state of the controller. To compute stats for the "what if" scenario, I first have to set up a controller that is in that state. I was experimenting with temporarily changing the state, then reverting back, but that turned out to be more complicated and expensive.

Counting tainted devices is something that is only necessary for the dry-run message for effect: None.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair approach. Although, still I'd consider refactoring that eventually into re-usable bit, rather than instantiating a whole controller.

Copy link
Copy Markdown
Contributor Author

@pohly pohly Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically a simpler struct with just has the relevant state and update methods. I had considered that, but it didn't feel worth it - I might give it another try.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, exactly that. I'm worried that the footprint (both cpu and mem) for duplicated controller might be significant at scale. Although I'm pretty sure you've done your fair share of testing 😅

rules, err = tc.ruleLister.List(labels.Everything())
// TODO: instead of listing and handling an error, keep track of rules in the informer event handler?
if err != nil {
panic(err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of our controllers panic, and I don't think this one should either. Maybe log error and return nil, or something like that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a local lister, so this should never happen. A panic for something that shouldn't happened seemed appropriate, but as it won't happen I can also switch to logging the error - done.

Copy link
Copy Markdown
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nits

/approve
the controller changes

config := new(kubectrlmgrconfigv1alpha1.DeviceTaintEvictionControllerConfiguration)
RecommendedDefaultDeviceTaintEvictionControllerConfiguration(config)
if config.ConcurrentSyncs != 8 {
t.Errorf("incorrect default value, expected 10 but got %v", config.ConcurrentSyncs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the message says 10 still


tc.handleRuleChange(rule, ruleEvict)
numPendingPods, numPendingNamespaces, err = tc.countPendingPods(rule)
numTaintedSliceDevices, numTaintedAllocatedDevices = tc.countTaintedDevices(rule)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair approach. Although, still I'd consider refactoring that eventually into re-usable bit, rather than instantiating a whole controller.

pohly added 4 commits November 4, 2025 21:57
It might never be necessary to change the default, but it is hard to be sure.
It's better to have the option, just in case.
To update the right statuses, the controller must collect more information
about why a pod is being evicted. Updating the DeviceTaintRule statuses then is
handled by the same work queue as evicting pods.

Both operations already share the same client instance and thus QPS+server-side
throttling, so they might as well share the same work queue. Deleting pods is
not necessarily more important than informing users or vice-versa, so there is
no strong argument for having different queues.

While at it, switching the unit tests to usage of the same mock work queue as
in staging/src/k8s.io/dynamic-resource-allocation/internal/workqueue. Because
there is no time to add it properly to a staging repo, the implementation gets
copied.
This covers permissions and correct usage of SSA.
@liggitt
Copy link
Copy Markdown
Member

liggitt commented Nov 4, 2025

/approve
for API / config bits

Copy link
Copy Markdown
Contributor

@nojnhuh nojnhuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: d5e30bcb47b3d439cfe03a81a0eef75934f5f214

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, nojnhuh, pohly, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@nojnhuh
Copy link
Copy Markdown
Contributor

nojnhuh commented Nov 4, 2025

/retest

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@pohly: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-apidiff-client-go 3f45a67 link false /test pull-kubernetes-apidiff-client-go
pull-kubernetes-audit-kind-conformance 3f45a67 link false /test pull-kubernetes-audit-kind-conformance

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/code-generation area/release-eng Issues or PRs related to the Release Engineering subproject area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Projects

Status: API review completed, 1.35
Status: ✅ Done
Archived in project
Archived in project
Archived in project
Archived in project

Development

Successfully merging this pull request may close these issues.