Skip to content

KEP 5055: DRA Device Taints: update for 1.35#5512

Merged
k8s-ci-robot merged 3 commits intokubernetes:masterfrom
pohly:dra-device-taints-1.35
Oct 15, 2025
Merged

KEP 5055: DRA Device Taints: update for 1.35#5512
k8s-ci-robot merged 3 commits intokubernetes:masterfrom
pohly:dra-device-taints-1.35

Conversation

@pohly
Copy link
Copy Markdown
Contributor

@pohly pohly commented Sep 4, 2025

  • One-line PR description: update for 1.35

  • Issue link: DRA: device taints and tolerations #5055

  • Other comments: This now covers publishing purely informational taints ("Effect: None"), rate limited eviction and improved user experience (DeviceTaintRule status, kubectl describe DeviceTaintRule).

The "informational taint" is a potential alternative to #5469.

The one-of ResourceSlice is relevant for #5234.

/cc @nojnhuh @byako @eero-t @mortent

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@pohly: GitHub didn't allow me to request PR reviews from the following users: eero-t.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

  • One-line PR description: update for 1.35

  • Issue link: DRA: device taints and tolerations #5055

  • Other comments: This now covers publishing purely informational taints ("Effect: None"), rate limited eviction and improved user experience (DeviceTaintRule status, kubectl describe DeviceTaintRule).

The "informational taint" is a potential alternative to #5469.

The one-of ResourceSlice is relevant for #5234.

/cc @nojnhuh @byako @eero-t @mortent

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Sep 4, 2025
@github-project-automation github-project-automation Bot moved this to Needs Triage in SIG Scheduling Sep 4, 2025
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 4, 2025
@pohly pohly force-pushed the dra-device-taints-1.35 branch from 73ef40f to 100e650 Compare September 4, 2025 09:29
Comment thread keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md Outdated
Comment on lines +220 to +222
Once eviction starts, it happens at a low enough rate that admins have a chance
to delete the DeviceTaintRule before all pods are evicted if they made a
mistake after all. This rate is configurable to enable faster eviction, if
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is effect: None + kubectl describe by itself an adequate safeguard against erroneous taints? Or are there different kinds of mistakes that only rate limiting eviction would mitigate? In general I like the idea of effect: None essentially being a dry run instead of trying to determine in real-time whether a taint is working while it is or is not actively evicting pods. Wondering if that's a worthwhile way we could narrow the scope here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the rate limiting would uncover anything that kubectl describe may have missed, except perhaps a race (one pods shown as to be evicted, large number of additional such pods created, then turning on eviction and deleting all of them).

But primarily it is that kubectl describe is optional, some admin might forget to double-check. Then the rate limiting may help as a second line of defense.

Comment thread keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md Outdated
Comment thread keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md Outdated
Comment thread keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md Outdated
Comment thread keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md Outdated
// some Pods and then gets restarted before updating this field.
//
// +optional
PodsEvicted *int64
Copy link
Copy Markdown
Contributor Author

@pohly pohly Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm undecided about this field. As it stands, this is a lower bound: "at least this number of pods were evicted, but maybe also more". 99% of the times it'll be accurate, but guaranteeing that would imply evicting a pod and updating this field in a single atomic transaction. If the kube-controller-manager dies and restarts, it cannot figure out whether all evicted pods were accounted for.

When using conditions, determining the old count becomes icky (need to parse string on startup, which may not always work because IMHO we shouldn't make the string format part of the official API).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea with the PodsEvicted field that it will monotonically increase forever? Would it make more sense to expose that as a metric tied to the lifetime of a kube-controller-manager instance? Then Prometheus would make it easier to add up the total evictions for a given DeviceTaintRule among all past and present kube-controller-manager pods without Kubernetes having to handle that itself.

I think the API is still a good place to at least keep some sign of "in progress"/"done" but maybe doesn't need to detail the exact progress. For that I think a condition makes sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea with the PodsEvicted field that it will monotonically increase forever?

Yes, but per DeviceTaintRule. Basically some feedback to the admin that "yes, this did something".

Would it make more sense to expose that as a metric tied to the lifetime of a kube-controller-manager instance? Then Prometheus would make it easier to add up the total evictions for a given DeviceTaintRule among all past and present kube-controller-manager pods without Kubernetes having to handle that itself.

Such a metric makes sense for "overall work done". But it cannot be per DeviceTaintRule because using e.g. the DeviceTaintRule UID as dimension would lead to unlimited growth of that metric.

I think the API is still a good place to at least keep some sign of "in progress"/"done" but maybe doesn't need to detail the exact progress. For that I think a condition makes sense.

So you think we can drop PodsEvicted and also not represent it in the condition description?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a metric makes sense for "overall work done".

Actually, it already exists: "Total number of Pods deleted by DeviceTaintEvictionController since its start."

I think as part of condition description it can be included like this "10 pods are pending eviction, 100 evicted already since starting the controller". That makes it clear that the count resets when restarting the controller and thus avoids potential confusion when the number is too low.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by switching to conditions. The exact wording of the description can be decided during the implementation phase.

}

// DeviceTaintRuleStatus provides information about an on-going pod eviction.
type DeviceTaintRuleStatus struct {
Copy link
Copy Markdown
Contributor Author

@pohly pohly Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While implement the type updates I was reminded that bumping the spec increments the Generation. This was meant as preparation for conditions which can record the observed generation to document what they are based on.

But now these new status fields aren't using conditions. Should they? There's no field to store data like "pods pending eviction" in machine-readable form in https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#Condition. The numbers could be part of the message field:

Condition {
   Type: "InProgress",
   Status: ConditionTrue,
   ObservedGeneration: rule.spec.Generation,
   LastTransitionTime: time.Now,
  Reason: "PodsPendingEviction",
  Message: fmt.Sprintf("%d pods are pending eviction, %d evicted already", numPodsPending, numPodsEvicted),
}`

This becomes `Status: ConditionFalse` when there are no pods which need to be evicted.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage of conditions is that kubectl wait --for=condition=InProgress=false DeviceTaintRule/my-taint-rule can be used, which is nicer than kubectl wait --for=jsonpath='{.status.podsPendingEviction}'=0.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've switched to Conditions []metav1.Condition.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While implement the type updates I was reminded that bumping the spec increments the Generation.

It can (and should), but must be specifically done in the strategy PrepareForUpdate method

I've switched to Conditions []metav1.Condition.

+1 for using conditions

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can (and should), but must be specifically done in the strategy PrepareForUpdate method

I wasn't clear: I meant to point out that I already implemented it like that in the initial implementation in 1.33 😁 .

I agree that managing Generation should be considered best practice for any type with a spec.

@pohly pohly force-pushed the dra-device-taints-1.35 branch from 4282b1f to b404e59 Compare September 30, 2025 08:13
@pohly
Copy link
Copy Markdown
Contributor Author

pohly commented Sep 30, 2025

/assign @liggitt

For comments on the proposed API.

Comment thread keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md Outdated
Comment thread keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md Outdated
Comment thread keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md Outdated
### API

The ResourceSlice content gets extended:
The ResourceSlice content gets extended. A natural place for a taint that
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since slices can now be “devices” or “taints”, can we call out:

  • How the scheduler/DRA controllers index taint-slices vs device-slices.
  • Cache invalidation and event fan-out when taint-slices update, to avoid O(N×M) reprocessing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer: the ResourceSlice tracker keeps track of that state and tries to minimize the changes that the consumers (scheduler and eviction controller) see.

The long answer is a bit more involved and I'd like to punt on that now while still in alpha. It needs to be added for beta when discussing scalability.

check its policy configuration and decide to take devices offline by creating
a DeviceTaintRule with a taint for affected devices.

#### Safe Pod Eviction
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we handle the races between scheduling and post-allocation taints? For example:

  • Taint appears after scheduling but before kubelet admission.
  • Taint appears while pod is running.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that question related to this user story?

Taint appears after scheduling but before kubelet admission.

If the effect is NoSchedule, then nothing happens. The pod is allowed to run.
If the effect is NoExecute, then the pod gets evicted.

Taint appears while pod is running.

Same here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant that it would nice to mention in the Safe Pod Eviction section that:

  • NoSchedule only affects scheduling decisions. Once a pod is bound, adding the taint later has no impact.
  • NoExecute applies retroactively. If the taint appears either right after scheduling but before kubelet admission or while the pod is already running, the pod is subject to eviction.

That way readers don’t misinterpret this as a race with undefined behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO NoSchedule does not belong into this section because the use case is about eviction.

I've extended the story with additional comments about that race.

I'm intentionally not using the NoExecute constant here because at this point I don't want to educate readers about the API details.

//
// The following condition is currently defined as part of this API, more may
// get added:
// - Type: EvictionInProgress
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add Ready or Completed condition state.

Copy link
Copy Markdown
Contributor Author

@pohly pohly Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe what the semantic of those states would be?

"Completed" could replace "EvictionInProgress", but it could be a bit misleading: it might be "Completed" now, but in the unlikely case that some pod pops up which is scheduled (for example, the scheduler missed the taint or a user manually scheduled a pod), then it's possible to go back from "Completed: True" to "Completed: False". It's not really a final state.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. How about that?

  • EvictionInProgress=True → pods are currently being evicted.
  • EvictionInProgress=False, Reason=Idle → no pods are pending eviction at the moment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment in the API docs and added reason: Idle in the explanations beneath it.

Comment thread keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md Outdated
Reasons are not specified as part of the API because it's not required by the
use cases and would restrict future changes unnecessarily.

### Test Plan
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add test plan for the case when multiple rules hit the same device set with different rates?

We may need to add another test case to validate rate limiting and controller back-pressure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

@helayoty helayoty moved this from Needs Review to In Progress in SIG Scheduling Oct 1, 2025
pohly added a commit to pohly/kubernetes that referenced this pull request Oct 2, 2025
The original idea was that avoiding repeated conversion of ResourceSlices to
the internal type by doing it in an informer would bring some performance
advantage. Because the DRA scheduler_perf benchmarks didn't clearly distinguish
between "setup phase" and "measurement phase", some of the processing done by
the informer affected the measurements and we got confusing benchmark results
in kubernetes#131168. This has been fixed
together with some other
issues (kubernetes#133941,
kubernetes#134010) and now results make more
sense. However, although the conversion shows up in CPU profiles, it's not slow
enough to have much impact.

The main reason for this change is that it enables extending the internal type
with fields that get set by the ResourceSlice
tracker (staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker). Those
additional fields are needed for the revised device taint
handling (kubernetes/enhancements#5512) because that
change will remove the device taints that are embedded inside the Device
struct. Without that there will be no place where the tracker can report all
taints that apply to a device, a separate data structure and method of
communicating when it changes would be needed. With this commit,
dynamicresources plugin and allocator only use the internal type and get it
either from the informer (this commit) or the ResourceSlice tracker.

Shifting the conversion may also enable other future optimizations because
internalizing strings amortizes itself better compared to doing it repeatedly,
but this has not been explored.
Comment thread keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md Outdated
// The default is 10 Pods/s.
//
// +optional
EvictionsPerSecond *int64
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still finding this pretty weird to reason about. I think this is the max evictions/second this taint would drive, but it's pretty best-effort / there's no guarantee pods would actually be evicted at this rate, right?

  • The eviction controller sync loop might be busy with other things
  • The eviction controller API client might be rate-limited client-side or server-side
  • The eviction controller API client might not be able to keep up with evictions
  • The attempts to evict pods could be rejected due to PDB (does the controller use the /evict subresource that respects PDB?)

If we keep this here, the docs need to make it really clear that this is not a strong guarantee pods will be deleted this fast

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be "max rate with best effort".

does the controller use the /evict subresource that respects PDB?

No. I copied the node eviction controller, which just deletes the pod. I don't have a strong opinion about what's best, but following that established pattern might be less surprising because it's consistent.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe two separate effects? One to "forcibly evict" and one to "evict respecting policy (PDB)"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With rate limiting gone there is no need for this field.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct but I think the idea of an effect that honors PDBs vs and effect that does not is a separate discussion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I'll keep this open for the PDB aspect. Like you I prefer to treat support for PDB as a future extension. It raises several questions that are similar to the rate limit (who gets to decide whether it should be used or not) and the API isn't obvious (new effect or a separate field).

But let's prepare for it now and explicitly allow adding more effects:

    // The effect of the taint on claims that do not tolerate the taint
    // and through such claims on the pods using them.
    //
    // Valid effects are None, NoSchedule and NoExecute. PreferNoSchedule as used for
    // nodes is not valid here. More effects may get added in the future.
    // Consumers must treat unknown effects like None.
    //
    // +required
    Effect DeviceTaintEffect

    // ^^^^
    //
    // Implementing PreferNoSchedule would depend on a scoring solution for DRA.
    // It might get added as part of that.
    //
    // A possible future new effect is NoExecuteWithPodDisruptionBudget:
    // honor the pod disruption budget instead of simply deleting pods.
    // This is currently undecided, it could also be a separate field.
    //
    // Validation must be prepared to allow unknown enums in stored objects,
    // which will enable adding new enums within a single release without
    // ratcheting.

@liggitt liggitt moved this from In progress to Changes requested in API Reviews Oct 9, 2025
A discussion about use cases and usage led to the conclusion
that rate limiting and arbitrary health data are not needed.
The idea with reducing the number of devices when using new
fields addresses the size issue.

The original, more complex proposal is kept in the initial
commit to record it for future reference.
Copy link
Copy Markdown
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from the DRA design perspective.

There is one open question about whether the eviction should honor PDB or not, and whether there should be separate effects for those options. IMO this is non-blocking, because I think that the current way of working the same as node taints is fine and makes sense. If we want one that honors PDB (and some other policies?), we can add it as a separate feature later.

//
// Validation must be prepared to allow unknown enums in stored objects,
// which will enable adding new enums within a single release without
// ratcheting.
Copy link
Copy Markdown
Member

@liggitt liggitt Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define how clients (schedulers, evictors, etc) should treat unrecognized taint effects (safest option is probably to treat them as None). If we contemplate adding more effects in the future, we have to ensure the oldest clients still present which do things with taint effects handle unrecognized effects safely. A 1.x API server can have a 1.(x-1) scheduler or kube-controller-manager running against it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define how clients (schedulers, evictors, etc) should treat unrecognized taint effects (safest option is probably to treat them as None).

I already did above in the API docs ("Consumers must treat unknown effects like None."). There it is visible to anyone who may want to react to the API.

This note here is more geared towards "how do we implement this type" and thus not part of the official API.

Copy link
Copy Markdown
Member

@liggitt liggitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm for API updates

just a couple doc suggestions

}

// DeviceTaintRuleStatus provides information about an on-going pod eviction.
type DeviceTaintRuleStatus struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While implement the type updates I was reminded that bumping the spec increments the Generation.

It can (and should), but must be specifically done in the strategy PrepareForUpdate method

I've switched to Conditions []metav1.Condition.

+1 for using conditions

fine (newer drivers work on an older Kubernetes).

`effect: None` is only valid for Kubernetes >= 1.35. DeviceTaintRules
and ResourceSliced with that value must be removed before a downgrade.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and ResourceSliced with that value must be removed before a downgrade.
and ResourceSlices with that value must be removed before a downgrade.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@liggitt liggitt moved this from Changes requested to API review completed, 1.35 in API Reviews Oct 10, 2025
@pohly pohly force-pushed the dra-device-taints-1.35 branch from 6add3c1 to 315c45f Compare October 15, 2025 07:01
Copy link
Copy Markdown
Contributor Author

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@macsko: I think I have addressed everything.

Can you do one last pass and LGTM/approve?

//
// Validation must be prepared to allow unknown enums in stored objects,
// which will enable adding new enums within a single release without
// ratcheting.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define how clients (schedulers, evictors, etc) should treat unrecognized taint effects (safest option is probably to treat them as None).

I already did above in the API docs ("Consumers must treat unknown effects like None."). There it is visible to anyone who may want to react to the API.

This note here is more geared towards "how do we implement this type" and thus not part of the official API.

}

// DeviceTaintRuleStatus provides information about an on-going pod eviction.
type DeviceTaintRuleStatus struct {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can (and should), but must be specifically done in the strategy PrepareForUpdate method

I wasn't clear: I meant to point out that I already implemented it like that in the initial implementation in 1.33 😁 .

I agree that managing Generation should be considered best practice for any type with a spec.

fine (newer drivers work on an older Kubernetes).

`effect: None` is only valid for Kubernetes >= 1.35. DeviceTaintRules
and ResourceSliced with that value must be removed before a downgrade.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


A taint is not meant to be used as an access control mechanism. Users are
allowed to ignore taints (at their own risk). Adding a taint in a live cluster
is inherently racy because it first needs to be observed by e.g. the scheduler.
Copy link
Copy Markdown
Member

@dom4ha dom4ha Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a plan to close that race window at some point by blocking new allocations while a taint is created?

IIUC currently the SoT for NoSchedule case is in scheduler, but NoExecute in Device Driver, which triggers eviction.

Theoretically if the NoSchedule taint is created DRA driver could reject binding request for pods which does not have matching toleration, so in fact it could have an immediate effect.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a plan to close that race window at some point by blocking new allocations while a taint is created?

No. I don't see how that could be done. The scheduler is always going to observe the taint creation with some delay.

IIUC currently the SoT for NoSchedule case is in scheduler, but NoExecute in Device Driver, which triggers eviction.

Sorry, I'm not following. The SoT for a taint is in the object which defines the taint (DeviceTaintRule or ResourceSlice). Different consumers (scheduler, controller) react to that once they observe those. There's no difference regarding that between NoScheduler and NoExecute.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC NoSchedule is enforced in scheduler, but NoExecute in scheduler, but eventually pods are evicted by the taint controller (not the Driver as said).

I wonder if the Driver should/could block new allocations for RCs which does not have necessary tolerations and reject them as well (assuming it can reject allocation, but I'm not sure). If we don't have this mechanism, then Device Driver cannot guarantee that new pods don't appear once it published pods to evict.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NoSchedule is enforced in scheduler, but NoExecute in scheduler

Did you mean "NoSchedule is enforced in scheduler, but NoExecute in controller"?

The scheduler also enforced NoExecute by treating it like NoSchedule. It just doesn't evict, leaving that to the controller.

I wonder if the Driver should/could block new allocations for RCs which does not have necessary tolerations and reject them as well (assuming it can reject allocation, but I'm not sure).

It cannot. The driver is not consulted by the scheduler when allocating ResourceClaims.

// eviction and/or statistics (number of matching devices,
// affected allocated claims, pods remaining to be evicted,
// etc.).
// Status provides information about what was requested in the spec.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Status provides information about what was requested in the spec.
// Status provides information about the effect of what was requested in the spec.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

const DeviceTaintRuleStatusMaxConditions = 8
```

The condition is meant to be used to check for completion of a triggered eviction with:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is a race with scheduler, it may be possible that initial set of conditions does not list any pod to evict. In that case kubectl wait may get completed but some pods may appear later and will become a subject for eviction not tracked with conditions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. We can make it very unlikely, though, by delaying setting the condition to False shortly after the DeviceTaintRule was created (5 to 10 seconds?). That should be long enough to propagate the new DeviceTaintRule to a running kube-scheduler. After a restart, the kube-scheduler first waits for cache sync before scheduling pods.

will become a subject for eviction not tracked with conditions

As discussed elsewhere with @helayoty, this condition can go from "False" to "True". The controller must reconcile to catch this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess there is nothing better we could do than adding this best effort delay to mitigate this race. I think we can discuss it again before beta promotion. I bet long term we need to have some way to address this sort of races, especially since the same problem is between scheduler and kubelet, so some unification would be welcome.

@pohly pohly force-pushed the dra-device-taints-1.35 branch from 315c45f to c59c3b0 Compare October 15, 2025 11:39
@dom4ha
Copy link
Copy Markdown
Member

dom4ha commented Oct 15, 2025

LGTM, leaving approval on @macsko

@pohly pohly force-pushed the dra-device-taints-1.35 branch from c59c3b0 to e2a20a4 Compare October 15, 2025 13:54
@macsko
Copy link
Copy Markdown
Member

macsko commented Oct 15, 2025

/lgtm
/approve
SIG Scheduling

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 15, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: macsko, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 15, 2025
@k8s-ci-robot k8s-ci-robot merged commit ab8b0cd into kubernetes:master Oct 15, 2025
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.35 milestone Oct 15, 2025
@github-project-automation github-project-automation Bot moved this from In Progress to Done in SIG Scheduling Oct 15, 2025
pohly added a commit to pohly/kubernetes that referenced this pull request Oct 21, 2025
The original idea was that avoiding repeated conversion of ResourceSlices to
the internal type by doing it in an informer would bring some performance
advantage. Because the DRA scheduler_perf benchmarks didn't clearly distinguish
between "setup phase" and "measurement phase", some of the processing done by
the informer affected the measurements and we got confusing benchmark results
in kubernetes#131168. This has been fixed
together with some other
issues (kubernetes#133941,
kubernetes#134010) and now results make more
sense. However, although the conversion shows up in CPU profiles, it's not slow
enough to have much impact.

The main reason for this change is that it enables extending the internal type
with fields that get set by the ResourceSlice
tracker (staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker). Those
additional fields are needed for the revised device taint
handling (kubernetes/enhancements#5512) because that
change will remove the device taints that are embedded inside the Device
struct. Without that there will be no place where the tracker can report all
taints that apply to a device, a separate data structure and method of
communicating when it changes would be needed. With this commit,
dynamicresources plugin and allocator only use the internal type and get it
either from the informer (this commit) or the ResourceSlice tracker.

Shifting the conversion may also enable other future optimizations because
internalizing strings amortizes itself better compared to doing it repeatedly,
but this has not been explored.
pohly added a commit to pohly/kubernetes that referenced this pull request Oct 22, 2025
The original idea was that avoiding repeated conversion of ResourceSlices to
the internal type by doing it in an informer would bring some performance
advantage. Because the DRA scheduler_perf benchmarks didn't clearly distinguish
between "setup phase" and "measurement phase", some of the processing done by
the informer affected the measurements and we got confusing benchmark results
in kubernetes#131168. This has been fixed
together with some other
issues (kubernetes#133941,
kubernetes#134010) and now results make more
sense. However, although the conversion shows up in CPU profiles, it's not slow
enough to have much impact.

The main reason for this change is that it enables extending the internal type
with fields that get set by the ResourceSlice
tracker (staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker). Those
additional fields are needed for the revised device taint
handling (kubernetes/enhancements#5512) because that
change will remove the device taints that are embedded inside the Device
struct. Without that there will be no place where the tracker can report all
taints that apply to a device, a separate data structure and method of
communicating when it changes would be needed. With this commit,
dynamicresources plugin and allocator only use the internal type and get it
either from the informer (this commit) or the ResourceSlice tracker.

Shifting the conversion may also enable other future optimizations because
internalizing strings amortizes itself better compared to doing it repeatedly,
but this has not been explored.
pohly added a commit to pohly/kubernetes that referenced this pull request Oct 23, 2025
The original idea was that avoiding repeated conversion of ResourceSlices to
the internal type by doing it in an informer would bring some performance
advantage. Because the DRA scheduler_perf benchmarks didn't clearly distinguish
between "setup phase" and "measurement phase", some of the processing done by
the informer affected the measurements and we got confusing benchmark results
in kubernetes#131168. This has been fixed
together with some other
issues (kubernetes#133941,
kubernetes#134010) and now results make more
sense. However, although the conversion shows up in CPU profiles, it's not slow
enough to have much impact.

The main reason for this change is that it enables extending the internal type
with fields that get set by the ResourceSlice
tracker (staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker). Those
additional fields are needed for the revised device taint
handling (kubernetes/enhancements#5512) because that
change will remove the device taints that are embedded inside the Device
struct. Without that there will be no place where the tracker can report all
taints that apply to a device, a separate data structure and method of
communicating when it changes would be needed. With this commit,
dynamicresources plugin and allocator only use the internal type and get it
either from the informer (this commit) or the ResourceSlice tracker.

Shifting the conversion may also enable other future optimizations because
internalizing strings amortizes itself better compared to doing it repeatedly,
but this has not been explored.
pohly added a commit to pohly/kubernetes that referenced this pull request Nov 3, 2025
The original idea was that avoiding repeated conversion of ResourceSlices to
the internal type by doing it in an informer would bring some performance
advantage. Because the DRA scheduler_perf benchmarks didn't clearly distinguish
between "setup phase" and "measurement phase", some of the processing done by
the informer affected the measurements and we got confusing benchmark results
in kubernetes#131168. This has been fixed
together with some other
issues (kubernetes#133941,
kubernetes#134010) and now results make more
sense. However, although the conversion shows up in CPU profiles, it's not slow
enough to have much impact.

The main reason for this change is that it enables extending the internal type
with fields that get set by the ResourceSlice
tracker (staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker). Those
additional fields are needed for the revised device taint
handling (kubernetes/enhancements#5512) because that
change will remove the device taints that are embedded inside the Device
struct. Without that there will be no place where the tracker can report all
taints that apply to a device, a separate data structure and method of
communicating when it changes would be needed. With this commit,
dynamicresources plugin and allocator only use the internal type and get it
either from the informer (this commit) or the ResourceSlice tracker.

Shifting the conversion may also enable other future optimizations because
internalizing strings amortizes itself better compared to doing it repeatedly,
but this has not been explored.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: API review completed, 1.35
Archived in project

Development

Successfully merging this pull request may close these issues.