KEP 5055: DRA Device Taints: update for 1.35#5512
KEP 5055: DRA Device Taints: update for 1.35#5512k8s-ci-robot merged 3 commits intokubernetes:masterfrom
Conversation
|
@pohly: GitHub didn't allow me to request PR reviews from the following users: eero-t. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
73ef40f to
100e650
Compare
| Once eviction starts, it happens at a low enough rate that admins have a chance | ||
| to delete the DeviceTaintRule before all pods are evicted if they made a | ||
| mistake after all. This rate is configurable to enable faster eviction, if |
There was a problem hiding this comment.
Is effect: None + kubectl describe by itself an adequate safeguard against erroneous taints? Or are there different kinds of mistakes that only rate limiting eviction would mitigate? In general I like the idea of effect: None essentially being a dry run instead of trying to determine in real-time whether a taint is working while it is or is not actively evicting pods. Wondering if that's a worthwhile way we could narrow the scope here.
There was a problem hiding this comment.
I don't think the rate limiting would uncover anything that kubectl describe may have missed, except perhaps a race (one pods shown as to be evicted, large number of additional such pods created, then turning on eviction and deleting all of them).
But primarily it is that kubectl describe is optional, some admin might forget to double-check. Then the rate limiting may help as a second line of defense.
100e650 to
b6d3123
Compare
b6d3123 to
4282b1f
Compare
| // some Pods and then gets restarted before updating this field. | ||
| // | ||
| // +optional | ||
| PodsEvicted *int64 |
There was a problem hiding this comment.
I'm undecided about this field. As it stands, this is a lower bound: "at least this number of pods were evicted, but maybe also more". 99% of the times it'll be accurate, but guaranteeing that would imply evicting a pod and updating this field in a single atomic transaction. If the kube-controller-manager dies and restarts, it cannot figure out whether all evicted pods were accounted for.
When using conditions, determining the old count becomes icky (need to parse string on startup, which may not always work because IMHO we shouldn't make the string format part of the official API).
There was a problem hiding this comment.
Is the idea with the PodsEvicted field that it will monotonically increase forever? Would it make more sense to expose that as a metric tied to the lifetime of a kube-controller-manager instance? Then Prometheus would make it easier to add up the total evictions for a given DeviceTaintRule among all past and present kube-controller-manager pods without Kubernetes having to handle that itself.
I think the API is still a good place to at least keep some sign of "in progress"/"done" but maybe doesn't need to detail the exact progress. For that I think a condition makes sense.
There was a problem hiding this comment.
Is the idea with the PodsEvicted field that it will monotonically increase forever?
Yes, but per DeviceTaintRule. Basically some feedback to the admin that "yes, this did something".
Would it make more sense to expose that as a metric tied to the lifetime of a kube-controller-manager instance? Then Prometheus would make it easier to add up the total evictions for a given DeviceTaintRule among all past and present kube-controller-manager pods without Kubernetes having to handle that itself.
Such a metric makes sense for "overall work done". But it cannot be per DeviceTaintRule because using e.g. the DeviceTaintRule UID as dimension would lead to unlimited growth of that metric.
I think the API is still a good place to at least keep some sign of "in progress"/"done" but maybe doesn't need to detail the exact progress. For that I think a condition makes sense.
So you think we can drop PodsEvicted and also not represent it in the condition description?
There was a problem hiding this comment.
Such a metric makes sense for "overall work done".
Actually, it already exists: "Total number of Pods deleted by DeviceTaintEvictionController since its start."
I think as part of condition description it can be included like this "10 pods are pending eviction, 100 evicted already since starting the controller". That makes it clear that the count resets when restarting the controller and thus avoids potential confusion when the number is too low.
There was a problem hiding this comment.
Resolved by switching to conditions. The exact wording of the description can be decided during the implementation phase.
| } | ||
|
|
||
| // DeviceTaintRuleStatus provides information about an on-going pod eviction. | ||
| type DeviceTaintRuleStatus struct { |
There was a problem hiding this comment.
While implement the type updates I was reminded that bumping the spec increments the Generation. This was meant as preparation for conditions which can record the observed generation to document what they are based on.
But now these new status fields aren't using conditions. Should they? There's no field to store data like "pods pending eviction" in machine-readable form in https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#Condition. The numbers could be part of the message field:
Condition {
Type: "InProgress",
Status: ConditionTrue,
ObservedGeneration: rule.spec.Generation,
LastTransitionTime: time.Now,
Reason: "PodsPendingEviction",
Message: fmt.Sprintf("%d pods are pending eviction, %d evicted already", numPodsPending, numPodsEvicted),
}`
This becomes `Status: ConditionFalse` when there are no pods which need to be evicted.
There was a problem hiding this comment.
The advantage of conditions is that kubectl wait --for=condition=InProgress=false DeviceTaintRule/my-taint-rule can be used, which is nicer than kubectl wait --for=jsonpath='{.status.podsPendingEviction}'=0.
There was a problem hiding this comment.
I've switched to Conditions []metav1.Condition.
There was a problem hiding this comment.
While implement the type updates I was reminded that bumping the spec increments the Generation.
It can (and should), but must be specifically done in the strategy PrepareForUpdate method
I've switched to Conditions []metav1.Condition.
+1 for using conditions
There was a problem hiding this comment.
It can (and should), but must be specifically done in the strategy PrepareForUpdate method
I wasn't clear: I meant to point out that I already implemented it like that in the initial implementation in 1.33 😁 .
I agree that managing Generation should be considered best practice for any type with a spec.
4282b1f to
b404e59
Compare
|
/assign @liggitt For comments on the proposed API. |
b404e59 to
683b619
Compare
| ### API | ||
|
|
||
| The ResourceSlice content gets extended: | ||
| The ResourceSlice content gets extended. A natural place for a taint that |
There was a problem hiding this comment.
Since slices can now be “devices” or “taints”, can we call out:
- How the scheduler/DRA controllers index taint-slices vs device-slices.
- Cache invalidation and event fan-out when taint-slices update, to avoid O(N×M) reprocessing.
There was a problem hiding this comment.
Short answer: the ResourceSlice tracker keeps track of that state and tries to minimize the changes that the consumers (scheduler and eviction controller) see.
The long answer is a bit more involved and I'd like to punt on that now while still in alpha. It needs to be added for beta when discussing scalability.
| check its policy configuration and decide to take devices offline by creating | ||
| a DeviceTaintRule with a taint for affected devices. | ||
|
|
||
| #### Safe Pod Eviction |
There was a problem hiding this comment.
How do we handle the races between scheduling and post-allocation taints? For example:
- Taint appears after scheduling but before kubelet admission.
- Taint appears while pod is running.
There was a problem hiding this comment.
Is that question related to this user story?
Taint appears after scheduling but before kubelet admission.
If the effect is NoSchedule, then nothing happens. The pod is allowed to run.
If the effect is NoExecute, then the pod gets evicted.
Taint appears while pod is running.
Same here.
There was a problem hiding this comment.
What I meant that it would nice to mention in the Safe Pod Eviction section that:
NoScheduleonly affects scheduling decisions. Once a pod is bound, adding the taint later has no impact.NoExecuteapplies retroactively. If the taint appears either right after scheduling but before kubelet admission or while the pod is already running, the pod is subject to eviction.
That way readers don’t misinterpret this as a race with undefined behavior.
There was a problem hiding this comment.
IMHO NoSchedule does not belong into this section because the use case is about eviction.
I've extended the story with additional comments about that race.
I'm intentionally not using the NoExecute constant here because at this point I don't want to educate readers about the API details.
| // | ||
| // The following condition is currently defined as part of this API, more may | ||
| // get added: | ||
| // - Type: EvictionInProgress |
There was a problem hiding this comment.
I would add Ready or Completed condition state.
There was a problem hiding this comment.
Can you describe what the semantic of those states would be?
"Completed" could replace "EvictionInProgress", but it could be a bit misleading: it might be "Completed" now, but in the unlikely case that some pod pops up which is scheduled (for example, the scheduler missed the taint or a user manually scheduled a pod), then it's possible to go back from "Completed: True" to "Completed: False". It's not really a final state.
There was a problem hiding this comment.
I agree. How about that?
- EvictionInProgress=True → pods are currently being evicted.
- EvictionInProgress=False, Reason=Idle → no pods are pending eviction at the moment.
There was a problem hiding this comment.
I updated the comment in the API docs and added reason: Idle in the explanations beneath it.
| Reasons are not specified as part of the API because it's not required by the | ||
| use cases and would restrict future changes unnecessarily. | ||
|
|
||
| ### Test Plan |
There was a problem hiding this comment.
Can we add test plan for the case when multiple rules hit the same device set with different rates?
We may need to add another test case to validate rate limiting and controller back-pressure.
The original idea was that avoiding repeated conversion of ResourceSlices to the internal type by doing it in an informer would bring some performance advantage. Because the DRA scheduler_perf benchmarks didn't clearly distinguish between "setup phase" and "measurement phase", some of the processing done by the informer affected the measurements and we got confusing benchmark results in kubernetes#131168. This has been fixed together with some other issues (kubernetes#133941, kubernetes#134010) and now results make more sense. However, although the conversion shows up in CPU profiles, it's not slow enough to have much impact. The main reason for this change is that it enables extending the internal type with fields that get set by the ResourceSlice tracker (staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker). Those additional fields are needed for the revised device taint handling (kubernetes/enhancements#5512) because that change will remove the device taints that are embedded inside the Device struct. Without that there will be no place where the tracker can report all taints that apply to a device, a separate data structure and method of communicating when it changes would be needed. With this commit, dynamicresources plugin and allocator only use the internal type and get it either from the informer (this commit) or the ResourceSlice tracker. Shifting the conversion may also enable other future optimizations because internalizing strings amortizes itself better compared to doing it repeatedly, but this has not been explored.
| // The default is 10 Pods/s. | ||
| // | ||
| // +optional | ||
| EvictionsPerSecond *int64 |
There was a problem hiding this comment.
I'm still finding this pretty weird to reason about. I think this is the max evictions/second this taint would drive, but it's pretty best-effort / there's no guarantee pods would actually be evicted at this rate, right?
- The eviction controller sync loop might be busy with other things
- The eviction controller API client might be rate-limited client-side or server-side
- The eviction controller API client might not be able to keep up with evictions
- The attempts to evict pods could be rejected due to PDB (does the controller use the /evict subresource that respects PDB?)
If we keep this here, the docs need to make it really clear that this is not a strong guarantee pods will be deleted this fast
There was a problem hiding this comment.
Yes, it would be "max rate with best effort".
does the controller use the /evict subresource that respects PDB?
No. I copied the node eviction controller, which just deletes the pod. I don't have a strong opinion about what's best, but following that established pattern might be less surprising because it's consistent.
There was a problem hiding this comment.
Maybe two separate effects? One to "forcibly evict" and one to "evict respecting policy (PDB)"?
There was a problem hiding this comment.
With rate limiting gone there is no need for this field.
There was a problem hiding this comment.
Correct but I think the idea of an effect that honors PDBs vs and effect that does not is a separate discussion.
There was a problem hiding this comment.
True. I'll keep this open for the PDB aspect. Like you I prefer to treat support for PDB as a future extension. It raises several questions that are similar to the rate limit (who gets to decide whether it should be used or not) and the API isn't obvious (new effect or a separate field).
But let's prepare for it now and explicitly allow adding more effects:
// The effect of the taint on claims that do not tolerate the taint
// and through such claims on the pods using them.
//
// Valid effects are None, NoSchedule and NoExecute. PreferNoSchedule as used for
// nodes is not valid here. More effects may get added in the future.
// Consumers must treat unknown effects like None.
//
// +required
Effect DeviceTaintEffect
// ^^^^
//
// Implementing PreferNoSchedule would depend on a scoring solution for DRA.
// It might get added as part of that.
//
// A possible future new effect is NoExecuteWithPodDisruptionBudget:
// honor the pod disruption budget instead of simply deleting pods.
// This is currently undecided, it could also be a separate field.
//
// Validation must be prepared to allow unknown enums in stored objects,
// which will enable adding new enums within a single release without
// ratcheting.
A discussion about use cases and usage led to the conclusion that rate limiting and arbitrary health data are not needed. The idea with reducing the number of devices when using new fields addresses the size issue. The original, more complex proposal is kept in the initial commit to record it for future reference.
johnbelamaric
left a comment
There was a problem hiding this comment.
LGTM from the DRA design perspective.
There is one open question about whether the eviction should honor PDB or not, and whether there should be separate effects for those options. IMO this is non-blocking, because I think that the current way of working the same as node taints is fine and makes sense. If we want one that honors PDB (and some other policies?), we can add it as a separate feature later.
| // | ||
| // Validation must be prepared to allow unknown enums in stored objects, | ||
| // which will enable adding new enums within a single release without | ||
| // ratcheting. |
There was a problem hiding this comment.
Define how clients (schedulers, evictors, etc) should treat unrecognized taint effects (safest option is probably to treat them as None). If we contemplate adding more effects in the future, we have to ensure the oldest clients still present which do things with taint effects handle unrecognized effects safely. A 1.x API server can have a 1.(x-1) scheduler or kube-controller-manager running against it.
There was a problem hiding this comment.
Define how clients (schedulers, evictors, etc) should treat unrecognized taint effects (safest option is probably to treat them as None).
I already did above in the API docs ("Consumers must treat unknown effects like None."). There it is visible to anyone who may want to react to the API.
This note here is more geared towards "how do we implement this type" and thus not part of the official API.
liggitt
left a comment
There was a problem hiding this comment.
lgtm for API updates
just a couple doc suggestions
| } | ||
|
|
||
| // DeviceTaintRuleStatus provides information about an on-going pod eviction. | ||
| type DeviceTaintRuleStatus struct { |
There was a problem hiding this comment.
While implement the type updates I was reminded that bumping the spec increments the Generation.
It can (and should), but must be specifically done in the strategy PrepareForUpdate method
I've switched to Conditions []metav1.Condition.
+1 for using conditions
| fine (newer drivers work on an older Kubernetes). | ||
|
|
||
| `effect: None` is only valid for Kubernetes >= 1.35. DeviceTaintRules | ||
| and ResourceSliced with that value must be removed before a downgrade. |
There was a problem hiding this comment.
| and ResourceSliced with that value must be removed before a downgrade. | |
| and ResourceSlices with that value must be removed before a downgrade. |
6add3c1 to
315c45f
Compare
| // | ||
| // Validation must be prepared to allow unknown enums in stored objects, | ||
| // which will enable adding new enums within a single release without | ||
| // ratcheting. |
There was a problem hiding this comment.
Define how clients (schedulers, evictors, etc) should treat unrecognized taint effects (safest option is probably to treat them as None).
I already did above in the API docs ("Consumers must treat unknown effects like None."). There it is visible to anyone who may want to react to the API.
This note here is more geared towards "how do we implement this type" and thus not part of the official API.
| } | ||
|
|
||
| // DeviceTaintRuleStatus provides information about an on-going pod eviction. | ||
| type DeviceTaintRuleStatus struct { |
There was a problem hiding this comment.
It can (and should), but must be specifically done in the strategy PrepareForUpdate method
I wasn't clear: I meant to point out that I already implemented it like that in the initial implementation in 1.33 😁 .
I agree that managing Generation should be considered best practice for any type with a spec.
| fine (newer drivers work on an older Kubernetes). | ||
|
|
||
| `effect: None` is only valid for Kubernetes >= 1.35. DeviceTaintRules | ||
| and ResourceSliced with that value must be removed before a downgrade. |
|
|
||
| A taint is not meant to be used as an access control mechanism. Users are | ||
| allowed to ignore taints (at their own risk). Adding a taint in a live cluster | ||
| is inherently racy because it first needs to be observed by e.g. the scheduler. |
There was a problem hiding this comment.
Is there a plan to close that race window at some point by blocking new allocations while a taint is created?
IIUC currently the SoT for NoSchedule case is in scheduler, but NoExecute in Device Driver, which triggers eviction.
Theoretically if the NoSchedule taint is created DRA driver could reject binding request for pods which does not have matching toleration, so in fact it could have an immediate effect.
There was a problem hiding this comment.
Is there a plan to close that race window at some point by blocking new allocations while a taint is created?
No. I don't see how that could be done. The scheduler is always going to observe the taint creation with some delay.
IIUC currently the SoT for NoSchedule case is in scheduler, but NoExecute in Device Driver, which triggers eviction.
Sorry, I'm not following. The SoT for a taint is in the object which defines the taint (DeviceTaintRule or ResourceSlice). Different consumers (scheduler, controller) react to that once they observe those. There's no difference regarding that between NoScheduler and NoExecute.
There was a problem hiding this comment.
IIUC NoSchedule is enforced in scheduler, but NoExecute in scheduler, but eventually pods are evicted by the taint controller (not the Driver as said).
I wonder if the Driver should/could block new allocations for RCs which does not have necessary tolerations and reject them as well (assuming it can reject allocation, but I'm not sure). If we don't have this mechanism, then Device Driver cannot guarantee that new pods don't appear once it published pods to evict.
There was a problem hiding this comment.
NoSchedule is enforced in scheduler, but NoExecute in scheduler
Did you mean "NoSchedule is enforced in scheduler, but NoExecute in controller"?
The scheduler also enforced NoExecute by treating it like NoSchedule. It just doesn't evict, leaving that to the controller.
I wonder if the Driver should/could block new allocations for RCs which does not have necessary tolerations and reject them as well (assuming it can reject allocation, but I'm not sure).
It cannot. The driver is not consulted by the scheduler when allocating ResourceClaims.
| // eviction and/or statistics (number of matching devices, | ||
| // affected allocated claims, pods remaining to be evicted, | ||
| // etc.). | ||
| // Status provides information about what was requested in the spec. |
There was a problem hiding this comment.
| // Status provides information about what was requested in the spec. | |
| // Status provides information about the effect of what was requested in the spec. |
| const DeviceTaintRuleStatusMaxConditions = 8 | ||
| ``` | ||
|
|
||
| The condition is meant to be used to check for completion of a triggered eviction with: |
There was a problem hiding this comment.
Since there is a race with scheduler, it may be possible that initial set of conditions does not list any pod to evict. In that case kubectl wait may get completed but some pods may appear later and will become a subject for eviction not tracked with conditions.
There was a problem hiding this comment.
True. We can make it very unlikely, though, by delaying setting the condition to False shortly after the DeviceTaintRule was created (5 to 10 seconds?). That should be long enough to propagate the new DeviceTaintRule to a running kube-scheduler. After a restart, the kube-scheduler first waits for cache sync before scheduling pods.
will become a subject for eviction not tracked with conditions
As discussed elsewhere with @helayoty, this condition can go from "False" to "True". The controller must reconcile to catch this.
There was a problem hiding this comment.
So I guess there is nothing better we could do than adding this best effort delay to mitigate this race. I think we can discuss it again before beta promotion. I bet long term we need to have some way to address this sort of races, especially since the same problem is between scheduler and kubelet, so some unification would be welcome.
315c45f to
c59c3b0
Compare
|
LGTM, leaving approval on @macsko |
c59c3b0 to
e2a20a4
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: macsko, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The original idea was that avoiding repeated conversion of ResourceSlices to the internal type by doing it in an informer would bring some performance advantage. Because the DRA scheduler_perf benchmarks didn't clearly distinguish between "setup phase" and "measurement phase", some of the processing done by the informer affected the measurements and we got confusing benchmark results in kubernetes#131168. This has been fixed together with some other issues (kubernetes#133941, kubernetes#134010) and now results make more sense. However, although the conversion shows up in CPU profiles, it's not slow enough to have much impact. The main reason for this change is that it enables extending the internal type with fields that get set by the ResourceSlice tracker (staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker). Those additional fields are needed for the revised device taint handling (kubernetes/enhancements#5512) because that change will remove the device taints that are embedded inside the Device struct. Without that there will be no place where the tracker can report all taints that apply to a device, a separate data structure and method of communicating when it changes would be needed. With this commit, dynamicresources plugin and allocator only use the internal type and get it either from the informer (this commit) or the ResourceSlice tracker. Shifting the conversion may also enable other future optimizations because internalizing strings amortizes itself better compared to doing it repeatedly, but this has not been explored.
The original idea was that avoiding repeated conversion of ResourceSlices to the internal type by doing it in an informer would bring some performance advantage. Because the DRA scheduler_perf benchmarks didn't clearly distinguish between "setup phase" and "measurement phase", some of the processing done by the informer affected the measurements and we got confusing benchmark results in kubernetes#131168. This has been fixed together with some other issues (kubernetes#133941, kubernetes#134010) and now results make more sense. However, although the conversion shows up in CPU profiles, it's not slow enough to have much impact. The main reason for this change is that it enables extending the internal type with fields that get set by the ResourceSlice tracker (staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker). Those additional fields are needed for the revised device taint handling (kubernetes/enhancements#5512) because that change will remove the device taints that are embedded inside the Device struct. Without that there will be no place where the tracker can report all taints that apply to a device, a separate data structure and method of communicating when it changes would be needed. With this commit, dynamicresources plugin and allocator only use the internal type and get it either from the informer (this commit) or the ResourceSlice tracker. Shifting the conversion may also enable other future optimizations because internalizing strings amortizes itself better compared to doing it repeatedly, but this has not been explored.
The original idea was that avoiding repeated conversion of ResourceSlices to the internal type by doing it in an informer would bring some performance advantage. Because the DRA scheduler_perf benchmarks didn't clearly distinguish between "setup phase" and "measurement phase", some of the processing done by the informer affected the measurements and we got confusing benchmark results in kubernetes#131168. This has been fixed together with some other issues (kubernetes#133941, kubernetes#134010) and now results make more sense. However, although the conversion shows up in CPU profiles, it's not slow enough to have much impact. The main reason for this change is that it enables extending the internal type with fields that get set by the ResourceSlice tracker (staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker). Those additional fields are needed for the revised device taint handling (kubernetes/enhancements#5512) because that change will remove the device taints that are embedded inside the Device struct. Without that there will be no place where the tracker can report all taints that apply to a device, a separate data structure and method of communicating when it changes would be needed. With this commit, dynamicresources plugin and allocator only use the internal type and get it either from the informer (this commit) or the ResourceSlice tracker. Shifting the conversion may also enable other future optimizations because internalizing strings amortizes itself better compared to doing it repeatedly, but this has not been explored.
The original idea was that avoiding repeated conversion of ResourceSlices to the internal type by doing it in an informer would bring some performance advantage. Because the DRA scheduler_perf benchmarks didn't clearly distinguish between "setup phase" and "measurement phase", some of the processing done by the informer affected the measurements and we got confusing benchmark results in kubernetes#131168. This has been fixed together with some other issues (kubernetes#133941, kubernetes#134010) and now results make more sense. However, although the conversion shows up in CPU profiles, it's not slow enough to have much impact. The main reason for this change is that it enables extending the internal type with fields that get set by the ResourceSlice tracker (staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker). Those additional fields are needed for the revised device taint handling (kubernetes/enhancements#5512) because that change will remove the device taints that are embedded inside the Device struct. Without that there will be no place where the tracker can report all taints that apply to a device, a separate data structure and method of communicating when it changes would be needed. With this commit, dynamicresources plugin and allocator only use the internal type and get it either from the informer (this commit) or the ResourceSlice tracker. Shifting the conversion may also enable other future optimizations because internalizing strings amortizes itself better compared to doing it repeatedly, but this has not been explored.
One-line PR description: update for 1.35
Issue link: DRA: device taints and tolerations #5055
Other comments: This now covers publishing purely informational taints ("Effect: None"), rate limited eviction and improved user experience (DeviceTaintRule status,
kubectl describe DeviceTaintRule).The "informational taint" is a potential alternative to #5469.
The one-of ResourceSlice is relevant for #5234.
/cc @nojnhuh @byako @eero-t @mortent