DRA: device taints: new ResourceSlice API, new features#134152
DRA: device taints: new ResourceSlice API, new features#134152k8s-ci-robot merged 11 commits intokubernetes:masterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
When the ResourceSlice no longer exists, the ResourceSlice tracker didn't and couldn't report the tainted devices even if they are allocated and in use. The controller must keep track of DeviceTaintRules itself and handle this scenario. In this scenario it is impossible to evaluation CEL expressions because the necessary device attributes aren't available. We could: - Copy them in the allocation result: too large, big change. - Limit usage of CEL expressions to rules with no eviction: inconsistent. - Remove the fields which cannot be supported well. The last option is chosen. The tracker is now no longer needed by the eviction controller. Reading directly from the informer means that we cannot assume that pointers are consistent. We have to track ResourceSlices by their name, not their pointer.
While at it, ensure that future unknown effects are treating like the None effect.
We know how often the controller should get a pod, let's check it. Must run before we do our own GET call.
The approach copied from node taint eviction was to fire off one goroutine per pod the intended time. This leads to the "thundering herd" problem: when a single taint causes eviction of several pods and those all have no or the same toleration grace period, then they all get deleted concurrently at the same time. For node taint eviction that is limited by the number of pods per node, which is typically ~100. In an integration test, that already led to problems with watchers: cacher.go:855] cacher (pods): 100 objects queued in incoming channel. cache_watcher.go:203] Forcing pods watcher close due to unresponsiveness: key: "/pods/", labels: "", fields: "". len(c.input) = 10, len(c.result) = 10, graceful = false It also causes spikes in memory consumption (mostly the 2KB stack per goroutine plus closure) with no upper limit. Using a workqueue makes concurrency more deterministic because there is an upper limit. In the integration test, 10 workers kept the watch active. Another advantage is that failures to evict the pod get retried with exponential backoff per affected pod forever. Previously, evicting was tried a few times with a fixed rate and then the controller gave up. If the apiserver was down long enough, pods didn't get evicted.
|
/assign @soltysh For kube-controller-manager. |
liggitt
left a comment
There was a problem hiding this comment.
API / validation / strategy changes lgtm
would like a final ack from a DV reviewer on the annotations there, and can squash as you think makes sense before merge
/assign @aaron-prindle Any thoughts on declarative validation tags for these API changes? See also #134152 (comment) (helpfully hidden by GitHub, but not resolved yet). |
| // | ||
| // +optional | ||
| // +listType=atomic | ||
| // +featureGate=DRADeviceTaints |
There was a problem hiding this comment.
For the updated max # of taints validation, could use the +k8s:maxItems DV tag here:
// +k8s:maxItems=16
Taints []DeviceTaint `json:"taints,omitempty" protobuf:"bytes,8,rep,name=taints"`
There was a problem hiding this comment.
Shall I give this a try in this PR, can this be a follow-up or should this be a follow-up?
There was a problem hiding this comment.
This can be a follow-up 👍 . Perhaps we can track the known DV tags we want to add in an issue to make it easier to follow-up for folks. I've created this issue where we can put the known followups:
There was a problem hiding this comment.
Thanks! I'll leave it in your capable hands then because I already have mine full with other things 😅
|
LGTM w.r.t declarative validation tags NOTE: no DV tags were added in this PR but added as follow-up in umbrella tracking issue: |
|
great, API bits are approved |
|
LGTM label has been added. DetailsGit tree hash: a9059764f4af85fd67a75e4a6893b99675f0c532 |
| // ValidatingAdmissionPolicyStatusController related features. | ||
| ValidatingAdmissionPolicyStatusController validatingadmissionpolicystatusconfig.ValidatingAdmissionPolicyStatusControllerConfiguration | ||
| // DeviceTaintEvictionControllerConfiguration contains elements configuring the device taint eviction controller. | ||
| DeviceTaintEvictionController devicetaintevictionconfig.DeviceTaintEvictionControllerConfiguration |
There was a problem hiding this comment.
Nit: this contents of this struct are kept in alphabetical order, would be nice to move this up after DeploymentController. Also while at it, would you mind moving StatefulSetController and CronJobController to appropriate places, these two seem to be out of order 😅
There was a problem hiding this comment.
Changed (all of them).
Dims added a "sorted" linter for feature gate definitions. It was almost generic enough to be applied to arbitrary fields, enums, etc. but not quite yet - might be worth picking up again...
There was a problem hiding this comment.
Thank you!
Dims added a "sorted" linter for feature gate definitions. It was almost generic enough to be applied to arbitrary fields, enums, etc. but not quite yet - might be worth picking up again...
Hmmm... that'd could be useful, do I see correctly that's #132668 ?
There was a problem hiding this comment.
I'll check it out, but probably after the freeze 😉
| // ruleStatusPeriod is the shortest time between DeviceTaintRule status | ||
| // updates while eviction is in progress. Once it is done, it no longer gets | ||
| // updated until in progress again. | ||
| ruleStatusPeriod = 10 * time.Second |
There was a problem hiding this comment.
Conservative is the usual approach we go with. Your initial approach with not exposing any configs is a good start, and expanding based on tests with number of workers. If you wonder about usefulness of additional configs, that means you don't need them, at least not yet, and so I'd not expose anything.
I'm aware of all the gotchas you described above, but those values are rarely configured and we should treat exposing them with utmost cautiousness.
| // that a watch in the integration test couldn't keep up: | ||
| // cacher.go:855] cacher (pods): 100 objects queued in incoming channel. | ||
| // cache_watcher.go:203] Forcing pods watcher close due to unresponsiveness: key: "/pods/", labels: "", fields: "". len(c.input) = 10, len(c.result) = 10, graceful = false | ||
| obj.ConcurrentSyncs = 10 |
There was a problem hiding this comment.
Is this number 10 coming from any particular tests? I see above you're explaining why 100 was too much, and I think I wouldn't be comfortable with such a big default. But most of our controllers oscillate around 5, with some outliers like GC with 20, namespace with 10, and evictions with 8.
| } | ||
|
|
||
| // trackedTaint augments a DeviceTaint with a pointer to its origin. | ||
| // rule and slice are mutually exclusive. At least one is set. |
There was a problem hiding this comment.
If they are mutually exclusive why At least one is set? Based on the code it seems rule will always take priority.
There was a problem hiding this comment.
Replaced with "mutually exclusive. Exactly one of them is always set."
I wanted to make it clear that if rule == nil, then slice must be non-empty.
| } | ||
|
|
||
| podDeletionLatency := time.Since(eviction.when.Time) | ||
| logger.Info("Evicted pod by deleting it", "latency", podDeletionLatency, "reason", eviction.reason) |
There was a problem hiding this comment.
Most of the controller logs are at V(1) or V(2) ideally.
There was a problem hiding this comment.
Copied from node eviction... Changed to V(2).
|
|
||
| tc.handleRuleChange(rule, ruleEvict) | ||
| numPendingPods, numPendingNamespaces, err = tc.countPendingPods(rule) | ||
| numTaintedSliceDevices, numTaintedAllocatedDevices = tc.countTaintedDevices(rule) |
There was a problem hiding this comment.
Why doing this, rather than just invoking the necessary logic? It seems somewhat counterintuitive 🤔
There was a problem hiding this comment.
The logic is implemented such that it depends on the state of the controller. To compute stats for the "what if" scenario, I first have to set up a controller that is in that state. I was experimenting with temporarily changing the state, then reverting back, but that turned out to be more complicated and expensive.
Counting tainted devices is something that is only necessary for the dry-run message for effect: None.
There was a problem hiding this comment.
That's fair approach. Although, still I'd consider refactoring that eventually into re-usable bit, rather than instantiating a whole controller.
There was a problem hiding this comment.
Basically a simpler struct with just has the relevant state and update methods. I had considered that, but it didn't feel worth it - I might give it another try.
There was a problem hiding this comment.
Yeah, exactly that. I'm worried that the footprint (both cpu and mem) for duplicated controller might be significant at scale. Although I'm pretty sure you've done your fair share of testing 😅
| rules, err = tc.ruleLister.List(labels.Everything()) | ||
| // TODO: instead of listing and handling an error, keep track of rules in the informer event handler? | ||
| if err != nil { | ||
| panic(err) |
There was a problem hiding this comment.
None of our controllers panic, and I don't think this one should either. Maybe log error and return nil, or something like that?
There was a problem hiding this comment.
This is a local lister, so this should never happen. A panic for something that shouldn't happened seemed appropriate, but as it won't happen I can also switch to logging the error - done.
soltysh
left a comment
There was a problem hiding this comment.
minor nits
/approve
the controller changes
| config := new(kubectrlmgrconfigv1alpha1.DeviceTaintEvictionControllerConfiguration) | ||
| RecommendedDefaultDeviceTaintEvictionControllerConfiguration(config) | ||
| if config.ConcurrentSyncs != 8 { | ||
| t.Errorf("incorrect default value, expected 10 but got %v", config.ConcurrentSyncs) |
There was a problem hiding this comment.
Nit: the message says 10 still
|
|
||
| tc.handleRuleChange(rule, ruleEvict) | ||
| numPendingPods, numPendingNamespaces, err = tc.countPendingPods(rule) | ||
| numTaintedSliceDevices, numTaintedAllocatedDevices = tc.countTaintedDevices(rule) |
There was a problem hiding this comment.
That's fair approach. Although, still I'd consider refactoring that eventually into re-usable bit, rather than instantiating a whole controller.
It might never be necessary to change the default, but it is hard to be sure. It's better to have the option, just in case.
To update the right statuses, the controller must collect more information about why a pod is being evicted. Updating the DeviceTaintRule statuses then is handled by the same work queue as evicting pods. Both operations already share the same client instance and thus QPS+server-side throttling, so they might as well share the same work queue. Deleting pods is not necessarily more important than informing users or vice-versa, so there is no strong argument for having different queues. While at it, switching the unit tests to usage of the same mock work queue as in staging/src/k8s.io/dynamic-resource-allocation/internal/workqueue. Because there is no time to add it properly to a staging repo, the implementation gets copied.
This covers permissions and correct usage of SSA.
|
/approve |
|
LGTM label has been added. DetailsGit tree hash: d5e30bcb47b3d439cfe03a81a0eef75934f5f214 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: liggitt, nojnhuh, pohly, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
@pohly: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What type of PR is this?
/kind feature
/kind api-change
What this PR does / why we need it:
This implements the updated functionality as described in kubernetes/enhancements#5512
In addition, fields in DeviceTaintRule which cannot be supported properly without more fundamental changes (DeviceClass, CEL selectors) get removed. The use case for them is not strong enough to justify additional complexity.
Which issue(s) this PR is related to:
KEP: kubernetes/enhancements#5055
Special notes for your reviewer:
Does this PR introduce a user-facing change?