[KEP-5440]: MutablePodResourcesForSuspendedJobs #132441
[KEP-5440]: MutablePodResourcesForSuspendedJobs #132441k8s-ci-robot merged 1 commit intokubernetes:masterfrom
Conversation
lmktfy
left a comment
There was a problem hiding this comment.
Some of this validation could go into a highly recommended ValidatingAdmissionPolicy.
For example: we could say that shrinking limits and requests for a template is always OK, but restrict resizes that enlarge them.
💭 What would the authz look like for template changes?
Maybe. But right now, we are forbidding any resizing. So we could relax it in-tree and then leave it to people to implement their own VAP? But that could lead to nobody implement this and then its free reign. So I'm not sure what is best here. |
ffc331b to
add236c
Compare
add236c to
f49bc5f
Compare
60c0bfa to
5c1f00c
Compare
|
/retest |
c1d22c4 to
0e439db
Compare
|
/hold Gotta address this. |
deads2k
left a comment
There was a problem hiding this comment.
I could use some clarifications.
|
|
||
| if utilfeature.DefaultFeatureGate.Enabled(features.MutablePodResourcesForSuspendedJobs) { | ||
| mutablePodTemplate := suspended && batchvalidation.IsConditionTrue(oldJob.Status.Conditions, batch.JobSuspended) && oldJob.Status.Active == 0 | ||
| opts.AllowMutableSchedulingDirectives = mutablePodTemplate |
There was a problem hiding this comment.
this change is less obvious to me. Why does this existing option become true some times? Actually it's more than that. This change takes us from AllowMutableSchedulingDirectives = suspended && notStarted to ignoring whether there is a start time, so a job that has started, but doesn't have an active pod can change. Why did this validation change as part of the pod resources change?
There was a problem hiding this comment.
I believe this comes from #134980 (comment).
The KEP update (done today) was kubernetes/enhancements#5678.
There was a problem hiding this comment.
Why does this existing option become true some times?
Yeah, currently the precondition logic is inside the strategy.go. We could also set the option based on the future gate and then check all preconditions inside the validation.go. I think the only explanation is to put the code next to the pre-existing FG which follows the pattern, ie. AllowMutableSchedulingDirectives. +1 to move it to validation.go - from my PoV either in this PR or a small follow up.
Why did this validation change as part of the pod resources change?
The motivation is to align the handling of AllowMutableSchedulingDirectives. Over time it turned out that the check for status.startTime is not good, because:
- it does not allow to change PodTemplate while the job is suspended, which is essential for Kueue, and thus we do this trick of clearing starttime ourselves. It allows to change only if never started which is too restrictive.
- It does not really guarantee that all Pods are deleted (have deletionTimestamp), and thus in Kueue we need to defer the stop until status.Active=0, we do it here.
- There is not much of conceptual difference between scheduling directives, like nodeSelectors, and Pod resources - both influence scheduling decisions, and so it seems reasonable to align the condition for mutability.
I think it is reasonable to align the mutability condition in this KEP, otherwise it would require the overhead of yet another KEP just to align.
There was a problem hiding this comment.
I think it is reasonable to align the mutability condition in this KEP, otherwise it would require the overhead of yet another KEP just to align.
I don't see a need for another KEP, but another featuregate with a different graduation lifecycle makes sense. We can update our job controller to manipulate startTime in this way, but we have no practical way of doing that for other controllers and certainly not in skewing cases. This means that we cannot simply change the validation under an unrelated featuregate and promote that featuregate in a normal cycle.
There was a problem hiding this comment.
I see, so can we just have another feature gate under this KEP like AdjustMutableSchedulingDirectives?
There was a problem hiding this comment.
Our controller might make a change (weird featuregate to wire it to btw), but other job controllers may not. Certainly not on a version skew. I can see the justification for expanding to allow validation for oldCondition || newCondition, but this tightening of validation needs a separate gate and a different cycle for deciding when to force it.
I can appreciate that it is a gap we'd like to close, but suddenly requiring new validation to use unrelated features is untenable.
| allErrs := field.ErrorList{} | ||
|
|
||
| // Create a deep copy of the old pod spec to modify | ||
| oldPodCopy := oldPod.DeepCopy() |
There was a problem hiding this comment.
non-blocking remark, but this might be heavy, and we already take the entire template above: https://github.com/kubernetes/kubernetes/pull/132441/files#diff-67fdcc10f97b3bbec1288558a1575efd01474516066db5332a141592ee5dd4e4R642
So we could optimize by reusing the same copy if we do early:
if opts.AllowMutableSchedulingDirectives || opts.AllowMutablePodResources {
oldTemplate = oldSpec.Template.DeepCopy() // +k8s:verify-mutation:reason=clone
}There was a problem hiding this comment.
This comment is non-blocking.
| oldTemplate.Labels = template.Labels // +k8s:verify-mutation:reason=clone | ||
| oldTemplate.Spec.SchedulingGates = template.Spec.SchedulingGates // +k8s:verify-mutation:reason=clone | ||
| } | ||
| allErrs = append(allErrs, apivalidation.ValidateImmutableField(template, oldTemplate, fldPath.Child("template"))...) |
There was a problem hiding this comment.
I think we could rely still on this line if we above just follow the tested pattern of mungling the oldTemplate. It would also save the unnecessary DeepCopy, posted here: https://github.com/kubernetes/kubernetes/pull/132441/files#r2489218558
There was a problem hiding this comment.
This comment is non-blocking.
| opts.AllowMutableSchedulingDirectives = suspended && notStarted | ||
|
|
||
| if utilfeature.DefaultFeatureGate.Enabled(features.MutablePodResourcesForSuspendedJobs) { | ||
| opts.AllowMutablePodResources = batchvalidation.IsConditionTrue(oldJob.Status.Conditions, batch.JobSuspended) && oldJob.Status.Active == 0 |
There was a problem hiding this comment.
| opts.AllowMutablePodResources = batchvalidation.IsConditionTrue(oldJob.Status.Conditions, batch.JobSuspended) && oldJob.Status.Active == 0 | |
| opts.AllowMutablePodResources = suspended && batchvalidation.IsConditionTrue(oldJob.Status.Conditions, batch.JobSuspended) && oldJob.Status.Active == 0 |
I think it better captures the intention, and also prevents some race conditions when someone could mutate the Job while suspend=false, but the JobSuspended is still there from the previous suspend.
There was a problem hiding this comment.
Oh I see you moved this condition here: https://github.com/kubernetes/kubernetes/pull/132441/files#diff-67fdcc10f97b3bbec1288558a1575efd01474516066db5332a141592ee5dd4e4
I think splitting these checks makes it hard to follow. Shouldn't we move also batchvalidation.IsConditionTrue(oldJob.Status.Conditions, batch.JobSuspended) && oldJob.Status.Active == 0 to validation.go?
Also, splitting the checks allows adding tests in validation_test.go which would be detached from reality (bypassing the checks).
I think we could have opts.AllowMutablePodResources = utilfeature.DefaultFeatureGate.Enabled(features.MutablePodResourcesForSuspendedJobs)
There was a problem hiding this comment.
I have a thread with david on this.
Validation checks spec and status as two separate function calls.
There was a problem hiding this comment.
So in the code I added validation I don't have access to anything on the status field.
Same for updates to status.
There was a problem hiding this comment.
I left the commit in because I do find this harder to follow.
There was a problem hiding this comment.
The structure is highlighting edges in the "spec cannot stop status updates". We haven't had a peer "status cannot stop spec updates", so it's an interesting dynamic. I'm glad to have the spec related validation all sitting in validation at least.
|
please squash to something more reasonable before merge /approve |
cdf30cf to
37c2054
Compare
|
/hold cancel The hold was put on some changes that moved into a separate PR. This is ready for final review. |
|
LGTM label has been added. DetailsGit tree hash: e965cf172fe3b68ba106bedfc4548cbab1289d28 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, kannon92, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| {Type: field.ErrorTypeInvalid, Field: "spec.template"}, | ||
| }, | ||
| }, | ||
| "feature gate enabled - unsuspended job resource updates rejected": { |
There was a problem hiding this comment.
Please add a test case which shows that the status.Active==0 checks is used, so
"feature gate enabled - job suspended, but with remaining active Pods resource updates rejected": {
enableFeatureGate: true,
job: &batch.Job{
ObjectMeta: metav1.ObjectMeta{
Name: "myjob",
Namespace: metav1.NamespaceDefault,
ResourceVersion: "0",
},
Spec: batch.JobSpec{
Selector: validSelector,
Template: validPodTemplateSpec,
ManualSelector: ptr.To(true),
Parallelism: ptr.To[int32](1),
Suspend: ptr.To(true ),
},
Status: batch.JobStatus{
Active: 1,
Conditions: []batch.JobCondition{
{
Type: batch.JobSuspended,
Status: api.ConditionStatus(metav1.ConditionFalse),
},
},
},
},
update: func(job *batch.Job) {
job.Spec.Template.Spec.Containers[0].Resources.Requests = api.ResourceList{
api.ResourceCPU: resource.MustParse("200m"),
}
},
wantErrs: field.ErrorList{
{Type: field.ErrorTypeInvalid, Field: "spec.template"},
},
},There was a problem hiding this comment.
and
"feature gate enabled - job suspended, with the suspended condition, but pods remain active - reject": {
enableFeatureGate: true,
job: &batch.Job{
ObjectMeta: metav1.ObjectMeta{
Name: "myjob",
Namespace: metav1.NamespaceDefault,
ResourceVersion: "0",
},
Spec: batch.JobSpec{
Selector: validSelector,
Template: validPodTemplateSpec,
ManualSelector: ptr.To(true),
Parallelism: ptr.To[int32](1),
Suspend: ptr.To(true ),
},
Status: batch.JobStatus{
Active: 1,
Conditions: []batch.JobCondition{
{
Type: batch.JobSuspended,
Status: api.ConditionStatus(metav1.ConditionTrue),
},
},
},
},
update: func(job *batch.Job) {
job.Spec.Template.Spec.Containers[0].Resources.Requests = api.ResourceList{
api.ResourceCPU: resource.MustParse("200m"),
}
},
wantErrs: field.ErrorList{
{Type: field.ErrorTypeInvalid, Field: "spec.template"},
},
},- Add feature gate to control mutable pod resources for suspended jobs - Implement validatePodResourceUpdatesOnly function to allow only container resource updates - Allow resource updates for suspended jobs regardless of whether they have started - Add comprehensive unit and integration tests for all scenarios including started-then-suspended - Ensure backward compatibility when feature gate is disabled This enables users to update container resources on suspended jobs, including jobs that have previously started and been suspended, while maintaining immutability for all other pod template fields.
/unhold Thank you 👍 |
|
LGTM label has been added. DetailsGit tree hash: c67bb1e9198cd912765c9418d436a48cb7ace432 |
|
Known flake, tracked in #134998 /test pull-kubernetes-node-e2e-containerd |
|
/test pull-kubernetes-node-e2e-containerd |
|
/retest |
|
@kannon92: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/test pull-kubernetes-node-e2e-containerd |
This enables users to update container resources on suspended jobs that haven't started yet, while maintaining immutability for all other pod template fields.
What type of PR is this?
/kind feature
What this PR does / why we need it:
Which issue(s) this PR is related to:
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: