-
Notifications
You must be signed in to change notification settings - Fork 40.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TTL for cleaning up Jobs after they finish #66840
Conversation
a1ff331
to
94efaae
Compare
@janetkuo could we use dynamic client/informer to handle TTL for both finished Jobs and Pods in 1 controller? Seems like this would be fairly similar logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally approve. Needs e2e as well.
pkg/apis/batch/types.go
Outdated
// finishes. This field is alpha-level and is only honored by servers that | ||
// enable the TTLAfterFinished feature. | ||
// +optional | ||
TTLSecondsAfterFinished *int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to consider using TTLSecondsAfterCompletion to be consistent with the terminology in JobStatus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jobs finishes once its status is either complete or failed. Combined complete + failed into finished for simplicity, based on user feedback from kubeflow.
// alpha: v1.12 | ||
// | ||
// Allow TTL controller to clean up Pods and Jobs after they finish. | ||
TTLAfterFinished utilfeature.Feature = "TTLAfterFinished" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as previous comment with respect to Completion instead of Finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
clock clock.Clock | ||
} | ||
|
||
func New(jobInformer batchinformers.JobInformer, client clientset.Interface) *TTLAfterFinishedController { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment for publicly exposed method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
"k8s.io/kubernetes/pkg/util/metrics" | ||
) | ||
|
||
type TTLAfterFinishedController struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should describe the type. You should add a comment explaining what the controller does and the design choice to have the contorller live outside of JobController.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
return tc | ||
} | ||
|
||
func (tc *TTLAfterFinishedController) Run(workers int, stopCh <-chan struct{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if err != nil { | ||
return nil, err | ||
} | ||
// TODO: handle finish time that happens in the future |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should come up with a well defined behavior for when this occurs. Consider simply deferring destroying the victim after the finish time. It's conservative, but it can't do harm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Because it happens in the future it'll only delay destorying the Job.
/approve |
@tnozicka this TTL controller only needs to watch Jobs and Pods for now; Jobs and Pods have different ways of indicating that they "finished". I'd like to switch to dynamic client/informer when we extend this to custom resources. |
a439b11
to
322c881
Compare
/approve |
/lgtm |
Just fixing verify failure. Re-applying tag. |
) | ||
|
||
var ( | ||
ttlEnabled = utilfeature.DefaultFeatureGate.Enabled(features.TTLAfterFinished) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
declare this in TestJobsStrategy since it's the only place you are using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -172,6 +173,9 @@ func NewKubeControllerManagerOptions() (*KubeControllerManagerOptions, error) { | |||
ServiceController: &cmoptions.ServiceControllerOptions{ | |||
ConcurrentServiceSyncs: componentConfig.ServiceController.ConcurrentServiceSyncs, | |||
}, | |||
TTLAfterFinishedController: &TTLAfterFinishedControllerOptions{ | |||
ConcurrentTTLSyncs: componentConfig.TTLAfterFinishedController.ConcurrentTTLSyncs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a proponent of YAGNI when it comes to configuration like parallel syncs. Is there a reason we want it explicitly for this feature as compared to other loops? An operator guessed value might be slighter better than a developer guessed value in an edge situation but the cost is extra code, maintainablity and suportability complexity. This was @cheftako's request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer adding workqueue latency and handletime metrics and explore adaptive worker counts then hand knobs to operators as a last resort.
…abled 1. If TTLAfterFinished feature is enabled, the value should be non-negative. 2. If TTLAfterFinished feature is disabled, the field value should not be kept.
job := cur.(*batch.Job) | ||
glog.V(4).Infof("Updating job %s/%s", job.Namespace, job.Name) | ||
|
||
if job.DeletionTimestamp == nil && needsCleanup(job) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not move job.DeletionTimestamp == nil
into needsCleanup?
} | ||
|
||
func getFinishAndExpireTime(j *batch.Job) (*time.Time, *time.Time, error) { | ||
if !needsCleanup(j) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call this once at the begining of processJob, then don't check it again in all these helpers?
/approve I'd like to see whether we can remove the flag before beta, but I won't block this PR. |
per previous lgtms |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cheftako, dixudx, janetkuo, kow3ns, liggitt, mikedanese, tnozicka The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-kubernetes-e2e-kops-aws |
1 similar comment
/test pull-kubernetes-e2e-kops-aws |
/test all [submit-queue is verifying that this PR is safe to merge] |
/retest |
Automatic merge from submit-queue (batch tested with PRs 66840, 68159). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md. |
@janetkuo: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
What this PR does / why we need it: kubernetes/enhancements#592
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #64470
For kubernetes/enhancements#592
Special notes for your reviewer: @kubernetes/sig-apps-pr-reviews
Release note: