-
Notifications
You must be signed in to change notification settings - Fork 42k
Add timeout option for docker's exec operation #58925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to set the timeout in the context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch allows the code that calls StartExec to be notified on timeout. Otherwise, each caller should have its own implementation for detecting timeouts.
|
/ok-to-test |
049a5e4 to
750ea4b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this wording might be a bit clearer: "Unable to kill process %d of exec session %s in container %s for timeout termination!", inspect.Pid, startExec, inspect.ContainerID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably a good idea to store the time.Timer returned by time.AfterFunc so we can defer timeoutTimer.Stop() - this will ensure we don't have timers with long timeouts sticking around for a while after the exec has returned successfully. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, stopping the timer on function exit is really a good idea.
|
LGTM. Will defer to sig-node for final lgtm & approval |
|
/uncc @dims |
|
Another month gone, anyone has any comments on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/does/did
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest inspectError instead of err2, for a more semantic name. Similar advice below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest something like glog.Errorf("failed to kill exec process after timeout: session: %s, container: %s, error: %v", startExec, inspect.ContainerID, inspectError)
Regarding inspect.ContainerID, can you trust that inspect holds a value when you get an error here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar suggestions below
Currently, docker's exec operation never timeout. This patch introduces a new timeout value into the parameters of `libdocker.StartExec`. If the exec operation timeouts, the `StartExec` function returns the `libdocker.operationTimeout` error.
|
@mtaufen Thanks for your suggestions. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: louyihua Assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
/remove-lifecycle stale |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
any new progress on this? @louyihua |
|
@ncdc Can you help to review this PR? |
|
Sorry, I am not involved in this area of the code any more - I would recommend getting someone from SIG Node to review. /uncc |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
/remove-lifecycle stale |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
This PR solves the same problem as the PR that is currently being actively discussed #94115. The agreement from SIG node meeting today was that the code change is only a small part of the fix and we need to actively look at side effects this change may bring to the production payloads. Please join the discussion at the PR: #94115. /close |
|
@SergeyKanzhelev: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The timeout wrapper in health checks was added in helm/charts#11355 to work around Docker/containerd not respecting timeouts in probes (cf. kubernetes/kubernetes#58925). The upstream issue has been fixed since Kubernetes 1.20 (kubernetes/kubernetes#94115), and this wrapper causes degraded behavior (ie. any failure in the wrapped command only gets reported as "The monitored command dumped core", without details for the specific failure), so the original behavior should be restored.
What this PR does / why we need it:
Currently, docker's exec operation never timeout. This patch introduces a new
ExecOptionsthat contains a timeout value into the parameters oflibdocker.StartExec. When a positive timeout is passed and the exec operation timeouts, theStartExecfunction returns thelibdocker.OperationTimeouterror.Release note: