Skip to content

Conversation

@louyihua
Copy link
Contributor

What this PR does / why we need it:
Currently, docker's exec operation never timeout. This patch introduces a new ExecOptions that contains a timeout value into the parameters of libdocker.StartExec. When a positive timeout is passed and the exec operation timeouts, the StartExec function returns the libdocker.OperationTimeout error.

Release note:

Add timeout option into libdocker.StartExec

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 28, 2018
@k8s-ci-robot k8s-ci-robot requested review from dims and mtaufen January 28, 2018 08:52
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 28, 2018
@louyihua
Copy link
Contributor Author

@yujuhong
This is a more generic patch than #58510.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to set the timeout in the context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch allows the code that calls StartExec to be notified on timeout. Otherwise, each caller should have its own implementation for detecting timeouts.

@dims
Copy link
Member

dims commented Mar 7, 2018

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 7, 2018
@louyihua louyihua force-pushed the exec-timeout branch 2 times, most recently from 049a5e4 to 750ea4b Compare April 11, 2018 06:27
@louyihua
Copy link
Contributor Author

@yujuhong
This PR is reworked. Now it matches @ncdc 's option 2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this wording might be a bit clearer: "Unable to kill process %d of exec session %s in container %s for timeout termination!", inspect.Pid, startExec, inspect.ContainerID

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably a good idea to store the time.Timer returned by time.AfterFunc so we can defer timeoutTimer.Stop() - this will ensure we don't have timers with long timeouts sticking around for a while after the exec has returned successfully. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, stopping the timer on function exit is really a good idea.

@ncdc
Copy link
Member

ncdc commented Apr 12, 2018

LGTM. Will defer to sig-node for final lgtm & approval
/assign @yujuhong @Random-Liu @derekwaynecarr

@dims
Copy link
Member

dims commented Apr 30, 2018

/uncc @dims

@k8s-ci-robot k8s-ci-robot removed the request for review from dims April 30, 2018 12:39
@louyihua
Copy link
Contributor Author

Another month gone, anyone has any comments on this?
@Random-Liu @derekwaynecarr @yujuhong

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/does/did

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest inspectError instead of err2, for a more semantic name. Similar advice below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest something like glog.Errorf("failed to kill exec process after timeout: session: %s, container: %s, error: %v", startExec, inspect.ContainerID, inspectError)

Regarding inspect.ContainerID, can you trust that inspect holds a value when you get an error here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar suggestions below

Currently, docker's exec operation never timeout. This patch introduces
a new timeout value into the parameters of `libdocker.StartExec`. If the
exec operation timeouts, the `StartExec` function returns the
`libdocker.operationTimeout` error.
@louyihua
Copy link
Contributor Author

@mtaufen Thanks for your suggestions.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: louyihua
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: derekwaynecarr

Assign the PR to them by writing /assign @derekwaynecarr in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 21, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2019
@louyihua
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 12, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 10, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Oct 10, 2019
@pigletfly
Copy link
Contributor

any new progress on this? @louyihua

@wangrzneu
Copy link
Contributor

@ncdc Can you help to review this PR?

@ncdc
Copy link
Member

ncdc commented Jan 2, 2020

Sorry, I am not involved in this area of the code any more - I would recommend getting someone from SIG Node to review.

/uncc

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 1, 2020
@wangrzneu
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 8, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 7, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 6, 2020
@SergeyKanzhelev
Copy link
Member

This PR solves the same problem as the PR that is currently being actively discussed #94115. The agreement from SIG node meeting today was that the code change is only a small part of the fix and we need to actively look at side effects this change may bring to the production payloads. Please join the discussion at the PR: #94115.

/close

@k8s-ci-robot
Copy link
Contributor

@SergeyKanzhelev: Closed this PR.

Details

In response to this:

This PR solves the same problem as the PR that is currently being actively discussed #94115. The agreement from SIG node meeting today was that the code change is only a small part of the fix and we need to actively look at side effects this change may bring to the production payloads. Please join the discussion at the PR: #94115.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

stuartpb added a commit to stuartpb/bitnami-charts that referenced this pull request Sep 16, 2021
The timeout wrapper in health checks was added in helm/charts#11355
to work around Docker/containerd not respecting timeouts in probes
(cf. kubernetes/kubernetes#58925). The upstream issue has been fixed
since Kubernetes 1.20 (kubernetes/kubernetes#94115), and this wrapper
causes degraded behavior (ie. any failure in the wrapped command only
gets reported as "The monitored command dumped core", without details
for the specific failure), so the original behavior should be restored.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.