Skip to content

Add timeout in CLI waiting for autoremove#31744

Closed
cpuguy83 wants to merge 1 commit intomoby:masterfrom
cpuguy83:client_timeout_autoremove
Closed

Add timeout in CLI waiting for autoremove#31744
cpuguy83 wants to merge 1 commit intomoby:masterfrom
cpuguy83:client_timeout_autoremove

Conversation

@cpuguy83
Copy link
Member

This makes sure the CLI doesn't hang in case, for whatever reason, an
event never comes.

@cpuguy83 cpuguy83 force-pushed the client_timeout_autoremove branch 5 times, most recently from ccf8195 to 1cda2b7 Compare March 11, 2017 01:52
This makes sure the CLI doesn't hang in case, for whatever reason, an
event never comes.
Times out 30seconds after the `die` event is received.
Additionally polls the container inspect API every 5 seconds

Signed-off-by: Brian Goff <[email protected]>
@MHBauer
Copy link
Contributor

MHBauer commented Mar 27, 2017

So many levels of indentation makes my head hurt. That's even after you removed one.

Was there an issue that prompted this work?

Copy link
Contributor

@MHBauer MHBauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I understand it LGTM.

Only question is curious why to put the new arg to waitExitOrRemoved "in the middle" and not on the right edge.

var removeErr error
statusChan := make(chan int)
exitCode := 125
ctxWithCancel, cancel := context.WithCancel(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cancellable context, good to see this used.

}

go func() {
statusChan <- waitExitedOrRemovedEvent(ctxWithCancel, dockerCli, containerID, waitRemove, options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start the event processor in a separate thread

}

func waitRemovedPolling(ctx context.Context, dockerCli *command.DockerCli, containerID string) {
ticker := time.NewTicker(5 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this anything we need to worry about being configurable?

select {
case <-attachDone:
case <-ctxWithCancel.Done():
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bail early if we've finished and cleaned up already.

case <-ctx.Done():
return
case <-ticker.C:
_, err := dockerCli.Client().ContainerInspect(ctx, containerID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything else we need to do in here? Or just loop until the container is gone?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I guess if anything happened that caused container failed to remove, the CLI could still hang? right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a 30s timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't see how the 30s works. In my understanding, if we failed to remove container, and container is always there, waitRemovedPolling won't exit, isn't that true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context will get cancelled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The waitRemovedPolling rely on line 34: https://github.com/docker/docker/pull/31744/files#diff-1fe5ea7c675b9700809b89d15a414576R34 to cancel the context right? and line 34 waitExitedOrRemovedEvent has a param of ctxWithCancel which doesn't have a timeout, so it still rely on Event API. If container fail to remove, there's no destroy event and nobody will cancel the context. 30s timeout only works after waitRemovedPolling returns.

That's why I think it removal failed, this could still hang~

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeiZhang555 Today, a remove can't fail (when using --force, which autoremove does) unless there is something horribly wrong on the system (and kill -9 doesn't work).

https://github.com/docker/docker/blob/master/daemon/delete.go#L126

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can certainly put a timeout on the poller, or start warning after a period of time or something, but the user can just as easily cancel out of it.

eventq, errq := dockerCli.Client().Events(eventCtx, eventOpts)

exitCode = 125 // set default exit code
eventProcessor := func(e events.Message) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process a single event and indicate if we've received an event telling us to finish or not.

}
case err := <-errq:
logrus.Errorf("error getting events from daemon: %v", err)
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loop forever processing events.

@cpuguy83
Copy link
Member Author

@MHBauer Mostly because I had to rebase this change on some other refactorings that happened and just kept stuff in place.

@MHBauer
Copy link
Contributor

MHBauer commented Mar 28, 2017

The most recent other person prior in this code looks like @WeiZhang555. Any opinions here?

@WeiZhang555
Copy link
Contributor

I can understand the logic, this looks more complicated than it's first implemented, I'm just wondering if this is worth it without a specific issue as @MHBauer said.

If the EVENT api failed to store and send some specific event, shouldn't we solve the EVENT api problem instead of this? If previous codes behaved correctly before(I guess so), why do we need to change it?

@cpuguy83
Copy link
Member Author

@WeiZhang555 Yes there is a specific issue, a couple really:

  1. Classic swarm events are unreliable
  2. Removal failure (when --force is fixed to not force remove on error)

I believe @tonistiigi has also mentioned adjusting the wait API to support removal as well.

@WeiZhang555
Copy link
Contributor

The implementation looks good to me 😄

@cpuguy83
Copy link
Member Author

Closing for now in favor of #32237

@cpuguy83 cpuguy83 closed this Apr 27, 2017
@cpuguy83 cpuguy83 deleted the client_timeout_autoremove branch September 20, 2017 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants