Skip to content

Fix hang when container fails to start#5060

Merged
thaJeztah merged 2 commits intodocker:masterfrom
laurazard:fix-hang-ctx
May 7, 2024
Merged

Fix hang when container fails to start#5060
thaJeztah merged 2 commits intodocker:masterfrom
laurazard:fix-hang-ctx

Conversation

@laurazard
Copy link
Member

@laurazard laurazard commented May 5, 2024

- What I did

fixes #5053

When running a container, if it fails to start, we immediately cancel the context used for waitExitOrRemoved and, if running the container with --rm, wait on the channel:

if attach {
cancelFun()
<-errCh
}
reportError(stderr, "run", err.Error(), false)
if copts.autoRemove {
// wait container to be removed
<-statusChan
}

However, in 840016e, we added the following line

case <-ctx.Done():
return

but in this case, we return from the goroutine without notifying the caller, which causes the caller to hang.

- How I did it

We should not be cancelling the context used to wait on the container exit/removal in this case, and it's only happening by accident due to this line, which is there for other reasons

if attach {
cancelFun()
<-errCh
}

So I used a new context for the waitExitOrRemoved call.

Also, close the returned channel to unblock any caller when we hit this branch, and added another close to a case down where we added another return without notifying the caller.

- How to verify it

Reproduce such as:

#!/bin/bash
while :; do
	docker run --rm --entrypoint invalidcommand alpine param
done

- Description for the changelog

Fix issue where the CLI process would sometimes hang when a container failed to start.

- A picture of a cute animal (not mandatory but encouraged)

image

@laurazard laurazard added the kind/bugfix PR's that fix bugs label May 5, 2024
@laurazard laurazard requested a review from vvoland May 5, 2024 22:21
@laurazard laurazard self-assigned this May 5, 2024
@laurazard
Copy link
Member Author

Should also add a test that would catch a regression such as this, but I'll leave that for tomorrow 😅

@codecov-commenter
Copy link

codecov-commenter commented May 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 61.07%. Comparing base (647ccf3) to head (8d6e571).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5060   +/-   ##
=======================================
  Coverage   61.06%   61.07%           
=======================================
  Files         295      295           
  Lines       20667    20670    +3     
=======================================
+ Hits        12621    12624    +3     
  Misses       7147     7147           
  Partials      899      899           

@thaJeztah
Copy link
Member

Will this affect client-side handling of --rm (API 1.25 and older)? I noticed that's handled in this area of the code, and I recall there was some rather complex logic around this.

@laurazard
Copy link
Member Author

I don't think it should, that check/path happens before the changes in this patch/the other changes right?

if versions.LessThan(apiClient.ClientVersion(), "1.30") {
return legacyWaitExitOrRemoved(ctx, apiClient, containerID, waitRemove)
}

@laurazard laurazard force-pushed the fix-hang-ctx branch 2 times, most recently from f431514 to 109a06e Compare May 6, 2024 13:02
@laurazard laurazard requested review from krissetto and thaJeztah May 6, 2024 13:51
@laurazard laurazard force-pushed the fix-hang-ctx branch 3 times, most recently from 109a06e to 32d7da6 Compare May 7, 2024 09:34
@laurazard laurazard requested a review from vvoland May 7, 2024 09:35
Copy link
Collaborator

@vvoland vvoland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; left a small nit, but feel free to ignore 😅

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

We should probably look at some of our tests to make sure cleanup happens; it looks like icmd.RunCmd may be broken (as it doesn't set the timeout on the cmd that's passed to icmd.WaitOnCmd). But while doing a little test (running sleep 10 instead of invalid command) I noticed that the container is not cleaned up when the test completes; we may either need some cleanup step as part of tests, or as part of the setup of the e2e tests 🤔

@thaJeztah thaJeztah added this to the 27.0.0 milestone May 7, 2024
@laurazard
Copy link
Member Author

Agree! I'll write a ticket somewhere for us to remember to do that, should look at all the tests and check that.

@thaJeztah
Copy link
Member

Thanks! Yes, looks like a tracking ticket won't hurt. In most cases it probably shouldn't be problematic, but it could explain some flakiness (containers from other tests still being around, causing tests to be inconsistent).

In either case, not critical for this PR; I think this PR should be good to go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bugfix PR's that fix bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

docker-cli broken in v26.1.1 with entrypoint is invalid. Keeps hanging forever.

5 participants