Fix race condition between exec start and resize. by zq-david-wang · Pull Request #37172 · moby/moby

zq-david-wang · 2018-05-30T07:56:46Z

Signed-off-by: David Wang [email protected]

- What I did
Fix a race condition between exec start and resize
- How I did it
Add lock when calling resize
- How to verify it
run docker exec -it $(docker run -d centos sleep 3600) bash, and then type a long line to see that the line actually wrapped around to next line
- Description for the changelog

fixes #31223
fixes #35407
fixes #36354
fixes #36483
fixes #36477
fixes docker/for-linux#314
fixes docker/for-linux#268

- A picture of a cute animal (not mandatory but encouraged)

arm64b

This 2 commits can squash into one since they're doing the same logic.

cpuguy83 · 2018-05-30T13:12:58Z

What is this racing with?
I thought this race was mostly in the CLI due to aync handling of start+resize?

zq-david-wang · 2018-05-30T13:20:24Z

@arm64b Thanks for the suggestion. I am not quite familiar with the procedure. Do I have to create a new PR? Or is there any procedure to fix the problem inplace?

arm64b · 2018-05-30T13:25:02Z

@zq-david-wang No, you don't need to create the PR, just squash the 2 commits into one on you local branch, then push it again. But that's not a big problem, I think you need to think about the question from @cpuguy83 first about the fixing itself.

zq-david-wang · 2018-05-30T13:34:34Z

@cpuguy83 Correct my if I am wrong. I am not sure where docker-cli send requests in sequence or not (Hope it sends request in sequence). But in dockerd, the request handler for exec start and exec resize would race with each other. Exec start would create a task process object and store in a map the key-value pair of exec-id and this process object, but if the resize request got processed before the process object got stored in map, then the resize request would get a "no such exec" response even though the exec-id is valid. I notice that before dockerd start exec, it would acquire a lock for the container and exec-session, but resize acquires no such lock. And I test the code in my env, it does work.

codecov · 2018-05-31T02:13:04Z

Codecov Report

❗ No coverage uploaded for pull request base (master@f11f249). Click here to learn what that means.
The diff coverage is 85.71%.

@@            Coverage Diff            @@
##             master   #37172   +/-   ##
=========================================
  Coverage          ?   34.71%           
=========================================
  Files             ?      609           
  Lines             ?    45007           
  Branches          ?        0           
=========================================
  Hits              ?    15625           
  Misses            ?    27291           
  Partials          ?     2091

zq-david-wang · 2018-06-01T07:00:06Z

@vdemeester The test case DockerSwarmSuite.TestSwarmClusterRotateUnlockKey would report following error, But I do not think it has anything to do with this PR, which only change logic for docker exec. Is there any chance that this is some test env issue? Or the master branch is now unstable?

FAIL: docker_cli_swarm_test.go:1372: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
...
04:44:09 Fail to exec node ls, stdout: Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
04:44:09 , stderr: exit status 1
04:44:09 docker_cli_swarm_test.go:1455:
04:44:09 c.Assert(err, checker.IsNil)
04:44:09 ... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc422ea54c0), Stderr:[]uint8(nil)} ("exit status 1")
04:44:09

arm64b · 2018-06-01T08:02:28Z

integration-cli/docker_cli_swarm_test.go

What's the purpose of those new added code block?

The test seems to assume something not clearly defined, in this case, how soon should the daemon node rejoin as a manager after restart. I just add some delay and then make a retry. In previous code change, I use 500ms delay and the test still report FAIL for 1 jobs. Current code change use a 10s delay and it seems PASS all test jobs at least for one time. And I clean up those commits and push new ones, now new round of test is started. Finger crossed.

We might prefer to have this change/discussion in a different PR, wdyt @thaJeztah ?

Yes, having it in a separate PR would be preferable (so that the PR is focused on just the changes needed for the race condition)

roger that~! I will move this change out.

arm64b · 2018-06-01T08:04:20Z

@zq-david-wang

DockerSwarmSuite.TestSwarmClusterRotateUnlockKey

is a flaky test, you can ignore this case. Ah, I understood you're trying to mitigate this in your commit 😭

thaJeztah · 2018-06-01T13:55:08Z

ping @tonistiigi could you have a look? You had some thoughts on this change

mlaventure

This should indeed fix the race.

I'm trying to remember if there's a reason why we haven't done so until then, but can't think of one.

mlaventure · 2018-06-04T18:35:26Z

integration-cli/docker_api_exec_resize_test.go

Why this change? Is the comment not valid anymore?

I think the test should uncover the exec start/resize issue if it does not ignore other errors.
I run this new test several round, no failure reported, yet. I will run more round of tests to see whether it would report strange failures.

mlaventure · 2018-06-04T18:36:05Z

integration-cli/docker_cli_swarm_test.go

We might prefer to have this change/discussion in a different PR, wdyt @thaJeztah ?

thaJeztah · 2018-06-04T20:15:44Z

integration-cli/docker_api_exec_resize_test.go

We should combine this with the checks above, and use err != nil, which is a more common approach;

_, rc, err := request.Post(fmt.Sprintf("/exec/%s/resize?h=24&w=80", execID), request.ContentType("text/plain")) // It's probably a panic of the daemon if io.ErrUnexpectedEOF is returned. if err != nil { if err == io.ErrUnexpectedEOF { return fmt.Errorf("The daemon might have crashed.") } return errors.Wrap(err, "failed to exec resize") } rc.Close() // We only interested in the io.ErrUnexpectedEOF error, so we return nil otherwise. return nil

Reading the comment though, it looks like previously we ignored other errors; is there a specific error we don't want to ignore?

I think the exec start/resize issue should be detected by this test case if we do not ignore those errors.
I run several rounds of test, so far no strange error found. I would run more round of tests to see how stable this test would be.

Changes made. But I dose not use the errors package because all other errors is generated by fmt.Errorf, kind of strange to change style... I will make the change if you still think it is necessary. @thaJeztah

tonistiigi · 2018-06-04T20:37:08Z

daemon/resize.go

Could you add a comment here how this lock is correcting the synchronization between request handlers? What request is holding the lock and keeps this waiting etc.

I add some comments in a new commit, please help take a look.
Thx~!

mlaventure

While writing my comment below, I realized that this doesn't actually fix the race.

Scheduling helping, this could potentially still be called before containerd.Exec() is done. To check that it has returned, we would need to check that ec.Pid is not 0.

mlaventure · 2018-06-05T14:42:02Z

daemon/resize.go

If think it could simply be:

Synchronize with exec-start to ensure the resize request is not sent before the process is actually started.

Would that work for you too @tonistiigi ?

tonistiigi · 2018-06-05T18:44:44Z

@zq-david-wang @mlaventure Would it be more understandable if instead of this locks we would just add a started chan struct{} to the ec that would be closed on startup. And resize would do a switch between waiting for this channel to be closed and a timeout.

zq-david-wang · 2018-06-06T04:04:43Z

@mlaventure the code calling containerd.Exec() is protected by ec.Lock(), but I agree that there are still chances that resize would be handled even before containerd.Exec() is called.

@mlaventure @tonistiigi considering adding timeout logic for resize, I kind of prefer a client-side change: docker-cli would add retry & timeout for sending resize request.

tonistiigi · 2018-06-06T04:11:49Z

considering adding timeout logic for resize, I kind of prefer a client-side change: docker-cli would add retry & timeout for sending resize request.

I disagree as we can't change old clients and this would be very hard to use with API. If I remember correctly there was an attempt but it also meant changing the API as there is no way to detect start atm. It was discussed in a maintainers meeting and maintainers were in favor of a daemon side solution similar to this.

zq-david-wang · 2018-06-06T09:56:15Z

@tonistiigi thanks for the information. There are indeed lots of clients out there....

I have made further changes to use wait group to indicate the readiness of the exec session, please help give it a look. @cpuguy83 @arm64b @tonistiigi @mlaventure @thaJeztah

mlaventure · 2018-06-06T14:53:37Z

daemon/resize.go

Not a huge fan of the timeout, I would prefer for the context to be passed as one of the argument, but I guess it can be done in a separate PR.

Me, too. The timeout dose seem quite ugly. I add it there because I am afraid that something could somehow go wrong and the process would hang for ever. Maybe I can wrap those code into a public function of exc.Config, this could make the code much clear.

mlaventure · 2018-06-06T14:54:13Z

daemon/resize.go

since we end up making a channel here to allow for the timeout, may as well make ReadyWaitGroup into a channel like originally suggested by @tonistiigi

@mlaventure @tonistiigi I am not a experienced go programmer, correct me if I am wrong. I think channel event is a one-time thing, after a poll, the event is processed and no longer reside in the channel. This means that a second pool (another resize request in the same exec context) would get nothing. So I think WaitGroup is a better choice here.

@tonistiigi @mlaventure I just learned that a closed channel would always be able to poll from. (Do feel have learned a lot about go programming just for this PR.) I will make code changes to use a chan to indicate the readiness. Thanks for the suggestions.

zq-david-wang · 2018-06-09T02:23:59Z

@thaJeztah @mlaventure @tonistiigi @arm64b @vdemeester thank you guys for all the help~! Its a honor to contribute!
I also open an issue for the hardcoded 10s timeout: https://github.com/moby/moby/issues/37251

thaJeztah · 2018-06-09T03:12:25Z

Great, thanks for that!

sargun · 2018-06-26T07:09:29Z

Curious, will this make it into 18.03.2-ce?

thaJeztah · 2018-06-26T07:22:10Z

Likely 18.06-ce

andymadge · 2018-06-26T07:43:37Z

@thaJeztah when is that due? We're rapidly running out of June...

thaJeztah · 2018-06-26T08:14:05Z

When it's ready :-) Sorry, don't have a date

…

On 26 Jun 2018, at 09:44, Andy Madge ***@***.***> wrote: @thaJeztah when is that due? We're rapidly running out of June... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

xrobau · 2018-07-18T20:40:27Z

Random though: Being that we're almost in August, is it going to be called 18.07, or 18.06? Or maybe even 18.08? I'm just happy that it's fixed 8)

mkurz · 2018-07-18T20:58:24Z

@xrobau Looks like it will be called 18.06, see https://github.com/docker/docker-ce/releases

pjeby · 2018-07-22T06:28:52Z

Unfortunately, this PR didn't fix the problem entirely -- as of 18.06-ce there is still a race condition, in that depending on what program you exec, it can miss the resizing of the terminal during the program's startup. In my experiments, I've had to in some cases add a 'sleep .1' to the command being exec'd to be certain that the resize is performed before the command reads the terminal size.

If I understand correctly, this patch asynchronously resizes the terminal after the target process is started; ISTM that the correct behavior would be to properly size the pty on creation, before actually starting the target process. Otherwise, this is still a race condition, albeit one with a narrower window for triggering it.

mpepping · 2018-07-25T06:34:06Z

@pjeby care to share an example command to reproduce this? I ran into this issue with multiple scenarios but these are all solved with this PR.

mlaventure · 2018-07-25T14:38:02Z

@pjeby The program shouldn't miss the resize with this PR, what can happen is that it starts with the wrong size for a few ms before the signal is sent to the program.

That said, it wouldn't hurt to initialize the ConsoleSize in the linux createSpec so the size is closer to reality on start. Although, there would still be a race as the terminal size can be changed between the program starting and it receive the resize signal.

pjeby · 2018-07-25T15:58:50Z

The specific issue I had was with exec'ing nano, or any other tool which reads the terminal size but doesn't resize on sigwinch. Such tools used to work correctly when not resizing the terminal that docker and docker-compose commands were run from. (That is, they ran correctly before the docker-ce version that introduced the bug this PR is aimed at.)

IOW, the race condition is simply the fact that the program being exec'd can read an uninitialized value (the default terminal size) before it's set. This is akin to starting a thread that reads a variable and then setting it in the calling thread, even if you then also notify the called thread that the variable has changed. 😉

mlaventure · 2018-07-27T14:56:42Z

@pjeby I understand what the issue is. But it was always there, it just that the timing changed as docker has never set the actual terminal size before starting the container since we don't pass that information to the start endpoint.

So the endpoint would need to take the TTY size as extra parameters for this to occurs.

Feel free to open a PR :)

coretemp · 2018-07-31T13:06:32Z

I am not sure what the structure is here, but it seems that nobody is actually responsible for fixing such serious bugs. Is that correct?

Is there anyone in this project who is actually paid to fix issues or otherwise cares about this project?

pjeby · 2018-07-31T13:38:59Z

I think that comment is both unfairly phrased and counterproductive; even I understand that this is a pretty niche issue compared to docker's main, non-interactive use cases, and after this PR it's a heck of a lot better than it was before. (I'm not sure I'd class it as a "serious bug" any more, and I'm the one who reopened this discussion.)

coretemp · 2018-07-31T15:17:14Z

@pjeby It is so serious that I will have to fix it if whoever "maintains" this project doesn't. That's my definition of serious.

I don't like your tone either, because you say "even I understand", suggesting I am a moron.

Let me also note that you have only wasted my time (and that of others reading this) by only talking about what you think about the comment and not actually answering the question.

thaJeztah · 2018-07-31T15:51:31Z

Please watch your language, and keep your wording respectful to others (see the community guidelines).

As to your question: whoever "maintains" this project - as in any open source project; it's contributors and maintainers (who also contribute); open source is about participation, which means that the community as a whole can assist in improving a project; either by opening issues / tickets to report bugs, by assisting others, or by contributing code to fix bugs or implement features.

Contributors can be paid to do so (for example if the company they work for has an interest in fixing a bug, or wants to implement a feature), but many contributors don't get paid for their work, and do it to add their part in contributing to the project.

It is so serious that I will have to fix it if whoever "maintains" this project doesn't

So; yes! If you want to give back, and work on fixing the issue you're seeing; we welcome your contribution.

pjeby · 2018-07-31T18:20:11Z

Perhaps "understand" was a poor word choice; substituting "accept" would make the intended meaning clearer. That is, my statement was intended to read as something like, "even I (as a party who thought this important enough to reopen discussion) would not go so far as to suggest this issue is a top priority for the project as a whole".

In other words, the word "understand" was a statement about accepting the project's priorities being different than my own, not about mine or anyone else's comprehension of those priorities.

In colloquial American English at least, "understand" often has the social meaning of acceptance, agreement, or acquiescence to another person's position or terms, especially when offered by an inferior (me in this case) towards a superior (the project team). That was the intended use here. (Ironically, my choice of this word was driven by the desire for politeness, even if it turned out not to work that well. 😀 )

coretemp · 2018-07-31T20:44:18Z

@thaJeztah There is nothing in those guidelines that I broke. As such waiving with them is not particularly nice and in fact against those same guidelines.

I understand you have the illusion that those rules help the "community", but trust me, they don't. They are only there to give an excuse to ban people without reason and literally even the rookiest of developers knows this.

You have also not answered my question. Essentially you have said the equivalent of "happy bunnies come out of the trenches at night and magically fix everything". Having said that, by using these vague statements, you have confirmed my idea about the state of the project.

I gave you a clear chance to explain to me how the project is in a good state (despite the mountain of evidence against it) and you didn't grab that opportunity, which is of course well within your rights to do so.

Please don't make the mistake of responding to this message; it will lead nowhere and I have no intention to say anything else in this thread. Essentially, I will consider it harassment (since you like your guidelines so much).

@pjeby Yes, it was a poor word choice or a highly intelligent one (a free pass at insulting someone if I hadn't said anything). We will never know what it really was.

GordonTheTurtle added the status/0-triage label May 30, 2018

arm64b suggested changes May 30, 2018

View reviewed changes

cpuguy83 added status/2-code-review and removed status/0-triage labels May 30, 2018

zq-david-wang force-pushed the resizefix2 branch from 33f9169 to 6b63f02 Compare May 31, 2018 02:13

zq-david-wang requested a review from vdemeester as a code owner June 1, 2018 03:01

arm64b reviewed Jun 1, 2018

View reviewed changes

zq-david-wang force-pushed the resizefix2 branch from 47bb059 to 3359e0d Compare June 1, 2018 10:48

zq-david-wang mentioned this pull request Jun 1, 2018

docker exec occasionally runs resize before container is attached. #31223

Closed

mlaventure reviewed Jun 4, 2018

View reviewed changes

thaJeztah requested changes Jun 4, 2018

View reviewed changes

tonistiigi reviewed Jun 4, 2018

View reviewed changes

zq-david-wang force-pushed the resizefix2 branch 2 times, most recently from ed5ac2d to cda8237 Compare June 5, 2018 14:05

mlaventure suggested changes Jun 5, 2018

View reviewed changes

zq-david-wang force-pushed the resizefix2 branch from cda8237 to 8ebaef4 Compare June 6, 2018 09:44

zq-david-wang force-pushed the resizefix2 branch from 67d25d3 to 4e326f0 Compare June 6, 2018 10:36

mlaventure suggested changes Jun 6, 2018

View reviewed changes

dmalan mentioned this pull request Jun 9, 2018

remove setting of COLUMNS and LINES cs50/cli50#35

Closed

thaJeztah mentioned this pull request Jun 11, 2018

Character wrap around in terminal is messed up docker/for-linux#268

Closed

3 tasks

ibrahima mentioned this pull request Jul 21, 2018

"width too small" error DarthSim/overmind#31

Closed

moby locked and limited conversation to collaborators Jul 31, 2018

Conversation

zq-david-wang commented May 30, 2018 • edited by thaJeztah Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arm64b left a comment

Choose a reason for hiding this comment

Uh oh!

cpuguy83 commented May 30, 2018

Uh oh!

zq-david-wang commented May 30, 2018

Uh oh!

arm64b commented May 30, 2018

Uh oh!

zq-david-wang commented May 30, 2018

Uh oh!

codecov bot commented May 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zq-david-wang commented Jun 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arm64b commented Jun 1, 2018

Uh oh!

thaJeztah commented Jun 1, 2018

Uh oh!

mlaventure left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mlaventure left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tonistiigi commented Jun 5, 2018

Uh oh!

zq-david-wang commented Jun 6, 2018

Uh oh!

tonistiigi commented Jun 6, 2018

Uh oh!

zq-david-wang commented Jun 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zq-david-wang commented Jun 9, 2018

Uh oh!

thaJeztah commented Jun 9, 2018

zq-david-wang commented May 30, 2018 •

edited by thaJeztah

Loading

codecov bot commented May 31, 2018 •

edited

Loading