Fix race condition between exec start and resize.#37172
Fix race condition between exec start and resize.#37172thaJeztah merged 1 commit intomoby:masterfrom
Conversation
arm64b
left a comment
There was a problem hiding this comment.
This 2 commits can squash into one since they're doing the same logic.
|
What is this racing with? |
|
@arm64b Thanks for the suggestion. I am not quite familiar with the procedure. Do I have to create a new PR? Or is there any procedure to fix the problem inplace? |
|
@zq-david-wang No, you don't need to create the PR, just squash the 2 commits into one on you local branch, then push it again. But that's not a big problem, I think you need to think about the question from @cpuguy83 first about the fixing itself. |
|
@cpuguy83 Correct my if I am wrong. I am not sure where docker-cli send requests in sequence or not (Hope it sends request in sequence). But in dockerd, the request handler for exec start and exec resize would race with each other. Exec start would create a task process object and store in a map the key-value pair of exec-id and this process object, but if the resize request got processed before the process object got stored in map, then the resize request would get a "no such exec" response even though the exec-id is valid. I notice that before dockerd start exec, it would acquire a lock for the container and exec-session, but resize acquires no such lock. And I test the code in my env, it does work. |
Codecov Report
@@ Coverage Diff @@
## master #37172 +/- ##
=========================================
Coverage ? 34.71%
=========================================
Files ? 609
Lines ? 45007
Branches ? 0
=========================================
Hits ? 15625
Misses ? 27291
Partials ? 2091 |
|
@vdemeester The test case DockerSwarmSuite.TestSwarmClusterRotateUnlockKey would report following error, But I do not think it has anything to do with this PR, which only change logic for
|
There was a problem hiding this comment.
What's the purpose of those new added code block?
There was a problem hiding this comment.
The test seems to assume something not clearly defined, in this case, how soon should the daemon node rejoin as a manager after restart. I just add some delay and then make a retry. In previous code change, I use 500ms delay and the test still report FAIL for 1 jobs. Current code change use a 10s delay and it seems PASS all test jobs at least for one time. And I clean up those commits and push new ones, now new round of test is started. Finger crossed.
There was a problem hiding this comment.
We might prefer to have this change/discussion in a different PR, wdyt @thaJeztah ?
There was a problem hiding this comment.
Yes, having it in a separate PR would be preferable (so that the PR is focused on just the changes needed for the race condition)
There was a problem hiding this comment.
roger that~! I will move this change out.
is a flaky test, you can ignore this case. Ah, I understood you're trying to mitigate this in your commit 😭 |
|
ping @tonistiigi could you have a look? You had some thoughts on this change |
mlaventure
left a comment
There was a problem hiding this comment.
This should indeed fix the race.
I'm trying to remember if there's a reason why we haven't done so until then, but can't think of one.
There was a problem hiding this comment.
Why this change? Is the comment not valid anymore?
There was a problem hiding this comment.
I think the test should uncover the exec start/resize issue if it does not ignore other errors.
I run this new test several round, no failure reported, yet. I will run more round of tests to see whether it would report strange failures.
There was a problem hiding this comment.
We might prefer to have this change/discussion in a different PR, wdyt @thaJeztah ?
There was a problem hiding this comment.
We should combine this with the checks above, and use err != nil, which is a more common approach;
_, rc, err := request.Post(fmt.Sprintf("/exec/%s/resize?h=24&w=80", execID), request.ContentType("text/plain"))
// It's probably a panic of the daemon if io.ErrUnexpectedEOF is returned.
if err != nil {
if err == io.ErrUnexpectedEOF {
return fmt.Errorf("The daemon might have crashed.")
}
return errors.Wrap(err, "failed to exec resize")
}
rc.Close()
// We only interested in the io.ErrUnexpectedEOF error, so we return nil otherwise.
return nilReading the comment though, it looks like previously we ignored other errors; is there a specific error we don't want to ignore?
There was a problem hiding this comment.
I think the exec start/resize issue should be detected by this test case if we do not ignore those errors.
I run several rounds of test, so far no strange error found. I would run more round of tests to see how stable this test would be.
There was a problem hiding this comment.
Changes made. But I dose not use the errors package because all other errors is generated by fmt.Errorf, kind of strange to change style... I will make the change if you still think it is necessary. @thaJeztah
daemon/resize.go
Outdated
There was a problem hiding this comment.
Could you add a comment here how this lock is correcting the synchronization between request handlers? What request is holding the lock and keeps this waiting etc.
There was a problem hiding this comment.
I add some comments in a new commit, please help take a look.
Thx~!
ed5ac2d to
cda8237
Compare
mlaventure
left a comment
There was a problem hiding this comment.
While writing my comment below, I realized that this doesn't actually fix the race.
Scheduling helping, this could potentially still be called before containerd.Exec() is done. To check that it has returned, we would need to check that ec.Pid is not 0.
daemon/resize.go
Outdated
There was a problem hiding this comment.
If think it could simply be:
Synchronize with exec-start to ensure the resize request is not sent before the process is actually started.
Would that work for you too @tonistiigi ?
|
@zq-david-wang @mlaventure Would it be more understandable if instead of this locks we would just add a |
|
@mlaventure the code calling @mlaventure @tonistiigi considering adding |
I disagree as we can't change old clients and this would be very hard to use with API. If I remember correctly there was an attempt but it also meant changing the API as there is no way to detect start atm. It was discussed in a maintainers meeting and maintainers were in favor of a daemon side solution similar to this. |
|
@tonistiigi thanks for the information. There are indeed lots of clients out there.... I have made further changes to use wait group to indicate the readiness of the exec session, please help give it a look. @cpuguy83 @arm64b @tonistiigi @mlaventure @thaJeztah |
daemon/resize.go
Outdated
There was a problem hiding this comment.
Not a huge fan of the timeout, I would prefer for the context to be passed as one of the argument, but I guess it can be done in a separate PR.
There was a problem hiding this comment.
Me, too. The timeout dose seem quite ugly. I add it there because I am afraid that something could somehow go wrong and the process would hang for ever. Maybe I can wrap those code into a public function of exc.Config, this could make the code much clear.
daemon/resize.go
Outdated
There was a problem hiding this comment.
since we end up making a channel here to allow for the timeout, may as well make ReadyWaitGroup into a channel like originally suggested by @tonistiigi
There was a problem hiding this comment.
@mlaventure @tonistiigi I am not a experienced go programmer, correct me if I am wrong. I think channel event is a one-time thing, after a poll, the event is processed and no longer reside in the channel. This means that a second pool (another resize request in the same exec context) would get nothing. So I think WaitGroup is a better choice here.
There was a problem hiding this comment.
@tonistiigi @mlaventure I just learned that a closed channel would always be able to poll from. (Do feel have learned a lot about go programming just for this PR.) I will make code changes to use a chan to indicate the readiness. Thanks for the suggestions.
|
@thaJeztah @mlaventure @tonistiigi @arm64b @vdemeester thank you guys for all the help~! Its a honor to contribute! |
|
Great, thanks for that! |
|
Curious, will this make it into 18.03.2-ce? |
|
Likely 18.06-ce |
|
@thaJeztah when is that due? We're rapidly running out of June... |
|
When it's ready :-)
Sorry, don't have a date
… On 26 Jun 2018, at 09:44, Andy Madge ***@***.***> wrote:
@thaJeztah when is that due? We're rapidly running out of June...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
Random though: Being that we're almost in August, is it going to be called 18.07, or 18.06? Or maybe even 18.08? I'm just happy that it's fixed 8) |
|
@xrobau Looks like it will be called |
|
Unfortunately, this PR didn't fix the problem entirely -- as of 18.06-ce there is still a race condition, in that depending on what program you exec, it can miss the resizing of the terminal during the program's startup. In my experiments, I've had to in some cases add a 'sleep .1' to the command being exec'd to be certain that the resize is performed before the command reads the terminal size. If I understand correctly, this patch asynchronously resizes the terminal after the target process is started; ISTM that the correct behavior would be to properly size the pty on creation, before actually starting the target process. Otherwise, this is still a race condition, albeit one with a narrower window for triggering it. |
|
@pjeby care to share an example command to reproduce this? I ran into this issue with multiple scenarios but these are all solved with this PR. |
|
@pjeby The program shouldn't miss the resize with this PR, what can happen is that it starts with the wrong size for a few ms before the signal is sent to the program. That said, it wouldn't hurt to initialize the |
|
The specific issue I had was with exec'ing IOW, the race condition is simply the fact that the program being exec'd can read an uninitialized value (the default terminal size) before it's set. This is akin to starting a thread that reads a variable and then setting it in the calling thread, even if you then also notify the called thread that the variable has changed. 😉 |
|
@pjeby I understand what the issue is. But it was always there, it just that the timing changed as docker has never set the actual terminal size before starting the container since we don't pass that information to the start endpoint. So the endpoint would need to take the TTY size as extra parameters for this to occurs. Feel free to open a PR :) |
|
I am not sure what the structure is here, but it seems that nobody is actually responsible for fixing such serious bugs. Is that correct? Is there anyone in this project who is actually paid to fix issues or otherwise cares about this project? |
|
I think that comment is both unfairly phrased and counterproductive; even I understand that this is a pretty niche issue compared to docker's main, non-interactive use cases, and after this PR it's a heck of a lot better than it was before. (I'm not sure I'd class it as a "serious bug" any more, and I'm the one who reopened this discussion.) |
|
@pjeby It is so serious that I will have to fix it if whoever "maintains" this project doesn't. That's my definition of serious. I don't like your tone either, because you say "even I understand", suggesting I am a moron. Let me also note that you have only wasted my time (and that of others reading this) by only talking about what you think about the comment and not actually answering the question. |
|
Please watch your language, and keep your wording respectful to others (see the community guidelines). As to your question: whoever "maintains" this project - as in any open source project; it's contributors and maintainers (who also contribute); open source is about participation, which means that the community as a whole can assist in improving a project; either by opening issues / tickets to report bugs, by assisting others, or by contributing code to fix bugs or implement features. Contributors can be paid to do so (for example if the company they work for has an interest in fixing a bug, or wants to implement a feature), but many contributors don't get paid for their work, and do it to add their part in contributing to the project.
So; yes! If you want to give back, and work on fixing the issue you're seeing; we welcome your contribution. |
|
Perhaps "understand" was a poor word choice; substituting "accept" would make the intended meaning clearer. That is, my statement was intended to read as something like, "even I (as a party who thought this important enough to reopen discussion) would not go so far as to suggest this issue is a top priority for the project as a whole". In other words, the word "understand" was a statement about accepting the project's priorities being different than my own, not about mine or anyone else's comprehension of those priorities. In colloquial American English at least, "understand" often has the social meaning of acceptance, agreement, or acquiescence to another person's position or terms, especially when offered by an inferior (me in this case) towards a superior (the project team). That was the intended use here. (Ironically, my choice of this word was driven by the desire for politeness, even if it turned out not to work that well. 😀 ) |
|
@thaJeztah There is nothing in those guidelines that I broke. As such waiving with them is not particularly nice and in fact against those same guidelines. I understand you have the illusion that those rules help the "community", but trust me, they don't. They are only there to give an excuse to ban people without reason and literally even the rookiest of developers knows this. You have also not answered my question. Essentially you have said the equivalent of "happy bunnies come out of the trenches at night and magically fix everything". Having said that, by using these vague statements, you have confirmed my idea about the state of the project. I gave you a clear chance to explain to me how the project is in a good state (despite the mountain of evidence against it) and you didn't grab that opportunity, which is of course well within your rights to do so. Please don't make the mistake of responding to this message; it will lead nowhere and I have no intention to say anything else in this thread. Essentially, I will consider it harassment (since you like your guidelines so much). @pjeby Yes, it was a poor word choice or a highly intelligent one (a free pass at insulting someone if I hadn't said anything). We will never know what it really was. |
Signed-off-by: David Wang [email protected]
- What I did
Fix a race condition between exec start and resize
- How I did it
Add lock when calling resize
- How to verify it
run
docker exec -it $(docker run -d centos sleep 3600) bash, and then type a long line to see that the line actually wrapped around to next line- Description for the changelog
fixes #31223
fixes #35407
fixes #36354
fixes #36483
fixes #36477
fixes docker/for-linux#314
fixes docker/for-linux#268
- A picture of a cute animal (not mandatory but encouraged)