-
Notifications
You must be signed in to change notification settings - Fork 1.2k
fix a potential race between domain termination and backup thread reuse #11801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
See ocaml#11800 for a CI failure report and a proposed failure scenario: > 1. the main domain spawns a domain A > 2. A starts its backup thread > 3. A terminates, and in the process: > - it sets `domain_self->backup_thread_running = 0` > - it has not signalled the backup thread to terminate yet > - it releases `all_domains_lock`, and is about to enter the > delicate part that collects the rest of the state without owning > `domain_self` anymore > 4. the main domain spawns a domain B, which reuses the slot of A > 5. B starts its backup thread > > At this point, if my reasoning is right, `B` sees > `domain_self->backup_thread_running == 0`, but the backup thread of > A has not terminated yet, and the assumption may fail. The present commit should fix this failure scenario, by ensuring that the backup thread is signalled before the terminating domain makes its domain-internal data available for reuse. We don't busy-wait on the backup thread at this point (unlike in the previous iteration a8e8ff3 ), so the backup thread may still be active when a new domain tries to install it again, but in this (rare) case it will be in the BT_TERMINATE state expected by `install_backup_thread` which will busy-wait itself.
| caml_plat_unlock(&domain_self->domain_lock); | ||
|
|
||
| CAMLassert (domain_self->backup_thread_running); | ||
| /* CR gasche: why is this assertion correct in the case where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it because the main OCaml thread does not go through domain_terminate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be my guess as well, thanks! But I understand that this could change in the future. Should we remove the assertion? (Or maybe this could or should become an if (domain_self->backup_thread_running) { ... }, disabling also the BT_TERMINATE signalling.) In any case, it is currently weird that the comment that is now shortly above contradicts this assertion.
(I wonder what happens if we start calling domain_terminate for the main domain as well. Do other things break, or is it supposed to work?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this assertion is useful anymore, I'm in favor of removing it at this point.
If a thread goes through domain_terminate, then a backup thread must be running, and it feels weird to check for this invariant at termination.
For Domain 0 calling domain_terminate, I am not sure this is meant to happen. Domain 0 returning means all remaining threads will be shut down on exit.
However, I see issues coming when we re-implement proper cleanup at exit, where this dual path of cleaning and exiting may come back to bite us.
(and, more, while not quite relevant, a single domain 0 does not mean it has no backup thread: it may very well be that all over domains have shut down, with domain 0 retaining its backup thread after.)
|
cc @Engil and @kayceesrk (who worked on this aspect of the code) One aspect where this PR is lacking is on the testing front: I haven't tried to reproduce the CI failure on my machine (and I doubt it would work as-is), so I can't tell whether this really was the issue and whether this PR solves it. (In theory this was and it does, but this is the theory.) Given that apparently this bug was already noticed during multicore-ocaml development ( #11800 (comment) ), it would be nice to think of a reliable way to have a regression test. Any suggestions? (One experiment I can make is to introduce a large pause at the end of domain termination, after |
I think this is quite a fun idea, we could introduce such delays in the debug runtime, and it could perhaps make it easier to spot lifecycle regressions with tight timing triggers? |
|
I tried to add a pause in various places in caml_plat_spin_wait(
10*1000*1000 /* 10ms */,
__FILE__, __LINE__, __func__);but I didn't get any luck triggering the bug after that -- I tried to run the |
|
Looking at the CI, I feel like the fix is not quite it, there's something more subtle to be taken into account and I was wrong in my assumption that it was probably ok to move BT termination here. On the The I feel like we now may be release the backup thread too early, and it is either bad, or there's another bug lurking around, but it is bed time for me. |
| will not have been started */ | ||
| atomic_store_rel(&domain_self->backup_thread_msg, BT_TERMINATE); | ||
| caml_plat_signal(&domain_self->domain_cond); | ||
| caml_plat_unlock(&domain_self->domain_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is releasing the domain_self->domain_lock too early compared to trunk. Should we leave the unlock where it was and only move the other two lines here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I would like to recreate this failure under rr and make sure that the failure is indeed due to the reason we think it is.
|
@kayceesrk I agree that releasing the domain lock is quite probably the cause of the testsuite failure here. (But then I tried to understand more precisely the protocol around the domain lock and I got a bit lost.) |
|
Let me try to explain what is bugging me about Lines 1753 to 1758 in 97aa649
On the other hand, Lines 583 to 588 in 97aa649
and the call to Lines 1081 to 1082 in 97aa649
Naively this suggests that the race scenario I described above, when creating this PR, cannot actually happen: a new domain reusing the same But: there are other uses of the domain lock, and it may be that something funny happens. (Could something in I must say that it's not so easy to follow the use of the various locking mechanisms in domain.c and the invariants they should protect. For example the backup thread sometimes waits on an interruptor-related condition variable Side question: online documentation on condition variable suggests to use a (I'm at the point where I realize that I don't understand much of what is going on here, and I am thinking of closing the PR to go back to the issue discussing the assert failure.) |
I am happy to take a look at clarifying/reorganizing the runtime code around domain termination for readability. This can be done elsewhere. For this issue, I do think that we should understand and reproduce the assertion failure before we can propose a solution. So closing this PR is the right thing to do. |
See #11800 for a CI failure report and a proposed failure scenario:
The present commit should fix this failure scenario, by ensuring that the backup thread is signalled before the terminating domain makes its domain-internal data available for reuse.
We don't busy-wait on the backup thread at this point (unlike in the previous iteration a8e8ff3 ), so the backup thread may still be active when a new domain tries to install it again, but in this (rare) case it will be in the BT_TERMINATE state expected by
install_backup_threadwhich will busy-wait itself.