-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Audit and fix caml_plat_lock_blocking usage #13227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Could you consider rebasing this on top of your recent changes to #12410? |
44f728c to
9b8051c
Compare
|
It was not so quick but I have rebased this PR on top of trunk (removing the changes that depended on #12410). So it can now be reviewed. I have updated the description. |
1ed7f7e to
e4ad75d
Compare
|
Since people subscribed to this issue or to all issues might not get notifications for the change in description: I now think this fixes at least one deadlock. |
| return NULL; | ||
| } | ||
|
|
||
| CAMLexport void caml_iterate_named_values(caml_named_action f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deadlock fixed would be here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not say it like this given that a deadlock involves at least two locations. But yes, the constraints of caml_plat_lock_blocking are probably too strong and specific to be placed on a caml_named_action supplied by the user.
The uncertainty is what user actually pass as argument in practice (it is not even clear that anyone uses this at all, from a github search).
| } | ||
| if (Is_block(action)) { | ||
| /* Speculatively allocate this so we don't hold the lock for | ||
| a GC */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous code was careful to not hold the lock across a caml_alloc call. If I understand correctly, you are pointing out doing so is safe (caml_alloc checks for pending STW but it does not execute asynchronous callbacks, this is the job of caml_process_pending_actions), and results in simpler code for an infrequent (and therefore not performance-sensitive) operation. Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
caml_plat_lock_blocking must be careful not to invoke any STW from the mutator (lock inversion). But caml_plat_lock_non_blocking does not have this constraint so we can call caml_alloc.
On the other hand, caml_plat_lock_non_blocking releases the domain lock so the reasoning based on holding the domain lock does not work in this function, that is why we have to move it upwards.
I think the end result is simpler to understand (the actual critical section is now what's between lock and unlock).
|
|
||
| // at the same instant: snapshot user_events list and set | ||
| // runtime_events_enabled to 1 | ||
| /* calling from STW */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I'm not sure I see the deadlock mentioned in this commit message. I thought about cases where a mutator would try to take the lock, but it is already held by a STW section. But the runtime_events_create_from_stw_single is invoked within a global barrier within the STW section, so the STW mechanism already excludes mutators from running at the same time. Where is the potential issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One mutator waiting for the lock, another waiting for a STW while holding the lock.
The STW case above is a red herring (but the comment clarifies why we must use blocking rather than non_blocking).
|
|
||
|
|
||
| caml_plat_lock_blocking(&user_events_lock); | ||
| caml_plat_lock_non_blocking(&user_events_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this critical section, we call caml_alloc_small and therefore we cannot use caml_plat_lock_blocking
gasche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the explanations and the kind reminders on the various invariants, I am convinced that this PR is an improvement. Thanks!
(Calling caml_plat_lock_non_blocking is probably costlier/less efficient than caml_plat_lock_blocking for short critical sections, so I initially wondered about the choice of using non_blocking by default, but in fact the operations where the change was made are called rarely so this is just fine, and I agree it is safer.)
But have not found uses of caml_iterate_named_values in the wild to evaluate the actual impact.
We apply the rule of using the non-blocking version when calling from the mutator. The present change does not fix potential bugs AFAIU.
e4ad75d to
a3c0dd0
Compare
|
I just rebased. Could this delicate code have a second review from an expert of synchronisation in the runtime ? Otherwise please merge if deemed unnecessary. |
|
I believe that this is okay and we haven't had expression of interest of other potential reviewers, so if the CI is still okay I will go ahead and merge. |
|
I could not understand the AppVeyor failure. Is this known to be unreliable? |
|
Yes, AppVeyor is currently broken on all PRs. |
|
Thanks! |
| /* Don't run concurrently with caml_ml_out_channels_list that may resurrect | ||
| a dead channel . */ | ||
| caml_plat_lock_blocking(&caml_all_opened_channels_mutex); | ||
| caml_plat_lock_non_blocking(&caml_all_opened_channels_mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, one cannot use non_blocking here.
But, we cannot just use the blocking variant here. They all have to be turned into the blocking variant. This is because the finalizer does not run in STW either, it can run in parallel with mutators (when finalizing young custom blocks). So there is another special case, for locks held during custom finalizers.
Actually as you can see this is the reason why caml_ml_out_channels_list was written in a roundabout way:
ocaml-multicore/ocaml-multicore@30622d7
Here is a patch, that documents what happens: gadmm@0dec8b4
This is unsatisfactory because the critical section inside caml_ml_out_channels_list is not guaranteed to be short.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow your explanation of why other mutators cannot use the non-blocking variant for the same mutex. Can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(A priori I don't see a problem if caml_plat_blocking is called by a finalizer in a mutex, in parallel with caml_plat_non_blocking being called by a mutator. I would expect that one of them wins, and the other waits. If the non_blocking call waits it can transfer control in the meantime, but I don't see where the problem lies with that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking of something like:
The finalizer (blocking) holds the domain lock D0, and waits for the mutex M. In parallel, the mutator (T1) (non-blocking) holds the mutex M, and waits for the domain lock D1. In parallel, the mutator (T2) on D1 holds the domain lock D1 and triggers a STW, waiting for the finalizer to complete so that a new STW can begin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, the problem is that caml_plat_lock_non_blocking(m) has a lock order (in the hard path) where it first takes m, then takes the domain lock. This is fundamentally incompatible with parallel code taking m while already holding the domain lock, which is what happens in the typical caml_plat_lock_blocking(m) scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(user_events_lock in runtime/events.c gets locked both in blocking and non-blocking mode.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe we should distinguish two different kind of mutexes:
- cooperative locks: one must own the runtime to take them, and their critical sections can use the runtime (they are C counterparts of OCaml-level mutexes)
- non-cooperative locks (critical locks?): taking them is valid without owning the runtime, and their critical sections cannot use the runtime
Ideally we would have two C types to distinguish them, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that a good first step would be to perform the minimal fix for #13713. I agree with your gadmm@0dec8b4 now that I understand things better, but I suppose that you might want to reword the documentation comments a bit (or maybe drop that part for now, if a longer-term fix is also coming later). Would you be willing to submit this as a PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix submitted at #13714. I think it is a full fix for this PR. I did consider having two types of mutexes, but this would be a larger change, and it is still legal to call both caml_plat_lock_non_blocking and caml_plat_lock_blocking from the mutator if not holding the domain lock during the latter.
|
One lesson, is that we could have run the multicore tests with the label |
|
I implemented a refactoring that separates the lock types in two -- a first attempt: gadmm/ocaml@caml_plat_lock_non_blocking3...gasche:ocaml:caml_plat_lock_non_blocking4 What do you think? My problem with the current approach is that it relies on a bit too much expertise: as we just found out, it is easy to get things wrong. Just the documentation of what is and is not allowed is a mouthful. I hope that, with two separate types, it will be easier to document and easier for non-experts to use. |
|
(Instead of having |
|
Now that I no longer see a subtlety with the interaction with the STW, I think this is an improvement, so please feel free to open a PR so we can comment on the details. |
|
Also, you could test for |
|
I pushed a new iteration that takes your remarks into account, and also moves the non-blocking code to sync.h. What do you think? |
|
I think it is worth opening a PR for further comments |
- 'runtime' mutexes are for blocking critical sections which do not access the runtime. - 'mutator' mutexes are for non-blocking critical sections (blocking on a mutex releases the runtime lock) which may access the runtime system. This refactoring comes from the discussions in ocaml#13227, it tries to avoid a class of bug where the same mutex is used in both blocking and non-blocking fashion, resulting in subtle deadlock situations.
|
Done in #13716. |
- 'runtime' mutexes are for blocking critical sections which do not access the runtime. - 'mutator' mutexes are for non-blocking critical sections (blocking on a mutex releases the runtime lock) which may access the runtime system. This refactoring comes from the discussions in ocaml#13227, it tries to avoid a class of bug where the same mutex is used in both blocking and non-blocking fashion, resulting in subtle deadlock situations.
- 'runtime' mutexes are for blocking critical sections which do not access the runtime. - 'mutator' mutexes are for non-blocking critical sections (blocking on a mutex releases the runtime lock) which may access the runtime system. This refactoring comes from the discussions in ocaml#13227, it tries to avoid a class of bug where the same mutex is used in both blocking and non-blocking fashion, resulting in subtle deadlock situations.
- 'runtime' mutexes are for blocking critical sections which do not access the runtime. - 'mutator' mutexes are for non-blocking critical sections (blocking on a mutex releases the runtime lock) which may access the runtime system. This refactoring comes from the discussions in ocaml#13227, it tries to avoid a class of bug where the same mutex is used in both blocking and non-blocking fashion, resulting in subtle deadlock situations. Reviewed-by: Antonin Décimo <[email protected]> Reviewed-by: Guillaume Munch-Maccagnoni <[email protected]> Reviewed-by: Miod Vallat <[email protected]>
This PR audits and fixes the usage of
caml_plat_lock_blockingin the runtime. One, maybe two, causes of deadlocks are fixed, in runtime events and named values.In two places (
io.c,signals.c), I make proper use ofcaml_plat_lock_non_blockingeven though there is not indication that the code used to be buggy. This is to enforce a rule thatcaml_plat_lock_non_blockingshould be used in general when locking from the mutator. The two corresponding commits can be removed from this PR if there is no consensus on this rule.In other places, I document why one cannot use
caml_plat_lock_non_blocking(and therefore one must be careful about what is placed in the critical section).This is best reviewed commit-per-commit.