Audit and fix caml_plat_lock_blocking usage #13227

gadmm · 2024-06-09T16:48:58Z

This PR audits and fixes the usage of caml_plat_lock_blocking in the runtime. One, maybe two, causes of deadlocks are fixed, in runtime events and named values.

In two places (io.c, signals.c), I make proper use of caml_plat_lock_non_blocking even though there is not indication that the code used to be buggy. This is to enforce a rule that caml_plat_lock_non_blocking should be used in general when locking from the mutator. The two corresponding commits can be removed from this PR if there is no consensus on this rule.

In other places, I document why one cannot use caml_plat_lock_non_blocking (and therefore one must be careful about what is placed in the critical section).

This is best reviewed commit-per-commit.

MisterDA · 2024-07-05T14:31:33Z

Could you consider rebasing this on top of your recent changes to #12410?

gadmm · 2024-12-11T16:34:02Z

It was not so quick but I have rebased this PR on top of trunk (removing the changes that depended on #12410). So it can now be reviewed. I have updated the description.

gadmm · 2024-12-12T14:33:39Z

Since people subscribed to this issue or to all issues might not get notifications for the change in description: I now think this fixes at least one deadlock.

gasche · 2024-12-12T14:36:26Z

runtime/callback.c

  return NULL;
 }

 CAMLexport void caml_iterate_named_values(caml_named_action f)


The deadlock fixed would be here, right?

I would not say it like this given that a deadlock involves at least two locations. But yes, the constraints of caml_plat_lock_blocking are probably too strong and specific to be placed on a caml_named_action supplied by the user.

The uncertainty is what user actually pass as argument in practice (it is not even clear that anyone uses this at all, from a github search).

gasche · 2024-12-12T14:58:08Z

runtime/signals.c

  }
  if (Is_block(action)) {
-    /* Speculatively allocate this so we don't hold the lock for
-       a GC */


The previous code was careful to not hold the lock across a caml_alloc call. If I understand correctly, you are pointing out doing so is safe (caml_alloc checks for pending STW but it does not execute asynchronous callbacks, this is the job of caml_process_pending_actions), and results in simpler code for an infrequent (and therefore not performance-sensitive) operation. Is that right?

caml_plat_lock_blocking must be careful not to invoke any STW from the mutator (lock inversion). But caml_plat_lock_non_blocking does not have this constraint so we can call caml_alloc.

On the other hand, caml_plat_lock_non_blocking releases the domain lock so the reasoning based on holding the domain lock does not work in this function, that is why we have to move it upwards.

I think the end result is simpler to understand (the actual critical section is now what's between lock and unlock).

gasche · 2024-12-12T15:06:43Z

runtime/runtime_events.c


    // at the same instant: snapshot user_events list and set
    // runtime_events_enabled to 1
+    /* calling from STW */


Note: I'm not sure I see the deadlock mentioned in this commit message. I thought about cases where a mutator would try to take the lock, but it is already held by a STW section. But the runtime_events_create_from_stw_single is invoked within a global barrier within the STW section, so the STW mechanism already excludes mutators from running at the same time. Where is the potential issue?

One mutator waiting for the lock, another waiting for a STW while holding the lock.

The STW case above is a red herring (but the comment clarifies why we must use blocking rather than non_blocking).

gadmm · 2024-12-12T16:18:59Z

runtime/runtime_events.c



-  caml_plat_lock_blocking(&user_events_lock);
+  caml_plat_lock_non_blocking(&user_events_lock);


In this critical section, we call caml_alloc_small and therefore we cannot use caml_plat_lock_blocking

gasche

With the explanations and the kind reminders on the various invariants, I am convinced that this PR is an improvement. Thanks!

(Calling caml_plat_lock_non_blocking is probably costlier/less efficient than caml_plat_lock_blocking for short critical sections, so I initially wondered about the choice of using non_blocking by default, but in fact the operations where the change was made are called rarely so this is just fine, and I agree it is safer.)

But have not found uses of caml_iterate_named_values in the wild to evaluate the actual impact.

…tion

We apply the rule of using the non-blocking version when calling from the mutator. The present change does not fix potential bugs AFAIU.

gadmm · 2025-01-06T12:57:37Z

I just rebased. Could this delicate code have a second review from an expert of synchronisation in the runtime ? Otherwise please merge if deemed unnecessary.

gasche · 2025-01-06T13:03:42Z

I believe that this is okay and we haven't had expression of interest of other potential reviewers, so if the CI is still okay I will go ahead and merge.

gadmm · 2025-01-06T15:59:04Z

I could not understand the AppVeyor failure. Is this known to be unreliable?

gasche · 2025-01-06T16:01:39Z

Yes, AppVeyor is currently broken on all PRs.

gadmm · 2025-01-06T16:04:27Z

Thanks!

gasche · 2025-01-07T12:49:47Z

runtime/io.c

  /* Don't run concurrently with caml_ml_out_channels_list that may resurrect
     a dead channel . */
-  caml_plat_lock_blocking(&caml_all_opened_channels_mutex);
+  caml_plat_lock_non_blocking(&caml_all_opened_channels_mutex);


@gadmm (re-reading this due to #13713): caml_finalize_channel is used as the finalization function of a custom block, is it really correct to call the non_blocking variant here?

Indeed, one cannot use non_blocking here.

But, we cannot just use the blocking variant here. They all have to be turned into the blocking variant. This is because the finalizer does not run in STW either, it can run in parallel with mutators (when finalizing young custom blocks). So there is another special case, for locks held during custom finalizers.

Actually as you can see this is the reason why caml_ml_out_channels_list was written in a roundabout way:
ocaml-multicore/ocaml-multicore@30622d7

Here is a patch, that documents what happens: gadmm@0dec8b4

This is unsatisfactory because the critical section inside caml_ml_out_channels_list is not guaranteed to be short.

I don't follow your explanation of why other mutators cannot use the non-blocking variant for the same mutex. Can you clarify?

(A priori I don't see a problem if caml_plat_blocking is called by a finalizer in a mutex, in parallel with caml_plat_non_blocking being called by a mutator. I would expect that one of them wins, and the other waits. If the non_blocking call waits it can transfer control in the meantime, but I don't see where the problem lies with that.)

I am thinking of something like:
The finalizer (blocking) holds the domain lock D0, and waits for the mutex M. In parallel, the mutator (T1) (non-blocking) holds the mutex M, and waits for the domain lock D1. In parallel, the mutator (T2) on D1 holds the domain lock D1 and triggers a STW, waiting for the finalizer to complete so that a new STW can begin.

If I understand correctly, the problem is that caml_plat_lock_non_blocking(m) has a lock order (in the hard path) where it first takes m, then takes the domain lock. This is fundamentally incompatible with parallel code taking m while already holding the domain lock, which is what happens in the typical caml_plat_lock_blocking(m) scenario.

(user_events_lock in runtime/events.c gets locked both in blocking and non-blocking mode.)

So maybe we should distinguish two different kind of mutexes:

cooperative locks: one must own the runtime to take them, and their critical sections can use the runtime (they are C counterparts of OCaml-level mutexes)

non-cooperative locks (critical locks?): taking them is valid without owning the runtime, and their critical sections cannot use the runtime

Ideally we would have two C types to distinguish them, what do you think?

I think that a good first step would be to perform the minimal fix for #13713. I agree with your gadmm@0dec8b4 now that I understand things better, but I suppose that you might want to reword the documentation comments a bit (or maybe drop that part for now, if a longer-term fix is also coming later). Would you be willing to submit this as a PR?

Fix submitted at #13714. I think it is a full fix for this PR. I did consider having two types of mutexes, but this would be a larger change, and it is still legal to call both caml_plat_lock_non_blocking and caml_plat_lock_blocking from the mutator if not holding the domain lock during the latter.

gadmm · 2025-01-07T16:48:03Z

One lesson, is that we could have run the multicore tests with the label run-multicoretests, as explained by @jmid. In particular, there is no need for regression tests in the fixing PR as they are already present.

gasche · 2025-01-07T17:36:59Z

I implemented a refactoring that separates the lock types in two -- a first attempt:

gadmm/ocaml@caml_plat_lock_non_blocking3...gasche:ocaml:caml_plat_lock_non_blocking4

What do you think?

My problem with the current approach is that it relies on a bit too much expertise: as we just found out, it is easy to get things wrong. Just the documentation of what is and is not allowed is a mouthful. I hope that, with two separate types, it will be easier to document and easier for non-experts to use.

gasche · 2025-01-07T17:38:09Z

(Instead of having coop_ functions in platform, I think it would be rather natural to have them in sync as a C counterpart of caml_ml_mutex which has almost the same logic. I'm willing to iterate on this if we think that it is worth it.)

gadmm · 2025-01-07T18:16:04Z

Now that I no longer see a subtlety with the interaction with the STW, I think this is an improvement, so please feel free to open a PR so we can comment on the details.
In terms of terminology, the term "cooperative" does not make sense to me and I find that "lock" for a structure that contains a mutex is confusing. You have some high-rank mutexes (higher than the domain mutex, that can be acquired while the domain lock is held), and low-rank mutexes (lower than the domain mutex, which require releasing the domain lock before being acquired). Note that this terminology is inverse from the intuition in terms of expressiveness, of "low-level" for the former (used by the runtime) and "high-level" for the latter (usable in OCaml contexts).

gadmm · 2025-01-07T18:30:36Z

Also, you could test for Caml_state_opt before releasing the domain lock in the non_blocking long path. This would make it callable regardless of whether the domain lock is held, avoiding the need to add a new function. I get the point that this capability is not yet used in the runtime.

gasche · 2025-01-07T22:26:09Z

I pushed a new iteration that takes your remarks into account, and also moves the non-blocking code to sync.h. What do you think?

gadmm · 2025-01-07T22:54:17Z

I think it is worth opening a PR for further comments

- 'runtime' mutexes are for blocking critical sections which do not access the runtime. - 'mutator' mutexes are for non-blocking critical sections (blocking on a mutex releases the runtime lock) which may access the runtime system. This refactoring comes from the discussions in ocaml#13227, it tries to avoid a class of bug where the same mutex is used in both blocking and non-blocking fashion, resulting in subtle deadlock situations.

gasche · 2025-01-08T10:00:10Z

Done in #13716.

- 'runtime' mutexes are for blocking critical sections which do not access the runtime. - 'mutator' mutexes are for non-blocking critical sections (blocking on a mutex releases the runtime lock) which may access the runtime system. This refactoring comes from the discussions in ocaml#13227, it tries to avoid a class of bug where the same mutex is used in both blocking and non-blocking fashion, resulting in subtle deadlock situations.

- 'runtime' mutexes are for blocking critical sections which do not access the runtime. - 'mutator' mutexes are for non-blocking critical sections (blocking on a mutex releases the runtime lock) which may access the runtime system. This refactoring comes from the discussions in ocaml#13227, it tries to avoid a class of bug where the same mutex is used in both blocking and non-blocking fashion, resulting in subtle deadlock situations. Reviewed-by: Antonin Décimo <[email protected]> Reviewed-by: Guillaume Munch-Maccagnoni <[email protected]> Reviewed-by: Miod Vallat <[email protected]>

gadmm marked this pull request as draft June 9, 2024 16:49

gadmm mentioned this pull request Jun 9, 2024

Correctly handle errors during Domain.spawn #12410

Merged

gasche added the runtime-system label Aug 1, 2024

gadmm force-pushed the caml_plat_lock_non_blocking2 branch from 44f728c to 9b8051c Compare December 11, 2024 16:31

gadmm marked this pull request as ready for review December 11, 2024 16:31

gadmm force-pushed the caml_plat_lock_non_blocking2 branch 2 times, most recently from 1ed7f7e to e4ad75d Compare December 11, 2024 17:18

gasche reviewed Dec 12, 2024

View reviewed changes

gadmm commented Dec 12, 2024

View reviewed changes

gasche approved these changes Dec 12, 2024

View reviewed changes

gadmm added 7 commits January 6, 2025 13:49

[minor] Document locking behaviour of C channels

bf7d441

Fix deadlock in runtime events

85d15f3

Avert potential deadlock in caml_iterate_named_values

88f1c1c

But have not found uses of caml_iterate_named_values in the wild to evaluate the actual impact.

Use caml_plat_lock_non_blocking in signals.c, clarify signal installa…

50bf0b0

…tion

Replace caml_plat_locking with caml_plat_non_blocking in io.c

2ff121a

We apply the rule of using the non-blocking version when calling from the mutator. The present change does not fix potential bugs AFAIU.

Document remaining uses of caml_plat_lock

d931f22

Changes

a3c0dd0

gadmm force-pushed the caml_plat_lock_non_blocking2 branch from e4ad75d to a3c0dd0 Compare January 6, 2025 12:51

gasche merged commit aaf378e into ocaml:trunk Jan 6, 2025
20 of 21 checks passed

jmid mentioned this pull request Jan 7, 2025

Merge of #13227 to trunk cause deadlocks on parallel Sys tests #13713

Closed

gasche reviewed Jan 7, 2025

View reviewed changes

gasche mentioned this pull request Jan 8, 2025

runtime: distinguish two kind of mutexes #13716

Open

NickBarnes mentioned this pull request Feb 19, 2025

Fix runtime deadlocks with judicious use of caml_plat_non_blocking oxcaml/oxcaml#3603

Merged

NickBarnes mentioned this pull request Mar 6, 2025

Ensure root safety in caml_register_named_value #13855

Merged



		caml_plat_lock_blocking(&user_events_lock);
		caml_plat_lock_non_blocking(&user_events_lock);

Audit and fix caml_plat_lock_blocking usage #13227

Audit and fix caml_plat_lock_blocking usage #13227

Uh oh!

Conversation

gadmm commented Jun 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MisterDA commented Jul 5, 2024

Uh oh!

gadmm commented Dec 11, 2024

Uh oh!

gadmm commented Dec 12, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gadmm Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gasche left a comment

Choose a reason for hiding this comment

Uh oh!

gadmm commented Jan 6, 2025

Uh oh!

gasche commented Jan 6, 2025

Uh oh!

Uh oh!

gadmm commented Jan 6, 2025

Uh oh!

gasche commented Jan 6, 2025

Uh oh!

gadmm commented Jan 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gadmm Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gadmm commented Jan 7, 2025

Uh oh!

gasche commented Jan 7, 2025

Uh oh!

gasche commented Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gadmm commented Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gadmm commented Jan 7, 2025

Uh oh!

gasche commented Jan 7, 2025

Uh oh!

gadmm commented Jan 7, 2025

gadmm commented Jun 9, 2024 •

edited

Loading

gadmm Dec 12, 2024 •

edited

Loading

gadmm Jan 7, 2025 •

edited

Loading

gasche commented Jan 7, 2025 •

edited

Loading

gadmm commented Jan 7, 2025 •

edited

Loading