Skip to content

Conversation

@gadmm
Copy link
Contributor

@gadmm gadmm commented Jun 9, 2024

This PR audits and fixes the usage of caml_plat_lock_blocking in the runtime. One, maybe two, causes of deadlocks are fixed, in runtime events and named values.

In two places (io.c, signals.c), I make proper use of caml_plat_lock_non_blocking even though there is not indication that the code used to be buggy. This is to enforce a rule that caml_plat_lock_non_blocking should be used in general when locking from the mutator. The two corresponding commits can be removed from this PR if there is no consensus on this rule.

In other places, I document why one cannot use caml_plat_lock_non_blocking (and therefore one must be careful about what is placed in the critical section).

This is best reviewed commit-per-commit.

@gadmm gadmm marked this pull request as draft June 9, 2024 16:49
@MisterDA
Copy link
Contributor

MisterDA commented Jul 5, 2024

Could you consider rebasing this on top of your recent changes to #12410?

@gadmm gadmm force-pushed the caml_plat_lock_non_blocking2 branch from 44f728c to 9b8051c Compare December 11, 2024 16:31
@gadmm gadmm marked this pull request as ready for review December 11, 2024 16:31
@gadmm
Copy link
Contributor Author

gadmm commented Dec 11, 2024

It was not so quick but I have rebased this PR on top of trunk (removing the changes that depended on #12410). So it can now be reviewed. I have updated the description.

@gadmm gadmm force-pushed the caml_plat_lock_non_blocking2 branch 2 times, most recently from 1ed7f7e to e4ad75d Compare December 11, 2024 17:18
@gadmm
Copy link
Contributor Author

gadmm commented Dec 12, 2024

Since people subscribed to this issue or to all issues might not get notifications for the change in description: I now think this fixes at least one deadlock.

return NULL;
}

CAMLexport void caml_iterate_named_values(caml_named_action f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deadlock fixed would be here, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not say it like this given that a deadlock involves at least two locations. But yes, the constraints of caml_plat_lock_blocking are probably too strong and specific to be placed on a caml_named_action supplied by the user.

The uncertainty is what user actually pass as argument in practice (it is not even clear that anyone uses this at all, from a github search).

}
if (Is_block(action)) {
/* Speculatively allocate this so we don't hold the lock for
a GC */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous code was careful to not hold the lock across a caml_alloc call. If I understand correctly, you are pointing out doing so is safe (caml_alloc checks for pending STW but it does not execute asynchronous callbacks, this is the job of caml_process_pending_actions), and results in simpler code for an infrequent (and therefore not performance-sensitive) operation. Is that right?

Copy link
Contributor Author

@gadmm gadmm Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

caml_plat_lock_blocking must be careful not to invoke any STW from the mutator (lock inversion). But caml_plat_lock_non_blocking does not have this constraint so we can call caml_alloc.

On the other hand, caml_plat_lock_non_blocking releases the domain lock so the reasoning based on holding the domain lock does not work in this function, that is why we have to move it upwards.

I think the end result is simpler to understand (the actual critical section is now what's between lock and unlock).


// at the same instant: snapshot user_events list and set
// runtime_events_enabled to 1
/* calling from STW */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I'm not sure I see the deadlock mentioned in this commit message. I thought about cases where a mutator would try to take the lock, but it is already held by a STW section. But the runtime_events_create_from_stw_single is invoked within a global barrier within the STW section, so the STW mechanism already excludes mutators from running at the same time. Where is the potential issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One mutator waiting for the lock, another waiting for a STW while holding the lock.

The STW case above is a red herring (but the comment clarifies why we must use blocking rather than non_blocking).



caml_plat_lock_blocking(&user_events_lock);
caml_plat_lock_non_blocking(&user_events_lock);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this critical section, we call caml_alloc_small and therefore we cannot use caml_plat_lock_blocking

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the explanations and the kind reminders on the various invariants, I am convinced that this PR is an improvement. Thanks!

(Calling caml_plat_lock_non_blocking is probably costlier/less efficient than caml_plat_lock_blocking for short critical sections, so I initially wondered about the choice of using non_blocking by default, but in fact the operations where the change was made are called rarely so this is just fine, and I agree it is safer.)

gadmm added 7 commits January 6, 2025 13:49
But have not found uses of caml_iterate_named_values in the wild to
evaluate the actual impact.
We apply the rule of using the non-blocking version when calling from
the mutator. The present change does not fix potential bugs AFAIU.
@gadmm gadmm force-pushed the caml_plat_lock_non_blocking2 branch from e4ad75d to a3c0dd0 Compare January 6, 2025 12:51
@gadmm
Copy link
Contributor Author

gadmm commented Jan 6, 2025

I just rebased. Could this delicate code have a second review from an expert of synchronisation in the runtime ? Otherwise please merge if deemed unnecessary.

@gasche
Copy link
Member

gasche commented Jan 6, 2025

I believe that this is okay and we haven't had expression of interest of other potential reviewers, so if the CI is still okay I will go ahead and merge.

@gasche gasche merged commit aaf378e into ocaml:trunk Jan 6, 2025
20 of 21 checks passed
@gadmm
Copy link
Contributor Author

gadmm commented Jan 6, 2025

I could not understand the AppVeyor failure. Is this known to be unreliable?

@gasche
Copy link
Member

gasche commented Jan 6, 2025

Yes, AppVeyor is currently broken on all PRs.

@gadmm
Copy link
Contributor Author

gadmm commented Jan 6, 2025

Thanks!

/* Don't run concurrently with caml_ml_out_channels_list that may resurrect
a dead channel . */
caml_plat_lock_blocking(&caml_all_opened_channels_mutex);
caml_plat_lock_non_blocking(&caml_all_opened_channels_mutex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gadmm (re-reading this due to #13713): caml_finalize_channel is used as the finalization function of a custom block, is it really correct to call the non_blocking variant here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, one cannot use non_blocking here.

But, we cannot just use the blocking variant here. They all have to be turned into the blocking variant. This is because the finalizer does not run in STW either, it can run in parallel with mutators (when finalizing young custom blocks). So there is another special case, for locks held during custom finalizers.

Actually as you can see this is the reason why caml_ml_out_channels_list was written in a roundabout way:
ocaml-multicore/ocaml-multicore@30622d7

Here is a patch, that documents what happens: gadmm@0dec8b4

This is unsatisfactory because the critical section inside caml_ml_out_channels_list is not guaranteed to be short.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow your explanation of why other mutators cannot use the non-blocking variant for the same mutex. Can you clarify?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(A priori I don't see a problem if caml_plat_blocking is called by a finalizer in a mutex, in parallel with caml_plat_non_blocking being called by a mutator. I would expect that one of them wins, and the other waits. If the non_blocking call waits it can transfer control in the meantime, but I don't see where the problem lies with that.)

Copy link
Contributor Author

@gadmm gadmm Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking of something like:
The finalizer (blocking) holds the domain lock D0, and waits for the mutex M. In parallel, the mutator (T1) (non-blocking) holds the mutex M, and waits for the domain lock D1. In parallel, the mutator (T2) on D1 holds the domain lock D1 and triggers a STW, waiting for the finalizer to complete so that a new STW can begin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the problem is that caml_plat_lock_non_blocking(m) has a lock order (in the hard path) where it first takes m, then takes the domain lock. This is fundamentally incompatible with parallel code taking m while already holding the domain lock, which is what happens in the typical caml_plat_lock_blocking(m) scenario.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(user_events_lock in runtime/events.c gets locked both in blocking and non-blocking mode.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe we should distinguish two different kind of mutexes:

  • cooperative locks: one must own the runtime to take them, and their critical sections can use the runtime (they are C counterparts of OCaml-level mutexes)
  • non-cooperative locks (critical locks?): taking them is valid without owning the runtime, and their critical sections cannot use the runtime

Ideally we would have two C types to distinguish them, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that a good first step would be to perform the minimal fix for #13713. I agree with your gadmm@0dec8b4 now that I understand things better, but I suppose that you might want to reword the documentation comments a bit (or maybe drop that part for now, if a longer-term fix is also coming later). Would you be willing to submit this as a PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix submitted at #13714. I think it is a full fix for this PR. I did consider having two types of mutexes, but this would be a larger change, and it is still legal to call both caml_plat_lock_non_blocking and caml_plat_lock_blocking from the mutator if not holding the domain lock during the latter.

@gadmm
Copy link
Contributor Author

gadmm commented Jan 7, 2025

One lesson, is that we could have run the multicore tests with the label run-multicoretests, as explained by @jmid. In particular, there is no need for regression tests in the fixing PR as they are already present.

@gasche
Copy link
Member

gasche commented Jan 7, 2025

I implemented a refactoring that separates the lock types in two -- a first attempt:

gadmm/ocaml@caml_plat_lock_non_blocking3...gasche:ocaml:caml_plat_lock_non_blocking4

What do you think?

My problem with the current approach is that it relies on a bit too much expertise: as we just found out, it is easy to get things wrong. Just the documentation of what is and is not allowed is a mouthful. I hope that, with two separate types, it will be easier to document and easier for non-experts to use.

@gasche
Copy link
Member

gasche commented Jan 7, 2025

(Instead of having coop_ functions in platform, I think it would be rather natural to have them in sync as a C counterpart of caml_ml_mutex which has almost the same logic. I'm willing to iterate on this if we think that it is worth it.)

@gadmm
Copy link
Contributor Author

gadmm commented Jan 7, 2025

Now that I no longer see a subtlety with the interaction with the STW, I think this is an improvement, so please feel free to open a PR so we can comment on the details.
In terms of terminology, the term "cooperative" does not make sense to me and I find that "lock" for a structure that contains a mutex is confusing. You have some high-rank mutexes (higher than the domain mutex, that can be acquired while the domain lock is held), and low-rank mutexes (lower than the domain mutex, which require releasing the domain lock before being acquired). Note that this terminology is inverse from the intuition in terms of expressiveness, of "low-level" for the former (used by the runtime) and "high-level" for the latter (usable in OCaml contexts).

@gadmm
Copy link
Contributor Author

gadmm commented Jan 7, 2025

Also, you could test for Caml_state_opt before releasing the domain lock in the non_blocking long path. This would make it callable regardless of whether the domain lock is held, avoiding the need to add a new function. I get the point that this capability is not yet used in the runtime.

@gasche
Copy link
Member

gasche commented Jan 7, 2025

I pushed a new iteration that takes your remarks into account, and also moves the non-blocking code to sync.h. What do you think?

@gadmm
Copy link
Contributor Author

gadmm commented Jan 7, 2025

I think it is worth opening a PR for further comments

gasche added a commit to gasche/ocaml that referenced this pull request Jan 8, 2025
- 'runtime' mutexes are for blocking critical sections which do not
   access the runtime.

- 'mutator' mutexes are for non-blocking critical sections (blocking on
   a mutex releases the runtime lock) which may access the runtime
   system.

This refactoring comes from the discussions in ocaml#13227, it tries to avoid
a class of bug where the same mutex is used in both blocking and
non-blocking fashion, resulting in subtle deadlock situations.
@gasche
Copy link
Member

gasche commented Jan 8, 2025

Done in #13716.

gasche added a commit to gasche/ocaml that referenced this pull request Jan 8, 2025
- 'runtime' mutexes are for blocking critical sections which do not
   access the runtime.

- 'mutator' mutexes are for non-blocking critical sections (blocking on
   a mutex releases the runtime lock) which may access the runtime
   system.

This refactoring comes from the discussions in ocaml#13227, it tries to avoid
a class of bug where the same mutex is used in both blocking and
non-blocking fashion, resulting in subtle deadlock situations.
gasche added a commit to gasche/ocaml that referenced this pull request May 18, 2025
- 'runtime' mutexes are for blocking critical sections which do not
   access the runtime.

- 'mutator' mutexes are for non-blocking critical sections (blocking on
   a mutex releases the runtime lock) which may access the runtime
   system.

This refactoring comes from the discussions in ocaml#13227, it tries to avoid
a class of bug where the same mutex is used in both blocking and
non-blocking fashion, resulting in subtle deadlock situations.
gasche added a commit to gasche/ocaml that referenced this pull request May 18, 2025
- 'runtime' mutexes are for blocking critical sections which do not
   access the runtime.

- 'mutator' mutexes are for non-blocking critical sections (blocking on
   a mutex releases the runtime lock) which may access the runtime
   system.

This refactoring comes from the discussions in ocaml#13227, it tries to avoid
a class of bug where the same mutex is used in both blocking and
non-blocking fashion, resulting in subtle deadlock situations.

Reviewed-by: Antonin Décimo <[email protected]>
Reviewed-by: Guillaume Munch-Maccagnoni <[email protected]>
Reviewed-by: Miod Vallat <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants