Skip to content

Conversation

@gadmm
Copy link
Contributor

@gadmm gadmm commented Jan 7, 2025

As noticed by @gasche, after @jmid's report of a deadlock at #13713, the channel finalizer cannot call caml_plat_lock_non_blocking. This reverts the parts of the change that caused the bug at #13713 and better documents what happens here.

Better reasoning about the interaction between caml_plat_lock_non_blocking and caml_plat_lock_blocking leads to tighter constraints about how they should be mixed. This shows that the previous patch did not fix the deadlock inside runtime events as thought, so I fix it differently here.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is consistent with our discussions in #13713.

@gasche gasche added the merge-me label Jan 7, 2025
@jmid
Copy link
Member

jmid commented Jan 7, 2025

Can we add the run-multicoretests label for extra certainty?

I think the test suite is showing its value, both in this case - and in #13580 - and hope to see label usage pick up on runtime changing PRs 🙏

@gasche gasche added the run-multicoretests Makes the CI run the multicore testsuite label Jan 7, 2025
@gasche
Copy link
Member

gasche commented Jan 7, 2025

multicore-tests is happy, merging. Thanks again @jmid for the prompt report and @gadmm for the prompt fix.

@gasche gasche merged commit 3ad6586 into ocaml:trunk Jan 7, 2025
39 of 40 checks passed
/* [user_events_lock] can be acquired during STW, so we must use
caml_plat_lock_blocking and be careful to avoid triggering any
STW while holding it */
caml_plat_lock_blocking(&user_events_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gadmm I wonder if the following alternative fix would work:

  • take the lock non_blocking for all the critical sections outside the STW code
  • in the STW section, do not take users_events_lock, with a barrier after the user_events-using section to guarantee that no mutators run in parallel

More generally, I think that we can have resources that use non-blocking locks in mutators and are followed by a barrier in STW.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-me run-multicoretests Makes the CI run the multicore testsuite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants