Skip to content

Assertion failed: atomic_load_acquire(&di->backup_thread_msg) == BT_INIT #13677

@jmid

Description

@jmid

On Sunday in multicoretests we saw a new assertion error triggered on trunk that we haven't seen before:
https://github.com/ocaml-multicore/multicoretests/actions/runs/12337399971/job/34431080832

random seed: 164588596
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s Lin Weak HashSet stress test with Domain
[02] file runtime/domain.c; line 1126 ### Assertion failed: atomic_load_acquire(&di->backup_thread_msg) == BT_INIT
File "src/weak/dune", line 28, characters 7-24:
28 |  (name lin_tests_hashset)
            ^^^^^^^^^^^^^^^^^
(cd _build/default/src/weak && ./lin_tests_hashset.exe --verbose)
Command got signal ILL.

The relevant part of install_backup_thread reads a follows:

ocaml/runtime/domain.c

Lines 1112 to 1126 in 849726f

if (atomic_load_acquire(&di->backup_thread_running) != 0) {
/* If the backup thread is running, but has been instructed to terminate,
we need to wait for it to stop until we can spawn another. */
uintnat msg;
msg = atomic_load_acquire(&di->backup_thread_msg);
while (msg != BT_INIT) {
/* Give a chance for backup thread on this domain to terminate */
caml_plat_unlock (&di->domain_lock);
cpu_relax ();
caml_plat_lock_blocking(&di->domain_lock);
msg = atomic_load_acquire(&di->backup_thread_msg);
}
}
CAMLassert(atomic_load_acquire(&di->backup_thread_msg) == BT_INIT);

Looking a bit into it, I spotted a couple of atomic updates above in backup_thread_func
thinking that it might be caused by the small window between these two:

ocaml/runtime/domain.c

Lines 1099 to 1100 in 849726f

atomic_store_release(&di->backup_thread_running, 0);
atomic_store_release(&di->backup_thread_msg, BT_INIT);

I thus inserted a sleep to open the window a bit more:

diff --git a/runtime/domain.c b/runtime/domain.c
index 6086ed8155..bac5754173 100644
--- a/runtime/domain.c
+++ b/runtime/domain.c
@@ -1097,6 +1097,7 @@ static void* backup_thread_func(void* v)
 
   /* doing terminate */
   atomic_store_release(&di->backup_thread_running, 0);
+  thrd_sleep(&(struct timespec){.tv_sec=0, .tv_nsec=050000000}, NULL); /* sleep for 50ms */
   atomic_store_release(&di->backup_thread_msg, BT_INIT);
 
   return 0;

With this patch, any Domain-using test I've tried now fails immediately with an assertion error:

multicoretests$ _build/default/src/hashtbl/stm_tests.exe 
### OCaml runtime: debug mode ###
### set OCAMLRUNPARAM=v=0 to silence this message
random seed: 347896408
[02] file runtime/domain.c; line 1127 ### Assertion failed: atomic_load_acquire(&di->backup_thread_msg) == BT_INIT
Illegal instruction (core dumped)

What's a good fix for this? Switching the two atomic updates? (edit: apparently not) 🤔

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions