-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
On Sunday in multicoretests we saw a new assertion error triggered on trunk that we haven't seen before:
https://github.com/ocaml-multicore/multicoretests/actions/runs/12337399971/job/34431080832
random seed: 164588596
generated error fail pass / total time test name
[ ] 0 0 0 0 / 1000 0.0s Lin Weak HashSet stress test with Domain
[02] file runtime/domain.c; line 1126 ### Assertion failed: atomic_load_acquire(&di->backup_thread_msg) == BT_INIT
File "src/weak/dune", line 28, characters 7-24:
28 | (name lin_tests_hashset)
^^^^^^^^^^^^^^^^^
(cd _build/default/src/weak && ./lin_tests_hashset.exe --verbose)
Command got signal ILL.
The relevant part of install_backup_thread reads a follows:
Lines 1112 to 1126 in 849726f
| if (atomic_load_acquire(&di->backup_thread_running) != 0) { | |
| /* If the backup thread is running, but has been instructed to terminate, | |
| we need to wait for it to stop until we can spawn another. */ | |
| uintnat msg; | |
| msg = atomic_load_acquire(&di->backup_thread_msg); | |
| while (msg != BT_INIT) { | |
| /* Give a chance for backup thread on this domain to terminate */ | |
| caml_plat_unlock (&di->domain_lock); | |
| cpu_relax (); | |
| caml_plat_lock_blocking(&di->domain_lock); | |
| msg = atomic_load_acquire(&di->backup_thread_msg); | |
| } | |
| } | |
| CAMLassert(atomic_load_acquire(&di->backup_thread_msg) == BT_INIT); |
Looking a bit into it, I spotted a couple of atomic updates above in backup_thread_func
thinking that it might be caused by the small window between these two:
Lines 1099 to 1100 in 849726f
| atomic_store_release(&di->backup_thread_running, 0); | |
| atomic_store_release(&di->backup_thread_msg, BT_INIT); |
I thus inserted a sleep to open the window a bit more:
diff --git a/runtime/domain.c b/runtime/domain.c
index 6086ed8155..bac5754173 100644
--- a/runtime/domain.c
+++ b/runtime/domain.c
@@ -1097,6 +1097,7 @@ static void* backup_thread_func(void* v)
/* doing terminate */
atomic_store_release(&di->backup_thread_running, 0);
+ thrd_sleep(&(struct timespec){.tv_sec=0, .tv_nsec=050000000}, NULL); /* sleep for 50ms */
atomic_store_release(&di->backup_thread_msg, BT_INIT);
return 0;With this patch, any Domain-using test I've tried now fails immediately with an assertion error:
multicoretests$ _build/default/src/hashtbl/stm_tests.exe
### OCaml runtime: debug mode ###
### set OCAMLRUNPARAM=v=0 to silence this message
random seed: 347896408
[02] file runtime/domain.c; line 1127 ### Assertion failed: atomic_load_acquire(&di->backup_thread_msg) == BT_INIT
Illegal instruction (core dumped)
What's a good fix for this? Switching the two atomic updates? (edit: apparently not) 🤔