-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
We're observing a major heap corruption or crash in 32-bit mode.
I'm able to trigger it relatively consistently with both 5.2.0 and trunk on x86_86 under GitHub actions.
The kicker is, so far I've not been able to do so locally, in Docker, under rr, or on a 32-bit RPi.
The test in question is this one: https://github.com/ocaml-multicore/multicoretests/blob/domain-dls-32-bit-focus/src/domain/stm_tests_dls.ml
It performs a sequence of random Domain.DLS.sets (or gets) which run in a child domain (there are no evil parallel domains). Crashes can be triggered either with only sets (or gets), so it does not seem closely tied to these per se.
I've earlier done a bisection, establishing that this started triggering on bdd8d96 - Make the GC compact again from #12193.
It seems easiest to trigger under the debug runtime. Out of 200 repetitions here https://github.com/ocaml-multicore/multicoretests/actions/runs/10297760451/job/28501499241 there's 21 failures distributed as
- 13 cases of major heap verification failure
file runtime/shared_heap.c; line 784 ### Assertion failed: Has_status_val(v, caml_global_heap_state.UNMARKED) /usr/bin/bash: line 1: 63840 Illegal instruction (core dumped) - 4 cases of OCaml heap value corruption showing up as invalid counterexamples (which the generator did not produce)
Test STM Domain.DLS test sequential errored on (25 shrink steps): Set (-133652508, -134770910) - 3 cases of a clean
Segmentation fault (core dumped) - 1 case of
Fatal error: bad opcode (5971072b) Aborted (core dumped)
IIUC, the first one indicates that a reachable major heap value is either MARKED or GARBAGE after a completed mark-sweep cycle (the marked bits are recycled each cycle to avoid an additional heap traversal).
By adjusting the space_overhead parameter from the default o=120 to o=20 it errs 45 out of 200 iterations, a clear increase with more major GC activity.
Using tmate I've been able to log in to the GitHub actions machine and obtain two stack traces.
(I'm a bit hesitant about doing this as we rely on and benefit from the GitHub actions runners)
Here's one from an assertion failure:
Thread 4 (Thread 0xf03ffac0 (LWP 172957)):
#0 0xf1a6c579 in __kernel_vsyscall ()
#1 0xf1683243 in ?? () from /lib/i386-linux-gnu/libc.so.6
#2 0xf168a06a in pthread_mutex_lock () from /lib/i386-linux-gnu/libc.so.6
#3 0x5922ab80 in caml_plat_lock_blocking (m=0x5a9df4c0) at runtime/caml/platform.h:457
#4 backup_thread_func (v=<optimized out>) at runtime/domain.c:1076
#5 0xf1686c01 in ?? () from /lib/i386-linux-gnu/libc.so.6
#6 0xf172372c in ?? () from /lib/i386-linux-gnu/libc.so.6
Thread 3 (Thread 0xf1880740 (LWP 171712)):
#0 0xf1a6c579 in __kernel_vsyscall ()
#1 0xf1715336 in ?? () from /lib/i386-linux-gnu/libc.so.6
#2 0xf1682e81 in ?? () from /lib/i386-linux-gnu/libc.so.6
#3 0xf1686079 in pthread_cond_wait () from /lib/i386-linux-gnu/libc.so.6
#4 0x592572d1 in sync_condvar_wait (m=0x5a9e3920, c=0x5a9e1620) at runtime/sync_posix.h:116
#5 caml_ml_condition_wait (wcond=<optimized out>, wmut=<optimized out>) at runtime/sync.c:172 ##### In a C call
#6 0x5925dce2 in caml_interprete (prog=<optimized out>, prog_size=<optimized out>) at runtime/interp.c:1047
#7 0x59261052 in caml_startup_code_exn (pooling=0, argv=0xffdf2e24, section_table_size=3683, section_table=0x59292020 <caml_sections> "\204\225\246\276", data_size=21404, data=0x59292ea0 <caml_data> "\204\225\246\276", code_size=528104, code=0x59298240 <caml_code>) at runtime/startup_byt.c:655
#8 caml_startup_code_exn (code=0x59298240 <caml_code>, code_size=528104, data=0x59292ea0 <caml_data> "\204\225\246\276", data_size=21404, section_table=0x59292020 <caml_sections> "\204\225\246\276", section_table_size=3683, pooling=0, argv=0xffdf2e24) at runtime/startup_byt.c:588
#9 0x59261101 in caml_startup_code (code=0x59298240 <caml_code>, code_size=528104, data=0x59292ea0 <caml_data> "\204\225\246\276", data_size=21404, section_table=0x59292020 <caml_sections> "\204\225\246\276", section_table_size=3683, pooling=0, argv=0xffdf2e24) at runtime/startup_byt.c:669
#10 0x592120b4 in main (argc=4, argv=0xffdf2e24) at camlprim.c:25901
Thread 2 (Thread 0xee0f6ac0 (LWP 172956)):
#0 0x592512cc in caml_verify_root (state=0xf10ae180, v=-245978988, p=0xf11f7410) at runtime/shared_heap.c:759
#1 0x5923013d in caml_scan_stack (f=0x592512c0 <caml_verify_root>, fflags=0, fdata=0xf10ae180, stack=0xf117f010, v_gc_regs=0x0) at runtime/fiber.c:396
#2 0x5924f826 in caml_do_local_roots (f=0x592512c0 <caml_verify_root>, fflags=0, fdata=0xf10ae180, local_roots=0xee0f61ec, current_stack=0xf117f010, v_gc_regs=0x0) at runtime/roots.c:65
#3 0x5924f865 in caml_do_roots (f=0x592512c0 <caml_verify_root>, fflags=0, fdata=0xf10ae180, d=0xf0402620, do_final_val=1) at runtime/roots.c:41
#4 0x5925343e in caml_verify_heap_from_stw (domain=0xf0402620) at runtime/shared_heap.c:804
#5 0x59240c39 in stw_cycle_all_domains (domain=<optimized out>, args=<optimized out>, participating_count=<optimized out>, participating=<optimized out>) at runtime/major_gc.c:1434
#6 0x5922af41 in caml_try_run_on_all_domains_with_spin_work (sync=<optimized out>, handler=<optimized out>, data=<optimized out>, leader_setup=<optimized out>, enter_spin_callback=<optimized out>, enter_spin_data=<optimized out>) at runtime/domain.c:1695
#7 0x5922b10a in caml_try_run_on_all_domains (handler=0x592407c0 <stw_cycle_all_domains>, data=0xee0f5ca8, leader_setup=0x0) at runtime/domain.c:1717
#8 0x5924324e in major_collection_slice (howmuch=<optimized out>, participant_count=participant_count@entry=0, barrier_participants=barrier_participants@entry=0x0, mode=<optimized out>, force_compaction=<optimized out>) at runtime/major_gc.c:1851
#9 0x59243670 in caml_major_collection_slice (howmuch=-1) at runtime/major_gc.c:1869
#10 0x5922a7d8 in caml_poll_gc_work () at runtime/domain.c:1874
#11 0x59254e67 in caml_do_pending_actions_res () at runtime/signals.c:338
#12 0x5924cc9c in caml_alloc_small_dispatch (dom_st=0xf0402620, wosize=2, flags=3, nallocs=1, encoded_alloc_lens=0x0) at runtime/minor_gc.c:896
#13 0x5925f1f9 in caml_interprete (prog=<optimized out>, prog_size=<optimized out>) at runtime/interp.c:788
#14 0x59224cbc in caml_callbackN_exn (closure=<optimized out>, narg=<optimized out>, args=<optimized out>) at runtime/callback.c:131
#15 0x59224faa in caml_callback_exn (arg1=<optimized out>, closure=<optimized out>) at runtime/callback.c:144
#16 caml_callback_res (closure=-243007372, arg=1) at runtime/callback.c:320
#17 0x59229e4a in domain_thread_func (v=<optimized out>) at runtime/domain.c:1244
#18 0xf1686c01 in ?? () from /lib/i386-linux-gnu/libc.so.6
#19 0xf172372c in ?? () from /lib/i386-linux-gnu/libc.so.6
Thread 1 (Thread 0xef3feac0 (LWP 171715)):
#0 caml_failed_assert (expr=0x5926bf18 "Has_status_val(v, caml_global_heap_state.UNMARKED)", file_os=0x5926b995 "runtime/shared_heap.c", line=784) at runtime/misc.c:48
#1 0x59253709 in verify_object (v=-298832060, st=0xf1074c70) at runtime/shared_heap.c:784
#2 caml_verify_heap_from_stw (domain=0x5a9e0120) at runtime/shared_heap.c:807
#3 0x59240c39 in stw_cycle_all_domains (domain=<optimized out>, args=<optimized out>, participating_count=<optimized out>, participating=<optimized out>) at runtime/major_gc.c: 1434
#4 0x5922aa28 in stw_handler (domain=0x5a9e0120) at runtime/domain.c:1486
#5 handle_incoming (s=<optimized out>) at runtime/domain.c:351
#6 0x5922ac9a in caml_handle_incoming_interrupts () at runtime/domain.c:364
#7 backup_thread_func (v=<optimized out>) at runtime/domain.c:1057
#8 0xf1686c01 in ?? () from /lib/i386-linux-gnu/libc.so.6
#9 0xf172372c in ?? () from /lib/i386-linux-gnu/libc.so.6
Here it seems
- Thread 4 is a waiting backup thread for the child domain
- Thread 3 is the main thread, paused during a blocked
C_CALL2tocaml_ml_condition_wait - Thread 2 is the child domain triggering a major GC slice on
MAKEBLOCK2 - Thread 1 is the main backup thread participating in STW
Here's another one from a clean segfault
Thread 2 (Thread 0xf17feac0 (LWP 122715)):
#0 0xf3f47579 in __kernel_vsyscall ()
#1 0xf3b15336 in ?? () from /lib/i386-linux-gnu/libc.so.6
#2 0xf3a82e81 in ?? () from /lib/i386-linux-gnu/libc.so.6
#3 0xf3a86079 in pthread_cond_wait () from /lib/i386-linux-gnu/libc.so.6
#4 0x648daa9d in caml_plat_wait (cond=0x64d693b4, mut=0x64d6939c) at runtime/platform.c:127
#5 0x648b6c1a in backup_thread_func (v=<optimized out>) at runtime/domain.c:1068
#6 0xf3a86c01 in ?? () from /lib/i386-linux-gnu/libc.so.6
#7 0xf3b2372c in ?? () from /lib/i386-linux-gnu/libc.so.6
Thread 1 (Thread 0xf3d5b740 (LWP 122712)):
#0 0x648ea29f in caml_interprete (prog=<optimized out>, prog_size=<optimized out>) at runtime/interp.c:573
#1 0x648ed052 in caml_startup_code_exn (pooling=0, argv=0xff8ca1e4, section_table_size=3683, section_table=0x6491e020 <caml_sections> "\204\225\246\276", data_size=21404, data=0x6491eea0 <caml_data> "\204\225\246\276", code_size=528104, code=0x64924240 <caml_code>) at runtime/startup_byt.c:655
#2 caml_startup_code_exn (code=0x64924240 <caml_code>, code_size=528104, data=0x6491eea0 <caml_data> "\204\225\246\276", data_size=21404, section_table=0x6491e020 <caml_sections> "\204\225\246\276", section_table_size=3683, pooling=0, argv=0xff8ca1e4) at runtime/startup_byt.c:588
#3 0x648ed101 in caml_startup_code (code=0x64924240 <caml_code>, code_size=528104, data=0x6491eea0 <caml_data> "\204\225\246\276", data_size=21404, section_table=0x6491e020 <caml_sections> "\204\225\246\276", section_table_size=3683, pooling=0, argv=0xff8ca1e4) at runtime/startup_byt.c:669
#4 0x6489e0b4 in main (argc=4, argv=0xff8ca1e4) at camlprim.c:25901
This one segfaults on a RETURN instruction in pc = Code_val(accu);
Here's a fresh branch reproducing the issue on CI (5.2, 5.3, and 5.4 which is now trunk):
https://github.com/ocaml-multicore/multicoretests/tree/domain-dls-32-bit-share-repro
The workflows do something along these steps, if anyone wants to attempt reproducing locally:
# install opam and a switch for the target compiler with ocaml-option-32bit
$ opam install qcheck-core
$ git clone -b domain-dls-32-bit-share-repro https://github.com/ocaml-multicore/multicoretests.git
$ cd multicoretests
$ dune build --profile=debug-runtime src/
$ OCAMLRUNPARAM="o=20,s=4096,v=0,V=1" dune build "@ci" -j1 --no-buffer --display=quiet --cache=disabled --error-reporting=twice --profile=debug-runtime src/