Skip to content

Conversation

@OlivierNicole
Copy link
Contributor

@OlivierNicole OlivierNicole commented Apr 10, 2025

A few weeks ago, @jmid discovered a type safety bug ocaml-multicore/multicoretests#528 happening when using the Dynarray module in parallel.

It appears that the root cause is probably in Array.sub, which, in some cases, calls memcpy. Depending on what assembly instructions memcpy is compiled to (which can vary a great deal depending on the platform), the reads from a in Array.sub a ofs len are not necessarily word-per-word; which, if writes are made to a concurrently, can cause tearing and fill the new array with garbage values.

Closing in on the bug was possible thanks to helpful comments and testing work from @jmid, @gasche and @shym.

The bug was observed when using musl, presumably because memcpy compiles to different code in that case. But it’s hardly a musl-specific problem. As @shym noted, building OCaml with glibc but with CFLAGS=-Os results in memcpy copying data byte-per-byte.

Fix details

  • Instead of using memcpy, we need to enforce that the source array is read word-wise. To achieve this, I’m using relaxed atomic loads.
  • Maybe memcpy could be used if caml_domain_alone () is true, as is done elsewhere, but I’m not entirely sure that it’s correct: without any synchronisation, it is in principle possible that a second domain is spawned and writes to the aray before the memcpy/memmove is finished. But I’m happy to be proven wrong. I have been proven wrong, see below.

@gasche
Copy link
Member

gasche commented Apr 10, 2025

I wrote an Array.sub microbenchmark (sources).

Implementations compared:

  • reference (source) is the current code in the runtime system, simplified from a generic caml_array_gather version to a specialized version just for sub
  • atomic (source) uses the modification proposed in this PR
  • stdlib is calling caml_array_sub from the runtime directly, as a control, it should perform about as well as reference
  • ocaml (source) is a pure-OCaml implementation

On arrays of size 200, when sub is used to copy the full array, the results are as follows:

Summary
  reference ran
    1.10 ± 0.12 times faster than stdlib
    2.12 ± 0.24 times faster than atomic
   14.63 ± 1.07 times faster than ocaml

In other words, this PR makes Array.sub 2x slower on small arrays.

I think that we should use a caml_domain_alone fast-path to avoid this overhead in sequential settings. (Olivier and I already discussed this a bit in this thread)

@OlivierNicole OlivierNicole added run-multicoretests Makes the CI run the multicore testsuite run-thread-sanitizer Makes the CI run the testsuite with TSAN enabled labels Apr 10, 2025
@gasche
Copy link
Member

gasche commented Apr 10, 2025

I added an atomic-opt version with a sequential fast path checking caml_domain_alone (), and indeed the performance is now as good as before:

Summary
  reference ran
    1.04 ± 0.16 times faster than atomic-opt
    1.11 ± 0.22 times faster than stdlib
    1.79 ± 0.21 times faster than atomic
   12.50 ± 1.21 times faster than ocaml

@gadmm
Copy link
Contributor

gadmm commented Apr 16, 2025

Maybe memcpy could be used if caml_domain_alone () is true, as is done elsewhere, but I’m not entirely sure that it’s correct: without any synchronisation, it is in principle possible that a second domain is spawned and writes to the aray before the memcpy/memmove is finished. But I’m happy to be proven wrong.

The second domain is necessarily spawned from the first domain. Wouldn't the (single) domain lock enforce the synchronization you need?

@gasche
Copy link
Member

gasche commented Apr 16, 2025

Re. caml_domain_alone, see #13952 where I documented my reasoning for why this is safe.

@OlivierNicole
Copy link
Contributor Author

I have added a fast path as suggested.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the current version -- with minor comments.

Co-authored-by: Gabriel Scherer <[email protected]>
@gasche
Copy link
Member

gasche commented Apr 23, 2025

The CI failures seem broadly unrelated to this PR.

List of failed tests:
    tests/tsan/exn_from_c.ml
    tests/tsan/exn_in_callback.ml
    tests/weak-ephe-final/weak_array_par.ml

The tsan/exn_* tests look related to printing order, either they are flaky or maybe there was an unrelated flushing change somewhere that broke their reference output.

The weak_array_par test is failing due to data races reported by TSan. One is on strings value exchanged across domains by Weak.set and Weak.get:

2025-04-23T10:05:17.2430563Z > WARNING: ThreadSanitizer: data race (pid=35412)
2025-04-23T10:05:17.2430942Z >   Read of size 1 at 0x7fdd215fffb7 by thread T4 (mutexes: write M0):
2025-04-23T10:05:17.2431514Z >     #0 caml_string_length runtime/str.c:36 (weak_array_par.opt+0x110508) (BuildId: 1b0d5496b0088fdebbc8f0f4f50edb9670d3c5de)
2025-04-23T10:05:17.2432238Z >     #1 caml_hash_mix_string runtime/hash.c:147 (weak_array_par.opt+0xe9379) (BuildId: 1b0d5496b0088fdebbc8f0f4f50edb9670d3c5de)
2025-04-23T10:05:17.2432931Z >     #2 caml_string_hash runtime/hash.c:306 (weak_array_par.opt+0xe9967) (BuildId: 1b0d5496b0088fdebbc8f0f4f50edb9670d3c5de)
[...]
2025-04-23T10:05:17.2438479Z >   Previous atomic write of size 8 at 0x7fdd215fffb0 by thread T1 (mutexes: write M1):
2025-04-23T10:05:17.2439090Z >     #0 caml_alloc_string runtime/alloc.c:188 (weak_array_par.opt+0xca1e7) (BuildId: 1b0d5496b0088fdebbc8f0f4f50edb9670d3c5de)
2025-04-23T10:05:17.2439835Z >     #1 caml_alloc_initialized_string runtime/alloc.c:197 (weak_array_par.opt+0xca2ef) (BuildId: 1b0d5496b0088fdebbc8f0f4f50edb9670d3c5de)
2025-04-23T10:05:17.2440575Z >     #2 caml_alloc_sprintf runtime/str.c:415 (weak_array_par.opt+0x11173a) (BuildId: 1b0d5496b0088fdebbc8f0f4f50edb9670d3c5de)
2025-04-23T10:05:17.2441266Z >     #3 caml_format_int runtime/ints.c:182 (weak_array_par.opt+0xee7f7) (BuildId: 1b0d5496b0088fdebbc8f0f4f50edb9670d3c5de)
2025-04-23T10:05:17.2441901Z >     #4 caml_c_call <null> (weak_array_par.opt+0x11b27b) (BuildId: 1b0d5496b0088fdebbc8f0f4f50edb9670d3c5de)
2025-04-23T10:05:17.2442702Z >     #5 camlStdlib$string_of_int_175 /home/runner/work/ocaml/ocaml/stdlib/stdlib.ml:266 (weak_array_par.opt+0x5f58b) (BuildId: 1b0d5496b0088fdebbc8f0f4f50edb9670d3c5de)
[...]

The code of the test, with the initialization, read and publication annotated by myself:

let () = Random.self_init ()

let num_domains = 4
let iters = 1_000_000
let len = 10_000
let table = Weak.create len

let go () =
  for i = 1 to 1_000_000 do
    let s = string_of_int i in (* STRING INITIALIZATION *)
    let h = String.hash s mod len in
    begin match Weak.get table h with
    | None -> ()
    | Some s' -> assert (String.hash s' (* CONCURRENT STRING READ *) mod len = h)
    end;
    Weak.set table h (Some s) (* PUBLICATION *)
  done

let () =
  let domains = Array.init (num_domains - 1) (fun _ -> Domain.spawn go) in
  go ();
  Array.iter Domain.join domains;
  print_endline "ok"

This is unrelated to the present PR but also an issue worth fixing. (I don't understand what needs to be done: maybe we really don't have enough synchronization here, or maybe we need some more opt-out code to teach TSan about our publish-safety assumptions.)

@gasche gasche merged commit 331b5e6 into ocaml:trunk Apr 23, 2025
20 of 24 checks passed
@OlivierNicole OlivierNicole deleted the bugfix-array-sub branch April 23, 2025 13:29
gasche added a commit that referenced this pull request Apr 23, 2025
Avoid tearing in Array.sub

(cherry picked from commit 331b5e6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-multicoretests Makes the CI run the multicore testsuite run-thread-sanitizer Makes the CI run the testsuite with TSAN enabled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants