Skip to content

Fix uninitialized read in runtime_events and toplevel bigarray printing. Document how to use MSAN.#13290

Closed
edwintorok wants to merge 7 commits intoocaml:trunkfrom
edwintorok:private/edvint/msan
Closed

Fix uninitialized read in runtime_events and toplevel bigarray printing. Document how to use MSAN.#13290
edwintorok wants to merge 7 commits intoocaml:trunkfrom
edwintorok:private/edvint/msan

Conversation

@edwintorok
Copy link
Contributor

@edwintorok edwintorok commented Jul 11, 2024

Back in 2018 tools/ci/inria/sanitizers/script claimed that MSAN output is impossible to debug.
clang has improved a lot since then, and in fact with clang-18 on Fedora40 it only reports 2 bug, both of which I was able to track down.

First of all zstd has to be disabled with -fsanitize=memory, because all external libraries would have to be compiled with -fsanitize=memory, otherwise the instrumentation won't see the writes performed by these and consider all memory written to by these functions uninitialized.

The first one is an uninitialized read in runtime_events, see this commit for the output

commit 053256277c640552ff6ac43d76d72e017b3454fc
Author: Edwin Török <[email protected]>
Date:   Thu Jul 11 21:41:14 2024 +0100

    runtime_events: avoid uninitialized value read

The error report contained enough information to locate the source of the problem and how to fix it: EV_RING_STOP has a data size of 0, but buf[2] was always read. This can be fixed by checking msg_length before accessing buf[2], and I've included that fix here.

The second one was considerably more difficult to track down, it was in the toplevel when printing bigarrays, see this commit for details:

commit 1f99e98ecce5c5edd4bdb7737844482a634d1e9c
Author: Edwin Török <[email protected]>
Date:   Thu Jul 11 21:08:31 2024 +0100

    sanitizers(MSAN): Suppress toplevel bug

I was only able to track down the reason, but I don't know how to fix it. First I had to locate which line from multi_indices.ml causes the failure, it was the one that printed the bigarray as <abstr>.
With that information and the stacktrace showing caml_ba_hash I was able to track down the problem to ObjTbl.mem nested_values repr in toplevel/genprintval.ml by sprinkling some prerr_endline to confirm it was indeed that call causing the issue (for some reason Printexc.get_callstack wasn't working here).
I don't have a solution in mind for this: bigarrays are uninit on allocation, and the hashing seems to be needed to avoid endless recursion, etc. and there isn't an easy way to exclude bigarrays from that. For now I've added it to the skipped tests.

With these 2 changes the testsuite now passes.

I've included some compile flag improvements as well, the manual recommends -O1 or higher, not -O0.

ocamlopt has to be kept disabled, it'd need a similar solution to TSAN to enable it (ocamlopt would have to emit calls to instrumentation functions compatible with Clang's, otherwise it won't see any of the writes from OCaml and consider them uninitialized).

I kept the memory sanitizer commented out for now, because:

  • I don't know what clang versions are available on Inria's CI. I've only tested with clang-18.
  • I haven't yet set up an account on Inria's CI to test this change
  • I'm not sure whether you'd want this enabled by default, although clang's output has improved it can still be difficult to locate the problem.

For now I'm opening the PR in draft form to gather feedback, especially on whether you'd want this enabled by default or not, i.e. are the error reports useful enough and understandable enough to justify enabling them?
Also should they be part of the same CI job as the other 2 sanitizers, or would you rather have them as 3 separate jobs?

@edwintorok edwintorok force-pushed the private/edvint/msan branch from 1f99e98 to 83e68fc Compare July 11, 2024 21:00
@gasche
Copy link
Member

gasche commented Jul 11, 2024

I wonder if the following patch could help with the toplevel cycle-detection-on-unitialized-memory issue (untested, just a quick thought):

diff --git i/toplevel/genprintval.ml w/toplevel/genprintval.ml
index 8d365fc6f7b..7390e9b7176 100644
--- i/toplevel/genprintval.ml
+++ w/toplevel/genprintval.ml
@@ -249,7 +249,7 @@ module Make(O : OBJ)(EVP : EVALPATH with type valu = O.t) = struct
       let nested_values = ObjTbl.create 8 in
       let nest_gen err f depth obj ty =
         let repr = obj in
-        if not (O.is_block repr) then
+        if not (O.is_block repr) || (O.tag repr >= Obj.no_scan_tag) then
           f depth obj ty
         else
           if ObjTbl.mem nested_values repr then

https://clang.llvm.org/docs/MemorySanitizer.html and https://clang.llvm.org/docs/AddressSanitizer.html
both recommend using `-O1 -fno-optimize-sibling-calls` for better
stacktraces, and recommend -O1 or higher for reasonable performance.

Try to future proof this by using -Og, which is currently documented in
`clang` as equivalent to `-O1`, but may in the future disable
optimizations to improve debuggability. GCC supports this flag too, and
infact does disable some -O1 optimizations that interfere with
debugging.

MSAN was using -O0 which is slower than needed.

Signed-off-by: Edwin Török <[email protected]>
Tested with clang-18.1.6 on Fedora 40.

According https://github.com/google/sanitizers/wiki/MemorySanitizer#using-instrumented-libraries
all the code in the program must be built with it, otherwise false
positives are reported.
Disable zstd when compiling with the memory sanitizer, because this is
an external library, that is unlikely to have been compiled with it.

Signed-off-by: Edwin Török <[email protected]>
Enable detailed origin tracking for memory sanitizer.
Memory sanitizer failures are sometimes difficult to reason about,
gather all the information that we can, even at the cost of additional
overhead.

Signed-off-by: Edwin Török <[email protected]>
I've only tested this with clang-18.

Signed-off-by: Edwin Török <[email protected]>
Reported by `-fsanitize=memory`:
```
> ==102752==WARNING: MemorySanitizer: use-of-uninitialized-value
>     #0 0x7f2ba7fb4ea4 in caml_runtime_events_read_poll /var/home/edwin/git/ocaml/otherlibs/runtime_events/runtime_events_consumer.c:496:18
>     #1 0x7f2ba7fbc016 in caml_ml_runtime_events_read_poll /var/home/edwin/git/ocaml/otherlibs/runtime_events/runtime_events_consumer.c:1207:9
>     ocaml#2 0x59ba5c in caml_interprete /var/home/edwin/git/ocaml/runtime/interp.c:1058:14
>     ocaml#3 0x5a9220 in caml_main /var/home/edwin/git/ocaml/runtime/startup_byt.c:575:9
>     ocaml#4 0x540d6b in main /var/home/edwin/git/ocaml/runtime/main.c:37:3
>     ocaml#5 0x7f2ba8120087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
>     ocaml#6 0x7f2ba812014a in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
>     ocaml#7 0x441804 in _start (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x441804) (BuildId: 617637580ee48eff08a2bce790e1667ad09f3b69)
>
>   Uninitialized value was stored to memory at
>     #0 0x7f2ba7fb4e9d in caml_runtime_events_read_poll /var/home/edwin/git/ocaml/otherlibs/runtime_events/runtime_events_consumer.c:497:69
>     #1 0x7f2ba7fbc016 in caml_ml_runtime_events_read_poll /var/home/edwin/git/ocaml/otherlibs/runtime_events/runtime_events_consumer.c:1207:9
>     ocaml#2 0x59ba5c in caml_interprete /var/home/edwin/git/ocaml/runtime/interp.c:1058:14
>     ocaml#3 0x5a9220 in caml_main /var/home/edwin/git/ocaml/runtime/startup_byt.c:575:9
>     ocaml#4 0x540d6b in main /var/home/edwin/git/ocaml/runtime/main.c:37:3
>     ocaml#5 0x7f2ba8120087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
>     ocaml#6 0x7f2ba812014a in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
>     ocaml#7 0x441804 in _start (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x441804) (BuildId: 617637580ee48eff08a2bce790e1667ad09f3b69)
>
>   Uninitialized value was created by an allocation of 'buf' in the stack frame
>     #0 0x7f2ba7fb3dbc in caml_runtime_events_read_poll /var/home/edwin/git/ocaml/otherlibs/runtime_events/runtime_events_consumer.c:402:7
>
```

This is in fact an EV_LIFECYCLE with EV_RING_STOP, which has 0
additional data, and thus msg_length 2:
```
runtime/runtime_events.c:      EV_RUNTIME, (ev_message_type){.runtime=EV_LIFECYCLE}, EV_RING_STOP, 0,
```

Attempting to read from `buf[2]` would read uninitialized data (or
potentially beyond the end of the buffer).

Check `msg_length` before reading.

Signed-off-by: Edwin Török <[email protected]>
Bigarrays are printed in the toplevel as `<abstr>`, but ObjTbl.mem
computes a hash. However bigarrays are uninitialized when allocated,
so this seems to be a genuine bug:
```
==133712==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x4e6d11 in caml_ba_hash /var/home/edwin/git/ocaml/runtime/bigarray.c:486:45
    #1 0x52474a in caml_hash /var/home/edwin/git/ocaml/runtime/hash.c:251:35
    ocaml#2 0x599ebf in caml_interprete /var/home/edwin/git/ocaml/runtime/interp.c:1065:14
    ocaml#3 0x5a909a in caml_main /var/home/edwin/git/ocaml/runtime/startup_byt.c:575:9
    ocaml#4 0x540ccb in main /var/home/edwin/git/ocaml/runtime/main.c:37:3
    ocaml#5 0x7f0910abb087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
    ocaml#6 0x7f0910abb14a in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
    ocaml#7 0x441804 in _start (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x441804) (BuildId: 7a60eef57e1c2baf770bc38d10d6c227e60ead37)

  Uninitialized value was created by a heap allocation
    #0 0x47d306 in malloc (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x47d306) (BuildId: 7a60eef57e1c2baf770bc38d10d6c227e60ead37)
    #1 0x4e7960 in caml_ba_alloc /var/home/edwin/git/ocaml/runtime/bigarray.c:246:12
    ocaml#2 0x4e801f in caml_ba_create /var/home/edwin/git/ocaml/runtime/bigarray.c:673:10
    ocaml#3 0x59b8fc in caml_interprete /var/home/edwin/git/ocaml/runtime/interp.c:1058:14
    ocaml#4 0x5a909a in caml_main /var/home/edwin/git/ocaml/runtime/startup_byt.c:575:9
    ocaml#5 0x540ccb in main /var/home/edwin/git/ocaml/runtime/main.c:37:3
    ocaml#6 0x7f0910abb087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
    ocaml#7 0x7f0910abb14a in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
    ocaml#8 0x441804 in _start (/var/home/edwin/git/ocaml/runtime/ocamlrun+0x441804) (BuildId: 7a60eef57e1c2baf770bc38d10d6c227e60ead37)

SUMMARY: MemorySanitizer: use-of-uninitialized-value /var/home/edwin/git/ocaml/runtime/bigarray.c:486:45 in caml_ba_hash
```

The hashing is only needed to avoid recursion, skip it when the OCaml
value doesn't contain further nested OCaml values, i.e. when it wouldn't
be scanned by the GC.

The testsuite passes now:
```
Summary:
  1335 tests passed
   102 tests skipped
     0 tests failed
     0 tests not started (parent test skipped or failed)
     0 unexpected errors
  1437 tests considered
```

Suggested-by: Gabriel Scherer <[email protected]>
Signed-off-by: Edwin Török <[email protected]>
@edwintorok edwintorok force-pushed the private/edvint/msan branch from 83e68fc to c2b0aed Compare July 11, 2024 22:53
@edwintorok
Copy link
Contributor Author

edwintorok commented Jul 11, 2024

(untested, just a quick thought):

Thanks, it passes the MSAN testsuite. I've replaced my suppression commit with your suggestion in:

commit d6f6e3010ee17ca48450155c5062590ccfbb3814
Author: Edwin Török <[email protected]>
Date:   Thu Jul 11 21:08:31 2024 +0100

    fix(toplevel): avoid hashing uninitialized bigarrays when printing

@edwintorok edwintorok changed the title Fix runtime events uninitialized read and document how to use MSAN Fix uninitialized read in runtime_events and toplevel bigarray printing. Document how to use MSAN. Jul 11, 2024
@edwintorok
Copy link
Contributor Author

BTW I just realized that UBSAN is not actually enabled, manual says "-fsanitize-trap= and -fsanitize-recover= are a no-op in the absence of a -fsanitize= option. There is no unused command line option warning.".
I'll open a separate PR to fix that, using -fsanitize-trap on its own is a ... trap.
Something like this fixes it (and testsuite still passes):

 #########################################################################
@@ -77,7 +77,7 @@ shift-exponent,\
 unreachable"
 
 # Select address sanitizer and UB sanitizer, with trap-on-error behavior
-sanitizers="-fsanitize=address -fsanitize-trap=$ubsan"
+sanitizers="-fsanitize=address -fsanitize=$ubsan -fsanitize-trap=$ubsan"
 
 # Don't optimize too much to get better backtraces of errors
 opt_stacktrace="-Og -fno-optimize-sibling-calls"
@@ -85,7 +85,7 @@ opt_stacktrace="-Og -fno-optimize-sibling-calls"
 ./configure \
   CC=clang \
   CFLAGS="$opt_stacktrace -fno-omit-frame-pointer $sanitizers" \
-  LDFLAGS="$sanitizers" \
+  LDFLAGS="$sanitizers -Og" \

@gasche
Copy link
Member

gasche commented Jul 13, 2024

In #13294 I sent the genprintval.ml fix alone, to make it easier to review. I think that it may make sense to send the runtime-events changes as their own, separate PR as well.

@edwintorok
Copy link
Contributor Author

edwintorok commented Jul 14, 2024

I have now done the same with the other changes and discussions in this PR (this PR started out relatively small, but grew as related issues were discovered).

The full list of PRs split from this one are:
#13294 - genprintval fix
#13297 - runtime_events_consumer.c fix (turned out to be bigger than the initial one-liner)
#13293 - enable UBSAN (it was disabled by accident)
#13298 - document how to use MSAN

The sanitizer PRs probably need to wait for #13219 to be merged first.
The bugfix PRs are independent.

I've also created a branch with everything merged together (and MSAN enabled by default) at trunk...edwintorok:ocaml:private/edvint/sanitizers-all, and triggered a sanitizer CI run at https://ci.inria.fr/ocaml/job/precheck-sanitizers/4/console (will keep an eye on it and try to fix any issues).

I'm closing this PR, it is now out of date, lets continue on the other PRs.

@edwintorok edwintorok closed this Jul 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants