-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Context
I was wondering whether TMC is actually still useful in the Multicore runtime, or if we shouget rid of it.
In the Multicore OCaml runtime, the runtime manages its own stack, which gives us freedom to write programs using a lot of stack space without running into operating-system limits. The Multicore runtime sets a stack-size limit by default and will raise Stack_Overflow exceptions when it is reached; this is a usability device to stop early on infinite stack-eating loops instead of consuming all the system memory and swapping. But the user can easily augment the stack limit with `OCAMLRUNPARAM="l=123456789", which used to work only in bytecode but now also works in native code. (The details of how Multicore uses its stack limit may change, but the above should remain true in the future.)
(In particular, we can now get rid of the threaded Stack_Overflow errors when compiling large/generated OCaml programs. But we are still at the mercy of quadratic/exponential slowdowns caused by algorithms.)
So what's stopping us from passing very large stack-limits if a program hits Stack_overflow due to (terminating) non-tail-recursive functions?
I did some expriments, and I observe a quadratic slowdown when the stack gets very large, due to repeated scanning of the whole stack by the minor collector. (See benchmark code at the end.) I'm opening the present issue to discuss the problem.
Quadratic stack scanning
This phenomenon already existed with the sequential runtime, but it was rarely observed due to system stack limits. The sequential runtime has an optional optimization to mark the part of the stacks that were already scanned by a minor GC (so they cannot contain young roots anymore and don't need to be scanned again), but it was only enabled on some architectures:
- https://github.com/ocaml/ocaml/blob/4.13.0/runtime/roots_nat.c#L306-L311
- f6a0392
- 2395d35
What would be a good approach to optimize the Multicore runtime to avoid quadratic stack scanning, without overhead for programs that don't use a lot of stack? I see two potential strategies, using the fact that the Multicore runtime uses a linked list of contiguous stacks. (Normally, calls always go to the last contiguous stack, with resizing if necessary; only effect handlers create new stacks.)
-
We could ensure that the minor GC only scans the last/active stack, not previous stacks. (I tried to do this naively and my code doesn't work.) Then we can add a limit on contiguous stack size: when trying to grow the stack above that limit, we add a new stack instead. This bounds the amount of stack-scanning work per minor GC.
-
Or we could keep a "scanned up to that frame" information in the stack metadata, that would be updated "upward" (when pushing to the stack) by the stack-resizing code, and updated "downward" (when popping from the stack) by special call frames pushed by the stack-resizing code. We can decide how often to do those updates by tuning the resizing strategy (currently it is exponential, we would add an upward limit).
I wonder if that sounds reasonable to people, and something worth exploring.
(cc peopl with their name in relevant code files: @xavierleroy @stedolan @kayceesrk @ctk21)
Measurement code
(* This non-recursive version sees a quadratic slowdown on very large
lists, due to repeated scanning of the call stack by the minor
GC. *)
let rec direct_map f xs =
match xs with
| [] -> []
| x :: xs ->
let y = f x in
y :: direct_map f xs
(* The TMC version uses constant stack space; no slowdown. *)
let[@tail_mod_cons] rec tmc_map f xs =
match xs with
| [] -> []
| x :: xs ->
let y = f x in
y :: tmc_map f xs
(* This version is similar to direct_map, except that every N calls it
creates a new stack/fiber by using a dummy effect handler. If the
minor GC would only scan the last fiber, this version should not
suffer from a quadratic slowdown; but currently all stacks/fibers
are scanned so this has no effect. *)
let handling_map f xs =
let handler_threshold = 100_000 in
let rec map f xs count =
if count = 0 then
map_on_new_stack f xs
else
match xs with
| [] -> []
| x :: xs ->
let y = f x in
y :: map f xs (count - 1)
and map_on_new_stack f xs =
let open Effect.Shallow in
let continue () = start_mapping f xs in
continue_with (fiber continue) () {
retc = Fun.id;
exnc = raise;
effc = (fun _ -> None);
}
and start_mapping f xs =
map f xs handler_threshold
in start_mapping f xs
let benchs = [
"direct", direct_map;
"handling", handling_map;
"tmc", tmc_map;
]
let () =
match
List.assoc Sys.argv.(1) benchs,
int_of_string Sys.argv.(2),
int_of_string Sys.argv.(3)
with
| exception _ ->
Printf.eprintf "Usage: %s [%s] <list-size> <niter>\n%!"
Sys.argv.(0)
(String.concat "|" (List.map fst benchs))
| map, list_size, niter ->
let li = List.init list_size Fun.id in
let old = ref [] in
for i = 1 to niter do
old := map succ li
done