Garbage collector colours change#9756
Conversation
|
Am investigating the cause of the crash on 32-bit. |
310e4c1 to
abb6fc1
Compare
Currently, gray stack overflow is a fatal error if more memory cannot be allocated.
Deals with overflow by emptying mark stack and indicating ranges in major heap chunks that need to be redarkened.
runtime/compact.c
Outdated
|
|
||
| #define Caml_gray (1 << 8) | ||
| #define Is_gray_hd(hd) (Color_hd (hd) == Caml_gray) | ||
| #define Grayhd_hd(hd) (((hd) & ~Caml_black) | Caml_gray) |
There was a problem hiding this comment.
This is unfortunate and probably needs a discussion around how to best integrate the new compactor from #9728 with this change.
It may be the simplest thing to do for now is rename these definitions and push the rest of the discussion to how compaction will work when more of multicore is merged.
There was a problem hiding this comment.
In the short time, I'd suggest you leave the "gray"-related macros defined in <caml/gc.h>, and just concentrate on not using them in the major GC. This way the compactor is not affected.
In the longer term, we should re-think the interface between the compactor and the Multicore OCaml major GC. My guess is that you'd need an extra pass of sweeping before calling the compactor, so that the compactor only sees "marked" and "unmarked" blocks but no "garbage" blocks.
There was a problem hiding this comment.
Sure. I've moved them back in to gc.h in the latest commit.
On the second point, I think @kayceesrk had a few ideas on how that might work.
There was a problem hiding this comment.
In the multicore GC [1], at the end of the major cycle, the meaning of the GC colours are cycled such that MARKED -> UNMARKED, UNMARKED -> GARBAGE, GARBAGE -> MARKED and FREE remains FREE. Just before cycling, there are no GARBAGE objects in the heap since the sweeper has made all of them FREE. As a result, just after cycling, there are no MARKED objects. Hence, depending on when the compaction is (before or after cycling), one of the colours is available for use in compaction.
[1] See https://arxiv.org/abs/2004.11663 for the details of the Multicore GC.
There was a problem hiding this comment.
That should work for compaction but I'm curious to know how you deal with out-of-heap blocks, which need to remain MARKED at all times.
There was a problem hiding this comment.
Multicore uses the bit pattern NOT_MARKABLE (3 << 8) for out-of-heap objects. MARKED, UNMARKED and GARBAGE cycle among 0 << 8, 1 << 8 and 2 << 8. FREE is not a distinguished colour in Multicore (although my comment above made it sound like it is). Sweeping sets the header of GARBAGE objects to 0 and adds it to the freelist [1]. More information is in the Multicore wiki [2].
[1] https://github.com/ocaml-multicore/ocaml-multicore/blob/172fe18bbc7fcdaa1fc286869c28a2c59ecbf27a/runtime/shared_heap.c#L421-L428.
[2] https://github.com/ocaml-multicore/ocaml-multicore/wiki/Garbage-collector-invariants#gc-cycles
jhjourdan
left a comment
There was a problem hiding this comment.
Generally, the change looks good to me. I have a few cosmetic/stylistic remarks, and two other more important remarks:
- There is a bug when resizing the stack fails. With the current code, nothing prevents from pushing even though there is no space left.
- Some eventlog tracking code has been removed here and there.
- Contrarily to what is advertised in thePR description, the traversal algorithm is depth-first rather than breath-first. This is in line with the original behavior of the GC.
Whoops, will fix.
Ah yes, that's a mistake in the PR description. @damiendoligez noticed it in a draft but I only fixed one of the two references. Will correct now. |
Co-authored-by: Jacques-Henri Jourdan <[email protected]>
Co-authored-by: Jacques-Henri Jourdan <[email protected]>
Co-authored-by: Jacques-Henri Jourdan <[email protected]>
Co-authored-by: Jacques-Henri Jourdan <[email protected]>
Co-authored-by: Jacques-Henri Jourdan <[email protected]>
Co-authored-by: Jacques-Henri Jourdan <[email protected]>
Co-authored-by: Jacques-Henri Jourdan <[email protected]>
jhjourdan
left a comment
There was a problem hiding this comment.
The current version looks good to me. Approving.
I still want to propose a fix in mark_stack_push.
Of course, I think you still need at the very least a review by @damiendoligez.
xavierleroy
left a comment
There was a problem hiding this comment.
I'm not competent to do a real review, but I left some comments and minor suggestions nonetheless.
| } | ||
| #endif | ||
|
|
||
| end = (block_wsz < 8 ? block_wsz : 8); |
There was a problem hiding this comment.
Why only 8? I agree the loop below must be limited so that it doesn't introduce noticeably latency while skipping over a huge integer array. But it should be fine if it runs a few hundred iterations. And it's a valuable optimization for integer arrays.
There was a problem hiding this comment.
So my concern with the optimisation in mark_stack_push is the Is_black_val we do inside the loop involves a dereference and it's possible to hit mark_stack_push from caml_modify. Hundreds of iterations could end up polluting the cache with headers you won't need again until the next major slice. I figured we might as well leave them to be checked during the major slice. May have been a premature optimisation though.
I also had a look at a few of the macro benchmarks we have in sandmark with an instrumented runtime and for all of them 95%+ of calls to mark_stack_push had block_wsz <= 8.
We could try a few different benchmarking runs with different values though and see what the impact is.
There was a problem hiding this comment.
Thanks for the benchmarking. OK to keep 8 as the threshold.
| CAMLassert(start <= Wosize_val(v)); | ||
|
|
||
| while (1){ | ||
| int can_mark = 0; |
There was a problem hiding this comment.
I'm grateful for the simplification in this "suspendable / restartable" loop, compared with the original code. The micro-optimization person in me still wonders whether the can_mark Boolean could be replaced by control flow (a goto or two). Maybe not.
There was a problem hiding this comment.
I'm also grateful for the simplification and the use of can_mark looks OK to me.
|
Can we label this PR as |
…for redarken start/end
|
@damiendoligez was wondering if you'd had a chance to review? Should be similar on the whole to draft PR but with simplifications from review feedback. |
damiendoligez
left a comment
There was a problem hiding this comment.
I have only one type error and a few small remarks, plus the feature request of making the skip list persistent, unless it's too much work.
| mark_entry* mark_stack = stk->stack; | ||
|
|
||
| char* heap_chunk = caml_heap_start; | ||
| struct skiplist chunk_sklist = SKIPLIST_STATIC_INITIALIZER; |
There was a problem hiding this comment.
I wonder how much work it would be to make this skip list global and maintain it when extending/shrinking the heap. If you do this, we get to keep our beloved Is_in_heap assertions when (in the near future) the page table disappears from the runtime.
There was a problem hiding this comment.
This is a good idea. I'll do a bit of investigating as to what's involved.
There was a problem hiding this comment.
Having had a think about this, we should avoid doing it for now. It is unclear how this would work safely with multicore's allocator (and indeed runs in to the same kind of problems as the page table).
| stk->count = 0; | ||
| } | ||
|
|
||
| static void realloc_mark_stack (struct mark_stack* stk) |
There was a problem hiding this comment.
I agree with @jhjourdan here. If you want to maintain the invariant that the mark stack size is at most 1/32th of the heap, you need only shrink it when the heap shrinks, at compaction.
| CAMLassert(start <= Wosize_val(v)); | ||
|
|
||
| while (1){ | ||
| int can_mark = 0; |
There was a problem hiding this comment.
I'm also grateful for the simplification and the use of can_mark looks OK to me.
Co-authored-by: Damien Doligez <[email protected]>
Co-authored-by: Damien Doligez <[email protected]>
Co-authored-by: Damien Doligez <[email protected]>
Formatting changes Co-authored-by: Damien Doligez <[email protected]>
|
@jhjourdan @damiendoligez we now shrink the mark stack on GC compaction: 727e1f3 I wasn't sure whether to just shrink it down to 1/32th of the heap or reset it to the initial size. I decided to go with the latter as sizing up seems to inexpensive and the heap structure may have changed significantly enough that a large mark stack is no longer necessary. I'm happy to resize it to 1/32th though if there are any compelling arguments the other way. |
damiendoligez
left a comment
There was a problem hiding this comment.
The PR looks fine now, modulo a small suggestion in the mark stack shrinking function.
Co-authored-by: Damien Doligez <[email protected]>
|
Merging now. Thanks everyone. |

This PR makes a change to the garbage collector colour scheme in the major collector. It removes the existing gray colour to match the scheme used by the multicore major collector.
This note covers the differences between trunk and multicore's major GC behaviour with respect to colours and overflow, then it lays out how this PR aims to remove the gray colour to aid merging the multicore collector whilst minimising trunk behaviour changes.
GC Colours
On trunk we currently have:
For the purposes of marking, multicore only has the two states of UNMARKED and MARKED which correspond to the White and Black trunk states above. This is because marking is concurrent and idempotent. Additionally marking and sweeping by different domains can overlap which is enabled by the new GC colour scheme. For more details: https://arxiv.org/abs/2004.11663
In terms of how these affect the different runtimes:
On trunk when a block is processed it is marked Gray and then added to a list of gray values to be later visited.
On multicore when a block is processed it is MARKED and then each field is pushed on to the mark stack to be later visited and marked.
Different strategies for dealing with overflow
Both of these schemes need a way to deal with the mark stack or gray values overflowing. That is when their sizes grow to become greater than some proportion of the major heap - in both cases 1/32th.
On trunk when gray values overflows it discards some of its entries and marks the heap as impure which later in the marking loop causes the heap to be rescanned for gray blocks so that they can be marked.
On multicore when the mark stack overflows it goes through a process whereby entries are scanned to find the minimum number of pools that will result in a 20% reduction in the size of the stack. These pools are then marked as needing a redarken at a later point in time and their entries removed from the stack.
Differences in gray values and mark stack
Gray values on trunk and the mark stack on multicore differ in their representation slightly. While gray values on trunk is an array of values that are marked gray on multicore the mark stack is stored as an array of mark entry structs which represent a block currently being scanned, an offset index into the object which will be marked next and the maximum number of fields (Wosize) for the block.
Differences in marking behaviour
A subtle difference between trunk and multicore is their behaviour in marking the object graph.
In trunk this goes as follows:
In multicore:
i.e trunk pushes all fields on to the stack first and then descends whereas multicore descends through each field in turn.
This may lead to different overflow behaviours on the same object graph.
This PR
Mark stack
In this PR we remove the gray value from trunk and we replace the gray values array with multicore's mark stack. One slight change from multicore is that we no longer store the end (Wosize) in the mark stack 's entry as it seems there is no performance benefit from doing so.
Marking behaviour
We retain trunk's marking behaviour.
Overflow behaviour
We deviate here from multicore's existing strategy. All heap chunks are added to a skiplist and the mark stack entries are iterated. For each chunk we build up a range in the chunk that needs to be redarkened at a later time. We do this for all mark stack entries and prune the whole mark stack.
Later when the marking loop exhausts the mark stack we iterate through chunks and redarken blocks with black headers that have any non-black fields. This process will only populate up to a quarter of the mark stack, to prevent subsequent overflows. It is incremental and can be called again when the mark stack empties.
Initially we ported multicore's addrmap based implementation (https://github.com/ocaml-multicore/ocaml-multicore/blob/parallel_minor_gc/runtime/major_gc.c#L1385) and broke the major heap in to heap ranges as heap chunks can vary in size. This actually didn't offer a performance advantage over the much simpler skiplist-based approach.
Performance
The following are a set of macro benchmarks from the Sandmark suite:
Of the macro benchmarks, matrix multiply and the two zarith benchmarks show performance regressions. From profiling all three spend nearly all their time in a few hot loops and next to no time in marking. Re-running on a machine with layout randomization shows no difference in performance between this PR and trunk, which points to a loop layout-related effect.
Below are some existing microbenchmarks in the Sandmark suite that did manage to overflow their mark stack. Overall there seems to be little negative impact on them.
Considerations
Overflow algorithm
In theory the simple overflow algorithm should be O(n logc) where n is the number of entries in the mark stack and c is the number of heap chunks. This could be a problem for very large heaps, as the mark stack is a proportion of the heap size though the size of heap chunks also grows rapidly as the heap grows in size.
Initial mark stack size
We currently have an initial mark stack size of 2048 mark stack entries. At 16 bytes per entry on 64-bit archs, that is 32k. Should this be a parameter tunable through OCAMLRUNPARAM? An alternative is to set the initial mark stack size much lower but allow it to grow beyond 1/32th of the heap up to a certain size.