Garbage collector colours change by sadiqj · Pull Request #9756 · ocaml/ocaml

sadiqj · 2020-07-12T14:42:50Z

This PR makes a change to the garbage collector colour scheme in the major collector. It removes the existing gray colour to match the scheme used by the multicore major collector.

This note covers the differences between trunk and multicore's major GC behaviour with respect to colours and overflow, then it lays out how this PR aims to remove the gray colour to aid merging the multicore collector whilst minimising trunk behaviour changes.

GC Colours

On trunk we currently have:

White - blocks that we haven't seen referenced by any live ( black ) major heap blocks yet
Gray - blocks that have been referenced by a live ( black ) major heap block but whose fields we haven't yet marked
Black - blocks that have been marked as live and whose fields we have also marked

For the purposes of marking, multicore only has the two states of UNMARKED and MARKED which correspond to the White and Black trunk states above. This is because marking is concurrent and idempotent. Additionally marking and sweeping by different domains can overlap which is enabled by the new GC colour scheme. For more details: https://arxiv.org/abs/2004.11663

In terms of how these affect the different runtimes:

On trunk when a block is processed it is marked Gray and then added to a list of gray values to be later visited.

On multicore when a block is processed it is MARKED and then each field is pushed on to the mark stack to be later visited and marked.

Different strategies for dealing with overflow

Both of these schemes need a way to deal with the mark stack or gray values overflowing. That is when their sizes grow to become greater than some proportion of the major heap - in both cases 1/32th.

On trunk when gray values overflows it discards some of its entries and marks the heap as impure which later in the marking loop causes the heap to be rescanned for gray blocks so that they can be marked.

On multicore when the mark stack overflows it goes through a process whereby entries are scanned to find the minimum number of pools that will result in a 20% reduction in the size of the stack. These pools are then marked as needing a redarken at a later point in time and their entries removed from the stack.

Differences in gray values and mark stack

Gray values on trunk and the mark stack on multicore differ in their representation slightly. While gray values on trunk is an array of values that are marked gray on multicore the mark stack is stored as an array of mark entry structs which represent a block currently being scanned, an offset index into the object which will be marked next and the maximum number of fields (Wosize) for the block.

Differences in marking behaviour

A subtle difference between trunk and multicore is their behaviour in marking the object graph.

In trunk this goes as follows:

A value is taken from the gray values
Each of its fields is iterated
- The field is marked gray and added to gray values

In multicore:

An entry e is taken from the mark stack
The offset in the entry e is used to construct a new entry child
The entry e's offset is incremented and it is pushed back on to the mark stack
e is set to child and we continue from step 2

i.e trunk pushes all fields on to the stack first and then descends whereas multicore descends through each field in turn.

This may lead to different overflow behaviours on the same object graph.

This PR

Mark stack

In this PR we remove the gray value from trunk and we replace the gray values array with multicore's mark stack. One slight change from multicore is that we no longer store the end (Wosize) in the mark stack 's entry as it seems there is no performance benefit from doing so.

Marking behaviour

We retain trunk's marking behaviour.

Overflow behaviour

We deviate here from multicore's existing strategy. All heap chunks are added to a skiplist and the mark stack entries are iterated. For each chunk we build up a range in the chunk that needs to be redarkened at a later time. We do this for all mark stack entries and prune the whole mark stack.

Later when the marking loop exhausts the mark stack we iterate through chunks and redarken blocks with black headers that have any non-black fields. This process will only populate up to a quarter of the mark stack, to prevent subsequent overflows. It is incremental and can be called again when the mark stack empties.

Initially we ported multicore's addrmap based implementation (https://github.com/ocaml-multicore/ocaml-multicore/blob/parallel_minor_gc/runtime/major_gc.c#L1385) and broke the major heap in to heap ranges as heap chunks can vary in size. This actually didn't offer a performance advantage over the much simpler skiplist-based approach.

Performance

The following are a set of macro benchmarks from the Sandmark suite:

Of the macro benchmarks, matrix multiply and the two zarith benchmarks show performance regressions. From profiling all three spend nearly all their time in a few hot loops and next to no time in marking. Re-running on a machine with layout randomization shows no difference in performance between this PR and trunk, which points to a loop layout-related effect.

Below are some existing microbenchmarks in the Sandmark suite that did manage to overflow their mark stack. Overall there seems to be little negative impact on them.

Considerations

Overflow algorithm

In theory the simple overflow algorithm should be O(n logc) where n is the number of entries in the mark stack and c is the number of heap chunks. This could be a problem for very large heaps, as the mark stack is a proportion of the heap size though the size of heap chunks also grows rapidly as the heap grows in size.

Initial mark stack size

We currently have an initial mark stack size of 2048 mark stack entries. At 16 bytes per entry on 64-bit archs, that is 32k. Should this be a parameter tunable through OCAMLRUNPARAM? An alternative is to set the initial mark stack size much lower but allow it to grow beyond 1/32th of the heap up to a certain size.

sadiqj · 2020-07-12T17:01:37Z

Am investigating the cause of the crash on 32-bit.

Currently, gray stack overflow is a fatal error if more memory cannot be allocated.

Deals with overflow by emptying mark stack and indicating ranges in major heap chunks that need to be redarkened.

sadiqj · 2020-07-14T17:02:37Z

runtime/compact.c


+#define Caml_gray (1 << 8)
+#define Is_gray_hd(hd) (Color_hd (hd) == Caml_gray)
+#define Grayhd_hd(hd)  (((hd)  & ~Caml_black)  | Caml_gray)


This is unfortunate and probably needs a discussion around how to best integrate the new compactor from #9728 with this change.

It may be the simplest thing to do for now is rename these definitions and push the rest of the discussion to how compaction will work when more of multicore is merged.

In the short time, I'd suggest you leave the "gray"-related macros defined in <caml/gc.h>, and just concentrate on not using them in the major GC. This way the compactor is not affected.

In the longer term, we should re-think the interface between the compactor and the Multicore OCaml major GC. My guess is that you'd need an extra pass of sweeping before calling the compactor, so that the compactor only sees "marked" and "unmarked" blocks but no "garbage" blocks.

Sure. I've moved them back in to gc.h in the latest commit.

On the second point, I think @kayceesrk had a few ideas on how that might work.

In the multicore GC [1], at the end of the major cycle, the meaning of the GC colours are cycled such that MARKED -> UNMARKED, UNMARKED -> GARBAGE, GARBAGE -> MARKED and FREE remains FREE. Just before cycling, there are no GARBAGE objects in the heap since the sweeper has made all of them FREE. As a result, just after cycling, there are no MARKED objects. Hence, depending on when the compaction is (before or after cycling), one of the colours is available for use in compaction.

[1] See https://arxiv.org/abs/2004.11663 for the details of the Multicore GC.

That should work for compaction but I'm curious to know how you deal with out-of-heap blocks, which need to remain MARKED at all times.

Multicore uses the bit pattern NOT_MARKABLE (3 << 8) for out-of-heap objects. MARKED, UNMARKED and GARBAGE cycle among 0 << 8, 1 << 8 and 2 << 8. FREE is not a distinguished colour in Multicore (although my comment above made it sound like it is). Sweeping sets the header of GARBAGE objects to 0 and adds it to the freelist [1]. More information is in the Multicore wiki [2].

[1] https://github.com/ocaml-multicore/ocaml-multicore/blob/172fe18bbc7fcdaa1fc286869c28a2c59ecbf27a/runtime/shared_heap.c#L421-L428.
[2] https://github.com/ocaml-multicore/ocaml-multicore/wiki/Garbage-collector-invariants#gc-cycles

jhjourdan

Generally, the change looks good to me. I have a few cosmetic/stylistic remarks, and two other more important remarks:

There is a bug when resizing the stack fails. With the current code, nothing prevents from pushing even though there is no space left.
Some eventlog tracking code has been removed here and there.
Contrarily to what is advertised in thePR description, the traversal algorithm is depth-first rather than breath-first. This is in line with the original behavior of the GC.

runtime/major_gc.c

sadiqj · 2020-07-16T14:51:44Z

* Some eventlog tracking code has been removed here and there.

Whoops, will fix.

* Contrarily to what is advertised in the PR description, the traversal algorithm is depth-first rather than breath-first. This is in line with the original behavior of the GC.

Ah yes, that's a mistake in the PR description. @damiendoligez noticed it in a draft but I only fixed one of the two references. Will correct now.

Co-authored-by: Jacques-Henri Jourdan <[email protected]>

jhjourdan

The current version looks good to me. Approving.

I still want to propose a fix in mark_stack_push.

Of course, I think you still need at the very least a review by @damiendoligez.

runtime/major_gc.c

xavierleroy

I'm not competent to do a real review, but I left some comments and minor suggestions nonetheless.

runtime/major_gc.c

xavierleroy · 2020-07-17T13:35:03Z

runtime/major_gc.c

+  }
+#endif
+
+  end = (block_wsz < 8 ? block_wsz : 8);


Why only 8? I agree the loop below must be limited so that it doesn't introduce noticeably latency while skipping over a huge integer array. But it should be fine if it runs a few hundred iterations. And it's a valuable optimization for integer arrays.

So my concern with the optimisation in mark_stack_push is the Is_black_val we do inside the loop involves a dereference and it's possible to hit mark_stack_push from caml_modify. Hundreds of iterations could end up polluting the cache with headers you won't need again until the next major slice. I figured we might as well leave them to be checked during the major slice. May have been a premature optimisation though.

I also had a look at a few of the macro benchmarks we have in sandmark with an instrumented runtime and for all of them 95%+ of calls to mark_stack_push had block_wsz <= 8.

We could try a few different benchmarking runs with different values though and see what the impact is.

I've done a benchmarking run for 8, 16, 32, 64, 128 and 256 for the optimisation in mark_stack_push and it looks like in most cases having the smaller the better. These numbers are relative to trunk as of about twelve hours ago:

Thanks for the benchmarking. OK to keep 8 as the threshold.

runtime/major_gc.c

xavierleroy · 2020-07-17T13:52:24Z

runtime/major_gc.c

-        CAMLassert(start <= Wosize_val(v));
+
+  while (1){
+    int can_mark = 0;


I'm grateful for the simplification in this "suspendable / restartable" loop, compared with the original code. The micro-optimization person in me still wonders whether the can_mark Boolean could be replaced by control flow (a goto or two). Maybe not.

I'm also grateful for the simplification and the use of can_mark looks OK to me.

runtime/memory.c

kayceesrk · 2020-07-18T05:23:07Z

Can we label this PR as multicore-prerequisite? Thanks.

…for redarken start/end

sadiqj · 2020-07-28T10:37:45Z

@damiendoligez was wondering if you'd had a chance to review?

Should be similar on the whole to draft PR but with simplifications from review feedback.

damiendoligez

I have only one type error and a few small remarks, plus the feature request of making the skip list persistent, unless it's too much work.

damiendoligez · 2020-08-28T16:36:04Z

runtime/major_gc.c

+  mark_entry* mark_stack = stk->stack;
+
+  char* heap_chunk = caml_heap_start;
+  struct skiplist chunk_sklist = SKIPLIST_STATIC_INITIALIZER;


I wonder how much work it would be to make this skip list global and maintain it when extending/shrinking the heap. If you do this, we get to keep our beloved Is_in_heap assertions when (in the near future) the page table disappears from the runtime.

This is a good idea. I'll do a bit of investigating as to what's involved.

Having had a think about this, we should avoid doing it for now. It is unclear how this would work safely with multicore's allocator (and indeed runs in to the same kind of problems as the page table).

damiendoligez · 2020-08-31T10:59:17Z

runtime/major_gc.c

+  stk->count = 0;
+}
+
+static void realloc_mark_stack (struct mark_stack* stk)


I agree with @jhjourdan here. If you want to maintain the invariant that the mark stack size is at most 1/32th of the heap, you need only shrink it when the heap shrinks, at compaction.

runtime/major_gc.c

damiendoligez · 2020-08-31T12:05:52Z

runtime/major_gc.c

-        CAMLassert(start <= Wosize_val(v));
+
+  while (1){
+    int can_mark = 0;


I'm also grateful for the simplification and the use of can_mark looks OK to me.

runtime/memory.c

Co-authored-by: Damien Doligez <[email protected]>

Formatting changes Co-authored-by: Damien Doligez <[email protected]>

…o gc_colours_skip

sadiqj · 2020-09-08T13:58:09Z

@jhjourdan @damiendoligez we now shrink the mark stack on GC compaction: 727e1f3

I wasn't sure whether to just shrink it down to 1/32th of the heap or reset it to the initial size. I decided to go with the latter as sizing up seems to inexpensive and the heap structure may have changed significantly enough that a large mark stack is no longer necessary.

I'm happy to resize it to 1/32th though if there are any compelling arguments the other way.

damiendoligez

The PR looks fine now, modulo a small suggestion in the mark stack shrinking function.

runtime/major_gc.c

Co-authored-by: Damien Doligez <[email protected]>

damiendoligez · 2020-09-17T15:22:50Z

Merging now. Thanks everyone.

sadiqj force-pushed the gc_colours_skip branch from 3941a33 to d3a4126 Compare July 12, 2020 16:05

sadiqj force-pushed the gc_colours_skip branch 2 times, most recently from 310e4c1 to abb6fc1 Compare July 14, 2020 13:11

stedolan and others added 3 commits July 14, 2020 15:22

Remove Caml_gray.

21fd9bb

Currently, gray stack overflow is a fatal error if more memory cannot be allocated.

Use multicore's mark stack representation and deal with overflows.

e18adfe

Deals with overflow by emptying mark stack and indicating ranges in major heap chunks that need to be redarkened.

add changes entry

134d975

sadiqj force-pushed the gc_colours_skip branch from abb6fc1 to 134d975 Compare July 14, 2020 14:22

sadiqj commented Jul 14, 2020

View reviewed changes

rehome gray macros back in gc.h from compact.c

8bc36ab

jhjourdan requested changes Jul 15, 2020

View reviewed changes

sadiqj and others added 17 commits July 16, 2020 15:53

Use SKIPLIST_STATIC_INITIALIZER

a82f92e

Co-authored-by: Jacques-Henri Jourdan <[email protected]>

Comment changes

a419cb0

Co-authored-by: Jacques-Henri Jourdan <[email protected]>

Unify asserts and remove redundant check

22bf34f

Co-authored-by: Jacques-Henri Jourdan <[email protected]>

Cosmetic change to match earlier code

01438c9

Co-authored-by: Jacques-Henri Jourdan <[email protected]>

gc message formatting change

c2ba5c8

Co-authored-by: Jacques-Henri Jourdan <[email protected]>

Make mark_entry initialization clearer

efeb933

Co-authored-by: Jacques-Henri Jourdan <[email protected]>

Align whitespace correctly

41904a8

Co-authored-by: Jacques-Henri Jourdan <[email protected]>

remove unnecessary assign

4cfac3a

make skiplist code more idiomatic

f1557bd

remove unnecessary conditional

e7212ca

compare words with words for mark stack sizing

727c95c

block_end -> block_wsz

33f67c6

less trippy whitespace

18aac15

changes to make requires_rescan flag unnecessary

892056e

long line fix

ca161dc

add missing CAML_EV_*s

b0919bb

re-introduce tracking of slice_fields and EV_C_MAJOR_MARK_SLICE_REMAIN

819a2cf

jhjourdan approved these changes Jul 17, 2020

View reviewed changes

runtime/major_gc.c Outdated Show resolved Hide resolved

runtime/major_gc.c Outdated Show resolved Hide resolved

xavierleroy reviewed Jul 17, 2020

View reviewed changes

nojb added the multicore-prerequisite label Jul 18, 2020

sadiqj force-pushed the gc_colours_skip branch from 8142001 to 5fc7093 Compare July 19, 2020 13:52

don't pass mark_entry struct to mark_stack_push and remove asymmetry …

44f9e12

…for redarken start/end

sadiqj force-pushed the gc_colours_skip branch from 5fc7093 to 44f9e12 Compare July 19, 2020 13:59

initialise chunk pointers to start and end of chunk

83cd8c9

damiendoligez reviewed Aug 31, 2020

View reviewed changes

sadiqj and others added 9 commits September 2, 2020 10:52

Reword comments

89ec428

Co-authored-by: Damien Doligez <[email protected]>

Formatting changes

f7390a1

Co-authored-by: Damien Doligez <[email protected]>

ensure we can grow to 1/32

14a6a26

Fix type issue with chunk ranges

57d4eca

Co-authored-by: Damien Doligez <[email protected]>

Apply suggestions from code review

2267873

Formatting changes Co-authored-by: Damien Doligez <[email protected]>

more formatting

3602cb8

Merge branch 'gc_colours_skip' of https://github.com/sadiqj/ocaml int…

bb1d03b

…o gc_colours_skip

shrink mark_stack after a compaction

727e1f3

update Changes with reviewers

a4ae86d

remove whitespace from Changes

b44134c

xavierleroy added this to the 4.12 milestone Sep 9, 2020

damiendoligez approved these changes Sep 11, 2020

View reviewed changes

runtime/major_gc.c Outdated Show resolved Hide resolved

Remove redundant return and log mark stack shrinking failure

5e9d026

Co-authored-by: Damien Doligez <[email protected]>

damiendoligez merged commit c102178 into ocaml:trunk Sep 17, 2020

avsm mentioned this pull request Sep 18, 2020

runtime: update GC colours section realworldocaml/book#3373

Closed

sadiqj mentioned this pull request Sep 20, 2020

Backport Garbage Collector Colours change ocaml-multicore/ocaml-multicore#399

Closed

Sudha247 mentioned this pull request Oct 26, 2020

Garbage Collector colours change backport ocaml-multicore/ocaml-multicore#427

Merged

stedolan mentioned this pull request Mar 9, 2021

Speed up GC by prefetching during marking #10195

Merged

damiendoligez mentioned this pull request Jun 9, 2021

fix work accounting in marking phase #10449

Merged

Conversation

sadiqj commented Jul 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GC Colours

Different strategies for dealing with overflow

Differences in gray values and mark stack

Differences in marking behaviour

This PR

Mark stack

Marking behaviour

Overflow behaviour

Performance

Considerations

Overflow algorithm

Initial mark stack size

Uh oh!

sadiqj commented Jul 12, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kayceesrk Jul 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhjourdan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sadiqj commented Jul 16, 2020

Uh oh!

jhjourdan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xavierleroy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kayceesrk commented Jul 18, 2020

Uh oh!

sadiqj commented Jul 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

damiendoligez left a comment

Choose a reason for hiding this comment

Uh oh!

sadiqj commented Jul 12, 2020 •

edited

Loading

kayceesrk Jul 15, 2020 •

edited

Loading

sadiqj commented Jul 28, 2020 •

edited

Loading