Skip to content

Garbage collector colours change#9756

Merged
damiendoligez merged 37 commits intoocaml:trunkfrom
sadiqj:gc_colours_skip
Sep 17, 2020
Merged

Garbage collector colours change#9756
damiendoligez merged 37 commits intoocaml:trunkfrom
sadiqj:gc_colours_skip

Conversation

@sadiqj
Copy link
Contributor

@sadiqj sadiqj commented Jul 12, 2020

This PR makes a change to the garbage collector colour scheme in the major collector. It removes the existing gray colour to match the scheme used by the multicore major collector.

This note covers the differences between trunk and multicore's major GC behaviour with respect to colours and overflow, then it lays out how this PR aims to remove the gray colour to aid merging the multicore collector whilst minimising trunk behaviour changes.

GC Colours

On trunk we currently have:

  • White - blocks that we haven't seen referenced by any live ( black ) major heap blocks yet
  • Gray - blocks that have been referenced by a live ( black ) major heap block but whose fields we haven't yet marked
  • Black - blocks that have been marked as live and whose fields we have also marked

For the purposes of marking, multicore only has the two states of UNMARKED and MARKED which correspond to the White and Black trunk states above. This is because marking is concurrent and idempotent. Additionally marking and sweeping by different domains can overlap which is enabled by the new GC colour scheme. For more details: https://arxiv.org/abs/2004.11663

In terms of how these affect the different runtimes:

On trunk when a block is processed it is marked Gray and then added to a list of gray values to be later visited.

On multicore when a block is processed it is MARKED and then each field is pushed on to the mark stack to be later visited and marked.

Different strategies for dealing with overflow

Both of these schemes need a way to deal with the mark stack or gray values overflowing. That is when their sizes grow to become greater than some proportion of the major heap - in both cases 1/32th.

On trunk when gray values overflows it discards some of its entries and marks the heap as impure which later in the marking loop causes the heap to be rescanned for gray blocks so that they can be marked.

On multicore when the mark stack overflows it goes through a process whereby entries are scanned to find the minimum number of pools that will result in a 20% reduction in the size of the stack. These pools are then marked as needing a redarken at a later point in time and their entries removed from the stack.

Differences in gray values and mark stack

Gray values on trunk and the mark stack on multicore differ in their representation slightly. While gray values on trunk is an array of values that are marked gray on multicore the mark stack is stored as an array of mark entry structs which represent a block currently being scanned, an offset index into the object which will be marked next and the maximum number of fields (Wosize) for the block.

Differences in marking behaviour

A subtle difference between trunk and multicore is their behaviour in marking the object graph.

In trunk this goes as follows:

  • A value is taken from the gray values
  • Each of its fields is iterated
    • The field is marked gray and added to gray values

In multicore:

  • An entry e is taken from the mark stack
  • The offset in the entry e is used to construct a new entry child
  • The entry e's offset is incremented and it is pushed back on to the mark stack
  • e is set to child and we continue from step 2

i.e trunk pushes all fields on to the stack first and then descends whereas multicore descends through each field in turn.

This may lead to different overflow behaviours on the same object graph.

This PR

Mark stack

In this PR we remove the gray value from trunk and we replace the gray values array with multicore's mark stack. One slight change from multicore is that we no longer store the end (Wosize) in the mark stack 's entry as it seems there is no performance benefit from doing so.

Marking behaviour

We retain trunk's marking behaviour.

Overflow behaviour

We deviate here from multicore's existing strategy. All heap chunks are added to a skiplist and the mark stack entries are iterated. For each chunk we build up a range in the chunk that needs to be redarkened at a later time. We do this for all mark stack entries and prune the whole mark stack.

Later when the marking loop exhausts the mark stack we iterate through chunks and redarken blocks with black headers that have any non-black fields. This process will only populate up to a quarter of the mark stack, to prevent subsequent overflows. It is incremental and can be called again when the mark stack empties.

Initially we ported multicore's addrmap based implementation (https://github.com/ocaml-multicore/ocaml-multicore/blob/parallel_minor_gc/runtime/major_gc.c#L1385) and broke the major heap in to heap ranges as heap chunks can vary in size. This actually didn't offer a performance advantage over the much simpler skiplist-based approach.

Performance

The following are a set of macro benchmarks from the Sandmark suite:

image

Of the macro benchmarks, matrix multiply and the two zarith benchmarks show performance regressions. From profiling all three spend nearly all their time in a few hot loops and next to no time in marking. Re-running on a machine with layout randomization shows no difference in performance between this PR and trunk, which points to a loop layout-related effect.

Below are some existing microbenchmarks in the Sandmark suite that did manage to overflow their mark stack. Overall there seems to be little negative impact on them.

image

Considerations

Overflow algorithm

In theory the simple overflow algorithm should be O(n logc) where n is the number of entries in the mark stack and c is the number of heap chunks. This could be a problem for very large heaps, as the mark stack is a proportion of the heap size though the size of heap chunks also grows rapidly as the heap grows in size.

Initial mark stack size

We currently have an initial mark stack size of 2048 mark stack entries. At 16 bytes per entry on 64-bit archs, that is 32k. Should this be a parameter tunable through OCAMLRUNPARAM? An alternative is to set the initial mark stack size much lower but allow it to grow beyond 1/32th of the heap up to a certain size.

@sadiqj sadiqj force-pushed the gc_colours_skip branch from 3941a33 to d3a4126 Compare July 12, 2020 16:05
@sadiqj
Copy link
Contributor Author

sadiqj commented Jul 12, 2020

Am investigating the cause of the crash on 32-bit.

@sadiqj sadiqj force-pushed the gc_colours_skip branch 2 times, most recently from 310e4c1 to abb6fc1 Compare July 14, 2020 13:11
stedolan and others added 3 commits July 14, 2020 15:22
Currently, gray stack overflow is a fatal error if more memory
cannot be allocated.
Deals with overflow by emptying mark stack and indicating ranges
in major heap chunks that need to be redarkened.
@sadiqj sadiqj force-pushed the gc_colours_skip branch from abb6fc1 to 134d975 Compare July 14, 2020 14:22

#define Caml_gray (1 << 8)
#define Is_gray_hd(hd) (Color_hd (hd) == Caml_gray)
#define Grayhd_hd(hd) (((hd) & ~Caml_black) | Caml_gray)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunate and probably needs a discussion around how to best integrate the new compactor from #9728 with this change.

It may be the simplest thing to do for now is rename these definitions and push the rest of the discussion to how compaction will work when more of multicore is merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the short time, I'd suggest you leave the "gray"-related macros defined in <caml/gc.h>, and just concentrate on not using them in the major GC. This way the compactor is not affected.

In the longer term, we should re-think the interface between the compactor and the Multicore OCaml major GC. My guess is that you'd need an extra pass of sweeping before calling the compactor, so that the compactor only sees "marked" and "unmarked" blocks but no "garbage" blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I've moved them back in to gc.h in the latest commit.

On the second point, I think @kayceesrk had a few ideas on how that might work.

Copy link
Contributor

@kayceesrk kayceesrk Jul 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the multicore GC [1], at the end of the major cycle, the meaning of the GC colours are cycled such that MARKED -> UNMARKED, UNMARKED -> GARBAGE, GARBAGE -> MARKED and FREE remains FREE. Just before cycling, there are no GARBAGE objects in the heap since the sweeper has made all of them FREE. As a result, just after cycling, there are no MARKED objects. Hence, depending on when the compaction is (before or after cycling), one of the colours is available for use in compaction.

[1] See https://arxiv.org/abs/2004.11663 for the details of the Multicore GC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should work for compaction but I'm curious to know how you deal with out-of-heap blocks, which need to remain MARKED at all times.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multicore uses the bit pattern NOT_MARKABLE (3 << 8) for out-of-heap objects. MARKED, UNMARKED and GARBAGE cycle among 0 << 8, 1 << 8 and 2 << 8. FREE is not a distinguished colour in Multicore (although my comment above made it sound like it is). Sweeping sets the header of GARBAGE objects to 0 and adds it to the freelist [1]. More information is in the Multicore wiki [2].

[1] https://github.com/ocaml-multicore/ocaml-multicore/blob/172fe18bbc7fcdaa1fc286869c28a2c59ecbf27a/runtime/shared_heap.c#L421-L428.
[2] https://github.com/ocaml-multicore/ocaml-multicore/wiki/Garbage-collector-invariants#gc-cycles

Copy link
Contributor

@jhjourdan jhjourdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, the change looks good to me. I have a few cosmetic/stylistic remarks, and two other more important remarks:

  • There is a bug when resizing the stack fails. With the current code, nothing prevents from pushing even though there is no space left.
  • Some eventlog tracking code has been removed here and there.
  • Contrarily to what is advertised in thePR description, the traversal algorithm is depth-first rather than breath-first. This is in line with the original behavior of the GC.

@sadiqj
Copy link
Contributor Author

sadiqj commented Jul 16, 2020

* Some eventlog tracking code has been removed here and there.

Whoops, will fix.

* Contrarily to what is advertised in the PR description, the traversal algorithm is depth-first rather than breath-first. This is in line with the original behavior of the GC.

Ah yes, that's a mistake in the PR description. @damiendoligez noticed it in a draft but I only fixed one of the two references. Will correct now.

Copy link
Contributor

@jhjourdan jhjourdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current version looks good to me. Approving.

I still want to propose a fix in mark_stack_push.

Of course, I think you still need at the very least a review by @damiendoligez.

Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not competent to do a real review, but I left some comments and minor suggestions nonetheless.

}
#endif

end = (block_wsz < 8 ? block_wsz : 8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only 8? I agree the loop below must be limited so that it doesn't introduce noticeably latency while skipping over a huge integer array. But it should be fine if it runs a few hundred iterations. And it's a valuable optimization for integer arrays.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my concern with the optimisation in mark_stack_push is the Is_black_val we do inside the loop involves a dereference and it's possible to hit mark_stack_push from caml_modify. Hundreds of iterations could end up polluting the cache with headers you won't need again until the next major slice. I figured we might as well leave them to be checked during the major slice. May have been a premature optimisation though.

I also had a look at a few of the macro benchmarks we have in sandmark with an instrumented runtime and for all of them 95%+ of calls to mark_stack_push had block_wsz <= 8.

We could try a few different benchmarking runs with different values though and see what the impact is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a benchmarking run for 8, 16, 32, 64, 128 and 256 for the optimisation in mark_stack_push and it looks like in most cases having the smaller the better. These numbers are relative to trunk as of about twelve hours ago:

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the benchmarking. OK to keep 8 as the threshold.

CAMLassert(start <= Wosize_val(v));

while (1){
int can_mark = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm grateful for the simplification in this "suspendable / restartable" loop, compared with the original code. The micro-optimization person in me still wonders whether the can_mark Boolean could be replaced by control flow (a goto or two). Maybe not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also grateful for the simplification and the use of can_mark looks OK to me.

@kayceesrk
Copy link
Contributor

Can we label this PR as multicore-prerequisite? Thanks.

@sadiqj sadiqj force-pushed the gc_colours_skip branch from 5fc7093 to 44f9e12 Compare July 19, 2020 13:59
@sadiqj
Copy link
Contributor Author

sadiqj commented Jul 28, 2020

@damiendoligez was wondering if you'd had a chance to review?

Should be similar on the whole to draft PR but with simplifications from review feedback.

Copy link
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only one type error and a few small remarks, plus the feature request of making the skip list persistent, unless it's too much work.

mark_entry* mark_stack = stk->stack;

char* heap_chunk = caml_heap_start;
struct skiplist chunk_sklist = SKIPLIST_STATIC_INITIALIZER;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how much work it would be to make this skip list global and maintain it when extending/shrinking the heap. If you do this, we get to keep our beloved Is_in_heap assertions when (in the near future) the page table disappears from the runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea. I'll do a bit of investigating as to what's involved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having had a think about this, we should avoid doing it for now. It is unclear how this would work safely with multicore's allocator (and indeed runs in to the same kind of problems as the page table).

stk->count = 0;
}

static void realloc_mark_stack (struct mark_stack* stk)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jhjourdan here. If you want to maintain the invariant that the mark stack size is at most 1/32th of the heap, you need only shrink it when the heap shrinks, at compaction.

CAMLassert(start <= Wosize_val(v));

while (1){
int can_mark = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also grateful for the simplification and the use of can_mark looks OK to me.

@sadiqj
Copy link
Contributor Author

sadiqj commented Sep 8, 2020

@jhjourdan @damiendoligez we now shrink the mark stack on GC compaction: 727e1f3

I wasn't sure whether to just shrink it down to 1/32th of the heap or reset it to the initial size. I decided to go with the latter as sizing up seems to inexpensive and the heap structure may have changed significantly enough that a large mark stack is no longer necessary.

I'm happy to resize it to 1/32th though if there are any compelling arguments the other way.

@xavierleroy xavierleroy added this to the 4.12 milestone Sep 9, 2020
Copy link
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks fine now, modulo a small suggestion in the mark stack shrinking function.

@damiendoligez
Copy link
Member

Merging now. Thanks everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants