Skip to content

GC latency improvements#297

Merged
damiendoligez merged 13 commits intoocaml:trunkfrom
damiendoligez:trunk-gc-patches
Dec 21, 2015
Merged

GC latency improvements#297
damiendoligez merged 13 commits intoocaml:trunkfrom
damiendoligez:trunk-gc-patches

Conversation

@damiendoligez
Copy link
Member

This patch introduces several changes to the GC, mostly aimed at reducing worst-case latency:

  1. When a GC cycle starts, the global roots must be marked. This is now done incrementally. (asmrun/roots.c -- native-code only because byte-code doesn't have this problem, it has only one global root).
  2. The minor GC and major GC slice are separated and the major slice is run at the mid-point between minor collections (i.e. when the minor heap is half-full).
  3. Major GC slice size is now (under user control) amortized over several minor collections instead of varying directly with the promotion rate.
  4. Large arrays are now marked incrementally instead of blocking the program for a long time.
  5. Allocation of major heap chunks is (optionally) done with mmap using the HUGE_PAGES feature to lower the TLB load on large heap sizes.

minor improvements:

  • better control of GC verbosity in debug runtime
  • added a runtime variant with GC performance instrumentation
  • missing include in instrtrace.c
  • add an inline function for adding an entry to a ref_table
  • add functions to stdlib modules Gc and Sys to control the new features
  • better display of GC stats

@kit-ty-kate
Copy link
Member

Have you tried to squash all the commits, stash the resulting commit, pull from trunk and apply the stashed changes ? Maybe it can work.

@gasche
Copy link
Member

gasche commented Nov 19, 2015

We discussed this large change at the developer meeting yesterday, and decided to include it for 4.03 -- the latency improvement are rather impressive. Below is a quick summary of the salient points of the discussion.

The flambda people in the room ( @chambart and @mshinwell ) seemed nervous at the mention that the work on incremental marking of GC roots relied on their immutability, so they may want to check the details with Damien.

One thing that was forgotten in the above description is the incremental marking of large arrays.

One other thing that is unclear in the above description is the status of the mmap tricks. My understanding is that they are disabled by default, and can be enabled (but is it a configure-time flag or a program-runtime flag?).

There was discussion of the interaction with @mshinwell 's more ambitious no-naked-pointers work in #203 . One idea mentioned by Damien was to use the mmap trick in addition to no-naked-pointers, to serve as a "debug mode" making it easy to check that the naked accesses are indeed in the heap (but maybe just keeping the pagetable around for this debug mode would suffice?).

@damiendoligez
Copy link
Member Author

I have updated the patch to turn it into a single commit, and to remove the mmap code. (BTW the mmap option was neither configure-time nor run-time, but compile-time). The consensus was that no-naked-pointers is the right thing to do, even if the migration path is more dangerous.

Indeed, the mmap trick is not needed to get a debug mode for no-naked-pointers.

I have updated the description above to reflect the current state of the patch (and add the missing bit about large arrays).

@damiendoligez
Copy link
Member Author

@bobot (and everyone else who would volunteer)
This is now ready for review.

@mshinwell
Copy link
Contributor

@gasche Nitpick on no-naked-pointers: despite the name (which should maybe be fixed), pointers outside the heap are permitted. It's just that they need to point at something that looks like an OCaml value.

@bobot
Copy link
Contributor

bobot commented Nov 23, 2015

It should be a black value, no?

@damiendoligez
Copy link
Member Author

The complete no-naked-pointers rule is that every pointer to outside the heap must be at least one of the following:

  • odd (i.e. look like an int)
  • stored in a closure
  • to something that has a black header (i.e. the word before the pointed thing must be readable and have the right values in its two color bits)

@mshinwell
Copy link
Contributor

As I have mentioned to @damiendoligez in private email, I suspect there is a bug in this patch that causes it to fail when merged with the flambda branch. It's probably to do with the multiple roots thing (GPR #262).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "sub" ? I was naively expecting "add".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, and this begs the question: why didn't it get caught in testing? With the current design, it's very unlikely to cause problems because the pointer is (almost) never read. I need to brainstorm with you about this.

@bobot
Copy link
Contributor

bobot commented Dec 1, 2015

I'm adding comments during the review that you could perhaps take if they are interesting. You can find them in my fork branch trunk-gc-patches: eg bobot/ocaml@2023b60

@bobot
Copy link
Contributor

bobot commented Dec 1, 2015

The travis build failed https://travis-ci.org/bobot/ocaml/builds/94135050, perhaps because HAS_HUGE_PAGES is activated there.

@damiendoligez
Copy link
Member Author

I've made changes according to the remarks of the reviewers, and fixed a nasty bug in the root scanning (triggered by testing with flambda).

Following a discussion with @xavierleroy I'm going to remove the change to the protocol for calling the GC.

@damiendoligez damiendoligez force-pushed the trunk-gc-patches branch 2 times, most recently from 6baff7f to 80953ea Compare December 4, 2015 13:51
@damiendoligez
Copy link
Member Author

@bobot could you please review this version with particular attention to asmrun/roots.c? It's quite tricky.

asmrun/roots.c Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not a strange use of for loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is to have an exact copy of the loops found in caml_do_roots. Maybe I should add a comment to this effect.

@bobot
Copy link
Contributor

bobot commented Dec 10, 2015

I tried to review this code in asmrun/roots.c and it is indeed very tricky. Instead of three for loops and three complicated while loops could we use non-natural for loops that feel a lot more natural:
df2d87c

intnat caml_darken_all_roots_slice (intnat work)
{
  mlsize_t i, j;
  value * glob;
  intnat work_done = 0;
  CAML_INSTR_SETUP (tmr, "");

  if(incr_roots_to_resume){
    goto resume;
  }
  /* otherwise we just start darkening */
  incr_roots_count = 0;

  /* The global roots */
  for (i = 0; caml_globals[i] != 0; i++) {
    for(glob = caml_globals[i]; *glob != 0; glob++) {
      for (j = 0; j < Wosize_val(*glob); j++){
        caml_darken (Field (*glob, j), &Field (*glob, j));
        ++work_done;
        if(work <= work_done){
          /** save the state of the loop */
          incr_roots_i = i;
          incr_roots_glob = glob;
          incr_roots_j = j;
          incr_roots_count += work_done;
          incr_roots_to_resume = 1;
          goto finished;
        resume:
          i = incr_roots_i;
          glob = incr_roots_glob;
          j = incr_roots_j;
        }
      }
    }
  }

  /** We finished before the maximum amount of work is reached, so the
      darkening of all roots is done */
  caml_incremental_roots_count = incr_roots_count + work_done;
  incr_roots_to_resume = 0; /** not needed since set during start but clearer */

finished:
  /** In every case */
  CAML_INSTR_TIME (tmr, "major/mark/global_roots_slice");
  return work_done;
}

Do you have a test for the fix in def7f4d ?

EDITED: There was a bug because I assigned -1 to a mlsize_t. I don't understand why caml_do_roots define int i,j; instead of mlsize_t i,j;. Now all the tests pass except the same than without this commit (tests/warnings/deprecated_module*.ml)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

caml_stat_heap_size have been redefined as #define caml_stat_heap_size Bsize_wsize(caml_stat_heap_wsz). But I don't understand how the compatibility function can use Bsize_wsize and also the code above which becomes:

  && Bsize_wsize (Bsize_wsize(caml_stat_heap_wsz)) <= HUGE_PAGE_SIZE)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bobot
Copy link
Contributor

bobot commented Dec 10, 2015

For asmrun/roots.c I'm confident in the simplified version in my fork branch trunk-gc-patches.

@damiendoligez damiendoligez self-assigned this Dec 15, 2015
damiendoligez added a commit that referenced this pull request Dec 21, 2015
@damiendoligez damiendoligez merged commit 423ddc8 into ocaml:trunk Dec 21, 2015
@mrvn
Copy link
Contributor

mrvn commented Mar 24, 2016

  1. If no naked pointer are allowed then where is the helper function to allocate a chunk of memory outside the ocaml heap with a suitable black colored header? Something like

CAMLextern void * caml_alloc_unmovable(mlsize_t); /* allocate a block of memory outside the ocaml heap suitable for naked C/C++ objects /
CAMLextern void caml_free_unmovale(void *); /
free block allocated by caml_alloc_unmovable */

  1. I frequently have C objects that also need to hold some values. At the moment one has to allocate an ocaml block to hold the values and malloc for the C object. Without naked pointers now one further needs a second ocaml block with No_scan_tag (or the alloc from 1). Instead couldn't we have a single block outside the ocaml heap containing the values at the start and the C structure at the end where the Gc scans only the begining?

/* allocate a block of memory outside the ocaml heap with mixed content

  • Allocate a block with tag containing ocaml values that
  • will be scanned by the Gc if reachable, followed by <no_scan> bytes of
  • of memory that will be ignored by the Gc. The block will never be moved
  • by the Gc and the no_scan portion can be accessed without holding the
  • runtime lock. The values portion must only be accessed while holding the
  • runtime lock.
    */
    CAMLextern void * caml_alloc_unmovable(mlsize_t values, tag_t tag, mlsize_t no_scan);
  1. Alternatively maybe the 'struct custom_operations' could be extended to include a field saying how many words to scan in the custom block. That way one could allocate single blocks containing both ocaml values and raw data and still have them movable by the Gc.

@damiendoligez
Copy link
Member Author

I think (1) is a good idea. cc @mshinwell

For (2): if it's outside the heap and points into the heap, then it has to be a root. It would probably not reasonable to have too many such blocks. We don't currently have a data structure to register blocks of roots, so this option would need big changes to globroots.c.

(3) looks like a good idea, but we have to be careful to remain compatible with C code that uses the current version.

@bobot
Copy link
Contributor

bobot commented Mar 24, 2016

An alternative version of 3) is that a custom_operations contains a function pointer that will iterate on the value in the custom block. It is more general and could allow lots more other uses.

@let-def
Copy link
Contributor

let-def commented Mar 24, 2016

Hijacking a bit the conversation, but related to C-FFI and GC:

When dealing with C-FFI needing a back-reference to OCaml values, I felt that a special area in the OCaml heap that is guaranteed to be non-compacted (and as such never moved) would help a lot.
This area would behave exactly like the major heap (for collection), but without compaction. Values would be allocated via a specific function.

The recent PR 7162 reminded me of this. Exchanging the idea with @garrigue, he thought this could globally benefit LablGTK (not having to disable compaction for the whole program).
@yminsky also suggested this could enable working on some OCaml values from C-FFI without locking the runtime (e.g. having C-like arrays / bigstrings in the heap).

@damiendoligez do you think this is reasonable?

@damiendoligez
Copy link
Member Author

@bobot I've thought of that but won't it be too slow?

@def-lkb It looks doable although it'll need a fair amount of work. Could you open a feature request on Mantis?

@jhjourdan
Copy link
Contributor

@damiendoligez Instead of a general function, we could have a bitset describing which words are pointers and which are raw data.

@mrvn
Copy link
Contributor

mrvn commented Mar 25, 2016

@damiendoligez For 2) allocate a chunk of memory, put a block with N values at the start, add one value pointing at itself and register it as root, remaining space is for C use. The free function would unregister the root and free the block. It's what everyone already does now except that the allocations get merged into one block and the registering is automatic.

@def-lkb I'm not sure a special unmovable heap would be that beneficial. You could only use it from C and with values a module allocates itself because the type system doesn't know about unmovables. And you still have to register a root to keep the value alive. You might as well just allocate the object fully outside the ocaml heap (there should be helper functions for this creating the block header). With the no-naked-pointer change I think that would do exactly what you want without ocaml having a special unmovable heap. Secondly accessing such a block without the runtime lock would only work if all values in the block only point to other unmovable objects, which basically rules out mutable fields. You could only do that in a verry controlled structure that you set up with a deep copy of all fields.

@mrvn
Copy link
Contributor

mrvn commented Mar 25, 2016

@damiendoligez

It would probably not reasonable to have too many such blocks.

Is 10-1000 appearing and disapearing per second too many? In my QT5 bindings if you do catch for example mouse events and then react to them by calling Qt functions that return anything but primitive types you get a ton of allocationd and eventual garbage collections verry quickyly. At the moment I register them as generational global root and the referenced objects generally never make it out of the minor heap. Hopefully that means the roots get unregistered before a major collection and their numbers doesn't matter. Right?

@bobot
Copy link
Contributor

bobot commented Mar 25, 2016

@bobot I've thought of that but won't it be too slow?
I think only experimenting could say, indirect calls are predicted by cpu. And if caml_darken (or caml_slice_darken) is exported in some way, its call during the iteration will be direct. I will open a mantis feature request.

@DemiMarie
Copy link
Contributor

One feature I would like to see is "mixed blocks": blocks that are in the ordinary OCaml heap, but have only part of themselves scanned by the GC. Applications include:

  • Unboxing of int32, int64, nativeint, double when in records/tuples that also contain other OCaml values
  • Resizable arrays (only the part that is in use needs to be scanned)
  • Storing C-allocated data (such as a callback pointer) in an OCaml block without requiring multiple allocations

@mrvn Accessing unboxed data will still work if you know (through external invariants) that no OCaml thread is accessing it. Applications include zero-copy writing of data to disk or the network.

@damiendoligez
Copy link
Member Author

@mrvn
It's not a question of how many appear and disappear, but how many are present at any time.
Anyway, your use case looks reasonable.

@damiendoligez damiendoligez deleted the trunk-gc-patches branch April 4, 2016 13:50
@alainfrisch
Copy link
Contributor

tbrk added a commit to inria-parkas/sundialsml that referenced this pull request Jul 22, 2016
Ensure compatibility with the new OCaml -no-naked-pointers option. See, for
example, ocaml/ocaml#297 (comment).
tbrk added a commit to inria-parkas/sundialsml that referenced this pull request Jul 22, 2016
Follow the rules set out in
  ocaml/ocaml#297 (comment)
so that backrefs become compatible with the -no-naked-pointers option.
mshinwell added a commit to mshinwell/ocaml that referenced this pull request Jan 11, 2021
chambart pushed a commit to chambart/ocaml-1 that referenced this pull request Jan 4, 2022
* Emit roundsd instruction (amd64)

* Add comment about the meaning of different bits of rounding mode

* Handle "current" rounding mode for [roundsd]
EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024
* Fix vertical scroll on TOC navbar

* Remove border on TOC

Co-authored-by: Mirza Babar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.