Skip to content

Do not cache young_limit in a processor register#9876

Merged
nojb merged 4 commits intoocaml:trunkfrom
xavierleroy:young-limit-not-in-register
Dec 31, 2020
Merged

Do not cache young_limit in a processor register#9876
nojb merged 4 commits intoocaml:trunkfrom
xavierleroy:young-limit-not-in-register

Conversation

@xavierleroy
Copy link
Contributor

Note: this PR is intended for benchmarking the impact of the proposed change. It is not to be merged immediately.

On target architectures with 32 or more registers, a register was used to cache the value of the young_limit field of the domain state. This reduced the size and execution time of the code for inlined allocations.

However, this usage is problematic with respect to polling for signals and to inter-domain communication in Multicore OCaml, because it is often impossible to change the value of the register when we change young_limit. So, the change to young_limit doesn't take effect immediately, only when the register is reloaded from young_limit.

This PR removes the caching of young_limit in a register from the ARM64, PowerPC and RISC-V ports.

The first commit 66fbe29 turns the caching off.

The second commit e7a2f15 recycles the register previously used as the cache as an allocatable register.

ARM64 and PowerPC were tested. RISC-V could not be tested.

Cc: @kayceesrk @ctk21 @sadiqj

@kayceesrk
Copy link
Contributor

Thanks @xavierleroy. @avsm can you provision an ARM64 and PowerPC machine for @shakthimaan, who can run Sandmark on this.

@xavierleroy xavierleroy force-pushed the young-limit-not-in-register branch from fdb04c5 to 5a852d4 Compare September 6, 2020 16:44
@avsm
Copy link
Member

avsm commented Sep 15, 2020

I've provisioned an ARM64 machine now for @shaktimaan, but we've had a hard drive failure on one of our two PowerPC boxes (prothoe), so it's a few days to get a replacement for that before we can run Sandmark on it. I might need to borrow the one running the Inria Jenkins CI @xavierleroy (pisto) next week if the drive replacement takes too long.

@xavierleroy
Copy link
Contributor Author

No problem taking pisto off the CI so that it can be used for benchmarking. Just let me know.

@shubhamkumar13
Copy link
Contributor

The following results are from the arm64 machine (this was run against 2bb2bde)

name runtime change cached uncached minor collections change (cached vs uncached)
35 quicksort.4000000 6.15% 9.37238 9.98706 0 (0.0%)
1 bdd.26 4.86% 24.0601 25.2893 2 (0.04%)
27 menhir.ocamly 2.4% 927.577 950.364 0 (0.0%)
20 knucleotide. 2.39% 178.394 182.768 0 (0.0%)
14 fft. 1.48% 24.2063 24.569 363 (35.38%)
2 binarytrees5.21 0.31% 44.5407 44.6772 0 (0.0%)
10 fannkuchredux.12 0.19% 266.758 267.267 0 (0.0%)
41 test_decompress.64_524_288 0.13% 18.2821 18.3062 0 (0.0%)
33 pidigits5.10_000 0.13% 24.9492 24.9813 0 (0.0%)
11 fannkuchredux2.12 0.07% 256.083 256.266 0 (0.0%)
4 cpdf.scale 0.05% 45.1205 45.1428 0 (0.0%)
26 matrix_multiplication.1024 0.04% 128.08 128.132 0 (0.0%)
23 lexifi-g2pp. 0.03% 81.7732 81.7972 0 (0.0%)
25 mandelbrot6.16_000 0.01% 114.818 114.835 0 (0.0%)
19 kb_no_exc. 0.01% 8.08046 8.08137 0 (0.0%)
32 nbody.50_000_000 0.0% 55.4536 55.4553 0 (0.0%)
13 fasta6.25_000_000 -0.01% 10.8102 10.8092 0 (0.0%)
36 regexredux2. -0.01% 58.1703 58.164 0 (0.0%)
42 yojson_ydump.sample.json -0.03% 2.36695 2.36615 0 (0.0%)
37 revcomp2. -0.03% 15.8918 15.8871 0 (0.0%)
43 zarith_pi.5000 -0.08% 5.29911 5.29506 0 (0.0%)
7 cubicle.szymanski_at.cub -0.15% 1992.3 1989.32 0 (0.0%)
17 grammatrix. -0.27% 488.454 487.136 0 (0.0%)
39 setrip.-enc_-rseed_1067894368 -0.28% 4.36783 4.35565 0 (0.0%)
3 cpdf.blacktext -0.42% 14.6691 14.6085 0 (0.0%)
16 game_of_life.256 -0.44% 38.7667 38.5975 0 (0.0%)
24 lu-decomposition. -0.49% 11.8898 11.8316 0 (0.0%)
5 cpdf.squeeze -0.53% 53.8528 53.5708 0 (0.0%)
21 knucleotide3. -0.67% 214.953 213.52 0 (0.0%)
30 minilight.roomfront -0.79% 107.066 106.231 0 (0.0%)
6 cubicle.german_pfs.cub -0.99% 752.228 744.859 0 (0.0%)
12 fasta3.25_000_000 -1.1% 19.8219 19.6056 0 (0.0%)
38 sequence_cps.10000 -1.2% 13.1027 12.9475 0 (0.0%)
18 kb. -1.3% 10.0976 9.96795 0 (0.0%)
22 levinson-durbin. -1.57% 17.4087 17.1397 0 (0.0%)
40 spectralnorm2.5_500 -1.96% 62.8052 61.5959 0 (0.0%)
28 menhir.sql-parser -2.03% 28.0578 27.4992 0 (0.0%)
8 durand-kerner-aberth. -2.09% 0.977001 0.957 0 (0.0%)
9 evolutionary_algorithm.10000_10000 -2.36% 249.876 244.11 4 (0.03%)
31 naive-multilayer. -2.55% 26.6982 26.0333 0 (0.0%)
15 floyd_warshall.512 -3.05% 21.0906 20.4672 0 (0.0%)
0 LU_decomposition.1024 -3.43% 26.7818 25.8929 0 (0.0%)
34 qr-decomposition. -3.8% 15.5745 15.0048 0 (0.0%)
29 menhir.sysver -4.03% 302.358 290.635 0 (0.0%)

Normalized runtime graph for younglimit (vs 2bb2bde)

normalised_younglimit_vs_2bb2bde

Normalized minor collections graph for younlimit (vs 2bb2bde)

normalised_minor_gc_younglimit_vs_2bb2bde

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shubhamkumar13 for the report! It is very strange that fft has more minor collections after this change. This suggests that the value of young_limit is not the same before and after the change, at a point where we decide whether to do a GC. This sounds like a bug.

| Some label -> call_gc_label := label
end;
let offset = Domainstate.(idx_of_field Domain_young_limit) * 8 in
` {emit_string lg} 0, {emit_int offset}(30)\n`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we loading from 30 and not reg_domain_state_ptr here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This backend uses "literal" registers everywhere, I guess @xavierleroy preferred to make the diff as small as possible?

@sadiqj
Copy link
Contributor

sadiqj commented Oct 28, 2020

Thanks @shubhamkumar13 for the report! It is very strange that fft has more minor collections after this change. This suggests that the value of young_limit is not the same before and after the change, at a point where we decide whether to do a GC. This sounds like a bug.

One theory is that this is caused by the remembered set threshold being crossed which results in a minor gc being requested. There's a comment in signals.c on this:

ocaml/runtime/signals.c

Lines 113 to 116 in 86c8a98

/* When this function is called without [caml_c_call] (e.g., in
[caml_modify]), this is only moderately effective on ports that cache
[Caml_state->young_limit] in a register, so it may take a while before the
register is reloaded from [Caml_state->young_limit]. */

@shubhamkumar13 is still investigating but we thought we'd post the numbers in case anyone had any other theories.

@gasche
Copy link
Member

gasche commented Oct 28, 2020

TL;DR: I understand what scenario you are referring to, it sounds plausible, and it would be reassuring as in this case both behaviors (before and after the patch) are correct. Thansk for the hint!

I was thinking that it would be a worrying bug if the register value was "earlier" than the stored value (it could lead to allocations overwriting existing values), but it may actually be not-dramatic in the case, when the stored value was moved to the end of the region due to pending actions. If this is indeed what is happening, the previous runtime would sometimes delay "pending events" due to this phenomenon, and the new behavior is arguably better than the previous one. (In some cases delaying pending events could be considered a bug.)

The specific "pending event" that you are referring to is the crossing of the "threshold" of the remembered set table when called from caml_modify (as suggested by the comment on caml_set_pending_action you referred to); the minor-gc dynamic tables will request a minor GC when they cross a certain "threshold" before they have to be dynamically resized. In this specific case there is no issue with delaying a pending event, we were merely ignoring a GC speed heuristic. If this is really the only use-case in which caml_set_pending_event is called at a place that does not properly restore the young-limit register afterwards, then there was nothing to worry about with the old code. (This is what the comment suggests.)

@xavierleroy
Copy link
Contributor Author

Thanks for the numbers. There's something I don't understand in your graph, and it's not the first time I'm confused.

The raw figures indicate a change in execution time between +6.15% and -4.03%, meaning, I believe, that the ratio run-time(uncached) / run-time(cached) is between 1.0615 and 0.9597.

Yet the graph depicts values between 0.94 and 1.04. What are these values depicted on the graph?

@sadiqj
Copy link
Contributor

sadiqj commented Nov 5, 2020

Yet the graph depicts values between 0.94 and 1.04. What are these values depicted on the graph?

Just had a look at this. I think the runtime values have the wrong sign. It should be -6.15% to 4.03% for the runtime. Those then line up correctly with the values on the graph.

@gasche
Copy link
Member

gasche commented Nov 5, 2020

I believe that the "runtime change" column should be read as the speedup provided by caching (and not a slowdown resulting from uncaching). For quicksort for example, the uncached/cached ratio is 1.0656, which suggests that uncached has a 6.56% slowdown, instead of the 6.15% reported in the same column. On the other hand, the cached/uncached ratio (the speedup of caching) is 0.938, which lines up perfectly with the data in the graph and suggests that caching provides a 6.15% performance boost (1 - 0.938, etc.) on this benchmark.

@shubhamkumar13
Copy link
Contributor

shubhamkumar13 commented Nov 5, 2020

I believe that the "runtime change" column should be read as the speedup provided by caching (and not a slowdown resulting from uncaching)

Yeah sorry about that, it was a representation mistake from my part. I could have named it "speedup" instead of "runtime change"

@sadiqj sadiqj mentioned this pull request Nov 22, 2020
@xavierleroy
Copy link
Contributor Author

I made some measurements using perf stat on ARM64 (Raspberry Pi 4), focusing on benchmarks that allocate a lot (e.g. KB).

Recall that the main impact of the change is to add one "load" per inlined allocation site. The secondary impact is to free one integer register for register allocation. This causes the following changes:

The number of instructions executed increases by 0.5% to 0.9% (depending on the benchmark).
The number of "cache references" (not sure what this means in perf: all memory accesses?) increases by 1.1% to 2.4%.
The static code size of ocamlc.opt increases by 0.7%.

To me this suggests that the slowdown directly caused by this change is on the order of 1% for allocation-intensive programs. Of course, code placement differs, so there are other effects on total running time, but not directly attributable to the change.

On target architectures with 32 or more registers,
a register was used to cache the value of the young_limit field
of the domain state.  This reduced the size and execution time
of the code for inlined allocations.

However, this usage is problematic with respect to polling for signals
and to inter-domain communication in Multicore OCaml, because it is
often not possible to change the value of the register when we change
young_limit.  So, the change to young_limit doesn't take effect
immediately, only when the register is reloaded from young_limit.

This commit removes the caching of young_limit in a register from the
ARM64, PowerPC and RISC-V ports.
@xavierleroy xavierleroy force-pushed the young-limit-not-in-register branch 2 times, most recently from 88567e1 to 073bd67 Compare December 10, 2020 18:44
…e register

Now that we have a unused callee-save register on ARM64, PowerPC, and RISC-V,
make it available for register allocation.

Assorted cleanups in runtime/*.S and in asmcomp/*/proc.ml
@xavierleroy xavierleroy force-pushed the young-limit-not-in-register branch from 073bd67 to f728a5a Compare December 11, 2020 08:51
@xavierleroy xavierleroy marked this pull request as ready for review December 11, 2020 09:46
@xavierleroy
Copy link
Contributor Author

At the latest developers meeting, it was decided to go on with this removal of the "young limit" register in the ARM64, PowerPC and Risc-V ports. I rebased the code, integrating the Risc-V changes that occurred recently on trunk. The test suite passes on all architectures (incl. Risc-V). This is now ready for review.

@xavierleroy
Copy link
Contributor Author

To repeat myself: this PR is READY FOR REVIEW. Thank you.

Copy link
Contributor

@nojb nojb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The new code is both simpler and more uniform.

I have reviewed the patch and looks correct to me. I left a number of comments/questions/etc but they are just to confirm my own understanding. Having said that, I think it is somewhat unlikely that one would be able to spot a bug by reading the code, unless it was fairly obvious. But am approving on the basis that the CI passes on each of the affected architectures (I consider this to be a pretty good test for this kinds of changes).

| Some label -> call_gc_label := label
end;
let offset = Domainstate.(idx_of_field Domain_young_limit) * 8 in
` {emit_string lg} 0, {emit_int offset}(30)\n`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This backend uses "literal" registers everywhere, I guess @xavierleroy preferred to make the diff as small as possible?

There are only 7 callee-save integer registers (x19 to x25), not 10.
@nojb
Copy link
Contributor

nojb commented Dec 31, 2020

Thanks for the nice PR! I'm happy with the current state and I think it is ready to be merged. Given that it is new year's eve, perhaps we should wait a few days to give time to any stragglers to chime in?

@gasche
Copy link
Member

gasche commented Dec 31, 2020

(No one else has written in those few months to say that they wanted to look at this PR, so I think it is reasonable to assume that no stragglers have plans on this.)

@nojb nojb merged commit 9e25043 into ocaml:trunk Dec 31, 2020
@nojb
Copy link
Contributor

nojb commented Dec 31, 2020

Merged, thanks!

@xavierleroy xavierleroy deleted the young-limit-not-in-register branch January 9, 2021 17:12
dbuenzli pushed a commit to dbuenzli/ocaml that referenced this pull request Mar 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants