Do not cache young_limit in a processor register by xavierleroy · Pull Request #9876 · ocaml/ocaml

xavierleroy · 2020-09-04T09:47:26Z

Note: this PR is intended for benchmarking the impact of the proposed change. It is not to be merged immediately.

On target architectures with 32 or more registers, a register was used to cache the value of the young_limit field of the domain state. This reduced the size and execution time of the code for inlined allocations.

However, this usage is problematic with respect to polling for signals and to inter-domain communication in Multicore OCaml, because it is often impossible to change the value of the register when we change young_limit. So, the change to young_limit doesn't take effect immediately, only when the register is reloaded from young_limit.

This PR removes the caching of young_limit in a register from the ARM64, PowerPC and RISC-V ports.

The first commit 66fbe29 turns the caching off.

The second commit e7a2f15 recycles the register previously used as the cache as an allocatable register.

ARM64 and PowerPC were tested. RISC-V could not be tested.

Cc: @kayceesrk @ctk21 @sadiqj

kayceesrk · 2020-09-05T04:37:21Z

Thanks @xavierleroy. @avsm can you provision an ARM64 and PowerPC machine for @shakthimaan, who can run Sandmark on this.

avsm · 2020-09-15T08:31:37Z

I've provisioned an ARM64 machine now for @shaktimaan, but we've had a hard drive failure on one of our two PowerPC boxes (prothoe), so it's a few days to get a replacement for that before we can run Sandmark on it. I might need to borrow the one running the Inria Jenkins CI @xavierleroy (pisto) next week if the drive replacement takes too long.

xavierleroy · 2020-09-15T13:17:18Z

No problem taking pisto off the CI so that it can be used for benchmarking. Just let me know.

shubhamkumar13 · 2020-10-28T09:23:53Z

The following results are from the arm64 machine (this was run against 2bb2bde)

	name	runtime change	cached	uncached	minor collections change (cached vs uncached)
35	quicksort.4000000	6.15%	9.37238	9.98706	0 (0.0%)
1	bdd.26	4.86%	24.0601	25.2893	2 (0.04%)
27	menhir.ocamly	2.4%	927.577	950.364	0 (0.0%)
20	knucleotide.	2.39%	178.394	182.768	0 (0.0%)
14	fft.	1.48%	24.2063	24.569	363 (35.38%)
2	binarytrees5.21	0.31%	44.5407	44.6772	0 (0.0%)
10	fannkuchredux.12	0.19%	266.758	267.267	0 (0.0%)
41	test_decompress.64_524_288	0.13%	18.2821	18.3062	0 (0.0%)
33	pidigits5.10_000	0.13%	24.9492	24.9813	0 (0.0%)
11	fannkuchredux2.12	0.07%	256.083	256.266	0 (0.0%)
4	cpdf.scale	0.05%	45.1205	45.1428	0 (0.0%)
26	matrix_multiplication.1024	0.04%	128.08	128.132	0 (0.0%)
23	lexifi-g2pp.	0.03%	81.7732	81.7972	0 (0.0%)
25	mandelbrot6.16_000	0.01%	114.818	114.835	0 (0.0%)
19	kb_no_exc.	0.01%	8.08046	8.08137	0 (0.0%)
32	nbody.50_000_000	0.0%	55.4536	55.4553	0 (0.0%)
13	fasta6.25_000_000	-0.01%	10.8102	10.8092	0 (0.0%)
36	regexredux2.	-0.01%	58.1703	58.164	0 (0.0%)
42	yojson_ydump.sample.json	-0.03%	2.36695	2.36615	0 (0.0%)
37	revcomp2.	-0.03%	15.8918	15.8871	0 (0.0%)
43	zarith_pi.5000	-0.08%	5.29911	5.29506	0 (0.0%)
7	cubicle.szymanski_at.cub	-0.15%	1992.3	1989.32	0 (0.0%)
17	grammatrix.	-0.27%	488.454	487.136	0 (0.0%)
39	setrip.-enc_-rseed_1067894368	-0.28%	4.36783	4.35565	0 (0.0%)
3	cpdf.blacktext	-0.42%	14.6691	14.6085	0 (0.0%)
16	game_of_life.256	-0.44%	38.7667	38.5975	0 (0.0%)
24	lu-decomposition.	-0.49%	11.8898	11.8316	0 (0.0%)
5	cpdf.squeeze	-0.53%	53.8528	53.5708	0 (0.0%)
21	knucleotide3.	-0.67%	214.953	213.52	0 (0.0%)
30	minilight.roomfront	-0.79%	107.066	106.231	0 (0.0%)
6	cubicle.german_pfs.cub	-0.99%	752.228	744.859	0 (0.0%)
12	fasta3.25_000_000	-1.1%	19.8219	19.6056	0 (0.0%)
38	sequence_cps.10000	-1.2%	13.1027	12.9475	0 (0.0%)
18	kb.	-1.3%	10.0976	9.96795	0 (0.0%)
22	levinson-durbin.	-1.57%	17.4087	17.1397	0 (0.0%)
40	spectralnorm2.5_500	-1.96%	62.8052	61.5959	0 (0.0%)
28	menhir.sql-parser	-2.03%	28.0578	27.4992	0 (0.0%)
8	durand-kerner-aberth.	-2.09%	0.977001	0.957	0 (0.0%)
9	evolutionary_algorithm.10000_10000	-2.36%	249.876	244.11	4 (0.03%)
31	naive-multilayer.	-2.55%	26.6982	26.0333	0 (0.0%)
15	floyd_warshall.512	-3.05%	21.0906	20.4672	0 (0.0%)
0	LU_decomposition.1024	-3.43%	26.7818	25.8929	0 (0.0%)
34	qr-decomposition.	-3.8%	15.5745	15.0048	0 (0.0%)
29	menhir.sysver	-4.03%	302.358	290.635	0 (0.0%)

Normalized runtime graph for younglimit (vs 2bb2bde)

Normalized minor collections graph for younlimit (vs 2bb2bde)

gasche

Thanks @shubhamkumar13 for the report! It is very strange that fft has more minor collections after this change. This suggests that the value of young_limit is not the same before and after the change, at a point where we decide whether to do a GC. This sounds like a bug.

gasche · 2020-10-28T09:43:41Z

asmcomp/power/emit.mlp

+    | Some label -> call_gc_label := label
+  end;
+  let offset = Domainstate.(idx_of_field Domain_young_limit) * 8 in
+  `	{emit_string lg}	0, {emit_int offset}(30)\n`;


why are we loading from 30 and not reg_domain_state_ptr here?

This backend uses "literal" registers everywhere, I guess @xavierleroy preferred to make the diff as small as possible?

sadiqj · 2020-10-28T10:21:42Z

Thanks @shubhamkumar13 for the report! It is very strange that fft has more minor collections after this change. This suggests that the value of young_limit is not the same before and after the change, at a point where we decide whether to do a GC. This sounds like a bug.

One theory is that this is caused by the remembered set threshold being crossed which results in a minor gc being requested. There's a comment in signals.c on this:

ocaml/runtime/signals.c

Lines 113 to 116 in 86c8a98

    
             /* When this function is called without [caml_c_call] (e.g., in 
        
                [caml_modify]), this is only moderately effective on ports that cache 
        
                [Caml_state->young_limit] in a register, so it may take a while before the 
        
                register is reloaded from [Caml_state->young_limit]. */

@shubhamkumar13 is still investigating but we thought we'd post the numbers in case anyone had any other theories.

gasche · 2020-10-28T10:49:23Z

TL;DR: I understand what scenario you are referring to, it sounds plausible, and it would be reassuring as in this case both behaviors (before and after the patch) are correct. Thansk for the hint!

I was thinking that it would be a worrying bug if the register value was "earlier" than the stored value (it could lead to allocations overwriting existing values), but it may actually be not-dramatic in the case, when the stored value was moved to the end of the region due to pending actions. If this is indeed what is happening, the previous runtime would sometimes delay "pending events" due to this phenomenon, and the new behavior is arguably better than the previous one. (In some cases delaying pending events could be considered a bug.)

The specific "pending event" that you are referring to is the crossing of the "threshold" of the remembered set table when called from caml_modify (as suggested by the comment on caml_set_pending_action you referred to); the minor-gc dynamic tables will request a minor GC when they cross a certain "threshold" before they have to be dynamically resized. In this specific case there is no issue with delaying a pending event, we were merely ignoring a GC speed heuristic. If this is really the only use-case in which caml_set_pending_event is called at a place that does not properly restore the young-limit register afterwards, then there was nothing to worry about with the old code. (This is what the comment suggests.)

xavierleroy · 2020-11-02T09:17:27Z

Thanks for the numbers. There's something I don't understand in your graph, and it's not the first time I'm confused.

The raw figures indicate a change in execution time between +6.15% and -4.03%, meaning, I believe, that the ratio run-time(uncached) / run-time(cached) is between 1.0615 and 0.9597.

Yet the graph depicts values between 0.94 and 1.04. What are these values depicted on the graph?

sadiqj · 2020-11-05T11:08:59Z

Yet the graph depicts values between 0.94 and 1.04. What are these values depicted on the graph?

Just had a look at this. I think the runtime values have the wrong sign. It should be -6.15% to 4.03% for the runtime. Those then line up correctly with the values on the graph.

gasche · 2020-11-05T13:01:59Z

I believe that the "runtime change" column should be read as the speedup provided by caching (and not a slowdown resulting from uncaching). For quicksort for example, the uncached/cached ratio is 1.0656, which suggests that uncached has a 6.56% slowdown, instead of the 6.15% reported in the same column. On the other hand, the cached/uncached ratio (the speedup of caching) is 0.938, which lines up perfectly with the data in the graph and suggests that caching provides a 6.15% performance boost (1 - 0.938, etc.) on this benchmark.

shubhamkumar13 · 2020-11-05T13:13:25Z

I believe that the "runtime change" column should be read as the speedup provided by caching (and not a slowdown resulting from uncaching)

Yeah sorry about that, it was a representation mistake from my part. I could have named it "speedup" instead of "runtime change"

xavierleroy · 2020-11-25T16:40:53Z

I made some measurements using perf stat on ARM64 (Raspberry Pi 4), focusing on benchmarks that allocate a lot (e.g. KB).

Recall that the main impact of the change is to add one "load" per inlined allocation site. The secondary impact is to free one integer register for register allocation. This causes the following changes:

The number of instructions executed increases by 0.5% to 0.9% (depending on the benchmark).
The number of "cache references" (not sure what this means in perf: all memory accesses?) increases by 1.1% to 2.4%.
The static code size of ocamlc.opt increases by 0.7%.

To me this suggests that the slowdown directly caused by this change is on the order of 1% for allocation-intensive programs. Of course, code placement differs, so there are other effects on total running time, but not directly attributable to the change.

On target architectures with 32 or more registers, a register was used to cache the value of the young_limit field of the domain state. This reduced the size and execution time of the code for inlined allocations. However, this usage is problematic with respect to polling for signals and to inter-domain communication in Multicore OCaml, because it is often not possible to change the value of the register when we change young_limit. So, the change to young_limit doesn't take effect immediately, only when the register is reloaded from young_limit. This commit removes the caching of young_limit in a register from the ARM64, PowerPC and RISC-V ports.

…e register Now that we have a unused callee-save register on ARM64, PowerPC, and RISC-V, make it available for register allocation. Assorted cleanups in runtime/*.S and in asmcomp/*/proc.ml

xavierleroy · 2020-12-11T09:49:13Z

At the latest developers meeting, it was decided to go on with this removal of the "young limit" register in the ARM64, PowerPC and Risc-V ports. I rebased the code, integrating the Risc-V changes that occurred recently on trunk. The test suite passes on all architectures (incl. Risc-V). This is now ready for review.

xavierleroy · 2020-12-22T18:32:48Z

To repeat myself: this PR is READY FOR REVIEW. Thank you.

nojb

Thanks! The new code is both simpler and more uniform.

I have reviewed the patch and looks correct to me. I left a number of comments/questions/etc but they are just to confirm my own understanding. Having said that, I think it is somewhat unlikely that one would be able to spot a bug by reading the code, unless it was fairly obvious. But am approving on the basis that the CI passes on each of the affected architectures (I consider this to be a pretty good test for this kinds of changes).

nojb · 2020-12-23T03:29:42Z

asmcomp/power/emit.mlp

+    | Some label -> call_gc_label := label
+  end;
+  let offset = Domainstate.(idx_of_field Domain_young_limit) * 8 in
+  `	{emit_string lg}	0, {emit_int offset}(30)\n`;


This backend uses "literal" registers everywhere, I guess @xavierleroy preferred to make the diff as small as possible?

asmcomp/riscv/proc.ml

asmcomp/arm64/proc.ml

runtime/riscv.S

There are only 7 callee-save integer registers (x19 to x25), not 10.

nojb · 2020-12-31T08:26:33Z

Thanks for the nice PR! I'm happy with the current state and I think it is ready to be merged. Given that it is new year's eve, perhaps we should wait a few days to give time to any stragglers to chime in?

gasche · 2020-12-31T08:48:05Z

(No one else has written in those few months to say that they wanted to look at this PR, so I think it is reasonable to assume that no stragglers have plans on this.)

nojb · 2020-12-31T09:12:16Z

Merged, thanks!

xavierleroy force-pushed the young-limit-not-in-register branch from fdb04c5 to 5a852d4 Compare September 6, 2020 16:44

sadiqj mentioned this pull request Oct 9, 2020

Draft Safepoints PR sadiqj/ocaml#2

Closed

gasche reviewed Oct 28, 2020

View reviewed changes

sadiqj mentioned this pull request Nov 22, 2020

Safepoints #10039

Merged

xavierleroy force-pushed the young-limit-not-in-register branch 2 times, most recently from 88567e1 to 073bd67 Compare December 10, 2020 18:44

Recycle the former "young limit" register, giving one more allocatabl…

f728a5a

…e register Now that we have a unused callee-save register on ARM64, PowerPC, and RISC-V, make it available for register allocation. Assorted cleanups in runtime/*.S and in asmcomp/*/proc.ml

xavierleroy force-pushed the young-limit-not-in-register branch from 073bd67 to f728a5a Compare December 11, 2020 08:51

xavierleroy marked this pull request as ready for review December 11, 2020 09:46

nojb approved these changes Dec 23, 2020

View reviewed changes

xavierleroy added 2 commits December 27, 2020 15:04

ARM64: wrong register pressure limits for Iextcall

12b62da

There are only 7 callee-save integer registers (x19 to x25), not 10.

Changes entry for ocaml#9876

2d9afd0

xavierleroy mentioned this pull request Dec 27, 2020

Masking of asynchronous callbacks #8961

Open

nojb merged commit 9e25043 into ocaml:trunk Dec 31, 2020

xavierleroy deleted the young-limit-not-in-register branch January 9, 2021 17:12

dbuenzli pushed a commit to dbuenzli/ocaml that referenced this pull request Mar 25, 2021

Changes entry for ocaml#9876

1b705ea

gretay-js mentioned this pull request Sep 3, 2021

Port #9876: do not cache the young_limit GC variable in a processor register. oxcaml/oxcaml#207

Closed

Conversation

xavierleroy commented Sep 4, 2020

Uh oh!

kayceesrk commented Sep 5, 2020

Uh oh!

avsm commented Sep 15, 2020

Uh oh!

xavierleroy commented Sep 15, 2020

Uh oh!

shubhamkumar13 commented Oct 28, 2020

Uh oh!

gasche left a comment

Choose a reason for hiding this comment

Uh oh!

gasche Oct 28, 2020

Choose a reason for hiding this comment

Uh oh!

nojb Dec 23, 2020

Choose a reason for hiding this comment

Uh oh!

sadiqj commented Oct 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gasche commented Oct 28, 2020

Uh oh!

xavierleroy commented Nov 2, 2020

Uh oh!

sadiqj commented Nov 5, 2020

Uh oh!

gasche commented Nov 5, 2020

Uh oh!

shubhamkumar13 commented Nov 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xavierleroy commented Nov 25, 2020

Uh oh!

xavierleroy commented Dec 11, 2020

Uh oh!

xavierleroy commented Dec 22, 2020

Uh oh!

nojb left a comment

Choose a reason for hiding this comment

Uh oh!

nojb Dec 23, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nojb commented Dec 31, 2020

Uh oh!

gasche commented Dec 31, 2020

Uh oh!

nojb commented Dec 31, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

sadiqj commented Oct 28, 2020 •

edited

Loading

shubhamkumar13 commented Nov 5, 2020 •

edited

Loading