Do not cache young_limit in a processor register#9876
Conversation
|
Thanks @xavierleroy. @avsm can you provision an ARM64 and PowerPC machine for @shakthimaan, who can run Sandmark on this. |
fdb04c5 to
5a852d4
Compare
|
I've provisioned an ARM64 machine now for @shaktimaan, but we've had a hard drive failure on one of our two PowerPC boxes ( |
|
No problem taking |
|
The following results are from the arm64 machine (this was run against 2bb2bde)
Normalized runtime graph for younglimit (vs 2bb2bde) Normalized minor collections graph for younlimit (vs 2bb2bde) |
gasche
left a comment
There was a problem hiding this comment.
Thanks @shubhamkumar13 for the report! It is very strange that fft has more minor collections after this change. This suggests that the value of young_limit is not the same before and after the change, at a point where we decide whether to do a GC. This sounds like a bug.
| | Some label -> call_gc_label := label | ||
| end; | ||
| let offset = Domainstate.(idx_of_field Domain_young_limit) * 8 in | ||
| ` {emit_string lg} 0, {emit_int offset}(30)\n`; |
There was a problem hiding this comment.
why are we loading from 30 and not reg_domain_state_ptr here?
There was a problem hiding this comment.
This backend uses "literal" registers everywhere, I guess @xavierleroy preferred to make the diff as small as possible?
One theory is that this is caused by the remembered set threshold being crossed which results in a minor gc being requested. There's a comment in Lines 113 to 116 in 86c8a98 @shubhamkumar13 is still investigating but we thought we'd post the numbers in case anyone had any other theories. |
|
TL;DR: I understand what scenario you are referring to, it sounds plausible, and it would be reassuring as in this case both behaviors (before and after the patch) are correct. Thansk for the hint! I was thinking that it would be a worrying bug if the register value was "earlier" than the stored value (it could lead to allocations overwriting existing values), but it may actually be not-dramatic in the case, when the stored value was moved to the end of the region due to pending actions. If this is indeed what is happening, the previous runtime would sometimes delay "pending events" due to this phenomenon, and the new behavior is arguably better than the previous one. (In some cases delaying pending events could be considered a bug.) The specific "pending event" that you are referring to is the crossing of the "threshold" of the remembered set table when called from |
|
Thanks for the numbers. There's something I don't understand in your graph, and it's not the first time I'm confused. The raw figures indicate a change in execution time between +6.15% and -4.03%, meaning, I believe, that the ratio run-time(uncached) / run-time(cached) is between 1.0615 and 0.9597. Yet the graph depicts values between 0.94 and 1.04. What are these values depicted on the graph? |
Just had a look at this. I think the runtime values have the wrong sign. It should be -6.15% to 4.03% for the runtime. Those then line up correctly with the values on the graph. |
|
I believe that the "runtime change" column should be read as the speedup provided by caching (and not a slowdown resulting from uncaching). For quicksort for example, the uncached/cached ratio is 1.0656, which suggests that uncached has a 6.56% slowdown, instead of the 6.15% reported in the same column. On the other hand, the cached/uncached ratio (the speedup of caching) is 0.938, which lines up perfectly with the data in the graph and suggests that caching provides a 6.15% performance boost (1 - 0.938, etc.) on this benchmark. |
Yeah sorry about that, it was a representation mistake from my part. I could have named it "speedup" instead of "runtime change" |
|
I made some measurements using Recall that the main impact of the change is to add one "load" per inlined allocation site. The secondary impact is to free one integer register for register allocation. This causes the following changes: The number of instructions executed increases by 0.5% to 0.9% (depending on the benchmark). To me this suggests that the slowdown directly caused by this change is on the order of 1% for allocation-intensive programs. Of course, code placement differs, so there are other effects on total running time, but not directly attributable to the change. |
On target architectures with 32 or more registers, a register was used to cache the value of the young_limit field of the domain state. This reduced the size and execution time of the code for inlined allocations. However, this usage is problematic with respect to polling for signals and to inter-domain communication in Multicore OCaml, because it is often not possible to change the value of the register when we change young_limit. So, the change to young_limit doesn't take effect immediately, only when the register is reloaded from young_limit. This commit removes the caching of young_limit in a register from the ARM64, PowerPC and RISC-V ports.
88567e1 to
073bd67
Compare
…e register Now that we have a unused callee-save register on ARM64, PowerPC, and RISC-V, make it available for register allocation. Assorted cleanups in runtime/*.S and in asmcomp/*/proc.ml
073bd67 to
f728a5a
Compare
|
At the latest developers meeting, it was decided to go on with this removal of the "young limit" register in the ARM64, PowerPC and Risc-V ports. I rebased the code, integrating the Risc-V changes that occurred recently on trunk. The test suite passes on all architectures (incl. Risc-V). This is now ready for review. |
|
To repeat myself: this PR is READY FOR REVIEW. Thank you. |
nojb
left a comment
There was a problem hiding this comment.
Thanks! The new code is both simpler and more uniform.
I have reviewed the patch and looks correct to me. I left a number of comments/questions/etc but they are just to confirm my own understanding. Having said that, I think it is somewhat unlikely that one would be able to spot a bug by reading the code, unless it was fairly obvious. But am approving on the basis that the CI passes on each of the affected architectures (I consider this to be a pretty good test for this kinds of changes).
| | Some label -> call_gc_label := label | ||
| end; | ||
| let offset = Domainstate.(idx_of_field Domain_young_limit) * 8 in | ||
| ` {emit_string lg} 0, {emit_int offset}(30)\n`; |
There was a problem hiding this comment.
This backend uses "literal" registers everywhere, I guess @xavierleroy preferred to make the diff as small as possible?
There are only 7 callee-save integer registers (x19 to x25), not 10.
|
Thanks for the nice PR! I'm happy with the current state and I think it is ready to be merged. Given that it is new year's eve, perhaps we should wait a few days to give time to any stragglers to chime in? |
|
(No one else has written in those few months to say that they wanted to look at this PR, so I think it is reasonable to assume that no stragglers have plans on this.) |
|
Merged, thanks! |


Note: this PR is intended for benchmarking the impact of the proposed change. It is not to be merged immediately.
On target architectures with 32 or more registers, a register was used to cache the value of the
young_limitfield of the domain state. This reduced the size and execution time of the code for inlined allocations.However, this usage is problematic with respect to polling for signals and to inter-domain communication in Multicore OCaml, because it is often impossible to change the value of the register when we change
young_limit. So, the change toyoung_limitdoesn't take effect immediately, only when the register is reloaded from young_limit.This PR removes the caching of young_limit in a register from the ARM64, PowerPC and RISC-V ports.
The first commit 66fbe29 turns the caching off.
The second commit e7a2f15 recycles the register previously used as the cache as an allocatable register.
ARM64 and PowerPC were tested. RISC-V could not be tested.
Cc: @kayceesrk @ctk21 @sadiqj