-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change interners to start preallocated with an increased capacity #137354
Conversation
This comment has been minimized.
This comment has been minimized.
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
[perf experiment] Changed interners to start preallocated with an increased capacity Inspired by rust-lang#137005. *Not meant to be merged in its current form* Added a `with_capacity` function to `InternedSet`. Changed the `CtxtInterners` to start with `InternedSets` preallocated with a capacity. This *does* increase memory usage at very slightly(by 1 MB at the start), altough that increase quickly disaperars for larger crates(since they require such capacity anyway). A local perf run indicates this improves compiletimes for small crates(like `ripgrep`), without a negative effect on larger ones:  The current default capacities are choosen somewhat arbitrarily, and are relatively low. Depending on what kind of memory usage is acceptable, it may be beneficial to increase that capacity for some interners. From a second local perf run(with capacity of `_type` increased to `131072`), it looks like increasing the size of the preallocated type interner has the biggest impact:  What would be the maximum acceptable memory usage increase? I think most people would not mind sacrificing 1-2MB for an improvement in compile speed, but I am curious what is the general opinion here.
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (f5637ed): comparison URL. Overall result: ❌✅ regressions and improvements - please read the text belowBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.
Max RSS (memory usage)Results (primary -2.5%, secondary -0.5%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesThis benchmark run did not return any relevant results for this metric. Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 773.525s -> 772.968s (-0.07%) |
Ideally the buffer sizes would be based on some collected data, but seems also ok to merge as is and add some FIXME for the future |
I have done a few more local perf runs, and it seems like the bigger the initial capacity the better - at least for The gains start to diminish somewhere past 2^16, but I still see some improvements even as far as 2^20. This is both good(since I see more improvements), and bad - since there is no clear local maximum, it is hard to decide at which point the higher minimum RAM increase stops being worth it. IMHO, the best way to tune this would be to see the size of those for some commonly built crates(like Does I don't know how much headroom I have here. |
There are not really any minimum required specs, but if you take a look at compiling hello world, the max RSS shouldn't increase dramatically after the preallocation, IMO.
That should be relatively simple to figure out, you could print the sizes of the maps before the compilation ends and then use the CC @nnethercote |
I have made some bigger changes to the capacites based on data from the For each variable I tweaked, I also included a comment with some observation about the values I encountered during my tests. There are some things I have not tweaked yet(those don't have comments). With those much larger capacites, the max RSS has increased in almost all cases, by average Instruction count got reduced by an average Are the RSS increases reasonable for those compiletime gains? If not, what would be the biggest acceptable RSS increase? |
We didn't accept mimalloc even though it was ~5% faster across the board, because it regressed RSS by 15-25%, so that's a lot :) Let's see it on perf.rlo. @bors try @rust-timer queue While the perf. wins are kinda nice, I'm not completely sure if they are worth having a ton of magic constants in the code, especially since their optimality will necessarily change as the codebase evolves. |
This comment has been minimized.
This comment has been minimized.
[perf experiment] Changed interners to start preallocated with an increased capacity Inspired by rust-lang#137005. *Not meant to be merged in its current form* Added a `with_capacity` function to `InternedSet`. Changed the `CtxtInterners` to start with `InternedSets` preallocated with a capacity. This *does* increase memory usage at very slightly(by 1 MB at the start), altough that increase quickly disaperars for larger crates(since they require such capacity anyway). A local perf run indicates this improves compiletimes for small crates(like `ripgrep`), without a negative effect on larger ones:  The current default capacities are choosen somewhat arbitrarily, and are relatively low. Depending on what kind of memory usage is acceptable, it may be beneficial to increase that capacity for some interners. From a second local perf run(with capacity of `_type` increased to `131072`), it looks like increasing the size of the preallocated type interner has the biggest impact:  What would be the maximum acceptable memory usage increase? I think most people would not mind sacrificing 1-2MB for an improvement in compile speed, but I am curious what is the general opinion here.
Makes sense - with different "magic numbers", I managed to see some per improvements without a RSS regressions(on average). I'll have to see what the optimal values are there. I'll see what can be done without increasing the average RSS.
That makes sense. Would having one constant and then multiplying / dividing it be better? Eg: const INTERN_CAP:usize = 1024;
CtxtInterners {
arena,
type_: InternedSet::with_capacity(INTERN_CAP*16),
const_lists: InternedSet::with_capacity(INTERN_CAP/2),
args: InternedSet::with_capacity(INTERN_CAP*16),
type_lists: InternedSet::with_capacity(INTERN_CAP*8),
region: InternedSet::with_capacity(INTERN_CAP),
// And so on... That could make future adjustements clearer. Alternatively, this could be a command line / enviorment option(to allow people with more RAM to get the perf benefits), or I could just use one constant cap(eg. 1024) for all interners - altough that would be a bit wastefull. |
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (c5e7aa9): comparison URL. Overall result: ❌✅ regressions and improvements - please read the text belowBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.
Max RSS (memory usage)Results (primary 15.9%, secondary 18.1%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResults (primary 3.2%, secondary 5.2%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 773.873s -> 786.312s (1.61%) |
Yeah, that went way too much in the wrong direction :) |
a0b66d3
to
84178df
Compare
@nnethercote would you be down to reviewing PRs like this in the future? I have some more changes like this(based on some local profiling), and this seems to something you might want to review. |
@@ -2847,6 +2853,7 @@ impl<'tcx> TyCtxt<'tcx> { | |||
// FIXME consider asking the input slice to be sorted to avoid | |||
// re-interning permutations, in which case that would be asserted | |||
// here. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unintentional?
offset_of: Default::default(), | ||
valtree: Default::default(), | ||
// The factors have been chosen by @FractalFir based on observed interner sizes | ||
// (obtained by printing them using `x perf eprintln --includes cargo`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was printed, i.e. where was the eprintln!
call inserted?
Also worth mentioning that cargo
is one of the larger benchmarks.
Sure, I've done tons of these kinds of micro-optimizations in the past :) |
84178df
to
1e16ec4
Compare
1e16ec4
to
7d2cfca
Compare
Should be good to go now. |
@rustbot label -S-waiting-on-author +S-waiting-on-review |
Looks good. The very first comment in the PR gets used as the merge comment. Can you update it above to not say "Not meant to be merged in its current form", and maybe remove the images? r=me with that, thanks. @bors delegate=FractalFir |
✌️ @FractalFir, you can now approve this pull request! If @nnethercote told you to " |
@bors r+ |
@FractalFir The reviewer (r) should've been nnethercote, not yourself (via r=nnethercote). Just so you know |
Sorry, my bad. I was in a bit of a rush, and did not notice I used the wrong command. Is there something I need to do now? |
No worries, such things happen from time to time. I don't think we can retroactively update it without starting over. It's okay tho |
☀️ Test successful - checks-actions |
Finished benchmarking commit (ac91805): comparison URL. Overall result: ✅ improvements - no action needed@rustbot label: -perf-regression Instruction countThis is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.
Max RSS (memory usage)Results (primary 0.6%, secondary -1.2%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesThis benchmark run did not return any relevant results for this metric. Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 771.836s -> 771.284s (-0.07%) |
Inspired by #137005.
Added a
with_capacity
function toInternedSet
. Changed theCtxtInterners
to start withInternedSets
preallocated with a capacity.This does increase memory usage at very slightly(by ~1 MB at the start), altough that increase quickly disaperars for larger crates(since they require such capacity anyway).
A local perf run indicates this improves compiletimes for small crates(like
ripgrep
), without a negative effect on larger ones.