Make bitset would_modify_words more vectorizer-friendly#153640
Make bitset would_modify_words more vectorizer-friendly#153640Zalathar wants to merge 3 commits intorust-lang:mainfrom
would_modify_words more vectorizer-friendly#153640Conversation
|
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Make bitset `would_modify_words` more vectorizer-friendly
This comment has been minimized.
This comment has been minimized.
This comment was marked as outdated.
This comment was marked as outdated.
|
Let's see what happens if we double the subchunk length from 32 bytes (4 words) to 64 bytes (8 words). @bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Make bitset `would_modify_words` more vectorizer-friendly
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (b3e83b4): comparison URL. Overall result: ✅ improvements - no action neededBenchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf. @bors rollup=never Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (secondary 7.5%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (secondary -2.3%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 480.034s -> 484.684s (0.97%) |
|
I was initially disappointed to see this only affect I wonder what code patterns cause these paths to be relevant. |
|
Probably its huge functions with a bunch of locals, exercising the move/init dataflow a lot? |
|
If it has large functions with thousands of locals, then yeah I can imagine that stressing MixedBitSet in ways that most crates never come close to. |
|
My recollection is that it has indeed, with the usual suspect of using machine-generated code. |
|
r? @madsmtm rustbot has assigned @madsmtm. Use Why was this reviewer chosen?The reviewer was selected based on:
|
|
No substantial changes since last time, but let's have another perf run to make sure everything's still good. @bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Make bitset `would_modify_words` more vectorizer-friendly
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (55f9125): comparison URL. Overall result: ❌✅ regressions and improvements - please read the text belowBenchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please do so in sufficient writing along with @bors rollup=never Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (secondary -2.8%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (secondary 0.0%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 480.488s -> 481.781s (0.27%) |
|
I know that some of the dataflow analysis for cranelift reaches a fixpoint very slowly, and many the basic blocks get reprocessed over and over. A while back I tried tweaking the basic block traversals and I got up to 20% improvements(!) on some cranelift builds but I also got similar slowdowns on some other benchmarks so I never filed any PRs about it. |
View all comments
Currently this function compares a single pair of
u64at a time, which is potentially slower than comparing multiple words before each early-exit check, especially for the large chunks used by ChunkedBitSet.Perf shows a notable improvement in
cranelift-codegen, which is the one benchmark that is known to stress these code paths.