-
Notifications
You must be signed in to change notification settings - Fork 5.3k
port SpanHelpers.IndexOf(ref byte, byte, int) to Vector128/256 #73364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Tagging subscribers to this area: @dotnet/area-system-memory Issue Detailsx6410% improvement for AVX2, no re regressions for AVX and no HI. DetailsBenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
[Host] : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
Job-BJEYEU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-KDQZPU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
[Host] : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
Job-SUXHIF : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
Job-KVLSYC : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
EnvironmentVariables=COMPlus_EnableAVX2=0
BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
[Host] : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
Job-KLMBAP : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-KSWPBM : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
EnvironmentVariables=COMPlus_EnableHWIntrinsic=0
arm64The initial implementation got x2 perf hit, but after I've moved the call to BDN reports half a nanosecond difference, which translates to 2-3% and I think that we should just ignore it (it's within the range of error). DetailsBenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22403.8
[Host] : .NET 7.0.0 (7.0.22.40210), Arm64 RyuJIT AdvSIMD
Job-SUFTPF : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-WFBVQA : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
contributes to #64451
|
| } | ||
|
|
||
| // Find bitflag offset of first match and add to current offset | ||
| uint matches = compareResult.ExtractMostSignificantBits(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last time I tried to do the same it was a noticeable regression for ARM64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, but 28aa082 solved the problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this operation is performed only once, after we find a match (not once for every vector we compare)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I did it too 🙂 But maybe it was before I did #65632
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the cost of this approach vs doing it every loop for Vector256<T>
Is it better, particularly for large inputs, to do this there as well?
x64
10% improvement for AVX2, no re regressions for AVX and no HI.
Details
arm64
The initial implementation got x2 perf hit, but after I've moved the call to
ExtractMostSignificantBitsto be performed only when a match is found, the perf is on par (see the second commit).BDN reports half a nanosecond difference, which translates to 2-3% and I think that we should just ignore it (it's within the range of error).
Details
contributes to #64451