Skip to content

Conversation

@adamsitnik
Copy link
Member

x64

10% improvement for AVX2, no re regressions for AVX and no HI.

Details
BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-BJEYEU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-KDQZPU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Method Job Toolchain Size Mean Ratio
IndexOfValue Job-BJEYEU \PR\corerun.exe 512 6.896 ns 0.89
IndexOfValue Job-KDQZPU \baseline\corerun.exe 512 7.740 ns 1.00
BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-SUXHIF : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  Job-KVLSYC : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX

EnvironmentVariables=COMPlus_EnableAVX2=0
Method Job Toolchain Size Mean Ratio
IndexOfValue Job-JAXEOB \PR\corerun.exe 512 9.438 ns 1.00
IndexOfValue Job-JLYYUT \baseline\corerun.exe 512 9.466 ns 1.00
BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-KLMBAP : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-KSWPBM : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

EnvironmentVariables=COMPlus_EnableHWIntrinsic=0
Method Job Toolchain Size Mean Ratio
IndexOfValue Job-KLMBAP \PR\corerun.exe 512 35.78 ns 1.01
IndexOfValue Job-KSWPBM \baseline\corerun.exe 512 35.35 ns 1.00

arm64

The initial implementation got x2 perf hit, but after I've moved the call to ExtractMostSignificantBits to be performed only when a match is found, the perf is on par (see the second commit).

BDN reports half a nanosecond difference, which translates to 2-3% and I think that we should just ignore it (it's within the range of error).

Details
BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22403.8
  [Host]     : .NET 7.0.0 (7.0.22.40210), Arm64 RyuJIT AdvSIMD
  Job-SUFTPF : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-WFBVQA : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Job Toolchain Size Mean Ratio
IndexOfValue Job-SUFTPF /PR/corerun 512 20.06 ns 0.97
IndexOfValue Job-WFBVQA /main/corerun 512 20.78 ns 1.00

contributes to #64451

@adamsitnik adamsitnik added area-System.Memory tenet-performance Performance related issue labels Aug 4, 2022
@adamsitnik adamsitnik added this to the 7.0.0 milestone Aug 4, 2022
@ghost ghost assigned adamsitnik Aug 4, 2022
@ghost
Copy link

ghost commented Aug 4, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

x64

10% improvement for AVX2, no re regressions for AVX and no HI.

Details
BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-BJEYEU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-KDQZPU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Method Job Toolchain Size Mean Ratio
IndexOfValue Job-BJEYEU \PR\corerun.exe 512 6.896 ns 0.89
IndexOfValue Job-KDQZPU \baseline\corerun.exe 512 7.740 ns 1.00
BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-SUXHIF : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  Job-KVLSYC : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX

EnvironmentVariables=COMPlus_EnableAVX2=0
Method Job Toolchain Size Mean Ratio
IndexOfValue Job-JAXEOB \PR\corerun.exe 512 9.438 ns 1.00
IndexOfValue Job-JLYYUT \baseline\corerun.exe 512 9.466 ns 1.00
BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-KLMBAP : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-KSWPBM : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

EnvironmentVariables=COMPlus_EnableHWIntrinsic=0
Method Job Toolchain Size Mean Ratio
IndexOfValue Job-KLMBAP \PR\corerun.exe 512 35.78 ns 1.01
IndexOfValue Job-KSWPBM \baseline\corerun.exe 512 35.35 ns 1.00

arm64

The initial implementation got x2 perf hit, but after I've moved the call to ExtractMostSignificantBits to be performed only when a match is found, the perf is on par (see the second commit).

BDN reports half a nanosecond difference, which translates to 2-3% and I think that we should just ignore it (it's within the range of error).

Details
BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22403.8
  [Host]     : .NET 7.0.0 (7.0.22.40210), Arm64 RyuJIT AdvSIMD
  Job-SUFTPF : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-WFBVQA : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Job Toolchain Size Mean Ratio
IndexOfValue Job-SUFTPF /PR/corerun 512 20.06 ns 0.97
IndexOfValue Job-WFBVQA /main/corerun 512 20.78 ns 1.00

contributes to #64451

Author: adamsitnik
Assignees: -
Labels:

area-System.Memory, tenet-performance

Milestone: 7.0.0

}

// Find bitflag offset of first match and add to current offset
uint matches = compareResult.ExtractMostSignificantBits();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last time I tried to do the same it was a noticeable regression for ARM64

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, but 28aa082 solved the problem

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this operation is performed only once, after we find a match (not once for every vector we compare)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I did it too 🙂 But maybe it was before I did #65632

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the cost of this approach vs doing it every loop for Vector256<T>

Is it better, particularly for large inputs, to do this there as well?

@adamsitnik adamsitnik merged commit 7085105 into dotnet:main Aug 4, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Sep 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants