port SpanHelpers.IndexOf(ref byte, byte, int) to Vector128/256 #73364

adamsitnik · 2022-08-04T10:31:28Z

x64

10% improvement for AVX2, no re regressions for AVX and no HI.

Details

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-BJEYEU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-KDQZPU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Method	Job	Toolchain	Size	Mean	Ratio
IndexOfValue	Job-BJEYEU	\PR\corerun.exe	512	6.896 ns	0.89
IndexOfValue	Job-KDQZPU	\baseline\corerun.exe	512	7.740 ns	1.00

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-SUXHIF : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  Job-KVLSYC : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX

EnvironmentVariables=COMPlus_EnableAVX2=0

Method	Job	Toolchain	Size	Mean	Ratio
IndexOfValue	Job-JAXEOB	\PR\corerun.exe	512	9.438 ns	1.00
IndexOfValue	Job-JLYYUT	\baseline\corerun.exe	512	9.466 ns	1.00

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-KLMBAP : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-KSWPBM : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

EnvironmentVariables=COMPlus_EnableHWIntrinsic=0

Method	Job	Toolchain	Size	Mean	Ratio
IndexOfValue	Job-KLMBAP	\PR\corerun.exe	512	35.78 ns	1.01
IndexOfValue	Job-KSWPBM	\baseline\corerun.exe	512	35.35 ns	1.00

arm64

The initial implementation got x2 perf hit, but after I've moved the call to ExtractMostSignificantBits to be performed only when a match is found, the perf is on par (see the second commit).

BDN reports half a nanosecond difference, which translates to 2-3% and I think that we should just ignore it (it's within the range of error).

Details

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22403.8
  [Host]     : .NET 7.0.0 (7.0.22.40210), Arm64 RyuJIT AdvSIMD
  Job-SUFTPF : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-WFBVQA : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

Method	Job	Toolchain	Size	Mean	Ratio
IndexOfValue	Job-SUFTPF	/PR/corerun	512	20.06 ns	0.97
IndexOfValue	Job-WFBVQA	/main/corerun	512	20.78 ns	1.00

contributes to #64451

…ts for every loop iteration

ghost · 2022-08-04T10:31:44Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

x64

10% improvement for AVX2, no re regressions for AVX and no HI.

Details

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-BJEYEU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-KDQZPU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Method	Job	Toolchain	Size	Mean	Ratio
IndexOfValue	Job-BJEYEU	\PR\corerun.exe	512	6.896 ns	0.89
IndexOfValue	Job-KDQZPU	\baseline\corerun.exe	512	7.740 ns	1.00

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-SUXHIF : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  Job-KVLSYC : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX

EnvironmentVariables=COMPlus_EnableAVX2=0

Method	Job	Toolchain	Size	Mean	Ratio
IndexOfValue	Job-JAXEOB	\PR\corerun.exe	512	9.438 ns	1.00
IndexOfValue	Job-JLYYUT	\baseline\corerun.exe	512	9.466 ns	1.00

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-KLMBAP : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-KSWPBM : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

EnvironmentVariables=COMPlus_EnableHWIntrinsic=0

Method	Job	Toolchain	Size	Mean	Ratio
IndexOfValue	Job-KLMBAP	\PR\corerun.exe	512	35.78 ns	1.01
IndexOfValue	Job-KSWPBM	\baseline\corerun.exe	512	35.35 ns	1.00

arm64

The initial implementation got x2 perf hit, but after I've moved the call to ExtractMostSignificantBits to be performed only when a match is found, the perf is on par (see the second commit).

BDN reports half a nanosecond difference, which translates to 2-3% and I think that we should just ignore it (it's within the range of error).

Details

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22403.8
  [Host]     : .NET 7.0.0 (7.0.22.40210), Arm64 RyuJIT AdvSIMD
  Job-SUFTPF : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-WFBVQA : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

Method	Job	Toolchain	Size	Mean	Ratio
IndexOfValue	Job-SUFTPF	/PR/corerun	512	20.06 ns	0.97
IndexOfValue	Job-WFBVQA	/main/corerun	512	20.78 ns	1.00

contributes to #64451

Author:	adamsitnik
Assignees:	-
Labels:	`area-System.Memory`, `tenet-performance`
Milestone:	7.0.0

EgorBo · 2022-08-04T10:57:46Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

                        }

                        // Find bitflag offset of first match and add to current offset
+                        uint matches = compareResult.ExtractMostSignificantBits();


Last time I tried to do the same it was a noticeable regression for ARM64

Same here, but 28aa082 solved the problem

Now this operation is performed only once, after we find a match (not once for every vector we compare)

I think I did it too 🙂 But maybe it was before I did #65632

What is the cost of this approach vs doing it every loop for Vector256<T>

Is it better, particularly for large inputs, to do this there as well?

adamsitnik added 2 commits August 4, 2022 11:54

port SpanHelpers.IndexOf(ref byte, byte, int) to Vector128/256

d3c77f8

try to solve arm64 regression by not calling ExtractMostSignificantBi…

28aa082

…ts for every loop iteration

adamsitnik added area-System.Memory tenet-performance Performance related issue labels Aug 4, 2022

adamsitnik added this to the 7.0.0 milestone Aug 4, 2022

adamsitnik requested review from EgorBo, stephentoub and tannergooding August 4, 2022 10:31

ghost assigned adamsitnik Aug 4, 2022

filipnavara approved these changes Aug 4, 2022

View reviewed changes

EgorBo reviewed Aug 4, 2022

View reviewed changes

EgorBo approved these changes Aug 4, 2022

View reviewed changes

adamsitnik mentioned this pull request Aug 4, 2022

Switch from direct intrinsics usage to Vector/Vector64/Vector128/Vector256 #64451

Open

75 tasks

filipnavara mentioned this pull request Aug 4, 2022

port SpanHelpers.IndexOf(ref char, char, int) to Vector128/256 #73368

Merged

adamsitnik merged commit 7085105 into dotnet:main Aug 4, 2022

ghost locked as resolved and limited conversation to collaborators Sep 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

port SpanHelpers.IndexOf(ref byte, byte, int) to Vector128/256 #73364

port SpanHelpers.IndexOf(ref byte, byte, int) to Vector128/256 #73364

Uh oh!

adamsitnik commented Aug 4, 2022

Uh oh!

ghost commented Aug 4, 2022

x64

arm64

Uh oh!

EgorBo Aug 4, 2022

Uh oh!

adamsitnik Aug 4, 2022

Uh oh!

adamsitnik Aug 4, 2022

Uh oh!

EgorBo Aug 4, 2022

Uh oh!

tannergooding Aug 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

port SpanHelpers.IndexOf(ref byte, byte, int) to Vector128/256 #73364

port SpanHelpers.IndexOf(ref byte, byte, int) to Vector128/256 #73364

Uh oh!

Conversation

adamsitnik commented Aug 4, 2022

x64

arm64

Uh oh!

ghost commented Aug 4, 2022

x64

arm64

Uh oh!

EgorBo Aug 4, 2022

Choose a reason for hiding this comment

Uh oh!

adamsitnik Aug 4, 2022

Choose a reason for hiding this comment

Uh oh!

adamsitnik Aug 4, 2022

Choose a reason for hiding this comment

Uh oh!

EgorBo Aug 4, 2022

Choose a reason for hiding this comment

Uh oh!

tannergooding Aug 4, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants