try to port ASCIIUtility.NarrowUtf16ToAscii to x-plat intrinsics #73064

adamsitnik · 2022-07-29T13:57:19Z

For both x64 and ARM64 I am observing 10-15% regression.

x64

Details

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  
Method=GetBytes

Type	Job	size	encName	Input	Mean	Ratio
Perf_Encoding	PR	16	ascii	?	18.22 ns	1.00
Perf_Encoding	main	16	ascii	?	18.20 ns	1.00

Perf_Encoding	PR	16	utf-8	?	17.81 ns	1.09
Perf_Encoding	main	16	utf-8	?	16.63 ns	1.00

Perf_Encoding	PR	512	ascii	?	50.45 ns	1.12
Perf_Encoding	main	512	ascii	?	44.88 ns	1.00

Perf_Encoding	PR	512	utf-8	?	74.30 ns	1.16
Perf_Encoding	main	512	utf-8	?	64.30 ns	1.00

Perf_Utf8Encoding	PR	?	?	EnglishAllAscii	17,846.09 ns	1.08
Perf_Utf8Encoding	main	?	?	EnglishAllAscii	16,454.56 ns	1.00

Perf_Utf8Encoding	PR	?	?	EnglishMostlyAscii	79,019.77 ns	0.97
Perf_Utf8Encoding	main	?	?	EnglishMostlyAscii	81,128.97 ns	1.00

Perf_Utf8Encoding	PR	?	?	Chinese	82,925.92 ns	0.99
Perf_Utf8Encoding	main	?	?	Chinese	83,539.51 ns	1.00

Perf_Utf8Encoding	PR	?	?	Cyrillic	97,716.52 ns	0.98
Perf_Utf8Encoding	main	?	?	Cyrillic	99,311.72 ns	1.00

Perf_Utf8Encoding	PR	?	?	Greek	151,446.91 ns	0.96
Perf_Utf8Encoding	main	?	?	Greek	157,171.66 ns	1.00

Here I was able to use VTune (I also tried uProf but it leaves a lot to desire):

From what I can see the regression comes from additional 3 vpand instructions:

It seems that the first one comes from the new VectorContainsNonAsciiChar implementation (this is expected). But the other two from Vector128.Narrow? @tannergooding is that expected?

ARM64

Details

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22379.1
  [Host]     : .NET 7.0.0 (7.0.22.37802), Arm64 RyuJIT AdvSIMD

Method=GetBytes

Type	Job	size	encName	Input	Mean	Ratio
Perf_Encoding	PR	16	ascii	?	67.68 ns	0.98
Perf_Encoding	main	16	ascii	?	68.75 ns	1.00

Perf_Encoding	PR	16	utf-8	?	75.34 ns	1.00
Perf_Encoding	main	16	utf-8	?	75.26 ns	1.00

Perf_Encoding	PR	512	ascii	?	202.77 ns	1.12
Perf_Encoding	main	512	ascii	?	180.61 ns	1.00

Perf_Encoding	PR	512	utf-8	?	269.92 ns	1.13
Perf_Encoding	main	512	utf-8	?	239.16 ns	1.00

Perf_Utf8Encoding	PR	?	?	EnglishAllAscii	79,699.39 ns	1.15
Perf_Utf8Encoding	main	?	?	EnglishAllAscii	69,283.09 ns	1.00

Perf_Utf8Encoding	PR	?	?	EnglishMostlyAscii	285,553.35 ns	1.00
Perf_Utf8Encoding	main	?	?	EnglishMostlyAscii	285,896.31 ns	1.00

Perf_Utf8Encoding	PR	?	?	Chinese	308,349.96 ns	1.00
Perf_Utf8Encoding	main	?	?	Chinese	307,750.04 ns	1.00

Perf_Utf8Encoding	PR	?	?	Cyrillic	217,887.57 ns	1.00
Perf_Utf8Encoding	main	?	?	Cyrillic	217,912.29 ns	1.00

Perf_Utf8Encoding	PR	?	?	Greek	312,137.16 ns	1.00
Perf_Utf8Encoding	main	?	?	Greek	311,067.67 ns	1.00

ghost · 2022-07-29T13:57:32Z

Tagging subscribers to this area: @dotnet/area-system-text-encoding
See info in area-owners.md if you want to be subscribed.

Issue Details

For both x64 and ARM64 I am observing 10-15% regression.

x64

Details

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  
Method=GetBytes

Type	Job	size	encName	Input	Mean	Ratio
Perf_Encoding	PR	16	ascii	?	18.22 ns	1.00
Perf_Encoding	main	16	ascii	?	18.20 ns	1.00

Perf_Encoding	PR	16	utf-8	?	17.81 ns	1.09
Perf_Encoding	main	16	utf-8	?	16.63 ns	1.00

Perf_Encoding	PR	512	ascii	?	50.45 ns	1.12
Perf_Encoding	main	512	ascii	?	44.88 ns	1.00

Perf_Encoding	PR	512	utf-8	?	74.30 ns	1.16
Perf_Encoding	main	512	utf-8	?	64.30 ns	1.00

Perf_Utf8Encoding	PR	?	?	EnglishAllAscii	17,846.09 ns	1.08
Perf_Utf8Encoding	main	?	?	EnglishAllAscii	16,454.56 ns	1.00

Perf_Utf8Encoding	PR	?	?	EnglishMostlyAscii	79,019.77 ns	0.97
Perf_Utf8Encoding	main	?	?	EnglishMostlyAscii	81,128.97 ns	1.00

Perf_Utf8Encoding	PR	?	?	Chinese	82,925.92 ns	0.99
Perf_Utf8Encoding	main	?	?	Chinese	83,539.51 ns	1.00

Perf_Utf8Encoding	PR	?	?	Cyrillic	97,716.52 ns	0.98
Perf_Utf8Encoding	main	?	?	Cyrillic	99,311.72 ns	1.00

Perf_Utf8Encoding	PR	?	?	Greek	151,446.91 ns	0.96
Perf_Utf8Encoding	main	?	?	Greek	157,171.66 ns	1.00

Here I was able to use VTune (I also tried uProf but it leaves a lot to desire):

From what I can see the regression comes from additional 3 vpand instructions:

It seems that the first one comes from the new VectorContainsNonAsciiChar implementation (this is expected). But the other two from Vector128.Narrow? @tannergooding is that expected?

ARM64

Details

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22379.1
  [Host]     : .NET 7.0.0 (7.0.22.37802), Arm64 RyuJIT AdvSIMD

Method=GetBytes

Type	Job	size	encName	Input	Mean	Ratio
Perf_Encoding	PR	16	ascii	?	67.68 ns	0.98
Perf_Encoding	main	16	ascii	?	68.75 ns	1.00

Perf_Encoding	PR	16	utf-8	?	75.34 ns	1.00
Perf_Encoding	main	16	utf-8	?	75.26 ns	1.00

Perf_Encoding	PR	512	ascii	?	202.77 ns	1.12
Perf_Encoding	main	512	ascii	?	180.61 ns	1.00

Perf_Encoding	PR	512	utf-8	?	269.92 ns	1.13
Perf_Encoding	main	512	utf-8	?	239.16 ns	1.00

Perf_Utf8Encoding	PR	?	?	EnglishAllAscii	79,699.39 ns	1.15
Perf_Utf8Encoding	main	?	?	EnglishAllAscii	69,283.09 ns	1.00

Perf_Utf8Encoding	PR	?	?	EnglishMostlyAscii	285,553.35 ns	1.00
Perf_Utf8Encoding	main	?	?	EnglishMostlyAscii	285,896.31 ns	1.00

Perf_Utf8Encoding	PR	?	?	Chinese	308,349.96 ns	1.00
Perf_Utf8Encoding	main	?	?	Chinese	307,750.04 ns	1.00

Perf_Utf8Encoding	PR	?	?	Cyrillic	217,887.57 ns	1.00
Perf_Utf8Encoding	main	?	?	Cyrillic	217,912.29 ns	1.00

Perf_Utf8Encoding	PR	?	?	Greek	312,137.16 ns	1.00
Perf_Utf8Encoding	main	?	?	Greek	311,067.67 ns	1.00

Author:	adamsitnik
Assignees:	-
Labels:	`area-System.Text.Encoding`
Milestone:	-

EgorBo · 2022-07-29T20:56:01Z

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs


            ref byte asciiBuffer = ref *pAsciiBuffer;
-            Vector128<byte> asciiVector = ExtractAsciiVector(utf16VectorFirst, utf16VectorFirst);
+            Vector128<byte> asciiVector = Vector128.Narrow(utf16VectorFirst, utf16VectorFirst);


@adamsitnik Narrow doesn't know that utf16VectorFirst is already ASCII at this point (from my understanding) so it applies a mask via AND to cut anything above 0xFF (not needed in this case). Consider this:

So for this case for better perf you probably want to keep ExtractAsciiVector

Can we log an issue here for Narrow. There is possibly some codegen improvements we can make in .NET 8

EgorBo · 2022-07-29T20:56:54Z

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

+            const ushort asciiMask = ushort.MaxValue - 127; // 0x7F80
+            Vector128<ushort> zeroIsAscii = utf16Vector & Vector128.Create(asciiMask);
+            // If a non-ASCII bit is set in any WORD of the vector, we have seen non-ASCII data.
+            return !Vector128.EqualsAll(zeroIsAscii, Vector128<ushort>.Zero);


You said that it's expected to have a redundant AND here - why? 🙂

adamsitnik · 2022-08-10T12:56:43Z

@tannergooding Since I was not able to get the same perf with Vector128, I've added Vector128 as a fallback when SSE2 and AdvSIMD are not supported which in theory some configs may benefit from.

kunalspathak · 2022-08-16T16:50:48Z

Improvements - dotnet/perf-autofiling-issues#7322

adamsitnik added 2 commits July 29, 2022 15:15

port VectorContainsNonAsciiChar

f1f5ab2

port NarrowUtf16ToAscii*

f9810d0

adamsitnik added the area-System.Text.Encoding label Jul 29, 2022

ghost assigned adamsitnik Jul 29, 2022

EgorBo reviewed Jul 29, 2022

View reviewed changes

adamsitnik mentioned this pull request Aug 8, 2022

Add glob filters support to disassembler to allow disassembling specific methods dotnet/BenchmarkDotNet#2072

Merged

prefer architecture specific intrinsic as they offer better perf

bca0dcf

adamsitnik marked this pull request as ready for review August 10, 2022 12:54

adamsitnik requested a review from tannergooding August 10, 2022 12:55

tannergooding approved these changes Aug 10, 2022

View reviewed changes

adamsitnik merged commit aae6e9b into dotnet:main Aug 10, 2022

adamsitnik mentioned this pull request Aug 17, 2022

Switch from direct intrinsics usage to Vector/Vector64/Vector128/Vector256 #64451

Open

75 tasks

ghost locked as resolved and limited conversation to collaborators Sep 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

try to port ASCIIUtility.NarrowUtf16ToAscii to x-plat intrinsics #73064

try to port ASCIIUtility.NarrowUtf16ToAscii to x-plat intrinsics #73064

Uh oh!

adamsitnik commented Jul 29, 2022

Uh oh!

ghost commented Jul 29, 2022

x64

ARM64

Uh oh!

EgorBo Jul 29, 2022 •

edited

Loading

Uh oh!

tannergooding Aug 10, 2022

Uh oh!

EgorBo Jul 29, 2022

Uh oh!

adamsitnik commented Aug 10, 2022

Uh oh!

kunalspathak commented Aug 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

try to port ASCIIUtility.NarrowUtf16ToAscii to x-plat intrinsics #73064

try to port ASCIIUtility.NarrowUtf16ToAscii to x-plat intrinsics #73064

Uh oh!

Conversation

adamsitnik commented Jul 29, 2022

x64

ARM64

Uh oh!

ghost commented Jul 29, 2022

x64

ARM64

Uh oh!

EgorBo Jul 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding Aug 10, 2022

Choose a reason for hiding this comment

Uh oh!

EgorBo Jul 29, 2022

Choose a reason for hiding this comment

Uh oh!

adamsitnik commented Aug 10, 2022

Uh oh!

kunalspathak commented Aug 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

EgorBo Jul 29, 2022 •

edited

Loading