Skip to content

Conversation

@radekdoulik
Copy link
Member

Add initial SIMD support for wasm. This is subset of the original draft PR without the public API additions. I left the underlying parts of new WasmBase class implementation here as well to not loose it.

Add WasmSIMD property to enable SIMD in AOT builds. With the property enabled, the apps built with AOT get SIMD intrinsics inlined for parts of S.R.I.Vector128 and S.R.I.Vector128<T> API.

Add test to build and run a simple app with SIMD enabled.

Example of the produced code:

> wa-info -d -f Vector.*Multiply.*RunStep src/mono/sample/wasm/browser-bench/bin/Debug/AppBundle/dotnet.wasm
(func Wasm_Browser_Bench_Sample_Sample_VectorTask_Multiply_RunStep(param $0 i32, $1 i32))
 local.get $0
 i32.eqz
 if
  call mini_llvmonly_throw_nullref_exception
  unreachable

 local.get $0
 local.get $0
 v128.load offset:24    [SIMD]
 local.get $0
 v128.load offset:8    [SIMD]
 i32x4.mul    [SIMD]
 v128.store offset:40    [SIMD]

The C# code:

            ...
            Vector128<int> vector1, vector2, vector3;

            public override void RunStep() => vector3 = vector1 * vector2;

And comparison to the non SIMD version:

> wa-diff -d -f Vector.*Multiply.*RunStep src/mono/sample/wasm/browser-bench/bin-nosimd/Debug/AppBundle/dotnet.wasm src/mono/sample/wasm/browser-bench/bin/Debug/AppBundle/dotnet.wasm
(func Wasm_Browser_Bench_Sample_Sample_VectorTask_Multiply_RunStep(param i32, i32))
-  local $2 i32
+  local.get $0
-  local $3 i32
-  local $4 i64
-  local $5 i64
-  local $6 i64
-  local $7 i64
-  local $8 i32
-  global.get $__stack_pointer
-  i32.const 80
-  i32.sub
-  local.tee $2
-  global.set $__stack_pointer
-  i32.const 3121246
-  i32.load8.u
   i32.eqz
   if
-   i32.const 2056316
-   call mono_aot_Wasm_Browser_Bench_Sample_init_method
+   call mini_llvmonly_throw_nullref_exception
+   unreachable
-   i32.const 3121246
-   i32.const 1
-   i32.store8

   local.get $0
-  if
-   local.get $2
   local.get $0
-   i64.load offset:16
+  v128.load offset:24    [SIMD]
-   local.tee $4
-   i64.store offset:40 align:3
-   local.get $2
   local.get $0
-   i64.load offset:8
-   local.tee $5
+  v128.load offset:8    [SIMD]
+  i32x4.mul    [SIMD]
-   i64.store offset:32 align:3
-   local.get $2
-   local.get $0
-   i64.load offset:32
-   local.tee $6
-   i64.store offset:56 align:3
-   local.get $2
-   local.get $0
-   i64.load offset:24
-   local.tee $7
-   i64.store offset:48 align:3
-   i32.const 3114552
-   i32.load align:2
-   local.tee $3
-   i32.load offset:4 align:2
-   local.set $8
-   local.get $3
-   i32.load align:2
-   local.set $3
-   local.get $2
-   local.get $4
-   i64.store offset:24 align:3
-   local.get $2
-   local.get $6
-   i64.store offset:8 align:3
-   local.get $2
-   local.get $5
-   i64.store offset:16 align:3
-   local.get $2
-   local.get $7
-   i64.store align:3
-   local.get $2
-   i32.const -1
-   i32.sub
-   local.get $2
-   i32.const 16
-   i32.add
-   local.get $2
-   local.get $8
-   local.get $3
-   call.indirect (func (param i32, i32, i32, i32))
-   local.get $0
-   local.get $2
-   v128.load offset:64 align:3    [SIMD]
   v128.store offset:40    [SIMD]
-   local.get $2
-   i32.const 80
-   i32.add
-   global.set $__stack_pointer
-   return
-
-  call mini_llvmonly_throw_nullref_exception
-  unreachable

Measurements of the bench-sample (aot and aot + SIMD are relevant here):

browser-bench/Release configuration

.NET7 May 19th

*1 .NET7 May 13th + emscripten 3.1.9 + SIMD

Chrome Version 101.0.4951.67 (Official Build) (64-bit)

measurement aot aot + EH aot + SIMD *1 aot + EH + SIMD *1 interp interp + EH
AppStart, Page show 26.1179ms 29.3718ms 31.4959ms 25.8079ms 34.3243ms 24.7880ms
AppStart, Reach managed 213.6154ms 199.3214ms 204.4444ms 198.1786ms 201.6667ms 196.5714ms
Exceptions, NoExceptionHandling 0.0537us 0.0555us 0.0547us 0.0532us 0.1068us 0.0938us
Exceptions, TryCatch 0.0767us 0.0755us 0.0916us 0.0781us 0.1042us 0.0950us
Exceptions, TryCatchThrow 0.0080ms 0.0079ms 0.0078ms 0.0079ms 0.0019ms 0.0019ms
Exceptions, TryCatchFilter 0.0776us 0.0751us 0.0820us 0.0752us 0.1039us 0.0983us
Exceptions, TryCatchFilterInline 0.0531us 0.0521us 0.0563us 0.0532us 0.0889us 0.0864us
Exceptions, TryCatchFilterThrow 0.0128ms 0.0122ms 0.0120ms 0.0123ms 0.0026ms 0.0026ms
Exceptions, TryCatchFilterThrowApplies 0.0100ms 0.0096ms 0.0097ms 0.0098ms 0.0019ms 0.0018ms
Json, non-ASCII text serialize 0.3699ms 0.3529ms 0.3553ms 0.3681ms 8.1483ms 7.9713ms
Json, non-ASCII text deserialize 1.5710ms 1.5414ms 1.5283ms 1.5166ms 12.3198ms 12.3657ms
Json, small serialize 0.0371ms 0.0370ms 0.0359ms 0.0360ms 0.2512ms 0.2510ms
Json, small deserialize 0.0557ms 0.0546ms 0.0537ms 0.0533ms 0.4237ms 0.3976ms
Json, large serialize 10.1708ms 9.9264ms 9.6885ms 9.5669ms 75.5970ms 72.3194ms
Json, large deserialize 15.5446ms 15.0997ms 14.9708ms 14.7407ms 112.7872ms 107.1837ms
Vector, Create Vector128 0.0602us 0.0581us 0.0450us 0.0503us 0.1793us 0.1610us
Vector, Add 2 Vector128's 0.5338us 0.5359us 0.0450us 0.0481us 0.2434us 0.2325us
Vector, Multiply 2 Vector128's 0.5331us 0.5460us 0.0451us 0.0481us 0.2421us 0.2303us
WebSocket, PartialSend 1B 0.4781us 0.4582us 0.4422us 0.4462us 0.0017ms 0.0017ms
WebSocket, PartialSend 64KB 0.0628ms 0.0703ms 0.0653ms 0.0617ms 0.0685ms 0.0640ms
WebSocket, PartialSend 1MB 0.9000ms 0.9545ms 0.9727ms 0.9364ms 0.9727ms 0.9455ms
WebSocket, PartialReceive 1B 0.8237us 0.8083us 0.7852us 0.7852us 0.0023ms 0.0023ms
WebSocket, PartialReceive 10KB 0.0020ms 0.0020ms 0.0040ms 0.0020ms 0.0040ms 0.0040ms
WebSocket, PartialReceive 100KB 0.0000us 0.0000us 0.0000us 0.0000us 0.0000us 0.0000us

The llvm code generator works nicely with them.
So that C#

    WasmBase.Constant(0xff11ff22ff33ff44, 0xff55ff66ff77ff88)

is compiled into wasm code

    v128.const 0xff11ff22ff33ff44ff55ff66ff77ff88    [SIMD]
This will need more work, as it crashes clang during 'WebAssembly
Instruction Selection' pass:

    WasmApp.Native.targets(353,5): error : 3.    Running pass 'WebAssembly Instruction Selection' on function '@corlib_System_Runtime_Intrinsics_Wasm_WasmBase_Shuffle_System_Runtime_Intrinsics_Vector128_1_byte_System_Runtime_Intrinsics_Vector128_1_byte_System_Runtime_Intrinsics_Vector128_1_byte'
Also add "experimental" to the property comment
@ghost ghost assigned radekdoulik Jun 1, 2022
@fanyang-mono
Copy link
Member

What does EH stand for?

@SamMonoRT
Copy link
Member

What does EH stand for?

Exception Handling

@ghost ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label Jun 1, 2022
@ghost ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Jun 2, 2022
@radekdoulik radekdoulik requested a review from fanyang-mono June 3, 2022 13:10
@radical radical added the arch-wasm WebAssembly architecture label Jun 3, 2022
@ghost
Copy link

ghost commented Jun 3, 2022

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

Add initial SIMD support for wasm. This is subset of the original draft PR without the public API additions. I left the underlying parts of new WasmBase class implementation here as well to not loose it.

Add WasmSIMD property to enable SIMD in AOT builds. With the property enabled, the apps built with AOT get SIMD intrinsics inlined for parts of S.R.I.Vector128 and S.R.I.Vector128<T> API.

Add test to build and run a simple app with SIMD enabled.

Example of the produced code:

> wa-info -d -f Vector.*Multiply.*RunStep src/mono/sample/wasm/browser-bench/bin/Debug/AppBundle/dotnet.wasm
(func Wasm_Browser_Bench_Sample_Sample_VectorTask_Multiply_RunStep(param $0 i32, $1 i32))
 local.get $0
 i32.eqz
 if
  call mini_llvmonly_throw_nullref_exception
  unreachable

 local.get $0
 local.get $0
 v128.load offset:24    [SIMD]
 local.get $0
 v128.load offset:8    [SIMD]
 i32x4.mul    [SIMD]
 v128.store offset:40    [SIMD]

The C# code:

            ...
            Vector128<int> vector1, vector2, vector3;

            public override void RunStep() => vector3 = vector1 * vector2;

And comparison to the non SIMD version:

> wa-diff -d -f Vector.*Multiply.*RunStep src/mono/sample/wasm/browser-bench/bin-nosimd/Debug/AppBundle/dotnet.wasm src/mono/sample/wasm/browser-bench/bin/Debug/AppBundle/dotnet.wasm
(func Wasm_Browser_Bench_Sample_Sample_VectorTask_Multiply_RunStep(param i32, i32))
-  local $2 i32
+  local.get $0
-  local $3 i32
-  local $4 i64
-  local $5 i64
-  local $6 i64
-  local $7 i64
-  local $8 i32
-  global.get $__stack_pointer
-  i32.const 80
-  i32.sub
-  local.tee $2
-  global.set $__stack_pointer
-  i32.const 3121246
-  i32.load8.u
   i32.eqz
   if
-   i32.const 2056316
-   call mono_aot_Wasm_Browser_Bench_Sample_init_method
+   call mini_llvmonly_throw_nullref_exception
+   unreachable
-   i32.const 3121246
-   i32.const 1
-   i32.store8

   local.get $0
-  if
-   local.get $2
   local.get $0
-   i64.load offset:16
+  v128.load offset:24    [SIMD]
-   local.tee $4
-   i64.store offset:40 align:3
-   local.get $2
   local.get $0
-   i64.load offset:8
-   local.tee $5
+  v128.load offset:8    [SIMD]
+  i32x4.mul    [SIMD]
-   i64.store offset:32 align:3
-   local.get $2
-   local.get $0
-   i64.load offset:32
-   local.tee $6
-   i64.store offset:56 align:3
-   local.get $2
-   local.get $0
-   i64.load offset:24
-   local.tee $7
-   i64.store offset:48 align:3
-   i32.const 3114552
-   i32.load align:2
-   local.tee $3
-   i32.load offset:4 align:2
-   local.set $8
-   local.get $3
-   i32.load align:2
-   local.set $3
-   local.get $2
-   local.get $4
-   i64.store offset:24 align:3
-   local.get $2
-   local.get $6
-   i64.store offset:8 align:3
-   local.get $2
-   local.get $5
-   i64.store offset:16 align:3
-   local.get $2
-   local.get $7
-   i64.store align:3
-   local.get $2
-   i32.const -1
-   i32.sub
-   local.get $2
-   i32.const 16
-   i32.add
-   local.get $2
-   local.get $8
-   local.get $3
-   call.indirect (func (param i32, i32, i32, i32))
-   local.get $0
-   local.get $2
-   v128.load offset:64 align:3    [SIMD]
   v128.store offset:40    [SIMD]
-   local.get $2
-   i32.const 80
-   i32.add
-   global.set $__stack_pointer
-   return
-
-  call mini_llvmonly_throw_nullref_exception
-  unreachable

Measurements of the bench-sample (aot and aot + SIMD are relevant here):

browser-bench/Release configuration

.NET7 May 19th

*1 .NET7 May 13th + emscripten 3.1.9 + SIMD

Chrome Version 101.0.4951.67 (Official Build) (64-bit)

measurement aot aot + EH aot + SIMD *1 aot + EH + SIMD *1 interp interp + EH
AppStart, Page show 26.1179ms 29.3718ms 31.4959ms 25.8079ms 34.3243ms 24.7880ms
AppStart, Reach managed 213.6154ms 199.3214ms 204.4444ms 198.1786ms 201.6667ms 196.5714ms
Exceptions, NoExceptionHandling 0.0537us 0.0555us 0.0547us 0.0532us 0.1068us 0.0938us
Exceptions, TryCatch 0.0767us 0.0755us 0.0916us 0.0781us 0.1042us 0.0950us
Exceptions, TryCatchThrow 0.0080ms 0.0079ms 0.0078ms 0.0079ms 0.0019ms 0.0019ms
Exceptions, TryCatchFilter 0.0776us 0.0751us 0.0820us 0.0752us 0.1039us 0.0983us
Exceptions, TryCatchFilterInline 0.0531us 0.0521us 0.0563us 0.0532us 0.0889us 0.0864us
Exceptions, TryCatchFilterThrow 0.0128ms 0.0122ms 0.0120ms 0.0123ms 0.0026ms 0.0026ms
Exceptions, TryCatchFilterThrowApplies 0.0100ms 0.0096ms 0.0097ms 0.0098ms 0.0019ms 0.0018ms
Json, non-ASCII text serialize 0.3699ms 0.3529ms 0.3553ms 0.3681ms 8.1483ms 7.9713ms
Json, non-ASCII text deserialize 1.5710ms 1.5414ms 1.5283ms 1.5166ms 12.3198ms 12.3657ms
Json, small serialize 0.0371ms 0.0370ms 0.0359ms 0.0360ms 0.2512ms 0.2510ms
Json, small deserialize 0.0557ms 0.0546ms 0.0537ms 0.0533ms 0.4237ms 0.3976ms
Json, large serialize 10.1708ms 9.9264ms 9.6885ms 9.5669ms 75.5970ms 72.3194ms
Json, large deserialize 15.5446ms 15.0997ms 14.9708ms 14.7407ms 112.7872ms 107.1837ms
Vector, Create Vector128 0.0602us 0.0581us 0.0450us 0.0503us 0.1793us 0.1610us
Vector, Add 2 Vector128's 0.5338us 0.5359us 0.0450us 0.0481us 0.2434us 0.2325us
Vector, Multiply 2 Vector128's 0.5331us 0.5460us 0.0451us 0.0481us 0.2421us 0.2303us
WebSocket, PartialSend 1B 0.4781us 0.4582us 0.4422us 0.4462us 0.0017ms 0.0017ms
WebSocket, PartialSend 64KB 0.0628ms 0.0703ms 0.0653ms 0.0617ms 0.0685ms 0.0640ms
WebSocket, PartialSend 1MB 0.9000ms 0.9545ms 0.9727ms 0.9364ms 0.9727ms 0.9455ms
WebSocket, PartialReceive 1B 0.8237us 0.8083us 0.7852us 0.7852us 0.0023ms 0.0023ms
WebSocket, PartialReceive 10KB 0.0020ms 0.0020ms 0.0040ms 0.0020ms 0.0040ms 0.0040ms
WebSocket, PartialReceive 100KB 0.0000us 0.0000us 0.0000us 0.0000us 0.0000us 0.0000us
Author: radekdoulik
Assignees: radekdoulik
Labels:

arch-wasm, area-Build-mono

Milestone: -

Copy link
Member

@radical radical left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build, and test changes look good.

@radical
Copy link
Member

radical commented Jun 3, 2022

Are there benchmarks in dotnet/performance that need to be enabled now, or new ones need to be added?

@fanyang-mono
Copy link
Member

Are there benchmarks in dotnet/performance that need to be enabled now, or new ones need to be added?

These ones should see improvements https://github.com/dotnet/performance/blob/main/src/benchmarks/micro/libraries/System.Numerics.Vectors/Perf_VectorOfT.cs

But they should be running already with bigger sets of instructions.

@radekdoulik radekdoulik merged commit 1f2eaaa into dotnet:main Jun 6, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jul 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

arch-wasm WebAssembly architecture area-Build-mono

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants