-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Add intrinsic for IndexOfAny on Arm64 #74010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add intrinsic for IndexOfAny on Arm64 #74010
Conversation
|
Tagging subscribers to this area: @dotnet/area-system-memory Issue Detailsnull
|
|
Benchmarking numbers for x64 and Arm64 x64 (Xeon Gold 6152) Arm64 (Altra) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoiding goto showed performance improvement on x64. @adamsitnik , feel free to confirm this on your end. If it is useful, we can apply it to other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm.. I'd expect inputVector != Vector128<ushort>.Zero to be lowered to MaxPairwise too. Does it have a different codegen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Namely with #65632
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! Apparently, JIT is somehow missing to optimise this one. It's still emitting umaxv.
...
cmeq v19.8h, v16.8h, v18.8h
cmeq v18.8h, v17.8h, v18.8h
orr v18.8h, v19.8h, v18.8h
umaxv s19, v18.4s
umov w0, v19.s[0]
cmp w0, #0
...
Full Arm64 Assembly
; Assembly listing for method System.SpanHelpers:IndexOfAny(byref,ushort,ushort,int):int
; Emitting BLENDED_CODE for generic ARM64 CPU - Unix
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 4 single block inlinees; 3 inlinees without PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T05] ( 5, 8.50) byref -> x19 single-def
; V01 arg1 [V01,T09] ( 5, 5 ) ushort -> x22 single-def
; V02 arg2 [V02,T06] ( 5, 8.50) ushort -> x20 single-def
; V03 arg3 [V03,T08] ( 6, 5.50) int -> x21 single-def
; V04 loc0 [V04,T00] ( 18, 38.50) long -> x23
; V05 loc1 [V05,T02] ( 13, 27.50) long -> x24
; V06 loc2 [V06,T01] ( 15, 36 ) int -> x25
; V07 loc3 [V07,T13] ( 3, 2.50) long -> x0
;* V08 loc4 [V08 ] ( 0, 0 ) long -> zero-ref
; V09 loc5 [V09,T07] ( 5, 10 ) byref -> x0
; V10 loc6 [V10,T14] ( 3, 1.50) int -> x0
; V11 loc7 [V11,T11] ( 3, 5 ) byref -> x19
;* V12 loc8 [V12 ] ( 0, 0 ) ref -> zero-ref class-hnd
;* V13 loc9 [V13 ] ( 0, 0 ) ref -> zero-ref class-hnd
;* V14 loc10 [V14 ] ( 0, 0 ) ref -> zero-ref class-hnd
; V15 loc11 [V15,T15] ( 6, 13.50) simd16 -> d18 HFA(simd16)
; V16 loc12 [V16,T16] ( 6, 10 ) simd16 -> d18 HFA(simd16)
; V17 loc13 [V17,T17] ( 3, 5 ) simd16 -> d16 HFA(simd16)
; V18 loc14 [V18,T18] ( 3, 5 ) simd16 -> d17 HFA(simd16)
;* V19 loc15 [V19 ] ( 0, 0 ) simd16 -> zero-ref HFA(simd16) ld-addr-op
;* V20 loc16 [V20 ] ( 0, 0 ) simd16 -> zero-ref HFA(simd16) ld-addr-op
;* V21 loc17 [V21 ] ( 0, 0 ) simd16 -> zero-ref HFA(simd16)
;* V22 loc18 [V22 ] ( 0, 0 ) simd16 -> zero-ref HFA(simd16) ld-addr-op
;# V23 OutArgs [V23 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace"
; V24 tmp1 [V24,T19] ( 3, 3 ) simd16 -> d16 HFA(simd16) "Clone op1 for vector extractmostsignificantbits"
; V25 tmp2 [V25,T20] ( 3, 3 ) simd16 -> d16 HFA(simd16) "Clone op1 for vector extractmostsignificantbits"
;* V26 tmp3 [V26 ] ( 0, 0 ) bool -> zero-ref "Inlining Arg"
;* V27 tmp4 [V27 ] ( 0, 0 ) bool -> zero-ref "Inlining Arg"
;* V28 tmp5 [V28 ] ( 0, 0 ) bool -> zero-ref "Inlining Arg"
;* V29 tmp6 [V29 ] ( 0, 0 ) bool -> zero-ref "Inlining Arg"
;* V30 tmp7 [V30 ] ( 0, 0 ) int -> zero-ref "Inline return value spill temp"
; V31 tmp8 [V31,T10] ( 5, 5 ) int -> x23 "Single return block return value"
; V32 cse0 [V32,T12] ( 6, 3 ) ref -> x1 "CSE - moderate"
; V33 cse1 [V33,T04] ( 9, 15.50) int -> x26 "CSE - aggressive"
; V34 cse2 [V34,T03] ( 9, 19 ) int -> x27 "CSE - aggressive"
;
; Lcl frame size = 8
G_M34414_IG01: ;; offset=0000H
A9BA7BFD stp fp, lr, [sp,#-96]!
A901D3F3 stp x19, x20, [sp,#24]
A902DBF5 stp x21, x22, [sp,#40]
A903E3F7 stp x23, x24, [sp,#56]
A904EBF9 stp x25, x26, [sp,#72]
F9002FFB str x27, [sp,#88]
910003FD mov fp, sp
AA0003F3 mov x19, x0
2A0103F6 mov w22, w1
2A0203F4 mov w20, w2
2A0303F5 mov w21, w3
;; size=44 bbWeight=1 PerfScore 8.50
G_M34414_IG02: ;; offset=002CH
710002BF cmp w21, #0
5400016A bge G_M34414_IG04
;; size=8 bbWeight=1 PerfScore 1.50
G_M34414_IG03: ;; offset=0034H
D2840500 movz x0, #0x2028
F2BB6800 movk x0, #0xdb40 LSL #16
F2DFF020 movk x0, #0xff81 LSL #32
F9400001 ldr x1, [x0]
AA0103E0 mov x0, x1
D28E3F02 movz x2, #0x71f8 // code for System.Diagnostics.Debug:Fail
F2AAAA42 movk x2, #0x5552 LSL #16
F2DFFFE2 movk x2, #0xffff LSL #32
F9400042 ldr x2, [x2]
D63F0040 blr x2
;; size=40 bbWeight=0.50 PerfScore 5.25
G_M34414_IG04: ;; offset=005CH
AA1F03F7 mov x23, xzr
2A1503F8 mov w24, w21
93407EA0 sxtw x0, w21
D1002000 sub x0, x0, #8
F100001F cmp x0, #0
540003AB blt G_M34414_IG07
;; size=24 bbWeight=1 PerfScore 3.50
G_M34414_IG05: ;; offset=0074H
AA0003F8 mov x24, x0
14000039 b G_M34414_IG16
align [0 bytes for IG09]
align [0 bytes]
align [0 bytes]
align [0 bytes]
;; size=8 bbWeight=0.50 PerfScore 0.75
G_M34414_IG06: ;; offset=007CH
D37FFAE0 lsl x0, x23, #1
8B000260 add x0, x19, x0
79400019 ldrh w25, [x0]
53003EDA uxth w26, w22
6B19035F cmp w26, w25
54000640 beq G_M34414_IG15
53003E9B uxth w27, w20
6B19037F cmp w27, w25
540005E0 beq G_M34414_IG15
79400419 ldrh w25, [x0,#2]
6B19035F cmp w26, w25
54000540 beq G_M34414_IG14
6B19037F cmp w27, w25
54000500 beq G_M34414_IG14
79400819 ldrh w25, [x0,#4]
6B19035F cmp w26, w25
54000460 beq G_M34414_IG13
6B19037F cmp w27, w25
54000420 beq G_M34414_IG13
79400C19 ldrh w25, [x0,#6]
6B19035F cmp w26, w25
54000360 beq G_M34414_IG12
6B19037F cmp w27, w25
54000320 beq G_M34414_IG12
910012F7 add x23, x23, #4
D1001318 sub x24, x24, #4
;; size=104 bbWeight=2 PerfScore 55.00
G_M34414_IG07: ;; offset=00E4H
F100131F cmp x24, #4
54FFFCA2 bhs G_M34414_IG06
;; size=8 bbWeight=4 PerfScore 6.00
G_M34414_IG08: ;; offset=00ECH
B4000198 cbz x24, G_M34414_IG10
53003EDA uxth w26, w22
;; size=8 bbWeight=0.50 PerfScore 0.75
G_M34414_IG09: ;; offset=00F4H
D37FFAE0 lsl x0, x23, #1
78606A79 ldrh w25, [x19, x0]
6B19035F cmp w26, w25
540002C0 beq G_M34414_IG15
53003E9B uxth w27, w20
6B19037F cmp w27, w25
54000260 beq G_M34414_IG15
910006F7 add x23, x23, #1
D1000718 sub x24, x24, #1
B5FFFEF8 cbnz x24, G_M34414_IG09
;; size=40 bbWeight=4 PerfScore 38.00
G_M34414_IG10: ;; offset=011CH
12800000 movn w0, #0
;; size=4 bbWeight=0.50 PerfScore 0.25
G_M34414_IG11: ;; offset=0120H
F9402FFB ldr x27, [sp,#88]
A944EBF9 ldp x25, x26, [sp,#72]
A943E3F7 ldp x23, x24, [sp,#56]
A942DBF5 ldp x21, x22, [sp,#40]
A941D3F3 ldp x19, x20, [sp,#24]
A8C67BFD ldp fp, lr, [sp],#96
D65F03C0 ret lr
;; size=28 bbWeight=0.50 PerfScore 4.00
G_M34414_IG12: ;; offset=013CH
11000EF7 add w23, w23, #3
1400004D b G_M34414_IG22
D503201F align [4 bytes for IG18]
align [0 bytes]
align [0 bytes]
align [0 bytes]
;; size=12 bbWeight=0.50 PerfScore 0.75
G_M34414_IG13: ;; offset=0148H
11000AF7 add w23, w23, #2
1400004A b G_M34414_IG22
;; size=8 bbWeight=0.50 PerfScore 0.75
G_M34414_IG14: ;; offset=0150H
110006F7 add w23, w23, #1
14000048 b G_M34414_IG22
;; size=8 bbWeight=0.50 PerfScore 0.75
G_M34414_IG15: ;; offset=0158H
14000047 b G_M34414_IG22
;; size=4 bbWeight=0.50 PerfScore 0.50
G_M34414_IG16: ;; offset=015CH
710022BF cmp w21, #8
5400016A bge G_M34414_IG17
D2840500 movz x0, #0x2028
F2BB6800 movk x0, #0xdb40 LSL #16
F2DFF020 movk x0, #0xff81 LSL #32
F9400001 ldr x1, [x0]
AA0103E0 mov x0, x1
D28E3F02 movz x2, #0x71f8 // code for System.Diagnostics.Debug:Fail
F2AAAA42 movk x2, #0x5552 LSL #16
F2DFFFE2 movk x2, #0xffff LSL #32
F9400042 ldr x2, [x2]
D63F0040 blr x2
;; size=48 bbWeight=0.50 PerfScore 6.00
G_M34414_IG17: ;; offset=018CH
53003EDA uxth w26, w22
4E020F50 dup v16.8h, w26
53003E9B uxth w27, w20
4E020F71 dup v17.8h, w27
B40001B8 cbz x24, G_M34414_IG19
;; size=20 bbWeight=0.50 PerfScore 3.00
G_M34414_IG18: ;; offset=01A0H
D37FFAE0 lsl x0, x23, #1
3CE06A72 ldr q18, [x19, x0]
6E728E13 cmeq v19.8h, v16.8h, v18.8h
6E728E32 cmeq v18.8h, v17.8h, v18.8h
4EB21E72 orr v18.8h, v19.8h, v18.8h
6EB0AA53 umaxv s19, v18.4s
0E043E60 umov w0, v19.s[0]
7100001F cmp w0, #0
54000361 bne G_M34414_IG20
910022F7 add x23, x23, #8
EB17031F cmp x24, x23
54FFFEA8 bhi G_M34414_IG18
;; size=48 bbWeight=4 PerfScore 56.00
G_M34414_IG19: ;; offset=01D0H
D37FFB00 lsl x0, x24, #1
3CE06A72 ldr q18, [x19, x0]
AA1803F7 mov x23, x24
6E728E10 cmeq v16.8h, v16.8h, v18.8h
6E728E31 cmeq v17.8h, v17.8h, v18.8h
4EB11E12 orr v18.8h, v16.8h, v17.8h
6EB0AA50 umaxv s16, v18.4s
0E043E00 umov w0, v16.s[0]
7100001F cmp w0, #0
54FFF940 beq G_M34414_IG10
9C000550 ldr q16, [@RWD00]
4E301E52 and v18.16b, v18.16b, v16.16b
9C000590 ldr q16, [@RWD16]
6E304650 ushl v16.16b, v18.16b, v16.16b
4F000411 movi v17.4s, #0x00
6E114211 ext v17.16b, v16.16b, v17.16b, #8
0E31BA31 addv b17, v17.8b
0E013E20 umov w0, v17.b[0]
53185C00 lsl w0, w0, #8
0E31BA10 addv b16, v16.8b
0E013E01 umov w1, v16.b[0]
2A010000 orr w0, w0, w1
1400000D b G_M34414_IG21
;; size=92 bbWeight=0.50 PerfScore 14.00
G_M34414_IG20: ;; offset=022CH
9C0003B0 ldr q16, [@RWD00]
4E301E50 and v16.16b, v18.16b, v16.16b
9C0003F1 ldr q17, [@RWD16]
6E314610 ushl v16.16b, v16.16b, v17.16b
4F000411 movi v17.4s, #0x00
6E114211 ext v17.16b, v16.16b, v17.16b, #8
0E31BA31 addv b17, v17.8b
0E013E20 umov w0, v17.b[0]
53185C00 lsl w0, w0, #8
0E31BA10 addv b16, v16.8b
0E013E01 umov w1, v16.b[0]
2A010000 orr w0, w0, w1
;; size=48 bbWeight=0.50 PerfScore 7.25
G_M34414_IG21: ;; offset=025CH
5AC00000 rbit w0, w0
5AC01000 clz w0, w0
2A0003E0 mov w0, w0
D341FC00 lsr x0, x0, #1
8B0002F7 add x23, x23, x0
17FFFFBA b G_M34414_IG15
;; size=24 bbWeight=0.50 PerfScore 2.25
G_M34414_IG22: ;; offset=0274H
2A1703E0 mov w0, w23
;; size=4 bbWeight=0.50 PerfScore 0.25
G_M34414_IG23: ;; offset=0278H
F9402FFB ldr x27, [sp,#88]
A944EBF9 ldp x25, x26, [sp,#72]
A943E3F7 ldp x23, x24, [sp,#56]
A942DBF5 ldp x21, x22, [sp,#40]
A941D3F3 ldp x19, x20, [sp,#24]
A8C67BFD ldp fp, lr, [sp],#96
D65F03C0 ret lr
;; size=28 bbWeight=0.50 PerfScore 4.00
RWD00 dq 8080808080808080h, 8080808080808080h
RWD16 dq 00FFFEFDFCFBFAF9h, 00FFFEFDFCFBFAF9h
; Total bytes of code 660, prolog size 44, PerfScore 285.00, instruction count 172, allocated bytes for code 660 (MethodHash=41e07991) for method System.SpanHelpers:IndexOfAny(byref,ushort,ushort,int):int
; ============================================================
|
@SwapnilGaikwad I am afraid I've caused a huge merge conflict with #73768. Could you please apply this optimization to the new generic |
Sure. I'll also take a look at x64 failures. Interestingly, when I downloaded the a bundle of failing tests locally using the |
The ones in you PR like this one? They were caused by our internal service outage (tests were passing but were being reported as failures). So as soon as you sync your branch and re-run the CI they should be gone. |
|
@stephentoub would this materially improve regex scenarios, do you think? |
Depends on the expression and how much they're dominated by such a call. For example, if the expression were |
bf93551 to
5898e80
Compare
|
@SwapnilGaikwad could you please provide updated benchmark results? |
Here are the numbers from an altra. Byte: Char: Int32: |
|
@adamsitnik what can we do to push this forward? 🙂 |
Add small performance improvement to IndexOfAny Char intrinsics