Arm SVE ZGEMM by xrq-phys · Pull Request #2 · xrq-phys/blis

xrq-phys · 2021-09-21T07:06:17Z

Self-PR.

Test result: a bit lower GFlOps than 6x8.

Quite slow.

- Compile w/ both GCC & Clang - Only block part is implement. Edge cases WIP - Not Optimal kernel scheme. Should do 4x8 instead

- Compile w/ both GCC & Clang. - Edge cases use ref-kernels. - Can give performance boost in some contexts.

- Add 6x8 GEMMSUP. - Adjust prefetching. - Workaround for Clang's disability to handle reg clobbering. - Subproduct 6x8 row-major GEMM <- incomplete.

Recommended kernels set: ... BLIS_RRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE, BLIS_RCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE, BLIS_RCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE, BLIS_CRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE, BLIS_CCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE, BLIS_CCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE, ... bli_blksz_init ( &blkszs[ BLIS_MR ], -1, 6, -1, -1, -1, 8, -1, -1 ); bli_blksz_init_easy( &blkszs[ BLIS_NR ], -1, 8, -1, -1 ); ...

Sizes according to the 2014 kernels.

For GCC.

Armv8-A now has a complete set of GEMMSUP kernels..

GCC does not have full NEON intrinsics support.

Suffixed NEON opcode is not supported by GNU assembler

- Use the same bulk kernel as Cortex-A53 / ThunderX2; - Larger block size; - Use gemmsup kernels for double precision.

Forgot to support `alpha`/`beta` in gemmsup_armv8a_int.

bli_dgemmsup_rv_armv8a_int_6x4mn

Plus some fix at edges. TODO: Should ensure that no ref kernel appear in beginning of gemmsup kernels. As ref does not recognise panel stride.

Ref cannot handle panel strides (packed cases) thus cannot be called from the beginning of `gemmsup` (i.e. cannot be dispatch target of gemmsup to other sizes.)

- `ref2` call in `bli_gemmsup_rv_armv8a_asm_d6x8m.c` is commented out. - `bli_gemmsup_rv_armv8a_asm_d4x8m.c` contains a tail `ref2` call but it's not called by any upper routine.

Fixes flame#486.

…the Mx1 gemmsup kernels for haswell. The fix is to use the same (valid) source register twice in the horizontal addition.

Fix more copy-paste errors in the haswell gemmsup code.

Details: - Defined a new packm variant for the 'gemmlike' sandbox. This new variant (bls_l3_packm_var3.c) parallelizes the packing operation over the k dimension rather than the m or n dimensions. Note that the gemmlike implementation still uses var1 by default, and use of the new code would require changing bls_l3_packm_a.c and/or bls_l3_packm_b.c so that var3 is called instead. Thanks to Jeff Diamond for proposing this (perhaps NUMA-friendly) solution.

Details: - Removed the commented-out #define BLIS_NUM_ARCHS in bli_type_defs.h and its associated (now outdated) comments. BLIS_NUM_ARCHS has been part of the arch_t enum for some time now, and so this change is mostly about removing any opportunity for confusion for people who may be reading the code. Thanks to Minh Quan Ho for leading me to cleanup.

- There was redundance between the macro BLIS_MAX_NUM_ERR_MSGS (=200) and the enum BLIS_ERROR_CODE_MAX (-170), while they both mean the same thing: the maximal number of error codes/messages. - The previous initialization of error messages at compile time ignored that the 'bli_error_string' array still occupies useless memory due to 2D char[][] declaration. Instead, it should be just an array of pointers, pointing at strings in .rodata section. - This commit does the two modifications: * retired macros BLIS_MAX_NUM_ERR_MSGS and BLIS_MAX_ERR_MSG_LENGTH everywhere * switch bli_error_string from char[][] to char *[] to reduce its footprint from 40KB (200*200) to 1.3KB (170*sizeof(char*)). (No problem to use the enum BLIS_ERROR_CODE_MAX at compile-time, since compiler is smart enough to determine its value is 170.)

[ci skip]

Previously, this was a global variable. Setting the value was synchronized via a mutex but reading the value was not. Of course, these accesses are almost certainly atomic, but there is still the possibility of one thread attempting to set the value and then reading the value set by another thread. For correct operation under user threading (e.g. pthreads), this should probably be thread-local with no mutex.

Fix a bug in bli_gemmsup_rd_armv8a_asm_d6x8m.c. For safety upon similar strategies in the future, change all [mn]_[iter/left] into signed ints.

Commenting out <sys/sysctl.h> due to possibly a Xcode bug.

This test will run on Linux, but all the kernels should run just fine. This does not test autodetection but then none of the other ARM tests do either.

- They didn't make much improvements. - Can't register row-preferral and column-preferral ukrs at the same time. Will break 1m.

[ci skip]

ARMv8 PACKM and GEMMSUP Kernels + Apple Firestorm Subconfig

Pic. size seems a bit different from upstream. Generaged w/ MATLAB. Open to any change.

FMOV [hsd]M, #imm does not allow zero immediate. Use wzr, xzr instead.

RuQing Xu and others added 30 commits August 13, 2021 02:40

Armv8-A Add 8x4 Kernel WIP

a29c163

Test result: a bit lower GFlOps than 6x8.

Armv8A DGEMM 4x4 Kernel WIP. Slow

6639999

Quite slow.

Armv8-A Add Part of GEMMSUP 8x4m Kernel

df40efe

- Compile w/ both GCC & Clang - Only block part is implement. Edge cases WIP - Not Optimal kernel scheme. Should do 4x8 instead

Armv8-A Add GEMMSUP 4x8n Kernel

a9ba79e

- Compile w/ both GCC & Clang. - Edge cases use ref-kernels. - Can give performance boost in some contexts.

Armv8-A Add More DGEMMSUP

8ed8f5e

- Add 6x8 GEMMSUP. - Adjust prefetching. - Workaround for Clang's disability to handle reg clobbering. - Subproduct 6x8 row-major GEMM <- incomplete.

Armv8-A DGEMMSUP Adjustments

3efe707

Armv8-A Introduced s/d Packing Kernels

49b05df

Sizes according to the 2014 kernels.

Armv8-A s/d Packing Kernels Fix Typo

3c5f740

For GCC.

Armv8-A GEMMSUP-RD 6x8n

afd0fa6

Armv8-A GEMMSUP-RD 6x8m

8a32d19

Armv8-A now has a complete set of GEMMSUP kernels..

Armv8-A Adjust Types for PACKM Kernels

ce44735

GCC does not have full NEON intrinsics support.

Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm

c792d50

Suffixed NEON opcode is not supported by GNU assembler

Armv8-A Supplimentary GEMMSUP Sizes for RD

4e7e225

Arm64 8x4 Kernel Use Less Regs

3df0e9b

Added Apple Firestorm (A14/M1) Subconfig

e38ca28

- Use the same bulk kernel as Cortex-A53 / ThunderX2; - Larger block size; - Use gemmsup kernels for double precision.

Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin

7d5903d

Forgot to support `alpha`/`beta` in gemmsup_armv8a_int.

Arm: Implement GEMMSUP Fallback Method

e6799b2

bli_dgemmsup_rv_armv8a_int_6x4mn

Arm: DGEMMSUP ?rc(rd) Invoke Edge Size

a361492

Arm: DGEMMSUP ??r(rv) Invoke Edge Size

35409eb

Plus some fix at edges. TODO: Should ensure that no ref kernel appear in beginning of gemmsup kernels. As ref does not recognise panel stride.

Header Typo

4fd82b0

Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref

7e2951e

Ref cannot handle panel strides (packed cases) thus cannot be called from the beginning of `gemmsup` (i.e. cannot be dispatch target of gemmsup to other sizes.)

Arm Whole GEMMSUP Call Route is Asm/Int Optimized

820f11a

- `ref2` call in `bli_gemmsup_rv_armv8a_asm_d6x8m.c` is commented out. - `bli_gemmsup_rv_armv8a_asm_d4x8m.c` contains a tail `ref2` call but it's not called by any upper routine.

Fix config_name in bli_arch.c

9c0064f

Fix more copy-paste errors in the haswell gemmsup code.

5191c43

Fixes flame#486.

Fix problem where uninitialized registers are included in vhaddpd in …

e3dc195

…the Mx1 gemmsup kernels for haswell. The fix is to use the same (valid) source register twice in the horizontal addition.

Merge pull request flame#544 from flame/haswell-gemmsup-fpe

b6f71fd

Fix more copy-paste errors in the haswell gemmsup code.

RuQing Xu and others added 29 commits October 6, 2021 01:00

Armv8 Handle *beta == 0 for GEMMSUP ??r Case.

40baf83

Firestorm Block Size Fixes

4bfadf9

Update .appveyor.yml

353a0d8

[ci skip]

Fix data race in testsuite.

c302499

Armv8 GEMMSUP Edge Cases Require Signed Ints

b9da6d5

Fix a bug in bli_gemmsup_rd_armv8a_asm_d6x8m.c. For safety upon similar strategies in the future, change all [mn]_[iter/left] into signed ints.

Firestorm CPUID Dispatcher

a024715

Commenting out <sys/sysctl.h> due to possibly a Xcode bug.

Add test for Apple M1 (firestorm)

14b1358

This test will run on Linux, but all the kernels should run just fine. This does not test autodetection but then none of the other ARM tests do either.

Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo

2920dde

Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo

d7a3372

Register firestorm into arm64 Metaconfig

a4066f2

Enable testing 1m in make check.

70b52ca

Armv8 Trash New Bulk Kernels

f44149f

- They didn't make much improvements. - Can't register row-preferral and column-preferral ukrs at the same time. Will break 1m.

Update Travis CI badge

2329d99

[ci skip]

Merge pull request flame#533 from xrq-phys/arm64-hi-bw

4277fec

ARMv8 PACKM and GEMMSUP Kernels + Apple Firestorm Subconfig

Arm SVE Add ZGEMM 2Vx8 Unindexed

49b9d79

Arm SVE Add ZGEMM 2Vx7 Unindexed

e13abde

Arm SVE Add ZGEMM 2Vx10 Unindexed

c19db2f

Arm SVE ZGEMM Support Gather Load / Scatt. St.

3f68e83

Arm SVE Add SGEMM 2Vx10 Unindexed

b677e0d

Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg

e4cabb9

A64FX Config Use ZGEMM/CGEMM

f7c6c2b

Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0

9e1e781

Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0

66a018e

Arm SVE: Update Perf. Graph

f76ea90

Pic. size seems a bit different from upstream. Generaged w/ MATLAB. Open to any change.

Arm SVE Config armsve Use ZGEMM/CGEMM

4b648e4

Arm SVE C/ZGEMM Support *beta==0

1749dfa

SH Kernel Unused Eigher

82b6128

Arm SVE C/ZGEMM Fix FMOV 0 Mistake

ccf1628

FMOV [hsd]M, #imm does not allow zero immediate. Use wzr, xzr instead.

xrq-phys force-pushed the armsve-zgemm branch from fe35edc to ccf1628 Compare October 8, 2021 03:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm SVE ZGEMM#2

Arm SVE ZGEMM#2
xrq-phys wants to merge 82 commits intoamd-plusfrom
armsve-zgemm

xrq-phys commented Sep 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

xrq-phys commented Sep 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants