ARMv8 PACKM and GEMMSUP Kernels + Apple Firestorm Subconfig#533
ARMv8 PACKM and GEMMSUP Kernels + Apple Firestorm Subconfig#533devinamatthews merged 36 commits intoflame:masterfrom
Conversation
Test result: a bit lower GFlOps than 6x8.
Quite slow.
- Compile w/ both GCC & Clang - Only block part is implement. Edge cases WIP - Not Optimal kernel scheme. Should do 4x8 instead
- Compile w/ both GCC & Clang. - Edge cases use ref-kernels. - Can give performance boost in some contexts.
- Add 6x8 GEMMSUP. - Adjust prefetching. - Workaround for Clang's disability to handle reg clobbering. - Subproduct 6x8 row-major GEMM <- incomplete.
Recommended kernels set:
...
BLIS_RRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
BLIS_RCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
BLIS_RCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
BLIS_CRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
BLIS_CCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
BLIS_CCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
...
bli_blksz_init ( &blkszs[ BLIS_MR ], -1, 6, -1, -1,
-1, 8, -1, -1 );
bli_blksz_init_easy( &blkszs[ BLIS_NR ], -1, 8, -1, -1 );
...
Sizes according to the 2014 kernels.
Armv8-A now has a complete set of GEMMSUP kernels..
GCC does not have full NEON intrinsics support.
Suffixed NEON opcode is not supported by GNU assembler
- Use the same bulk kernel as Cortex-A53 / ThunderX2; - Larger block size; - Use gemmsup kernels for double precision.
Forgot to support `alpha`/`beta` in gemmsup_armv8a_int.
|
What shape GEMM is the last graph? |
|
@xrq-phys have all of the build problems on M1 been worked out? |
bli_dgemmsup_rv_armv8a_int_6x4mn
|
@devinamatthews Sorry I forgot to add labels there. The last graph is DGEMM M=8 N=6 K=P. Regarding build problems, LLVM requires number of Another option could be GCC. There seemed to be sayings that GCC 10 is not fully trustable on M1, but what about GCC 11? |
Plus some fix at edges. TODO: Should ensure that no ref kernel appear in beginning of gemmsup kernels. As ref does not recognise panel stride.
Ref cannot handle panel strides (packed cases) thus cannot be called from the beginning of `gemmsup` (i.e. cannot be dispatch target of gemmsup to other sizes.)
- `ref2` call in `bli_gemmsup_rv_armv8a_asm_d6x8m.c` is commented out. - `bli_gemmsup_rv_armv8a_asm_d4x8m.c` contains a tail `ref2` call but it's not called by any upper routine.
|
@fgvanzee can you give this a final blessing? |
|
There are two little things that I want to add before merging:
|
|
@devinamatthews Oh thanks for the reminder I forgot to push that piece of change. Would you mind me commenting out that |
|
@xrq-phys I was going to just pop these in, or are you working on it? No use doing it twice 😄 |
|
Please put an |
Commenting out <sys/sysctl.h> due to possibly a Xcode bug.
|
@devinamatthews Inclusion of |
|
Oh yeah, yes it is. I think I put that in there so we could use hwctl but I ended up just hard-coding M1 for now. |
|
@devinamatthews For Travis part would a duplicate of Anyway it's better let you decide. My free Travis-CI has expired already ;) |
|
Putting in Travis test now. |
This test will run on Linux, but all the kernels should run just fine. This does not test autodetection but then none of the other ARM tests do either.
|
@xrq-phys can you test the autodetection? |
|
Uh-oh, Travis failed. I'll run it on a VM and see what's up. |
|
One CRR case fallen. |
|
Glad you found it. BLIS hasn't even finished compiling on QEMU yet.... |
|
Problematic routines are Trying to fix. |
|
@devinamatthews The two kernels are fixed. Registered subconfig |
|
@xrq-phys, I'm still getting incorrect results for
|
|
Hmmm. 3) seems to be optional since I am also getting wrong |
|
Reproduced. Will try to fix 🥲 |
|
@fgvanzee if there are no objections I'm going to enable 1m testing for |
Without objection. |
- They didn't make much improvements. - Can't register row-preferral and column-preferral ukrs at the same time. Will break 1m.
|
Fixed. This kind of thing worked for It seems that mixing row and column-preferring bulk kernels would break |
|
Interesting. I'll open an issue for that. |
|
Confirmed fixed. |
Hi!
This PR adds a subconfig, some supplementary
gemmsizes, somepackmkernels and a set ofdgemmsupkernels.The subconfig is roughly tuned against Apple's latest CPUs (i.e. CPU part of A14 and M1). This should slightly improve performance on Arm-based mac machines (c.f. #492, the peak a bit higher than OpenBLAS).
Regarding
gemmsup, I did not write assembly for all sizes but relied on some fallback methods:gemmsup_refforrvcases;bli_dgemmsup_rd_armv8a_int_???forrdcases.(Sorry the last graph has its labels missing: The title should be DGEMM M=8 N=6 K=P with the x-axis representing K=P.)
Fixes #495.