ARMv8 PACKM and GEMMSUP Kernels + Apple Firestorm Subconfig by xrq-phys · Pull Request #533 · flame/blis

xrq-phys · 2021-08-20T09:17:54Z

Hi!

This PR adds a subconfig, some supplementary gemm sizes, some packm kernels and a set of dgemmsup kernels.

The subconfig is roughly tuned against Apple's latest CPUs (i.e. CPU part of A14 and M1). This should slightly improve performance on Arm-based mac machines (c.f. #492, the peak a bit higher than OpenBLAS).

Regarding gemmsup, I did not write assembly for all sizes but relied on some fallback methods:

gemmsup_ref for rv cases;
NEON-intrinsic-based bli_dgemmsup_rd_armv8a_int_??? for rd cases.

(Sorry the last graph has its labels missing: The title should be DGEMM M=8 N=6 K=P with the x-axis representing K=P.)

Fixes #495.

Test result: a bit lower GFlOps than 6x8.

Quite slow.

- Compile w/ both GCC & Clang - Only block part is implement. Edge cases WIP - Not Optimal kernel scheme. Should do 4x8 instead

- Compile w/ both GCC & Clang. - Edge cases use ref-kernels. - Can give performance boost in some contexts.

- Add 6x8 GEMMSUP. - Adjust prefetching. - Workaround for Clang's disability to handle reg clobbering. - Subproduct 6x8 row-major GEMM <- incomplete.

Recommended kernels set: ... BLIS_RRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE, BLIS_RCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE, BLIS_RCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE, BLIS_CRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE, BLIS_CCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE, BLIS_CCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE, ... bli_blksz_init ( &blkszs[ BLIS_MR ], -1, 6, -1, -1, -1, 8, -1, -1 ); bli_blksz_init_easy( &blkszs[ BLIS_NR ], -1, 8, -1, -1 ); ...

Sizes according to the 2014 kernels.

For GCC.

Armv8-A now has a complete set of GEMMSUP kernels..

GCC does not have full NEON intrinsics support.

Suffixed NEON opcode is not supported by GNU assembler

- Use the same bulk kernel as Cortex-A53 / ThunderX2; - Larger block size; - Use gemmsup kernels for double precision.

Forgot to support `alpha`/`beta` in gemmsup_armv8a_int.

devinamatthews · 2021-08-20T16:59:23Z

What shape GEMM is the last graph?

devinamatthews · 2021-08-20T16:59:50Z

@xrq-phys have all of the build problems on M1 been worked out?

bli_dgemmsup_rv_armv8a_int_6x4mn

xrq-phys · 2021-08-20T17:51:29Z

@devinamatthews Sorry I forgot to add labels there. The last graph is DGEMM M=8 N=6 K=P.

Regarding build problems, LLVM requires number of __asm__ registers to be limited anyway. Though I cannot guarantee for all cases (no idea how LLVM routes its registers), at least all the assemblies are adjusted to pass Xcode 12.5.

Another option could be GCC. There seemed to be sayings that GCC 10 is not fully trustable on M1, but what about GCC 11?

devinamatthews · 2021-08-20T20:00:18Z

@xrq-phys if it builds, compiles, and runs on M1 without modification then that is awesome. We need to dust off #344 and then everything will look very nice.

Plus some fix at edges. TODO: Should ensure that no ref kernel appear in beginning of gemmsup kernels. As ref does not recognise panel stride.

Ref cannot handle panel strides (packed cases) thus cannot be called from the beginning of `gemmsup` (i.e. cannot be dispatch target of gemmsup to other sizes.)

- `ref2` call in `bli_gemmsup_rv_armv8a_asm_d6x8m.c` is commented out. - `bli_gemmsup_rv_armv8a_asm_d4x8m.c` contains a tail `ref2` call but it's not called by any upper routine.

devinamatthews · 2021-09-10T17:53:52Z

@fgvanzee can you give this a final blessing?

devinamatthews · 2021-10-06T15:07:12Z

There are two little things that I want to add before merging:

A Travis test for M1 because there is a lot of new packing and gemmsup code.
Autodetection. There is already a "stub" for M1, it just needs to activate the right config.

xrq-phys · 2021-10-06T15:11:47Z

@devinamatthews Oh thanks for the reminder I forgot to push that piece of change.
But in fact there seems a problem with <sys/sysctl.h> on Xcode 13.0 as:

In file included from /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/sysctl.h:83:
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/ucred.h:101:2: fatal error: unknown type name 'u_int'
        u_int   cr_version;             /* structure layout version */

Would you mind me commenting out that #include? It's not used at the moment anyway.

devinamatthews · 2021-10-06T15:14:25Z

@xrq-phys I was going to just pop these in, or are you working on it? No use doing it twice 😄

devinamatthews · 2021-10-06T15:15:01Z

Please put an #ifndef __APPLE__ around that include since that is the least change.

Commenting out <sys/sysctl.h> due to possibly a Xcode bug.

xrq-phys · 2021-10-06T15:17:22Z

@devinamatthews Inclusion of <sys/sysctl.h> is already surrounded by #ifdef __APPLE__ so I guess it's an Apply-only thing?

devinamatthews · 2021-10-06T15:19:48Z

Oh yeah, yes it is. I think I put that in there so we could use hwctl but I ended up just hard-coding M1 for now.

xrq-phys · 2021-10-06T15:19:54Z

@devinamatthews For Travis part would a duplicate of cortexa57 work?

Anyway it's better let you decide. My free Travis-CI has expired already ;)

devinamatthews · 2021-10-06T15:20:11Z

Putting in Travis test now.

This test will run on Linux, but all the kernels should run just fine. This does not test autodetection but then none of the other ARM tests do either.

devinamatthews · 2021-10-06T15:28:06Z

@xrq-phys can you test the autodetection?

devinamatthews · 2021-10-06T15:28:57Z

Uh-oh, Travis failed. I'll run it on a VM and see what's up.

xrq-phys · 2021-10-06T15:48:55Z

One CRR case fallen.
Sorry I didn't do BLAS tests.

devinamatthews · 2021-10-06T15:50:10Z

Glad you found it. BLIS hasn't even finished compiling on QEMU yet....

xrq-phys · 2021-10-06T16:06:44Z

Problematic routines are bli_dgemmsup_rv_armv8a_asm_8x4m and bli_dgemmsup_rv_armv8a_int_6x4mn.

Trying to fix.

xrq-phys · 2021-10-06T17:27:48Z

@devinamatthews The two kernels are fixed.

Registered subconfig firestorm to arm64 and confirmed (hard-coded) autodetect is working :D.

devinamatthews · 2021-10-06T22:48:11Z

@xrq-phys, I'm still getting incorrect results for zgemm1m (only that operation) for m=n=k=50. You can easily test this by:

Taking the default input.operations and input.general files from the testsuite directory and copying them to the top source directory.
In input.operations, change the number in front of gemm from a 1 to a 2.
In input.general, change both the min. size to test (100) and max. size to test (100) to 50.
Running ./test_libblis.x in the top directory. You can build this binary without running make check by doing make ./test_libblis.x.

devinamatthews · 2021-10-06T22:50:18Z

Hmmm. 3) seems to be optional since I am also getting wrong zgemm1m for larger sizes too. make check doesn't test 1m (which it should!) so Travis wouldn't catch this.

xrq-phys · 2021-10-07T17:29:14Z

Reproduced. Will try to fix 🥲

devinamatthews · 2021-10-07T17:30:38Z

@fgvanzee if there are no objections I'm going to enable 1m testing for make check. There are several times this would have saved me much grief.

fgvanzee · 2021-10-07T17:33:20Z

@fgvanzee if there are no objections I'm going to enable 1m testing for make check. There are several times this would have saved me much grief.

Without objection.

- They didn't make much improvements. - Can't register row-preferral and column-preferral ukrs at the same time. Will break 1m.

xrq-phys · 2021-10-07T17:40:54Z

Fixed.

This kind of thing worked for 1m:

	bli_cntx_set_l3_nat_ukrs
	(
	  1,
	  // BLIS_GEMM_UKR, BLIS_FLOAT,    bli_sgemm_armv8a_asm_8x12, FALSE,
	  BLIS_GEMM_UKR, BLIS_DOUBLE,   bli_dgemm_armv8a_asm_6x8r, TRUE,
	  cntx
	);

It seems that mixing row and column-preferring bulk kernels would break 1m.
Anyway, moved all row-preferring bulk kernels to old since they didn't show much advantage.

devinamatthews · 2021-10-07T17:43:42Z

Interesting. I'll open an issue for that.

devinamatthews · 2021-10-07T18:45:40Z

Confirmed fixed.

RuQing Xu added 17 commits August 13, 2021 02:40

Armv8-A Add 8x4 Kernel WIP

a29c163

Test result: a bit lower GFlOps than 6x8.

Armv8A DGEMM 4x4 Kernel WIP. Slow

6639999

Quite slow.

Armv8-A Add Part of GEMMSUP 8x4m Kernel

df40efe

- Compile w/ both GCC & Clang - Only block part is implement. Edge cases WIP - Not Optimal kernel scheme. Should do 4x8 instead

Armv8-A Add GEMMSUP 4x8n Kernel

a9ba79e

- Compile w/ both GCC & Clang. - Edge cases use ref-kernels. - Can give performance boost in some contexts.

Armv8-A Add More DGEMMSUP

8ed8f5e

- Add 6x8 GEMMSUP. - Adjust prefetching. - Workaround for Clang's disability to handle reg clobbering. - Subproduct 6x8 row-major GEMM <- incomplete.

Armv8-A DGEMMSUP Adjustments

3efe707

Armv8-A Introduced s/d Packing Kernels

49b05df

Sizes according to the 2014 kernels.

Armv8-A s/d Packing Kernels Fix Typo

3c5f740

For GCC.

Armv8-A GEMMSUP-RD 6x8n

afd0fa6

Armv8-A GEMMSUP-RD 6x8m

8a32d19

Armv8-A now has a complete set of GEMMSUP kernels..

Armv8-A Adjust Types for PACKM Kernels

ce44735

GCC does not have full NEON intrinsics support.

Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm

c792d50

Suffixed NEON opcode is not supported by GNU assembler

Armv8-A Supplimentary GEMMSUP Sizes for RD

4e7e225

Arm64 8x4 Kernel Use Less Regs

3df0e9b

Added Apple Firestorm (A14/M1) Subconfig

e38ca28

- Use the same bulk kernel as Cortex-A53 / ThunderX2; - Larger block size; - Use gemmsup kernels for double precision.

Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin

7d5903d

Forgot to support `alpha`/`beta` in gemmsup_armv8a_int.

Arm: Implement GEMMSUP Fallback Method

e6799b2

bli_dgemmsup_rv_armv8a_int_6x4mn

RuQing Xu and others added 6 commits August 23, 2021 01:13

Arm: DGEMMSUP ?rc(rd) Invoke Edge Size

a361492

Arm: DGEMMSUP ??r(rv) Invoke Edge Size

35409eb

Plus some fix at edges. TODO: Should ensure that no ref kernel appear in beginning of gemmsup kernels. As ref does not recognise panel stride.

Header Typo

4fd82b0

Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref

7e2951e

Ref cannot handle panel strides (packed cases) thus cannot be called from the beginning of `gemmsup` (i.e. cannot be dispatch target of gemmsup to other sizes.)

Arm Whole GEMMSUP Call Route is Asm/Int Optimized

820f11a

- `ref2` call in `bli_gemmsup_rv_armv8a_asm_d6x8m.c` is commented out. - `bli_gemmsup_rv_armv8a_asm_d4x8m.c` contains a tail `ref2` call but it's not called by any upper routine.

Fix config_name in bli_arch.c

9c0064f

xrq-phys mentioned this pull request Oct 2, 2021

Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs. #552

Merged

Firestorm CPUID Dispatcher

a024715

Commenting out <sys/sysctl.h> due to possibly a Xcode bug.

Add test for Apple M1 (firestorm)

14b1358

This test will run on Linux, but all the kernels should run just fine. This does not test autodetection but then none of the other ARM tests do either.

RuQing Xu added 3 commits October 7, 2021 02:01

Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo

2920dde

Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo

d7a3372

Register firestorm into arm64 Metaconfig

a4066f2

Armv8 Trash New Bulk Kernels

f44149f

- They didn't make much improvements. - Can't register row-preferral and column-preferral ukrs at the same time. Will break 1m.

devinamatthews mentioned this pull request Oct 7, 2021

All dcomplex 1m operations fail for mixed row/col dgemm ukrs #557

Closed

devinamatthews merged commit 4277fec into flame:master Oct 7, 2021

Conversation

xrq-phys commented Aug 20, 2021 • edited by devinamatthews Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devinamatthews commented Aug 20, 2021

Uh oh!

devinamatthews commented Aug 20, 2021

Uh oh!

xrq-phys commented Aug 20, 2021

Uh oh!

devinamatthews commented Aug 20, 2021

Uh oh!

devinamatthews commented Sep 10, 2021

Uh oh!

devinamatthews commented Oct 6, 2021

Uh oh!

xrq-phys commented Oct 6, 2021

Uh oh!

devinamatthews commented Oct 6, 2021

Uh oh!

devinamatthews commented Oct 6, 2021

Uh oh!

xrq-phys commented Oct 6, 2021

Uh oh!

devinamatthews commented Oct 6, 2021

Uh oh!

xrq-phys commented Oct 6, 2021

Uh oh!

devinamatthews commented Oct 6, 2021

Uh oh!

devinamatthews commented Oct 6, 2021

Uh oh!

devinamatthews commented Oct 6, 2021

Uh oh!

xrq-phys commented Oct 6, 2021

Uh oh!

devinamatthews commented Oct 6, 2021

Uh oh!

xrq-phys commented Oct 6, 2021

Uh oh!

xrq-phys commented Oct 6, 2021

Uh oh!

devinamatthews commented Oct 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devinamatthews commented Oct 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xrq-phys commented Oct 7, 2021

Uh oh!

devinamatthews commented Oct 7, 2021

Uh oh!

fgvanzee commented Oct 7, 2021

Uh oh!

xrq-phys commented Oct 7, 2021

Uh oh!

devinamatthews commented Oct 7, 2021

Uh oh!

devinamatthews commented Oct 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xrq-phys commented Aug 20, 2021 •

edited by devinamatthews

Loading

devinamatthews commented Oct 6, 2021 •

edited

Loading

devinamatthews commented Oct 6, 2021 •

edited

Loading