Skip to content

ARMv8 PACKM and GEMMSUP Kernels + Apple Firestorm Subconfig#533

Merged
devinamatthews merged 36 commits intoflame:masterfrom
xrq-phys:arm64-hi-bw
Oct 7, 2021
Merged

ARMv8 PACKM and GEMMSUP Kernels + Apple Firestorm Subconfig#533
devinamatthews merged 36 commits intoflame:masterfrom
xrq-phys:arm64-hi-bw

Conversation

@xrq-phys
Copy link
Copy Markdown
Collaborator

@xrq-phys xrq-phys commented Aug 20, 2021

Hi!

This PR adds a subconfig, some supplementary gemm sizes, some packm kernels and a set of dgemmsup kernels.

The subconfig is roughly tuned against Apple's latest CPUs (i.e. CPU part of A14 and M1). This should slightly improve performance on Arm-based mac machines (c.f. #492, the peak a bit higher than OpenBLAS).

Regarding gemmsup, I did not write assembly for all sizes but relied on some fallback methods:

  • gemmsup_ref for rv cases;
  • NEON-intrinsic-based bli_dgemmsup_rd_armv8a_int_??? for rd cases.

image

image

image

(Sorry the last graph has its labels missing: The title should be DGEMM M=8 N=6 K=P with the x-axis representing K=P.)

Fixes #495.

RuQing Xu added 17 commits August 13, 2021 02:40
Test result: a bit lower GFlOps than 6x8.
- Compile w/ both GCC & Clang
- Only block part is implement. Edge cases WIP
- Not Optimal kernel scheme. Should do 4x8 instead
- Compile w/ both GCC & Clang.
- Edge cases use ref-kernels.
- Can give performance boost in some contexts.
- Add 6x8 GEMMSUP.
- Adjust prefetching.
- Workaround for Clang's disability to handle reg clobbering.
- Subproduct 6x8 row-major GEMM <- incomplete.
Recommended kernels set:
  ...
  BLIS_RRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
  BLIS_RCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
  BLIS_RCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
  BLIS_CRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
  BLIS_CCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
  BLIS_CCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
  ...
  bli_blksz_init     ( &blkszs[ BLIS_MR ],    -1,     6,    -1,    -1,
                                              -1,     8,    -1,    -1 );
  bli_blksz_init_easy( &blkszs[ BLIS_NR ],    -1,     8,    -1,    -1 );
  ...
Sizes according to the 2014 kernels.
Armv8-A now has a complete set of GEMMSUP kernels..
GCC does not have full NEON intrinsics support.
Suffixed NEON opcode is not supported by GNU assembler
- Use the same bulk kernel as Cortex-A53 / ThunderX2;
- Larger block size;
- Use gemmsup kernels for double precision.
Forgot to support `alpha`/`beta` in gemmsup_armv8a_int.
@devinamatthews
Copy link
Copy Markdown
Member

What shape GEMM is the last graph?

@devinamatthews
Copy link
Copy Markdown
Member

@xrq-phys have all of the build problems on M1 been worked out?

bli_dgemmsup_rv_armv8a_int_6x4mn
@xrq-phys
Copy link
Copy Markdown
Collaborator Author

@devinamatthews Sorry I forgot to add labels there. The last graph is DGEMM M=8 N=6 K=P.

Regarding build problems, LLVM requires number of __asm__ registers to be limited anyway. Though I cannot guarantee for all cases (no idea how LLVM routes its registers), at least all the assemblies are adjusted to pass Xcode 12.5.

Another option could be GCC. There seemed to be sayings that GCC 10 is not fully trustable on M1, but what about GCC 11?

@devinamatthews
Copy link
Copy Markdown
Member

@xrq-phys if it builds, compiles, and runs on M1 without modification then that is awesome. We need to dust off #344 and then everything will look very nice.

RuQing Xu and others added 6 commits August 23, 2021 01:13
Plus some fix at edges.

TODO: Should ensure that no ref kernel appear in beginning of gemmsup
kernels. As ref does not recognise panel stride.
Ref cannot handle panel strides (packed cases) thus cannot be called
from the beginning of `gemmsup` (i.e. cannot be dispatch target of
gemmsup to other sizes.)
- `ref2` call in `bli_gemmsup_rv_armv8a_asm_d6x8m.c` is commented out.
- `bli_gemmsup_rv_armv8a_asm_d4x8m.c` contains a tail `ref2` call but
  it's not called by any upper routine.
@devinamatthews
Copy link
Copy Markdown
Member

@fgvanzee can you give this a final blessing?

@devinamatthews
Copy link
Copy Markdown
Member

There are two little things that I want to add before merging:

  1. A Travis test for M1 because there is a lot of new packing and gemmsup code.
  2. Autodetection. There is already a "stub" for M1, it just needs to activate the right config.

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

xrq-phys commented Oct 6, 2021

@devinamatthews Oh thanks for the reminder I forgot to push that piece of change.
But in fact there seems a problem with <sys/sysctl.h> on Xcode 13.0 as:

In file included from /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/sysctl.h:83:
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/ucred.h:101:2: fatal error: unknown type name 'u_int'
        u_int   cr_version;             /* structure layout version */

Would you mind me commenting out that #include? It's not used at the moment anyway.

@devinamatthews
Copy link
Copy Markdown
Member

@xrq-phys I was going to just pop these in, or are you working on it? No use doing it twice 😄

@devinamatthews
Copy link
Copy Markdown
Member

Please put an #ifndef __APPLE__ around that include since that is the least change.

Commenting out <sys/sysctl.h> due to possibly a Xcode bug.
@xrq-phys
Copy link
Copy Markdown
Collaborator Author

xrq-phys commented Oct 6, 2021

@devinamatthews Inclusion of <sys/sysctl.h> is already surrounded by #ifdef __APPLE__ so I guess it's an Apply-only thing?

@devinamatthews
Copy link
Copy Markdown
Member

Oh yeah, yes it is. I think I put that in there so we could use hwctl but I ended up just hard-coding M1 for now.

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

xrq-phys commented Oct 6, 2021

@devinamatthews For Travis part would a duplicate of cortexa57 work?

Anyway it's better let you decide. My free Travis-CI has expired already ;)

@devinamatthews
Copy link
Copy Markdown
Member

Putting in Travis test now.

This test will run on Linux, but all the kernels should run just fine. This does not test autodetection but then none of the other ARM tests do either.
@devinamatthews
Copy link
Copy Markdown
Member

@xrq-phys can you test the autodetection?

@devinamatthews
Copy link
Copy Markdown
Member

Uh-oh, Travis failed. I'll run it on a VM and see what's up.

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

xrq-phys commented Oct 6, 2021

One CRR case fallen.
Sorry I didn't do BLAS tests.

@devinamatthews
Copy link
Copy Markdown
Member

Glad you found it. BLIS hasn't even finished compiling on QEMU yet....

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

xrq-phys commented Oct 6, 2021

Problematic routines are bli_dgemmsup_rv_armv8a_asm_8x4m and bli_dgemmsup_rv_armv8a_int_6x4mn.

Trying to fix.

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

xrq-phys commented Oct 6, 2021

@devinamatthews The two kernels are fixed.

Registered subconfig firestorm to arm64 and confirmed (hard-coded) autodetect is working :D.

@devinamatthews
Copy link
Copy Markdown
Member

devinamatthews commented Oct 6, 2021

@xrq-phys, I'm still getting incorrect results for zgemm1m (only that operation) for m=n=k=50. You can easily test this by:

  1. Taking the default input.operations and input.general files from the testsuite directory and copying them to the top source directory.
  2. In input.operations, change the number in front of gemm from a 1 to a 2.
  3. In input.general, change both the min. size to test (100) and max. size to test (100) to 50.
  4. Running ./test_libblis.x in the top directory. You can build this binary without running make check by doing make ./test_libblis.x.

@devinamatthews
Copy link
Copy Markdown
Member

devinamatthews commented Oct 6, 2021

Hmmm. 3) seems to be optional since I am also getting wrong zgemm1m for larger sizes too. make check doesn't test 1m (which it should!) so Travis wouldn't catch this.

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

xrq-phys commented Oct 7, 2021

Reproduced. Will try to fix 🥲

@devinamatthews
Copy link
Copy Markdown
Member

@fgvanzee if there are no objections I'm going to enable 1m testing for make check. There are several times this would have saved me much grief.

@fgvanzee
Copy link
Copy Markdown
Member

fgvanzee commented Oct 7, 2021

@fgvanzee if there are no objections I'm going to enable 1m testing for make check. There are several times this would have saved me much grief.

Without objection.

- They didn't make much improvements.
- Can't register row-preferral and column-preferral ukrs at the same time.
  Will break 1m.
@xrq-phys
Copy link
Copy Markdown
Collaborator Author

xrq-phys commented Oct 7, 2021

Fixed.

This kind of thing worked for 1m:

	bli_cntx_set_l3_nat_ukrs
	(
	  1,
	  // BLIS_GEMM_UKR, BLIS_FLOAT,    bli_sgemm_armv8a_asm_8x12, FALSE,
	  BLIS_GEMM_UKR, BLIS_DOUBLE,   bli_dgemm_armv8a_asm_6x8r, TRUE,
	  cntx
	);

It seems that mixing row and column-preferring bulk kernels would break 1m.
Anyway, moved all row-preferring bulk kernels to old since they didn't show much advantage.

@devinamatthews
Copy link
Copy Markdown
Member

Interesting. I'll open an issue for that.

@devinamatthews
Copy link
Copy Markdown
Member

Confirmed fixed.

@devinamatthews devinamatthews merged commit 4277fec into flame:master Oct 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Mac M1 not automatically detected

3 participants