Skip to content

Arm SVE ZGEMM#2

Open
xrq-phys wants to merge 82 commits intoamd-plusfrom
armsve-zgemm
Open

Arm SVE ZGEMM#2
xrq-phys wants to merge 82 commits intoamd-plusfrom
armsve-zgemm

Conversation

@xrq-phys
Copy link
Copy Markdown
Owner

Self-PR.

RuQing Xu and others added 30 commits August 13, 2021 02:40
Test result: a bit lower GFlOps than 6x8.
- Compile w/ both GCC & Clang
- Only block part is implement. Edge cases WIP
- Not Optimal kernel scheme. Should do 4x8 instead
- Compile w/ both GCC & Clang.
- Edge cases use ref-kernels.
- Can give performance boost in some contexts.
- Add 6x8 GEMMSUP.
- Adjust prefetching.
- Workaround for Clang's disability to handle reg clobbering.
- Subproduct 6x8 row-major GEMM <- incomplete.
Recommended kernels set:
  ...
  BLIS_RRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
  BLIS_RCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
  BLIS_RCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
  BLIS_CRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
  BLIS_CCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
  BLIS_CCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
  ...
  bli_blksz_init     ( &blkszs[ BLIS_MR ],    -1,     6,    -1,    -1,
                                              -1,     8,    -1,    -1 );
  bli_blksz_init_easy( &blkszs[ BLIS_NR ],    -1,     8,    -1,    -1 );
  ...
Sizes according to the 2014 kernels.
Armv8-A now has a complete set of GEMMSUP kernels..
GCC does not have full NEON intrinsics support.
Suffixed NEON opcode is not supported by GNU assembler
- Use the same bulk kernel as Cortex-A53 / ThunderX2;
- Larger block size;
- Use gemmsup kernels for double precision.
Forgot to support `alpha`/`beta` in gemmsup_armv8a_int.
bli_dgemmsup_rv_armv8a_int_6x4mn
Plus some fix at edges.

TODO: Should ensure that no ref kernel appear in beginning of gemmsup
kernels. As ref does not recognise panel stride.
Ref cannot handle panel strides (packed cases) thus cannot be called
from the beginning of `gemmsup` (i.e. cannot be dispatch target of
gemmsup to other sizes.)
- `ref2` call in `bli_gemmsup_rv_armv8a_asm_d6x8m.c` is commented out.
- `bli_gemmsup_rv_armv8a_asm_d4x8m.c` contains a tail `ref2` call but
  it's not called by any upper routine.
…the Mx1 gemmsup kernels for haswell.

The fix is to use the same (valid) source register twice in the horizontal addition.
Fix more copy-paste errors in the haswell gemmsup code.
Details:
- Defined a new packm variant for the 'gemmlike' sandbox. This new
  variant (bls_l3_packm_var3.c) parallelizes the packing operation over
  the k dimension rather than the m or n dimensions. Note that the
  gemmlike implementation still uses var1 by default, and use of the new
  code would require changing bls_l3_packm_a.c and/or bls_l3_packm_b.c
  so that var3 is called instead. Thanks to Jeff Diamond for proposing
  this (perhaps NUMA-friendly) solution.
Details:
- Removed the commented-out #define BLIS_NUM_ARCHS in bli_type_defs.h
  and its associated (now outdated) comments. BLIS_NUM_ARCHS has been
  part of the arch_t enum for some time now, and so this change is
  mostly about removing any opportunity for confusion for people who
  may be reading the code. Thanks to Minh Quan Ho for leading me to
  cleanup.
- There was redundance between the macro BLIS_MAX_NUM_ERR_MSGS (=200) and
  the enum BLIS_ERROR_CODE_MAX (-170), while they both mean the same thing:
  the maximal number of error codes/messages.
- The previous initialization of error messages at compile time ignored that
  the 'bli_error_string' array still occupies useless memory due to 2D char[][]
  declaration. Instead, it should be just an array of pointers, pointing at
  strings in .rodata section.
- This commit does the two modifications:
   * retired macros BLIS_MAX_NUM_ERR_MSGS and BLIS_MAX_ERR_MSG_LENGTH everywhere
   * switch bli_error_string from char[][] to char *[] to reduce its footprint
     from 40KB (200*200) to 1.3KB (170*sizeof(char*)).
     (No problem to use the enum BLIS_ERROR_CODE_MAX at compile-time,
     since compiler is smart enough to determine its value is 170.)
RuQing Xu and others added 29 commits October 6, 2021 01:00
Previously, this was a global variable. Setting the value was synchronized via a mutex but reading the value was not. Of course, these accesses are almost certainly atomic, but there is still the possibility of one thread attempting to set the value and then reading the value set by another thread. For correct operation under user threading (e.g. pthreads), this should probably be thread-local with no mutex.
Fix a bug in bli_gemmsup_rd_armv8a_asm_d6x8m.c.
For safety upon similar strategies in the future,
 change all [mn]_[iter/left] into signed ints.
Commenting out <sys/sysctl.h> due to possibly a Xcode bug.
This test will run on Linux, but all the kernels should run just fine. This does not test autodetection but then none of the other ARM tests do either.
- They didn't make much improvements.
- Can't register row-preferral and column-preferral ukrs at the same time.
  Will break 1m.
ARMv8 PACKM and GEMMSUP Kernels + Apple Firestorm Subconfig
Pic. size seems a bit different from upstream.
Generaged w/ MATLAB. Open to any change.
FMOV [hsd]M, #imm does not allow zero immediate.
Use wzr, xzr instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants