Open
Conversation
Test result: a bit lower GFlOps than 6x8.
Quite slow.
- Compile w/ both GCC & Clang - Only block part is implement. Edge cases WIP - Not Optimal kernel scheme. Should do 4x8 instead
- Compile w/ both GCC & Clang. - Edge cases use ref-kernels. - Can give performance boost in some contexts.
- Add 6x8 GEMMSUP. - Adjust prefetching. - Workaround for Clang's disability to handle reg clobbering. - Subproduct 6x8 row-major GEMM <- incomplete.
Recommended kernels set:
...
BLIS_RRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
BLIS_RCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
BLIS_RCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
BLIS_CRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
BLIS_CCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
BLIS_CCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
...
bli_blksz_init ( &blkszs[ BLIS_MR ], -1, 6, -1, -1,
-1, 8, -1, -1 );
bli_blksz_init_easy( &blkszs[ BLIS_NR ], -1, 8, -1, -1 );
...
Sizes according to the 2014 kernels.
Armv8-A now has a complete set of GEMMSUP kernels..
GCC does not have full NEON intrinsics support.
Suffixed NEON opcode is not supported by GNU assembler
- Use the same bulk kernel as Cortex-A53 / ThunderX2; - Larger block size; - Use gemmsup kernels for double precision.
Forgot to support `alpha`/`beta` in gemmsup_armv8a_int.
bli_dgemmsup_rv_armv8a_int_6x4mn
Plus some fix at edges. TODO: Should ensure that no ref kernel appear in beginning of gemmsup kernels. As ref does not recognise panel stride.
Ref cannot handle panel strides (packed cases) thus cannot be called from the beginning of `gemmsup` (i.e. cannot be dispatch target of gemmsup to other sizes.)
- `ref2` call in `bli_gemmsup_rv_armv8a_asm_d6x8m.c` is commented out. - `bli_gemmsup_rv_armv8a_asm_d4x8m.c` contains a tail `ref2` call but it's not called by any upper routine.
…the Mx1 gemmsup kernels for haswell. The fix is to use the same (valid) source register twice in the horizontal addition.
Fix more copy-paste errors in the haswell gemmsup code.
Details: - Defined a new packm variant for the 'gemmlike' sandbox. This new variant (bls_l3_packm_var3.c) parallelizes the packing operation over the k dimension rather than the m or n dimensions. Note that the gemmlike implementation still uses var1 by default, and use of the new code would require changing bls_l3_packm_a.c and/or bls_l3_packm_b.c so that var3 is called instead. Thanks to Jeff Diamond for proposing this (perhaps NUMA-friendly) solution.
Details: - Removed the commented-out #define BLIS_NUM_ARCHS in bli_type_defs.h and its associated (now outdated) comments. BLIS_NUM_ARCHS has been part of the arch_t enum for some time now, and so this change is mostly about removing any opportunity for confusion for people who may be reading the code. Thanks to Minh Quan Ho for leading me to cleanup.
- There was redundance between the macro BLIS_MAX_NUM_ERR_MSGS (=200) and
the enum BLIS_ERROR_CODE_MAX (-170), while they both mean the same thing:
the maximal number of error codes/messages.
- The previous initialization of error messages at compile time ignored that
the 'bli_error_string' array still occupies useless memory due to 2D char[][]
declaration. Instead, it should be just an array of pointers, pointing at
strings in .rodata section.
- This commit does the two modifications:
* retired macros BLIS_MAX_NUM_ERR_MSGS and BLIS_MAX_ERR_MSG_LENGTH everywhere
* switch bli_error_string from char[][] to char *[] to reduce its footprint
from 40KB (200*200) to 1.3KB (170*sizeof(char*)).
(No problem to use the enum BLIS_ERROR_CODE_MAX at compile-time,
since compiler is smart enough to determine its value is 170.)
[ci skip]
Previously, this was a global variable. Setting the value was synchronized via a mutex but reading the value was not. Of course, these accesses are almost certainly atomic, but there is still the possibility of one thread attempting to set the value and then reading the value set by another thread. For correct operation under user threading (e.g. pthreads), this should probably be thread-local with no mutex.
Fix a bug in bli_gemmsup_rd_armv8a_asm_d6x8m.c. For safety upon similar strategies in the future, change all [mn]_[iter/left] into signed ints.
Commenting out <sys/sysctl.h> due to possibly a Xcode bug.
This test will run on Linux, but all the kernels should run just fine. This does not test autodetection but then none of the other ARM tests do either.
- They didn't make much improvements. - Can't register row-preferral and column-preferral ukrs at the same time. Will break 1m.
[ci skip]
ARMv8 PACKM and GEMMSUP Kernels + Apple Firestorm Subconfig
Pic. size seems a bit different from upstream. Generaged w/ MATLAB. Open to any change.
FMOV [hsd]M, #imm does not allow zero immediate. Use wzr, xzr instead.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Self-PR.