Flame Upstream Since Mar, 2021 by xrq-phys · Pull Request #1 · xrq-phys/blis

xrq-phys · 2021-09-21T06:40:02Z

Self-PR including:

POWER10
Arm SVE Real (Tuned for A64fx) + some clean-up.

in order to create a temporary mixed branch with all features.

Details: - Switched the small block allocator (sba), as defined in bli_sba.c and bli_apool.c, to static initialization of its internal mutex. Did a similar thing for the packing block allocator (pba), which appears as global_membrk in bli_membrk.c. - Commented out bli_membrk_init_mutex() and bli_membrk_finalize_mutex() to ensure they won't be used in the future. - In bli_thrcomm_pthreads.c and .h, removed old, commented-out cpp blocks guarded by BLIS_USE_PTHREAD_MUTEX.

Details: - Renamed the files, variables, and functions relating to the packing block allocator from its legacy name (membrk) to its current name (pba). This more clearly contrasts the packing block allocator with the small block allocator (sba). - Fixed a typo in bli_pack_set_pack_b(), defined in bli_pack.c, that caused the function to erroneously change the value of the pack_a field of the global rntm_t instead of the pack_b field. (Apparently nobody has used this API yet.) - Comment updates.

Details: - Removed the option to finalize BLIS after every BLAS call, which also means that BLIS would initialize at the beginning of every BLAS call. This option never really made sense and wasn't even implemented properly to begin with. (Because bli_init_auto() and _finalize_auto() were implemented in terms of bli_init_once() and _finalize_once(), respectively, the application would have only been able to call one BLAS routine before BLIS would find itself in a unusable, permanently uninitialized state.) Because this option was never meant for regular use, it never made it into configure as an actual configure-time option, and therefore this commit only removes parts of the code affected by the cpp macro guard BLIS_ENABLE_STAY_AUTO_INITIALIZED.

Details: - Added an err_t* parameter to memory allocation functions including bli_malloc_intl(), bli_calloc_intl(), bli_malloc_user(), bli_fmalloc_align(), and bli_fmalloc_noalign(). Since these functions already use the return value to return the allocated memory address, they can't communicate errors to the caller through the return value. This commit does not employ any error checking within these functions or their callers, but this sets up BLIS for a more comprehensive commit that moves in that direction. - Moved the typedefs for malloc_ft and free_ft from bli_malloc.h to bli_type_defs.h. This was done so that what remains of bli_malloc.h can be included after the definition of the err_t enum. (This ordering was needed because bli_malloc.h now contains function prototypes that use err_t.) - Defined bli_is_success() and bli_is_failure() static functions in bli_param_macro_defs.h. These functions provide easy checks for error codes and will be used more heavily in future commits. - Unfortunately, the additional err_t* argument discussed above breaks the API for bli_malloc_user(), which is an exported symbol in the shared library. However, it's quite possible that the only application that calls bli_malloc_user()--indeed, the reason it is was marked for symbol exporting to begin with--is the BLIS testsuite. And if that's the case, this breakage won't affect anyone. Nonetheless, the "major" part of the so_version file has been updated accordingly to 4.0.0.

Needed for compiling on e.g. Mac M1. AFAIK clang supports the same -mcpu flag for ThunderX2 as gcc.

Details: - Changed bli_pack_get_pack_a() and bli_pack_get_pack_b() so that instead of returning a bool, they set a bool that is passed in by address. This does break the public exported API, but I expect very few users actually use this function. (This change is being made in preparation for a much more extensive commit relating to error checking.)

Details: - Defined getijv, setijv operations to get and set elements of a vector, in bli_setgetijv.c and .h. - Renamed bli_setgetij.c and .h to bli_setgetijm.c and .h, respectively. - Added additional bounds checking to getijm and setijm to prevent actions with negative indices. - Added documentation to BLISObjectAPI.md and BLISTypedAPI.md for getijv and setijv. - Added documentation to BLISTypedAPI.md for getijm and setijm, which were inadvertently missing. - Added a new entry to the FAQ titled "Why does BLIS have vector (level-1v) and matrix (level-1m) variations of most level-1 operations?" - Comment updates.

Details: - Added new implementations of bli_slamch() and bli_dlamch() that use constants from the standard C library in lieu of dynamically-computed values (via code inherited from netlib). The previous implementation is still available when the cpp macro BLIS_ENABLE_LEGACY_LAMCH is defined by the subconfiguration at compile-time. Thanks to Devin Matthews for providing this patch, and to Stefano Zampini for reporting the issue (flame#497) that prompted Devin to propose the patch.

Details: - Defined eqsc, eqv, and eqm operations, which set a bool depending on whether the two scalars, two vectors, or two matrix operands are equal (element-wise). eqsc and eqv support implicit conjugation and eqm supports diagonal offset, diag, uplo, and trans parameters (in a manner consistent with other level-1m operations). These operations are currently housed under frame/util, at least for now, because they are not computational in nature. - Redefined bli_obj_equals() in terms of eqsc, eqv, and eqm. - Documented eqsc, eqv, and eqm in BLISObjectAPI.md and BLISTypedAPI.md. Also: - Documented getsc and setsc in both docs. - Reordered entry for setijv in BLISTypedAPI.md, and added separator bars to both docs. - Added missing "Observed object properties" clauses to various levle-1v entries in BLISObjectAPI.md. - Defined bli_apply_trans() in bli_param_macro_defs.h. - Defined supporting _check() function, bli_l0_xxbsc_check(), in bli_l0_check.c for eqsc. - Programming style and whitespace updates to bli_l1m_unb_var1.c. - Whitespace updates to bli_l0_oapi.c, bli_l1m_oapi.c - Consolidated redundant macro redefinition for copym function pointer type in bli_l1m_ft.h. - Added macros to bli_oapi_ba.h, _ex.h, and bli_tapi_ba.h, _ex.h that allow oapi and tapi source files to forego defining certain expert functions. (Certain operations such as printv and printm do not need to have both basic expert interfaces. This also includes eqsc, eqv, and eqm.)

Details: - Changed #ifdef BLIS_OAPI_BASIC to #ifdef BLIS_TAPI_BASIC in bli_util_ft.h. This typo was causing some types to be redefined when they weren't supposed to be.

Details: - Added frame/include/bli_xapi_undef.h, which explicitly undefines all macros defined in bli_oapi_ba.h, bli_oapi_ex.h, bli_tapi_ba.h, and bli_tapi_ex.h. (This is for safety and good cpp coding practice, not because it fixes anything.) - Added #include "bli_xapi_undef.h" to bli_l1v.h, bli_l1d.h, bli_l1f.h, bli_l1m.h, bli_l2.h, bli_l3.h, and bli_util.h. - Comment updates to bli_oapi_ba.h, bli_oapi_ex.h, bli_tapi_ba.h, and bli_tapi_ex.h. - Moved frame/3/bli_l3_ft_ex.h to local 'old' directory after realizing that nothing in BLIS used those function pointer types. Also commented out the "#include bli_l3_ft_ex.h" directive in frame/3/bli_l3.h.

Details: - Inserted a "#include bli_xapi_undef.h" after each usage of the basic and expert API macro setup headers: bli_oapi_ba.h, bli_oapi_ex.h, bli_tapi_ba.h, and bli_tapi_ex.h. This is functionally equivalent to the previous status quo, in which each header made minimal #undef prior to its own definitions and then a single instance of "#include bli_xapi_undef.h" cleaned up any remaining macro defs after all other headers were used. This commit will guarantee that macro defs from the setup of one header (say, bli_oapi_ex.h) don't "infect" the definitions made in a subsequent header. As with this previous commit, this change does not fix any issue but rather attempts to avoid creating orphaned macro definitions that are only needed within a very limited scope. - Removed minimal #undef from bli_?api_[ba|ex].h. - Removed old commented-out lines from bli_?api_[ba|ex].h.

Details: - Added 512-bit specific 'a64fx' subconfiguration that uses empirically tuned block size by Stepan Nassyr. This subconfig also sets the sector cache size and enables memory-tagging code in SVE gemm kernels. This subconfig utilizes (16, k) and (10, k) DPACKM kernels. - Added a vector-length agnostic 'armsve' subconfiguration that computes blocksizes according to the analytical model. This part is ported from Stepan Nassyr's repository. - Implemented vector-length-agnostic [d/s/sh] gemm kernels for Arm SVE at size (2*VL, 10). These kernels use unindexed FMLA instructions because indexed FMLA takes 2 FMA units in many implementations. PS: There are indexed-FLMA kernels in Stepan Nassyr's repository. - Implemented 512-bit SVE dpackm kernels with in-register transpose support for sizes (16, k) and (10, k). - Extended 256-bit SVE dpackm kernels by Linaro Ltd. to 512-bit for size (12, k). This dpackm kernel is not currently used by any subconfiguration. - Implemented several experimental dgemmsup kernels which would improve performance in a few cases. However, those dgemmsup kernels generally underperform hence they are not currently used in any subconfig. - Note: This commit squashes several commits submitted by RuQing Xu via PR flame#424.

- Updated distro to 20.04 focal aarch64-gcc-10. This is minimal version required by aarch64-gcc-10. SVE intrinsics would not compile without GCC >=10. - x86 toolchains use official repo instead of ubuntu-toolchain-r/test. 20.04 focal is not supported by that PPA at the moment. - Add extra configuration-time options to .travis.yml. - Add Arm SVE entry to .travis.yml.

- ArmSVE don't test gemmt (seems Qemu-only problem); - Clang use TravisCI-provided version instead of fixing to clang-8 due to that clang-8 seems conflicting with TravisCI's clang-7.

- Removed `V=1` in make line - Removed `CFLAGS` in configure line - Restored `pwd` surrounding OOT line

AMD requested removal due to unclear licensing terms; original code was from stackoverflow. The function is unused but could easily be replaced by new implementation.

Remove `rm-dupls` function in common.mk.

Check the C compiler for a predefined macro `_WIN32` to indicate (cross-)compilation for Windows. Fixes flame#463.

Upgrade Travis CI for Arm SVE

Add explicit compiler check for Windows.

Details: - Updated the performance graphs (pdfs and pngs) for the Fugaku/a64fx entry within Performance.md, and also updated the experiment details accordingly. Thanks to RuQing Xu for re-running the BLIS and SSL2 experiments reflected in this commit. - In Performance.md, added an English translation of the project name under which the Fugaku results were gathered, courtesy of RuQing Xu.

Details: - Added a new sandbox called 'gemmlike', which implements sequential and multithreaded gemm in the style of gemmsup but also unconditionally employs packing. The purpose of this sandbox is to (1) avoid select abstractions, such as objects and control trees, in order to allow readers to better understand how a real-world implementation of high-performance gemm can be constructed; (2) provide a starting point for expert users who wish to build something that is gemm-like without "reinventing the wheel." Thanks to Jeff Diamond, Tze Meng Low, Nicholai Tukanov, and Devangi Parikh for requesting and inspiring this work. - The functions defined in this sandbox currently use the "bls_" prefix instead of "bli_" in order to avoid any symbol collisions in the main library. - The sandbox contains two variants, each of which implements gemm via a block-panel algorithm. The only difference between the two is that variant 1 calls the microkernel directly while variant 2 calls the microkernel indirectly, via a function wrapper, which allows the edge case handling to be abstracted away from the classic five loops. - This sandbox implementation utilizes the conventional gemm microkernel (not the skinny/unpacked gemmsup kernels). - Updated some typos in the comments of a few files in the main framework.

Apple+Arm64 requires additional "tagging" of local symbols.

- x7, x8: Used to store address for Alpha and Beta. As Alpha & Beta was not used in k-loops, use x0, x1 to load Alpha & Beta's addresses after k-loops are completed, since A & B's addresses are no longer needed there. This "ldr [addr]; -> ldr val, [addr]" would not cause much performance drawback since it is done outside k-loops and there are plenty of instructions between Alpha & Beta's loading and usage. - x9: Used to store cs_c. x9 is multiplied by 8 into x10 and not used any longer. Directly loading cs_c and into x10 and scale by 8 spares x9 straightforwardly. - x11, x12: Not used at all. Simply remove from clobber list. - x13: Alike x9, loaded and scaled by 8 into x14, except that x13 is also used in a conditional branch so that "cmp x13, #1" needs to be modified into "cmp x14, flame#8" to completely free x13. - x3, x4: Used to store next_a & next_b. Untouched in k-loops. Load these addresses into x0 and x1 after Alpha & Beta are both loaded, since then neigher address of A/B nor address of Alpha/Beta is needed.

Roughly the same as 916e1fa , additionally with x15 clobbering removed. - x15: Not used at all. Compilation w/ Clang shows warning about x18 reservation, but compilation itself is OK and all tests got passed.

Avoid x18 use in FP32 kernel: - C address lines x[18-26] renamed to x[19-27] (reg index +1) - Original role of x27 fulfilled by x5 which is free after k-loop pert. FP64 does not require changing since x18 is not used there.

Details: - Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the kappa scalar was incorrectly loaded at an offset of 8 bytes (instead of 4 bytes) from the real component. This was almost certainly a copy- paste bug carried over from the corresonding zpackm kernels. Thanks to Devin Matthews for bringing this to my attention. - Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and bls_gemm_bp_var2.c that initializes the elements of the temporary microtile to zero. (This bug was never observed in output but rather noticed analytically. It probably would have also manifested as intermittent failures, this time involving edge cases.) - Minor commented-out/disabled changes to testsuite/src/test_gemm.c relating to debugging.

Horizontal subtraction instructions added to bli_x86_asm_macros.h, currently unused [ci skip].

Details: - Disabled a sanity check in bli_pool_finalize() that was meant to alert the user if a pool_t was being finalized while some blocks were still checked out. However, this is exactly the situation that might happen when a pool_t is re-initialized for a larger blocksize, and currently bli_pool_reinit() is implemeneted as _finalize() followed by _init(). So, this sanity check is not universally appropriate. Thanks to AMD-India for reporting this issue.

…ite objects. This fixes a bug where "make -j<N> check" may fail after a change to one or more header files, or where testsuite code doesn't get properly recompiled after internal changes.

Add dependency on the "flat" blis.h file for the BLIS and BLAS testuite objects.

Clean up some warnings that show up on clang/OSX.

Implement proposed new function pointer fields for obj_t.

Details: - Changed the implementation in the 'gemmlike' sandbox to more easily allow others to provide custom implementations of packm. These changes include: - Calling a local version of packm_cxk() that can be modified. This version of packm_cxk() uses inlined loops in packm_cxk() rather than querying the context for packm kernels (or even using scal2m). - Providing two variants of packm, one of which calls the aforementioned packm_cxk(), the other of which inlines the contents of packm_cxk() into the variant itself, making it self-contained. To switch from one to the other, simply change which function gets called within bls_packm_a() and bls_packm_b(). - Simplified and cleaned up some variant names in both variants of packm, relative to their parent code.

Details: - Added code to the gemmlike sandbox that handles parameter checking. Previously, the gemmlike implementation called bli_gemm_check(), which resides within the BLIS framework proper. Certain modifications that a user may wish to perform on the sandbox, such as adding a new matrix or vector operand, would have required additional checks, and so these changes make it easier for such a person to implement those checks for their custom gemm-like operation.

Details: - In the gemmlike sandbox, changed the loop index variable of inner loop of packm_cxk() from 'd' to 'i' (and likewise for the corresponding inlined code within packm_var2()). - Pack matrices A and B using packm_var1() instead of packm_var2().

Details: - Moved miscellaneous language-related definitions, including defs related to the handling of the 'restrict' keyword, from the top half of bli_macro_defs.h into a new file, bli_lang_defs.h, which is now #included immediately after "bli_system.h" in blis.h. This change is an attempt to fix a report of recent breakage of C++ compilers due to the recent introduction of 'restrict' in bli_type_defs.h (which previously was being included *before* bli_macro_defs.h and its restrict handling therein. Thanks to Ivan Korostelev for reporting this issue in flame#527. - CREDITS file update.

…atible.

Details: - Prohibit use of clang 10.x and older or gcc 9.x and older for the 'armsve' subconfiguration. Addresses issue flame#535.

Details: - Updated two out-of-date calls to bli_malloc_intl() within the gemmlike sandbox. These calls to malloc_intl(), which resided in bls_l3_decor_pthreads.c, were missing the err_t argument that the function uses to report errors. Thanks to Jeff Diamond for helping isolate this issue.

Details: - Modified bli_system.h so that the cpp macro BLIS_OS_NONE is defined when BLIS_DISABLE_SYSTEM is defined. Otherwise, the previous OS- detecting macro conditionals are considered. This change is to accommodate a solution to a cross-compilation issue described in flame#532.

Details: - Reverted changes in 8e0c425 due to AppVeyor build failures that we do not yet understand.

Add test to Travis using C++ compiler to make sure blis.h is C++-compatible

These code are not used anywhere now.

For those files, use R/W reg for asm args passing instead of memory.

xrq-phys · 2021-09-21T07:02:52Z

Create separate branch.

fgvanzee and others added 30 commits March 27, 2021 15:15

Merge branch 'master' into dev

f9ad55c

Allow clang for ThunderX2 config

6548ceb

Needed for compiling on e.g. Mac M1. AFAIK clang supports the same -mcpu flag for ThunderX2 as gcc.

Fixed typo in cpp guard in bli_util_ft.h.

5aa63cd

Details: - Changed #ifdef BLIS_OAPI_BASIC to #ifdef BLIS_TAPI_BASIC in bli_util_ft.h. This typo was causing some types to be redefined when they weren't supposed to be.

Adjust TravisCI

bd156a2

- ArmSVE don't test gemmt (seems Qemu-only problem); - Clang use TravisCI-provided version instead of fixing to clang-8 due to that clang-8 seems conflicting with TravisCI's clang-7.

Travis CI Revert Unnecessary Extras from 91d3636

932dfe6

- Removed `V=1` in make line - Removed `CFLAGS` in configure line - Restored `pwd` surrounding OOT line

Remove rm-dupls function in common.mk.

859fb77

AMD requested removal due to unclear licensing terms; original code was from stackoverflow. The function is unused but could easily be replaced by new implementation.

Merge pull request flame#502 from flame/rm-rm-dupls

6d4ab02

Remove `rm-dupls` function in common.mk.

Add explicit compiler check for Windows.

5feb04e

Check the C compiler for a predefined macro `_WIN32` to indicate (cross-)compilation for Windows. Fixes flame#463.

Merge pull request flame#500 from xrq-phys/armsve+travis

cbd8d39

Upgrade Travis CI for Arm SVE

Merge pull request flame#503 from flame/windows-compiler-check

e5c85da

Add explicit compiler check for Windows.

Asm Flag Mingling for Darwin_Aarch64

7fabd89

Apple+Arm64 requires additional "tagging" of local symbols.

Armv8A Rename Regs for Clang Compile: FP32 Part

9f4a4a3

Roughly the same as 916e1fa , additionally with x15 clobbering removed. - x15: Not used at all. Compilation w/ Clang shows warning about x18 reservation, but compilation itself is OK and all tests got passed.

Armv8A Rename Regs for Safe Darwin Compile

5fc93e2

Avoid x18 use in FP32 kernel: - C address lines x[18-26] renamed to x[19-27] (reg index +1) - Original role of x27 fulfilled by x5 which is free after k-loop pert. FP64 does not require changing since x18 is not used there.

Add vhsubpd/vhsubpd.

7c3eb44

Horizontal subtraction instructions added to bli_x86_asm_macros.h, currently unused [ci skip].

fgvanzee and others added 28 commits August 12, 2021 14:44

Add dependency on the "flat" blis.h file for the BLIS and BLAS testsu…

ec06b6a

…ite objects. This fixes a bug where "make -j<N> check" may fail after a change to one or more header files, or where testsuite code doesn't get properly recompiled after internal changes.

Remove schema field on obj_t (redundant) and add new API functions.

3cddce1

Clean up some warnings that show up on clang/OSX.

4f70eb7

Add row- and column-strides for A/B in obj_ukr_fn_t.

1772db0

Merge pull request flame#529 from flame/fix_make_check_dependencies

e6d68bc

Add dependency on the "flat" blis.h file for the BLIS and BLAS testuite objects.

Merge pull request flame#530 from flame/fix_clang_warnings

c99fae5

Clean up some warnings that show up on clang/OSX.

Whitespace tweaks.

4b8ed99

Merge pull request flame#527 from flame/obj_t_makeover

2c0b415

Implement proposed new function pointer fields for obj_t.

README.md citation updates (e.g. BLIS7 bibtex).

7144230

Minor tweaks to gemmlike sandbox.

3b275f8

Details: - In the gemmlike sandbox, changed the loop index variable of inner loop of packm_cxk() from 'd' to 'i' (and likewise for the corresponding inlined code within packm_var2()). - Pack matrices A and B using packm_var1() instead of packm_var2().

Add test to Travis using C++ compiler to make sure blis.h is C++-comp…

5fc65cd

…atible.

Merge branch 'master' into cxx_test

eaea674

Blacklist clang10/gcc9 and older for 'armsve'.

2f7325b

Details: - Prohibit use of clang 10.x and older or gcc 9.x and older for the 'armsve' subconfiguration. Addresses issue flame#535.

Disabled (at least temporarily) commit 8e0c425.

2be78fc

Details: - Reverted changes in 8e0c425 due to AppVeyor build failures that we do not yet understand.

Updated travis-ci.org link in README.md to .com.

ade10f4

Attempt to fix cxx-test for OOT builds.

fbb3560

Use C++ cross-compiler for ARM tests.

e486d66

Fix AArch64 tests and consolidate some other tests.

c76fcad

Do a fast test on OSX. [ci skip]

98ce6e8

Merge pull request flame#534 from flame/cxx_test

9293a68

Add test to Travis using C++ compiler to make sure blis.h is C++-compatible

Arm SVE: Remove Experimental GEMMSUP Code

a3ae409

These code are not used anywhere now.

Arm SVE: Address Clang Compilation

5766d68

For those files, use R/W reg for asm args passing instead of memory.

xrq-phys closed this Sep 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flame Upstream Since Mar, 2021#1

Flame Upstream Since Mar, 2021#1
xrq-phys wants to merge 90 commits intoamdfrom
armsve-clang

xrq-phys commented Sep 21, 2021

Uh oh!

xrq-phys commented Sep 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

xrq-phys commented Sep 21, 2021

Uh oh!

xrq-phys commented Sep 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants