Skip to content

Added an 'arm64' entry to .travis.yml.#726

Merged
fgvanzee merged 6 commits intomasterfrom
travis_arm64_build
Feb 18, 2023
Merged

Added an 'arm64' entry to .travis.yml.#726
fgvanzee merged 6 commits intomasterfrom
travis_arm64_build

Conversation

@fgvanzee
Copy link
Copy Markdown
Member

Details:

  • Added a new arm64 entry to the .travis.yml file in an attempt to get Travis CI to compile both NEON and SVE kernels, even if only NEON kernels are exercised in the testing. With this new arm64 entry, the cortexa57 entry becomes redundant and may be removed. Thanks to RuQing Xu for this suggestion.

cc: @xrq-phys

Details:
- Added a new 'arm64' entry to the .travis.yml file in an attempt to get
  Travis CI to compile both NEON and SVE kernels, even if only NEON
  kernels are exercised in the testing. With this new 'arm64' entry, the
  'cortexa57' entry becomes redundant and may be removed. Thanks to
  RuQing Xu for this suggestion.
@fgvanzee fgvanzee requested a review from xrq-phys February 13, 2023 16:58
@fgvanzee fgvanzee self-assigned this Feb 13, 2023
@fgvanzee
Copy link
Copy Markdown
Member Author

fgvanzee commented Feb 13, 2023

@xrq-phys Looks like we may need to provide a non-default value for one or more of the compile-time constants in bli_family_arm64.h:

libblis: Configured maximum stack buffer size is insufficient for register blocksizes currently in use.
libblis: Aborting.

The test checks that mr * nr * dt_size > BLIS_STACK_BUF_MAX_SIZE for every datatype size (dt_size).

BLIS_STACK_BUF_MAX_SIZE is not currently defined in the arm64 configuration, so it gets the default value of ( BLIS_SIMD_MAX_NUM_REGISTERS * BLIS_SIMD_MAX_SIZE * 2 ). BLIS_SIMD_MAX_NUM_REGISTERS is already defined at 32, but BLIS_SIMD_MAX_SIZE still has the default value of 32 64 (bytes).

What value should we increase BLIS_SIMD_MAX_SIZE to in order to ensure enough stack buffer space?

@devinamatthews
Copy link
Copy Markdown
Member

If I read the name of the macro as its intent, then BLIS_SIMD_MAX_SIZE should be the largest SIMD vector size that BLIS knows about, or 64 bytes by default.

@fgvanzee
Copy link
Copy Markdown
Member Author

If I read the name of the macro as its intent, then BLIS_SIMD_MAX_SIZE should be the largest SIMD vector size that BLIS knows about, or 64 bytes by default.

Generally I agree. But with arm SVE, can't vectors get up to 4096 bits (512 bytes)?

@devinamatthews
Copy link
Copy Markdown
Member

Yes, but no processor like that currently exists.

@fgvanzee
Copy link
Copy Markdown
Member Author

Yes, but no processor like that currently exists.

Okay, so we'll change the default BLIS_SIMD_MAX_SIZE to 64 bytes, and the BLIS_SIMD_MAX_SIZE for the arm64 configuration to 512 bytes (or whatever @xrq-phys suggests).

@fgvanzee
Copy link
Copy Markdown
Member Author

Oops, looks like I mis-quoted the code. The default BLIS_SIMD_MAX_SIZE is already 64 bytes. I probably was looking at BLIS_SIMD_MAX_NUM_REGISTERS when I made the mistake.

Details:
- Previously, the macro BLIS_SIMD_MAX_SIZE was *not* being set in
  bli_kernels_arm64.h, which meant that the default value of 64 was
  being used. This caused a runtime consistency check to fail in
  bli_gks.c, one which requires that

    mr * nr * dt_size > BLIS_STACK_BUF_MAX_SIZE

  for all datatype sizes dt_size, where BLIS_STACK_BUF_MAX_SIZE is
  defined as

    BLIS_SIMD_MAX_NUM_REGISTERS * BLIS_SIMD_MAX_SIZE * 2

  This commit sets BLIS_SIMD_MAX_SIZE to 512 for the 'arm64'
  configuration, thus overriding the default and (hopefully) avoiding
  the aforementioned consistency check failures.
@devinamatthews
Copy link
Copy Markdown
Member

I don't understand, how can 32 * 64 * 2 = 4096 bytes not be enough stack buffer size?

@fgvanzee
Copy link
Copy Markdown
Member Author

I don't understand, how can 32 * 64 * 2 = 4096 bytes not be enough stack buffer size?

I guess because mr * nr * dt_size is still greater than 4096b for some datatype.

@devinamatthews
Copy link
Copy Markdown
Member

Then how does it fit in registers?

@xrq-phys
Copy link
Copy Markdown
Collaborator

Hey.

Thanks for letting me know and I'm sorry for the issue.
I would like to change /travis/do_testsuite.sh to include the contents of output.testsuite after any failing test runs.
Would you mind applying the patch?

diff --git a/travis/do_testsuite.sh b/travis/do_testsuite.sh
index 6778f81d..af2a575c 100755
--- a/travis/do_testsuite.sh
+++ b/travis/do_testsuite.sh
@@ -9,27 +9,27 @@ export BLIS_JR_NT=1
 export BLIS_IR_NT=1
 
 if [ "$TEST" = "FAST" -o "$TEST" = "ALL" ]; then
-    make testblis-fast
+    make testblis-fast || cat ./output.testsuite
     $DIST_PATH/testsuite/check-blistest.sh ./output.testsuite
 fi
 
 if [ "$TEST" = "MD" -o "$TEST" = "ALL" ]; then
-       make testblis-md
+       make testblis-md || cat ./output.testsuite
     $DIST_PATH/testsuite/check-blistest.sh ./output.testsuite
 fi
 
 if [ "$TEST" = "SALT" -o "$TEST" = "ALL" ]; then
        # Disable multithreading within BLIS.
        export BLIS_JC_NT=1 BLIS_IC_NT=1 BLIS_JR_NT=1 BLIS_IR_NT=1
-       make testblis-salt
+       make testblis-salt || cat ./output.testsuite
     $DIST_PATH/testsuite/check-blistest.sh ./output.testsuite
 fi
 
 if [ "$TEST" = "1" -o "$TEST" = "ALL" ]; then
-    make testblis
+    make testblis || cat ./output.testsuite
     $DIST_PATH/testsuite/check-blistest.sh ./output.testsuite
 fi
 
-make testblas
+make testblas || cat ./output.testsuite
 $DIST_PATH/blastest/check-blastest.sh

(i.e., append || cat ./output.testsuite to all make test* runs.)

Details:
- Removed debug output from bli_check.c
- Lowered BLIS_SIMD_MAX_SIZE from 512 to 128 in bli_family_arm64.h
- Appended '|| cat ./output.testsuite' to all 'make' commands in
  travis/do_testsuite.sh. Thanks to RuQing Xu for this suggestion.
@fgvanzee
Copy link
Copy Markdown
Member Author

Would you mind applying the patch?

This slipped by me yesterday -- sorry. I've applied the patch. I also lowered BLIS_SIMD_MAX_SIZE for arm64 to 128. This should still be plenty of space even with the qemu bug.

@fgvanzee
Copy link
Copy Markdown
Member Author

@devinamatthews @xrq-phys Are you both good with this PR now?

@devinamatthews
Copy link
Copy Markdown
Member

OK

@xrq-phys
Copy link
Copy Markdown
Collaborator

xrq-phys commented Feb 17, 2023 via email

@fgvanzee
Copy link
Copy Markdown
Member Author

Thank you both!

@fgvanzee fgvanzee merged commit 0b421ef into master Feb 18, 2023
@fgvanzee fgvanzee deleted the travis_arm64_build branch February 18, 2023 19:11
ct-clmsn pushed a commit to ct-clmsn/blis that referenced this pull request Jul 29, 2023
Details:
- Added a new 'arm64' entry to the .travis.yml file in an attempt to get
  Travis CI to compile both NEON and SVE kernels, even if only NEON
  kernels are exercised in the testing. With this new 'arm64' entry, the
  'cortexa57' entry becomes redundant and may be removed. Thanks to
  RuQing Xu for this suggestion.
- Previously, the macro BLIS_SIMD_MAX_SIZE was *not* being set in
  bli_kernels_arm64.h, which meant that the default value of 64 was
  being used. This caused a runtime consistency check to fail in
  bli_gks.c (in Travis CI), one which requires that

    mr * nr * dt_size > BLIS_STACK_BUF_MAX_SIZE

  for all datatype sizes dt_size, where BLIS_STACK_BUF_MAX_SIZE is
  defined as

    BLIS_SIMD_MAX_NUM_REGISTERS * BLIS_SIMD_MAX_SIZE * 2

  This commit increases BLIS_SIMD_MAX_SIZE to 128 for the 'arm64'
  configuration, thus overriding the default and (hopefully) avoiding
  the aforementioned consistency check failures.
- Appended '|| cat ./output.testsuite' to all 'make' commands in
  travis/do_testsuite.sh. Thanks to RuQing Xu for this suggestion.
- Whitespace changes.
fgvanzee added a commit that referenced this pull request May 20, 2024
Details:
- Restored general storage case in armsve kernels.
- Reason for doing this: Though real `g`-storage is difficult to
  speedup, `g`-codepath here can provide a good support for
  transposed-storage. i.e. at least good for `GEMM_UKR_SETUP_CT_AMBI`.
- By experience, this solution is only *a little* slower than in-reg
  transpose. Plus in-reg transpose is only possible for a fixed VL in
  our case.
- (cherry picked from commit 4e18cd3)

Refined emacs handling of indentation. (#717)

Details:
- This refines the emacs autoformatting to be better in line with
  contribution guidelines.
- Removed a stray shebang in a .mk file which confuses emacs about the
  file mode, which should be makefile-mode. (emacs also removes stray
  whitespace at the ends of lines.)
- (cherry picked from 0ba6e9e)

Updated hpx namespace for make_count_shape. (#725)

Details:
- The hpx namespace for *counting_shape changed. This PR updates the use
  of counting_shape in blis to comply with the change in hpx.
- Co-authored-by: ctaylor <[email protected]>
- (cherry picked from 059f151)

Added an 'arm64' entry to `.travis.yml`. (#726)

Details:
- Added a new 'arm64' entry to the .travis.yml file in an attempt to get
  Travis CI to compile both NEON and SVE kernels, even if only NEON
  kernels are exercised in the testing. With this new 'arm64' entry, the
  'cortexa57' entry becomes redundant and may be removed. Thanks to
  RuQing Xu for this suggestion.
- Previously, the macro BLIS_SIMD_MAX_SIZE was *not* being set in
  bli_kernels_arm64.h, which meant that the default value of 64 was
  being used. This caused a runtime consistency check to fail in
  bli_gks.c (in Travis CI), one which requires that

    mr * nr * dt_size > BLIS_STACK_BUF_MAX_SIZE

  for all datatype sizes dt_size, where BLIS_STACK_BUF_MAX_SIZE is
  defined as

    BLIS_SIMD_MAX_NUM_REGISTERS * BLIS_SIMD_MAX_SIZE * 2

  This commit increases BLIS_SIMD_MAX_SIZE to 128 for the 'arm64'
  configuration, thus overriding the default and (hopefully) avoiding
  the aforementioned consistency check failures.
- Appended '|| cat ./output.testsuite' to all 'make' commands in
  travis/do_testsuite.sh. Thanks to RuQing Xu for this suggestion.
- Whitespace changes.
- (cherry picked from 0b421ef)

Redirect grep stderr to /dev/null. (#723)

Details:
- In common.mk, added a redirection of stderr to /dev/null for the grep
  command being used to gather a list of header files #included from
  bli_cntx_ref.c. The redirection is desirable because as of grep 3.8,
  regular expressions with "stray" backslashes trigger warnings [1].
  But removing the backslash seems to break the BLIS build system when
  using pre-3.8 versions of grep, so this seems to be easiest way to
  satisfy the BLIS build system for both pre- and post-3.8 grep
  environments.

  [1] https://lists.gnu.org/archive/html/info-gnu/2022-09/msg00001.html
- (cherry picked from b1d3fc7)

Added runtime selection of 'power' config family. (#718)

Details:
- Created a 'power' umbrella configuration family, which, when targeted
  at configure-time, will build both 'power9' and 'power10' subconfigs.
  (With this feature, a BLIS shared library could be compiled on a
  power9 system and run on power10 and vice-versa. Unoptimised code
  will execute if it is linked and run on any other generic system.)
- This new configuration family will only work with gcc, since that is
  the only compiler supported by both power9 and power10 subconfigs in
  BLIS.
- Documented power9 and power10 as supported microarchitectures in the
  docs/HardwareSupport.md document.
- (cherry picked from e3d352f)

Define `BLIS_VERSION_STRING` in `blis.h`. (#720)

Details:
- Previously, the version string was communicated from configure to
  config.mk (via the config.mk.in template), where it was included via
  the top-level Makefile, where it was then used to define the
  preprocessor macro BLIS_VERSION_STRING via a command line argument to
  the compiler (via -D). This macro is then used within bli_info.c to
  initialize a static string which can then be queried via the
  bli_info_get_version_str() function. However, there are some
  applications that may find utility in being able to access the version
  string by inspecting the monolithic (flattened) blis.h header file
  that is created at compile time and installed alongside the library.
  This commit moves the definition of BLIS_VERSION_STRING into
  bli_config.h (via the bli_config.h.in template) so that it is
  embedded in blis.h. The version string is now available in three
  places:
  - the static/shared library, which is installed in the 'lib'
    subdirectory of the install prefix (query-able via the
    bli_info_get_version_str() function);
  - the config.mk makefile fragment, which is installed in the 'share'
    subdirectory of the install prefix (in the VERSION variable);
  - the blis.h header file, which is installed in the 'include'
    subdirectory of the install prefix (via the BLIS_VERSION_STRING
    macro constant).
  Thanks to Mohsen Aznaveh and Tim Davis for providing the idea for this
  change.
- CREDITS file update.
- (cherry picked from e730c68)

Typecast printf() args to avoid compiler warnings. (#716)

Details:
- In bli_thread_range_tlb.c, typecast integer arguments passed to
  printf() -- which are typically disabled unless debugging -- to type
  "long" to guarantee a match to the "%ld" format specifiers used in
  those calls. This avoids spurious warnings with certain compilers in
  certain toolchain environments, such as 32-bit RISC-V (rv32iv).
- (cherry picked from dc5d00a)

Use here-document for 'configure --help' output. (#714)

Details:
- Changed the configure script function that outputs "--help" text to do
  so via so-called "here-document" syntax for improved readability and
  maintainability. The change eliminates hundreds of echo statements and
  makes it easier to change existing configure options' help text, along
  with other benefits such as eliminating the need to escape double-
  quote characters (").
- (cherry picked from ecbcf40)

Merge tlb- and slab/rr-specific gemm macrokernels. (#711)

Details:
- Merged the tlb-specific gemm macrokernel (_var2b) with the slab/rr-
  specific one (var2) so that a single function can be compiled with
  either tlb or slab/rr support, depending on the value of the
  BLIS_ENABLE_JRIR_TLB, _SLAB, and _RR. This is done by incorporating
  information from both approaches: the start/end/inc for the JR and IR
  loops from slab or rr partitioning; and the number of assigned
  microtiles, plus the starting IR dimension offset for all iterations
  after the first (ir_next). With these changes, slab, rr, and tlb can
  all be parameterized by initializing a similar set of variables prior
  to the jr loop.
- Removed the wrap-around logic that sets the "b_next" field of the
  auxinfo_t struct, which executes during the last IR iteration of the
  last JR iteration. The potential benefit of this code is so minor
  (and hinges on the microkernel making use of the b_next field) that
  it's arguably not worth including. The code also does the wrong
  thing for some threads whenever JR_NT > 1, since only thread 0 (in the
  JR group) would even compute with the first micropanel of B.
- Re-expressed the definition of bli_is_last_iter_slrr so that slab and
  tlb use the same code rather than rr and tlb.
- Adjusted the initialization of the gemm control tree accordingly.
- (cherry picked from c334ec2)

Fixed mis-mapped instruction for VEXTRACTF64X2. (#713)

Details:
- This commit fixes a typo in the macro definition for the extended
  inline assembly macro VEXTRACTF64X2 in bli_x86_asm_macros.h. The macro
  was previously defined (incorrectly) in terms of the vextractf64x4
  instruction rather than vextractf64x2.
- CREDITS file update.
- (cherry picked from 5793a77)

Defined lt, lte, gt, gte + misc. other updates. (#712)

Details:
- Changed invertsc operation to be a non-destructive operation; that is,
  it now takes separate input and output operands. This change applies
  to both the object and typed APIs.
- Defined an alternative square root operation, sqrtrsc, which, when
  operating on complex scalars, assumes the imaginary part of the input
  to be zero.
- Changed the semantics of addm, subm, copym, axpym, scal2m, and xpbym
  so that when the source matrix has an implicit unit diagonal, the
  operation leaves the diagonal of the destination matrix untouched.
  Previously, the operations would interpret an implicit unit diagonal
  on the source matrix as a request to manifest the unit diagonal
  *explicitly* on output (either as something to copy in the case of
  copym, or something to compute with in the cases of addm, subm, axpym,
  scal2m, and xpbym). It turns out that this behavior was too cute by
  half and could cause unintended headaches for practical use cases.
  (This change in behavior also required small modifications to the trmv
  and trsv testsuite modules so that they would properly test matrices
  with unit diagonals.)
- Added missing dependencies for copym to gemv, ger, hemv, trmv, and
  trsv testsuite modules.
- Implemented level-0-like ltsc, ltesc, gtsc, gtesc operations in
  frame/util, which use lt, lte, gt, and gte level-0 scalar macros.
- Trivial variable rename in bli_part.c to harmonize with other
  variable naming conventions.
- (cherry picked from 16d2e9e)

Implement cntx_t pointer caching in gks. (#709)

Details:
- Refactored the gks cntx_t query functions so that: (1) there is a
  clearer pattern of similarity between functions that query a native
  context and those that query its induced (1m) counterpart; and (2)
  queried cntx_t pointers (for both native and induced cntx_t pointers)
  are cached (by default), or deep-queried upon each invocation,
  depending on whether cpp macro BLIS_ENABLE_GKS_CACHING is defined.
- Refactored query-related functions in bli_arch.c to cache the queried
  arch_t value (by default), or deep-query the arch_t value upon each
  invocation, depending on whether cpp macro BLIS_ENABLE_GKS_CACHING is
  defined.
- Tweaked the behavior of bli_gks_query_ind_cntx_impl() (formerly named
  bli_gks_query_ind_cntx()) so that the induced method cntx_t struct is
  repopulated each time the function is called. (It is still only
  allocated once on first call.) This was mostly done in preparation for
  some future in which the arch_t value might change at runtime. In such
  a scenario, the induced method context would need to be recalculated
  any time the native context changes.
- Added preprocessor logic to bli_config_macro_defs.h to handle enabling
  or disabling of cntx_t pointer caching (via BLIS_ENABLE_GKS_CACHING).
- For now, cntx_t pointer caching is enabled by default and does not
  correspond to any official configure option. Disabling can be done
  by inserting a #define for BLIS_DISABLE_GKS_CACHING into the
  appropriate bli_family_*.h header file within the configuration of
  interest.
- Thanks to Harihara Sudhan S (AMD) for suggesting that cntxt_t pointers
  (and not just arch_t values) be cached.
- Comment updates.
- (cherry picked from 9a366b1)

Fixing type-mismatch errors in power10 sandbox (#701)

Details:
- This commit fixes a mismatch between the function type signature of
  bli_gemm_ex() required by BLIS and the version of the function defined
  within the power10 sandbox. It also performs typecasting upon calling
  bli_gemm_front() to attain type consistency with the type signature
  defined by BLIS for bli_gemm_front().
- (cherry picked from b895ec9)

Define new global scalar (obj_t) constants. (#703)

Details:
- This commit defines the following new global scalar constants:
  - BLIS_ONE_I: This constant encodes the imaginary unit.
  - BLIS_MINUS_ONE_I: This constant encodes the negative imaginary unit.
  - BLIS_NAN: This constant encodes a not-a-number value. Both real and
    imaginary parts are set to NaN for complex datatypes.
- (cherry picked from 38d88d5)

Disable power10 kernels other than sgemm, dgemm. (#705)

Details:
- There is a power10 sandbox which uses microkernels for datatypes other
  than float and double (or scomplex/dcomplex). In a regular power10-
  configured build (that is, with the sandbox disabled), there were
  compile errors for some of these other non-sgemm/non-dgemm
  microkernels. This commit protects those kernels with a new cpp macro
  guard (which is defined in sandbox/power10/bli_sandbox.h) that
  prevents that kernel code from being compiled for normal, non-sandbox
  power10 builds.
- (cherry picked from cdb22b8)

Fix k = 0 edge case in power10 microkernels (#706)

Details:
- When power10 sgemm and dgemm microkernels are called with k = 0, they
  become caught in infinite loops and segfault. This is fixed now via an
  early exit in the case of k = 0.
- (cherry picked from d220f9c)
fgvanzee added a commit that referenced this pull request May 21, 2024
Details:
- Restored general storage case in armsve kernels.
- Reason for doing this: Though real `g`-storage is difficult to
  speedup, `g`-codepath here can provide a good support for
  transposed-storage. i.e. at least good for `GEMM_UKR_SETUP_CT_AMBI`.
- By experience, this solution is only *a little* slower than in-reg
  transpose. Plus in-reg transpose is only possible for a fixed VL in
  our case.
- (cherry picked from 4e18cd3)

Refined emacs handling of indentation. (#717)

Details:
- This refines the emacs autoformatting to be better in line with
  contribution guidelines.
- Removed a stray shebang in a .mk file which confuses emacs about the
  file mode, which should be makefile-mode. (emacs also removes stray
  whitespace at the ends of lines.)
- (cherry picked from 0ba6e9e)

Updated hpx namespace for make_count_shape. (#725)

Details:
- The hpx namespace for *counting_shape changed. This PR updates the use
  of counting_shape in blis to comply with the change in hpx.
- Co-authored-by: ctaylor <[email protected]>
- (cherry picked from 059f151)

Added an 'arm64' entry to `.travis.yml`. (#726)

Details:
- Added a new 'arm64' entry to the .travis.yml file in an attempt to get
  Travis CI to compile both NEON and SVE kernels, even if only NEON
  kernels are exercised in the testing. With this new 'arm64' entry, the
  'cortexa57' entry becomes redundant and may be removed. Thanks to
  RuQing Xu for this suggestion.
- Previously, the macro BLIS_SIMD_MAX_SIZE was *not* being set in
  bli_kernels_arm64.h, which meant that the default value of 64 was
  being used. This caused a runtime consistency check to fail in
  bli_gks.c (in Travis CI), one which requires that

    mr * nr * dt_size > BLIS_STACK_BUF_MAX_SIZE

  for all datatype sizes dt_size, where BLIS_STACK_BUF_MAX_SIZE is
  defined as

    BLIS_SIMD_MAX_NUM_REGISTERS * BLIS_SIMD_MAX_SIZE * 2

  This commit increases BLIS_SIMD_MAX_SIZE to 128 for the 'arm64'
  configuration, thus overriding the default and (hopefully) avoiding
  the aforementioned consistency check failures.
- Appended '|| cat ./output.testsuite' to all 'make' commands in
  travis/do_testsuite.sh. Thanks to RuQing Xu for this suggestion.
- Whitespace changes.
- (cherry picked from 0b421ef)

Redirect grep stderr to /dev/null. (#723)

Details:
- In common.mk, added a redirection of stderr to /dev/null for the grep
  command being used to gather a list of header files #included from
  bli_cntx_ref.c. The redirection is desirable because as of grep 3.8,
  regular expressions with "stray" backslashes trigger warnings [1].
  But removing the backslash seems to break the BLIS build system when
  using pre-3.8 versions of grep, so this seems to be easiest way to
  satisfy the BLIS build system for both pre- and post-3.8 grep
  environments.

  [1] https://lists.gnu.org/archive/html/info-gnu/2022-09/msg00001.html
- (cherry picked from b1d3fc7)

Added runtime selection of 'power' config family. (#718)

Details:
- Created a 'power' umbrella configuration family, which, when targeted
  at configure-time, will build both 'power9' and 'power10' subconfigs.
  (With this feature, a BLIS shared library could be compiled on a
  power9 system and run on power10 and vice-versa. Unoptimised code
  will execute if it is linked and run on any other generic system.)
- This new configuration family will only work with gcc, since that is
  the only compiler supported by both power9 and power10 subconfigs in
  BLIS.
- Documented power9 and power10 as supported microarchitectures in the
  docs/HardwareSupport.md document.
- (cherry picked from e3d352f)

Define `BLIS_VERSION_STRING` in `blis.h`. (#720)

Details:
- Previously, the version string was communicated from configure to
  config.mk (via the config.mk.in template), where it was included via
  the top-level Makefile, where it was then used to define the
  preprocessor macro BLIS_VERSION_STRING via a command line argument to
  the compiler (via -D). This macro is then used within bli_info.c to
  initialize a static string which can then be queried via the
  bli_info_get_version_str() function. However, there are some
  applications that may find utility in being able to access the version
  string by inspecting the monolithic (flattened) blis.h header file
  that is created at compile time and installed alongside the library.
  This commit moves the definition of BLIS_VERSION_STRING into
  bli_config.h (via the bli_config.h.in template) so that it is
  embedded in blis.h. The version string is now available in three
  places:
  - the static/shared library, which is installed in the 'lib'
    subdirectory of the install prefix (query-able via the
    bli_info_get_version_str() function);
  - the config.mk makefile fragment, which is installed in the 'share'
    subdirectory of the install prefix (in the VERSION variable);
  - the blis.h header file, which is installed in the 'include'
    subdirectory of the install prefix (via the BLIS_VERSION_STRING
    macro constant).
  Thanks to Mohsen Aznaveh and Tim Davis for providing the idea for this
  change.
- CREDITS file update.
- (cherry picked from e730c68)

Typecast printf() args to avoid compiler warnings. (#716)

Details:
- In bli_thread_range_tlb.c, typecast integer arguments passed to
  printf() -- which are typically disabled unless debugging -- to type
  "long" to guarantee a match to the "%ld" format specifiers used in
  those calls. This avoids spurious warnings with certain compilers in
  certain toolchain environments, such as 32-bit RISC-V (rv32iv).
- (cherry picked from dc5d00a)

Use here-document for 'configure --help' output. (#714)

Details:
- Changed the configure script function that outputs "--help" text to do
  so via so-called "here-document" syntax for improved readability and
  maintainability. The change eliminates hundreds of echo statements and
  makes it easier to change existing configure options' help text, along
  with other benefits such as eliminating the need to escape double-
  quote characters (").
- (cherry picked from ecbcf40)

Merge tlb- and slab/rr-specific gemm macrokernels. (#711)

Details:
- Merged the tlb-specific gemm macrokernel (_var2b) with the slab/rr-
  specific one (var2) so that a single function can be compiled with
  either tlb or slab/rr support, depending on the value of the
  BLIS_ENABLE_JRIR_TLB, _SLAB, and _RR. This is done by incorporating
  information from both approaches: the start/end/inc for the JR and IR
  loops from slab or rr partitioning; and the number of assigned
  microtiles, plus the starting IR dimension offset for all iterations
  after the first (ir_next). With these changes, slab, rr, and tlb can
  all be parameterized by initializing a similar set of variables prior
  to the jr loop.
- Removed the wrap-around logic that sets the "b_next" field of the
  auxinfo_t struct, which executes during the last IR iteration of the
  last JR iteration. The potential benefit of this code is so minor
  (and hinges on the microkernel making use of the b_next field) that
  it's arguably not worth including. The code also does the wrong
  thing for some threads whenever JR_NT > 1, since only thread 0 (in the
  JR group) would even compute with the first micropanel of B.
- Re-expressed the definition of bli_is_last_iter_slrr so that slab and
  tlb use the same code rather than rr and tlb.
- Adjusted the initialization of the gemm control tree accordingly.
- (cherry picked from c334ec2)

Fixed mis-mapped instruction for VEXTRACTF64X2. (#713)

Details:
- This commit fixes a typo in the macro definition for the extended
  inline assembly macro VEXTRACTF64X2 in bli_x86_asm_macros.h. The macro
  was previously defined (incorrectly) in terms of the vextractf64x4
  instruction rather than vextractf64x2.
- CREDITS file update.
- (cherry picked from 5793a77)

Defined lt, lte, gt, gte + misc. other updates. (#712)

Details:
- Changed invertsc operation to be a non-destructive operation; that is,
  it now takes separate input and output operands. This change applies
  to both the object and typed APIs.
- Defined an alternative square root operation, sqrtrsc, which, when
  operating on complex scalars, assumes the imaginary part of the input
  to be zero.
- Changed the semantics of addm, subm, copym, axpym, scal2m, and xpbym
  so that when the source matrix has an implicit unit diagonal, the
  operation leaves the diagonal of the destination matrix untouched.
  Previously, the operations would interpret an implicit unit diagonal
  on the source matrix as a request to manifest the unit diagonal
  *explicitly* on output (either as something to copy in the case of
  copym, or something to compute with in the cases of addm, subm, axpym,
  scal2m, and xpbym). It turns out that this behavior was too cute by
  half and could cause unintended headaches for practical use cases.
  (This change in behavior also required small modifications to the trmv
  and trsv testsuite modules so that they would properly test matrices
  with unit diagonals.)
- Added missing dependencies for copym to gemv, ger, hemv, trmv, and
  trsv testsuite modules.
- Implemented level-0-like ltsc, ltesc, gtsc, gtesc operations in
  frame/util, which use lt, lte, gt, and gte level-0 scalar macros.
- Trivial variable rename in bli_part.c to harmonize with other
  variable naming conventions.
- (cherry picked from 16d2e9e)

Implement cntx_t pointer caching in gks. (#709)

Details:
- Refactored the gks cntx_t query functions so that: (1) there is a
  clearer pattern of similarity between functions that query a native
  context and those that query its induced (1m) counterpart; and (2)
  queried cntx_t pointers (for both native and induced cntx_t pointers)
  are cached (by default), or deep-queried upon each invocation,
  depending on whether cpp macro BLIS_ENABLE_GKS_CACHING is defined.
- Refactored query-related functions in bli_arch.c to cache the queried
  arch_t value (by default), or deep-query the arch_t value upon each
  invocation, depending on whether cpp macro BLIS_ENABLE_GKS_CACHING is
  defined.
- Tweaked the behavior of bli_gks_query_ind_cntx_impl() (formerly named
  bli_gks_query_ind_cntx()) so that the induced method cntx_t struct is
  repopulated each time the function is called. (It is still only
  allocated once on first call.) This was mostly done in preparation for
  some future in which the arch_t value might change at runtime. In such
  a scenario, the induced method context would need to be recalculated
  any time the native context changes.
- Added preprocessor logic to bli_config_macro_defs.h to handle enabling
  or disabling of cntx_t pointer caching (via BLIS_ENABLE_GKS_CACHING).
- For now, cntx_t pointer caching is enabled by default and does not
  correspond to any official configure option. Disabling can be done
  by inserting a #define for BLIS_DISABLE_GKS_CACHING into the
  appropriate bli_family_*.h header file within the configuration of
  interest.
- Thanks to Harihara Sudhan S (AMD) for suggesting that cntxt_t pointers
  (and not just arch_t values) be cached.
- Comment updates.
- (cherry picked from 9a366b1)

Fixing type-mismatch errors in power10 sandbox (#701)

Details:
- This commit fixes a mismatch between the function type signature of
  bli_gemm_ex() required by BLIS and the version of the function defined
  within the power10 sandbox. It also performs typecasting upon calling
  bli_gemm_front() to attain type consistency with the type signature
  defined by BLIS for bli_gemm_front().
- (cherry picked from b895ec9)

Define new global scalar (obj_t) constants. (#703)

Details:
- This commit defines the following new global scalar constants:
  - BLIS_ONE_I: This constant encodes the imaginary unit.
  - BLIS_MINUS_ONE_I: This constant encodes the negative imaginary unit.
  - BLIS_NAN: This constant encodes a not-a-number value. Both real and
    imaginary parts are set to NaN for complex datatypes.
- (cherry picked from 38d88d5)

Disable power10 kernels other than sgemm, dgemm. (#705)

Details:
- There is a power10 sandbox which uses microkernels for datatypes other
  than float and double (or scomplex/dcomplex). In a regular power10-
  configured build (that is, with the sandbox disabled), there were
  compile errors for some of these other non-sgemm/non-dgemm
  microkernels. This commit protects those kernels with a new cpp macro
  guard (which is defined in sandbox/power10/bli_sandbox.h) that
  prevents that kernel code from being compiled for normal, non-sandbox
  power10 builds.
- (cherry picked from cdb22b8)

Fix k = 0 edge case in power10 microkernels (#706)

Details:
- When power10 sgemm and dgemm microkernels are called with k = 0, they
  become caught in infinite loops and segfault. This is fixed now via an
  early exit in the case of k = 0.
- (cherry picked from d220f9c)

Fixed clang compiler warning in bli_l0_ft.h.

Details:
- Fixed a type redefinition in frame/0/bli_l0_ft.h that unintentionally
  slipped in with commit 02b5acd.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants