Skip to content

Arm micro-architecture dispatch#344

Merged
fgvanzee merged 23 commits intoflame:masterfrom
loveshack:arm64-cpuid
Oct 4, 2021
Merged

Arm micro-architecture dispatch#344
fgvanzee merged 23 commits intoflame:masterfrom
loveshack:arm64-cpuid

Conversation

@loveshack
Copy link
Copy Markdown
Contributor

This provides a cpuid-like mechanism for arm64, per #299.
After doing it, I realized it presumably has limited effect because it doesn't affect block sizes, though it does affect the ISA for ThunderX2, which I haven't been able to try yet. (OpenBLAS does have different kernel configurations for some of the micro-arches, but I don't know if significantly different.)
There are also comments about arm32 implementation if anyone is interested.

@loveshack loveshack mentioned this pull request Sep 25, 2019
Minor changes, as it was mostly there already.

Fixme:  Find out whether or not there's a way to avoid reading /proc,
as for arm64.  Otherwise, select the cpuinfo entry for the core we're
actually running on in case of big.little etc.
@loveshack loveshack changed the title Arm64 micro-architecture dispatch Arm micro-architecture dispatch Nov 4, 2019
@devinamatthews
Copy link
Copy Markdown
Member

@fgvanzee we should revisit this.

@fgvanzee
Copy link
Copy Markdown
Member

fgvanzee commented Aug 7, 2020

@loveshack What is the status of this PR? Is it ready to go as far as you know?

@loveshack
Copy link
Copy Markdown
Contributor Author

loveshack commented Aug 17, 2020 via email

@devinamatthews
Copy link
Copy Markdown
Member

@loveshack Leaving out some uarches for now is fine. Getting the base code in is what is important from this PR I think. I did a remerge recently but you should check my changes.

Not exposed as family yet.
Also treat ThunderX3 as ThunderX2.
devinamatthews added a commit that referenced this pull request Sep 7, 2020
Set BLIS_NUM_ARCHS automatically by inserting it as the last member in the enum. Also remove `=0` on first enum member since it is guaranteed by the standard. This make is easier and less error-prone to reorder enum entries.

This change was also included in changes I made to #344 and #345 in remerging them, but I thought it important to make this change earlier rather than later.
@devinamatthews
Copy link
Copy Markdown
Member

@loveshack is this mergeable? I know you had wanted to add some stuff like neoverse but I think getting those can be handled in smaller PRs later.

xrq-phys pushed a commit to xrq-phys/tblis that referenced this pull request Apr 15, 2021
"a64fx" was included twice and some other entries were mis-ordered.
- Remove stray SystemZ prototype.
- Add a "stub" for Apple Silicon support (and Neoverse N1).
- Slightly rework the detection logic.

`make check` passes on qemu-aarch64 with `configure auto` and `configure
arm64`.
@devinamatthews
Copy link
Copy Markdown
Member

@fgvanzee can you give this PR a final look? I think everything should pass now, except maybe the ARMSVE QEMU test (I can check this out if it does fail).

@devinamatthews
Copy link
Copy Markdown
Member

?? @fgvanzee all the Travis builds seem to be getting canceled?

@fgvanzee
Copy link
Copy Markdown
Member

?? @fgvanzee all the Travis builds seem to be getting canceled?

Actually, the first one failed after all of the jobs were in progress and running for a while. Then the second one was cancelled.

But yeah, no idea.

Only gcc for ARMSVE as of now due to the clang inline assembly register
usage issue.
Minor change to trigger Travis.

that just yields "v7l" on cortexa9.

*/
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if these comments are sort of duplicating with L994.

Maybe it'd be also good to add a comment here mentioning it as a fallback method for ARMv7?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually hadn't looked at the ARMv7 part. I will give it a once-over.

@xrq-phys
Copy link
Copy Markdown
Collaborator

Additionally, may I propose the following changes so that feature is correctly set under the new detection method?
(Although the variable seems not used at all.)

diff --git a/frame/base/bli_cpuid.c b/frame/base/bli_cpuid.c
index 3626f9bb..d704d825 100644
--- a/frame/base/bli_cpuid.c
+++ b/frame/base/bli_cpuid.c
@@ -956,7 +956,10 @@ et al
 #include <sys/sysctl.h>
 #endif
 
-static uint32_t get_coretype()
+static uint32_t get_coretype
+     (
+       uint32_t* features
+     )
 {
        int implementer = 0x00, part = 0x000;
 
@@ -983,7 +986,10 @@ static uint32_t get_coretype()
        }
        
        bool has_sve = getauxval( AT_HWCAP ) & HWCAP_SVE;
+       if (has_sve)
+         *features |= FEATURE_SVE;
 #endif //__linux__
+       *features |= FEATURE_NEON;
 
 #ifdef __APPLE__
        // Better values could be obtained from sysctlbyname()
@@ -1135,8 +1141,7 @@ uint32_t bli_cpuid_query
      )
 {
        *model    = MODEL_ARMV8;
-       *features = 0;
-       *part     = get_coretype();
+       *part     = get_coretype(features);
 
        return VENDOR_ARM;
 }

@devinamatthews
Copy link
Copy Markdown
Member

Additionally, may I propose the following changes so that feature is correctly set under the new detection method?
(Although the variable seems not used at all.)

I can add that, but at least for now the features aren't used in selecting the configuration (that is actually happening before get_coretype() returns...). has_sve is used there, but it really only kicks in for emulated SVE (QEMU) or some "unknown" SVE chip. In principle, I suppose the code should also check the SVE vector length?

I'll make the above changes for now and save these other things for PRs later, I just want to finally get this code in to master.

@devinamatthews
Copy link
Copy Markdown
Member

@fgvanzee can you give this a final blessing?

@devinamatthews
Copy link
Copy Markdown
Member

@fgvanzee pinging again, this should be ready to merge.

@fgvanzee
Copy link
Copy Markdown
Member

fgvanzee commented Oct 4, 2021

Yes, sorry for the delay. This looks fine to me.

(We may need to resolve conflicts since the PR is now a few weeks old.)

@fgvanzee fgvanzee merged commit d0a0b4b into flame:master Oct 4, 2021
@fgvanzee
Copy link
Copy Markdown
Member

fgvanzee commented Oct 4, 2021

No conflicts after all. Yay. 🎆

sireeshasanga pushed a commit to amd/blis that referenced this pull request Oct 11, 2024
* commit '81e10346':
  Alloc at least 1 elem in pool_t block_ptrs. (flame#560)
  Fix insufficient pool-growing logic in bli_pool.c. (flame#559)
  Arm SVE C/ZGEMM Fix FMOV 0 Mistake
  SH Kernel Unused Eigher
  Arm SVE C/ZGEMM Support *beta==0
  Arm SVE Config armsve Use ZGEMM/CGEMM
  Arm SVE: Update Perf. Graph
  Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0
  Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0
  A64FX Config Use ZGEMM/CGEMM
  Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg
  Arm SVE Add SGEMM 2Vx10 Unindexed
  Arm SVE ZGEMM Support Gather Load / Scatt. St.
  Arm SVE Add ZGEMM 2Vx10 Unindexed
  Arm SVE Add ZGEMM 2Vx7 Unindexed
  Arm SVE Add ZGEMM 2Vx8 Unindexed
  Update Travis CI badge
  Armv8 Trash New Bulk Kernels
  Enable testing 1m in `make check`.
  Config ArmSVE Unregister 12xk. Move 12xk to Old
  Revert __has_include(). Distinguish w/ BLIS_FAMILY_**
  Register firestorm into arm64 Metaconfig
  Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo
  Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo
  Add test for Apple M1 (firestorm)
  Firestorm CPUID Dispatcher
  Armv8 GEMMSUP Edge Cases Require Signed Ints
  Make error checking level a thread-local variable.
  Fix data race in testsuite.
  Update .appveyor.yml
  Firestorm Block Size Fixes
  Armv8 Handle *beta == 0 for GEMMSUP ??r Case.
  Move unused ARM SVE kernels to "old" directory.
  Add an option to control whether or not to use @rpath.
  Fix $ORIGIN usage on linux.
  Arm micro-architecture dispatch (flame#344)
  Use @path-based install name on MacOS and use relocatable RPATH entries for testsuite inaries.
  Armv8 Handle *beta == 0 for GEMMSUP ?rc Case.
  Armv8 Fix 6x8 Row-Maj Ukr
  Apply patch from @xrq-phys.
  Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs.
  bli_error: more cleanup on the error strings array
  Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9
  Arm SVE: Correct PACKM Ker Name: Intrinsic Kers
  Fix config_name in bli_arch.c
  Arm Whole GEMMSUP Call Route is Asm/Int Optimized
  Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref
  Header Typo
  Arm: DGEMMSUP ??r(rv) Invoke Edge Size
  Arm: DGEMMSUP ?rc(rd) Invoke Edge Size
  Arm: Implement GEMMSUP Fallback Method
  Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin
  Added Apple Firestorm (A14/M1) Subconfig
  Arm64 8x4 Kernel Use Less Regs
  Armv8-A Supplimentary GEMMSUP Sizes for RD
  Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm
  Armv8-A Adjust Types for PACKM Kernels
  Armv8-A GEMMSUP-RD 6x8m
  Armv8-A GEMMSUP-RD 6x8n
  Armv8-A s/d Packing Kernels Fix Typo
  Armv8-A Introduced s/d Packing Kernels
  Armv8-A DGEMMSUP 6x8m Kernel
  Armv8-A DGEMMSUP Adjustments
  Armv8-A Add More DGEMMSUP
  Armv8-A Add GEMMSUP 4x8n Kernel
  Armv8-A Add Part of GEMMSUP 8x4m Kernel
  Armv8A DGEMM 4x4 Kernel WIP. Slow
  Armv8-A Add 8x4 Kernel WIP

AMD-Internal: [CPUPL-2698]
Change-Id: I194ff69356740bb36ca189fd1bf9fef02eec3803
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants