Arm micro-architecture dispatch#344
Conversation
This only affects compiling with cpu-specific flags, not anything in the micro-kernel.
Minor changes, as it was mostly there already. Fixme: Find out whether or not there's a way to avoid reading /proc, as for arm64. Otherwise, select the cpuinfo entry for the core we're actually running on in case of big.little etc.
|
@fgvanzee we should revisit this. |
|
@loveshack What is the status of this PR? Is it ready to go as far as you know? |
|
@loveshack What is the status of this PR? Is it ready to go as far as you know?
Probably not now. It was all tested to the extent I had virtual
hardware, but needs merging and testing again. It also needs an
addition for a64fx/sve, presumably, and other micro-arches, like
neoverse. It's unclear to me to what extent aarch64 variants should be
configured individually.
|
|
@loveshack Leaving out some uarches for now is fine. Getting the base code in is what is important from this PR I think. I did a remerge recently but you should check my changes. |
Not exposed as family yet. Also treat ThunderX3 as ThunderX2.
Set BLIS_NUM_ARCHS automatically by inserting it as the last member in the enum. Also remove `=0` on first enum member since it is guaranteed by the standard. This make is easier and less error-prone to reorder enum entries. This change was also included in changes I made to #344 and #345 in remerging them, but I thought it important to make this change earlier rather than later.
|
@loveshack is this mergeable? I know you had wanted to add some stuff like neoverse but I think getting those can be handled in smaller PRs later. |
"a64fx" was included twice and some other entries were mis-ordered.
- Remove stray SystemZ prototype. - Add a "stub" for Apple Silicon support (and Neoverse N1). - Slightly rework the detection logic. `make check` passes on qemu-aarch64 with `configure auto` and `configure arm64`.
|
@fgvanzee can you give this PR a final look? I think everything should pass now, except maybe the ARMSVE QEMU test (I can check this out if it does fail). |
|
?? @fgvanzee all the Travis builds seem to be getting canceled? |
Actually, the first one failed after all of the jobs were in progress and running for a while. Then the second one was cancelled. But yeah, no idea. |
Only gcc for ARMSVE as of now due to the clang inline assembly register usage issue.
Minor change to trigger Travis.
|
|
||
| that just yields "v7l" on cortexa9. | ||
|
|
||
| */ |
There was a problem hiding this comment.
I'm wondering if these comments are sort of duplicating with L994.
Maybe it'd be also good to add a comment here mentioning it as a fallback method for ARMv7?
There was a problem hiding this comment.
I actually hadn't looked at the ARMv7 part. I will give it a once-over.
|
Additionally, may I propose the following changes so that diff --git a/frame/base/bli_cpuid.c b/frame/base/bli_cpuid.c
index 3626f9bb..d704d825 100644
--- a/frame/base/bli_cpuid.c
+++ b/frame/base/bli_cpuid.c
@@ -956,7 +956,10 @@ et al
#include <sys/sysctl.h>
#endif
-static uint32_t get_coretype()
+static uint32_t get_coretype
+ (
+ uint32_t* features
+ )
{
int implementer = 0x00, part = 0x000;
@@ -983,7 +986,10 @@ static uint32_t get_coretype()
}
bool has_sve = getauxval( AT_HWCAP ) & HWCAP_SVE;
+ if (has_sve)
+ *features |= FEATURE_SVE;
#endif //__linux__
+ *features |= FEATURE_NEON;
#ifdef __APPLE__
// Better values could be obtained from sysctlbyname()
@@ -1135,8 +1141,7 @@ uint32_t bli_cpuid_query
)
{
*model = MODEL_ARMV8;
- *features = 0;
- *part = get_coretype();
+ *part = get_coretype(features);
return VENDOR_ARM;
} |
I can add that, but at least for now the features aren't used in selecting the configuration (that is actually happening before I'll make the above changes for now and save these other things for PRs later, I just want to finally get this code in to master. |
|
@fgvanzee can you give this a final blessing? |
|
@fgvanzee pinging again, this should be ready to merge. |
|
Yes, sorry for the delay. This looks fine to me. (We may need to resolve conflicts since the PR is now a few weeks old.) |
|
No conflicts after all. Yay. 🎆 |
* commit '81e10346': Alloc at least 1 elem in pool_t block_ptrs. (flame#560) Fix insufficient pool-growing logic in bli_pool.c. (flame#559) Arm SVE C/ZGEMM Fix FMOV 0 Mistake SH Kernel Unused Eigher Arm SVE C/ZGEMM Support *beta==0 Arm SVE Config armsve Use ZGEMM/CGEMM Arm SVE: Update Perf. Graph Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0 Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0 A64FX Config Use ZGEMM/CGEMM Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg Arm SVE Add SGEMM 2Vx10 Unindexed Arm SVE ZGEMM Support Gather Load / Scatt. St. Arm SVE Add ZGEMM 2Vx10 Unindexed Arm SVE Add ZGEMM 2Vx7 Unindexed Arm SVE Add ZGEMM 2Vx8 Unindexed Update Travis CI badge Armv8 Trash New Bulk Kernels Enable testing 1m in `make check`. Config ArmSVE Unregister 12xk. Move 12xk to Old Revert __has_include(). Distinguish w/ BLIS_FAMILY_** Register firestorm into arm64 Metaconfig Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo Add test for Apple M1 (firestorm) Firestorm CPUID Dispatcher Armv8 GEMMSUP Edge Cases Require Signed Ints Make error checking level a thread-local variable. Fix data race in testsuite. Update .appveyor.yml Firestorm Block Size Fixes Armv8 Handle *beta == 0 for GEMMSUP ??r Case. Move unused ARM SVE kernels to "old" directory. Add an option to control whether or not to use @rpath. Fix $ORIGIN usage on linux. Arm micro-architecture dispatch (flame#344) Use @path-based install name on MacOS and use relocatable RPATH entries for testsuite inaries. Armv8 Handle *beta == 0 for GEMMSUP ?rc Case. Armv8 Fix 6x8 Row-Maj Ukr Apply patch from @xrq-phys. Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs. bli_error: more cleanup on the error strings array Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9 Arm SVE: Correct PACKM Ker Name: Intrinsic Kers Fix config_name in bli_arch.c Arm Whole GEMMSUP Call Route is Asm/Int Optimized Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref Header Typo Arm: DGEMMSUP ??r(rv) Invoke Edge Size Arm: DGEMMSUP ?rc(rd) Invoke Edge Size Arm: Implement GEMMSUP Fallback Method Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin Added Apple Firestorm (A14/M1) Subconfig Arm64 8x4 Kernel Use Less Regs Armv8-A Supplimentary GEMMSUP Sizes for RD Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm Armv8-A Adjust Types for PACKM Kernels Armv8-A GEMMSUP-RD 6x8m Armv8-A GEMMSUP-RD 6x8n Armv8-A s/d Packing Kernels Fix Typo Armv8-A Introduced s/d Packing Kernels Armv8-A DGEMMSUP 6x8m Kernel Armv8-A DGEMMSUP Adjustments Armv8-A Add More DGEMMSUP Armv8-A Add GEMMSUP 4x8n Kernel Armv8-A Add Part of GEMMSUP 8x4m Kernel Armv8A DGEMM 4x4 Kernel WIP. Slow Armv8-A Add 8x4 Kernel WIP AMD-Internal: [CPUPL-2698] Change-Id: I194ff69356740bb36ca189fd1bf9fef02eec3803
This provides a cpuid-like mechanism for arm64, per #299.
After doing it, I realized it presumably has limited effect because it doesn't affect block sizes, though it does affect the ISA for ThunderX2, which I haven't been able to try yet. (OpenBLAS does have different kernel configurations for some of the micro-arches, but I don't know if significantly different.)
There are also comments about arm32 implementation if anyone is interested.