New GEMM Assembly & Configuration Set for Arm SVE by xrq-phys · Pull Request #424 · flame/blis

xrq-phys · 2020-07-19T00:19:40Z

NOTICE: Branch `xrq-phys:armsve-cfg-venture` has been rebased/reworked for several times. Comments below may not reflect their context code commits.

I'm reopening #422 with a few updates:

A few improvements on the ARM SVE-512 kernel;
A new dgemm kernel specialized for A64fx chip (reason below);
A new configuration called a64fx;

Reason for a different dgemm kernel for A64fx

dgemm_armsve512_asm kernel under kernels/armsve/3 is mainly composed of SVE indexed FMLA instructions (opcode=0x64). This strategy is the same as dgemm_armsve256_asm kernel located at the same directory. It is able to increase the interval between a vector's load and its reference by FMLA. However, actual profiling result of that kernel gives (Test size: GEMM ~~(2400,1600,500)~~ (2000,1400,500)):

Left part of combo histogram shows that in most time the processor is only committing 1 or 2 instructions while A64fx has 2 FP pipelines and 2 integer pipelines summing up to 4. This fact drastically lowers final GFlOps yielded (c.f. spread at the end). However, FP stall rate and memory/cache access wait is quite low, indicating no impediment to FP pipelines.

~~Though not documented in materials disclosed by Fujitsu, I suspect~~ According to A64fx uarch manual (https://github.com/fujitsu/A64FX), the FP pipeline in A64fx does not have element duplicator for indexed SVE operations so that one single 0x64 FMLA is executed with both FP pipelines, each half-occupied. As a workaround, another kernel is created for A64fx with 0x64 FMLA replaced with 0x65 FMLA and it does yield higher GFlOps:

	BLIS + 0x64 Kernel	BLIS + 0x65 Kernel	Fujitsu BLAS
DGEMM(2000,1400,500)	24 GFlOps	33 GFlOps	42 GFlOps
(Table has typo corrected.)

Again, I want to say again that this pull request contains 2 components:

Arm SVE-512 dgemm kernels with one more specialized for A64fx;
Configurations for general Arm SVE and A64fx.

It can be separated into 4 dependent but self-inclusive themes so if you feel this pull request too big please feel free to close it and let me know. I'll relaunch with separated code changes.

xrq-phys · 2020-07-19T16:55:39Z

Sorry the way A64fx handles indexed FMLA is actually implied at the end of fujitsu/A64fx.

Anyway, based on this fact a separated kernel is added. Or would it be better to have separated kernels but still put them under the same armsve directory?

rvdg · 2020-07-19T17:13:01Z

RuQing, Thank you for your efforts. Would it be possible for you to send me an e-mail. There are some issues that we would like to discuss with you offline. Robert [email protected]

…

On Jul 19, 2020, at 11:55 AM, RuQing Xu ***@***.***> wrote: Sorry the way A64fx handles indexed FMLA is actually documented at the end of fujitsu/A64fx. Anyway, based on this fact a separated kernel is added. Or would it be better to have separated kernels but still put them under the same ˋarmsveˋ directory? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#424 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLLYJ3WOGX622RCFDNS36TR4MQRVANCNFSM4PAR7TSQ>.

devinamatthews · 2020-07-19T17:21:05Z

@xrq-phys to answer your specific question:

Anyway, based on this fact a separated kernel is added. Or would it be better to have separated kernels but still put them under the same armsve directory?

I would put this in the same directory and leave a commented-out section in the configuration code for the "old" one. Many of the architectures have multiple implementations sitting around, e.g. haswell has or had at least 4.

devinamatthews · 2020-07-19T17:41:52Z

There seem to be some architectural similarities to SkylakeX: for that kernel I found that the prefetching of the C microtile really has to be distributed among the k iterations and spread out quite a bit (and the prefetches of A and B should be spread out as much as possible too) to avoid overloading the L1 prefetch buffer. Additionally, because the panels of A and B do not fit into the L1 cache, you have to wait until a certain number of iterations of k before the end before you start prefetching C so that it does not get flushed out.

The other thing that might make a difference is prefetching the next panel of B (that will be used in the next microkernel call) to the L2 cache. I don't think you have to fetch the entire panel this may, but may be able to just prefetch enough to train the L2 stream prefetcher (I assume there is such a thing on this chip).

xrq-phys · 2020-07-20T03:45:26Z

@devinamatthews Thanks a lot for your advice.
By leaving next_a and next_b to stream prefetcher FP performance climbs up to 36GFlOps.
Distribution of C microtile prefetching is also implemented but unfortunately the machine has just been shut down. I'll update this thread if test results becomes available again.

Updates are commited to xrq-phys/blis/tree/armsve512-a64fx instead of this src branch at the moment. I would like this src branch to have as less git merge as possible. (Though rebasing might still be necessary later.)

xrq-phys · 2020-08-16T19:15:18Z

I'll try to submit a make checkblis output on an A64fx machine.

xrq-phys · 2020-08-16T20:02:37Z

Rebased against 2c554c2 .
(Later will rebase against current master.)
make checkblis-fast passed on Arm Instruction Emulator.

Host: Microsoft SQ1 (Cortex-A76 / A55)
OS: Debian 10 via. WSL 2
GCC: 10.1
ARMIE: 20.1
Output: output.testsuite.txt

xrq-phys · 2020-08-18T14:16:27Z

Rebased against current master (a.k.a. 7d41128).

ARMIE test has passed.

xrq-phys · 2020-09-10T14:05:27Z

BTW performance is over 42 up to 45GFlOps now but only for large m, n and k (>2000,>500).

(Ref: Vendor-provided BLAS has a peak of roughly 54GFlOps.)

Still trying some modifications.

devinamatthews · 2020-09-10T14:35:38Z

Awesome!

xrq-phys · 2020-09-11T12:46:16Z

Following suggestions by @rvdg , I'm placing here comparison against theoretical peak.

Currently performance is read from a profiler from Fujitsu (with a MS Excel frontend). Part of the performance report looks like (captured from a test with (M,N,K)=~~(2400,1600,500)~~(2000,1400,500) where Fujitsu's LAPACK was giving 42GFlOps in April):

(Screenshot updated to include frequency info.)

For now I'm directly treating the Floating Point Operation Pipeline Busy rate as GFlOps ratio against theoretical peak.
(Sorry there was a mistake in a previous email. I should read pipeline rate instead of peak ratio in the first tabular.)
The value is only ~65% so I suppose there's still a lot of space for improvements even after merging.

xrq-phys · 2020-10-01T13:42:50Z

Updates to test suite output:

ARMIE passed almost all make checkblis tests;
Fugaku real-machine failed symm routines, here's the output:
output.testsuite.serial.txt(line 18952)
It could be caused by OS, compiler or something else?
Basic BLAS tests passed (including SYMM. why?). See
out.dblat3.txt and
out.zblat3.txt.
(out.[sc]blat3 uses ref kernel hence trivial.)

At the same time, please let me summarize trivia used in tuning for the A64fx chip. I saw someone else's trying to implement a more generic kernel set and this already-tuned kernel could provide some hints. Here's the note (deleted)

devinamatthews · 2020-10-01T14:44:34Z

Hi @xrq-phys I guess the performance in the note is for 1 core? Have you tried the whole chip yet?

xrq-phys · 2020-10-01T17:33:42Z

Ah yes. I've not tested for 48 cores yet... I'll try it out as soon as the machine is ready.

xrq-phys · 2020-10-13T13:54:10Z

I've just noticed a serious typo.
All test above are (2000, 1400, 500) instead of (2400, 1600, 500).
GFlOps data remains valid.

(I may or may not have complained about strange behavior of Fujitsu's BLAS but it was caused by my environment setting mistake related to this typo. Sorry.)

xrq-phys · 2020-10-13T15:35:45Z

BTW multithreaded test look good on 12-core case but 48-core's performance is quite unstable.

Trying to sort out performance data...

jeffhammond · 2020-10-13T17:45:46Z

@xrq-phys I don't have access to A64fx yet and am not an expert on the architecture, but it seems that NUMA is the issue here. I don't know what your goals are but I would expect that the most important use cases for BLIS on A64fx will have one or more processors per CMG and thus scaling BLIS across the NOC isn't critical.

This issue is not unique to A64fx or even to BLIS. It's very hard to scale threaded programming models with shared data across NUMA domains.

I'm sure you know it but for anyone else who is reading this issue, the following is useful:

xrq-phys · 2020-10-14T14:31:51Z

Due to Fugaku's set-up process, Their administrator seems need switching A64fx's clock back and forth between 2.2GHz and 2.0GHz.

The processor has been @2.2GHz from Apr. but it's now 2.0Hz. As a consequence, I'm afraid results from here might differ from last month.

xrq-phys · 2020-10-14T14:51:35Z

@jeffhammond Thanks. I guess it'll then be OK to just post a 12-thread (or 13? I still have no idea whether the 1 auxiliary core can be used for OMP threading) benchmark as multithreaded result.

xrq-phys · 2020-10-14T15:12:02Z

GFlOps update to 47 (→51 if I still have 2.2GHz)

jeffhammond · 2020-10-14T16:29:45Z

If the 13th core on the CMG is reserved for OS/MPI/etc. it would be imprudent to run heavy compute on it. Even if it's possible, real apps will likely need the 13th core to be free to do its job. The right target is 12 cores.

If the 13th core on A64fx is like the 17th core on Blue Gene/Q, it can't be used for compute. I do not know if Fugaku implements that level of control.

xrq-phys · 2020-10-15T01:53:23Z

Seems that the 13th core is indeed OS-reserved.
I heard that all 13 cores are symmetric (with 512bit FP pipelines) and naively thought it could be used.
Forgive my ignorance to hardware design.

In my ~1200 GEMM test program, both of the following tests gives around 43.5 GFlOps per core (thanks to HBM I guess?):

12 OMP threads and
4 replicas of test program, 12 OMP threads each, running at the same time on the same chip.

xrq-phys · 2021-01-19T09:38:05Z

Hi.

I don't expect this branch to be the one to get merged, but keeping this PR would allow me to add some explanations.

I'm not very clear about the reason but my UKR seems not that sensitive to the way A&B are stored so I tried to make it a dgemmsup_rv KER.
Currently it's merely the same UKR with rs_a, cs_a and cs_b supplied as input. Loop over n0 is written outside the assembly so I don't expect it to have much performance. A bit above gemmsup_ref is the current target.

jeffhammond · 2021-01-28T22:05:10Z

I tried this but cannot get the BLIS build system to accept that Python 2 is dead.

jhammond43@octavius1:~/BLIS$ ./configure CC=gcc-10 CXX=g++-10 FC=ftn PYTHON=python3 -t openmp a64fx
configure: detected Linux kernel version 4.18.0-240.1.1.el8_3.aarch64.
configure: python interpeter search list is: python python3 python2.
configure: using 'python3' python interpreter.
configure: found python version 3.6.8 (maj: 3, min: 6, rev: 8).
configure: python 3.6.8 appears to be supported.
configure: C compiler search list is: gcc clang cc.
configure: using 'gcc-10' C compiler.
configure: C++ compiler search list is: g++ clang++ c++.
configure: using 'g++-10' C++ compiler (for sandbox only).
configure: found gcc version 10.2.0 (maj: 10, min: 2, rev: 0).
configure: checking for blacklisted configurations due to gcc-10 10.2.0.
configure: checking gcc-10 10.2.0 against known consequential version ranges.
configure: found assembler ('as') version 2.32 (maj: 2, min: 32, rev: ).
configure: checking for blacklisted configurations due to as 2.32.
configure: warning: assembler ('as' 2.32) does not support 'bulldozer'; adding to blacklist.
configure: warning: assembler ('as' 2.32) does not support 'sandybridge'; adding to blacklist.
configure: warning: assembler ('as' 2.32) does not support 'haswell'; adding to blacklist.
configure: warning: assembler ('as' 2.32) does not support 'piledriver'; adding to blacklist.
configure: warning: assembler ('as' 2.32) does not support 'steamroller'; adding to blacklist.
configure: warning: assembler ('as' 2.32) does not support 'excavator'; adding to blacklist.
configure: warning: assembler ('as' 2.32) does not support 'skx'; adding to blacklist.
configure: warning: assembler ('as' 2.32) does not support 'knl'; adding to blacklist.
configure: configuration blacklist:
configure:   bulldozer sandybridge haswell piledriver steamroller excavator skx knl
configure: reading configuration registry...done.
configure: determining default version string.
configure: found '.git' directory; assuming git clone.
configure: executing: git describe --tags.
configure: got back 0.7.0-65-gec63fdf6.
configure: truncating to 0.7.0-65.
configure: starting configuration of BLIS 0.7.0-65.
configure: configuring with official version string.
configure: found shared library .so version '3.0.0'.
configure:   .so major version: 3
configure:   .so minor.build version: 0.0
configure: manual configuration requested; configuring with 'a64fx'.
configure: checking configuration against contents of 'config_registry'.
configure: configuration 'a64fx' is registered.
configure: 'a64fx' is defined as having the following sub-configurations:
configure:    a64fx
configure: which collectively require the following kernels:
configure:    armsve armv8a
configure: checking sub-configurations:
configure:   'a64fx' is registered...and exists.
configure: checking sub-configurations' requisite kernels:
configure:   'armsve' kernels...exist.
configure:   'armv8a' kernels...exist.
configure: no install prefix option given; defaulting to '/usr/local'.
configure: no install exec_prefix option given; defaulting to PREFIX.
configure: no install libdir option given; defaulting to EXECPREFIX/lib.
configure: no install includedir option given; defaulting to PREFIX/include.
configure: no install sharedir option given; defaulting to PREFIX/share.
configure: final installation directories:
configure:   prefix:      /usr/local
configure:   exec_prefix: ${prefix}
configure:   libdir:      ${exec_prefix}/lib
configure:   includedir:  ${prefix}/include
configure:   sharedir:    ${prefix}/share
configure: NOTE: the variables above can be overridden when running make.
configure: no preset CFLAGS detected.
configure: no preset LDFLAGS detected.
configure: debug symbols disabled.
configure: disabling verbose make output. (enable with 'make V=1'.)
configure: disabling ARG_MAX hack.
configure: building BLIS as both static and shared libraries.
configure: exporting only public symbols within shared library.
configure: using OpenMP for threading.
configure: requesting slab threading in jr and ir loops.
configure: internal memory pools for packing blocks are enabled.
configure: internal memory pools for small blocks are enabled.
configure: memory tracing output is disabled.
configure: libmemkind not found; disabling.
configure: compiler appears to support #pragma omp simd.
configure: the BLAS compatibility layer is enabled.
configure: the CBLAS compatibility layer is disabled.
configure: mixed datatype support is enabled.
configure: mixed datatype optimizations requiring extra memory are enabled.
configure: small matrix handling is enabled.
configure: the BLIS API integer size is automatically determined.
configure: the BLAS/CBLAS API integer size is 32-bit.
configure: configuring for conventional gemm implementation.
configure: creating ./config.mk from ./build/config.mk.in
configure: creating ./bli_config.h from ./build/bli_config.h.in
configure: creating ./obj/a64fx
configure: creating ./obj/a64fx/config/a64fx
configure: creating ./obj/a64fx/kernels/armsve
configure: creating ./obj/a64fx/kernels/armv8a
configure: creating ./obj/a64fx/ref_kernels/a64fx
configure: creating ./obj/a64fx/frame
configure: creating ./obj/a64fx/blastest
configure: creating ./obj/a64fx/testsuite
configure: creating ./lib/a64fx
configure: creating ./include/a64fx
configure: mirroring ./config/a64fx to ./obj/a64fx/config/a64fx
configure: mirroring ./kernels/armsve to ./obj/a64fx/kernels/armsve
configure: mirroring ./kernels/armv8a to ./obj/a64fx/kernels/armv8a
configure: mirroring ./ref_kernels to ./obj/a64fx/ref_kernels
configure: mirroring ./ref_kernels to ./obj/a64fx/ref_kernels/a64fx
configure: mirroring ./frame to ./obj/a64fx/frame
configure: creating makefile fragments in ./obj/a64fx/config/a64fx
configure: creating makefile fragments in ./obj/a64fx/kernels/armsve
configure: creating makefile fragments in ./obj/a64fx/kernels/armv8a
configure: creating makefile fragments in ./obj/a64fx/ref_kernels
configure: creating makefile fragments in ./obj/a64fx/frame
configure: configured to build within top-level directory of source distribution.
jhammond43@octavius1:~/BLIS$ make -j48 
Generating monolithic blis.h/usr/bin/env: 'python': No such file or directory
make: *** [Makefile:460: include/a64fx/blis.h] Error 127

devinamatthews · 2021-01-28T22:16:22Z

@jeffhammond to be fair I don't know of any shebang that means "give me any python2 or python3" that actually works portably. Suggestions? (other than just looks for python3, because that isn't portable either 😢 ).

jeffhammond · 2021-01-28T22:21:41Z

Just use Python3 and refer anyone who asks for Python2 support to refer to https://pythonclock.org/

jeffhammond · 2021-01-28T22:23:16Z

A different issue here is that this code appears to rely on some non-standard Clang intrinsic, __aarch64_ldeor8_rel:

jhammond43@octavius2:~/PRK/FORTRAN$ rm *blas ; make dgemm-blas && ./dgemm-blas 100 1024
ftn -e F -O2 -DRADIUS=2 -DSTAR dgemm-blas.F90 -h omp -I/nethome/jhammond43/BLIS//include/a64fx /nethome/jhammond43/BLIS//lib/a64fx/libblis.a -o dgemm-blas
/opt/cray/pe/cce-sve/10.0.1/binutils/aarch64/aarch64-unknown-linux-gnu/bin/ld: /nethome/jhammond43/BLIS//lib/a64fx/libblis.a(bli_thrcomm.o): in function `bli_thrcomm_barrier_atomic':
bli_thrcomm.c:(.text+0xc4): undefined reference to `__aarch64_ldadd8_acq_rel'
/opt/cray/pe/cce-sve/10.0.1/binutils/aarch64/aarch64-unknown-linux-gnu/bin/ld: bli_thrcomm.c:(.text+0x100): undefined reference to `__aarch64_ldeor8_rel'
make: *** [Makefile:94: dgemm-blas] Error 1
jhammond43@octavius2:~/PRK/FORTRAN$

- Only use institution name; - Add Fz. Juelich as coauthor.

xrq-phys · 2021-05-17T15:22:56Z

@fgvanzee I suppose the code is clean enough to get merged.
Still, there are 2 questions:

After rebase, commit 757cb1c used in our Performance.md is no longer available in this branch (it is in xrq-phys:armsve-cfg+premerge instead). It shall be quick to rerun BLIS+SSL2 tests alone though.
Postponed I modified TravisCI config to use QEmu to test through ArmSVE. As SVE kernels require GCC 10+ and QEmu 4.4+, base OS is updated from trusty to focal with many other changes. I suppose I need you to decide whether these updates are acceptable.

devinamatthews · 2021-05-17T15:45:14Z

@xrq-phys how feasible is it to split the TravisCi changes into a separate PR?

xrq-phys · 2021-05-17T15:49:55Z

@devinamatthews Indeed. Moved .travis.yml away for now.

fgvanzee · 2021-05-17T19:39:08Z

After rebase, commit 757cb1c used in our Performance.md is no longer available in this branch (it is in xrq-phys:armsve-cfg+premerge instead). It shall be quick to rerun BLIS+SSL2 tests alone though.

@xrq-phys If/when you can rerun the same experiment set as before, I'll be happy to use them to regenerate graphs, at which time I can also update the commit that is referenced in the document. In the meantime, we can proceed with the merge.

Before I actually merge, I may try to do a few minor tweaks (mostly related to reordering code inserted for the new subconfigs), although I recently had trouble pushing to someone else's PR branch, so we'll see if I can even get it to work. If not, I'll merge and then do the cleanups on master.

devinamatthews · 2021-05-17T19:42:52Z

@fgvanzee you can also make a copy of @xrq-phys's branch in this repo and update the PR merge branch.

fgvanzee · 2021-05-17T20:03:38Z

@fgvanzee you can also make a copy of @xrq-phys's branch in this repo and update the PR merge branch.

I did not know. Also, that seems less than intuitive to me. If the PR is coming from a branch in RuQing's repository to master in flame/blis, how would pushing commits to a copy of his branch help us?

devinamatthews · 2021-05-17T20:06:00Z

You would edit the PR to come from the local copy, so you can update it. Not a big difference either way.

fgvanzee · 2021-05-17T20:29:56Z

@xrq-phys Is this a typo (in frame/base/bli_cpuid.c)?

#ifdef BLIS_CONFIG_ARMSVE
            if ( bli_cpuid_is_armsve( model, part, features ) )
                return BLIS_ARCH_ARMSVE;
#endif
#ifdef BLIS_CONFIG_ARMSVE
            if ( bli_cpuid_is_a64fx( model, part, features ) )
                return BLIS_ARCH_A64FX;
#endif

Seems like the second macro guard would normally be BLIS_CONFIG_A64FX.

Details: - Changed the order of the new A64fx and SVE code fragments to appear as the beginning of the armv8a-related code (rather appearing after other armv8a code). - Fixed what is probably a copy-paste bug in frame/base/bli_cpuid.c. Previously, the a64fx conditional check was guarded by the cpp macro BLIS_CONFIG_ARMSVE, which has now been changed to BLIS_CONFIG_A64FX.

fgvanzee · 2021-05-17T22:55:05Z

@xrq-phys Commit e782546 assumes that the inconsistency I highlighted above was indeed a bug and addresses it. If it was not a bug, please feel free to undo my change to that line. (The cpp guard in question is now on line 462 of frame/base/bli_cpuid.c)

xrq-phys · 2021-05-18T05:30:23Z

@fgvanzee It was indeed a typo. Sorry.
In fact it is possible to also add vendor ID checks in bli_cpuid_is_a64fx but I guess #344 is more preferable.

I'll try to rerun BLIS test on SC Fugaku if the environment config (OS, freq. throttling, etc.) didn't change too much.

fgvanzee · 2021-05-18T22:51:24Z

Thanks @xrq-phys. I'll merge this now. (And don't worry about the typo. It was easy for me to spot.)

fgvanzee · 2021-05-19T12:59:35Z

Apologies, I got distracted by other tasks/people yesterday and forgot to click the button.

fgvanzee · 2021-05-19T13:04:24Z

@xrq-phys As I was preparing to squash-and-merge, I realized that the default log message (ie: the concatenation of all constituent commit log entries) is a bit unwieldy. Could you summarize the changes you made in a way that would allow me to create a more concise commit log entry?

xrq-phys · 2021-05-19T14:23:09Z

@fgvanzee Oh. That's right. This branch has been rebased several times so commit messages become a little wield.

I suppose a brief summary would be:

Implemented vector-length-agnostic [D/S/SH] GEMM kernels for Arm SVE at size (2*VL, 10).
These kernels use unindexed FMLA instructions because indexed FMLA takes 2 FMA units in many implementations.
PS: There are indexed-FLMA kernels in S. Nassyr's repo.
Implemented several experimental DGEMMSUP kernels which would improve performance in a few cases.
However, those DGEMMSUP kernels generally underperform hence is not used in any subconfiguration.
Extended 256-bit SVE DPACKM kernels by Linaro Ltd. to 512-bit for size (12, k).
This DPACKM kernel is not used by any subconfiguration.
Implemented 512-bit SVE DPACKM kernels with in-register transpose support for sizes (16, k) and (10, k).
Added a vector-length agnostic subconfiguration armsve that computes block size according to the analytical model.
This part is ported from the repo of @stepannassyr .
Added a 512-bit specific subconfiguration a64fx that uses empirically tuned block size by @stepannassyr .
This subconfiguration also sets the sector cache size and enables memory-tagging code in SVE GEMM kernels.
This subconfiguration utilizes (16, k) and (10, k) DPACKM kernels.

fgvanzee · 2021-05-19T14:53:55Z

Thanks RuQing, that summary was great.

xrq-phys · 2021-05-19T15:43:36Z

Thanks a lot!

rvdg · 2021-05-19T23:50:33Z

Thank you RuQing for all your hard work.

…

On May 19, 2021, at 10:54 AM, Field G. Van Zee ***@***.***> wrote: Thanks RuQing, that summary was great. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#424 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLLYJZAXDZL5VDUO7LNVYLTOPGJVANCNFSM4PAR7TSQ>.

xrq-phys force-pushed the armsve-cfg-venture branch from 6e5e44b to 47f2be9 Compare August 16, 2020 19:57

xrq-phys force-pushed the armsve-cfg-venture branch from 47f2be9 to 041dca2 Compare August 18, 2020 14:13

xrq-phys force-pushed the armsve-cfg-venture branch from 2c13e80 to e455224 Compare January 18, 2021 15:56

RuQing Xu and others added 5 commits May 15, 2021 00:57

In-Place GEMMSUP-REF for Workaround like Haswell

dab4b27

Adjust Copyright Notice of ArmSVE Kernels

c180c0a

- Only use institution name; - Add Fz. Juelich as coauthor.

Set Cache Sector Size & Proper Block Size

2fc2423

Automatic Size for ArmSVE Config by Analytic Model

434fac8

Rename ArmSVE VL Bits Routine

334f3e2

xrq-phys force-pushed the armsve-cfg-venture branch from 0567e78 to 334f3e2 Compare May 15, 2021 07:27

stepannassyr added 2 commits May 15, 2021 16:29

ArmSVE Cfg Typo Correct

5613c29

ArmSVE Cfg Typo Correct 2

20a9eb4

Fix Implicit Decl. & Unused Vars.

e0705b2

xrq-phys force-pushed the armsve-cfg-venture branch from d16a12d to e0705b2 Compare May 17, 2021 15:49

fgvanzee merged commit 61584de into flame:master May 19, 2021

xrq-phys mentioned this pull request May 19, 2021

Upgrade Travis CI for Arm SVE #500

Merged

Conversation

xrq-phys commented Jul 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

NOTICE: Branch xrq-phys:armsve-cfg-venture has been rebased/reworked for several times. Comments below may not reflect their context code commits.

Uh oh!

xrq-phys commented Jul 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rvdg commented Jul 19, 2020 via email

Uh oh!

devinamatthews commented Jul 19, 2020

Uh oh!

devinamatthews commented Jul 19, 2020

Uh oh!

xrq-phys commented Jul 20, 2020

Uh oh!

xrq-phys commented Aug 16, 2020

Uh oh!

xrq-phys commented Aug 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xrq-phys commented Aug 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xrq-phys commented Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devinamatthews commented Sep 10, 2020

Uh oh!

xrq-phys commented Sep 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xrq-phys commented Oct 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devinamatthews commented Oct 1, 2020

Uh oh!

xrq-phys commented Oct 1, 2020

Uh oh!

xrq-phys commented Oct 13, 2020

Uh oh!

xrq-phys commented Oct 13, 2020

Uh oh!

jeffhammond commented Oct 13, 2020

Uh oh!

xrq-phys commented Oct 14, 2020

Uh oh!

xrq-phys commented Oct 14, 2020

Uh oh!

xrq-phys commented Oct 14, 2020

Uh oh!

jeffhammond commented Oct 14, 2020

Uh oh!

xrq-phys commented Oct 15, 2020

Uh oh!

xrq-phys commented Jan 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffhammond commented Jan 28, 2021

Uh oh!

devinamatthews commented Jan 28, 2021

Uh oh!

jeffhammond commented Jan 28, 2021

Uh oh!

jeffhammond commented Jan 28, 2021

Uh oh!

xrq-phys commented May 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devinamatthews commented May 17, 2021

Uh oh!

xrq-phys commented May 17, 2021

Uh oh!

fgvanzee commented May 17, 2021

Uh oh!

devinamatthews commented May 17, 2021

Uh oh!

fgvanzee commented May 17, 2021

Uh oh!

devinamatthews commented May 17, 2021

xrq-phys commented Jul 19, 2020 •

edited

Loading

NOTICE: Branch `xrq-phys:armsve-cfg-venture` has been rebased/reworked for several times. Comments below may not reflect their context code commits.

xrq-phys commented Jul 19, 2020 •

edited

Loading

xrq-phys commented Aug 16, 2020 •

edited

Loading

xrq-phys commented Aug 18, 2020 •

edited

Loading

xrq-phys commented Sep 10, 2020 •

edited

Loading

xrq-phys commented Sep 11, 2020 •

edited

Loading

xrq-phys commented Oct 1, 2020 •

edited

Loading

xrq-phys commented Jan 19, 2021 •

edited

Loading

xrq-phys commented May 17, 2021 •

edited

Loading

fgvanzee commented May 17, 2021 •

edited

Loading

fgvanzee commented May 17, 2021 •

edited

Loading

xrq-phys commented May 19, 2021 •

edited

Loading