Skip to content

New GEMM Assembly & Configuration Set for Arm SVE#424

Merged
fgvanzee merged 62 commits intoflame:masterfrom
xrq-phys:armsve-cfg-venture
May 19, 2021
Merged

New GEMM Assembly & Configuration Set for Arm SVE#424
fgvanzee merged 62 commits intoflame:masterfrom
xrq-phys:armsve-cfg-venture

Conversation

@xrq-phys
Copy link
Copy Markdown
Collaborator

@xrq-phys xrq-phys commented Jul 19, 2020

NOTICE: Branch xrq-phys:armsve-cfg-venture has been rebased/reworked for several times. Comments below may not reflect their context code commits.

I'm reopening #422 with a few updates:

  • A few improvements on the ARM SVE-512 kernel;
  • A new dgemm kernel specialized for A64fx chip (reason below);
  • A new configuration called a64fx;

Reason for a different dgemm kernel for A64fx

dgemm_armsve512_asm kernel under kernels/armsve/3 is mainly composed of SVE indexed FMLA instructions (opcode=0x64). This strategy is the same as dgemm_armsve256_asm kernel located at the same directory. It is able to increase the interval between a vector's load and its reference by FMLA. However, actual profiling result of that kernel gives (Test size: GEMM (2400,1600,500) (2000,1400,500)):

Bmk_DGEMM

Left part of combo histogram shows that in most time the processor is only committing 1 or 2 instructions while A64fx has 2 FP pipelines and 2 integer pipelines summing up to 4. This fact drastically lowers final GFlOps yielded (c.f. spread at the end). However, FP stall rate and memory/cache access wait is quite low, indicating no impediment to FP pipelines.

Though not documented in materials disclosed by Fujitsu, I suspect According to A64fx uarch manual (https://github.com/fujitsu/A64FX), the FP pipeline in A64fx does not have element duplicator for indexed SVE operations so that one single 0x64 FMLA is executed with both FP pipelines, each half-occupied. As a workaround, another kernel is created for A64fx with 0x64 FMLA replaced with 0x65 FMLA and it does yield higher GFlOps:

Bmk_DGEMM 2

BLIS + 0x64 Kernel BLIS + 0x65 Kernel Fujitsu BLAS
DGEMM(2000,1400,500) 24 GFlOps 33 GFlOps 42 GFlOps
(Table has typo corrected.)

Again, I want to say again that this pull request contains 2 components:

  • Arm SVE-512 dgemm kernels with one more specialized for A64fx;
  • Configurations for general Arm SVE and A64fx.

It can be separated into 4 dependent but self-inclusive themes so if you feel this pull request too big please feel free to close it and let me know. I'll relaunch with separated code changes.

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

xrq-phys commented Jul 19, 2020

Sorry the way A64fx handles indexed FMLA is actually implied at the end of fujitsu/A64fx.

Anyway, based on this fact a separated kernel is added. Or would it be better to have separated kernels but still put them under the same armsve directory?

@rvdg
Copy link
Copy Markdown
Collaborator

rvdg commented Jul 19, 2020 via email

@devinamatthews
Copy link
Copy Markdown
Member

@xrq-phys to answer your specific question:

Anyway, based on this fact a separated kernel is added. Or would it be better to have separated kernels but still put them under the same armsve directory?

I would put this in the same directory and leave a commented-out section in the configuration code for the "old" one. Many of the architectures have multiple implementations sitting around, e.g. haswell has or had at least 4.

@devinamatthews
Copy link
Copy Markdown
Member

There seem to be some architectural similarities to SkylakeX: for that kernel I found that the prefetching of the C microtile really has to be distributed among the k iterations and spread out quite a bit (and the prefetches of A and B should be spread out as much as possible too) to avoid overloading the L1 prefetch buffer. Additionally, because the panels of A and B do not fit into the L1 cache, you have to wait until a certain number of iterations of k before the end before you start prefetching C so that it does not get flushed out.

The other thing that might make a difference is prefetching the next panel of B (that will be used in the next microkernel call) to the L2 cache. I don't think you have to fetch the entire panel this may, but may be able to just prefetch enough to train the L2 stream prefetcher (I assume there is such a thing on this chip).

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

@devinamatthews Thanks a lot for your advice.
By leaving next_a and next_b to stream prefetcher FP performance climbs up to 36GFlOps.
Distribution of C microtile prefetching is also implemented but unfortunately the machine has just been shut down. I'll update this thread if test results becomes available again.

Updates are commited to xrq-phys/blis/tree/armsve512-a64fx instead of this src branch at the moment. I would like this src branch to have as less git merge as possible. (Though rebasing might still be necessary later.)

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

I'll try to submit a make checkblis output on an A64fx machine.

@xrq-phys xrq-phys force-pushed the armsve-cfg-venture branch from 6e5e44b to 47f2be9 Compare August 16, 2020 19:57
@xrq-phys
Copy link
Copy Markdown
Collaborator Author

xrq-phys commented Aug 16, 2020

Rebased against 2c554c2 .
(Later will rebase against current master.)
make checkblis-fast passed on Arm Instruction Emulator.

Host: Microsoft SQ1 (Cortex-A76 / A55)
OS: Debian 10 via. WSL 2
GCC: 10.1
ARMIE: 20.1
Output: output.testsuite.txt

@xrq-phys xrq-phys force-pushed the armsve-cfg-venture branch from 47f2be9 to 041dca2 Compare August 18, 2020 14:13
@xrq-phys
Copy link
Copy Markdown
Collaborator Author

xrq-phys commented Aug 18, 2020

Rebased against current master (a.k.a. 7d41128).

ARMIE test has passed.

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

xrq-phys commented Sep 10, 2020

BTW performance is over 42 up to 45GFlOps now but only for large m, n and k (>2000,>500).

(Ref: Vendor-provided BLAS has a peak of roughly 54GFlOps.)

Still trying some modifications.

@devinamatthews
Copy link
Copy Markdown
Member

Awesome!

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

xrq-phys commented Sep 11, 2020

Following suggestions by @rvdg , I'm placing here comparison against theoretical peak.

Currently performance is read from a profiler from Fujitsu (with a MS Excel frontend). Part of the performance report looks like (captured from a test with (M,N,K)=(2400,1600,500)(2000,1400,500) where Fujitsu's LAPACK was giving 42GFlOps in April):
Screenshot 2020-10-15 121843
(Screenshot updated to include frequency info.)

For now I'm directly treating the Floating Point Operation Pipeline Busy rate as GFlOps ratio against theoretical peak.
(Sorry there was a mistake in a previous email. I should read pipeline rate instead of peak ratio in the first tabular.)
The value is only ~65% so I suppose there's still a lot of space for improvements even after merging.

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

xrq-phys commented Oct 1, 2020

Updates to test suite output:

  • ARMIE passed almost all make checkblis tests;
  • Fugaku real-machine failed symm routines, here's the output:
    output.testsuite.serial.txt(line 18952)
    It could be caused by OS, compiler or something else?
  • Basic BLAS tests passed (including SYMM. why?). See
    out.dblat3.txt and
    out.zblat3.txt.
    (out.[sc]blat3 uses ref kernel hence trivial.)

At the same time, please let me summarize trivia used in tuning for the A64fx chip. I saw someone else's trying to implement a more generic kernel set and this already-tuned kernel could provide some hints. Here's the note (deleted)

@devinamatthews
Copy link
Copy Markdown
Member

Hi @xrq-phys I guess the performance in the note is for 1 core? Have you tried the whole chip yet?

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

xrq-phys commented Oct 1, 2020

Ah yes. I've not tested for 48 cores yet... I'll try it out as soon as the machine is ready.

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

I've just noticed a serious typo.
All test above are (2000, 1400, 500) instead of (2400, 1600, 500).
GFlOps data remains valid.

(I may or may not have complained about strange behavior of Fujitsu's BLAS but it was caused by my environment setting mistake related to this typo. Sorry.)

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

BTW multithreaded test look good on 12-core case but 48-core's performance is quite unstable.

Trying to sort out performance data...

@jeffhammond
Copy link
Copy Markdown
Member

@xrq-phys I don't have access to A64fx yet and am not an expert on the architecture, but it seems that NUMA is the issue here. I don't know what your goals are but I would expect that the most important use cases for BLIS on A64fx will have one or more processors per CMG and thus scaling BLIS across the NOC isn't critical.

This issue is not unique to A64fx or even to BLIS. It's very hard to scale threaded programming models with shared data across NUMA domains.

I'm sure you know it but for anyone else who is reading this issue, the following is useful:
fujitsu-a64fx-block-diagram

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

Due to Fugaku's set-up process, Their administrator seems need switching A64fx's clock back and forth between 2.2GHz and 2.0GHz.

The processor has been @2.2GHz from Apr. but it's now 2.0Hz. As a consequence, I'm afraid results from here might differ from last month.

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

@jeffhammond Thanks. I guess it'll then be OK to just post a 12-thread (or 13? I still have no idea whether the 1 auxiliary core can be used for OMP threading) benchmark as multithreaded result.

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

Screenshot 2020-10-15 001134

GFlOps update to 47 (→51 if I still have 2.2GHz)

@jeffhammond
Copy link
Copy Markdown
Member

If the 13th core on the CMG is reserved for OS/MPI/etc. it would be imprudent to run heavy compute on it. Even if it's possible, real apps will likely need the 13th core to be free to do its job. The right target is 12 cores.

If the 13th core on A64fx is like the 17th core on Blue Gene/Q, it can't be used for compute. I do not know if Fugaku implements that level of control.

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

Seems that the 13th core is indeed OS-reserved.
I heard that all 13 cores are symmetric (with 512bit FP pipelines) and naively thought it could be used.
Forgive my ignorance to hardware design.

In my ~1200 GEMM test program, both of the following tests gives around 43.5 GFlOps per core (thanks to HBM I guess?):

  • 12 OMP threads and
  • 4 replicas of test program, 12 OMP threads each, running at the same time on the same chip.

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

xrq-phys commented Jan 19, 2021

Hi.

I don't expect this branch to be the one to get merged, but keeping this PR would allow me to add some explanations.

I'm not very clear about the reason but my UKR seems not that sensitive to the way A&B are stored so I tried to make it a dgemmsup_rv KER.
Currently it's merely the same UKR with rs_a, cs_a and cs_b supplied as input. Loop over n0 is written outside the assembly so I don't expect it to have much performance. A bit above gemmsup_ref is the current target.

@jeffhammond
Copy link
Copy Markdown
Member

I tried this but cannot get the BLIS build system to accept that Python 2 is dead.

jhammond43@octavius1:~/BLIS$ ./configure CC=gcc-10 CXX=g++-10 FC=ftn PYTHON=python3 -t openmp a64fx
configure: detected Linux kernel version 4.18.0-240.1.1.el8_3.aarch64.
configure: python interpeter search list is: python python3 python2.
configure: using 'python3' python interpreter.
configure: found python version 3.6.8 (maj: 3, min: 6, rev: 8).
configure: python 3.6.8 appears to be supported.
configure: C compiler search list is: gcc clang cc.
configure: using 'gcc-10' C compiler.
configure: C++ compiler search list is: g++ clang++ c++.
configure: using 'g++-10' C++ compiler (for sandbox only).
configure: found gcc version 10.2.0 (maj: 10, min: 2, rev: 0).
configure: checking for blacklisted configurations due to gcc-10 10.2.0.
configure: checking gcc-10 10.2.0 against known consequential version ranges.
configure: found assembler ('as') version 2.32 (maj: 2, min: 32, rev: ).
configure: checking for blacklisted configurations due to as 2.32.
configure: warning: assembler ('as' 2.32) does not support 'bulldozer'; adding to blacklist.
configure: warning: assembler ('as' 2.32) does not support 'sandybridge'; adding to blacklist.
configure: warning: assembler ('as' 2.32) does not support 'haswell'; adding to blacklist.
configure: warning: assembler ('as' 2.32) does not support 'piledriver'; adding to blacklist.
configure: warning: assembler ('as' 2.32) does not support 'steamroller'; adding to blacklist.
configure: warning: assembler ('as' 2.32) does not support 'excavator'; adding to blacklist.
configure: warning: assembler ('as' 2.32) does not support 'skx'; adding to blacklist.
configure: warning: assembler ('as' 2.32) does not support 'knl'; adding to blacklist.
configure: configuration blacklist:
configure:   bulldozer sandybridge haswell piledriver steamroller excavator skx knl
configure: reading configuration registry...done.
configure: determining default version string.
configure: found '.git' directory; assuming git clone.
configure: executing: git describe --tags.
configure: got back 0.7.0-65-gec63fdf6.
configure: truncating to 0.7.0-65.
configure: starting configuration of BLIS 0.7.0-65.
configure: configuring with official version string.
configure: found shared library .so version '3.0.0'.
configure:   .so major version: 3
configure:   .so minor.build version: 0.0
configure: manual configuration requested; configuring with 'a64fx'.
configure: checking configuration against contents of 'config_registry'.
configure: configuration 'a64fx' is registered.
configure: 'a64fx' is defined as having the following sub-configurations:
configure:    a64fx
configure: which collectively require the following kernels:
configure:    armsve armv8a
configure: checking sub-configurations:
configure:   'a64fx' is registered...and exists.
configure: checking sub-configurations' requisite kernels:
configure:   'armsve' kernels...exist.
configure:   'armv8a' kernels...exist.
configure: no install prefix option given; defaulting to '/usr/local'.
configure: no install exec_prefix option given; defaulting to PREFIX.
configure: no install libdir option given; defaulting to EXECPREFIX/lib.
configure: no install includedir option given; defaulting to PREFIX/include.
configure: no install sharedir option given; defaulting to PREFIX/share.
configure: final installation directories:
configure:   prefix:      /usr/local
configure:   exec_prefix: ${prefix}
configure:   libdir:      ${exec_prefix}/lib
configure:   includedir:  ${prefix}/include
configure:   sharedir:    ${prefix}/share
configure: NOTE: the variables above can be overridden when running make.
configure: no preset CFLAGS detected.
configure: no preset LDFLAGS detected.
configure: debug symbols disabled.
configure: disabling verbose make output. (enable with 'make V=1'.)
configure: disabling ARG_MAX hack.
configure: building BLIS as both static and shared libraries.
configure: exporting only public symbols within shared library.
configure: using OpenMP for threading.
configure: requesting slab threading in jr and ir loops.
configure: internal memory pools for packing blocks are enabled.
configure: internal memory pools for small blocks are enabled.
configure: memory tracing output is disabled.
configure: libmemkind not found; disabling.
configure: compiler appears to support #pragma omp simd.
configure: the BLAS compatibility layer is enabled.
configure: the CBLAS compatibility layer is disabled.
configure: mixed datatype support is enabled.
configure: mixed datatype optimizations requiring extra memory are enabled.
configure: small matrix handling is enabled.
configure: the BLIS API integer size is automatically determined.
configure: the BLAS/CBLAS API integer size is 32-bit.
configure: configuring for conventional gemm implementation.
configure: creating ./config.mk from ./build/config.mk.in
configure: creating ./bli_config.h from ./build/bli_config.h.in
configure: creating ./obj/a64fx
configure: creating ./obj/a64fx/config/a64fx
configure: creating ./obj/a64fx/kernels/armsve
configure: creating ./obj/a64fx/kernels/armv8a
configure: creating ./obj/a64fx/ref_kernels/a64fx
configure: creating ./obj/a64fx/frame
configure: creating ./obj/a64fx/blastest
configure: creating ./obj/a64fx/testsuite
configure: creating ./lib/a64fx
configure: creating ./include/a64fx
configure: mirroring ./config/a64fx to ./obj/a64fx/config/a64fx
configure: mirroring ./kernels/armsve to ./obj/a64fx/kernels/armsve
configure: mirroring ./kernels/armv8a to ./obj/a64fx/kernels/armv8a
configure: mirroring ./ref_kernels to ./obj/a64fx/ref_kernels
configure: mirroring ./ref_kernels to ./obj/a64fx/ref_kernels/a64fx
configure: mirroring ./frame to ./obj/a64fx/frame
configure: creating makefile fragments in ./obj/a64fx/config/a64fx
configure: creating makefile fragments in ./obj/a64fx/kernels/armsve
configure: creating makefile fragments in ./obj/a64fx/kernels/armv8a
configure: creating makefile fragments in ./obj/a64fx/ref_kernels
configure: creating makefile fragments in ./obj/a64fx/frame
configure: configured to build within top-level directory of source distribution.
jhammond43@octavius1:~/BLIS$ make -j48 
Generating monolithic blis.h/usr/bin/env: 'python': No such file or directory
make: *** [Makefile:460: include/a64fx/blis.h] Error 127

@devinamatthews
Copy link
Copy Markdown
Member

@jeffhammond to be fair I don't know of any shebang that means "give me any python2 or python3" that actually works portably. Suggestions? (other than just looks for python3, because that isn't portable either 😢 ).

@jeffhammond
Copy link
Copy Markdown
Member

Just use Python3 and refer anyone who asks for Python2 support to refer to https://pythonclock.org/

@jeffhammond
Copy link
Copy Markdown
Member

A different issue here is that this code appears to rely on some non-standard Clang intrinsic, __aarch64_ldeor8_rel:

jhammond43@octavius2:~/PRK/FORTRAN$ rm *blas ; make dgemm-blas && ./dgemm-blas 100 1024
ftn -e F -O2 -DRADIUS=2 -DSTAR dgemm-blas.F90 -h omp -I/nethome/jhammond43/BLIS//include/a64fx /nethome/jhammond43/BLIS//lib/a64fx/libblis.a -o dgemm-blas
/opt/cray/pe/cce-sve/10.0.1/binutils/aarch64/aarch64-unknown-linux-gnu/bin/ld: /nethome/jhammond43/BLIS//lib/a64fx/libblis.a(bli_thrcomm.o): in function `bli_thrcomm_barrier_atomic':
bli_thrcomm.c:(.text+0xc4): undefined reference to `__aarch64_ldadd8_acq_rel'
/opt/cray/pe/cce-sve/10.0.1/binutils/aarch64/aarch64-unknown-linux-gnu/bin/ld: bli_thrcomm.c:(.text+0x100): undefined reference to `__aarch64_ldeor8_rel'
make: *** [Makefile:94: dgemm-blas] Error 1
jhammond43@octavius2:~/PRK/FORTRAN$ 

@xrq-phys xrq-phys force-pushed the armsve-cfg-venture branch from 0567e78 to 334f3e2 Compare May 15, 2021 07:27
@xrq-phys
Copy link
Copy Markdown
Collaborator Author

xrq-phys commented May 17, 2021

@fgvanzee I suppose the code is clean enough to get merged.
Still, there are 2 questions:

  • After rebase, commit 757cb1c used in our Performance.md is no longer available in this branch (it is in xrq-phys:armsve-cfg+premerge instead). It shall be quick to rerun BLIS+SSL2 tests alone though.
  • Postponed I modified TravisCI config to use QEmu to test through ArmSVE. As SVE kernels require GCC 10+ and QEmu 4.4+, base OS is updated from trusty to focal with many other changes. I suppose I need you to decide whether these updates are acceptable.

@devinamatthews
Copy link
Copy Markdown
Member

@xrq-phys how feasible is it to split the TravisCi changes into a separate PR?

@xrq-phys xrq-phys force-pushed the armsve-cfg-venture branch from d16a12d to e0705b2 Compare May 17, 2021 15:49
@xrq-phys
Copy link
Copy Markdown
Collaborator Author

@devinamatthews Indeed. Moved .travis.yml away for now.

@fgvanzee
Copy link
Copy Markdown
Member

After rebase, commit 757cb1c used in our Performance.md is no longer available in this branch (it is in xrq-phys:armsve-cfg+premerge instead). It shall be quick to rerun BLIS+SSL2 tests alone though.

@xrq-phys If/when you can rerun the same experiment set as before, I'll be happy to use them to regenerate graphs, at which time I can also update the commit that is referenced in the document. In the meantime, we can proceed with the merge.

Before I actually merge, I may try to do a few minor tweaks (mostly related to reordering code inserted for the new subconfigs), although I recently had trouble pushing to someone else's PR branch, so we'll see if I can even get it to work. If not, I'll merge and then do the cleanups on master.

@devinamatthews
Copy link
Copy Markdown
Member

@fgvanzee you can also make a copy of @xrq-phys's branch in this repo and update the PR merge branch.

@fgvanzee
Copy link
Copy Markdown
Member

@fgvanzee you can also make a copy of @xrq-phys's branch in this repo and update the PR merge branch.

I did not know. Also, that seems less than intuitive to me. If the PR is coming from a branch in RuQing's repository to master in flame/blis, how would pushing commits to a copy of his branch help us?

@devinamatthews
Copy link
Copy Markdown
Member

You would edit the PR to come from the local copy, so you can update it. Not a big difference either way.

@fgvanzee
Copy link
Copy Markdown
Member

fgvanzee commented May 17, 2021

@xrq-phys Is this a typo (in frame/base/bli_cpuid.c)?

#ifdef BLIS_CONFIG_ARMSVE
            if ( bli_cpuid_is_armsve( model, part, features ) )
                return BLIS_ARCH_ARMSVE;
#endif
#ifdef BLIS_CONFIG_ARMSVE
            if ( bli_cpuid_is_a64fx( model, part, features ) )
                return BLIS_ARCH_A64FX;
#endif

Seems like the second macro guard would normally be BLIS_CONFIG_A64FX.

Details:
- Changed the order of the new A64fx and SVE code fragments to appear
  as the beginning of the armv8a-related code (rather appearing after
  other armv8a code).
- Fixed what is probably a copy-paste bug in frame/base/bli_cpuid.c.
  Previously, the a64fx conditional check was guarded by the cpp macro
  BLIS_CONFIG_ARMSVE, which has now been changed to BLIS_CONFIG_A64FX.
@fgvanzee
Copy link
Copy Markdown
Member

fgvanzee commented May 17, 2021

@xrq-phys Commit e782546 assumes that the inconsistency I highlighted above was indeed a bug and addresses it. If it was not a bug, please feel free to undo my change to that line. (The cpp guard in question is now on line 462 of frame/base/bli_cpuid.c)

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

@fgvanzee It was indeed a typo. Sorry.
In fact it is possible to also add vendor ID checks in bli_cpuid_is_a64fx but I guess #344 is more preferable.

I'll try to rerun BLIS test on SC Fugaku if the environment config (OS, freq. throttling, etc.) didn't change too much.

@fgvanzee
Copy link
Copy Markdown
Member

Thanks @xrq-phys. I'll merge this now. (And don't worry about the typo. It was easy for me to spot.)

@fgvanzee
Copy link
Copy Markdown
Member

Apologies, I got distracted by other tasks/people yesterday and forgot to click the button.

@fgvanzee
Copy link
Copy Markdown
Member

@xrq-phys As I was preparing to squash-and-merge, I realized that the default log message (ie: the concatenation of all constituent commit log entries) is a bit unwieldy. Could you summarize the changes you made in a way that would allow me to create a more concise commit log entry?

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

xrq-phys commented May 19, 2021

@fgvanzee Oh. That's right. This branch has been rebased several times so commit messages become a little wield.

I suppose a brief summary would be:

  • Implemented vector-length-agnostic [D/S/SH] GEMM kernels for Arm SVE at size (2*VL, 10).
    These kernels use unindexed FMLA instructions because indexed FMLA takes 2 FMA units in many implementations.
    PS: There are indexed-FLMA kernels in S. Nassyr's repo.
  • Implemented several experimental DGEMMSUP kernels which would improve performance in a few cases.
    However, those DGEMMSUP kernels generally underperform hence is not used in any subconfiguration.
  • Extended 256-bit SVE DPACKM kernels by Linaro Ltd. to 512-bit for size (12, k).
    This DPACKM kernel is not used by any subconfiguration.
  • Implemented 512-bit SVE DPACKM kernels with in-register transpose support for sizes (16, k) and (10, k).
  • Added a vector-length agnostic subconfiguration armsve that computes block size according to the analytical model.
    This part is ported from the repo of @stepannassyr .
  • Added a 512-bit specific subconfiguration a64fx that uses empirically tuned block size by @stepannassyr .
    This subconfiguration also sets the sector cache size and enables memory-tagging code in SVE GEMM kernels.
    This subconfiguration utilizes (16, k) and (10, k) DPACKM kernels.

@fgvanzee fgvanzee merged commit 61584de into flame:master May 19, 2021
@fgvanzee
Copy link
Copy Markdown
Member

Thanks RuQing, that summary was great.

@xrq-phys
Copy link
Copy Markdown
Collaborator Author

Thanks a lot!

@rvdg
Copy link
Copy Markdown
Collaborator

rvdg commented May 19, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants