New GEMM Assembly & Configuration Set for Arm SVE#424
New GEMM Assembly & Configuration Set for Arm SVE#424fgvanzee merged 62 commits intoflame:masterfrom
Conversation
|
Sorry the way A64fx handles indexed FMLA is actually implied at the end of fujitsu/A64fx. Anyway, based on this fact a separated kernel is added. Or would it be better to have separated kernels but still put them under the same |
|
RuQing,
Thank you for your efforts. Would it be possible for you to send me an e-mail. There are some issues that we would like to discuss with you offline.
Robert
[email protected]
… On Jul 19, 2020, at 11:55 AM, RuQing Xu ***@***.***> wrote:
Sorry the way A64fx handles indexed FMLA is actually documented at the end of fujitsu/A64fx.
Anyway, based on this fact a separated kernel is added. Or would it be better to have separated kernels but still put them under the same ˋarmsveˋ directory?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#424 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLLYJ3WOGX622RCFDNS36TR4MQRVANCNFSM4PAR7TSQ>.
|
|
@xrq-phys to answer your specific question:
I would put this in the same directory and leave a commented-out section in the configuration code for the "old" one. Many of the architectures have multiple implementations sitting around, e.g. |
|
There seem to be some architectural similarities to SkylakeX: for that kernel I found that the prefetching of the C microtile really has to be distributed among the k iterations and spread out quite a bit (and the prefetches of A and B should be spread out as much as possible too) to avoid overloading the L1 prefetch buffer. Additionally, because the panels of A and B do not fit into the L1 cache, you have to wait until a certain number of iterations of k before the end before you start prefetching C so that it does not get flushed out. The other thing that might make a difference is prefetching the next panel of B (that will be used in the next microkernel call) to the L2 cache. I don't think you have to fetch the entire panel this may, but may be able to just prefetch enough to train the L2 stream prefetcher (I assume there is such a thing on this chip). |
|
@devinamatthews Thanks a lot for your advice. Updates are commited to |
|
I'll try to submit a |
6e5e44b to
47f2be9
Compare
|
Rebased against 2c554c2 . Host: Microsoft SQ1 (Cortex-A76 / A55) |
47f2be9 to
041dca2
Compare
|
Rebased against current master (a.k.a. 7d41128). |
|
BTW performance is over 42 up to 45GFlOps now but only for large m, n and k (>2000,>500). (Ref: Vendor-provided BLAS has a peak of roughly 54GFlOps.) Still trying some modifications. |
|
Awesome! |
|
Following suggestions by @rvdg , I'm placing here comparison against theoretical peak. Currently performance is read from a profiler from Fujitsu (with a MS Excel frontend). Part of the performance report looks like (captured from a test with (M,N,K)= For now I'm directly treating the Floating Point Operation Pipeline Busy rate as GFlOps ratio against theoretical peak. |
|
Updates to test suite output:
|
|
Hi @xrq-phys I guess the performance in the note is for 1 core? Have you tried the whole chip yet? |
|
Ah yes. I've not tested for 48 cores yet... I'll try it out as soon as the machine is ready. |
|
I've just noticed a serious typo. (I may or may not have complained about strange behavior of Fujitsu's BLAS but it was caused by my environment setting mistake related to this typo. Sorry.) |
|
BTW multithreaded test look good on 12-core case but 48-core's performance is quite unstable. Trying to sort out performance data... |
|
@xrq-phys I don't have access to A64fx yet and am not an expert on the architecture, but it seems that NUMA is the issue here. I don't know what your goals are but I would expect that the most important use cases for BLIS on A64fx will have one or more processors per CMG and thus scaling BLIS across the NOC isn't critical. This issue is not unique to A64fx or even to BLIS. It's very hard to scale threaded programming models with shared data across NUMA domains. I'm sure you know it but for anyone else who is reading this issue, the following is useful: |
|
Due to Fugaku's set-up process, Their administrator seems need switching A64fx's clock back and forth between 2.2GHz and 2.0GHz. The processor has been @2.2GHz from Apr. but it's now 2.0Hz. As a consequence, I'm afraid results from here might differ from last month. |
|
@jeffhammond Thanks. I guess it'll then be OK to just post a 12-thread (or 13? I still have no idea whether the 1 auxiliary core can be used for OMP threading) benchmark as multithreaded result. |
|
If the 13th core on the CMG is reserved for OS/MPI/etc. it would be imprudent to run heavy compute on it. Even if it's possible, real apps will likely need the 13th core to be free to do its job. The right target is 12 cores. If the 13th core on A64fx is like the 17th core on Blue Gene/Q, it can't be used for compute. I do not know if Fugaku implements that level of control. |
|
Seems that the 13th core is indeed OS-reserved. In my ~1200 GEMM test program, both of the following tests gives around 43.5 GFlOps per core (thanks to HBM I guess?):
|
2c13e80 to
e455224
Compare
|
Hi. I don't expect this branch to be the one to get merged, but keeping this PR would allow me to add some explanations. I'm not very clear about the reason but my UKR seems not that sensitive to the way A&B are stored so I tried to make it a |
|
I tried this but cannot get the BLIS build system to accept that Python 2 is dead. |
|
@jeffhammond to be fair I don't know of any shebang that means "give me any python2 or python3" that actually works portably. Suggestions? (other than just looks for python3, because that isn't portable either 😢 ). |
|
Just use Python3 and refer anyone who asks for Python2 support to refer to https://pythonclock.org/ |
|
A different issue here is that this code appears to rely on some non-standard Clang intrinsic, |
- Only use institution name; - Add Fz. Juelich as coauthor.
0567e78 to
334f3e2
Compare
|
@fgvanzee I suppose the code is clean enough to get merged.
|
|
@xrq-phys how feasible is it to split the TravisCi changes into a separate PR? |
d16a12d to
e0705b2
Compare
|
@devinamatthews Indeed. Moved |
@xrq-phys If/when you can rerun the same experiment set as before, I'll be happy to use them to regenerate graphs, at which time I can also update the commit that is referenced in the document. In the meantime, we can proceed with the merge. Before I actually merge, I may try to do a few minor tweaks (mostly related to reordering code inserted for the new subconfigs), although I recently had trouble pushing to someone else's PR branch, so we'll see if I can even get it to work. If not, I'll merge and then do the cleanups on |
I did not know. Also, that seems less than intuitive to me. If the PR is coming from a branch in RuQing's repository to |
|
You would edit the PR to come from the local copy, so you can update it. Not a big difference either way. |
|
@xrq-phys Is this a typo (in #ifdef BLIS_CONFIG_ARMSVE
if ( bli_cpuid_is_armsve( model, part, features ) )
return BLIS_ARCH_ARMSVE;
#endif
#ifdef BLIS_CONFIG_ARMSVE
if ( bli_cpuid_is_a64fx( model, part, features ) )
return BLIS_ARCH_A64FX;
#endifSeems like the second macro guard would normally be |
Details: - Changed the order of the new A64fx and SVE code fragments to appear as the beginning of the armv8a-related code (rather appearing after other armv8a code). - Fixed what is probably a copy-paste bug in frame/base/bli_cpuid.c. Previously, the a64fx conditional check was guarded by the cpp macro BLIS_CONFIG_ARMSVE, which has now been changed to BLIS_CONFIG_A64FX.
|
Thanks @xrq-phys. I'll merge this now. (And don't worry about the typo. It was easy for me to spot.) |
|
Apologies, I got distracted by other tasks/people yesterday and forgot to click the button. |
|
@xrq-phys As I was preparing to squash-and-merge, I realized that the default log message (ie: the concatenation of all constituent commit log entries) is a bit unwieldy. Could you summarize the changes you made in a way that would allow me to create a more concise commit log entry? |
|
@fgvanzee Oh. That's right. This branch has been rebased several times so commit messages become a little wield. I suppose a brief summary would be:
|
|
Thanks RuQing, that summary was great. |
|
Thanks a lot! |
|
Thank you RuQing for all your hard work.
… On May 19, 2021, at 10:54 AM, Field G. Van Zee ***@***.***> wrote:
Thanks RuQing, that summary was great.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#424 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLLYJZAXDZL5VDUO7LNVYLTOPGJVANCNFSM4PAR7TSQ>.
|



NOTICE: Branch
xrq-phys:armsve-cfg-venturehas been rebased/reworked for several times. Comments below may not reflect their context code commits.I'm reopening #422 with a few updates:
dgemmkernel specialized for A64fx chip (reason below);a64fx;Reason for a different
dgemmkernel for A64fxdgemm_armsve512_asmkernel underkernels/armsve/3is mainly composed of SVE indexedFMLAinstructions (opcode=0x64). This strategy is the same asdgemm_armsve256_asmkernel located at the same directory. It is able to increase the interval between a vector's load and its reference byFMLA. However, actual profiling result of that kernel gives (Test size: GEMM(2400,1600,500)(2000,1400,500)):Left part of combo histogram shows that in most time the processor is only committing 1 or 2 instructions while A64fx has 2 FP pipelines and 2 integer pipelines summing up to 4. This fact drastically lowers final GFlOps yielded (c.f. spread at the end). However, FP stall rate and memory/cache access wait is quite low, indicating no impediment to FP pipelines.
Though not documented in materials disclosed by Fujitsu, I suspectAccording to A64fx uarch manual (https://github.com/fujitsu/A64FX), the FP pipeline in A64fx does not have element duplicator for indexed SVE operations so that one single0x64 FMLAis executed with both FP pipelines, each half-occupied. As a workaround, another kernel is created for A64fx with0x64 FMLAreplaced with0x65 FMLAand it does yield higher GFlOps:Again, I want to say again that this pull request contains 2 components:
dgemmkernels with one more specialized for A64fx;It can be separated into 4 dependent but self-inclusive themes so if you feel this pull request too big please feel free to close it and let me know. I'll relaunch with separated code changes.