Add external PR integration process and flowchart to CONTRIBUTING.md by chandrkr · Pull Request #42 · amd/blis

chandrkr · 2025-09-17T07:03:57Z

Documented the process for handling external pull requests, including validation, review, and notification steps.
Added a markdown flowchart illustrating the PR workflow.
Included communication and contributor attribution notes.

Details: -- AMD Internal Id: CPUPL-1702 -- Used 4x3 ZGEMM kernel with vector fma by utilizing ymm registers efficiently to produce 12 dcomplex outputs at a time -- Used packing of matrix A to effectively cache and reuse -- Implemented kernels using macro based modular approach -- Added ztrsm_small for in ztrsm_ BLAS path for single thread when (m,n)<500 and multithread (m+n)<128 -- Taken care of --disable_pre_inversion configuration -- Achieved 10% average performance improvement for sizes less than 500 -- modularized all 16 combinations of trsm into 4 kernels Change-Id: I3cb42a1385f6b3b82d6c470912242675789cce75

-- Added number of threads used in DTL logs -- Added support for timestamps in DTL traces -- Added time taken by API at BLAS layer in the DTL logs -- Added GFLOPS achieved in DTL logs -- Added support to enable/disable execution time and gflops printing for individual API's. We may not want it for all API's. Also it will help us migrate API's to execution time and gflops logs in stages. -- Updated GEMM bench to match new logs -- Refactored aocldtl_blis.c to remove code duplication. -- Clean up logs generation and reading to use spaces consistently to separate various fields. -- Updated AOCL_gettid() to return correct thread id when using pthreads. AMD-Internal: [CPUPL-1691] Change-Id: Iddb8a3be2a5cd624a07ccdbf5ae0695799d8ae8e

Details - For axpyf implementation there are function(axpyf) calling overhead. - New implementations reduces function calling overhead. - This implementation uses kernel of size 4x4. - This implementation gives better performance for smaller sizes when compared to axpyf based implementation AMD-Internal: [CPUPL-1402] Change-Id: I5fa421b8c1d2b44c991c2a05e8f5b01b83eb4b37

Details: -- AMD Internal Id: [CPUPL-1702] -- Used 16x6 SGEMM kernel with vector fma by utilizing ymm registers -- Used packing of matrix A to effectively cache and reuse -- Implemented kernels using macro based modular approach -- Taken care of --disable_pre_inversion configuration -- modularized strsm 16 combinations of trsm into 4 kernels Change-Id: I30a1551967c36f6bae33be3b7ae5b7fcc7c905ea

1. Max thread cap added for small dimension based on product(n*k). AMD-Internal: [CPUPL-1388] Change-Id: I34412a1374bb58a9c4b3fd8e40949a69006cf057

Details: -- AMD Internal Id: CPUPL-1702 -- Used 8x3 CGEMM kernel with vector fma by utilizing ymm registers efficiently to produce 24 scomplex outputs at a time -- Used packing of matrix A to effectively cache and reuse -- Implemented kernels using macro based modular approach -- Added ctrsm_small for in ctrsm_ BLAS path for single thread when (m,n)<1000 and multithread (m+n)<320 -- Taken care of --disable_pre_inversion configuration -- Achieved 13% average performance improvement for sizes less than 1000 -- modularized all 16 combinations of trsm into 4 kernels Change-Id: I557c5bcd8cb7c034acd99ce0666bc411e9c4fe64

…milan-3.1

Details - For axpyf implementation there are function(axpyf) calling overhead. - New implementations reduces function calling overhead. - This implementation uses kernel of size 8x4. - This implementation gives better performance for smaller sizes when compared to axpyf based implementation AMD-Internal: [CPUPL-1402] Change-Id: Ic9a5e59363290caf26284548638da9065952fd48

Details: AMD Internal Id: CPUPL-1702 - While performing trsm function A's imaginary part needed to be complimented as per conjugate transpose. -So in the case of conjugate transpose A's imaginary part is negated before doing trsm. Change-Id: Ic736733a483eeadf6356952b434128c0af988e36

Details: AMD Internal Id: CPUPL-1702 - For the cases of A being of 1x1 dimension and of left and right hand side, A's only element is conjugate transposed by negating its imaginary component. Change-Id: I696ae982d9d60e0e702edaba98acbe9a5b0cd44c

AMD-Internal: [CPUPL-1691] Change-Id: Idc273666054529db5a2fb96a7d7ebbf7a3f5b008

-- Reverted changes made to include lp/ilp info in binary name This reverts commit c5e6f88. -- Included BLAS int size in 'make showconfig' -- Renamed amdepyc configuration to amdzen Change-Id: Ie87ec1c03e105f606aef1eac397ba0d8338906a6

Details : - Accuracy failures observed when fast math and ILP64 are enabled. - Disabling the feature with macro BLIS_ENABLE_FAST_MATH . AMD-Internal: [CPUPL-1907] Change-Id: I92c661647fb8cc5f1d0af8f6c4eae0fac1df5f16

…ing-milan-3.1

Change-Id: Ie05eafbeacbd5589b514d9353517330515104939

Details: -- reverted cscalv,zscalv,ctrsm,ztrsm changes to address accuracy issues observed by libflame and scalapack application testing. -- AMD-Internal: [CPUPL-1906], [CPUPL-1914] Change-Id: Ic364eacbdf49493dd3a166a66880c12ee84c2204

…md-staging-milan-3.1

Details: - Perf regression is observed for certain m,n,k inputs where (m,n,k > 512) and (m > 4 * n) in BLIS 3.1. The root cause was traced to commit 11dfc17 where BLIS_THREAD_RATIO_M was updated from 2 to 1. This change was not part of BLIS 3.0.6 and hence resulted in the new perf drop in 3.1. - This workaround updates the m dimension (doubles it) that is passed as argument to bli_rntm_set_ways_for_op which is used to determine the ic,jc work split in the threads. The BLIS_THREAD_RATIO_M is not updated (to 2) and rather the effect is induced using the doubled m dimension. AMD-Internal: [CPUPL-1909] Change-Id: I3b6ec4d4a22154289cb56d8f7db4cb60e5f34afe

…aging-milan-3.1

This commit fixed issue for gemm and copy API’s. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). The crash was caused by un-supported instructions in zen optimized kernels. The issue is fixed by calling only reference kernels if the architecture detected at runtime is not zen, zen2 or zen3. AMD-Internal: [CPUPL-1930] Change-Id: Ief57cd457b87542aa1a7bad64dc36c01f0d1a366

Removed direct calling of zen kernels in cblas source itself. Similar optimizations are done by the function directly invoked from Cblas layer. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). The crash was caused by un-supported instructions in zen optimized kernels. The issue is fixed by calling only reference kernels if the architecture detected at runtime is not zen, zen2 or zen3. AMD-Internal: [CPUPL-1930] Change-Id: I9178b7a98f2563dee2817064f37fcbb84073eeea

Removed direct calling of zen kernels in blis interface for trsm, scalv, swapv. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). The crash was caused by un-supported instructions in zen optimized kernels. The issue is fixed by calling only reference kernels if the architecture detected at runtime is not zen, zen2 or zen3. AMD-Internal: [CPUPL-1930] Change-Id: I7944d131d376e2c4e778fe441a8b030674952b81

…nto amd-staging-milan-3.1

Direct calls to zen kernels replaced by architecture dependent calls for dotv and amaxv kernels. For non-zen architecture, generic function is called using the BLIS interface. For zen architecture, direct calls to zen optimized kernels are made. Change-Id: I49fc9abc813434d6a49a23f49e47d16e95b7899f

Removed direct calling of zen kernels in ctrsv, ztrsv interface. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). The crash was caused by un-supported instructions in zen optimized kernels. AMD-Internal: [CPUPL-1930] Change-Id: I21f25a09cd6ffb013d16c66ea10aa9a42f7cad5b

…nto amd-staging-milan-3.1

…nd axpy routines. Summary: 1. This commit fixed issue for gemv and axpy API’s. 2. The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs (specifically CPUs without AVX2 support). 3. The crash was caused by un-supported instructions in zen optimized kernels.The issue is fixed by calling only reference kernels if the architecture detected at runtime is not zen, zen2 or zen3. Change-Id: Icc6f7fdc80bc58fac1a97b1502b6f269e5e89aa4

Small gemm implemenation is called from gemmnat path when library is built as multi-threaded small gemm is completely disabled. For single threaded the crash is fixed by disabling small gemm on generic architecture. AMD-Internal: [CPUPL-1930] Change-Id: If718870d89909cef908a1c23918b7ef6f7d80f7a

-New packing kernels for A matrix, both based on AVX512 and AVX2 ISA, for both row and column major storage are added as part of this change. Dependency on haswell A packing kernels are removed by this. -Tiny GEMM thresholds are further tuned for BF16 and F32 APIs. AMD-Internal: [SWLCSG-3380, SWLCSG-3415] Change-Id: I7330defacbacc9d07037ce1baf4a441f941e59be

Description: 1. When compiler gcc version less than 11.2 few BF16 instructions are not supported by the compiler even though the processors arch's zen4 and zen5 supports. 2. These instructions are guarded now with a macro. Change-Id: Ib07d41ff73d8fe14937af411843286c0e80c4131

Description - Zero point support for <s32/s8/bf16/u8> datatype in element-wise postop only f32o<f32/s8/u8/s32/bf16> APIs. AMD-Internal: [SWLCSG-3390] Change-Id: I2fdb308b05c1393013294df7d8a03cdcd7978379

Description: 1. Implement s8 unreorder API function which performs unreordering of int8 matrix which is reordered 2. Removed bf16vnni check for bf16 unreorder reference API because it can work on any architecture as it is reference code 3. Tested the reference code for all main and fringe paths. AMD-Interneal: [SWLCSG-3426] Change-Id: I920f807be870e1db5f9d0784cdcec7b366e1eff5

-Currently the scale factor is loaded without using mask in downscale, and matrix add/mul ops in the F32 eltwise kernels. This results in out of memory reads when n is not a multiple of NR (64). -The loads are updated to masked loads to fix the same. AMD-Internal: [SWLCSG-3390] Change-Id: Ib2fc555555861800c591344dc28ac0e3f63fd7cb

- Updated the bli_dgemv_zen_ref( ... ) kernel to support general stride. - Since the latest dgemv kernels don't support general stride, added checks to invoke bli_dgemv_zen_ref( ... ) when A matrix has a general stride. - Thanks to Vignesh Balasubramanian <[email protected]> for finding this issue. AMD-Internal: [CPUPL-6492] Change-Id: Ia987ce7674cb26cb32eea4a6e9bd6623f2027328

- In 8x24 DGEMM kernel, prefetch is always done assuming row major C. - For TRSM, the DGEMM kernel can be called with column major C also. - Current prefetch logic results in suboptimal performance. - Changed C prefetch logic so that correct C is prefetched for both row and column major C. AMD-Internal: [CPUPL-6493] Change-Id: I7c732ceac54d1056159b3749544c5380340aacd2 (cherry picked from commit e6ca01c)

Details: - Fixed the logic to identify an API that has int4 weights in bench files for gemm and batch_gemm. - Eliminated the memcpy instructions used in pack functions of zen4 kernels and replaced them with masked load instruction. This ensures that the load register will be populated with zeroes at locations where mask is set to zero. Change-Id: I8dd1ea7779c8295b7b4adec82069e80c6493155e AMD-Internal:[SWLCSG-3274] (cherry picked from commit 6c29236)

Details: - Group quantization is technique to improve accuracy where scale factors to quantize inputs and weights varies at group level instead of per channel and per tensor level. - Added new bench files to test GEMM with symmetric static quantization. - Added new get_size and reorder functions to account for storing sum of col-values separately per group. - Added new framework, kernels to support the same. - The scalefactors could be of type float or bf16. AMD-Internal:[SWLCSG-3274] Change-Id: I3e69ecd56faa2679a4f084031d35ffb76556230f (cherry picked from commit 7243a5d)

Details: - Setting post_op_grp to NULL at the start of post-op - creator to ensure that there is no junk value(non-null) - which might lead to destroyer trying to free - non-allocated buffers. AMD-Internal: [SWLCSG-3274] Change-Id: I45a54d01f0d128d072d5d9c7e66fc08412c7c79c (cherry picked from commit 1da554a)

-Currently the Tid spread does not happen for n=4096 even if there are threads available to facilitate the same. Update the threshold to account for the same. AMD-Internal: [SWLCSG-3185] Change-Id: I281b1639c32ba2145bd84062324f1f11b1167eeb

- Currently, the bf16 reorder function does not add padding for n=1 cases. But, the bf16 AVX2 Unreorder path considers the input re-ordered B matrix to be padded along the n and k dimension. - Hence, modified the conditions to make sure the path doesn't break while the AVX2 kernels are executed in AVX512 machines when B matrix reordered. Change-Id: I7dd3d37a24758a8e93e80945b533abfcf15f65a1

Updated CMakeLists.txt to remove GNU extensions for both C and C++. Now during building -std=c99 is used instead of -std=gnu99. Signed-off-by: Jagadish R <[email protected]> AMD-Internal: [CPUPL-6553] Change-Id: I98150707990112c5736660d287f1ddbe71a4e8e6

- Corrected a typo in dgemm kernel implementation, beta=0 and n_left=6 edge kernel. Thanks to Shubham Sharma<[email protected]> for helping with debugging. AMD-Internal: [CPUPL-6443] Change-Id: Ifa1e16ec544b7e85c21651bc23c4c27e86d6730b (cherry picked from commit a359a25)

Rename generated aocl-blas.pc and aocl-blas-mt.pc to blis.pc and blis-mt.pc. AMD-Internal: [SWLCSG-3446] Change-Id: Ica784c7a0fd1e52b4d419795659947316e932ef6 (cherry picked from commit 9f263d2)

Description: 1. For column major case when m=1 there was an accuracy mismatch with post ops(bias, matrix_add, matrix_add). 2. Added check for column major case and replace _mm512_loadu_ps with _mm512_maskz_loadu_ps. AMD-Internal: [CPUPL-6585] Change-Id: I8d98e2cb0b9dd445c9868f4c8af3abbc6c2dfc95

- Added column major path for BF16 tiny path - Tuned tiny-path thresholds to support few more inputs to the tiny path. AMD-Internal: [SWLCSG-3380] Change-Id: I9a5578c9f0d689881fc5a67ab778e6a917c4fce1

Change-Id: Ifb9542adec045ca4afbb8237469a391051673527

- Instead of native, we are wrongly selecting TRSM small, now its fixed. AMD-Internal: [SWLCSG-3338] Change-Id: I7a06a483fd874c71562a924b50118e0fc9e3b213 (cherry picked from commit 350c718)

Change-Id: I522b9b07f6b03ea31fa829a038db3016c3fb13ac

1. Documented the process for handling external pull requests, including validation, review, and notification steps. 2. Added a markdown flowchart illustrating the PR workflow. 3. Included communication and contributor attribution notes.

kvaragan

Make the changes in dev branch.

Details: - Fixed the problem decomposition for n-fringe case of 6x64 AVX512 FP32 kernel by updating the pointers correctly after each fringe kernel call. - AMD-Internal: SWLCSG-3556

harsdave and others added 30 commits September 30, 2021 11:41

gemmt SUP limitThread count for small sizes

84a9d2f

1. Max thread cap added for small dimension based on product(n*k). AMD-Internal: [CPUPL-1388] Change-Id: I34412a1374bb58a9c4b3fd8e40949a69006cf057

Merge "gemmt SUP limitThread count for small sizes" into amd-staging-…

e2549be

…milan-3.1

Fixed conjugate transpose kernel issue

3847828

Details: AMD Internal Id: CPUPL-1702 - For the cases of A being of 1x1 dimension and of left and right hand side, A's only element is conjugate transposed by negating its imaginary component. Change-Id: I696ae982d9d60e0e702edaba98acbe9a5b0cd44c

Fixed build issue in DTL when only traces are enabled.

3a02a9a

AMD-Internal: [CPUPL-1691] Change-Id: Idc273666054529db5a2fb96a7d7ebbf7a3f5b008

DNRM2 : Disable dnrm2 Fast math implementation.

6861e33

Details : - Accuracy failures observed when fast math and ILP64 are enabled. - Disabling the feature with macro BLIS_ENABLE_FAST_MATH . AMD-Internal: [CPUPL-1907] Change-Id: I92c661647fb8cc5f1d0af8f6c4eae0fac1df5f16

Merge "DNRM2 : Disable dnrm2 Fast math implementation." into amd-stag…

644f971

…ing-milan-3.1

AOCL Windows BLIS : Windows build for dynamic dispatch library

6a348e4

Change-Id: Ie05eafbeacbd5589b514d9353517330515104939

Merge "Reverted: To fix accuracy issues for complex datatypes" into a…

f7c364d

…md-staging-milan-3.1

Merge "Workaround for perf regression observed for sgemm" into amd-st…

729b14a

…aging-milan-3.1

Merge "Fixed dynamic dispatch crash issue on non-zen architecture." i…

75b847c

…nto amd-staging-milan-3.1

Merge "Fixed dynamic dispatch crash issue on non-zen architecture." i…

6bc7856

…nto amd-staging-milan-3.1

Merge "Fixed dynamic dispatch crash issue on non-zen architecture." i…

944286c

…nto amd-staging-milan-3.1

Blis AOCL 3.1 Release

3aa0044

MithunMohanKadavil and others added 22 commits February 26, 2025 06:30

Added support for different types of zero-point in f32 eltwise APIs.

32cf9d5

Description - Zero point support for <s32/s8/bf16/u8> datatype in element-wise postop only f32o<f32/s8/u8/s32/bf16> APIs. AMD-Internal: [SWLCSG-3390] Change-Id: I2fdb308b05c1393013294df7d8a03cdcd7978379

CMake: Fix pkgconfig file names

f88d627

Rename generated aocl-blas.pc and aocl-blas-mt.pc to blis.pc and blis-mt.pc. AMD-Internal: [SWLCSG-3446] Change-Id: Ica784c7a0fd1e52b4d419795659947316e932ef6 (cherry picked from commit 9f263d2)

Added column-major support for BF16 tiny path

5662733

- Added column major path for BF16 tiny path - Tuned tiny-path thresholds to support few more inputs to the tiny path. AMD-Internal: [SWLCSG-3380] Change-Id: I9a5578c9f0d689881fc5a67ab778e6a917c4fce1

Updated version string from 5.0.1 to 5.1.0

108555e

Change-Id: Ifb9542adec045ca4afbb8237469a391051673527

Fixed kernel selection logic of DTRSM for zen5 architecture.

3c62ca0

- Instead of native, we are wrongly selecting TRSM small, now its fixed. AMD-Internal: [SWLCSG-3338] Change-Id: I7a06a483fd874c71562a924b50118e0fc9e3b213 (cherry picked from commit 350c718)

The LICENSE and NOTICES files have been updated with the latest content.

a8c76c8

Change-Id: I522b9b07f6b03ea31fa829a038db3016c3fb13ac

AOCL-BLAS 5.1-GA Release

c850221

AOCL-BLAS 5.1 GA Release

16f852a

chandrkr force-pushed the aocl-blas-contribution-guide branch 2 times, most recently from 3eb7bb4 to 7a33f29 Compare September 17, 2025 10:05

chandrkr force-pushed the aocl-blas-contribution-guide branch from 7a33f29 to fbff020 Compare September 17, 2025 10:16

chandrkr changed the base branch from master to dev September 19, 2025 06:40

kvaragan reviewed Sep 19, 2025

View reviewed changes

chandrkr closed this Sep 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add external PR integration process and flowchart to CONTRIBUTING.md#42

Add external PR integration process and flowchart to CONTRIBUTING.md#42
chandrkr wants to merge 427 commits intoamd:devfrom
chandrkr:aocl-blas-contribution-guide

chandrkr commented Sep 17, 2025

Uh oh!

kvaragan left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

chandrkr commented Sep 17, 2025

Uh oh!

kvaragan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants