Skip to content

Add external PR integration process and flowchart to CONTRIBUTING.md#42

Closed
chandrkr wants to merge 427 commits intoamd:devfrom
chandrkr:aocl-blas-contribution-guide
Closed

Add external PR integration process and flowchart to CONTRIBUTING.md#42
chandrkr wants to merge 427 commits intoamd:devfrom
chandrkr:aocl-blas-contribution-guide

Conversation

@chandrkr
Copy link
Copy Markdown

  1. Documented the process for handling external pull requests, including validation, review, and notification steps.
  2. Added a markdown flowchart illustrating the PR workflow.
  3. Included communication and contributor attribution notes.

harsdave and others added 30 commits September 30, 2021 11:41
Details:
-- AMD Internal Id: CPUPL-1702
-- Used 4x3 ZGEMM kernel with vector fma by utilizing ymm registers
   efficiently to produce 12 dcomplex outputs at a time
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Added ztrsm_small for in ztrsm_ BLAS path for single thread
   when (m,n)<500 and multithread (m+n)<128
-- Taken care of --disable_pre_inversion configuration
-- Achieved 10% average performance improvement for sizes less than 500
-- modularized all 16 combinations of trsm into 4 kernels

Change-Id: I3cb42a1385f6b3b82d6c470912242675789cce75
    -- Added number of threads used in DTL logs
    -- Added support for timestamps in DTL traces
    -- Added time taken by API at BLAS layer in the DTL logs
    -- Added GFLOPS achieved in DTL logs
    -- Added support to enable/disable execution time and
       gflops printing for individual API's. We may not want
       it for all API's. Also it will help us migrate API's
       to execution time and gflops logs in stages.
    -- Updated GEMM bench to match new logs
    -- Refactored aocldtl_blis.c to remove code duplication.
    -- Clean up logs generation and reading to use spaces
       consistently to separate various fields.
    -- Updated AOCL_gettid() to return correct thread id
       when using pthreads.

AMD-Internal: [CPUPL-1691]
Change-Id: Iddb8a3be2a5cd624a07ccdbf5ae0695799d8ae8e
Details
- For axpyf implementation there are function(axpyf) calling overhead.
- New implementations reduces function calling overhead.
- This implementation uses kernel of size 4x4.
- This implementation gives better performance for smaller sizes when
  compared to axpyf based implementation

AMD-Internal: [CPUPL-1402]
Change-Id: I5fa421b8c1d2b44c991c2a05e8f5b01b83eb4b37
Details:
-- AMD Internal Id: [CPUPL-1702]
-- Used 16x6 SGEMM kernel with vector fma by utilizing ymm registers
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Taken care of --disable_pre_inversion configuration
-- modularized strsm 16 combinations of trsm into 4 kernels

Change-Id: I30a1551967c36f6bae33be3b7ae5b7fcc7c905ea
1. Max thread cap added for small dimension based on product(n*k).

AMD-Internal: [CPUPL-1388]

Change-Id: I34412a1374bb58a9c4b3fd8e40949a69006cf057
Details:
-- AMD Internal Id: CPUPL-1702
-- Used 8x3 CGEMM kernel with vector fma by utilizing ymm registers
   efficiently to produce 24 scomplex outputs at a time
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Added ctrsm_small for in ctrsm_ BLAS path for single thread
   when (m,n)<1000 and multithread (m+n)<320
-- Taken care of --disable_pre_inversion configuration
-- Achieved 13% average performance improvement for sizes less than 1000
-- modularized all 16 combinations of trsm into 4 kernels

Change-Id: I557c5bcd8cb7c034acd99ce0666bc411e9c4fe64
Details
- For axpyf implementation there are function(axpyf) calling overhead.
- New implementations reduces function calling overhead.
- This implementation uses kernel of size 8x4.
- This implementation gives better performance for smaller sizes when
  compared to axpyf based implementation

AMD-Internal: [CPUPL-1402]
Change-Id: Ic9a5e59363290caf26284548638da9065952fd48
Details:
AMD Internal Id: CPUPL-1702
- While performing trsm function A's imaginary
part needed to be complimented as per conjugate
transpose.
-So in the case of conjugate transpose A's imaginary
part is negated before doing trsm.

Change-Id: Ic736733a483eeadf6356952b434128c0af988e36
Details:
AMD Internal Id: CPUPL-1702
- For the cases of A being of 1x1 dimension and of
left and right hand side, A's only element is conjugate
transposed by negating its imaginary component.

Change-Id: I696ae982d9d60e0e702edaba98acbe9a5b0cd44c
AMD-Internal: [CPUPL-1691]
Change-Id: Idc273666054529db5a2fb96a7d7ebbf7a3f5b008
  -- Reverted changes made to include lp/ilp info in binary name
     This reverts commit c5e6f88.

  -- Included BLAS int size in 'make showconfig'

  -- Renamed amdepyc configuration to amdzen

Change-Id: Ie87ec1c03e105f606aef1eac397ba0d8338906a6
        Details :
          - Accuracy failures observed when  fast math and ILP64 are enabled.
          - Disabling the feature with macro BLIS_ENABLE_FAST_MATH .

        AMD-Internal: [CPUPL-1907]

Change-Id: I92c661647fb8cc5f1d0af8f6c4eae0fac1df5f16
Change-Id: Ie05eafbeacbd5589b514d9353517330515104939
Details:
-- reverted cscalv,zscalv,ctrsm,ztrsm changes to address accuracy issues
observed by libflame and scalapack application testing.
-- AMD-Internal: [CPUPL-1906], [CPUPL-1914]

Change-Id: Ic364eacbdf49493dd3a166a66880c12ee84c2204
Details:
- Perf regression is observed for certain m,n,k inputs where (m,n,k > 512)
  and (m > 4 * n) in BLIS 3.1. The root cause was traced to commit
  11dfc17 where BLIS_THREAD_RATIO_M was
  updated from 2 to 1. This change was not part of BLIS 3.0.6 and hence
  resulted in the new perf drop in 3.1.
- This workaround updates the m dimension (doubles it) that is passed as
  argument to bli_rntm_set_ways_for_op which is used to determine the ic,jc
  work split in the threads. The BLIS_THREAD_RATIO_M is not updated (to 2)
  and rather the effect is induced using the doubled m dimension.

AMD-Internal: [CPUPL-1909]
Change-Id: I3b6ec4d4a22154289cb56d8f7db4cb60e5f34afe
This commit fixed issue for gemm and copy API’s.

The BLIS binary with dynamic dispatch feature was crashing on non-zen
CPUs (specifically CPUs without AVX2 support).
The crash was caused by un-supported instructions in zen optimized kernels.
The issue is fixed by calling only reference kernels if the architecture detected at
runtime is not zen, zen2 or zen3.

AMD-Internal: [CPUPL-1930]

Change-Id: Ief57cd457b87542aa1a7bad64dc36c01f0d1a366
Removed direct calling of zen kernels in cblas source itself.
Similar optimizations are done by the function directly invoked from
Cblas layer.

The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs
(specifically CPUs without AVX2 support). The crash was caused by un-supported
instructions in zen optimized kernels. The issue is fixed by calling only
reference kernels if the architecture detected at runtime is not zen, zen2 or zen3.

AMD-Internal: [CPUPL-1930]
Change-Id: I9178b7a98f2563dee2817064f37fcbb84073eeea
Removed direct calling of zen kernels in blis interface for
trsm, scalv, swapv.

The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs
(specifically CPUs without AVX2 support). The crash was caused by un-supported
instructions in zen optimized kernels. The issue is fixed by calling only
reference kernels if the architecture detected at runtime is not zen, zen2 or zen3.

AMD-Internal: [CPUPL-1930]
Change-Id: I7944d131d376e2c4e778fe441a8b030674952b81
Direct calls to zen kernels replaced by architecture
dependent calls for dotv and amaxv kernels. For non-zen
architecture, generic function is called using the BLIS
interface. For zen architecture, direct calls to zen
optimized kernels are made.

Change-Id: I49fc9abc813434d6a49a23f49e47d16e95b7899f
Removed direct calling of zen kernels in ctrsv, ztrsv interface.

The BLIS binary with dynamic dispatch feature was crashing on non-zen CPUs
(specifically CPUs without AVX2 support). The crash was caused by un-supported
instructions in zen optimized kernels.

AMD-Internal: [CPUPL-1930]
Change-Id: I21f25a09cd6ffb013d16c66ea10aa9a42f7cad5b
…nd axpy routines.

  Summary:
    1. This commit fixed issue for gemv and axpy API’s.
    2. The BLIS binary with dynamic dispatch feature was
    crashing on non-zen CPUs (specifically CPUs without
    AVX2 support).
    3. The crash was caused by un-supported instructions
    in zen optimized kernels.The issue is fixed by calling
    only reference kernels if the architecture detected at
    runtime is not zen, zen2 or zen3.

Change-Id: Icc6f7fdc80bc58fac1a97b1502b6f269e5e89aa4
      Small gemm implemenation is called from gemmnat path
      when library is built as multi-threaded small gemm
      is completely disabled.

      For single threaded the crash is fixed by disabling
      small gemm on generic architecture.

AMD-Internal: [CPUPL-1930]
Change-Id: If718870d89909cef908a1c23918b7ef6f7d80f7a
MithunMohanKadavil and others added 22 commits February 26, 2025 06:30
-New packing kernels for A matrix, both based on AVX512 and AVX2 ISA,
for both row and column major storage are added as part of this change.
Dependency on haswell A packing kernels are removed by this.
-Tiny GEMM thresholds are further tuned for BF16 and F32 APIs.

AMD-Internal: [SWLCSG-3380, SWLCSG-3415]

Change-Id: I7330defacbacc9d07037ce1baf4a441f941e59be
Description:

1. When compiler gcc version less than 11.2 few BF16 instructions
   are not supported by the compiler even though the processors arch's
   zen4 and zen5 supports.

2. These instructions are guarded now with a macro.


Change-Id: Ib07d41ff73d8fe14937af411843286c0e80c4131
Description
 - Zero point support for <s32/s8/bf16/u8> datatype in element-wise
   postop only f32o<f32/s8/u8/s32/bf16> APIs.

 AMD-Internal: [SWLCSG-3390]

Change-Id: I2fdb308b05c1393013294df7d8a03cdcd7978379
Description:
1. Implement s8 unreorder API function which performs
   unreordering of int8 matrix which is reordered
2. Removed bf16vnni check for bf16 unreorder reference API
   because it can work on any architecture as it is reference
   code
3. Tested the reference code for all main and fringe paths.

AMD-Interneal: [SWLCSG-3426]

Change-Id: I920f807be870e1db5f9d0784cdcec7b366e1eff5
-Currently the scale factor is loaded without using mask in downscale,
and matrix add/mul ops in the F32 eltwise kernels. This results in
out of memory reads when n is not a multiple of NR (64).
-The loads are updated to masked loads to fix the same.

AMD-Internal: [SWLCSG-3390]

Change-Id: Ib2fc555555861800c591344dc28ac0e3f63fd7cb
- Updated the bli_dgemv_zen_ref( ... ) kernel to support general stride.
- Since the latest dgemv kernels don't support general stride, added
  checks to invoke bli_dgemv_zen_ref( ... ) when A matrix has a general
  stride.
- Thanks to Vignesh Balasubramanian <[email protected]>
  for finding this issue.

AMD-Internal: [CPUPL-6492]
Change-Id: Ia987ce7674cb26cb32eea4a6e9bd6623f2027328
- In 8x24 DGEMM kernel, prefetch is always done assuming
  row major C.
- For TRSM, the DGEMM kernel can be called with column major C also.
- Current prefetch logic results in suboptimal performance.
- Changed C prefetch logic so that correct C is prefetched for both row
  and column major C.

 AMD-Internal: [CPUPL-6493]

Change-Id: I7c732ceac54d1056159b3749544c5380340aacd2
(cherry picked from commit e6ca01c)
Details:
- Fixed the logic to identify an API that has int4 weights in
  bench files for gemm and batch_gemm.
- Eliminated the memcpy instructions used in pack functions of
  zen4 kernels and replaced them with masked load instruction.
  This ensures that the load register will be populated with
  zeroes at locations where mask is set to zero.

Change-Id: I8dd1ea7779c8295b7b4adec82069e80c6493155e
AMD-Internal:[SWLCSG-3274]
(cherry picked from commit 6c29236)
Details:
- Group quantization is technique to improve accuracy
  where scale factors to quantize inputs and weights
  varies at group level instead of per channel
  and per tensor level.
- Added new bench files to test GEMM with symmetric static
  quantization.
- Added new get_size and reorder functions to account for
  storing sum of col-values separately per group.
- Added new framework, kernels to support the same.
- The scalefactors could be of type float or bf16.

AMD-Internal:[SWLCSG-3274]

Change-Id: I3e69ecd56faa2679a4f084031d35ffb76556230f
(cherry picked from commit 7243a5d)
Details:
- Setting post_op_grp to NULL at the start of post-op
- creator to ensure that there is no junk value(non-null)
- which might lead to destroyer trying to free
-  non-allocated buffers.

AMD-Internal: [SWLCSG-3274]
Change-Id: I45a54d01f0d128d072d5d9c7e66fc08412c7c79c
(cherry picked from commit 1da554a)
-Currently the Tid spread does not happen for n=4096 even if there
are threads available to facilitate the same. Update the threshold
to account for the same.

AMD-Internal: [SWLCSG-3185]
Change-Id: I281b1639c32ba2145bd84062324f1f11b1167eeb
 - Currently, the bf16 reorder function does not add padding for
   n=1 cases. But, the bf16 AVX2 Unreorder path considers the input
   re-ordered B matrix to be padded along the n and k dimension.
 - Hence, modified the conditions to make sure the path doesn't break
   while the AVX2 kernels are executed in AVX512 machines when
   B matrix reordered.

Change-Id: I7dd3d37a24758a8e93e80945b533abfcf15f65a1
Updated CMakeLists.txt to remove GNU extensions for both C and C++.
Now during building -std=c99 is used instead of -std=gnu99.

Signed-off-by: Jagadish R <[email protected]>
AMD-Internal: [CPUPL-6553]
Change-Id: I98150707990112c5736660d287f1ddbe71a4e8e6
- Corrected a typo in dgemm kernel implementation, beta=0 and
  n_left=6 edge kernel.

Thanks to Shubham Sharma<[email protected]> for helping with debugging.

AMD-Internal: [CPUPL-6443]
Change-Id: Ifa1e16ec544b7e85c21651bc23c4c27e86d6730b
(cherry picked from commit a359a25)
Rename generated aocl-blas.pc and aocl-blas-mt.pc to blis.pc and blis-mt.pc.

AMD-Internal: [SWLCSG-3446]
Change-Id: Ica784c7a0fd1e52b4d419795659947316e932ef6
(cherry picked from commit 9f263d2)
Description:
1. For column major case when m=1 there was an accuracy mismatch with
   post ops(bias, matrix_add, matrix_add).
2. Added check for column major case and replace _mm512_loadu_ps with
   _mm512_maskz_loadu_ps.

AMD-Internal: [CPUPL-6585]

Change-Id: I8d98e2cb0b9dd445c9868f4c8af3abbc6c2dfc95
 - Added column major path for BF16 tiny path
 - Tuned tiny-path thresholds to support few more inputs to the
   tiny path.

AMD-Internal: [SWLCSG-3380]
Change-Id: I9a5578c9f0d689881fc5a67ab778e6a917c4fce1
Change-Id: Ifb9542adec045ca4afbb8237469a391051673527
 - Instead of native, we are wrongly selecting TRSM small, now its fixed.

AMD-Internal: [SWLCSG-3338]
Change-Id: I7a06a483fd874c71562a924b50118e0fc9e3b213
(cherry picked from commit 350c718)
Change-Id: I522b9b07f6b03ea31fa829a038db3016c3fb13ac
@chandrkr chandrkr force-pushed the aocl-blas-contribution-guide branch 2 times, most recently from 3eb7bb4 to 7a33f29 Compare September 17, 2025 10:05
1. Documented the process for handling external pull requests,
   including validation, review, and notification steps.
2. Added a markdown flowchart illustrating the PR workflow.
3. Included communication and contributor attribution notes.
@chandrkr chandrkr force-pushed the aocl-blas-contribution-guide branch from 7a33f29 to fbff020 Compare September 17, 2025 10:16
@chandrkr chandrkr changed the base branch from master to dev September 19, 2025 06:40
Copy link
Copy Markdown
Collaborator

@kvaragan kvaragan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the changes in dev branch.

@chandrkr chandrkr closed this Sep 19, 2025
vkirangoud pushed a commit to vkirangoud/blis that referenced this pull request Nov 15, 2025
Details:
- Fixed the problem decomposition for n-fringe case of
  6x64 AVX512 FP32 kernel by updating the pointers
  correctly after each fringe kernel call.

-  AMD-Internal: SWLCSG-3556
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.