Fix/f16 reduction accum#344
Merged
HenryNdubuaku merged 5 commits intocactus-compute:mainfrom Feb 18, 2026
Merged
Conversation
Signed-off-by: vyomshah05 <[email protected]>
Signed-off-by: vyomshah05 <[email protected]>
Collaborator
|
I looked over this PR and it looks good. Have you found any other instances of this mistake in other parts of the kernel? |
Signed-off-by: vyomshah05 <[email protected]>
Contributor
Author
|
Yes I noticed the error across the axis and mean functions and made similar fixes and validation tests across them as well |
Collaborator
|
Taking a further look into this. Will provide feedback by EOD tomorrow at worst. |
Signed-off-by: HenryNdubuaku <[email protected]>
Collaborator
|
So there are a few changes we need to make. Firstly, please remove the tests because we do not need the test for this instance since we understand that increased precision will result in higher accuracy. Secondly remove all comments from the code. |
Signed-off-by: vyomshah05 <[email protected]>
Contributor
Author
|
@ParkiratS I cleaned up the code and tests as you said |
Collaborator
|
@vyomshah05 Looks good, I will let Henry know that this is ready to merge. |
ncylich
pushed a commit
that referenced
this pull request
Feb 24, 2026
* Fix FP16 reduction accumulation Signed-off-by: vyomshah05 <[email protected]> * Fixed test Signed-off-by: vyomshah05 <[email protected]> * fixed error accross all functions Signed-off-by: vyomshah05 <[email protected]> * add citation Signed-off-by: HenryNdubuaku <[email protected]> * Removed unnecessary tests Signed-off-by: vyomshah05 <[email protected]> --------- Signed-off-by: vyomshah05 <[email protected]> Signed-off-by: HenryNdubuaku <[email protected]> Co-authored-by: HenryNdubuaku <[email protected]>
cattermelon1234
pushed a commit
to cattermelon1234/cactus
that referenced
this pull request
Feb 28, 2026
* Fix FP16 reduction accumulation Signed-off-by: vyomshah05 <[email protected]> * Fixed test Signed-off-by: vyomshah05 <[email protected]> * fixed error accross all functions Signed-off-by: vyomshah05 <[email protected]> * add citation Signed-off-by: HenryNdubuaku <[email protected]> * Removed unnecessary tests Signed-off-by: vyomshah05 <[email protected]> --------- Signed-off-by: vyomshah05 <[email protected]> Signed-off-by: HenryNdubuaku <[email protected]> Co-authored-by: HenryNdubuaku <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix: FP16 Reduction Accumulation Precision
Summary
Fixes a numerical precision issue in
cactus_sum_all_f16where accumulation was performed in FP16, causing potential overflow and silent precision loss for large tensors.Accumulation is now performed in FP32 while keeping the API unchanged.
Problem
Previous implementation accumulated directly in FP16:
This could cause:
Fix
Upcast to FP32 immediately after loading and accumulate in FP32:
__fp16doubleThis matches standard ML framework behavior (PyTorch, cuBLAS, etc.).
Validation
Added stress test summing 32k FP16 ones:
Impact