Skip to content

Conversation

@lcskrishna
Copy link
Contributor

@lcskrishna lcskrishna commented Dec 29, 2023

This PR fixes the accuracy issues for hipblasLT for mm case on ROCm.
This PR is a follow up to the integration PR #114329 and #114890

The accuracy issue arises for mm usecase for ROCm where hipblasLT is enabled, and a bias has been passed which is not required. This PR addresses that issue.
Added a unit-test case for this issue (bias=None) case.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang

@pytorch-bot pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch labels Dec 29, 2023
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 29, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/116537

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (8 Unrelated Failures)

As of commit ba7fafd with merge base 57491d2 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@lcskrishna
Copy link
Contributor Author

@pytorchbot label ciflow/periodic

@pytorch-bot pytorch-bot bot added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Dec 29, 2023
@lcskrishna
Copy link
Contributor Author

@pytorchbot label ciflow/trunk

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 29, 2023
@cpuhrsch cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 3, 2024
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more detailed PR description and a test case in OpInfo that triggers this failure would be great.

@jeffdaily jeffdaily self-requested a review January 3, 2024 18:30
@pytorch-bot pytorch-bot bot added the release notes: linalg_frontend release notes category label Jan 9, 2024
@lezcano lezcano removed their request for review January 9, 2024 07:53
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but see comments about some minor issues (but as tests are limited to ROCm platform I'm not requesting changes)

@jeffdaily
Copy link
Collaborator

@pytorchbot merge

@jeffdaily
Copy link
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 6 jobs have failed, first few of them are: rocm, linux-binary-libtorch-pre-cxx11, linux-binary-manywheel, trunk, linux-binary-libtorch-cxx11-abi

Details for Dev Infra team Raised by workflow job

@jeffdaily jeffdaily added ciflow/trunk Trigger trunk jobs on your pull request and removed ciflow/trunk Trigger trunk jobs on your pull request labels Jan 9, 2024
@facebook-github-bot
Copy link
Contributor

@xw285cornell has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@xw285cornell
Copy link
Contributor

Any chance we can change the hipblaslt behavior to avoid cuda/hip divergence?

@jeffdaily
Copy link
Collaborator

Any chance we can change the hipblaslt behavior to avoid cuda/hip divergence?

Which part of it? The bias issue or the double type not being supported or both?

#if defined(USE_ROCM)
// This condition is needed for mm case on ROCm for hipblasLt path.
// Passing the bias ptr as null to avoid accuracy issues for mm case.
(&result != &self) ? self.const_data_ptr<scalar_t>() : nullptr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffdaily I meant here that uses bias==nullptr to avoid setting the attributes in computeDesc iiuc. I wonder why setting those epilog attr will end up with wrong results.

@jeffdaily
Copy link
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 jobs have failed, first few of them are: rocm, periodic

Details for Dev Infra team Raised by workflow job

@jeffdaily jeffdaily added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/rocm Trigger "default" config CI on ROCm and removed ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/rocm Trigger "default" config CI on ROCm labels Jan 10, 2024
@jeffdaily
Copy link
Collaborator

pytorchmergebot got confused. I had to remove and re-add ciflow labels. Hopefully all missing ciflows are available to mergebot now.

@jeffdaily
Copy link
Collaborator

@pytorchbot merge -f "unrelated failures"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@lezcano lezcano changed the title [ROCm] Fixes for hipblasLt for mm use case. Fix mm accuracy in ROCm for some inputs Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request Merged module: rocm AMD GPU support for Pytorch open source release notes: linalg_frontend release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants