Skip to content

Conversation

@trixirt
Copy link

@trixirt trixirt commented Feb 24, 2024

The hipblaslt package is not available on Fedora.
Instead of requiring the package, make it optional. If it is found, define the preprocessor variable HIPBLASLT Convert the checks for ROCM_VERSION >= 50700 to HIPBLASLT checks

Fixes #119081

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 24, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/120551

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 2 Unrelated Failures

As of commit 8edc7b9 with merge base bab4b5a (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@janeyx99 janeyx99 requested a review from jeffdaily February 28, 2024 15:43
@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 28, 2024
@jeffdaily jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Feb 28, 2024
@jeffdaily
Copy link
Collaborator

@xw285cornell would appreciate your review of this. I'm assuming this PR will break your internal build?

The hipblaslt package is not available on Fedora.
Instead of requiring the package, make it optional.
If it is found, define the preprocessor variable HIPBLASLT
Convert the checks for ROCM_VERSION >= 507000 to HIPBLASLT checks

Signed-off-by: Tom Rix <[email protected]>
@trixirt trixirt force-pushed the optional-hipblaslt branch from 2c029b1 to 8edc7b9 Compare March 2, 2024 12:25
@trixirt
Copy link
Author

trixirt commented Mar 2, 2024

Update for a couple more hipblaslt usages that were added in main this week.

@FelixSchwarz
Copy link

This PR looks pretty straight forward and using a variable instead of "magic" version numbers seem to be much cleaner. It would be nice if this PR wouldn't linger around much longer.

@jeffdaily
Copy link
Collaborator

@trixirt I am in favor of this PR. My apologies for adding yet more exposure to hipblaslt APIs that you need to work around again. Please resolve conflicts likely due to #122106.

Copy link
Collaborator

@jithunnair-amd jithunnair-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest USE_HIPBLASLT instead of HIPBLASLT as name for define

@IMbackK
Copy link
Contributor

IMbackK commented Jun 10, 2024

considering that the basis issue of this pr #119081 is now correctly recognized as a bug i think it would be good to not leave this lingering much longer.

@xw285cornell
Copy link
Contributor

Sorry just see this. Thanks @jeffdaily for pinging, this will break our internal codebase but it should be an easy fix. I'm not objecting the idea, if you can ping me before this PR lands, I can put a fix to our internal system easily.

@jithunnair-amd jithunnair-amd added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 1, 2024
@jithunnair-amd jithunnair-amd changed the title Optionally use hipblaslt [ROCm] Optionally use hipblaslt Jul 1, 2024
@pytorch-bot pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label Jul 1, 2024
@jithunnair-amd jithunnair-amd requested a review from malfet July 1, 2024 20:13
@jithunnair-amd jithunnair-amd added the rocm priority high priority ROCm PRs from performance or other aspects label Jul 1, 2024
@jithunnair-amd
Copy link
Collaborator

@trixirt Please resolve conflicts and I'll request an upstream maintainer to approve.

@jithunnair-amd
Copy link
Collaborator

Suggest USE_HIPBLASLT instead of HIPBLASLT as name for define

@trixirt Please do consider this renaming to be aligned with current naming

@trixirt
Copy link
Author

trixirt commented Jul 1, 2024

I am working on this. Its a bit involved and in a parallel track i trying to get hipblastlt to build on Fedora.

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@trixirt
Copy link
Author

trixirt commented Jul 4, 2024

I have submitted the hipBLASLt package for Fedora here
hipBLASLt package review

A prelim refactoring of the above change is here refactored for 2.4

Fedora/My preference is to use the package once it is available.

@AngryLoki
Copy link
Contributor

@trixirt , could you fix git conflict, please? I'd like to see this patch merged

@trixirt
Copy link
Author

trixirt commented Jul 24, 2024

@AngryLoki
This is not really fixing a merge conflict, the base code changed out for under the original patch.
I am more in favor of having hipblaslt packaged than having to rebase the patch.
If I refactor this patch for 2.4, can I also help you package up hipblaslt in gentoo or your favorite distro ?

@AngryLoki
Copy link
Contributor

AngryLoki commented Jul 25, 2024

If I refactor this patch for 2.4, can I also help you package up hipblaslt in gentoo or your favorite distro ?

Yes, please. For 2.3.x I will add current patch (with USE_HIPBLASLT cmake option), but in next versions I'd like to see it in upstream.

@IMbackK
Copy link
Contributor

IMbackK commented Jul 25, 2024

the problem is that this patch is not just required to support distros that dont want to package hipblaslt but also because hipblaslt has a bug steming from a missunderstanding how the rocm runtime works that subtly breaks (or crashes, depends on if assertions in the runtime are enabled) all systems with gpus that dont support hipblaslt when a binary is run that links against it.

@pruthvistony pruthvistony added rocm This tag is for PRs from ROCm team and removed rocm priority high priority ROCm PRs from performance or other aspects labels Jul 29, 2024
@pruthvistony pruthvistony marked this pull request as draft July 29, 2024 21:39
@pruthvistony
Copy link
Collaborator

@trixirt ,
I have moved the PR to draft since it is NOT yet ready for review, please move it out of draft when the PR is ready for review and CI is green. Thank you.

@AngryLoki
Copy link
Contributor

If I refactor this patch for 2.4, can I also help you package up hipblaslt in gentoo or your favorite distro ?

@trixirt , hi again. I tried to update this patch myself and now I understand what do you mean by "the base code changed". I think it is too difficult to do it now, so I packaged pytorch-2.4.0 with required hipblaslt (but it is still possible to build hipblaslt with 0 gpus). Feel free to close this PR if you don't need it anymore.

@trixirt
Copy link
Author

trixirt commented Aug 14, 2024

If I refactor this patch for 2.4, can I also help you package up hipblaslt in gentoo or your favorite distro ?

@trixirt , hi again. I tried to update this patch myself and now I understand what do you mean by "the base code changed". I think it is too difficult to do it now, so I packaged pytorch-2.4.0 with required hipblaslt (but it is still possible to build hipblaslt with 0 gpus). Feel free to close this PR if you don't need it anymore.

Fedora has hipblaslt now with 1 gpu (WEEEE) .. so closing

@trixirt trixirt closed this Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request module: rocm AMD GPU support for Pytorch open source rocm This tag is for PRs from ROCm team triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ROCm loses some supported GPUs by requiring hipblaslt