-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[CI][CUDA][cuSPARSELt] cusparselt 0.6.3 and cu121 related cleanups #145696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI][CUDA][cuSPARSELt] cusparselt 0.6.3 and cu121 related cleanups #145696
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145696
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 6 Cancelled Jobs, 72 PendingAs of commit 52354fd with merge base b808774 ( NEW FAILURES - The following jobs have failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
I may need to get rid of lingering cu121 jobs before merging this |
| CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.5.2.1-archive" | ||
| CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.2.3-archive" | ||
| curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz | ||
| elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This matching error would have been caught a bit easier if you made sure the last else in this chain raised an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, could you please clarify? cuda matching error, or cusparselt matching error?
Where is the "matching"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh where no CUSPARSELT_NAME name is selected, and there it's blank. IE: None of the if else are matched
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think he is referring to if none of the "if condition" matched, need to raise an CUSPARLT_NAME not defined error in a separate "else" clause.
|
cc @huydhn for help reviewing test_trymerge and test_filter* changes, thanks! |
huydhn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still one failed test https://hud.pytorch.org/pr/pytorch/pytorch/145696#36245113047, but overall LGTM! Please fix the test before landing
(pytorch#104312) to test the API. Since the old PR has cu121, I cannot remove cu121 occurences here. Some other cases, it is just testing remove suffix. So it is fine to not use cu121 but cu124, well let's just use cu126
…pytorch#145568) Will be used in inductor. Pull Request resolved: pytorch#145568 Approved by: https://github.com/malfet, https://github.com/Skylion007
…)" This reverts commit c7b2f7d. Reverted pytorch#136780 on behalf of https://github.com/izaitsevfb due to internal regression ([comment](pytorch#136780 (comment)))
…#145488) pytorch#144989 This fixes tts_angular model on torchbench for `--export-aot-inductor` I put meta function in cpp, as shape calculation requires cudnn API calls. I've extracted shape calculation to be used in implementation as this logic has some non-trivial actions and comments. ``` └─ $ python benchmarks/dynamo/torchbench.py --only tts_angular --accuracy --no-translation-validation --inference --bfloat16 --export-aot-inductor --disable-cudagraphs --device cuda loading model: 0it [00:00, ?it/s]WARNING:common:Model tts_angular does not support bfloat16, running with amp instead loading model: 0it [00:01, ?it/s] WARNING:common:Model tts_angular does not support bfloat16, running with amp instead cuda eval tts_angular WARNING:common:Model tts_angular does not support bfloat16, running with amp instead pass ``` Pull Request resolved: pytorch#145488 Approved by: https://github.com/eqy, https://github.com/zou3519
Let TORCHINDUCTOR_FX_GRAPH_CACHE=0 being respected in unit test. This is helpful if I want the compilation to happen for testing. Setting INDUCTOR_TEST_DISABLE_FRESH_CACHE to 1 is not the same, since that will cause the generated wrapper file being deleted. But we may want to check those files after running a test. Pull Request resolved: pytorch#141195 Approved by: https://github.com/masnesral, https://github.com/desertfire
Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: pytorch#145579 Approved by: https://github.com/gchanan
…torch#145572) Summary: I think skipping these tests is suboptimal. If we categorize as expected failures, then we'll see test failures when they start passing, which means they're more likely to be removed. As a skip, they quietly continue to skip. Pull Request resolved: pytorch#145572 Approved by: https://github.com/Skylion007, https://github.com/zou3519
Test Plan: CI D68155591 Pull Request resolved: pytorch#144773 Approved by: https://github.com/izaitsevfb
|
Oops, sorry for the disaster. Closing. |
…145793) Make ci cusparselt installation be consistent with nightly binary Remove cu121 related docker build jobs and inductor runs Update test failures relating to cu121 Retry of #145696 Pull Request resolved: #145793 Approved by: https://github.com/eqy, https://github.com/tinglvv
Fix inconsistency between CD (nightly binary) and CI (what ci jobs install for cu126)
Remove remaining cu121 ci/docker jobs
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @mcarilli @ptrblck @leslie-fang-intel @jgong5 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn @ekr0 @xmfan @atalman @malfet @eqy @tinglvv