Skip to content

Conversation

@nWEIdia
Copy link
Collaborator

@nWEIdia nWEIdia commented Jul 31, 2024

Trying to keep a record of the steps before I lose track of it.

update:
Post Merge:
Step 1: TLDR, watch for simultaneous pytorch/test-infra AMI updates that could break in-between green signals and final landing. See #132202 (comment)
Step 2: Since this is CUDA 12.4.0 -> CUDA 12.4.1 bump, on the Windows side, just need to check the driver version is bumped according to CUDA12.4.1 version's recommended driver version.
https://github.com/pytorch/test-infra/blob/0c3a2634aaa2f638c8f640e743f03d696ce1191f/aws/ami/windows/scripts/Installers/Install-CUDA-Tools.ps1#L33
still shows 551.61, needs to bump to 551.78 according to Table 3 of https://docs.nvidia.com/cuda/archive/12.4.1/cuda-toolkit-release-notes/index.html
Step 3: Trigger inductor performance testing, e.g. https://github.com/pytorch/pytorch/actions/runs/10527627251

cc @atalman @ptrblck @eqy @tinglvv @malfet

@nWEIdia nWEIdia requested a review from jeffdaily as a code owner July 31, 2024 01:37
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 31, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/132202

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 4 Unrelated Failures

As of commit 29e15e9 with merge base 8624a57 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jul 31, 2024
@nWEIdia nWEIdia added ciflow/inductor ciflow/inductor-cu124 ciflow/inductor-micro-benchmark ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR and removed ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/inductor ciflow/inductor-micro-benchmark ciflow/inductor-cu124 labels Jul 31, 2024
@nWEIdia nWEIdia requested a review from a team as a code owner July 31, 2024 09:04
@Skylion007
Copy link
Collaborator

Skylion007 commented Jul 31, 2024

What's the goal of updating the minor version? Are there important perf / bug fixes? Does it update the minimum driver version or requirements? If we are doing this we should also consider updating NCCL and CUDNN in a concurrent PR probably.

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Jul 31, 2024

What's the goal of updating the minor version? Are there important perf / bug fixes? Does it update the minimum driver version or requirements? If we are doing this we should also consider updating NCCL and CUDNN in a concurrent PR probably.

The update originally comes from the suggestion of @malfet and @ptrblck to always use the maximum minor version of CUDA 12.4 which is CUDA 12.4.1.
At least it would fix nanogpt smoke test regression, see https://github.com/pytorch/pytorch/blob/main/.ci/pytorch/test.sh#L640

The Linux minimum driver version (550.54.15) needed by CUDA 12.4.1 was already adopted when enabling CUDA 12.4.0 [yeah, we probably should have just enabled CUDA 12.4.1 then]. Only the Windows side, the driver may need a bump.

NCCL and CUDNN with CUDA 12.4.1 may work similar to CUDA 12.4.0, we could consider updating them though.

@colesbury colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 31, 2024
@nWEIdia nWEIdia force-pushed the enable_cuda_12.4.1 branch from aaa456e to 46c2b60 Compare July 31, 2024 21:52
@nWEIdia nWEIdia added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Jul 31, 2024
@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Aug 1, 2024

For Pip Binary/jobs:
https://ossci-raw-job-status.s3.amazonaws.com/log/28186744872 error below
pip install /final_pkgs/torch-2.5.0.dev20240731+cu124-cp310-cp310-linux_x86_64.whl --index-url https://download.pytorch.org/whl/nightly/cu124
No matching distribution found for nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == "Linux" and platform_machine == "x86_64"
says the pypi packages have to be copied and hosted from aws s3 pytorch (https://download.pytorch.org/whl/nightly/cu124).

For Conda Binary/jobs:
Linking: pytorch/builder#1947

@Skylion007 Skylion007 requested a review from atalman August 1, 2024 19:08
@Skylion007
Copy link
Collaborator

@atalman Can help with the nightly packages

@Skylion007
Copy link
Collaborator

Also pinging @malfet . @nWEIdia Are they are any older versions of CUDA we do should do minor version bumps to? Like 11.8 or 12.1?

@atalman
Copy link
Contributor

atalman commented Aug 5, 2024

@nWEIdia @Skylion007 uploaded packages for 12.4 and 12.4 split build.

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Aug 5, 2024

Also pinging @malfet . @nWEIdia Are they are any older versions of CUDA we do should do minor version bumps to? Like 11.8 or 12.1?

https://developer.nvidia.com/cuda-toolkit-archive shows CUDA 11.8 does not have patch versions. It looks like 12.1 indeed has an update 1. Fortunately, upstream pytorch already comes with 12.1.1 as indicated by https://github.com/pytorch/pytorch/blob/main/.github/scripts/generate_binary_build_matrix.py#L21

@nWEIdia nWEIdia force-pushed the enable_cuda_12.4.1 branch from 46c2b60 to c18ac47 Compare August 5, 2024 21:29
@nWEIdia nWEIdia force-pushed the enable_cuda_12.4.1 branch from c18ac47 to 7753234 Compare August 6, 2024 19:47
Copy link
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, please make sure manywheels builds are green, revert temp change and merge

@pytorchmergebot
Copy link
Collaborator

Successfully rebased enable_cuda_12.4.1 onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout enable_cuda_12.4.1 && git pull --rebase)

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Aug 15, 2024

Looks like the signal is green for CUDA tests but the 6 pending jobs are due to long queue of runner linux.9xlarge.ephemeral

cc @atalman @malfet

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Aug 15, 2024

For this kind of PRs, it would be great if we can retain the test results from 653b4ac

Update: the 653b4ac is a rebased commit from https://hud.pytorch.org/pytorch/pytorch/commit/77532344e35f7bd8d6e4c221e1b6807e17734204
fortunately, the commit before the rebase and test results are still available!

How to get to the old commit? Post a job link in this PR (e.g. https://github.com/pytorch/pytorch/actions/runs/10295868135/job/28500435981 ) this can serve as index to find the old commit ID.

@nWEIdia

This comment was marked as off-topic.

@pytorch-bot

This comment was marked as off-topic.

@nWEIdia

This comment was marked as off-topic.

@pytorch-bot

This comment was marked as resolved.

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Aug 20, 2024

@pytorchbot merge -i

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Aug 21, 2024

1st bad build: https://github.com/pytorch/pytorch/actions/runs/10433835042/job/28936166917

2024-08-19T09:45:31.8316859Z #15 421.6 Installing : fipscheck-1.4.1-6.el7.x86_64 2/73
2024-08-19T09:52:25.7608721Z #15 421.7 Installing : fipscheck-lib-1.4.1-6.el7.x86_64 3/73

See the 7 minutes gap!!

Last known good build: https://github.com/pytorch/pytorch/actions/runs/10425348037/job/28876036438
The installation of these two are instant!

Could the runner be from a different region? If not, the multiple "RUN sed -i s/^#.baseurl=http/baseurl=http/g /etc/yum.repos.d/.repo" seems suspicious.

cc @atalman @malfet @ptrblck

@atalman
Copy link
Contributor

atalman commented Aug 21, 2024

@nWEIdia Looks like this is maybe related to AMI update:
Last successful build: https://github.com/pytorch/pytorch/actions/runs/10424302444/job/28872834620 uses AMI Name: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

Current build takes 5hrs and uses:
https://github.com/pytorch/pytorch/actions/runs/10463836139/job/28976474949
AMI Name: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

cc: @ZainRizvi

@ZainRizvi
Copy link
Contributor

yeah, this does seem related to the new Amazon Linux 2023 ami upgrade. The green builds all use the Amazon 2 AMI while the red ones are on Amazon 2023. @atalman is working on a mitigation PR

@nWEIdia

This comment was marked as off-topic.

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Sep 10, 2024

Post merge Step 4:
Missed steps: aarch64 binary validation (building was ok, validation of the binary in builder repo was not)

https://github.com/pytorch/builder/actions/runs/10776424717/job/29883178793#step:12:4025

@tinglvv
Copy link
Collaborator

tinglvv commented Sep 10, 2024

Above aarch64 build failure is seen on cuda aarch64.
Error https://github.com/pytorch/builder/actions/runs/10776424717/job/29883178793#step:12:4025

++ python3 ./test/smoke_test/smoke_test.py --package torchonly
Traceback (most recent call last):
  File "/pytorch/builder/./test/smoke_test/smoke_test.py", line 9, in <module>
    import torch._dynamo
  File "/opt/conda/envs/conda-env-10794919545/lib/python3.9/site-packages/torch/_dynamo/__init__.py", line 3, in <module>
    from . import convert_frame, eval_frame, resume_execution
  File "/opt/conda/envs/conda-env-10794919545/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 53, in <module>
    from . import config, exc, trace_rules
  File "/opt/conda/envs/conda-env-10794919545/lib/python3.9/site-packages/torch/_dynamo/trace_rules.py", line 45, in <module>
    from .utils import getfile, hashable, NP_SUPPORTED_MODULES, unwrap_if_wrapper
ImportError: cannot import name 'NP_SUPPORTED_MODULES' from 'torch._dynamo.utils' (/opt/conda/envs/conda-env-10794919545/lib/python3.9/site-packages/torch/_dynamo/utils.py)

Will need to see if this is due to CUDA 12.4.1 upgrade and fix it for 2.5.0 release

cc @atalman

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/inductor ciflow/inductor-cu124 ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants