-
Notifications
You must be signed in to change notification settings - Fork 26.3k
add nvidia-smi to run_torchbench #83857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add nvidia-smi to run_torchbench #83857
Conversation
Seeing an OOM in #83239, this would help understand whether the issue is with the infra or with the test. RUN_TORCHBENCH: nvfuser [ghstack-poisoned]
🔗 Helpful links
❌ 1 New FailuresAs of commit b60bfee (more details on the Dr. CI page): Expand to see more
🕵️♀️ 1 failure not recognized by patterns:This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
|
@xuzhao9 do you know why torchbench is OOMing here? I tried running this on the AWS cluster on A100s but I couldn't repro the OOM issue |
|
@davidberard98 The runner has 8xNvidia T4 GPUs (each 16GB), not A100, maybe that's the difference? |
|
@pytorchbot rebase -s |
|
@pytorchbot successfully started a rebase job. Check the current status here |
Seeing an OOM in #83239, this would help understand whether the issue is with the infra or with the test. RUN_TORCHBENCH: nvfuser [ghstack-poisoned]
|
Successfully rebased |
Seeing an OOM in #83239, this would help understand whether the issue is with the infra or with the test. RUN_TORCHBENCH: nvfuser [ghstack-poisoned]
Seeing an OOM in #83239, this would help understand whether the issue is with the infra or with the test. RUN_TORCHBENCH: nvfuser [ghstack-poisoned]
Seeing an OOM in #83239, this would help understand whether the issue is with the infra or with the test. RUN_TORCHBENCH: nvfuser [ghstack-poisoned]
Seeing an OOM in #83239, this would help understand whether the issue is with the infra or with the test. RUN_TORCHBENCH: nvfuser [ghstack-poisoned]
Seeing an OOM in #83239, this would help understand whether the issue is with the infra or with the test. RUN_TORCHBENCH: nvfuser [ghstack-poisoned]
Seeing an OOM in #83239, this would help understand whether the issue is with the infra or with the test. RUN_TORCHBENCH: nvfuser [ghstack-poisoned]
Seeing an OOM in #83239, this would help understand whether the issue is with the infra or with the test. RUN_TORCHBENCH: nvfuser [ghstack-poisoned]
.github/workflows/run_torchbench.yml
Outdated
| . "${HOME}"/anaconda3/etc/profile.d/conda.sh | ||
| conda activate pr-ci | ||
| python3 pytorch/.github/scripts/run_torchbench.py \ | ||
| # python3 -c "import torch; torch.rand((4, 4), device='cuda')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we can remove this? I am also okay to keep and uncomment it, just to make sure the hardware works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for catching that - I will remove it. It doesn't actually work because pytorch isn't built at this point.
Seeing an OOM in #83239, this would help understand whether the issue is with the infra or with the test. RUN_TORCHBENCH: nvfuser [ghstack-poisoned]
|
@pytorchbot rebase -s |
|
@pytorchbot successfully started a rebase job. Check the current status here |
Seeing an OOM in #83239, this would help understand whether the issue is with the infra or with the test. RUN_TORCHBENCH: nvfuser [ghstack-poisoned]
|
Successfully rebased |
|
@pytorchbot merge |
|
@pytorchbot successfully started a merge job. Check the current status here. |
|
Hey @davidberard98. |
Summary: Seeing an OOM in #83239, this would help understand whether the issue is with the infra or with the test. RUN_TORCHBENCH: nvfuser Pull Request resolved: #83857 Approved by: https://github.com/xuzhao9 Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/71d99662a0d7f8a9ad68999c9a014b71591cbb68 Reviewed By: mehtanirav Differential Revision: D39172015 Pulled By: davidberard98 fbshipit-source-id: 208f7d8bf00937a459bb5abd5baf9461660d19c3
Stack from ghstack (oldest at bottom):
Seeing an OOM in #83239, this would help understand whether the issue is with the infra or with the test.
RUN_TORCHBENCH: nvfuser