-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Parametrize test_lstm_packed #137447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parametrize test_lstm_packed #137447
Conversation
The test runs all its combination (512) sequentially
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137447
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit c87af06 with merge base 14b4099 ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
albanD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ho nice refactor, thanks!
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: slow / linux-focal-cuda12.1-py3-gcc9-slow-gradcheck / test (default, 6, 8, lf.linux.g5.4xlarge.nvidia.gpu) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -f 'Existing slow failures' |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -m 'Need to up few more instance to 4xlarge, revert to reland' -c weird |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@huydhn your PR has been successfully reverted. |
This reverts commit d5493ed. Reverted #137447 on behalf of https://github.com/huydhn due to Need to up few more instance to 4xlarge, revert to reland ([comment](#137447 (comment)))
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: slow / linux-focal-cuda12.1-py3-gcc9-slow-gradcheck / test (default, 2, 8, lf.linux.g5.4xlarge.nvidia.gpu) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -f 'Existing slow failures' |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The failed test is recently moved backed from slow and it requires more RAM than what available on 2xlarge runner. It looks ok to up the instance size to 4xlarge instead. I missed periodic jobs in #137447 Example periodic failures https://hud.pytorch.org/pytorch/pytorch/commit/de4c2a3b4e89d96334dc678d1c3f2ae51a6630a0 (test_cpu_repro) Pull Request resolved: #137633 Approved by: https://github.com/seemethere, https://github.com/malfet
The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour. Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore.
Also, the test seems to run OOM on a 2xlarge with
std::bad_allocmemory error. Maybe, this would also fix the issue (pending CI testing)cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang