-
Notifications
You must be signed in to change notification settings - Fork 26.3k
download test times during build to avoid race conditions #81915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful links
✅ No Failures (0 Pending)As of commit 78482f9 (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
86f4728 to
e911e43
Compare
janeyx99
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test plan? Have you verified that sharding works for all configs?
looks good though! and your asan solution is...genius....can't believe we didn't think of that for all these months and years.
|
This PR would also fix #74620 |
🔗 Helpful Links🧪 See artifacts and rendered test results here
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 78482f9: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
updated the pr body |
| with open(path, "r") as f: | ||
| test_file_times = cast(Dict[str, Any], json.load(f)) | ||
| else: | ||
| test_file_times = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the path doesn't exist, would it make sense to fall back to the previous behavior test_file_times = get_test_times(str(REPO_ROOT), filename=TEST_TIMES_FILE) here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we do that, we run into the issue of race conditions + reruns being different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, what I'm curious about is that may be the possibility of race conditions + reruns being different is still better than the default sharding plan. Thus, we still use the old code knowingly before your fix. If I get your answer correctly, I guess it's not worth the trouble to keep the old logic here as a fallback.
|
@pytorchbot merge |
|
@pytorchbot successfully started a merge and created land time checks. See merge status here and land check progress here |
After #81116, we started pulling test times straight from the source instead of first downloading them in the build job and then having the test job take the build jobs version. This can cause an issues where different shards pull different versions of the file, leading to incorrect sharding (ex two shards running the same tests file on accident). This generally happens if the test jobs happen while the test times file is being updated (unlikely, but not impossible) or if someone reruns a test job the next day. In this PR, I return to the old method of downloading the test times file during the build job and having the test jobs pull from the build jobs uploaded artifacts. If there is no test times file in the build job's artifacts, we fall back to the default sharding plan. Notes: * script moved to a new file to avoid needing to import torch, which would require torch to be built, which can cause issues with asan * I got errors with asan (`ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.`), so I put the script at the beginning of the build ### Test Plan Verified that the number of tests ran in the pull and trunk workflows are similar to workflows run on master. Checked logs to see if artifacts were being used for sharding. Spot checked a few test configs to check that their lists of selected tests didn't overlap. Pull Request resolved: #81915 Approved by: https://github.com/huydhn
|
Merge failed due to Failed to merge; some land checks failed: pull, pull / linux-focal-py3.7-clang10-onnx / test (default, 1, 2, linux.2xlarge) |
|
@pytorchbot merge |
|
@pytorchbot successfully started a merge job. Check the current status here |
|
Hey @clee2000. |
…81915) Summary: After #81116, we started pulling test times straight from the source instead of first downloading them in the build job and then having the test job take the build jobs version. This can cause an issues where different shards pull different versions of the file, leading to incorrect sharding (ex two shards running the same tests file on accident). This generally happens if the test jobs happen while the test times file is being updated (unlikely, but not impossible) or if someone reruns a test job the next day. In this PR, I return to the old method of downloading the test times file during the build job and having the test jobs pull from the build jobs uploaded artifacts. If there is no test times file in the build job's artifacts, we fall back to the default sharding plan. Notes: * script moved to a new file to avoid needing to import torch, which would require torch to be built, which can cause issues with asan * I got errors with asan (`ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.`), so I put the script at the beginning of the build ### Test Plan Verified that the number of tests ran in the pull and trunk workflows are similar to workflows run on master. Checked logs to see if artifacts were being used for sharding. Spot checked a few test configs to check that their lists of selected tests didn't overlap. Pull Request resolved: #81915 Approved by: https://github.com/huydhn Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/86f038dd56dab9ecec5893b60efc74e46ca19e36 Reviewed By: osalpekar Differential Revision: D38252585 Pulled By: clee2000 fbshipit-source-id: 912b5fa0977647a79785e24613355ff0879bcacf
After #81116, we started pulling test times straight from the source instead of first downloading them in the build job and then having the test job take the build jobs version. This can cause an issues where different shards pull different versions of the file, leading to incorrect sharding (ex two shards running the same tests file on accident). This generally happens if the test jobs happen while the test times file is being updated (unlikely, but not impossible) or if someone reruns a test job the next day.
In this PR, I return to the old method of downloading the test times file during the build job and having the test jobs pull from the build jobs uploaded artifacts. If there is no test times file in the build job's artifacts, we fall back to the default sharding plan.
Notes:
ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.), so I put the script at the beginning of the buildTest Plan
Verified that the number of tests ran in the pull and trunk workflows are similar to workflows run on master. Checked logs to see if artifacts were being used for sharding. Spot checked a few test configs to check that their lists of selected tests didn't overlap.