download test times during build to avoid race conditions #81915

clee2000 · 2022-07-21T20:27:59Z

After #81116, we started pulling test times straight from the source instead of first downloading them in the build job and then having the test job take the build jobs version. This can cause an issues where different shards pull different versions of the file, leading to incorrect sharding (ex two shards running the same tests file on accident). This generally happens if the test jobs happen while the test times file is being updated (unlikely, but not impossible) or if someone reruns a test job the next day.

In this PR, I return to the old method of downloading the test times file during the build job and having the test jobs pull from the build jobs uploaded artifacts. If there is no test times file in the build job's artifacts, we fall back to the default sharding plan.

Notes:

script moved to a new file to avoid needing to import torch, which would require torch to be built, which can cause issues with asan
I got errors with asan (ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.), so I put the script at the beginning of the build

Test Plan

Verified that the number of tests ran in the pull and trunk workflows are similar to workflows run on master. Checked logs to see if artifacts were being used for sharding. Spot checked a few test configs to check that their lists of selected tests didn't overlap.

facebook-github-bot · 2022-07-21T20:28:05Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/81915
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit 78482f9 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

test/run_test.py

janeyx99

Test plan? Have you verified that sharding works for all configs?

looks good though! and your asan solution is...genius....can't believe we didn't think of that for all these months and years.

janeyx99 · 2022-07-27T21:21:26Z

This PR would also fix #74620

pytorch-bot · 2022-07-27T21:31:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results here

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 78482f9:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

clee2000 · 2022-07-27T21:33:58Z

Test plan? Have you verified that sharding works for all configs?

updated the pr body

huydhn · 2022-07-28T02:09:18Z

test/run_test.py

+            with open(path, "r") as f:
+                test_file_times = cast(Dict[str, Any], json.load(f))
+        else:
+            test_file_times = {}


If the path doesn't exist, would it make sense to fall back to the previous behavior test_file_times = get_test_times(str(REPO_ROOT), filename=TEST_TIMES_FILE) here instead?

if we do that, we run into the issue of race conditions + reruns being different

Oh, what I'm curious about is that may be the possibility of race conditions + reruns being different is still better than the default sharding plan. Thus, we still use the old code knowingly before your fix. If I get your answer correctly, I guess it's not worth the trouble to keep the old logic here as a fallback.

clee2000 · 2022-07-28T05:30:55Z

@pytorchbot merge

pytorchmergebot · 2022-07-28T05:32:13Z

@pytorchbot successfully started a merge and created land time checks. See merge status here and land check progress here

After #81116, we started pulling test times straight from the source instead of first downloading them in the build job and then having the test job take the build jobs version. This can cause an issues where different shards pull different versions of the file, leading to incorrect sharding (ex two shards running the same tests file on accident). This generally happens if the test jobs happen while the test times file is being updated (unlikely, but not impossible) or if someone reruns a test job the next day. In this PR, I return to the old method of downloading the test times file during the build job and having the test jobs pull from the build jobs uploaded artifacts. If there is no test times file in the build job's artifacts, we fall back to the default sharding plan. Notes: * script moved to a new file to avoid needing to import torch, which would require torch to be built, which can cause issues with asan * I got errors with asan (`ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.`), so I put the script at the beginning of the build ### Test Plan Verified that the number of tests ran in the pull and trunk workflows are similar to workflows run on master. Checked logs to see if artifacts were being used for sharding. Spot checked a few test configs to check that their lists of selected tests didn't overlap. Pull Request resolved: #81915 Approved by: https://github.com/huydhn

pytorchmergebot · 2022-07-28T06:37:42Z

Merge failed due to Failed to merge; some land checks failed: pull, pull / linux-focal-py3.7-clang10-onnx / test (default, 1, 2, linux.2xlarge)
Raised by https://github.com/pytorch/pytorch/actions/runs/2751528340

clee2000 · 2022-07-28T16:32:34Z

@pytorchbot merge

pytorchmergebot · 2022-07-28T16:34:54Z

@pytorchbot successfully started a merge job. Check the current status here

github-actions · 2022-07-28T16:35:39Z

Hey @clee2000.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

…81915) Summary: After #81116, we started pulling test times straight from the source instead of first downloading them in the build job and then having the test job take the build jobs version. This can cause an issues where different shards pull different versions of the file, leading to incorrect sharding (ex two shards running the same tests file on accident). This generally happens if the test jobs happen while the test times file is being updated (unlikely, but not impossible) or if someone reruns a test job the next day. In this PR, I return to the old method of downloading the test times file during the build job and having the test jobs pull from the build jobs uploaded artifacts. If there is no test times file in the build job's artifacts, we fall back to the default sharding plan. Notes: * script moved to a new file to avoid needing to import torch, which would require torch to be built, which can cause issues with asan * I got errors with asan (`ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.`), so I put the script at the beginning of the build ### Test Plan Verified that the number of tests ran in the pull and trunk workflows are similar to workflows run on master. Checked logs to see if artifacts were being used for sharding. Spot checked a few test configs to check that their lists of selected tests didn't overlap. Pull Request resolved: #81915 Approved by: https://github.com/huydhn Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/86f038dd56dab9ecec5893b60efc74e46ca19e36 Reviewed By: osalpekar Differential Revision: D38252585 Pulled By: clee2000 fbshipit-source-id: 912b5fa0977647a79785e24613355ff0879bcacf

facebook-github-bot added the cla signed label Jul 21, 2022

clee2000 added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 21, 2022

clee2000 force-pushed the csl/artifacts branch 2 times, most recently from 86f4728 to e911e43 Compare July 26, 2022 22:29

clee2000 added 8 commits July 27, 2022 09:38

download during build

6ce4a8b

asan

b1f98eb

win

f9c156e

asan

52f9a86

kinda cheat for asan

bd15dee

new script

955ce4e

parent

ae66b3e

reorder

d6ce59f

clee2000 force-pushed the csl/artifacts branch from e911e43 to d6ce59f Compare July 27, 2022 16:39

clee2000 marked this pull request as ready for review July 27, 2022 19:13

clee2000 requested a review from a team as a code owner July 27, 2022 19:13

janeyx99 reviewed Jul 27, 2022

View reviewed changes

test/run_test.py Outdated Show resolved Hide resolved

janeyx99 reviewed Jul 27, 2022

View reviewed changes

update

11f2e94

clee2000 requested a review from janeyx99 July 27, 2022 21:34

check if file path exists

78482f9

huydhn reviewed Jul 28, 2022

View reviewed changes

huydhn approved these changes Jul 28, 2022

View reviewed changes

pytorchmergebot added the Merged label Jul 28, 2022

pytorchmergebot closed this in 86f038d Jul 28, 2022

xwang233 mentioned this pull request Jul 29, 2022

Unable to run pytorch unit test because CI environment variables are not available #82492

Closed

clee2000 deleted the csl/artifacts branch September 28, 2022 17:16

download test times during build to avoid race conditions #81915

download test times during build to avoid race conditions #81915

Uh oh!

Conversation

clee2000 commented Jul 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Plan

Uh oh!

facebook-github-bot commented Jul 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

✅ No Failures (0 Pending)

Uh oh!

Uh oh!

janeyx99 left a comment

Choose a reason for hiding this comment

Uh oh!

janeyx99 commented Jul 27, 2022

Uh oh!

pytorch-bot bot commented Jul 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results here

✅ No Failures

Uh oh!

clee2000 commented Jul 27, 2022

Uh oh!

huydhn Jul 28, 2022

Choose a reason for hiding this comment

Uh oh!

clee2000 Jul 28, 2022

Choose a reason for hiding this comment

Uh oh!

huydhn Jul 28, 2022

Choose a reason for hiding this comment

Uh oh!

clee2000 commented Jul 28, 2022

Uh oh!

pytorchmergebot commented Jul 28, 2022

Uh oh!

pytorchmergebot commented Jul 28, 2022

Uh oh!

clee2000 commented Jul 28, 2022

Uh oh!

pytorchmergebot commented Jul 28, 2022

Uh oh!

github-actions bot commented Jul 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

clee2000 commented Jul 21, 2022 •

edited

Loading

facebook-github-bot commented Jul 21, 2022 •

edited

Loading

pytorch-bot bot commented Jul 27, 2022 •

edited

Loading