[test stats] use published test stats for sharding #81116

suo · 2022-07-08T17:05:23Z

Stack from ghstack (oldest at bottom):

Use the nightly-published test stats to perform sharding, instead of
calculating it in every build job.

Use the nightly-published test stats to perform sharding, instead of calculating it in every build job. [ghstack-poisoned]

facebook-github-bot · 2022-07-08T17:05:29Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/81116
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours
↩️ [fb-only] Re-run with SSH instructions

❌ 1 New Failures

As of commit b5c388a (more details on the Dr. CI page):

Expand to see more

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages

pull / linux-focal-py3.7-gcc7 / test (backwards_compat, 1, 1, linux.2xlarge) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-07-09T07:37:10.6204318Z The PR is introduc...m to confirm whether this change is wanted or not.

2022-07-09T07:37:10.6186977Z processing existing schema:  duration_ns(__torch__.torch.classes.profiling.InstructionStats _0) -> int _0
2022-07-09T07:37:10.6188753Z processing existing schema:  source(__torch__.torch.classes.profiling.SourceStats _0) -> __torch__.torch.classes.profiling.SourceRef _0
2022-07-09T07:37:10.6191175Z processing existing schema:  line_map(__torch__.torch.classes.profiling.SourceStats _0) -> Dict(int, __torch__.torch.classes.profiling.InstructionStats) _0
2022-07-09T07:37:10.6192506Z processing existing schema:  __init__(__torch__.torch.classes.profiling._ScriptProfile _0) -> NoneType _0
2022-07-09T07:37:10.6197018Z processing existing schema:  enable(__torch__.torch.classes.profiling._ScriptProfile _0) -> NoneType _0
2022-07-09T07:37:10.6197289Z processing existing schema:  disable(__torch__.torch.classes.profiling._ScriptProfile _0) -> NoneType _0
2022-07-09T07:37:10.6200090Z processing existing schema:  _dump_stats(__torch__.torch.classes.profiling._ScriptProfile _0) -> __torch__.torch.classes.profiling.SourceStats[] _0
2022-07-09T07:37:10.6200378Z processing existing schema:  __init__(__torch__.torch.classes.c10d.ProcessGroup _0, int _1, int _2) -> NoneType _0
2022-07-09T07:37:10.6201442Z processing existing schema:  __init__(__torch__.torch.classes.c10d.Work _0) -> NoneType _0
2022-07-09T07:37:10.6203879Z processing existing schema:  __init__(__torch__.torch.classes.dist_rpc.WorkerInfo _0, str _1, int _2) -> NoneType _0
2022-07-09T07:37:10.6204318Z The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not. 
2022-07-09T07:37:10.6204365Z 
2022-07-09T07:37:10.6204441Z Broken ops: [
2022-07-09T07:37:10.6204727Z 	__getstate__(__torch__.torch.classes.sparse.LinearPackedParamsBase _0) -> ((Tensor, Tensor?, int[]) _0)
2022-07-09T07:37:10.6205025Z 	__setstate__(__torch__.torch.classes.sparse.LinearPackedParamsBase _0, (Tensor, Tensor?, int[]) _1) -> NoneType _0
2022-07-09T07:37:10.6205087Z ]
2022-07-09T07:37:10.7424445Z ##[error]Process completed with exit code 1.
2022-07-09T07:37:10.7455086Z Prepare all required actions
2022-07-09T07:37:10.7455245Z Getting action download info
2022-07-09T07:37:10.8825065Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-07-09T07:37:10.8825140Z with:

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Use the nightly-published test stats to perform sharding, instead of calculating it in every build job. [ghstack-poisoned]

Use the nightly-published test stats to perform sharding, instead of calculating it in every build job. ghstack-source-id: 30694e6 Pull Request resolved: #81116

Use the nightly-published test stats to perform sharding, instead of calculating it in every build job. [ghstack-poisoned]

Use the nightly-published test stats to perform sharding, instead of calculating it in every build job. ghstack-source-id: 66742d5 Pull Request resolved: #81116

Use the nightly-published test stats to perform sharding, instead of calculating it in every build job. [ghstack-poisoned]

janeyx99

Code looks good, but could you verify that the same tests were run as before in the stats before landing?

Brief glance at the logs also looks good.

suo · 2022-07-12T04:48:34Z

num tests run looks fine

malfet · 2022-07-12T13:41:06Z

It should have been better documented, but the reason why test sharding is preserved at the build time, is to avoid a situation when test suite will only partially be run (or some tests would be run twice) as result of network flakiness

After you've removed this logic, its possible that some shards would build the schedule based on latest nightly stats, while others will use default ones (if S3 is unavailable for some reason)

To avoid that, either submit a follow-up fix, or restore the logic

Summary: Use the nightly-published test stats to perform sharding, instead of calculating it in every build job. Pull Request resolved: #81116 Approved by: https://github.com/janeyx99 Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/9f58d5d7ce7d2779f08b8fa8cbbb52ba1bf153af Reviewed By: DanilBaibak Differential Revision: D37782040 Pulled By: suo fbshipit-source-id: b50e2efe8f26fbc05361d815625921571d2ae5d3

suo · 2022-07-12T15:11:21Z

Wait but the test jsons preserved at build time are downloaded to the test jobs…from s3 lol

…

On Tue, Jul 12, 2022 at 6:41 AM Nikita Shulga ***@***.***> wrote: It should have been better documented, but the reason why test sharding is preserved at the build time, is to avoid a situation when test suite will only partially be run (or some tests would be run twice) as result of network flakiness After you've removed this logic, its possible that some shards would build the schedule based on latest nightly stats, while others will use default ones (if S3 is unavailable for some reason) To avoid that, either submit a follow-up fix, or restore the logic — Reply to this email directly, view it on GitHub <#81116 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMK4EGU57WWGMUFOAJVZBTVTVYYDANCNFSM53BS5IYQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

After #81116, we started pulling test times straight from the source instead of first downloading them in the build job and then having the test job take the build jobs version. This can cause an issues where different shards pull different versions of the file, leading to incorrect sharding (ex two shards running the same tests file on accident). This generally happens if the test jobs happen while the test times file is being updated (unlikely, but not impossible) or if someone reruns a test job the next day. In this PR, I return to the old method of downloading the test times file during the build job and having the test jobs pull from the build jobs uploaded artifacts. If there is no test times file in the build job's artifacts, we fall back to the default sharding plan. Notes: * script moved to a new file to avoid needing to import torch, which would require torch to be built, which can cause issues with asan * I got errors with asan (`ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.`), so I put the script at the beginning of the build ### Test Plan Verified that the number of tests ran in the pull and trunk workflows are similar to workflows run on master. Checked logs to see if artifacts were being used for sharding. Spot checked a few test configs to check that their lists of selected tests didn't overlap. Pull Request resolved: #81915 Approved by: https://github.com/huydhn

…81915) Summary: After #81116, we started pulling test times straight from the source instead of first downloading them in the build job and then having the test job take the build jobs version. This can cause an issues where different shards pull different versions of the file, leading to incorrect sharding (ex two shards running the same tests file on accident). This generally happens if the test jobs happen while the test times file is being updated (unlikely, but not impossible) or if someone reruns a test job the next day. In this PR, I return to the old method of downloading the test times file during the build job and having the test jobs pull from the build jobs uploaded artifacts. If there is no test times file in the build job's artifacts, we fall back to the default sharding plan. Notes: * script moved to a new file to avoid needing to import torch, which would require torch to be built, which can cause issues with asan * I got errors with asan (`ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.`), so I put the script at the beginning of the build ### Test Plan Verified that the number of tests ran in the pull and trunk workflows are similar to workflows run on master. Checked logs to see if artifacts were being used for sharding. Spot checked a few test configs to check that their lists of selected tests didn't overlap. Pull Request resolved: #81915 Approved by: https://github.com/huydhn Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/86f038dd56dab9ecec5893b60efc74e46ca19e36 Reviewed By: osalpekar Differential Revision: D38252585 Pulled By: clee2000 fbshipit-source-id: 912b5fa0977647a79785e24613355ff0879bcacf

[test stats] use published test stats for sharding

dfc5d2f

Use the nightly-published test stats to perform sharding, instead of calculating it in every build job. [ghstack-poisoned]

suo requested a review from a team as a code owner July 8, 2022 17:05

facebook-github-bot added the cla signed label Jul 8, 2022

Update on "[test stats] use published test stats for sharding"

df812e6

Use the nightly-published test stats to perform sharding, instead of calculating it in every build job. [ghstack-poisoned]

suo added a commit that referenced this pull request Jul 8, 2022

[test stats] use published test stats for sharding

a841996

Use the nightly-published test stats to perform sharding, instead of calculating it in every build job. ghstack-source-id: 30694e6 Pull Request resolved: #81116

Update on "[test stats] use published test stats for sharding"

3fb0944

Use the nightly-published test stats to perform sharding, instead of calculating it in every build job. [ghstack-poisoned]

suo added a commit that referenced this pull request Jul 8, 2022

[test stats] use published test stats for sharding

5507f06

Use the nightly-published test stats to perform sharding, instead of calculating it in every build job. ghstack-source-id: 66742d5 Pull Request resolved: #81116

Update on "[test stats] use published test stats for sharding"

eaceea6

Use the nightly-published test stats to perform sharding, instead of calculating it in every build job. [ghstack-poisoned]

Update on "[test stats] use published test stats for sharding"

f204d5b

Use the nightly-published test stats to perform sharding, instead of calculating it in every build job. [ghstack-poisoned]

Update on "[test stats] use published test stats for sharding"

515f779

Use the nightly-published test stats to perform sharding, instead of calculating it in every build job. [ghstack-poisoned]

Update on "[test stats] use published test stats for sharding"

d382253

Use the nightly-published test stats to perform sharding, instead of calculating it in every build job. [ghstack-poisoned]

This was referenced Jul 9, 2022

[ci] remove dead code related to test selection #81163

Closed

[ci] remove print_test_stats.py #81164

Closed

[ci] remove test history code #81165

Closed

[ci] remove s3_stat_parser and scribe upload #81166

Closed

Update on "[test stats] use published test stats for sharding"

b5c388a

Use the nightly-published test stats to perform sharding, instead of calculating it in every build job. [ghstack-poisoned]

janeyx99 approved these changes Jul 11, 2022

View reviewed changes

suo added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 11, 2022

pytorchmergebot closed this in 9f58d5d Jul 12, 2022

facebook-github-bot deleted the gh/suo/585/head branch July 15, 2022 14:18

clee2000 mentioned this pull request Jul 21, 2022

download test times during build to avoid race conditions #81915

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[test stats] use published test stats for sharding #81116

[test stats] use published test stats for sharding #81116

Uh oh!

suo commented Jul 8, 2022 •

edited

Loading

Uh oh!

facebook-github-bot commented Jul 8, 2022 •

edited

Loading

🕵️ 1 new failure recognized by patterns

pull / linux-focal-py3.7-gcc7 / test (backwards_compat, 1, 1, linux.2xlarge) (1/1)

Uh oh!

janeyx99 left a comment

Uh oh!

suo commented Jul 12, 2022

Uh oh!

malfet commented Jul 12, 2022

Uh oh!

suo commented Jul 12, 2022 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[test stats] use published test stats for sharding #81116

[test stats] use published test stats for sharding #81116

Uh oh!

Conversation

suo commented Jul 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jul 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

❌ 1 New Failures

🕵️ 1 new failure recognized by patterns

pull / linux-focal-py3.7-gcc7 / test (backwards_compat, 1, 1, linux.2xlarge) (1/1)

Uh oh!

janeyx99 left a comment

Choose a reason for hiding this comment

Uh oh!

suo commented Jul 12, 2022

Uh oh!

malfet commented Jul 12, 2022

Uh oh!

suo commented Jul 12, 2022 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

suo commented Jul 8, 2022 •

edited

Loading

facebook-github-bot commented Jul 8, 2022 •

edited

Loading