Skip to content

Conversation

@henrylhtsang
Copy link
Contributor

@henrylhtsang henrylhtsang commented Sep 3, 2024

Summary: Add debug utils to debug a flaky test in fbcode ci.

Some context: #126545

Test Plan: ci

Differential Revision: D62005445

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135038

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2fa2703 with merge base 04118d8 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62005445

@henrylhtsang
Copy link
Contributor Author

@jgong5 can you take a look?

Copy link
Collaborator

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Better to wrap it with is_fbcode() check though.

Comment on lines 242 to 250
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is only used for debugging fbcode thing, can we wrap it with is_fbcode() check?

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 6, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62005445

henrylhtsang added a commit to henrylhtsang/pytorch that referenced this pull request Sep 6, 2024
…35038)

Summary:
Pull Request resolved: pytorch#135038

Add debug utils to debug a flaky test in fbcode ci.

Test Plan: ci

Differential Revision: D62005445
…35038)

Summary:
Pull Request resolved: pytorch#135038

Add debug utils to debug a flaky test in fbcode ci.

Test Plan:
ci

```
buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cpu_select_algorithm_cpu -- --exact 'caffe2/test/inductor:cpu_select_algorithm_cpu - test_linear_with_pointwise_batch_size_384_in_features_196_out_features_384_bias_False_epilogue_sigmoid_cpu_bfloat16 (caffe2.test.inductor.test_cpu_select_algorithm.TestSelectAlgorithmCPU)' --run-disabled --stress-runs 10 --record-results --print-passing-details
```

Differential Revision: D62005445
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62005445

@henrylhtsang
Copy link
Contributor Author

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 0 checks:

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@henrylhtsang
Copy link
Contributor Author

It looks like in some cases we have torch.ops.mkldnn._is_mkldnn_bf16_supported() = False. I don't know why that would happen occasionally, considering they even use the same hardware.

Quickest way forward might be to add that check in the if conditions.

@henrylhtsang
Copy link
Contributor Author

It looks like in some cases we have torch.ops.mkldnn._is_mkldnn_bf16_supported() = False. I don't know why that would happen occasionally, considering they even use the same hardware.

Quickest way forward might be to add that check in the if conditions.

Correction:
The is_mkldnn_bf16_supported = False one is using Intel(R) Xeon(R) CPU E5-2680 v4, while the is_mkldnn_bf16_supported = True one is using Intel(R) Xeon(R) Platinum 8321HC. I figured this might be something obvious, so @jgong5 do let me know if this is expected.

Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
pytorchmergebot pushed a commit that referenced this pull request Sep 24, 2024
…136290)

Sometimes the test is run with older cpu, e.g. Intel(R) Xeon(R) CPU E5-2680 v4. If we inspect its `lscpu`, in the flags, we don't see a `avx512_bf16`. So that probably means bf16 is not supported for those hardwares, and hence the unit test can fail. So we add the check in the code.

Context: #135038

Differential Revision: D62984129

Pull Request resolved: #136290
Approved by: https://github.com/XuehaiPan, https://github.com/chenyang78
PaliC pushed a commit to PaliC/pytorch that referenced this pull request Sep 25, 2024
…ytorch#136290)

Sometimes the test is run with older cpu, e.g. Intel(R) Xeon(R) CPU E5-2680 v4. If we inspect its `lscpu`, in the flags, we don't see a `avx512_bf16`. So that probably means bf16 is not supported for those hardwares, and hence the unit test can fail. So we add the check in the code.

Context: pytorch#135038

Differential Revision: D62984129

Pull Request resolved: pytorch#136290
Approved by: https://github.com/XuehaiPan, https://github.com/chenyang78
BoyuanFeng pushed a commit to BoyuanFeng/pytorch that referenced this pull request Sep 25, 2024
…ytorch#136290)

Sometimes the test is run with older cpu, e.g. Intel(R) Xeon(R) CPU E5-2680 v4. If we inspect its `lscpu`, in the flags, we don't see a `avx512_bf16`. So that probably means bf16 is not supported for those hardwares, and hence the unit test can fail. So we add the check in the code.

Context: pytorch#135038

Differential Revision: D62984129

Pull Request resolved: pytorch#136290
Approved by: https://github.com/XuehaiPan, https://github.com/chenyang78
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants