Skip to content

Conversation

@XiaobingSuper
Copy link
Collaborator

@XiaobingSuper XiaobingSuper commented Dec 7, 2020

Stack from ghstack:

Differential Revision: D25537188

@dr-ci
Copy link

dr-ci bot commented Dec 7, 2020

💊 CI failures summary and remediations

As of commit 070f721 (more details on the Dr. CI page):


  • 2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_doc_test (1/2)

Step: "Doc test" (full log | diagnosis details | 🔁 rerun)

Feb 06 14:01:55 sccache: error: couldn't connect to server
Feb 06 14:01:55 ++++ eval 'extract_trap_cmd '
Feb 06 14:01:55 +++++ extract_trap_cmd
Feb 06 14:01:55 +++++ printf '%s\n' ''
Feb 06 14:01:55 ++++ printf '%s\n' cleanup
Feb 06 14:01:55 +++ trap -- '
Feb 06 14:01:55 cleanup' EXIT
Feb 06 14:01:55 +++ [[ pytorch-linux-xenial-py3.6-gcc5.4-build != *pytorch-win-* ]]
Feb 06 14:01:55 +++ which sccache
Feb 06 14:01:55 +++ sccache --stop-server
Feb 06 14:01:55 Stopping sccache server...
Feb 06 14:01:55 sccache: error: couldn't connect to server
Feb 06 14:01:55 sccache: caused by: Connection refused (os error 111)
Feb 06 14:01:55 +++ true
Feb 06 14:01:55 +++ rm /var/lib/jenkins/sccache_error.log
Feb 06 14:01:55 +++ [[ pytorch-linux-xenial-py3.6-gcc5.4-build == *rocm* ]]
Feb 06 14:01:55 +++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log
Feb 06 14:01:55 +++ SCCACHE_IDLE_TIMEOUT=1200
Feb 06 14:01:55 +++ RUST_LOG=sccache::server=error
Feb 06 14:01:55 +++ sccache --start-server
Feb 06 14:01:55 sccache: Starting the server...
Feb 06 14:01:55 +++ sccache --zero-stats

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2 (2/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Feb 06 16:47:02 [E request_callback_no_python.cpp:653] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future
Feb 06 16:47:02 At:
Feb 06 16:47:02   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize
Feb 06 16:47:02   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize
Feb 06 16:47:02 
Feb 06 16:47:02 [E request_callback_no_python.cpp:653] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future
Feb 06 16:47:02 
Feb 06 16:47:02 At:
Feb 06 16:47:02   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize
Feb 06 16:47:02   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize
Feb 06 16:47:02 
Feb 06 16:47:02 [E request_callback_no_python.cpp:653] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future
Feb 06 16:47:02 
Feb 06 16:47:02 At:
Feb 06 16:47:02   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize
Feb 06 16:47:02   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize
Feb 06 16:47:02 
Feb 06 16:47:02 ok (1.530s)
Feb 06 16:47:04   test_return_future_remote (__main__.TensorPipeRpcTestWithSpawn) ... ok (1.530s)
Feb 06 16:47:05   test_return_local_rrefs (__main__.TensorPipeRpcTestWithSpawn) ... ok (1.532s)
Feb 06 16:47:11   test_rpc_profiling_async_function (__main__.TensorPipeRpcTestWithSpawn) ... ok (5.437s)
Feb 06 16:47:17   test_rpc_profiling_async_function_single_threaded (__main__.TensorPipeRpcTestWithSpawn) ... ok (5.939s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@XiaobingSuper
Copy link
Collaborator Author

@VitalyFedyunin , the error is caused by runing old cpu device(only support avx2), but onednn need the device support avx512 at least for bfloat16 path, so I want to add a gloabal context like torch._C.has_mkldnn to skip bfloat16 test case if cpu device not support avx512,. is it ok for your side?

@XiaobingSuper
Copy link
Collaborator Author

@VitalyFedyunin, There has another option: fall back fp32 path in ideep if onednn is not supported on some old device, this option don't need to change pytorch code except updata ideep. if this option is ok, I will updata ideep at first. Thanks!

@jgong5
Copy link
Collaborator

jgong5 commented Dec 10, 2020

There has another option: fall back fp32 path in ideep if onednn is not supported on some old device, this option don't need to change pytorch code except updata ideep. if this option is ok, I will updata ideep at first. Thanks!

@VitalyFedyunin This option sounds better to us since it does not require extra CPU arch check from the PyTorch side. IDEEP will support BF16 regardless of CPU archs. We will have a quick upgrade on IDEEP if it looks good to you. :-)

VitalyFedyunin
VitalyFedyunin previously approved these changes Dec 14, 2020
@VitalyFedyunin
Copy link
Contributor

Current error doesn't tell anything about avx support:

        x_mkldnn = x if x.is_mkldnn else x.to_mkldnn()
        y_mkldnn = torch._C._nn.mkldnn_linear(x_mkldnn, self.weight, self.bias)
                   ~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        y = y_mkldnn if x.is_mkldnn else y_mkldnn.to_dense()

Falling back to fp32 path will create invalid impression for users. It is much better to error out with nice warning when something is not supported.

For tests you don't need to introduce new global context. Something simple like @skipIf(test_arch(),"not supported architecture") will suffice.

@jgong5
Copy link
Collaborator

jgong5 commented Dec 15, 2020

Falling back to fp32 path will create invalid impression for users. It is much better to error out with nice warning when something is not supported.

@VitalyFedyunin Thanks for the suggestions. So, we will also add cpu arch check inside use_mkldnn (https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Convolution.cpp#L227) so that it does not support bf16 without avx512. Does that make sense to you?

@VitalyFedyunin
Copy link
Contributor

Ok we fixed tests, but can we please make sure that we are going to have nice error if someone tries it on old CPU

@XiaobingSuper
Copy link
Collaborator Author

XiaobingSuper commented Jan 13, 2021

Ok we fixed tests, but can we please make sure that we are going to have nice error if someone tries it on old CPU

Done.

@XiaobingSuper
Copy link
Collaborator Author

@VitalyFedyunin , please help review it. Thanks!

try:
# for bf16 path, OneDNN requires the cpu has intel avx512 with avx512bw,
# avx512vl, and avx512dq.
cmd = "grep avx512bw /proc/cpuinfo | grep avx512vl | grep avx512dq"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we give up on testing bf16 on windows? If so, tests should be marked to skip windows, instead of relying on obscure errors that would be thrown.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For windows, bf16 cases are skiped now, the reason is that I can't find a easy way to check the device as linux.

if isinstance(m, torch.nn.Linear):
return MkldnnLinear(m)
return MkldnnLinear(m, d)
elif isinstance(m, torch.nn.Conv1d):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is dtype argument ignored on Conv1d, Conv3d and BatchNorm? If they don't support it, we should error out. We cannot silently ignore it. Similarly, if dtype argument that's generally not supported is passed (e.g. torch.double) we should error out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dtype argument is added on Conv1d, Conv3d and will assert if given dtype is not supported.

self._test_serialization(mkldnn_conv2d, (x.to_mkldnn(),))
self._test_tracing(mkldnn_conv2d, (x.to_mkldnn(),))

@unittest.skipIf(not has_bf16_support(), "OneDNN bfloat16 path requires AVX512")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests should not be skipped if there's no bf16 support, instead they should be catching and asserting that appropriate errors are raised.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test cases if there's no bf16 support.

@XiaobingSuper XiaobingSuper requested a review from ngimel January 27, 2021 03:08
Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR would really benefit from better description and documentation, e.g. the fact that bfloat16 in mkldnn is supported only on avx512 processors really should be put in the description and in the Note somewhere in the code.
Similarly, I would like to understand why bias is never converted to bfloat16. If it is by design, it should be commented upon and put in a note.
Similarly, for BatchNorm parameters - if OneDNN wants them in fp32, that's fine, but there should be a comment about this.

return MkldnnConv3d(m)
return MkldnnConv3d(m, d)
elif isinstance(m, torch.nn.BatchNorm2d) or isinstance(m, torch.nn.BatchNorm3d):
return MkldnnBatchNorm(m)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is dtype ignored for batchnorm? Does it expect its parameters to remain in fp32?

Copy link
Collaborator Author

@XiaobingSuper XiaobingSuper Jan 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, batchnorm's parameters remain in fp32, for MKLDNN batchnorm bf16 path, it requires input is bf16, but parameters are fp32. but for conv and linear, weights require fp32, bias can be fp32 or bf16, for good accuracy, we chose fp32 for bias.

@XiaobingSuper
Copy link
Collaborator Author

This PR would really benefit from better description and documentation, e.g. the fact that bfloat16 in mkldnn is supported only on avx512 processors really should be put in the description and in the Note somewhere in the code.
Similarly, I would like to understand why bias is never converted to bfloat16. If it is by design, it should be commented upon and put in a note.
Similarly, for BatchNorm parameters - if OneDNN wants them in fp32, that's fine, but there should be a comment about this.

Yes, I add some comments in test_mkldnn.py and torch/utils/mkldnn.py to describe them.

@XiaobingSuper XiaobingSuper requested a review from ngimel January 28, 2021 03:26
@XiaobingSuper
Copy link
Collaborator Author

@ngimel , the failed case seems not related to this PR.

@facebook-github-bot
Copy link
Contributor

@VitalyFedyunin merged this pull request in 324c6aa.

@facebook-github-bot facebook-github-bot deleted the gh/XiaobingSuper/3/head branch February 21, 2021 15:16
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
Summary: Pull Request resolved: pytorch#48922

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D25537188

Pulled By: VitalyFedyunin

fbshipit-source-id: ab6eb1ba8cffb5ba9d00d05db8ef616628f8c932
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants