Skip to content

Conversation

@rodrigoberriel
Copy link
Contributor

@rodrigoberriel rodrigoberriel commented Sep 21, 2021

Related to #30987 and #33628. Fix the following tasks:

  • Remove the use of .data in all our internal code:
    • benchmarks/
    • torch/utils/tensorboard/

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @gcramer23 @albanD @gchanan

@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Sep 21, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Sep 21, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 87ea5d3 (more details on the Dr. CI page):


  • 3/3 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build linux-bionic-py3.8-gcc9-coverage / test (distributed, 1, 1, linux.2xlarge) (1/2)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2021-09-21T18:07:13.5205239Z test_udf_remote_...yUniqueId(created_on=0, local_id=0) to be created.
2021-09-21T18:06:32.7961469Z frame #15: <unknown function> + 0x48a6a (0x7f60cdb2ea6a in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
2021-09-21T18:06:32.7963085Z frame #16: <unknown function> + 0xc9039 (0x7f60cda3a039 in /opt/conda/lib/libstdc++.so.6)
2021-09-21T18:06:32.7964874Z frame #17: <unknown function> + 0x76db (0x7f60f175e6db in /lib/x86_64-linux-gnu/libpthread.so.0)
2021-09-21T18:06:32.7966479Z frame #18: clone + 0x3f (0x7f60f148771f in /lib/x86_64-linux-gnu/libc.so.6)
2021-09-21T18:06:32.7967159Z 
2021-09-21T18:06:33.2693677Z ok (3.724s)
2021-09-21T18:06:48.5199865Z   test_rpc_builtin_timeout (__main__.FaultyFaultyAgentRpcTest) ... ok (15.250s)
2021-09-21T18:06:57.7602784Z   test_rpc_script_timeout (__main__.FaultyFaultyAgentRpcTest) ... ok (9.240s)
2021-09-21T18:07:01.4857230Z   test_rref_to_here_timeout (__main__.FaultyFaultyAgentRpcTest) ... ok (3.725s)
2021-09-21T18:07:09.2179948Z   test_udf_remote_message_delay_timeout (__main__.FaultyFaultyAgentRpcTest) ... ok (7.732s)
2021-09-21T18:07:13.5205239Z   test_udf_remote_message_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTest) ... [E request_callback_no_python.cpp:559] Received error while processing request type 261: falseINTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp":385, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
2021-09-21T18:07:13.5208055Z Exception raised from getOwnerRRef at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp:385 (most recent call first):
2021-09-21T18:07:13.5210017Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x59 (0x7f1e6f2d92d9 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
2021-09-21T18:07:13.5213060Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xa3 (0x7f1e6f2afe44 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
2021-09-21T18:07:13.5215342Z frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x61 (0x7f1e6f2d66c1 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
2021-09-21T18:07:13.5217070Z frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 0x628 (0x7f1e788bf398 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
2021-09-21T18:07:13.5219309Z frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 0x8c (0x7f1e788a59dc in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
2021-09-21T18:07:13.5221765Z frame #5: torch::distributed::rpc::RequestCallbackImpl::processPythonRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0xf5 (0x7f1e8928b505 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
2021-09-21T18:07:13.5224104Z frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x1f0 (0x7f1e788ac780 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
2021-09-21T18:07:13.5226409Z frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x60 (0x7f1e8928add0 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
2021-09-21T18:07:13.5227979Z frame #8: <unknown function> + 0x935b510 (0x7f1e788a1510 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)

See GitHub Actions build linux-bionic-py3.6-clang9 / test (noarch, 1, 1, linux.2xlarge) (2/2)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-09-21T18:07:33.6051266Z AssertionError: RuntimeError not raised
2021-09-21T18:07:33.6046228Z Traceback (most recent call last):
2021-09-21T18:07:33.6046819Z   File "/var/lib/jenkins/workspace/test/jit/test_tracer.py", line 244, in test_canonicalize_tensor_iterator
2021-09-21T18:07:33.6047781Z     self.assertTrue(str(traced.graph_for(x)).count(': int = prim::Constant') == 5)
2021-09-21T18:07:33.6048318Z AssertionError: False is not true
2021-09-21T18:07:33.6048652Z 		
2021-09-21T18:07:33.6049196Z ❌ Failure: jit.test_tracer.TestTracer.test_inplace_check
2021-09-21T18:07:33.6049572Z 
2021-09-21T18:07:33.6049889Z Traceback (most recent call last):
2021-09-21T18:07:33.6050436Z   File "/var/lib/jenkins/workspace/test/jit/test_tracer.py", line 342, in test_inplace_check
2021-09-21T18:07:33.6050888Z     ge(x)
2021-09-21T18:07:33.6051266Z AssertionError: RuntimeError not raised
2021-09-21T18:07:33.6051625Z 		
2021-09-21T18:07:33.6052343Z 🚨 ERROR: jit.test_freezing.TestMKLDNNReinplacing.test_always_alive_values
2021-09-21T18:07:33.6052879Z 
2021-09-21T18:07:33.6053187Z Traceback (most recent call last):
2021-09-21T18:07:33.6053860Z   File "/var/lib/jenkins/workspace/test/jit/test_freezing.py", line 2134, in test_always_alive_values
2021-09-21T18:07:33.6054588Z     self.checkResults(mod_eager, mod)
2021-09-21T18:07:33.6055187Z   File "/var/lib/jenkins/workspace/test/jit/test_freezing.py", line 2091, in checkResults
2021-09-21T18:07:33.6055773Z     self.assertEqual(mod1(inp), mod2(inp))
2021-09-21T18:07:33.6056524Z   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
2021-09-21T18:07:33.6057165Z     return forward_call(*input, **kwargs)

1 failure not recognized by patterns:

Job Step Action
GitHub Actions linux-xenial-py3.6-gcc5.4 / build-docs (cpp) Unknown 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the cleanup!

wave_write.setsampwidth(2)
wave_write.setframerate(sample_rate)
wave_write.writeframes(tensor.data)
wave_write.writeframes(tensor.detach())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sometimes get a numpy array actually. So we should not unconditionally call this.
Not sure what '.data' does on numpy arrays though...

Copy link
Contributor Author

@rodrigoberriel rodrigoberriel Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I think we can just remove .detach(), because there is a call to make_np(tensor) already. If that's ok, I can update. BTW, should we also update the docstring (of both audio and video) to something like this?

img_tensor (torch.Tensor, numpy.array, or string/blobname): Image data

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually .data is a valid thing on numpy array: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.data.html
So this one I think we just want to rename tensor to array and keep the .data!

Copy link
Contributor Author

@rodrigoberriel rodrigoberriel Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although .data is valid, it is not required in this case. It returns a memoryview, and id(memoryview(x)) == id(x.data). As wave gets a memoryview in both cases (see here), and the test without .data passes, I guess we could simply remove it. Anyway, I think we can proceed as you're suggesting, because at this point, given make_np call at the beginning, it'll be a numpy array. I submitted the change.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. Looks good now.

@facebook-github-bot
Copy link
Contributor

@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@albanD merged this pull request in a0dea07.

@rodrigoberriel rodrigoberriel deleted the remove-data-benchmark-tensorboard branch September 22, 2021 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants