Skip to content

Conversation

@janeyx99
Copy link
Contributor

@janeyx99 janeyx99 commented Jul 16, 2021

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jul 16, 2021

💊 CI failures summary and remediations

As of commit 5846082 (more details on the Dr. CI page and at hud.pytorch.org/pr/61774):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_macos_10_13_py3_test (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Jul 16 21:56:55 test_remote_message_script_de...yUniqueId(created_on=0, local_id=0) to be created.
Jul 16 21:56:27 frame #12: std::__1::__function::__func<std::__1::__bind<torch::distributed::rpc::ProcessGroupAgent::enqueueRecv(torch::distributed::rpc::RecvWork)::$_6, torch::distributed::rpc::RecvWork>, std::__1::allocator<std::__1::__bind<torch::distributed::rpc::ProcessGroupAgent::enqueueRecv(torch::distributed::rpc::RecvWork)::$_6, torch::distributed::rpc::RecvWork> >, void ()>::operator()() + 42 (0x118049a6a in libtorch_cpu.dylib)
Jul 16 21:56:27 frame #13: c10::ThreadPool::main_loop(unsigned long) + 569 (0x1125e8369 in libc10.dylib)
Jul 16 21:56:27 frame #14: void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, c10::ThreadPool::ThreadPool(int, int, std::__1::function<void ()>)::$_0> >(void*) + 67 (0x1125e8a13 in libc10.dylib)
Jul 16 21:56:27 frame #15: _pthread_start + 148 (0x7fff696de109 in libsystem_pthread.dylib)
Jul 16 21:56:27 frame #16: thread_start + 15 (0x7fff696d9b8b in libsystem_pthread.dylib)
Jul 16 21:56:27 
Jul 16 21:56:27 ok (4.012s)
Jul 16 21:56:36   test_remote_message_dropped_pickle (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (8.418s)
Jul 16 21:56:44   test_remote_message_dropped_pickle_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (8.355s)
Jul 16 21:56:51   test_remote_message_script_delay_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (7.235s)
Jul 16 21:56:55   test_remote_message_script_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:555] Received error while processing request type 260: falseINTERNAL ASSERT FAILED at "../torch/csrc/distributed/rpc/rref_context.cpp":390, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
Jul 16 21:56:55 Exception raised from getOwnerRRef at ../torch/csrc/distributed/rpc/rref_context.cpp:390 (most recent call first):
Jul 16 21:56:55 frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) + 98 (0x1184d16b2 in libc10.dylib)
Jul 16 21:56:55 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 106 (0x1184cfe2a in libc10.dylib)
Jul 16 21:56:55 frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 64 (0x1184d0060 in libc10.dylib)
Jul 16 21:56:55 frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 1711 (0x11d23cf6f in libtorch_cpu.dylib)
Jul 16 21:56:55 frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 86 (0x11d2277c6 in libtorch_cpu.dylib)
Jul 16 21:56:55 frame #5: torch::distributed::rpc::RequestCallbackImpl::processScriptRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 376 (0x117acdc08 in libtorch_python.dylib)
Jul 16 21:56:55 frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 437 (0x11d226415 in libtorch_cpu.dylib)
Jul 16 21:56:55 frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 74 (0x117ace97a in libtorch_python.dylib)
Jul 16 21:56:55 frame #8: c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> > c10::ivalue::Future::thenAsync<torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const::$_1>(torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const::$_1, std::__1::shared_ptr<c10::Type>)::'lambda'(c10::ivalue::Future&)::operator()(c10::ivalue::Future&) + 223 (0x11d22e0df in libtorch_cpu.dylib)

Preview docs built from this PR

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

facebook-github-bot pushed a commit that referenced this pull request Jul 16, 2021
Summary:
Move non-libtorch Linux 11.3 scheduled CI job to GHA.
Libtorch builds will be migrated here: #61774

Successful run: https://github.com/pytorch/pytorch/actions/runs/1035592487

Pull Request resolved: #61732

Reviewed By: seemethere

Differential Revision: D29735637

Pulled By: janeyx99

fbshipit-source-id: dce13370b218ae7833483fdaa00137db95e27c98
@facebook-github-bot
Copy link
Contributor

@janeyx99 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@janeyx99 janeyx99 requested a review from a team July 16, 2021 23:00
@janeyx99 janeyx99 changed the title Migrate libtorch to GHA Migrate linux libtorch to GHA Jul 16, 2021
@samestep
Copy link
Contributor

so just to clarify: does this take periodic libtorch jobs and migrate them to GHA making them non-periodic? or does it only migrate jobs that were already non-periodic?

"num_test_shards": num_test_shards,
"exclude_test": exclude_test,
"is_libtorch": is_libtorch,
"exclude_test": is_libtorch or exclude_test, # libtorch is build only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I guess this is fine; I was about to say that this would be the only place where we don't just pass the parameter straight through to a field of the same name, but then I realized that #60215 already did that in PyTorchWindowsWorkflow

Copy link
Contributor

@samestep samestep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@janeyx99
Copy link
Contributor Author

so just to clarify: does this take periodic libtorch jobs and migrate them to GHA making them non-periodic? or does it only migrate jobs that were already non-periodic?

clarified in my description, but this PR simply does the migration of libtorch jobs without changing their periodicity.

@facebook-github-bot
Copy link
Contributor

@janeyx99 merged this pull request in d565b3e.

@zhouzhuojie zhouzhuojie mentioned this pull request Jul 19, 2021
facebook-github-bot pushed a commit that referenced this pull request Jul 19, 2021
…61872)

Summary:
Forward fixes merge conflict on master: https://github.com/pytorch/pytorch/runs/3106027618

for PR #61774

Pull Request resolved: #61872

Reviewed By: dzhulgakov

Differential Revision: D29775595

Pulled By: janeyx99

fbshipit-source-id: 8194dd123f166fd5f3fd1e77417e865c188f40c8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants