Skip to content

Conversation

@zhaojuanmao
Copy link
Contributor

@zhaojuanmao zhaojuanmao commented Oct 2, 2019

Stack from ghstack:

[test all]

Explicitly clean up py::objects to avoid segment faults when py::objects with CPython are cleaned up later at program exit.

See similar issues reported pybind/pybind11#1598
and pybind/pybind11#1493.

Our local tests also caught this segment faults if py::objects are cleaned
up at program exit. The explaination is: CPython cleans up most critical
utitlies before cleaning up PythonRpcHandler singleton, so when
PythonRpcHandler signleton cleans up py::objects and call dec_ref(), it
will crash.

The solution is to clean up py::objects earlier when Rpc agent join().
Be note that py::objects can not be cleaned up when Rpc agent is destroyed
as well, as Rpc agent is global variable and it will have same issue as
PythonRpcHandler.

close #27182

Differential Revision: D17727362

we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5.

This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5.

close #27182

Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)

[ghstack-poisoned]
@pytorchbot pytorchbot added oncall: distributed Add this issue/PR to distributed oncall triage queue module: pybind Related to our Python bindings / interactions with other Python libraries module: tests Issues related to tests (not the torch.testing module) labels Oct 2, 2019
zhaojuanmao added a commit that referenced this pull request Oct 2, 2019
we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5.

This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5.

close #27182

Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)

ghstack-source-id: 91230848
Pull Request resolved: #27251
// the python objects in PythonRpcHandler could be cleaned up before
// destructing ProcessGroupAgent, then ProcessGroupAgent destructor will
// dec_ref deallocated python objects and crash.
PythonRpcHandler::getInstance().cleanUp();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we just set runUDFFunction_ and loadResultFunction_ to None in the destructor of PythonRPCHandler? Does that work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we only set them to None, not free the memory, then we are assuming python objects will be cleaned up before destructing pythonRpcHandler, but this assumption is only true for python 3.5. For python 3.6, it is not the case, we may be at risk of leaking memory in python 3.6 if set them as None only in destructor of PythonRPCHandler

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, PythonRpcHandler is a singleton so it would be destructed only on program exit. As a result, we won't leak any memory since the OS will just clean up all the memory once the process dies. Although, I'm not sure if ASAN would fail due to this. Can you give it a shot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pritamdamania87 tests passed in both py3.5 and py3.6, fork asan did not complain as well, not sure whether spawn asan will complain or not because we disabled spawn asan.

I will update diff to see how others think about it as well.

we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5.

This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5.

close #27182

Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Oct 3, 2019
Pull Request resolved: #27251

we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5.

This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5.

close #27182
ghstack-source-id: 91310248

Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)
// ProcessGroupAgent. Otherwise, in Python 3.5, when python program exits,
// the python objects in PythonRpcHandler could be cleaned up before
// destructing ProcessGroupAgent, then ProcessGroupAgent destructor will
// dec_ref deallocated python objects and crash.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: do we know why inc_ref was not called properly in the first place? If ProcessGroupAgent is holding a reference as well, should the reference counting be consistent with that?

SendWork(workerIds_[dst], Message({}, {}, MessageType::SHUTDOWN)));
threadPool_.waitWorkComplete();
listenerThread_.join();
// explicitly clean up python objects in PythonRpcHandler before destructing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we expect all RpcAgent implementations would use the same PythonRpcHandler, shall we add this to RpcAgent instead of an implementation of RpcAgent.

void PythonRpcHandler::cleanUp() {
AutoGIL ag;
if (!runUDFFunction_.is_none()) {
runUDFFunction_.dec_ref();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please remind me where is the corresponding inc_ref for this? Is it when we load the function in module.attr? I am a little confused of why we need to do dec_ref on our own but not inc_ref?

@zhaojuanmao
Copy link
Contributor Author

@mrshenli the theory now is: inc_ref works as expected, but python objects are deallocated before destructing pythonRpcHandler when program exits, so when pythonRpcHandler destructor tries to call dec_ref to clean up python objects, it is trying to clean up null objects and crashed.

And this only happens to python 3.5, I do not know why python objects are cleaned up before destructing pythonRpcHandler in python 3.5 only. This is not the case for python 3.6 though.

Does it make sense?

we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5.

This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5.

close #27182

Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Oct 4, 2019
Pull Request resolved: #27251

we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5.

This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5.

close #27182
ghstack-source-id: 91359967

Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)
@mrshenli
Copy link
Contributor

mrshenli commented Oct 4, 2019

inc_ref works as expected, but python objects are deallocated before destructing pythonRpcHandler when program exits, so when pythonRpcHandler destructor tries to call dec_ref to clean up python objects, it is trying to clean up null objects and crashed.

I am not aware of the root cause for this, but this Python deallocation behavior seems weird to me. If pythonRpcHandler is still holding a reference to the Python object, wouldn't that Python object has a non-zero ref-count? If so, I am confused why they are deallocated before pythonRpcHandler. I would expect Python walk through the reference graph to destruct things in order. Or maybe Python just don't respect the topological order just in case there are circular dependencies.

@zhaojuanmao
Copy link
Contributor Author

@mrshenli right, that is also weird to me as well. But be noted, Python 3.6 does things as we expected. but python 3.5 does not

// PythonRpcHandler singleton when program exits, all memories will be cleaned
// up by OS.
PythonRpcHandler::~PythonRpcHandler() {
if (!runUDFFunction_.is_none()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we need these if statements? Can't we just set this to py::none()?

we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5.

This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5.

close #27182

Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Oct 7, 2019
Pull Request resolved: #27251

we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5.

This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5.

close #27182
ghstack-source-id: 91445621

Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be conflicts with #27284.

we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5.

This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5.

close #27182

Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Oct 8, 2019
Pull Request resolved: #27251

we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5.

This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5.

close #27182
ghstack-source-id: 91556219

Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)
@pietern
Copy link
Contributor

pietern commented Oct 10, 2019

Typo in title and summary spwan -> spawn

Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two checks were carried over from the previous diff but are no longer needed.

The explanation of why this works on Python 3.5 is counter intuitive to me.

If Python 3.5 is proving to be such a pain, can't we restrict RPC to 3.6?

std::vector<torch::Tensor>& responseTensorTable) {
AutoGIL ag;
auto pargs = py::bytes(pickledPayload.data(), pickledPayload.size());
TORCH_CHECK(!pyRunFunction_.is_none(), "pyRunFunction_ is none");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this check. This member is initialized and confirmed to not be none in the constructor.

const std::vector<torch::Tensor>& tensorTable) {
AutoGIL ag;
auto pargs = py::bytes(pickledPayload.data(), pickledPayload.size());
TORCH_CHECK(!pyLoadReturnValue_.is_none(), "pyLoadReturnValue_ is none");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem as above.

test/run_test.py Outdated
'cuda_primary_ctx',
'dataloader',
'dist_autograd_fork',
'dist_autograd_spawn',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distributed autograd tests directly import a number of private torch.distributed.rpc APIs. If this is not initialized, the files can't even be sourced, and cause an import error. I have posted #27612 to fix this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I was waiting to rebase on your diff

[test all] we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5.

This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5.

close #27182

Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)

[ghstack-poisoned]
@zhaojuanmao zhaojuanmao changed the title fix python rpc handler exit crash in python 3.5 spwan tests fix python rpc handler exit crash Oct 14, 2019
[test all] 

Explicitly clean up py::objects to avoid segment faults when py::objects with CPython are cleaned up later at program exit.

See similar issues reported pybind/pybind11#1598
and pybind/pybind11#1493.

Our local tests also caught this segment faults if py::objects are cleaned
up at program exit. The explaination is: CPython cleans up most critical
utitlies before cleaning up PythonRpcHandler singleton, so when
PythonRpcHandler signleton cleans up py::objects and call dec_ref(), it
will crash.

The solution is to clean up py::objects earlier when Rpc agent join().
Be note that py::objects can not be cleaned up when Rpc agent is destroyed
as well, as Rpc agent is global variable and it will have same issue as
PythonRpcHandler.

close #27182

Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Oct 14, 2019
Pull Request resolved: #27251

 Explicitly clean up py::objects to avoid segment faults when py::objects with CPython are cleaned up later at program exit.

See similar issues reported pybind/pybind11#1598
and pybind/pybind11#1493.

Our local tests also caught this segment faults if py::objects are cleaned
up at program exit. The explaination is: CPython cleans up most critical
utitlies before cleaning up PythonRpcHandler singleton, so when
PythonRpcHandler signleton cleans up py::objects and call dec_ref(), it
will crash.

The solution is to clean up py::objects earlier when Rpc agent join().
Be note that py::objects can not be cleaned up when Rpc agent is destroyed
as well, as Rpc agent is global variable and it will have same issue as
PythonRpcHandler.

close #27182
ghstack-source-id: 91898924

Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)

// if cachedFn is not none, assign it to local function; otherwise, import
// function from the scratch and assign it to local function;
py::object getLocalFunction(const py::object& cachedFn, const char* name) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this given that the three py functions are already loaded in the constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the three py functions will be cleaned up in Rpc agent join(), but they could be called after Rpc agent join() in rare cases.

for most cases, we do not need to load them again, just copy cachedFn to localFn, this should be cheap as it is just copying pointer.

Does it make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but they could be called after Rpc agent join() in rare cases.

when could this happen? And if we load them again, do we need to cleanup them again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. loadPythonUDFResult() is called in future.wait(), which could be called after Rpc agent join().

it does not need to clean up them again, as it is loaded to a local variable intentionally for this rare case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Let's get this PR in to fix master, and let's discuss whether we need to support the above case. I feel that all future.wait() in user code should be resolved before calling join, and if we detect termination in join(), RequestCallbackImpl won't hit this case either (?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli yes, RequestCallbackImpl will not hit this case. but I still add it for runPythonUDF() function for safety reason, because in theory functions in PythonRpcHandler could be called anytime.

Comment on lines 22 to 30
py::object localFn = py::none();
if (cachedFn.is_none()) {
localFn =
getFunction(py::module::import("torch.distributed.rpc.internal"), name);
} else {
localFn = cachedFn;
}
return localFn;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

  if (cachedFn.is_none()) {
    return getFunction(py::module::import("torch.distributed.rpc.internal"), name);
  } else {
    return cachedFn;
  }

// and https://github.com/pybind/pybind11/issues/1493
// Our local tests also caught this segment faults if py::objects are cleaned
// up at program exit. The explaination is: CPython cleans up most critical
// utitlies before cleaning up PythonRpcHandler singleton, so when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utitlies -> utilities ?

localFn =
getFunction(py::module::import("torch.distributed.rpc.internal"), name);
} else {
localFn = cachedFn;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Does not need to be fixed in this PR) This will always make a copy of the py::object even if the cachedFn is available. Do you know how expensive it is to make a copy of py::object? It is just similar as creating a shared_ptr? Will it work if we return a const reference here to extend the lifetime of a temporary var?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I understand correctly, py::object is a pointer, so should be cheap. I would rather not to return a const reference to extend lifetime, because I do not know when the const will be destroyed, may hit the same issue as now

[test all] 

Explicitly clean up py::objects to avoid segment faults when py::objects with CPython are cleaned up later at program exit.

See similar issues reported pybind/pybind11#1598
and pybind/pybind11#1493.

Our local tests also caught this segment faults if py::objects are cleaned
up at program exit. The explaination is: CPython cleans up most critical
utitlies before cleaning up PythonRpcHandler singleton, so when
PythonRpcHandler signleton cleans up py::objects and call dec_ref(), it
will crash.

The solution is to clean up py::objects earlier when Rpc agent join().
Be note that py::objects can not be cleaned up when Rpc agent is destroyed
as well, as Rpc agent is global variable and it will have same issue as
PythonRpcHandler.

close #27182

Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)

[ghstack-poisoned]
@zhaojuanmao zhaojuanmao requested a review from apaszke as a code owner October 15, 2019 20:49
zhaojuanmao added a commit that referenced this pull request Oct 15, 2019
Pull Request resolved: #27251

 Explicitly clean up py::objects to avoid segment faults when py::objects with CPython are cleaned up later at program exit.

See similar issues reported pybind/pybind11#1598
and pybind/pybind11#1493.

Our local tests also caught this segment faults if py::objects are cleaned
up at program exit. The explaination is: CPython cleans up most critical
utitlies before cleaning up PythonRpcHandler singleton, so when
PythonRpcHandler signleton cleans up py::objects and call dec_ref(), it
will crash.

The solution is to clean up py::objects earlier when Rpc agent join().
Be note that py::objects can not be cleaned up when Rpc agent is destroyed
as well, as Rpc agent is global variable and it will have same issue as
PythonRpcHandler.

close #27182
ghstack-source-id: 91961049

Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest version looks good!

@zhaojuanmao
Copy link
Contributor Author

failures are not related

[test all] 

Explicitly clean up py::objects to avoid segment faults when py::objects with CPython are cleaned up later at program exit.

See similar issues reported pybind/pybind11#1598
and pybind/pybind11#1493.

Our local tests also caught this segment faults if py::objects are cleaned
up at program exit. The explaination is: CPython cleans up most critical
utitlies before cleaning up PythonRpcHandler singleton, so when
PythonRpcHandler signleton cleans up py::objects and call dec_ref(), it
will crash.

The solution is to clean up py::objects earlier when Rpc agent join().
Be note that py::objects can not be cleaned up when Rpc agent is destroyed
as well, as Rpc agent is global variable and it will have same issue as
PythonRpcHandler.

close #27182

Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Oct 16, 2019
Pull Request resolved: #27251

 Explicitly clean up py::objects to avoid segment faults when py::objects with CPython are cleaned up later at program exit.

See similar issues reported pybind/pybind11#1598
and pybind/pybind11#1493.

Our local tests also caught this segment faults if py::objects are cleaned
up at program exit. The explaination is: CPython cleans up most critical
utitlies before cleaning up PythonRpcHandler singleton, so when
PythonRpcHandler signleton cleans up py::objects and call dec_ref(), it
will crash.

The solution is to clean up py::objects earlier when Rpc agent join().
Be note that py::objects can not be cleaned up when Rpc agent is destroyed
as well, as Rpc agent is global variable and it will have same issue as
PythonRpcHandler.

close #27182
ghstack-source-id: 92035069

Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 3214f13.

@facebook-github-bot facebook-github-bot deleted the gh/zhaojuanmao/9/head branch October 28, 2019 22:23
thiagocrepaldi pushed a commit to thiagocrepaldi/pytorch that referenced this pull request Feb 4, 2020
Summary:
Pull Request resolved: pytorch#27251

 Explicitly clean up py::objects to avoid segment faults when py::objects with CPython are cleaned up later at program exit.

See similar issues reported pybind/pybind11#1598
and pybind/pybind11#1493.

Our local tests also caught this segment faults if py::objects are cleaned
up at program exit. The explaination is: CPython cleans up most critical
utitlies before cleaning up PythonRpcHandler singleton, so when
PythonRpcHandler signleton cleans up py::objects and call dec_ref(), it
will crash.

The solution is to clean up py::objects earlier when Rpc agent join().
Be note that py::objects can not be cleaned up when Rpc agent is destroyed
as well, as Rpc agent is global variable and it will have same issue as
PythonRpcHandler.

close pytorch#27182
ghstack-source-id: 92035069

Test Plan: unit tests on python 3.6 and python 3.5

Differential Revision: D17727362

fbshipit-source-id: c254023f6a85acce35528ba756a4efabba9a519f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: pybind Related to our Python bindings / interactions with other Python libraries module: tests Issues related to tests (not the torch.testing module) oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants