-
Notifications
You must be signed in to change notification settings - Fork 26.3k
fix python rpc handler exit crash #27251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5. This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5. close #27182 Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/) [ghstack-poisoned]
we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5. This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5. close #27182 Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/) ghstack-source-id: 91230848 Pull Request resolved: #27251
| // the python objects in PythonRpcHandler could be cleaned up before | ||
| // destructing ProcessGroupAgent, then ProcessGroupAgent destructor will | ||
| // dec_ref deallocated python objects and crash. | ||
| PythonRpcHandler::getInstance().cleanUp(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we just set runUDFFunction_ and loadResultFunction_ to None in the destructor of PythonRPCHandler? Does that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we only set them to None, not free the memory, then we are assuming python objects will be cleaned up before destructing pythonRpcHandler, but this assumption is only true for python 3.5. For python 3.6, it is not the case, we may be at risk of leaking memory in python 3.6 if set them as None only in destructor of PythonRPCHandler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, PythonRpcHandler is a singleton so it would be destructed only on program exit. As a result, we won't leak any memory since the OS will just clean up all the memory once the process dies. Although, I'm not sure if ASAN would fail due to this. Can you give it a shot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pritamdamania87 tests passed in both py3.5 and py3.6, fork asan did not complain as well, not sure whether spawn asan will complain or not because we disabled spawn asan.
I will update diff to see how others think about it as well.
we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5. This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5. close #27182 Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/) [ghstack-poisoned]
Pull Request resolved: #27251 we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5. This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5. close #27182 ghstack-source-id: 91310248 Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)
| // ProcessGroupAgent. Otherwise, in Python 3.5, when python program exits, | ||
| // the python objects in PythonRpcHandler could be cleaned up before | ||
| // destructing ProcessGroupAgent, then ProcessGroupAgent destructor will | ||
| // dec_ref deallocated python objects and crash. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: do we know why inc_ref was not called properly in the first place? If ProcessGroupAgent is holding a reference as well, should the reference counting be consistent with that?
| SendWork(workerIds_[dst], Message({}, {}, MessageType::SHUTDOWN))); | ||
| threadPool_.waitWorkComplete(); | ||
| listenerThread_.join(); | ||
| // explicitly clean up python objects in PythonRpcHandler before destructing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we expect all RpcAgent implementations would use the same PythonRpcHandler, shall we add this to RpcAgent instead of an implementation of RpcAgent.
| void PythonRpcHandler::cleanUp() { | ||
| AutoGIL ag; | ||
| if (!runUDFFunction_.is_none()) { | ||
| runUDFFunction_.dec_ref(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please remind me where is the corresponding inc_ref for this? Is it when we load the function in module.attr? I am a little confused of why we need to do dec_ref on our own but not inc_ref?
|
@mrshenli the theory now is: inc_ref works as expected, but python objects are deallocated before destructing pythonRpcHandler when program exits, so when pythonRpcHandler destructor tries to call dec_ref to clean up python objects, it is trying to clean up null objects and crashed. And this only happens to python 3.5, I do not know why python objects are cleaned up before destructing pythonRpcHandler in python 3.5 only. This is not the case for python 3.6 though. Does it make sense? |
we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5. This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5. close #27182 Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/) [ghstack-poisoned]
Pull Request resolved: #27251 we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5. This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5. close #27182 ghstack-source-id: 91359967 Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)
I am not aware of the root cause for this, but this Python deallocation behavior seems weird to me. If |
|
@mrshenli right, that is also weird to me as well. But be noted, Python 3.6 does things as we expected. but python 3.5 does not |
| // PythonRpcHandler singleton when program exits, all memories will be cleaned | ||
| // up by OS. | ||
| PythonRpcHandler::~PythonRpcHandler() { | ||
| if (!runUDFFunction_.is_none()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Do we need these if statements? Can't we just set this to py::none()?
we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5. This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5. close #27182 Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/) [ghstack-poisoned]
Pull Request resolved: #27251 we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5. This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5. close #27182 ghstack-source-id: 91445621 Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be conflicts with #27284.
we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5. This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5. close #27182 Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/) [ghstack-poisoned]
Pull Request resolved: #27251 we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5. This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5. close #27182 ghstack-source-id: 91556219 Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)
|
Typo in title and summary |
pietern
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two checks were carried over from the previous diff but are no longer needed.
The explanation of why this works on Python 3.5 is counter intuitive to me.
If Python 3.5 is proving to be such a pain, can't we restrict RPC to 3.6?
| std::vector<torch::Tensor>& responseTensorTable) { | ||
| AutoGIL ag; | ||
| auto pargs = py::bytes(pickledPayload.data(), pickledPayload.size()); | ||
| TORCH_CHECK(!pyRunFunction_.is_none(), "pyRunFunction_ is none"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this check. This member is initialized and confirmed to not be none in the constructor.
| const std::vector<torch::Tensor>& tensorTable) { | ||
| AutoGIL ag; | ||
| auto pargs = py::bytes(pickledPayload.data(), pickledPayload.size()); | ||
| TORCH_CHECK(!pyLoadReturnValue_.is_none(), "pyLoadReturnValue_ is none"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idem as above.
test/run_test.py
Outdated
| 'cuda_primary_ctx', | ||
| 'dataloader', | ||
| 'dist_autograd_fork', | ||
| 'dist_autograd_spawn', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The distributed autograd tests directly import a number of private torch.distributed.rpc APIs. If this is not initialized, the files can't even be sourced, and cause an import error. I have posted #27612 to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, I was waiting to rebase on your diff
[test all] we found python rpc handler exit crash in python 3.5 spwan tests, backtrace showed pythonRpcHandler destructor dec_ref() deallocated python objects, that means python objects in pythonRpcHandler have been cleaned up before destructing pythonRpcHandler while python program exits in python 3.5. This diff is to explicitly clean up python objects of pythonRpcHandler in RpcAgent.join(). so that pythonRpcHandler destructor will not have chances to call dec_ref(). we did not explicitly clean up python objects in RpcAngent destructor, it is also because python objects will be cleaned up before destructing RpcAgent when python program exit in python 3.5. close #27182 Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/) [ghstack-poisoned]
[test all] Explicitly clean up py::objects to avoid segment faults when py::objects with CPython are cleaned up later at program exit. See similar issues reported pybind/pybind11#1598 and pybind/pybind11#1493. Our local tests also caught this segment faults if py::objects are cleaned up at program exit. The explaination is: CPython cleans up most critical utitlies before cleaning up PythonRpcHandler singleton, so when PythonRpcHandler signleton cleans up py::objects and call dec_ref(), it will crash. The solution is to clean up py::objects earlier when Rpc agent join(). Be note that py::objects can not be cleaned up when Rpc agent is destroyed as well, as Rpc agent is global variable and it will have same issue as PythonRpcHandler. close #27182 Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/) [ghstack-poisoned]
Pull Request resolved: #27251 Explicitly clean up py::objects to avoid segment faults when py::objects with CPython are cleaned up later at program exit. See similar issues reported pybind/pybind11#1598 and pybind/pybind11#1493. Our local tests also caught this segment faults if py::objects are cleaned up at program exit. The explaination is: CPython cleans up most critical utitlies before cleaning up PythonRpcHandler singleton, so when PythonRpcHandler signleton cleans up py::objects and call dec_ref(), it will crash. The solution is to clean up py::objects earlier when Rpc agent join(). Be note that py::objects can not be cleaned up when Rpc agent is destroyed as well, as Rpc agent is global variable and it will have same issue as PythonRpcHandler. close #27182 ghstack-source-id: 91898924 Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)
|
|
||
| // if cachedFn is not none, assign it to local function; otherwise, import | ||
| // function from the scratch and assign it to local function; | ||
| py::object getLocalFunction(const py::object& cachedFn, const char* name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this given that the three py functions are already loaded in the constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the three py functions will be cleaned up in Rpc agent join(), but they could be called after Rpc agent join() in rare cases.
for most cases, we do not need to load them again, just copy cachedFn to localFn, this should be cheap as it is just copying pointer.
Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but they could be called after Rpc agent join() in rare cases.
when could this happen? And if we load them again, do we need to cleanup them again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. loadPythonUDFResult() is called in future.wait(), which could be called after Rpc agent join().
it does not need to clean up them again, as it is loaded to a local variable intentionally for this rare case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Let's get this PR in to fix master, and let's discuss whether we need to support the above case. I feel that all future.wait() in user code should be resolved before calling join, and if we detect termination in join(), RequestCallbackImpl won't hit this case either (?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli yes, RequestCallbackImpl will not hit this case. but I still add it for runPythonUDF() function for safety reason, because in theory functions in PythonRpcHandler could be called anytime.
| py::object localFn = py::none(); | ||
| if (cachedFn.is_none()) { | ||
| localFn = | ||
| getFunction(py::module::import("torch.distributed.rpc.internal"), name); | ||
| } else { | ||
| localFn = cachedFn; | ||
| } | ||
| return localFn; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
if (cachedFn.is_none()) {
return getFunction(py::module::import("torch.distributed.rpc.internal"), name);
} else {
return cachedFn;
}| // and https://github.com/pybind/pybind11/issues/1493 | ||
| // Our local tests also caught this segment faults if py::objects are cleaned | ||
| // up at program exit. The explaination is: CPython cleans up most critical | ||
| // utitlies before cleaning up PythonRpcHandler singleton, so when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utitlies -> utilities ?
| localFn = | ||
| getFunction(py::module::import("torch.distributed.rpc.internal"), name); | ||
| } else { | ||
| localFn = cachedFn; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Does not need to be fixed in this PR) This will always make a copy of the py::object even if the cachedFn is available. Do you know how expensive it is to make a copy of py::object? It is just similar as creating a shared_ptr? Will it work if we return a const reference here to extend the lifetime of a temporary var?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I understand correctly, py::object is a pointer, so should be cheap. I would rather not to return a const reference to extend lifetime, because I do not know when the const will be destroyed, may hit the same issue as now
[test all] Explicitly clean up py::objects to avoid segment faults when py::objects with CPython are cleaned up later at program exit. See similar issues reported pybind/pybind11#1598 and pybind/pybind11#1493. Our local tests also caught this segment faults if py::objects are cleaned up at program exit. The explaination is: CPython cleans up most critical utitlies before cleaning up PythonRpcHandler singleton, so when PythonRpcHandler signleton cleans up py::objects and call dec_ref(), it will crash. The solution is to clean up py::objects earlier when Rpc agent join(). Be note that py::objects can not be cleaned up when Rpc agent is destroyed as well, as Rpc agent is global variable and it will have same issue as PythonRpcHandler. close #27182 Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/) [ghstack-poisoned]
Pull Request resolved: #27251 Explicitly clean up py::objects to avoid segment faults when py::objects with CPython are cleaned up later at program exit. See similar issues reported pybind/pybind11#1598 and pybind/pybind11#1493. Our local tests also caught this segment faults if py::objects are cleaned up at program exit. The explaination is: CPython cleans up most critical utitlies before cleaning up PythonRpcHandler singleton, so when PythonRpcHandler signleton cleans up py::objects and call dec_ref(), it will crash. The solution is to clean up py::objects earlier when Rpc agent join(). Be note that py::objects can not be cleaned up when Rpc agent is destroyed as well, as Rpc agent is global variable and it will have same issue as PythonRpcHandler. close #27182 ghstack-source-id: 91961049 Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest version looks good!
|
failures are not related |
[test all] Explicitly clean up py::objects to avoid segment faults when py::objects with CPython are cleaned up later at program exit. See similar issues reported pybind/pybind11#1598 and pybind/pybind11#1493. Our local tests also caught this segment faults if py::objects are cleaned up at program exit. The explaination is: CPython cleans up most critical utitlies before cleaning up PythonRpcHandler singleton, so when PythonRpcHandler signleton cleans up py::objects and call dec_ref(), it will crash. The solution is to clean up py::objects earlier when Rpc agent join(). Be note that py::objects can not be cleaned up when Rpc agent is destroyed as well, as Rpc agent is global variable and it will have same issue as PythonRpcHandler. close #27182 Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/) [ghstack-poisoned]
Pull Request resolved: #27251 Explicitly clean up py::objects to avoid segment faults when py::objects with CPython are cleaned up later at program exit. See similar issues reported pybind/pybind11#1598 and pybind/pybind11#1493. Our local tests also caught this segment faults if py::objects are cleaned up at program exit. The explaination is: CPython cleans up most critical utitlies before cleaning up PythonRpcHandler singleton, so when PythonRpcHandler signleton cleans up py::objects and call dec_ref(), it will crash. The solution is to clean up py::objects earlier when Rpc agent join(). Be note that py::objects can not be cleaned up when Rpc agent is destroyed as well, as Rpc agent is global variable and it will have same issue as PythonRpcHandler. close #27182 ghstack-source-id: 92035069 Differential Revision: [D17727362](https://our.internmc.facebook.com/intern/diff/D17727362/)
|
This pull request has been merged in 3214f13. |
Summary: Pull Request resolved: pytorch#27251 Explicitly clean up py::objects to avoid segment faults when py::objects with CPython are cleaned up later at program exit. See similar issues reported pybind/pybind11#1598 and pybind/pybind11#1493. Our local tests also caught this segment faults if py::objects are cleaned up at program exit. The explaination is: CPython cleans up most critical utitlies before cleaning up PythonRpcHandler singleton, so when PythonRpcHandler signleton cleans up py::objects and call dec_ref(), it will crash. The solution is to clean up py::objects earlier when Rpc agent join(). Be note that py::objects can not be cleaned up when Rpc agent is destroyed as well, as Rpc agent is global variable and it will have same issue as PythonRpcHandler. close pytorch#27182 ghstack-source-id: 92035069 Test Plan: unit tests on python 3.6 and python 3.5 Differential Revision: D17727362 fbshipit-source-id: c254023f6a85acce35528ba756a4efabba9a519f
Stack from ghstack:
[test all]
Explicitly clean up py::objects to avoid segment faults when py::objects with CPython are cleaned up later at program exit.
See similar issues reported pybind/pybind11#1598
and pybind/pybind11#1493.
Our local tests also caught this segment faults if py::objects are cleaned
up at program exit. The explaination is: CPython cleans up most critical
utitlies before cleaning up PythonRpcHandler singleton, so when
PythonRpcHandler signleton cleans up py::objects and call dec_ref(), it
will crash.
The solution is to clean up py::objects earlier when Rpc agent join().
Be note that py::objects can not be cleaned up when Rpc agent is destroyed
as well, as Rpc agent is global variable and it will have same issue as
PythonRpcHandler.
close #27182
Differential Revision: D17727362