-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Allow to create local RRef with value #28948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users. This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module. Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/) [ghstack-poisoned]
Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users. This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module. Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/) ghstack-source-id: 92972276 Pull Request resolved: #28948
| addForkOfOwner(rfd.rrefId_, rfd.forkId_); | ||
| // ensure that this RRef is in the owners_ list to keep it alive. | ||
| // this is needed for OwnerRRefs that were created locally. | ||
| owners_[rref->rrefId()] = rref; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we need a lock here to guard owners_.
torch/distributed/rpc/api.py
Outdated
| from torch.distributed import _start_rpc_agent | ||
| from torch.distributed import _destroy_rref_context, _cleanup_python_rpc_handler | ||
| from torch.distributed import WorkerInfo | ||
| from torch.distributed import WorkerInfo, RRef as PyRRef |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need as PyRRef here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to expose RRef to rpc.RRef -- but avoid the lint issues for RRef not being used in the file. I can alternatively add a lint ignore rule. Is this how you'd do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, either adding a # noqa: F401 or do something similar to this PR #25823
torch/csrc/distributed/rpc/init.cpp
Outdated
| "local_value", | ||
| &PyRRef::localValue, | ||
| py::call_guard<py::gil_scoped_release>()) | ||
| .def(py::init(&PyRRef::local)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to top as this is a constructor?
| getOrCreateOwnerRRef<py::object>(const RRefId& rrefId); | ||
|
|
||
| template <typename T> | ||
| std::shared_ptr<OwnerRRef<T>> RRefContext::createOwnerRRef() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that this does not take a T argument as we can also use this to support run rpc.remote on self?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we can use this to run rpc.remote on self, but I don't follow how it would be possible to remove the T argument -- I need to pass it to the constructor of OnwerRRef.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I mean I was thinking this API should take a T&& value for now as this is how we use it currently, i.e., something like:
template <typename T>
std::shared_ptr<OwnerRRef<T>> RRefContext::createOwnerRRef(T&& value) {There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes i'm thinking of a use case where we don't set the value right away.
Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users. This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module. Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/) [ghstack-poisoned]
Pull Request resolved: #28948 Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users. This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module. ghstack-source-id: 92996021 Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/)
test/rpc_test.py
Outdated
|
|
||
| @dist_init | ||
| def test_local_rref_no_fork(self): | ||
| local_rref = rpc.RRef(35) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API is a little weird. This RRef has nothing to do with rpc, yet we're creating it with rpc.RRef(). I think RRef should be a separate package since it is independent from RPC (as we can see here). If we want to allow creating local RRefs like this, RRef shouldn't be tightly coupled with RPC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's 2 things in your comment:
-
would you prefer to have simply torch.distributed.RRef ? I can do that.
-
Currently init_rpc() needs to be called before we can call RRef() constructor even locally since only OnwerRRef constructor is private and only accessible by RRefContext. I can change that too if you think is appropriate? In that case, RRefs would become known to RRefContext only at fork time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes, I think that makes sense.
- I think this is why we probably need a higher level initialization like
init_model_parallel(I'm open to a better name) that initializes rpc, rref and autograd. The user has to call this API even if they want to only use rpc/rref/dist_autograd individually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought a bit about this -- I don't think that it makes sense to initialize RRef without initializing RPC, actually. RRef semantic requires RPC concepts such as owner() . owner() only makes sense if an RPCAgent is present.
RPC without RRef doesn't make sense either -- since RRef is a native type supported as both input and output of any RPC calls, particularly Remote.
As for 1., I'll change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't do 1. The top level torch.distributed package concerns only collectives. As you say, RPC without RRef and RRef without RPC are both useless. If that ever changes, we could consider promoting RRefs to their own package, but not to the top level torch.distributed module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I see they're part of torch.distributed today, but that's because that's where they were defined from the very beginning. I have #27529 out (pending type checking fixes) to update that and move the pybind11 wrappers to the appropriate modules. We can discuss moving them around, but let's use a new issue for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok i'm just going to reuse the existing distributed.RRef for now. Will leave to @pietern to move it afterwards.
| py::call_guard<py::gil_scoped_release>()) | ||
| .def( | ||
| "local_value", | ||
| &PyRRef::localValue, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized that these are all part of the public user API and we should have good documentation for all of these. We should define all of these methods in api.py and have proper python docs. Don't have to do this as part of this PR.
| return PyRRef(std::move(rref)); | ||
| } | ||
|
|
||
| PyRRef PyRRef::local(const py::object& value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do we have a static method like this instead of having a constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constructor of PyRRef is not exposed to python. This is something that works on top of that constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we define a constructor for PyRRef that takes a py::object and use that in py::init()?
Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users. This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module. Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/) [ghstack-poisoned]
Pull Request resolved: #28948 Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users. This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module. ghstack-source-id: 93053110 Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/)
Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users. This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module. Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/) [ghstack-poisoned]
Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users. This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module. Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/) [ghstack-poisoned]
Pull Request resolved: #28948 Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users. This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module. ghstack-source-id: 93218186 Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/)
|
@aazzolini Please hold off on merging until #27529 has landed. I've been trying to merge it for a while and it should be any moment now... |
|
@aazzolini The PR has been merged. If you rebase you can use |
Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users. This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module. Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/) [ghstack-poisoned]
Pull Request resolved: #28948 Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users. This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module. ghstack-source-id: 93372618 Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/)
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from my side! Thanks!
test/rpc_test.py
Outdated
| rpc.remote( | ||
| dst_worker, | ||
| add_rref_to_value, | ||
| args=(rref, 50)).to_here().wait(), 90) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently flaky. Can you explicitly hold the rref alive for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am landing #29396, which will revert to_here() and make it block again.
Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users. This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module. Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/) [ghstack-poisoned]
Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users. This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module. Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/) [ghstack-poisoned]
Pull Request resolved: #28948 Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users. This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module. ghstack-source-id: 93565010 Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/)
|
This pull request has been merged in 93b5c9d. |
Stack from ghstack:
Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users.
This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module.
Differential Revision: D18241227