Skip to content

Conversation

@aazzolini
Copy link
Contributor

@aazzolini aazzolini commented Oct 31, 2019

Stack from ghstack:

Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users.
This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module.

Differential Revision: D18241227

Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users.
This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module.

Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/)

[ghstack-poisoned]
aazzolini added a commit that referenced this pull request Oct 31, 2019
Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users.
This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module.

Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/)

ghstack-source-id: 92972276
Pull Request resolved: #28948
addForkOfOwner(rfd.rrefId_, rfd.forkId_);
// ensure that this RRef is in the owners_ list to keep it alive.
// this is needed for OwnerRRefs that were created locally.
owners_[rref->rrefId()] = rref;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we need a lock here to guard owners_.

from torch.distributed import _start_rpc_agent
from torch.distributed import _destroy_rref_context, _cleanup_python_rpc_handler
from torch.distributed import WorkerInfo
from torch.distributed import WorkerInfo, RRef as PyRRef
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need as PyRRef here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to expose RRef to rpc.RRef -- but avoid the lint issues for RRef not being used in the file. I can alternatively add a lint ignore rule. Is this how you'd do it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, either adding a # noqa: F401 or do something similar to this PR #25823

"local_value",
&PyRRef::localValue,
py::call_guard<py::gil_scoped_release>())
.def(py::init(&PyRRef::local))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to top as this is a constructor?

getOrCreateOwnerRRef<py::object>(const RRefId& rrefId);

template <typename T>
std::shared_ptr<OwnerRRef<T>> RRefContext::createOwnerRRef() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that this does not take a T argument as we can also use this to support run rpc.remote on self?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we can use this to run rpc.remote on self, but I don't follow how it would be possible to remove the T argument -- I need to pass it to the constructor of OnwerRRef.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I mean I was thinking this API should take a T&& value for now as this is how we use it currently, i.e., something like:

template <typename T>
std::shared_ptr<OwnerRRef<T>> RRefContext::createOwnerRRef(T&& value) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes i'm thinking of a use case where we don't set the value right away.

Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users.
This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module.

Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/)

[ghstack-poisoned]
aazzolini added a commit that referenced this pull request Oct 31, 2019
Pull Request resolved: #28948

Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users.
This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module.
ghstack-source-id: 92996021

Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/)
test/rpc_test.py Outdated

@dist_init
def test_local_rref_no_fork(self):
local_rref = rpc.RRef(35)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is a little weird. This RRef has nothing to do with rpc, yet we're creating it with rpc.RRef(). I think RRef should be a separate package since it is independent from RPC (as we can see here). If we want to allow creating local RRefs like this, RRef shouldn't be tightly coupled with RPC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's 2 things in your comment:

  1. would you prefer to have simply torch.distributed.RRef ? I can do that.

  2. Currently init_rpc() needs to be called before we can call RRef() constructor even locally since only OnwerRRef constructor is private and only accessible by RRefContext. I can change that too if you think is appropriate? In that case, RRefs would become known to RRefContext only at fork time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, I think that makes sense.
  2. I think this is why we probably need a higher level initialization like init_model_parallel (I'm open to a better name) that initializes rpc, rref and autograd. The user has to call this API even if they want to only use rpc/rref/dist_autograd individually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought a bit about this -- I don't think that it makes sense to initialize RRef without initializing RPC, actually. RRef semantic requires RPC concepts such as owner() . owner() only makes sense if an RPCAgent is present.

RPC without RRef doesn't make sense either -- since RRef is a native type supported as both input and output of any RPC calls, particularly Remote.

As for 1., I'll change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't do 1. The top level torch.distributed package concerns only collectives. As you say, RPC without RRef and RRef without RPC are both useless. If that ever changes, we could consider promoting RRefs to their own package, but not to the top level torch.distributed module.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I see they're part of torch.distributed today, but that's because that's where they were defined from the very beginning. I have #27529 out (pending type checking fixes) to update that and move the pybind11 wrappers to the appropriate modules. We can discuss moving them around, but let's use a new issue for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok i'm just going to reuse the existing distributed.RRef for now. Will leave to @pietern to move it afterwards.

py::call_guard<py::gil_scoped_release>())
.def(
"local_value",
&PyRRef::localValue,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized that these are all part of the public user API and we should have good documentation for all of these. We should define all of these methods in api.py and have proper python docs. Don't have to do this as part of this PR.

return PyRRef(std::move(rref));
}

PyRRef PyRRef::local(const py::object& value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do we have a static method like this instead of having a constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor of PyRRef is not exposed to python. This is something that works on top of that constructor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we define a constructor for PyRRef that takes a py::object and use that in py::init()?

Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users.
This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module.

Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/)

[ghstack-poisoned]
aazzolini added a commit that referenced this pull request Nov 1, 2019
Pull Request resolved: #28948

Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users.
This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module.
ghstack-source-id: 93053110

Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/)
Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users.
This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module.

Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/)

[ghstack-poisoned]
Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users.
This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module.

Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/)

[ghstack-poisoned]
aazzolini added a commit that referenced this pull request Nov 4, 2019
Pull Request resolved: #28948

Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users.
This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module.
ghstack-source-id: 93218186

Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/)
@pietern
Copy link
Contributor

pietern commented Nov 5, 2019

@aazzolini Please hold off on merging until #27529 has landed. I've been trying to merge it for a while and it should be any moment now...

@pietern
Copy link
Contributor

pietern commented Nov 5, 2019

@aazzolini The PR has been merged. If you rebase you can use rpc.RRef. See #27529 and b4df413.

Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users.
This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module.

Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/)

[ghstack-poisoned]
aazzolini added a commit that referenced this pull request Nov 6, 2019
Pull Request resolved: #28948

Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users.
This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module.
ghstack-source-id: 93372618

Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/)
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my side! Thanks!

test/rpc_test.py Outdated
rpc.remote(
dst_worker,
add_rref_to_value,
args=(rref, 50)).to_here().wait(), 90)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently flaky. Can you explicitly hold the rref alive for now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am landing #29396, which will revert to_here() and make it block again.

Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users.
This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module.

Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/)

[ghstack-poisoned]
Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users.
This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module.

Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/)

[ghstack-poisoned]
aazzolini added a commit that referenced this pull request Nov 8, 2019
Pull Request resolved: #28948

Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users.
This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module.
ghstack-source-id: 93565010

Differential Revision: [D18241227](https://our.internmc.facebook.com/intern/diff/D18241227/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 93b5c9d.

@facebook-github-bot facebook-github-bot deleted the gh/aazzolini/4/head branch November 15, 2019 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants