[distributed_launch] Add --no_python flag #29144

lukeyeager · 2019-11-04T20:37:42Z

Allows you to use a bash script wrapper in-between launch and your
training script. e.g.

python -m torch.distributed.launch --nproc_per_node=8 --no_python --use_env \
    bash -c 'exec numactl --cpunodebind=$(( LOCAL_RANK / 4 )) "$@"' -- \
    python train.py ...

Allows you to use a bash script wrapper in-between launch and your training script. e.g. ``` python -m torch.distributed.launch --nproc_per_node=8 --no_python --use_env \ bash -c 'exec numactl --cpunodebind=$(( LOCAL_RANK / 4 )) "$@"' -- \ python train.py ... ```

pietern · 2019-11-06T08:06:14Z

Thanks, @lukeyeager. Looks good to me.

It's unfortunate that the tool is becoming a pile of backward compat hacks, though. I think a next version of this thing would 1) never prepend Python, and 2) always use the environment.

facebook-github-bot

@pietern is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

torch/distributed/launch.py

lukeyeager · 2019-11-15T00:28:33Z

Temporary workaround 😁

python -m torch.distributed.launch --nproc_per_node=8 --use_env -- -c \
    "import os, sys, subprocess; \
    ret = subprocess.run(['numactl', '--cpunodebind={}'.format(int(int(os.environ['LOCAL_RANK'])/4)), *sys.argv[1:]]); \
    sys.exit(ret.returncode)" python train.py ...

@pietern Thoughts on the more serious proposals above?

pietern · 2019-11-20T11:50:06Z

Thanks for investigating, @lukeyeager. Unfortunate that we can't have a bool option in argparse...

I like the second option best: a --no_python option and negation in code for readability.

Regarding a v2, we could create a parallel tool called torch.distributed.run start from scratch. Then there won't be any BC issues and we can make it do more stuff as needed. Thoughts?

lukeyeager · 2019-11-20T18:43:33Z

I like the second option best

Great! Done.

Regarding a v2, we could create a parallel tool called torch.distributed.run start from scratch. Then there won't be any BC issues and we can make it do more stuff as needed. Thoughts?

I don't really have any feelings about that. I haven't been using pytorch much so I don't have much context. I do know there's already torch.nn.parallel.DistributedDataParallel which is (was?) "better" (why?) in some circumstances than torch.distributed.launch. Adding a third thing might be even more confusing? I expect others at NVIDIA might have more well-formed thoughts. I know several people want something which plays nicely with MPI. I would hate for that to become a fourth launch option.

pietern · 2019-11-22T12:42:19Z

I do know there's already torch.nn.parallel.DistributedDataParallel which is (was?) "better" (why?) in some circumstances than torch.distributed.launch.

They are complementary. You might be thinking of nn.DataParallel (single process, multi GPU, not distributed) and nn.DistributedDataParallel (single process, single or multi GPU, distributed). To launch jobs that use the latter, you need either mpirun, srun, or if you DIY something, you can use torch.distributed.launch. As you see in the code, its job is really simple, and it just launches N processes with a local rank argument / environment variable.

Adding a third thing might be even more confusing? I expect others at NVIDIA might have more well-formed thoughts. I know several people want something which plays nicely with MPI. I would hate for that to become a fourth launch option.

We could have some kind of ptrun frontend for MPI / Slurm / DIY. I know Horovod has hvdrun and does something similar with environment detection and launching a job. This is definitely an area we can improve on.

Thanks for updating the PR.

facebook-github-bot

@pietern is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

lukeyeager requested review from apaszke, mrshenli and pietern as code owners November 4, 2019 20:37

facebook-github-bot reviewed Nov 6, 2019

View reviewed changes

pietern added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 6, 2019

pietern reviewed Nov 6, 2019

View reviewed changes

torch/distributed/launch.py Outdated Show resolved Hide resolved

lukeyeager requested a review from pietern November 11, 2019 19:03

Address PR feedback

e623e99

lukeyeager requested a review from zhaojuanmao as a code owner November 20, 2019 18:37

facebook-github-bot reviewed Nov 22, 2019

View reviewed changes

facebook-github-bot closed this in 183aa15 Nov 22, 2019

lukeyeager deleted the distlaunch-no-python branch November 22, 2019 16:45

pietern mentioned this pull request Jan 6, 2020

Need a launch utility for Distributed RPC framework. #31752

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[distributed_launch] Add --no_python flag #29144

[distributed_launch] Add --no_python flag #29144

Uh oh!

lukeyeager commented Nov 4, 2019

Uh oh!

pietern commented Nov 6, 2019

Uh oh!

facebook-github-bot left a comment

Uh oh!

Uh oh!

lukeyeager commented Nov 15, 2019 •

edited

Loading

Uh oh!

pietern commented Nov 20, 2019

Uh oh!

lukeyeager commented Nov 20, 2019

Uh oh!

pietern commented Nov 22, 2019

Uh oh!

facebook-github-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[distributed_launch] Add --no_python flag #29144

[distributed_launch] Add --no_python flag #29144

Uh oh!

Conversation

lukeyeager commented Nov 4, 2019

Uh oh!

pietern commented Nov 6, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lukeyeager commented Nov 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pietern commented Nov 20, 2019

Uh oh!

lukeyeager commented Nov 20, 2019

Uh oh!

pietern commented Nov 22, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lukeyeager commented Nov 15, 2019 •

edited

Loading