Implement `xm.rendezvous` with XLA collective communication #4181

will-cromar · 2022-11-09T22:24:19Z

We have found that gloo doesn't scale effectively to large pod sizes, and it's not easily possible to use torch.distributed in a multithreaded context such as TPU v3.

xm.rendezvous will now call xm.mark_step to sync results from XLA.
Support multithreaded contexts like TPU v2/v3.
Don't initialize an XLA process group in xm.rendezvous. Also remove initialization based on XRT_MESH_SERVICE_ADDRESS since host 0 is not predictable anyway.
Require that user calls xm.rendezvous from all replicas per XLA requirements: Computing the result of AllReduce requires having one input from each replica, so if one replica executes a AllReduce node more times than another, then the former replica will wait forever. This covers the vast majority of the usage of rendezvous in our experience.

Tested manually on a TPU v4-8 with 1 process and 4 threads to simulate a v3.

torch_xla/experimental/pjrt.py

AlexWertheim · 2022-11-11T18:08:10Z

I tested this on v4-8 and v4-4096 and it seems to work successfully on both accelerator types. Using gloo on v4-4096 resulted in connection refused and connection closed by peer errors when calling xm.rendezvous; after discussion with @will-cromar and @JackCaoG, we suspect there are limits with the number of active tcp connections between devices.

test/pjrt/test_mesh_service.py

JackCaoG

Anything in the PJRT README we should update?

will-cromar · 2022-11-16T19:42:03Z

I'll update the readme after #4193

ronghanghu reviewed Nov 10, 2022

View reviewed changes

torch_xla/experimental/pjrt.py Show resolved Hide resolved

will-cromar added 5 commits November 10, 2022 20:53

Implement pjrt.rendezvous with XLA collective ops.

3c1555c

Update tests.

9fe6013

Formatting

bd2ca0c

Handle some edge cases better

f66a6e8

formatting

585982d

will-cromar force-pushed the wcromar/xla-rendezvous branch from 78ecc43 to 585982d Compare November 10, 2022 20:53

will-cromar changed the title ~~[WIP] Implement xm.rendezvous with XLA collective communication~~ Implement xm.rendezvous with XLA collective communication Nov 10, 2022

will-cromar requested a review from JackCaoG November 10, 2022 21:00

will-cromar marked this pull request as ready for review November 10, 2022 21:00

will-cromar mentioned this pull request Nov 11, 2022

Experimental TPU implementation of DistributedDataParallel #4193

Merged

JackCaoG reviewed Nov 11, 2022

View reviewed changes

test/pjrt/test_mesh_service.py Show resolved Hide resolved

Add mesh service test to CI

9120768

will-cromar requested a review from JackCaoG November 15, 2022 17:16

JackCaoG approved these changes Nov 15, 2022

View reviewed changes

will-cromar added the runtime label Nov 16, 2022

will-cromar merged commit a9eb345 into master Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement `xm.rendezvous` with XLA collective communication #4181

Implement `xm.rendezvous` with XLA collective communication #4181

Uh oh!

will-cromar commented Nov 9, 2022 •

edited

Loading

Uh oh!

Uh oh!

AlexWertheim commented Nov 11, 2022

Uh oh!

Uh oh!

JackCaoG left a comment

Uh oh!

will-cromar commented Nov 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Implement xm.rendezvous with XLA collective communication #4181

Implement xm.rendezvous with XLA collective communication #4181

Uh oh!

Conversation

will-cromar commented Nov 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

AlexWertheim commented Nov 11, 2022

Uh oh!

Uh oh!

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

will-cromar commented Nov 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Implement `xm.rendezvous` with XLA collective communication #4181

Implement `xm.rendezvous` with XLA collective communication #4181

will-cromar commented Nov 9, 2022 •

edited

Loading