Implement xm.rendezvous with XLA collective communication
#4181
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We have found that
gloodoesn't scale effectively to large pod sizes, and it's not easily possible to usetorch.distributedin a multithreaded context such as TPU v3.xm.rendezvouswill now callxm.mark_stepto sync results from XLA.xm.rendezvous. Also remove initialization based onXRT_MESH_SERVICE_ADDRESSsince host 0 is not predictable anyway.xm.rendezvousfrom all replicas per XLA requirements:Computing the result of AllReduce requires having one input from each replica, so if one replica executes a AllReduce node more times than another, then the former replica will wait forever.This covers the vast majority of the usage ofrendezvousin our experience.Tested manually on a TPU v4-8 with 1 process and 4 threads to simulate a v3.