Skip to content

Added idist.one_rank_first method#2926

Merged
vfdev-5 merged 15 commits intopytorch:masterfrom
AlexanderChaptykov:master
Apr 27, 2023
Merged

Added idist.one_rank_first method#2926
vfdev-5 merged 15 commits intopytorch:masterfrom
AlexanderChaptykov:master

Conversation

@AlexanderChaptykov
Copy link
Copy Markdown
Contributor

@AlexanderChaptykov AlexanderChaptykov commented Apr 18, 2023

Fixes #2923

Description:

  • Added idist.one_rank_first context manager

@github-actions github-actions bot added the module: distributed Distributed module label Apr 18, 2023
Copy link
Copy Markdown
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @AlexanderChaptykov
I left few comments in the code to address.
Once we agree with the API, names etc let's also write a test and docstring.

@vfdev-5 vfdev-5 changed the title RankProcessFirst Added idist.one_process_first method Apr 20, 2023
@AlexanderChaptykov
Copy link
Copy Markdown
Contributor Author

@vfdev-5 PR updated

Copy link
Copy Markdown
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexanderChaptykov
Copy link
Copy Markdown
Contributor Author

AlexanderChaptykov commented Apr 26, 2023

@vfdev-5 looks like we pass rank=1 with world_size=1

@vfdev-5
Copy link
Copy Markdown
Collaborator

vfdev-5 commented Apr 26, 2023

@vfdev-5 looks like we pass rank=1 with world_size=1

Let's use WORLD_SIZE env var, check the tests codebase how it is done

@vfdev-5
Copy link
Copy Markdown
Collaborator

vfdev-5 commented Apr 26, 2023

Also, we may want to update this example:

def get_dataflow(config):
# - Get train/test datasets
if idist.get_local_rank() > 0:
# Ensure that only local rank 0 download the dataset
# Thus each node will download a copy of the dataset
idist.barrier()
train_dataset, test_dataset = utils.get_train_test_datasets(config["data_path"])
if idist.get_local_rank() == 0:
# Ensure that only local rank 0 download the dataset
idist.barrier()
# Setup data loader also adapted to distributed config: nccl, gloo, xla-tpu
train_loader = idist.auto_dataloader(
train_dataset, batch_size=config["batch_size"], num_workers=config["num_workers"], shuffle=True, drop_last=True
)
test_loader = idist.auto_dataloader(
test_dataset, batch_size=2 * config["batch_size"], num_workers=config["num_workers"], shuffle=False
)
return train_loader, test_loader
and others

@vfdev-5 vfdev-5 changed the title Added idist.one_process_first method Added idist.one_rank_first method Apr 26, 2023
Copy link
Copy Markdown
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this PR @AlexanderChaptykov
Hopefully collective effort with myself and @sadra-barikbin we can land something better.

@vfdev-5 vfdev-5 merged commit 9d38754 into pytorch:master Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: distributed Distributed module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Provide a context manager for running distributed code on a given process first

2 participants