Added `idist.one_rank_first` method by AlexanderChaptykov · Pull Request #2926 · pytorch/ignite

AlexanderChaptykov · 2023-04-18T11:02:03Z

Description:

Added idist.one_rank_first context manager

vfdev-5

Thanks for the PR @AlexanderChaptykov
I left few comments in the code to address.
Once we agree with the API, names etc let's also write a test and docstring.

ignite/distributed/launcher.py

vfdev-5 · 2023-04-20T09:04:02Z

Thanks for the update @AlexanderChaptykov , let's revert unrelated changes and also add a test and a docstring.
As for a test you can add a test here:

AlexanderChaptykov · 2023-04-24T20:57:20Z

@vfdev-5 PR updated

vfdev-5

Thanks for the update @AlexanderChaptykov !
I left few comments and also CI is all failing : https://github.com/pytorch/ignite/actions/runs/4793848979/jobs/8526673730?pr=2926
please check

ignite/distributed/launcher.py

tests/ignite/distributed/utils/test_native.py

ignite/distributed/utils.py

tests/ignite/distributed/utils/test_native.py

ignite/distributed/utils.py

AlexanderChaptykov · 2023-04-26T10:46:13Z

@vfdev-5 looks like we pass rank=1 with world_size=1

vfdev-5 · 2023-04-26T11:06:51Z

@vfdev-5 looks like we pass rank=1 with world_size=1

Let's use WORLD_SIZE env var, check the tests codebase how it is done

ignite/distributed/utils.py

tests/ignite/distributed/utils/test_native.py

ignite/distributed/utils.py

vfdev-5 · 2023-04-26T21:36:59Z

Also, we may want to update this example:

ignite/examples/contrib/cifar10/main.py

Lines 212 to 233 in 7bafab7

    
           def get_dataflow(config): 
        
               # - Get train/test datasets 
        
               if idist.get_local_rank() > 0: 
        
                   # Ensure that only local rank 0 download the dataset 
        
                   # Thus each node will download a copy of the dataset 
        
                   idist.barrier() 
        
               train_dataset, test_dataset = utils.get_train_test_datasets(config["data_path"]) 
        
               if idist.get_local_rank() == 0: 
        
                   # Ensure that only local rank 0 download the dataset 
        
                   idist.barrier() 
        
               # Setup data loader also adapted to distributed config: nccl, gloo, xla-tpu 
        
               train_loader = idist.auto_dataloader( 
        
                   train_dataset, batch_size=config["batch_size"], num_workers=config["num_workers"], shuffle=True, drop_last=True 
        
               ) 
        
               test_loader = idist.auto_dataloader( 
        
                   test_dataset, batch_size=2 * config["batch_size"], num_workers=config["num_workers"], shuffle=False 
        
               ) 
        
               return train_loader, test_loader

and others

vfdev-5

Thanks for your work on this PR @AlexanderChaptykov
Hopefully collective effort with myself and @sadra-barikbin we can land something better.

AlexanderChaptykov added 4 commits April 14, 2023 00:46

remove codecov

bb5e244

RankProcessFirst

f56b362

annotations

5801dd5

Merge remote-tracking branch 'upstream/master'

34e77e7

github-actions bot added the module: distributed Distributed module label Apr 18, 2023

vfdev-5 reviewed Apr 18, 2023

View reviewed changes

ignite/distributed/launcher.py Outdated Show resolved Hide resolved

ignite/distributed/launcher.py Outdated Show resolved Hide resolved

from class to contextlib

86e564f

vfdev-5 changed the title ~~RankProcessFirst~~ Added idist.one_process_first method Apr 20, 2023

vfdev-5 reviewed Apr 20, 2023

View reviewed changes

ignite/distributed/launcher.py Outdated Show resolved Hide resolved

from class to contextlib and test

2f75b92

del test file

fcb555c