[dtensor][random] add 1d and 2d model meta init tests #141731

XilunWu · 2024-11-27T23:28:15Z

Stack from ghstack (oldest at bottom):

[dtensor][random] use torch.uint64 as the seed/offset tensor dtype to avoid overflow #141532
[dtensor][random] allow user to manual_seed different seed on device mesh; only sync RNG state in WORLD when manual_seed has not been called #141223
[dtensor][random][tp] remove the adhoc DTensor RNG tracker TensorParallelRNGTracker since it does not match FSDP2+TP #141220
-> [dtensor][random] add 1d and 2d model meta init tests #141731

Summary
Added tests for model meta init on 1-d mesh (TP) and 2-d mesh (FSDP+TP). This exploits the issue where DTensor RNG failed to initialize weights differently across FSDP ranks.

Test
pytest test/distributed/_tensor/test_random_ops.py -s -k meta_init

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-11-27T23:28:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141731

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d4f1b46 with merge base 6a22cae ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab

LGTM! thanks for adding these.

wconstab · 2024-11-27T23:46:13Z

test/distributed/_tensor/test_random_ops.py

+        )
+
+        # verify the weights are initialized differently on all ranks
+        with self.assertRaisesRegex(AssertionError, "AssertionError not raised"):


nit: maybe better practice to write the test "correctly" and then mark it as 'xfail', though this is OK. and anyway no point updating it since your next PR is going to remove the assertion.

…llelRNGTracker since it does not match FSDP2+TP (#141220) **Summary** The ad-hoc DTensor RNG tracker was used to mimic Megatron DDP+TP RNG behavior but it turns out not compatible with PyTorch Distributed FSDP2+TP so we decide to deprecate it and use `OffsetBasedRNGTracker` to replace, which follows the SPMD semantics (replicas get the same random sampling result, shards get different results). **Motivation** `TensorParallelRNGTracker` was designed for DDP+TP where the random operators produce the same result along the data parallel mesh dimension and different results along the tensor parallel dimension. However this does not apply to the new FSDP+TP composable combination where the model weights are sharded along data parallel mesh dimension as well. Therefore we decide to remove this outdated RNG tracker type for now. If users have demands for exact match between PyTorch Distributed and Megatron on Random Number generation result, feel free to file an issue. **Impact** `TensorParallelRNGTracker` was only used when Tensor Parallel is used (i.e. calling `parallelize_module`). For non-FSDP users, the "replicas get the same random numbers and shards get different ones" remains unchanged. Unlike `TensorParallelRNGTracker` which sets different seeds (`base_seed + 2718 + TP_rank`) within the TP group, DTensor now sets the same seed (default value is 1234 but users can call `torch.distributed.tensor._random.manual_seed` to modify) on all ranks but choose the right RNG offset based on DTensor placements to enforce the "replicas get the same random numbers and shards get different ones" invariant. For FSDP2 users, improvement should be observed in a way that DTensor sharded within DP group now gets different random number sampling which `TensorParallelRNGTracker` failed to do, though we're not sure how much this change will improve the eventual training loss convergence. **Test** 1-d model weight meta init: `pytest test/distributed/_tensor/test_random_ops.py -s -k test_tp_model_meta_init` 2-d model weight meta init: `pytest test/distributed/_tensor/test_random_ops.py -s -k test_fsdp_tp_model_meta_init` TP model weight init test: `pytest test/distributed/tensor/parallel/test_tp_random_state.py` FSDP+TP model weight init test: `pytest test/distributed/_composable/fsdp/test_fully_shard_init.py` Pull Request resolved: #141220 Approved by: https://github.com/wconstab ghstack dependencies: #141731

…mesh; only sync RNG state in WORLD when manual_seed has not been called (#141223) **Summary** This PR proposes 4 changes to DTensor RNG management: 1. DTensor allows users to eagerly initialize the RNG tracker by calling `torch.distributed.tensor._random.manual_seed`. 2. DTensor `manual_seed` no longer checks the integrity of the `seed` argument. Users are responsible for setting the same seed on all ranks within an SPMD group, but if there are multiple separate SPMD groups (e.g. across pipeline stages), users should set a _different_ seed for each SPMD group. For cases like Pipeline Parallel, users can set different initial seed for pipelining stages by calling ``` world_mesh = init_device_mesh( device_type="cuda", mesh_shape=(2, 2, 2), mesh_dim_names=("pp", "dp", "tp"), ) pp_mesh = world_mesh["pp"] pp_rank = pp_mesh.get_local_rank() spmd_mesh = world_mesh["dp", "tp"]._flatten("spmd") # this flattening is only needed if you need to call collective over this mesh torch.distributed.tensor._random.manual_seed(123+pp_rank, spmd_mesh) ``` In other word, if users want to call `torch.distributed.tensor._random.manual_seed`, they will be responsible for passing in the right value and DTensor won't perform any checks on it. If the current rank is not a part of the mesh, it will use the current device RNG state to initialize. 3. `OffsetBasedRNGTracker` still performs RNG state synchronization by broadcasting the RNG state on rank 0 to `WORLD`. However, calling `torch.distributed.tensor._random.manual_seed` is an exception. In this case, no broadcast will happen. 4. Enforce that the `manual_seed` call only accept "full mesh" i.e. the DTensor RNG state on every rank must be set through the call. This makes sure that no rank has its RNG state left uninitialized and the SPMD ranks have their RNG state synchronous. **Motivation** tl;dr 1. Lazily initializing DTensor RNG tracker causes hang in non-SPMD code such as Pipeline Parallel. 2. Users may want to set different seed on ranks in one device mesh. 3. We want to keep the old behavior if users prefer not curating the RNG state and want to have DTensor take care of it. see detail in #140301 **Test** `pytest test/distributed/_tensor/test_random_ops.py` `pytest test/distributed/tensor/parallel/test_tp_random_state.py` Pull Request resolved: #141223 Approved by: https://github.com/wconstab ghstack dependencies: #141731, #141220

… avoid overflow (#141532) **Summary** DTensor RNG code raises error if the seed passed in is beyong `torch.int64` range (e.g. `torch.tensor([2**64-1])` raises error). The solution is to specify the `dtype=torch.uint64` in the `torch.tensor()` call. Pull Request resolved: #141532 Approved by: https://github.com/wconstab ghstack dependencies: #141731, #141220, #141223

**Summary** Added tests for model meta init on 1-d mesh (TP) and 2-d mesh (FSDP+TP). This exploits the issue where DTensor RNG failed to initialize weights differently across FSDP ranks. **Test** `pytest test/distributed/_tensor/test_random_ops.py -s -k meta_init` Pull Request resolved: pytorch#141731 Approved by: https://github.com/wconstab

…llelRNGTracker since it does not match FSDP2+TP (pytorch#141220) **Summary** The ad-hoc DTensor RNG tracker was used to mimic Megatron DDP+TP RNG behavior but it turns out not compatible with PyTorch Distributed FSDP2+TP so we decide to deprecate it and use `OffsetBasedRNGTracker` to replace, which follows the SPMD semantics (replicas get the same random sampling result, shards get different results). **Motivation** `TensorParallelRNGTracker` was designed for DDP+TP where the random operators produce the same result along the data parallel mesh dimension and different results along the tensor parallel dimension. However this does not apply to the new FSDP+TP composable combination where the model weights are sharded along data parallel mesh dimension as well. Therefore we decide to remove this outdated RNG tracker type for now. If users have demands for exact match between PyTorch Distributed and Megatron on Random Number generation result, feel free to file an issue. **Impact** `TensorParallelRNGTracker` was only used when Tensor Parallel is used (i.e. calling `parallelize_module`). For non-FSDP users, the "replicas get the same random numbers and shards get different ones" remains unchanged. Unlike `TensorParallelRNGTracker` which sets different seeds (`base_seed + 2718 + TP_rank`) within the TP group, DTensor now sets the same seed (default value is 1234 but users can call `torch.distributed.tensor._random.manual_seed` to modify) on all ranks but choose the right RNG offset based on DTensor placements to enforce the "replicas get the same random numbers and shards get different ones" invariant. For FSDP2 users, improvement should be observed in a way that DTensor sharded within DP group now gets different random number sampling which `TensorParallelRNGTracker` failed to do, though we're not sure how much this change will improve the eventual training loss convergence. **Test** 1-d model weight meta init: `pytest test/distributed/_tensor/test_random_ops.py -s -k test_tp_model_meta_init` 2-d model weight meta init: `pytest test/distributed/_tensor/test_random_ops.py -s -k test_fsdp_tp_model_meta_init` TP model weight init test: `pytest test/distributed/tensor/parallel/test_tp_random_state.py` FSDP+TP model weight init test: `pytest test/distributed/_composable/fsdp/test_fully_shard_init.py` Pull Request resolved: pytorch#141220 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#141731

…mesh; only sync RNG state in WORLD when manual_seed has not been called (pytorch#141223) **Summary** This PR proposes 4 changes to DTensor RNG management: 1. DTensor allows users to eagerly initialize the RNG tracker by calling `torch.distributed.tensor._random.manual_seed`. 2. DTensor `manual_seed` no longer checks the integrity of the `seed` argument. Users are responsible for setting the same seed on all ranks within an SPMD group, but if there are multiple separate SPMD groups (e.g. across pipeline stages), users should set a _different_ seed for each SPMD group. For cases like Pipeline Parallel, users can set different initial seed for pipelining stages by calling ``` world_mesh = init_device_mesh( device_type="cuda", mesh_shape=(2, 2, 2), mesh_dim_names=("pp", "dp", "tp"), ) pp_mesh = world_mesh["pp"] pp_rank = pp_mesh.get_local_rank() spmd_mesh = world_mesh["dp", "tp"]._flatten("spmd") # this flattening is only needed if you need to call collective over this mesh torch.distributed.tensor._random.manual_seed(123+pp_rank, spmd_mesh) ``` In other word, if users want to call `torch.distributed.tensor._random.manual_seed`, they will be responsible for passing in the right value and DTensor won't perform any checks on it. If the current rank is not a part of the mesh, it will use the current device RNG state to initialize. 3. `OffsetBasedRNGTracker` still performs RNG state synchronization by broadcasting the RNG state on rank 0 to `WORLD`. However, calling `torch.distributed.tensor._random.manual_seed` is an exception. In this case, no broadcast will happen. 4. Enforce that the `manual_seed` call only accept "full mesh" i.e. the DTensor RNG state on every rank must be set through the call. This makes sure that no rank has its RNG state left uninitialized and the SPMD ranks have their RNG state synchronous. **Motivation** tl;dr 1. Lazily initializing DTensor RNG tracker causes hang in non-SPMD code such as Pipeline Parallel. 2. Users may want to set different seed on ranks in one device mesh. 3. We want to keep the old behavior if users prefer not curating the RNG state and want to have DTensor take care of it. see detail in pytorch#140301 **Test** `pytest test/distributed/_tensor/test_random_ops.py` `pytest test/distributed/tensor/parallel/test_tp_random_state.py` Pull Request resolved: pytorch#141223 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#141731, pytorch#141220

… avoid overflow (pytorch#141532) **Summary** DTensor RNG code raises error if the seed passed in is beyong `torch.int64` range (e.g. `torch.tensor([2**64-1])` raises error). The solution is to specify the `dtype=torch.uint64` in the `torch.tensor()` call. Pull Request resolved: pytorch#141532 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#141731, pytorch#141220, pytorch#141223

**Summary** Added tests for model meta init on 1-d mesh (TP) and 2-d mesh (FSDP+TP). This exploits the issue where DTensor RNG failed to initialize weights differently across FSDP ranks. **Test** `pytest test/distributed/_tensor/test_random_ops.py -s -k meta_init` Pull Request resolved: pytorch#141731 Approved by: https://github.com/wconstab

…llelRNGTracker since it does not match FSDP2+TP (pytorch#141220) **Summary** The ad-hoc DTensor RNG tracker was used to mimic Megatron DDP+TP RNG behavior but it turns out not compatible with PyTorch Distributed FSDP2+TP so we decide to deprecate it and use `OffsetBasedRNGTracker` to replace, which follows the SPMD semantics (replicas get the same random sampling result, shards get different results). **Motivation** `TensorParallelRNGTracker` was designed for DDP+TP where the random operators produce the same result along the data parallel mesh dimension and different results along the tensor parallel dimension. However this does not apply to the new FSDP+TP composable combination where the model weights are sharded along data parallel mesh dimension as well. Therefore we decide to remove this outdated RNG tracker type for now. If users have demands for exact match between PyTorch Distributed and Megatron on Random Number generation result, feel free to file an issue. **Impact** `TensorParallelRNGTracker` was only used when Tensor Parallel is used (i.e. calling `parallelize_module`). For non-FSDP users, the "replicas get the same random numbers and shards get different ones" remains unchanged. Unlike `TensorParallelRNGTracker` which sets different seeds (`base_seed + 2718 + TP_rank`) within the TP group, DTensor now sets the same seed (default value is 1234 but users can call `torch.distributed.tensor._random.manual_seed` to modify) on all ranks but choose the right RNG offset based on DTensor placements to enforce the "replicas get the same random numbers and shards get different ones" invariant. For FSDP2 users, improvement should be observed in a way that DTensor sharded within DP group now gets different random number sampling which `TensorParallelRNGTracker` failed to do, though we're not sure how much this change will improve the eventual training loss convergence. **Test** 1-d model weight meta init: `pytest test/distributed/_tensor/test_random_ops.py -s -k test_tp_model_meta_init` 2-d model weight meta init: `pytest test/distributed/_tensor/test_random_ops.py -s -k test_fsdp_tp_model_meta_init` TP model weight init test: `pytest test/distributed/tensor/parallel/test_tp_random_state.py` FSDP+TP model weight init test: `pytest test/distributed/_composable/fsdp/test_fully_shard_init.py` Pull Request resolved: pytorch#141220 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#141731

…mesh; only sync RNG state in WORLD when manual_seed has not been called (pytorch#141223) **Summary** This PR proposes 4 changes to DTensor RNG management: 1. DTensor allows users to eagerly initialize the RNG tracker by calling `torch.distributed.tensor._random.manual_seed`. 2. DTensor `manual_seed` no longer checks the integrity of the `seed` argument. Users are responsible for setting the same seed on all ranks within an SPMD group, but if there are multiple separate SPMD groups (e.g. across pipeline stages), users should set a _different_ seed for each SPMD group. For cases like Pipeline Parallel, users can set different initial seed for pipelining stages by calling ``` world_mesh = init_device_mesh( device_type="cuda", mesh_shape=(2, 2, 2), mesh_dim_names=("pp", "dp", "tp"), ) pp_mesh = world_mesh["pp"] pp_rank = pp_mesh.get_local_rank() spmd_mesh = world_mesh["dp", "tp"]._flatten("spmd") # this flattening is only needed if you need to call collective over this mesh torch.distributed.tensor._random.manual_seed(123+pp_rank, spmd_mesh) ``` In other word, if users want to call `torch.distributed.tensor._random.manual_seed`, they will be responsible for passing in the right value and DTensor won't perform any checks on it. If the current rank is not a part of the mesh, it will use the current device RNG state to initialize. 3. `OffsetBasedRNGTracker` still performs RNG state synchronization by broadcasting the RNG state on rank 0 to `WORLD`. However, calling `torch.distributed.tensor._random.manual_seed` is an exception. In this case, no broadcast will happen. 4. Enforce that the `manual_seed` call only accept "full mesh" i.e. the DTensor RNG state on every rank must be set through the call. This makes sure that no rank has its RNG state left uninitialized and the SPMD ranks have their RNG state synchronous. **Motivation** tl;dr 1. Lazily initializing DTensor RNG tracker causes hang in non-SPMD code such as Pipeline Parallel. 2. Users may want to set different seed on ranks in one device mesh. 3. We want to keep the old behavior if users prefer not curating the RNG state and want to have DTensor take care of it. see detail in pytorch#140301 **Test** `pytest test/distributed/_tensor/test_random_ops.py` `pytest test/distributed/tensor/parallel/test_tp_random_state.py` Pull Request resolved: pytorch#141223 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#141731, pytorch#141220

… avoid overflow (pytorch#141532) **Summary** DTensor RNG code raises error if the seed passed in is beyong `torch.int64` range (e.g. `torch.tensor([2**64-1])` raises error). The solution is to specify the `dtype=torch.uint64` in the `torch.tensor()` call. Pull Request resolved: pytorch#141532 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#141731, pytorch#141220, pytorch#141223

[dtensor][random] add 1d and 2d model meta init tests

d4f1b46

[ghstack-poisoned]

XilunWu mentioned this pull request Nov 27, 2024

[dtensor][random][tp] remove the adhoc DTensor RNG tracker TensorParallelRNGTracker since it does not match FSDP2+TP #141220

Closed

XilunWu mentioned this pull request Nov 27, 2024

[dtensor][random] allow user to manual_seed different seed on device mesh; only sync RNG state in WORLD when manual_seed has not been called #141223

Closed

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 27, 2024

XilunWu mentioned this pull request Nov 27, 2024

[dtensor][random] use torch.uint64 as the seed/offset tensor dtype to avoid overflow #141532

Closed

XilunWu requested review from fegin and wconstab November 27, 2024 23:31

XilunWu added the topic: not user facing topic category label Nov 27, 2024

wconstab approved these changes Nov 27, 2024

View reviewed changes

wconstab reviewed Nov 27, 2024

View reviewed changes

pytorchmergebot closed this in c55191f Nov 29, 2024

pytorchmergebot added the Merged label Nov 29, 2024

github-actions bot deleted the gh/XilunWu/106/head branch December 30, 2024 02:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[dtensor][random] add 1d and 2d model meta init tests #141731

[dtensor][random] add 1d and 2d model meta init tests #141731

Uh oh!

XilunWu commented Nov 27, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 27, 2024 •

edited

Loading

Uh oh!

wconstab left a comment

Uh oh!

wconstab Nov 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[dtensor][random] add 1d and 2d model meta init tests #141731

[dtensor][random] add 1d and 2d model meta init tests #141731

Uh oh!

Conversation

XilunWu commented Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141731

✅ No Failures

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab Nov 27, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

XilunWu commented Nov 27, 2024 •

edited

Loading

pytorch-bot bot commented Nov 27, 2024 •

edited

Loading