[FSDP2] Added `shard_placement_fn` arg #136221

awgu · 2024-09-17T20:32:45Z

Stack from ghstack (oldest at bottom):

-> [FSDP2] Added shard_placement_fn arg #136221

TL;DR

This PR adds a shard_placement_fn: Optional[Callable[[nn.Parameter], Optional[Shard]] arg to fully_shard that allows users to specify FSDP sharding on a nonzero tensor dim. If doing so, then the tensor dim size must be divisible by the FSDP shard world size.

# Example:
def shard_placement_fn(param: nn.Parameter) -> Optional[Shard]:
    largest_dim = largest_dim_size = -1
    for dim, dim_size in enumerate(param.shape):
        if dim_size > largest_dim_size:
            largest_dim = dim
            largest_dim_size = dim_size
    return Shard(largest_dim)

fully_shard(module, shard_placement_fn=shard_placement_fn)

Follow-Ups

Copy kernels: For all-gather copy-out, we currently copy-out to temporaries and then chunk-dim-0 -> cat-shard-dim, incurring an extra copy for parameters sharded on nonzero tensor dim. Similarly, for reduce-scatter copy-in, we currently chunk-shard-dim -> cat-dim-0, incurring an extra copy for gradients sharded on nonzero tensor dim. @yifuwang has ideas for adding additional split size args to the copy ops that allows fusing these extra copies into the existing all-gather copy-out and reduce-scatter copy-in.

cc @XilunWu @H-Huang @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Differential Revision: D62964657

[ghstack-poisoned]

pytorch-bot · 2024-09-17T20:32:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136221

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 40 Cancelled Jobs

As of commit 4942493 with merge base d1b87e2 ():

CANCELLED JOBS - The following jobs were cancelled. Please retry:

Lint / lintrunner-clang / linux-job (gh)
##[error]The operation was canceled.
Lint / lintrunner-noclang / linux-job (gh)
##[error]The operation was canceled.
pull / cuda12.1-py3.10-gcc9-sm75 (gh)
pull / cuda12.1-py3.10-gcc9-sm75 / build (gh)
##[error]The operation was canceled.
pull / linux-docs (gh)
pull / linux-focal-cpu-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge) (gh)
##[error]The operation was canceled.
pull / linux-focal-cuda11.8-py3.10-gcc9 (gh)
pull / linux-focal-cuda11.8-py3.10-gcc9 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-cuda12.1-py3.10-gcc9 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-cuda12.1-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu) (gh)
##[error]The operation was canceled.
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 (gh)
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-cuda12.4-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu) (gh)
##[error]The operation was canceled.
pull / linux-focal-py3_9-clang9-xla (gh)
pull / linux-focal-py3_9-clang9-xla / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3-clang9-android-ndk-r21e-gradle-custom-build-single / build-and-test (default, 1, 1, linux.2xlarge) (gh)
##[error]The operation was canceled.
pull / linux-focal-py3-clang9-android-ndk-r21e-gradle-custom-build-single-full-jit / build-and-test (default, 1, 1, linux.2xlarge) (gh)
##[error]The operation was canceled.
pull / linux-focal-py3-clang9-mobile-custom-build-static / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3.11-clang10 (gh)
pull / linux-focal-py3.11-clang10 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3.12-clang10 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3.12-clang10-experimental-split-build (gh)
pull / linux-focal-py3.12-clang10-experimental-split-build / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3.9-clang10 (gh)
pull / linux-focal-py3.9-clang10 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3.9-clang10-onnx (gh)
pull / linux-focal-py3.9-clang10-onnx / build (gh)
##[error]The operation was canceled.
pull / linux-focal-rocm6.2-py3.10 / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-cuda11.8-cudnn9-py3.9-clang12 / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3-clang12-executorch (gh)
pull / linux-jammy-py3-clang12-executorch / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3-clang12-mobile-build / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.10-clang15-asan (gh)
pull / linux-jammy-py3.10-clang15-asan / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.9-gcc11 (gh)
pull / linux-jammy-py3.9-gcc11 / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.9-gcc11-mobile-lightweight-dispatch-build / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.9-gcc11-no-ops / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.9-gcc11-pch / build (gh)
##[error]The operation was canceled.
pull / win-vs2019-cpu-py3 / build (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This is WIP and quite messy right now. cc XilunWu H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

test/distributed/_composable/fsdp/test_fully_shard_init.py

This is WIP and quite messy right now. cc XilunWu H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

This is WIP and quite messy right now. 2D train parity seems broken when using new code path cc XilunWu H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

This is WIP. cc XilunWu H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 60eedca Pull Request resolved: #136221

ghstack-source-id: acf7973 Pull Request resolved: #136221

This is WIP. cc XilunWu H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o Differential Revision: [D62964657](https://our.internmc.facebook.com/intern/diff/D62964657) [ghstack-poisoned]

torch/distributed/_composable/fsdp/_fsdp_param.py

This is WIP. For `Shard(i)` and `i != 0`, the uneven sharding case must incur extra copies before/after all-gather/reduce-scatter in order for the sharded parameters and gradients to have contiguous strides. cc XilunWu H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o Differential Revision: [D62964657](https://our.internmc.facebook.com/intern/diff/D62964657) [ghstack-poisoned]

This is WIP. cc XilunWu H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o Differential Revision: [D62964657](https://our.internmc.facebook.com/intern/diff/D62964657) [ghstack-poisoned]

ghstack-source-id: 04abf49 Pull Request resolved: #136221

awgu · 2024-10-07T20:30:35Z

To-do:

Add some state dict test
(Optional) add some memory test (will use more memory though)
Test on 8-GPU devgpu in torchtitan and sanity check profiler trace

weifengpy · 2024-10-08T00:12:04Z

torch/distributed/_composable/fsdp/_fsdp_param.py

-        self.padded_sharded_param_size = padded_sharded_param.size()
        if sharded_param.numel() > 0:
-            padded_sharded_param[: sharded_param.size(0)].copy_(sharded_param)
+            padded_sharded_param.narrow(


I was worried about NF4 but found that .narrow and [:] both dispatch to slice.Tensor. That's good! any ideas on how to find the mapping (other than printing torch dispatch)? I am a litle bit surpised since I found narrow in native_functions.yaml

import torch from torch.utils._python_dispatch import TorchDispatchMode class LoggingMode(TorchDispatchMode): def __torch_dispatch__(self, func, types, args=(), kwargs=None): if kwargs is None: kwargs = {} # Print the function being dispatched and the arguments print(f"Dispatching function: {func.__name__}") return func(*args, **kwargs) # Example usage data = torch.rand(3, 3, device="cuda") # Use LoggingMode for the duration of this block with LoggingMode(): data.narrow(dim=1, start=0, length=2) data[:2]

torch/distributed/_composable/fsdp/_fsdp_param.py

weifengpy · 2024-10-08T00:39:44Z

torch/distributed/_composable/fsdp/_fsdp_collectives.py

+            unsharded_grad.size(shard_dim) % world_size == 0
+        ), f"Shard({shard_dim}) requires even sharding: {unsharded_grad.size()=} {world_size=}"
+        chunks = torch.chunk(unsharded_grad, world_size, dim=shard_dim)
+        unsharded_grads[i] = torch.cat(chunks, dim=0)


are we chunk + cat to make unsharded_grads contiguous? does the .cat trigger copies on gpu?

We have to chunk -> cat to do a data shuffle. If I have some time, I may try to draw some diagrams to add to the PR description.

torch/distributed/_composable/fsdp/_fsdp_collectives.py

weifengpy · 2024-10-08T00:57:00Z

torch/distributed/_composable/fsdp/_fsdp_collectives.py

+        if fsdp_param.fsdp_placement.dim != 0:
+            # Copy to a temporary and then chunk-cat into the final all-gather
+            # output tensors
+            param_all_gather_outputs = [


[old] seems it's overwriting init_all_gather_outputs. is it better to move the logic into init_all_gather_outputs ? or this is only temporary ?

editted: sorry. it's not overwriting. it's temporary place for copy out. feel free to ignore

weifengpy · 2024-10-08T01:01:33Z

torch/distributed/_composable/fsdp/_fsdp_collectives.py

+            post_param_size[shard_dim] *= world_size
+            cat_out = target_all_gather_output.view(post_param_size)
+            torch.cat(chunks, dim=shard_dim, out=cat_out)
+            torch._C._autograd._unsafe_set_version_counter(


is this similar to with _unsafe_preserve_version_counter(target_all_gather_output)? we use _unsafe_set_version_counter because param_all_gather_outputs is defined in for ... in and the api is easier?

yes with _unsafe_preserve_version_counter(target_all_gather_output) calls into this _unsafe_set_version_counter
since we know we just need to decrement once, I think it is simpler/lower overhead to just call the API directly

weifengpy · 2024-10-08T01:11:47Z

torch/distributed/_composable/fsdp/_fsdp_param.py

            dp_shard_tp_placement = (
                (
-                    _StridedShard(0, split_factor=split_factor)
+                    _StridedShard(shard_dim, split_factor=split_factor)


I never truly understand strided sharding. I guess it's meant for model.parameters() outside of fwd/bwd ? mostly for DTensor ops outside of FSDP, like state dict, optimizer, grad norm clipping?

## TL;DR This PR adds a `shard_placement_fn: Optional[Callable[[nn.Parameter], Optional[Shard]]` arg to `fully_shard` that allows users to specify FSDP sharding on a nonzero tensor dim. If doing so, then the tensor dim size must be divisible by the FSDP shard world size. ``` # Example: def shard_placement_fn(param: nn.Parameter) -> Optional[Shard]: largest_dim = largest_dim_size = -1 for dim, dim_size in enumerate(param.shape): if dim_size > largest_dim_size: largest_dim = dim largest_dim_size = dim_size return Shard(largest_dim) fully_shard(module, shard_placement_fn=shard_placement_fn) ``` ## Follow-Ups - **Copy kernels:** For all-gather copy-out, we currently copy-out to temporaries and then chunk-dim-0 -> cat-shard-dim, incurring an extra copy for parameters sharded on nonzero tensor dim. Similarly, for reduce-scatter copy-in, we currently chunk-shard-dim -> cat-dim-0, incurring an extra copy for gradients sharded on nonzero tensor dim. yifuwang has ideas for adding additional split size args to the copy ops that allows fusing these extra copies into the existing all-gather copy-out and reduce-scatter copy-in. cc XilunWu H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o Differential Revision: [D62964657](https://our.internmc.facebook.com/intern/diff/D62964657) [ghstack-poisoned]

ghstack-source-id: 71ac63d Pull Request resolved: #136221

awgu · 2024-10-08T16:16:31Z

re-opened in a different PR (#137496) since the test-config/distributed label seems to persist even after removing

[FSDP2] Added shard_placement_fn arg

193b347

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Sep 17, 2024

awgu added release notes: distributed (fsdp2) release notes category test-config/distributed and removed release notes: distributed (fsdp) release notes category labels Sep 17, 2024

Update on "[FSDP2] Added shard_placement_fn arg"

69374d8

This is WIP and quite messy right now. cc XilunWu H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

weifengpy reviewed Sep 17, 2024

View reviewed changes

test/distributed/_composable/fsdp/test_fully_shard_init.py Show resolved Hide resolved

Update on "[FSDP2] Added shard_placement_fn arg"

526cbca

This is WIP and quite messy right now. cc XilunWu H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Update on "[FSDP2] Added shard_placement_fn arg"

8936503

This is WIP and quite messy right now. cc XilunWu H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Update on "[FSDP2] Added shard_placement_fn arg"

5a3e55f

This is WIP and quite messy right now. cc XilunWu H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Update on "[FSDP2] Added shard_placement_fn arg"

c8ce7f3

This is WIP and quite messy right now. 2D train parity seems broken when using new code path cc XilunWu H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

awgu mentioned this pull request Sep 17, 2024

[FSDP2] Fixed 2D mismatched grad placements #136237

Closed

Update on "[FSDP2] Added shard_placement_fn arg"

2bbb337

This is WIP and quite messy right now. 2D train parity seems broken when using new code path cc XilunWu H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Update on "[FSDP2] Added shard_placement_fn arg"

2c9b241

This is WIP. cc XilunWu H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Update on "[FSDP2] Added shard_placement_fn arg"

757458a

This is WIP. cc XilunWu H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Update on "[FSDP2] Added shard_placement_fn arg"

e4f815c

This is WIP. cc XilunWu H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

awgu pushed a commit that referenced this pull request Sep 18, 2024

[FSDP2] Added shard_placement_fn arg

672da3f

ghstack-source-id: 60eedca Pull Request resolved: #136221

awgu pushed a commit that referenced this pull request Oct 4, 2024

[FSDP2] Added shard_placement_fn arg

41bd3fd

ghstack-source-id: acf7973 Pull Request resolved: #136221

awgu commented Oct 7, 2024

View reviewed changes

torch/distributed/_composable/fsdp/_fsdp_param.py Outdated Show resolved Hide resolved

awgu commented Oct 7, 2024

View reviewed changes

torch/distributed/_composable/fsdp/_fsdp_param.py Outdated Show resolved Hide resolved

awgu commented Oct 7, 2024

View reviewed changes

torch/distributed/_composable/fsdp/_fsdp_param.py Outdated Show resolved Hide resolved

awgu pushed a commit that referenced this pull request Oct 7, 2024

[FSDP2] Added shard_placement_fn arg

f5d9e0b

ghstack-source-id: 04abf49 Pull Request resolved: #136221

weifengpy reviewed Oct 8, 2024

View reviewed changes

torch/distributed/_composable/fsdp/_fsdp_param.py Outdated Show resolved Hide resolved

weifengpy reviewed Oct 8, 2024

View reviewed changes

torch/distributed/_composable/fsdp/_fsdp_collectives.py Outdated Show resolved Hide resolved

weifengpy reviewed Oct 8, 2024

View reviewed changes

awgu removed the test-config/distributed label Oct 8, 2024

awgu pushed a commit that referenced this pull request Oct 8, 2024

[FSDP2] Added shard_placement_fn arg

f67c248

ghstack-source-id: 71ac63d Pull Request resolved: #136221

awgu closed this Oct 8, 2024

github-actions bot deleted the gh/awgu/640/head branch November 21, 2024 02:07

[FSDP2] Added shard_placement_fn arg #136221

[FSDP2] Added shard_placement_fn arg #136221

Uh oh!

Conversation

awgu commented Sep 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Follow-Ups

Uh oh!

pytorch-bot bot commented Sep 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136221

❌ 40 Cancelled Jobs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awgu commented Oct 7, 2024

Uh oh!

weifengpy Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weifengpy Oct 8, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

weifengpy Oct 8, 2024

Choose a reason for hiding this comment

Uh oh!

awgu Oct 8, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

weifengpy Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weifengpy Oct 8, 2024

Choose a reason for hiding this comment

Uh oh!

awgu Oct 8, 2024

Choose a reason for hiding this comment

Uh oh!

weifengpy Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awgu Oct 8, 2024

Choose a reason for hiding this comment

Uh oh!

awgu commented Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[FSDP2] Added `shard_placement_fn` arg #136221

[FSDP2] Added `shard_placement_fn` arg #136221

awgu commented Sep 17, 2024 •

edited

Loading

pytorch-bot bot commented Sep 17, 2024 •

edited

Loading

weifengpy Oct 8, 2024 •

edited

Loading

weifengpy Oct 8, 2024 •

edited

Loading

weifengpy Oct 8, 2024 •

edited

Loading

awgu commented Oct 8, 2024 •

edited

Loading