Skip to content

[6/N][BE] Remove Sharded Linear Op for ShardedTensor#95948

Closed
fduwjj wants to merge 2 commits intogh/fduwjj/78/basefrom
gh/fduwjj/78/head
Closed

[6/N][BE] Remove Sharded Linear Op for ShardedTensor#95948
fduwjj wants to merge 2 commits intogh/fduwjj/78/basefrom
gh/fduwjj/78/head

Conversation

@fduwjj
Copy link
Contributor

@fduwjj fduwjj commented Mar 3, 2023

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 3, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/95948

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a9d6b23:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@fduwjj fduwjj added the better-engineering Relatively self-contained tasks for better engineering contributors label Mar 3, 2023
@fduwjj fduwjj changed the title [6/N] Remove Sharded Linear Op for ShardedTensor [6/N][BE] Remove Sharded Linear Op for ShardedTensor Mar 3, 2023
@fduwjj fduwjj added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 6, 2023
@fduwjj
Copy link
Contributor Author

fduwjj commented Mar 6, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@fduwjj
Copy link
Contributor Author

fduwjj commented Mar 7, 2023

@fduwjj has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@huydhn
Copy link
Contributor

huydhn commented Mar 7, 2023

@fduwjj I suspect that this PR causes a failed test in CI periodic multigpu test, for example https://hud.pytorch.org/pytorch/pytorch/commit/893aa5df3f2a475c91ea8eadb1353812e52fb227. The error is

Traceback (most recent call last):
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 657, in run_test
    getattr(self, test_name)()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 543, in wrapper
    fn()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper
    return func(*args, **kwargs)
  File "/var/lib/jenkins/workspace/test/distributed/fsdp/test_fsdp_tp_integration.py", line 469, in test_fsdp_tp_checkpoint_integration
    tp_fsdp_model(inp).sum().backward()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 748, in forward
    output = self._fsdp_wrapped_module(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/var/lib/jenkins/workspace/test/distributed/fsdp/test_fsdp_tp_integration.py", line 178, in forward
    return self.net3(self.net2(self.relu(self.net1(x))))
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1098, in __torch_function__
    return dispatch(st_instance, func)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1081, in dispatch
    raise RuntimeError(
RuntimeError: torch function 'linear', with args: (tensor([[[0.7576, 0.2793, 0.4031, 0.7347, 0.0293],
         [0.7999, 0.3971, 0.7544, 0.5695, 0.4388],
         [0.6387, 0.5247, 0.6826, 0.3051, 0.4635]],

        [[0.4550, 0.5725, 0.4980, 0.9371, 0.6556],
         [0.3138, 0.1980, 0.4162, 0.2843, 0.3398],
         [0.5239, 0.7981, 0.7718, 0.0112, 0.8100]]], device='cuda:0'), ShardedTensor(ShardedTensorMetadata(shards_metadata=[ShardMetadata(shard_offsets=[0, 0], shard_sizes=[4, 5], placement=rank:0/cuda:0), ShardMetadata(shard_offsets=[4, 0], shard_sizes=[4, 5], placement=rank:1/cuda:1)], size=torch.Size([8, 5]), tensor_properties=TensorProperties(dtype=torch.float32, layout=torch.strided, requires_grad=True, memory_format=torch.contiguous_format, pin_memory=False))), tensor([ 0.0235, -0.2293,  0.0757, -0.4176, -0.3231, -0.2306,  0.2822,  0.2622],
       device='cuda:0', grad_fn=<ViewBackward0>)) and kwargs: None not supported for ShardedTensor!

Should that test also need to be removed as part of this change to remove linear ops for sharded tensor? Please help take a look

@fduwjj
Copy link
Contributor Author

fduwjj commented Mar 7, 2023

@huydhn I will send a forward fix for that. Not sure why this was not captured in CI flow.

@fduwjj
Copy link
Contributor Author

fduwjj commented Mar 7, 2023

Oh it's in periodic large test, that's no wonder...

@huydhn
Copy link
Contributor

huydhn commented Mar 7, 2023

@huydhn I will send a forward fix for that. Not sure why this was not captured in CI flow.

Thank you for looking into this. That test is periodic, so it only runs every 4 hours or so. You could add 'ciflow/periodic` in your PR to have it running. It's ok to forward fix periodic tests though

@fduwjj
Copy link
Contributor Author

fduwjj commented Mar 8, 2023

@huydhn Fixed in #96254.

fduwjj added a commit that referenced this pull request Mar 8, 2023
We removed ShardedLinear in #95948 but it broke TP_FSDP integration test because it is using ShardedTensor in the test. Migrating using DTensor fixes the test. DTensor shards the bias too so that we need to change the test a little bit.


[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Mar 8, 2023
…st and other tests depending on Sharded Linear"


We removed ShardedLinear in #95948 but it broke TP_FSDP integration test because it is using ShardedTensor in the test. Migrating using DTensor fixes the test. DTensor shards the bias too so that we need to change the test a little bit.


[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Mar 8, 2023
…r tests depending on Sharded Linear (#96254)

We removed ShardedLinear in #95948 but it broke TP_FSDP integration test because it is using ShardedTensor in the test. Migrating using DTensor fixes the test. DTensor shards the bias too so that we need to change the test a little bit.

Pull Request resolved: #96254
Approved by: https://github.com/huydhn
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 12, 2023
…r tests depending on Sharded Linear (#96254)

We removed ShardedLinear in pytorch/pytorch#95948 but it broke TP_FSDP integration test because it is using ShardedTensor in the test. Migrating using DTensor fixes the test. DTensor shards the bias too so that we need to change the test a little bit.

Pull Request resolved: pytorch/pytorch#96254
Approved by: https://github.com/huydhn
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 12, 2023
…r tests depending on Sharded Linear (#96254)

We removed ShardedLinear in pytorch/pytorch#95948 but it broke TP_FSDP integration test because it is using ShardedTensor in the test. Migrating using DTensor fixes the test. DTensor shards the bias too so that we need to change the test a little bit.

Pull Request resolved: pytorch/pytorch#96254
Approved by: https://github.com/huydhn
ydwu4 added a commit to ydwu4/pytorch that referenced this pull request Mar 13, 2023
…r tests depending on Sharded Linear (pytorch#96254)

We removed ShardedLinear in pytorch#95948 but it broke TP_FSDP integration test because it is using ShardedTensor in the test. Migrating using DTensor fixes the test. DTensor shards the bias too so that we need to change the test a little bit.

Pull Request resolved: pytorch#96254
Approved by: https://github.com/huydhn
@facebook-github-bot facebook-github-bot deleted the gh/fduwjj/78/head branch June 8, 2023 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

better-engineering Relatively self-contained tasks for better engineering contributors ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (sharded) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants