Skip to content

Conversation

@tohtana
Copy link
Collaborator

@tohtana tohtana commented Nov 3, 2025

Currently DeepSpeed's backward API has more constraints compared to PyTorch's normal backward API.
Here is the usage as described in the documentation:

    loss = model_engine(batch)
    model_engine.backward(loss)

In this example,

  1. Only accepts a (scalar) loss value
  2. Need to call engine's backward API

In contrast, in standard PyTorch, you can do:

    output = model(batch)
    output.backward(out_grad)

There are several use cases that rely on this flexibility. For example, combining multiple models or using loss functions defined separately from the main model.

If you attempt the same pattern with a DeepSpeed engine, some preprocessing and postprocessing steps will be silently skipped, which can lead to incorrect results.

The document explains we can call _backward_epilogue manually (possibly backward_prologue as well). However, it's easy for users to miss these calls, and passing a non-scalar gradient is still not supported.

This PR introduces the same .backward() behavior as PyTorch, allowing .backward() to be called directly on tensors and supporting non-scalar outputs.

To implement post-backward hooks, we had to use some torch internal APIs. See comments for more details. When the internal APIs are not available, DeepSpeed engine only accepts the traditional way model_engine.backward(loss).

Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
@sfc-gh-truwase
Copy link
Collaborator

@tohtana, this is a very exciting usability improvement. Please remember to update the documentation.

tohtana and others added 18 commits November 6, 2025 16:24
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
@tohtana tohtana marked this pull request as ready for review November 14, 2025 02:13
@tohtana
Copy link
Collaborator Author

tohtana commented Nov 14, 2025

@sfc-gh-truwase I think this PR is now ready for review, though the latest change on HF transformer causes an error with test_zero_nesting_init.py::TestNestedParallelInit::test_nested_parallel_init.

Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
@tohtana tohtana enabled auto-merge (squash) November 19, 2025 00:01
@tohtana tohtana merged commit 53e91a0 into deepspeedai:master Nov 19, 2025
12 checks passed
rraminen pushed a commit to rraminen/DeepSpeed that referenced this pull request Dec 1, 2025
Currently DeepSpeed's backward API has more constraints compared to
PyTorch's normal backward API.
Here is the usage as described in the documentation:
```python
    loss = model_engine(batch)
    model_engine.backward(loss)
```

In this example,
1. Only accepts a (scalar) loss value
1. Need to call engine's backward API

In contrast, in standard PyTorch, you can do:
```python
    output = model(batch)
    output.backward(out_grad)
```

There are several use cases that rely on this flexibility. For example,
combining multiple models or using loss functions defined separately
from the main model.

If you attempt the same pattern with a DeepSpeed engine, some
preprocessing and postprocessing steps will be silently skipped, which
can lead to incorrect results.

The
[document](https://deepspeed.readthedocs.io/en/latest/training.html#jointly-training-models-with-shared-loss)
explains we can call `_backward_epilogue` manually (possibly
`backward_prologue` as well). However, it's easy for users to miss these
calls, and passing a non-scalar gradient is still not supported.

This PR introduces the same `.backward()` behavior as PyTorch, allowing
.backward() to be called directly on tensors and supporting non-scalar
outputs.

To implement post-backward hooks, we had to use some torch internal
APIs. See
[comments](https://github.com/deepspeedai/DeepSpeed/blob/73f7ff1aab9d1387eb7dd4eca7453a25024533f4/deepspeed/runtime/engine.py#L424)
for more details. When the internal APIs are not available, DeepSpeed
engine only accepts the traditional way `model_engine.backward(loss)`.

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: rraminen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants