PyTorch-compatible backward API #7665

tohtana · 2025-11-03T21:02:24Z

Currently DeepSpeed's backward API has more constraints compared to PyTorch's normal backward API.
Here is the usage as described in the documentation:

    loss = model_engine(batch)
    model_engine.backward(loss)

In this example,

Only accepts a (scalar) loss value
Need to call engine's backward API

In contrast, in standard PyTorch, you can do:

    output = model(batch)
    output.backward(out_grad)

There are several use cases that rely on this flexibility. For example, combining multiple models or using loss functions defined separately from the main model.

If you attempt the same pattern with a DeepSpeed engine, some preprocessing and postprocessing steps will be silently skipped, which can lead to incorrect results.

The document explains we can call _backward_epilogue manually (possibly backward_prologue as well). However, it's easy for users to miss these calls, and passing a non-scalar gradient is still not supported.

This PR introduces the same .backward() behavior as PyTorch, allowing .backward() to be called directly on tensors and supporting non-scalar outputs.

To implement post-backward hooks, we had to use some torch internal APIs. See comments for more details. When the internal APIs are not available, DeepSpeed engine only accepts the traditional way model_engine.backward(loss).

Signed-off-by: Masahiro Tanaka <[email protected]>

sfc-gh-truwase · 2025-11-04T11:51:25Z

@tohtana, this is a very exciting usability improvement. Please remember to update the documentation.

Signed-off-by: Masahiro Tanaka <[email protected]>

tohtana · 2025-11-14T02:14:44Z

@sfc-gh-truwase I think this PR is now ready for review, though the latest change on HF transformer causes an error with test_zero_nesting_init.py::TestNestedParallelInit::test_nested_parallel_init.

deepspeed/runtime/zero/stage3.py

deepspeed/runtime/utils.py

deepspeed/runtime/zero/stage3.py

Signed-off-by: Masahiro Tanaka <[email protected]>

Currently DeepSpeed's backward API has more constraints compared to PyTorch's normal backward API. Here is the usage as described in the documentation: ```python loss = model_engine(batch) model_engine.backward(loss) ``` In this example, 1. Only accepts a (scalar) loss value 1. Need to call engine's backward API In contrast, in standard PyTorch, you can do: ```python output = model(batch) output.backward(out_grad) ``` There are several use cases that rely on this flexibility. For example, combining multiple models or using loss functions defined separately from the main model. If you attempt the same pattern with a DeepSpeed engine, some preprocessing and postprocessing steps will be silently skipped, which can lead to incorrect results. The [document](https://deepspeed.readthedocs.io/en/latest/training.html#jointly-training-models-with-shared-loss) explains we can call `_backward_epilogue` manually (possibly `backward_prologue` as well). However, it's easy for users to miss these calls, and passing a non-scalar gradient is still not supported. This PR introduces the same `.backward()` behavior as PyTorch, allowing .backward() to be called directly on tensors and supporting non-scalar outputs. To implement post-backward hooks, we had to use some torch internal APIs. See [comments](https://github.com/deepspeedai/DeepSpeed/blob/73f7ff1aab9d1387eb7dd4eca7453a25024533f4/deepspeed/runtime/engine.py#L424) for more details. When the internal APIs are not available, DeepSpeed engine only accepts the traditional way `model_engine.backward(loss)`. --------- Signed-off-by: Masahiro Tanaka <[email protected]> Signed-off-by: rraminen <[email protected]>

tohtana added 11 commits October 17, 2025 16:23

rename backward prologue method

8a27283

Signed-off-by: Masahiro Tanaka <[email protected]>

refactor loss scaling

d85cfd9

Signed-off-by: Masahiro Tanaka <[email protected]>

refactor backward

bded5c8

Signed-off-by: Masahiro Tanaka <[email protected]>

fix for bf16 optimizer

1f413d6

Signed-off-by: Masahiro Tanaka <[email protected]>

simplify preprocess/postprocess of backward

cc87977

Signed-off-by: Masahiro Tanaka <[email protected]>

fix order of backward postprocess

95018a3

Signed-off-by: Masahiro Tanaka <[email protected]>

enable non-scalar backward only for ZeROOptimizer

80d0e7d

Signed-off-by: Masahiro Tanaka <[email protected]>

fix zero+fp16 case

db70476

Signed-off-by: Masahiro Tanaka <[email protected]>

add config to enable allow_user_backward

50b29d8

Signed-off-by: Masahiro Tanaka <[email protected]>

fix flag for error handling

076b187

Signed-off-by: Masahiro Tanaka <[email protected]>

resolve conflict

f6748d1

Signed-off-by: Masahiro Tanaka <[email protected]>

tohtana and others added 18 commits November 6, 2025 16:24

add test cases

280b1fa

Signed-off-by: Masahiro Tanaka <[email protected]>

Merge branch 'master' into tohtana/backward_non_scalar

5d5e64e

fix format

6ce26f3

Signed-off-by: Masahiro Tanaka <[email protected]>

return scaled loss from engine's backward

0c579d5

Signed-off-by: Masahiro Tanaka <[email protected]>

Merge branch 'master' into tohtana/backward_non_scalar

1615036

remove option to enable user backward

c8758f7

Signed-off-by: Masahiro Tanaka <[email protected]>

add hook utility

9962f2c

Signed-off-by: Masahiro Tanaka <[email protected]>

fix for z2

a8f15a0

Signed-off-by: Masahiro Tanaka <[email protected]>

fix scaling

1d0a721

Signed-off-by: Masahiro Tanaka <[email protected]>

exclude unused params from counter

39372ac

Signed-off-by: Masahiro Tanaka <[email protected]>

set default flag

7eacbc7

Signed-off-by: Masahiro Tanaka <[email protected]>

handle non-zero optimizer

6cce937

Signed-off-by: Masahiro Tanaka <[email protected]>

call epilogue in engine's backward

b72b5a7

Signed-off-by: Masahiro Tanaka <[email protected]>

prevent hooks from being called from nested backward

1307a87

Signed-off-by: Masahiro Tanaka <[email protected]>

run post hook fo rz3

98cc865

Signed-off-by: Masahiro Tanaka <[email protected]>

Merge branch 'master' into tohtana/backward_non_scalar

adb6990

added comments

9328dfa

Signed-off-by: Masahiro Tanaka <[email protected]>

remove hard-coded tolerances

01b3251

Signed-off-by: Masahiro Tanaka <[email protected]>

tohtana added 2 commits November 13, 2025 15:24

add test for multiple engines

78f7ad4

Signed-off-by: Masahiro Tanaka <[email protected]>

update document

73f7ff1

Signed-off-by: Masahiro Tanaka <[email protected]>

tohtana marked this pull request as ready for review November 14, 2025 02:13

tohtana requested review from loadams and tjruwase as code owners November 14, 2025 02:13

sfc-gh-truwase reviewed Nov 17, 2025

View reviewed changes

deepspeed/runtime/zero/stage3.py Show resolved Hide resolved

sfc-gh-truwase reviewed Nov 17, 2025

View reviewed changes

deepspeed/runtime/utils.py Outdated Show resolved Hide resolved

sfc-gh-truwase reviewed Nov 17, 2025

View reviewed changes

deepspeed/runtime/utils.py Outdated Show resolved Hide resolved

sfc-gh-truwase reviewed Nov 17, 2025

View reviewed changes

deepspeed/runtime/zero/stage3.py Outdated Show resolved Hide resolved

tohtana added 5 commits November 17, 2025 00:29

remove deprecated comment

26308cd

Signed-off-by: Masahiro Tanaka <[email protected]>

simplify utility func to count effective grad nodes

9963546

Signed-off-by: Masahiro Tanaka <[email protected]>

fix combination with leaf module

b730f46

Signed-off-by: Masahiro Tanaka <[email protected]>

refactor tests

08b1599

Signed-off-by: Masahiro Tanaka <[email protected]>

refactor tests

92d3068

Signed-off-by: Masahiro Tanaka <[email protected]>

sfc-gh-truwase approved these changes Nov 18, 2025

View reviewed changes

tohtana and others added 3 commits November 18, 2025 13:32

fix loss scaling

ebac40b

Signed-off-by: Masahiro Tanaka <[email protected]>

Merge branch 'master' into tohtana/backward_non_scalar

fcf7c8c

Merge branch 'master' into tohtana/backward_non_scalar

90e1b7d

tohtana enabled auto-merge (squash) November 19, 2025 00:01

tohtana merged commit 53e91a0 into deepspeedai:master Nov 19, 2025
12 checks passed

eternalNight mentioned this pull request Nov 28, 2025

[BUG] None grad triggers exception in the backward hook #7708

Closed

tohtana mentioned this pull request Dec 9, 2025

[BUG]Incorrect gradient computation in ZeRO-2 with DeepSpeed ≥ 0.17.6 #7718

Closed

tdrussell mentioned this pull request Jan 12, 2026

[BUG] Gradients are summed instead of averaged when using gradient accumulation steps with pipeline parallelism #7773

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PyTorch-compatible backward API #7665

PyTorch-compatible backward API #7665

Uh oh!

tohtana commented Nov 3, 2025 •

edited

Loading

Uh oh!

sfc-gh-truwase commented Nov 4, 2025

Uh oh!

tohtana commented Nov 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PyTorch-compatible backward API #7665

PyTorch-compatible backward API #7665

Uh oh!

Conversation

tohtana commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-truwase commented Nov 4, 2025

Uh oh!

tohtana commented Nov 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tohtana commented Nov 3, 2025 •

edited

Loading