Skip to content

Conversation

@tohtana
Copy link
Collaborator

@tohtana tohtana commented Feb 3, 2025

DeepSpeed supports mixed precision training, but the behavior is different from torch.autocast. DeepSpeed maintains parameters and gradients both in FP32 and a lower precision (FP16/BF16) (NVIDIA Apex AMP style) and computes all modules in the lower precision while torch.autocast maintains parameters in FP32 but computes only certain operators in the lower precision.
This leads to differences in:

  • performance: torch.autocast needs downcast in forward/backward
  • memory usage: DeepSpeed needs more memory to keep copies of parameters and gradients in lower precision
  • accuracy: torch.autocast has a list of modules that can safely be computed in lower precision. Some precision-sensitive operators (e.g. softmax) are computed in FP32.

To align DeepSpeed's behavior with torch.autocast when necessary, this PR adds the integration with torch.autocast with ZeRO. Here is an examples of the configuration.

"torch_autocast": {
  "enabled": true,
  "dtype": "bfloat16",
  "lower_precision_safe_modules": ["torch.nn.Linear", "torch.nn.Conv2d"]
}

Each configuration works as follows:

  • enabled: Enable the integration with torch.autocast if this is set to True. You don't need to call torch.autocast in your code. The grad scaler is also applied in the DeepSpeed optimizer.
  • dtype: lower precision dtype passed to torch.autocast. Gradients for allreduce (reduce-scatter) and parameters for allgather (only for ZeRO3) of lower_precision_safe_modules are also downcasted to this dtype.
  • lower_precision_safe_modules: Downcast for allreduce (reduce-scatter) and allgather (ZeRO3) are applied only to modules specified in this list. (The precision for PyTorch operators in forward/backward follows torch.autocast's policy, not this list.) You can set names of classes with their packages. If you don't set this item, DeepSpeed uses the default list: [torch.nn.Linear, torch.nn.Conv1d, torch.nn.Conv2d, torch.nn.Conv3d].

Note that we only maintain FP32 parameters with this feature enabled. For consistency, you cannot enable fp16 or bf16 in DeepSpeed config.

tjruwase and others added 30 commits February 28, 2025 22:53
Fix #6772

---------

Co-authored-by: Logan Adams <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
…#6967)

- Issues with nv-sd updates, will follow up with a subsequent PR

Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
NVIDIA Blackwell GPU generation has number 10. The SM code and
architecture should be `100`, but the current code generates `1.`,
because it expects a 2 characters string.

This change modifies the logic to consider it as a string that contains
a `.`, hence splits the string and uses the array of strings.

Signed-off-by: Fabien Dupont <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Logan Adams <[email protected]>
Signed-off-by: Fabien Dupont <[email protected]>
Co-authored-by: Fabien Dupont <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
1. update intel oneAPI basekit to 2025.0
2. update torch/ipex/oneccl to 2.5

Signed-off-by: Masahiro Tanaka <[email protected]>
Same as [this PR](#6922).
[affeb88](affeb88)
I noticed the CI updated the DCO check recently. Using the suggested
rebase method for sign-off would reintroduce many conflicts, so I opted
for a squash merge with sign-off instead. thanks: )

Signed-off-by: inkcherry <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Those files have code that gets run when importing them, so in systems
that doesn't support triton but have triton installed this causes
issues.

In general, I think it is better to import triton when it is installed
and supported.

Signed-off-by: Omar Elayan <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Logan Adams <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Fix #7014
Avoid naming collision on `partition()`

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Fix typos

Signed-off-by: Masahiro Tanaka <[email protected]>
BUGFIX for Apple Silicon hostname
#6497

---------

Signed-off-by: Fabien Dupont <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Logan Adams <[email protected]>
Signed-off-by: inkcherry <[email protected]>
Signed-off-by: Roman Fitzjalen <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Fabien Dupont <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Liangliang Ma <[email protected]>
Co-authored-by: inkcherry <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
- Update existing workflows that use cu121 to cu124. Note, this means
that where we download torch latest, we will now be getting torch 2.6
rather than the torch latest 2.5 provided with cuda 12.1.
- Note, nv-nightly is failing in master currently due to unrelated
errors, so this could be ignored in this PR (nv-nightly tested locally,
where it passes with 12.1 and it also passes with 12.4).

---------

Signed-off-by: Fabien Dupont <[email protected]>
Signed-off-by: Logan Adams <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: inkcherry <[email protected]>
Signed-off-by: Omar Elayan <[email protected]>
Co-authored-by: Fabien Dupont <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Liangliang Ma <[email protected]>
Co-authored-by: inkcherry <[email protected]>
Co-authored-by: Omar Elayan <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
This change is required to successfully build fp_quantizer extension on
ROCm.

---------

Co-authored-by: Logan Adams <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
cc @tjruwase @jomayeri

---------

Co-authored-by: root <root@ftqtmec25000000.taxzvufipdhelhupulxcbvr15f.ux.internal.cloudapp.net>
Signed-off-by: Masahiro Tanaka <[email protected]>
Fix #7029
- Add Chinese blog for deepspeed windows
- Fix format in README.md

Co-authored-by: Logan Adams <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Adding compile support for AIO library on AMD GPUs.

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Make trace cache warnings configurable, and disabled by default.

Fix #6985, #4081, #5033, #5006, #5662

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Update CUDA compute capability for cross compile according to wiki page.
https://en.wikipedia.org/wiki/CUDA#GPUs_supported

---------

Signed-off-by: Hongwei <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Logan Adams <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Propagate API change.

Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
@tohtana
Copy link
Collaborator Author

tohtana commented May 24, 2025

Then assert is the way to go, Masahiro

Thank you @stas00, then can you approve this PR?

@stas00
Copy link
Collaborator

stas00 commented May 27, 2025

Hmm, I can't just hit approve, that would be defeat the purpose of doing the review.

We have only discussed one small aspect of this PR, which has been resolved, but the rest of the PR I don't know and currently rushing to finish the porting of Ulysses to Hf/DS so until that is done I won't have time to do a serious review.

@tohtana tohtana enabled auto-merge (squash) June 19, 2025 20:23
@tohtana tohtana merged commit ed5f737 into master Jun 19, 2025
12 checks passed
@tohtana tohtana deleted the tohtana/support_autocast branch June 19, 2025 21:36
tohtana added a commit that referenced this pull request Jun 22, 2025
#6993 broke many paths in ZeRO1/2 optimizer. This PR fixes most of the
issues the PR caused. Currently we still have one error with tests in
`unit/runtime/zero`.

```
====================================== short test summary info ======================================
FAILED test_zero.py::TestParamPartitioningSkipInit::test[dtype1] - RuntimeError: mat1 and mat2 must have the same dtype, but got Half and BFloat16
========= 1 failed, 204 passed, 66 skipped, 15 deselected, 5 warnings in 2305.03s (0:38:25) =========
```

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
Antlera pushed a commit to Antlera/DeepSpeed that referenced this pull request Jun 27, 2025
DeepSpeed supports mixed precision training, but the behavior is
different from `torch.autocast`. DeepSpeed maintains parameters and
gradients both in FP32 and a lower precision (FP16/BF16) (NVIDIA Apex
AMP style) and computes all modules in the lower precision while
`torch.autocast` maintains parameters in FP32 but computes only certain
operators in the lower precision.
This leads to differences in:
- performance: `torch.autocast` needs downcast in forward/backward
- memory usage: DeepSpeed needs more memory to keep copies of parameters
and gradients in lower precision
- accuracy: `torch.autocast` has a list of modules that can safely be
computed in lower precision. Some precision-sensitive operators (e.g.
softmax) are computed in FP32.

To align DeepSpeed's behavior with `torch.autocast` when necessary, this
PR adds the integration with `torch.autocast` with ZeRO. Here is an
examples of the configuration.

```json
"torch_autocast": {
  "enabled": true,
  "dtype": "bfloat16",
  "lower_precision_safe_modules": ["torch.nn.Linear", "torch.nn.Conv2d"]
}
```

Each configuration works as follows:
- `enabled`: Enable the integration with `torch.autocast` if this is set
to `True`. You don't need to call `torch.autocast` in your code. The
grad scaler is also applied in the DeepSpeed optimizer.
- `dtype`: lower precision dtype passed to `torch.autocast`. Gradients
for allreduce (reduce-scatter) and parameters for allgather (only for
ZeRO3) of `lower_precision_safe_modules` are also downcasted to this
dtype.
- `lower_precision_safe_modules`: Downcast for allreduce
(reduce-scatter) and allgather (ZeRO3) are applied only to modules
specified in this list. (The precision for PyTorch operators in
forward/backward follows `torch.autocast`'s policy, not this list.) You
can set names of classes with their packages. If you don't set this
item, DeepSpeed uses the default list: `[torch.nn.Linear,
torch.nn.Conv1d, torch.nn.Conv2d, torch.nn.Conv3d]`.

Note that we only maintain FP32 parameters with this feature enabled.
For consistency, you cannot enable `fp16` or `bf16` in DeepSpeed config.

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Fabien Dupont <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Logan Adams <[email protected]>
Signed-off-by: inkcherry <[email protected]>
Signed-off-by: Omar Elayan <[email protected]>
Signed-off-by: Roman Fitzjalen <[email protected]>
Signed-off-by: Hongwei <[email protected]>
Signed-off-by: shaomin <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: siqi <[email protected]>
Signed-off-by: Wei Wu <[email protected]>
Signed-off-by: ShellyNR <[email protected]>
Signed-off-by: Lai, Yejing <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Fabien Dupont <[email protected]>
Co-authored-by: Liangliang Ma <[email protected]>
Co-authored-by: inkcherry <[email protected]>
Co-authored-by: Omar Elayan <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Roman Fitzjalen <[email protected]>
Co-authored-by: Ramya Ramineni <[email protected]>
Co-authored-by: Guanhua Wang <[email protected]>
Co-authored-by: root <root@ftqtmec25000000.taxzvufipdhelhupulxcbvr15f.ux.internal.cloudapp.net>
Co-authored-by: Hongwei Chen <[email protected]>
Co-authored-by: Joe Mayer <[email protected]>
Co-authored-by: wukong1992 <[email protected]>
Co-authored-by: shaomin <[email protected]>
Co-authored-by: loadams <[email protected]>
Co-authored-by: siqi654321 <[email protected]>
Co-authored-by: siqi <[email protected]>
Co-authored-by: Wei Wu <[email protected]>
Co-authored-by: Shelly Nahir <[email protected]>
Co-authored-by: snahir <[email protected]>
Co-authored-by: Yejing-Lai <[email protected]>
Co-authored-by: Siddharth Singh <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
lpnpcs pushed a commit to lpnpcs/DeepSpeed that referenced this pull request Jul 30, 2025
deepspeedai#6993 broke many paths in ZeRO1/2 optimizer. This PR fixes most of the
issues the PR caused. Currently we still have one error with tests in
`unit/runtime/zero`.

```
====================================== short test summary info ======================================
FAILED test_zero.py::TestParamPartitioningSkipInit::test[dtype1] - RuntimeError: mat1 and mat2 must have the same dtype, but got Half and BFloat16
========= 1 failed, 204 passed, 66 skipped, 15 deselected, 5 warnings in 2305.03s (0:38:25) =========
```

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
@Vervious
Copy link

Vervious commented Aug 8, 2025

I'm running into an issue where turning on this feature results in massive grad norms, using zero 2; have you seen this before?

[deepspeed.torch_autocast]
enabled = true  # NOTE: turning this on makes grad norms explode in stage 2 (but not stage 3)
dtype = "bfloat16"

# [deepspeed.bf16]
# enabled = true  # this works properly

[deepspeed.zero_optimization]
stage = 2
allgather_partitions = true
overlap_comm = false
reduce_scatter = true
contiguous_gradients = true
stage3_prefetch_bucket_size = 0
stage3_max_live_parameters = 0
stage3_max_reuse_distance = 0
stage3_gather_16bit_weights_on_model_save = true

Grad norms were reported via model_engine.get_global_grad_norm() and also observed via safe_get_full_grad after a backward call (but before step). Stage 3 seems to have reasonable grad norms. For some reason the loss curves also don't match exactly between stage 2 and 3 (but they match exactly using deepspeed.bf16 instead)

eternalNight added a commit to openanolis/DeepSpeed that referenced this pull request Sep 30, 2025
PR deepspeedai#6993 replaces the flat IPG buffers with a dict maintaining
type-indexed buckets. The member is also renamed from
`_ipg_bucket_flat_buffer` to `ipg_buckets`.

Update the bucket clearing logic in `init_z3` accordingly.

Signed-off-by: Junjie Mao <[email protected]>
tohtana pushed a commit that referenced this pull request Oct 1, 2025
PR #6993 replaces the flat IPG buffers with a dict maintaining
type-indexed buckets. The member is also renamed from
`_ipg_bucket_flat_buffer` to `ipg_buckets`.

Update the bucket clearing logic in `init_z3` accordingly.

Signed-off-by: Junjie Mao <[email protected]>
snorkelopstesting1-a11y pushed a commit to snorkel-marlin-repos/deepspeedai_DeepSpeed_pr_6993_98d95b49-81a8-4e37-a589-1378c216295f that referenced this pull request Oct 2, 2025
Original PR #6993 by tohtana
Original: deepspeedai/DeepSpeed#6993
snorkelopstesting1-a11y added a commit to snorkel-marlin-repos/deepspeedai_DeepSpeed_pr_6993_98d95b49-81a8-4e37-a589-1378c216295f that referenced this pull request Oct 2, 2025
delock pushed a commit that referenced this pull request Oct 3, 2025
PR #6993 replaces the flat IPG buffers with a dict maintaining
type-indexed buckets. The member is also renamed from
`_ipg_bucket_flat_buffer` to `ipg_buckets`.

Update the bucket clearing logic in `init_z3` accordingly.

Signed-off-by: Junjie Mao <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
DeepSpeed supports mixed precision training, but the behavior is
different from `torch.autocast`. DeepSpeed maintains parameters and
gradients both in FP32 and a lower precision (FP16/BF16) (NVIDIA Apex
AMP style) and computes all modules in the lower precision while
`torch.autocast` maintains parameters in FP32 but computes only certain
operators in the lower precision.
This leads to differences in:
- performance: `torch.autocast` needs downcast in forward/backward
- memory usage: DeepSpeed needs more memory to keep copies of parameters
and gradients in lower precision
- accuracy: `torch.autocast` has a list of modules that can safely be
computed in lower precision. Some precision-sensitive operators (e.g.
softmax) are computed in FP32.

To align DeepSpeed's behavior with `torch.autocast` when necessary, this
PR adds the integration with `torch.autocast` with ZeRO. Here is an
examples of the configuration.

```json
"torch_autocast": {
  "enabled": true,
  "dtype": "bfloat16",
  "lower_precision_safe_modules": ["torch.nn.Linear", "torch.nn.Conv2d"]
}
```

Each configuration works as follows:
- `enabled`: Enable the integration with `torch.autocast` if this is set
to `True`. You don't need to call `torch.autocast` in your code. The
grad scaler is also applied in the DeepSpeed optimizer.
- `dtype`: lower precision dtype passed to `torch.autocast`. Gradients
for allreduce (reduce-scatter) and parameters for allgather (only for
ZeRO3) of `lower_precision_safe_modules` are also downcasted to this
dtype.
- `lower_precision_safe_modules`: Downcast for allreduce
(reduce-scatter) and allgather (ZeRO3) are applied only to modules
specified in this list. (The precision for PyTorch operators in
forward/backward follows `torch.autocast`'s policy, not this list.) You
can set names of classes with their packages. If you don't set this
item, DeepSpeed uses the default list: `[torch.nn.Linear,
torch.nn.Conv1d, torch.nn.Conv2d, torch.nn.Conv3d]`.

Note that we only maintain FP32 parameters with this feature enabled.
For consistency, you cannot enable `fp16` or `bf16` in DeepSpeed config.

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Fabien Dupont <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Logan Adams <[email protected]>
Signed-off-by: inkcherry <[email protected]>
Signed-off-by: Omar Elayan <[email protected]>
Signed-off-by: Roman Fitzjalen <[email protected]>
Signed-off-by: Hongwei <[email protected]>
Signed-off-by: shaomin <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: siqi <[email protected]>
Signed-off-by: Wei Wu <[email protected]>
Signed-off-by: ShellyNR <[email protected]>
Signed-off-by: Lai, Yejing <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Fabien Dupont <[email protected]>
Co-authored-by: Liangliang Ma <[email protected]>
Co-authored-by: inkcherry <[email protected]>
Co-authored-by: Omar Elayan <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Roman Fitzjalen <[email protected]>
Co-authored-by: Ramya Ramineni <[email protected]>
Co-authored-by: Guanhua Wang <[email protected]>
Co-authored-by: root <root@ftqtmec25000000.taxzvufipdhelhupulxcbvr15f.ux.internal.cloudapp.net>
Co-authored-by: Hongwei Chen <[email protected]>
Co-authored-by: Joe Mayer <[email protected]>
Co-authored-by: wukong1992 <[email protected]>
Co-authored-by: shaomin <[email protected]>
Co-authored-by: loadams <[email protected]>
Co-authored-by: siqi654321 <[email protected]>
Co-authored-by: siqi <[email protected]>
Co-authored-by: Wei Wu <[email protected]>
Co-authored-by: Shelly Nahir <[email protected]>
Co-authored-by: snahir <[email protected]>
Co-authored-by: Yejing-Lai <[email protected]>
Co-authored-by: Siddharth Singh <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
deepspeedai#6993 broke many paths in ZeRO1/2 optimizer. This PR fixes most of the
issues the PR caused. Currently we still have one error with tests in
`unit/runtime/zero`.

```
====================================== short test summary info ======================================
FAILED test_zero.py::TestParamPartitioningSkipInit::test[dtype1] - RuntimeError: mat1 and mat2 must have the same dtype, but got Half and BFloat16
========= 1 failed, 204 passed, 66 skipped, 15 deselected, 5 warnings in 2305.03s (0:38:25) =========
```

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
PR deepspeedai#6993 replaces the flat IPG buffers with a dict maintaining
type-indexed buckets. The member is also renamed from
`_ipg_bucket_flat_buffer` to `ipg_buckets`.

Update the bucket clearing logic in `init_z3` accordingly.

Signed-off-by: Junjie Mao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.