[ROCm] Relax tolerances for FP8 unit test for fp16 and bf16 cases #7655

rraminen · 2025-10-28T17:12:52Z

Relaxing the tolerance values to enable the below two unit testa, with FP16 and BF16 data types on ROCm

unit/runtime/half_precision/test_fp8.py::TestFp8ComposabilityAcrossZero::test[bf16]
unit/runtime/half_precision/test_fp8.py::TestFp8ComposabilityAcrossZero::test[fp32]

jithunnair-amd · 2025-10-30T19:52:22Z

tests/unit/runtime/half_precision/test_fp8.py

-        if is_rocm_pytorch() and model_dtype == torch.float16:
-            rtol, atol = 3e-07, 3e-05
+        if is_rocm_pytorch() and base_datatype in ["fp16", "bf16"]:
+            rtol, atol = 1e-07, 1e-04


@rraminen Suggest figuring out if this is HW-specific e.g. MI200/MI300?

For ROCm, fp8 features are only supported on MI300. This workaround is needed for MI300 onwards.

also bf16 disabled for rocm now, PR for enabling it still waiting review -> https://github.com/deepspeedai/DeepSpeed/pull/7448/files#diff-56a6081d5f623bd0c17de268c74239976682acd6386a8b83639181aa3f5a39c4

I am reviewing.

Signed-off-by: rraminen <[email protected]>

) As I was integrating ALST/Ulysses SP into HF Accelerate/Trainer I noticed that the initial `UlyssesSPAttentionHF.register_with_transformers` API was a bit inflexible/confusing wrt variable seqlen. This PR deprecates the misleading `max_length` arg name, replaces it with `seq_length` and makes the latter optional if `seq_length_is_variable` is True. Updated tests and docs. Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: rraminen <[email protected]>

@jithunnair-amd

…edai#7645) This PR fixes an issue in deepspeed/runtime/fp16/fused_optimizer.py where the gradient overflow handling logic incorrectly exited the function too early, resulting in wrong forward pass and loss calculations in certain FP16 training scenarios. The `return self.overflow` and `self.timers.log(OVERFLOW_TIMERS)` calls are now correctly moved inside the `if self.overflow:` block so that the function only returns early when an actual overflow is detected. Origin of the error: deepspeedai@889f0ea cc: @jithunnair-amd Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: rraminen <[email protected]>

@sfc-gh-truwase

) Fixes deepspeedai#7653 The extra-large params were recorded in `param.dtype` but the reducer looks up using `comm_dtype`. https://github.com/deepspeedai/DeepSpeed/blob/d56e847bac2853d5b8819ce176eeafff65a3798e/deepspeed/runtime/zero/stage_1_and_2.py#L1461 cc @sfc-gh-truwase Signed-off-by: Naveenraj Kamalakannan <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Signed-off-by: rraminen <[email protected]>

Make it very clear that `TiledMLP`'s memory saving has a cost of recomputing forward. Signed-off-by: rraminen <[email protected]>

@sfc-gh-truwase

…eepspeedai#7659) fixes deepspeedai#7650 adding a `value.dim()>0` check to prevent slicing of 0-dim tensors cc @sfc-gh-truwase Signed-off-by: Naveenraj Kamalakannan <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: rraminen <[email protected]>

Corrected the record_function parameter in code example from incorrect '"""):' to 'model_forward' Signed-off-by: kunhee <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: rraminen <[email protected]>

Long overdue --------- Signed-off-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Signed-off-by: rraminen <[email protected]>

Signed-off-by: rraminen <[email protected]>

1. `modal-accelerate` needs now `uv` installed explicitly since the image change to 2025 one. 2. moved accelerate repo cloning into the job, since the original way was incorrect as it was caching some accelerate version and not updating it. 3. annotated that how to actually test the ci work when changing the workflow as `pull_request_target` will not run the updated .py+.yaml files. --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: rraminen <[email protected]>

add Masahiro's explanation to why that code is there. --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: rraminen <[email protected]>

as we lost v100s - disable first so that it stops interfering with PRs, then port to modal. Signed-off-by: rraminen <[email protected]>

…er (deepspeedai#7658) This PR allows seperate learning rate for muon and adam part of the Muon optimizer. Following up deepspeedai#7657 Signed-off-by: Guokai Ma <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: rraminen <[email protected]>

`logger` is the wrong facility for `see_mem_usage` messages, since it should always print (if `force=True`) and the current implementation doesn't guarantee that leading to wasted dev time puzzling over it. So fixing to use `print` instead. Signed-off-by: rraminen <[email protected]>

when using param debug utils deal with use cases where `.ds_*` attributes aren't there. Signed-off-by: rraminen <[email protected]>

currently, we get pinned cpu memory when it's not configured to do so - this PR is fixing that. Signed-off-by: rraminen <[email protected]>

currently only one modal CI job is possible across all PRs, which is not workable - all running jobs get cancelled on a new PR or existing PR update - fixing the dependency to make the group concurrency work across PRs not to waste valuable resources. Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Signed-off-by: rraminen <[email protected]>

@sdvillal

This PR replaces the ninja detection based on `import ninja` with the corresponding pytorch utils to detect ninja. The motivation behind this change is twofold. 1. The detection based on importing the python interface works in the pip-world, but fails in the conda-world, as the corresponding ninja conda package does not include the python interface. The implication on detection is [known](conda-forge/ninja-feedstock#26) and the recommended solution is to use the `subprocess` module to detect ninja. This approach is followed by, e.g., [meson](https://github.com/mesonbuild/meson-python/blob/1c8092dc477cbc7e1e4d40913608d9daae75f793/mesonpy/__init__.py#L1077-L1088) and [pytorch](https://github.com/pytorch/pytorch/blob/d33d125c9413c5043aa5f74fad909a576288242d/torch/utils/cpp_extension.py#L2312-L2325). As the `subprocess`-based check works in both the pip- and the conda-world, I think, it would make sense to switch over. 2. As ninja is only invoked through pytorch, it should be sufficient to check if pytorch can find ninja. Hence, instead of reimplementing the `subprocess` check, I think it would be better to use the pytorch utils (which implement said test anyways). Without this refactor, every conda environment that depends on DeepSpeed will need to install ninja as a PyPI dependency, even though the conda version of ninja would be sufficient for the compilation. In my opinion, this adds unnecessary complexity to these environments. I tried to keep the changes minimal. As some additional context, @sdvillal and I stumbled over this issue while working on packaging aqlaboratory/openfold-3 for conda-forge. Signed-off-by: Tim Adler <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: rraminen <[email protected]>

…deepspeedai#7692) Heavily borrowed from https://github.com/vllm-project/vllm/blob/main/SECURITY.md Signed-off-by: rraminen <[email protected]>

The AutoTP supported models does not include Qwen2.5, which is already supported. Update the document. (https://www.deepspeed.ai/tutorials/automatic-tensor-parallelism/#supported-models) --------- Signed-off-by: Ma, Guokai <[email protected]> Signed-off-by: rraminen <[email protected]>

The SSL certificate of Intel's wheel server has expired. To unblock PRs, trust `pytorch-extension.intel.com`. Signed-off-by: Masahiro Tanaka <[email protected]> Signed-off-by: rraminen <[email protected]>

Currently DeepSpeed's backward API has more constraints compared to PyTorch's normal backward API. Here is the usage as described in the documentation: ```python loss = model_engine(batch) model_engine.backward(loss) ``` In this example, 1. Only accepts a (scalar) loss value 1. Need to call engine's backward API In contrast, in standard PyTorch, you can do: ```python output = model(batch) output.backward(out_grad) ``` There are several use cases that rely on this flexibility. For example, combining multiple models or using loss functions defined separately from the main model. If you attempt the same pattern with a DeepSpeed engine, some preprocessing and postprocessing steps will be silently skipped, which can lead to incorrect results. The [document](https://deepspeed.readthedocs.io/en/latest/training.html#jointly-training-models-with-shared-loss) explains we can call `_backward_epilogue` manually (possibly `backward_prologue` as well). However, it's easy for users to miss these calls, and passing a non-scalar gradient is still not supported. This PR introduces the same `.backward()` behavior as PyTorch, allowing .backward() to be called directly on tensors and supporting non-scalar outputs. To implement post-backward hooks, we had to use some torch internal APIs. See [comments](https://github.com/deepspeedai/DeepSpeed/blob/73f7ff1aab9d1387eb7dd4eca7453a25024533f4/deepspeed/runtime/engine.py#L424) for more details. When the internal APIs are not available, DeepSpeed engine only accepts the traditional way `model_engine.backward(loss)`. --------- Signed-off-by: Masahiro Tanaka <[email protected]> Signed-off-by: rraminen <[email protected]>

README.md

rraminen requested review from loadams, tjruwase and tohtana as code owners October 28, 2025 17:12

jithunnair-amd reviewed Oct 30, 2025

View reviewed changes

rraminen marked this pull request as draft October 30, 2025 20:04

rraminen marked this pull request as ready for review November 10, 2025 17:09

rraminen marked this pull request as draft November 14, 2025 17:33

rraminen marked this pull request as ready for review November 24, 2025 12:13

rraminen and others added 22 commits December 1, 2025 07:26

Relax tolerance

80e6f53

Signed-off-by: rraminen <[email protected]>

UlyssesSP: TiledMLP doc - recomputes forward twice (deepspeedai#7664)

dd2e147

Make it very clear that `TiledMLP`'s memory saving has a cost of recomputing forward. Signed-off-by: rraminen <[email protected]>

README refresh (deepspeedai#7668)

014ee5f

Long overdue --------- Signed-off-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Signed-off-by: rraminen <[email protected]>

Update version.txt after release (deepspeedai#7675)

d324f97

Signed-off-by: rraminen <[email protected]>

leaf modules: explain better (deepspeedai#7674)

9730153

add Masahiro's explanation to why that code is there. --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: rraminen <[email protected]>

disable nv-lightning-v100.yml cI (deepspeedai#7681)

2499045

as we lost v100s - disable first so that it stops interfering with PRs, then port to modal. Signed-off-by: rraminen <[email protected]>

make debug utils more resilient (deepspeedai#7690)

0387a0a

when using param debug utils deal with use cases where `.ds_*` attributes aren't there. Signed-off-by: rraminen <[email protected]>

stage 1-2: don't pin memory if not configured (deepspeedai#7689)

c69ab19

currently, we get pinned cpu memory when it's not configured to do so - this PR is fixing that. Signed-off-by: rraminen <[email protected]>

Update SECURITY.md to point to GitHub reporting rather than Microsoft (…

767fe52

…deepspeedai#7692) Heavily borrowed from https://github.com/vllm-project/vllm/blob/main/SECURITY.md Signed-off-by: rraminen <[email protected]>

Trust intel server for XPU tests (deepspeedai#7698)

2972ef8

The SSL certificate of Intel's wheel server has expired. To unblock PRs, trust `pytorch-extension.intel.com`. Signed-off-by: Masahiro Tanaka <[email protected]> Signed-off-by: rraminen <[email protected]>

rraminen force-pushed the relax_tol_testFP8_ROCm branch from 1089fa4 to a7ea3f6 Compare December 1, 2025 07:27

rraminen requested a review from jomayeri as a code owner December 1, 2025 07:27

Merge branch 'master' into relax_tol_testFP8_ROCm

ae3ae05

sfc-gh-truwase reviewed Dec 2, 2025

View reviewed changes

README.md Outdated Show resolved Hide resolved

Apply suggestion from @sfc-gh-truwase

72d4e99

sfc-gh-truwase approved these changes Dec 2, 2025

View reviewed changes

sfc-gh-truwase merged commit 28fbb80 into deepspeedai:master Dec 3, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] Relax tolerances for FP8 unit test for fp16 and bf16 cases #7655

[ROCm] Relax tolerances for FP8 unit test for fp16 and bf16 cases #7655

Uh oh!

rraminen commented Oct 28, 2025

Uh oh!

jithunnair-amd Oct 30, 2025

Uh oh!

rraminen Nov 10, 2025 •

edited

Loading

Uh oh!

k-artem Nov 18, 2025

Uh oh!

sfc-gh-truwase Dec 2, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

[ROCm] Relax tolerances for FP8 unit test for fp16 and bf16 cases #7655

[ROCm] Relax tolerances for FP8 unit test for fp16 and bf16 cases #7655

Uh oh!

Conversation

rraminen commented Oct 28, 2025

Uh oh!

jithunnair-amd Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

rraminen Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k-artem Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

sfc-gh-truwase Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

rraminen Nov 10, 2025 •

edited

Loading