Skip to content

Conversation

@rraminen
Copy link
Contributor

Relaxing the tolerance values to enable the below two unit testa, with FP16 and BF16 data types on ROCm

unit/runtime/half_precision/test_fp8.py::TestFp8ComposabilityAcrossZero::test[bf16]
unit/runtime/half_precision/test_fp8.py::TestFp8ComposabilityAcrossZero::test[fp32]

if is_rocm_pytorch() and model_dtype == torch.float16:
rtol, atol = 3e-07, 3e-05
if is_rocm_pytorch() and base_datatype in ["fp16", "bf16"]:
rtol, atol = 1e-07, 1e-04
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rraminen Suggest figuring out if this is HW-specific e.g. MI200/MI300?

Copy link
Contributor Author

@rraminen rraminen Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ROCm, fp8 features are only supported on MI300. This workaround is needed for MI300 onwards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am reviewing.

@rraminen rraminen marked this pull request as draft October 30, 2025 20:04
@rraminen rraminen marked this pull request as ready for review November 10, 2025 17:09
@rraminen rraminen marked this pull request as draft November 14, 2025 17:33
@rraminen rraminen marked this pull request as ready for review November 24, 2025 12:13
rraminen and others added 22 commits December 1, 2025 07:26
Signed-off-by: rraminen <[email protected]>
)

As I was integrating ALST/Ulysses SP into HF Accelerate/Trainer I
noticed that the initial
`UlyssesSPAttentionHF.register_with_transformers` API was a bit
inflexible/confusing wrt variable seqlen.

This PR deprecates the misleading `max_length` arg name, replaces it
with `seq_length` and makes the latter optional if
`seq_length_is_variable` is True.

Updated tests and docs.

Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: rraminen <[email protected]>
…edai#7645)

This PR fixes an issue in deepspeed/runtime/fp16/fused_optimizer.py
where the gradient overflow handling logic incorrectly exited the
function too early, resulting in wrong forward pass and loss
calculations in certain FP16 training scenarios.

The `return self.overflow` and `self.timers.log(OVERFLOW_TIMERS)` calls
are now correctly moved inside the `if self.overflow:` block so that the
function only returns early when an actual overflow is detected.

Origin of the error:
deepspeedai@889f0ea

cc: @jithunnair-amd

Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: rraminen <[email protected]>
Make it very clear that `TiledMLP`'s memory saving has a cost of
recomputing forward.

Signed-off-by: rraminen <[email protected]>
…eepspeedai#7659)

fixes deepspeedai#7650

adding a `value.dim()>0` check to prevent slicing of 0-dim tensors

cc @sfc-gh-truwase

Signed-off-by: Naveenraj Kamalakannan <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: rraminen <[email protected]>
Corrected the record_function parameter in code example from incorrect
'"""):' to 'model_forward'

Signed-off-by: kunhee <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: rraminen <[email protected]>
Long overdue

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Signed-off-by: rraminen <[email protected]>
1. `modal-accelerate` needs now `uv` installed explicitly since the
image change to 2025 one.
2. moved accelerate repo cloning into the job, since the original way
was incorrect as it was caching some accelerate version and not updating
it.
3. annotated that how to actually test the ci work when changing the
workflow as `pull_request_target` will not run the updated .py+.yaml
files.

---------

Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: rraminen <[email protected]>
add Masahiro's explanation to why that code is there.

---------

Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: rraminen <[email protected]>
as we lost v100s - disable first so that it stops interfering with PRs,
then port to modal.

Signed-off-by: rraminen <[email protected]>
…er (deepspeedai#7658)

This PR allows seperate learning rate for muon and adam part of the Muon
optimizer. Following up
deepspeedai#7657

Signed-off-by: Guokai Ma <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: rraminen <[email protected]>
`logger` is the wrong facility for `see_mem_usage` messages, since it
should always print (if `force=True`) and the current implementation
doesn't guarantee that leading to wasted dev time puzzling over it. So
fixing to use `print` instead.

Signed-off-by: rraminen <[email protected]>
when using param debug utils deal with use cases where `.ds_*`
attributes aren't there.

Signed-off-by: rraminen <[email protected]>
currently, we get pinned cpu memory when it's not configured to do so -
this PR is fixing that.

Signed-off-by: rraminen <[email protected]>
currently only one modal CI job is possible across all PRs, which is not
workable - all running jobs get cancelled on a new PR or existing PR
update - fixing the dependency to make the group concurrency work across
PRs not to waste valuable resources.

Signed-off-by: Stas Bekman <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Signed-off-by: rraminen <[email protected]>
This PR replaces the ninja detection based on `import ninja` with the
corresponding pytorch utils to detect ninja.

The motivation behind this change is twofold.

1. The detection based on importing the python interface works in the
pip-world, but fails in the conda-world, as the corresponding ninja
conda package does not include the python interface. The implication on
detection is
[known](conda-forge/ninja-feedstock#26) and
the recommended solution is to use the `subprocess` module to detect
ninja. This approach is followed by, e.g.,
[meson](https://github.com/mesonbuild/meson-python/blob/1c8092dc477cbc7e1e4d40913608d9daae75f793/mesonpy/__init__.py#L1077-L1088)
and
[pytorch](https://github.com/pytorch/pytorch/blob/d33d125c9413c5043aa5f74fad909a576288242d/torch/utils/cpp_extension.py#L2312-L2325).
As the `subprocess`-based check works in both the pip- and the
conda-world, I think, it would make sense to switch over.
2. As ninja is only invoked through pytorch, it should be sufficient to
check if pytorch can find ninja. Hence, instead of reimplementing the
`subprocess` check, I think it would be better to use the pytorch utils
(which implement said test anyways).

Without this refactor, every conda environment that depends on DeepSpeed
will need to install ninja as a PyPI dependency, even though the conda
version of ninja would be sufficient for the compilation. In my opinion,
this adds unnecessary complexity to these environments.

I tried to keep the changes minimal.

As some additional context, @sdvillal and I stumbled over this issue
while working on packaging aqlaboratory/openfold-3 for conda-forge.

Signed-off-by: Tim Adler <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: rraminen <[email protected]>
The AutoTP supported models does not include Qwen2.5, which is already
supported. Update the document.

(https://www.deepspeed.ai/tutorials/automatic-tensor-parallelism/#supported-models)

---------

Signed-off-by: Ma, Guokai <[email protected]>
Signed-off-by: rraminen <[email protected]>
The SSL certificate of Intel's wheel server has expired. To unblock PRs,
trust `pytorch-extension.intel.com`.

Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: rraminen <[email protected]>
Currently DeepSpeed's backward API has more constraints compared to
PyTorch's normal backward API.
Here is the usage as described in the documentation:
```python
    loss = model_engine(batch)
    model_engine.backward(loss)
```

In this example,
1. Only accepts a (scalar) loss value
1. Need to call engine's backward API

In contrast, in standard PyTorch, you can do:
```python
    output = model(batch)
    output.backward(out_grad)
```

There are several use cases that rely on this flexibility. For example,
combining multiple models or using loss functions defined separately
from the main model.

If you attempt the same pattern with a DeepSpeed engine, some
preprocessing and postprocessing steps will be silently skipped, which
can lead to incorrect results.

The
[document](https://deepspeed.readthedocs.io/en/latest/training.html#jointly-training-models-with-shared-loss)
explains we can call `_backward_epilogue` manually (possibly
`backward_prologue` as well). However, it's easy for users to miss these
calls, and passing a non-scalar gradient is still not supported.

This PR introduces the same `.backward()` behavior as PyTorch, allowing
.backward() to be called directly on tensors and supporting non-scalar
outputs.

To implement post-backward hooks, we had to use some torch internal
APIs. See
[comments](https://github.com/deepspeedai/DeepSpeed/blob/73f7ff1aab9d1387eb7dd4eca7453a25024533f4/deepspeed/runtime/engine.py#L424)
for more details. When the internal APIs are not available, DeepSpeed
engine only accepts the traditional way `model_engine.backward(loss)`.

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: rraminen <[email protected]>
@rraminen rraminen force-pushed the relax_tol_testFP8_ROCm branch from 1089fa4 to a7ea3f6 Compare December 1, 2025 07:27
@rraminen rraminen requested a review from jomayeri as a code owner December 1, 2025 07:27
@sfc-gh-truwase sfc-gh-truwase merged commit 28fbb80 into deepspeedai:master Dec 3, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.