Skip to content

Conversation

@tohtana
Copy link
Collaborator

@tohtana tohtana commented Jun 21, 2025

#6993 broke many paths in ZeRO1/2 optimizer. This PR fixes most of the issues the PR caused. Currently we still have one error with tests in unit/runtime/zero.

====================================== short test summary info ======================================
FAILED test_zero.py::TestParamPartitioningSkipInit::test[dtype1] - RuntimeError: mat1 and mat2 must have the same dtype, but got Half and BFloat16
========= 1 failed, 204 passed, 66 skipped, 15 deselected, 5 warnings in 2305.03s (0:38:25) =========

Signed-off-by: Masahiro Tanaka <[email protected]>
@tohtana tohtana requested a review from tjruwase as a code owner June 21, 2025 20:58
Masahiro Tanaka added 2 commits June 21, 2025 21:02
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
@tohtana
Copy link
Collaborator Author

tohtana commented Jun 21, 2025

I confirmed that #6993 is not related to the above error. The same error happens with v0.17.1 (#6993 is not merged). #7377 should address the error.

@tohtana tohtana merged commit d5f6915 into master Jun 22, 2025
12 checks passed
@tohtana tohtana deleted the tohtana/fix_zero_bucket branch June 22, 2025 04:24
Antlera pushed a commit to Antlera/DeepSpeed that referenced this pull request Jun 27, 2025
issues the PR caused. Currently we still have one error with tests in
`unit/runtime/zero`.

```
====================================== short test summary info ======================================
FAILED test_zero.py::TestParamPartitioningSkipInit::test[dtype1] - RuntimeError: mat1 and mat2 must have the same dtype, but got Half and BFloat16
========= 1 failed, 204 passed, 66 skipped, 15 deselected, 5 warnings in 2305.03s (0:38:25) =========
```

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
lpnpcs pushed a commit to lpnpcs/DeepSpeed that referenced this pull request Jul 30, 2025
deepspeedai#6993 broke many paths in ZeRO1/2 optimizer. This PR fixes most of the
issues the PR caused. Currently we still have one error with tests in
`unit/runtime/zero`.

```
====================================== short test summary info ======================================
FAILED test_zero.py::TestParamPartitioningSkipInit::test[dtype1] - RuntimeError: mat1 and mat2 must have the same dtype, but got Half and BFloat16
========= 1 failed, 204 passed, 66 skipped, 15 deselected, 5 warnings in 2305.03s (0:38:25) =========
```

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
deepspeedai#6993 broke many paths in ZeRO1/2 optimizer. This PR fixes most of the
issues the PR caused. Currently we still have one error with tests in
`unit/runtime/zero`.

```
====================================== short test summary info ======================================
FAILED test_zero.py::TestParamPartitioningSkipInit::test[dtype1] - RuntimeError: mat1 and mat2 must have the same dtype, but got Half and BFloat16
========= 1 failed, 204 passed, 66 skipped, 15 deselected, 5 warnings in 2305.03s (0:38:25) =========
```

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants