Skip to content

Conversation

@eternalNight
Copy link
Contributor

The following assertion error arises when torch autocast is enabled.

[rank3]: File "/opt/deepspeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 337, in fetch_sub_module
[rank3]: self.__inflight_param_registry.pop(param).wait(handle_dependency=not fast_fetch)
[rank3]: File "/opt/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line 787, in wait
[rank3]: handle.wait(handle_dependency)
[rank3]: File "/opt/deepspeed/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank3]: ret_val = func(*args, **kwargs)
[rank3]: File "/opt/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line 750, in wait
[rank3]: assert param.ds_status == ZeroParamStatus.INFLIGHT, f"expected param {param.ds_summary()} to be inflight"
[rank3]: AssertionError: expected param {'id': 685, 'status': 'AVAILABLE', 'numel': 131334144, 'ds_numel': 131334144, 'shape': (32064, 4096), 'ds_shape': (32064, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([16416768])} to be inflight

This is due to multiple all-gather ops in the same coalesced all-gather sharing the same list of params (of mixed dtypes).

Make each all-gather exchange only params of a certain dtype. Also pass the allgather dtype that matches the params.

The following assertion error arises when torch autocast is enabled.

[rank3]:   File "/opt/deepspeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 337, in fetch_sub_module
[rank3]:     self.__inflight_param_registry.pop(param).wait(handle_dependency=not fast_fetch)
[rank3]:   File "/opt/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line 787, in wait
[rank3]:     handle.wait(handle_dependency)
[rank3]:   File "/opt/deepspeed/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank3]:     ret_val = func(*args, **kwargs)
[rank3]:   File "/opt/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line 750, in wait
[rank3]:     assert param.ds_status == ZeroParamStatus.INFLIGHT, f"expected param {param.ds_summary()} to be inflight"
[rank3]: AssertionError: expected param {'id': 685, 'status': 'AVAILABLE', 'numel': 131334144, 'ds_numel': 131334144, 'shape': (32064, 4096), 'ds_shape': (32064, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([16416768])} to be inflight

This is due to multiple all-gather ops in the same coalesced all-gather
sharing the same list of params (of mixed dtypes).

Make each all-gather exchange only params of a certain dtype. Also pass the
allgather dtype that matches the params.

Signed-off-by: Junjie Mao <[email protected]>
@tohtana tohtana self-assigned this Aug 1, 2025
@tohtana tohtana merged commit 0aff6b2 into deepspeedai:master Aug 3, 2025
9 checks passed
LYMDLUT pushed a commit to LYMDLUT/DeepSpeed that referenced this pull request Aug 20, 2025
The following assertion error arises when torch autocast is enabled.

[rank3]: File
"/opt/deepspeed/deepspeed/runtime/zero/partitioned_param_coordinator.py",
line 337, in fetch_sub_module
[rank3]:
self.__inflight_param_registry.pop(param).wait(handle_dependency=not
fast_fetch)
[rank3]: File
"/opt/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line
787, in wait
[rank3]:     handle.wait(handle_dependency)
[rank3]: File "/opt/deepspeed/deepspeed/utils/nvtx.py", line 20, in
wrapped_fn
[rank3]:     ret_val = func(*args, **kwargs)
[rank3]: File
"/opt/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line
750, in wait
[rank3]: assert param.ds_status == ZeroParamStatus.INFLIGHT, f"expected
param {param.ds_summary()} to be inflight"
[rank3]: AssertionError: expected param {'id': 685, 'status':
'AVAILABLE', 'numel': 131334144, 'ds_numel': 131334144, 'shape': (32064,
4096), 'ds_shape': (32064, 4096), 'requires_grad': True, 'grad_shape':
None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape':
torch.Size([16416768])} to be inflight

This is due to multiple all-gather ops in the same coalesced all-gather
sharing the same list of params (of mixed dtypes).

Make each all-gather exchange only params of a certain dtype. Also pass
the allgather dtype that matches the params.

Signed-off-by: Junjie Mao <[email protected]>
Co-authored-by: Junjie Mao <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: lym <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
The following assertion error arises when torch autocast is enabled.

[rank3]: File
"/opt/deepspeed/deepspeed/runtime/zero/partitioned_param_coordinator.py",
line 337, in fetch_sub_module
[rank3]:
self.__inflight_param_registry.pop(param).wait(handle_dependency=not
fast_fetch)
[rank3]: File
"/opt/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line
787, in wait
[rank3]:     handle.wait(handle_dependency)
[rank3]: File "/opt/deepspeed/deepspeed/utils/nvtx.py", line 20, in
wrapped_fn
[rank3]:     ret_val = func(*args, **kwargs)
[rank3]: File
"/opt/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line
750, in wait
[rank3]: assert param.ds_status == ZeroParamStatus.INFLIGHT, f"expected
param {param.ds_summary()} to be inflight"
[rank3]: AssertionError: expected param {'id': 685, 'status':
'AVAILABLE', 'numel': 131334144, 'ds_numel': 131334144, 'shape': (32064,
4096), 'ds_shape': (32064, 4096), 'requires_grad': True, 'grad_shape':
None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape':
torch.Size([16416768])} to be inflight

This is due to multiple all-gather ops in the same coalesced all-gather
sharing the same list of params (of mixed dtypes).

Make each all-gather exchange only params of a certain dtype. Also pass
the allgather dtype that matches the params.

Signed-off-by: Junjie Mao <[email protected]>
Co-authored-by: Junjie Mao <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants