Skip to content

Conversation

@alexk101
Copy link
Contributor

In comms_logging.py, when calling log_all and the show_straggler option is enabled, an all_reduce is performed across all nodes to calculate the minimum latency to find stragglers. However, the tensors on which this is performed are not sent to the configured devices. This commit adds this capability using deepspeed's abstract accelerator api.

Resolves #7397

In `comms_logging.py`, when calling log_all and the `show_straggler` option is enabled, an all_reduce is performed across all nodes to calculate the minimum latency to find stragglers.
However, the tensors on which this is performed are not sent to the configured devices. This commit adds this capability using deepspeed's abstract accelerator api.

Signed-off-by: Alex Kiefer <[email protected]>
Copy link
Collaborator

@tohtana tohtana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! Thank you @alexk101!

@tohtana tohtana enabled auto-merge (squash) June 28, 2025 01:10
@tohtana tohtana merged commit 4c687bf into deepspeedai:master Jun 28, 2025
9 checks passed
lpnpcs pushed a commit to lpnpcs/DeepSpeed that referenced this pull request Jul 30, 2025
In `comms_logging.py`, when calling log_all and the `show_straggler`
option is enabled, an all_reduce is performed across all nodes to
calculate the minimum latency to find stragglers. However, the tensors
on which this is performed are not sent to the configured devices. This
commit adds this capability using deepspeed's abstract accelerator api.

Resolves deepspeedai#7397

Signed-off-by: Alex Kiefer <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
In `comms_logging.py`, when calling log_all and the `show_straggler`
option is enabled, an all_reduce is performed across all nodes to
calculate the minimum latency to find stragglers. However, the tensors
on which this is performed are not sent to the configured devices. This
commit adds this capability using deepspeed's abstract accelerator api.

Resolves deepspeedai#7397

Signed-off-by: Alex Kiefer <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Comms Logging Straggler All Reduce

2 participants