Skip to content

Conversation

@tohtana
Copy link
Collaborator

@tohtana tohtana commented Sep 4, 2025

This PR improves error logging and relaxes loss value checks in the autocast test.

Previously, the test displayed error messages and mismatched loss values on all ranks, even if the mismatch only occurred on some ranks. This was confusing, since logs from other ranks could appear correct. This PR changes the behavior so that error messages are shown only on the ranks where the mismatch occurs.

Additionally, this PR skips loss value validation for test_lower_precision_model, where we intentionally use a different communication dtype from the baseline (standard PyTorch autocast).

Signed-off-by: Masahiro Tanaka <[email protected]>
@tohtana tohtana requested a review from stas00 September 4, 2025 21:58
Signed-off-by: Masahiro Tanaka <[email protected]>
Copy link
Collaborator

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a small suggestion left earlier. Thank you for fixing this, @tohtana

@tohtana tohtana enabled auto-merge (squash) September 5, 2025 06:00
@stas00 stas00 disabled auto-merge September 5, 2025 06:02
@stas00 stas00 enabled auto-merge (squash) September 5, 2025 06:02
@stas00 stas00 merged commit b82ef71 into master Sep 5, 2025
12 checks passed
@stas00 stas00 deleted the tohtana/fix_assert_autocast_test branch September 5, 2025 07:04
Flakes342 pushed a commit to Flakes342/DeepSpeed that referenced this pull request Sep 9, 2025
…edai#7547)

This PR improves error logging and relaxes loss value checks in the
autocast test.

Previously, the test displayed error messages and mismatched loss values
on all ranks, even if the mismatch only occurred on some ranks. This was
confusing, since logs from other ranks could appear correct. This PR
changes the behavior so that error messages are shown only on the ranks
where the mismatch occurs.

Additionally, this PR skips loss value validation for
`test_lower_precision_model`, where we intentionally use a different
communication dtype from the baseline (standard PyTorch autocast).

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Signed-off-by: Flakes342 <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
…edai#7547)

This PR improves error logging and relaxes loss value checks in the
autocast test.

Previously, the test displayed error messages and mismatched loss values
on all ranks, even if the mismatch only occurred on some ranks. This was
confusing, since logs from other ranks could appear correct. This PR
changes the behavior so that error messages are shown only on the ranks
where the mismatch occurs.

Additionally, this PR skips loss value validation for
`test_lower_precision_model`, where we intentionally use a different
communication dtype from the baseline (standard PyTorch autocast).

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants