Improve error message and reduce validation in autocast test #7547

tohtana · 2025-09-04T21:56:29Z

This PR improves error logging and relaxes loss value checks in the autocast test.

Previously, the test displayed error messages and mismatched loss values on all ranks, even if the mismatch only occurred on some ranks. This was confusing, since logs from other ranks could appear correct. This PR changes the behavior so that error messages are shown only on the ranks where the mismatch occurs.

Additionally, this PR skips loss value validation for test_lower_precision_model, where we intentionally use a different communication dtype from the baseline (standard PyTorch autocast).

Signed-off-by: Masahiro Tanaka <[email protected]>

tests/unit/runtime/zero/test_zero_autocast.py

Signed-off-by: Masahiro Tanaka <[email protected]>

stas00

a small suggestion left earlier. Thank you for fixing this, @tohtana

Signed-off-by: Masahiro Tanaka <[email protected]>

tests/unit/runtime/zero/test_zero_autocast.py

…edai#7547) This PR improves error logging and relaxes loss value checks in the autocast test. Previously, the test displayed error messages and mismatched loss values on all ranks, even if the mismatch only occurred on some ranks. This was confusing, since logs from other ranks could appear correct. This PR changes the behavior so that error messages are shown only on the ranks where the mismatch occurs. Additionally, this PR skips loss value validation for `test_lower_precision_model`, where we intentionally use a different communication dtype from the baseline (standard PyTorch autocast). --------- Signed-off-by: Masahiro Tanaka <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Signed-off-by: Flakes342 <[email protected]>

…edai#7547) This PR improves error logging and relaxes loss value checks in the autocast test. Previously, the test displayed error messages and mismatched loss values on all ranks, even if the mismatch only occurred on some ranks. This was confusing, since logs from other ranks could appear correct. This PR changes the behavior so that error messages are shown only on the ranks where the mismatch occurs. Additionally, this PR skips loss value validation for `test_lower_precision_model`, where we intentionally use a different communication dtype from the baseline (standard PyTorch autocast). --------- Signed-off-by: Masahiro Tanaka <[email protected]> Co-authored-by: Stas Bekman <[email protected]>

fix condition to assert an error in autocast test

354a382

Signed-off-by: Masahiro Tanaka <[email protected]>

tohtana requested review from loadams and tjruwase as code owners September 4, 2025 21:56

revert test args

936c100

Signed-off-by: Masahiro Tanaka <[email protected]>

tohtana requested a review from stas00 September 4, 2025 21:58

stas00 reviewed Sep 4, 2025

View reviewed changes

tests/unit/runtime/zero/test_zero_autocast.py Outdated Show resolved Hide resolved

remove logger output

b247fa6

Signed-off-by: Masahiro Tanaka <[email protected]>

stas00 approved these changes Sep 5, 2025

View reviewed changes

reverted error message with print and added comment

437e7b2

Signed-off-by: Masahiro Tanaka <[email protected]>

tohtana enabled auto-merge (squash) September 5, 2025 06:00

stas00 reviewed Sep 5, 2025

View reviewed changes

tests/unit/runtime/zero/test_zero_autocast.py Outdated Show resolved Hide resolved

stas00 disabled auto-merge September 5, 2025 06:02

Update tests/unit/runtime/zero/test_zero_autocast.py

bdfbbc1

stas00 enabled auto-merge (squash) September 5, 2025 06:02

Merge branch 'master' into tohtana/fix_assert_autocast_test

9d7e6d0

stas00 merged commit b82ef71 into master Sep 5, 2025
12 checks passed

stas00 deleted the tohtana/fix_assert_autocast_test branch September 5, 2025 07:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve error message and reduce validation in autocast test #7547

Improve error message and reduce validation in autocast test #7547

Uh oh!

tohtana commented Sep 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

stas00 left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve error message and reduce validation in autocast test #7547

Improve error message and reduce validation in autocast test #7547

Uh oh!

Conversation

tohtana commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

stas00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tohtana commented Sep 4, 2025 •

edited

Loading