[SP] make eval work #259

sfc-gh-sbekman · 2025-08-09T00:51:00Z

The PR that added eval support didn't include SP, so this PR adds it.

It seems that liger-kernel LigerFusedLinearCrossEntropyFunction returns loss=NaN with eval.
I don't see any special code that checks if the function is called in inference mode here
https://github.com/linkedin/Liger-Kernel/blob/65c0ad123e5905208ff11332ba50498308e047bb/src/liger_kernel/ops/fused_linear_cross_entropy.py#L16
unless I'm missing something it always runs backward

So I'm not sure if we need to disable LigerFusedLinearCrossEntropyFunction for eval-mode.

But since we have a fallback to our TiledFusedLogitsLoss I made it inference-friendly here: deepspeedai/DeepSpeed#7477

So to use this PR you need to also use the above deepspeed PR and remove model.type:liger in the config

For Liger-Kernel I opened a detailed issue here #266 - might need to investigate more.

This PR also

makes testing_utils.py.execute_subprocess_async work with py-3.12.
and fixes how loss summing is done to correctly deal with mbs>1 (avoids /0 error thanks to @yanrui27)

Fixes: #258

Signed-off-by: Stas Bekman <[email protected]>

Adding inference support for `TiledFusedLogitsLoss` by skipping `backward` inside `forward` if the incoming tensor doesn't require grad. xref: snowflakedb/ArcticTraining#259 --------- Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Rui Yan <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>

Adding inference support for `TiledFusedLogitsLoss` by skipping `backward` inside `forward` if the incoming tensor doesn't require grad. xref: snowflakedb/ArcticTraining#259 --------- Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Rui Yan <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: lym <[email protected]>

Signed-off-by: Stas Bekman <[email protected]>

sfc-gh-sbekman · 2025-08-20T18:33:27Z

Important: Before merging wait for a new ds release and add this requirement deepspeed>=0.17.5 to this PR

Signed-off-by: Stas Bekman <[email protected]>

sfc-gh-sbekman · 2025-08-20T21:21:50Z

Thank you for reviewing, Mike!

Signed-off-by: Stas Bekman <[email protected]>

Adding inference support for `TiledFusedLogitsLoss` by skipping `backward` inside `forward` if the incoming tensor doesn't require grad. xref: snowflakedb/ArcticTraining#259 --------- Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Rui Yan <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>

[SP] make eval work

819d566

Signed-off-by: Stas Bekman <[email protected]>

sfc-gh-sbekman requested a review from sfc-gh-jrasley as a code owner August 9, 2025 00:51

sfc-gh-sbekman mentioned this pull request Aug 9, 2025

Eval data loader is empty #258

Closed

stas00 mentioned this pull request Aug 9, 2025

[TiledFusedLogitsLoss] support inference deepspeedai/DeepSpeed#7477

Merged

sfc-gh-sbekman mentioned this pull request Aug 9, 2025

fix(loss): Prevent division by zero in tiled loss calculation #257

Closed

fix for breaking ordering of chat datasets

07a29b6

sfc-gh-sbekman added 2 commits August 20, 2025 18:04

wip

825f58a

Signed-off-by: Stas Bekman <[email protected]>

Merge branch 'main' into stas/sp-eval

c885e9e

clean exception for liger-kernel

c168516

Signed-off-by: Stas Bekman <[email protected]>

sfc-gh-mwyatt approved these changes Aug 20, 2025

View reviewed changes

Merge branch 'main' into stas/sp-eval

2444602

new ds release

b85e2d0

Signed-off-by: Stas Bekman <[email protected]>

sfc-gh-sbekman enabled auto-merge (squash) August 20, 2025 22:01

sfc-gh-sbekman merged commit 4cd0ff5 into main Aug 20, 2025
4 checks passed

sfc-gh-sbekman deleted the stas/sp-eval branch August 20, 2025 22:09

stas00 mentioned this pull request Sep 9, 2025

[ALST tutorial] support bs>1 deepspeedai/DeepSpeed#7550

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SP] make eval work #259

[SP] make eval work #259

Uh oh!

sfc-gh-sbekman commented Aug 9, 2025 •

edited

Loading

Uh oh!

sfc-gh-sbekman commented Aug 20, 2025 •

edited

Loading

Uh oh!

sfc-gh-sbekman commented Aug 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SP] make eval work #259

[SP] make eval work #259

Uh oh!

Conversation

sfc-gh-sbekman commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-sbekman commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-sbekman commented Aug 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sfc-gh-sbekman commented Aug 9, 2025 •

edited

Loading

sfc-gh-sbekman commented Aug 20, 2025 •

edited

Loading