Skip to content

Conversation

@sfc-gh-sbekman
Copy link
Collaborator

@sfc-gh-sbekman sfc-gh-sbekman commented Aug 9, 2025

The PR that added eval support didn't include SP, so this PR adds it.

It seems that liger-kernel LigerFusedLinearCrossEntropyFunction returns loss=NaN with eval.
I don't see any special code that checks if the function is called in inference mode here
https://github.com/linkedin/Liger-Kernel/blob/65c0ad123e5905208ff11332ba50498308e047bb/src/liger_kernel/ops/fused_linear_cross_entropy.py#L16
unless I'm missing something it always runs backward

So I'm not sure if we need to disable LigerFusedLinearCrossEntropyFunction for eval-mode.

But since we have a fallback to our TiledFusedLogitsLoss I made it inference-friendly here: deepspeedai/DeepSpeed#7477

So to use this PR you need to also use the above deepspeed PR and remove model.type:liger in the config

For Liger-Kernel I opened a detailed issue here #266 - might need to investigate more.


This PR also

  • makes testing_utils.py.execute_subprocess_async work with py-3.12.
  • and fixes how loss summing is done to correctly deal with mbs>1 (avoids /0 error thanks to @yanrui27)

Fixes: #258

Signed-off-by: Stas Bekman <[email protected]>
sfc-gh-truwase added a commit to deepspeedai/DeepSpeed that referenced this pull request Aug 11, 2025
Adding inference support for `TiledFusedLogitsLoss` by skipping
`backward` inside `forward` if the incoming tensor doesn't require grad.

xref: snowflakedb/ArcticTraining#259

---------

Signed-off-by: Stas Bekman <[email protected]>
Co-authored-by: Rui Yan <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
LYMDLUT pushed a commit to LYMDLUT/DeepSpeed that referenced this pull request Aug 20, 2025
Adding inference support for `TiledFusedLogitsLoss` by skipping
`backward` inside `forward` if the incoming tensor doesn't require grad.

xref: snowflakedb/ArcticTraining#259

---------

Signed-off-by: Stas Bekman <[email protected]>
Co-authored-by: Rui Yan <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: lym <[email protected]>
@sfc-gh-sbekman
Copy link
Collaborator Author

sfc-gh-sbekman commented Aug 20, 2025

Important: Before merging wait for a new ds release and add this requirement deepspeed>=0.17.5 to this PR

@sfc-gh-sbekman
Copy link
Collaborator Author

Thank you for reviewing, Mike!

Signed-off-by: Stas Bekman <[email protected]>
@sfc-gh-sbekman sfc-gh-sbekman enabled auto-merge (squash) August 20, 2025 22:01
@sfc-gh-sbekman sfc-gh-sbekman merged commit 4cd0ff5 into main Aug 20, 2025
4 checks passed
@sfc-gh-sbekman sfc-gh-sbekman deleted the stas/sp-eval branch August 20, 2025 22:09
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
Adding inference support for `TiledFusedLogitsLoss` by skipping
`backward` inside `forward` if the incoming tensor doesn't require grad.

xref: snowflakedb/ArcticTraining#259

---------

Signed-off-by: Stas Bekman <[email protected]>
Co-authored-by: Rui Yan <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval data loader is empty

3 participants