Skip to content

Conversation

@xylian86
Copy link
Contributor

@xylian86 xylian86 commented Dec 3, 2025

This PR fixes #7703.

The root cause is that when swap_tensors is empty, no AIO request is queued. However, aio.wait() still checks whether any AIO request has been issued, which can lead to _num_pending_ops == 0 and trigger the issue:

assert(0 == _num_pending_ops);

The fix is to skip aio_handle.wait() when there are no swap tensors queued.

Add check for swap_tensors length before assertion.
@sfc-gh-truwase sfc-gh-truwase enabled auto-merge (squash) December 4, 2025 03:10
@sfc-gh-truwase sfc-gh-truwase merged commit 78edd6f into deepspeedai:master Dec 4, 2025
13 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Training LLM through ZeRO-3 with libaio-dev installed fails with Assertion `_num_pending_ops > 0'

2 participants