Skip to content

Conversation

@xylian86
Copy link
Contributor

What this PR does

  • This PR fixes an occasional deadlock / hang when using DeepSpeed Async I/O (AIO) for NVMe swap-in/swap-out
  • The hang happens inside aio_handle.wait() where training can stall forever.

Reproduction
ds_config.json
finetune_zero3.py

Steps

  1. Replace {NVME_PATH} in ds_config.json with a valid NVMe mount path on your cluster.
  2. Build/install DeepSpeed with AIO enabled: DS_BUILD_AIO=1 pip install --no-build-isolation .
  3. Run: CUDA_VISIBLE_DEVICES=0 deepspeed finetune_zero3.py

Fix:
Release the Python GIL while aio_handle.wait() is blocking by adding a pybind11 call guard (py::gil_scoped_release) to the wait() binding.

Why this is needed (root cause)
Two threads are involved:

  • Python main thread: calls aio_handle.wait() and blocks until all async I/O operations complete.
  • AIO worker thread(s): perform the actual file I/O in the background.

In some cases, after an I/O operation completes, the worker thread triggers cleanup of PyTorch tensors (e.g., decref / refcount updates for Python-backed objects). That cleanup path may require acquiring the Python GIL.

Before this PR:

  • The Python main thread enters aio_handle.wait() while still holding the GIL.
  • wait() blocks, waiting for the worker thread(s) to finish.
  • A worker thread completes an I/O op and reaches a cleanup path that attempts to acquire the GIL.
  • The worker thread cannot acquire the GIL because it is held by the Python thread blocked in wait().
  • Result: the Python thread is waiting for the worker, and the worker is waiting for the GIL → deadlock.

@xylian86 xylian86 requested a review from tjruwase as a code owner December 16, 2025 01:12
&deepspeed_aio_handle_t::wait,
"Wait for (ongoing) asynchronous operations to complete");
"Wait for (ongoing) asynchronous operations to complete",
py::call_guard<py::gil_scoped_release>());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any BC concerns with python versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

py::call_guard<py::gil_scoped_release>() requires pybind11 ≥ 2.2.0 (added in v2.2.0 (2017-08-31) per the changelog), So this should be fine—this is an old requirement relative to current pybind11 releases. And there’s no Python-version BC concern.

@sfc-gh-truwase sfc-gh-truwase merged commit 2bc16e2 into deepspeedai:master Dec 18, 2025
11 checks passed
Rakshit-gen pushed a commit to Rakshit-gen/DeepSpeed that referenced this pull request Dec 19, 2025
…deepspeedai#7727)

_**What this PR does**_

- This PR fixes an occasional deadlock / hang when using DeepSpeed Async
I/O (AIO) for NVMe swap-in/swap-out
- The hang happens inside aio_handle.wait() where training can stall
forever.

_**Reproduction**_

[ds_config.json](https://github.com/user-attachments/files/24179010/ds_config.json)

[finetune_zero3.py](https://github.com/user-attachments/files/24179011/finetune_zero3.py)

Steps
1. Replace {NVME_PATH} in ds_config.json with a valid NVMe mount path on
your cluster.
2. Build/install DeepSpeed with AIO enabled: `DS_BUILD_AIO=1 pip install
--no-build-isolation .`
3. Run:  `CUDA_VISIBLE_DEVICES=0 deepspeed finetune_zero3.py`

_**Fix:**_
Release the Python GIL while aio_handle.wait() is blocking by adding a
pybind11 call guard (py::gil_scoped_release) to the wait() binding.

_**Why this is needed (root cause)**_
Two threads are involved:

- Python main thread: calls aio_handle.wait() and blocks until all async
I/O operations complete.
- AIO worker thread(s): perform the actual file I/O in the background.

In some cases, after an I/O operation completes, the worker thread
triggers cleanup of PyTorch tensors (e.g., decref / refcount updates for
Python-backed objects). That cleanup path may require acquiring the
Python GIL.

**Before this PR:**

- The Python main thread enters aio_handle.wait() while still holding
the GIL.
- wait() blocks, waiting for the worker thread(s) to finish.
- A worker thread completes an I/O op and reaches a cleanup path that
attempts to acquire the GIL.
- The worker thread cannot acquire the GIL because it is held by the
Python thread blocked in wait().
- Result: the Python thread is waiting for the worker, and the worker is
waiting for the GIL → deadlock.

Signed-off-by: Rakshit-gen <[email protected]>
Rakshit-gen pushed a commit to Rakshit-gen/DeepSpeed that referenced this pull request Dec 19, 2025
…deepspeedai#7727)

_**What this PR does**_

- This PR fixes an occasional deadlock / hang when using DeepSpeed Async
I/O (AIO) for NVMe swap-in/swap-out
- The hang happens inside aio_handle.wait() where training can stall
forever.

_**Reproduction**_

[ds_config.json](https://github.com/user-attachments/files/24179010/ds_config.json)

[finetune_zero3.py](https://github.com/user-attachments/files/24179011/finetune_zero3.py)

Steps
1. Replace {NVME_PATH} in ds_config.json with a valid NVMe mount path on
your cluster.
2. Build/install DeepSpeed with AIO enabled: `DS_BUILD_AIO=1 pip install
--no-build-isolation .`
3. Run:  `CUDA_VISIBLE_DEVICES=0 deepspeed finetune_zero3.py`

_**Fix:**_
Release the Python GIL while aio_handle.wait() is blocking by adding a
pybind11 call guard (py::gil_scoped_release) to the wait() binding.

_**Why this is needed (root cause)**_
Two threads are involved:

- Python main thread: calls aio_handle.wait() and blocks until all async
I/O operations complete.
- AIO worker thread(s): perform the actual file I/O in the background.

In some cases, after an I/O operation completes, the worker thread
triggers cleanup of PyTorch tensors (e.g., decref / refcount updates for
Python-backed objects). That cleanup path may require acquiring the
Python GIL.

**Before this PR:**

- The Python main thread enters aio_handle.wait() while still holding
the GIL.
- wait() blocks, waiting for the worker thread(s) to finish.
- A worker thread completes an I/O op and reaches a cleanup path that
attempts to acquire the GIL.
- The worker thread cannot acquire the GIL because it is held by the
Python thread blocked in wait().
- Result: the Python thread is waiting for the worker, and the worker is
waiting for the GIL → deadlock.

Signed-off-by: Rakshit-gen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants