Fix rare hang in DeepSpeed Async I/O wait by releasing the Python GIL #7727

xylian86 · 2025-12-16T01:12:57Z

What this PR does

This PR fixes an occasional deadlock / hang when using DeepSpeed Async I/O (AIO) for NVMe swap-in/swap-out
The hang happens inside aio_handle.wait() where training can stall forever.

Reproduction
ds_config.json
finetune_zero3.py

Steps

Replace {NVME_PATH} in ds_config.json with a valid NVMe mount path on your cluster.
Build/install DeepSpeed with AIO enabled: DS_BUILD_AIO=1 pip install --no-build-isolation .
Run: CUDA_VISIBLE_DEVICES=0 deepspeed finetune_zero3.py

Fix:
Release the Python GIL while aio_handle.wait() is blocking by adding a pybind11 call guard (py::gil_scoped_release) to the wait() binding.

Why this is needed (root cause)
Two threads are involved:

Python main thread: calls aio_handle.wait() and blocks until all async I/O operations complete.
AIO worker thread(s): perform the actual file I/O in the background.

In some cases, after an I/O operation completes, the worker thread triggers cleanup of PyTorch tensors (e.g., decref / refcount updates for Python-backed objects). That cleanup path may require acquiring the Python GIL.

Before this PR:

The Python main thread enters aio_handle.wait() while still holding the GIL.
wait() blocks, waiting for the worker thread(s) to finish.
A worker thread completes an I/O op and reaches a cleanup path that attempts to acquire the GIL.
The worker thread cannot acquire the GIL because it is held by the Python thread blocked in wait().
Result: the Python thread is waiting for the worker, and the worker is waiting for the GIL → deadlock.

sfc-gh-truwase · 2025-12-17T15:34:52Z

csrc/aio/py_lib/py_ds_aio.cpp

             &deepspeed_aio_handle_t::wait,
-             "Wait for (ongoing) asynchronous operations to complete");
+             "Wait for (ongoing) asynchronous operations to complete",
+             py::call_guard<py::gil_scoped_release>());


Is there any BC concerns with python versions?

py::call_guard<py::gil_scoped_release>() requires pybind11 ≥ 2.2.0 (added in v2.2.0 (2017-08-31) per the changelog), So this should be fine—this is an old requirement relative to current pybind11 releases. And there’s no Python-version BC concern.

…deepspeedai#7727) _**What this PR does**_ - This PR fixes an occasional deadlock / hang when using DeepSpeed Async I/O (AIO) for NVMe swap-in/swap-out - The hang happens inside aio_handle.wait() where training can stall forever. _**Reproduction**_ [ds_config.json](https://github.com/user-attachments/files/24179010/ds_config.json) [finetune_zero3.py](https://github.com/user-attachments/files/24179011/finetune_zero3.py) Steps 1. Replace {NVME_PATH} in ds_config.json with a valid NVMe mount path on your cluster. 2. Build/install DeepSpeed with AIO enabled: `DS_BUILD_AIO=1 pip install --no-build-isolation .` 3. Run: `CUDA_VISIBLE_DEVICES=0 deepspeed finetune_zero3.py` _**Fix:**_ Release the Python GIL while aio_handle.wait() is blocking by adding a pybind11 call guard (py::gil_scoped_release) to the wait() binding. _**Why this is needed (root cause)**_ Two threads are involved: - Python main thread: calls aio_handle.wait() and blocks until all async I/O operations complete. - AIO worker thread(s): perform the actual file I/O in the background. In some cases, after an I/O operation completes, the worker thread triggers cleanup of PyTorch tensors (e.g., decref / refcount updates for Python-backed objects). That cleanup path may require acquiring the Python GIL. **Before this PR:** - The Python main thread enters aio_handle.wait() while still holding the GIL. - wait() blocks, waiting for the worker thread(s) to finish. - A worker thread completes an I/O op and reaches a cleanup path that attempts to acquire the GIL. - The worker thread cannot acquire the GIL because it is held by the Python thread blocked in wait(). - Result: the Python thread is waiting for the worker, and the worker is waiting for the GIL → deadlock. Signed-off-by: Rakshit-gen <[email protected]>

Fix rare hang in DeepSpeed Async I/O wait by releasing the GIL

743c3b3

xylian86 requested a review from tjruwase as a code owner December 16, 2025 01:12

sfc-gh-truwase reviewed Dec 17, 2025

View reviewed changes

sfc-gh-truwase approved these changes Dec 17, 2025

View reviewed changes

sfc-gh-truwase merged commit 2bc16e2 into deepspeedai:master Dec 18, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix rare hang in DeepSpeed Async I/O wait by releasing the Python GIL #7727

Fix rare hang in DeepSpeed Async I/O wait by releasing the Python GIL #7727

xylian86 commented Dec 16, 2025

Uh oh!

sfc-gh-truwase Dec 17, 2025

Uh oh!

xylian86 Dec 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix rare hang in DeepSpeed Async I/O wait by releasing the Python GIL #7727

Fix rare hang in DeepSpeed Async I/O wait by releasing the Python GIL #7727

Conversation

xylian86 commented Dec 16, 2025

Uh oh!

sfc-gh-truwase Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

xylian86 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants