-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Fix rare hang in DeepSpeed Async I/O wait by releasing the Python GIL #7727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| &deepspeed_aio_handle_t::wait, | ||
| "Wait for (ongoing) asynchronous operations to complete"); | ||
| "Wait for (ongoing) asynchronous operations to complete", | ||
| py::call_guard<py::gil_scoped_release>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any BC concerns with python versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
py::call_guard<py::gil_scoped_release>() requires pybind11 ≥ 2.2.0 (added in v2.2.0 (2017-08-31) per the changelog), So this should be fine—this is an old requirement relative to current pybind11 releases. And there’s no Python-version BC concern.
…deepspeedai#7727) _**What this PR does**_ - This PR fixes an occasional deadlock / hang when using DeepSpeed Async I/O (AIO) for NVMe swap-in/swap-out - The hang happens inside aio_handle.wait() where training can stall forever. _**Reproduction**_ [ds_config.json](https://github.com/user-attachments/files/24179010/ds_config.json) [finetune_zero3.py](https://github.com/user-attachments/files/24179011/finetune_zero3.py) Steps 1. Replace {NVME_PATH} in ds_config.json with a valid NVMe mount path on your cluster. 2. Build/install DeepSpeed with AIO enabled: `DS_BUILD_AIO=1 pip install --no-build-isolation .` 3. Run: `CUDA_VISIBLE_DEVICES=0 deepspeed finetune_zero3.py` _**Fix:**_ Release the Python GIL while aio_handle.wait() is blocking by adding a pybind11 call guard (py::gil_scoped_release) to the wait() binding. _**Why this is needed (root cause)**_ Two threads are involved: - Python main thread: calls aio_handle.wait() and blocks until all async I/O operations complete. - AIO worker thread(s): perform the actual file I/O in the background. In some cases, after an I/O operation completes, the worker thread triggers cleanup of PyTorch tensors (e.g., decref / refcount updates for Python-backed objects). That cleanup path may require acquiring the Python GIL. **Before this PR:** - The Python main thread enters aio_handle.wait() while still holding the GIL. - wait() blocks, waiting for the worker thread(s) to finish. - A worker thread completes an I/O op and reaches a cleanup path that attempts to acquire the GIL. - The worker thread cannot acquire the GIL because it is held by the Python thread blocked in wait(). - Result: the Python thread is waiting for the worker, and the worker is waiting for the GIL → deadlock. Signed-off-by: Rakshit-gen <[email protected]>
…deepspeedai#7727) _**What this PR does**_ - This PR fixes an occasional deadlock / hang when using DeepSpeed Async I/O (AIO) for NVMe swap-in/swap-out - The hang happens inside aio_handle.wait() where training can stall forever. _**Reproduction**_ [ds_config.json](https://github.com/user-attachments/files/24179010/ds_config.json) [finetune_zero3.py](https://github.com/user-attachments/files/24179011/finetune_zero3.py) Steps 1. Replace {NVME_PATH} in ds_config.json with a valid NVMe mount path on your cluster. 2. Build/install DeepSpeed with AIO enabled: `DS_BUILD_AIO=1 pip install --no-build-isolation .` 3. Run: `CUDA_VISIBLE_DEVICES=0 deepspeed finetune_zero3.py` _**Fix:**_ Release the Python GIL while aio_handle.wait() is blocking by adding a pybind11 call guard (py::gil_scoped_release) to the wait() binding. _**Why this is needed (root cause)**_ Two threads are involved: - Python main thread: calls aio_handle.wait() and blocks until all async I/O operations complete. - AIO worker thread(s): perform the actual file I/O in the background. In some cases, after an I/O operation completes, the worker thread triggers cleanup of PyTorch tensors (e.g., decref / refcount updates for Python-backed objects). That cleanup path may require acquiring the Python GIL. **Before this PR:** - The Python main thread enters aio_handle.wait() while still holding the GIL. - wait() blocks, waiting for the worker thread(s) to finish. - A worker thread completes an I/O op and reaches a cleanup path that attempts to acquire the GIL. - The worker thread cannot acquire the GIL because it is held by the Python thread blocked in wait(). - Result: the Python thread is waiting for the worker, and the worker is waiting for the GIL → deadlock. Signed-off-by: Rakshit-gen <[email protected]>
What this PR does
Reproduction
ds_config.json
finetune_zero3.py
Steps
DS_BUILD_AIO=1 pip install --no-build-isolation .CUDA_VISIBLE_DEVICES=0 deepspeed finetune_zero3.pyFix:
Release the Python GIL while aio_handle.wait() is blocking by adding a pybind11 call guard (py::gil_scoped_release) to the wait() binding.
Why this is needed (root cause)
Two threads are involved:
In some cases, after an I/O operation completes, the worker thread triggers cleanup of PyTorch tensors (e.g., decref / refcount updates for Python-backed objects). That cleanup path may require acquiring the Python GIL.
Before this PR: