Skip to content

build_environment.py: Prevent deadlock on install process join#51429

Merged
scheibelp merged 3 commits intospack:developfrom
johnwparent:prevent-deadlock-from-failing-child
Oct 10, 2025
Merged

build_environment.py: Prevent deadlock on install process join#51429
scheibelp merged 3 commits intospack:developfrom
johnwparent:prevent-deadlock-from-failing-child

Conversation

@johnwparent
Copy link
Copy Markdown
Contributor

@johnwparent johnwparent commented Oct 10, 2025

Currently in build_environment.complete_build_process we call join on the child installer process before we call recv on the write pipe.

This can lead to a deadlock if the child is blocking on the send call to the write pipe.
Per the python docs, Communication.send does not claim to block, however, if you dig into the implementation of send on Windows, it uses _winapi.WriteFile(self._handle, buf, overlapped=True) under the hood, which, with an OVERLAPPED (async) handle and an anonymous pipe, will return false if the pipe buffer is full, per the win32 docs:

If the pipe buffer is full when an application uses the WriteFile function to write to a pipe,
the write operation may not finish immediately. 
The write operation will be completed when a read operation makes more system buffer space available for the 
pipe.

The send method then calls waitres = _winapi.WaitForMultipleObjects([ov.event], False, INFINITE) which hangs infinitely until the child process is killed or the parent process frees up buffer space with a read.

So if Spack's child installer process fills up that buffer with a send, it will hang on that send until the buffer is emptied, but if the parent process is waiting on a join to perform that read, Spack will deadlock and hang.

We are observing this with VTK and Paraview in CI (as can be observed here: https://gitlab.spack.io/spack/spack-packages/-/jobs/18409638, which is costing us approx 18 hours of CI time for each of these failures.
In these instances, Spack is sending 12kb into the pipe buffer, which exceeds the default buffer size of 8kb, leading to a deadlock.

Just moving the recv call to before the join empties the buffer, and ensures we can gracefully join the child.

Signed-off-by: John Parent <[email protected]>
Signed-off-by: John Parent <[email protected]>
@scheibelp scheibelp enabled auto-merge (squash) October 10, 2025 21:22
@scheibelp scheibelp merged commit 9286416 into spack:develop Oct 10, 2025
31 checks passed
@haampie
Copy link
Copy Markdown
Member

haampie commented Oct 13, 2025

@scheibelp please label for backports next time :)

@haampie haampie added the v1.0.3 PRs to backport for v1.0.3 label Oct 13, 2025
@johnwparent johnwparent linked an issue Nov 4, 2025 that may be closed by this pull request
3 tasks
@johnwparent johnwparent mentioned this pull request Nov 4, 2025
becker33 pushed a commit that referenced this pull request Jan 10, 2026
Currently in build_environment.complete_build_process we call join on
the child installer process before we call recv on the write pipe.

This can lead to a deadlock if the child is blocking on the send call
to the write pipe. Per the python docs, Communication.send does not
claim to block. For the implementation on Windows, it turns out it
can if the pipe becomes filled.

The issue is fixed by reading from the pipe in the parent process
before joining the child process.

---------

Signed-off-by: John Parent <[email protected]>
@becker33 becker33 mentioned this pull request Jan 10, 2026
vjranagit pushed a commit to vjranagit/spack that referenced this pull request Jan 18, 2026
…#51429)

Currently in build_environment.complete_build_process we call join on
the child installer process before we call recv on the write pipe.

This can lead to a deadlock if the child is blocking on the send call
to the write pipe. Per the python docs, Communication.send does not
claim to block. For the implementation on Windows, it turns out it
can if the pipe becomes filled.

The issue is fixed by reading from the pipe in the parent process
before joining the child process.

---------

Signed-off-by: John Parent <[email protected]>
becker33 pushed a commit that referenced this pull request Feb 2, 2026
Currently in build_environment.complete_build_process we call join on
the child installer process before we call recv on the write pipe.

This can lead to a deadlock if the child is blocking on the send call
to the write pipe. Per the python docs, Communication.send does not
claim to block. For the implementation on Windows, it turns out it
can if the pipe becomes filled.

The issue is fixed by reading from the pipe in the parent process
before joining the child process.

---------

Signed-off-by: John Parent <[email protected]>
becker33 pushed a commit that referenced this pull request Feb 2, 2026
Currently in build_environment.complete_build_process we call join on
the child installer process before we call recv on the write pipe.

This can lead to a deadlock if the child is blocking on the send call
to the write pipe. Per the python docs, Communication.send does not
claim to block. For the implementation on Windows, it turns out it
can if the pipe becomes filled.

The issue is fixed by reading from the pipe in the parent process
before joining the child process.

---------

Signed-off-by: John Parent <[email protected]>
Signed-off-by: Gregory Becker <[email protected]>
becker33 pushed a commit that referenced this pull request Feb 19, 2026
Currently in build_environment.complete_build_process we call join on
the child installer process before we call recv on the write pipe.

This can lead to a deadlock if the child is blocking on the send call
to the write pipe. Per the python docs, Communication.send does not
claim to block. For the implementation on Windows, it turns out it
can if the pipe becomes filled.

The issue is fixed by reading from the pipe in the parent process
before joining the child process.

---------

Signed-off-by: John Parent <[email protected]>
Signed-off-by: Gregory Becker <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v1.0.3 PRs to backport for v1.0.3

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spack install hangs on failed cmake stage with long argument lists

3 participants