Skip to content

Commit 976da90

Browse files
Alexey Izbyshevgpshead
andauthored
bpo-35823: subprocess: Use vfork() instead of fork() on Linux when safe (GH-11671)
* bpo-35823: subprocess: Use vfork() instead of fork() on Linux when safe When used to run a new executable image, fork() is not a good choice for process creation, especially if the parent has a large working set: fork() needs to copy page tables, which is slow, and may fail on systems where overcommit is disabled, despite that the child is not going to touch most of its address space. Currently, subprocess is capable of using posix_spawn() instead, which normally provides much better performance. However, posix_spawn() does not support many of child setup operations exposed by subprocess.Popen(). Most notably, it's not possible to express `close_fds=True`, which happens to be the default, via posix_spawn(). As a result, most users can't benefit from faster process creation, at least not without changing their code. However, Linux provides vfork() system call, which creates a new process without copying the address space of the parent, and which is actually used by C libraries to efficiently implement posix_spawn(). Due to sharing of the address space and even the stack with the parent, extreme care is required to use vfork(). At least the following restrictions must hold: * No signal handlers must execute in the child process. Otherwise, they might clobber memory shared with the parent, potentially confusing it. * Any library function called after vfork() in the child must be async-signal-safe (as for fork()), but it must also not interact with any library state in a way that might break due to address space sharing and/or lack of any preparations performed by libraries on normal fork(). POSIX.1 permits to call only execve() and _exit(), and later revisions remove vfork() specification entirely. In practice, however, almost all operations needed by subprocess.Popen() can be safely implemented on Linux. * Due to sharing of the stack with the parent, the child must be careful not to clobber local variables that are alive across vfork() call. Compilers are normally aware of this and take extra care with vfork() (and setjmp(), which has a similar problem). * In case the parent is privileged, special attention must be paid to vfork() use, because sharing an address space across different privilege domains is insecure[1]. This patch adds support for using vfork() instead of fork() on Linux when it's possible to do safely given the above. In particular: * vfork() is not used if credential switch is requested. The reverse case (simple subprocess.Popen() but another application thread switches credentials concurrently) is not possible for pure-Python apps because subprocess.Popen() and functions like os.setuid() are mutually excluded via GIL. We might also consider to add a way to opt-out of vfork() (and posix_spawn() on platforms where it might be implemented via vfork()) in a future PR. * vfork() is not used if `preexec_fn != None`. With this change, subprocess will still use posix_spawn() if possible, but will fallback to vfork() on Linux in most cases, and, failing that, to fork(). [1] https://ewontfix.com/7 Co-authored-by: Gregory P. Smith [Google LLC] <[email protected]>
1 parent 16ee68d commit 976da90

File tree

5 files changed

+204
-29
lines changed

5 files changed

+204
-29
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
Use ``vfork()`` instead of ``fork()`` for :func:`subprocess.Popen` on Linux
2+
to improve performance in cases where it is deemed safe.

Modules/_posixsubprocess.c

Lines changed: 197 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,12 @@
3636
# define SYS_getdents64 __NR_getdents64
3737
#endif
3838

39+
#if defined(__linux__) && defined(HAVE_VFORK) && defined(HAVE_SIGNAL_H) && \
40+
defined(HAVE_PTHREAD_SIGMASK) && !defined(HAVE_BROKEN_PTHREAD_SIGMASK)
41+
# include <signal.h>
42+
# define VFORK_USABLE 1
43+
#endif
44+
3945
#if defined(__sun) && defined(__SVR4)
4046
/* readdir64 is used to work around Solaris 9 bug 6395699. */
4147
# define readdir readdir64
@@ -407,18 +413,82 @@ _close_open_fds_maybe_unsafe(long start_fd, PyObject* py_fds_to_keep)
407413
#endif /* else NOT (defined(__linux__) && defined(HAVE_SYS_SYSCALL_H)) */
408414

409415

416+
#ifdef VFORK_USABLE
417+
/* Reset dispositions for all signals to SIG_DFL except for ignored
418+
* signals. This way we ensure that no signal handlers can run
419+
* after we unblock signals in a child created by vfork().
420+
*/
421+
static void
422+
reset_signal_handlers(const sigset_t *child_sigmask)
423+
{
424+
struct sigaction sa_dfl = {.sa_handler = SIG_DFL};
425+
for (int sig = 1; sig < _NSIG; sig++) {
426+
/* Dispositions for SIGKILL and SIGSTOP can't be changed. */
427+
if (sig == SIGKILL || sig == SIGSTOP) {
428+
continue;
429+
}
430+
431+
/* There is no need to reset the disposition of signals that will
432+
* remain blocked across execve() since the kernel will do it. */
433+
if (sigismember(child_sigmask, sig) == 1) {
434+
continue;
435+
}
436+
437+
struct sigaction sa;
438+
/* C libraries usually return EINVAL for signals used
439+
* internally (e.g. for thread cancellation), so simply
440+
* skip errors here. */
441+
if (sigaction(sig, NULL, &sa) == -1) {
442+
continue;
443+
}
444+
445+
/* void *h works as these fields are both pointer types already. */
446+
void *h = (sa.sa_flags & SA_SIGINFO ? (void *)sa.sa_sigaction :
447+
(void *)sa.sa_handler);
448+
if (h == SIG_IGN || h == SIG_DFL) {
449+
continue;
450+
}
451+
452+
/* This call can't reasonably fail, but if it does, terminating
453+
* the child seems to be too harsh, so ignore errors. */
454+
(void) sigaction(sig, &sa_dfl, NULL);
455+
}
456+
}
457+
#endif /* VFORK_USABLE */
458+
459+
410460
/*
411-
* This function is code executed in the child process immediately after fork
412-
* to set things up and call exec().
461+
* This function is code executed in the child process immediately after
462+
* (v)fork to set things up and call exec().
413463
*
414464
* All of the code in this function must only use async-signal-safe functions,
415465
* listed at `man 7 signal` or
416466
* http://www.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_04.html.
417467
*
418468
* This restriction is documented at
419469
* http://www.opengroup.org/onlinepubs/009695399/functions/fork.html.
470+
*
471+
* If this function is called after vfork(), even more care must be taken.
472+
* The lack of preparations that C libraries normally take on fork(),
473+
* as well as sharing the address space with the parent, might make even
474+
* async-signal-safe functions vfork-unsafe. In particular, on Linux,
475+
* set*id() and setgroups() library functions must not be called, since
476+
* they have to interact with the library-level thread list and send
477+
* library-internal signals to implement per-process credentials semantics
478+
* required by POSIX but not supported natively on Linux. Another reason to
479+
* avoid this family of functions is that sharing an address space between
480+
* processes running with different privileges is inherently insecure.
481+
* See bpo-35823 for further discussion and references.
482+
*
483+
* In some C libraries, setrlimit() has the same thread list/signalling
484+
* behavior since resource limits were per-thread attributes before
485+
* Linux 2.6.10. Musl, as of 1.2.1, is known to have this issue
486+
* (https://www.openwall.com/lists/musl/2020/10/15/6).
487+
*
488+
* If vfork-unsafe functionality is desired after vfork(), consider using
489+
* syscall() to obtain it.
420490
*/
421-
static void
491+
_Py_NO_INLINE static void
422492
child_exec(char *const exec_array[],
423493
char *const argv[],
424494
char *const envp[],
@@ -432,6 +502,7 @@ child_exec(char *const exec_array[],
432502
int call_setgid, gid_t gid,
433503
int call_setgroups, size_t groups_size, const gid_t *groups,
434504
int call_setuid, uid_t uid, int child_umask,
505+
const void *child_sigmask,
435506
PyObject *py_fds_to_keep,
436507
PyObject *preexec_fn,
437508
PyObject *preexec_fn_args_tuple)
@@ -507,6 +578,13 @@ child_exec(char *const exec_array[],
507578
if (restore_signals)
508579
_Py_RestoreSignals();
509580

581+
#ifdef VFORK_USABLE
582+
if (child_sigmask) {
583+
reset_signal_handlers(child_sigmask);
584+
POSIX_CALL(pthread_sigmask(SIG_SETMASK, child_sigmask, NULL));
585+
}
586+
#endif
587+
510588
#ifdef HAVE_SETSID
511589
if (call_setsid)
512590
POSIX_CALL(setsid());
@@ -599,6 +677,81 @@ child_exec(char *const exec_array[],
599677
}
600678

601679

680+
/* The main purpose of this wrapper function is to isolate vfork() from both
681+
* subprocess_fork_exec() and child_exec(). A child process created via
682+
* vfork() executes on the same stack as the parent process while the latter is
683+
* suspended, so this function should not be inlined to avoid compiler bugs
684+
* that might clobber data needed by the parent later. Additionally,
685+
* child_exec() should not be inlined to avoid spurious -Wclobber warnings from
686+
* GCC (see bpo-35823).
687+
*/
688+
_Py_NO_INLINE static pid_t
689+
do_fork_exec(char *const exec_array[],
690+
char *const argv[],
691+
char *const envp[],
692+
const char *cwd,
693+
int p2cread, int p2cwrite,
694+
int c2pread, int c2pwrite,
695+
int errread, int errwrite,
696+
int errpipe_read, int errpipe_write,
697+
int close_fds, int restore_signals,
698+
int call_setsid,
699+
int call_setgid, gid_t gid,
700+
int call_setgroups, size_t groups_size, const gid_t *groups,
701+
int call_setuid, uid_t uid, int child_umask,
702+
const void *child_sigmask,
703+
PyObject *py_fds_to_keep,
704+
PyObject *preexec_fn,
705+
PyObject *preexec_fn_args_tuple)
706+
{
707+
708+
pid_t pid;
709+
710+
#ifdef VFORK_USABLE
711+
if (child_sigmask) {
712+
/* These are checked by our caller; verify them in debug builds. */
713+
assert(!call_setsid);
714+
assert(!call_setuid);
715+
assert(!call_setgid);
716+
assert(!call_setgroups);
717+
assert(preexec_fn == Py_None);
718+
719+
pid = vfork();
720+
} else
721+
#endif
722+
{
723+
pid = fork();
724+
}
725+
726+
if (pid != 0) {
727+
return pid;
728+
}
729+
730+
/* Child process.
731+
* See the comment above child_exec() for restrictions imposed on
732+
* the code below.
733+
*/
734+
735+
if (preexec_fn != Py_None) {
736+
/* We'll be calling back into Python later so we need to do this.
737+
* This call may not be async-signal-safe but neither is calling
738+
* back into Python. The user asked us to use hope as a strategy
739+
* to avoid deadlock... */
740+
PyOS_AfterFork_Child();
741+
}
742+
743+
child_exec(exec_array, argv, envp, cwd,
744+
p2cread, p2cwrite, c2pread, c2pwrite,
745+
errread, errwrite, errpipe_read, errpipe_write,
746+
close_fds, restore_signals, call_setsid,
747+
call_setgid, gid, call_setgroups, groups_size, groups,
748+
call_setuid, uid, child_umask, child_sigmask,
749+
py_fds_to_keep, preexec_fn, preexec_fn_args_tuple);
750+
_exit(255);
751+
return 0; /* Dead code to avoid a potential compiler warning. */
752+
}
753+
754+
602755
static PyObject *
603756
subprocess_fork_exec(PyObject* self, PyObject *args)
604757
{
@@ -836,39 +989,56 @@ subprocess_fork_exec(PyObject* self, PyObject *args)
836989
need_after_fork = 1;
837990
}
838991

839-
pid = fork();
840-
if (pid == 0) {
841-
/* Child process */
842-
/*
843-
* Code from here to _exit() must only use async-signal-safe functions,
844-
* listed at `man 7 signal` or
845-
* http://www.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_04.html.
992+
/* NOTE: When old_sigmask is non-NULL, do_fork_exec() may use vfork(). */
993+
const void *old_sigmask = NULL;
994+
#ifdef VFORK_USABLE
995+
/* Use vfork() only if it's safe. See the comment above child_exec(). */
996+
sigset_t old_sigs;
997+
if (preexec_fn == Py_None &&
998+
!call_setuid && !call_setgid && !call_setgroups && !call_setsid) {
999+
/* Block all signals to ensure that no signal handlers are run in the
1000+
* child process while it shares memory with us. Note that signals
1001+
* used internally by C libraries won't be blocked by
1002+
* pthread_sigmask(), but signal handlers installed by C libraries
1003+
* normally service only signals originating from *within the process*,
1004+
* so it should be sufficient to consider any library function that
1005+
* might send such a signal to be vfork-unsafe and do not call it in
1006+
* the child.
8461007
*/
1008+
sigset_t all_sigs;
1009+
sigfillset(&all_sigs);
1010+
pthread_sigmask(SIG_BLOCK, &all_sigs, &old_sigs);
1011+
old_sigmask = &old_sigs;
1012+
}
1013+
#endif
8471014

848-
if (preexec_fn != Py_None) {
849-
/* We'll be calling back into Python later so we need to do this.
850-
* This call may not be async-signal-safe but neither is calling
851-
* back into Python. The user asked us to use hope as a strategy
852-
* to avoid deadlock... */
853-
PyOS_AfterFork_Child();
854-
}
1015+
pid = do_fork_exec(exec_array, argv, envp, cwd,
1016+
p2cread, p2cwrite, c2pread, c2pwrite,
1017+
errread, errwrite, errpipe_read, errpipe_write,
1018+
close_fds, restore_signals, call_setsid,
1019+
call_setgid, gid, call_setgroups, num_groups, groups,
1020+
call_setuid, uid, child_umask, old_sigmask,
1021+
py_fds_to_keep, preexec_fn, preexec_fn_args_tuple);
8551022

856-
child_exec(exec_array, argv, envp, cwd,
857-
p2cread, p2cwrite, c2pread, c2pwrite,
858-
errread, errwrite, errpipe_read, errpipe_write,
859-
close_fds, restore_signals, call_setsid,
860-
call_setgid, gid, call_setgroups, num_groups, groups,
861-
call_setuid, uid, child_umask,
862-
py_fds_to_keep, preexec_fn, preexec_fn_args_tuple);
863-
_exit(255);
864-
return NULL; /* Dead code to avoid a potential compiler warning. */
865-
}
8661023
/* Parent (original) process */
8671024
if (pid == -1) {
8681025
/* Capture errno for the exception. */
8691026
saved_errno = errno;
8701027
}
8711028

1029+
#ifdef VFORK_USABLE
1030+
if (old_sigmask) {
1031+
/* vfork() semantics guarantees that the parent is blocked
1032+
* until the child performs _exit() or execve(), so it is safe
1033+
* to unblock signals once we're here.
1034+
* Note that in environments where vfork() is implemented as fork(),
1035+
* such as QEMU user-mode emulation, the parent won't be blocked,
1036+
* but it won't share the address space with the child,
1037+
* so it's still safe to unblock the signals. */
1038+
pthread_sigmask(SIG_SETMASK, old_sigmask, NULL);
1039+
}
1040+
#endif
1041+
8721042
Py_XDECREF(cwd_obj2);
8731043

8741044
if (need_after_fork)

configure

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11732,7 +11732,7 @@ for ac_func in alarm accept4 setitimer getitimer bind_textdomain_codeset chown \
1173211732
sigaction sigaltstack sigfillset siginterrupt sigpending sigrelse \
1173311733
sigtimedwait sigwait sigwaitinfo snprintf strftime strlcpy strsignal symlinkat sync \
1173411734
sysconf tcgetpgrp tcsetpgrp tempnam timegm times tmpfile tmpnam tmpnam_r \
11735-
truncate uname unlinkat utimensat utimes waitid waitpid wait3 wait4 \
11735+
truncate uname unlinkat utimensat utimes vfork waitid waitpid wait3 wait4 \
1173611736
wcscoll wcsftime wcsxfrm wmemcmp writev _getpty rtpSpawn
1173711737
do :
1173811738
as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`

configure.ac

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3690,7 +3690,7 @@ AC_CHECK_FUNCS(alarm accept4 setitimer getitimer bind_textdomain_codeset chown \
36903690
sigaction sigaltstack sigfillset siginterrupt sigpending sigrelse \
36913691
sigtimedwait sigwait sigwaitinfo snprintf strftime strlcpy strsignal symlinkat sync \
36923692
sysconf tcgetpgrp tcsetpgrp tempnam timegm times tmpfile tmpnam tmpnam_r \
3693-
truncate uname unlinkat utimensat utimes waitid waitpid wait3 wait4 \
3693+
truncate uname unlinkat utimensat utimes vfork waitid waitpid wait3 wait4 \
36943694
wcscoll wcsftime wcsxfrm wmemcmp writev _getpty rtpSpawn)
36953695

36963696
# Force lchmod off for Linux. Linux disallows changing the mode of symbolic

pyconfig.h.in

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1301,6 +1301,9 @@
13011301
/* Define to 1 if you have the <uuid/uuid.h> header file. */
13021302
#undef HAVE_UUID_UUID_H
13031303

1304+
/* Define to 1 if you have the `vfork' function. */
1305+
#undef HAVE_VFORK
1306+
13041307
/* Define to 1 if you have the `wait3' function. */
13051308
#undef HAVE_WAIT3
13061309

0 commit comments

Comments
 (0)