Skip to content

Commit 9f6b562

Browse files
committed
seccomp: add support for "clone3" syscall in default policy
If no seccomp policy is requested, then the built-in default policy in dockerd applies. This has no rule for "clone3" defined, nor any default errno defined. So when runc receives the config it attempts to determine a default errno, using logic defined in its commit: opencontainers/runc@7a8d716 As explained in the above commit message, runc uses a heuristic to decide which errno to return by default: [quote] The solution applied here is to prepend a "stub" filter which returns -ENOSYS if the requested syscall has a larger syscall number than any syscall mentioned in the filter. The reason for this specific rule is that syscall numbers are (roughly) allocated sequentially and thus newer syscalls will (usually) have a larger syscall number -- thus causing our filters to produce -ENOSYS if the filter was written before the syscall existed. [/quote] Unfortunately clone3 appears to one of the edge cases that does not result in use of ENOSYS, instead ending up with the historical EPERM errno. Latest glibc (2.33.9000, in Fedora 35 rawhide) will attempt to use clone3 by default. If it sees ENOSYS then it will automatically fallback to using clone. Any other errno is treated as a fatal error. Thus when docker seccomp policy triggers EPERM from clone3, no fallback occurs and programs are thus unable to spawn threads. The clone3 syscall is much more complicated than clone, most notably its flags are not exposed as a directly argument any more. Instead they are hidden inside a struct. This means that seccomp filters are unable to apply policy based on values seen in flags. Thus we can't directly replicate the current "clone" filtering for "clone3". We can at least ensure "clone3" returns ENOSYS errno, to trigger fallback to "clone" at which point we can filter on flags. Fixes: #42680 Signed-off-by: Daniel P. Berrangé <[email protected]>
1 parent e9b07a7 commit 9f6b562

2 files changed

Lines changed: 27 additions & 0 deletions

File tree

profiles/seccomp/default.json

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -553,6 +553,7 @@
553553
"names": [
554554
"bpf",
555555
"clone",
556+
"clone3",
556557
"fanotify_init",
557558
"fsconfig",
558559
"fsmount",
@@ -627,6 +628,18 @@
627628
]
628629
}
629630
},
631+
{
632+
"names": [
633+
"clone3"
634+
],
635+
"action": "SCMP_ACT_ERRNO",
636+
"errnoRet": 38,
637+
"excludes": {
638+
"caps": [
639+
"CAP_SYS_ADMIN"
640+
]
641+
}
642+
},
630643
{
631644
"names": [
632645
"reboot"

profiles/seccomp/default_linux.go

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ func arches() []Architecture {
4242

4343
// DefaultProfile defines the allowed syscalls for the default seccomp profile.
4444
func DefaultProfile() *Seccomp {
45+
nosys := uint(unix.ENOSYS)
4546
syscalls := []*Syscall{
4647
{
4748
LinuxSyscall: specs.LinuxSyscall{
@@ -546,6 +547,7 @@ func DefaultProfile() *Seccomp {
546547
Names: []string{
547548
"bpf",
548549
"clone",
550+
"clone3",
549551
"fanotify_init",
550552
"fsconfig",
551553
"fsmount",
@@ -615,6 +617,18 @@ func DefaultProfile() *Seccomp {
615617
Caps: []string{"CAP_SYS_ADMIN"},
616618
},
617619
},
620+
{
621+
LinuxSyscall: specs.LinuxSyscall{
622+
Names: []string{
623+
"clone3",
624+
},
625+
Action: specs.ActErrno,
626+
ErrnoRet: &nosys,
627+
},
628+
Excludes: &Filter{
629+
Caps: []string{"CAP_SYS_ADMIN"},
630+
},
631+
},
618632
{
619633
LinuxSyscall: specs.LinuxSyscall{
620634
Names: []string{

0 commit comments

Comments
 (0)