Block obsolete and unusual socket families in the default seccomp profile#29076
Conversation
riyazdf
left a comment
There was a problem hiding this comment.
profile changes LGTM!
We should also update the default seccomp profile docs to capture this change
cc @mstanleyjones
profiles/seccomp/seccomp_default.go
Outdated
There was a problem hiding this comment.
to ensure I understand correctly: this rule allows for calling bind() and accept(), subsequent rules lock down socket()?
There was a problem hiding this comment.
socketcall is a kind of obsolete way of calling socket used on some architectures (x86 32 bit, some others), rather than there being a syscall socket and one for bind etc there is one for all socket operations, and you call socketcall(1, rest of args) to call socket, where 1 is socket, 2 is bindetc. So to block we have to block both ways of calling it. So we allowsocketcallwith the first argument greater than 1, ie all operations exceptsocket`.
There was a problem hiding this comment.
So yes, as you said, just spelling it out for clarity...
There was a problem hiding this comment.
Would be great to capture this discussion in the form of a comment on the policy itself? In 6 months we won't remember why we did this ;)
There was a problem hiding this comment.
Added a comment inline.
0187fb5 to
fa89362
Compare
fa89362 to
c2e7446
Compare
|
cc @jessfraz |
|
cool thanks! |
|
ping @diogomonica @ijc25 @rneugeba PTAL (thanks for looking @jessfraz !) |
|
This LGTM. I agree w/ the current list of allowed socket families, and with the impression that the other families are very unlikely to be in use by any containers out there. Do we currently have any way to test profile changes like this in a meaningful number of images? That would help us gain more confidence that this is indeed a NOP. Good work! |
|
Looking through It's possible some people might want |
|
@ijc25 |
|
I should learn to read properly! |
|
@justincormack needs a rebase 🙇 |
438bbfd to
ecc4f4b
Compare
|
rebased and added a comment as suggested by @diogomonica |
Linux supports many obsolete address families, which are usually available in common distro kernels, but they are less likely to be properly audited and may have security issues This blocks all socket families in the socket (and socketcall where applicable) syscall except - AF_UNIX - Unix domain sockets - AF_INET - IPv4 - AF_INET6 - IPv6 - AF_NETLINK - Netlink sockets for communicating with the ekrnel - AF_PACKET - raw sockets, which are only allowed with CAP_NET_RAW All other socket families are blocked, including Appletalk (native, not over IP), IPX (remember that!), VSOCK and HVSOCK, which should not generally be used in containers, etc. Note that users can of course provide a profile per container or in the daemon config if they have unusual use cases that require these. Signed-off-by: Justin Cormack <[email protected]>
ecc4f4b to
7e3a596
Compare
|
all 💚 now |
Block obsolete and unusual socket families in the default seccomp profile (cherry picked from commit 4818435) Signed-off-by: Sebastiaan van Stijn <[email protected]>
Since moby/moby#29076 socket(AF_ALG, ...) is being blocked by Docker default seccomp policy. Fail nicely in this case.
|
@ihac ah yes, I forgot about that. Hmm, this basically makes this patch useless as we can't really block |
|
On the plus side
Not sure what the cross of that with Moby's set of supported arches is, at least x86-32, mips and s390, I think? I don't know about the others. Even of all those I didn't check if they also have the split calls and if so what versions of glibc for those platforms used which interface. So I don't think it renders the patch quite "useless" at least for a number of interesting platforms (x86-64, arm64). Although I suppose with |
|
yes, as we allow 32 bit calls on 64 bit by default, and the glibc changes for supporting non socketcall on 32 bit x86 are recent (and never done for Musl) we cant disable, so the aim of removing dangerous kernel paths for exploits is impossible on amd64. Same applies to s390/s390x that changed at the same time. |
This reverts commit 7e3a596. Unfortunately, it was pointed out in moby#29076 (comment) that the `socketcall` syscall takes a pointer to a struct so it is not possible to use seccomp profiles to filter it. This means these cannot be blocked as you can use `socketcall` to call them regardless, as we currently allow 32 bit syscalls. Users who wish to block these should use a seccomp profile that blocks all 32 bit syscalls and then just block the non socketcall versions. Signed-off-by: Justin Cormack <[email protected]>
This syncs the seccomp-profile with the latest changes in containerd's profile, applying the same changes as containerd/containerd@17a9324 Some background from the associated ticket: > We want to use vsock for guest-host communication on KubeVirt > (https://github.com/kubevirt/kubevirt). In KubeVirt we run VMs in pods. > > However since anyone can just connect from any pod to any VM with the > default seccomp settings, we cannot limit connection attempts to our > privileged node-agent. > > ### Describe the solution you'd like > We want to deny the `socket` syscall for the `AF_VSOCK` family by default. > > I see in [1] and [2] that AF_VSOCK was actually already blocked for some > time, but that got reverted since some architectures support the `socketcall` > syscall which can't be restricted properly. However we are mostly interested > in `arm64` and `amd64` where limiting `socket` would probably be enough. > > ### Additional context > I know that in theory we could use our own seccomp profiles, but we would want > to provide security for as many users as possible which use KubeVirt, and there > it would be very helpful if this protection could be added by being part of the > DefaultRuntime profile to easily ensure that it is active for all pods [3]. > > Impact on existing workloads: It is unlikely that this will disturb any existing > workload, becuase VSOCK is almost exclusively used for host-guest commmunication. > However if someone would still use it: Privileged pods would still be able to > use `socket` for `AF_VSOCK`, custom seccomp policies could be applied too. > Further it was already blocked for quite some time and the blockade got lifted > due to reasons not related to AF_VSOCK. > > The PR in KubeVirt which adds VSOCK support for additional context: [4] > > [1]: moby#29076 (comment) > [2]: moby@dcf2632 > [3]: https://kubernetes.io/docs/tutorials/security/seccomp/#enable-the-use-of-runtimedefault-as-the-default-seccomp-profile-for-all-workloads > [4]: kubevirt/kubevirt#8546 Signed-off-by: Sebastiaan van Stijn <[email protected]>
This syncs the seccomp-profile with the latest changes in containerd's profile, applying the same changes as containerd/containerd@17a9324 Some background from the associated ticket: > We want to use vsock for guest-host communication on KubeVirt > (https://github.com/kubevirt/kubevirt). In KubeVirt we run VMs in pods. > > However since anyone can just connect from any pod to any VM with the > default seccomp settings, we cannot limit connection attempts to our > privileged node-agent. > > ### Describe the solution you'd like > We want to deny the `socket` syscall for the `AF_VSOCK` family by default. > > I see in [1] and [2] that AF_VSOCK was actually already blocked for some > time, but that got reverted since some architectures support the `socketcall` > syscall which can't be restricted properly. However we are mostly interested > in `arm64` and `amd64` where limiting `socket` would probably be enough. > > ### Additional context > I know that in theory we could use our own seccomp profiles, but we would want > to provide security for as many users as possible which use KubeVirt, and there > it would be very helpful if this protection could be added by being part of the > DefaultRuntime profile to easily ensure that it is active for all pods [3]. > > Impact on existing workloads: It is unlikely that this will disturb any existing > workload, becuase VSOCK is almost exclusively used for host-guest commmunication. > However if someone would still use it: Privileged pods would still be able to > use `socket` for `AF_VSOCK`, custom seccomp policies could be applied too. > Further it was already blocked for quite some time and the blockade got lifted > due to reasons not related to AF_VSOCK. > > The PR in KubeVirt which adds VSOCK support for additional context: [4] > > [1]: moby#29076 (comment) > [2]: moby@dcf2632 > [3]: https://kubernetes.io/docs/tutorials/security/seccomp/#enable-the-use-of-runtimedefault-as-the-default-seccomp-profile-for-all-workloads > [4]: kubevirt/kubevirt#8546 Signed-off-by: Sebastiaan van Stijn <[email protected]> (cherry picked from commit 57b2290) Signed-off-by: Sebastiaan van Stijn <[email protected]>
This syncs the seccomp-profile with the latest changes in containerd's profile, applying the same changes as containerd/containerd@17a9324 Some background from the associated ticket: > We want to use vsock for guest-host communication on KubeVirt > (https://github.com/kubevirt/kubevirt). In KubeVirt we run VMs in pods. > > However since anyone can just connect from any pod to any VM with the > default seccomp settings, we cannot limit connection attempts to our > privileged node-agent. > > ### Describe the solution you'd like > We want to deny the `socket` syscall for the `AF_VSOCK` family by default. > > I see in [1] and [2] that AF_VSOCK was actually already blocked for some > time, but that got reverted since some architectures support the `socketcall` > syscall which can't be restricted properly. However we are mostly interested > in `arm64` and `amd64` where limiting `socket` would probably be enough. > > ### Additional context > I know that in theory we could use our own seccomp profiles, but we would want > to provide security for as many users as possible which use KubeVirt, and there > it would be very helpful if this protection could be added by being part of the > DefaultRuntime profile to easily ensure that it is active for all pods [3]. > > Impact on existing workloads: It is unlikely that this will disturb any existing > workload, becuase VSOCK is almost exclusively used for host-guest commmunication. > However if someone would still use it: Privileged pods would still be able to > use `socket` for `AF_VSOCK`, custom seccomp policies could be applied too. > Further it was already blocked for quite some time and the blockade got lifted > due to reasons not related to AF_VSOCK. > > The PR in KubeVirt which adds VSOCK support for additional context: [4] > > [1]: moby#29076 (comment) > [2]: moby@dcf2632 > [3]: https://kubernetes.io/docs/tutorials/security/seccomp/#enable-the-use-of-runtimedefault-as-the-default-seccomp-profile-for-all-workloads > [4]: kubevirt/kubevirt#8546 Signed-off-by: Sebastiaan van Stijn <[email protected]> (cherry picked from commit 57b2290) Signed-off-by: Sebastiaan van Stijn <[email protected]>
This reverts commit 74e1d6c. Unfortunately, it was pointed out in moby/moby#29076 (comment) that the `socketcall` syscall takes a pointer to a struct so it is not possible to use seccomp profiles to filter it. This means these cannot be blocked as you can use `socketcall` to call them regardless, as we currently allow 32 bit syscalls. Users who wish to block these should use a seccomp profile that blocks all 32 bit syscalls and then just block the non socketcall versions. Signed-off-by: Justin Cormack <[email protected]>
This syncs the seccomp-profile with the latest changes in containerd's profile, applying the same changes as containerd/containerd@17a9324 Some background from the associated ticket: > We want to use vsock for guest-host communication on KubeVirt > (https://github.com/kubevirt/kubevirt). In KubeVirt we run VMs in pods. > > However since anyone can just connect from any pod to any VM with the > default seccomp settings, we cannot limit connection attempts to our > privileged node-agent. > > ### Describe the solution you'd like > We want to deny the `socket` syscall for the `AF_VSOCK` family by default. > > I see in [1] and [2] that AF_VSOCK was actually already blocked for some > time, but that got reverted since some architectures support the `socketcall` > syscall which can't be restricted properly. However we are mostly interested > in `arm64` and `amd64` where limiting `socket` would probably be enough. > > ### Additional context > I know that in theory we could use our own seccomp profiles, but we would want > to provide security for as many users as possible which use KubeVirt, and there > it would be very helpful if this protection could be added by being part of the > DefaultRuntime profile to easily ensure that it is active for all pods [3]. > > Impact on existing workloads: It is unlikely that this will disturb any existing > workload, becuase VSOCK is almost exclusively used for host-guest commmunication. > However if someone would still use it: Privileged pods would still be able to > use `socket` for `AF_VSOCK`, custom seccomp policies could be applied too. > Further it was already blocked for quite some time and the blockade got lifted > due to reasons not related to AF_VSOCK. > > The PR in KubeVirt which adds VSOCK support for additional context: [4] > > [1]: moby/moby#29076 (comment) > [2]: moby/moby@d82b7d9 > [3]: https://kubernetes.io/docs/tutorials/security/seccomp/#enable-the-use-of-runtimedefault-as-the-default-seccomp-profile-for-all-workloads > [4]: kubevirt/kubevirt#8546 Signed-off-by: Sebastiaan van Stijn <[email protected]>

Linux supports many obsolete address families, which are usually available in
common distro kernels, but they are less likely to be properly audited and
may have security issues
This blocks all socket families in the socket (and socketcall where applicable) syscall
except
All other socket families are blocked, including Appletalk (native, not
over IP), IPX (remember that!), VSOCK and HVSOCK, which should not generally
be used in containers, etc.
Note that users can of course provide a profile per container or in the daemon
config if they have unusual use cases that require these.
Signed-off-by: Justin Cormack [email protected]
cc @diogomonica @riyazdf @ijc25 @rneugeba