Skip to content

Comments

Block obsolete and unusual socket families in the default seccomp profile#29076

Merged
thaJeztah merged 1 commit intomoby:masterfrom
justincormack:seccomp-socket-to-them
Jan 18, 2017
Merged

Block obsolete and unusual socket families in the default seccomp profile#29076
thaJeztah merged 1 commit intomoby:masterfrom
justincormack:seccomp-socket-to-them

Conversation

@justincormack
Copy link
Contributor

Linux supports many obsolete address families, which are usually available in
common distro kernels, but they are less likely to be properly audited and
may have security issues

This blocks all socket families in the socket (and socketcall where applicable) syscall
except

  • AF_UNIX - Unix domain sockets
  • AF_INET - IPv4
  • AF_INET6 - IPv6
  • AF_NETLINK - Netlink sockets for communicating with the ekrnel
  • AF_PACKET - raw sockets, which are only allowed with CAP_NET_RAW

All other socket families are blocked, including Appletalk (native, not
over IP), IPX (remember that!), VSOCK and HVSOCK, which should not generally
be used in containers, etc.

Note that users can of course provide a profile per container or in the daemon
config if they have unusual use cases that require these.

Signed-off-by: Justin Cormack [email protected]

cc @diogomonica @riyazdf @ijc25 @rneugeba

bats

Copy link
Contributor

@riyazdf riyazdf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

profile changes LGTM!

We should also update the default seccomp profile docs to capture this change

cc @mstanleyjones

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to ensure I understand correctly: this rule allows for calling bind() and accept(), subsequent rules lock down socket()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

socketcall is a kind of obsolete way of calling socket used on some architectures (x86 32 bit, some others), rather than there being a syscall socket and one for bind etc there is one for all socket operations, and you call socketcall(1, rest of args) to call socket, where 1 is socket, 2 is bindetc. So to block we have to block both ways of calling it. So we allowsocketcallwith the first argument greater than 1, ie all operations exceptsocket`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yes, as you said, just spelling it out for clarity...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, thanks 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to capture this discussion in the form of a comment on the policy itself? In 6 months we won't remember why we did this ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment inline.

@justincormack justincormack force-pushed the seccomp-socket-to-them branch from 0187fb5 to fa89362 Compare December 2, 2016 18:02
@justincormack justincormack force-pushed the seccomp-socket-to-them branch from fa89362 to c2e7446 Compare December 2, 2016 20:15
@justincormack justincormack changed the title Block obsolete socket families in the default seccomp profile Block obsolete and unusual socket families in the default seccomp profile Dec 2, 2016
@justincormack
Copy link
Contributor Author

cc @jessfraz

@jessfraz
Copy link
Contributor

jessfraz commented Dec 9, 2016

cool thanks!

@thaJeztah
Copy link
Member

ping @diogomonica @ijc25 @rneugeba PTAL

(thanks for looking @jessfraz !)

@diogomonica
Copy link
Contributor

This LGTM. I agree w/ the current list of allowed socket families, and with the impression that the other families are very unlikely to be in use by any containers out there.

Do we currently have any way to test profile changes like this in a meaningful number of images? That would help us gain more confidence that this is indeed a NOP.

Good work!

@ijc
Copy link
Contributor

ijc commented Dec 12, 2016

Looking through include/linux/socket.h I think you've picked the right ones to allow by default.

It's possible some people might want AF_PACKET but I think you are right to exclude that from the default set.

@justincormack
Copy link
Contributor Author

@ijc25 AF_PACKET is specifically allowed in this patch, but it is gated by CAP_NET_RAW (which is allowed by default, which is something that needs fixing, but thats another PR).

@ijc
Copy link
Contributor

ijc commented Dec 16, 2016

I should learn to read properly!

@vdemeester
Copy link
Member

@justincormack needs a rebase 🙇

@justincormack justincormack force-pushed the seccomp-socket-to-them branch 2 times, most recently from 438bbfd to ecc4f4b Compare January 17, 2017 15:56
@justincormack
Copy link
Contributor Author

rebased and added a comment as suggested by @diogomonica

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like @jessfraz, @riyazdf, and @diogomonica are good, so

LGTM

docs PR is in docker/docs#776

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

screen shot 2017-01-17 at 17 51 40

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha!

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🐻

Linux supports many obsolete address families, which are usually available in
common distro kernels, but they are less likely to be properly audited and
may have security issues

This blocks all socket families in the socket (and socketcall where applicable) syscall
except
- AF_UNIX - Unix domain sockets
- AF_INET - IPv4
- AF_INET6 - IPv6
- AF_NETLINK - Netlink sockets for communicating with the ekrnel
- AF_PACKET - raw sockets, which are only allowed with CAP_NET_RAW

All other socket families are blocked, including Appletalk (native, not
over IP), IPX (remember that!), VSOCK and HVSOCK, which should not generally
be used in containers, etc.

Note that users can of course provide a profile per container or in the daemon
config if they have unusual use cases that require these.

Signed-off-by: Justin Cormack <[email protected]>
@justincormack justincormack force-pushed the seccomp-socket-to-them branch from ecc4f4b to 7e3a596 Compare January 17, 2017 17:50
@thaJeztah
Copy link
Member

all 💚 now

@thaJeztah thaJeztah merged commit 4818435 into moby:master Jan 18, 2017
@GordonTheTurtle GordonTheTurtle added this to the 1.14.0 milestone Jan 18, 2017
thaJeztah added a commit to thaJeztah/docker that referenced this pull request Feb 22, 2017
Block obsolete and unusual socket families in the default seccomp profile
(cherry picked from commit 4818435)

Signed-off-by: Sebastiaan van Stijn <[email protected]>
fishilico added a commit to fishilico/shared that referenced this pull request Mar 19, 2017
Since moby/moby#29076 socket(AF_ALG, ...) is
being blocked by Docker default seccomp policy. Fail nicely in this
case.
@justincormack
Copy link
Contributor Author

@ihac ah yes, I forgot about that. Hmm, this basically makes this patch useless as we can't really block socketcall at present as many programs are still compiled with it.

@ijc
Copy link
Contributor

ijc commented Apr 26, 2017

On the plus side socketcall never existed for x86-64 or ARM according to socketcall(2). Looking at the kernel source it looks like only a minority of arches had or have it:

  • arm oabi (not the modern one)
  • blackfin
  • cris
  • frv
  • m32r
  • m68k
  • microblaze
  • mips32+64
  • mn10300
  • s390
  • sh
  • sparc
  • x86-32.

Not sure what the cross of that with Moby's set of supported arches is, at least x86-32, mips and s390, I think? I don't know about the others. Even of all those I didn't check if they also have the split calls and if so what versions of glibc for those platforms used which interface.

So I don't think it renders the patch quite "useless" at least for a number of interesting platforms (x86-64, arm64). Although I suppose with CONFIG_COMPAT there are paths to socketcall even on an x86-64 host.

@justincormack
Copy link
Contributor Author

yes, as we allow 32 bit calls on 64 bit by default, and the glibc changes for supporting non socketcall on 32 bit x86 are recent (and never done for Musl) we cant disable, so the aim of removing dangerous kernel paths for exploits is impossible on amd64. Same applies to s390/s390x that changed at the same time.

justincormack added a commit to justincormack/docker that referenced this pull request May 9, 2017
This reverts commit 7e3a596.

Unfortunately, it was pointed out in moby#29076 (comment)
that the `socketcall` syscall takes a pointer to a struct so it is not possible to
use seccomp profiles to filter it. This means these cannot be blocked as you can
use `socketcall` to call them regardless, as we currently allow 32 bit syscalls.

Users who wish to block these should use a seccomp profile that blocks all
32 bit syscalls and then just block the non socketcall versions.

Signed-off-by: Justin Cormack <[email protected]>
@justincormack justincormack deleted the seccomp-socket-to-them branch October 28, 2019 15:09
thaJeztah added a commit to thaJeztah/docker that referenced this pull request Dec 1, 2022
This syncs the seccomp-profile with the latest changes in containerd's
profile, applying the same changes as containerd/containerd@17a9324

Some background from the associated ticket:

> We want to use vsock for guest-host communication on KubeVirt
> (https://github.com/kubevirt/kubevirt). In KubeVirt we run VMs in pods.
>
> However since anyone can just connect from any pod to any VM with the
> default seccomp settings, we cannot limit connection attempts to our
> privileged node-agent.
>
> ### Describe the solution you'd like
> We want to deny the `socket` syscall for the `AF_VSOCK` family by default.
>
> I see in [1] and [2] that AF_VSOCK was actually already blocked for some
> time, but that got reverted since some architectures support the `socketcall`
> syscall which can't be restricted properly. However we are mostly interested
> in `arm64` and `amd64` where limiting `socket` would probably be enough.
>
> ### Additional context
> I know that in theory we could use our own seccomp profiles, but we would want
> to provide security for as many users as possible which use KubeVirt, and there
> it would be very helpful if this protection could be added by being part of the
> DefaultRuntime profile to easily ensure that it is active for all pods [3].
>
> Impact on existing workloads: It is unlikely that this will disturb any existing
> workload, becuase VSOCK is almost exclusively used for host-guest commmunication.
> However if someone would still use it: Privileged pods would still be able to
> use `socket` for `AF_VSOCK`, custom seccomp policies could be applied too.
> Further it was already blocked for quite some time and the blockade got lifted
> due to reasons not related to AF_VSOCK.
>
> The PR in KubeVirt which adds VSOCK support for additional context: [4]
>
> [1]: moby#29076 (comment)
> [2]: moby@dcf2632
> [3]: https://kubernetes.io/docs/tutorials/security/seccomp/#enable-the-use-of-runtimedefault-as-the-default-seccomp-profile-for-all-workloads
> [4]: kubevirt/kubevirt#8546

Signed-off-by: Sebastiaan van Stijn <[email protected]>
thaJeztah added a commit to thaJeztah/docker that referenced this pull request Dec 1, 2022
This syncs the seccomp-profile with the latest changes in containerd's
profile, applying the same changes as containerd/containerd@17a9324

Some background from the associated ticket:

> We want to use vsock for guest-host communication on KubeVirt
> (https://github.com/kubevirt/kubevirt). In KubeVirt we run VMs in pods.
>
> However since anyone can just connect from any pod to any VM with the
> default seccomp settings, we cannot limit connection attempts to our
> privileged node-agent.
>
> ### Describe the solution you'd like
> We want to deny the `socket` syscall for the `AF_VSOCK` family by default.
>
> I see in [1] and [2] that AF_VSOCK was actually already blocked for some
> time, but that got reverted since some architectures support the `socketcall`
> syscall which can't be restricted properly. However we are mostly interested
> in `arm64` and `amd64` where limiting `socket` would probably be enough.
>
> ### Additional context
> I know that in theory we could use our own seccomp profiles, but we would want
> to provide security for as many users as possible which use KubeVirt, and there
> it would be very helpful if this protection could be added by being part of the
> DefaultRuntime profile to easily ensure that it is active for all pods [3].
>
> Impact on existing workloads: It is unlikely that this will disturb any existing
> workload, becuase VSOCK is almost exclusively used for host-guest commmunication.
> However if someone would still use it: Privileged pods would still be able to
> use `socket` for `AF_VSOCK`, custom seccomp policies could be applied too.
> Further it was already blocked for quite some time and the blockade got lifted
> due to reasons not related to AF_VSOCK.
>
> The PR in KubeVirt which adds VSOCK support for additional context: [4]
>
> [1]: moby#29076 (comment)
> [2]: moby@dcf2632
> [3]: https://kubernetes.io/docs/tutorials/security/seccomp/#enable-the-use-of-runtimedefault-as-the-default-seccomp-profile-for-all-workloads
> [4]: kubevirt/kubevirt#8546

Signed-off-by: Sebastiaan van Stijn <[email protected]>
(cherry picked from commit 57b2290)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
thaJeztah added a commit to thaJeztah/docker that referenced this pull request Dec 1, 2022
This syncs the seccomp-profile with the latest changes in containerd's
profile, applying the same changes as containerd/containerd@17a9324

Some background from the associated ticket:

> We want to use vsock for guest-host communication on KubeVirt
> (https://github.com/kubevirt/kubevirt). In KubeVirt we run VMs in pods.
>
> However since anyone can just connect from any pod to any VM with the
> default seccomp settings, we cannot limit connection attempts to our
> privileged node-agent.
>
> ### Describe the solution you'd like
> We want to deny the `socket` syscall for the `AF_VSOCK` family by default.
>
> I see in [1] and [2] that AF_VSOCK was actually already blocked for some
> time, but that got reverted since some architectures support the `socketcall`
> syscall which can't be restricted properly. However we are mostly interested
> in `arm64` and `amd64` where limiting `socket` would probably be enough.
>
> ### Additional context
> I know that in theory we could use our own seccomp profiles, but we would want
> to provide security for as many users as possible which use KubeVirt, and there
> it would be very helpful if this protection could be added by being part of the
> DefaultRuntime profile to easily ensure that it is active for all pods [3].
>
> Impact on existing workloads: It is unlikely that this will disturb any existing
> workload, becuase VSOCK is almost exclusively used for host-guest commmunication.
> However if someone would still use it: Privileged pods would still be able to
> use `socket` for `AF_VSOCK`, custom seccomp policies could be applied too.
> Further it was already blocked for quite some time and the blockade got lifted
> due to reasons not related to AF_VSOCK.
>
> The PR in KubeVirt which adds VSOCK support for additional context: [4]
>
> [1]: moby#29076 (comment)
> [2]: moby@dcf2632
> [3]: https://kubernetes.io/docs/tutorials/security/seccomp/#enable-the-use-of-runtimedefault-as-the-default-seccomp-profile-for-all-workloads
> [4]: kubevirt/kubevirt#8546

Signed-off-by: Sebastiaan van Stijn <[email protected]>
(cherry picked from commit 57b2290)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
thaJeztah pushed a commit to moby/profiles that referenced this pull request Jul 22, 2025
This reverts commit 74e1d6c.

Unfortunately, it was pointed out in moby/moby#29076 (comment)
that the `socketcall` syscall takes a pointer to a struct so it is not possible to
use seccomp profiles to filter it. This means these cannot be blocked as you can
use `socketcall` to call them regardless, as we currently allow 32 bit syscalls.

Users who wish to block these should use a seccomp profile that blocks all
32 bit syscalls and then just block the non socketcall versions.

Signed-off-by: Justin Cormack <[email protected]>
thaJeztah added a commit to moby/profiles that referenced this pull request Jul 22, 2025
This syncs the seccomp-profile with the latest changes in containerd's
profile, applying the same changes as containerd/containerd@17a9324

Some background from the associated ticket:

> We want to use vsock for guest-host communication on KubeVirt
> (https://github.com/kubevirt/kubevirt). In KubeVirt we run VMs in pods.
>
> However since anyone can just connect from any pod to any VM with the
> default seccomp settings, we cannot limit connection attempts to our
> privileged node-agent.
>
> ### Describe the solution you'd like
> We want to deny the `socket` syscall for the `AF_VSOCK` family by default.
>
> I see in [1] and [2] that AF_VSOCK was actually already blocked for some
> time, but that got reverted since some architectures support the `socketcall`
> syscall which can't be restricted properly. However we are mostly interested
> in `arm64` and `amd64` where limiting `socket` would probably be enough.
>
> ### Additional context
> I know that in theory we could use our own seccomp profiles, but we would want
> to provide security for as many users as possible which use KubeVirt, and there
> it would be very helpful if this protection could be added by being part of the
> DefaultRuntime profile to easily ensure that it is active for all pods [3].
>
> Impact on existing workloads: It is unlikely that this will disturb any existing
> workload, becuase VSOCK is almost exclusively used for host-guest commmunication.
> However if someone would still use it: Privileged pods would still be able to
> use `socket` for `AF_VSOCK`, custom seccomp policies could be applied too.
> Further it was already blocked for quite some time and the blockade got lifted
> due to reasons not related to AF_VSOCK.
>
> The PR in KubeVirt which adds VSOCK support for additional context: [4]
>
> [1]: moby/moby#29076 (comment)
> [2]: moby/moby@d82b7d9
> [3]: https://kubernetes.io/docs/tutorials/security/seccomp/#enable-the-use-of-runtimedefault-as-the-default-seccomp-profile-for-all-workloads
> [4]: kubevirt/kubevirt#8546

Signed-off-by: Sebastiaan van Stijn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants