Skip to content

[release/1.3] backport seccomp profile updates#4561

Merged
mxpv merged 14 commits intocontainerd:release/1.3from
thaJeztah:1.3_backport_seccomp_updates
Sep 14, 2020
Merged

[release/1.3] backport seccomp profile updates#4561
mxpv merged 14 commits intocontainerd:release/1.3from
thaJeztah:1.3_backport_seccomp_updates

Conversation

@thaJeztah
Copy link
Copy Markdown
Member

Just like #4503 - updating the seccomp profile with the latest changes
relates to / addresses #4535

Backports of:

thaJeztah and others added 14 commits September 14, 2020 16:17
Relates to https://patchwork.kernel.org/patch/10756415/

Added to whitelist:

- `clock_getres_time64` (equivalent of `clock_getres`, which was whitelisted)
- `clock_gettime64` (equivalent of `clock_gettime`, which was whitelisted)
- `clock_nanosleep_time64` (equivalent of `clock_nanosleep`, which was whitelisted)
- `futex_time64` (equivalent of `futex`, which was whitelisted)
- `io_pgetevents_time64` (equivalent of `io_pgetevents`, which was whitelisted)
- `mq_timedreceive_time64` (equivalent of `mq_timedreceive`, which was whitelisted)
- `mq_timedsend_time64 ` (equivalent of `mq_timedsend`, which was whitelisted)
- `ppoll_time64` (equivalent of `ppoll`, which was whitelisted)
- `pselect6_time64` (equivalent of `pselect6`, which was whitelisted)
- `recvmmsg_time64` (equivalent of `recvmmsg`, which was whitelisted)
- `rt_sigtimedwait_time64` (equivalent of `rt_sigtimedwait`, which was whitelisted)
- `sched_rr_get_interval_time64` (equivalent of `sched_rr_get_interval`, which was whitelisted)
- `semtimedop_time64` (equivalent of `semtimedop`, which was whitelisted)
- `timer_gettime64` (equivalent of `timer_gettime`, which was whitelisted)
- `timer_settime64` (equivalent of `timer_settime`, which was whitelisted)
- `timerfd_gettime64` (equivalent of `timerfd_gettime`, which was whitelisted)
- `timerfd_settime64` (equivalent of `timerfd_settime`, which was whitelisted)
- `utimensat_time64` (equivalent of `utimensat`, which was whitelisted)

Not added to whitelist:

- `clock_adjtime64` (equivalent of `clock_adjtime`, which was not whitelisted)
- `clock_settime64` (equivalent of `clock_settime`, which was not whitelisted)

Signed-off-by: Sebastiaan van Stijn <[email protected]>
(cherry picked from commit 9529c69)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
This only allows making the syscall. CAP_SYS_TIME is still required
for time adjustment (enforced by the kernel):

```
kernel/time/posix-timers.c:

1112 SYSCALL_DEFINE2(clock_adjtime, const clockid_t, which_clock,
1113                 struct __kernel_timex __user *, utx)
...
1121         err = do_clock_adjtime(which_clock, &ktx);

1100 int do_clock_adjtime(const clockid_t which_clock, struct __kernel_timex * ktx)
1101 {
...
1109         return kc->clock_adj(which_clock, ktx);

1299 static const struct k_clock clock_realtime = {
...
1304         .clock_adj              = posix_clock_realtime_adj,

188 static int posix_clock_realtime_adj(const clockid_t which_clock,
189                                     struct __kernel_timex *t)
190 {
191         return do_adjtimex(t);

kernel/time/timekeeping.c:

2312 int do_adjtimex(struct __kernel_timex *txc)
2313 {
...
2321         /* Validate the data before disabling interrupts */
2322         ret = timekeeping_validate_timex(txc);

2246 static int timekeeping_validate_timex(const struct __kernel_timex *txc)
2247 {
2248         if (txc->modes & ADJ_ADJTIME) {
...
2252                 if (!(txc->modes & ADJ_OFFSET_READONLY) &&
2253                     !capable(CAP_SYS_TIME))
2254                         return -EPERM;
2255         } else {
2256                 /* In order to modify anything, you gotta be super-user! */
2257                 if (txc->modes && !capable(CAP_SYS_TIME))
2258                         return -EPERM;

```

Fixes: moby/moby 40919
Signed-off-by: Stanislav Levin <[email protected]>
Signed-off-by: Sebastiaan van Stijn <[email protected]>
(cherry picked from commit 5765991)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
query_module(2) is only in kernels before Linux 2.6.

Signed-off-by: Kenta Tada <[email protected]>
(cherry picked from commit 0375582)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
Restartable Sequences (rseq) are a kernel-based mechanism for fast
update operations on per-core data in user-space. Some libraries, like
the newest version of Google's TCMalloc, depend on it [1].

This also makes dockers default seccomp profile on par with systemd's,
which enabled 'rseq' in early 2019 [2].

1: https://google.github.io/tcmalloc/design.html
2: systemd/systemd@6fee3be

Signed-off-by: Florian Schmaus <[email protected]>
(cherry picked from commit e977564)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
Signed-off-by: Michael Crosby <[email protected]>
(cherry picked from commit 0f83109)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
…SYSLOG

This call is what is used to implement `dmesg` to get kernel messages
about the host. This can leak substantial information about the host.
It is normally available to unprivileged users on the host, unless
the sysctl `kernel.dmesg_restrict = 1` is set, but this is not set
by standard on the majority of distributions. Blocking this to restrict
leaks about the configuration seems correct.

Relates to moby/moby#37897 "docker exposes dmesg to containers by default"

See also https://googleprojectzero.blogspot.com/2018/09/a-cache-invalidation-bug-in-linux.html

Signed-off-by: Sebastiaan van Stijn <[email protected]>
(cherry picked from commit 267a0cf)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
Signed-off-by: Sebastiaan van Stijn <[email protected]>
(cherry picked from commit 7e7545e)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
Enabled adjtimex in the default profile without requiring CAP_SYS_TIME privilege.
The kernel will check CAP_SYS_TIME and won't allow setting the time.

Fixes: Getting the system time with ntptime returns an error in an unprivileged
container

To verify, inside a CentOS 7 container:

    yum install -y ntp
    ntptime
    # ntp_gettime() returns code 0 (OK)

    ntpdate -v time.nist.gov
    # ntpdate[84]: Can't adjust the time of day: Operation not permitted

Signed-off-by: Sebastiaan van Stijn <[email protected]>
(cherry picked from commit 1746a19)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
Add the membarrier syscall to the default seccomp profile.
It is for example used in the implementation of dlopen() in
the musl libc of Alpine images.

Signed-off-by: Sebastiaan van Stijn <[email protected]>
(cherry picked from commit fc9e5d1)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
From personality(2):

    Have uname(2) report a 2.6.40+ version number rather than a 3.x version
    number.  Added as a stopgap measure to support broken applications that
    could not handle the  kernel  version-numbering  switch  from 2.6.x to 3.x.

This allows both "UNAME26|PER_LINUX" and "UNAME26|PER_LINUX32".

Fixes: "setarch broken in docker packages from Debian stretch"

Signed-off-by: Sebastiaan van Stijn <[email protected]>
(cherry picked from commit 117d678)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
On a ppc64le host, running postgres (tried with 9.4 to 9.6) gives the following
warning when trying to flush data to disks (which happens very frequently):

     WARNING: could not flush dirty data: Operation not permitted.

A quick dig in postgres source code indicate it uses sync_file_range(2) to
flush data; which on ppe64le and arm64 is translated to sync_file_range2(2)
for alignements reasons.

The profile did not allow sync_file_range2(2), making postgres sad because
it can not flush its buffers. arm_sync_file_range(2) is an ancient alias to
sync_file_range2(2), the syscall was renamed in Linux 2.6.22 when the same
syscall was added for PowerPC.

Signed-off-by: Sebastiaan van Stijn <[email protected]>
(cherry picked from commit 5862285)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
This allows the quotactl syscall in the default seccomp profile, gated by
CAP_SYS_ADMIN.

Signed-off-by: Sebastiaan van Stijn <[email protected]>
(cherry picked from commit 5cdb6e8)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
Signed-off-by: Sebastiaan van Stijn <[email protected]>
(cherry picked from commit 0a5ee7e)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
Adds the io-uring related system call introduced in kernel 5.1 to the
seccomp whitelist. With older kernels or older versions of libseccomp,
this configure will be omitted.

Note that io_uring will grow support for more syscalls in the future
so we should keep an eye on this.

Signed-off-by: Sebastiaan van Stijn <[email protected]>
(cherry picked from commit 325bac7)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Sep 14, 2020

Build succeeded.

Copy link
Copy Markdown
Member

@estesp estesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mxpv mxpv merged commit 02d93ad into containerd:release/1.3 Sep 14, 2020
@thaJeztah thaJeztah deleted the 1.3_backport_seccomp_updates branch September 14, 2020 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants