[release/1.3] backport seccomp profile updates#4561
Merged
mxpv merged 14 commits intocontainerd:release/1.3from Sep 14, 2020
Merged
[release/1.3] backport seccomp profile updates#4561mxpv merged 14 commits intocontainerd:release/1.3from
mxpv merged 14 commits intocontainerd:release/1.3from
Conversation
Relates to https://patchwork.kernel.org/patch/10756415/ Added to whitelist: - `clock_getres_time64` (equivalent of `clock_getres`, which was whitelisted) - `clock_gettime64` (equivalent of `clock_gettime`, which was whitelisted) - `clock_nanosleep_time64` (equivalent of `clock_nanosleep`, which was whitelisted) - `futex_time64` (equivalent of `futex`, which was whitelisted) - `io_pgetevents_time64` (equivalent of `io_pgetevents`, which was whitelisted) - `mq_timedreceive_time64` (equivalent of `mq_timedreceive`, which was whitelisted) - `mq_timedsend_time64 ` (equivalent of `mq_timedsend`, which was whitelisted) - `ppoll_time64` (equivalent of `ppoll`, which was whitelisted) - `pselect6_time64` (equivalent of `pselect6`, which was whitelisted) - `recvmmsg_time64` (equivalent of `recvmmsg`, which was whitelisted) - `rt_sigtimedwait_time64` (equivalent of `rt_sigtimedwait`, which was whitelisted) - `sched_rr_get_interval_time64` (equivalent of `sched_rr_get_interval`, which was whitelisted) - `semtimedop_time64` (equivalent of `semtimedop`, which was whitelisted) - `timer_gettime64` (equivalent of `timer_gettime`, which was whitelisted) - `timer_settime64` (equivalent of `timer_settime`, which was whitelisted) - `timerfd_gettime64` (equivalent of `timerfd_gettime`, which was whitelisted) - `timerfd_settime64` (equivalent of `timerfd_settime`, which was whitelisted) - `utimensat_time64` (equivalent of `utimensat`, which was whitelisted) Not added to whitelist: - `clock_adjtime64` (equivalent of `clock_adjtime`, which was not whitelisted) - `clock_settime64` (equivalent of `clock_settime`, which was not whitelisted) Signed-off-by: Sebastiaan van Stijn <[email protected]> (cherry picked from commit 9529c69) Signed-off-by: Sebastiaan van Stijn <[email protected]>
This only allows making the syscall. CAP_SYS_TIME is still required
for time adjustment (enforced by the kernel):
```
kernel/time/posix-timers.c:
1112 SYSCALL_DEFINE2(clock_adjtime, const clockid_t, which_clock,
1113 struct __kernel_timex __user *, utx)
...
1121 err = do_clock_adjtime(which_clock, &ktx);
1100 int do_clock_adjtime(const clockid_t which_clock, struct __kernel_timex * ktx)
1101 {
...
1109 return kc->clock_adj(which_clock, ktx);
1299 static const struct k_clock clock_realtime = {
...
1304 .clock_adj = posix_clock_realtime_adj,
188 static int posix_clock_realtime_adj(const clockid_t which_clock,
189 struct __kernel_timex *t)
190 {
191 return do_adjtimex(t);
kernel/time/timekeeping.c:
2312 int do_adjtimex(struct __kernel_timex *txc)
2313 {
...
2321 /* Validate the data before disabling interrupts */
2322 ret = timekeeping_validate_timex(txc);
2246 static int timekeeping_validate_timex(const struct __kernel_timex *txc)
2247 {
2248 if (txc->modes & ADJ_ADJTIME) {
...
2252 if (!(txc->modes & ADJ_OFFSET_READONLY) &&
2253 !capable(CAP_SYS_TIME))
2254 return -EPERM;
2255 } else {
2256 /* In order to modify anything, you gotta be super-user! */
2257 if (txc->modes && !capable(CAP_SYS_TIME))
2258 return -EPERM;
```
Fixes: moby/moby 40919
Signed-off-by: Stanislav Levin <[email protected]>
Signed-off-by: Sebastiaan van Stijn <[email protected]>
(cherry picked from commit 5765991)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
query_module(2) is only in kernels before Linux 2.6. Signed-off-by: Kenta Tada <[email protected]> (cherry picked from commit 0375582) Signed-off-by: Sebastiaan van Stijn <[email protected]>
Restartable Sequences (rseq) are a kernel-based mechanism for fast update operations on per-core data in user-space. Some libraries, like the newest version of Google's TCMalloc, depend on it [1]. This also makes dockers default seccomp profile on par with systemd's, which enabled 'rseq' in early 2019 [2]. 1: https://google.github.io/tcmalloc/design.html 2: systemd/systemd@6fee3be Signed-off-by: Florian Schmaus <[email protected]> (cherry picked from commit e977564) Signed-off-by: Sebastiaan van Stijn <[email protected]>
Signed-off-by: Michael Crosby <[email protected]> (cherry picked from commit 0f83109) Signed-off-by: Sebastiaan van Stijn <[email protected]>
…SYSLOG This call is what is used to implement `dmesg` to get kernel messages about the host. This can leak substantial information about the host. It is normally available to unprivileged users on the host, unless the sysctl `kernel.dmesg_restrict = 1` is set, but this is not set by standard on the majority of distributions. Blocking this to restrict leaks about the configuration seems correct. Relates to moby/moby#37897 "docker exposes dmesg to containers by default" See also https://googleprojectzero.blogspot.com/2018/09/a-cache-invalidation-bug-in-linux.html Signed-off-by: Sebastiaan van Stijn <[email protected]> (cherry picked from commit 267a0cf) Signed-off-by: Sebastiaan van Stijn <[email protected]>
Signed-off-by: Sebastiaan van Stijn <[email protected]> (cherry picked from commit 7e7545e) Signed-off-by: Sebastiaan van Stijn <[email protected]>
Enabled adjtimex in the default profile without requiring CAP_SYS_TIME privilege.
The kernel will check CAP_SYS_TIME and won't allow setting the time.
Fixes: Getting the system time with ntptime returns an error in an unprivileged
container
To verify, inside a CentOS 7 container:
yum install -y ntp
ntptime
# ntp_gettime() returns code 0 (OK)
ntpdate -v time.nist.gov
# ntpdate[84]: Can't adjust the time of day: Operation not permitted
Signed-off-by: Sebastiaan van Stijn <[email protected]>
(cherry picked from commit 1746a19)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
Add the membarrier syscall to the default seccomp profile. It is for example used in the implementation of dlopen() in the musl libc of Alpine images. Signed-off-by: Sebastiaan van Stijn <[email protected]> (cherry picked from commit fc9e5d1) Signed-off-by: Sebastiaan van Stijn <[email protected]>
From personality(2):
Have uname(2) report a 2.6.40+ version number rather than a 3.x version
number. Added as a stopgap measure to support broken applications that
could not handle the kernel version-numbering switch from 2.6.x to 3.x.
This allows both "UNAME26|PER_LINUX" and "UNAME26|PER_LINUX32".
Fixes: "setarch broken in docker packages from Debian stretch"
Signed-off-by: Sebastiaan van Stijn <[email protected]>
(cherry picked from commit 117d678)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
On a ppc64le host, running postgres (tried with 9.4 to 9.6) gives the following
warning when trying to flush data to disks (which happens very frequently):
WARNING: could not flush dirty data: Operation not permitted.
A quick dig in postgres source code indicate it uses sync_file_range(2) to
flush data; which on ppe64le and arm64 is translated to sync_file_range2(2)
for alignements reasons.
The profile did not allow sync_file_range2(2), making postgres sad because
it can not flush its buffers. arm_sync_file_range(2) is an ancient alias to
sync_file_range2(2), the syscall was renamed in Linux 2.6.22 when the same
syscall was added for PowerPC.
Signed-off-by: Sebastiaan van Stijn <[email protected]>
(cherry picked from commit 5862285)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
This allows the quotactl syscall in the default seccomp profile, gated by CAP_SYS_ADMIN. Signed-off-by: Sebastiaan van Stijn <[email protected]> (cherry picked from commit 5cdb6e8) Signed-off-by: Sebastiaan van Stijn <[email protected]>
Signed-off-by: Sebastiaan van Stijn <[email protected]> (cherry picked from commit 0a5ee7e) Signed-off-by: Sebastiaan van Stijn <[email protected]>
Adds the io-uring related system call introduced in kernel 5.1 to the seccomp whitelist. With older kernels or older versions of libseccomp, this configure will be omitted. Note that io_uring will grow support for more syscalls in the future so we should keep an eye on this. Signed-off-by: Sebastiaan van Stijn <[email protected]> (cherry picked from commit 325bac7) Signed-off-by: Sebastiaan van Stijn <[email protected]>
|
Build succeeded.
|
mxpv
approved these changes
Sep 14, 2020
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Just like #4503 - updating the seccomp profile with the latest changes
relates to / addresses #4535
Backports of:
clock_adjtime#4264 seccomp: Whitelistclock_adjtime