Skip to content
This repository was archived by the owner on Nov 15, 2025. It is now read-only.

Conversation

@iaguis
Copy link

@iaguis iaguis commented Dec 11, 2020

This PR adds the RestrictFileSystems= property. When used, processes
belonging to a service are only able to access the filesystems listed in the
property.

This is implemented by attaching a BPF program to the file_open BPF LSM hook.
The program is attached at boot time and stays there forever. Then, when a
service specifying the RestrictFileSystems= property is started, an entry is
added to a global hash of maps BPF map pinned to the BPF filesystem under
/sys/fs/bpf/systemd/lsm_bpf_map. The map stores a set of filesystem
magic numbers per cgroupID. When a process tries to open a file, the BPF
program is executed and checks the cgroup the process is running in: if an
entry is present in the global map it checks if the filesystem the process is
trying to access is present in the set, if not, it denies access to it.

RestrictFileSystems= is only supported on systems with the LSM BPF hook
enabled and using cgroup2 (unified or hybrid).

This PR makes use of the libbpf framework proposed on systemd#17655. Same as that PR,
it requires clang and llvm at compile time, and the
libbpf shared library.

Thanks to the usage of libbpf, the program can use the CO-RE (Compile-Once
Run-Everywhere) technology so it doesn't require kernel headers at runtime to
access internal kernel structures.

@iaguis iaguis requested a review from alban December 11, 2020 17:51
@iaguis iaguis force-pushed the iaguis/lsm-bpf branch 2 times, most recently from 020342d to 85b6a49 Compare December 12, 2020 11:08
@iaguis iaguis force-pushed the iaguis/lsm-bpf branch 4 times, most recently from b81c860 to b9715e7 Compare December 22, 2020 19:28
@github-actions github-actions bot added the mkosi label Dec 22, 2020
angdraug and others added 21 commits December 23, 2020 10:18
Explicitly document the behavior introduced in systemd#7437: when picking a new
UID shift base with "-U", a hash of the machine name will be tried
before falling back to fully random UID base candidates.
This commit adds support for disabling the read and write
workqueues with the new crypttab options no-read-workqueue
and no-write-workqueue. These correspond to the cryptsetup
options --perf-no_read_workqueue and --perf-no_write_workqueue
respectively.
IPv6 privacy extensions are plural, not singular.
When set to "kernel", systemd is not supposed to touch that sysctl.

5e0534f, part of
systemd#17240 forgot to handle that
case.

Fixes systemd#18003
…kernel

network: fix IPv6PrivacyExtensions=kernel
In situations where a service fails to start, systemd suggests the user to
use "journalctl -xe" to get details about the failure. While running this
command does provide some additional details, most of the information is
similar to what was already printed when the service fails.

often the actual reason for the failure can be found in the logs of the
service that fails to start.

This patch updates the wording to suggest using "-u" to view the service
logs instead.

Signed-off-by: Sebastiaan van Stijn <[email protected]>
…address

DenyList= filters provided prefixes, not router address.
So, RouteDenyLisy= should so for consistency.

Fixes 16c89e6.
networkd: add support for prefix allow-list and route allow-list
…_resend()

When compiling with CFLAGS='-Werror=maybe-uninitialized -Og' we get a
warning about uninitialized "next_timeout" variable.

Avoid the warning by adding an (unreachable) "default" label.

Fixes: c24288d ("sd-dhcp-client: correct dhcpv4 renew/rebind retransmit timeouts")
Let's link the three man pages together more tightly and explain what
the two targets are about, emphasizing local/quick/reliable/approximate
vs remote/slow/unreliable/accurate synchronization.

Follow-up for: 1431b2f fe934b4
man: extend time-{set,sync}.target + systemd-timesyncd/wait-sync docs
Julia Kartseva and others added 3 commits January 6, 2021 13:40
* Add `build-bpf` feature gate with 'auto', 'true' and 'false' choices
* Add libbpf [0] dependency
* Search for clang and llc binaries the build environment.

For libbpf [0], make 0.2.0 [1] the minimum required version.
If libbpf is satisfied, set HAVE_LIBBPF config option to 1.

If `build-bpf` feature gate is set to 'auto', whether feature is enabled
or disabled is defined by presence of all of libbpf, clang and llvm in build
environment. With 'auto' all dependencies are optional.
If the gate is set to `true`, make all of the libbpf, clang and llvm
dependencies mandatory.
If it's set to `false`, set `BUILD_BPF` to false and make libbpf
dependency optional.

libbpf dep is dynamic followed by the common pattern in systemd.
find_program doesn't allow to set minimum version similary to
`dependency` option. The most recent BPF features include BTF which
require minimim v.10 LLVM, allow_bind program doesn't use BTF features
and builds with clang and llvm 9.0.
Introduce minimalistic set of helpers for bpf programs compiled from
restricted C sources.

Introduce a basic type `struct BPFProgramV2` with 'fd'
and 'attach_type' fields to represent a loaded bpf prog:wqram.
The BPFProgram struct is not used since:
- v2 methods will use libbpf while v1 use raw syscalls
- the majority of its fields is not needed to support BPF program
compiled from sources
- lack of 'attach_type' field

Introduce bpf_object_{} helpers to load bpf programs into kernel, resize
and populate bpf maps, attach program to cgroup hooks.

libbpf dependency must be satisfied to compile the code.
bpf_object_set_inner_map_fd is needed for hash of maps BPF maps and
bpf_object_find_program_by_title is needed because
bpf_object_get_programs() doesn't return LSM BPF programs, so we need to
get it by name.
iaguis added 20 commits January 6, 2021 17:10
They were failing in the CI.
Returns the magic number for each filesystem.
It hooks into the file_open LSM hook and allows only when the filesystem
where the open will take place is present in a BPF map for a particular
cgroup.

The BPF map used is a hash of maps with the following structure:

    cgroupID -> (s_magic -> uint32)

The inner map is effectively a set.

When the cgroupID is present in the map, it checks the inner map for the
magic number of the filesystem associated with the file that's being
opened. If that magic number is present it allows the open to succeed,
otherwise it returns -EPERM.

If the cgroupID is not present in the map, it allows the open to
succeed.

The BPF program uses CO-RE (Compile-Once Run-Everywhere) to access
internal kernel structures without needing kernel headers present at
runtime.
It uses tools/build-bpf.py to compile the BPF program from the sources.
If systemd#17655 gets merged, there's no need to do this and we can use their
test.

This removes the bpf_object_get_programs() test because LSM programs
are not returned by libbpf.
It returns the cgroupID from a cgroup path.
It didn't reflect the current status.
It will be used later.
They link with libcore and libcore is not using libbpf.
This adds 4 functions to implement RestrictFileSystems=

* lsm_bpf_supported() checks if LSM BPF is supported. It checks that
  cgroupv2 is used, that BPF LSM is enabled, and tries to load the BPF
  LSM program which makes sure BTF and hash of maps are supported, and
  BPF LSM programs can be loaded.
* lsm_bpf_setup() loads and attaches the LSM BPF program.
* bpf_restrict_filesystems() populates the hash of maps BPF map with the
  cgroupID and the set of allowed filesystems.
* cleanup_lsm_bpf() removes a cgroupID entry from the hash of maps.
It attaches the LSM BPF program when the system manager starts up.

It populates the hash of maps BPF map when services that have
RestrictFileSystems= set start.

It cleans up the hash of maps when the unit cgroups is pruned.
Services only have access to filesystems that are listed here.

Accepts a list of filesystem names.
libbpf is used in core code now, so we need to add it as dependency for
tests.
For distros that ship libbpf 0.2.0.
@iaguis
Copy link
Author

iaguis commented Jan 6, 2021

There's an upstream PR now. Closing.

@iaguis iaguis closed this Jan 6, 2021
mauriciovasquezbernal pushed a commit that referenced this pull request May 18, 2021
C.f. 9793530.

We'd crash when trying to access an already-deallocated object:

Thread no. 1 (7 frames)
 #2 log_assert_failed_realm at ../src/basic/log.c:844
 #3 event_inotify_data_drop at ../src/libsystemd/sd-event/sd-event.c:3035
 #4 source_dispatch at ../src/libsystemd/sd-event/sd-event.c:3250
 #5 sd_event_dispatch at ../src/libsystemd/sd-event/sd-event.c:3631
 #6 sd_event_run at ../src/libsystemd/sd-event/sd-event.c:3689
 #7 sd_event_loop at ../src/libsystemd/sd-event/sd-event.c:3711
 systemd#8 run at ../src/home/homed.c:47

The source in question is an inotify source, and the messages are:

systemd-homed[1340]: /home/ moved or renamed, recreating watch and rescanning.
systemd-homed[1340]: Assertion '*_head == _item' failed at src/libsystemd/sd-event/sd-event.c:3035, function event_inotify_data_drop(). Aborting.

on_home_inotify() got called, then manager_watch_home(), which unrefs the
existing inotify_event_source. I assume that the source gets dispatched again
because it was still in the pending queue.

I can't reproduce the issue (timing?), but this should
fix systemd#17824, https://bugzilla.redhat.com/show_bug.cgi?id=1899264.
iaguis pushed a commit that referenced this pull request Sep 20, 2023
When exiting PID 1 we most likely don't have stdio/stdout open, so the
final LSan check would not print any actionable information and would
just crash PID 1 leading up to a kernel panic, which is a bit annoying.
Let's instead attempt to open /dev/console, and if we succeed redirect
LSan's report there.

The result is a bit messy, as it's slightly interleaved with the kernel
panic, but it's definitely better than not having the stack trace at
all:

[  OK  ] Reached target final.target.
[  OK  ] Finished systemd-poweroff.service.
[  OK  ] Reached target poweroff.target.

=================================================================
3 1m  43.251782] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
[   43.252838] CPU: 2 PID: 1 Comm: systemd Not tainted 6.4.12-200.fc38.x86_64 #1
==[1==ERR O R :4 3Le.a2k53562] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
[   43.254683] Call Trace:
[   43.254911]  <TASK>
[   43.255107]  dump_stack_lvl+0x47/0x60
S[ a  43.n2555i05]  panic+t0x192/0x350
izer[   :43.255966 ]  do_exit+0x990/0xdb10
etec[   43.256504]  do_group_exit+0x31/0x80
[   43.256889]  __x64_sys_exit_group+0x18/0x20
[   43.257288]  do_syscall_64+0x60/0x90
o_user_mod leaks[   43.257618]  ? syscall_exit_t

+0x2b/0x40
[   43.258411]  ? do_syscall_64+0x6c/0x90
1mDirect le[   43.258755]  ak of 21 byte(s)? exc_page_fault+0x7f/0x180
[   43.259446]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
 [   43.259901] RiIP: 0033:0x7f357nb8f3ad4
 1 objec[   43.260354] Ctode: 48 89 (f7 0f 05 c3 sf3 0f 1e fa b8 3b 00 00 00) 0f 05 c3 0f 1f 4 0 00 f3 0f 1e fa 50 58 b8 e7 00 00 00 48 83 ec 08 48 63 ff 0f 051
[   43.262581] RSP: 002b:00007ffc25872440 EFLAGS: 00000202 ORIG_RAX: 00000000000000e7
a RBX: 00007f357be9b218 RCX: 00007f357b8f3ad4m:ffd
[   43.264512] RDX: 0000000000000001 RSI: 00007f357b933b63 RDI: 0000000000000001
[   43.265355] RBP: 00007f357be9b218 R08: efffffffffffffff R09: 00007ffc258721ef
[   43.266191] R10: 000000000000003f R11: 0000000000000202 R12: 00000fe6ae9e0000
[   43.266891] R13: 00007f3574f00000 R14: 0000000000000000 R15: 0000000000000007
[   43.267517]  </TASK>

    #0 0x7f357b8814a8 in strdup (/lib64/libasan.so.8+0x814a8) (BuildId: e5f0a0d511a659fbc47bf41072869139cb2db47f)
    #1 0x7f3578d43317 in cg_path_decode_unit ../src/basic/cgroup-util.c:1132
    #2 0x7f3578d43936 in cg_path_get_unit ../src/basic/cgroup-util.c:1190
    #3 0x7f3578d440f6 in cg_pid_get_unit ../src/basic/cgroup-util.c:1234
    #4 0x7f35789263d7 in bus_log_caller ../src/shared/bus-util.c:734
    #5 0x7f357a9cf10a in method_reload ../src/core/dbus-manager.c:1621
    #6 0x7f3578f77497 in method_callbacks_run ../src/libsystemd/sd-bus/bus-objects.c:406
    #7 0x7f3578f80dd8 in object_find_and_run ../src/libsystemd/sd-bus/bus-objects.c:1319
    systemd#8 0x7f3578f82487 in bus_process_object ../src/libsystemd/sd-bus/bus-objects.c:1439
    systemd#9 0x7f3578fe41f1 in process_message ../src/libsystemd/sd-bus/sd-bus.c:3007
    systemd#10 0x7f3578fe477b in process_running ../src/libsystemd/sd-bus/sd-bus.c:3049
    systemd#11 0x7f3578fe75d1 in bus_process_internal ../src/libsystemd/sd-bus/sd-bus.c:3269
    systemd#12 0x7f3578fe776e in sd_bus_process ../src/libsystemd/sd-bus/sd-bus.c:3296
    systemd#13 0x7f3578feaedc in io_callback ../src/libsystemd/sd-bus/sd-bus.c:3638
    systemd#14 0x7f35791c2f68 in source_dispatch ../src/libsystemd/sd-event/sd-event.c:4187
    systemd#15 0x7f35791cc6f9 in sd_event_dispatch ../src/libsystemd/sd-event/sd-event.c:4808
    systemd#16 0x7f35791cd830 in sd_event_run ../src/libsystemd/sd-event/sd-event.c:4869
    systemd#17 0x7f357abcd572 in manager_loop ../src/core/manager.c:3244
    systemd#18 0x41db21 in invoke_main_loop ../src/core/main.c:1960
    systemd#19 0x426615 in main ../src/core/main.c:3125
    systemd#20 0x7f3577c49b49 in __libc_start_call_main (/lib64/libc.so.6+0x27b49) (BuildId: 245240a31888ad5c11bbc55b18e02d87388f59a9)
    systemd#21 0x7f3577c49c0a in __libc_start_main_alias_2 (/lib64/libc.so.6+0x27c0a) (BuildId: 245240a31888ad5c11bbc55b18e02d87388f59a9)
    systemd#22 0x408494 in _start (/usr/lib/systemd/systemd+0x408494) (BuildId: fe61e1b0f00b6a36aa34e707a98c15c52f6b960a)

SUMMARY: AddressSanitizer: 21 byte(s) leaked in 1 allocation(s).
[   43.295912] Kernel Offset: 0x7000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   43.297036] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100 ]---

Originally noticed in systemd#28579.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.