systemd version the issue has been seen with
255.2-3ubuntu2
Used distribution
Ubuntu Noble 24.04
Linux kernel version used
6.6.0-14-generic
CPU architectures issue was seen on
x86_64
Component
systemd, systemd-networkd, systemd-udevd
Expected behaviour you didn't see
When mounting a filesystem or creating a tmpfs mount, systemd should be responsive and service important messages on its sockets, such as restarting services or renewing dhcp leases in a timely fashion.
Unexpected behaviour you saw
This issue was first seen on a cloud instance that was mounting and unmounting tmpfs filesystems every second or so by containerd collecting telemetry.
The system had about 3000 stale mountpoints of the root filesystem under /home due to a mountpoint leak by osqueryd. When the new tmpfs mounts were created, systemd is at 100% cpu and is completely blocked. It appears to parse each of the 3000 mountpoints under the udev filesystem during the tmpfs mount, which takes considerable time. While systemd is blocked, you cannot restart or query the status of services, and the dhcp lease can expire, and not be renewed leading to loss of network connectivity.
This isn't a bug per se, due to the large amount of mounts involved, but the issue is more systemd is blocked and communication to systemd sockets is not consumed, leading to issues such as dhcp lease expiry.
Is it strictly necessary to parse all these udev entries per tmpfs mount?
Is it possible to optimise the mount path to scan the udev entries in chunks, and break inbetween to let systemd service events like dhcp lease renewal etc in a timely fashion? Then it can return to the ongoing mount.
Steps to reproduce the problem
Start a VM, easier to see the impact if you select 1 vcpu.
Create a file, repro.sh with the following contents:
#!/bin/bash
NUMDIRS=3000
DIRPREFIX="mnt"
NUM_MOUNTS=0
TDIR=""
cleanup() {
for idx in $(seq 1 $NUM_MOUNTS); do
DIRNAME="$DIRPREFIX/$idx"
sudo umount "$DIRNAME"
rmdir $DIRNAME
echo "Removed $DIRNAME"
done
if [ -n "$TDIR" -a -e "$TDIR" ]; then
sudo umount $TDIR
fi
return
}
trap "cleanup" EXIT
for idx in $(seq 1 $NUMDIRS); do
DIRNAME="$DIRPREFIX/$idx"
mkdir -p "$DIRNAME"
sudo mount /dev/sda1 "$DIRNAME"
NUM_MOUNTS=$idx
echo "Created $DIRNAME mount"
done
for idx in $(seq 1 1000); do
TDIR=$(mktemp -d);
sudo mount -o size=1k -t tmpfs tmpfs $TDIR
sleep 1
sudo umount "$TDIR"
done
Adjust /dev/sda1 to your system, e.g. /dev/vda1.
Run repro.sh, it will create 3000 mounts of the primary filesystem, and then
once per second, create a tmpfs, wait a second, and then unmount the tmpfs.
$ chmod +x repro.sh
$ ./repro.sh
Created mnt/1
...
Once all 3000 mounts have been created, if you then restart systemd-networkd
to simulate a dhcp lease expiry, you can see that it is blocked waiting for 1.5 to
2.5 minutes, and during this time, there is no dhcp lease, and networking is down.
$ time sudo systemctl restart systemd-networkd
real 2m28.361s
user 0m0.004s
sys 0m0.000s
$ time sudo systemctl restart systemd-networkd
Failed to restart systemd-networkd.service: Transport endpoint is not connected
See system logs and 'systemctl status systemd-networkd.service' for details.
real 1m30.145s
user 0m0.002s
sys 0m0.003s
$ time sudo systemctl restart systemd-networkd
real 2m17.425s
user 0m0.001s
sys 0m0.005s
$ ping github.com
ping: github.com: Temporary failure in name resolution
$ ping github.com
ping: github.com: Temporary failure in name resolution
$ ping github.com
ping: github.com: Temporary failure in name resolution
Additional program output to the terminal or log subsystem illustrating the issue
Running an strace during mount shows many udev entries being parsed for each
mountpoint, which takes time.
19:55:06 newfstatat(AT_FDCWD, "/dev/sda1", {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x1), ...}, 0) = 0 <0.000061>
19:55:06 openat(AT_FDCWD, "/", O_RDONLY|O_CLOEXEC|O_PATH|O_DIRECTORY) = 52 <0.000034>
19:55:06 openat(52, "sys", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000038>
19:55:06 newfstatat(53, "", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000034>
19:55:06 close(52) = 0 <0.000032>
19:55:06 openat(53, "dev", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000043>
19:55:06 newfstatat(52, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000222>
19:55:06 close(53) = 0 <0.000021>
19:55:06 openat(52, "block", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000042>
19:55:06 newfstatat(53, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000034>
19:55:06 close(52) = 0 <0.000030>
19:55:06 openat(53, "8:1", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000048>
19:55:06 newfstatat(52, "", {st_mode=S_IFLNK|0777, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000032>
19:55:06 readlinkat(53, "8:1", "../../devices/pci0000:00/0000:00"..., 4096) = 86 <0.000037>
19:55:06 close(52) = 0 <0.000030>
19:55:06 openat(53, "..", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000037>
19:55:06 close(53) = 0 <0.000037>
19:55:06 openat(52, "..", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000179>
19:55:06 close(52) = 0 <0.000060>
19:55:06 openat(53, "devices", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000043>
19:55:06 newfstatat(52, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000033>
19:55:06 close(53) = 0 <0.000030>
19:55:06 openat(52, "pci0000:00", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000040>
19:55:06 newfstatat(53, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000032>
19:55:06 close(52) = 0 <0.000031>
19:55:06 openat(53, "0000:00:03.0", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000053>
19:55:06 newfstatat(52, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000032>
19:55:06 close(53) = 0 <0.000108>
19:55:06 openat(52, "virtio0", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000089>
19:55:06 newfstatat(53, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000064>
19:55:06 close(52) = 0 <0.000030>
19:55:06 openat(53, "host0", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000037>
19:55:06 newfstatat(52, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000029>
19:55:06 close(53) = 0 <0.000041>
19:55:06 openat(52, "target0:0:1", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000042>
19:55:06 newfstatat(53, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000027>
19:55:06 close(52) = 0 <0.000044>
19:55:06 openat(53, "0:0:1:0", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000044>
19:55:06 newfstatat(52, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000036>
19:55:06 close(53) = 0 <0.000028>
19:55:06 openat(52, "block", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000040>
19:55:06 newfstatat(53, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000104>
19:55:06 close(52) = 0 <0.000034>
19:55:06 openat(53, "sda", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000042>
19:55:06 newfstatat(52, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000033>
19:55:06 close(53) = 0 <0.000033>
19:55:06 openat(52, "sda1", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000043>
19:55:06 newfstatat(53, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000062>
19:55:06 close(52) = 0 <0.000051>
19:55:06 close(53) = 0 <0.000038>
19:55:06 access("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:1/0:0:1:0/block/sda/sda1/uevent", F_OK) = 0 <0.000107>
19:55:06 openat(AT_FDCWD, "/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:1/0:0:1:0/block/sda/sda1/uevent", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 52 <0.000064>
19:55:06 newfstatat(52, "", {st_mode=S_IFREG|0644, st_size=4096, ...}, AT_EMPTY_PATH) = 0 <0.000038>
19:55:06 read(52, "MAJOR=8\nMINOR=1\nDEVNAME=sda1\nDEV"..., 4104) = 65 <0.000055>
19:55:06 close(52) = 0 <0.000068>
19:55:06 readlinkat(AT_FDCWD, "/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:1/0:0:1:0/block/sda/sda1/subsystem", "../../../../../../../../../../cl"..., 4096) = 41 <0.000051>
19:55:06 openat(AT_FDCWD, "/run/udev/data/b8:1", O_RDONLY|O_CLOEXEC) = 52 <0.000063>
19:55:06 newfstatat(52, "", {st_mode=S_IFREG|0644, st_size=1509, ...}, AT_EMPTY_PATH) = 0 <0.000073>
19:55:06 newfstatat(52, "", {st_mode=S_IFREG|0644, st_size=1509, ...}, AT_EMPTY_PATH) = 0 <0.000043>
19:55:06 read(52, "S:disk/by-partuuid/04a52087-8ae8"..., 4096) = 1509 <0.000038>
19:55:06 read(52, "", 4096) = 0 <0.000056>
19:55:06 close(52) = 0 <0.000036>
htop shows systemd at 100% cpu:
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
1 root 20 0 40004 30796 9548 R 32.5 0.8 1:18.35 /sbin/init
792 ubuntu 20 0 36580 26880 9472 R 31.8 0.7 0:57.32 /usr/lib/systemd/systemd --user
1351root 20 0 23112 12544 8064 R 20.1 0.3 0:15.57 (networkd)
637 root 20 0 460M 16132 11392 R 13.6 0.4 0:53.56 /usr/libexec/udisks2/udisksd
systemd version the issue has been seen with
255.2-3ubuntu2
Used distribution
Ubuntu Noble 24.04
Linux kernel version used
6.6.0-14-generic
CPU architectures issue was seen on
x86_64
Component
systemd, systemd-networkd, systemd-udevd
Expected behaviour you didn't see
When mounting a filesystem or creating a tmpfs mount, systemd should be responsive and service important messages on its sockets, such as restarting services or renewing dhcp leases in a timely fashion.
Unexpected behaviour you saw
This issue was first seen on a cloud instance that was mounting and unmounting tmpfs filesystems every second or so by containerd collecting telemetry.
The system had about 3000 stale mountpoints of the root filesystem under /home due to a mountpoint leak by osqueryd. When the new tmpfs mounts were created, systemd is at 100% cpu and is completely blocked. It appears to parse each of the 3000 mountpoints under the udev filesystem during the tmpfs mount, which takes considerable time. While systemd is blocked, you cannot restart or query the status of services, and the dhcp lease can expire, and not be renewed leading to loss of network connectivity.
This isn't a bug per se, due to the large amount of mounts involved, but the issue is more systemd is blocked and communication to systemd sockets is not consumed, leading to issues such as dhcp lease expiry.
Is it strictly necessary to parse all these udev entries per tmpfs mount?
Is it possible to optimise the mount path to scan the udev entries in chunks, and break inbetween to let systemd service events like dhcp lease renewal etc in a timely fashion? Then it can return to the ongoing mount.
Steps to reproduce the problem
Start a VM, easier to see the impact if you select 1 vcpu.
Create a file,
repro.shwith the following contents:Adjust /dev/sda1 to your system, e.g. /dev/vda1.
Run repro.sh, it will create 3000 mounts of the primary filesystem, and then
once per second, create a tmpfs, wait a second, and then unmount the tmpfs.
Once all 3000 mounts have been created, if you then restart
systemd-networkdto simulate a dhcp lease expiry, you can see that it is blocked waiting for 1.5 to
2.5 minutes, and during this time, there is no dhcp lease, and networking is down.
Additional program output to the terminal or log subsystem illustrating the issue
htop shows systemd at 100% cpu: