Skip to content

Parsing thousands of mounts blocks processing of systemd communication leading to dhcp lease expiry and more #31137

@matthewruffell

Description

@matthewruffell

systemd version the issue has been seen with

255.2-3ubuntu2

Used distribution

Ubuntu Noble 24.04

Linux kernel version used

6.6.0-14-generic

CPU architectures issue was seen on

x86_64

Component

systemd, systemd-networkd, systemd-udevd

Expected behaviour you didn't see

When mounting a filesystem or creating a tmpfs mount, systemd should be responsive and service important messages on its sockets, such as restarting services or renewing dhcp leases in a timely fashion.

Unexpected behaviour you saw

This issue was first seen on a cloud instance that was mounting and unmounting tmpfs filesystems every second or so by containerd collecting telemetry.

The system had about 3000 stale mountpoints of the root filesystem under /home due to a mountpoint leak by osqueryd. When the new tmpfs mounts were created, systemd is at 100% cpu and is completely blocked. It appears to parse each of the 3000 mountpoints under the udev filesystem during the tmpfs mount, which takes considerable time. While systemd is blocked, you cannot restart or query the status of services, and the dhcp lease can expire, and not be renewed leading to loss of network connectivity.

This isn't a bug per se, due to the large amount of mounts involved, but the issue is more systemd is blocked and communication to systemd sockets is not consumed, leading to issues such as dhcp lease expiry.

Is it strictly necessary to parse all these udev entries per tmpfs mount?

Is it possible to optimise the mount path to scan the udev entries in chunks, and break inbetween to let systemd service events like dhcp lease renewal etc in a timely fashion? Then it can return to the ongoing mount.

Steps to reproduce the problem

Start a VM, easier to see the impact if you select 1 vcpu.

Create a file, repro.sh with the following contents:

#!/bin/bash

NUMDIRS=3000
DIRPREFIX="mnt"
NUM_MOUNTS=0
TDIR=""

cleanup() {
	for idx in $(seq 1 $NUM_MOUNTS); do
		DIRNAME="$DIRPREFIX/$idx"
		sudo umount "$DIRNAME"
		rmdir $DIRNAME
		echo "Removed $DIRNAME"
	done
	if [ -n "$TDIR" -a -e "$TDIR" ]; then 
	   sudo umount $TDIR
	fi
	return
}

trap "cleanup" EXIT

for idx in $(seq 1 $NUMDIRS); do
	DIRNAME="$DIRPREFIX/$idx"
	mkdir -p "$DIRNAME"
	sudo mount /dev/sda1 "$DIRNAME"
	NUM_MOUNTS=$idx
	echo "Created $DIRNAME mount"
done

for idx in $(seq 1 1000); do
	TDIR=$(mktemp -d); 
	sudo mount -o size=1k -t tmpfs tmpfs $TDIR
	sleep 1
	sudo umount "$TDIR"
done

Adjust /dev/sda1 to your system, e.g. /dev/vda1.

Run repro.sh, it will create 3000 mounts of the primary filesystem, and then
once per second, create a tmpfs, wait a second, and then unmount the tmpfs.

$ chmod +x repro.sh
$ ./repro.sh
Created mnt/1
...

Once all 3000 mounts have been created, if you then restart systemd-networkd
to simulate a dhcp lease expiry, you can see that it is blocked waiting for 1.5 to
2.5 minutes, and during this time, there is no dhcp lease, and networking is down.

$ time sudo systemctl restart systemd-networkd

real	2m28.361s
user	0m0.004s
sys	0m0.000s
$ time sudo systemctl restart systemd-networkd
Failed to restart systemd-networkd.service: Transport endpoint is not connected
See system logs and 'systemctl status systemd-networkd.service' for details.

real	1m30.145s
user	0m0.002s
sys	0m0.003s
$ time sudo systemctl restart systemd-networkd

real	2m17.425s
user	0m0.001s
sys	0m0.005s
$ ping github.com
ping: github.com: Temporary failure in name resolution
$ ping github.com
ping: github.com: Temporary failure in name resolution
$ ping github.com
ping: github.com: Temporary failure in name resolution

Additional program output to the terminal or log subsystem illustrating the issue

Running an strace during mount shows many udev entries being parsed for each
mountpoint, which takes time.


19:55:06 newfstatat(AT_FDCWD, "/dev/sda1", {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x1), ...}, 0) = 0 <0.000061>
19:55:06 openat(AT_FDCWD, "/", O_RDONLY|O_CLOEXEC|O_PATH|O_DIRECTORY) = 52 <0.000034>
19:55:06 openat(52, "sys", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000038>
19:55:06 newfstatat(53, "", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000034>
19:55:06 close(52)                      = 0 <0.000032>
19:55:06 openat(53, "dev", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000043>
19:55:06 newfstatat(52, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000222>
19:55:06 close(53)                      = 0 <0.000021>
19:55:06 openat(52, "block", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000042>
19:55:06 newfstatat(53, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000034>
19:55:06 close(52)                      = 0 <0.000030>
19:55:06 openat(53, "8:1", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000048>
19:55:06 newfstatat(52, "", {st_mode=S_IFLNK|0777, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000032>
19:55:06 readlinkat(53, "8:1", "../../devices/pci0000:00/0000:00"..., 4096) = 86 <0.000037>
19:55:06 close(52)                      = 0 <0.000030>
19:55:06 openat(53, "..", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000037>
19:55:06 close(53)                      = 0 <0.000037>
19:55:06 openat(52, "..", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000179>
19:55:06 close(52)                      = 0 <0.000060>
19:55:06 openat(53, "devices", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000043>
19:55:06 newfstatat(52, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000033>
19:55:06 close(53)                      = 0 <0.000030>
19:55:06 openat(52, "pci0000:00", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000040>
19:55:06 newfstatat(53, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000032>
19:55:06 close(52)                      = 0 <0.000031>
19:55:06 openat(53, "0000:00:03.0", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000053>
19:55:06 newfstatat(52, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000032>
19:55:06 close(53)                      = 0 <0.000108>
19:55:06 openat(52, "virtio0", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000089>
19:55:06 newfstatat(53, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000064>
19:55:06 close(52)                      = 0 <0.000030>
19:55:06 openat(53, "host0", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000037>
19:55:06 newfstatat(52, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000029>
19:55:06 close(53)                      = 0 <0.000041>
19:55:06 openat(52, "target0:0:1", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000042>
19:55:06 newfstatat(53, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000027>
19:55:06 close(52)                      = 0 <0.000044>
19:55:06 openat(53, "0:0:1:0", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000044>
19:55:06 newfstatat(52, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000036>
19:55:06 close(53)                      = 0 <0.000028>
19:55:06 openat(52, "block", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000040>
19:55:06 newfstatat(53, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000104>
19:55:06 close(52)                      = 0 <0.000034>
19:55:06 openat(53, "sda", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000042>
19:55:06 newfstatat(52, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000033>
19:55:06 close(53)                      = 0 <0.000033>
19:55:06 openat(52, "sda1", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000043>
19:55:06 newfstatat(53, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000062>
19:55:06 close(52)                      = 0 <0.000051>
19:55:06 close(53)                      = 0 <0.000038>
19:55:06 access("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:1/0:0:1:0/block/sda/sda1/uevent", F_OK) = 0 <0.000107>
19:55:06 openat(AT_FDCWD, "/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:1/0:0:1:0/block/sda/sda1/uevent", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 52 <0.000064>
19:55:06 newfstatat(52, "", {st_mode=S_IFREG|0644, st_size=4096, ...}, AT_EMPTY_PATH) = 0 <0.000038>
19:55:06 read(52, "MAJOR=8\nMINOR=1\nDEVNAME=sda1\nDEV"..., 4104) = 65 <0.000055>
19:55:06 close(52)                      = 0 <0.000068>
19:55:06 readlinkat(AT_FDCWD, "/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:1/0:0:1:0/block/sda/sda1/subsystem", "../../../../../../../../../../cl"..., 4096) = 41 <0.000051>
19:55:06 openat(AT_FDCWD, "/run/udev/data/b8:1", O_RDONLY|O_CLOEXEC) = 52 <0.000063>
19:55:06 newfstatat(52, "", {st_mode=S_IFREG|0644, st_size=1509, ...}, AT_EMPTY_PATH) = 0 <0.000073>
19:55:06 newfstatat(52, "", {st_mode=S_IFREG|0644, st_size=1509, ...}, AT_EMPTY_PATH) = 0 <0.000043>
19:55:06 read(52, "S:disk/by-partuuid/04a52087-8ae8"..., 4096) = 1509 <0.000038>
19:55:06 read(52, "", 4096)             = 0 <0.000056>
19:55:06 close(52)                      = 0 <0.000036>

htop shows systemd at 100% cpu:

PID	USER	PRI	NI	VIRT	RES		SHR		S	CPU%	MEM%	TIME+	Command
1	root	20	0	40004	30796	9548	R	32.5	0.8	1:18.35		/sbin/init
792	ubuntu	20	0	36580	26880	9472	R	31.8	0.7	0:57.32		/usr/lib/systemd/systemd --user
1351root	20	0	23112	12544	8064	R	20.1	0.3	0:15.57		(networkd)
637	root	20	0	460M	16132	11392	R	13.6	0.4	0:53.56		/usr/libexec/udisks2/udisksd

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions