-
Notifications
You must be signed in to change notification settings - Fork 2k
IO is bottlenecked through a single lock #12121
Description
System information
| Type | Version/Name |
|---|---|
| Distribution Name | Debian |
| Distribution Version | 10.9 |
| Linux Kernel | 5.10.19-1~bpo10+1 |
| Architecture | amd64 |
| ZFS Version | 2.0.4 |
| SPL Version | 2.0.4 |
Describe the problem you're observing
tl;dr: IO is bottlenecked through the kstat lock in vdev_queue.c.
I have a large pool of 24x 8T NVMe devices (Intel DC P4510), configured as a single pool of 2x12 RAIDZ2.
NAME STATE READ WRITE CKSUM
i44cyrus ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
i44disk1 ONLINE 0 0 0
...
i44disk11 ONLINE 0 0 0
i44disk12 ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
i44disk13 ONLINE 0 0 0
...
i44disk24 ONLINE 0 0 0
Yesterday I started bringing in some 40T of data from a different machine using a normal send | recv over a direct 10G connection. It appears that we're getting a sustained throughput of ~250M/s. We haven't done a lot of tuning yet so I wasn't entirely sure if this was "good" or not, but it felt low, so I went digging.
(note: there is nothing else of any consequence running on this machine; just basic OS services)
On the receiving side, perf top reported that 25% of its time is spent in osq_lock, which is the kernel spinlock. So I fired up bpftrace and got to work.
Bucketing by the lock pointer shows that there's a single lock receiving a lot more attention than the others:
[fastmail root(robn)@imap44 ~]# bpftrace -e 'kprobe:osq_lock { @[arg0] = count(); }'
Attaching 1 probe...
^C
@[-117676299645548]: 1
@[-117669704412716]: 1
@[-117485265829068]: 1
@[-117625234665004]: 1
@[-117458348531436]: 1
...
@[-117723405763756]: 3502
@[-117723442693292]: 3509
@[-117716388578476]: 3540
@[-117723405747372]: 3573
@[-117719621901892]: 11432
@[-117711093901308]: 144076
Inspecting this single lock:
[fastmail root(robn)@imap44 ~]# bpftrace -e 'kprobe:osq_lock /arg0 == -117711093901308/ { @[kstack(3)] = count() }'
stdin:1:23-25: WARNING: comparison of integers of different signs: 'unsigned int64' and 'int64' can lead to undefined behavior
kprobe:osq_lock /arg0 == -117711093901308/ { @[kstack(3)] = count() }
~~
Attaching 1 probe...
^C
@[
osq_lock+1
__mutex_lock.isra.9+137
vdev_queue_io_to_issue+2778
]: 93753
@[
osq_lock+1
__mutex_lock.isra.9+137
vdev_queue_io_done+235
]: 120459
@[
osq_lock+1
__mutex_lock.isra.9+137
vdev_queue_io+273
]: 205863
@[
osq_lock+1
__mutex_lock.isra.9+137
vdev_queue_io_remove+124
]: 208130
I didn't know very much about the internals at this point, and I was assuming there was a single lock around all devices or something like that. After studying vdev_queue_io and the headers I saw that a struct vdev has vdev_queue and vq_lock inlined, so the vdev is a reasonable proxy for the lock for tracing purposes, and so:
[fastmail root(robn)@imap44 ~]# cat zio.bt
#include <sys/zio.h>
kprobe:vdev_queue_io
{
@[((struct zio *)arg0)->io_vd] = count()
}
[fastmail root(robn)@imap44 ~]# bpftrace -I/usr/include/libspl -I/usr/include/libzfs zio.bt
Attaching 1 probe...
^C
@[0xffff94ee60054000]: 14228
@[0xffff94edce118000]: 14288
@[0xffff94f06863c000]: 14316
@[0xffff94f0034a8000]: 14342
@[0xffff94edfb11c000]: 14491
@[0xffff94edce9f4000]: 14560
@[0xffff94f004790000]: 14703
@[0xffff94ee62378000]: 14785
@[0xffff94ee6d3a8000]: 14799
@[0xffff94edce11c000]: 14855
@[0xffff94edfc550000]: 14867
@[0xffff94f0034ac000]: 14954
@[0xffff94edce358000]: 14996
@[0xffff94ee60040000]: 15017
@[0xffff94ee60050000]: 15065
@[0xffff94f0034a0000]: 15066
@[0xffff94f004794000]: 15110
@[0xffff94ee6d3ac000]: 15184
@[0xffff94edfb118000]: 15191
@[0xffff94edfc554000]: 15199
@[0xffff94f0034a4000]: 15232
@[0xffff94ee6237c000]: 15255
@[0xffff94edce9f0000]: 15255
@[0xffff94edce35c000]: 15273
24 separate vdevs, each receiving roughly the same amount of traffic, and none of them with an outlier. So vdev locks are not responsible!
To read it, vdev_queue_io (and friends) only work on the vdev lock, so I couldn't understand how the single busy lock was matching the traces that I had. But then I saw that vdev_queue_io_add is only called from one place and is static, so it has been inlined. But lo, there's another lock, and a global lock at that.
We can't probe an inlined function, but we can probe the thing it calls:
[fastmail root(robn)@imap44 ~]# bpftrace -e 'kprobe:kstat_waitq_enter { @[kstack(3)] = count() }'
Attaching 1 probe...
^C
@[
kstat_waitq_enter+1
vdev_queue_io+304
zio_vdev_io_start+156
]: 270092
And there it is: all the vdev ops are waiting on a single lock, for a stats counter no less! Oh no!
I don't have a great theory but I suspect this perhaps isn't as noticeable on higher-latency devices (spinners) because they're slower than the lock wait, and perhaps isn't as noticeable on NVMe mirrors because probably most of the time there's only two drives in the mirror, so it doesn't get amplified quite so much.
Describe how to reproduce the problem
It should be enough to have a wide RAIDZ vdev on top of low-latency devices. And then just pour data into with zfs recv.
(I can reproduce this reliably, however I'm on tight deadlines so I don't immediately have much time for experimenting. But I would like to help, and will have a lot more hardware available to work with in the next week or two).