Skip to content

IO is bottlenecked through a single lock #12121

@robn

Description

@robn

System information

Type Version/Name
Distribution Name Debian
Distribution Version 10.9
Linux Kernel 5.10.19-1~bpo10+1
Architecture amd64
ZFS Version 2.0.4
SPL Version 2.0.4

Describe the problem you're observing

tl;dr: IO is bottlenecked through the kstat lock in vdev_queue.c.

I have a large pool of 24x 8T NVMe devices (Intel DC P4510), configured as a single pool of 2x12 RAIDZ2.


	NAME           STATE     READ WRITE CKSUM
	i44cyrus       ONLINE       0     0     0
	 raidz2-0     ONLINE       0     0     0
	   i44disk1   ONLINE       0     0     0
...
	   i44disk11  ONLINE       0     0     0
	   i44disk12  ONLINE       0     0     0
	 raidz2-1     ONLINE       0     0     0
	   i44disk13  ONLINE       0     0     0
...
	   i44disk24  ONLINE       0     0     0

Yesterday I started bringing in some 40T of data from a different machine using a normal send | recv over a direct 10G connection. It appears that we're getting a sustained throughput of ~250M/s. We haven't done a lot of tuning yet so I wasn't entirely sure if this was "good" or not, but it felt low, so I went digging.

(note: there is nothing else of any consequence running on this machine; just basic OS services)

On the receiving side, perf top reported that 25% of its time is spent in osq_lock, which is the kernel spinlock. So I fired up bpftrace and got to work.

Bucketing by the lock pointer shows that there's a single lock receiving a lot more attention than the others:

[fastmail root(robn)@imap44 ~]# bpftrace -e 'kprobe:osq_lock { @[arg0] = count(); }'
Attaching 1 probe...
^C
@[-117676299645548]: 1
@[-117669704412716]: 1
@[-117485265829068]: 1
@[-117625234665004]: 1
@[-117458348531436]: 1
...
@[-117723405763756]: 3502
@[-117723442693292]: 3509
@[-117716388578476]: 3540
@[-117723405747372]: 3573
@[-117719621901892]: 11432
@[-117711093901308]: 144076

Inspecting this single lock:

[fastmail root(robn)@imap44 ~]# bpftrace -e 'kprobe:osq_lock /arg0 == -117711093901308/ { @[kstack(3)] = count() }'
stdin:1:23-25: WARNING: comparison of integers of different signs: 'unsigned int64' and 'int64' can lead to undefined behavior
kprobe:osq_lock /arg0 == -117711093901308/ { @[kstack(3)] = count() }
                      ~~
Attaching 1 probe...
^C

@[
    osq_lock+1
    __mutex_lock.isra.9+137
    vdev_queue_io_to_issue+2778
]: 93753
@[
    osq_lock+1
    __mutex_lock.isra.9+137
    vdev_queue_io_done+235
]: 120459
@[
    osq_lock+1
    __mutex_lock.isra.9+137
    vdev_queue_io+273
]: 205863
@[
    osq_lock+1
    __mutex_lock.isra.9+137
    vdev_queue_io_remove+124
]: 208130

I didn't know very much about the internals at this point, and I was assuming there was a single lock around all devices or something like that. After studying vdev_queue_io and the headers I saw that a struct vdev has vdev_queue and vq_lock inlined, so the vdev is a reasonable proxy for the lock for tracing purposes, and so:

[fastmail root(robn)@imap44 ~]# cat zio.bt 
#include <sys/zio.h>

kprobe:vdev_queue_io
{
  @[((struct zio *)arg0)->io_vd] = count()
}
[fastmail root(robn)@imap44 ~]# bpftrace -I/usr/include/libspl -I/usr/include/libzfs zio.bt
Attaching 1 probe...
^C

@[0xffff94ee60054000]: 14228
@[0xffff94edce118000]: 14288
@[0xffff94f06863c000]: 14316
@[0xffff94f0034a8000]: 14342
@[0xffff94edfb11c000]: 14491
@[0xffff94edce9f4000]: 14560
@[0xffff94f004790000]: 14703
@[0xffff94ee62378000]: 14785
@[0xffff94ee6d3a8000]: 14799
@[0xffff94edce11c000]: 14855
@[0xffff94edfc550000]: 14867
@[0xffff94f0034ac000]: 14954
@[0xffff94edce358000]: 14996
@[0xffff94ee60040000]: 15017
@[0xffff94ee60050000]: 15065
@[0xffff94f0034a0000]: 15066
@[0xffff94f004794000]: 15110
@[0xffff94ee6d3ac000]: 15184
@[0xffff94edfb118000]: 15191
@[0xffff94edfc554000]: 15199
@[0xffff94f0034a4000]: 15232
@[0xffff94ee6237c000]: 15255
@[0xffff94edce9f0000]: 15255
@[0xffff94edce35c000]: 15273

24 separate vdevs, each receiving roughly the same amount of traffic, and none of them with an outlier. So vdev locks are not responsible!

To read it, vdev_queue_io (and friends) only work on the vdev lock, so I couldn't understand how the single busy lock was matching the traces that I had. But then I saw that vdev_queue_io_add is only called from one place and is static, so it has been inlined. But lo, there's another lock, and a global lock at that.

We can't probe an inlined function, but we can probe the thing it calls:

[fastmail root(robn)@imap44 ~]# bpftrace -e 'kprobe:kstat_waitq_enter { @[kstack(3)] = count() }'
Attaching 1 probe...
^C

@[
    kstat_waitq_enter+1
    vdev_queue_io+304
    zio_vdev_io_start+156
]: 270092

And there it is: all the vdev ops are waiting on a single lock, for a stats counter no less! Oh no!

I don't have a great theory but I suspect this perhaps isn't as noticeable on higher-latency devices (spinners) because they're slower than the lock wait, and perhaps isn't as noticeable on NVMe mirrors because probably most of the time there's only two drives in the mirror, so it doesn't get amplified quite so much.

Describe how to reproduce the problem

It should be enough to have a wide RAIDZ vdev on top of low-latency devices. And then just pour data into with zfs recv.

(I can reproduce this reliably, however I'm on tight deadlines so I don't immediately have much time for experimenting. But I would like to help, and will have a lot more hardware available to work with in the next week or two).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: DefectIncorrect behavior (e.g. crash, hang)Type: PerformancePerformance improvement or performance problem

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions