IO is bottlenecked through a single lock

### System information

Type | Version/Name
 --- | ---
Distribution Name	| Debian
Distribution Version	| 10.9
Linux Kernel	| 5.10.19-1~bpo10+1
Architecture	| amd64
ZFS Version	| 2.0.4
SPL Version	| 2.0.4


### Describe the problem you're observing

tl;dr: IO is bottlenecked through the kstat lock in `vdev_queue.c`.

I have a large pool of 24x 8T NVMe devices (Intel DC P4510), configured as a single pool of 2x12 RAIDZ2.

```

	NAME           STATE     READ WRITE CKSUM
	i44cyrus       ONLINE       0     0     0
	 raidz2-0     ONLINE       0     0     0
	   i44disk1   ONLINE       0     0     0
...
	   i44disk11  ONLINE       0     0     0
	   i44disk12  ONLINE       0     0     0
	 raidz2-1     ONLINE       0     0     0
	   i44disk13  ONLINE       0     0     0
...
	   i44disk24  ONLINE       0     0     0
```

Yesterday I started bringing in some 40T of data from a different machine using a normal `send | recv` over a direct 10G connection. It appears that we're getting a sustained throughput of ~250M/s. We haven't done a lot of tuning yet so I wasn't entirely sure if this was "good" or not, but it felt low, so I went digging.

(note: there is nothing else of any consequence running on this machine; just basic OS services)

On the receiving side, `perf top` reported that 25% of its time is spent in `osq_lock`, which is the kernel spinlock. So I fired up `bpftrace` and got to work.

Bucketing by the lock pointer shows that there's a single lock receiving a lot more attention than the others:

```
[fastmail root(robn)@imap44 ~]# bpftrace -e 'kprobe:osq_lock { @[arg0] = count(); }'
Attaching 1 probe...
^C
@[-117676299645548]: 1
@[-117669704412716]: 1
@[-117485265829068]: 1
@[-117625234665004]: 1
@[-117458348531436]: 1
...
@[-117723405763756]: 3502
@[-117723442693292]: 3509
@[-117716388578476]: 3540
@[-117723405747372]: 3573
@[-117719621901892]: 11432
@[-117711093901308]: 144076
```

Inspecting this single lock:

```
[fastmail root(robn)@imap44 ~]# bpftrace -e 'kprobe:osq_lock /arg0 == -117711093901308/ { @[kstack(3)] = count() }'
stdin:1:23-25: WARNING: comparison of integers of different signs: 'unsigned int64' and 'int64' can lead to undefined behavior
kprobe:osq_lock /arg0 == -117711093901308/ { @[kstack(3)] = count() }
                      ~~
Attaching 1 probe...
^C

@[
    osq_lock+1
    __mutex_lock.isra.9+137
    vdev_queue_io_to_issue+2778
]: 93753
@[
    osq_lock+1
    __mutex_lock.isra.9+137
    vdev_queue_io_done+235
]: 120459
@[
    osq_lock+1
    __mutex_lock.isra.9+137
    vdev_queue_io+273
]: 205863
@[
    osq_lock+1
    __mutex_lock.isra.9+137
    vdev_queue_io_remove+124
]: 208130
```

I didn't know very much about the internals at this point, and I was assuming there was a single lock around all devices or something like that. After studying `vdev_queue_io` and the headers I saw that a `struct vdev` has `vdev_queue` and `vq_lock` inlined, so the vdev is a reasonable proxy for the lock for tracing purposes, and so:

```
[fastmail root(robn)@imap44 ~]# cat zio.bt 
#include <sys/zio.h>

kprobe:vdev_queue_io
{
  @[((struct zio *)arg0)->io_vd] = count()
}
```

```
[fastmail root(robn)@imap44 ~]# bpftrace -I/usr/include/libspl -I/usr/include/libzfs zio.bt
Attaching 1 probe...
^C

@[0xffff94ee60054000]: 14228
@[0xffff94edce118000]: 14288
@[0xffff94f06863c000]: 14316
@[0xffff94f0034a8000]: 14342
@[0xffff94edfb11c000]: 14491
@[0xffff94edce9f4000]: 14560
@[0xffff94f004790000]: 14703
@[0xffff94ee62378000]: 14785
@[0xffff94ee6d3a8000]: 14799
@[0xffff94edce11c000]: 14855
@[0xffff94edfc550000]: 14867
@[0xffff94f0034ac000]: 14954
@[0xffff94edce358000]: 14996
@[0xffff94ee60040000]: 15017
@[0xffff94ee60050000]: 15065
@[0xffff94f0034a0000]: 15066
@[0xffff94f004794000]: 15110
@[0xffff94ee6d3ac000]: 15184
@[0xffff94edfb118000]: 15191
@[0xffff94edfc554000]: 15199
@[0xffff94f0034a4000]: 15232
@[0xffff94ee6237c000]: 15255
@[0xffff94edce9f0000]: 15255
@[0xffff94edce35c000]: 15273
```

24 separate vdevs, each receiving roughly the same amount of traffic, and none of them with an outlier. So vdev locks are not responsible!

To read it, `vdev_queue_io` (and friends) only work on the vdev lock, so I couldn't understand how the single busy lock was matching the traces that I had. But then I saw that `vdev_queue_io_add` is only called from one place and is `static`, so it has been inlined. But lo, there's another lock, and a global lock at that.

We can't probe an inlined function, but we can probe the thing it calls:

```
[fastmail root(robn)@imap44 ~]# bpftrace -e 'kprobe:kstat_waitq_enter { @[kstack(3)] = count() }'
Attaching 1 probe...
^C

@[
    kstat_waitq_enter+1
    vdev_queue_io+304
    zio_vdev_io_start+156
]: 270092
```

And there it is: all the vdev ops are waiting on a single lock, for a stats counter no less! Oh no!

I don't have a great theory but I suspect this perhaps isn't as noticeable on higher-latency devices (spinners) because they're slower than the lock wait, and perhaps isn't as noticeable on NVMe mirrors because probably most of the time there's only two drives in the mirror, so it doesn't get amplified quite so much.

### Describe how to reproduce the problem

It should be enough to have a wide RAIDZ vdev on top of low-latency devices. And then just pour data into with `zfs recv`.

(I can reproduce this reliably, however I'm on tight deadlines so I don't immediately have much time for experimenting. But I would like to help, and will have a lot more hardware available to work with in the next week or two).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IO is bottlenecked through a single lock #12121

System information

Describe the problem you're observing

Describe how to reproduce the problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Type	Version/Name
Distribution Name	Debian
Distribution Version	10.9
Linux Kernel	5.10.19-1~bpo10+1
Architecture	amd64
ZFS Version	2.0.4
SPL Version	2.0.4

IO is bottlenecked through a single lock #12121

Description

System information

Describe the problem you're observing

Describe how to reproduce the problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions