Implement EVENT_IDX and polling for vhost_user_block #774

slp · 2020-02-18T11:21:08Z

This patch series implements EVENT_IDX, a virtio feature that allows reducing the number of driver <-> device notifications and thus, reducing the number of vm exits needed while processing requests. This is specially useful when actively polling the virtqueues.

VIRTIO_RING_F_EVENT_IDX is a virtio feature that allows to avoid device <-> driver notifications under some circunstances, most notably when actively polling the queue. This commit implements support for in in the vm-virtio crate. Consumers of this crate will also need to add support for it by exposing the feature and calling using update_avail_event() and get_used_event() accordingly. Signed-off-by: Sergio Lopez <[email protected]>

Add helpers to Vring and VhostUserSlaveReqHandler for EVENT_IDX, so consumers of this crate can make use of this feature. Signed-off-by: Sergio Lopez <[email protected]>

Now that vhost_user_backend and vm-virtio do support EVENT_IDX, use it in vhost_user_block to reduce the number of notifications sent between the driver and the device. This is specially useful when using active polling on the virtqueue, as it'll be implemented by a future patch. This is a snapshot of kvm_stat while generating ~60K IOPS with fio on the guest without EVENT_IDX: Event Total %Total CurAvg/s kvm_entry 393454 20.3 62494 kvm_exit 393446 20.3 62494 kvm_apic_accept_irq 378146 19.5 60268 kvm_msi_set_irq 369720 19.0 58881 kvm_fast_mmio 370497 19.1 58817 kvm_hv_timer_state 10197 0.5 1715 kvm_msr 8770 0.5 1443 kvm_wait_lapic_expire 7018 0.4 1118 kvm_apic 2768 0.1 538 kvm_pv_tlb_flush 2028 0.1 360 kvm_vcpu_wakeup 1453 0.1 278 kvm_apic_ipi 1384 0.1 269 kvm_fpu 1148 0.1 164 kvm_pio 574 0.0 82 kvm_userspace_exit 574 0.0 82 kvm_halt_poll_ns 24 0.0 3 And this is the snapshot while doing the same thing with EVENT_IDX: Event Total %Total CurAvg/s kvm_entry 35506 26.0 3873 kvm_exit 35499 26.0 3873 kvm_hv_timer_state 14740 10.8 1672 kvm_apic_accept_irq 13017 9.5 1438 kvm_msr 12845 9.4 1421 kvm_wait_lapic_expire 10422 7.6 1118 kvm_apic 3788 2.8 502 kvm_pv_tlb_flush 2708 2.0 340 kvm_vcpu_wakeup 1992 1.5 258 kvm_apic_ipi 1894 1.4 251 kvm_fpu 1476 1.1 164 kvm_pio 738 0.5 82 kvm_userspace_exit 738 0.5 82 kvm_msi_set_irq 701 0.5 69 kvm_fast_mmio 238 0.2 4 kvm_halt_poll_ns 50 0.0 1 kvm_ple_window_update 28 0.0 0 kvm_page_fault 4 0.0 0 It can be clearly appreciated how the number of vm exits per second, specially the ones related to notifications (kvm_fast_mmio and kvm_msi_set_irq) is drastically lower. Signed-off-by: Sergio Lopez <[email protected]>

rbradford

@slp lgtm. I'm not sure I see the value in making the number configurable, could we just set a sensible default and be done with it?

slp · 2020-02-18T11:37:47Z

@slp lgtm. I'm not sure I see the value in making the number configurable, could we just set a sensible default and be done with it?

@rbradford That can be complicated, as the best value for poll_us depends on the storage backend latency. That said, we can turn this into an opinionated feature, and decide that it's only worth for low-latency (<30us) backend, setting the value to, for example, 50us.

rbradford · 2020-02-18T12:22:42Z

@slp lgtm. I'm not sure I see the value in making the number configurable, could we just set a sensible default and be done with it?

@rbradford That can be complicated, as the best value for poll_us depends on the storage backend latency. That said, we can turn this into an opinionated feature, and decide that it's only worth for low-latency (<30us) backend, setting the value to, for example, 50us.

Sounds good.

slp · 2020-02-18T14:53:52Z

@rbradford OK, I'll turn poll_us into a boolean (poll_queue?). Do we also need an integration test just for this option?

rbradford · 2020-02-18T17:15:01Z

@slp Can we just turn it on by default?

sboeuf · 2020-02-18T17:35:05Z

@rbradford why do you want to do polling by default?

rbradford · 2020-02-18T18:02:58Z

@rbradford why do you want to do polling by default?

EVENT_IDX is designed to improve performance so we should turn it on by default.

slp · 2020-02-19T11:28:20Z

@rbradford why do you want to do polling by default?

EVENT_IDX is designed to improve performance so we should turn it on by default.

@sameo @rbradford EVENT_IDX will always be enabled (as long its negotiated by the guest), the question is whether to have active polling on vhost_user_block enabled by default, with the CPU cost this implies, or make it a tunable.

Personally, I don't have a problem have enabling it by default, as I always enable it in my VMs ;-)

On the other hand, even if we decide to have it enabled by default, should we also still have a parameter to disable it?

rbradford · 2020-02-19T12:38:30Z

Can you get some simple numbers to show the improvement it makes? (bonnie? or dd or something like that?)

I would lean towards making it a boolean, on by default.

sameo · 2020-02-19T13:18:41Z

Can you get some simple numbers to show the improvement it makes? (bonnie? or dd or something like that?)

I would lean towards making it a boolean, on by default.

Same here, a boolean knob enabled by default. It will help reducing our number of vm exits, but if users don't want to pay the CPU utilization price they can turn it off.

slp · 2020-02-19T13:55:23Z

Can you get some simple numbers to show the improvement it makes? (bonnie? or dd or something like that?)

Sure, I've put some numbers on the fourth commit's message:

This is a test using null_blk as a backend for the image, with the
following parameters:

 - null_blk gb=20 nr_devices=1 irqmode=2 completion_nsec=0 no_sched=1

With poll_us=0:

fio --ioengine=sync --bs=4k --rw randread --name randread --direct=1
--filename=/dev/vdb --time_based --runtime=10

randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
fio-3.14
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=169MiB/s][r=43.2k IOPS][eta 00m:00s]
randread: (groupid=0, jobs=1): err= 0: pid=433: Tue Feb 18 11:12:59 2020
  read: IOPS=43.2k, BW=169MiB/s (177MB/s)(1688MiB/10001msec)
    clat (usec): min=17, max=836, avg=21.64, stdev= 3.81
     lat (usec): min=17, max=836, avg=21.77, stdev= 3.81
    clat percentiles (nsec):
     |  1.00th=[19328],  5.00th=[19840], 10.00th=[20352], 20.00th=[21120],
     | 30.00th=[21376], 40.00th=[21376], 50.00th=[21376], 60.00th=[21632],
     | 70.00th=[21632], 80.00th=[21888], 90.00th=[22144], 95.00th=[22912],
     | 99.00th=[28544], 99.50th=[30336], 99.90th=[39168], 99.95th=[42752],
     | 99.99th=[71168]
   bw (  KiB/s): min=168440, max=188496, per=100.00%, avg=172912.00, stdev=3975.63, samples=19
   iops        : min=42110, max=47124, avg=43228.00, stdev=993.91, samples=19
  lat (usec)   : 20=5.90%, 50=94.08%, 100=0.02%, 250=0.01%, 500=0.01%
  lat (usec)   : 750=0.01%, 1000=0.01%
  cpu          : usr=10.35%, sys=25.82%, ctx=432417, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=432220,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=169MiB/s (177MB/s), 169MiB/s-169MiB/s (177MB/s-177MB/s), io=1688MiB (1770MB), run=10001-10001msec

Disk stats (read/write):
  vdb: ios=427867/0, merge=0/0, ticks=7346/0, in_queue=0, util=99.04%

With poll_us=20:

fio --ioengine=sync --bs=4k --rw randread --name randread --direct=1
--filename=/dev/vdb --time_based --runtime=10

randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
fio-3.14
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=260MiB/s][r=66.7k IOPS][eta 00m:00s]
randread: (groupid=0, jobs=1): err= 0: pid=422: Tue Feb 18 11:14:47 2020
  read: IOPS=68.5k, BW=267MiB/s (280MB/s)(2674MiB/10001msec)
    clat (usec): min=10, max=966, avg=13.60, stdev= 3.49
     lat (usec): min=10, max=966, avg=13.70, stdev= 3.50
    clat percentiles (nsec):
     |  1.00th=[11200],  5.00th=[11968], 10.00th=[11968], 20.00th=[12224],
     | 30.00th=[12992], 40.00th=[13504], 50.00th=[13760], 60.00th=[13888],
     | 70.00th=[14016], 80.00th=[14144], 90.00th=[14272], 95.00th=[14656],
     | 99.00th=[20352], 99.50th=[23936], 99.90th=[35072], 99.95th=[36096],
     | 99.99th=[47872]
   bw (  KiB/s): min=265456, max=296456, per=100.00%, avg=274229.05, stdev=13048.14, samples=19
   iops        : min=66364, max=74114, avg=68557.26, stdev=3262.03, samples=19
  lat (usec)   : 20=98.84%, 50=1.15%, 100=0.01%, 250=0.01%, 500=0.01%
  lat (usec)   : 750=0.01%, 1000=0.01%
  cpu          : usr=8.24%, sys=21.15%, ctx=684669, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=684611,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=267MiB/s (280MB/s), 267MiB/s-267MiB/s (280MB/s-280MB/s), io=2674MiB (2804MB), run=10001-10001msec

Disk stats (read/write):
  vdb: ios=677855/0, merge=0/0, ticks=7026/0, in_queue=0, util=99.04%

I would lean towards making it a boolean, on by default.

Same here, a boolean knob enabled by default. It will help reducing our number of vm exits, but if users don't want to pay the CPU utilization price they can turn it off.

Sounds good to me too.

rbradford · 2020-02-19T13:59:05Z

@slp whoops. Sorry I missed that, the improvement looks great.

Actively polling the virtqueue significantly reduces the latency of each I/O operation, at the expense of using more CPU time. This features is specially useful when using low-latency devices (SSD, NVMe) as the backend. This change implements static polling. When a request arrives after being idle, vhost_user_block will keep checking the virtqueue for new requests, until POLL_QUEUE_US (50us) has passed without finding one. POLL_QUEUE_US is defined to be 50us, based on the current latency of enterprise SSDs (< 30us) and the overhead of the emulation. This feature is enabled by default, and can be disabled by using the "poll_queue" parameter of "block-backend". This is a test using null_blk as a backend for the image, with the following parameters: - null_blk gb=20 nr_devices=1 irqmode=2 completion_nsec=0 no_sched=1 With "poll_queue=false": fio --ioengine=sync --bs=4k --rw randread --name randread --direct=1 --filename=/dev/vdb --time_based --runtime=10 randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1 fio-3.14 Starting 1 process Jobs: 1 (f=1): [r(1)][100.0%][r=169MiB/s][r=43.2k IOPS][eta 00m:00s] randread: (groupid=0, jobs=1): err= 0: pid=433: Tue Feb 18 11:12:59 2020 read: IOPS=43.2k, BW=169MiB/s (177MB/s)(1688MiB/10001msec) clat (usec): min=17, max=836, avg=21.64, stdev= 3.81 lat (usec): min=17, max=836, avg=21.77, stdev= 3.81 clat percentiles (nsec): | 1.00th=[19328], 5.00th=[19840], 10.00th=[20352], 20.00th=[21120], | 30.00th=[21376], 40.00th=[21376], 50.00th=[21376], 60.00th=[21632], | 70.00th=[21632], 80.00th=[21888], 90.00th=[22144], 95.00th=[22912], | 99.00th=[28544], 99.50th=[30336], 99.90th=[39168], 99.95th=[42752], | 99.99th=[71168] bw ( KiB/s): min=168440, max=188496, per=100.00%, avg=172912.00, stdev=3975.63, samples=19 iops : min=42110, max=47124, avg=43228.00, stdev=993.91, samples=19 lat (usec) : 20=5.90%, 50=94.08%, 100=0.02%, 250=0.01%, 500=0.01% lat (usec) : 750=0.01%, 1000=0.01% cpu : usr=10.35%, sys=25.82%, ctx=432417, majf=0, minf=10 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=432220,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=169MiB/s (177MB/s), 169MiB/s-169MiB/s (177MB/s-177MB/s), io=1688MiB (1770MB), run=10001-10001msec Disk stats (read/write): vdb: ios=427867/0, merge=0/0, ticks=7346/0, in_queue=0, util=99.04% With "poll_queue=true" (default): fio --ioengine=sync --bs=4k --rw randread --name randread --direct=1 --filename=/dev/vdb --time_based --runtime=10 randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1 fio-3.14 Starting 1 process Jobs: 1 (f=1): [r(1)][100.0%][r=260MiB/s][r=66.7k IOPS][eta 00m:00s] randread: (groupid=0, jobs=1): err= 0: pid=422: Tue Feb 18 11:14:47 2020 read: IOPS=68.5k, BW=267MiB/s (280MB/s)(2674MiB/10001msec) clat (usec): min=10, max=966, avg=13.60, stdev= 3.49 lat (usec): min=10, max=966, avg=13.70, stdev= 3.50 clat percentiles (nsec): | 1.00th=[11200], 5.00th=[11968], 10.00th=[11968], 20.00th=[12224], | 30.00th=[12992], 40.00th=[13504], 50.00th=[13760], 60.00th=[13888], | 70.00th=[14016], 80.00th=[14144], 90.00th=[14272], 95.00th=[14656], | 99.00th=[20352], 99.50th=[23936], 99.90th=[35072], 99.95th=[36096], | 99.99th=[47872] bw ( KiB/s): min=265456, max=296456, per=100.00%, avg=274229.05, stdev=13048.14, samples=19 iops : min=66364, max=74114, avg=68557.26, stdev=3262.03, samples=19 lat (usec) : 20=98.84%, 50=1.15%, 100=0.01%, 250=0.01%, 500=0.01% lat (usec) : 750=0.01%, 1000=0.01% cpu : usr=8.24%, sys=21.15%, ctx=684669, majf=0, minf=10 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=684611,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=267MiB/s (280MB/s), 267MiB/s-267MiB/s (280MB/s-280MB/s), io=2674MiB (2804MB), run=10001-10001msec Disk stats (read/write): vdb: ios=677855/0, merge=0/0, ticks=7026/0, in_queue=0, util=99.04% Signed-off-by: Sergio Lopez <[email protected]>

slp · 2020-02-19T16:50:20Z

Done, I've replaced poll_us with the boolean poll_queue, enabled by default.

sboeuf · 2020-02-19T17:36:24Z

src/main.rs

                    "vhost-user-block backend parameters \"image=<image_path>,\
                     sock=<socket_path>,num_queues=<number_of_queues>,\
-                     readonly=true|false,direct=true|false\"",
+                     readonly=true|false,direct=true|false,poll_queue=true|false\"",


Could you please update the OpenAPI definition here: https://github.com/cloud-hypervisor/cloud-hypervisor/blob/master/vmm/src/api/openapi/cloud-hypervisor.yaml

Sure, I'll do that in a separate patch.

Actually @rbradford pointed me that you don't need to do anything since you're not modifying any of the VMM's device. Sorry I got confused because now the main.rs CLI includes options for vhost-user backends.

I was about to ask about this. Should I add poll_queue to DiskConfig and CH's disk argument? Reading the code, I looks to me that block-backend and disk need to be synchronized.

@slp See #787 and #786

sboeuf · 2020-02-19T17:49:49Z

This patch is nice @slp as it'll let the user choose for perf at CPU cost.
Do you intend to apply it somewhere else? --net could definitely use this :)

slp · 2020-02-19T18:18:10Z

This patch is nice @slp as it'll let the user choose for perf at CPU cost.
Do you intend to apply it somewhere else? --net could definitely use this :)

Good question. IIRC, virtio-net has its own interrupt suppression mechanism, so it may need a slightly different approach. It's probably worth creating an issue to track this, though.

bjzhjing · 2020-03-19T10:22:52Z

vhost_user_block/src/lib.rs

+                    // Actively poll the queue until POLL_QUEUE_US has passed
+                    // without seeing a new request.
+                    let mut now = Instant::now();
+                    loop {


All the events come here are from the epoll in vhost_user_backend, is it necessary to add a loop here?

This is for polling the queue for new elements without having to wait for an ioeventfd signal.

bjzhjing · 2020-03-19T10:23:39Z

vhost_user_block/src/lib.rs

+                    }
                }
+
+                if self.event_idx {


This branch is already added in process_queue, why check it again here?

The check in process_queue is to determine whether the EVENT_IDX feature can be used to omit sending an interrupt to the guest, while the one here, as the comment says, is to tackle the fact that the Queue iterator only checks avail_index once.

slp added 3 commits February 18, 2020 12:04

vhost_user_backend: Add helpers for EVENT_IDX

6b16349

Add helpers to Vring and VhostUserSlaveReqHandler for EVENT_IDX, so consumers of this crate can make use of this feature. Signed-off-by: Sergio Lopez <[email protected]>

rbradford reviewed Feb 18, 2020

View reviewed changes

slp force-pushed the vhost_event_idx branch from 47c23c8 to dd8a94f Compare February 19, 2020 16:34

rbradford approved these changes Feb 19, 2020

View reviewed changes

rbradford merged commit 5c06b7f into cloud-hypervisor:master Feb 19, 2020

sboeuf reviewed Feb 19, 2020

View reviewed changes

rbradford mentioned this pull request Feb 20, 2020

Add EVENT_IDX support to virtio-block #786

Closed

bjzhjing reviewed Mar 19, 2020

View reviewed changes

Implement EVENT_IDX and polling for vhost_user_block #774

Implement EVENT_IDX and polling for vhost_user_block #774

Uh oh!

Conversation

slp commented Feb 18, 2020

Uh oh!

rbradford left a comment

Choose a reason for hiding this comment

Uh oh!

slp commented Feb 18, 2020

Uh oh!

rbradford commented Feb 18, 2020

Uh oh!

slp commented Feb 18, 2020

Uh oh!

rbradford commented Feb 18, 2020

Uh oh!

sboeuf commented Feb 18, 2020

Uh oh!

rbradford commented Feb 18, 2020

Uh oh!

slp commented Feb 19, 2020

Uh oh!

rbradford commented Feb 19, 2020

Uh oh!

sameo commented Feb 19, 2020

Uh oh!

slp commented Feb 19, 2020

Uh oh!

rbradford commented Feb 19, 2020

Uh oh!

slp commented Feb 19, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sboeuf commented Feb 19, 2020

Uh oh!

slp commented Feb 19, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants