Skip to content

Conversation

@slp
Copy link
Member

@slp slp commented Feb 18, 2020

This patch series implements EVENT_IDX, a virtio feature that allows reducing the number of driver <-> device notifications and thus, reducing the number of vm exits needed while processing requests. This is specially useful when actively polling the virtqueues.

slp added 3 commits February 18, 2020 12:04
VIRTIO_RING_F_EVENT_IDX is a virtio feature that allows to avoid
device <-> driver notifications under some circunstances, most
notably when actively polling the queue.

This commit implements support for in in the vm-virtio
crate. Consumers of this crate will also need to add support for it by
exposing the feature and calling using update_avail_event() and
get_used_event() accordingly.

Signed-off-by: Sergio Lopez <[email protected]>
Add helpers to Vring and VhostUserSlaveReqHandler for EVENT_IDX, so
consumers of this crate can make use of this feature.

Signed-off-by: Sergio Lopez <[email protected]>
Now that vhost_user_backend and vm-virtio do support EVENT_IDX, use it
in vhost_user_block to reduce the number of notifications sent between
the driver and the device.

This is specially useful when using active polling on the virtqueue,
as it'll be implemented by a future patch.

This is a snapshot of kvm_stat while generating ~60K IOPS with fio on
the guest without EVENT_IDX:

 Event                                         Total %Total CurAvg/s
 kvm_entry                                    393454   20.3    62494
 kvm_exit                                     393446   20.3    62494
 kvm_apic_accept_irq                          378146   19.5    60268
 kvm_msi_set_irq                              369720   19.0    58881
 kvm_fast_mmio                                370497   19.1    58817
 kvm_hv_timer_state                            10197    0.5     1715
 kvm_msr                                        8770    0.5     1443
 kvm_wait_lapic_expire                          7018    0.4     1118
 kvm_apic                                       2768    0.1      538
 kvm_pv_tlb_flush                               2028    0.1      360
 kvm_vcpu_wakeup                                1453    0.1      278
 kvm_apic_ipi                                   1384    0.1      269
 kvm_fpu                                        1148    0.1      164
 kvm_pio                                         574    0.0	  82
 kvm_userspace_exit                              574    0.0	  82
 kvm_halt_poll_ns                                 24    0.0	   3

And this is the snapshot while doing the same thing with EVENT_IDX:

 Event                                         Total %Total CurAvg/s
 kvm_entry                                     35506   26.0     3873
 kvm_exit                                      35499   26.0     3873
 kvm_hv_timer_state                            14740   10.8     1672
 kvm_apic_accept_irq                           13017    9.5     1438
 kvm_msr                                       12845    9.4     1421
 kvm_wait_lapic_expire                         10422    7.6     1118
 kvm_apic                                       3788    2.8      502
 kvm_pv_tlb_flush                               2708    2.0      340
 kvm_vcpu_wakeup                                1992    1.5      258
 kvm_apic_ipi                                   1894    1.4      251
 kvm_fpu                                        1476    1.1      164
 kvm_pio                                         738    0.5       82
 kvm_userspace_exit                              738    0.5	  82
 kvm_msi_set_irq                                 701    0.5	  69
 kvm_fast_mmio                                   238    0.2        4
 kvm_halt_poll_ns                                 50    0.0        1
 kvm_ple_window_update                            28    0.0        0
 kvm_page_fault                                    4    0.0        0

It can be clearly appreciated how the number of vm exits per second,
specially the ones related to notifications (kvm_fast_mmio and
kvm_msi_set_irq) is drastically lower.

Signed-off-by: Sergio Lopez <[email protected]>
Copy link
Member

@rbradford rbradford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slp lgtm. I'm not sure I see the value in making the number configurable, could we just set a sensible default and be done with it?

@slp
Copy link
Member Author

slp commented Feb 18, 2020

@slp lgtm. I'm not sure I see the value in making the number configurable, could we just set a sensible default and be done with it?

@rbradford That can be complicated, as the best value for poll_us depends on the storage backend latency. That said, we can turn this into an opinionated feature, and decide that it's only worth for low-latency (<30us) backend, setting the value to, for example, 50us.

@rbradford
Copy link
Member

@slp lgtm. I'm not sure I see the value in making the number configurable, could we just set a sensible default and be done with it?

@rbradford That can be complicated, as the best value for poll_us depends on the storage backend latency. That said, we can turn this into an opinionated feature, and decide that it's only worth for low-latency (<30us) backend, setting the value to, for example, 50us.

Sounds good.

@slp
Copy link
Member Author

slp commented Feb 18, 2020

@rbradford OK, I'll turn poll_us into a boolean (poll_queue?). Do we also need an integration test just for this option?

@rbradford
Copy link
Member

@slp Can we just turn it on by default?

@sboeuf
Copy link
Member

sboeuf commented Feb 18, 2020

@rbradford why do you want to do polling by default?

@rbradford
Copy link
Member

@rbradford why do you want to do polling by default?

EVENT_IDX is designed to improve performance so we should turn it on by default.

@slp
Copy link
Member Author

slp commented Feb 19, 2020

@rbradford why do you want to do polling by default?

EVENT_IDX is designed to improve performance so we should turn it on by default.

@sameo @rbradford EVENT_IDX will always be enabled (as long its negotiated by the guest), the question is whether to have active polling on vhost_user_block enabled by default, with the CPU cost this implies, or make it a tunable.

Personally, I don't have a problem have enabling it by default, as I always enable it in my VMs ;-)

On the other hand, even if we decide to have it enabled by default, should we also still have a parameter to disable it?

@rbradford
Copy link
Member

Can you get some simple numbers to show the improvement it makes? (bonnie? or dd or something like that?)

I would lean towards making it a boolean, on by default.

@sameo
Copy link
Member

sameo commented Feb 19, 2020

Can you get some simple numbers to show the improvement it makes? (bonnie? or dd or something like that?)

I would lean towards making it a boolean, on by default.

Same here, a boolean knob enabled by default. It will help reducing our number of vm exits, but if users don't want to pay the CPU utilization price they can turn it off.

@slp
Copy link
Member Author

slp commented Feb 19, 2020

Can you get some simple numbers to show the improvement it makes? (bonnie? or dd or something like that?)

Sure, I've put some numbers on the fourth commit's message:

This is a test using null_blk as a backend for the image, with the
following parameters:

 - null_blk gb=20 nr_devices=1 irqmode=2 completion_nsec=0 no_sched=1

With poll_us=0:

fio --ioengine=sync --bs=4k --rw randread --name randread --direct=1
--filename=/dev/vdb --time_based --runtime=10

randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
fio-3.14
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=169MiB/s][r=43.2k IOPS][eta 00m:00s]
randread: (groupid=0, jobs=1): err= 0: pid=433: Tue Feb 18 11:12:59 2020
  read: IOPS=43.2k, BW=169MiB/s (177MB/s)(1688MiB/10001msec)
    clat (usec): min=17, max=836, avg=21.64, stdev= 3.81
     lat (usec): min=17, max=836, avg=21.77, stdev= 3.81
    clat percentiles (nsec):
     |  1.00th=[19328],  5.00th=[19840], 10.00th=[20352], 20.00th=[21120],
     | 30.00th=[21376], 40.00th=[21376], 50.00th=[21376], 60.00th=[21632],
     | 70.00th=[21632], 80.00th=[21888], 90.00th=[22144], 95.00th=[22912],
     | 99.00th=[28544], 99.50th=[30336], 99.90th=[39168], 99.95th=[42752],
     | 99.99th=[71168]
   bw (  KiB/s): min=168440, max=188496, per=100.00%, avg=172912.00, stdev=3975.63, samples=19
   iops        : min=42110, max=47124, avg=43228.00, stdev=993.91, samples=19
  lat (usec)   : 20=5.90%, 50=94.08%, 100=0.02%, 250=0.01%, 500=0.01%
  lat (usec)   : 750=0.01%, 1000=0.01%
  cpu          : usr=10.35%, sys=25.82%, ctx=432417, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=432220,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=169MiB/s (177MB/s), 169MiB/s-169MiB/s (177MB/s-177MB/s), io=1688MiB (1770MB), run=10001-10001msec

Disk stats (read/write):
  vdb: ios=427867/0, merge=0/0, ticks=7346/0, in_queue=0, util=99.04%

With poll_us=20:

fio --ioengine=sync --bs=4k --rw randread --name randread --direct=1
--filename=/dev/vdb --time_based --runtime=10

randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
fio-3.14
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=260MiB/s][r=66.7k IOPS][eta 00m:00s]
randread: (groupid=0, jobs=1): err= 0: pid=422: Tue Feb 18 11:14:47 2020
  read: IOPS=68.5k, BW=267MiB/s (280MB/s)(2674MiB/10001msec)
    clat (usec): min=10, max=966, avg=13.60, stdev= 3.49
     lat (usec): min=10, max=966, avg=13.70, stdev= 3.50
    clat percentiles (nsec):
     |  1.00th=[11200],  5.00th=[11968], 10.00th=[11968], 20.00th=[12224],
     | 30.00th=[12992], 40.00th=[13504], 50.00th=[13760], 60.00th=[13888],
     | 70.00th=[14016], 80.00th=[14144], 90.00th=[14272], 95.00th=[14656],
     | 99.00th=[20352], 99.50th=[23936], 99.90th=[35072], 99.95th=[36096],
     | 99.99th=[47872]
   bw (  KiB/s): min=265456, max=296456, per=100.00%, avg=274229.05, stdev=13048.14, samples=19
   iops        : min=66364, max=74114, avg=68557.26, stdev=3262.03, samples=19
  lat (usec)   : 20=98.84%, 50=1.15%, 100=0.01%, 250=0.01%, 500=0.01%
  lat (usec)   : 750=0.01%, 1000=0.01%
  cpu          : usr=8.24%, sys=21.15%, ctx=684669, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=684611,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=267MiB/s (280MB/s), 267MiB/s-267MiB/s (280MB/s-280MB/s), io=2674MiB (2804MB), run=10001-10001msec

Disk stats (read/write):
  vdb: ios=677855/0, merge=0/0, ticks=7026/0, in_queue=0, util=99.04%

I would lean towards making it a boolean, on by default.

Same here, a boolean knob enabled by default. It will help reducing our number of vm exits, but if users don't want to pay the CPU utilization price they can turn it off.

Sounds good to me too.

@rbradford
Copy link
Member

@slp whoops. Sorry I missed that, the improvement looks great.

Actively polling the virtqueue significantly reduces the latency of
each I/O operation, at the expense of using more CPU time. This
features is specially useful when using low-latency devices (SSD,
NVMe) as the backend.

This change implements static polling. When a request arrives after
being idle, vhost_user_block will keep checking the virtqueue for new
requests, until POLL_QUEUE_US (50us) has passed without finding one.

POLL_QUEUE_US is defined to be 50us, based on the current latency of
enterprise SSDs (< 30us) and the overhead of the emulation.

This feature is enabled by default, and can be disabled by using the
"poll_queue" parameter of "block-backend".

This is a test using null_blk as a backend for the image, with the
following parameters:

 - null_blk gb=20 nr_devices=1 irqmode=2 completion_nsec=0 no_sched=1

With "poll_queue=false":

fio --ioengine=sync --bs=4k --rw randread --name randread --direct=1
--filename=/dev/vdb --time_based --runtime=10

randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
fio-3.14
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=169MiB/s][r=43.2k IOPS][eta 00m:00s]
randread: (groupid=0, jobs=1): err= 0: pid=433: Tue Feb 18 11:12:59 2020
  read: IOPS=43.2k, BW=169MiB/s (177MB/s)(1688MiB/10001msec)
    clat (usec): min=17, max=836, avg=21.64, stdev= 3.81
     lat (usec): min=17, max=836, avg=21.77, stdev= 3.81
    clat percentiles (nsec):
     |  1.00th=[19328],  5.00th=[19840], 10.00th=[20352], 20.00th=[21120],
     | 30.00th=[21376], 40.00th=[21376], 50.00th=[21376], 60.00th=[21632],
     | 70.00th=[21632], 80.00th=[21888], 90.00th=[22144], 95.00th=[22912],
     | 99.00th=[28544], 99.50th=[30336], 99.90th=[39168], 99.95th=[42752],
     | 99.99th=[71168]
   bw (  KiB/s): min=168440, max=188496, per=100.00%, avg=172912.00, stdev=3975.63, samples=19
   iops        : min=42110, max=47124, avg=43228.00, stdev=993.91, samples=19
  lat (usec)   : 20=5.90%, 50=94.08%, 100=0.02%, 250=0.01%, 500=0.01%
  lat (usec)   : 750=0.01%, 1000=0.01%
  cpu          : usr=10.35%, sys=25.82%, ctx=432417, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=432220,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=169MiB/s (177MB/s), 169MiB/s-169MiB/s (177MB/s-177MB/s), io=1688MiB (1770MB), run=10001-10001msec

Disk stats (read/write):
  vdb: ios=427867/0, merge=0/0, ticks=7346/0, in_queue=0, util=99.04%

With "poll_queue=true" (default):

fio --ioengine=sync --bs=4k --rw randread --name randread --direct=1
--filename=/dev/vdb --time_based --runtime=10

randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
fio-3.14
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=260MiB/s][r=66.7k IOPS][eta 00m:00s]
randread: (groupid=0, jobs=1): err= 0: pid=422: Tue Feb 18 11:14:47 2020
  read: IOPS=68.5k, BW=267MiB/s (280MB/s)(2674MiB/10001msec)
    clat (usec): min=10, max=966, avg=13.60, stdev= 3.49
     lat (usec): min=10, max=966, avg=13.70, stdev= 3.50
    clat percentiles (nsec):
     |  1.00th=[11200],  5.00th=[11968], 10.00th=[11968], 20.00th=[12224],
     | 30.00th=[12992], 40.00th=[13504], 50.00th=[13760], 60.00th=[13888],
     | 70.00th=[14016], 80.00th=[14144], 90.00th=[14272], 95.00th=[14656],
     | 99.00th=[20352], 99.50th=[23936], 99.90th=[35072], 99.95th=[36096],
     | 99.99th=[47872]
   bw (  KiB/s): min=265456, max=296456, per=100.00%, avg=274229.05, stdev=13048.14, samples=19
   iops        : min=66364, max=74114, avg=68557.26, stdev=3262.03, samples=19
  lat (usec)   : 20=98.84%, 50=1.15%, 100=0.01%, 250=0.01%, 500=0.01%
  lat (usec)   : 750=0.01%, 1000=0.01%
  cpu          : usr=8.24%, sys=21.15%, ctx=684669, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=684611,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=267MiB/s (280MB/s), 267MiB/s-267MiB/s (280MB/s-280MB/s), io=2674MiB (2804MB), run=10001-10001msec

Disk stats (read/write):
  vdb: ios=677855/0, merge=0/0, ticks=7026/0, in_queue=0, util=99.04%

Signed-off-by: Sergio Lopez <[email protected]>
@slp
Copy link
Member Author

slp commented Feb 19, 2020

Done, I've replaced poll_us with the boolean poll_queue, enabled by default.

@rbradford rbradford merged commit 5c06b7f into cloud-hypervisor:master Feb 19, 2020
"vhost-user-block backend parameters \"image=<image_path>,\
sock=<socket_path>,num_queues=<number_of_queues>,\
readonly=true|false,direct=true|false\"",
readonly=true|false,direct=true|false,poll_queue=true|false\"",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll do that in a separate patch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually @rbradford pointed me that you don't need to do anything since you're not modifying any of the VMM's device. Sorry I got confused because now the main.rs CLI includes options for vhost-user backends.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to ask about this. Should I add poll_queue to DiskConfig and CH's disk argument? Reading the code, I looks to me that block-backend and disk need to be synchronized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slp See #787 and #786

@sboeuf
Copy link
Member

sboeuf commented Feb 19, 2020

This patch is nice @slp as it'll let the user choose for perf at CPU cost.
Do you intend to apply it somewhere else? --net could definitely use this :)

@slp
Copy link
Member Author

slp commented Feb 19, 2020

This patch is nice @slp as it'll let the user choose for perf at CPU cost.
Do you intend to apply it somewhere else? --net could definitely use this :)

Good question. IIRC, virtio-net has its own interrupt suppression mechanism, so it may need a slightly different approach. It's probably worth creating an issue to track this, though.

// Actively poll the queue until POLL_QUEUE_US has passed
// without seeing a new request.
let mut now = Instant::now();
loop {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the events come here are from the epoll in vhost_user_backend, is it necessary to add a loop here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for polling the queue for new elements without having to wait for an ioeventfd signal.

}
}

if self.event_idx {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch is already added in process_queue, why check it again here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check in process_queue is to determine whether the EVENT_IDX feature can be used to omit sending an interrupt to the guest, while the one here, as the comment says, is to tackle the fact that the Queue iterator only checks avail_index once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants