virtio-devices: net: reduce vec allocations for iovec conversion #6636

ihciah · 2024-08-06T16:34:18Z

Currently when converting desc chain to iovecs, it uses Vec::new(). To save allocation, this PR proposed to using a long lifetime buffer to save the result.

A simulated benchmark about whether use persistent buffer or smallvec: https://gist.github.com/ihciah/dcfaf13393bfada1ff8f372b4fa16e3f

rbradford · 2024-08-06T16:52:09Z

net_util/src/queue_pair.rs

            let mut next_desc = Some(desc);

-            let mut iovecs = Vec::new();
+            let mut iovecs = self.iovecs.borrow();


Thanks for your contribution - but maybe we could just use smallvec instead - like we do elsewhere in the codebase?

To use SmallVec efficiently we need to know beforehand the optimum size, otherwise it will allocate and deallocate. Not sure how feasible this is with the code here.

Adding some debugging (e.g. adding info!("iocount {}", iovecs.len())) and running various workloads shows that the vector never exceeds length 4 (and that only occurs when copying a very large file). I think a smallvec with reserved capacity of 4 is the best solution.

liuw · 2024-08-06T18:44:45Z

How much improvement do you see with this change?

ihciah · 2024-08-07T03:35:12Z

How much improvement do you see with this change?

Haven't benchmarked. Could you provide some docs about how to bench it?

Virtio-net is a component that is used very frequently in real scenarios. Reducing one allocation when sending and receiving each single frame should have a relatively good performance benefit.

Using smallvec can work here too if the desc chain's length is small enough(but it is unpredictable). However, since we make a syscall immediately after constructing iovecs, there won't be multiple iovecs at the same time, so it's more appropriate to use a shared buffer to cache iovecs and store the buffer in a long-lived structure.

liuw · 2024-08-07T05:16:46Z

How much improvement do you see with this change?

Haven't benchmarked. Could you provide some docs about how to bench it?

Virtio-net is a component that is used very frequently in real scenarios. Reducing one allocation when sending and receiving each single frame should have a relatively good performance benefit.

Using smallvec can work here too if the desc chain's length is small enough(but it is unpredictable). However, since we make a syscall immediately after constructing iovecs, there won't be multiple iovecs at the same time, so it's more appropriate to use a shared buffer to cache iovecs and store the buffer in a long-lived structure.

The best case is to run your target workload or the synthetic workload you care about, and show that it increases the throughput or decreases the latency. Alternatively, there is this dhat feature with which you can see how many allocations are saved from a test run.

In generally I think this is a good change. The code is simple enough. I was just curious to know if there is any directly visible benefit.

Somewhat related -- recently I observed that CH's block perf has more to be desired. I don't have any concrete ideas yet, but if I end up tuning the internal buffer size, it may make sense to move away from SmallVec to this sort of persistent buffer, too.

Lastly, please properly sign off your patch.

ihciah · 2024-08-07T16:14:52Z

Lastly, please properly sign off your patch.

Ok, it is added now.

Adding some debugging (e.g. adding info!("iocount {}", iovecs.len())) and running various workloads shows that the vector never exceeds length 4 (and that only occurs when copying a very large file). I think a smallvec with reserved capacity of 4 is the best solution.

I can confirm it won't be more than 4 for most cases if I understood it right(I searched the reference to virtqueue_add_sgs in kernel's code).

SmallVec works here based on the current kernel's driver implementation, using a persistent buffer works too.
How do you think? @rbradford @liuw If needed I can change it to smallvec.

liuw · 2024-08-07T17:55:22Z

Lastly, please properly sign off your patch.

Ok, it is added now.

Adding some debugging (e.g. adding info!("iocount {}", iovecs.len())) and running various workloads shows that the vector never exceeds length 4 (and that only occurs when copying a very large file). I think a smallvec with reserved capacity of 4 is the best solution.

I can confirm it won't be more than 4 for most cases if I understood it right(I searched the reference to virtqueue_add_sgs in kernel's code).

SmallVec works here based on the current kernel's driver implementation, using a persistent buffer works too. How do you think? @rbradford @liuw If needed I can change it to smallvec.

Assuming the kernel driver is not change very frequently, I think using SmallVec is better. You can even use 8 to leave some room to grow in the future. In the commit message, please say why a specific value is chosen.

ihciah · 2024-08-08T16:14:03Z

I can confirm it won't be more than 4 for most cases if I understood it right(I searched the reference to virtqueue_add_sgs in kernel's code).

It was my mistake. I'm not familiar with kernel's code yet. I did a little bit more investigation on it and found there would be at most 19 descriptors linked in a single chain.

start_xmit is used to send a skb. After calling xmit_skb which it's core logic is in, it calls check_sq_full_and_disable to see if there's enough descriptor room for next packet. And at the end, it kicks device.
In xmit_skb, it calls sg_set_buf to set virtio header to the first sg, then it calls skb_to_sgvec to map all skb fragments to sgs(it returns the sg count it used).
In check_sq_full_and_disable, it checks if (sq->vq->num_free < 2+MAX_SKB_FRAGS), which means if enough room for mapping descriptors. If not, it calls netif_stop_subqueue to stop next sending.

The maximum possibility of desc chain size is 2+MAX_SKB_FRAGS including 1 virtio header and 1 frame header and many fragments. MAX_SKB_FRAGS as defined in kernel is 17.

Though at most cases the desc chain won't be very long and using smallvec with N=4 is enough. Considering the possibility above, I still think using a persistent buffer is always a better way as it can definitely avoid allocation at packet level and saves copy cost of smallvec. Comparing to smallvec I don't see any drawbacks since in our case there would not be more than one instance exist at the same time.

However, I'm open to change it as smallvec if we decide to, it is better than current Vec anyway.

@rbradford @liuw , would love to hear your opinion!

liuw · 2024-08-08T18:45:40Z

I can confirm it won't be more than 4 for most cases if I understood it right(I searched the reference to virtqueue_add_sgs in kernel's code).

It was my mistake. I'm not familiar with kernel's code yet. I did a little bit more investigation on it and found there would be at most 19 descriptors linked in a single chain.

start_xmit is used to send a skb. After calling xmit_skb which it's core logic is in, it calls check_sq_full_and_disable to see if there's enough descriptor room for next packet. And at the end, it kicks device.

In xmit_skb, it calls sg_set_buf to set virtio header to the first sg, then it calls skb_to_sgvec to map all skb fragments to sgs(it returns the sg count it used).

In check_sq_full_and_disable, it checks if (sq->vq->num_free < 2+MAX_SKB_FRAGS), which means if enough room for mapping descriptors. If not, it calls netif_stop_subqueue to stop next sending.

The maximum possibility of desc chain size is 2+MAX_SKB_FRAGS including 1 virtio header and 1 frame header and many fragments. MAX_SKB_FRAGS as defined in kernel is 17.

Though at most cases the desc chain won't be very long and using smallvec with N=4 is enough. Considering the possibility above, I still think using a persistent buffer is always a better way as it can definitely avoid allocation at packet level and saves copy cost of smallvec. Comparing to smallvec I don't see any drawbacks since in our case there would not be more than one instance exist at the same time.

However, I'm open to change it as smallvec if we decide to, it is better than current Vec anyway.

@rbradford @liuw , would love to hear your opinion!

I don't feel too strongly either way. Like I said, the code you have is simple to reason about. One request I have is if we're to use persistent buffers, you should allocate the vectors to an appropriate size (for example 16) from the get go.

liuw · 2024-08-08T18:47:06Z

net_util/src/queue_pair.rs

+#[derive(Default, Clone)]
+struct IovecBuffer(Vec<libc::iovec>);
+
+unsafe impl Send for IovecBuffer {}


You need to document these unsafes.

Send and Sync are usually auto implemented, but not for pointer(libc::iovec contains pointer).
As an empty container, it can be safely move ownership and safely shared between threads.

I know, but without a comment your code won't pass the CI check. We require safety comments for all unsafes.

liuw · 2024-08-08T18:49:50Z

net_util/src/queue_pair.rs

+struct IovecBuffer(Vec<libc::iovec>);
+
+unsafe impl Send for IovecBuffer {}
+unsafe impl Sync for IovecBuffer {}


Does this really need to be Sync? What happens if you drop it?

liuw · 2024-08-08T18:52:00Z

net_util/src/queue_pair.rs

+
+impl Drop for IovecBufferBorrowed<'_> {
+    fn drop(&mut self) {
+        self.0.clear();


My Rust knowledge on this part is a bit rusty (pun intended), but why clear here?

Do you even need to implement Drop? When IovecBuffer is dropped, its only member will also get dropped, right?

Here IovecBufferBorrowed is a zero-cost wrapper for a mutable borrowed Vec. It is introduced to impl drop manually(instead of just returning &mut Vec<_>).

Clear the Vec when drop can make sure there's no old value inside for the next user. With the Rust borrow checker, we can ensure that the buffer can not be touched by anyone else between a single {write, syscall, clear}(otherwise it failed compiling), and that no old value remains when it is used next time.

This make it works just like creating a new Vec.

IovecBuffer::borrow -> Vec::new

IovecBufferBorrowed::some_fn -> Vec::some_fn because of Deref/DerefMut

IovecBufferBorrowed::drop -> Vec::drop

I see. I misread the code. I thought you were implementing Drop for IovecBuffer.

rbradford · 2024-08-12T07:55:00Z

Typo in commit message convertion -> conversion

ihciah · 2024-08-12T15:49:10Z

Typo in commit message convertion -> conversion

Thank you, it has been updated.

rbradford · 2024-08-13T20:22:36Z

net_util/src/queue_pair.rs

+
+impl IovecBuffer {
+    const fn new() -> Self {
+        IovecBuffer(Vec::new())


Might as well do Vec::with_capacity(4) as from experimentation that is a sensible starting number to use (and avoid the initial reallocations.)

I'm still not 100% convinced this is better than smallvec in terms of maintenance overhead (avoiding heap allocations like this is exactly why we started using smallvec)

But I can see you've put a lot of effort into this work - so with the change I propose above lgtm.

Sure. I've updated it as 4 as you suggested.

But I can see you've put a lot of effort into this work - so with the change I propose above lgtm.

Please don't do it just because this reason. I believe I can convince you of my point of view. This is a small change and didn't cost much time, but in order to justify this change I read some kernel code and did benchmarks, which was very rewarding and I think it is good.

I did a simple benchmark(but only simulated not with vmm) here: https://gist.github.com/ihciah/dcfaf13393bfada1ff8f372b4fa16e3f

And the result shows even for len=3, borrowed vec only cost 48% CPU time of smallvec(N=4). I believe it is because some stack copying cost is saved. For longer data, borrowed vec saves more. In the link above there's more results and reproducing guide.

Signed-off-by: ihciah <[email protected]>

ihciah requested a review from a team as a code owner August 6, 2024 16:34

rbradford reviewed Aug 6, 2024

View reviewed changes

ihciah force-pushed the iovec-buf-for-net branch from 3188b91 to 2f81849 Compare August 7, 2024 14:35

liuw reviewed Aug 8, 2024

View reviewed changes

ihciah force-pushed the iovec-buf-for-net branch from 2f81849 to 5e8ce28 Compare August 11, 2024 09:03

ihciah force-pushed the iovec-buf-for-net branch from 5e8ce28 to f169548 Compare August 12, 2024 15:48

ihciah changed the title ~~virtio-devices: net: reduce vec allocations for iovec convertion~~ virtio-devices: net: reduce vec allocations for iovec conversion Aug 12, 2024

liuw approved these changes Aug 12, 2024

View reviewed changes

liuw closed this Aug 13, 2024

liuw reopened this Aug 13, 2024

rbradford reviewed Aug 13, 2024

View reviewed changes

virtio-devices: net: reduce vec allocations for iovec conversion

2bd078d

Signed-off-by: ihciah <[email protected]>

ihciah force-pushed the iovec-buf-for-net branch from f169548 to 2bd078d Compare August 14, 2024 12:13

rbradford approved these changes Aug 15, 2024

View reviewed changes

rbradford added this pull request to the merge queue Aug 15, 2024

Merged via the queue into cloud-hypervisor:main with commit 3320015 Aug 15, 2024

ihciah mentioned this pull request Aug 27, 2024

refactor(virtio-net): avoid copy on rx firecracker-microvm/firecracker#4742

Closed

9 tasks

virtio-devices: net: reduce vec allocations for iovec conversion #6636

virtio-devices: net: reduce vec allocations for iovec conversion #6636

Uh oh!

Conversation

ihciah commented Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rbradford Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liuw commented Aug 6, 2024

Uh oh!

ihciah commented Aug 7, 2024

Uh oh!

liuw commented Aug 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ihciah commented Aug 7, 2024

Uh oh!

liuw commented Aug 7, 2024

Uh oh!

ihciah commented Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liuw commented Aug 8, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ihciah Aug 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rbradford commented Aug 12, 2024

Uh oh!

ihciah commented Aug 12, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ihciah commented Aug 6, 2024 •

edited

Loading

rbradford Aug 6, 2024 •

edited

Loading

liuw commented Aug 7, 2024 •

edited

Loading

ihciah commented Aug 8, 2024 •

edited

Loading

ihciah Aug 9, 2024 •

edited

Loading