Skip to content

Conversation

@ihciah
Copy link
Contributor

@ihciah ihciah commented Aug 6, 2024

Currently when converting desc chain to iovecs, it uses Vec::new(). To save allocation, this PR proposed to using a long lifetime buffer to save the result.

A simulated benchmark about whether use persistent buffer or smallvec: https://gist.github.com/ihciah/dcfaf13393bfada1ff8f372b4fa16e3f

@ihciah ihciah requested a review from a team as a code owner August 6, 2024 16:34
let mut next_desc = Some(desc);

let mut iovecs = Vec::new();
let mut iovecs = self.iovecs.borrow();
Copy link
Member

@rbradford rbradford Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution - but maybe we could just use smallvec instead - like we do elsewhere in the codebase?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use SmallVec efficiently we need to know beforehand the optimum size, otherwise it will allocate and deallocate. Not sure how feasible this is with the code here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some debugging (e.g. adding info!("iocount {}", iovecs.len())) and running various workloads shows that the vector never exceeds length 4 (and that only occurs when copying a very large file). I think a smallvec with reserved capacity of 4 is the best solution.

@liuw
Copy link
Member

liuw commented Aug 6, 2024

How much improvement do you see with this change?

@ihciah
Copy link
Contributor Author

ihciah commented Aug 7, 2024

How much improvement do you see with this change?

Haven't benchmarked. Could you provide some docs about how to bench it?

Virtio-net is a component that is used very frequently in real scenarios. Reducing one allocation when sending and receiving each single frame should have a relatively good performance benefit.

Using smallvec can work here too if the desc chain's length is small enough(but it is unpredictable). However, since we make a syscall immediately after constructing iovecs, there won't be multiple iovecs at the same time, so it's more appropriate to use a shared buffer to cache iovecs and store the buffer in a long-lived structure.

@liuw
Copy link
Member

liuw commented Aug 7, 2024

How much improvement do you see with this change?

Haven't benchmarked. Could you provide some docs about how to bench it?

Virtio-net is a component that is used very frequently in real scenarios. Reducing one allocation when sending and receiving each single frame should have a relatively good performance benefit.

Using smallvec can work here too if the desc chain's length is small enough(but it is unpredictable). However, since we make a syscall immediately after constructing iovecs, there won't be multiple iovecs at the same time, so it's more appropriate to use a shared buffer to cache iovecs and store the buffer in a long-lived structure.

The best case is to run your target workload or the synthetic workload you care about, and show that it increases the throughput or decreases the latency. Alternatively, there is this dhat feature with which you can see how many allocations are saved from a test run.

In generally I think this is a good change. The code is simple enough. I was just curious to know if there is any directly visible benefit.

Somewhat related -- recently I observed that CH's block perf has more to be desired. I don't have any concrete ideas yet, but if I end up tuning the internal buffer size, it may make sense to move away from SmallVec to this sort of persistent buffer, too.

Lastly, please properly sign off your patch.

@ihciah ihciah force-pushed the iovec-buf-for-net branch from 3188b91 to 2f81849 Compare August 7, 2024 14:35
@ihciah
Copy link
Contributor Author

ihciah commented Aug 7, 2024

Lastly, please properly sign off your patch.

Ok, it is added now.

Adding some debugging (e.g. adding info!("iocount {}", iovecs.len())) and running various workloads shows that the vector never exceeds length 4 (and that only occurs when copying a very large file). I think a smallvec with reserved capacity of 4 is the best solution.

I can confirm it won't be more than 4 for most cases if I understood it right(I searched the reference to virtqueue_add_sgs in kernel's code).

SmallVec works here based on the current kernel's driver implementation, using a persistent buffer works too.
How do you think? @rbradford @liuw If needed I can change it to smallvec.

@liuw
Copy link
Member

liuw commented Aug 7, 2024

Lastly, please properly sign off your patch.

Ok, it is added now.

Adding some debugging (e.g. adding info!("iocount {}", iovecs.len())) and running various workloads shows that the vector never exceeds length 4 (and that only occurs when copying a very large file). I think a smallvec with reserved capacity of 4 is the best solution.

I can confirm it won't be more than 4 for most cases if I understood it right(I searched the reference to virtqueue_add_sgs in kernel's code).

SmallVec works here based on the current kernel's driver implementation, using a persistent buffer works too. How do you think? @rbradford @liuw If needed I can change it to smallvec.

Assuming the kernel driver is not change very frequently, I think using SmallVec is better. You can even use 8 to leave some room to grow in the future. In the commit message, please say why a specific value is chosen.

@ihciah
Copy link
Contributor Author

ihciah commented Aug 8, 2024

I can confirm it won't be more than 4 for most cases if I understood it right(I searched the reference to virtqueue_add_sgs in kernel's code).

It was my mistake. I'm not familiar with kernel's code yet. I did a little bit more investigation on it and found there would be at most 19 descriptors linked in a single chain.

  1. start_xmit is used to send a skb. After calling xmit_skb which it's core logic is in, it calls check_sq_full_and_disable to see if there's enough descriptor room for next packet. And at the end, it kicks device.
  2. In xmit_skb, it calls sg_set_buf to set virtio header to the first sg, then it calls skb_to_sgvec to map all skb fragments to sgs(it returns the sg count it used).
  3. In check_sq_full_and_disable, it checks if (sq->vq->num_free < 2+MAX_SKB_FRAGS), which means if enough room for mapping descriptors. If not, it calls netif_stop_subqueue to stop next sending.

The maximum possibility of desc chain size is 2+MAX_SKB_FRAGS including 1 virtio header and 1 frame header and many fragments. MAX_SKB_FRAGS as defined in kernel is 17.

Though at most cases the desc chain won't be very long and using smallvec with N=4 is enough. Considering the possibility above, I still think using a persistent buffer is always a better way as it can definitely avoid allocation at packet level and saves copy cost of smallvec. Comparing to smallvec I don't see any drawbacks since in our case there would not be more than one instance exist at the same time.

However, I'm open to change it as smallvec if we decide to, it is better than current Vec anyway.

@rbradford @liuw , would love to hear your opinion!

@liuw
Copy link
Member

liuw commented Aug 8, 2024

I can confirm it won't be more than 4 for most cases if I understood it right(I searched the reference to virtqueue_add_sgs in kernel's code).

It was my mistake. I'm not familiar with kernel's code yet. I did a little bit more investigation on it and found there would be at most 19 descriptors linked in a single chain.

  1. start_xmit is used to send a skb. After calling xmit_skb which it's core logic is in, it calls check_sq_full_and_disable to see if there's enough descriptor room for next packet. And at the end, it kicks device.
  2. In xmit_skb, it calls sg_set_buf to set virtio header to the first sg, then it calls skb_to_sgvec to map all skb fragments to sgs(it returns the sg count it used).
  3. In check_sq_full_and_disable, it checks if (sq->vq->num_free < 2+MAX_SKB_FRAGS), which means if enough room for mapping descriptors. If not, it calls netif_stop_subqueue to stop next sending.

The maximum possibility of desc chain size is 2+MAX_SKB_FRAGS including 1 virtio header and 1 frame header and many fragments. MAX_SKB_FRAGS as defined in kernel is 17.

Though at most cases the desc chain won't be very long and using smallvec with N=4 is enough. Considering the possibility above, I still think using a persistent buffer is always a better way as it can definitely avoid allocation at packet level and saves copy cost of smallvec. Comparing to smallvec I don't see any drawbacks since in our case there would not be more than one instance exist at the same time.

However, I'm open to change it as smallvec if we decide to, it is better than current Vec anyway.

@rbradford @liuw , would love to hear your opinion!

I don't feel too strongly either way. Like I said, the code you have is simple to reason about. One request I have is if we're to use persistent buffers, you should allocate the vectors to an appropriate size (for example 16) from the get go.

#[derive(Default, Clone)]
struct IovecBuffer(Vec<libc::iovec>);

unsafe impl Send for IovecBuffer {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to document these unsafes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Send and Sync are usually auto implemented, but not for pointer(libc::iovec contains pointer).
As an empty container, it can be safely move ownership and safely shared between threads.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but without a comment your code won't pass the CI check. We require safety comments for all unsafes.

struct IovecBuffer(Vec<libc::iovec>);

unsafe impl Send for IovecBuffer {}
unsafe impl Sync for IovecBuffer {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really need to be Sync? What happens if you drop it?


impl Drop for IovecBufferBorrowed<'_> {
fn drop(&mut self) {
self.0.clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My Rust knowledge on this part is a bit rusty (pun intended), but why clear here?

Do you even need to implement Drop? When IovecBuffer is dropped, its only member will also get dropped, right?

Copy link
Contributor Author

@ihciah ihciah Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here IovecBufferBorrowed is a zero-cost wrapper for a mutable borrowed Vec. It is introduced to impl drop manually(instead of just returning &mut Vec<_>).

Clear the Vec when drop can make sure there's no old value inside for the next user. With the Rust borrow checker, we can ensure that the buffer can not be touched by anyone else between a single {write, syscall, clear}(otherwise it failed compiling), and that no old value remains when it is used next time.

This make it works just like creating a new Vec.

  • IovecBuffer::borrow -> Vec::new
  • IovecBufferBorrowed::some_fn -> Vec::some_fn because of Deref/DerefMut
  • IovecBufferBorrowed::drop -> Vec::drop

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I misread the code. I thought you were implementing Drop for IovecBuffer.

@ihciah ihciah force-pushed the iovec-buf-for-net branch from 2f81849 to 5e8ce28 Compare August 11, 2024 09:03
@rbradford
Copy link
Member

Typo in commit message convertion -> conversion

@ihciah ihciah force-pushed the iovec-buf-for-net branch from 5e8ce28 to f169548 Compare August 12, 2024 15:48
@ihciah ihciah changed the title virtio-devices: net: reduce vec allocations for iovec convertion virtio-devices: net: reduce vec allocations for iovec conversion Aug 12, 2024
@ihciah
Copy link
Contributor Author

ihciah commented Aug 12, 2024

Typo in commit message convertion -> conversion

Thank you, it has been updated.

@liuw liuw closed this Aug 13, 2024
@liuw liuw reopened this Aug 13, 2024

impl IovecBuffer {
const fn new() -> Self {
IovecBuffer(Vec::new())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well do Vec::with_capacity(4) as from experimentation that is a sensible starting number to use (and avoid the initial reallocations.)

I'm still not 100% convinced this is better than smallvec in terms of maintenance overhead (avoiding heap allocations like this is exactly why we started using smallvec)

But I can see you've put a lot of effort into this work - so with the change I propose above lgtm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I've updated it as 4 as you suggested.

But I can see you've put a lot of effort into this work - so with the change I propose above lgtm.

Please don't do it just because this reason. I believe I can convince you of my point of view. This is a small change and didn't cost much time, but in order to justify this change I read some kernel code and did benchmarks, which was very rewarding and I think it is good.

I did a simple benchmark(but only simulated not with vmm) here: https://gist.github.com/ihciah/dcfaf13393bfada1ff8f372b4fa16e3f

And the result shows even for len=3, borrowed vec only cost 48% CPU time of smallvec(N=4). I believe it is because some stack copying cost is saved. For longer data, borrowed vec saves more. In the link above there's more results and reproducing guide.

@ihciah ihciah force-pushed the iovec-buf-for-net branch from f169548 to 2bd078d Compare August 14, 2024 12:13
@rbradford rbradford added this pull request to the merge queue Aug 15, 2024
Merged via the queue into cloud-hypervisor:main with commit 3320015 Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants