-
Notifications
You must be signed in to change notification settings - Fork 2
Speed up dirty_log() (a lot) #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up dirty_log() (a lot) #36
Conversation
b315ec0 to
2d294ef
Compare
Instead of using ad-hoc code, just write an extension to the Iterator trait that we can easily unit test. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>
It improves iteration code a lot! I'll use it in the upcoming commit to speed up dirty bitmap scanning. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>
With this change, we don't need a copy of the vector. Just something that can be coerced into an iterator. We also use the bit position iterator to make the code somewhat clearer. The new code is much faster, because it will not iterate over every bit, just each 1 bit in the input. The next commit will complete this optimization and have some concrete numbers. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>
This would be a good opportunity to optimize another pointless vector away, but I don't have a good way to test this at the moment. But maybe someone else gives it a shot. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>
... by passing the slice along instead. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>
... by just passing the iterator along. For large VMs this bitmap is gigantic. A 12TB VM has 384MB of dirty bitmap. With all these optimizations from the previous commits in place, we see quite the improvement when it comes to scanning the dirty bitmap. For a bitmap with 1% bits (randomly) set, dirty_log() takes: Original code: 2166ms (100.0%) New code: 382ms ( 17.6%) on my system. The sparser the dirty bitmap the faster. Scanning an empty bitmap is 100x faster. For a 5% populated bitmap we are still 3x faster. If someone wants to play with this, there is a benchmark harness here: https://github.com/blitz/chv-bitmap-bench On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>
olivereanderson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love these changes!
The new code definitely avoids large copies and is at least as readable as the old.
It would be interesting to see some benchmarks with real world data, but I doubt this will ever be slower than what we currently have (except perhaps for extremely dense bitmaps which should not be a common occurrence).
|
We now have a ticket to track this: https://github.com/cobaltcore-dev/cobaltcore/issues/316 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it, thanks! Please have some more cookies 🍪 🍪
This is fallout from #30. I've noticed that the function that turns dirty bitmaps during migration into
MemoryRangeTableis very inefficient. It creates useless temporary vectors. It has a very basic method of finding the 1 bits in the bitmap.I've transformed the dirty bitmap scanning using iterators (without temporaries). To do that I introduced
BitposIteratorwhich takes an iterator over a bitmap and emits the positions of all 1 bits. With some Itertools magic (coalesce) we can then easily merge the resulting bit ranges.The code now:
Improvements
I've benchmarked this for a 384 MB dirty bitmap (equivalent to 12TB VM). I got the following results:
The 0% means that we are scanning the empty bitmap at 24 GB/s (on my system). I should have 56GB/s memory bandwidth, so this is pretty good without any SIMD magic.
For even dirtier bitmaps, the performance will be more equal, because it's dominated by constructing the huge vector inside of
MemoryRangeTable. This can also be optimized away (https://github.com/cobaltcore-dev/cobaltcore/issues/280) and the optimization in this PR is a first step.These numbers were generated using criterion using this harness.
Testing