-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional Perf. Optimizations #6
Conversation
Update: This pull request has been tried and tested sufficiently to my belief and should be ready for merging. |
Sorry for the late reply, I've been a busy lately and my inbox buried this. |
LGTM. Will push a release soon so you can use mainline on NuGet. Apologies again for the late reply. |
Published as https://www.nuget.org/packages/VCDiff/4.0.0 Capping Thanks again @Sewer56 ! |
Hi, thanks for the performance optimizations.
I'm working on a delta patching algorithm for a mod loader; and was interested in a managed implementation of VCDiff. In the process, I ended up making some miscellaneous extra improvements to the implementation.
Summary
VCEncoder
.Adler32.Combine
IByteBuffer
to reduce total number of heap allocations. New API used where possible.ByteBuffer
as source inVCEncoder
(to avoid unnecessary memory copy).Source
in VCEncoder (allocated array isn't zero initialized [default .NET Behaviour] as it will be entirely overwritten anyway).[SkipLocalsInit]
attribute for .NET 5 target to methods with 1+ Million invocations (on 100MB+ files).bufferSize
inVCEncoder
to target length to avoid unnecessary excess memory allocation.target
inEncodeAsync
are now asynchonous. Only writing to output now still isn't.GC.AllocateUninitializedArray
for .NET 5 inBlockHash
to avoid implicit array zero fill.VCDiff.Benchmark
).BlockHash
tables.ByteStreamReader
.Span<T>
forIByteBuffer
.Array Rentals
for less heap allocations on repeat decodes.MemoryStream
allocations by using array pooling inMicrosoft.IO.RecyclableMemoryStream
.Note
Although unused code was cleaned up a little, no public APIs were removed.
Code passes all tests as expected; should be slightly faster than before but I didn't strictly benchmark it.
Benchmarks
Encode
Encoding a16 MB file 1 time.
Before:
Current (15f0e67):
Decode
Decoding 16 MB file 128 times. (~2GB per run.)
Before:
Current:
For decode I've now come to a point where
MemoryStream
itself is the bottleneck; with a speed anywhere between 740MB/s and 1870MB/s depending on file.Real World Tests
On an i7 4790k @ 4.4GHz & 1866MHz CL9 DDR3 RAM.
Input is a 1.4GB GameCube ROM and a modified version of said ROM with custom code and ~590MiB of music.
(Not actually what I plan on using it on but was a good test case)
xdelta3.0.11 x64 (cmd:
xdelta.exe -e -S -A -s
vcdiff (15f0e67)
Not the only ROM I've tested. Another game mod with equal amount of custom files yielded an even greater patch size improvement from 1320MiB to 620MiB.
There's a chance that xdelta is performing some sort of tricks to optimise speed over compression ratio. RAM usage suggests a small table size. It may be simply splitting files into smaller chunks as it encodes. I haven't investigated the xdelta source personally to verify 100%.
With small files, performance difference is mostly negligible as per the blog.
I'm bringing those numbers up because your blog post focused a lot on comparing performance with xdelta.
Don't forget about compression efficiency; xdelta seems to drop off more the bigger the file size is, though I haven't sampled enough data to verify how strong that correlation is.
Other Notes
Span<T>
as a bottleneck.That kind of surprises me as creating a span in release mode is basically free.
See JIT Generated Code on Sharplab:
https://sharplab.io/#v2:C4LghgzgtgPgAgJgIwFgBQcAMACOSB0ASgK4B2wAllAKb4DCA9lAA4UA21ATgMpcBuFAMbUIAbnTo4AZlwJsdbAG902VdhVrp2MhDAAzatm7MwpADwAjAJ7BqAPmwBZABTXbAKmzMGFclwA02L7A2BykAObAABYAlOpoakoaiapwAOzYpNQA7kYm5m72zt7BAaHUEdEx4glqAL7odUA=
The only artifact of creating a Span is a bounds check (length < 0); otherwise it's basically 2 instructions.
I wonder if the profiler may have something to do with that. Or maybe it was running in Debug; I don't know.
That said, although it's been a while that I've experimented with SIMD, I do remember there being a noticeable performance difference in using the the more generic
Vector<T>
over a specific instruction set implementation i.e.Avx2
. I would prefer to believe the bigger performance difference came from here.What is actually not free, is generating a Span from
Memory<T>
:https://sharplab.io/#v2:C4LghgzgtgPgAgJgIwFgBQcAMACOSB0ASgK4B2wAllAKb4DCA9lAA4UA21ATgMpcBuFAMbUIAbnTo4AZlwJsdbAG902VdhVrp2MhDAAzatm7MwpADwAjAJ7BqAPmwBZABSPqUBpytmAQjfvYACZgwGAAlOpoakoa0apwAOxBIWD4xqbiUWoAvujZQA==
You're not the only one :) https://stackoverflow.com/questions/56285370/c-sharp-memory-span-unexpectedly-slow
Due to some special handling resulting by the fact that
Memory<T>
can also hold managed types, not onlyunmanaged
ones.Edit: Here's an extra
Untested. Just to show the idea. Relies on runtime implementation, would not recommend.