-
Notifications
You must be signed in to change notification settings - Fork 38.7k
Cache 26% more coins: Reduce CCoinsMap::value_type from 96 to 76 bytes #17060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
|
Would be nice to test this on a Raspberry-like device. They tend to have low memory and slow storage. Unfortunately they generally also require pruning, and the way pruning currently works we flush the coin cache. For example I have an Orange Pi chipping away at IBD this week; although it has 2 GB of RAM, because I'm pruning it to 5 GB, it never uses more than 230 MB. |
A 512 GB micro SD card should do the job, no? |
|
@TheBlueMatt You might be interested here given you commented in #16970. |
Yep, works fine. |
|
Please lets keep the discussion high level before pointing out any typos. |
promag
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK. Didn't verified the claim but 26% sounds a good improvement to me.
AFAICT dirty and fresh are almost read and written at the same time, maybe use a 2 bit flag? Or is compiler doing that anyway?
|
@promag that's what the bitfield syntax |
|
Would like to test this on my VM using a HDD. |
|
bitcoinperf is a bit messy to set up. I think it requires docker. Is there any specific issue you ran into? I guess it might be better to report such upstream at https://github.com/chaincodelabs/bitcoinperf to not bloat this thread |
In my benchmarks I deleted It would be awesome to have some kind of logger in bitcoind that can periodically log statistics into a file in an easily parseable data. |
|
There were a couple issues. Could you recommend anything else to test p.s. thanks for the info @martinus. I'll figure something out with what you did and those tools mentioned in the link. |
|
@laanwj I mean stuff like this: could be where |
|
The debug.log from first benchmark (master) got overwritten by the second benchmark, and I didn't make a script to do a graph from debug.log yet anyway, so I'm just gonna post the
|
|
Thanks @GChuf for testing this! This are really nice results.
I don't think so:
So the branch used ~6% less CPU. Note that the percentage 233% is much higher than for master because it used that CPU in a much shorter timeframe. The percentage is calculated 100*(user + system time) / elapsed. 233 percent means that on average 2.33 cores were running for this job. |
Thanks for the explaination! The math works :) |
|
Concept ACK from me, seems like some great savings here. Though it'll be crucial to test this thoroughly across platforms. Will review in depth soon. |
|
I can also test it on windows. Does anyone have the windows executable for this already, to make my life easier? @martinus maybe? |
|
These are pretty impressive gains, so concept ACK. Some overall comments:
I do want to review the prevector changes in more detail. |
Sorry, I build only in Linux, never tried to build in Windows
I'm also unsure if it was everywhere necessary to backup and then restore the previous flags. Moving the flags into |
|
@DrahtBot might have a windows build ready tomorrow |
|
I have Gitian-built windows binaries of 03cb535384eb6d2f4284524c33ab03371bd263e2 here for those who are keen (@GChuf): https://send.firefox.com/download/c78ac103a4dabe86/#S70uxKCUua-baEeR0v029g |
|
@jamesob twice the time or twice the percentage? (i'm usually using the following flags: |
|
@martinus yes it did -- are you not rebased? Worth pointing out that a prevector<28> on master fits perfectly in a 32 byte or 64 byte cache line. No longer after this change (which is fine, we can increase to prevector<31> where sensitive to it, like std::vector<prevector<>>. Also, now that CAmounts are unaligned, they too can be extra slow to do something with. I doubt that's what's going on here, but worth looking at before merge. |
I didn't saw he linked to the master revision, so both master and my branch has #16957 so it's not that. I suspect that there is something strange going on with the threading code. Maybe some timing issue causes lots of caching misses for some reason? I don't know much about how the checkqueue and surrounding classes work though. |
|
I think I might have been able to reproduce this issue. I've tried I don't know why, but this commit seems to have a dramatic effect on performance. |
|
@martinus Are you aware that blocks before the assumevalid point don't get signature/script validated? |
|
For benchmarking purposes, it might be best to set all nodes compiled from different branches to |
No, I didn't know. Also it seems that since I didn't yet have this block, -reindex-chainstate signature/script validates everything It might be helpful to have a list of assumevalid blocks instead of just a single block, to prevent revalidating everything when that specific marker block is not yet there |
This change reduces CCoinMap's value_type from 96 bytes to 80 bytes by more tightly packing it's data. This is achieved by these changes: * Refactored prevector so it uses a single byte to determine its size when in direct mode * Reduce CScriptBase from 28 to 27 indirect bytes * Introduced PackableCAmount to be able to align CTxOut to 4 bytes to prevent padding when used as a member in Coin This tighter packing means more data can be stored in the coinsCache before it is full and has to be flushed to disk. In my benchmark, -reindex-chainstate was 6% faster and used 6% less memory. The cache could fit 14% more txo's before it had to resize.
Removed CCoinsCacheEntry's 1 byte for the flags and put it directly into Coin. That way we can get rid of unnecessary padding, which reduces the memory requirement for the coinsCache. We steal 2 bits from Coin's nHeight, so now there are only 29 bits left. Still, we are save until block 2^29-1 = 536870911.
This requires the operator=(CAmount) which I've implemented here as well. Using static_cast for nicer formatting. Note that it would be cleaner to make the user defined conversion`operator CAmount()` explicit too, but that would mean that practically everywhere it is used we need to add an explicit cast. See "C.164: Avoid implicit conversion operators" http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#c164-avoid-implicit-conversion-operators
03cb535 to
22d1d94
Compare
|
Did some benches over the weekend ( See modest 1-2% improvement in runtime and (oddly) on the SSD machines I'm seeing significantly higher memory usage (usually ~200MB) for this branch. |
|
If
and there's no need for (EDIT: with 32-bit pointers, CAmount alignment becomes your blocker, but not sure that optimising that heavily for 32-bit systems makes sense?) |
| Needs rebase |
|
I'll close this issue, as @ajtowns noted, this PR has illegally aligned the CScript to 4 bytes to get it to 28 bytes. |
This PR is similar to bitcoin#17060, but without the incorrect alignment, and with safeguards. CCoinsCacheEntry makes use of Coin's padding to store the flag. When using the padding of the member, great care needs to be taken that the padding is not accidentally overwritten, e.g. when assigning/moving a coin. This PR solves this potential error source by chainging CCoinsCacheEntry into a class with private members and only beeing able to modify the coin with a call to MutableCoin(). That method safely remembers and then restores the flags, no matter what happens to the coin member. The original struct layout was like this: 96 CCoinsMap::value_type 36 COutPoint 32 uint256 4 uint32_t 4 >>> PADDING <<< 56 CCoinsCacheEntry 48 Coin 40 CTxOut 8 nValue 32 CScript 4 fCoinBase & nHeight 4 >>> PADDING <<< 1 flags (dirty & fresh) 7 >>> PADDING <<< After that PR the struct layout looks like this: 88 CCoinsMap::value_type 36 COutPoint 32 uint256 4 uint32_t 4 >>> PADDING <<< 48 CCoinsCacheEntry 48 Coin 40 CTxOut 8 nValue 32 CScript 4 fCoinBase & nHeight 4 padding: stores flags (dirty & fresh) That change translates to about ~8% less memory usage of the CCoinsMap.
This PR is similar to bitcoin#17060, but without the incorrect alignment, and with safeguards. CCoinsCacheEntry makes use of Coin's padding to store the flag. When using the padding of the member, great care needs to be taken that the padding is not accidentally overwritten, e.g. when assigning/moving a coin. This PR solves this potential error source by chainging CCoinsCacheEntry into a class with private members and only beeing able to modify the coin with a call to MutableCoin(). That method safely remembers and then restores the flags, no matter what happens to the coin member. The original struct layout was like this: 96 CCoinsMap::value_type 36 COutPoint 32 uint256 4 uint32_t 4 >>> PADDING <<< 56 CCoinsCacheEntry 48 Coin 40 CTxOut 8 nValue 32 CScript 4 fCoinBase & nHeight 4 >>> PADDING <<< 1 flags (dirty & fresh) 7 >>> PADDING <<< After that PR the struct layout looks like this: 88 CCoinsMap::value_type 36 COutPoint 32 uint256 4 uint32_t 4 >>> PADDING <<< 48 CCoinsCacheEntry 48 Coin 40 CTxOut 8 nValue 32 CScript 4 fCoinBase & nHeight 4 padding: stores flags (dirty & fresh) That change translates to about ~8% less memory usage of the CCoinsMap.
This is an attempt to more tightly pack the data inside
CCoinsMap. (replaces #16970, done after my investigations on #16957) Currently, there is quite a bit of overhead involved in theCCoinsMap::value_typedata structure (total bytes to the left):So there is quite a bit of padding data. In my experiements I've noticed that the compiler is forced to use a padding size >=8 only because
nValue'sCAmounttype isint64_twhich has to be 8-byte aligned. When replacing nValue with a 4-byte aligned data structure, the wholeCCoinsMap::value_typewill be aligned to 4 bytes, reducing padding.Another 4 bytes can be saved by refactoring
prevectorto only use a single byte for direct size information, and reducing CScript's size from 28 to 27 bytes. It is still able to directly cache most scripts as most are <= 25 bytes long.The remaining 4 bytes due to the 1 byte flag + 3 bytes padding in
CCoinsCacheEntrycan be removed by moving the flags intoCoin, stealing another 2 bits fromnHeight.Finally, the resulting data structure is 20 bytes smaller:
So we can store about 26% more data into dbcache's memory. I have evalued this on my Intel i7-8700, 32GB RAM, external SSD, for
-reindex-chainstatewith both-dbcache=500and-dbcache=5000:So on my machine there is definitive an improvement, but honestly I am not sure if this ~6% improvement is worth the change. Maybe the improvement is bigger with slower disk, as ~26% more transaction can be cached.