Skip to content

Conversation

@martinus
Copy link
Contributor

@martinus martinus commented Oct 5, 2019

This is an attempt to more tightly pack the data inside CCoinsMap. (replaces #16970, done after my investigations on #16957) Currently, there is quite a bit of overhead involved in the CCoinsMap::value_type data structure (total bytes to the left):

96 CCoinsMap::value_type
    36 COutPoint
        32 uint256
         4 uint32_t
     4 >>> PADDING <<<
    56 CCoinsCacheEntry
        48 Coin
            40 CTxOut
                8 nValue
                32 CScript
             4 fCoinBase & nHeight
             4 >>> PADDING <<<
         1 flags (dirty & fresh)
         7 >>> PADDING <<< 

So there is quite a bit of padding data. In my experiements I've noticed that the compiler is forced to use a padding size >=8 only because nValue's CAmount type is int64_t which has to be 8-byte aligned. When replacing nValue with a 4-byte aligned data structure, the whole CCoinsMap::value_type will be aligned to 4 bytes, reducing padding.

Another 4 bytes can be saved by refactoring prevector to only use a single byte for direct size information, and reducing CScript's size from 28 to 27 bytes. It is still able to directly cache most scripts as most are <= 25 bytes long.

The remaining 4 bytes due to the 1 byte flag + 3 bytes padding in CCoinsCacheEntry can be removed by moving the flags into Coin, stealing another 2 bits from nHeight.

Finally, the resulting data structure is 20 bytes smaller:

76 CCoinsMap::value_type
    36 COutPoint
        32 uint256
         4 uint32_t
    40 CCoinsCacheEntry
        40 Coin
            36 CTxOut
                 8 nValue
                28 CScript
            4 fCoinBase & nHeight & flags (dirty & fresh)

So we can store about 26% more data into dbcache's memory. I have evalued this on my Intel i7-8700, 32GB RAM, external SSD, for -reindex-chainstate with both -dbcache=500 and -dbcache=5000:

out2

    time max resident set size
master -dbcache=500 05:57:20 2957532
2019-09-more-compact-Coin -dbcache=500 05:33:37 2919312
Improvement   6,64% 1,29%
    time max resident set size
master -dbcache=5000 04:22:42 7072612
2019-09-more-compact-Coin -dbcache=5000 04:09:16 6533132
Improvement   5,11% 7,63%

So on my machine there is definitive an improvement, but honestly I am not sure if this ~6% improvement is worth the change. Maybe the improvement is bigger with slower disk, as ~26% more transaction can be cached.

@DrahtBot
Copy link
Contributor

DrahtBot commented Oct 5, 2019

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #18113 ([coins] Don't allow a coin to spent and FRESH. Improve commenting. by jnewbery)
  • #18087 (Get rid of VARINT default argument by sipa)
  • #18000 ([WIP] Coin Statistics Index by fjahr)
  • #17708 (prevector: avoid misaligned member accesses by ajtowns)
  • #17487 (coins: allow write to disk without cache drop by jamesob)
  • #9384 (CCoinsViewCache code cleanup & deduplication by ryanofsky)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@Sjors
Copy link
Member

Sjors commented Oct 6, 2019

Would be nice to test this on a Raspberry-like device. They tend to have low memory and slow storage. Unfortunately they generally also require pruning, and the way pruning currently works we flush the coin cache.

For example I have an Orange Pi chipping away at IBD this week; although it has 2 GB of RAM, because I'm pruning it to 5 GB, it never uses more than 230 MB.

@maflcko
Copy link
Member

maflcko commented Oct 7, 2019

For example I have an Orange Pi chipping away at IBD this week; although it has 2 GB of RAM, because I'm pruning it to 5 GB, it never uses more than 230 MB.

A 512 GB micro SD card should do the job, no?

@fanquake
Copy link
Member

fanquake commented Oct 8, 2019

@TheBlueMatt You might be interested here given you commented in #16970.

@laanwj
Copy link
Member

laanwj commented Oct 9, 2019

A 512 GB micro SD card should do the job, no?

Yep, works fine.

@fanquake
Copy link
Member

fanquake commented Oct 9, 2019

Please lets keep the discussion high level before pointing out any typos.

Copy link
Contributor

@promag promag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK. Didn't verified the claim but 26% sounds a good improvement to me.

AFAICT dirty and fresh are almost read and written at the same time, maybe use a 2 bit flag? Or is compiler doing that anyway?

@laanwj
Copy link
Member

laanwj commented Oct 12, 2019

@promag that's what the bitfield syntax unsigned int dirty : 1; does, it makes the flags take up the minimum possible space.
Or maybe I'm misunderstanding what you're trying to accomplish.

@GChuf
Copy link
Contributor

GChuf commented Oct 12, 2019

Would like to test this on my VM using a HDD.
I'm not sure how to benchmark this - I've tried https://github.com/chaincodelabs/bitcoinperf but haven't succeeded. Any info/help would be appreciated.

@maflcko
Copy link
Member

maflcko commented Oct 12, 2019

bitcoinperf is a bit messy to set up. I think it requires docker. Is there any specific issue you ran into? I guess it might be better to report such upstream at https://github.com/chaincodelabs/bitcoinperf to not bloat this thread

@martinus
Copy link
Contributor Author

Would like to test this on my VM using a HDD.
I'm not sure how to benchmark this - I've tried https://github.com/chaincodelabs/bitcoinperf but haven't succeeded. Any info/help would be appreciated.

In my benchmarks I deleted debug.log, ran /usr/bin/time -v ./bitcoind -reindex-chainstate -stopatheight=594000 -printtoconsole=0 -dbcache=500, then used a hand written script to convert the debug.log into gnuplot-readable data, then used gnuplot to create the graphs.

It would be awesome to have some kind of logger in bitcoind that can periodically log statistics into a file in an easily parseable data.

@GChuf
Copy link
Contributor

GChuf commented Oct 12, 2019

There were a couple issues. Could you recommend anything else to test -reindex-chainstate? Should I simply run bitcoind and have linux count time ... ?
Or maybe I'll look around things mentioned in this stackexchange answer.

p.s. thanks for the info @martinus. I'll figure something out with what you did and those tools mentioned in the link.

@promag
Copy link
Contributor

promag commented Oct 12, 2019

@laanwj I mean stuff like this:

        auto oldDirty = coin.dirty;
        auto oldFresh = coin.fresh;

could be

        auto old_cache = coin.cache;

where

    unsigned int cache : 2;   // DIRTY = 1,  FRESH = 2

@GChuf
Copy link
Contributor

GChuf commented Oct 15, 2019

The debug.log from first benchmark (master) got overwritten by the second benchmark, and I didn't make a script to do a graph from debug.log yet anyway, so I'm just gonna post the /usr/bin/time -v results.
The improvement seems remarkable! I tested this on Ubuntu 16 VM running on a Seagate SSHD (also called bybrid) disk. It seems however the process used more CPU - not sure what's going on with those percentages though. Must be a VM bug.

First Header Second Header time max resident set size
2019-09-more-compact-Coin dbcache=500 2:20:09 1673108
Master dbcache=500 3:25:46 1729412
Improvement 31,9% 3,26%

	Command being timed: "./bitcoindmaster -reindex-chainstate -stopatheight=325000 -printtoconsole=0 -dbcache=500"
	User time (seconds): 19760.08
	System time (seconds): 1031.02
	Percent of CPU this job got: 168%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 3:25:46
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 1729412
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 2088592
	Voluntary context switches: 38379113
	Involuntary context switches: 2185645
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

	Command being timed: "./bitcoindcoin -reindex-chainstate -stopatheight=325000 -printtoconsole=0 -dbcache=500"
	User time (seconds): 19004.87
	System time (seconds): 598.24
	Percent of CPU this job got: 233%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 2:20:09
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 1673108
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 1737560
	Voluntary context switches: 17203875
	Involuntary context switches: 2926137
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

@martinus
Copy link
Contributor Author

martinus commented Oct 15, 2019

Thanks @GChuf for testing this! This are really nice results.

It seems however the process used more CPU

I don't think so:

  • master had 19760.08 sec user + 1031.02 sec system = 20791.1 sec
  • 2019-09-more-compact-Coin had 19004.87 sec user + 598.24 sec system = 19603,11

So the branch used ~6% less CPU. Note that the percentage 233% is much higher than for master because it used that CPU in a much shorter timeframe.

The percentage is calculated 100*(user + system time) / elapsed. 233 percent means that on average 2.33 cores were running for this job.

@GChuf
Copy link
Contributor

GChuf commented Oct 16, 2019

The percentage is calculated 100*(user + system time) / elapsed. 233 percent means that on average 2.33 cores were running for this job.

Thanks for the explaination! The math works :)
If more testing is needed I'd be glad to help, but I think HDDs will show similar results to my SSHD and it's clear that the improvements are real.

@jamesob
Copy link
Contributor

jamesob commented Oct 17, 2019

Concept ACK from me, seems like some great savings here. Though it'll be crucial to test this thoroughly across platforms. Will review in depth soon.

@GChuf
Copy link
Contributor

GChuf commented Oct 17, 2019

I can also test it on windows. Does anyone have the windows executable for this already, to make my life easier? @martinus maybe?

@sipa
Copy link
Member

sipa commented Oct 17, 2019

These are pretty impressive gains, so concept ACK.

Some overall comments:

  • Since this is making the boundary between Coin and CCoinsCacheEntry a bit less clear, maybe it would be useful to either have a Coin::AssignWithoutFlags function, or even make Coin::operator= not touch the flags. That would avoid all the oldX = ...; ...; ... = oldX code, and make review easier.
  • The coding style guidelines for variables say to use snake_case in new code.

I do want to review the prevector changes in more detail.

@martinus
Copy link
Contributor Author

I can also test it on windows. Does anyone have the windows executable for this already, to make my life easier? @martinus maybe?

Sorry, I build only in Linux, never tried to build in Windows

Since this is making the boundary between Coin and CCoinsCacheEntry a bit less clear, maybe it would be useful to either have a Coin::AssignWithoutFlags function, or even make Coin::operator= not touch the flags. That would avoid all the oldX = ...; ...; ... = oldX code, and make review easier.

I'm also unsure if it was everywhere necessary to backup and then restore the previous flags. Moving the flags into Coin only saves us 4bytes. I could also remove that change and do this in a separate PR.

@maflcko
Copy link
Member

maflcko commented Oct 17, 2019

@DrahtBot might have a windows build ready tomorrow

@dongcarl
Copy link
Contributor

I have Gitian-built windows binaries of 03cb535384eb6d2f4284524c33ab03371bd263e2 here for those who are keen (@GChuf): https://send.firefox.com/download/c78ac103a4dabe86/#S70uxKCUua-baEeR0v029g

@elichai
Copy link
Contributor

elichai commented Oct 29, 2019

@jamesob twice the time or twice the percentage?
Because I can confirm that my own benchmarks of master from ~2-3 weeks ago showed SipHashUint256Extra very very high on the list (spent a while re-implementing it with SSE instructions just to see it's actually slower with them lol)

(i'm usually using the following flags: perf report -g 'graph,0.5,caller' that way it sorts the percentages by caller instead of callee, though you probably know more about perf than I :) )

@martinus
Copy link
Contributor Author

@jamesob did your master build already contain #16957?

@JeremyRubin
Copy link
Contributor

@martinus yes it did -- are you not rebased?

Worth pointing out that a prevector<28> on master fits perfectly in a 32 byte or 64 byte cache line.

No longer after this change (which is fine, we can increase to prevector<31> where sensitive to it, like std::vector<prevector<>>.

Also, now that CAmounts are unaligned, they too can be extra slow to do something with.

I doubt that's what's going on here, but worth looking at before merge.

@martinus
Copy link
Contributor Author

@martinus yes it did -- are you not rebased?

I didn't saw he linked to the master revision, so both master and my branch has #16957 so it's not that.

I suspect that there is something strange going on with the threading code. Maybe some timing issue causes lots of caching misses for some reason? I don't know much about how the checkqueue and surrounding classes work though.

@martinus
Copy link
Contributor Author

I think I might have been able to reproduce this issue. I've tried ./bitcoind -datadir=/run/media/martinus/tera/bitcoin/db -reindex-chainstate -stopatheight=150000 -printtoconsole=0 -connect=0 -dbcache=15000 on master, and on an older version, before #16957. The old version used ~20 seconds of user time, master used 332 seconds. I've done git bisect and found this revision as the first bad commit: fa3a733

I don't know why, but this commit seems to have a dramatic effect on performance.

@sipa
Copy link
Member

sipa commented Oct 29, 2019

@martinus Are you aware that blocks before the assumevalid point don't get signature/script validated?

@maflcko
Copy link
Member

maflcko commented Oct 30, 2019

For benchmarking purposes, it might be best to set all nodes compiled from different branches to -noassumevalid or the same block hash.

@martinus
Copy link
Contributor Author

martinus commented Oct 30, 2019

@martinus Are you aware that blocks before the assumevalid point don't get signature/script validated?

No, I didn't know. Also it seems that since I didn't yet have this block, -reindex-chainstate signature/script validates everything

It might be helpful to have a list of assumevalid blocks instead of just a single block, to prevent revalidating everything when that specific marker block is not yet there

This change reduces CCoinMap's value_type from 96 bytes to 80 bytes by
more tightly packing it's data. This is achieved by these changes:

* Refactored prevector so it uses a single byte to determine its size
  when in direct mode
* Reduce CScriptBase from 28 to 27 indirect bytes
* Introduced PackableCAmount to be able to align CTxOut to 4 bytes to
  prevent padding when used as a member in Coin

This tighter packing means more data can be stored in the coinsCache
before it is full and has to be flushed to disk. In my benchmark,
-reindex-chainstate was 6% faster and used 6% less memory. The cache
could fit 14% more txo's before it had to resize.
Removed CCoinsCacheEntry's 1 byte for the flags and put it directly into
Coin. That way we can get rid of unnecessary padding, which reduces the
memory requirement for the coinsCache. We steal 2 bits from Coin's
nHeight, so now there are only 29 bits left. Still, we are save until
block 2^29-1 = 536870911.
This requires the operator=(CAmount) which I've implemented here as well. Using static_cast for nicer formatting.

Note that it would be cleaner to make the user defined conversion`operator CAmount()` explicit too, but that would mean that practically everywhere it is used we need to add an explicit cast.
See "C.164: Avoid implicit conversion operators" http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#c164-avoid-implicit-conversion-operators
@martinus martinus force-pushed the 2019-09-more-compact-Coin branch from 03cb535 to 22d1d94 Compare November 8, 2019 15:59
@jamesob
Copy link
Contributor

jamesob commented Nov 12, 2019

Did some benches over the weekend (bench/master.1 vs. bench/compactcoin.1) with the same setup/reindex-to-550,000 as mentioned in the allocator PR (#16801 (comment)):

host         tag                      time       time% maxmem  cpu%  dbcache
bench-ssd-2  master.1                 9:13:54    1.00  5770.10MB 344%  4000MB
bench-ssd-2  master.1                 9:15:57    1.00  5773.23MB 343%  4000MB
bench-ssd-2  compactcoin.1            9:07:01    0.98  5899.80MB 347%  4000MB
bench-ssd-2  compactcoin.1            9:07:41    0.99  5923.72MB 346%  4000MB

bench-ssd-3  compactcoin.1            9:06:32    0.98  5959.14MB 347%  4000MB
bench-ssd-3  compactcoin.1            9:07:35    0.98  6013.83MB 346%  4000MB
bench-ssd-3  master.1                 9:17:00    1.00  5771.56MB 342%  4000MB
bench-ssd-3  master.1                 9:17:24    1.00  5773.17MB 342%  4000MB

bench-ssd-4  compactcoin.1            9:08:11    0.98  5936.75MB 346%  4000MB
bench-ssd-4  compactcoin.1            9:07:46    0.98  5916.50MB 346%  4000MB
bench-ssd-4  master.1                 9:13:56    0.99  5770.13MB 344%  4000MB
bench-ssd-4  master.1                 9:17:07    1.00  5760.78MB 342%  4000MB

bench-ssd-5  compactcoin.1            9:03:06    0.98  5943.29MB 349%  4000MB
bench-ssd-5  compactcoin.1            9:04:24    0.98  5975.77MB 348%  4000MB
bench-ssd-5  master.1                 9:11:54    1.00  5769.57MB 346%  4000MB
bench-ssd-5  master.1                 9:13:56    1.00  5775.04MB 344%  4000MB

bench-strong compactcoin.1            14:59:15   0.99  5710.87MB 546%  4000MB
bench-strong compactcoin.1            14:59:00   0.99  5701.12MB 548%  4000MB
bench-strong master.1                 15:06:04   1.00  5720.08MB 544%  4000MB
bench-strong master.1                 15:04:26   1.00  5745.98MB 545%  4000MB

See modest 1-2% improvement in runtime and (oddly) on the SSD machines I'm seeing significantly higher memory usage (usually ~200MB) for this branch.

@ajtowns
Copy link
Contributor

ajtowns commented Dec 10, 2019

If CScript needs to be 8 byte aligned (because it has a pointer), then I think this becomes:

  • 88 CCoinsMap::value_type
    • 36 COutPoint
      • 32 uint256
      • 4 uint32_t
    • 4 PADDING
    • 48 CCoinsCacheEntry
      • 48 Coin
        • 40 CTxOut
          • 8 nValue
          • 32 CScript
        • 4 nHeight
        • 1 fCoinBase & flags (dirty & fresh)
        • 3 PADDING

and there's no need for PackableCAmount, and nHeight doesn't need bit fields. 96 byte to 88 is only an 9% improvement though.

(EDIT: with 32-bit pointers, CAmount alignment becomes your blocker, but not sure that optimising that heavily for 32-bit systems makes sense?)

@DrahtBot
Copy link
Contributor

Needs rebase

@martinus
Copy link
Contributor Author

I'll close this issue, as @ajtowns noted, this PR has illegally aligned the CScript to 4 bytes to get it to 28 bytes.

@martinus martinus closed this Feb 18, 2020
martinus added a commit to martinus/bitcoin that referenced this pull request Aug 27, 2021
This PR is similar to bitcoin#17060, but without the incorrect alignment, and
with safeguards. CCoinsCacheEntry makes use of Coin's padding to store
the flag.

When using the padding of the member, great care needs to be taken that the
padding is not accidentally overwritten, e.g. when assigning/moving a coin.

This PR solves this potential error source by chainging CCoinsCacheEntry
into a class with private members and only beeing able to modify the
coin with a call to MutableCoin(). That method safely remembers and then
restores the flags, no matter what happens to the coin member.

The original struct layout was like this:

    96 CCoinsMap::value_type
        36 COutPoint
            32 uint256
            4 uint32_t
        4 >>> PADDING <<<
        56 CCoinsCacheEntry
            48 Coin
                40 CTxOut
                    8 nValue
                    32 CScript
                4 fCoinBase & nHeight
                4 >>> PADDING <<<
            1 flags (dirty & fresh)
            7 >>> PADDING <<<

After that PR the struct layout looks like this:

    88 CCoinsMap::value_type
        36 COutPoint
            32 uint256
            4 uint32_t
        4 >>> PADDING <<<
        48 CCoinsCacheEntry
            48 Coin
                40 CTxOut
                    8 nValue
                    32 CScript
                4 fCoinBase & nHeight
                4 padding: stores flags (dirty & fresh)

That change translates to about ~8% less memory usage of the CCoinsMap.
martinus added a commit to martinus/bitcoin that referenced this pull request Aug 27, 2021
This PR is similar to bitcoin#17060, but without the incorrect alignment, and
with safeguards. CCoinsCacheEntry makes use of Coin's padding to store
the flag.

When using the padding of the member, great care needs to be taken that the
padding is not accidentally overwritten, e.g. when assigning/moving a coin.

This PR solves this potential error source by chainging CCoinsCacheEntry
into a class with private members and only beeing able to modify the
coin with a call to MutableCoin(). That method safely remembers and then
restores the flags, no matter what happens to the coin member.

The original struct layout was like this:

    96 CCoinsMap::value_type
        36 COutPoint
            32 uint256
            4 uint32_t
        4 >>> PADDING <<<
        56 CCoinsCacheEntry
            48 Coin
                40 CTxOut
                    8 nValue
                    32 CScript
                4 fCoinBase & nHeight
                4 >>> PADDING <<<
            1 flags (dirty & fresh)
            7 >>> PADDING <<<

After that PR the struct layout looks like this:

    88 CCoinsMap::value_type
        36 COutPoint
            32 uint256
            4 uint32_t
        4 >>> PADDING <<<
        48 CCoinsCacheEntry
            48 Coin
                40 CTxOut
                    8 nValue
                    32 CScript
                4 fCoinBase & nHeight
                4 padding: stores flags (dirty & fresh)

That change translates to about ~8% less memory usage of the CCoinsMap.
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Feb 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.