Skip to content

Conversation

@kallewoof
Copy link
Contributor

@kallewoof kallewoof commented Jan 7, 2020

The blk files have blocks written to them as they come in. Each blk file and rev file reference the same blocks, but the order in which block data is written into the rev (undo) files is in order of activation, i.e. in block height order.

Thus, when the code decides to stop writing into blk000X.dat, it will flush this and rev000X.dat, and then move to X+1. The problem is that since blocks are activated later than they are stored in the block file, we will still be writing to rev X for awhile after this.

This code fixes this by simply finalizing the previous undo file each time we finish up a block file. This pretty much guarantees that we will never pre-allocate space and never flush for an undo file (which is happening right now).

Unfortunately for anyone with affected machines, there will be wasted drive space for each rev file if they did IBD before this fix. Probably affects all architectures.

Edit: note that this PR initially added a new critical section for undo pos and some other stuff, but I scratched that for this simpler solution, hence practicalswift's comment below.

Fixes #17890.

@DrahtBot
Copy link
Contributor

DrahtBot commented Jan 7, 2020

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #17994 (validation: flush undo files after last block write by kallewoof)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@practicalswift
Copy link
Contributor

Thread-safety annotations (GUARDED_BY etc.) seems to be missing? :)

@kallewoof kallewoof force-pushed the 200108-reflush-undo branch from 12e6f04 to 0508072 Compare January 8, 2020 03:41
@kallewoof
Copy link
Contributor Author

@practicalswift Thanks for the nudge. My apologies, though... I ended up scratching this solution for a much simpler one (see updated code).

@kallewoof kallewoof changed the title bug-fix: re-flush undo files after they have been re-opened bug-fix: delay flushing undo files until after they are finalized Jan 8, 2020
@kallewoof
Copy link
Contributor Author

kallewoof commented Jan 8, 2020

Some stats: after syncing my mac up to ~ block 186k, the blocks dir looks like this:

$ du -ch *.dat
128M	blk00000.dat
128M	blk00001.dat
128M	blk00002.dat
128M	blk00003.dat
128M	blk00004.dat
128M	blk00005.dat
128M	blk00006.dat
128M	blk00007.dat
128M	blk00008.dat
128M	blk00009.dat
128M	blk00010.dat
128M	blk00011.dat
128M	blk00012.dat
128M	blk00013.dat
144M	blk00014.dat
 19M	rev00000.dat
 16M	rev00001.dat
 16M	rev00002.dat
 16M	rev00003.dat
 16M	rev00004.dat
 17M	rev00005.dat
 17M	rev00006.dat
 17M	rev00007.dat
 18M	rev00008.dat
 17M	rev00009.dat
 17M	rev00010.dat
 17M	rev00011.dat
 18M	rev00012.dat
171M	rev00013.dat
 21M	rev00014.dat
2.3G	total

The fact the blk for 14 and rev files for 13-14 are bloated is due to a separate bug (see #17887).

Note that rev 13 is still huge despite us moving on to rev 14. When blk 14 is finalized, it will truncate rev 13 down to its right size. It may be worthwhile to actually finalize current and previous twice, to avoid this, but since we are more or less guaranteed to again pre-allocate into X-1, this seems unnecessary, and #17887 will make this have significantly less impact.

@maaku
Copy link
Contributor

maaku commented Jan 8, 2020

I can confirm the bug. I've run into it multiple times. Kalle is convinced that we are at fault, although I'm not so sure, and I think this might even be an Apple bug. It only shows up on APFS, and the implementation of rsync which ships with macOS also suffers from the same issue.

But in any case, if we can fix it on our end we should, and this should be pretty high priority. It can end up wasting hundreds of gigabytes of space during an IBD, in some cases exhausting the file on 512GB drives in a way that is very non-intuitive to figure out and fix as a user.

@eriknylund
Copy link
Contributor

ACK 0508072e1a45c339a181d24c2edfc88e3cd5201f built locally. All tests pass. I see the blow-up happening and the files being truncated as the following blk file is opened and the previous is finalized.

Here rev00004.dat is still not truncated since we are on blk00005.dat

128M	blk00000.dat
128M	blk00001.dat
128M	blk00002.dat
128M	blk00003.dat
128M	blk00004.dat
576M	blk00005.dat
 19M	rev00000.dat
 16M	rev00001.dat
 16M	rev00002.dat
 16M	rev00003.dat
153M	rev00004.dat
136M	rev00005.dat

In the next step blk00005.dat has been finalized and rev00004.dat is truncated.

128M	blk00000.dat
128M	blk00001.dat
128M	blk00002.dat
128M	blk00003.dat
128M	blk00004.dat
128M	blk00005.dat
 16M	blk00006.dat
 19M	rev00000.dat
 16M	rev00001.dat
 16M	rev00002.dat
 16M	rev00003.dat
 16M	rev00004.dat
153M	rev00005.dat
1.0M	rev00006.dat

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int last_undo_file = fFinalize && nLastBlockFile > 0 ? nLastBlockFile - 1 : nLastBlockFile;
int previous_undo_file = fFinalize && nLastBlockFile > 0 ? nLastBlockFile - 1 : nLastBlockFile;

Given the comment on line 1738, is previous more appropriate than last?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

@kallewoof kallewoof force-pushed the 200108-reflush-undo branch 2 times, most recently from 5200be2 to 7613c50 Compare January 21, 2020 07:08
@fjahr
Copy link
Contributor

fjahr commented Jan 21, 2020

Similar to here #17887 (comment) I am not able to reproduce this issue on current master (no matter if --enable-debug or not). Happy to take hints and retest as I am curious why my results are different from everyone else.

@kallewoof
Copy link
Contributor Author

@fjahr #17887 (comment) seems to indicate you were finally able to reproduce the issue. Or is that comment older?

@kallewoof kallewoof force-pushed the 200108-reflush-undo branch from 7613c50 to da439d4 Compare January 22, 2020 09:28
@fjahr
Copy link
Contributor

fjahr commented Jan 22, 2020

@fjahr #17887 (comment) seems to indicate you were finally able to reproduce the issue. Or is that comment older?

Yes, sorry, this is not worded clearly: I could reproduce #17887 with rev files up 64M large if compiled without --enable-debug but I could not reproduce it with --enable-debug. This one I can not reproduce no matter if I set debugging or not.

@kallewoof
Copy link
Contributor Author

@fjahr You can not detect the following pattern anywhere in your log file even after syncing for awhile?

(date) Leaving block file X: CBlockFileInfo(...)
(date) Pre-allocating up to position 0x1000000 in blk(X+1).dat
(date) Pre-allocating up to position 0x100000 in rev(X+1).dat
[...]
(date) Pre-allocating up to position (...) in rev(X).dat

@fjahr
Copy link
Contributor

fjahr commented Jan 22, 2020

@kallewoof I do see it, but not regularly. From my last sync up to 190k it seems to appear 3 times: For X=7, X=10 and X=14. I was trying hard to catch a larger than normal file size with du -ch *.dat, so I missed the right block/rev file transition?

@kallewoof
Copy link
Contributor Author

There's some amount of randomness involved, as the order in which blocks are received and the order in which they are connected is different.

With #17887 merged, it will be very hard to physically spot these now, but it should occasionally show larger-than-necessary rev files (in the range of 1 b up to 1 MB).

When creating blk and rev files, these are pre-allocated for performance reasons, and then truncated down to their final size after completion.
This works for blk files which are sequentially created, but since blocks are stored out of order, and since the undo data for blocks are stored sequentially (upon activation of the block), the rev file will be written to for awhile even after the blk file has been finalized. This change switches the flushing part to flush the *previous* undo file for the finalize (truncate) case, which addresses the problem, as we never write to the rev file 2 steps back.
@kallewoof kallewoof force-pushed the 200108-reflush-undo branch from da439d4 to b49b341 Compare January 22, 2020 16:11
@Sjors
Copy link
Member

Sjors commented Jan 22, 2020

Given that #17887 is merged, is it still worth the risk of back-porting? (can always do that later, after 0.20 is released)

// for awhile even after we've moved onto the next blk file; if we don't do this, we
// end up with a re-opened but unflushed rev file, resulting in pre-allocated but not
// returned disk space allocation
int previous_undo_file = fFinalize && nLastBlockFile > 0 ? nLastBlockFile - 1 : nLastBlockFile;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add brackets to clarify (correct) precedence: (nLastBlockFile > 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't think that's necessary, but I'm open to change my opinion.

@kallewoof
Copy link
Contributor Author

@Sjors

Given that #17887 is merged, is it still worth the risk of back-porting? (can always do that later, after 0.20 is released)

There are nearly two thousand rev files right now. Someone doing an IBD sync using 0.19 today would end up wasting up to 2 GB of space due to this bug. I do believe we want to plug that hole, even though it has become much smaller of an issue as you mention.

@maaku
Copy link
Contributor

maaku commented Jan 23, 2020

Last time I did an IBD (a few months ago) it was >100GB wasted. That’s actually how I discovered this problem initially, it was enough wasted space to impact me (full disk).

@kallewoof
Copy link
Contributor Author

@maaku

Last time I did an IBD (a few months ago) it was >100GB wasted. That’s actually how I discovered this problem initially, it was enough wasted space to impact me (full disk).

Yeah, that was partially fixed in #17887, so you shouldn't see quite that drastic of an effect anymore, but still..

@maaku
Copy link
Contributor

maaku commented Jan 23, 2020

And this discussion is about backporting those changes, no?

@kallewoof
Copy link
Contributor Author

No, #17887 is already in the backported list and is not under discussion. @Sjors is wondering whether this patch, which delays finalize-flushing the rev files until we're (pretty) sure they are no longer appended to, should be backported or not.

Without this patch, we will experience 0-1 MB of wasted space for each rev file.

Without #17887 we would experience what you saw.

@fjahr
Copy link
Contributor

fjahr commented Jan 23, 2020

I have now done another run with this fix and could not detect the pattern described in #17892 (comment) in my debug.log

ACK b49b341

@maflcko
Copy link
Member

maflcko commented Jan 23, 2020

Without this patch, we will experience 0-1 MB of wasted space for each rev file.

It seems like this does not warrant a backport to 0.19.1

@kallewoof
Copy link
Contributor Author

kallewoof commented Jan 23, 2020

@MarcoFalke

Depends if you think "a few GBs wasted" is negligible or not. (This is for non-pruned nodes only, so a few GBs in a 250 GB datadir.)

@maflcko
Copy link
Member

maflcko commented Jan 23, 2020

Yes, based on your comments this should be less than 1% off (not noticeable for a user)

@anton48
Copy link

anton48 commented Jan 23, 2020

@MarcoFalke

Depends if you think "a few GBs wasted" is negligible or not. (This is for non-pruned nodes only, so a few GBs in a 250 GB datadir.)

I did a quick test with downloading from scratch up to year 2014 (approx. 270k blocks):

master (278292 blocks, 1931064 bytes, 10 files with size 19M)

2020-01-23T11:01:43Z Leaving block file 104: CBlockFileInfo(blocks=871, size=134096120, heights=276943...278292, time=2013-12-25...2014-01-02)

Antons-Mac-mini:HDD2 a48$ ls -la Bitcoin.nopatch/blocks/rev00* | wc -l
     105

Antons-Mac-mini:HDD2 a48$ du -ck Bitcoin.nopatch/blocks/rev00* | grep total
1931064	total

Antons-Mac-mini:HDD2 a48$ du -ch Bitcoin.nopatch/blocks/rev00* | grep 19M
 19M	Bitcoin.nopatch/blocks/rev00000.dat
 19M	Bitcoin.nopatch/blocks/rev00013.dat
 19M	Bitcoin.nopatch/blocks/rev00073.dat
 19M	Bitcoin.nopatch/blocks/rev00075.dat
 19M	Bitcoin.nopatch/blocks/rev00081.dat
 19M	Bitcoin.nopatch/blocks/rev00083.dat
 19M	Bitcoin.nopatch/blocks/rev00086.dat
 19M	Bitcoin.nopatch/blocks/rev00087.dat
 19M	Bitcoin.nopatch/blocks/rev00094.dat
 19M	Bitcoin.nopatch/blocks/rev00101.dat
Antons-Mac-mini:HDD2 a48$ du -ch Bitcoin.nopatch/blocks/rev00* | grep 19M | wc -l
      10
Antons-Mac-mini:HDD2 a48$ du -ch Bitcoin.nopatch/blocks/rev00* | grep 18M | wc -l
      80

this PR (278293 blocks, 1865004 bytes, no files with size 19M, all of them are 17M or 18M)

2020-01-23T12:48:45Z Leaving block file 104: CBlockFileInfo(blocks=865, size=134045380, heights=277411...278293, time=2013-12-28...2014-01-02)

Antons-Mac-mini:HDD2 a48$ ls -la Bitcoin.patch/blocks/rev00* | wc -l
     105

Antons-Mac-mini:HDD2 a48$ du -ck Bitcoin.patch/blocks/rev00* | grep total
1865004	total

Antons-Mac-mini:HDD2 a48$ du -ch Bitcoin.patch/blocks/rev00* | grep 19M
Antons-Mac-mini:HDD2 a48$ du -ch Bitcoin.patch/blocks/rev00* | grep 19M | wc -l
       0
Antons-Mac-mini:HDD2 a48$ du -ch Bitcoin.patch/blocks/rev00* | grep 18M | wc -l
      38

so, it looks for me that there could be more than 1G of overhead on a fullnode with all blocks (will test it, but it takes time).

@kallewoof
Copy link
Contributor Author

@anton48 Thanks for the initiative. I'll do the same thing on my side so we can get some clarity on the matter.

@MarcoFalke I agree with you, but with the caveat that it's temporary. It's ludicrous to waste several GB of space on people's machines for no good reason. My proposal is that we (1) make a PR that does a one time sweep (esp important for people with lots of disk space who are stuck with pre-#17887 datadirs, if such people exist at all) + implements this code*, and to potentially backport those two.

(* or possibly more sophisticated; ideal implementation is to, for each blk file, track which block is the highest height, and once that has been connected, to finalize the corresponding rev file)

@Sjors
Copy link
Member

Sjors commented Jan 23, 2020

I agree with you, but with the caveat that it's temporary. It's ludicrous to waste several GB of space on people's machines for no good reason.

The waste would be temporary; until the user installs 0.20 or 0.19.2. Worst case scenario is block data corruption. Because I'm personally not familiar with this area of the code, I prefer to let it stew on master for a while before shipping. But if others feel very comfortable that this code is correct then I'm fine with backporting.

@eriknylund
Copy link
Contributor

eriknylund commented Jan 23, 2020

@anton48 Thanks for the initiative. I'll do the same thing on my side so we can get some clarity on the matter.

@MarcoFalke I agree with you, but with the caveat that it's temporary. It's ludicrous to waste several GB of space on people's machines for no good reason. My proposal is that we (1) make a PR that does a one time sweep (esp important for people with lots of disk space who are stuck with pre-#17887 datadirs, if such people exist at all) + implements this code*, and to potentially backport those two.

(* or possibly more sophisticated; ideal implementation is to, for each blk file, track which block is the highest height, and once that has been connected, to finalize the corresponding rev file)

@kallewoof Given it has not been getting attention since the release of APFS in September, 2017 it seems not that many installations exist. Is it worth adding this extra complexity for little or no users? Do we know how many are potentially running a version that has the bug? The workarounds are not optimal either of course, but at least there is one are two relatively easy ones:

  1. Do the IBD from scratch using the latest version
    2. Run a truncate bash script (Blocks directory on macOS uses more disk space than expected #17827 (comment))

@eriknylund
Copy link
Contributor

@kallewoof @anton48 I can also repeat the full IBD test running Bitcoin-Qt on the Mac Mini like I did for #17887, it will take about 3 days to complete.

@kallewoof
Copy link
Contributor Author

@eriknylund Unfortunately that bash script only addresses #17887, not #17892.

@Sjors It's not technically temporary until someone writes code that does a one-time sweep of people's rev files and backport this & that, but since we only see ~1 rev file a day, people's drives shouldn't suffer more than a few hundred MBs before a patch is in place, hopefully.

@kallewoof
Copy link
Contributor Author

I am working on an improved version of this, based on #17892 (comment)

@eriknylund
Copy link
Contributor

@eriknylund Unfortunately that bash script only addresses #17887, not #17892.

@Sjors It's not technically temporary until someone writes code that does a one-time sweep of people's rev files and backport this & that, but since we only see ~1 rev file a day, people's drives shouldn't suffer more than a few hundred MBs before a patch is in place, hopefully.

@kallewoof Ah you're absolutely correct. Thanks for pointing that out.

@anton48
Copy link

anton48 commented Jan 24, 2020

@eriknylund Unfortunately that bash script only addresses #17887, not #17892.
@Sjors It's not technically temporary until someone writes code that does a one-time sweep of people's rev files and backport this & that, but since we only see ~1 rev file a day, people's drives shouldn't suffer more than a few hundred MBs before a patch is in place, hopefully.

@kallewoof Ah you're absolutely correct. Thanks for pointing that out.

however, in my case the script made some truncations at yesterday test directory (which is on AFPS volume):

Antons-Mac-mini:blocks a48$ /Users/a48/slim.sh
./rev00000.dat 19M  ->  19M
[..]
./rev00013.dat 19M  ->  18M
[..]
./rev00047.dat 18M  ->  17M
[..]
./rev00053.dat 18M  ->  17M
[..]
./rev00063.dat 18M  ->  17M
./rev00064.dat 18M  ->  17M
[..]
./rev00070.dat 18M  ->  17M
[..]
./rev00073.dat 19M  ->  18M
[..]
./rev00075.dat 19M  ->  18M
[..]
./rev00083.dat 19M  ->  18M
[..]
./rev00086.dat 19M  ->  18M
./rev00087.dat 19M  ->  18M
[..]
./rev00101.dat 19M  ->  18M
[..]
./rev00104.dat 18M  ->  18M

@kallewoof
Copy link
Contributor Author

I did some number crunching and concluded that a backport is probably not necessary. I am also closing this PR in favor of #17994. Will re-open if people object/prefer this one.

@kallewoof kallewoof closed this Jan 26, 2020
@kallewoof kallewoof deleted the 200108-reflush-undo branch January 26, 2020 11:43
laanwj added a commit that referenced this pull request Jun 4, 2020
ac94141 validation: delay flushing undo files in syncing node case (Karl-Johan Alm)

Pull request description:

  Fixes #17890. Replaces #17892.

  Data files (`{blk|rev}<number>.dat`) pre-allocate space as they are written, and then trims down to the final size once they move on to the next sequence ("finalized flush"). The code currently assumes (incorrectly) that blk and rev files finish at the same time, but because blk files are written as blocks come in, and rev files are written in block height order, rev files end up being written to for awhile after moving on to the next block file, resulting in pre-allocation and waste of up to 1 MB of space per rev file.

  The exact point at which rev file writing finishes is the highest height block found inside the corresponding block file, which is already available in the CBlockFileInfo vector. This PR moves finalized flushing of undo files to to directly after the undo data for the previous block file has been written.

  There is a branch with annotation that demonstrates how this is handling flushing here: https://github.com/kallewoof/bitcoin/tree/200124-rev-files-annotated

ACKs for top commit:
  vasild:
    ACK ac94141 (no changes in the code since ed34e00da).
  fjahr:
    Code review re-ACK ac94141
  jonatack:
    Code review ACK ac94141

Tree-SHA512: 1d4e3b3d1d99bd7ebe7a2f632b1231146dd4f9f993c54db3a4090d9c086d95d2e4c327fd936066392b3afc6277b8f3a908d5c5993d4c8e49f72b92a417716dd2
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Jun 4, 2020
ac94141 validation: delay flushing undo files in syncing node case (Karl-Johan Alm)

Pull request description:

  Fixes bitcoin#17890. Replaces bitcoin#17892.

  Data files (`{blk|rev}<number>.dat`) pre-allocate space as they are written, and then trims down to the final size once they move on to the next sequence ("finalized flush"). The code currently assumes (incorrectly) that blk and rev files finish at the same time, but because blk files are written as blocks come in, and rev files are written in block height order, rev files end up being written to for awhile after moving on to the next block file, resulting in pre-allocation and waste of up to 1 MB of space per rev file.

  The exact point at which rev file writing finishes is the highest height block found inside the corresponding block file, which is already available in the CBlockFileInfo vector. This PR moves finalized flushing of undo files to to directly after the undo data for the previous block file has been written.

  There is a branch with annotation that demonstrates how this is handling flushing here: https://github.com/kallewoof/bitcoin/tree/200124-rev-files-annotated

ACKs for top commit:
  vasild:
    ACK ac94141 (no changes in the code since ed34e00da).
  fjahr:
    Code review re-ACK ac94141
  jonatack:
    Code review ACK ac94141

Tree-SHA512: 1d4e3b3d1d99bd7ebe7a2f632b1231146dd4f9f993c54db3a4090d9c086d95d2e4c327fd936066392b3afc6277b8f3a908d5c5993d4c8e49f72b92a417716dd2
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Sep 19, 2021
ac94141 validation: delay flushing undo files in syncing node case (Karl-Johan Alm)

Pull request description:

  Fixes bitcoin#17890. Replaces bitcoin#17892.

  Data files (`{blk|rev}<number>.dat`) pre-allocate space as they are written, and then trims down to the final size once they move on to the next sequence ("finalized flush"). The code currently assumes (incorrectly) that blk and rev files finish at the same time, but because blk files are written as blocks come in, and rev files are written in block height order, rev files end up being written to for awhile after moving on to the next block file, resulting in pre-allocation and waste of up to 1 MB of space per rev file.

  The exact point at which rev file writing finishes is the highest height block found inside the corresponding block file, which is already available in the CBlockFileInfo vector. This PR moves finalized flushing of undo files to to directly after the undo data for the previous block file has been written.

  There is a branch with annotation that demonstrates how this is handling flushing here: https://github.com/kallewoof/bitcoin/tree/200124-rev-files-annotated

ACKs for top commit:
  vasild:
    ACK ac94141 (no changes in the code since ed34e00da).
  fjahr:
    Code review re-ACK ac94141
  jonatack:
    Code review ACK ac94141

Tree-SHA512: 1d4e3b3d1d99bd7ebe7a2f632b1231146dd4f9f993c54db3a4090d9c086d95d2e4c327fd936066392b3afc6277b8f3a908d5c5993d4c8e49f72b92a417716dd2
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Feb 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Non-latest rev files are sometimes re-opened but never re-flushed

10 participants