-
Notifications
You must be signed in to change notification settings - Fork 38.8k
bug-fix: delay flushing undo files until after they are finalized #17892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
|
Thread-safety annotations ( |
12e6f04 to
0508072
Compare
|
@practicalswift Thanks for the nudge. My apologies, though... I ended up scratching this solution for a much simpler one (see updated code). |
|
Some stats: after syncing my mac up to ~ block 186k, the blocks dir looks like this: $ du -ch *.dat
128M blk00000.dat
128M blk00001.dat
128M blk00002.dat
128M blk00003.dat
128M blk00004.dat
128M blk00005.dat
128M blk00006.dat
128M blk00007.dat
128M blk00008.dat
128M blk00009.dat
128M blk00010.dat
128M blk00011.dat
128M blk00012.dat
128M blk00013.dat
144M blk00014.dat
19M rev00000.dat
16M rev00001.dat
16M rev00002.dat
16M rev00003.dat
16M rev00004.dat
17M rev00005.dat
17M rev00006.dat
17M rev00007.dat
18M rev00008.dat
17M rev00009.dat
17M rev00010.dat
17M rev00011.dat
18M rev00012.dat
171M rev00013.dat
21M rev00014.dat
2.3G totalThe fact the blk for 14 and rev files for 13-14 are bloated is due to a separate bug (see #17887). Note that rev 13 is still huge despite us moving on to rev 14. When blk 14 is finalized, it will truncate rev 13 down to its right size. It may be worthwhile to actually finalize current and previous twice, to avoid this, but since we are more or less guaranteed to again pre-allocate into X-1, this seems unnecessary, and #17887 will make this have significantly less impact. |
|
I can confirm the bug. I've run into it multiple times. Kalle is convinced that we are at fault, although I'm not so sure, and I think this might even be an Apple bug. It only shows up on APFS, and the implementation of rsync which ships with macOS also suffers from the same issue. But in any case, if we can fix it on our end we should, and this should be pretty high priority. It can end up wasting hundreds of gigabytes of space during an IBD, in some cases exhausting the file on 512GB drives in a way that is very non-intuitive to figure out and fix as a user. |
|
ACK 0508072e1a45c339a181d24c2edfc88e3cd5201f built locally. All tests pass. I see the blow-up happening and the files being truncated as the following blk file is opened and the previous is finalized. Here rev00004.dat is still not truncated since we are on blk00005.dat In the next step blk00005.dat has been finalized and rev00004.dat is truncated. |
src/validation.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| int last_undo_file = fFinalize && nLastBlockFile > 0 ? nLastBlockFile - 1 : nLastBlockFile; | |
| int previous_undo_file = fFinalize && nLastBlockFile > 0 ? nLastBlockFile - 1 : nLastBlockFile; |
Given the comment on line 1738, is previous more appropriate than last?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
5200be2 to
7613c50
Compare
|
Similar to here #17887 (comment) I am not able to reproduce this issue on current master (no matter if |
|
@fjahr #17887 (comment) seems to indicate you were finally able to reproduce the issue. Or is that comment older? |
7613c50 to
da439d4
Compare
Yes, sorry, this is not worded clearly: I could reproduce #17887 with rev files up 64M large if compiled without |
|
@fjahr You can not detect the following pattern anywhere in your log file even after syncing for awhile? |
|
@kallewoof I do see it, but not regularly. From my last sync up to 190k it seems to appear 3 times: For X=7, X=10 and X=14. I was trying hard to catch a larger than normal file size with |
|
There's some amount of randomness involved, as the order in which blocks are received and the order in which they are connected is different. With #17887 merged, it will be very hard to physically spot these now, but it should occasionally show larger-than-necessary rev files (in the range of 1 b up to 1 MB). |
When creating blk and rev files, these are pre-allocated for performance reasons, and then truncated down to their final size after completion. This works for blk files which are sequentially created, but since blocks are stored out of order, and since the undo data for blocks are stored sequentially (upon activation of the block), the rev file will be written to for awhile even after the blk file has been finalized. This change switches the flushing part to flush the *previous* undo file for the finalize (truncate) case, which addresses the problem, as we never write to the rev file 2 steps back.
da439d4 to
b49b341
Compare
|
Given that #17887 is merged, is it still worth the risk of back-porting? (can always do that later, after 0.20 is released) |
| // for awhile even after we've moved onto the next blk file; if we don't do this, we | ||
| // end up with a re-opened but unflushed rev file, resulting in pre-allocated but not | ||
| // returned disk space allocation | ||
| int previous_undo_file = fFinalize && nLastBlockFile > 0 ? nLastBlockFile - 1 : nLastBlockFile; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add brackets to clarify (correct) precedence: (nLastBlockFile > 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly don't think that's necessary, but I'm open to change my opinion.
There are nearly two thousand rev files right now. Someone doing an IBD sync using 0.19 today would end up wasting up to 2 GB of space due to this bug. I do believe we want to plug that hole, even though it has become much smaller of an issue as you mention. |
|
Last time I did an IBD (a few months ago) it was >100GB wasted. That’s actually how I discovered this problem initially, it was enough wasted space to impact me (full disk). |
|
And this discussion is about backporting those changes, no? |
|
No, #17887 is already in the backported list and is not under discussion. @Sjors is wondering whether this patch, which delays finalize-flushing the rev files until we're (pretty) sure they are no longer appended to, should be backported or not. Without this patch, we will experience 0-1 MB of wasted space for each rev file. Without #17887 we would experience what you saw. |
|
I have now done another run with this fix and could not detect the pattern described in #17892 (comment) in my debug.log ACK b49b341 |
It seems like this does not warrant a backport to 0.19.1 |
|
Depends if you think "a few GBs wasted" is negligible or not. (This is for non-pruned nodes only, so a few GBs in a 250 GB datadir.) |
|
Yes, based on your comments this should be less than 1% off (not noticeable for a user) |
I did a quick test with downloading from scratch up to year 2014 (approx. 270k blocks): master (278292 blocks, 1931064 bytes, 10 files with size 19M) this PR (278293 blocks, 1865004 bytes, no files with size 19M, all of them are 17M or 18M) so, it looks for me that there could be more than 1G of overhead on a fullnode with all blocks (will test it, but it takes time). |
|
@anton48 Thanks for the initiative. I'll do the same thing on my side so we can get some clarity on the matter. @MarcoFalke I agree with you, but with the caveat that it's temporary. It's ludicrous to waste several GB of space on people's machines for no good reason. My proposal is that we (1) make a PR that does a one time sweep (esp important for people with lots of disk space who are stuck with pre-#17887 datadirs, if such people exist at all) + implements this code*, and to potentially backport those two. (* or possibly more sophisticated; ideal implementation is to, for each blk file, track which block is the highest height, and once that has been connected, to finalize the corresponding rev file) |
The waste would be temporary; until the user installs 0.20 or 0.19.2. Worst case scenario is block data corruption. Because I'm personally not familiar with this area of the code, I prefer to let it stew on master for a while before shipping. But if others feel very comfortable that this code is correct then I'm fine with backporting. |
@kallewoof Given it has not been getting attention since the release of APFS in September, 2017 it seems not that many installations exist. Is it worth adding this extra complexity for little or no users? Do we know how many are potentially running a version that has the bug? The workarounds are not optimal either of course, but at least there is one
|
|
@kallewoof @anton48 I can also repeat the full IBD test running Bitcoin-Qt on the Mac Mini like I did for #17887, it will take about 3 days to complete. |
|
@eriknylund Unfortunately that bash script only addresses #17887, not #17892. @Sjors It's not technically temporary until someone writes code that does a one-time sweep of people's rev files and backport this & that, but since we only see ~1 rev file a day, people's drives shouldn't suffer more than a few hundred MBs before a patch is in place, hopefully. |
|
I am working on an improved version of this, based on #17892 (comment) |
@kallewoof Ah you're absolutely correct. Thanks for pointing that out. |
however, in my case the script made some truncations at yesterday test directory (which is on AFPS volume): |
|
I did some number crunching and concluded that a backport is probably not necessary. I am also closing this PR in favor of #17994. Will re-open if people object/prefer this one. |
ac94141 validation: delay flushing undo files in syncing node case (Karl-Johan Alm) Pull request description: Fixes #17890. Replaces #17892. Data files (`{blk|rev}<number>.dat`) pre-allocate space as they are written, and then trims down to the final size once they move on to the next sequence ("finalized flush"). The code currently assumes (incorrectly) that blk and rev files finish at the same time, but because blk files are written as blocks come in, and rev files are written in block height order, rev files end up being written to for awhile after moving on to the next block file, resulting in pre-allocation and waste of up to 1 MB of space per rev file. The exact point at which rev file writing finishes is the highest height block found inside the corresponding block file, which is already available in the CBlockFileInfo vector. This PR moves finalized flushing of undo files to to directly after the undo data for the previous block file has been written. There is a branch with annotation that demonstrates how this is handling flushing here: https://github.com/kallewoof/bitcoin/tree/200124-rev-files-annotated ACKs for top commit: vasild: ACK ac94141 (no changes in the code since ed34e00da). fjahr: Code review re-ACK ac94141 jonatack: Code review ACK ac94141 Tree-SHA512: 1d4e3b3d1d99bd7ebe7a2f632b1231146dd4f9f993c54db3a4090d9c086d95d2e4c327fd936066392b3afc6277b8f3a908d5c5993d4c8e49f72b92a417716dd2
ac94141 validation: delay flushing undo files in syncing node case (Karl-Johan Alm) Pull request description: Fixes bitcoin#17890. Replaces bitcoin#17892. Data files (`{blk|rev}<number>.dat`) pre-allocate space as they are written, and then trims down to the final size once they move on to the next sequence ("finalized flush"). The code currently assumes (incorrectly) that blk and rev files finish at the same time, but because blk files are written as blocks come in, and rev files are written in block height order, rev files end up being written to for awhile after moving on to the next block file, resulting in pre-allocation and waste of up to 1 MB of space per rev file. The exact point at which rev file writing finishes is the highest height block found inside the corresponding block file, which is already available in the CBlockFileInfo vector. This PR moves finalized flushing of undo files to to directly after the undo data for the previous block file has been written. There is a branch with annotation that demonstrates how this is handling flushing here: https://github.com/kallewoof/bitcoin/tree/200124-rev-files-annotated ACKs for top commit: vasild: ACK ac94141 (no changes in the code since ed34e00da). fjahr: Code review re-ACK ac94141 jonatack: Code review ACK ac94141 Tree-SHA512: 1d4e3b3d1d99bd7ebe7a2f632b1231146dd4f9f993c54db3a4090d9c086d95d2e4c327fd936066392b3afc6277b8f3a908d5c5993d4c8e49f72b92a417716dd2
ac94141 validation: delay flushing undo files in syncing node case (Karl-Johan Alm) Pull request description: Fixes bitcoin#17890. Replaces bitcoin#17892. Data files (`{blk|rev}<number>.dat`) pre-allocate space as they are written, and then trims down to the final size once they move on to the next sequence ("finalized flush"). The code currently assumes (incorrectly) that blk and rev files finish at the same time, but because blk files are written as blocks come in, and rev files are written in block height order, rev files end up being written to for awhile after moving on to the next block file, resulting in pre-allocation and waste of up to 1 MB of space per rev file. The exact point at which rev file writing finishes is the highest height block found inside the corresponding block file, which is already available in the CBlockFileInfo vector. This PR moves finalized flushing of undo files to to directly after the undo data for the previous block file has been written. There is a branch with annotation that demonstrates how this is handling flushing here: https://github.com/kallewoof/bitcoin/tree/200124-rev-files-annotated ACKs for top commit: vasild: ACK ac94141 (no changes in the code since ed34e00da). fjahr: Code review re-ACK ac94141 jonatack: Code review ACK ac94141 Tree-SHA512: 1d4e3b3d1d99bd7ebe7a2f632b1231146dd4f9f993c54db3a4090d9c086d95d2e4c327fd936066392b3afc6277b8f3a908d5c5993d4c8e49f72b92a417716dd2
The blk files have blocks written to them as they come in. Each blk file and rev file reference the same blocks, but the order in which block data is written into the rev (undo) files is in order of activation, i.e. in block height order.
Thus, when the code decides to stop writing into blk000X.dat, it will flush this and rev000X.dat, and then move to X+1. The problem is that since blocks are activated later than they are stored in the block file, we will still be writing to rev X for awhile after this.
This code fixes this by simply finalizing the previous undo file each time we finish up a block file. This pretty much guarantees that we will never pre-allocate space and never flush for an undo file (which is happening right now).
Unfortunately for anyone with affected machines, there will be wasted drive space for each rev file if they did IBD before this fix. Probably affects all architectures.
Edit: note that this PR initially added a new critical section for undo pos and some other stuff, but I scratched that for this simpler solution, hence practicalswift's comment below.
Fixes #17890.