-
Notifications
You must be signed in to change notification settings - Fork 38.8k
rpc: Pruning nodes can not fetch blocks before syncing past their height #23927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
|
Concept ACK. I started testing it by creating a functional test. Not sure if the approach is right, I started two nodes, the second one with def run_test(self):
self.log.info("Mine 20 blocks on Node 0")
self.generate(self.nodes[0], 200, sync_fun=self.no_op)
assert_equal(self.nodes[0].getblockcount(), 400)
self.log.info("Connect nodes")
self.connect_nodes(0, 1)
peers = self.nodes[1].getpeerinfo()
assert_equal(len(peers), 1)
peer_1_peer_0_id = peers[0]["id"]
best_block_hash_0 = self.nodes[0].getbestblockhash()
assert_raises_rpc_error(-1, 'In prune mode, only blocks that the node has already synced previously can be fetched from a peer', self.nodes[1].getblockfrompeer, best_block_hash_0, peer_1_peer_0_id)In master branch, this test would fail, because an exeception won't be raised. |
Hey @brunoerg , thanks for giving it a try but unfortunately I don't think this approach works reliably. The problem is that the outcome of this test is a race because node 1 is downloading the blocks from node 0 in the background. It may have the current tip in flight by the time the assert is called, or not. We try to avoid tests where the outcome is not 100% reliable because we are seeing intermittent test failures quite often already. What would be needed is a reliable way to let node 1 sync the headers but then prevent it from syncing the blocks. It seems we don't have something like this and I am now thinking about how this could be done and where else it could be useful. |
Interesting, this is new for me. def setup_network(self):
self.setup_nodes()even with this config (setup_network) node1 will download the blocks from node0 in the background after node0 mines more 200 blocks? def run_test(self):
self.log.info("Mine 200 blocks on Node 0")
self.generate(self.nodes[0], 200, sync_fun=self.no_op)
assert_equal(self.nodes[0].getblockcount(), 400)
self.log.info("Connect nodes")
self.connect_nodes(0, 1)
peers = self.nodes[1].getpeerinfo()
assert_equal(len(peers), 1)
peer_1_peer_0_id = peers[0]["id"]
best_block_hash_0 = self.nodes[0].getbestblockhash()
assert_raises_rpc_error(-1, 'In prune mode, only blocks that the node has already synced previously can be fetched from a peer', self.nodes[1].getblockfrompeer, best_block_hash_0, peer_1_peer_0_id)
self.sync_blocks()
self.nodes[1].getblockfrompeer(best_block_hash_0, peer_1_peer_0_id)I thought the first |
No, the nodes indeed can not sync if they are not connected. But you are connecting the nodes in your test here:
What |
|
I think we have a working test now. I remembered that we can submit the header via a |
|
Yes, my bad. That approach would work because node0 is mining 200 blocks more, so probably node1 didn't get the last one before the assertion but they're syncing and it could cause intermittent failures. |
brunoerg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tACK 105b2876c5cd4f3710e3510491070f47ae11b0e4
Compiled the branch code on MacOS 12 and started bitcoind (empty):
./src/bitcoind --prune=550 --daemon
Got a hash of a recent block from a block explorer: 00000000000000000001fede733d9ad94b9a9cdb07237cba25c556f8f807db4b
Executed getpeerinfo to see the connections and get some peers id to use.
And then, executed getblockfrompeer with 00000000000000000001fede733d9ad94b9a9cdb07237cba25c556f8f807db4b as block hash.
➜ bitcoin git:(fjahr) ✗ ./src/bitcoin-cli getblockfrompeer 00000000000000000001fede733d9ad94b9a9cdb07237cba25c556f8f807db4b 11
error code: -1
error message:
In prune mode, only blocks that the node has already synced previously can be fetched from a peer
➜ bitcoin git:(fjahr) ✗ ./src/bitcoin-cli getblockfrompeer 00000000000000000001fede733d9ad94b9a9cdb07237cba25c556f8f807db4b 8
error code: -1
error message:
In prune mode, only blocks that the node has already synced previously can be fetched from a peer
➜ bitcoin git:(fjahr) ✗ ./src/bitcoin-cli getblockfrompeer 00000000000000000001fede733d9ad94b9a9cdb07237cba25c556f8f807db4b 5
error code: -1
error message:
In prune mode, only blocks that the node has already synced previously can be fetched from a peer
|
Concept ACK, but light selfish preference for getting #23706 in first, since there's a (small) conflict. |
|
Weak concept NACK. I think it's better to allow it. Pruning is only best-effort anyway, not a guarantee. |
105b287 to
b750720
Compare
|
Rebased
Why do you think it's better to allow it? Do you have any specific use cases in mind? Of course there is no guarantee to stay below the exact number but for |
b750720 to
3ec2cba
Compare
|
In the context of ForkMonitor I use this feature to fetch blocks that are, by definition, at or below the tip height. I.e. stale blocks. This PR doesn't impact that use case, because these nodes are always up to date. In fact, this PR adds some safety for when a fresh node is added to the site, or an existing node is reinstalled, and it's still catching up (though in practice we don't call Another use case that seems obvious to me is to fetch an historical block, perhaps because you're rescanning a wallet. In that case I don't see any harm in waiting until the node is synced. |
3ec2cba to
92f4ca2
Compare
|
Rebased |
While a node is still catching up to the tip that it is aware of via the headers, the user can currently use to fetch blocks close to the tip. These blocks are stored in the current block/rev file which otherwise contains blocks the node is receiving as part of the syncing process. This creates a problem for pruned nodes: The files containing a fetched block are not pruned during syncing because they contain a block close to the tip. This means the entire file will not be pruned until the tip have moved on far enough from the fetched block. In extreme cases with heavy pruning (550) and multiple blocks being fetched this could mean that the disc usage far exceeds what the user expects, potentially running out of space.
92f4ca2 to
5826bf5
Compare
|
Rebased |
|
utACK 5826bf5 Very nice test. |
|
ACK 5826bf5 |
|
|
||
| // Fetching blocks before the node has syncing past their height can prevent block files from | ||
| // being pruned, so we avoid it if the node is in prune mode. | ||
| if (index->nHeight > chainman.ActiveChain().Tip()->nHeight && node::fPruneMode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 7fa851f: (non-blocking nit)
Why not use IsBlockPruned instead?
Which should be the same as saying that, on pruning mode, can only fetch blocks that were downloaded and discarded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If IsBlockPruned was used instead, fetching an older block that isn't pruned (blocks close to the tip for instance) on a pruned node would result in the error In prune mode, only blocks that the node has already synced previously can be fetched from a peer instead of Block already downloaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't chainman.ActiveChain() need to be called with cs_main locked?
CChain& ActiveChain() const EXCLUSIVE_LOCKS_REQUIRED(GetMutex()) { return ActiveChainstate().m_chain; }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, indeed it does. I really need to be building clang...
rpc/blockchain.cpp:464:35: warning: calling function 'ActiveChain' requires holding mutex 'cs_main' exclusively [-Wthread-safety-analysis]
if (index->nHeight > chainman.ActiveChain().Tip()->nHeight && node::fPruneMode) {
^
1 warning generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tACK 5826bf5
I tested the behavior by invalidating the tip with invalidateblock and trying to fetch it again with getblockfrompeer which resulted in In prune mode, only blocks that the node has already synced previously can be fetched from a peer as expected.
f5ff3d7 rpc: add missing lock around chainman.ActiveTip() (Andrew Toth) Pull request description: #23927 seems to have missed a lock around `chainman.ActiveChain()`. ACKs for top commit: aureleoules: ACK f5ff3d7 Tree-SHA512: 3f116ca44c1b2bc0c7042698249ea3417dfb7c0bb81158a7ceecd087f1e02baa89948f9bb7924b1757798a1691a7de6e886aa72a0a9e227c13a3f512cc59d6c9
f5ff3d7 rpc: add missing lock around chainman.ActiveTip() (Andrew Toth) Pull request description: bitcoin#23927 seems to have missed a lock around `chainman.ActiveChain()`. ACKs for top commit: aureleoules: ACK f5ff3d7 Tree-SHA512: 3f116ca44c1b2bc0c7042698249ea3417dfb7c0bb81158a7ceecd087f1e02baa89948f9bb7924b1757798a1691a7de6e886aa72a0a9e227c13a3f512cc59d6c9
This PR prevents
getblockfrompeerfrom getting used on blocks that the node has not synced past yet if the node is in running in prune mode.Problem
While a node is still catching up to the tip that it is aware of via the headers, the user can currently use to fetch blocks close to or at the tip. These blocks are stored in the block/rev file that otherwise contains blocks the node is receiving as part of the syncing process.
This creates a problem for pruned nodes: The files containing a fetched block are not pruned during syncing because they contain a block close to the tip. This means the entire file (~130MB) will not be pruned until the tip has moved on far enough from the fetched block. In extreme cases with heavy pruning (like 550) and multiple blocks being fetched this could mean that the disc usage far exceeds what the user expects, potentially running out of space.
Approach
There would be certainly other approaches that could fix the problem while still allowing the current behavior, but all of the ideas I came up with seemed like overkill for a niche problem on a new RPC where it's still unclear how and how much it will be used.
Testing
So far I did not see a simple enough way to test this I am still looking into it and if it's complex will potentially add it in a follow-up. What would be needed is a way to have a node fetch headers but not sync the blocks yet, that seems like a pattern that could be generally useful.
To manually reproduce the problematic behavior:
masterwith-prune=550and an empty/new datadir, Testnet and Mainnet should both work.getblockfrompeeron the current tip and a few other recent blocks.blk*.datandrev*.datfiles that are not being pruned. When you "pinned" a few of these files the blocks folder should be significantly above the target size of 550MB.