-
Notifications
You must be signed in to change notification settings - Fork 38.7k
Hopefully final fix for the stuck blockchain issue #1315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I suspected getblocks was probably the answer :) It would be very nice to see a mini white paper on what this does and how it works.... |
Immediately issue a "getblocks", instead of a "getdata" (which will trigger the relevant "inv" to be sent anyway), and only do so when the previous set of invs led us into a known and attached part of the block tree.
|
K. Tested resync from start several times. Tested partial resync. Tested recovery from fork with reorg on non-stuck node. Tested recovery from a forkmode stuck node. Tested with loadblocks. Make sure it wasn't bloating up the chain with a ton of copies of extra block .... I can't break it, so I'm pulling. |
Hopefully final fix for the stuck blockchain issue
|
Just out of interest, does the initial getblocks (that's sent to the first peer upon starting the node) also cause the recovery from afork stuck node? Is it that this change is to enable it to become unstuck without restarting the node? Or did even restarting not fix things? |
|
Even restarting did not fix the particular issue this fix was needed to address— but normal nodes probably can't get into that state. (The nodes in question were ones that got stuck due to incorrectly rejecting the correct chain because e.g. of premature BIP16 enforcement) |
|
A bit more elaborate: if you were running an 0.6 RC, you would have code that used the old BIP16 switchover date. The date passed, but you did not update your software. Suddenly someone sends an invalid BIP transaction (so, one that is valid according to the traditional rules, but not according to the BIP16 rules). On the main network BIP16 validation is not active, so the transaction gets accepted. However, your old RC enforces BIP16 validation, so it considers this transaction invalid. This only happens after downloading the block that contains it, and adding it to the tree in the database. A few hundred blocks are added on top of this one, all in your database, but this chain does not become the best chain (as it is considers invalid). Finally, you upgrade your software, and you now have the correct BIP16 switchover date. The correct chain is already downloaded in your block database, but it is not marked as the active best chain. At startup, your node sends a getblocks from its current best tip (which is one block before the one that contained the invalid BIP16 transactions) to the top of the chain. The peer answers by sending 500 invs back, and remembers to request 500 more when the last of those is downloaded. However, we already have the first 500, so not one is requested, and nothing happens. We must somehow make the peer send us the rest of the invs, as that is our only means for reconnecting that chain to the current best block. Earlier versions of this patch forced a getdata of that 500th block, this one sends a getblocks immediately. |
|
I understand the explanation so far but it still doesn't explain how the new getblocks achieves that, and why upon receiving the very latest block, that that doesn't fix it, nor why it can't be coded to re-evaluate the last 6 or so blocks in the last known valid again to see if they are still valid by any new rules. The last solution would be better, IMHO, as it wouldn't increase network traffic, unlike this fix (kludge?). Sent from my Nokia phone A bit more elaborate: if you were running an 0.6 RC, you would have code that used the old BIP16 switchover date. The date passed, but you did not update your software. Suddenly someone sends an invalid BIP transaction (so, one that is valid according to the traditional rules, but not according to the BIP16 rules). On the main network BIP16 validation is not active, so the transaction gets accepted. However, your old RC enforces BIP16 validation, so it considers this transaction invalid. This only happens after downloading the block that contains it, and adding it to the tree in the database. A few hundred blocks are added on top of this one, all in your database, but this chain does not become the best chain (as it is considers invalid). Finally, you upgrade your software, and you now have the correct BIP16 switchover date. The correct chain is already downloaded in your block database, but it is not marked as the active best chain. At startup, your node sends a getblocks from its current best tip (which is one block before the one that contained the invalid BIP16 transactions) to the top of the chain. The peer answers by sending 500 invs back, and remembers to request 500 more when the last of those is downloaded. However, we already have the first 500, so not one is requested, and nothing happens. We must somehow make the peer send us the rest of the invs, as that is our only means for reconnecting that chain to the current best block. Earlier versions of this patch forced a getdata of that 500th block, this one sends a getblocks immediately. Reply to this email directly or view it on GitHub: |
|
Sorry, meant to say, that it could re-check the invalid again upon start up, perhaps by giving a command line option or perhaps automatically whenever the invalid again is longer by 6 blocks or more. Sent from my Nokia phone A bit more elaborate: if you were running an 0.6 RC, you would have code that used the old BIP16 switchover date. The date passed, but you did not update your software. Suddenly someone sends an invalid BIP transaction (so, one that is valid according to the traditional rules, but not according to the BIP16 rules). On the main network BIP16 validation is not active, so the transaction gets accepted. However, your old RC enforces BIP16 validation, so it considers this transaction invalid. This only happens after downloading the block that contains it, and adding it to the tree in the database. A few hundred blocks are added on top of this one, all in your database, but this chain does not become the best chain (as it is considers invalid). Finally, you upgrade your software, and you now have the correct BIP16 switchover date. The correct chain is already downloaded in your block database, but it is not marked as the active best chain. At startup, your node sends a getblocks from its current best tip (which is one block before the one that contained the invalid BIP16 transactions) to the top of the chain. The peer answers by sending 500 invs back, and remembers to request 500 more when the last of those is downloaded. However, we already have the first 500, so not one is requested, and nothing happens. We must somehow make the peer send us the rest of the invs, as that is our only means for reconnecting that chain to the current best block. Earlier versions of this patch forced a getdata of that 500th block, this one sends a getblocks immediately. Reply to this email directly or view it on GitHub: |
|
This fix will - over the course of an entire blockchain syncup - maybe cause 50 kilobytes extra communication. What you suggest is also possible, but harder and with less guarantees, in my opinion. You'd need to traverse the entire blockchain database and find stale chains, and re-evaluate them all? |
|
Doesn't this fix also increase data transfer even after the node has caught up? I thought it does getblocks upon receipt of every block, doesn't it? To re-evaluate the invalid again it would only need to re-evaluate one block upon start-up in the example you give. The first block in the longest invalid chain. |
|
Also, technically, i'd say this current fix requires a BIP. |
|
In normal operation, this patch does nothing. It only sends out a getblocks when an inv is received with blocks that are already known and part of the block tree. During normal operation, this never happens, as you only request invs for the part after the main chain. And the block-sync process has never been well-formalized, though the responses to the network requests are. Those aren't changed however. |
|
I think you are incorrect to say it doesn't happen during normal operation. This is not my experience. When i new block arrives on the network, let's say 8 nodes announce it in invs. My node will getdata it from the first one, download it and ProcessBlock it usually well before the last connected peer sends invs for it, so with this code with each new block, the slowest peers to announce it will receive the new getblocks in this fix. As the network gets bigger, this could get worse. It could also get less worse if ProcessBlocks takes longer due to larger blocks. |
|
While previously testing this I specifically looked for excess requests during normal operations and didn't see any. Either I made a mistake or just had unlucky timing— or it's something about the peer mix thats triggering it, because I see ones now— about 1769 of them on 05/16. Actually, they seem to be being caused in high volume by specific peers. E.g. I have a couple which are each responsible for several hundred of them. |
|
I've added a fix to this, in my current bitcoin-ParallelBlockDownload branch (the 3rd commit of pull #1326).... I still think the ideal solution is to do it without using the network though.... |
Hopefully final fix for the stuck blockchain issue
449872b [Build] Remove Windows 32 bit build, coming from btc@faf666f8148eeb305a9c4f78459aff2c7268016b (furszy) Pull request description: Coming from [btc@15939](bitcoin#15939). Based on bitcoin#1315. ACKs for top commit: Fuzzbawls: ACK 449872b random-zebra: ACK 449872b and merging... Tree-SHA512: 39c8785ade1202c09c76d964fdc3c739f9e162fec5c9b2991ee9a0a60c4935485c3822be82fd2e279fe536e049b3b1e4a9d5137145505adc09feea850420d8f0
Immediately issue a "getblocks", instead of a "getdata" (which will trigger the relevant "inv" to be sent anyway), and only do so when the previous set of invs led us into a known and attached part of the block tree.
This patch has been tested on a (constructed) blockchain that was effectively stuck.