Hopefully final fix for the stuck blockchain issue #1315

sipa · 2012-05-15T15:40:50Z

Immediately issue a "getblocks", instead of a "getdata" (which will trigger the relevant "inv" to be sent anyway), and only do so when the previous set of invs led us into a known and attached part of the block tree.

This patch has been tested on a (constructed) blockchain that was effectively stuck.

rebroad · 2012-05-15T15:43:25Z

I suspected getblocks was probably the answer :)

It would be very nice to see a mini white paper on what this does and how it works....

Immediately issue a "getblocks", instead of a "getdata" (which will trigger the relevant "inv" to be sent anyway), and only do so when the previous set of invs led us into a known and attached part of the block tree.

gmaxwell · 2012-05-16T02:13:22Z

K. Tested resync from start several times. Tested partial resync. Tested recovery from fork with reorg on non-stuck node. Tested recovery from a forkmode stuck node. Tested with loadblocks. Make sure it wasn't bloating up the chain with a ton of copies of extra block .... I can't break it, so I'm pulling.

Hopefully final fix for the stuck blockchain issue

rebroad · 2012-05-17T02:09:02Z

Just out of interest, does the initial getblocks (that's sent to the first peer upon starting the node) also cause the recovery from afork stuck node? Is it that this change is to enable it to become unstuck without restarting the node? Or did even restarting not fix things?

gmaxwell · 2012-05-17T02:23:05Z

Even restarting did not fix the particular issue this fix was needed to address— but normal nodes probably can't get into that state. (The nodes in question were ones that got stuck due to incorrectly rejecting the correct chain because e.g. of premature BIP16 enforcement)

sipa · 2012-05-17T10:10:53Z

A bit more elaborate: if you were running an 0.6 RC, you would have code that used the old BIP16 switchover date. The date passed, but you did not update your software. Suddenly someone sends an invalid BIP transaction (so, one that is valid according to the traditional rules, but not according to the BIP16 rules). On the main network BIP16 validation is not active, so the transaction gets accepted. However, your old RC enforces BIP16 validation, so it considers this transaction invalid. This only happens after downloading the block that contains it, and adding it to the tree in the database. A few hundred blocks are added on top of this one, all in your database, but this chain does not become the best chain (as it is considers invalid).

Finally, you upgrade your software, and you now have the correct BIP16 switchover date. The correct chain is already downloaded in your block database, but it is not marked as the active best chain. At startup, your node sends a getblocks from its current best tip (which is one block before the one that contained the invalid BIP16 transactions) to the top of the chain. The peer answers by sending 500 invs back, and remembers to request 500 more when the last of those is downloaded. However, we already have the first 500, so not one is requested, and nothing happens. We must somehow make the peer send us the rest of the invs, as that is our only means for reconnecting that chain to the current best block. Earlier versions of this patch forced a getdata of that 500th block, this one sends a getblocks immediately.

rebroad · 2012-05-17T10:22:59Z

I understand the explanation so far but it still doesn't explain how the new getblocks achieves that, and why upon receiving the very latest block, that that doesn't fix it, nor why it can't be coded to re-evaluate the last 6 or so blocks in the last known valid again to see if they are still valid by any new rules. The last solution would be better, IMHO, as it wouldn't increase network traffic, unlike this fix (kludge?).

Sent from my Nokia phone
-----Original Message-----
From: Pieter Wuille
Sent: 17/05/2012 11:10:55
Subject: Re: [bitcoin] Hopefully final fix for the stuck blockchain issue (#1315)

A bit more elaborate: if you were running an 0.6 RC, you would have code that used the old BIP16 switchover date. The date passed, but you did not update your software. Suddenly someone sends an invalid BIP transaction (so, one that is valid according to the traditional rules, but not according to the BIP16 rules). On the main network BIP16 validation is not active, so the transaction gets accepted. However, your old RC enforces BIP16 validation, so it considers this transaction invalid. This only happens after downloading the block that contains it, and adding it to the tree in the database. A few hundred blocks are added on top of this one, all in your database, but this chain does not become the best chain (as it is considers invalid).

Finally, you upgrade your software, and you now have the correct BIP16 switchover date. The correct chain is already downloaded in your block database, but it is not marked as the active best chain. At startup, your node sends a getblocks from its current best tip (which is one block before the one that contained the invalid BIP16 transactions) to the top of the chain. The peer answers by sending 500 invs back, and remembers to request 500 more when the last of those is downloaded. However, we already have the first 500, so not one is requested, and nothing happens. We must somehow make the peer send us the rest of the invs, as that is our only means for reconnecting that chain to the current best block. Earlier versions of this patch forced a getdata of that 500th block, this one sends a getblocks immediately.

Reply to this email directly or view it on GitHub:
#1315 (comment)

rebroad · 2012-05-17T10:29:57Z

Sorry, meant to say, that it could re-check the invalid again upon start up, perhaps by giving a command line option or perhaps automatically whenever the invalid again is longer by 6 blocks or more.

Sent from my Nokia phone
-----Original Message-----
From: Pieter Wuille
Sent: 17/05/2012 11:10:55
Subject: Re: [bitcoin] Hopefully final fix for the stuck blockchain issue (#1315)

A bit more elaborate: if you were running an 0.6 RC, you would have code that used the old BIP16 switchover date. The date passed, but you did not update your software. Suddenly someone sends an invalid BIP transaction (so, one that is valid according to the traditional rules, but not according to the BIP16 rules). On the main network BIP16 validation is not active, so the transaction gets accepted. However, your old RC enforces BIP16 validation, so it considers this transaction invalid. This only happens after downloading the block that contains it, and adding it to the tree in the database. A few hundred blocks are added on top of this one, all in your database, but this chain does not become the best chain (as it is considers invalid).

Finally, you upgrade your software, and you now have the correct BIP16 switchover date. The correct chain is already downloaded in your block database, but it is not marked as the active best chain. At startup, your node sends a getblocks from its current best tip (which is one block before the one that contained the invalid BIP16 transactions) to the top of the chain. The peer answers by sending 500 invs back, and remembers to request 500 more when the last of those is downloaded. However, we already have the first 500, so not one is requested, and nothing happens. We must somehow make the peer send us the rest of the invs, as that is our only means for reconnecting that chain to the current best block. Earlier versions of this patch forced a getdata of that 500th block, this one sends a getblocks immediately.

Reply to this email directly or view it on GitHub:
#1315 (comment)

sipa · 2012-05-17T10:30:36Z

This fix will - over the course of an entire blockchain syncup - maybe cause 50 kilobytes extra communication. What you suggest is also possible, but harder and with less guarantees, in my opinion. You'd need to traverse the entire blockchain database and find stale chains, and re-evaluate them all?

rebroad · 2012-05-17T10:35:16Z

Doesn't this fix also increase data transfer even after the node has caught up? I thought it does getblocks upon receipt of every block, doesn't it?

To re-evaluate the invalid again it would only need to re-evaluate one block upon start-up in the example you give. The first block in the longest invalid chain.

rebroad · 2012-05-17T10:37:06Z

Also, technically, i'd say this current fix requires a BIP.

sipa · 2012-05-17T10:44:16Z

In normal operation, this patch does nothing. It only sends out a getblocks when an inv is received with blocks that are already known and part of the block tree. During normal operation, this never happens, as you only request invs for the part after the main chain. And the block-sync process has never been well-formalized, though the responses to the network requests are. Those aren't changed however.

rebroad · 2012-05-17T10:51:09Z

I think you are incorrect to say it doesn't happen during normal operation. This is not my experience. When i new block arrives on the network, let's say 8 nodes announce it in invs. My node will getdata it from the first one, download it and ProcessBlock it usually well before the last connected peer sends invs for it, so with this code with each new block, the slowest peers to announce it will receive the new getblocks in this fix. As the network gets bigger, this could get worse. It could also get less worse if ProcessBlocks takes longer due to larger blocks.

gmaxwell · 2012-05-17T13:15:34Z

While previously testing this I specifically looked for excess requests during normal operations and didn't see any. Either I made a mistake or just had unlucky timing— or it's something about the peer mix thats triggering it, because I see ones now— about 1769 of them on 05/16.

Actually, they seem to be being caused in high volume by specific peers. E.g. I have a couple which are each responsible for several hundred of them.

rebroad · 2012-05-17T16:32:18Z

I've added a fix to this, in my current bitcoin-ParallelBlockDownload branch (the 3rd commit of pull #1326).... I still think the ideal solution is to do it without using the network though....

Hopefully final fix for the stuck blockchain issue

…itcoin#1315)

449872b [Build] Remove Windows 32 bit build, coming from btc@faf666f8148eeb305a9c4f78459aff2c7268016b (furszy) Pull request description: Coming from [btc@15939](bitcoin#15939). Based on bitcoin#1315. ACKs for top commit: Fuzzbawls: ACK 449872b random-zebra: ACK 449872b and merging... Tree-SHA512: 39c8785ade1202c09c76d964fdc3c739f9e162fec5c9b2991ee9a0a60c4935485c3822be82fd2e279fe536e049b3b1e4a9d5137145505adc09feea850420d8f0

Hopefully final fix for the stuck blockchain issue

385f730

Immediately issue a "getblocks", instead of a "getdata" (which will trigger the relevant "inv" to be sent anyway), and only do so when the previous set of invs led us into a known and attached part of the block tree.

gmaxwell added a commit that referenced this pull request May 16, 2012

Merge pull request #1315 from sipa/unstuck

462f5d9

Hopefully final fix for the stuck blockchain issue

gmaxwell merged commit 462f5d9 into bitcoin:master May 16, 2012

coblee pushed a commit to litecoin-project/litecoin that referenced this pull request Jul 17, 2012

Merge pull request bitcoin#1315 from sipa/unstuck

4c682bb

Hopefully final fix for the stuck blockchain issue

lateminer pushed a commit to lateminer/bitcoin that referenced this pull request Jan 22, 2019

Use FPR_FILTER_MAX when checking for size of filter rather than 1.0 (b…

d202c9c

…itcoin#1315)

bitcoin locked as resolved and limited conversation to collaborators Sep 8, 2021

Hopefully final fix for the stuck blockchain issue #1315

Hopefully final fix for the stuck blockchain issue #1315

Uh oh!

Conversation

sipa commented May 15, 2012

Uh oh!

rebroad commented May 15, 2012

Uh oh!

gmaxwell commented May 16, 2012

Uh oh!

rebroad commented May 17, 2012

Uh oh!

gmaxwell commented May 17, 2012

Uh oh!

sipa commented May 17, 2012

Uh oh!

rebroad commented May 17, 2012

Uh oh!

rebroad commented May 17, 2012

Uh oh!

sipa commented May 17, 2012

Uh oh!

rebroad commented May 17, 2012

Uh oh!

rebroad commented May 17, 2012

Uh oh!

sipa commented May 17, 2012

Uh oh!

rebroad commented May 17, 2012

Uh oh!

gmaxwell commented May 17, 2012

Uh oh!

rebroad commented May 17, 2012

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants