Skip to content

Conversation

@sipa
Copy link
Member

@sipa sipa commented Jan 11, 2014

This pull request brings in several block downloaded related improvements. It's intended as a preparation for headers-first, but It's useful on itself.

Changes:

  • Track blocks being downloaded per-peer, and limit the number in-flight.
  • Never request the same block twice, so no more duplicate downloads during sync.
  • In case a peer seems to be too slow to send blocks we need, they're disconnected and their in-flight blocks can be requested elsewhere (which saves both us and them bandwidth).
  • Limit the number of orphans held in memory, as those still happen very frequently during sync (the result from accidentally downloading blocks from two peers at once).

This should result in generally better synchronization behaviour from random/multiple peers, but may slow down the first part of the process (due to small blocks) and slow down fetching over high-latency high-bandwidth connections (which is relatively rare).

Test plan:

  • Delete blocks/ and chainstate/
  • Start a download from random peers (no special options, or use -addnode=IP optionally).
    • Verify that no "already have block ..." messages appear during sync.
    • Check that synchronization completes (it may still be slow, as there is no optimization to select particularly good peers to download from, and orphans are expected).
  • Delete blocks/ and chainstate/ again.
  • Start a download from a single fast peer (-connect=IP)
    • Verify that no "already have block ..." messages appear during sync.
    • Check that synchronization completes and progresses quickly (doesn't get stuck; it may be a bit slower than before this patch, particularly in the initial part of the chain; very occasional orphans are possible)

Headers-first will remove a part of the logic in this pull request again, but as it's useful on itself, I prefer getting this reviewed and merged anyway:

  • There won't be a per-peer to-be-downloaded queue of block hashes anymore; those will be decided globally based on good known (and verified) headers chains.
  • The per-peer block-downloads-in-flight will remain, but becomes a queue of CBlockIndex*'es instead of uint256's.
  • There won't be any orphan processing in the block download mechanism anymore (so no need to limit orphans).
  • Stalled download detection won't use as many hardcoded constants anymore (as the question becomes "is this peer slowing too slow compared to the others to keep up" rather than "is this peer slow according to some absolute measurement").

Tested by synchronizing while running in valgrind.

Depends on #3370 (Prepare block connection logic for headers-first).

@laanwj
Copy link
Member

laanwj commented Jan 13, 2014

Needs rebase after #3516

@sipa
Copy link
Member Author

sipa commented Jan 13, 2014

Will rebase tonight.

@sipa
Copy link
Member Author

sipa commented Jan 18, 2014

Did a sync from scratch using this branch (plus #3517) on mainnet, a significant part of which under valgrind. Seems to work fine, though sometimes long chains of accumulated orphans are connected, resulting in cs_main being locked for a long time. I'll try to fix that independently.

@sipa
Copy link
Member Author

sipa commented Jan 27, 2014

Rebase after logging changes.

@jgarzik
Copy link
Contributor

jgarzik commented Jan 29, 2014

ACK

@laanwj
Copy link
Member

laanwj commented Jan 30, 2014

I did some testing with block syncs, no problems found, worked fine for me.

@sipa
Copy link
Member Author

sipa commented Jan 30, 2014

There is one potential problem here, namely that the many orphans can result in very long chains (I've seen up to ~400) of blocks being connected within a single cs_main lock, essentially making the program look frozen. This can be solved by pushing some locks down, but may need some discussion first.

I could easily submit the limit-orphans commit as a separate pull request though, as I suppose that can be seen as a memory usage bug fix.

Keep track of which block is being requested (and to be requested) from
each peer, and limit the number of blocks in-flight per peer. In addition,
detect stalled downloads, and disconnect if they persist for too long.

This means blocks are never requested twice, and should eliminate duplicate
downloads during synchronization.
@BitcoinPullTester
Copy link

Automatic sanity-testing: PASSED, see http://jenkins.bluematt.me/pull-tester/f59d8f0b644d49324cabd19c58cf2262d49e1392 for binaries and test log.
This test script verifies pulls every time they are updated. It, however, dies sometimes and fails to test properly. If you are waiting on a test, please check timestamps to verify that the test.log is moving at http://jenkins.bluematt.me/pull-tester/current/
Contact BlueMatt on freenode if something looks broken.

@rebroad
Copy link
Contributor

rebroad commented Feb 23, 2014

Regarding slowing down the download of the small blocks - how about factoring in the block size and incrementally changing how many peers are used as the block sizes get larger?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because blocks used to be tracked using the (weak) mapAlreadyAskedFor mechanism, but this patch introduces a better way, that retains information about whom it was asked from.

@rebroad
Copy link
Contributor

rebroad commented Feb 26, 2014

I've been testing this a few days now. The stuck block detection doesn't seem to be working:-

2014-02-26 22:24:01 Requesting block 00000000000000014e06412631ff67305cde77c6631dc58a2ce35308558cee05 from 216.99.101.192:8333
2014-02-26 22:28:31 ResendWalletTransactions()
2014-02-26 22:54:16 socket recv error 110
2014-02-26 22:54:17 receive version message: /Satoshi:0.8.6/: version 70001, blocks=287993, us=70.42.240.30:57711, them=207.255.42.202:8333, peer=207.255.42.202:8333
2014-02-26 22:54:17 Added time data, samples 10, offset +0 (+0 minutes)
2014-02-26 22:56:34 socket recv error 110
2014-02-26 22:57:26 connect() to 173.79.229.22:8333 failed after select(): Connection refused
2014-02-26 22:57:28 socket recv error 110
2014-02-26 22:58:03 socket recv error 110

almost an 30 minutes went by with no download happening, and it didn't seem to notice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number... why so high? In the later stages of block download I'd have thought a number close to 1 would be appropriate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a further optimization, yes.

@rebroad
Copy link
Contributor

rebroad commented Mar 1, 2014

So far upon testing this code, I've not yet seen it downloading blocks from more than one node. There's not even any code to send a getblocks to more than one node from what I can tell. If a node is deemed to be unresponsive it's forcefully disconnected. How come this is ok, and yet my pull request to disconnect nodes sending blocks my nodes already had was considered not ok?

@sipa
Copy link
Member Author

sipa commented Mar 1, 2014

This does not do parallel block download yet, as that requires
headers-first sync. This is a step towards that, though.

And disconnecting unresponsive peers is fine - it's the only way to not
waste their bandwidth if we don't need/want requested blocks anymore.

What is not fine, is disconecting peers because they're the second ones to
respond to a block that we requested twice. We should simply not request
the same block twice in the first place. That is what this pull request is
supposed to do. If you still see already have block messages with this
patch, there is possibly a bug in it that will need to be addressed.

@rebroad
Copy link
Contributor

rebroad commented Mar 2, 2014

@sipa I deleted my previous comment about "already have" messages - none seen with this patch so far! But... a misbehaving node could still send random blocks if it wanted to, so at some point in the future I suspect we'll want to set that sort of behaviour as misbehaviour - maybe even as part of this pull request...?

@sipa
Copy link
Member Author

sipa commented Mar 2, 2014

@rebroad That's a good question. The problem is that the current code actually does broadcast blocks sometimes without inv, namely newly mined blocks which are certain not be exist anywhere else.

@rebroad
Copy link
Contributor

rebroad commented Mar 2, 2014

why would it broadcast a block without broadcasting the inv first?

As an aside note, I notice the block download got stuck today, and wasn't able to recover - it was still downloading blocks that were being advertised (i.e. the latest block mined), but the initial block download stayed stuck.

@rebroad
Copy link
Contributor

rebroad commented Mar 2, 2014

@sipa Re misbehaviour, it could just be for nodes sending unsolicited blocks that the node already has. Perhaps +30.

@rebroad
Copy link
Contributor

rebroad commented Mar 8, 2014

The stuck download detection seems like it might not be working.

laanwj added a commit that referenced this pull request Mar 10, 2014
f59d8f0 Per-peer block download tracking and stalled download detection. (Pieter Wuille)
@laanwj laanwj merged commit f59d8f0 into bitcoin:master Mar 10, 2014
@rebroad
Copy link
Contributor

rebroad commented Mar 30, 2014

I think this has been pulled prematurely. It's still got problems as I mentioned above. For one, it doesn't resume download if the OS is suspended and resumed.

@laanwj
Copy link
Member

laanwj commented Mar 30, 2014

@rebroad can you post new problems and bug reports in new issues? If you just dump them into existing, closed issues they will probably get lost.

@laanwj
Copy link
Member

laanwj commented Mar 30, 2014

And IMO, going toward headers-first / parallel block download is the single most important point in continued development of Bitcoin Core right now.

As this is pull is an important step in that direction, this was prioritized to be merged as soon as possible after 0.9.0 was released.

If a little temporary inconvenience while the kinks are worked out is too much for you, I suggest staying with 0.9 for now. Thanks for helping test though.

@sipa
Copy link
Member Author

sipa commented Mar 30, 2014

The stuck download detection here is just a very simple prototype. It only uses some hardcoded constants as thresholds. The purpose is turning it into a detection for one peer being slower than the other, once actual parallel downloading is there.

@rebroad Is this a regression introduced by this pullreq, or a pre-existing problem that you expected to be fixed by this but wasn't?

@rebroad
Copy link
Contributor

rebroad commented Mar 31, 2014

@sipa it's a regression.

@laanwj
Copy link
Member

laanwj commented Apr 4, 2014

I suspended my PC with bitcoin running a few nights and it had no problem recovering connections when re-awakened.

@rebroad Do you perhaps have a dynamic IP, that changes if you computer was suspended for a while? If so Bitcoin indeed doesn't cope with it too well (it will AFAIK still advertise the old address so you won't get incoming connections). But that's not a regression or new issue.

@rebroad
Copy link
Contributor

rebroad commented Apr 6, 2014

@laanwj yes, my IP address changes, but I don't accept incoming connections either. The getting stuck after a suspend didn't happen before I applied this pullreq AFAIK.

assert(state != NULL);

// Make sure it's not listed somewhere already.
MarkBlockAsReceived(hash);
Copy link
Contributor

@rebroad rebroad Oct 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really don't like the use of functions that clearly don't do what their name suggests they do.

What parts of this function are needed? And what would the function be called if it did what it is needing to do here?

Copy link

@satoshi-n satoshi-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incompetence or malicious bbbb

@bitcoin bitcoin locked as resolved and limited conversation to collaborators Feb 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants