Reduce keepalive ping/timeout #2784

sipa · 2013-06-22T17:17:37Z

Ping automatically every 2 minutes (unconditionally), instead of of after 30 minutes of no sending, for latency measurement and keep-alive. Also, disconnect if no reply arrives within 5 minutes, instead of after 90 minutes of inactivity.

This should make detection of stalled connections much faster.

sipa · 2013-06-22T17:30:14Z

@mikehearn Would this interact badly with BitcoinJ wallets? How frequently do they send pings?
(or anyone with an alternate implementation)

mikehearn · 2013-06-23T13:56:57Z

bitcoinj sends pings every two seconds, it's much more aggressive than this patch. We could certainly reduce it. Ping times are used to pick which peer to download from.

jgarzik · 2013-06-23T19:50:34Z

The disconnect logic seems like it would negatively impact testnet.

sipa · 2013-06-23T22:20:15Z

Inverted the logic, so it now becomes a ping if nothing has been received for a while, rather than sent.

Tested that it indeed detects broken connections within one minute.

mikehearn · 2013-07-05T10:41:12Z

Change looks good to me, although it will have the effect of disconnecting any nodes that don't respond to pings with pongs (or some other message). As pong messages were added in protocol version 60000 it would effectively EOL clients older than that and this may deserve an announcement somewhere.

However bitcoinj clients will be fine with it.

gmaxwell · 2013-07-07T16:46:13Z

So, this causes nodes to impose an upper bound on their peers response latency. Effectively, someone who will take a minute to respond can't participate in the network at all. This may adversely impact some anonymity protocols which cause high latency— for many kinds of Bitcoin usage the latency isn't all that critical. I'd suggest that such users should be using an alternative transport, except the bitcoin p2p protocol is already reasonably well designed for high latency usage (at least its highly asynchronous).

Some of the other side effects of timeouts is that they can reduce link stability during a DOS attack, and they can preclude some kinds of node high availability schemes which might cause tens of seconds of unresponsiveness (e.g. during a state resync or hardware migration). Connection slot filling attacks are more effective if you can overload a peer temporarily and make them drop all their good connections.

Some other protocols, like BGP, negotiate the heart-beating— 30, 90 being a common set of parameters in that case— to address the problem of potentially imposing too fast a response requirement. But since our network protocol is mostly stateless (though— it has become less so with bloom-filtering, and never was completely: e.g. inv caching) there is less of an issue as restarts aren't so bad.

I'd instead prefer a longer timeout and separately having peer rotation code that periodically slays the least recently active node from node out of the set which been heard from in >60 seconds. E.g. imposing the shorter timeout but only when we could potentially better use the slot.

TCP's design targets a maximum segment lifetime of 2 minutes. Many operating systems have an SO_KEEPALIVE timeout of around 10 minutes. Somewhere in that range

mikehearn · 2013-07-08T13:00:05Z

It would be easy to adjust the timeout for onion addresses, but really, I'm not sure we want nodes in the network that can't respond within a minute. Slow nodes can have terrible effects on the user experience for SPV wallets. They will automatically drop to the bottom of the preference list in bitcoinj and not be used for much except observing broadcasts, for that reason.

That said, just dropping the slowest peer when we run out of slots on the bitcoind side indeed sounds reasonable.

jgarzik · 2013-08-25T03:05:17Z

How about a 5-minute timeout, and get this merged? Maybe 1 minute is too short, but 90 is far too long.

sipa · 2013-10-14T00:16:43Z

Suggestion: send pings every 2 minutes (even in case something was received recently, with the new ping-response-time-measurement that means we always get some useful latency information), and disconnect after receiving nothing for 5 minutes.

gmaxwell · 2013-10-14T17:15:58Z

@sipa ACK unconditional 2 minute ping plus 5 minute disconnect.

sipa · 2013-10-14T22:42:20Z

Rebased & updated.

gavinandresen · 2013-10-14T23:10:05Z

ACK.

sipa · 2013-10-14T23:21:22Z

There may be a bug in this code; will investigate later.

EDIT: fixed.

jgarzik · 2013-10-15T02:21:55Z

ACK

Optional nit: "5 * 60" is more self-documenting than "300", and the previous code used the "M * 60" notation. All modern compilers will automatically convert the more human-readable M*60 at compile time, so there is no added runtime cost.

sipa · 2013-10-15T20:08:59Z

Moved/added the ping/timeout constants to net.h (where they can be accessed by both net & main), and changed the timeout logic a bit: either there is an unanswered ping >5 minutes old, or there has been no message received at all for 5 minutes.

Diapolo · 2013-10-15T20:55:38Z

src/net.h

Small nit: Perhaps indent the comments to be in line.

sipa · 2013-10-16T18:25:54Z

Currently running this patch myself on bitcoin.sipa.be. I see a surprisingly high number of ping timeouts and inactivity timeouts (every 10 minutes or so, very irregularly). I'll investigate whether these are actual connections that go dead (in which case this patch is actually useful...) or a problem with the disconnection logic.

luke-jr · 2014-02-21T01:11:17Z

@sipa What did you find?

sipa · 2014-05-10T13:30:27Z

Rebased, and made some changes (better reporting, and don't enforce 5-minute receive timeout on pre-BIP31 peers).

I'm testing it again myself.

sipa · 2014-05-10T13:35:37Z

Strange, I have one peer (/Satoshi:0.8.6/, proto v70001) which seems to disconnect when it receives a ping message, and then immediately reconnects.

sipa · 2014-05-11T09:34:36Z

Seems the majority of timeouts detected are ping timeouts from getaddr.bitnodes.io (which apparently only recently implemented ping/pong - though probably not yet deployed).

Apart from that I do quite some others, from various clients and IPs, every ~10 minutes on average the past hours. These look like genuine connection problems.

ayeowch · 2014-05-11T09:48:13Z

@sipa I have just deployed the ping/pong update to getaddr.bitnodes.io. The crawler should be responding to incoming ping properly now.

sipa · 2014-05-11T10:17:26Z

@ayeowch Cool, good to know. I'm not seeing any pongs yet, though...

ayeowch · 2014-05-11T10:26:24Z

@sipa Hmm.. If the crawler (subver=/getaddr.bitnodes.io:0.1/) is already connected to your node. Try sending a ping with bitcoind ping. I did get pong from the crawler by doing this.

sipa · 2014-05-11T10:31:38Z

2014-05-11 10:27:53 Sending ping to [2a01:4f8:202:81b1::2]:48980 /getaddr.bitnodes.io:0.1/

No pong afterwards (while I do receive pongs from other nodes).

sipa · 2014-05-31T11:50:17Z

For reference: the problem with bitnodes.io was lingering connections from a crawler that didn't support BIP31 (but advertized a protocol version that did), which was fixed on their side now.

I'm seeing occassional disconnects because of the new ping timeouts, but I believe these are correct.

Any objections to merging?

laanwj · 2014-06-05T04:55:20Z

@sipa No objections to pinging more, but I'm a bit wary about the 'disconnect nodes faster' part of this. Going from 90 minutes to 5 minutes is a big change. It wouldn't be the first time that something is introduced that can hang bitcoind for a few minutes.

leofidus · 2014-06-06T00:55:27Z

I think there are already scenarios where network pings are not answered within 5 minutes. For example the importprivkey rpc command can lock cs_main for many minutes during a rescan. If during that time a network message arrives which also requires a lock on cs_main (for example a harmless inv), this locks the network thread until the rescan is complete. This leaves the peer unable to answer pings for more than 10 minutes.

sipa · 2014-06-06T01:02:54Z

I think it's reasonable that peers disconnect you during that time. You can't use their service, and they can't use yours.

laanwj · 2014-06-06T03:53:51Z

To me it just doesn't feel very robust to have the network fall apart when somehow 5 minutes of lag are introduced.

sipa · 2014-06-08T21:25:46Z

@laanwj What timeout would you find acceptable?

laanwj · 2014-06-09T11:11:29Z

Let's pick uhm... 20 minutes. That slices the current inactivity timeout in four already. We can always lower the value later.

gmaxwell · 2014-06-09T20:57:17Z

@wumpus TCP itself will fail over long before that if the network itself is delayed. A node thats holding up our connection but not responding for >5 minutes is really broken as far as we should care. It can reconnect when it can start responding again. (and yes, in some cases bitocoind is 'broken', but thats okay... reconnecting is fairly cheap).

The flip side of a long timeout is that we'll leave connections up to busted peers— keeping ourselves offline (or our resources consumed) for up to the timeout— when we could otherwise have useful connections up.

Maybe it would make sense to have a conservative upper limit for now, I certantly wouldn't want to delay pulling this further to debate the timeout. I'm happy with anything 5 minutes or longer but would prefer closer to 5 minutes. Future peer rotation could prioritize punting long idle connection, which would address the problem that you might sit partitioned from the network for many minutes with no working connections.

sipa · 2014-06-09T21:00:24Z

Increased the timeout to 20 minutes.

gmaxwell · 2014-06-09T21:05:54Z

@sipa The commit message is now wrong.

... instead of after 30 minutes of no sending, for latency measurement and keep-alive. Also, disconnect if no reply arrives within 20 minutes, instead of 90 of inactivity (for peers supporting the 'pong' message).

sipa · 2014-06-09T21:07:14Z

Fixed.

BitcoinPullTester · 2014-06-09T21:31:41Z

Automatic sanity-testing: PASSED, see http://jenkins.bluematt.me/pull-tester/f1920e86063d0ed008c6028d8223b0e21889bf75 for binaries and test log.
This test script verifies pulls every time they are updated. It, however, dies sometimes and fails to test properly. If you are waiting on a test, please check timestamps to verify that the test.log is moving at http://jenkins.bluematt.me/pull-tester/current/
Contact BlueMatt on freenode if something looks broken.

laanwj · 2014-06-09T21:56:01Z

@gmaxwell I just want to be careful here.

@sipa Looks good to me now.

f1920e8 Ping automatically every 2 minutes (unconditionally) (Pieter Wuille)

cozz · 2014-06-13T17:58:27Z

Just tested on mainnet:

start with maxconnections=1
initial pingtime under a second
wait for block sync to start
ping
pingtime=140s

The problem is the node is busy sending us blocks, so does not respond to the ping.
I think we should delay the ping when blocks are in flight.

gmaxwell · 2014-06-13T18:06:13Z

@cozz I don't understand your concern. Yes, pings will take a lot of time in the case, but it will not be disconnected.

cozz · 2014-06-13T18:32:44Z

Well, I thought if the pingtime is used for peer-selection later, then you probably dont want to ping, if you know that the ping will be much slower as normal, because blocks are in flight.
I dont have any concerns for now, just saying that if we are going to judge a peer by its periodically pingtime, a slow ping does not necessarily mean a slow node. In my case the node is a good node, just busy.

* Split out GetInstantSendLockHashByTxid from GetInstantSendLockByTxid * Filter ISLOCK messages based on provided filter

Diapolo reviewed Oct 15, 2013
View reviewed changes

This was referenced May 19, 2014

Push cs_main locks down in ProcessBlock #4148

Merged

Add comments/documentation for CNode structure. #4200

Closed

Qt: Add GUI view of peer information. #4133 #4225

Merged

sipa mentioned this pull request Jun 4, 2014

[Qt] tweak new peers tab in console window #4285

Merged

Ping automatically every 2 minutes (unconditionally)

f1920e8

... instead of after 30 minutes of no sending, for latency measurement and keep-alive. Also, disconnect if no reply arrives within 20 minutes, instead of 90 of inactivity (for peers supporting the 'pong' message).

laanwj merged commit f1920e8 into bitcoin:master Jun 12, 2014

laanwj added a commit that referenced this pull request Jun 12, 2014

Merge pull request #2784

3f39b9d

f1920e8 Ping automatically every 2 minutes (unconditionally) (Pieter Wuille)

Bushstar pushed a commit to Bushstar/omnicore that referenced this pull request Apr 5, 2019

Honor bloom filters when announcing LLMQ based IS locks (bitcoin#2784)

9e70209

* Split out GetInstantSendLockHashByTxid from GetInstantSendLockByTxid * Filter ISLOCK messages based on provided filter

hebasto mentioned this pull request May 11, 2019

Flushing database cache causes p2p connections ping timeout #16008

Closed

bitcoin deleted a comment May 11, 2019

bitcoin locked as resolved and limited conversation to collaborators Sep 8, 2021

Reduce keepalive ping/timeout #2784

Reduce keepalive ping/timeout #2784

Uh oh!

Conversation

sipa commented Jun 22, 2013

Uh oh!

sipa commented Jun 22, 2013

Uh oh!

mikehearn commented Jun 23, 2013

Uh oh!

jgarzik commented Jun 23, 2013

Uh oh!

sipa commented Jun 23, 2013

Uh oh!

mikehearn commented Jul 5, 2013

Uh oh!

gmaxwell commented Jul 7, 2013

Uh oh!

mikehearn commented Jul 8, 2013

Uh oh!

jgarzik commented Aug 25, 2013

Uh oh!

sipa commented Oct 14, 2013

Uh oh!

gmaxwell commented Oct 14, 2013

Uh oh!

sipa commented Oct 14, 2013

Uh oh!

gavinandresen commented Oct 14, 2013

Uh oh!

sipa commented Oct 14, 2013

Uh oh!

jgarzik commented Oct 15, 2013

Uh oh!

sipa commented Oct 15, 2013

Uh oh!

Diapolo Oct 15, 2013

Choose a reason for hiding this comment

Uh oh!

sipa Oct 15, 2013

Choose a reason for hiding this comment

Uh oh!

sipa commented Oct 16, 2013

Uh oh!

luke-jr commented Feb 21, 2014

Uh oh!

sipa commented May 10, 2014

Uh oh!

sipa commented May 10, 2014

Uh oh!

sipa commented May 11, 2014

Uh oh!

ayeowch commented May 11, 2014

Uh oh!

sipa commented May 11, 2014

Uh oh!

ayeowch commented May 11, 2014

Uh oh!

sipa commented May 11, 2014

Uh oh!

sipa commented May 31, 2014

Uh oh!

laanwj commented Jun 5, 2014

Uh oh!

leofidus commented Jun 6, 2014

Uh oh!

sipa commented Jun 6, 2014

Uh oh!

laanwj commented Jun 6, 2014

Uh oh!

sipa commented Jun 8, 2014

Uh oh!

laanwj commented Jun 9, 2014

Uh oh!

gmaxwell commented Jun 9, 2014

Uh oh!

sipa commented Jun 9, 2014

Uh oh!

gmaxwell commented Jun 9, 2014

Uh oh!