Skip to content

Conversation

@jonatack
Copy link
Member

@jonatack jonatack commented Mar 13, 2025

When an addnode peer is disconnected by our IBD headers/blocks download timeout or stalling logic, ThreadOpenAddedConnections attempts to immediately reconnect it (unless "onetry" was passed to the addnode RPC) up to the limit of 8 addnode connections that is separate from regular peer connection limits.

During IBD in a low-bandwidth environment, I see these disconnections+reconnections very frequently with high-quality addnode peers; see https://www.erisian.com.au/bitcoin-core-dev/log-2025-01-22.html#l-85. The logging for these disconnections was: "Peer is stalling block download, disconnecting ." The ping times of the affected peers were often 20-100 seconds.

After IBD, I continue to see the following addnode peer disconnections, albeit much less frequently: "Timeout downloading block , disconnecting ."

This is probably network, resource and time intensive to an extent, particularly in the case of I2P peers that involves destroying and rebuilding 2 tunnels per peer connection, and worth avoiding where it is simple to do so.

Automatic (non-manually added) peers are also disconnected, but they are a different category and case (non-trusted non-protected peers, no immediate reconnection) that would require monitoring over time to adjust the timeouts accordingly; mzumsande was looking into this (see https://www.erisian.com.au/bitcoin-core-dev/log-2025-01-22.html#l-105).

The goal of this pull is therefore to avoid unnecessary disconnections/immediate reconnections of addnode peers.

@DrahtBot
Copy link
Contributor

DrahtBot commented Mar 13, 2025

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32051.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK pinheadmz, yuvicc
Concept ACK vasild, enirox001, hodlinator, 1440000bytes, l0rinc

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

@DrahtBot DrahtBot added the P2P label Mar 13, 2025
@fanquake
Copy link
Member

[23:32:25.656] /ci_container_base/src/rpc/net.cpp:310:125: error: invalid operands to binary expression ('const char[198]' and 'const char[196]')
[23:32:25.656]                 "Nodes added using addnode (or -connect) are protected from disconnection due to DoS or IBD header/block\n" +
[23:32:25.656]                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
[23:32:25.656] 1 error generated.

@jonatack jonatack force-pushed the 2025-03-addnode-p2p branch from 7a2f7d7 to ef46dc7 Compare March 13, 2025 03:35
@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Debug: https://github.com/bitcoin/bitcoin/runs/38682150505

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@jonatack jonatack force-pushed the 2025-03-addnode-p2p branch from ef46dc7 to 93b0799 Compare March 13, 2025 04:24
@jonatack jonatack marked this pull request as ready for review March 13, 2025 15:17
@jonatack
Copy link
Member Author

Ready for review.

}
// In case there is a block that has been in flight from this peer for block_interval * (1 + 0.5 * N)
// (with N the number of peers from which we're downloading validated blocks), disconnect due to timeout.
// (with N the number of peers from which we're downloading validated blocks), disconnect due to timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// (with N the number of peers from which we're downloading validated blocks), disconnect due to timeout
// (with N being the number of peers from which we're downloading validated blocks), disconnect due to timeout

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, since you're already touching it, consider updating for grammatical correctness

Copy link
Contributor

@mzumsande mzumsande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main conceptual question is what should happen in the case of a stalling situation where the addnode peer is actually slow and causing congestion?

While we have more trust in manual peers than automatic ones, I don't think that means the trust should be without bounds. Even trusted nodes can become unresponsive temporarily (e.g. very slow internet). I'm not sure if we should give one addnode peer the power to let our entire IBD come to a halt completely.

An alternative would be to not give these peers unlimited time but clear their block requests (so that these blocks can be requested from other peers) so that they wouldn't be disconnected but also wouldn't be able to stall us indefinitely.

if (pto->IsManualConn()) {
LogInfo("Timeout downloading block %s from addnode peer, not %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
} else {
LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the timeout specific for IBD stalling but the general one (which is generally longer). So we'd still disconnect for stalling (see log message "Peer is stalling block download") - is that on purpose / can you explain a bit more the reasoning behind this?

@hodlinator
Copy link
Contributor

I think the main conceptual question is what should happen in the case of a stalling situation where the addnode peer is actually slow and causing congestion?

Maybe one could add yet another option to make manually specified nodes be treated with extra leniency?
First I was thinking that maybe we could treat -connect-nodes vs -addnodes differently automatically in this respect. But it would be more natural for -connect to be protected, and your (jonatack's) use-case is -addnode from what I gather.

@ajtowns
Copy link
Contributor

ajtowns commented Apr 24, 2025

I think the main conceptual question is what should happen in the case of a stalling situation where the addnode peer is actually slow and causing congestion?

IMHO: It could be nice if we could "score" peers during IBD, and scale the number of blocks we request from them at once according to that score. In that case, automatic peers with low scores would get dropped, and manual peers with low scores would just not get any blocks requested.

@jonatack
Copy link
Member Author

Appreciate the thoughtful feedback. I'll look at an update.

LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
pto->fDisconnect = true;
if (pto->IsManualConn()) {
LogInfo("Timeout downloading block %s from addnode peer, not %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
Copy link
Contributor

@naiyoma naiyoma Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the message can be a bit clearer, e.g., "Timeout downloading block %s from manual peer (addnode/connect), not disconnecting

@1440000bytes

This comment was marked as abuse.

@vasild
Copy link
Contributor

vasild commented May 23, 2025

Concept ACK to improve this, but:

what should happen in the case of a stalling situation where the addnode peer is actually slow and causing congestion? ... one addnode peer the power to let our entire IBD come to a halt completely

Would it be too difficult to request a given block from more than one peer concurrently? That is, if a block is requested from a peer and not received within some time, then request it from other(s) as well, but don't cancel the original request. Then we take the block from whoever delivers it first. Subsequent receptions of the block from others would be an excess/redundant traffic, but I guess that is ok, if this technique is used carefully. Is this the same:

An alternative would be to not give these peers unlimited time but clear their block requests (so that these blocks can be requested from other peers)

Copy link
Member

@pinheadmz pinheadmz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concept ACK and untested code review 93b0799

I believe this also closes an 11-year-old issue: #5097

return true;
} else {
LogInfo("Timeout downloading headers from noban peer, not %s\n", pto->DisconnectMsg(fLogIPs));
LogInfo("Timeout downloading headers from %s peer, not %s\n", pto->IsManualConn() ? "addnode" : "noban", pto->DisconnectMsg(fLogIPs));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64b956f

Nit: could you use NetPermissions::ToStrings(NetPermissionFlags flags) just to avoid hard coding things?

@DrahtBot DrahtBot requested a review from vasild May 23, 2025 14:46
@pinheadmz
Copy link
Member

I think the main conceptual question is what should happen in the case of a stalling situation where the addnode peer is actually slow and causing congestion?

If we're considering a case where the user specifically chooses specific peers for IBD for privacy or whatever reasons, don't we just let them have the slow sync they trade-off for?

@yuvicc
Copy link
Contributor

yuvicc commented May 26, 2025

Concept ACK

@yuvicc
Copy link
Contributor

yuvicc commented May 28, 2025

Code Review ACK 93b0799

I've made a review notes for this pr -> here

@stringintech
Copy link
Contributor

stringintech commented May 28, 2025

An alternative would be to not give these peers unlimited time but clear their block requests (so that these blocks can be requested from other peers) so that they wouldn't be disconnected but also wouldn't be able to stall us indefinitely.

This makes sense, as the PR tries to achieve a similar goal with initial headers sync timeouts in commit 64b956f - when a timeout occurs, the sync state is reset to allow retrying with other peers while preserving the connection, and the PR extends this behavior to also apply to addnode peers.

@enirox001
Copy link
Contributor

Concept ACK and Code Review ACK.

I think the PR's increased timeout tolerance for addnode peers is valuable. Given my limited domain knowledge, I'd suggest test coverage on potential silent stalls to ensure blocks are efficiently re-requested

@achow101
Copy link
Member

achow101 commented Jun 2, 2025

I think the main conceptual question is what should happen in the case of a stalling situation where the addnode peer is actually slow and causing congestion?

I'd like to see this addressed before moving forward here.

Maybe one could add yet another option to make manually specified nodes be treated with extra leniency?

It could be a whitebind/whitelist option.

@pinheadmz
Copy link
Member

@achow101 re: slow preferred peer "causing congestion" -- wouldn't this only affect the single user who explicitly set the preferred peer?

Copy link
Contributor

@hodlinator hodlinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK 93b0799

LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
pto->fDisconnect = true;
if (pto->IsManualConn()) {
LogInfo("Timeout downloading block %s from addnode peer, not %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any risk that this log message becomes spammy if we keep the peer connected? - Should we do something like reset state.m_downloading_since to current_time, clear state.vBlocksInFlight or add another state-field to throttle the log?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it could still be spammy, #32604 should mitigate it somewhat.

@achow101
Copy link
Member

achow101 commented Jul 4, 2025

@achow101 re: slow preferred peer "causing congestion" -- wouldn't this only affect the single user who explicitly set the preferred peer?

Yes, however I have seen people post on various forums suggestions of specific nodes that they can addnode, and users may be addnode'ing such nodes without realizing that there are potential adverse consequences of doing so. Generally, I would prefer to have things that change how we treat nodes be limited to the whitelist options rather than addnode also implying that those nodes will get specific special treatment.

fanquake added a commit that referenced this pull request Jul 15, 2025
61e800e test: headers sync timeout (stringintech)

Pull request description:

  When reviewing PR #32051 and considering which functional tests might need to be adapted/extended accordingly, I noticed there appears to be limited functional test coverage for header sync timeouts and thought it might be helpful to add one.

  This test attempts to cover two scenarios:

  1. **Normal peer timeout behavior:** When a peer fails to respond to initial getheaders requests within the timeout period, it should be disconnected and the node should attempt to sync headers from the next available peer.

  2. **Noban peer behavior:** When a peer with noban privileges times out, it should remain connected while the node still attempts to sync headers from other peers.

ACKs for top commit:
  maflcko:
    re-ACK 61e800e 🗝
  stratospher:
    reACK 61e800e.

Tree-SHA512: b8a867e7986b6f0aa00d81a84b205f2bf8fb2e6047a2e37272e0244229d1f43020e9031467827dabbfe7849a91429f2685e00a25356e2ed477fa1a035fa0b1fd
@glozow
Copy link
Member

glozow commented Aug 11, 2025

What's the status / are you still working on this? It doesn't look like #32051 (comment) has been addressed

@jonatack
Copy link
Member Author

jonatack commented Aug 11, 2025

Copy link
Contributor

@l0rinc l0rinc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK

LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
pto->fDisconnect = true;
if (pto->IsManualConn()) {
LogInfo("Timeout downloading block %s from addnode peer, not %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it could still be spammy, #32604 should mitigate it somewhat.

// Detect whether we're stalling
auto stalling_timeout = m_block_stalling_timeout.load();
if (state.m_stalling_since.count() && state.m_stalling_since < current_time - stalling_timeout) {
// Allow more time for addnode peers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit confusing, I’d prefer separating stall detection logic from the action we take. Peers are stalling regardless of whether they're added manually or not, so the stalling logic should be unaffected.

I think we should always log when a stall is detected, even for manual peers, so operators can see what’s happening:

  • for automatic peers, disconnect as usual;
  • for manual peers, only disconnect if they exceed BLOCK_STALLING_TIMEOUT_MAX - otherwise keep them connected.

That way the logs clearly show tolerated stalls, and the policy difference is obvious without blending it into the timeout calculation, something like:

diff --git a/src/net_processing.cpp b/src/net_processing.cpp
--- a/src/net_processing.cpp	(revision 8405fdb06e8ff3220590bfac84e98547067ec6b7)
+++ b/src/net_processing.cpp	(date 1755201170971)
@@ -5814,14 +5814,16 @@
             // the download window should be much larger than the to-be-downloaded set of blocks, so disconnection
             // should only happen during initial block download.
             LogInfo("Peer is stalling block download, %s\n", pto->DisconnectMsg(fLogIPs));
-            pto->fDisconnect = true;
-            // Increase timeout for the next peer so that we don't disconnect multiple peers if our own
-            // bandwidth is insufficient.
-            const auto new_timeout = std::min(2 * stalling_timeout, BLOCK_STALLING_TIMEOUT_MAX);
-            if (stalling_timeout != new_timeout && m_block_stalling_timeout.compare_exchange_strong(stalling_timeout, new_timeout)) {
-                LogDebug(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", count_seconds(new_timeout));
-            }
-            return true;
+            if (!pto->IsManualConn() || state.m_stalling_since < current_time - BLOCK_STALLING_TIMEOUT_MAX) { // Allow more time for addnode peers
+                pto->fDisconnect = true;
+                // Increase timeout for the next peer so that we don't disconnect multiple peers if our own
+                // bandwidth is insufficient.
+                const auto new_timeout = std::min(2 * stalling_timeout, BLOCK_STALLING_TIMEOUT_MAX);
+                if (stalling_timeout != new_timeout && m_block_stalling_timeout.compare_exchange_strong(stalling_timeout, new_timeout)) {
+                    LogDebug(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", count_seconds(new_timeout));
+                }
+                return true;
+            }
         }
         // In case there is a block that has been in flight from this peer for block_interval * (1 + 0.5 * N)
         // (with N the number of peers from which we're downloading validated blocks), disconnect due to timeout.

I would also prefer seeing a motivation being explained in the commit message for this change, maybe something like:

p2p: allow BLOCK_STALLING_TIMEOUT_MAX for addnode peers

When a manually added peer stalls during IBD, the default timeout is often too short for high-latency/low-bandwidth environments.
This results in frequent disconnect/reconnect cycles, especially over Tor/I2P, which are wasteful and slow down IBD.

This change uses `BLOCK_STALLING_TIMEOUT_MAX` for manual peers when detecting block download stalls, giving them a longer grace period before disconnection.
Non-manual peers keep the existing timeout and behavior.

}
// In case there is a block that has been in flight from this peer for block_interval * (1 + 0.5 * N)
// (with N the number of peers from which we're downloading validated blocks), disconnect due to timeout.
// (with N the number of peers from which we're downloading validated blocks), disconnect due to timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, since you're already touching it, consider updating for grammatical correctness

Comment on lines +5815 to +5820
if (pto->IsManualConn()) {
LogInfo("Timeout downloading block %s from addnode peer, not %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
} else {
LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
pto->fDisconnect = true;
}
Copy link
Contributor

@l0rinc l0rinc Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're logging very similarly in both cases, most of the message and params are duplicated, I think it would help if we extracted the differences in code instead of heaving to compile both in our heads, e.g. something like:

Suggested change
if (pto->IsManualConn()) {
LogInfo("Timeout downloading block %s from addnode peer, not %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
} else {
LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
pto->fDisconnect = true;
}
LogInfo("Timeout downloading block %s (%s)",
queuedBlock.pindex->GetBlockHash().ToString(),
pto->IsManualConn() ? "addnode peer, keeping" : pto->DisconnectMsg(fLogIPs));
pto->fDisconnect = !pto->IsManualConn();

Note: that the fDisconnect value is also simpler if we assign it unconditionally


Or if @mzumsande is correct and this is meant to be done in the condition above, consider:

LogInfo("Peer is stalling block download (%s)\n",
        pto->IsManualConn() ? "addnode peer, keeping" : pto->DisconnectMsg(fLogIPs));
pto->fDisconnect = !pto->IsManualConn();

Comment on lines +5829 to +5830
// Disconnect a peer (if it is neither an addnode peer, nor has
// NetPermissionFlags::NoBan permission) if it is our only sync peer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this nesting quite confusing now, consider something like:

Suggested change
// Disconnect a peer (if it is neither an addnode peer, nor has
// NetPermissionFlags::NoBan permission) if it is our only sync peer
// If this peer is our only active sync peer, and we have other peers
// available for syncing, disconnect it - unless it is an addnode peer
// or has NetPermissionFlags::NoBan permission.

static RPCHelpMan addnode()
{
return RPCHelpMan{"addnode",
"\nAttempts to add or remove a node from the addnode list.\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a rebase

"Or try a connection to a node once.\n"
"Nodes added using addnode (or -connect) are protected from DoS disconnection and are not required to be\n"
"Nodes added using addnode (or -connect) are protected from disconnection due to DoS or IBD header/block\n"
"download timeouts (and given more time before considered to be stalling), and are not required to be\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"download timeouts (and given more time before considered to be stalling), and are not required to be\n"
"download timeouts (and are given more time before being considered stalled), and are not required to be\n"

@glozow
Copy link
Member

glozow commented Sep 23, 2025

Yes, I plan to update this as mentioned at the last meeting: https://www.erisian.com.au/bitcoin-core-dev/log-2025-08-07.html#l-302

@jonatack I'm going to mark this as draft since it's been a while. Please mark as ready for review whenever you get the chance to address the comments.

@glozow glozow marked this pull request as draft September 23, 2025 18:46
@jonatack
Copy link
Member Author

jonatack commented Oct 22, 2025

I suppose I need to convince myself that this might be merged, to work further on it. I'll re-open if I do.

@jonatack jonatack closed this Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.