-
Notifications
You must be signed in to change notification settings - Fork 38.7k
p2p: protect addnode peers during IBD #32051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32051. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. |
[23:32:25.656] /ci_container_base/src/rpc/net.cpp:310:125: error: invalid operands to binary expression ('const char[198]' and 'const char[196]')
[23:32:25.656] "Nodes added using addnode (or -connect) are protected from disconnection due to DoS or IBD header/block\n" +
[23:32:25.656] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
[23:32:25.656] 1 error generated. |
7a2f7d7 to
ef46dc7
Compare
|
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
ef46dc7 to
93b0799
Compare
|
Ready for review. |
| } | ||
| // In case there is a block that has been in flight from this peer for block_interval * (1 + 0.5 * N) | ||
| // (with N the number of peers from which we're downloading validated blocks), disconnect due to timeout. | ||
| // (with N the number of peers from which we're downloading validated blocks), disconnect due to timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // (with N the number of peers from which we're downloading validated blocks), disconnect due to timeout | |
| // (with N being the number of peers from which we're downloading validated blocks), disconnect due to timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, since you're already touching it, consider updating for grammatical correctness
mzumsande
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main conceptual question is what should happen in the case of a stalling situation where the addnode peer is actually slow and causing congestion?
While we have more trust in manual peers than automatic ones, I don't think that means the trust should be without bounds. Even trusted nodes can become unresponsive temporarily (e.g. very slow internet). I'm not sure if we should give one addnode peer the power to let our entire IBD come to a halt completely.
An alternative would be to not give these peers unlimited time but clear their block requests (so that these blocks can be requested from other peers) so that they wouldn't be disconnected but also wouldn't be able to stall us indefinitely.
| if (pto->IsManualConn()) { | ||
| LogInfo("Timeout downloading block %s from addnode peer, not %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs)); | ||
| } else { | ||
| LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the timeout specific for IBD stalling but the general one (which is generally longer). So we'd still disconnect for stalling (see log message "Peer is stalling block download") - is that on purpose / can you explain a bit more the reasoning behind this?
Maybe one could add yet another option to make manually specified nodes be treated with extra leniency? |
IMHO: It could be nice if we could "score" peers during IBD, and scale the number of blocks we request from them at once according to that score. In that case, automatic peers with low scores would get dropped, and manual peers with low scores would just not get any blocks requested. |
|
Appreciate the thoughtful feedback. I'll look at an update. |
| LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs)); | ||
| pto->fDisconnect = true; | ||
| if (pto->IsManualConn()) { | ||
| LogInfo("Timeout downloading block %s from addnode peer, not %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think the message can be a bit clearer, e.g., "Timeout downloading block %s from manual peer (addnode/connect), not disconnecting
This comment was marked as abuse.
This comment was marked as abuse.
|
Concept ACK to improve this, but:
Would it be too difficult to request a given block from more than one peer concurrently? That is, if a block is requested from a peer and not received within some time, then request it from other(s) as well, but don't cancel the original request. Then we take the block from whoever delivers it first. Subsequent receptions of the block from others would be an excess/redundant traffic, but I guess that is ok, if this technique is used carefully. Is this the same:
|
pinheadmz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return true; | ||
| } else { | ||
| LogInfo("Timeout downloading headers from noban peer, not %s\n", pto->DisconnectMsg(fLogIPs)); | ||
| LogInfo("Timeout downloading headers from %s peer, not %s\n", pto->IsManualConn() ? "addnode" : "noban", pto->DisconnectMsg(fLogIPs)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could you use NetPermissions::ToStrings(NetPermissionFlags flags) just to avoid hard coding things?
If we're considering a case where the user specifically chooses specific peers for IBD for privacy or whatever reasons, don't we just let them have the slow sync they trade-off for? |
|
Concept ACK |
This makes sense, as the PR tries to achieve a similar goal with initial headers sync timeouts in commit 64b956f - when a timeout occurs, the sync state is reset to allow retrying with other peers while preserving the connection, and the PR extends this behavior to also apply to addnode peers. |
|
Concept ACK and Code Review ACK. I think the PR's increased timeout tolerance for addnode peers is valuable. Given my limited domain knowledge, I'd suggest test coverage on potential silent stalls to ensure blocks are efficiently re-requested |
I'd like to see this addressed before moving forward here.
It could be a |
|
@achow101 re: slow preferred peer "causing congestion" -- wouldn't this only affect the single user who explicitly set the preferred peer? |
hodlinator
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK 93b0799
| LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs)); | ||
| pto->fDisconnect = true; | ||
| if (pto->IsManualConn()) { | ||
| LogInfo("Timeout downloading block %s from addnode peer, not %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any risk that this log message becomes spammy if we keep the peer connected? - Should we do something like reset state.m_downloading_since to current_time, clear state.vBlocksInFlight or add another state-field to throttle the log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it could still be spammy, #32604 should mitigate it somewhat.
Yes, however I have seen people post on various forums suggestions of specific nodes that they can addnode, and users may be addnode'ing such nodes without realizing that there are potential adverse consequences of doing so. Generally, I would prefer to have things that change how we treat nodes be limited to the whitelist options rather than addnode also implying that those nodes will get specific special treatment. |
61e800e test: headers sync timeout (stringintech) Pull request description: When reviewing PR #32051 and considering which functional tests might need to be adapted/extended accordingly, I noticed there appears to be limited functional test coverage for header sync timeouts and thought it might be helpful to add one. This test attempts to cover two scenarios: 1. **Normal peer timeout behavior:** When a peer fails to respond to initial getheaders requests within the timeout period, it should be disconnected and the node should attempt to sync headers from the next available peer. 2. **Noban peer behavior:** When a peer with noban privileges times out, it should remain connected while the node still attempts to sync headers from other peers. ACKs for top commit: maflcko: re-ACK 61e800e 🗝 stratospher: reACK 61e800e. Tree-SHA512: b8a867e7986b6f0aa00d81a84b205f2bf8fb2e6047a2e37272e0244229d1f43020e9031467827dabbfe7849a91429f2685e00a25356e2ed477fa1a035fa0b1fd
|
What's the status / are you still working on this? It doesn't look like #32051 (comment) has been addressed |
|
Yes, I plan to update this as mentioned at the last meeting: https://www.erisian.com.au/bitcoin-core-dev/log-2025-08-07.html#l-302 |
l0rinc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK
| LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs)); | ||
| pto->fDisconnect = true; | ||
| if (pto->IsManualConn()) { | ||
| LogInfo("Timeout downloading block %s from addnode peer, not %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it could still be spammy, #32604 should mitigate it somewhat.
| // Detect whether we're stalling | ||
| auto stalling_timeout = m_block_stalling_timeout.load(); | ||
| if (state.m_stalling_since.count() && state.m_stalling_since < current_time - stalling_timeout) { | ||
| // Allow more time for addnode peers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a bit confusing, I’d prefer separating stall detection logic from the action we take. Peers are stalling regardless of whether they're added manually or not, so the stalling logic should be unaffected.
I think we should always log when a stall is detected, even for manual peers, so operators can see what’s happening:
- for automatic peers, disconnect as usual;
- for manual peers, only disconnect if they exceed
BLOCK_STALLING_TIMEOUT_MAX- otherwise keep them connected.
That way the logs clearly show tolerated stalls, and the policy difference is obvious without blending it into the timeout calculation, something like:
diff --git a/src/net_processing.cpp b/src/net_processing.cpp
--- a/src/net_processing.cpp (revision 8405fdb06e8ff3220590bfac84e98547067ec6b7)
+++ b/src/net_processing.cpp (date 1755201170971)
@@ -5814,14 +5814,16 @@
// the download window should be much larger than the to-be-downloaded set of blocks, so disconnection
// should only happen during initial block download.
LogInfo("Peer is stalling block download, %s\n", pto->DisconnectMsg(fLogIPs));
- pto->fDisconnect = true;
- // Increase timeout for the next peer so that we don't disconnect multiple peers if our own
- // bandwidth is insufficient.
- const auto new_timeout = std::min(2 * stalling_timeout, BLOCK_STALLING_TIMEOUT_MAX);
- if (stalling_timeout != new_timeout && m_block_stalling_timeout.compare_exchange_strong(stalling_timeout, new_timeout)) {
- LogDebug(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", count_seconds(new_timeout));
- }
- return true;
+ if (!pto->IsManualConn() || state.m_stalling_since < current_time - BLOCK_STALLING_TIMEOUT_MAX) { // Allow more time for addnode peers
+ pto->fDisconnect = true;
+ // Increase timeout for the next peer so that we don't disconnect multiple peers if our own
+ // bandwidth is insufficient.
+ const auto new_timeout = std::min(2 * stalling_timeout, BLOCK_STALLING_TIMEOUT_MAX);
+ if (stalling_timeout != new_timeout && m_block_stalling_timeout.compare_exchange_strong(stalling_timeout, new_timeout)) {
+ LogDebug(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", count_seconds(new_timeout));
+ }
+ return true;
+ }
}
// In case there is a block that has been in flight from this peer for block_interval * (1 + 0.5 * N)
// (with N the number of peers from which we're downloading validated blocks), disconnect due to timeout.
I would also prefer seeing a motivation being explained in the commit message for this change, maybe something like:
p2p: allow BLOCK_STALLING_TIMEOUT_MAX for addnode peers
When a manually added peer stalls during IBD, the default timeout is often too short for high-latency/low-bandwidth environments.
This results in frequent disconnect/reconnect cycles, especially over Tor/I2P, which are wasteful and slow down IBD.
This change uses `BLOCK_STALLING_TIMEOUT_MAX` for manual peers when detecting block download stalls, giving them a longer grace period before disconnection.
Non-manual peers keep the existing timeout and behavior.
| } | ||
| // In case there is a block that has been in flight from this peer for block_interval * (1 + 0.5 * N) | ||
| // (with N the number of peers from which we're downloading validated blocks), disconnect due to timeout. | ||
| // (with N the number of peers from which we're downloading validated blocks), disconnect due to timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, since you're already touching it, consider updating for grammatical correctness
| if (pto->IsManualConn()) { | ||
| LogInfo("Timeout downloading block %s from addnode peer, not %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs)); | ||
| } else { | ||
| LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs)); | ||
| pto->fDisconnect = true; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're logging very similarly in both cases, most of the message and params are duplicated, I think it would help if we extracted the differences in code instead of heaving to compile both in our heads, e.g. something like:
| if (pto->IsManualConn()) { | |
| LogInfo("Timeout downloading block %s from addnode peer, not %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs)); | |
| } else { | |
| LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs)); | |
| pto->fDisconnect = true; | |
| } | |
| LogInfo("Timeout downloading block %s (%s)", | |
| queuedBlock.pindex->GetBlockHash().ToString(), | |
| pto->IsManualConn() ? "addnode peer, keeping" : pto->DisconnectMsg(fLogIPs)); | |
| pto->fDisconnect = !pto->IsManualConn(); |
Note: that the fDisconnect value is also simpler if we assign it unconditionally
Or if @mzumsande is correct and this is meant to be done in the condition above, consider:
LogInfo("Peer is stalling block download (%s)\n",
pto->IsManualConn() ? "addnode peer, keeping" : pto->DisconnectMsg(fLogIPs));
pto->fDisconnect = !pto->IsManualConn();| // Disconnect a peer (if it is neither an addnode peer, nor has | ||
| // NetPermissionFlags::NoBan permission) if it is our only sync peer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this nesting quite confusing now, consider something like:
| // Disconnect a peer (if it is neither an addnode peer, nor has | |
| // NetPermissionFlags::NoBan permission) if it is our only sync peer | |
| // If this peer is our only active sync peer, and we have other peers | |
| // available for syncing, disconnect it - unless it is an addnode peer | |
| // or has NetPermissionFlags::NoBan permission. |
| static RPCHelpMan addnode() | ||
| { | ||
| return RPCHelpMan{"addnode", | ||
| "\nAttempts to add or remove a node from the addnode list.\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a rebase
| "Or try a connection to a node once.\n" | ||
| "Nodes added using addnode (or -connect) are protected from DoS disconnection and are not required to be\n" | ||
| "Nodes added using addnode (or -connect) are protected from disconnection due to DoS or IBD header/block\n" | ||
| "download timeouts (and given more time before considered to be stalling), and are not required to be\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "download timeouts (and given more time before considered to be stalling), and are not required to be\n" | |
| "download timeouts (and are given more time before being considered stalled), and are not required to be\n" |
@jonatack I'm going to mark this as draft since it's been a while. Please mark as ready for review whenever you get the chance to address the comments. |
|
I suppose I need to convince myself that this might be merged, to work further on it. I'll re-open if I do. |
When an addnode peer is disconnected by our IBD headers/blocks download timeout or stalling logic,
ThreadOpenAddedConnectionsattempts to immediately reconnect it (unless "onetry" was passed to the addnode RPC) up to the limit of 8 addnode connections that is separate from regular peer connection limits.During IBD in a low-bandwidth environment, I see these disconnections+reconnections very frequently with high-quality addnode peers; see https://www.erisian.com.au/bitcoin-core-dev/log-2025-01-22.html#l-85. The logging for these disconnections was: "Peer is stalling block download, disconnecting ." The ping times of the affected peers were often 20-100 seconds.
After IBD, I continue to see the following addnode peer disconnections, albeit much less frequently: "Timeout downloading block , disconnecting ."
This is probably network, resource and time intensive to an extent, particularly in the case of I2P peers that involves destroying and rebuilding 2 tunnels per peer connection, and worth avoiding where it is simple to do so.
Automatic (non-manually added) peers are also disconnected, but they are a different category and case (non-trusted non-protected peers, no immediate reconnection) that would require monitoring over time to adjust the timeouts accordingly; mzumsande was looking into this (see https://www.erisian.com.au/bitcoin-core-dev/log-2025-01-22.html#l-105).
The goal of this pull is therefore to avoid unnecessary disconnections/immediate reconnections of addnode peers.