-
Notifications
You must be signed in to change notification settings - Fork 38.8k
net: replace manual reference counting of CNode with shared_ptr #32015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32015. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
|
cc @willcl-ark, @dergoegge, @theuni |
|
cc @hodlinator, you might be interested in this as well |
Run p2p_headers_presync with args ['/Users/runner/work/bitcoin/bitcoin/ci/scratch/build-aarch64-apple-darwin23.6.0/src/test/fuzz/fuzz', PosixPath('/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/p2p_headers_presync')]net_processing.cpp:295 SetTxRelay: Assertion `!m_tx_relay' failed.
Error processing input "/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/p2p_headers_presync/c12f87158cda99b4fb2e1fe7312af392a1f4634a"
net_processing.cpp:295 SetTxRelay: Assertion `!m_tx_relay' failed.
Error processing input "/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/p2p_headers_presync/c12f87158cda99b4fb2e1fe7312af392a1f4634a" |
src/net.cpp
Outdated
| // after m_nodes_mutex has been released. Destructing a node involves | ||
| // calling m_msgproc->FinalizeNode() which acquires cs_main. Lock order | ||
| // should be cs_main, m_nodes_mutex. | ||
| disconnected_nodes.push_back(node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that ~CNode is responsible for calling FinalizeNode, the guarantees around which thread does the actual cleanup are lost. It's basically just whichever thread ends up holding the last instance of a CNode. This seems bug prone.
E.g. there are no guarantees (afaict) that disconnected_nodes will actually be the last owner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that it does not matter which thread will destroy the CNode objects.
there are no guarantees (afaict) that
disconnected_nodeswill actually be the last owner.
That is right, but there is no need for such a guarantee. disconnected_nodes is to guarantee that the objects will not be destroyed while holding m_nodes_mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In thread #32015 (comment):
Maybe one could have an additional vector<shared_ptr<CNode>> member in CConnman beyond m_nodes that always keeps the shared_ptr::use_count() above 0. Then use removal from that vector (when use_count() == 1) for controlled deletion with enforced lock order, more like how the code was before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that is necessary. I mean - if we would assume that the code is going to be modified in a careless way to break the logic, in other words to destroy CNode objects while holding m_nodes_mutex, then we might as well assume that the code could be modified in a way that breaks the manual reference counting. IMO the manual reference counting is more fragile and error prone.
Further, if the code is changed to destroy CNode objects while holding m_nodes_mutex, then 83 functional tests fail. I think it is safe to assume that such a thing will not go unnoticed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In thread #32015 (comment):
Started proving out my suggestion and landed on something closer to master.
- Resurrect
m_nodes_disconnected(now holdingshared_ptr) along with clearly defined lifetimes. - Resurrect
DeleteNode(), but make it take an r-valueshared_ptrand assert thatuse_count() == 1. - Avoid adding
~CNode()andCNode::m_destruct_cbalong with sending it in everywhere in the constructor. This means we don't need to touch 5 of the test files.
master...hodlinator:bitcoin:pr/32015_suggestion
Passes unit tests and functional tests.
|
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
|
|
hodlinator
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK 36d932cebbb235a79e90575960646e86a23d39a8
src/net.cpp
Outdated
| // after m_nodes_mutex has been released. Destructing a node involves | ||
| // calling m_msgproc->FinalizeNode() which acquires cs_main. Lock order | ||
| // should be cs_main, m_nodes_mutex. | ||
| disconnected_nodes.push_back(node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In thread #32015 (comment):
Maybe one could have an additional vector<shared_ptr<CNode>> member in CConnman beyond m_nodes that always keeps the shared_ptr::use_count() above 0. Then use removal from that vector (when use_count() == 1) for controlled deletion with enforced lock order, more like how the code was before?
36d932c to
af622d0
Compare
|
|
|
Instead of passing around shared pointers to mutable data (aka global variables in disguise), would it be possible to increase the level of abstraction and pass around values with regular semantics? Every C++ programmer should rewatch https://youtu.be/W2tWOdzgXHA from time to time. |
I remembered that as "the seminal
Guess one could have an |
|
After reviewing the code line by line, I realized that the only places where |
It is also called in the RAII helper
It's probably just there to have it "covered". |
Got it. |
67bc008 to
0cd86a5
Compare
|
|
0cd86a5 to
8218daa
Compare
|
|
8218daa to
d026d54
Compare
|
This is now a little bit simplified after #32326 was merged - no need to touch |
This is a non-functional change - the manual reference counting, disconnection, finalization and deletion are unchanged. `DeleteNode()` used to call `FinalizeNode()` + `delete` the raw pointer. Expand `DeleteNode()` at the call sites - call `FinalizeNode()` at the call sites and instead of raw `delete` either get the container vector go out of scope or explicitly call `clear()` on it.
…aries The other copies have a purpose for locking, these were just an awkward convenience for deleting while iterating. Co-authored-by: Vasil Dimov <[email protected]>
Subsequent commits will allow this to be shared with the message handling thread. Co-authored-by: Vasil Dimov <[email protected]>
Rather than relying on refcounting to abstract away our teardown procedure, make the rules explicit: - FinalizeNode() is only called when ProcessMessages() and SendMessages() is not running. This is enforced by finalizing only at the top of the message handler loop. - FinalizeNode() may only be called on nodes which have been disconnected, by virtue of finding them in m_nodes_disconnected. This enforces the order: disconnect -> finalize -> delete. This order is the current behavior but is not strictly necessary, and the restriction will be removed in a later commit. - Nodes are only deleted after they've been disconnected and finalized. This is enforced by deleting them from m_nodes_disconnected and only after calling FinalizeNode(). - Callers of ForNode() and ForEachNode() are currently protected from finalization or deletion due to their locking of m_nodes_mutex. Co-authored-by: Vasil Dimov <[email protected]>
Snapshots of `m_nodes` are not needed anymore because finalization and deletion of nodes only happens in `ThreadMessageHandler()` and only a after nodes have been disconnected and moved from `m_nodes` to `m_nodes_disconnected`. The manual reference counting becomes unused, remove it too. Co-authored-by: Cory Fields <[email protected]>
The shared_ptrs now ensure that the CNodes will not be deleted for the duration of the callbacks. Disconnection/Finalization may happen while the shared_ptr is being held, though it won't be deleted for the duration of the callback. The majority of uses are in ProcessMessages and SendMessages, during which disconnection may occur but FinalizeNode will not be called. Co-authored-by: Vasil Dimov <[email protected]>
This keeps us from waiting for the condvar timeout (currently 100ms) before finalizing a node that has been disconnected on the socket handler thread. Co-authored-by: Vasil Dimov <[email protected]>
d026d54 to
abc29c4
Compare
|
|
|
🐙 This pull request conflicts with the target branch and needs rebase. |
Before this change the code used to count references to
CNodeobjects manually viaCNode::nRefCount. UnneededCNodes were scheduled for deletion by putting them inCConnman::m_nodes_disconnectedand were deleted after their reference count reached zero. Deleting consists of callingPeerManager::FinalizeNode()and destroying theCNodeobject.Replace this scheme with
std::shared_ptr. This simplifies the code and removes:CNode::nRefCountCNode::GetRefCount()CNode::AddRef()CNode::Release()CConnman::m_nodes_disconnectedCConnman::NodesSnapshotNow creating a snapshot of
CConnman::m_nodesis done by simply copying it (under the mutex).Call
PeerManager::FinalizeNode()from the destructor ofCNode, which is called when the reference count reaches 0.This removes brittle code and replaces it by a code from the standard library which is more robust. All the below are possible with the manual reference counting and not possible with
shared_ptr:There is zero learning curve for the average C++ programmer who knows how
shared_ptrworks. Not so much with the manual reference counting because one has to learn aboutAddRef(),Release(), check and maintain where are they called and their relationship withm_nodes_disconnected.This has some history in #28222 and #10738. See below #32015 (comment) and #32015 (comment).
Possible followup: change
ProcessMessages()andSendMessages()to takeCNode&: #32015 (comment)