-
Notifications
You must be signed in to change notification settings - Fork 1.2k
llmq|init|test: Implement DKG data recovery / quorum verification vector sync #3964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llmq|init|test: Implement DKG data recovery / quorum verification vector sync #3964
Conversation
936045a to
f2e1dd5
Compare
f2e1dd5 to
872a6b5
Compare
|
Just added 744c8d0 with the last force push. This is an issue which comes from #3953 where |
With this change it also validates that "secretKeyShare" is not in `quorum_info` if its not expected to be in there. Before this was basically just not tested.
872a6b5 to
92127d2
Compare
|
And one more force push:
|
UdjinM6
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See below or maybe UdjinM6@16a59e0 instead of suggestions in src/llmq/quorums.cpp
|
|
||
| void CQuorumManager::TriggerQuorumDataRecoveryThreads(const CBlockIndex* pIndex) const | ||
| { | ||
| if (!fMasternodeMode || !CLLMQUtils::QuorumDataRecoveryEnabled() || pIndex == nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
argument checks are probably the most expensive (and most likely to be okay) so imo should be last thing to check
| if (!fMasternodeMode || !CLLMQUtils::QuorumDataRecoveryEnabled() || pIndex == nullptr) { | |
| if (!fMasternodeMode || pIndex == nullptr || !CLLMQUtils::QuorumDataRecoveryEnabled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm actually, i guess since pIndex can't even be nullptr in this case (Because of UpdateBlockTip won't ever be called with nullptr and i don't yet see a reason to trigger it from somewhere else) it should be still better how it is right now, no? That at least was my reason to put them in this order. Basically with the suggested change you would always do a (trivial, but) redundant pIndex == nullptr check if QuorumDataRecoveryEnabled is false and if it's true it doesn't matter if i don't miss something?
Co-authored-by: UdjinM6 <[email protected]>
I like 16a59e0 👍 Partially picked it in 00434a9, not the cache thread related stuff though because that seemed unrelated to here? But imo it still makes sense in a separate PR unless you both feel we should still just push it in here, i can, let me know :D |
UdjinM6
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK
UdjinM6
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK
This should probably be accessible via nodes connecting via qwatch no? Additionally, could this be a dos vector? What happens if I repeatedly request from the same node? |
Can you explain a bit more what you mean here?
Im not sure where your concerns are coming from since this PR doesn't change any behaviour of the underlying P2P messages, it just uses them (means the same rules as in #3953 still apply). Can you explain your thoughts? |
PastaPastaPasta
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I misunderstood what was going on :)
Now that I thought a bit more and drank more coffee, I think everything makes sense.
utACK
…tor sync (dashpay#3964) * llmq: Implement automated DKG recovery threads * llmq: Implement quorum verification vector sync * init: Validiate quorum data recovery related command line parameter * test: Add quorum_data_request_timeout_seconds in DashTestFramework * test: Test quorum data recovery in feature_llmq_data_recovery.py * test: Add feature_llmq_data_recovery.py to BASE_SCRIPTS * test: Fix quorum_data_request_expiration_timeout in wait_for_quorum_data * test: Always test the existence of secretKeyShare in test_mn_quorum_data With this change it also validates that "secretKeyShare" is not in `quorum_info` if its not expected to be in there. Before this was basically just not tested. * llmq|test: Use bool as argument type for -llmq-data-recovery * llmq: Always set nTimeLastSuccess to 0 * test: Set -llmq-data-recovery=0 in p2p_quorum_data.py * test: Simplify test_mns Co-authored-by: UdjinM6 <[email protected]> * refactor: pass CQuorumCPtr to StartQuorumDataRecoveryThread * test: Fix thread name in comment Co-authored-by: UdjinM6 <[email protected]>
This PR implements DKG data recovery in 4d54892127a9d139e1e3042e90c1c520030de396 which allows a member of
LLMQxwhich looses it's required DKG data (quorum verification vector and secret key share) e.g. due to unexpected loss/corruption of the database to recover itself with the help of the other members ofLLMQxand the P2P messagesQGETDATA/QDATA. Before this PR such an invalid member would just stay invalid for the remaining "active time" of the LLMQ.To solve a requirement of platform/tenderdash it also introduces quorum verification vector sync in eef2691defeb3e903db28eaf8afd247a7810961b which enables to configure a masternode to request the quorum verification vectors of all existing/new LLMQs with the type
llmqTypeif the masternode itself is a member of any other active LLMQ with the typellmqType.New command line parameter (hidden?):
-llmq-data-recovery- Enable1, Disable0automated data recovery/qvvec requests.-llmq-qvvec-sync- Defines from which LLMQ type the masternode should sync quorum verification vectors (Can be used multiple times with different LLMQ types.New thread
dash-q-recoveryThis new thread can exist exactly one time for each quorum (llmqType, quorumHash) and its job is to query the missing DKG data from related quorum members. This is done in a loop like: Calculate the next masternode to try, connect to it, send theQGETDATArequest and wait until theQDATAmessage comes in with a timeout of 10 seconds before we try the next member. This is continues until the required data has been received successfully or until all members of the quorum were asked without success.The recovery threads are getting started by
CQuorumManager::TriggerQuorumDataRecoveryThreads()if required which gets called with every block update inCQuorumManager::UpdatedBlockTip().The masternode decides based on
CQuourm::GetQuorumRecoveryStartOffset()and the sorted protx hashes from all members of the quorum to which member of the quorum it tries to connect first. It also sleepsoffset * 100msbefore it tries to connect to the next member. I implemented those two as attempt to balance the incoming requests equally for each member of the target quorum. This means if-llmq-qvvec-sync=llmq_100_67is used on all masternodes by platform and a newllmq_100_67quorum just formed we will have in worst case ~2400 nodes asking ~100 members for their quorum verification vectors which means each member will then in best case get 24 requests and then all nodes should have the verification vector. Should be just good or do i miss something?Let me know your thoughts!
Based on #3953