llmq|init|test: Implement DKG data recovery / quorum verification vector sync #3964

xdustinface · 2021-01-22T21:15:36Z

This PR implements DKG data recovery in 4d54892127a9d139e1e3042e90c1c520030de396 which allows a member of LLMQx which looses it's required DKG data (quorum verification vector and secret key share) e.g. due to unexpected loss/corruption of the database to recover itself with the help of the other members of LLMQx and the P2P messages QGETDATA/QDATA. Before this PR such an invalid member would just stay invalid for the remaining "active time" of the LLMQ.

To solve a requirement of platform/tenderdash it also introduces quorum verification vector sync in eef2691defeb3e903db28eaf8afd247a7810961b which enables to configure a masternode to request the quorum verification vectors of all existing/new LLMQs with the type llmqType if the masternode itself is a member of any other active LLMQ with the type llmqType.

New command line parameter (hidden?):

-llmq-data-recovery - Enable 1, Disable 0 automated data recovery/qvvec requests.
-llmq-qvvec-sync - Defines from which LLMQ type the masternode should sync quorum verification vectors (Can be used multiple times with different LLMQ types.

New thread

dash-q-recovery This new thread can exist exactly one time for each quorum (llmqType, quorumHash) and its job is to query the missing DKG data from related quorum members. This is done in a loop like: Calculate the next masternode to try, connect to it, send the QGETDATA request and wait until the QDATA message comes in with a timeout of 10 seconds before we try the next member. This is continues until the required data has been received successfully or until all members of the quorum were asked without success.

The recovery threads are getting started by CQuorumManager::TriggerQuorumDataRecoveryThreads() if required which gets called with every block update in CQuorumManager::UpdatedBlockTip().

The masternode decides based on CQuourm::GetQuorumRecoveryStartOffset() and the sorted protx hashes from all members of the quorum to which member of the quorum it tries to connect first. It also sleeps offset * 100ms before it tries to connect to the next member. I implemented those two as attempt to balance the incoming requests equally for each member of the target quorum. This means if -llmq-qvvec-sync=llmq_100_67 is used on all masternodes by platform and a new llmq_100_67 quorum just formed we will have in worst case ~2400 nodes asking ~100 members for their quorum verification vectors which means each member will then in best case get 24 requests and then all nodes should have the verification vector. Should be just good or do i miss something?

Let me know your thoughts!

Based on #3953

xdustinface · 2021-01-28T23:55:17Z

Just added 744c8d0 with the last force push. This is an issue which comes from #3953 where recover=True for wait_for_quorum_data wasn't even used/needed. The recover parameter should have just not exist in that PR but that was obviously a mistake i made when i during development added a generalized version of wait_for_quorum_data to test_framework.py instead of having a specific version in p2p_quorum_data.py and feature_llmq_data_recovery.py. I guess we can just fix it in here?

With this change it also validates that "secretKeyShare" is not in `quorum_info` if its not expected to be in there. Before this was basically just not tested.

xdustinface · 2021-01-29T02:54:49Z

And one more force push:

Rebased on develop to have rpc: Loop through all nodes when looking for one to override mnauth data for #3971 included
Fixed some issues in feature_llmq_data_recovery.py
Added 92127d2 which is basically another llmq|rpc|test|version: Implement P2P messages QGETDATA <-> QDATA #3953 fix in test_framework.py for the method test_mn_quorum_data.

src/llmq/quorums.cpp

src/llmq/quorums_utils.h

src/llmq/quorums_utils.cpp

UdjinM6

See below or maybe UdjinM6@16a59e0 instead of suggestions in src/llmq/quorums.cpp

src/llmq/quorums.cpp

test/functional/feature_llmq_data_recovery.py

PastaPastaPasta · 2021-01-30T18:33:51Z

src/llmq/quorums.cpp


+void CQuorumManager::TriggerQuorumDataRecoveryThreads(const CBlockIndex* pIndex) const
+{
+    if (!fMasternodeMode || !CLLMQUtils::QuorumDataRecoveryEnabled() || pIndex == nullptr) {


argument checks are probably the most expensive (and most likely to be okay) so imo should be last thing to check

Suggested change

if (!fMasternodeMode || !CLLMQUtils::QuorumDataRecoveryEnabled() || pIndex == nullptr) {

if (!fMasternodeMode || pIndex == nullptr || !CLLMQUtils::QuorumDataRecoveryEnabled()) {

Hm actually, i guess since pIndex can't even be nullptr in this case (Because of UpdateBlockTip won't ever be called with nullptr and i don't yet see a reason to trigger it from somewhere else) it should be still better how it is right now, no? That at least was my reason to put them in this order. Basically with the suggested change you would always do a (trivial, but) redundant pIndex == nullptr check if QuorumDataRecoveryEnabled is false and if it's true it doesn't matter if i don't miss something?

Co-authored-by: UdjinM6 <[email protected]>

xdustinface · 2021-01-31T02:15:51Z

See below or maybe UdjinM6@16a59e0 instead of suggestions in src/llmq/quorums.cpp

I like 16a59e0 👍 Partially picked it in 00434a9, not the cache thread related stuff though because that seemed unrelated to here? But imo it still makes sense in a separate PR unless you both feel we should still just push it in here, i can, let me know :D

UdjinM6

utACK

UdjinM6

utACK

PastaPastaPasta · 2021-02-01T15:20:38Z

if the masternode itself is a member of any other active LLMQ with the type llmqType.

This should probably be accessible via nodes connecting via qwatch no?

Additionally, could this be a dos vector? What happens if I repeatedly request from the same node?

xdustinface · 2021-02-01T15:37:34Z

if the masternode itself is a member of any other active LLMQ with the type llmqType.

This should probably be accessible via nodes connecting via qwatch no?

Can you explain a bit more what you mean here?

Additionally, could this be a dos vector? What happens if I repeatedly request from the same node?

Im not sure where your concerns are coming from since this PR doesn't change any behaviour of the underlying P2P messages, it just uses them (means the same rules as in #3953 still apply). Can you explain your thoughts?

PastaPastaPasta

I think I misunderstood what was going on :)

Now that I thought a bit more and drank more coffee, I think everything makes sense.

utACK

…tor sync (dashpay#3964) * llmq: Implement automated DKG recovery threads * llmq: Implement quorum verification vector sync * init: Validiate quorum data recovery related command line parameter * test: Add quorum_data_request_timeout_seconds in DashTestFramework * test: Test quorum data recovery in feature_llmq_data_recovery.py * test: Add feature_llmq_data_recovery.py to BASE_SCRIPTS * test: Fix quorum_data_request_expiration_timeout in wait_for_quorum_data * test: Always test the existence of secretKeyShare in test_mn_quorum_data With this change it also validates that "secretKeyShare" is not in `quorum_info` if its not expected to be in there. Before this was basically just not tested. * llmq|test: Use bool as argument type for -llmq-data-recovery * llmq: Always set nTimeLastSuccess to 0 * test: Set -llmq-data-recovery=0 in p2p_quorum_data.py * test: Simplify test_mns Co-authored-by: UdjinM6 <[email protected]> * refactor: pass CQuorumCPtr to StartQuorumDataRecoveryThread * test: Fix thread name in comment Co-authored-by: UdjinM6 <[email protected]>

xdustinface added this to the 17 milestone Jan 22, 2021

xdustinface force-pushed the pr-llmq-data-recovery branch from 936045a to f2e1dd5 Compare January 28, 2021 22:35

xdustinface marked this pull request as ready for review January 28, 2021 22:36

xdustinface force-pushed the pr-llmq-data-recovery branch from f2e1dd5 to 872a6b5 Compare January 28, 2021 23:47

xdustinface added 8 commits January 29, 2021 03:45

llmq: Implement automated DKG recovery threads

6834430

llmq: Implement quorum verification vector sync

06be854

init: Validiate quorum data recovery related command line parameter

5a8255d

test: Add quorum_data_request_timeout_seconds in DashTestFramework

384b541

test: Test quorum data recovery in feature_llmq_data_recovery.py

e54c7a8

test: Add feature_llmq_data_recovery.py to BASE_SCRIPTS

9aaafe0

test: Fix quorum_data_request_expiration_timeout in wait_for_quorum_data

3e586f2

test: Always test the existence of secretKeyShare in test_mn_quorum_data

92127d2

With this change it also validates that "secretKeyShare" is not in `quorum_info` if its not expected to be in there. Before this was basically just not tested.

xdustinface force-pushed the pr-llmq-data-recovery branch from 872a6b5 to 92127d2 Compare January 29, 2021 02:49

UdjinM6 reviewed Jan 29, 2021

View reviewed changes

src/llmq/quorums.cpp Outdated Show resolved Hide resolved

src/llmq/quorums_utils.h Outdated Show resolved Hide resolved

src/llmq/quorums_utils.cpp Outdated Show resolved Hide resolved

xdustinface added 3 commits January 29, 2021 23:47

llmq|test: Use bool as argument type for -llmq-data-recovery

54e02a5

llmq: Always set nTimeLastSuccess to 0

9cf00af

test: Set -llmq-data-recovery=0 in p2p_quorum_data.py

6558aad

UdjinM6 reviewed Jan 30, 2021

View reviewed changes

src/llmq/quorums.cpp Outdated Show resolved Hide resolved

src/llmq/quorums.cpp Outdated Show resolved Hide resolved

test/functional/feature_llmq_data_recovery.py Outdated Show resolved Hide resolved

PastaPastaPasta reviewed Jan 30, 2021

View reviewed changes

xdustinface and others added 2 commits January 31, 2021 02:43

test: Simplify test_mns

aae157c

Co-authored-by: UdjinM6 <[email protected]>

refactor: pass CQuorumCPtr to StartQuorumDataRecoveryThread

00434a9

UdjinM6 previously approved these changes Jan 31, 2021

View reviewed changes

test: Fix thread name in comment

ca16587

xdustinface dismissed UdjinM6’s stale review via ca16587 February 1, 2021 13:44

xdustinface requested review from PastaPastaPasta and UdjinM6 February 1, 2021 13:44

UdjinM6 approved these changes Feb 1, 2021

View reviewed changes

PastaPastaPasta approved these changes Feb 1, 2021

View reviewed changes

xdustinface merged commit d6c6174 into dashpay:develop Feb 1, 2021

xdustinface mentioned this pull request Mar 9, 2021

llmq|init|test: Add "mode" to -llmq-qvvec-sync parameter #4030

Merged

thephez mentioned this pull request Mar 10, 2021

DIP 21: LLMQ DKG Data Sharing dashpay/dips#87

Merged

	if (!fMasternodeMode \|\| !CLLMQUtils::QuorumDataRecoveryEnabled() \|\| pIndex == nullptr) {
	if (!fMasternodeMode \|\| pIndex == nullptr \|\| !CLLMQUtils::QuorumDataRecoveryEnabled()) {

llmq|init|test: Implement DKG data recovery / quorum verification vector sync #3964

llmq|init|test: Implement DKG data recovery / quorum verification vector sync #3964

Uh oh!

Conversation

xdustinface commented Jan 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New command line parameter (hidden?):

New thread

Uh oh!

xdustinface commented Jan 28, 2021

Uh oh!

xdustinface commented Jan 29, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

UdjinM6 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PastaPastaPasta Jan 30, 2021

Choose a reason for hiding this comment

Uh oh!

xdustinface Jan 31, 2021

Choose a reason for hiding this comment

Uh oh!

xdustinface commented Jan 31, 2021

Uh oh!

UdjinM6 left a comment

Choose a reason for hiding this comment

Uh oh!

UdjinM6 left a comment

Choose a reason for hiding this comment

Uh oh!

PastaPastaPasta commented Feb 1, 2021

Uh oh!

xdustinface commented Feb 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PastaPastaPasta left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xdustinface commented Jan 22, 2021 •

edited

Loading

xdustinface commented Feb 1, 2021 •

edited

Loading