refreshes ContactInfo.outset before initializing validator#3135
Conversation
Nodes join gossip during bootstrap process with a stub contact-info which in particular has invalid TVU socket address. Once the bootstrap is done they re-join gossip a 2nd time with a fully populated contact-info, but this contact-info has an outset timestamp older than the 1st one because it was initiated earlier. In v2.0 the outset timestamp determines which contact-info overrides the other, so the v2.0 nodes refrain from updating their CRDS table with the fully initialized contact-info. The commit refreshes ContactInfo.outset before initializing the validator so that it overrides the one pushed to the gossip by the bootstrap stage.
8cc1c32 to
9a89a91
Compare
|
Hey @steviez @gregcusack, we'll need approval from a subject matter expert in addition to the backport reviewers. |
|
curious what the testing process for this will be if we backport a commit to v1.18 which has been running on mainnet for some time now. |
I hope we get enough soak time on testnet and mainnet before proceeding with v2.0 upgrade. |
We can adjust the upgrade schedule as needed. How much time would you want this to run on each cluster? The original plan was to do some quick upgrade/downgrade cycles on testnet, and then start ramping v2.0 on mainnet-beta. Something like this: Testnet, over the course of ~2-4 days:
Mainnet-beta:
If we publish a new v1.18 release with this change most mainnet-beta operators will ignore it unless we tell them it's a critical patch. As I understand it, this problem only manifests if there's a restart while the cluster is split v1.18 / v2.0. So we could recommend v1.18.26, for mainnet-beta, but only push it as a critical patch if we happen to have a restart during the upgrade. Is that correct? |
|
Without this patch, if some v1.18 node decides to restart and runs the bootstrap code, then v2.0 nodes will pick up its stub contact-info from bootstrap and stick to it, in which case the node will be effectively left out of cluster because the stub contact-info only has a gossip address. So I would say it is still much safer if nodes upgrade to this code before more stake is running v2.0 branch. |
|
been thinking if this is going to cause any issues for v1.18. I don't think it will. v1.18 uses With the upgrade to v2, now v1.18 needs |
|
Devnet testing succeeded: The v2.0 version of this patch (#2681) went into v2.0.7 which was announced for testnet on 2024-08-25, and then used for the testnet restart on 2024-08-26 @anza-xyz/backport-reviewers Any other testing you'd like to see here? |
…nza-xyz#3135)" This reverts commit c2b3500.
Problem
Nodes join gossip during bootstrap process with a stub contact-info which in particular has invalid TVU socket address.
Once the bootstrap is done they re-join gossip a 2nd time with a fully populated contact-info, but this contact-info has an outset timestamp older than the 1st one because it was initiated earlier.
In v2.0, the outset timestamp determines which contact-info overrides the other, so the v2.0 nodes refrain from updating their CRDS table with the fully initialized contact-info.
Summary of Changes
The commit refreshes
ContactInfo.outsetbefore initializing validator so that it overrides the one pushed to gossip by the bootstrap stage.