Skip to content

refreshes ContactInfo.outset before initializing validator#3135

Merged
behzadnouri merged 1 commit intoanza-xyz:v1.18from
behzadnouri:patch-contact-info-outset-v1.18
Oct 11, 2024
Merged

refreshes ContactInfo.outset before initializing validator#3135
behzadnouri merged 1 commit intoanza-xyz:v1.18from
behzadnouri:patch-contact-info-outset-v1.18

Conversation

@behzadnouri
Copy link
Copy Markdown

Problem

Nodes join gossip during bootstrap process with a stub contact-info which in particular has invalid TVU socket address.

Once the bootstrap is done they re-join gossip a 2nd time with a fully populated contact-info, but this contact-info has an outset timestamp older than the 1st one because it was initiated earlier.

In v2.0, the outset timestamp determines which contact-info overrides the other, so the v2.0 nodes refrain from updating their CRDS table with the fully initialized contact-info.

Summary of Changes

The commit refreshes ContactInfo.outset before initializing validator so that it overrides the one pushed to gossip by the bootstrap stage.

@behzadnouri behzadnouri requested a review from a team as a code owner October 10, 2024 19:16
Nodes join gossip during bootstrap process with a stub contact-info
which in particular has invalid TVU socket address.

Once the bootstrap is done they re-join gossip a 2nd time with a fully
populated contact-info, but this contact-info has an outset timestamp
older than the 1st one because it was initiated earlier.

In v2.0 the outset timestamp determines which contact-info overrides the
other, so the v2.0 nodes refrain from updating their CRDS table with the
fully initialized contact-info.

The commit refreshes ContactInfo.outset before initializing the
validator so that it overrides the one pushed to the gossip by the
bootstrap stage.
@behzadnouri behzadnouri force-pushed the patch-contact-info-outset-v1.18 branch from 8cc1c32 to 9a89a91 Compare October 10, 2024 19:20
@willhickey
Copy link
Copy Markdown

Hey @steviez @gregcusack, we'll need approval from a subject matter expert in addition to the backport reviewers.

@gregcusack
Copy link
Copy Markdown

curious what the testing process for this will be if we backport a commit to v1.18 which has been running on mainnet for some time now.

@behzadnouri
Copy link
Copy Markdown
Author

curious what the testing process for this will be if we backport a commit to v1.18 which has been running on mainnet for some time now.

I hope we get enough soak time on testnet and mainnet before proceeding with v2.0 upgrade.

@willhickey
Copy link
Copy Markdown

willhickey commented Oct 10, 2024

I hope we get enough soak time on testnet and mainnet before proceeding with v2.0 upgrade.

We can adjust the upgrade schedule as needed. How much time would you want this to run on each cluster?

The original plan was to do some quick upgrade/downgrade cycles on testnet, and then start ramping v2.0 on mainnet-beta. Something like this:

Testnet, over the course of ~2-4 days:

  1. Restart on v1.18
  2. Upgrade live to v2.0
  3. Downgrade live to v1.18
  4. Upgrade to v2.0

Mainnet-beta:

  1. Ask for volunteers to take 10% of stake to v2.0
  2. 1 week after 1: ask for volunteers to take 25% of stake to v2.0
  3. 1 week after 2: Recommend v2.0 for general adoption

If we publish a new v1.18 release with this change most mainnet-beta operators will ignore it unless we tell them it's a critical patch. As I understand it, this problem only manifests if there's a restart while the cluster is split v1.18 / v2.0. So we could recommend v1.18.26, for mainnet-beta, but only push it as a critical patch if we happen to have a restart during the upgrade. Is that correct?

@behzadnouri
Copy link
Copy Markdown
Author

Without this patch, if some v1.18 node decides to restart and runs the bootstrap code, then v2.0 nodes will pick up its stub contact-info from bootstrap and stick to it, in which case the node will be effectively left out of cluster because the stub contact-info only has a gossip address.

So I would say it is still much safer if nodes upgrade to this code before more stake is running v2.0 branch.

@gregcusack
Copy link
Copy Markdown

been thinking if this is going to cause any issues for v1.18. I don't think it will.

v1.18 uses ContactInfo for startup in rpc_bootstrap(), but then really relies on LegacyContactInfo for communication with the rest of the cluster. It's really just sending and accepting ContactInfo in the background.

With the upgrade to v2, now v1.18 needs ContactInfo to communicate with v2, since v2 doesn't use LegacyContactInfo for much. v2 has an override check that v1.18 does not have that looks at ContactInfo.outset, which is why if a v2 node receives an outset older than the current one, the new value will fail to override.

Copy link
Copy Markdown

@gregcusack gregcusack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link
Copy Markdown

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just about identical to #2666 when you control for the addition of the helpers

@willhickey
Copy link
Copy Markdown

Devnet testing succeeded:
https://discord.com/channels/428295358100013066/1293704291437121537/1294175307162451988

The v2.0 version of this patch (#2681) went into v2.0.7 which was announced for testnet on 2024-08-25, and then used for the testnet restart on 2024-08-26

@anza-xyz/backport-reviewers Any other testing you'd like to see here?

Copy link
Copy Markdown

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per discord discussion, lgtm

@behzadnouri behzadnouri merged commit c2b3500 into anza-xyz:v1.18 Oct 11, 2024
@behzadnouri behzadnouri deleted the patch-contact-info-outset-v1.18 branch October 11, 2024 16:31
nibty added a commit to x1-labs/tachyon that referenced this pull request Nov 26, 2024
@anza-xyz anza-xyz deleted a comment from YASSINE-cmd10 Dec 14, 2024
Gabo139

This comment was marked as spam.

@anza-xyz anza-xyz deleted a comment Feb 10, 2025
@anza-xyz anza-xyz deleted a comment Feb 10, 2025
ndeaf53-glitch

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants