Healthcheck for zero ingress connection count#3719
Merged
StephenButtolph merged 6 commits intoava-labs:masterfrom Mar 12, 2025
Merged
Healthcheck for zero ingress connection count#3719StephenButtolph merged 6 commits intoava-labs:masterfrom
StephenButtolph merged 6 commits intoava-labs:masterfrom
Conversation
a5a21ff to
b8ad75a
Compare
de6b79f to
d7fdf19
Compare
Contributor
tsachiherman
left a comment
There was a problem hiding this comment.
Overall, looks great. I've added few comments.
Contributor
Author
Thanks, addressed/responded to your comments. |
Contributor
|
Thanks. Will re-review shortly.
…On Tue, Feb 18, 2025, 2:31 PM yacovm ***@***.***> wrote:
Overall, looks great. I've added few comments.
Thanks, addressed/responded to your comments.
—
Reply to this email directly, view it on GitHub
<#3719 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF2OOHY34J6BCLZ6OD4ZA532QODB7AVCNFSM6AAAAABXDA2QMCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRWG42DSNJRG4>
.
You are receiving this because you commented.Message ID:
***@***.***>
[image: yacovm]*yacovm* left a comment (ava-labs/avalanchego#3719)
<#3719 (comment)>
Overall, looks great. I've added few comments.
Thanks, addressed/responded to your comments.
—
Reply to this email directly, view it on GitHub
<#3719 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF2OOHY34J6BCLZ6OD4ZA532QODB7AVCNFSM6AAAAABXDA2QMCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRWG42DSNJRG4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
tsachiherman
approved these changes
Feb 19, 2025
ceyonur
reviewed
Feb 24, 2025
maru-ava
approved these changes
Feb 26, 2025
cd6d014 to
a52673e
Compare
maru-ava
approved these changes
Feb 27, 2025
d820099 to
2005233
Compare
Contributor
Author
|
Parking this as a draft before I manually test it again after this code change |
4169d09 to
1cf5056
Compare
Contributor
Author
|
@StephenButtolph the PR is ready again for re-review. |
ff2344e to
d1b4d92
Compare
This commit adds a healthcheck that fails if: - The node is a validator of the primary network - The node has zero ingress connections - Enough time (defaults to 20 min) has passed since the node finished bootstrapping Signed-off-by: Yacov Manevich <[email protected]>
Signed-off-by: Yacov Manevich <[email protected]>
Signed-off-by: Yacov Manevich <[email protected]>
Signed-off-by: Yacov Manevich <[email protected]>
Signed-off-by: Yacov Manevich <[email protected]>
cf8c3d6 to
9c9ca56
Compare
Signed-off-by: Yacov Manevich <[email protected]>
StephenButtolph
approved these changes
Mar 12, 2025
cam-schultz
pushed a commit
that referenced
this pull request
Mar 24, 2025
Signed-off-by: Yacov Manevich <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why this should be merged
This PR adds a new health check that monitors if the node is a primary network validator and has no node connecting to it.
Our current health checks only monitor whether the node is connected to enough stake. However, a node may be connected to enough stake by opening egress connections to all other validators, but also at the same time, not be reachable from any validator or non-validator node.
This health check does just that - it alerts when:
How this works
A node can establish a connection through two pathways:
We count the number of connections of the latter type via an atomic counter, and add a health check that checks the three conditions mentioned earlier.
How this was tested
Added unit tests and also did a manual test:
I launched a validator node on Fuji (testnet) and I added it to the validators:
I then caused all other nodes to disconnect from it by blocking port 9651:
After a short time, the health check started failing:
and probing the health check API showed it:
I then restored connectivity via deleting the
iptablesrule:and observed that the health check recovered:
Need to be documented in RELEASES.md?
Added a section about the health check added.