Skip to content

Conversation

@jimmygchen
Copy link
Member

@jimmygchen jimmygchen commented Aug 27, 2024

Issue Addressed

This PR addresses two issues:

  • Add validation to get_custody_columns to prevent infinite loop when supplied custody subnet count is invalid (> total number of data column subnets).
  • Refactor error handling of network_globals.custody_columns to the caller site (see @pawanjay176's comment here)
  • Store the node's custody subnets and columns on NetworkGlobals initialization rather than computing it every time.

Error handling

For node's custody_subnet_count from local metadata:

  • if custody subnet count is not set or invalid after PeerDAS is enabled, we just panic on startup - this indicates a bug in the code, and we should detect this immediately on testing

For peer's custody_subnet_count from ENR & metadata:

  • if custody_subnet_count is invalid, fall back to custody_requirement
  • I think it's fine to not penalise peer, if they can still serve the columns from custody_requirement, they are still useful to us

@jimmygchen jimmygchen added the das Data Availability Sampling label Aug 27, 2024
@jimmygchen jimmygchen mentioned this pull request Aug 27, 2024
@jimmygchen
Copy link
Member Author

I don't really like the complexity in error handling here, will work on simplifying it tomorrow.

@jimmygchen jimmygchen force-pushed the get-custody-columns-error-handing branch 2 times, most recently from b9a2cfd to b455de6 Compare August 30, 2024 13:08
@jimmygchen jimmygchen changed the title Add validation to get_custody_columns and improve error handling Improve get_custody_columns validation, caching and error handling Aug 30, 2024
@jimmygchen
Copy link
Member Author

@pawanjay176 I've re-written the PR and I think it's in a better place now and ready for review.

@jimmygchen jimmygchen marked this pull request as ready for review August 30, 2024 13:11
@jimmygchen jimmygchen added the ready-for-review The code is ready for review label Aug 30, 2024
@jimmygchen jimmygchen force-pushed the get-custody-columns-error-handing branch from b4c6f0d to 3e040e5 Compare August 30, 2024 13:19
@jimmygchen jimmygchen requested a review from pawanjay176 August 30, 2024 13:20
@jimmygchen jimmygchen force-pushed the get-custody-columns-error-handing branch from f3e99e9 to 73a340c Compare August 30, 2024 14:12
@jimmygchen jimmygchen force-pushed the get-custody-columns-error-handing branch from 73a340c to 82a578f Compare August 30, 2024 14:30
# Conflicts:
#	beacon_node/lighthouse_network/src/discovery/subnet_predicate.rs
#	beacon_node/lighthouse_network/src/peer_manager/peerdb.rs
#	beacon_node/lighthouse_network/src/peer_manager/peerdb/peer_info.rs
#	beacon_node/lighthouse_network/src/types/globals.rs
#	beacon_node/network/src/service.rs
#	consensus/types/src/data_column_subnet_id.rs
@jimmygchen jimmygchen added work-in-progress PR is a work-in-progress and removed ready-for-review The code is ready for review labels Sep 3, 2024
…onnection since we get them on metadata anyway.
@jimmygchen jimmygchen force-pushed the get-custody-columns-error-handing branch from 5def040 to 0a717ff Compare September 3, 2024 07:59
@jimmygchen jimmygchen added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Sep 4, 2024
@jimmygchen jimmygchen requested a review from dapplion September 4, 2024 12:50
try_create_int_gauge_vec(
"peers_per_custody_subnet_count",
"The current count of peers by custody subnet count",
&["custody_subnet_count"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider doing a histogram here as we can have buckets instead of exact values. Say we have a peer with csc 128 and another with 127, can grafana bucket these two labels into the same range?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think buckets are useful, but historgrams accumulate observersations over time and it doesn't have the ability to track the current count for each bucket at a specific moment. I feel like gauge is more suitable but if you have a way to represent the metric in histogram I'm interested to hear.

Copy link
Collaborator

@dapplion dapplion Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Lodestar we reset the histogram buckets before every scrape. It feels weird but renders nicely on Grafana.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea. I've tried it and unfortunately couldn't find a way to set it with the prometheus crate though.

Closest thing that I could find was histogram.metric().clear_histogram(); but it doesn't do anything to the actual metric.

I'm going to merge this now, if we find any better way to do it, happy to do it in a follow up PR.

Copy link
Collaborator

@dapplion dapplion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jimmygchen jimmygchen added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Sep 6, 2024
@jimmygchen
Copy link
Member Author

@mergify queue

@mergify
Copy link

mergify bot commented Sep 6, 2024

queue

✅ The pull request has been merged automatically

Details

The pull request has been merged automatically at c0b4f01

mergify bot added a commit that referenced this pull request Sep 6, 2024
@mergify mergify bot merged commit c0b4f01 into sigp:unstable Sep 6, 2024
ThreeHrSleep added a commit to ThreeHrSleep/lighthouse that referenced this pull request Sep 8, 2024
Add retropgf funding (sigp#6324)

Migrate from `ethereum-types` to `alloy-primitives` (sigp#6078)

* Remove use of ethers_core::RlpStream

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into remove_use_of_ethers_core

* Remove old code

* Simplify keccak call

* Remove unused package

* Merge branch 'unstable' of https://github.com/ethDreamer/lighthouse into remove_use_of_ethers_core

* Merge branch 'unstable' into remove_use_of_ethers_core

* Run clippy

* Merge branch 'remove_use_of_ethers_core' of https://github.com/dospore/lighthouse into remove_use_of_ethers_core

* Check all cargo fmt

* migrate to alloy primitives init

* fix deps

* integrate alloy-primitives

* resolve dep issues

* more changes based on dep changes

* add TODOs

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into remove_use_of_ethers_core

* Revert lock

* Add BeaconBlocksByRange v3

* continue migration

* Revert "Add BeaconBlocksByRange v3"

This reverts commit e3ce7fc.

* impl hash256 extended trait

* revert some uneeded diffs

* merge conflict resolved

* fix subnet id rshift calc

* rename to FixedBytesExtended

* debugging

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into migrate-to-alloy-primitives

* fix failed test

* fixing more tests

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into remove_use_of_ethers_core

* introduce a shim to convert between the two u256 types

* move alloy to wrokspace

* align alloy versions

* update

* update web3signer test certs

* refactor

* resolve failing tests

* linting

* fix graffiti string test

* fmt

* fix ef test

* resolve merge conflicts

* remove udep and revert cert

* cargo patch

* cyclic dep

* fix build error

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into migrate-to-alloy-primitives

* resolve conflicts, update deps

* merge unstable

* fmt

* fix deps

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into migrate-to-alloy-primitives

* resolve merge conflicts

* resolve conflicts, make necessary changes

* Remove patch

* fmt

* remove file

* merge conflicts

* sneaking in a smol change

* bump versions

* Merge remote-tracking branch 'origin/unstable' into migrate-to-alloy-primitives

* Updates for peerDAS

* Update ethereum_hashing to prevent dupe

* updated alloy-consensus, removed TODOs

* cargo update

* endianess fix

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into migrate-to-alloy-primitives

* fmt

* fix merge

* fix test

* fixed_bytes crate

* minor fixes

* convert u256 to i64

* panic free mixin to_low_u64_le

* from_str_radix

* computbe_subnet api and ensuring we use big-endian

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into migrate-to-alloy-primitives

* fix test

* Simplify subnet_id test

* Simplify some more tests

* Add tests to fixed_bytes crate

* Merge branch 'unstable' into migrate-to-alloy-primitives

update libp2p to version 0.54 (sigp#6249)

* update libp2p to version 0.54.0

* address review

* Merge branch 'unstable' of github.com:sigp/lighthouse into update-libp2p

* Merge branch 'update-libp2p' of github.com:sigp/lighthouse into update-libp2p

Drop block data from BlockError and BlobError (sigp#5735)

* Drop block data from BlockError and BlobError

* Debug release tests

* Fix release tests

* Revert unnecessary changes to lighthouse_metrics

Update manual checkpoint sync content in Lighthouse book (sigp#6338)

* Update manual checkpiont sync

* Update faq

* Minor revision

* Minor revision

Light Client Bug Fix (sigp#6299)

* Light Client Bug Fix

Ignore Rust 1.82 warnings about void patterns (sigp#6357)

* Ignore Rust 1.82 warnings about void patterns

Enable `large_stack_frames` lint (sigp#6343)

* Enable `large_stack_frames` lint

Add blob count label to `DATA_COLUMN_SIDECAR_COMPUTATION` metric (sigp#6340)

* Add blob count label to `DATA_COLUMN_SIDECAR_COMPUTATION` metric, and move metrics into the compute function, recording only successful computation.

* Move `discard_timer_on_break` usage to caller site.

* Merge branch 'unstable' into compute-data-column-metric

* Merge branch 'unstable' into compute-data-column-metric

Metadata request ordering (sigp#6336)

* Send metadata request ordering

* Merge branch 'unstable' into metadata-order

Clarify validator monitor block log (sigp#6342)

* Clarify validator monitor block log

* Merge branch 'unstable' into clarify-block-log

Check known parent on rpc blob process (sigp#5893)

* Check known parent on rpc blob process

* fix test

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into blob-unknown-parent

Remove beta tag from gossipsub 1.2 (sigp#6344)

* Remove the beta tag from gossipsub v1.2

* fix clippy

* Merge branch 'unstable' into remove-beta-tag

Fix lints for Rust 1.81 (sigp#6363)

* Fix lints for Rust 1.81

Use tikv-jemallocator instead of jemallocator (sigp#6354)

* Use tikv-jemallocator instead of jemallocator

* Merge branch 'unstable' into tikv-jemallocator

* Bump tikv-jemallocator and tikv-jemalloc-ctl

Improve `get_custody_columns` validation, caching and error handling (sigp#6308)

* Improve `get_custody_columns` validation, caching and error handling.

* Merge branch 'unstable' into get-custody-columns-error-handing

* Fix failing test and add more test.

* Fix failing test and add more test.

* Merge branch 'unstable' into get-custody-columns-error-handing

* Add unit test to make sure the default specs won't panic on the `compute_custody_requirement_subnets` function.

* Add condition when calling `compute_custody_subnets_from_metadata` and update logs.

* Validate `csc` when returning from enr. Remove `csc` computation on connection since we get them on metadata anyway.

* Add `peers_per_custody_subnet_count` to track peer csc and supernodes.

* Disconnect peers with invalid metadata and find other peers instead.

* Fix sampling tests.

* Merge branch 'unstable' into get-custody-columns-error-handing

* Merge branch 'unstable' into get-custody-columns-error-handing

Delete legacy payload reconstruction (sigp#6213)

* Delete legacy payload reconstruction

* Delete unneeded failing test

* Merge remote-tracking branch 'origin/unstable' into remove-more-ethers

* Merge remote-tracking branch 'origin/unstable' into remove-more-ethers

* Cleanups

simplify rpc codec logic (sigp#6304)

* simplify rpc codec logic

* Merge branch 'unstable' of github.com:sigp/lighthouse into simplify-rpc-codec

* Merge branch 'unstable' of github.com:sigp/lighthouse into simplify-rpc-codec

* Merge branch 'unstable' of github.com:sigp/lighthouse into simply-rpc-codec

* Merge branch 'unstable' into simplify-rpc-codec

* Merge branch 'unstable' into simplify-rpc-codec
ThreeHrSleep added a commit to ThreeHrSleep/lighthouse that referenced this pull request Sep 22, 2024
Add retropgf funding (sigp#6324)

Migrate from `ethereum-types` to `alloy-primitives` (sigp#6078)

* Remove use of ethers_core::RlpStream

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into remove_use_of_ethers_core

* Remove old code

* Simplify keccak call

* Remove unused package

* Merge branch 'unstable' of https://github.com/ethDreamer/lighthouse into remove_use_of_ethers_core

* Merge branch 'unstable' into remove_use_of_ethers_core

* Run clippy

* Merge branch 'remove_use_of_ethers_core' of https://github.com/dospore/lighthouse into remove_use_of_ethers_core

* Check all cargo fmt

* migrate to alloy primitives init

* fix deps

* integrate alloy-primitives

* resolve dep issues

* more changes based on dep changes

* add TODOs

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into remove_use_of_ethers_core

* Revert lock

* Add BeaconBlocksByRange v3

* continue migration

* Revert "Add BeaconBlocksByRange v3"

This reverts commit e3ce7fc.

* impl hash256 extended trait

* revert some uneeded diffs

* merge conflict resolved

* fix subnet id rshift calc

* rename to FixedBytesExtended

* debugging

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into migrate-to-alloy-primitives

* fix failed test

* fixing more tests

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into remove_use_of_ethers_core

* introduce a shim to convert between the two u256 types

* move alloy to wrokspace

* align alloy versions

* update

* update web3signer test certs

* refactor

* resolve failing tests

* linting

* fix graffiti string test

* fmt

* fix ef test

* resolve merge conflicts

* remove udep and revert cert

* cargo patch

* cyclic dep

* fix build error

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into migrate-to-alloy-primitives

* resolve conflicts, update deps

* merge unstable

* fmt

* fix deps

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into migrate-to-alloy-primitives

* resolve merge conflicts

* resolve conflicts, make necessary changes

* Remove patch

* fmt

* remove file

* merge conflicts

* sneaking in a smol change

* bump versions

* Merge remote-tracking branch 'origin/unstable' into migrate-to-alloy-primitives

* Updates for peerDAS

* Update ethereum_hashing to prevent dupe

* updated alloy-consensus, removed TODOs

* cargo update

* endianess fix

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into migrate-to-alloy-primitives

* fmt

* fix merge

* fix test

* fixed_bytes crate

* minor fixes

* convert u256 to i64

* panic free mixin to_low_u64_le

* from_str_radix

* computbe_subnet api and ensuring we use big-endian

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into migrate-to-alloy-primitives

* fix test

* Simplify subnet_id test

* Simplify some more tests

* Add tests to fixed_bytes crate

* Merge branch 'unstable' into migrate-to-alloy-primitives

update libp2p to version 0.54 (sigp#6249)

* update libp2p to version 0.54.0

* address review

* Merge branch 'unstable' of github.com:sigp/lighthouse into update-libp2p

* Merge branch 'update-libp2p' of github.com:sigp/lighthouse into update-libp2p

Drop block data from BlockError and BlobError (sigp#5735)

* Drop block data from BlockError and BlobError

* Debug release tests

* Fix release tests

* Revert unnecessary changes to lighthouse_metrics

Update manual checkpoint sync content in Lighthouse book (sigp#6338)

* Update manual checkpiont sync

* Update faq

* Minor revision

* Minor revision

Light Client Bug Fix (sigp#6299)

* Light Client Bug Fix

Ignore Rust 1.82 warnings about void patterns (sigp#6357)

* Ignore Rust 1.82 warnings about void patterns

Enable `large_stack_frames` lint (sigp#6343)

* Enable `large_stack_frames` lint

Add blob count label to `DATA_COLUMN_SIDECAR_COMPUTATION` metric (sigp#6340)

* Add blob count label to `DATA_COLUMN_SIDECAR_COMPUTATION` metric, and move metrics into the compute function, recording only successful computation.

* Move `discard_timer_on_break` usage to caller site.

* Merge branch 'unstable' into compute-data-column-metric

* Merge branch 'unstable' into compute-data-column-metric

Metadata request ordering (sigp#6336)

* Send metadata request ordering

* Merge branch 'unstable' into metadata-order

Clarify validator monitor block log (sigp#6342)

* Clarify validator monitor block log

* Merge branch 'unstable' into clarify-block-log

Check known parent on rpc blob process (sigp#5893)

* Check known parent on rpc blob process

* fix test

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into blob-unknown-parent

Remove beta tag from gossipsub 1.2 (sigp#6344)

* Remove the beta tag from gossipsub v1.2

* fix clippy

* Merge branch 'unstable' into remove-beta-tag

Fix lints for Rust 1.81 (sigp#6363)

* Fix lints for Rust 1.81

Use tikv-jemallocator instead of jemallocator (sigp#6354)

* Use tikv-jemallocator instead of jemallocator

* Merge branch 'unstable' into tikv-jemallocator

* Bump tikv-jemallocator and tikv-jemalloc-ctl

Improve `get_custody_columns` validation, caching and error handling (sigp#6308)

* Improve `get_custody_columns` validation, caching and error handling.

* Merge branch 'unstable' into get-custody-columns-error-handing

* Fix failing test and add more test.

* Fix failing test and add more test.

* Merge branch 'unstable' into get-custody-columns-error-handing

* Add unit test to make sure the default specs won't panic on the `compute_custody_requirement_subnets` function.

* Add condition when calling `compute_custody_subnets_from_metadata` and update logs.

* Validate `csc` when returning from enr. Remove `csc` computation on connection since we get them on metadata anyway.

* Add `peers_per_custody_subnet_count` to track peer csc and supernodes.

* Disconnect peers with invalid metadata and find other peers instead.

* Fix sampling tests.

* Merge branch 'unstable' into get-custody-columns-error-handing

* Merge branch 'unstable' into get-custody-columns-error-handing

Delete legacy payload reconstruction (sigp#6213)

* Delete legacy payload reconstruction

* Delete unneeded failing test

* Merge remote-tracking branch 'origin/unstable' into remove-more-ethers

* Merge remote-tracking branch 'origin/unstable' into remove-more-ethers

* Cleanups

simplify rpc codec logic (sigp#6304)

* simplify rpc codec logic

* Merge branch 'unstable' of github.com:sigp/lighthouse into simplify-rpc-codec

* Merge branch 'unstable' of github.com:sigp/lighthouse into simplify-rpc-codec

* Merge branch 'unstable' of github.com:sigp/lighthouse into simply-rpc-codec

* Merge branch 'unstable' into simplify-rpc-codec

* Merge branch 'unstable' into simplify-rpc-codec
chong-he pushed a commit to chong-he/lighthouse that referenced this pull request Nov 26, 2024
…igp#6308)

* Improve `get_custody_columns` validation, caching and error handling.

* Merge branch 'unstable' into get-custody-columns-error-handing

* Fix failing test and add more test.

* Fix failing test and add more test.

* Merge branch 'unstable' into get-custody-columns-error-handing

# Conflicts:
#	beacon_node/lighthouse_network/src/discovery/subnet_predicate.rs
#	beacon_node/lighthouse_network/src/peer_manager/peerdb.rs
#	beacon_node/lighthouse_network/src/peer_manager/peerdb/peer_info.rs
#	beacon_node/lighthouse_network/src/types/globals.rs
#	beacon_node/network/src/service.rs
#	consensus/types/src/data_column_subnet_id.rs

* Add unit test to make sure the default specs won't panic on the `compute_custody_requirement_subnets` function.

* Add condition when calling `compute_custody_subnets_from_metadata` and update logs.

* Validate `csc` when returning from enr. Remove `csc` computation on connection since we get them on metadata anyway.

* Add `peers_per_custody_subnet_count` to track peer csc and supernodes.

* Disconnect peers with invalid metadata and find other peers instead.

* Fix sampling tests.

* Merge branch 'unstable' into get-custody-columns-error-handing

* Merge branch 'unstable' into get-custody-columns-error-handing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

das Data Availability Sampling ready-for-merge This PR is ready to merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants