-
Notifications
You must be signed in to change notification settings - Fork 1.5k
execution/bbd: el backward block downloader v2 #16073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
mh0lt
approved these changes
Jul 15, 2025
taratorio
added a commit
that referenced
this pull request
Jul 18, 2025
This reverts commit ee48d8a.
taratorio
added a commit
that referenced
this pull request
Jul 18, 2025
taratorio
added a commit
that referenced
this pull request
Jul 24, 2025
taratorio
added a commit
that referenced
this pull request
Aug 1, 2025
adding back #16073 (had to revert it because it broke hive tests) closes #15877 closes #15879 relates to erigontech/erigon-qa#74 (comment) This change is introduced with a feature flag `--el.block.downloader.v2` (off by default) until sufficient testing has been done. The old flow will be used until we enable it by default. This should minimise any breaking changes. **Problems addressed** 1. We weren't able to join any of the perfnet shadow forks - problem is that when we send out "get header requests" over devp2p with sentry we always send it to the same 5 peers (due to a max limit of 5 peers when sending the message out) which causes issues for shadow forks - in a shadow fork environment we have very few peers that are on the shadow fork (e.g. 5 bootnodes) and majority of the other peers we are connected to are mainnet peers (the shadowfork uses the same networkid=1 as mainnet) - what ended up happening is that we would always hit the same 5 peers that were on mainnet instead of the shadowfork when trying to backward sync using the EL block downloader - leading to use being stuck indefinitely - same issue happened for bodies download - there it was even worse because we only send bodies request to max 1 peer (and we always would hit the same peer) - my workaround/short-term fix to unblock the perfnet work on the `performance` branch was to always send all messages to all peers (which is bad for the network and so we can't merge that temp fix into main, hence we need a better solution which this PR is aspiring to be) 2. extreme chaindata growth for large backward chain downloads due to the el downloader not taking into account `--sync.loop.block.limit` batched execution (in batches of 5,000 blocks) for E3 3. extreme chaindata growth for large backward chain downloads due to a long-lived chaindata RoTx - the long-lived RoTx is used in `membatchwithdb.NewMemoryBatchWithCustomDB(tx, tmpDb, tmpTx)` after the headers have been downloaded into an ETL - even after point 2. above was fixed and we called execution every 5,000 block (in order for the pruning to kick in) the chaindata was still growing uncontrollably - this is because there is a long lived RoTx holding the data (even though it has been pruned by the execution loop) - this RoTx lives for the entirety of the backward chain download 4. issues with the EL block downloader getting stuck for gnosis as per: erigontech/erigon-qa#74 (comment) **Solutions** To address problem 1: - We devise a new backward fetch algorithm which leverages the devp2p api I wrote for Astrid (it is a convenient wrapper on top of Sentry). It is a much better choice than the old `HeaderDownload` and `BodyDownload` approach because it gives the user more control over which peers to fetch from (e.g. it allows you to list peers and make smarter selections about which peers to download from) which makes a lot of the code simpler. It also allows us to do cool things like parallel body download from many peers which we weren't able to do in an easy way previously. Another convenience of the API is that it is blocking every time you send a request to a peer so you don't have to worry about the async nature of devp2p. Doing this will also allow us to remove `HeaderDownload`, `BodyDownload` and headers and bodies stage so that we can finally unify our devp2p code paths between Polygon and Ethereum - The algo looks like this: ``` 1. List all peers 2. Try to download the initial header with the hash we need to download backwards from all peers so that we know which peers have this data and are available for future requests. 3. Send devp2p backward header requests in batches of 500 headers (configurable per request) to 1 peer at a time - Send every new batch to the next available peer (which has the initial header) in a circular fashion to distribute the requests - For every batch of headers that we receive, we traverse the headers backwards (in descending number) and check if the parent hash is available in our DB - if it is we have found a connection point and we stop otherwise we continue fetching headers backwards - We accumulate the headers in an ETL collector - if any of the available peers fail to respond to us we mark them as "exhausted" so we don't try to use them again 4. Once we have a connection point, we start doing a forward body download in batches of 500 (configurable per request) by traversing through the sorted ETL (without needing any temporary MDBX) - here we do some parallelisation (because we already know all the hashes) - depending on how many non-exhausted peers we have we split the batch of 500 headers into N smaller batches - where N is the number of non-exhausted peers - we fetch the bodies of these N batches in parallel (we have a configurable max workers property - currently set to 10) - if any of the peers fail to serve us the bodies we again mark them as exhausted so we don't try to use them again and retry the batch with the next available peer in a circular fashion - once we've downloaded the batch of 500 blocks we feed it to a `ResultFeed` where the caller can consume it (paging-like API) ``` To address problem 2: - we change the logic in the EngineBlockDownloader in the following way - if a backward downloaded header chain is > dbg.MaxReorgDepth (512 blocks) in NewPayload we abandon the download request - the algo in the solution to problem 1 has an option to abandon the backward header download (step 3) early if the backward chain length exceeds a configurable number (512 in this case) - the CL will eventually have to send a ForkChoiceUpdated request for that large backward chain if it is the canonical one - within a FCU we allow long backward downloads - when doing these long backward downloads with FCU - we add logic to take into account --sync.loop.block.limit - so we insert blocks to chain data (in batches of 500 blocks as currently done) then when we reach 5,000 blocks inserted we call ValidateChain+UpdateForkChoice to trigger execution + pruning (to keep chaindata size low) - note this works quite well with the `ResultFeed` paging approach in the solution for problem 1 To address problem 3: - thanks to the solution to problem 1. we actually no longer need to use `membatchwithdb.NewMemoryBatchWithCustomDB(tx, tmpDb, tmpTx)` and a long-lived chaindata RoTx - we only need an ETL collector which can be used without a tx/db To address problem 4: - solved by the solution to problem 1. as well **Future follow-ups** - We should move the shared code from `polygon/p2p` to `p2p` package - We should use the new `bbd.BackwardBlockDownloader` in Astrid too (there is 1 situation in the tip-sync mode there that may occasionally need to do long backward block downloads when handling long period of no milestones) - Remove the old code path and `--el.block.downloader.v2` feature flag in EngineBlockDownloader
taratorio
added a commit
that referenced
this pull request
Sep 15, 2025
addressing 1st follow up from #16073 now that we use the new backward block downloader in the Ethereum EL Engine API server we should move the p2p code for it from polygon/p2p to execution/p2p (the core pkg) so that the Ethereum EL Engine API server and Astrid can import and use the same code from execution/p2p
taratorio
added a commit
that referenced
this pull request
Sep 22, 2025
2nd follow up from #16073 is to have Astrid use the same flow for backward block downloads as Ethereum - this unites the 2 benefits of this are: - the new backward block downloader abstraction matures more - I found 2 small issues with it while doing this work that are now fixed - also it's interface is more flexible to cater for both use cases now - there will be performance gains in astrid when handling milestone mismatches - the new downloader is much quicker to figure out what peers it can backward download the milestone from and gives a fail/success answer much quicker (no 60 sec stalling when we hit a peer that doesn't have the milestone as previously) - also when getting a new block hash event we can directly backward download to a connection point in the canonical chain builder (instead of first downloading 1 block from a peer and then realising that we need to download more because there is a gap and sending more requests)
NazariiDenha
pushed a commit
that referenced
this pull request
Oct 24, 2025
addressing 1st follow up from #16073 now that we use the new backward block downloader in the Ethereum EL Engine API server we should move the p2p code for it from polygon/p2p to execution/p2p (the core pkg) so that the Ethereum EL Engine API server and Astrid can import and use the same code from execution/p2p
NazariiDenha
pushed a commit
that referenced
this pull request
Oct 24, 2025
2nd follow up from #16073 is to have Astrid use the same flow for backward block downloads as Ethereum - this unites the 2 benefits of this are: - the new backward block downloader abstraction matures more - I found 2 small issues with it while doing this work that are now fixed - also it's interface is more flexible to cater for both use cases now - there will be performance gains in astrid when handling milestone mismatches - the new downloader is much quicker to figure out what peers it can backward download the milestone from and gives a fail/success answer much quicker (no 60 sec stalling when we hit a peer that doesn't have the milestone as previously) - also when getting a new block hash event we can directly backward download to a connection point in the canonical chain builder (instead of first downloading 1 block from a peer and then realising that we need to download more because there is a gap and sending more requests)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
closes #15877
closes #15879
relates to https://github.com/erigontech/erigon-qa/issues/74#issuecomment-2956232796
This change is introduced with a feature flag
--el.block.downloader.v2(off by default) until sufficient testing has been done. The old flow will be used until we enable it by default. This should minimise any breaking changes.Problems addressed
performancebranch was to always send all messages to all peers (which is bad for the network and so we can't merge that temp fix into main, hence we need a better solution which this PR is aspiring to be)--sync.loop.block.limitbatched execution (in batches of 5,000 blocks) for E3membatchwithdb.NewMemoryBatchWithCustomDB(tx, tmpDb, tmpTx)after the headers have been downloaded into an ETLSolutions
To address problem 1:
HeaderDownloadandBodyDownloadapproach because it gives the user more control over which peers to fetch from (e.g. it allows you to list peers and make smarter selections about which peers to download from) which makes a lot of the code simpler. It also allows us to do cool things like parallel body download from many peers which we weren't able to do in an easy way previously. Another convenience of the API is that it is blocking every time you send a request to a peer so you don't have to worry about the async nature of devp2p. Doing this will also allow us to removeHeaderDownload,BodyDownloadand headers and bodies stage so that we can finally unify our devp2p code paths between Polygon and EthereumTo address problem 2:
ResultFeedpaging approach in the solution for problem 1To address problem 3:
membatchwithdb.NewMemoryBatchWithCustomDB(tx, tmpDb, tmpTx)and a long-lived chaindata RoTx - we only need an ETL collector which can be used without a tx/dbTo address problem 4:
Future follow-ups
polygon/p2ptop2ppackagebbd.BackwardBlockDownloaderin Astrid too (there is 1 situation in the tip-sync mode there that may occasionally need to do long backward block downloads when handling long period of no milestones)--el.block.downloader.v2feature flag in EngineBlockDownloader