-
Notifications
You must be signed in to change notification settings - Fork 38.7k
Optimize Mempool Reorg logic using Epochs, improving memory usage and runtime. #24158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Mempool Reorg logic using Epochs, improving memory usage and runtime. #24158
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
|
What is the relationship of this PR to your #18191? Just judging from the title, they seem to do very similar things, although the code changes are not the same. |
|
see the pr description of #18191
|
|
oh oops. uh yeah TBH I forgot I had that other PR open. They should be mostly the same. The main difference is the earlier one also applies an optimization getting rid of setExclude and just using the cache line presence instead, which ends up being redundant with the setExclude. We can add that optimization as a separate PR since it's a little bit less obvious why it works, I left it out when I rewrote this one. |
glozow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this could be an improvement, but not convinced unless there's a bench or a breakdown of the memory used.
| const CTxMemPoolEntry& descendant = *descendants[i]; | ||
| const CTxMemPoolEntry::Children& children = descendant.GetMemPoolChildrenConst(); | ||
| for (const CTxMemPoolEntry& childEntry : children) { | ||
| cacheMap::iterator cacheIt = cachedDescendants.find(mapTx.iterator_to(childEntry)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you first look for descendant in cachedDescendants before fetching its children? If its descendant set is available there, you wasted a cycle looking at the first generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you are right on this. The patch here is a 100% behavioral match, but it does seem like one could evade caching a little because we don't check if children are cached. Can you think of any functional differences? (this code is incredibly subtle, so i'm reticent to change it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should only be done if children is non-empty - this change should also allow getting rid of the current find call in the children loop and just add straight to descendants. The current logic in master of checking cachedDescendants is a bit odd as if there are three txn's (t1 -> t2 -> t3) in a chain that depend on each other sequentially, then the find call will only get a "hit" when t3 is updateIt as it "skips" t2
| if (!(descendants.size() == i+2)) { | ||
| std::swap(descendants[i+1], descendants.back()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, you're swapping here in order to insert all of the descendants between this child and the next entry in the descendants vector, so you can just increment i to skip to the next child in the same generation. Note that you're still moving everything downwards n times where n is the number of cached descendants.
Approach-wise, perhaps a std::deque (linked list) is more appropriate if you really want the constant time insert at arbitrary position. Also, I'm not too familiar with the implementation of std::vector, but I feel like it should be optimized enough for you to feel okay using insert(i+1) without using swaps. It might even do that in the background.
Also, please add a comment because it was not immediately obvious that you're doing swaps to maintain ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you have it backwards kinda. we do not care about ordering whatsoever, we are keeping a "partition" of processed and unprocessed elements.
essentially we have a queue with processed and unprocessed elements and a pointer to where we should process next.
// start
[P1 P2 P3 P4 *U5 U6 U7 U8]
// process element
[P1 P2 P3 P4 P5 *U6 U7 U8]
// insert processed element using push back and swap
[P1 P2 P3 P4 P5 *U6 U7 U8 P9]
[P1 P2 P3 P4 P5 *P9 U7 U8 U6]
[P1 P2 P3 P4 P5 P9 *U7 U8 U6]
// insert unprocessed element
[P1 P2 P3 P4 P5 P9 *U7 U8 U6 U10]
If we were to try to do the same with inserts, it would cause N^2 behavior. As you can see, we also do not preserve order during the swap back approach.
If we were to do insert it would look like:
[P1 P2 P3 P4 P5 *U6 U7 U8]
// insert processed element using insert
[P1 P2 P3 P4 P5 *P9 U6 U7 U8 P9]
[P1 P2 P3 P4 P5 P9 *U6 U7 U8 P9]
And the shifting would cost O(N).
Deque is a good data structure, but has an awful lot of extra overhead making it a poor fit for a performance critical code section. One of the key benefits of this approach is that we keep our data structure quick to iterate/add to.
The swap approach is O(1) per element added, which is fine. If we used insert / shifting it would be O(N) per element, which would cause a quadratic blowup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By preserve ordering, I merely meant that you're keeping the unprocessed elements behind the processed ones. We're saying the same thing, sorry for being unclear.
If we were to try to do the same with inserts, it would cause N^2 behavior.
Sure, this is the case if you're shifting everything per-insert inside the vector. This is why I suggested using a deque, where it's O(1) per element. It'd be the same performance as what you're doing here, except you just call insert instead of swapping. But my concerns with the complexity/readability of the code will be gone if you just comment what you're doing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah deque is same complexity, but it's not the same performance, so it makes sense to continue using a vec even if we have to do a little bit of index management. can add a comment.
| size_t i = 0; | ||
| for (const auto& child: children) { | ||
| descendants[i] = mapTx.iterator_to(child); | ||
| ++i; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of i here? You're re-assigning it later anyway.
Also, why can't you just do a std::transform from children to populate descendants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i is needed to index descendants which is allocated up front.
it could be a std::transform I suppose, but I don't think std::transform provides a readability improvement here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i is needed to index descendants which is allocated up front
Just push_back, and you don't need to allocate anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i do a call to reserve instead of the sized constructor https://en.cppreference.com/w/cpp/container/vector/vector, that would work w/o extra allocations.
the sized constructor + default insert + indexed insert is a bit better IMO because there are some overheads to push_back/emplace_back in terms of updating the vec bookkeeping that the simple indexed insert above does not have, but it's all something that would need measuring as the compiler probably does an OK job at it.
do you have a strong preference around this?
| // Note: the below contains code which does some hacks to keep memory tight. | ||
| // it could be improved in the future to detect if the vector is already tight | ||
| // and then directly move it to cachedDescendants. For simplicity, we just | ||
| // do a copy for now. | ||
|
|
||
| // emplace into a new vector to guarantee we trim memory | ||
| const auto& it = cachedDescendants.emplace(std::piecewise_construct, | ||
| std::forward_as_tuple(updateIt), | ||
| std::forward_as_tuple(descendants.begin(), included_upto)); | ||
| // swap with descendants to release it early! | ||
| std::vector<txiter>().swap(descendants); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would using std::shrink_to_fit() https://www.cplusplus.com/reference/vector/vector/shrink_to_fit/ "guarantee we trim" or no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I'm curious as to why it's necessary to free this memory here. We don't need to allocate anything in the rest of the function, and descendants goes out of scope when we return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from what you cited:
The request is non-binding, and the container implementation is free to optimize otherwise and leave the vector with a capacity greater than its size.
Ideally we would be happy trusting our shrink_to_fit, but I'd rather get exactly a trimmed vector here.
we do allocate (descendants_to_remove.insert), so it's a cautious approach to free it explicitly when it will never be used again than later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
And my other question - what's the point of releasing descendants early?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e do allocate (descendants_to_remove.insert), so it's a cautious approach to free it explicitly when it will never be used again than later.
|
|
||
| // initialize to hold all direct children | ||
| // children guaranteed to be unique at this point | ||
| std::vector<txiter> descendants(children.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually it might make more sense to resize to the children's aggregated descendant counts. You could overestimate, but there's most likely even fewer resizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good point. But how do we know in advance of iterating what that would be? I figured we do not?
|
I think |
|
🐙 This pull request conflicts with the target branch and needs rebase. Want to unsubscribe from rebase notifications on this pull request? Just convert this pull request to a "draft". |
|
@JeremyRubin as discussed offline, closing this for now and assigning to @stickies-v. |
This is a follow up PR to #21464 which improves the memory usage and runtime of the reorg update logic. The main changes are: