Optimize Mempool Reorg logic using Epochs, improving memory usage and runtime. #24158

JeremyRubin · 2022-01-26T02:20:47Z

This is a follow up PR to #21464 which improves the memory usage and runtime of the reorg update logic. The main changes are:

Use std::vector in cacheMap instead of std::set which makes (iirc) entries go from > ~64 bytes to just 8 bytes
Don't look up in cacheMap the updateIt N times, look it up once and directly write the vector with known length (saves resizing)
Use epochs to de-duplicate (saves N log N lookups)
trimming from setExclude in a hotter loop

… runtime

DrahtBot · 2022-01-26T23:53:38Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#25290 ([kernel 3a/n] Decouple CTxMemPool from ArgsManager by dongcarl)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

mzumsande · 2022-02-16T19:49:22Z

What is the relationship of this PR to your #18191? Just judging from the title, they seem to do very similar things, although the code changes are not the same.

JeremyRubin · 2022-02-16T19:59:16Z

They were split out from an initial patch set to make it easier to review.

#18191 is the component which has some behavioral change, #24158 is a pure optimization.

JeremyRubin · 2022-02-16T20:01:21Z

see the pr description of #18191

There's potential for a better -- but more sophisticated -- algorithm that can be used taking advantage of epochs, but I figured it is better to do something that is simple and works first and upgrade it later as the other epoch mempool work proceeds as it makes the patches for the epoch algorithm simpler to understand, so you can consider this as preparatory work. It could either go in now if it is not controversial, or we could wait until the other patch is ready to go.

mzumsande · 2022-02-16T20:14:14Z

see the pr description of #18191

That's from #21464 which was merged, my question is about the older #18191.

JeremyRubin · 2022-02-16T20:28:55Z

oh oops.

uh yeah TBH I forgot I had that other PR open.

They should be mostly the same.

The main difference is the earlier one also applies an optimization getting rid of setExclude and just using the cache line presence instead, which ends up being redundant with the setExclude.

We can add that optimization as a separate PR since it's a little bit less obvious why it works, I left it out when I rewrote this one.

glozow

Seems like this could be an improvement, but not convinced unless there's a bench or a breakdown of the memory used.

glozow · 2022-02-17T11:15:52Z

src/txmempool.cpp

+            const CTxMemPoolEntry& descendant = *descendants[i];
+            const CTxMemPoolEntry::Children& children = descendant.GetMemPoolChildrenConst();
+            for (const CTxMemPoolEntry& childEntry : children) {
+                cacheMap::iterator cacheIt = cachedDescendants.find(mapTx.iterator_to(childEntry));


Shouldn't you first look for descendant in cachedDescendants before fetching its children? If its descendant set is available there, you wasted a cycle looking at the first generation.

i think you are right on this. The patch here is a 100% behavioral match, but it does seem like one could evade caching a little because we don't check if children are cached. Can you think of any functional differences? (this code is incredibly subtle, so i'm reticent to change it).

I think it should only be done if children is non-empty - this change should also allow getting rid of the current find call in the children loop and just add straight to descendants. The current logic in master of checking cachedDescendants is a bit odd as if there are three txn's (t1 -> t2 -> t3) in a chain that depend on each other sequentially, then the find call will only get a "hit" when t3 is updateIt as it "skips" t2

glozow · 2022-02-17T11:30:56Z

src/txmempool.cpp

+                            if (!(descendants.size() == i+2)) {
+                                std::swap(descendants[i+1], descendants.back());
+                            }


IIUC, you're swapping here in order to insert all of the descendants between this child and the next entry in the descendants vector, so you can just increment i to skip to the next child in the same generation. Note that you're still moving everything downwards n times where n is the number of cached descendants.
Approach-wise, perhaps a std::deque (linked list) is more appropriate if you really want the constant time insert at arbitrary position. Also, I'm not too familiar with the implementation of std::vector, but I feel like it should be optimized enough for you to feel okay using insert(i+1) without using swaps. It might even do that in the background.

Also, please add a comment because it was not immediately obvious that you're doing swaps to maintain ordering.

I think you have it backwards kinda. we do not care about ordering whatsoever, we are keeping a "partition" of processed and unprocessed elements.

essentially we have a queue with processed and unprocessed elements and a pointer to where we should process next.

// start [P1 P2 P3 P4 *U5 U6 U7 U8] // process element [P1 P2 P3 P4 P5 *U6 U7 U8] // insert processed element using push back and swap [P1 P2 P3 P4 P5 *U6 U7 U8 P9] [P1 P2 P3 P4 P5 *P9 U7 U8 U6] [P1 P2 P3 P4 P5 P9 *U7 U8 U6] // insert unprocessed element [P1 P2 P3 P4 P5 P9 *U7 U8 U6 U10]

If we were to try to do the same with inserts, it would cause N^2 behavior. As you can see, we also do not preserve order during the swap back approach.

If we were to do insert it would look like:

[P1 P2 P3 P4 P5 *U6 U7 U8] // insert processed element using insert [P1 P2 P3 P4 P5 *P9 U6 U7 U8 P9] [P1 P2 P3 P4 P5 P9 *U6 U7 U8 P9]

And the shifting would cost O(N).

Deque is a good data structure, but has an awful lot of extra overhead making it a poor fit for a performance critical code section. One of the key benefits of this approach is that we keep our data structure quick to iterate/add to.

The swap approach is O(1) per element added, which is fine. If we used insert / shifting it would be O(N) per element, which would cause a quadratic blowup.

By preserve ordering, I merely meant that you're keeping the unprocessed elements behind the processed ones. We're saying the same thing, sorry for being unclear.

If we were to try to do the same with inserts, it would cause N^2 behavior.

Sure, this is the case if you're shifting everything per-insert inside the vector. This is why I suggested using a deque, where it's O(1) per element. It'd be the same performance as what you're doing here, except you just call insert instead of swapping. But my concerns with the complexity/readability of the code will be gone if you just comment what you're doing here.

yeah deque is same complexity, but it's not the same performance, so it makes sense to continue using a vec even if we have to do a little bit of index management. can add a comment.

glozow · 2022-02-17T11:31:59Z

src/txmempool.cpp

+    size_t i = 0;
+    for (const auto& child: children) {
+        descendants[i] = mapTx.iterator_to(child);
+        ++i;
+    }


What is the purpose of i here? You're re-assigning it later anyway.

Also, why can't you just do a std::transform from children to populate descendants?

i is needed to index descendants which is allocated up front.

it could be a std::transform I suppose, but I don't think std::transform provides a readability improvement here.

i is needed to index descendants which is allocated up front

Just push_back, and you don't need to allocate anything

if i do a call to reserve instead of the sized constructor https://en.cppreference.com/w/cpp/container/vector/vector, that would work w/o extra allocations.

the sized constructor + default insert + indexed insert is a bit better IMO because there are some overheads to push_back/emplace_back in terms of updating the vec bookkeeping that the simple indexed insert above does not have, but it's all something that would need measuring as the compiler probably does an OK job at it.

do you have a strong preference around this?

glozow · 2022-02-17T11:52:43Z

src/txmempool.cpp

+    // Note: the below contains code which does some hacks to keep memory tight.
+    // it could be improved in the future to detect if the vector is already tight
+    // and then directly move it to cachedDescendants. For simplicity, we just
+    // do a copy for now.
+
+    // emplace into a new vector to guarantee we trim memory
+    const auto& it = cachedDescendants.emplace(std::piecewise_construct,
+            std::forward_as_tuple(updateIt),
+            std::forward_as_tuple(descendants.begin(), included_upto));
+    // swap with descendants to release it early!
+    std::vector<txiter>().swap(descendants);


Would using std::shrink_to_fit() https://www.cplusplus.com/reference/vector/vector/shrink_to_fit/ "guarantee we trim" or no?

Also I'm curious as to why it's necessary to free this memory here. We don't need to allocate anything in the rest of the function, and descendants goes out of scope when we return.

from what you cited:

The request is non-binding, and the container implementation is free to optimize otherwise and leave the vector with a capacity greater than its size.

Ideally we would be happy trusting our shrink_to_fit, but I'd rather get exactly a trimmed vector here.

we do allocate (descendants_to_remove.insert), so it's a cautious approach to free it explicitly when it will never be used again than later.

Sounds good.

And my other question - what's the point of releasing descendants early?

e do allocate (descendants_to_remove.insert), so it's a cautious approach to free it explicitly when it will never be used again than later.

glozow · 2022-02-17T12:01:27Z

src/txmempool.cpp

+
+    // initialize to hold all direct children
+    // children guaranteed to be unique at this point
+    std::vector<txiter> descendants(children.size());


actually it might make more sense to resize to the children's aggregated descendant counts. You could overestimate, but there's most likely even fewer resizes.

this is a good point. But how do we know in advance of iterating what that would be? I figured we do not?

Crypt-iQ · 2022-04-19T01:11:23Z

I think UpdateForDescendants should have the LOCKS_EXCLUDED annotation for m_epoch in txmempool.h

DrahtBot · 2022-06-29T08:03:32Z

🐙 This pull request conflicts with the target branch and needs rebase.

_{Want to unsubscribe from rebase notifications on this pull request? Just convert this pull request to a "draft".}

glozow · 2022-10-12T19:02:20Z

@JeremyRubin as discussed offline, closing this for now and assigning to @stickies-v.

Optimize Mempool Reorg logic using Epochs, improving memory usage and…

1eac472

… runtime

fanquake added the Mempool label Jan 26, 2022

This was referenced Jan 27, 2022

[mempool] allow tx replacement by smaller witness #24007

Closed

Use int32_t type for most transaction size/weight values #23962

Merged

glozow reviewed Feb 17, 2022

View reviewed changes

bitcoin deleted a comment from carlgiroux23071979 Feb 21, 2022

DrahtBot mentioned this pull request Apr 18, 2022

mempool: reduce lookups, insertions to cache in UpdateForDescendants #24901

Closed

glozow mentioned this pull request Apr 22, 2022

Change UpdateForDescendants to use Epochs #18191

Closed

DrahtBot mentioned this pull request Jun 7, 2022

[kernel 3a/n] Decouple CTxMemPool from ArgsManager #25290

Merged

2 tasks

DrahtBot mentioned this pull request Jun 21, 2022

Add a -mempoolfullrbf node setting #25353

Merged

DrahtBot added the Needs rebase label Jun 29, 2022

glozow closed this Oct 12, 2022

bitcoin locked and limited conversation to collaborators Oct 12, 2023

Optimize Mempool Reorg logic using Epochs, improving memory usage and runtime. #24158

Optimize Mempool Reorg logic using Epochs, improving memory usage and runtime. #24158

Uh oh!

Conversation

JeremyRubin commented Jan 26, 2022

Uh oh!

DrahtBot commented Jan 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Conflicts

Uh oh!

mzumsande commented Feb 16, 2022

Uh oh!

JeremyRubin commented Feb 16, 2022

Uh oh!

JeremyRubin commented Feb 16, 2022

Uh oh!

mzumsande commented Feb 16, 2022

Uh oh!

JeremyRubin commented Feb 16, 2022

Uh oh!

glozow left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Crypt-iQ Apr 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Crypt-iQ commented Apr 19, 2022

Uh oh!

DrahtBot commented Jun 29, 2022

Uh oh!

glozow commented Oct 12, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

DrahtBot commented Jan 26, 2022 •

edited

Loading

Crypt-iQ Apr 15, 2022 •

edited

Loading