Make prevector::resize() and other prevector operations much faster #12549

eklitzke · 2018-02-27T04:16:40Z

This branch optimizes various prevector operations, especially resizing vectors. While profiling the loadblk thread I noticed that a lot of time was being spent in prevector::resize() which led to this work. I have some data here indicating that it takes up 37% of the time in ReadBlockFromDisk(): https://monad.io/readblockfromdisk.svg

This branch improves things significantly. For trivial types, the new results for the prevector benchmark are:

PrevectorClearTrivial which tests prevector::clear() becomes 24.6x faster
PrevectorDestructorTrivial which tests prevector::~prevector() becomes 20.5x faster
PrevectorResizeTrivial which tests prevector::resize() becomes 20.3x faster

Note that in practice it looks like the prevector is only used to contain unsigned char types, which is a trivial type. The benchmarks are testing a bit of an extreme case, but the changes here are motivated by the profiling data for ReadBlockFromDisk() I linked to above.

The pull request here consists of a series of three commits:

The first adds new benchmarks but does not change the prevector code.
The second is from @AkioNak , and merges some prevector optimizations he submitted in Reduce redundant code of prevector and speed it up #11988
The third optimizes prevector::resize() to use memset() when the prevector contains trivially constructible types

meshcollider · 2018-02-27T04:25:41Z

Nice speedup! Concept ACK

kallewoof · 2018-02-27T04:37:45Z

This exists in #11988 no?

eklitzke · 2018-02-27T04:40:57Z

It looks like this branch is failing because std::is_trivially_constructible wasn't added until GCC 5 (according to https://www.gnu.org/software/gcc/gcc-5/changes.html ), even though it is part of C++11. I can find a workaround for GCC 4.8.

This isn't quite the same as #11988 because that still uses placement new.

kallewoof · 2018-02-27T04:48:31Z

If it's not hugely different consider reviewing #11988 and suggesting changes to make it better instead.

eklitzke · 2018-02-27T04:52:45Z

That's a good idea, I'll see if I can cherry-pick @AkioNak's commit into my branch while I'm testing the GCC 4.8 compatibility issues.

eklitzke · 2018-02-27T06:46:46Z

This new branch cherry-picks @AkioNak 's commit from #11988 and then adds my memset optimization. I constructed this PR such that it has three commits:

First commit just adds new benchmarks
Second commit is from Reduce redundant code of prevector and speed it up #11988 (unmodified, preserves authorship)
Third commit implements my memset() optimization

I have a script that benchmarks the performance for each of the three commits. Here's the summarized data. The first commit is considered to have a baseline speedup of 1x, and then each of the subsequent commits shows the speedup from the base:

34d238     PrevectorClear                 1.00x
34d238     PrevectorDestructor            1.00x
34d238     PrevectorResizeNontrivial      1.00x
34d238     PrevectorResizeTrivial         1.00x
d1d28e     PrevectorClear                 2.36x
d1d28e     PrevectorDestructor            2.16x
d1d28e     PrevectorResizeNontrivial      3.24x
d1d28e     PrevectorResizeTrivial         2.76x
f8b078     PrevectorClear                 8.31x
f8b078     PrevectorDestructor            7.62x
f8b078     PrevectorResizeNontrivial      6.56x
f8b078     PrevectorResizeTrivial         18.99x

Here's the raw data I generated the above from:

git 34d2387b1d5c753d6ba130eb6b4730dcc15072c5
# Benchmark, evals, iterations, total, min, max, median
PrevectorClear, 5, 5600, 3.47561, 0.00012317, 0.000125752, 0.000123494
PrevectorDestructor, 5, 5700, 3.2223, 0.000112914, 0.000113208, 0.000113068
PrevectorResizeNontrivial, 5, 4100, 3.24072, 0.000158005, 0.000158212, 0.000158075
PrevectorResizeTrivial, 5, 4900, 3.24407, 0.000131094, 0.000133749, 0.000132495

git d1d28e72154a9f2e7fb56001f2b790074d1cfa30
# Benchmark, evals, iterations, total, min, max, median
PrevectorClear, 5, 5600, 1.4682, 5.23405e-05, 5.2611e-05, 5.24265e-05
PrevectorDestructor, 5, 5700, 1.49587, 5.24034e-05, 5.26179e-05, 5.2448e-05
PrevectorResizeNontrivial, 5, 4100, 1.00747, 4.87854e-05, 5.04511e-05, 4.88389e-05
PrevectorResizeTrivial, 5, 4900, 1.17581, 4.79055e-05, 4.81338e-05, 4.79753e-05

git f8b078c8cda6d4ac6e3b112ce605b08f6ac002b9
# Benchmark, evals, iterations, total, min, max, median
PrevectorClear, 5, 5600, 0.419837, 1.48182e-05, 1.55947e-05, 1.48556e-05
PrevectorDestructor, 5, 5700, 0.422994, 1.48187e-05, 1.48808e-05, 1.48327e-05
PrevectorResizeNontrivial, 5, 4100, 0.494272, 2.40686e-05, 2.41813e-05, 2.4084e-05
PrevectorResizeTrivial, 5, 4900, 0.171037, 6.89859e-06, 7.06054e-06, 6.97555e-06

Empact · 2018-02-27T07:35:13Z

src/prevector.h

Maybe use ptrdiff_t here and in fill? Would allow you to assert against the negative case rather than have a very large value.

Not that the negative case is possible with the current code, but it could be protective going forward.

AkioNak · 2018-02-27T14:27:21Z

src/bench/prevector.cpp

I think diganostic message can not be omitted on c++11.

AkioNak · 2018-02-27T14:28:06Z

src/bench/prevector.cpp

I think diganostic message can not be omitted on c++11.

laanwj · 2018-02-27T14:57:43Z

Ping @sipa who wrote this code.

eklitzke · 2018-02-27T17:30:00Z

I've addressed the feedback I've gotten so far, and expanded the other benchmarks to test both trivial and non-trivial types.

One thing that I also found interesting while doing this change was the way the current code is written to not use SFINAE. On GCC 7.3 (what I have locally) the compiler is smart enough to do the type checks at compile time and not generate runtime checks, meaning it generates optimal code (i.e. the same code it would generate using SFINAE). On GCC 4.8 it's smart enough to do that for trivial types, but does not generate optimal code in the nontrivial path. This is probably fine because the main use of prevector is to hold unsigned char (a trivially constructible and destructible type).

sipa · 2018-02-27T18:55:18Z

I believe your first commit won't compile on GCC 4.8. Also, if you're going to use IS_TRIVIALLY_CONSTRUCTIBLE outside of prevector.h, perhaps it's better to move it elsewhere (like util.h or compat.h)?

eklitzke · 2018-02-27T19:03:55Z

Good catch, I only tested the last commit. I'll rebase this branch so all of the commits build cleanly, and move the macro definition.

This prepares for a series of two additional commits which optimize prevector performance.

In prevector.h, the code which like item_ptr(size()) apears in the loop. Both item_ptr() and size() judge whether values are held directly or indirectly, but in most cases it is sufficient to make that judgement once outside the loop. This PR adds 2 private function fill() which has the loop to initialize by specified value (or iterator of the other prevector's element), but don't call item_ptr() in their loop. Other functions(assign(), constructor, operator=(), insert()) that has similar loop, call fill() instead of original loop. Also, resize() was changed like fill(), but it calls the default constructor for that element each time.

eklitzke · 2018-02-27T20:49:09Z

Here are the latest results, tested under both GCC 7.3 and GCC 4.8. For both compilers I ran the full test suite and benchmarks for each of the three commits in this PR.

GCC 4.8

Commit     Test                                     Median (s)   Speedup
------------------------------------------------------------------------
f0e7aa     PrevectorClearNontrivial                 8.15034e-05   1.00x
f0e7aa     PrevectorClearTrivial                    1.49740e-04   1.00x
f0e7aa     PrevectorDestructorNontrivial            8.27660e-05   1.00x
f0e7aa     PrevectorDestructorTrivial               1.48346e-04   1.00x
f0e7aa     PrevectorResizeNontrivial                8.26464e-05   1.00x
f0e7aa     PrevectorResizeTrivial                   1.50579e-04   1.00x
e46be2     PrevectorClearNontrivial                 2.59100e-05   3.15x
e46be2     PrevectorClearTrivial                    4.80695e-05   3.12x
e46be2     PrevectorDestructorNontrivial            2.54231e-05   3.26x
e46be2     PrevectorDestructorTrivial               4.80365e-05   3.09x
e46be2     PrevectorResizeNontrivial                2.54693e-05   3.24x
e46be2     PrevectorResizeTrivial                   4.82063e-05   3.12x
3f9abf     PrevectorClearNontrivial                 2.60759e-05   3.13x
3f9abf     PrevectorClearTrivial                    6.08490e-06  24.61x
3f9abf     PrevectorDestructorNontrivial            2.55872e-05   3.23x
3f9abf     PrevectorDestructorTrivial               6.09886e-06  24.32x
3f9abf     PrevectorResizeNontrivial                2.56024e-05   3.23x
3f9abf     PrevectorResizeTrivial                   6.15214e-06  24.48x

GCC 7.3

Commit     Test                                     Median (s)   Speedup
------------------------------------------------------------------------
f0e7aa     PrevectorClearNontrivial                 1.58989e-04   1.00x
f0e7aa     PrevectorClearTrivial                    1.31495e-04   1.00x
f0e7aa     PrevectorDestructorNontrivial            1.58322e-04   1.00x
f0e7aa     PrevectorDestructorTrivial               1.31456e-04   1.00x
f0e7aa     PrevectorResizeNontrivial                1.58518e-04   1.00x
f0e7aa     PrevectorResizeTrivial                   1.29851e-04   1.00x
e46be2     PrevectorClearNontrivial                 4.89286e-05   3.25x
e46be2     PrevectorClearTrivial                    4.81217e-05   2.73x
e46be2     PrevectorDestructorNontrivial            4.94412e-05   3.20x
e46be2     PrevectorDestructorTrivial               4.79828e-05   2.74x
e46be2     PrevectorResizeNontrivial                4.89689e-05   3.24x
e46be2     PrevectorResizeTrivial                   4.80065e-05   2.70x
3f9abf     PrevectorClearNontrivial                 2.59529e-05   6.13x
3f9abf     PrevectorClearTrivial                    6.39799e-06  20.55x
3f9abf     PrevectorDestructorNontrivial            2.58583e-05   6.12x
3f9abf     PrevectorDestructorTrivial               6.40997e-06  20.51x
3f9abf     PrevectorResizeNontrivial                2.56362e-05   6.18x
3f9abf     PrevectorResizeTrivial                   6.40775e-06  20.26x

It's interesting that GCC 4.8 generates better code in some cases. Anyway, hopefully this is enough data to convince people that this is a good change.

JeremyRubin · 2018-02-27T20:54:23Z

utACK

This is a good optimization @AkioNak / @eklitzke! I didn't have a hack for getting cross platform support for trivial constructors when I wrote #9505, so glad to see trivial types handled more optimally now (noting this to clarify that there was no other reason not to do this that I came across in #9505)

I would only nit that you should amend the line:

prevector::~prevector() for trivial types becomes 7.6x faster (most of its time was spent in resize())

to make it clear you are only referring to the Benchmark, not to the actual destructor (there should be no performance change under this patchset, and it would be good for you to demonstrate that -- ironically it means I probably under-reported the performance advantage when I added the trivial_destructor commit, because I counted the resizes).

Empact · 2018-02-27T21:00:55Z

src/compat.h

Is this redundant with the same in prevector.h?

Good catch, pushing a rebase that fixes this.

Further optimize prevector::resize() (which is called by a number of other prevector methods) to use memset to initialize memory when the prevector contains trivial types.

sipa · 2018-02-27T21:44:49Z

utACK 5aad635. Those benchmark numbers look very nice; thanks @AkioNak as well (sorry I never got around to reviewing your previous PR).

eklitzke · 2018-02-27T23:18:35Z

Updated profile for ReadBlockFromDisk() with these changes: https://monad.io/readblockfromdisk2.svg . The time spent waiting on prevector::resize() has dropped from about 32% of the total time to 4% of the total time.

AkioNak · 2018-02-27T23:55:54Z

@sipa please never mind.

laanwj · 2018-03-01T11:08:52Z

utACK 5aad635

@AkioNak

… much faster 5aad635 Use memset() to optimize prevector::resize() (Evan Klitzke) e46be25 Reduce redundant code of prevector and speed it up (Akio Nakamura) f0e7aa7 Add new prevector benchmarks. (Evan Klitzke) Pull request description: This branch optimizes various `prevector` operations, especially resizing vectors. While profiling the `loadblk` thread I noticed that a lot of time was being spent in `prevector::resize()` which led to this work. I have some data here indicating that it takes up **37%** of the time in `ReadBlockFromDisk()`: https://monad.io/readblockfromdisk.svg This branch improves things significantly. For trivial types, the new results for the prevector benchmark are: * `PrevectorClearTrivial` which tests `prevector::clear()` becomes 24.6x faster * `PrevectorDestructorTrivial` which tests `prevector::~prevector()` becomes 20.5x faster * `PrevectorResizeTrivial` which tests `prevector::resize()` becomes 20.3x faster Note that in practice it looks like the prevector is only used to contain `unsigned char` types, which is a trivial type. The benchmarks are testing a bit of an extreme case, but the changes here are motivated by the profiling data for `ReadBlockFromDisk()` I linked to above. The pull request here consists of a series of three commits: * The first adds new benchmarks but does not change the prevector code. * The second is from @AkioNak , and merges some prevector optimizations he submitted in #11988 * The third optimizes `prevector::resize()` to use `memset()` when the prevector contains trivially constructible types Tree-SHA512: 28f7cbb91a19f9f43b6a5942781d7eb2e3197389186b666f086b69df12bee37773140f765426d715bfb8ebff79cb27a5f1206d0325b54b4aa65598b50fb18368

eklitzke · 2018-03-01T20:50:45Z

I just discovered this causes a warning in GCC 8, which added a new -Wclass-memaccess warning. The warning is spurious/incorrect, and GCC 8 is still not released as stable so I'm effectively testing a pre-release compiler. I filed a bug upstream: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84656

If GCC 8 is released without fixing this we can add an explicit void* cast in the memset() call to suppress the warning. The other option would be to SFINAE the entire class (more work).

@AkioNak

…rations much faster 5aad635 Use memset() to optimize prevector::resize() (Evan Klitzke) e46be25 Reduce redundant code of prevector and speed it up (Akio Nakamura) f0e7aa7 Add new prevector benchmarks. (Evan Klitzke) Pull request description: This branch optimizes various `prevector` operations, especially resizing vectors. While profiling the `loadblk` thread I noticed that a lot of time was being spent in `prevector::resize()` which led to this work. I have some data here indicating that it takes up **37%** of the time in `ReadBlockFromDisk()`: https://monad.io/readblockfromdisk.svg This branch improves things significantly. For trivial types, the new results for the prevector benchmark are: * `PrevectorClearTrivial` which tests `prevector::clear()` becomes 24.6x faster * `PrevectorDestructorTrivial` which tests `prevector::~prevector()` becomes 20.5x faster * `PrevectorResizeTrivial` which tests `prevector::resize()` becomes 20.3x faster Note that in practice it looks like the prevector is only used to contain `unsigned char` types, which is a trivial type. The benchmarks are testing a bit of an extreme case, but the changes here are motivated by the profiling data for `ReadBlockFromDisk()` I linked to above. The pull request here consists of a series of three commits: * The first adds new benchmarks but does not change the prevector code. * The second is from @AkioNak , and merges some prevector optimizations he submitted in bitcoin#11988 * The third optimizes `prevector::resize()` to use `memset()` when the prevector contains trivially constructible types Tree-SHA512: 28f7cbb91a19f9f43b6a5942781d7eb2e3197389186b666f086b69df12bee37773140f765426d715bfb8ebff79cb27a5f1206d0325b54b4aa65598b50fb18368

d856989 warnings: Compiler warning on memset usage for non-trivial type (Lenny Maiorani) 4cb188e Drop defunct IS_TRIVIALLY_CONSTRUCTIBLE handling from prevector.h (Ben Woosley) c412d5f Reduce redundant code of prevector and speed it up (random-zebra) e6741d0 Add new prevector benchmarks. (random-zebra) Pull request description: Backports from Bitcoin a series of optimizations for `prevector` operations, especially around resizing, which takes a significative portion of the time spent by of `ReadBlockFromDisk()`. Running the benchmark tool before the patch: ``` #Benchmark, count, min, max, avg, minc, maxc, avgc ----------- ------ ---- ---- ---- ----- ----- ---- PrevectorClearTrivial, 3328, 312631, 329431, 320311, 812854, 856536, 832823 PrevectorDestructorTrivial, 3328, 317011, 353344, 325141, 824243, 918711, 845382 PrevectorResizeTrivial, 3328, 321333, 327672, 324713, 835482, 851963, 844269 ``` and after: ``` #Benchmark, count, min, max, avg, minc, maxc, avgc ----------- ------ ---- ---- ---- ----- ----- ---- PrevectorClearTrivial, 65536, 15726, 16368, 16067, 40889, 42559, 41775 PrevectorDestructorTrivial, 65536, 15865, 16416, 16141, 41249, 42684, 41968 PrevectorResizeTrivial, 65536, 16036, 16320, 16213, 41696, 42433, 42156 ``` shows that `prevector::clear()`, `prevector::~prevector()`, and `prevector::resize()` become (on avg) **20X** faster. Based on: - bitcoin#12549 - bitcoin#14651 - bitcoin#14715 ACKs for top commit: Fuzzbawls: utACK d856989 furszy: Quite cool. utACK d856989 Tree-SHA512: 8060458bfd5373448d83a3d905fbc95379cfc1c1aed19b750c98b06a1a6a4ba85a6715865388752a5bd579676e2b2e26d3584ec5b26d51e9ba1380bc4fa73f33

eklitzke force-pushed the prevector branch from da95b7b to f8b078c Compare February 27, 2018 06:42

eklitzke mentioned this pull request Feb 27, 2018

Reduce redundant code of prevector and speed it up #11988

Closed

Empact reviewed Feb 27, 2018

View reviewed changes

AkioNak reviewed Feb 27, 2018

View reviewed changes

src/bench/prevector.cpp Outdated

Copy link

Contributor

AkioNak Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think diganostic message can not be omitted on c++11.

AkioNak reviewed Feb 27, 2018

View reviewed changes

src/bench/prevector.cpp Outdated

Copy link

Contributor

AkioNak Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think diganostic message can not be omitted on c++11.

eklitzke force-pushed the prevector branch 3 times, most recently from 3bc0cad to cb1dad1 Compare February 27, 2018 17:26

eklitzke and others added 2 commits February 27, 2018 11:42

Add new prevector benchmarks.

f0e7aa7

This prepares for a series of two additional commits which optimize prevector performance.

eklitzke force-pushed the prevector branch from cb1dad1 to 3f9abf4 Compare February 27, 2018 20:32

Empact reviewed Feb 27, 2018

View reviewed changes

Use memset() to optimize prevector::resize()

5aad635

Further optimize prevector::resize() (which is called by a number of other prevector methods) to use memset to initialize memory when the prevector contains trivial types.

eklitzke force-pushed the prevector branch from 3f9abf4 to 5aad635 Compare February 27, 2018 21:28

fanquake added the Refactoring label Feb 28, 2018

laanwj merged commit 5aad635 into bitcoin:master Mar 1, 2018

eklitzke deleted the prevector branch March 4, 2018 06:55

AkioNak mentioned this pull request Jun 20, 2019

speed up Unserialize_impl for prevector #12324

Merged

sickpig mentioned this pull request Mar 2, 2020

[port] update prevector.h BitcoinUnlimited/BitcoinUnlimited#2109

Merged

random-zebra mentioned this pull request Dec 19, 2020

[Core] Prevector Optimizations PIVX-Project/PIVX#2083

Merged

bitcoin locked as resolved and limited conversation to collaborators Sep 8, 2021

Make prevector::resize() and other prevector operations much faster #12549

Make prevector::resize() and other prevector operations much faster #12549

Uh oh!

Conversation

eklitzke commented Feb 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meshcollider commented Feb 27, 2018

Uh oh!

kallewoof commented Feb 27, 2018

Uh oh!

eklitzke commented Feb 27, 2018

Uh oh!

kallewoof commented Feb 27, 2018

Uh oh!

eklitzke commented Feb 27, 2018

Uh oh!

eklitzke commented Feb 27, 2018

Uh oh!

Empact Feb 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AkioNak Feb 27, 2018

Choose a reason for hiding this comment

Uh oh!

AkioNak Feb 27, 2018

Choose a reason for hiding this comment

Uh oh!

laanwj commented Feb 27, 2018

Uh oh!

eklitzke commented Feb 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sipa commented Feb 27, 2018

Uh oh!

eklitzke commented Feb 27, 2018

Uh oh!

eklitzke commented Feb 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GCC 4.8

GCC 7.3

Uh oh!

JeremyRubin commented Feb 27, 2018

Uh oh!

Empact Feb 27, 2018

Choose a reason for hiding this comment

Uh oh!

eklitzke Feb 27, 2018

Choose a reason for hiding this comment

Uh oh!

sipa commented Feb 27, 2018

Uh oh!

eklitzke commented Feb 27, 2018

Uh oh!

AkioNak commented Feb 27, 2018

Uh oh!

laanwj commented Mar 1, 2018

Uh oh!

eklitzke commented Mar 1, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

eklitzke commented Feb 27, 2018 •

edited

Loading

Empact Feb 27, 2018 •

edited

Loading

eklitzke commented Feb 27, 2018 •

edited

Loading

eklitzke commented Feb 27, 2018 •

edited

Loading