Skip to content

Conversation

@eklitzke
Copy link
Contributor

@eklitzke eklitzke commented Feb 27, 2018

This branch optimizes various prevector operations, especially resizing vectors. While profiling the loadblk thread I noticed that a lot of time was being spent in prevector::resize() which led to this work. I have some data here indicating that it takes up 37% of the time in ReadBlockFromDisk(): https://monad.io/readblockfromdisk.svg

This branch improves things significantly. For trivial types, the new results for the prevector benchmark are:

  • PrevectorClearTrivial which tests prevector::clear() becomes 24.6x faster
  • PrevectorDestructorTrivial which tests prevector::~prevector() becomes 20.5x faster
  • PrevectorResizeTrivial which tests prevector::resize() becomes 20.3x faster

Note that in practice it looks like the prevector is only used to contain unsigned char types, which is a trivial type. The benchmarks are testing a bit of an extreme case, but the changes here are motivated by the profiling data for ReadBlockFromDisk() I linked to above.

The pull request here consists of a series of three commits:

  • The first adds new benchmarks but does not change the prevector code.
  • The second is from @AkioNak , and merges some prevector optimizations he submitted in Reduce redundant code of prevector and speed it up #11988
  • The third optimizes prevector::resize() to use memset() when the prevector contains trivially constructible types

@meshcollider
Copy link
Contributor

Nice speedup! Concept ACK

@kallewoof
Copy link
Contributor

This exists in #11988 no?

@eklitzke
Copy link
Contributor Author

It looks like this branch is failing because std::is_trivially_constructible wasn't added until GCC 5 (according to https://www.gnu.org/software/gcc/gcc-5/changes.html ), even though it is part of C++11. I can find a workaround for GCC 4.8.

This isn't quite the same as #11988 because that still uses placement new.

@kallewoof
Copy link
Contributor

If it's not hugely different consider reviewing #11988 and suggesting changes to make it better instead.

@eklitzke
Copy link
Contributor Author

That's a good idea, I'll see if I can cherry-pick @AkioNak's commit into my branch while I'm testing the GCC 4.8 compatibility issues.

@eklitzke
Copy link
Contributor Author

This new branch cherry-picks @AkioNak 's commit from #11988 and then adds my memset optimization. I constructed this PR such that it has three commits:

I have a script that benchmarks the performance for each of the three commits. Here's the summarized data. The first commit is considered to have a baseline speedup of 1x, and then each of the subsequent commits shows the speedup from the base:

34d238     PrevectorClear                 1.00x
34d238     PrevectorDestructor            1.00x
34d238     PrevectorResizeNontrivial      1.00x
34d238     PrevectorResizeTrivial         1.00x
d1d28e     PrevectorClear                 2.36x
d1d28e     PrevectorDestructor            2.16x
d1d28e     PrevectorResizeNontrivial      3.24x
d1d28e     PrevectorResizeTrivial         2.76x
f8b078     PrevectorClear                 8.31x
f8b078     PrevectorDestructor            7.62x
f8b078     PrevectorResizeNontrivial      6.56x
f8b078     PrevectorResizeTrivial         18.99x

Here's the raw data I generated the above from:

git 34d2387b1d5c753d6ba130eb6b4730dcc15072c5
# Benchmark, evals, iterations, total, min, max, median
PrevectorClear, 5, 5600, 3.47561, 0.00012317, 0.000125752, 0.000123494
PrevectorDestructor, 5, 5700, 3.2223, 0.000112914, 0.000113208, 0.000113068
PrevectorResizeNontrivial, 5, 4100, 3.24072, 0.000158005, 0.000158212, 0.000158075
PrevectorResizeTrivial, 5, 4900, 3.24407, 0.000131094, 0.000133749, 0.000132495

git d1d28e72154a9f2e7fb56001f2b790074d1cfa30
# Benchmark, evals, iterations, total, min, max, median
PrevectorClear, 5, 5600, 1.4682, 5.23405e-05, 5.2611e-05, 5.24265e-05
PrevectorDestructor, 5, 5700, 1.49587, 5.24034e-05, 5.26179e-05, 5.2448e-05
PrevectorResizeNontrivial, 5, 4100, 1.00747, 4.87854e-05, 5.04511e-05, 4.88389e-05
PrevectorResizeTrivial, 5, 4900, 1.17581, 4.79055e-05, 4.81338e-05, 4.79753e-05

git f8b078c8cda6d4ac6e3b112ce605b08f6ac002b9
# Benchmark, evals, iterations, total, min, max, median
PrevectorClear, 5, 5600, 0.419837, 1.48182e-05, 1.55947e-05, 1.48556e-05
PrevectorDestructor, 5, 5700, 0.422994, 1.48187e-05, 1.48808e-05, 1.48327e-05
PrevectorResizeNontrivial, 5, 4100, 0.494272, 2.40686e-05, 2.41813e-05, 2.4084e-05
PrevectorResizeTrivial, 5, 4900, 0.171037, 6.89859e-06, 7.06054e-06, 6.97555e-06

src/prevector.h Outdated
Copy link
Contributor

@Empact Empact Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use ptrdiff_t here and in fill? Would allow you to assert against the negative case rather than have a very large value.

Not that the negative case is possible with the current code, but it could be protective going forward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think diganostic message can not be omitted on c++11.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think diganostic message can not be omitted on c++11.

@laanwj
Copy link
Member

laanwj commented Feb 27, 2018

Ping @sipa who wrote this code.

@eklitzke eklitzke force-pushed the prevector branch 3 times, most recently from 3bc0cad to cb1dad1 Compare February 27, 2018 17:26
@eklitzke
Copy link
Contributor Author

eklitzke commented Feb 27, 2018

I've addressed the feedback I've gotten so far, and expanded the other benchmarks to test both trivial and non-trivial types.

One thing that I also found interesting while doing this change was the way the current code is written to not use SFINAE. On GCC 7.3 (what I have locally) the compiler is smart enough to do the type checks at compile time and not generate runtime checks, meaning it generates optimal code (i.e. the same code it would generate using SFINAE). On GCC 4.8 it's smart enough to do that for trivial types, but does not generate optimal code in the nontrivial path. This is probably fine because the main use of prevector is to hold unsigned char (a trivially constructible and destructible type).

@sipa
Copy link
Member

sipa commented Feb 27, 2018

I believe your first commit won't compile on GCC 4.8. Also, if you're going to use IS_TRIVIALLY_CONSTRUCTIBLE outside of prevector.h, perhaps it's better to move it elsewhere (like util.h or compat.h)?

@eklitzke
Copy link
Contributor Author

Good catch, I only tested the last commit. I'll rebase this branch so all of the commits build cleanly, and move the macro definition.

eklitzke and others added 2 commits February 27, 2018 11:42
This prepares for a series of two additional commits which optimize
prevector performance.
In prevector.h, the code which like item_ptr(size()) apears in the loop.
Both item_ptr() and size() judge whether values are held directly or
indirectly, but in most cases it is sufficient to make that judgement
once outside the loop.

This PR adds 2 private function fill() which has the loop to initialize
by specified value (or iterator of the other prevector's element),
but don't call item_ptr() in their loop.
Other functions(assign(), constructor, operator=(), insert())
that has similar loop, call fill() instead of original loop.

Also, resize() was changed like fill(), but it calls the default
constructor for that element each time.
@eklitzke
Copy link
Contributor Author

eklitzke commented Feb 27, 2018

Here are the latest results, tested under both GCC 7.3 and GCC 4.8. For both compilers I ran the full test suite and benchmarks for each of the three commits in this PR.

GCC 4.8

Commit     Test                                     Median (s)   Speedup
------------------------------------------------------------------------
f0e7aa     PrevectorClearNontrivial                 8.15034e-05   1.00x
f0e7aa     PrevectorClearTrivial                    1.49740e-04   1.00x
f0e7aa     PrevectorDestructorNontrivial            8.27660e-05   1.00x
f0e7aa     PrevectorDestructorTrivial               1.48346e-04   1.00x
f0e7aa     PrevectorResizeNontrivial                8.26464e-05   1.00x
f0e7aa     PrevectorResizeTrivial                   1.50579e-04   1.00x
e46be2     PrevectorClearNontrivial                 2.59100e-05   3.15x
e46be2     PrevectorClearTrivial                    4.80695e-05   3.12x
e46be2     PrevectorDestructorNontrivial            2.54231e-05   3.26x
e46be2     PrevectorDestructorTrivial               4.80365e-05   3.09x
e46be2     PrevectorResizeNontrivial                2.54693e-05   3.24x
e46be2     PrevectorResizeTrivial                   4.82063e-05   3.12x
3f9abf     PrevectorClearNontrivial                 2.60759e-05   3.13x
3f9abf     PrevectorClearTrivial                    6.08490e-06  24.61x
3f9abf     PrevectorDestructorNontrivial            2.55872e-05   3.23x
3f9abf     PrevectorDestructorTrivial               6.09886e-06  24.32x
3f9abf     PrevectorResizeNontrivial                2.56024e-05   3.23x
3f9abf     PrevectorResizeTrivial                   6.15214e-06  24.48x

GCC 7.3

Commit     Test                                     Median (s)   Speedup
------------------------------------------------------------------------
f0e7aa     PrevectorClearNontrivial                 1.58989e-04   1.00x
f0e7aa     PrevectorClearTrivial                    1.31495e-04   1.00x
f0e7aa     PrevectorDestructorNontrivial            1.58322e-04   1.00x
f0e7aa     PrevectorDestructorTrivial               1.31456e-04   1.00x
f0e7aa     PrevectorResizeNontrivial                1.58518e-04   1.00x
f0e7aa     PrevectorResizeTrivial                   1.29851e-04   1.00x
e46be2     PrevectorClearNontrivial                 4.89286e-05   3.25x
e46be2     PrevectorClearTrivial                    4.81217e-05   2.73x
e46be2     PrevectorDestructorNontrivial            4.94412e-05   3.20x
e46be2     PrevectorDestructorTrivial               4.79828e-05   2.74x
e46be2     PrevectorResizeNontrivial                4.89689e-05   3.24x
e46be2     PrevectorResizeTrivial                   4.80065e-05   2.70x
3f9abf     PrevectorClearNontrivial                 2.59529e-05   6.13x
3f9abf     PrevectorClearTrivial                    6.39799e-06  20.55x
3f9abf     PrevectorDestructorNontrivial            2.58583e-05   6.12x
3f9abf     PrevectorDestructorTrivial               6.40997e-06  20.51x
3f9abf     PrevectorResizeNontrivial                2.56362e-05   6.18x
3f9abf     PrevectorResizeTrivial                   6.40775e-06  20.26x

It's interesting that GCC 4.8 generates better code in some cases. Anyway, hopefully this is enough data to convince people that this is a good change.

@JeremyRubin
Copy link
Contributor

utACK

This is a good optimization @AkioNak / @eklitzke! I didn't have a hack for getting cross platform support for trivial constructors when I wrote #9505, so glad to see trivial types handled more optimally now (noting this to clarify that there was no other reason not to do this that I came across in #9505)

I would only nit that you should amend the line:

prevector::~prevector() for trivial types becomes 7.6x faster (most of its time was spent in resize())

to make it clear you are only referring to the Benchmark, not to the actual destructor (there should be no performance change under this patchset, and it would be good for you to demonstrate that -- ironically it means I probably under-reported the performance advantage when I added the trivial_destructor commit, because I counted the resizes).

src/compat.h Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this redundant with the same in prevector.h?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, pushing a rebase that fixes this.

Further optimize prevector::resize() (which is called by a number of
other prevector methods) to use memset to initialize memory when the
prevector contains trivial types.
@sipa
Copy link
Member

sipa commented Feb 27, 2018

utACK 5aad635. Those benchmark numbers look very nice; thanks @AkioNak as well (sorry I never got around to reviewing your previous PR).

@eklitzke
Copy link
Contributor Author

Updated profile for ReadBlockFromDisk() with these changes: https://monad.io/readblockfromdisk2.svg . The time spent waiting on prevector::resize() has dropped from about 32% of the total time to 4% of the total time.

@AkioNak
Copy link
Contributor

AkioNak commented Feb 27, 2018

@sipa please never mind.

@laanwj
Copy link
Member

laanwj commented Mar 1, 2018

utACK 5aad635

@laanwj laanwj merged commit 5aad635 into bitcoin:master Mar 1, 2018
laanwj added a commit that referenced this pull request Mar 1, 2018
… much faster

5aad635 Use memset() to optimize prevector::resize() (Evan Klitzke)
e46be25 Reduce redundant code of prevector and speed it up (Akio Nakamura)
f0e7aa7 Add new prevector benchmarks. (Evan Klitzke)

Pull request description:

  This branch optimizes various `prevector` operations, especially resizing vectors. While profiling the `loadblk` thread I noticed that a lot of time was being spent in `prevector::resize()` which led to this work. I have some data here indicating that it takes up **37%** of the time in `ReadBlockFromDisk()`: https://monad.io/readblockfromdisk.svg

  This branch improves things significantly. For trivial types, the new results for the prevector benchmark are:

   * `PrevectorClearTrivial` which tests `prevector::clear()` becomes 24.6x faster
   * `PrevectorDestructorTrivial` which tests `prevector::~prevector()` becomes 20.5x faster
   * `PrevectorResizeTrivial` which tests `prevector::resize()` becomes 20.3x faster

  Note that in practice it looks like the prevector is only used to contain `unsigned char` types, which is a trivial type. The benchmarks are testing a bit of an extreme case, but the changes here are motivated by the profiling data for `ReadBlockFromDisk()` I linked to above.

  The pull request here consists of a series of three commits:
   * The first adds new benchmarks but does not change the prevector code.
   * The second is from @AkioNak , and merges some prevector optimizations he submitted in #11988
   * The third optimizes `prevector::resize()` to use `memset()` when the prevector contains trivially constructible types

Tree-SHA512: 28f7cbb91a19f9f43b6a5942781d7eb2e3197389186b666f086b69df12bee37773140f765426d715bfb8ebff79cb27a5f1206d0325b54b4aa65598b50fb18368
@eklitzke
Copy link
Contributor Author

eklitzke commented Mar 1, 2018

I just discovered this causes a warning in GCC 8, which added a new -Wclass-memaccess warning. The warning is spurious/incorrect, and GCC 8 is still not released as stable so I'm effectively testing a pre-release compiler. I filed a bug upstream: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84656

If GCC 8 is released without fixing this we can add an explicit void* cast in the memset() call to suppress the warning. The other option would be to SFINAE the entire class (more work).

@eklitzke eklitzke deleted the prevector branch March 4, 2018 06:55
codablock pushed a commit to codablock/dash that referenced this pull request Oct 2, 2019
…rations much faster

5aad635 Use memset() to optimize prevector::resize() (Evan Klitzke)
e46be25 Reduce redundant code of prevector and speed it up (Akio Nakamura)
f0e7aa7 Add new prevector benchmarks. (Evan Klitzke)

Pull request description:

  This branch optimizes various `prevector` operations, especially resizing vectors. While profiling the `loadblk` thread I noticed that a lot of time was being spent in `prevector::resize()` which led to this work. I have some data here indicating that it takes up **37%** of the time in `ReadBlockFromDisk()`: https://monad.io/readblockfromdisk.svg

  This branch improves things significantly. For trivial types, the new results for the prevector benchmark are:

   * `PrevectorClearTrivial` which tests `prevector::clear()` becomes 24.6x faster
   * `PrevectorDestructorTrivial` which tests `prevector::~prevector()` becomes 20.5x faster
   * `PrevectorResizeTrivial` which tests `prevector::resize()` becomes 20.3x faster

  Note that in practice it looks like the prevector is only used to contain `unsigned char` types, which is a trivial type. The benchmarks are testing a bit of an extreme case, but the changes here are motivated by the profiling data for `ReadBlockFromDisk()` I linked to above.

  The pull request here consists of a series of three commits:
   * The first adds new benchmarks but does not change the prevector code.
   * The second is from @AkioNak , and merges some prevector optimizations he submitted in bitcoin#11988
   * The third optimizes `prevector::resize()` to use `memset()` when the prevector contains trivially constructible types

Tree-SHA512: 28f7cbb91a19f9f43b6a5942781d7eb2e3197389186b666f086b69df12bee37773140f765426d715bfb8ebff79cb27a5f1206d0325b54b4aa65598b50fb18368
furszy added a commit to PIVX-Project/PIVX that referenced this pull request Dec 20, 2020
d856989 warnings: Compiler warning on memset usage for non-trivial type (Lenny Maiorani)
4cb188e Drop defunct IS_TRIVIALLY_CONSTRUCTIBLE handling from prevector.h (Ben Woosley)
c412d5f Reduce redundant code of prevector and speed it up (random-zebra)
e6741d0 Add new prevector benchmarks. (random-zebra)

Pull request description:

  Backports from Bitcoin a series of optimizations for `prevector` operations, especially around resizing, which takes a significative portion of the time spent by of `ReadBlockFromDisk()`.

  Running the benchmark tool before the patch:
  ```
  #Benchmark,                    count,  min,    max,    avg,    minc,   maxc,   avgc
  -----------                    ------  ----    ----    ----    -----   -----   ----
  PrevectorClearTrivial,         3328,   312631, 329431, 320311, 812854, 856536, 832823
  PrevectorDestructorTrivial,    3328,   317011, 353344, 325141, 824243, 918711, 845382
  PrevectorResizeTrivial,        3328,   321333, 327672, 324713, 835482, 851963, 844269
  ```

  and after:
  ```
  #Benchmark,                    count, min,    max,    avg,    minc,   maxc,   avgc
  -----------                    ------ ----    ----    ----    -----   -----   ----
  PrevectorClearTrivial,         65536, 15726,  16368,  16067,  40889,  42559,  41775
  PrevectorDestructorTrivial,    65536, 15865,  16416,  16141,  41249,  42684,  41968
  PrevectorResizeTrivial,        65536, 16036,  16320,  16213,  41696,  42433,  42156
  ```

  shows that `prevector::clear()`, `prevector::~prevector()`, and `prevector::resize()` become (on avg) **20X** faster.

  Based on:
  - bitcoin#12549
  - bitcoin#14651
  - bitcoin#14715

ACKs for top commit:
  Fuzzbawls:
    utACK d856989
  furszy:
    Quite cool. utACK d856989

Tree-SHA512: 8060458bfd5373448d83a3d905fbc95379cfc1c1aed19b750c98b06a1a6a4ba85a6715865388752a5bd579676e2b2e26d3584ec5b26d51e9ba1380bc4fa73f33
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Sep 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants