Avoid integer division in the benchmark inner-most loop. #8115

gmaxwell · 2016-05-29T01:51:40Z

Previously the benchmark code used an integer division (%) with
a non-constant in the inner-loop. This is quite slow on many
processors, especially ones like ARM that lack a hardware divide.

Even on fairly recent x86_64 like haswell an integer division can
take something like 100 cycles-- making it comparable to the
runtime of siphash.

This change avoids the division by using bitmasking instead. This
was especially easy since the count was only increased by doubling.

This change also restarts the timing when the execution time was
very low this avoids mintimes of zero in cases where one execution
ends up below the timer resolution. It also reduces the impact of
the overhead on the final result.

The formatting of the prints is changed to not use scientific
notation make it more machine readable (in particular, gnuplot
croaks on the non-fixedpoint, and it doesn't sort correctly).

This also hoists out all the floating point divisions out of the
semi-hot path because it was easy to do so.

It might be prudent to break out the critical test into a macro
just to guarantee that it gets inlined. It might also make sense
to just save out the intermediate counts and times and get the
floating point completely out of the timing loop (because e.g.
on hardware without a fast hardware FPU like some ARM it will
still be slow enough to distort the results). I haven't done
either of these in this commit.

fanquake · 2016-05-29T07:46:59Z

Master

Benchmark,count,min,max,average
RIPEMD160,415,0.00242996,0.00253913,0.00246098
RollingBloom-refresh,1,0.000801,0.000801,0.000801
RollingBloom-refresh,1,0.000111,0.000111,0.000111
RollingBloom-refresh,1,0.000107,0.000107,0.000107
RollingBloom-refresh,1,0.000107,0.000107,0.000107
RollingBloom-refresh,1,0.000111,0.000111,0.000111
RollingBloom-refresh,1,0.000107,0.000107,0.000107
RollingBloom-refresh,1,0.000108,0.000108,0.000108
RollingBloom-refresh,1,0.000108,0.000108,0.000108
RollingBloom-refresh,1,0.000107,0.000107,0.000107
RollingBloom-refresh,1,0.000131,0.000131,0.000131
RollingBloom-refresh,1,0.000111,0.000111,0.000111
RollingBloom-refresh,1,0.00011,0.00011,0.00011
RollingBloom-refresh,1,0.000109,0.000109,0.000109
RollingBloom-refresh,1,0.000108,0.000108,0.000108
RollingBloom-refresh,1,0.000129,0.000129,0.000129
RollingBloom-refresh,1,0.000109,0.000109,0.000109
RollingBloom-refresh,1,0.000108,0.000108,0.000108
RollingBloom-refresh,1,0.000114,0.000114,0.000114
RollingBloom-refresh,1,0.000134,0.000134,0.000134
RollingBloom-refresh,1,0.000109,0.000109,0.000109
RollingBloom-refresh,1,0.000108,0.000108,0.000108
RollingBloom-refresh,1,0.000108,0.000108,0.000108
RollingBloom-refresh,1,0.00012,0.00012,0.00012
RollingBloom-refresh,1,0.000111,0.000111,0.000111
RollingBloom,1441791,6.77479e-07,1.19209e-06,7.49941e-07
SHA1,575,0.00178885,0.00195658,0.0018124
SHA256,239,0.00424147,0.00436625,0.00429098
SHA512,383,0.00258148,0.00269461,0.00262054
Sleep100ms,10,0.100642,0.105008,0.103793
Trig,46137343,0,9.53674e-07,2.26862e-08

This PR

#Benchmark,count,min,max,average
RIPEMD160,448,0.001245033173334,0.002638196945190,0.002461894814457
RollingBloom-refresh,1,0.000635000000000,0.000635000000000,0.000635000000000
RollingBloom-refresh,1,0.000108000000000,0.000108000000000,0.000108000000000
RollingBloom-refresh,1,0.000107000000000,0.000107000000000,0.000107000000000
RollingBloom-refresh,1,0.000204000000000,0.000204000000000,0.000204000000000
RollingBloom-refresh,1,0.000112000000000,0.000112000000000,0.000112000000000
RollingBloom-refresh,1,0.000109000000000,0.000109000000000,0.000109000000000
RollingBloom-refresh,1,0.000120000000000,0.000120000000000,0.000120000000000
RollingBloom-refresh,1,0.000108000000000,0.000108000000000,0.000108000000000
RollingBloom-refresh,1,0.000112000000000,0.000112000000000,0.000112000000000
RollingBloom-refresh,1,0.000109000000000,0.000109000000000,0.000109000000000
RollingBloom-refresh,1,0.000112000000000,0.000112000000000,0.000112000000000
RollingBloom-refresh,1,0.000127000000000,0.000127000000000,0.000127000000000
RollingBloom-refresh,1,0.000109000000000,0.000109000000000,0.000109000000000
RollingBloom-refresh,1,0.000108000000000,0.000108000000000,0.000108000000000
RollingBloom-refresh,1,0.000109000000000,0.000109000000000,0.000109000000000
RollingBloom-refresh,1,0.000134000000000,0.000134000000000,0.000134000000000
RollingBloom-refresh,1,0.000113000000000,0.000113000000000,0.000113000000000
RollingBloom-refresh,1,0.000111000000000,0.000111000000000,0.000111000000000
RollingBloom-refresh,1,0.000108000000000,0.000108000000000,0.000108000000000
RollingBloom-refresh,1,0.000110000000000,0.000110000000000,0.000110000000000
RollingBloom-refresh,1,0.000107000000000,0.000107000000000,0.000107000000000
RollingBloom-refresh,1,0.000108000000000,0.000108000000000,0.000108000000000
RollingBloom,1310720,0.000000364444583,0.000000838096198,0.000000766858648
SHA1,640,0.000909024336207,0.001938136418660,0.001843086257577
SHA256,256,0.002209486499909,0.008500099182129,0.004300644621253
SHA512,384,0.001319904176016,0.002813005447388,0.002615700786312
Sleep100ms,10,0.205592155456543,0.210056066513062,0.104166316986084
Trig,67108864,0.000000014997003,0.000000015448112,0.000000015188842

laanwj · 2016-05-30T07:13:33Z

utACK

gmaxwell · 2016-05-30T16:28:05Z

There is a bug in the min/max handling (see the sleep 100 timings). I'll fix.

Previously the benchmark code used an integer division (%) with a non-constant in the inner-loop. This is quite slow on many processors, especially ones like ARM that lack a hardware divide. Even on fairly recent x86_64 like haswell an integer division can take something like 100 cycles-- making it comparable to the runtime of siphash. This change avoids the division by using bitmasking instead. This was especially easy since the count was only increased by doubling. This change also restarts the timing when the execution time was very low this avoids mintimes of zero in cases where one execution ends up below the timer resolution. It also reduces the impact of the overhead on the final result. The formatting of the prints is changed to not use scientific notation make it more machine readable (in particular, gnuplot croaks on the non-fixedpoint, and it doesn't sort correctly). This also hoists out all the floating point divisions out of the semi-hot path because it was easy to do so. It might be prudent to break out the critical test into a macro just to guarantee that it gets inlined. It might also make sense to just save out the intermediate counts and times and get the floating point completely out of the timing loop (because e.g. on hardware without a fast hardware FPU like some ARM it will still be slow enough to distort the results). I haven't done either of these in this commit.

63ff57d Avoid integer division in the benchmark inner-most loop. (Gregory Maxwell)

…t loop. 63ff57d Avoid integer division in the benchmark inner-most loop. (Gregory Maxwell)

Micro-benchmarking framework part 1 Cherry-picked from the following upstream PRs: - bitcoin/bitcoin#6733 - bitcoin/bitcoin#6770 - bitcoin/bitcoin#6892 - Excluding changes to `src/policy/policy.h` which we don't have yet. - bitcoin/bitcoin#7934 - Just the benchmark, not the performance improvements. - bitcoin/bitcoin#8039 - bitcoin/bitcoin#8107 - bitcoin/bitcoin#8115 - bitcoin/bitcoin#8914 - Required resolving several merge conflicts in code that had been refactored upstream. The changes were simple enough that I decided it was okay to impose merge conflicts on pulling in those refactors later. - bitcoin/bitcoin#9200 - bitcoin/bitcoin#9202 - Adds support for measuring CPU cycles, which is later removed in an upstream PR after the refactor. I am including it to reduce future merge conflicts. - bitcoin/bitcoin#9281 - Only changes to `src/bench/bench.cpp` - bitcoin/bitcoin#9498 - bitcoin/bitcoin#9712 - bitcoin/bitcoin#9547 - bitcoin/bitcoin#9505 - Just the benchmark, not the performance improvements. - bitcoin/bitcoin#9792 - Just the benchmark, not the performance improvements. - bitcoin/bitcoin#10272 - bitcoin/bitcoin#10395 - Only changes to `src/bench/` - bitcoin/bitcoin#10735 - Only changes to `src/bench/base58.cpp` - bitcoin/bitcoin#10963 - bitcoin/bitcoin#11303 - Only the benchmark backend change. - bitcoin/bitcoin#11562 - bitcoin/bitcoin#11646 - bitcoin/bitcoin#11654 This pulls in all changes to the micro-benchmark framework prior to December 2017, when it was rewritten. The rewrite depends on other upstream PRs we have not pulled in yet. This does not pull in all benchmarks prior to December 2017. It leaves out benchmarks that either test code we do not have yet (except for the `FastRandomContext` refactor, which I decided to pull in), or would require rewrites to work with our changes to the codebase.

3f3edde [Bench] Use PIVX address in Base58Decode test (random-zebra) 5a1be90 [Travis] Disable benchmark framework for trusty test (random-zebra) 1bd89ac Initialize recently introduced non-static class member lastCycles to zero in constructor (random-zebra) ec60671 Require a steady clock for bench with at least micro precision (random-zebra) 84069ce bench: prefer a steady clock if the resolution is no worse (random-zebra) 38367b1 bench: switch to std::chrono for time measurements (random-zebra) a24633a Remove countMaskInv caching in bench framework (random-zebra) 9e9bc22 Restore default format state of cout after printing with std::fixed/setprecision (random-zebra) 3dd559d Avoid static analyzer warnings regarding uninitialized arguments (random-zebra) e85f224 Replace boost::function with std::function (C++11) (random-zebra) 98c0857 Prevent warning: variable 'x' is uninitialized (random-zebra) 7f0d4b3 FastRandom benchmark (random-zebra) d9fa0c6 Add prevector destructor benchmark (random-zebra) e1527ba Assert that what might look like a possible division by zero is actually unreachable (random-zebra) e94cf15 bench: Fix initialization order in registration (random-zebra) 151c25f Basic CCheckQueue Benchmarks (random-zebra) 51aedbc Use std:thread:hardware_concurrency, instead of Boost, to determine available cores (random-zebra) d447613 Use real number of cores for default -par, ignore virtual cores (random-zebra) 9162a56 [Refactoring] Removed using namespace <xxx> from bench/ sources (random-zebra) 5c07f67 bench: Add support for measuring CPU cycles (random-zebra) 41ce1ed bench: Fix subtle counting issue when rescaling iteration count (random-zebra) 68ea794 Avoid integer division in the benchmark inner-most loop. (random-zebra) 3fa4f27 bench: Added base58 encoding/decoding benchmarks (random-zebra) 4442118 bench: Add crypto hash benchmarks (random-zebra) a5179b6 [Trivial] ensure minimal header conventions (random-zebra) 8607d6b Support very-fast-running benchmarks (random-zebra) 4aebb60 Simple benchmarking framework (random-zebra) Pull request description: Introduces the benchmarking framework, loosely based on google's micro-benchmarking library (https://github.com/google/benchmark), ported from Bitcoin, up to 0.16. The benchmark framework is hard-coded to run each benchmark for one wall-clock second, and then spits out .csv-format timing information to stdout. Backported PR: - bitcoin#6733 - bitcoin#6770 - bitcoin#6892 - bitcoin#8039 - bitcoin#8107 - bitcoin#8115 - bitcoin#9200 - bitcoin#9202 - bitcoin#9281 - bitcoin#6361 - bitcoin#10271 - bitcoin#9498 - bitcoin#9712 - bitcoin#9547 - bitcoin#9505 (benchmark only. Rest was in #1557) - bitcoin#9792 (benchmark only. Rest was in #643) - bitcoin#10272 - bitcoin#10395 (base58 only) - bitcoin#10963 - bitcoin#11303 (first commit) - bitcoin#11562 - bitcoin#11646 - bitcoin#11654 Current output of `src/bench/bench_pivx`: ``` #Benchmark,count,min(ns),max(ns),average(ns),min_cycles,max_cycles,average_cycles Base58CheckEncode,131072,7697,8065,7785,20015,20971,20242 Base58Decode,294912,3305,3537,3454,8595,9198,8981 Base58Encode,180224,5498,6020,5767,14297,15652,14994 CCheckQueueSpeed,320,3159960,3535173,3352787,8216030,9191602,8717388 CCheckQueueSpeedPrevectorJob,96,9184484,11410840,10823070,23880046,29668680,28140445 FastRandom_1bit,320,3143690,4838162,3199156,8173726,12579373,8317941 FastRandom_32bit,60,17097612,17923669,17367440,44454504,46602306,45156079 PrevectorClear,3072,334741,366618,346731,870340,953224,901516 PrevectorDestructor,2816,344233,368912,357281,895022,959187,928948 RIPEMD160,288,3404503,3693917,3577774,8851850,9604334,9302363 SHA1,384,2718128,2891558,2802513,7067238,7518184,7286652 SHA256,176,6133760,6580005,6239866,15948035,17108376,16223916 SHA512,240,4251468,4358706,4313463,11054006,11332826,11215186 Sleep100ms,10,100221470,100302411,100239073,260580075,260790726,260625870 ``` NOTE: Not all the tests have been pulled yet (as we might not have the code being tested, or it would require rewrites to work with our different code base), but the framework is updated to December 2017. ACKs for top commit: Fuzzbawls: ACK 3f3edde Tree-SHA512: c283311a9accf6d2feeb93b185afa08589ebef3f18b6e86980dbc3647b9845f75ac9ecce2f1b08738d25ceac36596a2c89d41e4dbf3b463502aa695611aa1f8e

fanquake mentioned this pull request May 29, 2016

[Doc] Add benchmarking notes #8110

Merged

laanwj added the Tests label May 30, 2016

laanwj merged commit 63ff57d into bitcoin:master May 31, 2016

laanwj added a commit that referenced this pull request May 31, 2016

Merge #8115: Avoid integer division in the benchmark inner-most loop.

0026e0e

63ff57d Avoid integer division in the benchmark inner-most loop. (Gregory Maxwell)

codablock pushed a commit to codablock/dash that referenced this pull request Dec 21, 2017

Merge bitcoin#8115: Avoid integer division in the benchmark inner-mos…

a3c6303

…t loop. 63ff57d Avoid integer division in the benchmark inner-most loop. (Gregory Maxwell)

dagurval mentioned this pull request Apr 30, 2018

Benchmark cherries bitcoinxt/bitcoinxt#408

Merged

andvgal pushed a commit to energicryptocurrency/gen2-energi that referenced this pull request Jan 6, 2019

Merge bitcoin#8115: Avoid integer division in the benchmark inner-mos…

92d59b3

…t loop. 63ff57d Avoid integer division in the benchmark inner-most loop. (Gregory Maxwell)

str4d mentioned this pull request Feb 22, 2019

Micro-benchmarking framework part 1 zcash/zcash#3858

Merged

sickpig mentioned this pull request Apr 28, 2020

Update benchmark suite BitcoinUnlimited/BitcoinUnlimited#2191

Merged

random-zebra mentioned this pull request May 27, 2020

[Tests] Benchmarking Framework PIVX-Project/PIVX#1650

Merged

bitcoin locked as resolved and limited conversation to collaborators Sep 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid integer division in the benchmark inner-most loop. #8115

Avoid integer division in the benchmark inner-most loop. #8115

Uh oh!

gmaxwell commented May 29, 2016

Uh oh!

fanquake commented May 29, 2016

Uh oh!

laanwj commented May 30, 2016

Uh oh!

gmaxwell commented May 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Avoid integer division in the benchmark inner-most loop. #8115

Avoid integer division in the benchmark inner-most loop. #8115

Uh oh!

Conversation

gmaxwell commented May 29, 2016

Uh oh!

fanquake commented May 29, 2016

Uh oh!

laanwj commented May 30, 2016

Uh oh!

gmaxwell commented May 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants