-
Notifications
You must be signed in to change notification settings - Fork 38.7k
util: improve FindByte() performance #19690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This PR was suggested by @hebasto #16981 (comment) (thank you!) |
promag
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance gain looks substantial - didn't verified.
src/streams.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, join with above line.
How was it measured? |
|
Any idea why this is so much faster? As far as I know, there is no faster algorithm to look for the first occurrence of a single byte in a memory array than a linear iteration over it. I'd expect |
I ran: with master (1m20s) and with master + PR (3s).
I think you're correct that there's no faster way than a loop with runtime proportional to the number of bytes to scan, but I assume I just noticed that the repetition count on the test is set to a large number (50000000) and I meant to reduce it for the commit (3 seconds is too long to add to the unit test suite). I'll reduce that number in force push in a minute. This test doesn't really need to be in this PR ( |
ab412ec to
a31aa32
Compare
|
Force-push a small fix to the test, so it doesn't take 3 seconds to run. |
That's true, it's possible to optimize that with assembly language (definitely with specific instruction sets). it still surprises me because you'd expect the I/O, to read the data from disk, to dominate greatly in the block importing. Not looking for a character already in memory! It just seems out of proportion. |
src/test/streams_tests.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand how the performance increase is so significant, but why not a bench if you're worried about burdening the unit tests? I tried to do this but I must be doing something wrong because I can't seem to reproduce the speedup. 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance impact may be very compiler/architecture/stdlib dependent. I'm kind of surprised std::find has optimizations beyond the naive loop implementation in the first place on some platforms, so I certainly wouldn't be surprised if others don't have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gzhao408, thank you, I wasn't aware of bench. I suspect the iteration count, 100, is far too low and the difference is swamped by the noise. I increased the iteration count to 10m (1e7) and it showed the expected difference, master: 1,659.30 ns/op, PR: 52.66 ns/op (ratio is about 31).
I just force-pushed (diff) to remove the unit test (which was only for benchmarking, not really testing anything) and cherry-pick Gloria's benchmark. I added one more commit to increase the iteration count.
|
a31aa32 to
8b07e17
Compare
|
@gmaxwell pointed out to me why this is so much faster: it's not that std::find is amazing, but that the original code (which I wrote in 2012, it seems!) is doing a modulus operation for every character (which is often orders of magnitude slower than the byte comparison or addition/subtraction). Thinking about this a bit more high level: the end goal is just to scan quickly for the 4-byte network magic in a file. If this is relevant for performance (and it seems it may be), it may be better to have a function that does exactly that in CBufferedFile, rather than a search for one byte + memcmp. |
|
Here's a version that's very close in implementation to master that eliminates the and that does make a significant difference; I do like @sipa's suggestion to generalize |
|
I just added a new commit (25413ab, can squash later) to implement @sipa's suggestion to add a method to It didn't work out to use |
|
Looks like one of the sanitizers is finding an implicit-integer-sign-change problem in the latest update: |
25413ab to
8f2e8c2
Compare
|
Thanks, @adamjonas, force-pushed a fix for that signed-unsigned CI failure, added another unit test, some improvements to |
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsNo conflicts as of last run. |
src/bench/streams_findbyte.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we maybe use something like memfd_create(2) for the benchmark, to decrease I/O noise?
Anyone knows if there is a windows equivalent or a higher abstraction in boost::filesystem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmemopen(3) also looks interesting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions, but I don't think this matters, because the file IO (read) occurs only on the very first benchmark loop iteration; after that, the data is in memory and there is no IO at all. Each iteration repositions the stream pointer to zero (bf.SetPos(0)) and then searches forward for the value 1 (bf.FindByte(1)), which is 200 bytes away. But after the first iteration, all of these bytes are in memory, and we're just moving the position between 0 and 200.
The reason for the 200, by the way, is that with random data, searching for a random byte value the average distance would be half of 256. But the data in blk.dat files (which is the only use of this code currently) isn't quite random; zero and 0xff occur more often (and we never search for those values). So I guessed that 200 is close to a typical distance that this function would move through before finding the requested byte. Maybe I should explain these points in a comment in bench/streams_findbyte.cpp.
|
Code review ACK 8f2e8c27bb9bb3ebb9f094129ce4a6019c46cfc0, this looks like a better abstraction, and it's an improvement not to do a modulus for every byte. |
8f2e8c2 to
566c2e2
Compare
|
Force-pushed to rationalize the commits, so I think it's in a good state for merging. I reordered the commit that adds the benchmark test, b576994, to be first, so it's easy for reviewers to checkout that commit and run the new benchmark test without the improvements: I re-ran those tests just now, and on my laptop, the ns/op without the PR is 1,840, and with the PR it's 55 (an improvement of more than a factor of 33). |
|
Code review ACK 566c2e2eec1e84af5143f1554613a89192e29433. I think it'd be good to move the refactoring to FindByte in the last commit to a separate commit. |
566c2e2 to
134de90
Compare
|
Force-pushed to implement latest review suggestion, changes are more cleanly separated among the commits (no code changes overall), thanks @sipa. |
hebasto
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK.
08cf6a9 to
dacd331
Compare
|
@john-moffett - Good idea to bring back the earlier commit (the condition branch instruction theory makes sense); I just restored (force-pushed) it as you suggested. On my x86 (ns/op, lower is better): master: 307 |
|
ACK dacd3316ca0279c56d12372957633b73a999b5b2 |
stickies-v
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach ACK dacd3316c
I'm now seeing bench performance improvement on my M1 again, from ~150ns/op -> ~85ns/op.
src/streams.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that there is only a single non-test callsite of FindByte(), would it make sense to just update the fn signature to take std::byte directly?
src/streams.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: and a bunch more of those
| size_t buf_offset = m_read_pos % vchBuf.size(); | |
| size_t buf_offset{m_read_pos % vchBuf.size()}; |
src/streams.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some of the rationale of this implementation should be in the docs so future contributors don't simplify the code again to unintentionally undo the performance gains, e.g. why the modulo operator is kept outside of the while loop seems quite important and non-trivial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but I think it would also be helpful to improve the docs to specify that if ch is not found, std::ios_base::failure is thrown (from Fill()). It's an essential part of the interface imo.
Perhaps worth improving on this behaviour, and have FindByte() throw its own error, by wrapping the Fill() command in a try/catch? Orthogonal to this PR, though. (And I also don't like that we're catching a general std::exception for a FindByte() failure, but again, orthogonal.)
dacd331 to
52804a5
Compare
|
Force-pushed for review comments (thanks, @stickies-v), verified benchmark performance is unchanged (ns/op with PR: 34.76, without PR: 302). Summary of force-push changes:
|
acef167 to
0fe832c
Compare
|
Force-pushed again to fix CI failures. |
stickies-v
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 0fe832c4a
src/streams.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of changing the callsites of FindByte(), how about adding a uint8_t overload? I think it keeps the implementation clean, but since it can easily be argued that uint8_t also is a byte, this keeps the callsites straightforward and reduces the diff.
git diff
diff --git a/src/bench/streams_findbyte.cpp b/src/bench/streams_findbyte.cpp
index 175564fe9..7b2e65da2 100644
--- a/src/bench/streams_findbyte.cpp
+++ b/src/bench/streams_findbyte.cpp
@@ -20,7 +20,7 @@ static void FindByte(benchmark::Bench& bench)
bench.run([&] {
bf.SetPos(0);
- bf.FindByte(std::byte(1));
+ bf.FindByte(1);
});
// Cleanup
diff --git a/src/streams.h b/src/streams.h
index 2558bd830..9280fa013 100644
--- a/src/streams.h
+++ b/src/streams.h
@@ -763,6 +763,8 @@ public:
if (buf_offset >= vchBuf.size()) buf_offset = 0;
}
}
+
+ void FindByte(uint8_t byte) { return FindByte(static_cast<std::byte>(byte)); }
};
#endif // BITCOIN_STREAMS_H
diff --git a/src/test/fuzz/buffered_file.cpp b/src/test/fuzz/buffered_file.cpp
index 2f7ce60c7..67cac8fa4 100644
--- a/src/test/fuzz/buffered_file.cpp
+++ b/src/test/fuzz/buffered_file.cpp
@@ -53,7 +53,7 @@ FUZZ_TARGET(buffered_file)
return;
}
try {
- opt_buffered_file->FindByte(std::byte(fuzzed_data_provider.ConsumeIntegral<uint8_t>()));
+ opt_buffered_file->FindByte(fuzzed_data_provider.ConsumeIntegral<uint8_t>());
} catch (const std::ios_base::failure&) {
}
},
diff --git a/src/test/streams_tests.cpp b/src/test/streams_tests.cpp
index 79bc7b7c0..1db5b61f1 100644
--- a/src/test/streams_tests.cpp
+++ b/src/test/streams_tests.cpp
@@ -462,7 +462,7 @@ BOOST_AUTO_TEST_CASE(streams_buffered_file_rand)
size_t find = currentPos + InsecureRandRange(8);
if (find >= fileSize)
find = fileSize - 1;
- bf.FindByte(std::byte(find));
+ bf.FindByte(find);
// The value at each offset is the offset.
BOOST_CHECK_EQUAL(bf.GetPos(), find);
currentPos = find;
diff --git a/src/validation.cpp b/src/validation.cpp
index a79b81add..b42b39861 100644
--- a/src/validation.cpp
+++ b/src/validation.cpp
@@ -4438,7 +4438,7 @@ void Chainstate::LoadExternalBlockFile(
try {
// locate a header
unsigned char buf[CMessageHeader::MESSAGE_START_SIZE];
- blkdat.FindByte(std::byte(params.MessageStart()[0]));
+ blkdat.FindByte(params.MessageStart()[0]);
nRewind = blkdat.GetPos() + 1;
blkdat >> buf;
if (memcmp(buf, params.MessageStart(), CMessageHeader::MESSAGE_START_SIZE)) {
src/streams.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that it's part of the interface, I think this needs to be documented on the function level so devs wanting to use FindByte know how it behaves when the byte isn't found - they shouldn't need to dive into the implementation.
src/streams.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
| // For best performance, avoid mod operation within the loop. | |
| // The modulus operation is much more expensive than byte | |
| // comparison and addition, so we keep it out of the loop to | |
| // improve performance (see #19690 for discussion). |
|
ACK 0fe832c4a4b2049fdf967bca375468d5ac285563 |
john-moffett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 0fe832c4a4b2049fdf967bca375468d5ac285563
|
Silent merge conflict with master: |
Avoid use of the expensive mod operator (%) when calculating the buffer offset. No functional difference. Co-authored-by: Hennadii Stepanov <[email protected]>
0fe832c to
72efc26
Compare
|
Force pushed rebase to fix hidden merge conflict, thanks @achow101 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-ACK 72efc26
Verified that the only difference is to include <util/fs.h> instead of <fs.h> (introduced by 00e9b97)
% git range-diff HEAD~2 0fe832c4a4b2049fdf967bca375468d5ac285563 HEAD
1: 5842d92c8 ! 1: 604df63f6 [bench] add streams findbyte
@@ src/bench/streams_findbyte.cpp (new)
+
+#include <bench/bench.h>
+
-+#include <fs.h>
++#include <util/fs.h>
+#include <streams.h>
+
+static void FindByte(benchmark::Bench& bench)
2: 0fe832c4a = 2: 72efc2643 util: improve streams.h:FindByte() performance
|
re-ACK 72efc26 |
72efc26 util: improve streams.h:FindByte() performance (Larry Ruane) 604df63 [bench] add streams findbyte (gzhao408) Pull request description: This PR is strictly a performance improvement; there is no functional change. The `CBufferedFile::FindByte()` method searches for the next occurrence of the given byte in the file. Currently, this is done by explicitly inspecting each byte in turn. This PR takes advantage of `std::find()` to do the same more efficiently, improving its CPU runtime by a factor of about 25 in typical use. ACKs for top commit: achow101: re-ACK 72efc26 stickies-v: re-ACK 72efc26 Tree-SHA512: ddf0bff335cc8aa34f911aa4e0558fa77ce35d963d602e4ab1c63090b4a386faf074548daf06ee829c7f2c760d06eed0125cf4c34e981c6129cea1804eb3b719
This PR is strictly a performance improvement; there is no functional change. The
CBufferedFile::FindByte()method searches for the next occurrence of the given byte in the file. Currently, this is done by explicitly inspecting each byte in turn. This PR takes advantage ofstd::find()to do the same more efficiently, improving its CPU runtime by a factor of about 25 in typical use.