Skip to content

Conversation

@martinus
Copy link
Contributor

@martinus martinus commented May 3, 2022

As a followup of #22953, this removes the remaining occurrences of boost::split and replaces them with our own SplitString. To be able to do so, this extends the function spanparsing::Split to work with multiple separators. Finally this removes 3 more files from lint-includes.py.

@martinus martinus changed the title refactor: replace remaining boost::split with StringSplit refactor: replace remaining boost::split with SplitString May 3, 2022
Copy link
Member

@maflcko maflcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I was also working on this yesterday, but I like your version better.

@fanquake
Copy link
Member

fanquake commented May 3, 2022

Concept ACK

Copy link

@vincenzopalazzo vincenzopalazzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK

martinus and others added 4 commits May 4, 2022 07:34
Note that `SplitString` doesn't support token compression, but in this case
it does not matter as empty strings are already skipped anyways.

Also removes split.hpp and classification.hpp from expected includes
Also removes boost/algorithm/string.hpp from expected includes
@martinus martinus force-pushed the 2022-04-boost-split-exorcism branch from 8976d15 to f849e63 Compare May 4, 2022 05:55
@martinus
Copy link
Contributor Author

martinus commented May 4, 2022

Rebased with fixing nits and adding fuzzing

@fanquake fanquake requested a review from theStack May 4, 2022 14:05
Copy link
Contributor

@theStack theStack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-review ACK f849e63

I tend to think that the single-character interface could be removed completely (especially since it uses the multi-separators version internally in the end anyways), but on the other hand this would make the diff larger as lots of calling instances had to be changed, so NBD.

Finally this removes 3 more files from lint-includes.py

🎉 🎉 🎉

@fanquake fanquake merged commit bde5836 into bitcoin:master May 4, 2022
@martinus martinus deleted the 2022-04-boost-split-exorcism branch May 4, 2022 18:50
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request May 4, 2022
…litString

f849e63 fuzz: SplitString with multiple separators (Martin Leitner-Ankerl)
d1a9850 http: replace boost::split with SplitString (Martin Leitner-Ankerl)
0d7efcd core_read: Replace boost::split with SplitString (Martin Leitner-Ankerl)
b7ab9db Extend Split to work with multiple separators (Martin Leitner-Ankerl)

Pull request description:

  As a followup of bitcoin#22953, this removes the remaining occurrences of `boost::split` and replaces them with our own `SplitString`. To be able to do so, this extends the function `spanparsing::Split` to work with multiple separators. Finally this removes 3 more files from `lint-includes.py`.

ACKs for top commit:
  theStack:
    Code-review ACK f849e63

Tree-SHA512: f37d4dbe11cab2046e646045b0f018a75f978d521443a2c5001512737a1370e22b09247d5db0e5c9e4153229a4e2d66731903c1bba3713711c4cae8cedcc775d
Copy link
Contributor

@vasild vasild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK f849e63

auto start = it;
while (it != sp.end()) {
if (*it == sep) {
if (separators.find(*it) != std::string::npos) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is doing full scan of the iterators for each input character, thus the overall complexity is O(input_length * number_of_separators). It can be optimized to O(input_length + number_of_separators) if a map of the separators is built beforehand that would allow checking whether a given character is a separator in a constant time.

This is mostly theoretical because it would matter if lots of separators are used and currently we only use a few. Anyway, out of curiosity, I tried that and compared the performance - in the current variant it is 734 ns/op while with the map it is 130 ns/op for 10 separators.

optimize Split()
 template <typename T = Span<const char>>
 std::vector<T> Split(const Span<const char>& sp, std::string_view separators)
 {
+    std::bitset<256> m;
+    for (const auto& sep : separators) {
+        m[static_cast<unsigned char>(sep)] = true;
+    }
     std::vector<T> ret;
     auto it = sp.begin();
     auto start = it;
     while (it != sp.end()) {
-        if (separators.find(*it) != std::string::npos) {
+        if (m[static_cast<unsigned char>(*it)]) {
             ret.emplace_back(start, it);
             start = it + 1;
         }
         ++it;
     }
     ret.emplace_back(start, it);
benchmark
static void Split(benchmark::Bench& bench)
{
    bench.run([&] {
        (void)spanparsing::Split(
            "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
            "bcdefghijk"
        );
    });
}

BENCHMARK(Split);

Feel free to ignore.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants