Speed up replaceRegexpAll and replaceAll by replacing underlying regexp library#62436
Closed
taiyang-li wants to merge 8 commits intoClickHouse:masterfrom
Closed
Speed up replaceRegexpAll and replaceAll by replacing underlying regexp library#62436taiyang-li wants to merge 8 commits intoClickHouse:masterfrom
replaceRegexpAll and replaceAll by replacing underlying regexp library#62436taiyang-li wants to merge 8 commits intoClickHouse:masterfrom
Conversation
Contributor
|
This is an automated comment for commit 9ac4fed with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
Contributor
Author
replaceAllwith 'Many years later as he faced the firing squad, Colonel Aureliano Buendia was to remember that distant afternoon when his father took him to discover ice.' as s select replaceAll(materialize(s), ', ', '. ') as w from numbers(10000000) format Null;
Old:
localhost:9001, queries: 10, QPS: 1.064, RPS: 10638950.453, MiB/s: 81.169, result RPS: 0.000, result MiB/s: 0.000.
0.000% 0.867 sec.
10.000% 0.875 sec.
20.000% 0.876 sec.
30.000% 0.887 sec.
40.000% 0.888 sec.
50.000% 0.899 sec.
60.000% 0.899 sec.
70.000% 0.909 sec.
80.000% 0.928 sec.
90.000% 0.985 sec.
95.000% 1.177 sec.
99.000% 1.177 sec.
99.900% 1.177 sec.
99.990% 1.177 sec.
New:
localhost:9001, queries: 10, QPS: 1.169, RPS: 11685467.795, MiB/s: 89.153, result RPS: 0.000, result MiB/s: 0.000.
0.000% 0.827 sec.
10.000% 0.840 sec.
20.000% 0.842 sec.
30.000% 0.842 sec.
40.000% 0.842 sec.
50.000% 0.842 sec.
60.000% 0.842 sec.
70.000% 0.844 sec.
80.000% 0.853 sec.
90.000% 0.856 sec.
95.000% 0.862 sec.
99.000% 0.862 sec.
99.900% 0.862 sec.
99.990% 0.862 sec. |
replaceRegexpAll and replaceAll
rschu1ze
reviewed
Apr 9, 2024
Contributor
Author
Contributor
Author
Fixed. It is ready for review now. |
replaceRegexpAll and replaceAllreplaceRegexpAll and replaceAll by replacing underlying regexp lib.
replaceRegexpAll and replaceAll by replacing underlying regexp lib.replaceRegexpAll and replaceAll by replacing underlying regexp library
Contributor
Author
|
@rschu1ze hope for your reviews, thanks very much. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
rschu1ze
reviewed
Apr 25, 2024
rschu1ze
reviewed
Apr 25, 2024
This comment was marked as outdated.
This comment was marked as outdated.
Contributor
Author
|
@rschu1ze any progress recently ? |
Contributor
Author
|
Anybody help to review it ? |
Member
|
Sorry for the delay, I'll check today. |
Member
|
CLosed in favor of #66185 - thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
replaceRegexp(One|All)toreplace(One|All)ifpatternis trivial andreplacementdoesn't contain any captured substitutions.re2toOptimizedRegularExpression.unsigned match(std::string_view text, const char * subject, size_t subject_size, MatchVec & matches, unsigned limit) constinOptimizedRegularExpressionto process cases like:select replaceRegexpAll('Hello, World!', '^', 'here: '), making it compatiable with originalre2library.std::string::nposwith its actual value. It makes sense because even when captured pieces are empty, we still want to know its position for later replacement.Why we need to add a new interface
unsigned match(std::string_view text, const char * subject, size_t subject_size, MatchVec & matches, unsigned limit) const?We have to add a new interface to make sure
OptimizedRegularExpressionhave the same functionality with previous regexp libraryre2.When invoking
unsigned match(std::string_view text, const char * subject, size_t subject_size, MatchVec & matches, unsigned limit) const.text: the whole input: 'Hello, World!'subject: pointer to suffix of text after we have processed some prefix of text. In functionregexpRegexpAll, abovematchmethod maybe called multiple times. Each time we have the sametextbut differentsubject.subject_size: suffix lengthConsider case:
select replaceRegexpAll('Hello, World!', '^', 'here: '). The pattern^refers to the beginning oftext'Hello, World!', it should be matched only once. But old interface inOptimizedRegularExpressiondoesn't meet the the requirement.If we use the new interface above, valid match will only happend once, which makes replacement happen only once. The output matches are:
And the sql will outcome correct output.
If we use the old interface
unsigned match(const char * subject, size_t subject_size, MatchVec & matches, unsigned limit) const, it will be invoked multiple times, which makes replacement happen multiple times. The output matches are:It is obvious that the sql generate wrong output under current condition.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Speed up
replaceRegexpby 1.8x (4.41x especially when needle and replacement are trivail ) andreplaceAllby 1.10x