Improve regexp kernels performance by avoiding cloning Regex#5235
Improve regexp kernels performance by avoiding cloning Regex#5235Dandandan merged 2 commits intoapache:masterfrom
Conversation
arrow-string/src/regexp.rs
Outdated
| })?; | ||
| patterns.insert(pattern, re.clone()); | ||
| re | ||
| patterns.insert(pattern.clone(), re); |
There was a problem hiding this comment.
| let existing_pattern = patterns.get(&pattern); | ||
| let re = match existing_pattern { | ||
| Some(re) => re.clone(), | ||
| Some(re) => re, | ||
| None => { | ||
| let re = Regex::new(pattern.as_str()).map_err(|e| { | ||
| ArrowError::ComputeError(format!( | ||
| "Regular expression did not compile: {e:?}" | ||
| )) | ||
| })?; | ||
| patterns.insert(pattern, re.clone()); | ||
| re | ||
| patterns.entry(pattern).or_insert(re) |
There was a problem hiding this comment.
If you replaced the get above with this entry call, it might improve the performance further by eliminating a branch perhaps??
There was a problem hiding this comment.
Hmm, then we need to use or_insert_with like
let re = patterns.entry(pattern).or_insert_with(|| {
Regex::new(pattern.as_str()).map_err(|e| {
ArrowError::ComputeError(format!(
"Regular expression did not compile: {e:?}"
))
}?)
});But or_insert_with cannot propagate the error out of the function.
Tried with Result<Regex, ArrowError> as value type with the patterns HashMap, but then it doesn't work because the ? operator cannot be applied to type &mut Result<regex::Regex, arrow_schema::ArrowError> too.
So just keep it as is.
There was a problem hiding this comment.
Entry is an enumeration you can match on, you don't have to use or_insert_with and friends, they're just utilities like on Option
There was a problem hiding this comment.
Oh, the branch to eliminate is the one inside or_insert? Do you mean something like:
let entry = patterns.entry(pattern.clone());
let re = match entry {
Entry::Occupied(entry) => entry.into_mut(),
Entry::Vacant(entry) => {
let re = Regex::new(pattern.as_str()).map_err(|e| {
ArrowError::ComputeError(format!(
"Regular expression did not compile: {e:?}"
))
})?;
entry.insert(re)
}
};Hmm, it actually gets regression a little. 🤔
regexp time: [7.6891 ms 7.7210 ms 7.7511 ms]
change: [+14.792% +15.570% +16.321%] (p = 0.00 < 0.05)
Performance has regressed.
There was a problem hiding this comment.
That is what I meant, perhaps the regression is because of the added clone (which i think can be removed)
There was a problem hiding this comment.
clone? You mean let entry = patterns.entry(pattern.clone());? It cannot be removed.
216 | (Some(value), Some(pattern)) => {
| ------- move occurs because `pattern` has type `String`, which does not implement the `Copy` trait
217 | let entry = patterns.entry(pattern);
| ------- value moved here
...
221 | let re = Regex::new(pattern.as_str()).map_err(|e| {
| ^^^^^^^ value borrowed here after move
|
help: consider cloning the value if the performance cost is acceptable
|
217 | let entry = patterns.entry(pattern.clone());
| ++++++++
There was a problem hiding this comment.
Oh because of the unfortunate way flags are handled... Yeah this kernel would probably do with redesigning 😅
tustvold
left a comment
There was a problem hiding this comment.
As an aside it occurs to me we should probably have a version of this that takes Datum
Yea, I did a local version and tested it with DataFusion. It turns out the performance bottleneck is on cloning Regex mostly. But it is per query benchmark and I didn't run this kernel benchmark in it. I will run this benchmark against the Datum version and probably submit it later. |
|
Thank you @viirya ! |
|
Thank you @alamb @Dandandan @tustvold for review |
Which issue does this PR close?
Closes #.
Rationale for this change
The
regexp_matchscalar expression in DataFusion is quite slow. As looking for the cause, it is found that the bad performance is due to cloningRegexper row.Regexhas aCachePool. CloningRegexwill create a freshCachePooland it is somehow expensive to re-creating cache space per row instead of reusing the cache.What changes are included in this PR?
This patch removes cloning on
Regexin regexp kernels.Are there any user-facing changes?