Conversation
The last I recall thinking about this was summarized in this comment. The functions, at least as can be seen in other db's or query engines, are very similar with extract being slightly more powerful by allowing one to define which group to extract. Frankly, I could see datafusion having one function that does both (aliased to regexp_substr and regexp_extract) where an optional 'index' or 'group' can be provided (defaulting to 0) that denotes which capture group to return. |
| | 200 | | ||
| +---------------------------------------------------------+ | ||
| ``` | ||
| Additional examples can be found [here](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/builtin_functions/regexp.rs) |
There was a problem hiding this comment.
If we are not going to add examples to the regexp.rs file I would suggest removing this line.
| argument(name = "str", description = "Column or column name"), | ||
| argument( | ||
| name = "regexp", | ||
| description = r#"a string representing a regular expression. The regex string should be a |
There was a problem hiding this comment.
If this is indeed the case (java) this function belongs in the spark crate, not in the main datafusion functions crate.
There was a problem hiding this comment.
And also Java regular expression are not 100% compatible with rust
| ) -> Result<ColumnarValue> { | ||
| let args = &args.args; | ||
|
|
||
| if args.len() != 2 && args.len() != 3 { |
There was a problem hiding this comment.
I'm not sure how this could possibly work. If args.len() == 2 it'll fail the second condition, if 3, the first.
There was a problem hiding this comment.
If it's neither 2 nor 3, then it's an error.
So, if len == 2, it'll fail the check on 3, hence won't enter the branch.
Maybe written as following could read easier?
| if args.len() != 2 && args.len() != 3 { | |
| if ! (args.len() == 2 || args.len() == 3 ) { |
|
From what I remember it was quite complicated to expose rust backed regexp into JVM world, because of rust/jvm regexp processing difference.
Theoretically we still can expose the function but Spark users need to be careful, accept the nuances and this needs to be documented. |
Jefffrey
left a comment
There was a problem hiding this comment.
Sounds like we should proceed with adding this as a function given other dbs/engines have something similar; however we should probably approach this from an angle of adding it as a datafusion function, but not necessarily to match Spark exactly given what @comphead outlined.
| /// Extracts a group that matches `regexp`. If `idx` is not specified, | ||
| /// it defaults to 1. | ||
| /// | ||
| /// Matches Spark's DataFrame API: `regexp_extract(e: Column, exp: String, groupIdx: Int)` |
There was a problem hiding this comment.
We probably should remove mention of Spark since we're adding this as a DataFusion function (i.e. not to the datafusion-spark crate)
There was a problem hiding this comment.
I agree but giving another perspective: it is useful to have a reference implementation why it was like that, like we do for other functions that match Postgres behavior
There was a problem hiding this comment.
What I do think is that we should remove the extensive comments about Spark in the implementation itself (like Spark catalyst reference and the inner implementations), and we can just keep a single mention if we decide to.
| use std::any::Any; | ||
| use std::sync::Arc; | ||
|
|
||
| // See https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.regexp_extract.html |
| } | ||
|
|
||
| fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { | ||
| use DataType::*; |
There was a problem hiding this comment.
We don't need all these checks in return_type; we can simply return Ok(Utf8) as signature should guard this for us
| } | ||
|
|
||
| // DataFusion passes either scalars or arrays. Convert to arrays. | ||
| let len = args |
There was a problem hiding this comment.
We should just use make_scalar_function which handles this boilerplate for us if we don't want to deal with columnarvalues
There was a problem hiding this comment.
Thanks, I'll try to look into it!
| } | ||
|
|
||
| /// Helper to build args for tests and external callers. | ||
| pub fn regexp_extract(args: &[ArrayRef]) -> Result<ArrayRef> { |
There was a problem hiding this comment.
This doesn't need to be public
|
|
||
| /// Helper to build args for tests and external callers. | ||
| pub fn regexp_extract(args: &[ArrayRef]) -> Result<ArrayRef> { | ||
| if args.len() != 3 { |
There was a problem hiding this comment.
If it needs 3 arguments we should make the signature 3 distinct arguments instead of a slice
There was a problem hiding this comment.
Here's a small omission (either 2 or 3). If there's a desire to always have only 3 I can change it everywhere, but it'll make it diverge slightly from spark (where the group idx is optional and defaults to 1 when not specified).
There was a problem hiding this comment.
I don't follow; what I'm saying here is this specific function says it wants a slice of 3 elements, so we might as well pass in 3 separate arguments as part of the signature instead of indirectly encoding this invariant via slices (we call this only in one place (other than tests) and we pass in a slice of 3 elements)
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { |
There was a problem hiding this comment.
Could we move all these tests to be SLTs instead?
There was a problem hiding this comment.
I was following the existing pattern in this mod.
My $0.02: having unit tests write next to the definition helps future evolution (I for one find the step-by-step debugging much more efficient).
If there's a strong desire to remove them, I can (but all the other unit tests should also be removed in order to be consistent).
There was a problem hiding this comment.
I see the required boilerplate adding too much verbosity; generally we prefer having SLTs because they result in less test codegen, less verbosity, and a more natural interface (SQL instead of needing to manually create the arguments for invoke for example). I do agree it can be useful to have unit tests for easier debugging, but when considering how many UDFs we support I feel its worth trading for test maintainability/consistency across the codebase.
(but all the other unit tests should also be removed in order to be consistent).
Which unit tests are you referring to?
| let pattern = &args[1]; | ||
| let index = &args[2]; | ||
|
|
||
| let values_array = values |
There was a problem hiding this comment.
Can use as_string_array for easier downcasting here; same idea for int array below too
rluvaton
left a comment
There was a problem hiding this comment.
Thank you for your contribution
| There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to | ||
| fallback to the Spark 1.6 behavior regarding string literal parsing. For example, | ||
| if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".<br><br> | ||
| It's recommended to use a raw string literal (with the `r` prefix) to avoid escaping | ||
| special characters in the pattern string if exists."# |
There was a problem hiding this comment.
This comment is not useful for DataFusion users as there is no such config for us
| &self.signature | ||
| } | ||
|
|
||
| fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { |
There was a problem hiding this comment.
Please instead of return type implement the return_field_from_args and specify the nullability, and in here you should return an error, look at other implementations to see an example
Make sure the nullability depend on the input and please add test for that as well
| let a0 = args[0].to_array(len)?; | ||
| let a1 = args[1].to_array(len)?; | ||
|
|
||
| // Spark Catalyst: def this(s, r) = this(s, r, Literal(1)) | ||
| // When idx is omitted, default to group index 1. | ||
| let a2 = if args.len() == 3 { |
There was a problem hiding this comment.
Could you please rename those to match what they represent
| let group_index = index as usize; | ||
|
|
||
| let regex = | ||
| Regex::new(pattern).map_err(|e| ArrowError::ComputeError(e.to_string()))?; |
There was a problem hiding this comment.
Can you please use DataFusionError instead?
| /// Extracts a group that matches `regexp`. If `idx` is not specified, | ||
| /// it defaults to 1. | ||
| /// | ||
| /// Matches Spark's DataFrame API: `regexp_extract(e: Column, exp: String, groupIdx: Int)` |
There was a problem hiding this comment.
What I do think is that we should remove the extensive comments about Spark in the implementation itself (like Spark catalyst reference and the inner implementations), and we can just keep a single mention if we decide to.
| }; | ||
|
|
||
| let out: ArrayRef = regexp_extract(&[a0, a1, a2])?; | ||
| Ok(ColumnarValue::Array(out)) |
There was a problem hiding this comment.
This has a bug if all are scalar you return an array with size 1 and not scalar or array with the expected number of rows.
This is a common pitfall and I'm not exception 😅
Please add a test when all scalar and num rows in the args is 8192 for example you either return a scalar or columnar array with size 8192
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
What changes are NOT included in this PR?
Are these changes tested?
Are there any user-facing changes?
Yes (new regex function added added to the docs).