Port regex_extract by calcaura · Pull Request #20308 · apache/datafusion

calcaura · 2026-02-12T10:33:45Z

Which issue does this PR close?

Closes regexp_extract func from Spark #14280

Rationale for this change

Implement the Spark function “regexp_extract" in Datafusion.

What changes are included in this PR?

Feature: implement Spark's analogue of regexp_extract

What changes are NOT included in this PR?

Support for LargeUtf8.
Utf8View

Are these changes tested?

Yes, Unit tests + SQL + CI

# Unit tests
cargo test --package datafusion-functions --lib -- regex::regexpextract::tests --nocapture

# SQL tests
cargo test --test sqllogictests -- regexp_extract

Are there any user-facing changes?

Yes (new regex function added added to the docs).

Jefffrey · 2026-02-13T06:43:44Z

cc @Omega359 @comphead did we ever land on a consensus regarding regexp_extract and regexp_substr? We had some PRs for them before and they seemed to lapse, but looks like there was still some discussion on which regex functions we include as part of datafusion

Omega359 · 2026-02-13T15:41:31Z

cc @Omega359 @comphead did we ever land on a consensus regarding regexp_extract and regexp_substr? We had some PRs for them before and they seemed to lapse, but looks like there was still some discussion on which regex functions we include as part of datafusion

The last I recall thinking about this was summarized in this comment. The functions, at least as can be seen in other db's or query engines, are very similar with extract being slightly more powerful by allowing one to define which group to extract.

Frankly, I could see datafusion having one function that does both (aliased to regexp_substr and regexp_extract) where an optional 'index' or 'group' can be provided (defaulting to 0) that denotes which capture group to return.

Omega359 · 2026-02-13T15:56:39Z

datafusion/functions/src/regex/regexpextract.rs

+            | 200                                                   |
+            +---------------------------------------------------------+
+```
+Additional examples can be found [here](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/builtin_functions/regexp.rs)


If we are not going to add examples to the regexp.rs file I would suggest removing this line.

Omega359 · 2026-02-13T15:58:16Z

datafusion/functions/src/regex/regexpextract.rs

+    argument(name = "str", description = "Column or column name"),
+    argument(
+        name = "regexp",
+        description = r#"a string representing a regular expression. The regex string should be a


If this is indeed the case (java) this function belongs in the spark crate, not in the main datafusion functions crate.

And also Java regular expression are not 100% compatible with rust

Omega359 · 2026-02-13T15:59:32Z

datafusion/functions/src/regex/regexpextract.rs

+    ) -> Result<ColumnarValue> {
+        let args = &args.args;
+
+        if args.len() != 2 && args.len() != 3 {


I'm not sure how this could possibly work. If args.len() == 2 it'll fail the second condition, if 3, the first.

If it's neither 2 nor 3, then it's an error.

So, if len == 2, it'll fail the check on 3, hence won't enter the branch.

Maybe written as following could read easier?

Suggested change

if args.len() != 2 && args.len() != 3 {

if ! (args.len() == 2 || args.len() == 3 ) {

comphead · 2026-02-14T00:42:46Z

From what I remember it was quite complicated to expose rust backed regexp into JVM world, because of rust/jvm regexp processing difference.
The major ones:

no backtracking in rust
groups
quantifiers diff
lookaheads

Theoretically we still can expose the function but Spark users need to be careful, accept the nuances and this needs to be documented.

Jefffrey

Sounds like we should proceed with adding this as a function given other dbs/engines have something similar; however we should probably approach this from an angle of adding it as a datafusion function, but not necessarily to match Spark exactly given what @comphead outlined.

Jefffrey · 2026-02-14T01:20:01Z

datafusion/functions/src/regex/mod.rs

+    /// Extracts a group that matches `regexp`. If `idx` is not specified,
+    /// it defaults to 1.
+    ///
+    /// Matches Spark's DataFrame API: `regexp_extract(e: Column, exp: String, groupIdx: Int)`


We probably should remove mention of Spark since we're adding this as a DataFusion function (i.e. not to the datafusion-spark crate)

I agree but giving another perspective: it is useful to have a reference implementation why it was like that, like we do for other functions that match Postgres behavior

What I do think is that we should remove the extensive comments about Spark in the implementation itself (like Spark catalyst reference and the inner implementations), and we can just keep a single mention if we decide to.

Jefffrey · 2026-02-14T01:20:20Z

datafusion/functions/src/regex/regexpextract.rs

+use std::any::Any;
+use std::sync::Arc;
+
+// See https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.regexp_extract.html


Jefffrey · 2026-02-14T01:20:59Z

datafusion/functions/src/regex/regexpextract.rs

+    }
+
+    fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
+        use DataType::*;


We don't need all these checks in return_type; we can simply return Ok(Utf8) as signature should guard this for us

Jefffrey · 2026-02-14T01:21:42Z

datafusion/functions/src/regex/regexpextract.rs

+        }
+
+        // DataFusion passes either scalars or arrays. Convert to arrays.
+        let len = args


We should just use make_scalar_function which handles this boilerplate for us if we don't want to deal with columnarvalues

Thanks, I'll try to look into it!

Jefffrey · 2026-02-14T01:21:59Z

datafusion/functions/src/regex/regexpextract.rs

+}
+
+/// Helper to build args for tests and external callers.
+pub fn regexp_extract(args: &[ArrayRef]) -> Result<ArrayRef> {


This doesn't need to be public

Jefffrey · 2026-02-14T01:22:28Z

datafusion/functions/src/regex/regexpextract.rs

+
+/// Helper to build args for tests and external callers.
+pub fn regexp_extract(args: &[ArrayRef]) -> Result<ArrayRef> {
+    if args.len() != 3 {


If it needs 3 arguments we should make the signature 3 distinct arguments instead of a slice

Here's a small omission (either 2 or 3). If there's a desire to always have only 3 I can change it everywhere, but it'll make it diverge slightly from spark (where the group idx is optional and defaults to 1 when not specified).

I don't follow; what I'm saying here is this specific function says it wants a slice of 3 elements, so we might as well pass in 3 separate arguments as part of the signature instead of indirectly encoding this invariant via slices (we call this only in one place (other than tests) and we pass in a slice of 3 elements)

Jefffrey · 2026-02-14T01:23:18Z

datafusion/functions/src/regex/regexpextract.rs

+}
+
+#[cfg(test)]
+mod tests {


Could we move all these tests to be SLTs instead?

I was following the existing pattern in this mod.

My $0.02: having unit tests write next to the definition helps future evolution (I for one find the step-by-step debugging much more efficient).

If there's a strong desire to remove them, I can (but all the other unit tests should also be removed in order to be consistent).

I see the required boilerplate adding too much verbosity; generally we prefer having SLTs because they result in less test codegen, less verbosity, and a more natural interface (SQL instead of needing to manually create the arguments for invoke for example). I do agree it can be useful to have unit tests for easier debugging, but when considering how many UDFs we support I feel its worth trading for test maintainability/consistency across the codebase.

(but all the other unit tests should also be removed in order to be consistent).

Which unit tests are you referring to?

Jefffrey · 2026-02-14T01:24:17Z

datafusion/functions/src/regex/regexpextract.rs

+    let pattern = &args[1];
+    let index = &args[2];
+
+    let values_array = values


Can use as_string_array for easier downcasting here; same idea for int array below too

rluvaton

Thank you for your contribution

rluvaton · 2026-02-18T19:52:32Z

datafusion/functions/src/regex/regexpextract.rs

+          There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to
+          fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
+          if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".<br><br>
+          It's recommended to use a raw string literal (with the `r` prefix) to avoid escaping
+          special characters in the pattern string if exists."#


This comment is not useful for DataFusion users as there is no such config for us

rluvaton · 2026-02-18T19:53:47Z

datafusion/functions/src/regex/regexpextract.rs

+        &self.signature
+    }
+
+    fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {


Please instead of return type implement the return_field_from_args and specify the nullability, and in here you should return an error, look at other implementations to see an example

Make sure the nullability depend on the input and please add test for that as well

rluvaton · 2026-02-18T19:55:01Z

datafusion/functions/src/regex/regexpextract.rs

+        let a0 = args[0].to_array(len)?;
+        let a1 = args[1].to_array(len)?;
+
+        // Spark Catalyst: def this(s, r) = this(s, r, Literal(1))
+        // When idx is omitted, default to group index 1.
+        let a2 = if args.len() == 3 {


Could you please rename those to match what they represent

rluvaton · 2026-02-18T19:56:45Z

datafusion/functions/src/regex/regexpextract.rs

+        let group_index = index as usize;
+
+        let regex =
+            Regex::new(pattern).map_err(|e| ArrowError::ComputeError(e.to_string()))?;


Can you please use DataFusionError instead?

rluvaton · 2026-02-18T20:04:21Z

datafusion/functions/src/regex/mod.rs

+    /// Extracts a group that matches `regexp`. If `idx` is not specified,
+    /// it defaults to 1.
+    ///
+    /// Matches Spark's DataFrame API: `regexp_extract(e: Column, exp: String, groupIdx: Int)`


What I do think is that we should remove the extensive comments about Spark in the implementation itself (like Spark catalyst reference and the inner implementations), and we can just keep a single mention if we decide to.

rluvaton · 2026-02-18T20:11:35Z

datafusion/functions/src/regex/regexpextract.rs

+        };
+
+        let out: ArrayRef = regexp_extract(&[a0, a1, a2])?;
+        Ok(ColumnarValue::Array(out))


This has a bug if all are scalar you return an array with size 1 and not scalar or array with the expected number of rows.

This is a common pitfall and I'm not exception 😅

Please add a test when all scalar and num rows in the args is 8192 for example you either return a scalar or columnar array with size 8192

Port regex_extract

256531f

calcaura marked this pull request as draft February 12, 2026 10:33

github-actions bot added the functions Changes to functions implementation label Feb 12, 2026

calcaura and others added 2 commits February 12, 2026 11:44

Merge branch 'main' into regexp-extract

9c8f3c9

doc + sql test

a64a05c

github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) labels Feb 12, 2026

[sql tests] Convert pyspark tests to datafusion sql tests

a9571d6

calcaura marked this pull request as ready for review February 12, 2026 14:45

Omega359 reviewed Feb 13, 2026

View reviewed changes

Jefffrey reviewed Feb 14, 2026

View reviewed changes

rluvaton reviewed Feb 18, 2026

View reviewed changes

	if args.len() != 2 && args.len() != 3 {
	if ! (args.len() == 2 \|\| args.len() == 3 ) {

Conversation

calcaura commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

What changes are NOT included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Jefffrey commented Feb 13, 2026

Uh oh!

Omega359 commented Feb 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comphead commented Feb 14, 2026

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rluvaton Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rluvaton left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

calcaura commented Feb 12, 2026 •

edited

Loading

rluvaton Feb 18, 2026 •

edited

Loading