Skip to content

Port regex_extract#20308

Open
calcaura wants to merge 4 commits intoapache:mainfrom
calcaura:regexp-extract
Open

Port regex_extract#20308
calcaura wants to merge 4 commits intoapache:mainfrom
calcaura:regexp-extract

Conversation

@calcaura
Copy link

@calcaura calcaura commented Feb 12, 2026

Which issue does this PR close?

Rationale for this change

  • Implement the Spark function “regexp_extract" in Datafusion.

What changes are included in this PR?

What changes are NOT included in this PR?

  • Support for LargeUtf8.
  • Utf8View

Are these changes tested?

  • Yes, Unit tests + SQL + CI
# Unit tests
cargo test --package datafusion-functions --lib -- regex::regexpextract::tests --nocapture
# SQL tests
cargo test --test sqllogictests -- regexp_extract

Are there any user-facing changes?

Yes (new regex function added added to the docs).

@calcaura calcaura marked this pull request as draft February 12, 2026 10:33
@github-actions github-actions bot added the functions Changes to functions implementation label Feb 12, 2026
@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) labels Feb 12, 2026
@calcaura calcaura marked this pull request as ready for review February 12, 2026 14:45
@Jefffrey
Copy link
Contributor

cc @Omega359 @comphead did we ever land on a consensus regarding regexp_extract and regexp_substr? We had some PRs for them before and they seemed to lapse, but looks like there was still some discussion on which regex functions we include as part of datafusion

@Omega359
Copy link
Contributor

cc @Omega359 @comphead did we ever land on a consensus regarding regexp_extract and regexp_substr? We had some PRs for them before and they seemed to lapse, but looks like there was still some discussion on which regex functions we include as part of datafusion

The last I recall thinking about this was summarized in this comment. The functions, at least as can be seen in other db's or query engines, are very similar with extract being slightly more powerful by allowing one to define which group to extract.

Frankly, I could see datafusion having one function that does both (aliased to regexp_substr and regexp_extract) where an optional 'index' or 'group' can be provided (defaulting to 0) that denotes which capture group to return.

| 200 |
+---------------------------------------------------------+
```
Additional examples can be found [here](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/builtin_functions/regexp.rs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not going to add examples to the regexp.rs file I would suggest removing this line.

argument(name = "str", description = "Column or column name"),
argument(
name = "regexp",
description = r#"a string representing a regular expression. The regex string should be a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is indeed the case (java) this function belongs in the spark crate, not in the main datafusion functions crate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also Java regular expression are not 100% compatible with rust

) -> Result<ColumnarValue> {
let args = &args.args;

if args.len() != 2 && args.len() != 3 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this could possibly work. If args.len() == 2 it'll fail the second condition, if 3, the first.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's neither 2 nor 3, then it's an error.

So, if len == 2, it'll fail the check on 3, hence won't enter the branch.

Maybe written as following could read easier?

Suggested change
if args.len() != 2 && args.len() != 3 {
if ! (args.len() == 2 || args.len() == 3 ) {

@comphead
Copy link
Contributor

From what I remember it was quite complicated to expose rust backed regexp into JVM world, because of rust/jvm regexp processing difference.
The major ones:

  • no backtracking in rust
  • groups
  • quantifiers diff
  • lookaheads

Theoretically we still can expose the function but Spark users need to be careful, accept the nuances and this needs to be documented.

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like we should proceed with adding this as a function given other dbs/engines have something similar; however we should probably approach this from an angle of adding it as a datafusion function, but not necessarily to match Spark exactly given what @comphead outlined.

/// Extracts a group that matches `regexp`. If `idx` is not specified,
/// it defaults to 1.
///
/// Matches Spark's DataFrame API: `regexp_extract(e: Column, exp: String, groupIdx: Int)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should remove mention of Spark since we're adding this as a DataFusion function (i.e. not to the datafusion-spark crate)

Copy link
Member

@rluvaton rluvaton Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but giving another perspective: it is useful to have a reference implementation why it was like that, like we do for other functions that match Postgres behavior

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I do think is that we should remove the extensive comments about Spark in the implementation itself (like Spark catalyst reference and the inner implementations), and we can just keep a single mention if we decide to.

use std::any::Any;
use std::sync::Arc;

// See https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.regexp_extract.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

}

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
use DataType::*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need all these checks in return_type; we can simply return Ok(Utf8) as signature should guard this for us

}

// DataFusion passes either scalars or arrays. Convert to arrays.
let len = args
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just use make_scalar_function which handles this boilerplate for us if we don't want to deal with columnarvalues

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll try to look into it!

}

/// Helper to build args for tests and external callers.
pub fn regexp_extract(args: &[ArrayRef]) -> Result<ArrayRef> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be public


/// Helper to build args for tests and external callers.
pub fn regexp_extract(args: &[ArrayRef]) -> Result<ArrayRef> {
if args.len() != 3 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it needs 3 arguments we should make the signature 3 distinct arguments instead of a slice

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a small omission (either 2 or 3). If there's a desire to always have only 3 I can change it everywhere, but it'll make it diverge slightly from spark (where the group idx is optional and defaults to 1 when not specified).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow; what I'm saying here is this specific function says it wants a slice of 3 elements, so we might as well pass in 3 separate arguments as part of the signature instead of indirectly encoding this invariant via slices (we call this only in one place (other than tests) and we pass in a slice of 3 elements)

}

#[cfg(test)]
mod tests {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move all these tests to be SLTs instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the existing pattern in this mod.

My $0.02: having unit tests write next to the definition helps future evolution (I for one find the step-by-step debugging much more efficient).

If there's a strong desire to remove them, I can (but all the other unit tests should also be removed in order to be consistent).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the required boilerplate adding too much verbosity; generally we prefer having SLTs because they result in less test codegen, less verbosity, and a more natural interface (SQL instead of needing to manually create the arguments for invoke for example). I do agree it can be useful to have unit tests for easier debugging, but when considering how many UDFs we support I feel its worth trading for test maintainability/consistency across the codebase.

(but all the other unit tests should also be removed in order to be consistent).

Which unit tests are you referring to?

let pattern = &args[1];
let index = &args[2];

let values_array = values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use as_string_array for easier downcasting here; same idea for int array below too

Copy link
Member

@rluvaton rluvaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution

Comment on lines +58 to +62
There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to
fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".<br><br>
It's recommended to use a raw string literal (with the `r` prefix) to avoid escaping
special characters in the pattern string if exists."#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not useful for DataFusion users as there is no such config for us

&self.signature
}

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please instead of return type implement the return_field_from_args and specify the nullability, and in here you should return an error, look at other implementations to see an example

Make sure the nullability depend on the input and please add test for that as well

Comment on lines +155 to +160
let a0 = args[0].to_array(len)?;
let a1 = args[1].to_array(len)?;

// Spark Catalyst: def this(s, r) = this(s, r, Literal(1))
// When idx is omitted, default to group index 1.
let a2 = if args.len() == 3 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please rename those to match what they represent

let group_index = index as usize;

let regex =
Regex::new(pattern).map_err(|e| ArrowError::ComputeError(e.to_string()))?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please use DataFusionError instead?

/// Extracts a group that matches `regexp`. If `idx` is not specified,
/// it defaults to 1.
///
/// Matches Spark's DataFrame API: `regexp_extract(e: Column, exp: String, groupIdx: Int)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I do think is that we should remove the extensive comments about Spark in the implementation itself (like Spark catalyst reference and the inner implementations), and we can just keep a single mention if we decide to.

};

let out: ArrayRef = regexp_extract(&[a0, a1, a2])?;
Ok(ColumnarValue::Array(out))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has a bug if all are scalar you return an array with size 1 and not scalar or array with the expected number of rows.

This is a common pitfall and I'm not exception 😅

Please add a test when all scalar and num rows in the args is 8192 for example you either return a scalar or columnar array with size 8192

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

regexp_extract func from Spark

5 participants