feat: support lambda function for scalar udf by chenkovsky · Pull Request #17220 · apache/datafusion

chenkovsky · 2025-08-17T14:15:29Z

Which issue does this PR close?

Closes #.

Rationale for this change

Some array-related UDFs need to support passing in lambda

What changes are included in this PR?

support lambda function in Scalar UDF, and implement array_filter as an example.

Are these changes tested?

UT

Are there any user-facing changes?

No

rluvaton · 2025-08-21T09:15:13Z

datafusion/functions-nested/src/array_filter.rs

+    fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
+        let [arg_type] = take_function_args(self.name(), arg_types)?;
+        match arg_type {
+            List(_) | LargeList(_) => Ok(arg_type.clone()),
+            _ => plan_err!("{} does not support type {}", self.name(), arg_type),
+        }
+    }


Can you please implement return_field_from_args instead so it won't be nullable in case the input is not nullable

rluvaton · 2025-08-21T09:16:21Z

datafusion/functions-nested/src/array_filter.rs

+    fn coerce_types(&self, _arg_types: &[DataType]) -> Result<Vec<DataType>> {
+        datafusion_common::not_impl_err!(
+            "Function {} does not implement coerce_types",
+            self.name()
+        )
+    }


This is the default implementation, can you please remove it?

rluvaton · 2025-08-21T09:19:10Z

datafusion/functions-nested/src/array_filter.rs

+    lambda: &dyn PhysicalLambda,
+    field: &Arc<Field>,
+) -> Result<ArrayRef> {
+    let mut offsets = vec![OffsetSize::zero()];


can you use OffsetBufferBuilder instead? so we don't have to manage the offsets ourselves

rluvaton · 2025-08-21T09:33:02Z

datafusion/functions-nested/src/array_filter.rs

+/// Implementation of the `array_filter` scalar user-defined function.
+///
+/// This function filters array elements using a lambda function, returning a new array
+/// containing only the elements for which the lambda function returns true.


Please also note that nulls will count as false

rluvaton · 2025-08-21T09:33:40Z

datafusion/functions-nested/src/array_filter.rs

+    ),
+    argument(
+        name = "lambda",
+        description = "Lambda function with one argument that returns a boolean. The lambda is applied to each element of the array."


please add that returning null will be the same as false

rluvaton

Good job, left some comments

rluvaton · 2025-08-21T09:46:39Z

datafusion/functions-nested/src/array_filter.rs

+    let values = list_array.values();
+    let value_offsets = list_array.value_offsets();
+    let nulls = list_array.nulls();
+
+    let batch = RecordBatch::try_new(
+        Schema::new(vec![field
+            .as_ref()
+            .clone()
+            .with_name(lambda.params()[0].clone())])
+        .into(),
+        vec![Arc::clone(values)],
+    )?;


I can do it in a separate PR.

this will lead to unnecessary computation as it will include values that are not part of list "visible" values in case of either of the following.

the list is sliced, making the evaluate work on more data that is needed
this is how to create that:

let data = vec![ Some(vec![Some(0), Some(1), Some(2)]), Some(vec![Some(3), Some(4), Some(5)]), Some(vec![Some(6), Some(7)]), Some(vec![Some(8)]), ]; let list_array = ListArray::from_iter_primitive::<Int32Type, _, _>(data); let list_sliced_values = list_array.slice(1, 2);

in case of nulls in the list that are not behind an empty list
this is how to create that

let data = vec![ Some(vec![Some(0), Some(1), Some(2)]), Some(vec![Some(3), Some(4), Some(5)]), Some(vec![Some(6), Some(7)]), ]; let list_array = ListArray::from_iter_primitive::<Int32Type, _, _>(data); let (field, offsets, values, nulls) = list_array.into_parts(); let list_array_with_null_pointing_to_non_empty_list = ListArray::try_new( field, offsets, values, Some(NullBuffer::from(&[true, false, true])) )?; ```

please view it again, now array will be compacted first. I think it can solve unnecessary computation and the following bug. @rluvaton

it looks ok, but can you please add a unit test (as it's tricky to simulate with sql) to make sure the bug won't return?

rluvaton · 2025-08-21T11:17:11Z

datafusion/functions-nested/src/array_filter.rs

+    let ColumnarValue::Array(filter_array) = filter_array else {
+        return exec_err!(
+            "array_filter requires a lambda that returns an array of booleans"
+        );
+    };


You can add optimization for scalar if you want or I can do it in a different PR

rluvaton · 2025-08-21T11:18:35Z

datafusion/functions-nested/src/array_filter.rs

+            // Handle null arrays by keeping the offset unchanged
+            offsets.push(offsets[row_index]);


This have a bug in case of null value pointing to a non empty list and none of the underlying values were filtered

rluvaton · 2025-08-21T11:20:34Z

datafusion/sqllogictest/test_files/array_filter.slt

+
+# array_filter with multiple array columns
+statement ok
+CREATE TABLE test_arrays (arr1 ARRAY<INTEGER>, arr2 ARRAY<INTEGER>) AS VALUES ([1, 2, 3], [4, 5, 6]);


Can you please add null list here as well and null items

rluvaton · 2025-08-25T09:13:19Z

datafusion/functions-nested/src/array_filter.rs

+    }
+
+    fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
+        not_impl_err!("{} does not implement return_type", self.name())


Suggested change

not_impl_err!("{} does not implement return_type", self.name())

not_impl_err!("{} does not implement return_type, call return_field_from_args instead", self.name())

rluvaton

great job but I think this lambda API add a lot of complexity to the user.

@alamb, your take?

rluvaton · 2025-08-25T09:18:28Z

datafusion/functions-nested/src/array_filter.rs

+    let values = list_array.values();
+    let value_offsets = list_array.value_offsets();
+    let nulls = list_array.nulls();
+
+    let batch = RecordBatch::try_new(
+        Schema::new(vec![field
+            .as_ref()
+            .clone()
+            .with_name(lambda.params()[0].clone())])
+        .into(),
+        vec![Arc::clone(values)],
+    )?;


it looks ok, but can you please add a unit test (as it's tricky to simulate with sql) to make sure the bug won't return?

rluvaton · 2025-08-25T09:20:58Z

datafusion/functions-nested/src/array_filter.rs

Because this is the first implementation for lambda function could you please add a lot of comments explaining how it work so future lambda creation will have a reference point?

rluvaton · 2025-08-25T09:35:47Z

datafusion/expr/src/udf.rs

+    ///
+    /// # Returns
+    /// An optional optimized expression, or None if no optimization is available
+    fn try_call(&self, _args: &[Expr]) -> Result<Option<Expr>> {


the try_call function name is confusing as it sounds like it will invoke the udf.

also isn't it the same as simplify?

it's same as simplify, except invoking timing. I tried simplify before, but an error is thrown, because simplify is called in optimize stage. but we need this function to be called at the very beginning. maybe we can reuse simplify, but I'm not sure whether this will break other functions. maybe call this method: try_currying

Nitpick: Agreed, more descriptive name would be nice

rluvaton · 2025-08-25T09:55:58Z

datafusion/expr/src/udf.rs

+    /// Plans the scalar UDF implementation with lambda function support.
+    ///
+    /// This method enables UDF implementations to work with lambda functions
+    /// by allowing them to plan and prepare lambda expressions for execution.
+    /// Returns a new implementation instance if lambda planning is needed.
+    ///
+    /// # Arguments
+    /// * `_planner` - The lambda planner for converting logical lambdas to physical
+    /// * `_args` - The function arguments that may include lambda expressions
+    /// * `_input_dfschema` - The input schema context for lambda planning
+    ///
+    /// # Returns
+    /// An optional new UDF implementation with planned lambdas, or None if no planning is needed
+    fn plan(
+        &self,
+        _planner: &dyn LambdaPlanner,
+        _args: &[Expr],
+        _input_dfschema: &DFSchema,
+    ) -> Result<Option<Arc<dyn ScalarUDFImpl>>> {
+        Ok(None)
+    }


I find this approach confusing both to implement and understand. It requires users to call this function beforehand for the higher-order function to actually work. I had to read through it several times to grasp the concept.

This function is now a prerequisite for the lambda function UDF to operate. Previously, there was only one simple entry point (invoke_with_args) that was straightforward to implement. Adding another entry point increases complexity unnecessarily.

I suggest considering an alternative approach: create a separate trait specifically for higher-order functions with a dedicated wrapper (similar to ScalarUDF) that provides a better API suited for higher-order functions. This wrapper could handle the "compilation" of lambda expressions upfront, and the invoke call would include the pre-compiled lambda function.

Alternatively, we could add physical expressions of children to ScalarFunctionArgs, though I'm not particularly fond of that solution either.

For context on precompilation (which is meant for optimization and not required for the expression to work):

See issue Specialized / Pre-compiled / Prepared ScalarUDFs #8051

The current implementation creates confusion and adds an unnecessary prerequisite step that users must remember to perform.

I considered your solutions before. but they all require a significant change. this solution is not the perfect one. But I think it's the least modification one.

Lambda functions are just a seasoning; although they are indispensable, most UDFs do not require them. For example, in Databricks, only the following functions use lambda functions:
aggregate,array_sort,exists,filter,forall,map_filter,map_zip_with,transform,transform_keys,transform_values,zip_with.
Therefore, I feel there is no need for us to make huge changes just because of this. That's why I selected the least modification one.

It requires users to call this function beforehand for the higher-order function to actually work.

it's a currying. from my side, this is not hard to understand.

rluvaton · 2025-08-25T10:03:07Z

datafusion/expr/src/udf.rs

+        _planner: &dyn LambdaPlanner,
+        _args: &[Expr],


What if I don't work with logical expressions and only physical ones like in Comet.

alamb · 2025-08-26T12:08:10Z

@shehabgamin / @andygrove is this functionality that might be needed for spark integration?

Thanks for the call otu @rluvaton and for the PR @chenkovsky

Unfortunately, I am not likely to have the time to review this PR / feature in the near term -- as @rluvaton says, it would take some time to understand the implications of this new API for the system and other users.

@chenkovsky can you explain more of the rationale / need for this functionality? If there are other users who need this feature, perhaps we can find some others in the community to help drive if forward.

rluvaton · 2025-08-26T12:10:59Z

I'm all in on supporting lambda functions

is this functionality that might be needed for spark integration?

yes, as @chenkovsky wrote:

in Databricks, only the following functions use lambda functions:
aggregate, array_sort, exists, filter, forall, map_filter, map_zip_with, transform, transform_keys, transform_values, zip_with.

github-actions · 2025-10-28T02:06:28Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

shehabgamin · 2025-11-25T21:10:55Z

@shehabgamin / @andygrove is this functionality that might be needed for spark integration?

Thanks for the call otu @rluvaton and for the PR @chenkovsky

Unfortunately, I am not likely to have the time to review this PR / feature in the near term -- as @rluvaton says, it would take some time to understand the implications of this new API for the system and other users.

@chenkovsky can you explain more of the rationale / need for this functionality? If there are other users who need this feature, perhaps we can find some others in the community to help drive if forward.

@alamb @chenkovsky Sorry I missed this, I think supporting lambda functions would be super valuable. I'll take a look through this PR today/tomorrow and hopefully we can reopen it?

shehabgamin · 2025-11-27T05:47:38Z

datafusion/sql/src/expr/function.rs

        if let Some(fm) = self.context_provider.get_function_meta(&name) {
            let args = self.function_args_to_expr(args, schema, planner_context)?;
-            return Ok(Expr::ScalarFunction(ScalarFunction::new_udf(fm, args)));
+            return fm.try_call(args);


The naming here is a bit confusing imo

shehabgamin · 2025-11-27T05:54:48Z

datafusion/expr/src/udf.rs

+    ///
+    /// # Returns
+    /// An optional new UDF implementation with planned lambdas, or None if no planning is needed
+    fn plan(


Adding a plan function to the trait feels a bit odd. Are there any potential use cases for this outside of lambdas?

gstvg · 2025-12-09T05:19:11Z

@chenkovsky @rluvaton @shehabgamin @alamb I happened to also open a PR in #18921 to add lambda support, and even if this here (the first to be opened) is decided to move forward, I believe that the different approach and the alternatives listed there can help to make progress here. Thanks

chenkovsky added 2 commits August 17, 2025 21:49

feat: support lambda for udf

f4a83ca

update doc

cc7bea4

Merge branch 'main' into feat/high_order

4da4293

chenkovsky marked this pull request as draft August 17, 2025 14:59

chenkovsky added 3 commits August 17, 2025 23:28

update

b2d2f33

update

5a64131

update

cb9ca7f

github-actions bot added the documentation Improvements or additions to documentation label Aug 17, 2025

chenkovsky marked this pull request as ready for review August 18, 2025 01:19

chenkovsky added 2 commits August 20, 2025 08:58

Merge branch 'main' into feat/high_order

b11b359

update

6747803

rluvaton reviewed Aug 21, 2025

View reviewed changes

chenkovsky added 2 commits August 22, 2025 23:13

update

616b26e

doc

624fdae

rluvaton reviewed Aug 25, 2025

View reviewed changes

alamb added the api change Changes the API exposed to users of the crate label Aug 26, 2025

github-actions bot added the Stale PR has not had any activity for some time label Oct 28, 2025

github-actions bot closed this Nov 4, 2025

Omega359 mentioned this pull request Nov 5, 2025

[DISCUSSION] DataFusion Road Map: Q1 2026 #18494

Open

shehabgamin mentioned this pull request Nov 25, 2025

Spark SQL Functions Coverage and Parity - Lambda lakehq/sail#229

Open

12 tasks

shehabgamin reviewed Nov 27, 2025

View reviewed changes

gstvg mentioned this pull request Feb 21, 2026

[RFC] Add lambda support and array_transform udf #18921

Draft

		// Handle null arrays by keeping the offset unchanged
		offsets.push(offsets[row_index]);

	not_impl_err!("{} does not implement return_type", self.name())
	not_impl_err!("{} does not implement return_type, call return_field_from_args instead", self.name())

Comments

Conversation

chenkovsky commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rluvaton left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rluvaton left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenkovsky Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Aug 26, 2025

Uh oh!

rluvaton commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 28, 2025

Uh oh!

shehabgamin commented Nov 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gstvg commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

chenkovsky commented Aug 17, 2025 •

edited

Loading

chenkovsky Aug 25, 2025 •

edited

Loading

rluvaton commented Aug 26, 2025 •

edited

Loading