ARROW-9756: [Rust] [DataFusion] Added support for scalar UDFs of arbitrary return types#7974
ARROW-9756: [Rust] [DataFusion] Added support for scalar UDFs of arbitrary return types#7974jorgecarleitao wants to merge 13 commits intoapache:masterfrom jorgecarleitao:scalar
Conversation
|
FYI @andygrove @alamb @houqp : this is a draft because it depends on other PRs being reviewed and accepted. |
rust/datafusion/src/logicalplan.rs
Outdated
There was a problem hiding this comment.
I had to drop partialEq for expressions, and therefore also dropped this since it was strictly not needed.
Currently a UDF's argument can only be a single type. This PR adds support for multiple types per argument, thus allowing users to register UDFs that can handle multiple types at once.
This operation is already by the optimizer and is more complete.
This allows to declare UDFs that support multiple types. The existing UDFs (math) now support float32 and float64.
| #[derive(Debug)] | ||
| struct SumAccumulator { | ||
| sum: Option<ScalarValue>, | ||
| pub sum: Option<ScalarValue>, |
|
|
||
| pub mod dataframe; | ||
| pub mod datasource; | ||
| mod datatyped; |
There was a problem hiding this comment.
pub so that others can implement a different DataTyped?
| args: Vec<Expr>, | ||
| /// The `DataType` the expression will yield | ||
| return_type: DataType, | ||
| return_type: ReturnType, |
There was a problem hiding this comment.
THIS IS A MAJOR CHANGE: it makes Expr hold an Arc<&dyn>, which make them unserializable. I was unable to find another way, unfortunately.
| Expr::AggregateFunction { return_type, .. } => Ok(return_type.clone()), | ||
| Expr::ScalarFunction { | ||
| args, return_type, .. | ||
| } => return_type(&args.iter().map(|x| x.as_datatyped()).collect(), schema), |
There was a problem hiding this comment.
not very beautiful, but was unable to find a better pattern for this :(
| match self.schema_provider.get_function_meta(&name) { | ||
| Some(fm) => { | ||
| let args = if name == "count" { | ||
| // optimization to avoid computing expressions |
There was a problem hiding this comment.
This should probably be moved to another place (an optimizer?). For now I left it where it here as it was.
This PR is done on top of #7971 ,
Its current runtime consequence is that all our math functions now return float32 or float64 depending on their incoming column (and use float32 for other numeric types). Its API consequences is that it allows to register UDFs of variable return types.
This PR hits both physical plans and logical plans, and abstracts a bit the typing of some of our structs. In particular, it introduces a new trait that only cares about the return datatype of an object (PhysicalExpr and LogicalExpr).