Blog post on using UDFs in python by timsaucer · Pull Request #17 · apache/datafusion-site

timsaucer · 2024-08-06T14:05:17Z

This PR adds a blog post describing using UDFs, and in particular on how to combine third party rust UDFs with datafusion-python.

_posts/2024-08-06-datafusion-python-udf-comparisons.md

kylebarron · 2024-08-12T18:36:50Z

_posts/2024-08-06-datafusion-python-udf-comparisons.md

+    return pa.array(result)
+
+
+is_of_interest = udf(


Suggested change

is_of_interest = udf(

# Wrap our custom function with `datafusion.udf`, annotating expected

# parameter and return types

is_of_interest = udf(

As a separate note, it wouldn't be hard to convert this udf function wrapper into a Python decorator, so we could do

@udf(args=(pa.int64(), pa.int64(), pa.utf8()), returns=pa.bool_(), "stable") def is_of_interest( partkey_arr: pa.Array, suppkey_arr: pa.Array, returnflag_arr: pa.Array, ) -> pa.Array: ...

Great idea. I've added it to the issue list apache/datafusion-python#806

_posts/2024-08-06-datafusion-python-udf-comparisons.md

kylebarron · 2024-08-12T18:41:49Z

_posts/2024-08-06-datafusion-python-udf-comparisons.md

+    returnflag_arr: pa.Array,
+) -> pa.Array:
+    results = None
+    for partkey, suppkey, returnflag in values_of_interest:


I think you can use pyarrow.is_in to speed this up, instead of doing an equality check multiple times: https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_in.html

_posts/2024-08-06-datafusion-python-udf-comparisons.md

kylebarron · 2024-08-12T18:52:54Z

_posts/2024-08-06-datafusion-python-udf-comparisons.md

+        let values = partkey_arr
+            .values()
+            .iter()
+            .zip(suppkey_arr.values().iter())
+            .zip(returnflag_arr.iter())
+            .map(|((a, b), c)| (a, b, c.unwrap_or_default()))
+            .map(|v| values_to_search.contains(&v));


This is faster I suppose because it's not doing a boolean check on each individual array in its entirety and then ORing them? It's doing it all at once in a single pass?

Yes, I didn't dive any deeper but my expectation is that by doing a single pass through the iteration we'll get a small speed improvement. It my modest test it only accounted for about a 5% boost.

_posts/2024-08-06-datafusion-python-udf-comparisons.md

timsaucer · 2024-08-13T01:34:03Z

Huge tip of the hat to @kylebarron for the thorough feedback!

_posts/2024-08-06-datafusion-python-udf-comparisons.md

alamb · 2024-11-18T15:37:36Z

@timsaucer is this something we should publish? I hadn't seen it before but it looks great

timsaucer · 2024-11-18T18:51:02Z

I keep meaning to get back to it and add a portion on window functions but I get distracted with other things. Yes, let me brush it up.

emgeee · 2024-11-18T19:08:58Z

I definitely found the post helpful!

timsaucer · 2024-11-19T14:31:45Z

I don't know if I'll have time soon to write an entire section on how to use the window functions, so I just added a blurb to look at the online documentation. I do think the online docs are actually in a pretty good spot. I think this is ready for review/merge/publish.

_posts/2024-11-19-datafusion-python-udf-comparisons.md

andygrove

Excellent blog post! Thanks @timsaucer

Omega359 · 2024-11-19T15:17:17Z

_posts/2024-11-19-datafusion-python-udf-comparisons.md

+
+For a few months now I’ve been working with [Apache DataFusion](https://datafusion.apache.org/), a
+fast query engine written in Rust. From my experience the language that nearly all data scientists
+are working in is Python. In general, often stick to [Pandas](https://pandas.pydata.org/) for


Suggested change

are working in is Python. In general, often stick to [Pandas](https://pandas.pydata.org/) for

are working in is Python. In general, data scientists often use [Pandas](https://pandas.pydata.org/) for

Omega359 · 2024-11-19T15:42:40Z

_posts/2024-11-19-datafusion-python-udf-comparisons.md

+
+## User Defined Window Functions
+
+Writing a user defined window function is slighlty more complex than an aggregate function due


Suggested change

Writing a user defined window function is slighlty more complex than an aggregate function due

Writing a user defined window function is slightly more complex than an aggregate function due

comphead · 2024-11-19T16:11:23Z

_posts/2024-11-19-datafusion-python-udf-comparisons.md

+processing.
+
+In addition to DataFusion, there is another Rust based newcomer to the DataFrame world,
+[Polars](https://pola.rs/). It is growing extremely fast, and it serves many of the same use cases


Suggested change

[Polars](https://pola.rs/). It is growing extremely fast, and it serves many of the same use cases

[Polars](https://pola.rs/). The latter is growing extremely fast, and it serves many of the same use cases

comphead · 2024-11-19T16:14:00Z

_posts/2024-11-19-datafusion-python-udf-comparisons.md

+
+When it comes to designing UDFs, I strongly recommend seeing if you can write your UDF using
+[PyArrow functions](https://arrow.apache.org/docs/python/api/compute.html) rather than pure Python
+objects. These will give you enormous speed benefits. If you must do something that isn't well


Some of users jump directly into conclusions sections, so I'd add an short notice, what was speed benefits

comphead

lgtm thanks @timsaucer

alamb

Thank you @timsaucer -- I think this is great (and I certainly learned a lot)

Once you merge this PR we need to do the publish process here: https://github.com/apache/datafusion-site?tab=readme-ov-file#publish-site

I am happy to do so as well, but I wanted to give you a heads up that it wouldn't just auto publish (no one has hooked that up yet)

alamb · 2024-11-19T16:00:24Z

_posts/2024-11-19-datafusion-python-udf-comparisons.md

+- Writing a UDF in Rust and exposing it to Python
+
+Additionally I will demonstrate two variants of this. The first will be nearly identical to the
+PyArrow library approach to simplicity of understanding how to connect the Rust code to Python. The


Suggested change

PyArrow library approach to simplicity of understanding how to connect the Rust code to Python. The

PyArrow library approach to simplify understanding how to connect the Rust code to Python. In the

alamb · 2024-11-19T16:03:02Z

_posts/2024-11-19-datafusion-python-udf-comparisons.md

+DataFrame itself, perform a join, and select the columns from the original DataFrame. If we were
+working in PySpark we would probably broadcast join the DataFrame created from the tuple list since
+it is tiny. In practice, I have found that with some DataFrame libraries performing a filter rather
+than a join can be significantly faster. This is worth profiling for your specific use case.


100% agree

BTW this is the kind of project I think is needed to make such queries much faster: apache/datafusion#7955

alamb · 2024-11-19T16:09:12Z

_posts/2024-11-19-datafusion-python-udf-comparisons.md

+only to return a DataFrame with a specific combination of these three values. That is, I want
+to know if part number 1530 from supplier 4031 was sold (not returned), so I want a specific
+combination of `p_partkey = 1530`, `p_suppkey = 4031`, and `p_returnflag = 'N'`. I have a small
+handful of these combinations I want to return.


I don't think we should change this blog post / example, but I wanted to point out that you can do this same calculation in SQL

> create table foo (x int, y int) as values (1,2), (3,4), (5,6); 0 row(s) fetched. Elapsed 0.046 seconds. > select * from foo where (x,y) IN ((3,4), (4,5)); +---+---+ | x | y | +---+---+ | 3 | 4 | +---+---+ 1 row(s) fetched. Elapsed 0.018 seconds. > select * from foo where (x,y) IN ((3,4), (4,5)); +---+---+ | x | y | +---+---+ | 3 | 4 | +---+---+ 1 row(s) fetched. Elapsed 0.018 seconds.

DataFusion rewrites it internally using structs:

> select * from foo where {'x': x, 'y': y} IN ({'x': 3, 'y': 4}, {'x': 4, 'y': 5}); +---+---+ | x | y | +---+---+ | 3 | 4 | +---+---+

> explain select * from foo where (x,y) IN ((3,4), (4,5)); +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Filter: CAST(struct(foo.x, foo.y) AS Struct([Field { name: "c0", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c1", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])) IN ([Struct({c0:3,c1:4}), Struct({c0:4,c1:5})]) | | | TableScan: foo projection=[x, y] | | physical_plan | CoalesceBatchesExec: target_batch_size=8192 | | | FilterExec: CAST(struct(x@0, y@1) AS Struct([Field { name: "c0", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c1", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])) IN ([Literal { value: Struct({c0:3,c1:4}) }, Literal { value: Struct({c0:4,c1:5}) }]) | | | MemoryExec: partitions=1, partition_sizes=[1] | | | | +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 2 row(s) fetched. Elapsed 0.003 seconds.

alamb · 2024-11-19T16:10:44Z

_posts/2024-11-19-datafusion-python-udf-comparisons.md

+will do here is write a Rust function to perform our computation and then expose that function to
+Python. I know of two use cases where I would recommend this approach. The first is the case when
+the PyArrow compute functions are insufficient for your needs. Perhaps your code is too complex or
+could be greatly simplified if you pulled in some outside dependency. The second use case is when


these are really good points

alamb · 2024-11-19T16:14:14Z

_posts/2024-11-19-datafusion-python-udf-comparisons.md

+we will process a single array and update the internal state, which we share with the `state()`
+function. For larger batches we may `merge()` these states. It is important to note that the
+`states` in the `merge()` function are an array of the values returned from `state()`. It is
+entirely possible that the `merge` function is significantly different than the `update`, though in


A classic example is "avg" where the state is a sum + count and the output is a value

avg was my first example before I found a bug (apache/datafusion-python#797) and switched it to sum to unblock myself!

alamb · 2024-11-19T16:15:43Z

_posts/2024-11-19-datafusion-python-udf-comparisons.md

+```
+
+As expected, the conversion to Python objects is by far the worst performance. As soon as we drop
+into using any functions that keep the data entirely on the Rust side we see a near 10x speed


I think technically using pyarrow doesn't use rust -- it is implemented in C/C++. Possibly this would be slightly more accurate:

Suggested change

into using any functions that keep the data entirely on the Rust side we see a near 10x speed

into using any functions that keep the data entirely on the Native (Rust or C/C++) side we see a near 10x speed

timsaucer · 2024-11-19T17:17:21Z

Thank you all for the feedback. I've incorporated the suggestions and will wait to the end of my work day to see if there are any more comments before publishing.

timsaucer · 2024-11-19T21:42:18Z

I wasn't able to get the site to build on my machine. I'm going to try setting up a docker container just for building it tomorrow.

alamb · 2024-11-19T22:38:57Z

I wasn't able to get the site to build on my machine. I'm going to try setting up a docker container just for building it tomorrow.

I had a docker container setup so I made a PR to publish it:

Publish Python UDF Blog Post #34

alamb · 2024-11-20T14:39:19Z

The post is now live https://datafusion.apache.org/blog/2024/11/19/datafusion-python-udf-comparisons/

timsaucer mentioned this pull request Aug 6, 2024

Document how to use rust UDF extensions of datafusion-python apache/datafusion-python#792

Closed

kylebarron reviewed Aug 12, 2024

View reviewed changes

timsaucer mentioned this pull request Aug 13, 2024

Add udf / udaf decorators apache/datafusion-python#806

Closed

emgeee reviewed Oct 30, 2024

View reviewed changes

_posts/2024-08-06-datafusion-python-udf-comparisons.md Outdated Show resolved Hide resolved

emgeee reviewed Oct 30, 2024

View reviewed changes

_posts/2024-08-06-datafusion-python-udf-comparisons.md Outdated Show resolved Hide resolved

timsaucer-may and others added 8 commits November 19, 2024 09:11

Blog post on using UDFs in python

c5c8a48

Addressing review comments

93c5602

Small typo

cb3b1f1

Capitalization

104f846

Small language adjustments

b03a877

Add more thorough description of the problem

95a4f86

Variety of small readability improvements

1fefd6b

Update date

91b8817

Small typo

1b483aa

andygrove reviewed Nov 19, 2024

View reviewed changes

_posts/2024-11-19-datafusion-python-udf-comparisons.md Outdated Show resolved Hide resolved

andygrove reviewed Nov 19, 2024

View reviewed changes

_posts/2024-11-19-datafusion-python-udf-comparisons.md Outdated Show resolved Hide resolved

andygrove approved these changes Nov 19, 2024

View reviewed changes

Small change from in memory to in-memory

ca13b63

Omega359 reviewed Nov 19, 2024

View reviewed changes

comphead reviewed Nov 19, 2024

View reviewed changes

comphead approved these changes Nov 19, 2024

View reviewed changes

alamb approved these changes Nov 19, 2024

View reviewed changes

Resolve remaining feedback

a0ba495

timsaucer merged commit 4b60caf into apache:main Nov 19, 2024

alamb mentioned this pull request Nov 19, 2024

Publish Python UDF Blog Post #34

Merged

crystalxyz mentioned this pull request Mar 2, 2025

feat: Implementation of udf and udaf decorator apache/datafusion-python#1040

Merged

	are working in is Python. In general, often stick to [Pandas](https://pandas.pydata.org/) for
	are working in is Python. In general, data scientists often use [Pandas](https://pandas.pydata.org/) for


		## User Defined Window Functions

		Writing a user defined window function is slighlty more complex than an aggregate function due

	[Polars](https://pola.rs/). It is growing extremely fast, and it serves many of the same use cases
	[Polars](https://pola.rs/). The latter is growing extremely fast, and it serves many of the same use cases

	PyArrow library approach to simplicity of understanding how to connect the Rust code to Python. The
	PyArrow library approach to simplify understanding how to connect the Rust code to Python. In the

	into using any functions that keep the data entirely on the Rust side we see a near 10x speed
	into using any functions that keep the data entirely on the Native (Rust or C/C++) side we see a near 10x speed

Conversation

timsaucer commented Aug 6, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

timsaucer commented Aug 13, 2024

Uh oh!

Uh oh!

Uh oh!

alamb commented Nov 18, 2024

Uh oh!

timsaucer commented Nov 18, 2024

Uh oh!

emgeee commented Nov 18, 2024

Uh oh!

timsaucer commented Nov 19, 2024

Uh oh!

Uh oh!

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Omega359 Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timsaucer commented Nov 19, 2024

Uh oh!

timsaucer commented Nov 19, 2024

Omega359 Nov 19, 2024 •

edited

Loading