Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Dec 30, 2025

This PR adds microbenchmarks for scanning a Parquet file and evaluating a single string expression per row.

The benchmark runs against DuckDB and DataFusion and compares the results.

Assuming that this PR can be merged, I will follow up by covering other expressions.

Function DataFusion 50.0.0 (ms) DuckDB 1.4.3 (ms) Speedup Faster
trim 50.10 123.48 2.46x DataFusion
ltrim 36.93 59.65 1.62x DataFusion
rtrim 35.00 116.27 3.32x DataFusion
lower 41.78 61.56 1.47x DataFusion
upper 39.40 62.52 1.59x DataFusion
length 23.63 28.35 1.20x DataFusion
char_length 23.19 26.43 1.14x DataFusion
reverse 37.16 59.29 1.60x DataFusion
repeat_3 54.88 75.05 1.37x DataFusion
concat 74.38 67.66 1.10x DuckDB
concat_ws 35.34 72.74 2.06x DataFusion
substring_1_5 32.33 42.34 1.31x DataFusion
left_5 39.47 48.38 1.23x DataFusion
right_5 59.40 62.14 1.05x DataFusion
lpad_20 331.96 95.70 3.47x DuckDB
rpad_20 334.77 94.14 3.56x DuckDB
replace 51.14 103.57 2.03x DataFusion
translate 795.03 285.29 2.79x DuckDB
ascii 15.87 21.55 1.36x DataFusion
md5 275.02 137.94 1.99x DuckDB
sha256 59.85 269.76 4.51x DataFusion
btrim 37.81 123.29 3.26x DataFusion
split_part 79.91 58.93 1.36x DuckDB
starts_with 15.15 26.10 1.72x DataFusion
ends_with 25.36 19.81 1.28x DuckDB
strpos 44.48 28.06 1.59x DuckDB
regexp_replace 93.54 407.16 4.35x DataFusion

@andygrove andygrove marked this pull request as ready for review December 30, 2025 18:20
@andygrove andygrove requested review from comphead and viirya December 30, 2025 18:20
def format_results_markdown(results: list[BenchmarkResult]) -> str:
"""Format benchmark results as a markdown table."""
lines = [
"# String Function Microbenchmarks: DataFusion vs DuckDB",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get versions from DataFusion and DuckDB automatically and put into this title?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I updated this. Header now shows:

| Function | DataFusion 50.0.0 (ms) | DuckDB 1.4.3 (ms) | Speedup | Faster |

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Comment on lines 113 to 124
def setup_datafusion(parquet_path: str) -> datafusion.SessionContext:
"""Create and configure DataFusion context."""
ctx = datafusion.SessionContext()
ctx.register_parquet('test_data', parquet_path)
return ctx


def setup_duckdb(parquet_path: str) -> duckdb.DuckDBPyConnection:
"""Create and configure DuckDB connection."""
conn = duckdb.connect(':memory:')
conn.execute(f"CREATE VIEW test_data AS SELECT * FROM read_parquet('{parquet_path}')")
return conn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In SedonaDB we found that there was wildly differing concurrency that resulted from the default settings for DataFusion and DuckDB for our micro-ish benchmarks. For these types of benchmarks we set DataFusion to use one partition and DuckDB to specifically use a single thread (we don't do this for more macro-scale benchmarks where we really do want to know what happens when a user sits down a types something against all the defaults):

https://github.com/apache/sedona-db/blob/e0e1d109480727faaf7be25923b57b4686144438/python/sedonadb/python/sedonadb/testing.py#L407-L412

https://github.com/apache/sedona-db/blob/e0e1d109480727faaf7be25923b57b4686144438/python/sedonadb/python/sedonadb/testing.py#L347-L353

I might also suggest trying the config for DuckDB to return StringViews to see if there's any Arrow conversion overhead getting in the way (I think the config is SET produce_arrow_string_view = true;).

Another thing to try is having both DuckDB and DataFusion operate on Arrow data from memory instead of Parquet (to make sure we're not just measuring the speed of the Parquet read). For DuckDB that's SELECT ... FROM the_name_of_a_python_variable_that_is_a_pyarrow_table.

strings.append(s)

table = pa.table({
'str_col': pa.array(strings, type=pa.string())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth trying both string and string_view. In theory string_view -> DuckDB has less overhead because the string view is closer to its internal representation. It also might be that DataFusion performs differently when one or the other is used as input.

github-merge-queue bot pushed a commit to apache/datafusion that referenced this pull request Dec 31, 2025
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Part of #19569

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

I ran microbenchmarks comparing DataFusion with DuckDB for string
functions (see apache/datafusion-benchmarks#26)
and noticed that DF was very slow for `md5`.

This PR improves performance:

| Benchmark                  | Before | After  | Speedup     |
|----------------------------|--------|--------|-------------|
| md5_array (1024 strings)   | 206 µs | 100 µs | 2.1x faster |
| md5_scalar (single string) | 337 ns | 221 ns | 1.5x faster |

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Avoid using `write!` with a format string and use a more efficient
approach

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
@andygrove
Copy link
Member Author

Thanks for the great feedback @paleolimbot! I have pushed commits to address those points

@andygrove andygrove merged commit 32f6747 into apache:main Dec 31, 2025
@andygrove andygrove deleted the microbenchmarks branch December 31, 2025 22:59
github-merge-queue bot pushed a commit to apache/datafusion that referenced this pull request Jan 6, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

I ran microbenchmarks comparing DataFusion with DuckDB for string
functions (see apache/datafusion-benchmarks#26)
and noticed that DF was very slow for `split_part`.

This PR fixes some obvious performance issues. Speedups are:

| Benchmark                         | Before | After | Speedup      |
|-----------------------------------|--------|-------|--------------|
| single_char_delim/pos_first       | 1.27ms | 140µs | 9.1x faster  |
| single_char_delim/pos_middle      | 1.39ms | 396µs | 3.5x faster  |
| single_char_delim/pos_last        | 1.47ms | 738µs | 2.0x faster  |
| single_char_delim/pos_negative    | 1.35ms | 148µs | 9.1x faster  |
| multi_char_delim/pos_first        | 1.22ms | 174µs | 7.0x faster  |
| multi_char_delim/pos_middle       | 1.22ms | 407µs | 3.0x faster  |
| string_view_single_char/pos_first | 1.42ms | 139µs | 10.2x faster |
| many_parts_20/pos_second          | 2.48ms | 201µs | 12.3x faster |
| long_strings_50_parts/pos_first   | 8.18ms | 178µs | 46x faster   |

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Martin Grigorov <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants