feat: Add microbenchmark for string functions #26

andygrove · 2025-12-30T18:16:54Z

This PR adds microbenchmarks for scanning a Parquet file and evaluating a single string expression per row.

The benchmark runs against DuckDB and DataFusion and compares the results.

Assuming that this PR can be merged, I will follow up by covering other expressions.

Function	DataFusion 50.0.0 (ms)	DuckDB 1.4.3 (ms)	Speedup	Faster
trim	50.10	123.48	2.46x	DataFusion
ltrim	36.93	59.65	1.62x	DataFusion
rtrim	35.00	116.27	3.32x	DataFusion
lower	41.78	61.56	1.47x	DataFusion
upper	39.40	62.52	1.59x	DataFusion
length	23.63	28.35	1.20x	DataFusion
char_length	23.19	26.43	1.14x	DataFusion
reverse	37.16	59.29	1.60x	DataFusion
repeat_3	54.88	75.05	1.37x	DataFusion
concat	74.38	67.66	1.10x	DuckDB
concat_ws	35.34	72.74	2.06x	DataFusion
substring_1_5	32.33	42.34	1.31x	DataFusion
left_5	39.47	48.38	1.23x	DataFusion
right_5	59.40	62.14	1.05x	DataFusion
lpad_20	331.96	95.70	3.47x	DuckDB
rpad_20	334.77	94.14	3.56x	DuckDB
replace	51.14	103.57	2.03x	DataFusion
translate	795.03	285.29	2.79x	DuckDB
ascii	15.87	21.55	1.36x	DataFusion
md5	275.02	137.94	1.99x	DuckDB
sha256	59.85	269.76	4.51x	DataFusion
btrim	37.81	123.29	3.26x	DataFusion
split_part	79.91	58.93	1.36x	DuckDB
starts_with	15.15	26.10	1.72x	DataFusion
ends_with	25.36	19.81	1.28x	DuckDB
strpos	44.48	28.06	1.59x	DuckDB
regexp_replace	93.54	407.16	4.35x	DataFusion

viirya · 2025-12-30T19:01:17Z

microbenchmarks/microbenchmarks.py

+def format_results_markdown(results: list[BenchmarkResult]) -> str:
+    """Format benchmark results as a markdown table."""
+    lines = [
+        "# String Function Microbenchmarks: DataFusion vs DuckDB",


Can we get versions from DataFusion and DuckDB automatically and put into this title?

paleolimbot

Cool!

paleolimbot · 2025-12-31T03:27:25Z

microbenchmarks/microbenchmarks.py

+def setup_datafusion(parquet_path: str) -> datafusion.SessionContext:
+    """Create and configure DataFusion context."""
+    ctx = datafusion.SessionContext()
+    ctx.register_parquet('test_data', parquet_path)
+    return ctx
+
+
+def setup_duckdb(parquet_path: str) -> duckdb.DuckDBPyConnection:
+    """Create and configure DuckDB connection."""
+    conn = duckdb.connect(':memory:')
+    conn.execute(f"CREATE VIEW test_data AS SELECT * FROM read_parquet('{parquet_path}')")
+    return conn


In SedonaDB we found that there was wildly differing concurrency that resulted from the default settings for DataFusion and DuckDB for our micro-ish benchmarks. For these types of benchmarks we set DataFusion to use one partition and DuckDB to specifically use a single thread (we don't do this for more macro-scale benchmarks where we really do want to know what happens when a user sits down a types something against all the defaults):

https://github.com/apache/sedona-db/blob/e0e1d109480727faaf7be25923b57b4686144438/python/sedonadb/python/sedonadb/testing.py#L407-L412

https://github.com/apache/sedona-db/blob/e0e1d109480727faaf7be25923b57b4686144438/python/sedonadb/python/sedonadb/testing.py#L347-L353

I might also suggest trying the config for DuckDB to return StringViews to see if there's any Arrow conversion overhead getting in the way (I think the config is SET produce_arrow_string_view = true;).

Another thing to try is having both DuckDB and DataFusion operate on Arrow data from memory instead of Parquet (to make sure we're not just measuring the speed of the Parquet read). For DuckDB that's SELECT ... FROM the_name_of_a_python_variable_that_is_a_pyarrow_table.

paleolimbot · 2025-12-31T03:35:51Z

microbenchmarks/microbenchmarks.py

+        strings.append(s)
+
+    table = pa.table({
+        'str_col': pa.array(strings, type=pa.string())


It may be worth trying both string and string_view. In theory string_view -> DuckDB has less overhead because the string view is closer to its internal representation. It also might be that DataFusion performs differently when one or the other is used as input.

## Which issue does this PR close?  - Part of #19569 ## Rationale for this change  I ran microbenchmarks comparing DataFusion with DuckDB for string functions (see apache/datafusion-benchmarks#26) and noticed that DF was very slow for `md5`. This PR improves performance: | Benchmark | Before | After | Speedup | |----------------------------|--------|--------|-------------| | md5_array (1024 strings) | 206 µs | 100 µs | 2.1x faster | | md5_scalar (single string) | 337 ns | 221 ns | 1.5x faster | ## What changes are included in this PR?  Avoid using `write!` with a format string and use a more efficient approach ## Are these changes tested?  ## Are there any user-facing changes?

andygrove · 2025-12-31T22:58:49Z

Thanks for the great feedback @paleolimbot! I have pushed commits to address those points

## Which issue does this PR close?  - Closes #. ## Rationale for this change  I ran microbenchmarks comparing DataFusion with DuckDB for string functions (see apache/datafusion-benchmarks#26) and noticed that DF was very slow for `split_part`. This PR fixes some obvious performance issues. Speedups are: | Benchmark | Before | After | Speedup | |-----------------------------------|--------|-------|--------------| | single_char_delim/pos_first | 1.27ms | 140µs | 9.1x faster | | single_char_delim/pos_middle | 1.39ms | 396µs | 3.5x faster | | single_char_delim/pos_last | 1.47ms | 738µs | 2.0x faster | | single_char_delim/pos_negative | 1.35ms | 148µs | 9.1x faster | | multi_char_delim/pos_first | 1.22ms | 174µs | 7.0x faster | | multi_char_delim/pos_middle | 1.22ms | 407µs | 3.0x faster | | string_view_single_char/pos_first | 1.42ms | 139µs | 10.2x faster | | many_parts_20/pos_second | 2.48ms | 201µs | 12.3x faster | | long_strings_50_parts/pos_first | 8.18ms | 178µs | 46x faster | ## What changes are included in this PR?  ## Are these changes tested?  ## Are there any user-facing changes?   --------- Co-authored-by: Martin Grigorov <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>

add microbenchmark for string functions

e6824ad

andygrove marked this pull request as ready for review December 30, 2025 18:20

andygrove requested review from comphead and viirya December 30, 2025 18:20

andygrove mentioned this pull request Dec 30, 2025

feat: Expand microbenchmarks #27

Closed

viirya approved these changes Dec 30, 2025

View reviewed changes

andygrove added 2 commits December 30, 2025 11:58

optimize duckdb

858111b

optimize duckdb

fd91321

viirya reviewed Dec 30, 2025

View reviewed changes

andygrove added 2 commits December 30, 2025 12:12

add README

5a22f47

rename script, update README

5027a29

This was referenced Dec 30, 2025

perf: Improve performance of md5 apache/datafusion#19568

Merged

perf: Improve performance of split_part apache/datafusion#19570

Merged

paleolimbot reviewed Dec 31, 2025

View reviewed changes

address feedback

6ab4e5a

andygrove merged commit 32f6747 into apache:main Dec 31, 2025

andygrove deleted the microbenchmarks branch December 31, 2025 22:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add microbenchmark for string functions #26

feat: Add microbenchmark for string functions #26

Uh oh!

andygrove commented Dec 30, 2025 •

edited

Loading

Uh oh!

viirya Dec 30, 2025

Uh oh!

andygrove Dec 30, 2025

Uh oh!

paleolimbot left a comment

Uh oh!

paleolimbot Dec 31, 2025

Uh oh!

paleolimbot Dec 31, 2025

Uh oh!

andygrove commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Add microbenchmark for string functions #26

feat: Add microbenchmark for string functions #26

Uh oh!

Conversation

andygrove commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

paleolimbot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

paleolimbot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andygrove commented Dec 30, 2025 •

edited

Loading