-
Notifications
You must be signed in to change notification settings - Fork 12
feat: Add microbenchmark for string functions #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| def format_results_markdown(results: list[BenchmarkResult]) -> str: | ||
| """Format benchmark results as a markdown table.""" | ||
| lines = [ | ||
| "# String Function Microbenchmarks: DataFusion vs DuckDB", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get versions from DataFusion and DuckDB automatically and put into this title?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I updated this. Header now shows:
| Function | DataFusion 50.0.0 (ms) | DuckDB 1.4.3 (ms) | Speedup | Faster |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
| def setup_datafusion(parquet_path: str) -> datafusion.SessionContext: | ||
| """Create and configure DataFusion context.""" | ||
| ctx = datafusion.SessionContext() | ||
| ctx.register_parquet('test_data', parquet_path) | ||
| return ctx | ||
|
|
||
|
|
||
| def setup_duckdb(parquet_path: str) -> duckdb.DuckDBPyConnection: | ||
| """Create and configure DuckDB connection.""" | ||
| conn = duckdb.connect(':memory:') | ||
| conn.execute(f"CREATE VIEW test_data AS SELECT * FROM read_parquet('{parquet_path}')") | ||
| return conn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In SedonaDB we found that there was wildly differing concurrency that resulted from the default settings for DataFusion and DuckDB for our micro-ish benchmarks. For these types of benchmarks we set DataFusion to use one partition and DuckDB to specifically use a single thread (we don't do this for more macro-scale benchmarks where we really do want to know what happens when a user sits down a types something against all the defaults):
I might also suggest trying the config for DuckDB to return StringViews to see if there's any Arrow conversion overhead getting in the way (I think the config is SET produce_arrow_string_view = true;).
Another thing to try is having both DuckDB and DataFusion operate on Arrow data from memory instead of Parquet (to make sure we're not just measuring the speed of the Parquet read). For DuckDB that's SELECT ... FROM the_name_of_a_python_variable_that_is_a_pyarrow_table.
microbenchmarks/microbenchmarks.py
Outdated
| strings.append(s) | ||
|
|
||
| table = pa.table({ | ||
| 'str_col': pa.array(strings, type=pa.string()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth trying both string and string_view. In theory string_view -> DuckDB has less overhead because the string view is closer to its internal representation. It also might be that DataFusion performs differently when one or the other is used as input.
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Part of #19569 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> I ran microbenchmarks comparing DataFusion with DuckDB for string functions (see apache/datafusion-benchmarks#26) and noticed that DF was very slow for `md5`. This PR improves performance: | Benchmark | Before | After | Speedup | |----------------------------|--------|--------|-------------| | md5_array (1024 strings) | 206 µs | 100 µs | 2.1x faster | | md5_scalar (single string) | 337 ns | 221 ns | 1.5x faster | ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Avoid using `write!` with a format string and use a more efficient approach ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
|
Thanks for the great feedback @paleolimbot! I have pushed commits to address those points |
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> I ran microbenchmarks comparing DataFusion with DuckDB for string functions (see apache/datafusion-benchmarks#26) and noticed that DF was very slow for `split_part`. This PR fixes some obvious performance issues. Speedups are: | Benchmark | Before | After | Speedup | |-----------------------------------|--------|-------|--------------| | single_char_delim/pos_first | 1.27ms | 140µs | 9.1x faster | | single_char_delim/pos_middle | 1.39ms | 396µs | 3.5x faster | | single_char_delim/pos_last | 1.47ms | 738µs | 2.0x faster | | single_char_delim/pos_negative | 1.35ms | 148µs | 9.1x faster | | multi_char_delim/pos_first | 1.22ms | 174µs | 7.0x faster | | multi_char_delim/pos_middle | 1.22ms | 407µs | 3.0x faster | | string_view_single_char/pos_first | 1.42ms | 139µs | 10.2x faster | | many_parts_20/pos_second | 2.48ms | 201µs | 12.3x faster | | long_strings_50_parts/pos_first | 8.18ms | 178µs | 46x faster | ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Martin Grigorov <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
This PR adds microbenchmarks for scanning a Parquet file and evaluating a single string expression per row.
The benchmark runs against DuckDB and DataFusion and compares the results.
Assuming that this PR can be merged, I will follow up by covering other expressions.