Conversation
Co-authored-by: Yongting You <[email protected]>
Co-authored-by: Yongting You <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Phillip LeBlanc <[email protected]>
Contributor
|
Starting to check this out |
Contributor
Author
|
Thanks. Odd that RustRover rendered it differently but the wording is definitely better :) |
Contributor
Yeah, the pelicanasf rendered is pretty wonky and non standard (also doesn't like markdown tables for some reason 🤷 ) |
Co-authored-by: Kevin Liu <[email protected]>
|
|
||
| Thanks to [zhuqi-lucas](https://github.com/zhuqi-lucas) for PR [#15104](https://github.com/apache/datafusion/pull/15104) | ||
|
|
||
| ### Tracing context propagation in spawned tasks |
Contributor
There was a problem hiding this comment.
Thanks for mentioning this change! You can also maybe link to the related datafusion-contrib repo https://github.com/datafusion-contrib/datafusion-tracing which builds upon this, otherwise the description might be a bit too abstract :)
Contributor
There was a problem hiding this comment.
It makes the post much stronger
alamb
reviewed
Jul 11, 2025
| 2 row(s) fetched. | ||
| ``` | ||
|
|
||
| Thanks to [irenjj](https://github.com/irenjj) for the initial work in PR [#14677](https://github.com/apache/datafusion/pull/14677) |
alamb
reviewed
Jul 11, 2025
| DataFusion 47.0.0 comes with numerous performance enhancements across the board. Here are some of the noteworthy | ||
| optimizations in this release: | ||
|
|
||
| - **`FIRST_VALUE` and `LAST_VALUE`:** `FIRST_VALUE` and `LAST_VALUE` functions execute much faster for data with high cardinality such as those with many groups or partitions. DataFusion 47.0.0 executes the following in **7 seconds** compared to **36 seconds** in DataFusion 46.0.0: `select id2, id4, first_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4` (h2o.ai dataset). (PR's [#15266](https://github.com/apache/datafusion/pull/15266) |
alamb
reviewed
Jul 11, 2025
| - **`FIRST_VALUE` and `LAST_VALUE`:** `FIRST_VALUE` and `LAST_VALUE` functions execute much faster for data with high cardinality such as those with many groups or partitions. DataFusion 47.0.0 executes the following in **7 seconds** compared to **36 seconds** in DataFusion 46.0.0: `select id2, id4, first_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4` (h2o.ai dataset). (PR's [#15266](https://github.com/apache/datafusion/pull/15266) | ||
| and [#15542](https://github.com/apache/datafusion/pull/15542) by [UBarney](https://github.com/UBarney)). | ||
|
|
||
| - **`MIN`, `MAX` and `AVG` for Durations:** DataFusion executes aggregate queries up to 2.5x faster when they include `MIN`, `MAX` and `AVG` on `Duration` columns. |
alamb
approved these changes
Jul 11, 2025
| - **Short circuit evaluation for `AND` and `OR`:** DataFusion now eagerly skips the evaluation of | ||
| the right operand if the left is known to be false (`AND`) or true (`OR`) in certain cases. For complex predicates, such as those with many `LIKE` or `CASE` expressions, this optimization results in | ||
| [significant performance improvements](https://github.com/apache/datafusion/issues/11212#issuecomment-2753584617) (up to 100x in extreme cases). | ||
| (PRs [#15462](https://github.com/apache/datafusion/pull/15462) and [#15694](https://github.com/apache/datafusion/pull/15694) |
| optimization (TopK) for fully sorted data. DataFusion 47.0.0 extends the optimization for partially sorted data, which is common in many real-world datasets, such as time-series data sorted by day but not within each day. | ||
| (PR [#15563](https://github.com/apache/datafusion/pull/15563) by [geoffreyclaude](https://github.com/geoffreyclaude)). | ||
|
|
||
| - **Disable re-validation of spilled files:** DataFusion no longer does unnecessary re-validation of temporary spill files. The validation is unnecessary and expensive as the data is known to be valid when it was written out |
|
|
||
| Previous versions of DataFusion used `Utf8View` when reading parquet files and it is faster in most cases. | ||
|
|
||
| Thanks to [zhuqi-lucas](https://github.com/zhuqi-lucas) for PR [#15104](https://github.com/apache/datafusion/pull/15104) |
Contributor
This was referenced Jul 12, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


First cut at a DF 47 blog post as
47.0.0(April 2025) datafusion#15072Please let me know of anything you wish to add/modify