-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Implement partition_statistics API for RepartitionExec
#17061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @xudong963, this PR should be ready for review, PTAL! Thanks |
|
thank you, i'll continue reviewing tomorrow |
| stats.total_byte_size = stats | ||
| .total_byte_size | ||
| .get_value() | ||
| .map(|bytes| Precision::Inexact(bytes / partition_count)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imagine if the child of the RepartitonExec is a EmptyExec, will the partition_count be zero, then panic comes out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good callout, actually if partition_count is 0, the function will return an internal error here and will not reach this line of code, added a test case for it. But it's worth thinking that whether we should return an error or unknown statistics. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for the case partition_count is 0, it's better to return unknown statistics, which means the child node is empty and doesn't have data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the expected behavior if partition_count is 0 and the partition arg of this function is larger than 0?
partition_statistics(Some(0))->Statistics::new_unknown()partition_statistics(Some(1))->Internal ErrororStatistics::new_unknown()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the case of EmptyExec, I incline to return Statistics::new_unknown() .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, also added a case for invalid partitions where the partition_count is larger than 0
xudong963
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your patience!
|
Thank you again! |
Which issue does this PR close?
partition_statisticsAPI for more operators #15873.Rationale for this change
What changes are included in this PR?
Because the final partitioning scheme and data distribution are unknown, I’ve taken
num_rowsandtotal_bytesdivided them by the partition count and flagged them as Inexact (on the assumption that hash partitioning spreads rows evenly). For the column statistics, I’ve leftnull_count,distinct_count, andsum_valueblank, since those values can’t be preserved after repartitioning.Are these changes tested?
Yes
Are there any user-facing changes?
No