Skip to content

Conversation

@liamzwbao
Copy link
Contributor

@liamzwbao liamzwbao commented Aug 7, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Because the final partitioning scheme and data distribution are unknown, I’ve taken num_rows and total_bytes divided them by the partition count and flagged them as Inexact (on the assumption that hash partitioning spreads rows evenly). For the column statistics, I’ve left null_count, distinct_count, and sum_value blank, since those values can’t be preserved after repartitioning.

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added core Core DataFusion crate physical-plan Changes to the physical-plan crate labels Aug 7, 2025
@liamzwbao
Copy link
Contributor Author

Hi @xudong963, this PR should be ready for review, PTAL! Thanks

@liamzwbao liamzwbao marked this pull request as ready for review August 7, 2025 02:05
@liamzwbao liamzwbao requested a review from xudong963 August 9, 2025 14:37
@xudong963
Copy link
Member

thank you, i'll continue reviewing tomorrow

@liamzwbao liamzwbao requested a review from xudong963 August 18, 2025 14:53
stats.total_byte_size = stats
.total_byte_size
.get_value()
.map(|bytes| Precision::Inexact(bytes / partition_count))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine if the child of the RepartitonExec is a EmptyExec, will the partition_count be zero, then panic comes out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout, actually if partition_count is 0, the function will return an internal error here and will not reach this line of code, added a test case for it. But it's worth thinking that whether we should return an error or unknown statistics. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the case partition_count is 0, it's better to return unknown statistics, which means the child node is empty and doesn't have data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the expected behavior if partition_count is 0 and the partition arg of this function is larger than 0?

  • partition_statistics(Some(0)) -> Statistics::new_unknown()
  • partition_statistics(Some(1)) -> Internal Error or Statistics::new_unknown() ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the case of EmptyExec, I incline to return Statistics::new_unknown() .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, also added a case for invalid partitions where the partition_count is larger than 0

@liamzwbao liamzwbao requested a review from xudong963 August 21, 2025 00:14
Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your patience!

@xudong963
Copy link
Member

Thank you again!

@xudong963 xudong963 merged commit 6368daf into apache:main Aug 25, 2025
27 checks passed
@liamzwbao liamzwbao deleted the issue-15873-repartition branch August 25, 2025 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants