Projection Order Propagation by berkaysynnada · Pull Request #7364 · apache/datafusion

berkaysynnada · 2023-08-22T07:40:57Z

Which issue does this PR close?

Closes #7363.

Rationale for this change

ProjectionExec cannot make the orderings of the non-column expressions propagate. However, we can find the PhysicalSortExpr of the expressions which are subject to projection, and use them to preserve ordering information.

This is the second step of a three-step improvement process. The first one addressed some bug fixes, and the third one will focus on ScalarFunctionExpr's. With the assistance of these PRs, we can propagate order information over projections for various types of PhysicalExpr's.

What changes are included in this PR?

ProjectionExec holds the information of orderings of its projections.
If the output ordering gets lost after the projection (since the column of input output ordering does not emerge in the new projection), new output ordering can be set over newly projected non-column expressions.
Non-column projections are added to ordering equivalence classes.
PhysicalExpr's have a new method get_ordering(), which returns the SortProperties of the expression.
We need to be able to differentiate a non-ordered column and a literal value, so such a struct is added.

Are these changes tested?

Yes.

Are there any user-facing changes?

…expr's

ozankabak

I reviewed this carefully, I'd appreciate it if you can take a quick look @alamb

datafusion/sqllogictest/test_files/order.slt

Dandandan · 2023-08-22T19:48:26Z

datafusion/sqllogictest/test_files/order.slt

+--Projection: multiple_ordered_table.b + multiple_ordered_table.a + multiple_ordered_table.c AS result
+----TableScan: multiple_ordered_table projection=[a, b, c]
+physical_plan
+SortPreservingMergeExec: [result@0 ASC NULLS LAST]


The SortPreservingMergeExec seems still rather "nonoptimal" as the batches themselves are already sorted (but the info is thrown away by RepartitionExec). Any plans for addressing this as well in the future?

Yes, if I'm not mistaken one of the WIP PRs in our pipeline will address this 🙂

RepartitionExec does not break the order of partitions in this case as it partitions 1 to 4. Because these partitions are already ordered, the presence of SortPreservingMerge is correct. For general cases, SortPreservingRepartitionExec implementation is on the way, which will have the capability to preserve order for all kinds of partitioning.

What I essentially mean is this:

The batches of the table multiple_ordered_table are ordered and order is preserved in RepartitionExec and ProjectionExec. If SortPreservingMergeExec would know the number of the batch (batch 0, batch 1, batch 2, etc.) as they were returned from the table it would only need to wait on batch 0, batch 1, batch 2, etc. to appear from the partition streams, but not the rows itself, which would be much faster (i.e. not having to merge).

Is that something you plan as well?

To maintain batch indices for preserving order, it’s actually a good idea. The current sort preserving algorithm, designed to preserve the hash repartition, tends to overfit to the row sorting. We could potentially collaborate on implementing this optimization in future work.

ozankabak · 2023-08-23T14:12:07Z

Any objections to this? This is gating some other improvements so I would like to go ahead and merge this if there are no objections.

alamb · 2023-08-23T15:47:17Z

I have no objections (I haven't had a chance to review it either, but I can do so after the merge)

alamb · 2023-08-23T15:53:25Z

I am checking this one out now...

ozankabak · 2023-08-23T15:57:55Z

Thanks @alamb, feel free to merge when you are done 👍

alamb

Thank you for this PR @berkaysynnada -- the tests and the documentation really help.

I don't fully understand all the parts of this PR but what I did read made reasonable sense to me. My key piece of feedback about all the sort order code in general is that it is getting quite complicated and spread out (and I wonder therefore how much duplication or inconsistency there is)

Maybe we could start to consolidate / pull more of this logic together and make it easier to work with and find (to start, just pulling it into individual modules might help)

I left some small structural / naming suggestions. Let me know if you want to address them prior to merge.

alamb · 2023-08-23T15:51:38Z

datafusion/core/src/physical_plan/projection.rs

+    /// Expressions' normalized orderings (as given by the output ordering API
+    /// and normalized with respect to equivalence classes of input plan). The
+    /// projected expressions are mapped by their indices to this vector.
+    orderings: Vec<Option<PhysicalSortExpr>>,


I don't fully understand how this is different than output_ordering (which can also be a Vec of PhysicalSortExpr`). Would it makes sense to always normalize the output ordering in terms of the input's equivalence classes?

I thought the notion of multiple equivalent orderings was expressed using https://docs.rs/datafusion-physical-expr/28.0.0/datafusion_physical_expr/equivalence/type.OrderingEquivalentClass.html

Or maybe the key difference is this can track the ordering of each expression independently?

You are correct in your last sentence. We calculate the order information for each expression that is projected. Since we use this information when determining ordering equivalences, I thought it would be wise to keep it in the executor's state. I would appreciate it if you have any ideas on how to handle this more effectively.

datafusion/core/src/physical_plan/projection.rs

datafusion/physical-expr/src/expressions/binary.rs

ozankabak · 2023-08-23T17:54:48Z

Thanks for the review @alamb. @berkaysynnada will finalize the PR according to your comments and take notes for any follow-on work. I will merge this after he finalizes

ozankabak · 2023-08-23T21:02:20Z

@alamb, the code organization now should be in line with what you had in mind. I will merge after CI passes, but it'd be great if you can take a quick look in the meantime to make sure we got your suggestions right. AFAICT, there is only one potential room for improvement which we will address in a follow-on.

berkaysynnada and others added 15 commits August 10, 2023 11:33

projection exec is updated, get_ordering method is added to physical …

31d5855

…expr's

Merge branch 'apache_main' into feature/projection-order-propagation

5fbed2d

fix after merge

3b7198f

simplifications

6032cef

Refactor, normalization code

d58b8de

Simplifications

f341250

mustafa's simplifications

d6c7868

Merge branch 'apache_main' into feature/projection-order-propagation

0119cde

test source update

8e9fbcc

Comment edited

4dda15e

Simplifications

46afd4e

Code improvements, comment reviews

2fdaf6d

Merge branch 'apache_main' into feature/projection-order-propagation

7a101ac

Comments are enriched

7875f20

tests added, ExtendedSortOptions renamed

b18bce2

github-actions bot added physical-expr Changes to the physical-expr crates core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Aug 22, 2023

berkaysynnada and others added 3 commits August 22, 2023 10:42

Merge branch 'main' into feature/projection-order-propagation

8aad516

fix after merge

e844a59

Minor change.

097b5b4

ozankabak approved these changes Aug 22, 2023

View reviewed changes

Dandandan reviewed Aug 22, 2023

View reviewed changes

datafusion/sqllogictest/test_files/order.slt Outdated Show resolved Hide resolved

Dandandan reviewed Aug 22, 2023

View reviewed changes

Address reviews

9233724

alamb approved these changes Aug 23, 2023

View reviewed changes

berkaysynnada and others added 2 commits August 23, 2023 23:41

structural changes

11793da

Finalizing changes

c3f4373

ozankabak merged commit 2fd704c into apache:main Aug 23, 2023

berkaysynnada mentioned this pull request Aug 25, 2023

ScalarFunctionExpr Maintaining Order #7417

Merged

alamb mentioned this pull request Aug 25, 2023

panicked at 'index out of bounds: the len is 0 but the index is 0' in find_orderings_of_exprs #7418

Closed

berkaysynnada deleted the feature/projection-order-propagation branch August 31, 2023 07:31

alamb mentioned this pull request Sep 5, 2023

panicked at 'index out of bounds: the len is 0 but the index is 0' in datafusion::physical_plan::projection::validate_output_ordering #7482

Closed

Comments

Conversation

berkaysynnada commented Aug 22, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

ozankabak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Dandandan Aug 22, 2023

Choose a reason for hiding this comment

Uh oh!

ozankabak Aug 23, 2023

Choose a reason for hiding this comment

Uh oh!

berkaysynnada Aug 23, 2023

Choose a reason for hiding this comment

Uh oh!

Dandandan Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Aug 23, 2023

Choose a reason for hiding this comment

Uh oh!

metesynnada Aug 23, 2023

Choose a reason for hiding this comment

Uh oh!

ozankabak commented Aug 23, 2023

Uh oh!

alamb commented Aug 23, 2023

Uh oh!

alamb commented Aug 23, 2023

Uh oh!

ozankabak commented Aug 23, 2023

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Aug 23, 2023

Choose a reason for hiding this comment

Uh oh!

berkaysynnada Aug 23, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ozankabak commented Aug 23, 2023

Uh oh!

ozankabak commented Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Dandandan Aug 23, 2023 •

edited

Loading

ozankabak commented Aug 23, 2023 •

edited

Loading