-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Labels
Description
This discussion stems from: #11976
Status quo
we currently
- explicitly insert logical exchanges where we might require data shuffling; and
- then determine whether those exchanges are real data shuffle or possible passthrough;
- then we determine whether to assign more/less servers to run either side of the exchanges
Current solution
This PR (#11976) only addresses step-2: to give it a better idea on whether the RelDistribution is the same before & after an exchange. for this purpose, it is OK (at this point) to only keep one of the 2 sides for JOIN Rel.
Needs revisit
there are several problems
- should we explicitly insert exchange or should we use other abstractions?
- there are other ways to add exchange nodes that are more "Calcite-suggested" when managing the Exchange insert instead of applying them during optimization IMO.
- should the exchange insert be RelDistribution-based or physical-based?
- we are mixing the concept of Exchange usage: it can mean (1) altering logical RelDistribution, (2) indicating there potentially could be physical layout differences, (3) whether we can apply leaf-stage optimization
- although (3) will be addressed by [multistage][feature] leaf planning with multi-semi join support #11937, we should consider whether to still use ExchangeNode as our abstraction or create our own to avoid confusion
- should we apply RelDistribution trait before or after Exchange insert
- currently we have to do this after insert, but technically if we addresses question (2) we can potentially apply that beforehand
ultimately utilizing Exchange was a quick decision during early stage multi-stage engine development and it might not have been the best option. it is worth taking some time to revisit