Interface for physical plan invariant checking. by wiedld · Pull Request #13986 · apache/datafusion

wiedld · 2025-01-02T18:02:30Z

Which issue does this PR close?

Part of Automatically check "invariants" #13652

Rationale for this change

The original discussion mentioned implicit changes which can cause problems when trying to upgrade Datafusion (DF). These implicit changes are often the result of how DF core components interact with user-defined extensions which add, and mutate, different plan nodes.

We previously introduced the concept of invariants, as a way to help faster isolate when an implicit change may conflict with user-defined plan extensions. A previous PR introduced the logical plan invariants. This PR introduces physical plan invariants.

What changes are included in this PR?

This WIP proposes the interface for the execution plan invariant checks. It was done a bit differently from the logical plan (LP) invariants.

The LP is a common enum with the same invokable function for checking invariants (altho the level of validation may vary). In contrast, each ExecutionPlan node is its own implementation. Therefore the approach was chosen to have the invariant checking be defined on the implementations (with a default set of invariants defined on the trait).

As with the LP invariants, the physical plan invariants are checked as part of the default planner. Also same as the LP, we have the more costly check only run in debug mode.

Are these changes tested?

Yes

Are there any user-facing changes?

User defined ExecutionPlan extension can define their own set of invariants. When a DF upgrade is failing, they can run in debug mode and have their ExecutionPlan::check_node_invariants run after each optimizer pass. For example, this can isolate if an upstream DF optimizer change has produced inputs which fails for the user's ExecutionPlan extensions.

wiedld · 2025-01-02T18:04:17Z

datafusion/physical-plan/src/execution_plan.rs

+    /// A default set of invariants is provided in the default implementation.
+    /// Extension nodes can provide their own invariants.
+    fn check_node_invariants(&self) -> Result<()> {
+        // TODO


I wasn't sure what should be the default set. The SanityCheckPlan does exactly what I had been thinking:

datafusion/datafusion/core/src/physical_optimizer/sanity_checker.rs

Lines 41 to 47 in 38ccb00

/// The SanityCheckPlan rule rejects the following query plans:

/// 1. Invalid plans containing nodes whose order and/or distribution requirements

/// are not satisfied by their children.

/// 2. Plans that use pipeline-breaking operators on infinite input(s),

/// it is impossible to execute such queries (they will never generate output nor finish)

#[derive(Default, Debug)]

pub struct SanityCheckPlan {}

Also, I think this optimizer pass does not mutate anything and instead validates?

If we change the SanityPlanChecker be an invariant checker instead, and then (a) run after the other optimizer rules are applied (current behavior) as well as (b) after each optimizer rule in debug mode -- would this be useful?

The added debug mode check could help isolate when a user-defined optimizer rule extension, or a user defined ExecutionPlan node, does not work well with the DF upgrade (e.g. changes in DF plan nodes or optimizer rules).

Conceptually, sanity checking is a "more general" process -- it verifies that any two operators that exchange data (i.e. one's output feeds the other's input) are compatible. So I don't think we can "change" it to be an invariant checker, but we can extend it to also check "invariants" of each individual operator (however they are defined by an ExecutionPlan) as it traverses the plan tree.

However, we can not blindly run sanity checking after every rule. Why? Because rules have the following types regarding their input/output plan validity:

Some rules only take in valid plans and output valid plans (e.g. ProjectionPushdown). These are typically applied at later stages in the optimization/plan construction process.

Some take in invalid or valid plans, and always create valid plans (e.g. EnforceSorting and EnforceDistribution). These can be applied any time, but are typically applied in the middle of the optimization/plan construction process.

Some take invalid plans and yield still invalid plans (IIRC JoinSelection is this way). These are typically applied early in the optimization/plan construction process.

As of this writing, we don't have a formal cut-off point in our list of rules whereafter plans remain valid, but I suspect they do after EnforceSorting. In debug/upgrade mode, we can apply SanityCheckPlan after every rule after that point.

In the logical planner we have a split between

AnalyzerRules that make plans Executable (e.g. by coercing types, etc)

OptimizerRules that don't change the plan semantics (e.g. output types are the same, etc)

It seems like maybe we could make the same separation for physical optimizer rules as well ("not yet executable") and ("read to execute"),

Some take invalid plans and yield still invalid plans (IIRC JoinSelection is this way). These are typically applied early in the optimization/plan construction process.

This was surprising to me (I am not doubting it). It looked at the other passes, and it seems there are a few others

datafusion/datafusion/core/src/physical_optimizer/optimizer.rs

Lines 56 to 72 in 264f4c5

Arc::new(OutputRequirements::new_add_mode()),

Arc::new(AggregateStatistics::new()),

// Statistics-based join selection will change the Auto mode to a real join implementation,

// like collect left, or hash join, or future sort merge join, which will influence the

// EnforceDistribution and EnforceSorting rules as they decide whether to add additional

// repartitioning and local sorting steps to meet distribution and ordering requirements.

// Therefore, it should run before EnforceDistribution and EnforceSorting.

Arc::new(JoinSelection::new()),

// The LimitedDistinctAggregation rule should be applied before the EnforceDistribution rule,

// as that rule may inject other operations in between the different AggregateExecs.

// Applying the rule early means only directly-connected AggregateExecs must be examined.

Arc::new(LimitedDistinctAggregation::new()),

// The EnforceDistribution rule is for adding essential repartitioning to satisfy distribution

// requirements. Please make sure that the whole plan tree is determined before this rule.

// This rule increases parallelism if doing so is beneficial to the physical plan; i.e. at

// least one of the operators in the plan benefits from increased parallelism.

Arc::new(EnforceDistribution::new()),

🤔

Conceptually, sanity checking is a "more general" process -- it verifies that any two operators that exchange data (i.e. one's output feeds the other's input) are compatible. So I don't think we can "change" it to be an invariant checker, but we can extend it to also check "invariants" of each individual operator (however they are defined by an ExecutionPlan) as it traverses the plan tree.

I agree with this sentiment. It seems to me that the "SanityChecker" is verifying invariants that should be true for all nodes (regardless of what they do -- for example that the declared required input sort is the same as the produced output sort)

Thus, focusing on ExecutionPlan specific invariants might be a good first step.

Some simple invariants to start with I could imagine are:

Number of inputs (e.g. that unions have more than zero inputs, for example)

Thank you both for the reviews. Apologies on the delayed response.

To summarize this nice explanation from @ozankabak , the Executable-ness of the output plan (post optimizer run) is dependent upon what each optimizer run does and if the input plan was valid. Although it is surprising, we currently permit optimizer rules to output invalid plans.

As such, I added a PhysicalOptimizerRule::executable_check which defines the expected behavior per optimizer rule (see commit here). This also helps us surface which rules may produce unexecutable plans, as well as when we can define an output plan as "executable".

Next, the InvariantChecker was expanded to conditionally check the executableness based upon the declared expectations. If the plan is expected to be executable, then the invariants are checked for both (a) Datafusion internal definition of "executable" from the sanity plan check, as well as (b) any defined invariants on ExecutionPlan nodes (including user extension nodes).

Finally, there is an example test case added which demonstrates how this could be useful for catching incompatibilities with users' PhysicalOptimizerRule extensions.

Some rules only take in valid plans and output valid plans (e.g. ProjectionPushdown). These are typically applied at later stages in the optimization/plan construction process

When running our current sqllogic test suite, I found that most rules were passing through the same validity of plan. For example, the ProjectionPushdown usually had an "unexecutable" plan (due to an early OutputRequirements rule output) and the pushdown rule itself did not change the validity.

That is why the default impl behavior of the PhysicalOptimizerRule::executable_check is to pass through the current plan validity expectation.

This has now changed per #13986 (review).

…, and perform check as part of the default physical planner

datafusion/core/src/physical_planner.rs

alamb · 2025-01-02T22:19:23Z

datafusion/physical-plan/src/execution_plan.rs

+    /// A default set of invariants is provided in the default implementation.
+    /// Extension nodes can provide their own invariants.
+    fn check_node_invariants(&self) -> Result<()> {
+        // TODO


Conceptually, sanity checking is a "more general" process -- it verifies that any two operators that exchange data (i.e. one's output feeds the other's input) are compatible. So I don't think we can "change" it to be an invariant checker, but we can extend it to also check "invariants" of each individual operator (however they are defined by an ExecutionPlan) as it traverses the plan tree.

I agree with this sentiment. It seems to me that the "SanityChecker" is verifying invariants that should be true for all nodes (regardless of what they do -- for example that the declared required input sort is the same as the produced output sort)

Thus, focusing on ExecutionPlan specific invariants might be a good first step.

Some simple invariants to start with I could imagine are:

Number of inputs (e.g. that unions have more than zero inputs, for example)

datafusion/physical-plan/src/execution_plan.rs

… which allows each optimizer rule to state the of the output plan

…ionally based upon the expected/stated behavior of the optimizer rule

datafusion/core/src/physical_planner.rs

alamb

Sorry for all the back and forth @wiedld but I feel like this PR has drifted further away from the intent

As I said in my previous comment, I think we should focus on ExecutionPlan specific invariants. As you have uncovered and @ozankabak mentions, this will not capture all possible issues that arise, but it may prevent some issues

Concretely I suggest:

Add ExectuionPlan::check_invariants
Add at least one simple check -- that UnionExec has more than zero inputs
Wire up the invarant checker for ExecutionPlans the same way it is wired for LogicalPlans (see code here)

Specifially verify invarants:

always on the provided plan before running any optimizer pass
always on the final plan after running all optimzer passes
in debug mode, after each individual pass

datafusion/physical-optimizer/src/optimizer.rs

wiedld · 2025-01-16T16:33:18Z

Converting to draft since I've been a bit side tracked. I'll mark it ready once updates are in. TY. 🙏🏼

…nterface which allows each optimizer rule to state the of the output plan" This reverts commit 5760792.

…digm" This reverts commit ad15c85.

…optimization checker should be run

alamb

Thank you @wiedld -- I think this is quite nice and can be extended over time.

I had some minor comment suggestions but we can make a follow on PR for those if preferred

datafusion/core/src/physical_planner.rs

alamb · 2025-01-19T10:02:39Z

datafusion/physical-plan/src/union.rs

        &self.cache
    }

+    fn check_invariants(&self, _check: InvariantLevel) -> Result<()> {


datafusion/physical-plan/src/execution_plan.rs

alamb · 2025-01-20T22:50:31Z

Thanks again @wiedld

github-actions bot added physical-expr Changes to the physical-expr crates core Core DataFusion crate labels Jan 2, 2025

wiedld commented Jan 2, 2025

View reviewed changes

wiedld mentioned this pull request Jan 2, 2025

Automatically check "invariants" #13652

Closed

3 tasks

feat(13652): provide interfaces for checking physical plan invariants…

3b03f6c

…, and perform check as part of the default physical planner

alamb reviewed Jan 2, 2025

View reviewed changes

alamb mentioned this pull request Jan 6, 2025

Define extension API for user-defined invariants. #14029

Closed

wiedld added 4 commits January 13, 2025 14:03

Merge branch 'main' into wiedld/physical-plan-invariant

da6cc17

feat(13652): define PhysicalOptimizerRule::executable_check interface…

5760792

… which allows each optimizer rule to state the of the output plan

feat(13652): perform invariant checking on the execution plan, condit…

94482d1

…ionally based upon the expected/stated behavior of the optimizer rule

test: update tests to reflect updated invariant checking paradigm

ad15c85

github-actions bot added the optimizer Optimizer rules label Jan 14, 2025

wiedld changed the title ~~WIP: Proposed interface for physical plan invariant checking.~~ Interface for physical plan invariant checking. Jan 14, 2025

wiedld marked this pull request as ready for review January 14, 2025 05:59

wiedld commented Jan 14, 2025

View reviewed changes

datafusion/core/src/physical_planner.rs Outdated Show resolved Hide resolved

alamb reviewed Jan 14, 2025

View reviewed changes

datafusion/physical-optimizer/src/optimizer.rs Outdated Show resolved Hide resolved

wiedld marked this pull request as draft January 16, 2025 16:32

wiedld added 5 commits January 18, 2025 19:27

Revert "feat(13652): define PhysicalOptimizerRule::executable_check i…

5e065e1

…nterface which allows each optimizer rule to state the of the output plan" This reverts commit 5760792.

Merge branch 'main' into wiedld/physical-plan-invariant

4a6718f

Revert "test: update tests to reflect updated invariant checking para…

10af0bd

…digm" This reverts commit ad15c85.

refactor: remove vestiges of sanity_check from the InvariantChecker

e71ef9f

refactor: introduce Invariant levels, and make explicit how the post-…

9d854a6

…optimization checker should be run

github-actions bot removed the optimizer Optimizer rules label Jan 19, 2025

feat: provide invariant for UnionExec

7b2f54b

wiedld marked this pull request as ready for review January 19, 2025 08:16

alamb approved these changes Jan 19, 2025

View reviewed changes

chore: update docs and error messages

8da07d0

alamb merged commit 2f28327 into apache:main Jan 20, 2025
25 checks passed

edmondop mentioned this pull request Jan 21, 2025

Automatically check "invariants" edmondop/arrow-datafusion#3

Open

shehabgamin mentioned this pull request Feb 4, 2025

Test DataFusion 45.0.0 with Sail #14408

Closed

alamb mentioned this pull request Feb 4, 2025

Feb 4, 2025: This week(s) in DataFusion #14491

Closed

	/// The SanityCheckPlan rule rejects the following query plans:
	/// 1. Invalid plans containing nodes whose order and/or distribution requirements
	/// are not satisfied by their children.
	/// 2. Plans that use pipeline-breaking operators on infinite input(s),
	/// it is impossible to execute such queries (they will never generate output nor finish)
	#[derive(Default, Debug)]
	pub struct SanityCheckPlan {}

	Arc::new(OutputRequirements::new_add_mode()),
	Arc::new(AggregateStatistics::new()),
	// Statistics-based join selection will change the Auto mode to a real join implementation,
	// like collect left, or hash join, or future sort merge join, which will influence the
	// EnforceDistribution and EnforceSorting rules as they decide whether to add additional
	// repartitioning and local sorting steps to meet distribution and ordering requirements.
	// Therefore, it should run before EnforceDistribution and EnforceSorting.
	Arc::new(JoinSelection::new()),
	// The LimitedDistinctAggregation rule should be applied before the EnforceDistribution rule,
	// as that rule may inject other operations in between the different AggregateExecs.
	// Applying the rule early means only directly-connected AggregateExecs must be examined.
	Arc::new(LimitedDistinctAggregation::new()),
	// The EnforceDistribution rule is for adding essential repartitioning to satisfy distribution
	// requirements. Please make sure that the whole plan tree is determined before this rule.
	// This rule increases parallelism if doing so is beneficial to the physical plan; i.e. at
	// least one of the operators in the plan benefits from increased parallelism.
	Arc::new(EnforceDistribution::new()),

Comments

Conversation

wiedld commented Jan 2, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

wiedld Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wiedld Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ozankabak Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Jan 2, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Jan 2, 2025

Choose a reason for hiding this comment

Uh oh!

wiedld Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wiedld Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wiedld Jan 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb Jan 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wiedld commented Jan 16, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alamb Jan 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Jan 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wiedld commented Jan 2, 2025 •

edited by alamb

Loading

wiedld Jan 2, 2025 •

edited

Loading

wiedld Jan 2, 2025 •

edited

Loading

ozankabak Jan 2, 2025 •

edited

Loading

wiedld Jan 14, 2025 •

edited

Loading

wiedld Jan 14, 2025 •

edited

Loading