tests: enable fuzz for filtered anti-semi NLJoin#12360
Conversation
| .await | ||
| } | ||
|
|
||
| fn less_than_100_join_filter(schema1: Arc<Schema>, _schema2: Arc<Schema>) -> JoinFilter { |
There was a problem hiding this comment.
during NLjoin construction, join filter was completely ignored, and all NLjoin tests were running with only equijoin conditions in filter.
Just to double check my understanding, your analysis is that this test is invalid because NestedLoopJoinExec doesn't appy the less_than_100_join_filter correctly?
I reread the docs for NestedLoopJoinExec and it seems to imply the point is to run joins without a equality predicate 🤔
So is the difference that less_than_100_join_filter only referred to one join input (aka only refers to a) where NLJ needs filters that refer to both sides ?
There was a problem hiding this comment.
it seems to imply the point is to run joins without a equality predicate
That's true, but in order to somehow perform fuzz testing against other join implementations, NLJoin runs (intended to run) in fuzz tests with the following filter <on / equijoin conditions> and <optional join filter>
Just to double check my understanding, your analysis is that this test is invalid because NestedLoopJoinExec doesn't appy the
less_than_100_join_filtercorrectly?
The test is invalid because of how NLJoin was built in these tests -- if optional join filter was passed as a test case argument, it was not included into NLJoin filter expression, and the test compared join on <equijoin conditions> and <filter> for HashJoin vs join on <equijoin condition> for NLJoin -- this problem is fixed by the first commit.
So is the difference that
less_than_100_join_filteronly referred to one join input (aka only refers to a) where NLJ needs filters that refer to both sides ?
less_than_100_join_filter is removed because it's expression (t.a < 100) doesn't make much sense, since a produced by make_staggered_batches is always less than 100 -- that's why only semi/anti_filtered joins were failing (they used different filter expression -- col_lt_col_filter, which is not always true), and all other filtered test cases (inner/left/right/full) were passing even with missing filter for NLJoin (from p.1).
There was a problem hiding this comment.
yeah, I think I introduced the col_lt_col_filter when testing flaky semi join.
We probably should have used this filter for other filtered use cases.
But if the tests becoming flaky, it would be dangerous to merge the PR without ignoring tests
comphead
left a comment
There was a problem hiding this comment.
To find if test is flaky I followed the approach of running the test 1000 times in the loop. Like for AntiJoin
async fn test_anti_join_1k_filtered() {
// NLJ vs HJ gives wrong result
// Tracked in https://github.com/apache/datafusion/issues/11537
for i in 0..1000 {
println!("Iteration {i}");
JoinFuzzTestCase::new(
make_staggered_batches(1000),
make_staggered_batches(1000),
JoinType::LeftAnti,
Some(Box::new(col_lt_col_filter)),
)
.run_test(&[JoinTestType::HjSmj], false)
.await;
}
}
But again if the test becomes flaky with col_lt_col_filter we should probably ignore the test to avoid random broken CI
I've run all them in a loop with hundred iterations, and all currently present tests are passing. The only failed tests after filter change were SMJ related -- they are disabled in f9d2a90 |
Which issue does this PR close?
Closes #11537.
Rationale for this change
It turned out to be not a NLJoin issue, but fuzz tests -- during NLjoin construction, join filter was completely ignored, and all NLjoin tests were running with only equijoin conditions in filter.
In the same time filtered inner/left/right/full join fuzz tests were passing even without the filter -- the reason is that they've been using
less_than_100_join_filter, which always returns True, as generation of input data for column, used in this filter, always returns values less than 100 --make_staggered_batchescomment:so it seems to be more useful to place
col_lt_col_filterin all tests, since it has non 100% selectivity.Finally, after replacing filter in all test cases, some of HjSmj tests turned out to be flaky, so they are temporarily disabled, until #12359 fix.
What changes are included in this PR?
_filteredtest cases now usecol_lt_col_filter_filteredtestcases disabled for SortMergeJoin due to their flakinessAre these changes tested?
Yes
Are there any user-facing changes?
No, it's related exclusively to the tests