Support SortMergeJoin spilling by comphead · Pull Request #11218 · apache/datafusion

comphead · 2024-07-02T17:01:02Z

Which issue does this PR close?

Closes #9359 .

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

comphead · 2024-07-04T00:32:52Z

all existing spilling tests are okay, I will add 3 more tests to test the spilling

comphead · 2024-07-08T23:42:05Z

Multi batch spill tests still fails

comphead · 2024-07-09T15:51:16Z

All initial tests passed, I'm planning to add more tests related to result correctness in separate PR

comphead · 2024-07-09T15:52:22Z

datafusion/physical-plan/src/sorts/sort.rs

-                        "Spill file {:?} does not exist",
-                        spill.path()
-                    )));
+                    return internal_err!("Spill file {:?} does not exist", spill.path());


this is a clean up

drive by cleanups: 👍

viirya · 2024-07-09T18:41:18Z

I will review this in next few days.

viirya · 2024-07-11T21:00:43Z

datafusion/core/tests/memory_limit/mod.rs

+    TestCase::new()
+        .with_query(
+            "select t1.* from t t1 JOIN t t2 ON t1.pod = t2.pod AND t1.time = t2.time",
+        )
+        .with_memory_limit(1_000)
+        .with_config(config)
+        .with_disk_manager_config(DiskManagerConfig::NewOs)
+        .run()
+        .await


I wonder how do we know if it triggers spilling or not?

Can we check metrics?

Yeah, that is great idea, I was overthinking how to check that file spilled to disk but metrics is much easier, I'm adding it

@viirya I added metrics tests in sort_merge_join.rs like https://github.com/apache/datafusion/pull/11218/files#diff-825342e035aec56595dce761afb00dd54e3ae663a2e24ebf3a597123e636f9e2R3140

For this exact test which runs on SQL level I'm thinking if I can access metrics some how

It doesnt seem possible to access any metrics in this case. We can rely that if test with disabled spill is failing on mem issues, then the same test with enabled spilling is passing. Hope that is enough

datafusion/physical-plan/src/joins/sort_merge_join.rs

alamb · 2024-07-12T22:02:10Z

I plan to review this PR later today -- sorry for the delay

viirya · 2024-07-18T05:39:38Z

datafusion/core/tests/memory_limit/mod.rs

+}
+
+#[tokio::test]
+async fn sort_merge_join_spill() {


This test case can only make sure the query can run, it may or may not be spilling.

We should have some ways to verify the spilling is actually happened.

unfortunately exactly this test case we cannot access any spilling metrics, but there is another test above sort_merge_join_no_spill which is exactly the same but expectedly fails by mem issue and have the spilling disabled explicitly. This test passes without issues and with spilling enabled so we can conclude the spilling happened.

datafusion/physical-plan/src/spill.rs

viirya · 2024-07-18T05:43:48Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

+                    self.join_type,
+                    on,
+                    self.filter
+                        .as_ref()
+                        .map(|f| format!(", filter={}", f.expression()))
+                        .unwrap_or("".to_string())


Why moving the code?

inlined the display filter and changed map_or_else to map with default

datafusion/physical-plan/src/joins/sort_merge_join.rs

viirya · 2024-07-18T06:00:53Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

+        if buffered_batch.spill_file.is_none() && buffered_batch.batch.is_some() {
+            self.reservation
+                .try_shrink(buffered_batch.size_estimation)?;
+        }


We should also handle else cases, i.e., spilling file is Some and batch is also Some, and both are None, etc.

I think those cases are not possible but the current code doesn't make that clear

Here is a proposal that I think makes it clearer what states are possible: comphead#297

datafusion/physical-plan/src/joins/sort_merge_join.rs

viirya · 2024-07-18T06:08:34Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

+        // If the batch was spilled to disk, less likely
+        (Some(spill_file), None) => {
+            let mut buffered_cols: Vec<ArrayRef> =
+                Vec::with_capacity(buffered_indices.len());


buffered_indices.len() is the length of arrays. I think the capacity should be the number of columns of the batch.

the should be the same right? the take kernel will check the bounds

datafusion/physical-plan/src/joins/sort_merge_join.rs

viirya · 2024-07-18T06:11:54Z

datafusion/physical-plan/src/spill.rs

+/// Spill the `RecordBatch` to disk as smaller batches
+/// split by `batch_size_rows`
+/// Return `total_rows` what is spilled
+pub fn spill_record_batch_by_size(


Where is this function used other than in test? I don't find it.

I think it may be left over from an earlier version of this PR

Yes, Im planning to keep it and reuse it in row_hash in following PR, basically the subbatch slicing is from row_hash.rs

comphead · 2024-07-18T15:42:19Z

Thanks @viirya for your review, I'll address the comments today/tomorrow

alamb

Thank you @comphead and @viirya

I think this code is now correct, though I also think it could be improved (both with the comments from @viirya , my suggestion in comphead#297 as well as more testing)

Specifically, for testing, given the subtlety of the code involved I am not 100% sure it works for all corner cases. I suggest (as a follow on) we invest in fuzz testing both for SMJ in general as well as for spilling SMJ

https://github.com/apache/datafusion/blob/6c0e4fb5d9ac7a0a2f2b91f8b88d21f0bc0b4424/datafusion/core/tests/fuzz_cases/join_fuzz.rs#L50-L49

I think in particular, making sure we adjust the random inputs to have different numbers of repeated values (as the code in this PR is only going to be exercised when there are many of the same join keys I think)

alamb · 2024-07-18T18:45:47Z

datafusion/physical-plan/src/spill.rs

+/// Spill the `RecordBatch` to disk as smaller batches
+/// split by `batch_size_rows`
+/// Return `total_rows` what is spilled
+pub fn spill_record_batch_by_size(


I think it may be left over from an earlier version of this PR

alamb · 2024-07-18T18:59:34Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

 struct BufferedBatch {
    /// The buffered record batch
-    pub batch: RecordBatch,
+    pub batch: Option<RecordBatch>,


While reviewing this PR, I found having to reason about what the valid batch or spill_file combinations was confusing (like there is an invariant I think that they can't both be Some)

Rather than use two fields, I tried making an enum that encoded the state and I thought it was easier to reason about. Here is a proposal here: comphead#297

I think its great idea. I'll include this to follow up to simplify double option check in favor of enum.

alamb · 2024-07-18T19:02:15Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

+        if buffered_batch.spill_file.is_none() && buffered_batch.batch.is_some() {
+            self.reservation
+                .try_shrink(buffered_batch.size_estimation)?;
+        }


I think those cases are not possible but the current code doesn't make that clear

Here is a proposal that I think makes it clearer what states are possible: comphead#297

comphead · 2024-07-19T00:31:38Z

Thank you @comphead and @viirya

I think this code is now correct, though I also think it could be improved (both with the comments from @viirya , my suggestion in comphead#297 as well as more testing)

Specifically, for testing, given the subtlety of the code involved I am not 100% sure it works for all corner cases. I suggest (as a follow on) we invest in fuzz testing both for SMJ in general as well as for spilling SMJ

https://github.com/apache/datafusion/blob/6c0e4fb5d9ac7a0a2f2b91f8b88d21f0bc0b4424/datafusion/core/tests/fuzz_cases/join_fuzz.rs#L50-L49

I think in particular, making sure we adjust the random inputs to have different numbers of repeated values (as the code in this PR is only going to be exercised when there are many of the same join keys I think)

Filed #11541

* Support SortMerge spilling

comphead marked this pull request as draft July 2, 2024 17:01

github-actions bot added the core Core DataFusion crate label Jul 2, 2024

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate labels Jul 8, 2024

comphead force-pushed the dev0 branch from 6806813 to 9c16696 Compare July 8, 2024 23:41

github-actions bot removed sql SQL Planner logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate labels Jul 8, 2024

comphead commented Jul 9, 2024

View reviewed changes

comphead force-pushed the dev0 branch from 0717597 to 32781db Compare July 9, 2024 16:06

comphead marked this pull request as ready for review July 9, 2024 16:26

comphead requested review from alamb and viirya July 9, 2024 16:26

alamb changed the title ~~Support SortMerge spilling~~ Support SortMergeJoin spilling Jul 9, 2024

alamb mentioned this pull request Jul 9, 2024

DataFusion weekly project plan (Andrew Lamb) - July 8, 2024 #11334

Closed

9 tasks

viirya reviewed Jul 11, 2024

View reviewed changes

datafusion/physical-plan/src/joins/sort_merge_join.rs Outdated Show resolved Hide resolved

comphead requested a review from viirya July 12, 2024 20:58

comphead added 4 commits July 17, 2024 12:25

address comments

8380758

spill entire file

87810f2

fmt

186b2ff

merge

1d9c7d4

comphead force-pushed the dev0 branch from ec113ef to 1d9c7d4 Compare July 17, 2024 19:36