perf: Reduce native shuffle memory overhead by 50%#1452

Merged

andygrove merged 7 commits intoapache:mainfrom

andygrove:shuffle-no-double-alloc

Feb 28, 2025

Member

andygrove commented Feb 26, 2025 •

edited

Loading

Which issue does this PR close?

Closes #1448

Rationale for this change

We were reserving memory twice in native shuffle, resulting in excessive shuffling.

Here are results for TPC-H q9 with 3GB off-heap memory allocated:

Before (main branch)

        42.215020418167114,
        43.29415225982666,
        40.11583089828491,
        40.11201024055481,
        36.295708417892456

After

        33.27595615386963,
        30.483699560165405,
        31.230262994766235,
        31.28650164604187,
        31.095990657806396

What changes are included in this PR?

Stop double reserving memory.

How are these changes tested?

andygrove marked this pull request as draft

February 26, 2025 20:36

codecov-commenter commented Feb 26, 2025 •

edited

Loading

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.56%. Comparing base (f09f8af) to head (c33900b).
Report is 53 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1452      +/-   ##
============================================
+ Coverage     56.12%   58.56%   +2.43%     
- Complexity      976     1015      +39     
============================================
  Files           119      122       +3     
  Lines         11743    12223     +480     
  Branches       2251     2295      +44     
============================================
+ Hits           6591     7158     +567     
+ Misses         4012     3913      -99     
- Partials       1140     1152      +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


          remove ShufflePartitioner reservation

4cfe993

andygrove force-pushed the shuffle-no-double-alloc branch from 47491a9 to 4cfe993 Compare

February 27, 2025 21:17

andygrove commented

View reviewed changes

native/core/src/execution/shuffle/shuffle_writer.rs

                   num_output_partitions: usize,
                   runtime: Arc<RuntimeEnv>,
                   metrics: ShuffleRepartitionerMetrics,
-                  reservation: MemoryReservation,

Member Author

andygrove Feb 27, 2025

This is the main change; removing the memory tracking in ShuffleRepartitioner because we already track the memory in each instance of PartitionedBuffer.

andygrove added 6 commits

February 27, 2025 14:26

fix

fba15c7


          improve error handling

eee60ae


          remove unused return values

e5cb72d


          more cleanup

5669d5f


          more accurate memory accounting

96957f8


          comment

c33900b

andygrove commented

View reviewed changes

native/core/src/execution/shuffle/shuffle_writer.rs

Comment on lines -475 to -509

-                                  let mut mem_diff = self
-                                      .append_rows_to_partition(
-                                          input.columns(),
-                                          &shuffled_partition_ids[start..end],
-                                          partition_id,
-                                      )
-                                      .await?;
-                                  if mem_diff > 0 {
-                                      let mem_increase = mem_diff as usize;
-                                      let try_grow = {
-                                          let mut mempool_timer = self.metrics.mempool_time.timer();
-                                          let result = self.reservation.try_grow(mem_increase);
-                                          mempool_timer.stop();
-                                          result
-                                      };
-                                      if try_grow.is_err() {
-                                          self.spill().await?;
-                                          let mut mempool_timer = self.metrics.mempool_time.timer();
-                                          self.reservation.free();
-                                          self.reservation.try_grow(mem_increase)?;
-                                          mempool_timer.stop();
-                                          mem_diff = 0;
-                                      }
-                                  }
-                                  if mem_diff < 0 {
-                                      let mem_used = self.reservation.size();
-                                      let mem_decrease = mem_used.min(-mem_diff as usize);
-                                      let mut mempool_timer = self.metrics.mempool_time.timer();
-                                      self.reservation.shrink(mem_decrease);
-                                      mempool_timer.stop();
-                                  }

Member Author

andygrove Feb 28, 2025

We don't need any of this memory accounting because it is already handled within append_rows_to_partition

andygrove commented

View reviewed changes

native/core/src/execution/shuffle/shuffle_writer.rs

-                          mem_diff += self.active_slots_mem_size as isize;
                       }
-                      Ok(mem_diff)

Member Author

andygrove Feb 28, 2025

no need to return memory size here because we already reserved the memory in this method

andygrove commented

View reviewed changes

native/core/src/execution/shuffle/shuffle_writer.rs

                           }
-                          start = end;
                       }
-                      AppendRowStatus::MemDiff(Ok(mem_diff))

Member Author

andygrove Feb 28, 2025

No need to return memory size here because all accounting already took place in the calls to allocate_active_builders and flush in this method

andygrove commented

View reviewed changes

native/core/src/execution/shuffle/shuffle_writer.rs

+                      mempool_timer.stop();
-                      mem_diff += (self.frozen.capacity() - frozen_capacity_old) as isize;
-                      Ok(mem_diff)

Member Author

andygrove Feb 28, 2025

No need to return memory size because memory accounting already happened in this method.

andygrove commented

View reviewed changes

native/core/src/execution/shuffle/shuffle_writer.rs

Comment on lines -1113 to +1069

-                      // TODO reservation should not be zero because there are active builders again
-                      assert_eq!(0, buffer.reservation.size());
+                      assert_eq!(106496, buffer.reservation.size());

Member Author

andygrove Feb 28, 2025

This demonstrates that the memory accounting is now more accurate

andygrove added the performance label

andygrove marked this pull request as ready for review

February 28, 2025 15:06

andygrove requested review from kazuyukitanimura and viirya

February 28, 2025 15:09

Member Author

andygrove commented Feb 28, 2025

@mbutrovich This PR is ready for review now

viirya approved these changes

View reviewed changes

mbutrovich approved these changes

View reviewed changes

Contributor

mbutrovich left a comment •

edited

Loading

I find "active" vague and the mixing of "active rows" and "active slots" a bit confusing in PartitionBuffer, but that shouldn't stop this PR. Nicely done, @andygrove!

Member Author

andygrove commented Feb 28, 2025

Thanks for the reviews @viirya and @mbutrovich. I agree that we could update some of the naming.

andygrove merged commit b149983 into apache:main

74 checks passed

andygrove deleted the shuffle-no-double-alloc branch

February 28, 2025 16:40

andygrove mentioned this pull request

Reduce spilling overhead in Comet shuffle #1436

Closed

coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request


          perf: Reduce native shuffle memory overhead by 50% (apache#1452)

4f529d8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels