GH-45917: [C++][Acero] Add flush taskgroup to enable parallelization #45918

uchenily · 2025-03-25T02:25:45Z

Rationale for this change

What changes are included in this PR?

Execute JoinResultMaterialize.Flush() in task group to enhance parallelism in downstream processing, improving the performance of hash join.

Are these changes tested?

Yes.

Are there any user-facing changes?

None.

GitHub Issue: [C++][Acero] Join result materialize in parallel #45917

github-actions · 2025-03-25T02:26:10Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2025-03-25T02:33:07Z

⚠️ GitHub issue #45917 has been automatically assigned in GitHub to PR creator.

uchenily · 2025-03-25T03:16:58Z

@westonpace @pitrou @zanmato1984 Please take a look when you have time. Thanks!

zanmato1984 · 2025-03-25T05:38:03Z

Hi @uchenily , thank you for opening the PR.

I would like add some more about the problem this PR is trying to address:
By the time that JoinProbeProcessor::OnFinished is invoked, there will be at most 1 << 15 or 32k pending rows (that is, ought to be but not yet been emitted to downstream nodes) in each JoinResultMaterialize (num_threads of them in total) and we are going to Flush them. Serial execution of these Flush not only slows down the Flush itself, but also disables parallelism for any downstream processing, which in some cases, might be computational intensive (consider an aggregation like select sum(c) from t1 right join t2 on a = b group by d). Thus I think this PR makes sense.

Just wondering if you have encountered any case that the above problem causes real performance issue and how bad it is? And how much does this PR improve it?

uchenily · 2025-03-25T06:49:28Z

@zanmato1984 I ran a test hashjoin + hash aggr (join type: RIGHT_OUTER, no key match). When each input batch was set to 1<<15, the probe * build (4096 * 512) scenario took only 17.8s (including data generation time), whereas the original serial way took 471.8s. (In this set of comparative tests, the value of kNumRowsPerScanTask was consistently set to 4 * 1024).

It should be noted that during this test above, I modified kNumRowsPerScanTask to 4 * 1024. If the original value 512 * 1024 was used, the performance remained poor, in fact, the test took so long that I couldn't even measure the runtime.

What I mean is that kNumRowsPerScanTask also significantly impacts the test results. However, since I couldn't determine a more reasonable value for this parameter now, I haven't fully understood how this parameter affects the test results, so I will leave it unchanged in this PR.

zanmato1984 · 2025-03-25T08:15:41Z

@zanmato1984 I ran a test hashjoin + hash aggr (join type: RIGHT_OUTER, no key match). When each input batch was set to 1<<15, the probe * build (4096 * 512) scenario took only 17.8s (including data generation time), whereas the original serial way took 471.8s. (In this set of comparative tests, the value of kNumRowsPerScanTask was consistently set to 4 * 1024).

Thank you for the info. May I know the number of threads in your test?

uchenily · 2025-03-25T09:01:46Z

@zanmato1984 I ran the test on a 112-core machine, using the default values for both CPU thread pool and IO thread pool. num_threads = (GetCpuThreadPoolCapacity() + io::GetIOThreadPoolCapacity() + 1), so it should be 112 + 8 + 1.

zanmato1984 · 2025-03-25T10:07:53Z

Thank you. After some math I think I can explain the perf boost in your setup.

Your thread count is 120, so there are 120 materialize_ s. You modified the kNumRowsPerScanTask to 4k - let's call each 4k rows a "batch" for short. You have 512 * (1 << 15) unmatched rows, aka 4096 batches in the build side. Because 4096 > thread count so the parallelism of the scan will be 120. Each scan thread accumulates 4096 / 120 = 34 batches into a materialize_ with most of the rows output to the downstream (the agg) in parallel and 32k rows pending to flush in materialize_. Finally when the scan is finished, the serial flushing will need to flush 120 * 32k rows aka 960 batches in sequence and output to downstream (the agg), whereas the parallel flushing (this PR) will use 120 thread, each flushing 32k rows aka 4 batches.

Summarizing the comparison:

In the scan phase, each thread processes 34 batches, this is the same for both serial and parallel flushing.
In the flushing phase, serial execution processes 960 batches but parallel execution processes 4 batches (in each thread).
After roughly solving some equations, it takes about 0.5s to process a batch (mostly by the downstream agg). The serial flushing takes about (/*scan*/34 + /*flush*/960) * 0.5 = 497s, and the parallel flushing takes about (/*scan*/34 + /*flush*/4) * 0.5 = 19s.
The numbers all add up.

In terms of the improvement of this PR for kNumRowsPerScanTask = 512k (the original value), though we don't have numbers, we can infer that due to the less parallelism (512 * (1 << 15) / 512k = 32 threads), the scan phase won't be as fast. And although there are still 120 materialize_ s (the probe parallelism is still 120), only 32 of them will have data (because of the scan parallelism being 32), so the flushing will only boost for 32x using parallel flushing.

But anyway, I think the effectiveness of this PR is independent of kNumRowsPerScanTask and fully justified. I will now proceed with reviewing the code.

cpp/src/arrow/acero/swiss_join.cc

zanmato1984

This looks nice. Only one nit.

cpp/src/arrow/acero/swiss_join.cc

zanmato1984 · 2025-03-26T06:28:31Z

@github-actions crossbow submit -g cpp

zanmato1984

LGTM. I'll merge once the CI is good.

Thank you for your contribution @uchenily !

github-actions · 2025-03-26T06:31:22Z

Revision: a1934df

Submitted crossbow builds: ursacomputing/crossbow @ actions-750fc6ab42

Task	Status
example-cpp-minimal-build-static
example-cpp-minimal-build-static-system-dependency
example-cpp-tutorial
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-meson
test-conda-cpp-valgrind
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-fedora-39-cpp
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-bundled
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-bundled-offline
test-ubuntu-24.04-cpp-gcc-13-bundled
test-ubuntu-24.04-cpp-gcc-14
test-ubuntu-24.04-cpp-minimal-with-formats
test-ubuntu-24.04-cpp-thread-sanitizer

zanmato1984 · 2025-03-26T07:46:50Z

@ursabot please benchmark

ursabot · 2025-03-26T07:46:55Z

Benchmark runs are scheduled for commit a1934df. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

conbench-apache-arrow · 2025-03-26T08:19:11Z

Thanks for your patience. Conbench analyzed the 0 benchmarking runs that have been run so far on PR commit a1934df.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

zanmato1984 · 2025-03-26T08:22:53Z

Benchmark failures are unrelated. Seems it has been broken for a while. cc @raulcd may know more about this.

I'm merging now.

conbench-apache-arrow · 2025-03-26T09:04:22Z

After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit c753740.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

raulcd · 2025-03-26T09:16:24Z

Thanks, I've opened a blocker:

[Benchmarking][CI] Benchmarking buildkite runs fail to build Arrow #45939

It seems Arrow fails to build on the buildkite runners

…ation (apache#45918) ### Rationale for this change Closes apache#45917 ### What changes are included in this PR? Execute JoinResultMaterialize.Flush() in task group to enhance parallelism in downstream processing, improving the performance of hash join. ### Are these changes tested? Yes. ### Are there any user-facing changes? None. * GitHub Issue: apache#45917 Authored-by: uchenily <[email protected]> Signed-off-by: Rossi Sun <[email protected]>

uchenily added 3 commits March 24, 2025 18:14

Flush in parallel

48bc2ec

Fix

cb68d14

Extract variable

751ef65

uchenily requested a review from westonpace as a code owner March 25, 2025 02:25

github-actions bot added Component: C++ awaiting review Awaiting review labels Mar 25, 2025

uchenily changed the title ~~Hash join improvement~~ GH-45917: [C++][Acero] Join result materialize in parallel Mar 25, 2025

uchenily mentioned this pull request Mar 25, 2025

[C++][Compute][Acero] Poor aggregate performance when there is a large number of batches on the build side #45847

Open

Fix benchmark

aeb47d3

uchenily force-pushed the hash-join-improvement branch from b6387ae to aeb47d3 Compare March 25, 2025 04:40

zanmato1984 requested changes Mar 25, 2025

View reviewed changes

cpp/src/arrow/acero/swiss_join.cc Outdated Show resolved Hide resolved

cpp/src/arrow/acero/swiss_join.cc Outdated Show resolved Hide resolved

Add FlushTask

50e1dab

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Mar 26, 2025

uchenily changed the title ~~GH-45917: [C++][Acero] Join result materialize in parallel~~ GH-45917: [C++][Acero] Add flush taskgroup to enable parallelization Mar 26, 2025

zanmato1984 reviewed Mar 26, 2025

View reviewed changes

cpp/src/arrow/acero/swiss_join.cc Outdated Show resolved Hide resolved

Remove condition_variable

a1934df

uchenily force-pushed the hash-join-improvement branch from 7e5a2b0 to a1934df Compare March 26, 2025 06:26

uchenily requested a review from zanmato1984 March 26, 2025 06:27

zanmato1984 approved these changes Mar 26, 2025

View reviewed changes

zanmato1984 merged commit c753740 into apache:main Mar 26, 2025
38 of 39 checks passed

raulcd mentioned this pull request Mar 26, 2025

[Benchmarking][CI] Benchmarking buildkite runs fail to build Arrow #45939

Closed

GH-45917: [C++][Acero] Add flush taskgroup to enable parallelization #45918

GH-45917: [C++][Acero] Add flush taskgroup to enable parallelization #45918

Conversation

uchenily commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

uchenily commented Mar 25, 2025

Uh oh!

zanmato1984 commented Mar 25, 2025

Uh oh!

uchenily commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zanmato1984 commented Mar 25, 2025

Uh oh!

uchenily commented Mar 25, 2025

Uh oh!

zanmato1984 commented Mar 25, 2025

Uh oh!

Uh oh!

Uh oh!

zanmato1984 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zanmato1984 commented Mar 26, 2025

Uh oh!

zanmato1984 left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 26, 2025

Uh oh!

zanmato1984 commented Mar 26, 2025

Uh oh!

ursabot commented Mar 26, 2025

Uh oh!

conbench-apache-arrow bot commented Mar 26, 2025

Uh oh!

zanmato1984 commented Mar 26, 2025

Uh oh!

Uh oh!

conbench-apache-arrow bot commented Mar 26, 2025

Uh oh!

raulcd commented Mar 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

uchenily commented Mar 25, 2025 •

edited

Loading

uchenily commented Mar 25, 2025 •

edited

Loading