parallelize sort #142391

Ryo-not-rio · 2024-12-09T18:11:02Z

use __gnu_parallel::sort for gcc compilations
add a parallelized version of std::sort and std::stable_sort for non gcc compilations

Using __gnu_parallel::sort:
provides ~3.7x speed up for length 50000 sorts with NUM_THREADS=16 and NUM_THREADS=4 on aarch64

The performance is measured using the following script:

import torch
import torch.autograd.profiler as profiler

torch.manual_seed(0)

N = 50000
x = torch.randn(N, dtype=torch.float)

with profiler.profile(with_stack=True, profile_memory=False, record_shapes=True) as prof:
    for i in range(1000):
        _, _ = torch.sort(x)

print(prof.key_averages(group_by_input_shape=True).table(sort_by='self_cpu_time_total', row_limit=10))

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @malfet @snadampal @milpuz01

pytorch-bot · 2024-12-09T18:11:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142391

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 3 Unrelated Failures

As of commit c23712e with merge base e01a5e9 ():

NEW FAILURES - The following jobs have failed:

windows-binary-wheel / wheel-py3_12-cuda11_8-build (gh)
sccache: error: couldn't connect to server
windows-binary-wheel / wheel-py3_13-cuda11_8-build (gh)
sccache: error: couldn't connect to server
windows-binary-wheel / wheel-py3_13t-cuda11_8-build (gh)
sccache: error: couldn't connect to server

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

windows-binary-wheel / wheel-py3_13-cuda12_4-build (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.
windows-binary-wheel / wheel-py3_9-cuda11_8-build (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.
windows-binary-wheel / wheel-py3_9-cuda12_6-build (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Skylion007 · 2024-12-09T18:20:36Z

std C++17 has a parallel sort algorithm builtin via the execution policy. Any reason we aren't using that? Is it because our supported compilers do not have full C++17 support?

aten/src/ATen/native/cpu/SortingKernel.cpp

nikitaved · 2024-12-10T15:54:02Z

aten/src/ATen/native/cpu/SortingKernel.cpp

Merge can be done in parallel and in a hierarchical manner. Here it looks like the parallelization is done only on a single level, i.e. the lowest level in the bottom-up approach, right?

Actually, sounds like a good thing to have at some point -- a generalized parallel implementation of divide-and-conquer types of algorithms...

yes, although the conquer part is quite specific to each algorithm

CMakeLists.txt

nikhil-arm · 2025-01-16T14:53:53Z

@pytorchbot label "module: arm"

malfet

Is paralle sort algorithm a GNU C++ extension or something else? And why do you need tbb here.

It feels wrong adding random compiler extensions which are yet not planned to be included into next compiler standard.

If this is something that is available via say C++20 standard, I would be fine with adding tentative standard flag, but if enabling this requires new runtime dependency, than it's probably a no-starter to be included in any source builds

.ci/docker/manywheel/Dockerfile_2_28_aarch64

Ryo-not-rio · 2025-01-16T16:18:25Z

Is paralle sort algorithm a GNU C++ extension or something else? And why do you need tbb here.

It feels wrong adding random compiler extensions which are yet not planned to be included into next compiler standard.

If this is something that is available via say C++20 standard, I would be fine with adding tentative standard flag, but if enabling this requires new runtime dependency, than it's probably a no-starter to be included in any source builds

This is a libstdc++ extension listed in https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html. tbb seems to be a requirement for the actual parallelism to work as we did not see any benefit without it. We also tried the execution policy for std::sort but did not see any improvements

nikhil-arm · 2025-01-16T16:27:24Z

Is paralle sort algorithm a GNU C++ extension or something else? And why do you need tbb here.
It feels wrong adding random compiler extensions which are yet not planned to be included into next compiler standard.
If this is something that is available via say C++20 standard, I would be fine with adding tentative standard flag, but if enabling this requires new runtime dependency, than it's probably a no-starter to be included in any source builds

This is a libstdc++ extension listed in https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html. tbb seems to be a requirement for the actual parallelism to work as we did not see any benefit without it. We also tried the execution policy for std::sort but did not see any improvements

Lets try to profile without tbb with latest main

Skylion007 · 2025-01-16T16:32:47Z

More likely the issue is that parllallel policy for the stdlib were some of the last C++17 features to be implemented so I am not even sure if our stdlib version actually supports them (vs falling back on the serial implementation)

Ryo-not-rio · 2025-01-17T10:25:20Z

@malfet @ng-05 I have removed the execution policy fallback as that is what was requiring tbb

malfet · 2025-01-23T00:08:36Z

provides ~2x speed up for length 50000 sorts with NUM_THREADS=16 and NUM_THREADS=4 on aarch64

@Ryo-not-rio can you please share the script in PR description that could be used to reproduce those results? (Though I guess it's a comment from TBB case, as I don't see how it makes any difference unless GCC extension is used) Which again, makes me wonder, if this extension is slated for C++23 or is it always an extension?

pytorchmergebot · 2025-02-05T02:21:11Z

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

Ryo-not-rio · 2025-02-06T10:53:41Z

@pytorchbot merge -r

pytorchmergebot · 2025-02-06T10:55:09Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

- use __gnu_parallel::sort for gcc compilations - add a parallelized version of std::sort and std::stable_sort for non gcc compilations Using __gnu_parallel::sort: provides ~3.7x speed up for length 50000 sorts with NUM_THREADS=16 and NUM_THREADS=4 on aarch64 Otherwise: provides ~2x speed up for length 50000 sorts with NUM_THREADS=16 and NUM_THREADS=4 on aarch64

pytorchmergebot · 2025-02-06T10:55:12Z

Successfully rebased parallel-sort onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout parallel-sort && git pull --rebase)

pytorchmergebot · 2025-02-06T10:56:29Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-02-06T15:02:56Z

Merge failed

Reason: 2 jobs have failed, first few of them are: windows-binary-wheel / wheel-py3_12-cuda11_8-build, windows-binary-wheel / wheel-py3_13t-cuda11_8-build

Details for Dev Infra team

Raised by workflow job

malfet · 2025-02-06T17:58:31Z

@pytorchbot merge -i

pytorchmergebot · 2025-02-06T18:00:23Z

Merge started

Your change will be merged while ignoring the following 6 checks: windows-binary-wheel / wheel-py3_12-cuda11_8-build, windows-binary-wheel / wheel-py3_13-cuda11_8-build, windows-binary-wheel / wheel-py3_13-cuda12_4-build, windows-binary-wheel / wheel-py3_9-cuda12_6-build, windows-binary-wheel / wheel-py3_9-cuda11_8-build, windows-binary-wheel / wheel-py3_13t-cuda11_8-build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

PR #142391 erroneously used `USE_OMP` instead of `USE_OPENMP`. Pull Request resolved: #149505 Approved by: https://github.com/fadara01, https://github.com/Skylion007

PR #142391 erroneously used `USE_OMP` instead of `USE_OPENMP`. Pull Request resolved: #149505 Approved by: https://github.com/fadara01, https://github.com/Skylion007 (cherry picked from commit 842d515)

Parallelize sort (#149505) PR #142391 erroneously used `USE_OMP` instead of `USE_OPENMP`. Pull Request resolved: #149505 Approved by: https://github.com/fadara01, https://github.com/Skylion007 (cherry picked from commit 842d515) Co-authored-by: Annop Wongwathanarat <[email protected]>

PR pytorch#142391 erroneously used `USE_OMP` instead of `USE_OPENMP`. Pull Request resolved: pytorch#149505 Approved by: https://github.com/fadara01, https://github.com/Skylion007

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Dec 9, 2024

pytorchbot added the open source label Dec 9, 2024

nikitaved reviewed Dec 9, 2024

View reviewed changes

aten/src/ATen/native/cpu/SortingKernel.cpp Outdated Show resolved Hide resolved

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 9, 2024

nikitaved reviewed Dec 10, 2024

View reviewed changes

Ryo-not-rio force-pushed the parallel-sort branch from 4207a53 to be42032 Compare December 12, 2024 15:21

Ryo-not-rio requested a review from nikitaved December 12, 2024 15:22

Ryo-not-rio force-pushed the parallel-sort branch 2 times, most recently from 45364be to 8373846 Compare December 12, 2024 15:27

nSircombe reviewed Dec 18, 2024

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

Ryo-not-rio force-pushed the parallel-sort branch from 8373846 to 2f25783 Compare December 23, 2024 13:20

Ryo-not-rio requested a review from jeffdaily as a code owner December 23, 2024 13:20

pytorch-bot bot added the release notes: releng release notes category label Dec 23, 2024

Ryo-not-rio force-pushed the parallel-sort branch from 2f25783 to 6e9bc55 Compare December 23, 2024 13:21

Ryo-not-rio requested a review from nSircombe January 8, 2025 10:29

nikhil-arm requested review from digantdesai and malfet and removed request for nSircombe January 16, 2025 14:51

pytorch-bot bot added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Jan 16, 2025

malfet requested changes Jan 16, 2025

View reviewed changes

.ci/docker/manywheel/Dockerfile_2_28_aarch64 Outdated Show resolved Hide resolved

Ryo-not-rio force-pushed the parallel-sort branch from 14f34c7 to 408ca44 Compare January 17, 2025 10:34

huydhn requested a review from malfet January 17, 2025 18:10

pytorchmergebot removed the merging label Feb 5, 2025

Ryo-not-rio force-pushed the parallel-sort branch from d756c09 to 4a8ef5c Compare February 5, 2025 15:11

Ryo-not-rio added 3 commits February 6, 2025 10:55

Remove tbb and execution policy from parallel sort

518202e

Use -D_GLIBCXX_PARALLEL to parallelize sort

c23712e

pytorchmergebot force-pushed the parallel-sort branch from 5c04e67 to c23712e Compare February 6, 2025 10:55

pytorchmergebot added the merging label Feb 6, 2025

pytorchmergebot removed the merging label Feb 6, 2025

pytorchmergebot added the merging label Feb 6, 2025

pytorchmergebot closed this in 49082f9 Feb 6, 2025

pytorchmergebot added Merged and removed merging labels Feb 6, 2025

annop-w mentioned this pull request Mar 19, 2025

Parallelize sort #149505

Closed

pytorchbot mentioned this pull request Mar 21, 2025

Parallelize sort #149765

Merged

fadara01 mentioned this pull request Mar 21, 2025

[v.2.7.0] Release Tracker #149044

Closed

annop-w mentioned this pull request Mar 28, 2025

Parallelize sort using libstdc++ parallel mode #150195

Closed

parallelize sort #142391

parallelize sort #142391

Uh oh!

Conversation

Ryo-not-rio commented Dec 9, 2024 • edited by malfet Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142391

❌ 3 New Failures, 3 Unrelated Failures

Uh oh!

Skylion007 commented Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

nikitaved Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikitaved Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

Ryo-not-rio Dec 12, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nikhil-arm commented Jan 16, 2025

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Ryo-not-rio commented Jan 16, 2025

Uh oh!

nikhil-arm commented Jan 16, 2025

Uh oh!

Skylion007 commented Jan 16, 2025

Uh oh!

Ryo-not-rio commented Jan 17, 2025

Uh oh!

malfet commented Jan 23, 2025

Uh oh!

pytorchmergebot commented Feb 5, 2025

Merge failed

Uh oh!

Ryo-not-rio commented Feb 6, 2025

Uh oh!

pytorchmergebot commented Feb 6, 2025

Uh oh!

pytorchmergebot commented Feb 6, 2025

Uh oh!

pytorchmergebot commented Feb 6, 2025

Merge started

Uh oh!

pytorchmergebot commented Feb 6, 2025

Merge failed

Uh oh!

malfet commented Feb 6, 2025

Uh oh!

pytorchmergebot commented Feb 6, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Ryo-not-rio commented Dec 9, 2024 •

edited by malfet

Loading

pytorch-bot bot commented Dec 9, 2024 •

edited

Loading

Skylion007 commented Dec 9, 2024 •

edited

Loading

nikitaved Dec 10, 2024 •

edited

Loading