Skip to content

Conversation

@Ryo-not-rio
Copy link
Collaborator

@Ryo-not-rio Ryo-not-rio commented Dec 9, 2024

  • use __gnu_parallel::sort for gcc compilations
  • add a parallelized version of std::sort and std::stable_sort for non gcc compilations

Using __gnu_parallel::sort:
provides ~3.7x speed up for length 50000 sorts with NUM_THREADS=16 and NUM_THREADS=4 on aarch64

The performance is measured using the following script:

import torch
import torch.autograd.profiler as profiler

torch.manual_seed(0)

N = 50000
x = torch.randn(N, dtype=torch.float)

with profiler.profile(with_stack=True, profile_memory=False, record_shapes=True) as prof:
    for i in range(1000):
        _, _ = torch.sort(x)

print(prof.key_averages(group_by_input_shape=True).table(sort_by='self_cpu_time_total', row_limit=10))

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @malfet @snadampal @milpuz01

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Dec 9, 2024
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 9, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142391

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 3 Unrelated Failures

As of commit c23712e with merge base e01a5e9 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@Skylion007
Copy link
Collaborator

Skylion007 commented Dec 9, 2024

std C++17 has a parallel sort algorithm builtin via the execution policy. Any reason we aren't using that? Is it because our supported compilers do not have full C++17 support?

@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 9, 2024
Comment on lines 160 to 168
Copy link
Collaborator

@nikitaved nikitaved Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge can be done in parallel and in a hierarchical manner. Here it looks like the parallelization is done only on a single level, i.e. the lowest level in the bottom-up approach, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, sounds like a good thing to have at some point -- a generalized parallel implementation of divide-and-conquer types of algorithms...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, although the conquer part is quite specific to each algorithm

@Ryo-not-rio Ryo-not-rio force-pushed the parallel-sort branch 2 times, most recently from 45364be to 8373846 Compare December 12, 2024 15:27
@pytorch-bot pytorch-bot bot added the release notes: releng release notes category label Dec 23, 2024
@Ryo-not-rio Ryo-not-rio requested a review from nSircombe January 8, 2025 10:29
@nikhil-arm nikhil-arm requested review from digantdesai and malfet and removed request for nSircombe January 16, 2025 14:51
@nikhil-arm
Copy link
Collaborator

@pytorchbot label "module: arm"

@pytorch-bot pytorch-bot bot added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Jan 16, 2025
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is paralle sort algorithm a GNU C++ extension or something else? And why do you need tbb here.

It feels wrong adding random compiler extensions which are yet not planned to be included into next compiler standard.

If this is something that is available via say C++20 standard, I would be fine with adding tentative standard flag, but if enabling this requires new runtime dependency, than it's probably a no-starter to be included in any source builds

@Ryo-not-rio
Copy link
Collaborator Author

Is paralle sort algorithm a GNU C++ extension or something else? And why do you need tbb here.

It feels wrong adding random compiler extensions which are yet not planned to be included into next compiler standard.

If this is something that is available via say C++20 standard, I would be fine with adding tentative standard flag, but if enabling this requires new runtime dependency, than it's probably a no-starter to be included in any source builds

This is a libstdc++ extension listed in https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html. tbb seems to be a requirement for the actual parallelism to work as we did not see any benefit without it. We also tried the execution policy for std::sort but did not see any improvements

@nikhil-arm
Copy link
Collaborator

Is paralle sort algorithm a GNU C++ extension or something else? And why do you need tbb here.
It feels wrong adding random compiler extensions which are yet not planned to be included into next compiler standard.
If this is something that is available via say C++20 standard, I would be fine with adding tentative standard flag, but if enabling this requires new runtime dependency, than it's probably a no-starter to be included in any source builds

This is a libstdc++ extension listed in https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html. tbb seems to be a requirement for the actual parallelism to work as we did not see any benefit without it. We also tried the execution policy for std::sort but did not see any improvements

Lets try to profile without tbb with latest main

@Skylion007
Copy link
Collaborator

More likely the issue is that parllallel policy for the stdlib were some of the last C++17 features to be implemented so I am not even sure if our stdlib version actually supports them (vs falling back on the serial implementation)

@Ryo-not-rio
Copy link
Collaborator Author

@malfet @ng-05 I have removed the execution policy fallback as that is what was requiring tbb

@malfet
Copy link
Contributor

malfet commented Jan 23, 2025

provides ~2x speed up for length 50000 sorts with NUM_THREADS=16 and NUM_THREADS=4 on aarch64

@Ryo-not-rio can you please share the script in PR description that could be used to reproduce those results? (Though I guess it's a comment from TBB case, as I don't see how it makes any difference unless GCC extension is used) Which again, makes me wonder, if this extension is slated for C++23 or is it always an extension?

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@Ryo-not-rio
Copy link
Collaborator Author

@pytorchbot merge -r

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

- use __gnu_parallel::sort for gcc compilations
- add a parallelized version of std::sort and std::stable_sort for non gcc compilations

Using __gnu_parallel::sort:
provides ~3.7x speed up for length 50000 sorts with NUM_THREADS=16 and NUM_THREADS=4 on aarch64

Otherwise:
provides ~2x speed up for length 50000 sorts with NUM_THREADS=16 and NUM_THREADS=4 on aarch64
@pytorchmergebot
Copy link
Collaborator

Successfully rebased parallel-sort onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout parallel-sort && git pull --rebase)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 jobs have failed, first few of them are: windows-binary-wheel / wheel-py3_12-cuda11_8-build, windows-binary-wheel / wheel-py3_13t-cuda11_8-build

Details for Dev Infra team Raised by workflow job

@malfet
Copy link
Contributor

malfet commented Feb 6, 2025

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 6 checks: windows-binary-wheel / wheel-py3_12-cuda11_8-build, windows-binary-wheel / wheel-py3_13-cuda11_8-build, windows-binary-wheel / wheel-py3_13-cuda12_4-build, windows-binary-wheel / wheel-py3_9-cuda12_6-build, windows-binary-wheel / wheel-py3_9-cuda11_8-build, windows-binary-wheel / wheel-py3_13t-cuda11_8-build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@annop-w annop-w mentioned this pull request Mar 19, 2025
pytorchbot pushed a commit that referenced this pull request Mar 21, 2025
@pytorchbot pytorchbot mentioned this pull request Mar 21, 2025
ZainRizvi pushed a commit that referenced this pull request Mar 26, 2025
Parallelize sort (#149505)

PR #142391 erroneously used `USE_OMP` instead of `USE_OPENMP`.

Pull Request resolved: #149505
Approved by: https://github.com/fadara01, https://github.com/Skylion007

(cherry picked from commit 842d515)

Co-authored-by: Annop Wongwathanarat <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arm priority ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/trunk Trigger trunk jobs on your pull request Merged module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: cpu CPU specific problem (e.g., perf, algorithm) open source release notes: releng release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants