-
Notifications
You must be signed in to change notification settings - Fork 26.3k
parallelize sort #142391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallelize sort #142391
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142391
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New Failures, 3 Unrelated FailuresAs of commit c23712e with merge base e01a5e9 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
std C++17 has a parallel sort algorithm builtin via the execution policy. Any reason we aren't using that? Is it because our supported compilers do not have full C++17 support? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge can be done in parallel and in a hierarchical manner. Here it looks like the parallelization is done only on a single level, i.e. the lowest level in the bottom-up approach, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, sounds like a good thing to have at some point -- a generalized parallel implementation of divide-and-conquer types of algorithms...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, although the conquer part is quite specific to each algorithm
4207a53 to
be42032
Compare
45364be to
8373846
Compare
8373846 to
2f25783
Compare
2f25783 to
6e9bc55
Compare
|
@pytorchbot label "module: arm" |
malfet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is paralle sort algorithm a GNU C++ extension or something else? And why do you need tbb here.
It feels wrong adding random compiler extensions which are yet not planned to be included into next compiler standard.
If this is something that is available via say C++20 standard, I would be fine with adding tentative standard flag, but if enabling this requires new runtime dependency, than it's probably a no-starter to be included in any source builds
This is a libstdc++ extension listed in https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html. |
Lets try to profile without |
|
More likely the issue is that parllallel policy for the stdlib were some of the last C++17 features to be implemented so I am not even sure if our stdlib version actually supports them (vs falling back on the serial implementation) |
|
@malfet @ng-05 I have removed the execution policy fallback as that is what was requiring tbb |
14f34c7 to
408ca44
Compare
@Ryo-not-rio can you please share the script in PR description that could be used to reproduce those results? (Though I guess it's a comment from TBB case, as I don't see how it makes any difference unless GCC extension is used) Which again, makes me wonder, if this extension is slated for C++23 or is it always an extension? |
Merge failedReason: 3 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
d756c09 to
4a8ef5c
Compare
|
@pytorchbot merge -r |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
- use __gnu_parallel::sort for gcc compilations - add a parallelized version of std::sort and std::stable_sort for non gcc compilations Using __gnu_parallel::sort: provides ~3.7x speed up for length 50000 sorts with NUM_THREADS=16 and NUM_THREADS=4 on aarch64 Otherwise: provides ~2x speed up for length 50000 sorts with NUM_THREADS=16 and NUM_THREADS=4 on aarch64
|
Successfully rebased |
5c04e67 to
c23712e
Compare
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 jobs have failed, first few of them are: windows-binary-wheel / wheel-py3_12-cuda11_8-build, windows-binary-wheel / wheel-py3_13t-cuda11_8-build Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 6 checks: windows-binary-wheel / wheel-py3_12-cuda11_8-build, windows-binary-wheel / wheel-py3_13-cuda11_8-build, windows-binary-wheel / wheel-py3_13-cuda12_4-build, windows-binary-wheel / wheel-py3_9-cuda12_6-build, windows-binary-wheel / wheel-py3_9-cuda11_8-build, windows-binary-wheel / wheel-py3_13t-cuda11_8-build Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
PR #142391 erroneously used `USE_OMP` instead of `USE_OPENMP`. Pull Request resolved: #149505 Approved by: https://github.com/fadara01, https://github.com/Skylion007
PR #142391 erroneously used `USE_OMP` instead of `USE_OPENMP`. Pull Request resolved: #149505 Approved by: https://github.com/fadara01, https://github.com/Skylion007 (cherry picked from commit 842d515)
Parallelize sort (#149505) PR #142391 erroneously used `USE_OMP` instead of `USE_OPENMP`. Pull Request resolved: #149505 Approved by: https://github.com/fadara01, https://github.com/Skylion007 (cherry picked from commit 842d515) Co-authored-by: Annop Wongwathanarat <[email protected]>
PR pytorch#142391 erroneously used `USE_OMP` instead of `USE_OPENMP`. Pull Request resolved: pytorch#149505 Approved by: https://github.com/fadara01, https://github.com/Skylion007
Using __gnu_parallel::sort:
provides ~3.7x speed up for length 50000 sorts with NUM_THREADS=16 and NUM_THREADS=4 on aarch64
The performance is measured using the following script:
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @malfet @snadampal @milpuz01