-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
ENH Parallelize gradient computation in t-SNE #13264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Here is a small benchmark for the performance gain with the current implementation. The benchmark was run using the suite from jeremiedbb/scikit-learn_benchmarks#6. We do not have a large impact in term of peak memory and we seem to gain 30% in performance when using 2 cores. |
jeremiedbb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good.
Needs #13297 to replace effective_n_jobs by effective_n_jobs_openmp (but don't merge it in this PR, it would make the review quite harder).
|
This is still current, merely awaiting review, yeah? What's new needs to be moved to 0.22.rst |
|
This PR is still valid but I think it was stalled because there were no clear concensus on how to control the number of openmp threads. I will move the whatnew. |
What's wrong with the approach of passing Maybe remove the "WIP" tag ? |
prange(num_threads=n_jobs) does not work if n_jobs is None |
Doesn't work in what way?
Though I imagine the detection of the actual number of cores is not as good as in joblib.. |
|
Aww, you mean that |
|
I mean prange(num_threads=None) does compile but prange(num_threads=n_jobs) where n_jobs is None doesn't |
Possibly related to cython/cython#1957, do we want to report it upstream? In any case don't we want to manually map |
|
I think we were also discussing about the default value, and how to handle env variable such as |
|
Maybe let's open an issue about this? As it would also affect all other PRs using OpenMP? |
|
+1 for opening an issue |
|
This PR is stalled as it needs #14196 to be merged before we finalize it. I can make the required updates once the |
|
The dependency has been merged |
|
@tomMoral It's been decided to use as many threads as possible by default and to not expose a n_jobs parameter for that. #14196 is meant to be used as follows: If the prange is an inner loop of nested loops, it's more efficient to call |
f9d3e0d to
581b3b5
Compare
|
I rebased the PR on master and now use the helper to set the number of threads. |
jeremiedbb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first pass. Here are some comments.
NicolasHug
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments and questions from me.
Any chance you have access to a machine with lots of cores so we can check whether this suffers from the same performance drop as the GBDT, when the number of CPUs gets high?
Maybe @ogrisel can try on the same machine?
| long stop, | ||
| bint compute_error) nogil: | ||
| bint compute_error, | ||
| int num_threads) nogil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: our convention is to use n_threads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument in the prange is called num_threads and this is private function. I would rather keep it that way as it seems clearer to me (it is linked to *_NUM_THREADS and all the openMP naming) but if you strongly disagree, I will make the change.
jeremiedbb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm !
still needs to benchmark on a big machine (I'll do it when I have time if you haven't done it before)
|
Here is a benchmark on a machine with 44 physical cores (88 logical cores). We can see that up to 44 threads, there is "some" improvements but there is a slow down after. Note that by default, |
|
From master on the same machine: |
NicolasHug
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nits but LGTM!
|
I fixed the last comments, let me know if more clean up are needed. |
|
Thanks @tomMoral ! |
Related to #13213
Add
n_jobsto t-SNE and parallelize the computation of the nearest neighbors (with threads) and the gradient (withprange).