Skip to content

Conversation

@tomMoral
Copy link
Contributor

Related to #13213

Add n_jobs to t-SNE and parallelize the computation of the nearest neighbors (with threads) and the gradient (with prange).

@tomMoral tomMoral changed the title ENH add prange for gradient computation in t-SNE [WIP] ENH add prange for gradient computation in t-SNE Feb 26, 2019
@tomMoral
Copy link
Contributor Author

Here is a small benchmark for the performance gain with the current implementation. The benchmark was run using the suite from jeremiedbb/scikit-learn_benchmarks#6. We do not have a large impact in term of peak memory and we seem to gain 30% in performance when using 2 cores.

Runtime
============ ========= ========= =========
--                 n_jobs           
------------ -----------------------------
   method        1         2         4    
============ ========= ========= =========
   exact      7.58±0s   7.47±0s   7.49±0s 
 barnes_hut   8.44±0s   5.70±0s   4.35±0s 
============ ========= ========= =========

Peak memory
============ ======= ======= =======
--                    n_jobs        
------------ -----------------------
    method       1       2       4   
============ ======= ======= =======
   exact      68.8M   68.8M   68.8M 
 barnes_hut   72.7M   74.1M   77.8M 
============ ======= ======= =======

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good.

Needs #13297 to replace effective_n_jobs by effective_n_jobs_openmp (but don't merge it in this PR, it would make the review quite harder).

@jnothman
Copy link
Member

jnothman commented May 1, 2019

This is still current, merely awaiting review, yeah?

What's new needs to be moved to 0.22.rst

@tomMoral
Copy link
Contributor Author

This PR is still valid but I think it was stalled because there were no clear concensus on how to control the number of openmp threads.

I will move the whatnew.

@rth
Copy link
Member

rth commented Jun 20, 2019

This PR is still valid but I think it was stalled because there were no clear consensus on how to control the number of openmp threads.

What's wrong with the approach of passing n_jobs to prange as done in this PR?

Maybe remove the "WIP" tag ?

@jeremiedbb
Copy link
Member

jeremiedbb commented Jun 20, 2019

What's wrong with the approach of passing n_jobs to prange as done in this PR?

prange(num_threads=n_jobs) does not work if n_jobs is None

@rth
Copy link
Member

rth commented Jun 20, 2019

prange(num_threads=n_jobs) does not work if n_jobs is None

Doesn't work in what way?

From https://cython.readthedocs.io/en/latest/src/userguide/parallelism.html#cython.parallel.prange

cython.parallel.prange([start,] stop[, step][, nogil=False][, schedule=None[, chunksize=None]][, num_threads=None])
[..]
num_threads – The num_threads argument indicates how many threads the team should consist of. If not given, OpenMP will decide how many threads to use. Typically this is the number of cores available on the machine. However, this may be controlled through the omp_set_num_threads() function, or through the OMP_NUM_THREADS environment variable.

Though I imagine the detection of the actual number of cores is not as good as in joblib..

@rth
Copy link
Member

rth commented Jun 20, 2019

Aww, you mean that None doesn't map to 1 but rather to -1?

@jeremiedbb
Copy link
Member

I mean prange(num_threads=None) does compile but prange(num_threads=n_jobs) where n_jobs is None doesn't

@rth
Copy link
Member

rth commented Jun 20, 2019

I mean prange(num_threads=None) does compile but prange(num_threads=n_jobs) where n_jobs is None doesn't

Possibly related to cython/cython#1957, do we want to report it upstream?

In any case don't we want to manually map n_jobs=None to num_threads=1 as num_threads=None actually means n_jobs=-1?

@tomMoral
Copy link
Contributor Author

I think we were also discussing about the default value, and how to handle env variable such as OMP_NUM_THREADS and n_jobs=-1 (which is not valid).

@rth
Copy link
Member

rth commented Jun 20, 2019

Maybe let's open an issue about this? As it would also affect all other PRs using OpenMP?

@NicolasHug
Copy link
Member

+1 for opening an issue

@tomMoral
Copy link
Contributor Author

tomMoral commented Sep 2, 2019

This PR is stalled as it needs #14196 to be merged before we finalize it. I can make the required updates once the _openmp_effective_n_threads is merged into master.

@jnothman
Copy link
Member

The dependency has been merged

@jeremiedbb
Copy link
Member

jeremiedbb commented Oct 21, 2019

@tomMoral It's been decided to use as many threads as possible by default and to not expose a n_jobs parameter for that. #14196 is meant to be used as follows:

cdef int num_threads = _openmp_effective_n_threads()
for i in prange(num_threads=num_threads):
    ....

If the prange is an inner loop of nested loops, it's more efficient to call _openmp_effective_n_threads outside the outer loop because it involves python interaction.

@tomMoral tomMoral changed the title [WIP] ENH add prange for gradient computation in t-SNE [MRG] ENH add prange for gradient computation in t-SNE Oct 21, 2019
@tomMoral
Copy link
Contributor Author

I rebased the PR on master and now use the helper to set the number of threads.
I think it is now ready for merge.

@rth rth added Waiting for Reviewer and removed Needs Decision Requires decision labels Oct 21, 2019
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first pass. Here are some comments.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments and questions from me.

Any chance you have access to a machine with lots of cores so we can check whether this suffers from the same performance drop as the GBDT, when the number of CPUs gets high?
Maybe @ogrisel can try on the same machine?

long stop,
bint compute_error) nogil:
bint compute_error,
int num_threads) nogil:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: our convention is to use n_threads

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument in the prange is called num_threads and this is private function. I would rather keep it that way as it seems clearer to me (it is linked to *_NUM_THREADS and all the openMP naming) but if you strongly disagree, I will make the change.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm !

still needs to benchmark on a big machine (I'll do it when I have time if you haven't done it before)

@tomMoral
Copy link
Contributor Author

Here is a benchmark on a machine with 44 physical cores (88 logical cores). We can see that up to 44 threads, there is "some" improvements but there is a slow down after. Note that by default, TSNE would use 88 threads, as reported by cpu_count.

$ for i in {1,4,16,32,64,88}; do
    OMP_NUM_THREADS=$i /usr/bin/nice -5 python3 benchmarks/bench_tsne_mnist.py
done

Used number of threads: 1
PCA preprocessing down to 50 dimensions took 3.459s
Fitting sklearn TSNE on 100 samples...
Fitting sklearn TSNE on 100 samples took 0.411s in 849 iterations, nn accuracy: 0.690
Fitting sklearn TSNE on 500 samples...
Fitting sklearn TSNE on 500 samples took 2.594s in 999 iterations, nn accuracy: 0.654
Fitting sklearn TSNE on 1000 samples...
Fitting sklearn TSNE on 1000 samples took 7.097s in 999 iterations, nn accuracy: 0.642
Fitting sklearn TSNE on 5000 samples...
Fitting sklearn TSNE on 5000 samples took 37.475s in 999 iterations, nn accuracy: 0.560
Fitting sklearn TSNE on 10000 samples...
Fitting sklearn TSNE on 10000 samples took 88.579s in 999 iterations, nn accuracy: 0.506

Used number of threads: 4
PCA preprocessing down to 50 dimensions took 2.104s
Fitting sklearn TSNE on 100 samples...
Fitting sklearn TSNE on 100 samples took 0.286s in 999 iterations, nn accuracy: 0.790
Fitting sklearn TSNE on 500 samples...
Fitting sklearn TSNE on 500 samples took 1.192s in 999 iterations, nn accuracy: 0.660
Fitting sklearn TSNE on 1000 samples...
Fitting sklearn TSNE on 1000 samples took 2.481s in 999 iterations, nn accuracy: 0.634
Fitting sklearn TSNE on 5000 samples...
Fitting sklearn TSNE on 5000 samples took 14.933s in 999 iterations, nn accuracy: 0.557
Fitting sklearn TSNE on 10000 samples...
Fitting sklearn TSNE on 10000 samples took 37.520s in 999 iterations, nn accuracy: 0.515

Used number of threads: 16
PCA preprocessing down to 50 dimensions took 1.730s
Fitting sklearn TSNE on 100 samples...
Fitting sklearn TSNE on 100 samples took 0.317s in 949 iterations, nn accuracy: 0.680
Fitting sklearn TSNE on 500 samples...
Fitting sklearn TSNE on 500 samples took 1.040s in 999 iterations, nn accuracy: 0.686
Fitting sklearn TSNE on 1000 samples...
Fitting sklearn TSNE on 1000 samples took 1.909s in 999 iterations, nn accuracy: 0.633
Fitting sklearn TSNE on 5000 samples...
Fitting sklearn TSNE on 5000 samples took 10.437s in 999 iterations, nn accuracy: 0.561
Fitting sklearn TSNE on 10000 samples...
Fitting sklearn TSNE on 10000 samples took 25.551s in 999 iterations, nn accuracy: 0.510

Used number of threads: 32
PCA preprocessing down to 50 dimensions took 1.647s
Fitting sklearn TSNE on 100 samples...
Fitting sklearn TSNE on 100 samples took 0.345s in 999 iterations, nn accuracy: 0.730
Fitting sklearn TSNE on 500 samples...
Fitting sklearn TSNE on 500 samples took 0.977s in 999 iterations, nn accuracy: 0.656
Fitting sklearn TSNE on 1000 samples...
Fitting sklearn TSNE on 1000 samples took 1.825s in 999 iterations, nn accuracy: 0.625
Fitting sklearn TSNE on 5000 samples...
Fitting sklearn TSNE on 5000 samples took 10.254s in 999 iterations, nn accuracy: 0.563
Fitting sklearn TSNE on 10000 samples...
Fitting sklearn TSNE on 10000 samples took 22.869s in 999 iterations, nn accuracy: 0.510

Used number of threads: 64
PCA preprocessing down to 50 dimensions took 1.802s
Fitting sklearn TSNE on 100 samples...
Fitting sklearn TSNE on 100 samples took 0.408s in 999 iterations, nn accuracy: 0.710
Fitting sklearn TSNE on 500 samples...
Fitting sklearn TSNE on 500 samples took 1.323s in 999 iterations, nn accuracy: 0.662
Fitting sklearn TSNE on 1000 samples...
Fitting sklearn TSNE on 1000 samples took 1.934s in 999 iterations, nn accuracy: 0.624
Fitting sklearn TSNE on 5000 samples...
Fitting sklearn TSNE on 5000 samples took 10.913s in 999 iterations, nn accuracy: 0.560
Fitting sklearn TSNE on 10000 samples...
Fitting sklearn TSNE on 10000 samples took 27.168s in 999 iterations, nn accuracy: 0.504

Used number of threads: 88
PCA preprocessing down to 50 dimensions took 1.761s
Fitting sklearn TSNE on 100 samples...
Fitting sklearn TSNE on 100 samples took 0.819s in 999 iterations, nn accuracy: 0.720
Fitting sklearn TSNE on 500 samples...
Fitting sklearn TSNE on 500 samples took 1.953s in 999 iterations, nn accuracy: 0.676
Fitting sklearn TSNE on 1000 samples...
Fitting sklearn TSNE on 1000 samples took 2.960s in 999 iterations, nn accuracy: 0.628
Fitting sklearn TSNE on 5000 samples...
Fitting sklearn TSNE on 5000 samples took 13.847s in 999 iterations, nn accuracy: 0.558
Fitting sklearn TSNE on 10000 samples...
Fitting sklearn TSNE on 10000 samples took 29.670s in 999 iterations, nn accuracy: 0.508

@tomMoral
Copy link
Contributor Author

From master on the same machine:

/usr/bin/nice -5 python3 benchmarks/bench_tsne_mnist.py
PCA preprocessing down to 50 dimensions took 1.852s
Fitting sklearn TSNE on 100 samples...
Fitting sklearn TSNE on 100 samples took 0.355s in 999 iterations, nn accuracy: 0.690
Fitting sklearn TSNE on 500 samples...
Fitting sklearn TSNE on 500 samples took 2.236s in 999 iterations, nn accuracy: 0.658
Fitting sklearn TSNE on 1000 samples...
Fitting sklearn TSNE on 1000 samples took 5.367s in 999 iterations, nn accuracy: 0.638
Fitting sklearn TSNE on 5000 samples...
Fitting sklearn TSNE on 5000 samples took 33.667s in 999 iterations, nn accuracy: 0.560
Fitting sklearn TSNE on 10000 samples...
Fitting sklearn TSNE on 10000 samples took 83.715s in 999 iterations, nn accuracy: 0.514

@thomasjpfan thomasjpfan added this to the 0.22 milestone Oct 26, 2019
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits but LGTM!

@tomMoral
Copy link
Contributor Author

I fixed the last comments, let me know if more clean up are needed.

@NicolasHug NicolasHug changed the title [MRG] ENH add prange for gradient computation in t-SNE ENH Parallelize gradient computation in t-SNE Oct 30, 2019
@NicolasHug NicolasHug merged commit 908ded8 into scikit-learn:master Oct 30, 2019
@NicolasHug
Copy link
Member

Thanks @tomMoral !

@tomMoral tomMoral deleted the ENH_prange_tsne branch October 30, 2019 14:34
@dkobak dkobak mentioned this pull request Mar 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants