Conversation
|
Hello @emmanuelle! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2022-08-25 17:24:05 UTC |
|
@scoder comments or ideas very welcome :-) |
|
Compilation with OpenMP see also https://github.com/scikit-image/scikit-image/pull/1570/files |
|
On my machine tests fail randomly (half of the time). With |
Codecov Report
@@ Coverage Diff @@
## master #3120 +/- ##
=========================================
Coverage ? 86.81%
=========================================
Files ? 339
Lines ? 27385
Branches ? 0
=========================================
Hits ? 23773
Misses ? 3612
Partials ? 0
Continue to review full report at Codecov.
|
|
Thank you for your comments @scoder. I actually had to revert to range in the part you commented on because of concurrent writing by different threads (which was the reason why tests are failing). However, there is another set of nested for loops on pixels where we can also decide which one is prange. My idea was that the set of array elements considered inside the prange loop would fit on cache, but maybe this is not needed? |
|
I removed the "do not merge" in the PR title because maybe this is something we want to merge :-). |
|
@scikit-image/core what do you think about this PR? |
|
@emmanuelle I think it needs to be rebased on master and the asv benchmark function there adapted to test n_jobs=1, n_jobs=2, and n_jobs=4. =D (these should be different methods on the same class, I think.) |
|
@jni ok I rebased :-) |
|
@jni I rebased and added the asv benchmarks (is it the right way to do it?). |
|
The code in the setup.py to add compiler flags is basically the same as in #3267, so we'll see which one gets merged first. |
|
@emmanuelle I'm not sure whether I'm not compiling it right (using Have you tried it? |
|
@jni hum, this is interesting because with asv I get the same result as you, but when using %timeit I see the improvement About gcc: |
|
Ah! This is very interesting and annoying! Looks like asv measures CPU time, not wall time:
I’m not sure what the right way to handle this is... perhaps to mentally divide by the number of cores? I’ll investigate. |
|
@emmanuelle I agree about the problem being here: if distance[z, y, x] > dist_center:
nearest_segments[z, y, x] = k + start_label
distance[z, y, x] = dist_centerCould you use separate There's also an open issue for Cython to add |
Or, rather than allowing me to make up some fishy "improvement" to an established algorithm, it's probably better to take a look through existing approaches to parallelise SLIC. Searching for parallel slic implementation gives me a few hits, at least. I understand that this was an approach to get something out of it quickly, but it seems that the algorithm itself resists a little bit. |
|
@scoder thanks for your comments! Indeed, yesterday I was a bit vexed that I could not get any reasonable speed-up so I started looking at the literature on parallel slic. For example, in this paper proposing a GPU implementation, the outer loop is on image pixels, and each pixel has a list of k indices corresponding to the possible centroids it can belong to (9 such indices in 2D). Distances to these centroids is computed for each pixel, a new label is attributed to the pixel, and after looping over all pixels, the new position of centroids is computed. This way, there is no problem of concurrent access. You are right that keeping the lists of indices for each pixel would have an important memory cost, unless there is a way to compute the indices (using modulo and division operations). It could be worth giving it a try with this other way of looping. However, I'm not overly optimistic since I brutally tried to turn the loop over segments to a
Thanks also for pointing at the open issue on min/max reduction! |
|
Hey @scikit-image/core, |
mkcor
left a comment
There was a problem hiding this comment.
@alexdesiqueira +1 let's not lose this contribution, thanks for taking care of the merge conflicts 🙏
Co-authored-by: Marianne Corvellec <[email protected]>
|
Alright, let's see what we got. 🙂 |
|
Azure warning: Azure error: |
|
|
|
I suspect we'd be able to get this built fairly easily on Meson now. And the changes are not very invasive. |
|
Examining the PR more closely, it looks like it uses Although, I'm not even sure whether Cython would allow you to dynamically compute num_threads on prange. Anyway, this PR isn't quite ready for merging, and requires some further investigation and benchmarking. So, I'm going to leave it here to note the idea, and remove the "OK to Merge" label. |
Not sure. What matters is a straight memory layout, so the height (i.e. chunks of rows of consecutive points in memory that benefit from RAM prefetching) seems a good thread partitioning scheme (for a 2D image). Large chunks of consecutive memory are usually as good as it gets. In short, as long as your algorithm works in rows, your best partitioning scheme is chunks of rows. |
|
That's a great point, thanks for the reminder. We operate over z, y, x on a pixel level, so batching on |
|
Something else I was wondering about is whether |
This PR is here to share some work I've been doing during the BIDS spring to accelerate cython for loops using prange. I do get a nice acceleration (factor of 4 using 8 threads, so it's quite good).
Some remaining issues:
as I understand, prange only works with gcc (not with clang). I haven't tried compiling this code on a Mac.compilation with openmp can be made optionalhow to choose the number of threads? Should we leave it to the user? Can it be an argument of the functionthat's what I did.