Trying to accelerate slic by emmanuelle · Pull Request #3120 · scikit-image/scikit-image

emmanuelle · 2018-05-29T21:59:00Z

This PR is here to share some work I've been doing during the BIDS spring to accelerate cython for loops using prange. I do get a nice acceleration (factor of 4 using 8 threads, so it's quite good).

Some remaining issues:

~~as I understand, prange only works with gcc (not with clang). I haven't tried compiling this code on a Mac.~~ compilation with openmp can be made optional
~~how to choose the number of threads? Should we leave it to the user? Can it be an argument of the function~~ that's what I did.

pep8speaks · 2018-05-29T21:59:02Z

Hello @emmanuelle! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-08-25 17:24:05 UTC

emmanuelle · 2018-05-29T22:09:25Z

@scoder comments or ideas very welcome :-)

stefanv · 2018-05-29T22:46:51Z

Compilation with OpenMP see also https://github.com/scikit-image/scikit-image/pull/1570/files

emmanuelle · 2018-05-29T23:13:23Z

On my machine tests fail randomly (half of the time). With test_gray_3d for example, there seems to be some boundary effects (values are the right ones, except at the border of expected superpixels). Do idea where this comes from... @jni

skimage/segmentation/_slic.pyx

codecov-io · 2018-05-30T16:25:56Z

Codecov Report

❗ No coverage uploaded for pull request base (master@2f436f1). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master    #3120   +/-   ##
=========================================
  Coverage          ?   86.81%           
=========================================
  Files             ?      339           
  Lines             ?    27385           
  Branches          ?        0           
=========================================
  Hits              ?    23773           
  Misses            ?     3612           
  Partials          ?        0

Impacted Files	Coverage Δ
skimage/segmentation/slic_superpixels.py	`100% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2f436f1...155e247. Read the comment docs.

emmanuelle · 2018-05-30T19:06:37Z

Thank you for your comments @scoder. I actually had to revert to range in the part you commented on because of concurrent writing by different threads (which was the reason why tests are failing). However, there is another set of nested for loops on pixels where we can also decide which one is prange. My idea was that the set of array elements considered inside the prange loop would fit on cache, but maybe this is not needed?

emmanuelle · 2018-05-30T21:23:41Z

My benchmark script is https://github.com/emmanuelle/skimage-sprint/blob/master/profile_slic.py

emmanuelle · 2018-06-01T19:01:59Z

I removed the "do not merge" in the PR title because maybe this is something we want to merge :-).

emmanuelle · 2018-07-01T20:24:43Z

@scikit-image/core what do you think about this PR?

jni · 2018-07-05T21:55:14Z

@emmanuelle I think it needs to be rebased on master and the asv benchmark function there adapted to test n_jobs=1, n_jobs=2, and n_jobs=4. =D (these should be different methods on the same class, I think.)

emmanuelle · 2018-07-25T20:02:37Z

@jni ok I rebased :-)

emmanuelle · 2018-08-31T21:36:02Z

@jni I rebased and added the asv benchmarks (is it the right way to do it?).

emmanuelle · 2018-08-31T21:38:31Z

The code in the setup.py to add compiler flags is basically the same as in #3267, so we'll see which one gets merged first.

jni · 2018-09-01T06:14:16Z

@emmanuelle I'm not sure whether I'm not compiling it right (using make clean; pip install -U --no-deps -e .), but currently the benchmark shows a decrease in performance on my machine:

 $ asv dev --bench time_slic_basic.*
· Discovering benchmarks
· Running 3 total benchmarks (1 commits * 1 environments * 3 benchmarks)
[  0.00%] ·· Building for existing-py_home_jni_miniconda3_envs_cf_bin_python
[  0.00%] ·· Benchmarking existing-py_home_jni_miniconda3_envs_cf_bin_python
[ 33.33%] ··· Running benchmark_segmentation.SegmentationSuite.time_slic_basic        2.36s
[ 66.67%] ··· Running benchmark_segmentation.SegmentationSuite.time_slic_basic_2jobs  2.52s
[100.00%] ··· Running benchmark_segmentation.SegmentationSuite.time_slic_basic_4jobs  3.24s

Have you tried it?

 $ gcc --version
gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

emmanuelle · 2018-09-01T13:38:01Z

@jni hum, this is interesting because with asv I get the same result as you, but when using %timeit I see the improvement

In [4]: %timeit b = segmentation.slic(a, n_jobs=1)
3.47 s ± 4.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit b = segmentation.slic(a, n_jobs=2)
1.9 s ± 5.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit b = segmentation.slic(a, n_jobs=4)
1.48 s ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

About gcc:

$ gcc --version
gcc (Ubuntu 7.2.0-8ubuntu3.2) 7.2.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

jni · 2018-09-01T13:58:58Z

Ah! This is very interesting and annoying! Looks like asv measures CPU time, not wall time:

The default timing function is the POSIX CLOCK_PROCESS_CPUTIME, which measures the CPU time used only by the current process.

https://github.com/airspeed-velocity/asv/blob/master/docs/source/writing_benchmarks.rst

I’m not sure what the right way to handle this is... perhaps to mentally divide by the number of cores? I’ll investigate.

scoder · 2020-08-09T11:11:07Z

@emmanuelle I agree about the problem being here:

if distance[z, y, x] > dist_center:
     nearest_segments[z, y, x] = k + start_label
     distance[z, y, x] = dist_center

Could you use separate nearest_segments and distance arrays (or ones with an additional k dimension) to collect the "k local" changes in parallel, and then merge the k sub-arrays back together in a final sequential loop in the end? That's obvious trading k times the space for a bit of time, not sure if that is acceptable. Images can be quite large sometimes.

There's also an open issue for Cython to add min() and max() as OpenMP regressions. Would apply the the distance array but probably not to the nearest_segments.

scoder · 2020-08-09T11:21:42Z

Could you use separate nearest_segments and distance arrays …

Or, rather than allowing me to make up some fishy "improvement" to an established algorithm, it's probably better to take a look through existing approaches to parallelise SLIC. Searching for parallel slic implementation gives me a few hits, at least.

I understand that this was an approach to get something out of it quickly, but it seems that the algorithm itself resists a little bit.

emmanuelle · 2020-08-09T21:08:59Z

@scoder thanks for your comments! Indeed, yesterday I was a bit vexed that I could not get any reasonable speed-up so I started looking at the literature on parallel slic. For example, in this paper proposing a GPU implementation, the outer loop is on image pixels, and each pixel has a list of k indices corresponding to the possible centroids it can belong to (9 such indices in 2D). Distances to these centroids is computed for each pixel, a new label is attributed to the pixel, and after looping over all pixels, the new position of centroids is computed. This way, there is no problem of concurrent access. You are right that keeping the lists of indices for each pixel would have an important memory cost, unless there is a way to compute the indices (using modulo and division operations).

It could be worth giving it a try with this other way of looping. However, I'm not overly optimistic since I brutally tried to turn the loop over segments to a prange (ignoring the risk of concurrent access), and I could not see an improvement. Probably @grlee77 is right that

it looks like there are only basic multiply/add operations in the inner loop, so we may be limited by memory access times more than computation in the multithreaded case.

Thanks also for pointing at the open issue on min/max reduction!

alexdesiqueira · 2022-08-09T17:23:33Z

Hey @scikit-image/core,
how are we on this one? If we are happy, I'll fix the conflicts and merge it. Thanks!

mkcor

@alexdesiqueira +1 let's not lose this contribution, thanks for taking care of the merge conflicts 🙏

setup.py

skimage/segmentation/_slic.pyx

Co-authored-by: Marianne Corvellec <[email protected]>

alexdesiqueira · 2022-08-25T17:24:59Z

Alright, let's see what we got. 🙂

alexdesiqueira · 2022-08-26T17:05:13Z

Azure warning:

022-08-25T19:27:47.3102697Z setup.py:9: DeprecationWarning: 
2022-08-25T19:27:47.3104273Z 
2022-08-25T19:27:47.3106742Z   `numpy.distutils` is deprecated since NumPy 1.23.0, as a result
2022-08-25T19:27:47.3108078Z   of the deprecation of `distutils` itself. It will be removed for
2022-08-25T19:27:47.3110438Z   Python >= 3.12. For older Python versions it will remain present.
2022-08-25T19:27:47.3111790Z   It is recommended to use `setuptools < 60.0` for those Python versions.
2022-08-25T19:27:47.3113069Z   For more details, see:
2022-08-25T19:27:47.3114456Z     https://numpy.org/devdocs/reference/distutils_status_migration.html 
2022-08-25T19:27:47.3115307Z 
2022-08-25T19:27:47.3116039Z 
2022-08-25T19:27:47.3117461Z   from numpy.distutils.command.build_ext import build_ext as npy_build_ext
2022-08-25T19:27:47.3119344Z C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\config\setupcfg.py:508: SetuptoolsDeprecationWarning: The license_file parameter is deprecated, use license_files instead.
2022-08-25T19:27:47.3120776Z   warnings.warn(msg, warning_class)
2022-08-25T19:27:47.3122055Z Partial import of skimage during the build process.

Azure error:

22-08-25T19:27:52.4272028Z --- Logging error ---
2022-08-25T19:27:52.4274135Z Traceback (most recent call last):
2022-08-25T19:27:52.4275547Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\logging\__init__.py", line 1088, in emit
2022-08-25T19:27:52.4276925Z     stream.write(msg + self.terminator)
2022-08-25T19:27:52.4278473Z ValueError: underlying buffer has been detached
2022-08-25T19:27:52.4279856Z Call stack:
2022-08-25T19:27:52.4281015Z   File "setup.py", line 263, in <module>
2022-08-25T19:27:52.4282048Z     setup(
2022-08-25T19:27:52.4283346Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\numpy\distutils\core.py", line 169, in setup
2022-08-25T19:27:52.4284637Z     return old_setup(**new_attr)
2022-08-25T19:27:52.4285939Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\__init__.py", line 87, in setup
2022-08-25T19:27:52.4287747Z     return distutils.core.setup(**attrs)
2022-08-25T19:27:52.4289555Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\_distutils\core.py", line 185, in setup
2022-08-25T19:27:52.4291001Z     return run_commands(dist)
2022-08-25T19:27:52.4292778Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\_distutils\core.py", line 201, in run_commands
2022-08-25T19:27:52.4294445Z     dist.run_commands()
2022-08-25T19:27:52.4296190Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\_distutils\dist.py", line 973, in run_commands
2022-08-25T19:27:52.4298062Z     self.run_command(cmd)
2022-08-25T19:27:52.4299787Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\dist.py", line 1217, in run_command
2022-08-25T19:27:52.4301544Z     super().run_command(command)
2022-08-25T19:27:52.4303365Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\_distutils\dist.py", line 992, in run_command
2022-08-25T19:27:52.4304766Z     cmd_obj.run()
2022-08-25T19:27:52.4306331Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\wheel\bdist_wheel.py", line 299, in run
2022-08-25T19:27:52.4307740Z     self.run_command('build')
2022-08-25T19:27:52.4309538Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\_distutils\cmd.py", line 319, in run_command
2022-08-25T19:27:52.4311094Z     self.distribution.run_command(command)
2022-08-25T19:27:52.4312829Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\dist.py", line 1217, in run_command
2022-08-25T19:27:52.4314693Z     super().run_command(command)
2022-08-25T19:27:52.4316544Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\_distutils\dist.py", line 992, in run_command
2022-08-25T19:27:52.4318510Z     cmd_obj.run()
2022-08-25T19:27:52.4319986Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\numpy\distutils\command\build.py", line 62, in run
2022-08-25T19:27:52.4321784Z     old_build.run(self)
2022-08-25T19:27:52.4323362Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\_distutils\command\build.py", line 132, in run
2022-08-25T19:27:52.4325143Z     self.run_command(cmd_name)
2022-08-25T19:27:52.4326796Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\_distutils\cmd.py", line 319, in run_command
2022-08-25T19:27:52.4328763Z     self.distribution.run_command(command)
2022-08-25T19:27:52.4330441Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\dist.py", line 1217, in run_command
2022-08-25T19:27:52.4331789Z     super().run_command(command)
2022-08-25T19:27:52.4333525Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\_distutils\dist.py", line 992, in run_command
2022-08-25T19:27:52.4334749Z     cmd_obj.run()
2022-08-25T19:27:52.4336415Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\numpy\distutils\command\build_src.py", line 144, in run
2022-08-25T19:27:52.4337788Z     self.build_sources()
2022-08-25T19:27:52.4339470Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\numpy\distutils\command\build_src.py", line 161, in build_sources
2022-08-25T19:27:52.4340876Z     self.build_extension_sources(ext)
2022-08-25T19:27:52.4342754Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\numpy\distutils\command\build_src.py", line 306, in build_extension_sources
2022-08-25T19:27:52.4344311Z     sources = list(ext.sources)
2022-08-25T19:27:52.4345932Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\pythran\dist.py", line 170, in sources
2022-08-25T19:27:52.4347876Z     tc.compile_pythranfile(source, output_file,
2022-08-25T19:27:52.4351427Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\pythran\toolchain.py", line 477, in compile_pythranfile
2022-08-25T19:27:52.4354806Z     output_file = compile_pythrancode(module_name, fd.read(),
2022-08-25T19:27:52.4356764Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\pythran\toolchain.py", line 411, in compile_pythrancode

mkcor · 2022-08-26T21:24:35Z

~~And this error #6453 (comment) as well.~~ [edit: re-running these jobs]

stefanv · 2023-02-07T06:41:29Z

I suspect we'd be able to get this built fairly easily on Meson now. And the changes are not very invasive.

stefanv · 2023-04-04T01:54:54Z

Examining the PR more closely, it looks like it uses prange only on the height of the image. Ideally, I suppose, one would have to determine the best dimension to prange over. Perhaps something like:

                for z in range(z_min, z_max, num_threads=z_jobs):
                    for y in prange(y_min, y_max, num_threads=y_jobs):
                        for x in range(x_min, x_max, num_threads=x_jobs):

Although, I'm not even sure whether Cython would allow you to dynamically compute num_threads on prange.

Anyway, this PR isn't quite ready for merging, and requires some further investigation and benchmarking. So, I'm going to leave it here to note the idea, and remove the "OK to Merge" label.

scoder · 2023-04-04T08:07:00Z

uses prange only on the height of the image. Ideally, I suppose, one would have to determine the best dimension to prange over

Not sure. What matters is a straight memory layout, so the height (i.e. chunks of rows of consecutive points in memory that benefit from RAM prefetching) seems a good thread partitioning scheme (for a 2D image). Large chunks of consecutive memory are usually as good as it gets.

In short, as long as your algorithm works in rows, your best partitioning scheme is chunks of rows.

stefanv · 2023-04-04T16:32:48Z

That's a great point, thanks for the reminder. We operate over z, y, x on a pixel level, so batching on y was probably sensible.

stefanv · 2023-04-04T20:32:59Z

Something else I was wondering about is whether x += 1 is still slow in Cython; will have to do an experiment to confirm.

Not ready yet.

scoder reviewed May 30, 2018

View reviewed changes

skimage/segmentation/_slic.pyx Outdated Show resolved Hide resolved

skimage/segmentation/_slic.pyx Outdated Show resolved Hide resolved

emmanuelle added the 🧐 Needs review label May 31, 2018

emmanuelle changed the title ~~[WIP] Trying to accelerate slic DO NOT MERGE~~ [WIP] Trying to accelerate slic Jun 1, 2018

soupault added the 📈 type: Performance label Jul 6, 2018

soupault added this to the 0.15 milestone Jul 6, 2018

emmanuelle force-pushed the prange branch from 3d0c813 to f363457 Compare July 25, 2018 20:02

emmanuelle added 6 commits August 31, 2018 18:51

Trying to accelerate slic

798e073

Reverted concurrent writing bug

55b80e0

Added n_jobs to slic

90c90ad

cimport prange

b8c148f

conditional openmp compilation

7fad153

Removed unwanted compilation options from submodule setup.py

1fed0a5

emmanuelle force-pushed the prange branch from f363457 to 1fed0a5 Compare August 31, 2018 16:51

Added asv benchmarks for n_jobs=2 and 4 for SLIC segmentation

e12f473

Base automatically changed from master to main February 18, 2021 18:23

mkcor mentioned this pull request Feb 22, 2021

2021's calendar of community management #5169

Closed

jarrodmillman modified the milestones: 0.17, 0.20 Jun 4, 2022

mkcor reviewed Aug 24, 2022

View reviewed changes

setup.py Outdated Show resolved Hide resolved

skimage/segmentation/_slic.pyx Outdated Show resolved Hide resolved

Alexandre de Siqueira and others added 3 commits August 25, 2022 10:11

Update setup.py

8cd0ca2

Co-authored-by: Marianne Corvellec <[email protected]>

Merge branch 'main' into prange

d90a832

Remove unused bool import in _slic.pyx

136cc8f

lagru mentioned this pull request Aug 28, 2022

2022's calendar of community management #6165

Closed

lagru removed this from the 0.20 milestone Oct 25, 2022

stefanv added this to the 0.21 milestone Feb 7, 2023

stefanv removed the 👍 OK to merge label Apr 4, 2023

jarrodmillman modified the milestones: 0.21, 0.22 May 19, 2023

jarrodmillman modified the milestones: 0.22, 0.23 Oct 24, 2023

jarrodmillman removed this from the 0.23 milestone Apr 4, 2024

Uh oh!

Conversation

emmanuelle commented May 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pep8speaks commented May 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2022-08-25 17:24:05 UTC

Uh oh!

emmanuelle commented May 29, 2018

Uh oh!

stefanv commented May 29, 2018

Uh oh!

emmanuelle commented May 29, 2018

Uh oh!

Uh oh!

Uh oh!

codecov-io commented May 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

emmanuelle commented May 30, 2018

Uh oh!

emmanuelle commented May 30, 2018

Uh oh!

emmanuelle commented Jun 1, 2018

Uh oh!

emmanuelle commented Jul 1, 2018

Uh oh!

jni commented Jul 5, 2018

Uh oh!

emmanuelle commented Jul 25, 2018

Uh oh!

emmanuelle commented Aug 31, 2018

Uh oh!

emmanuelle commented Aug 31, 2018

Uh oh!

jni commented Sep 1, 2018

Uh oh!

emmanuelle commented Sep 1, 2018

Uh oh!

jni commented Sep 1, 2018

Uh oh!

scoder commented Aug 9, 2020

Uh oh!

scoder commented Aug 9, 2020

Uh oh!

emmanuelle commented Aug 9, 2020

Uh oh!

alexdesiqueira commented Aug 9, 2022

Uh oh!

mkcor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alexdesiqueira commented Aug 25, 2022

Uh oh!

alexdesiqueira commented Aug 26, 2022

Uh oh!

mkcor commented Aug 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefanv commented Feb 7, 2023

Uh oh!

stefanv commented Apr 4, 2023

Uh oh!

scoder commented Apr 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefanv commented Apr 4, 2023

Uh oh!

stefanv commented Apr 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

emmanuelle commented May 29, 2018 •

edited

Loading

pep8speaks commented May 29, 2018 •

edited

Loading

codecov-io commented May 30, 2018 •

edited

Loading

mkcor commented Aug 26, 2022 •

edited

Loading

scoder commented Apr 4, 2023 •

edited

Loading