Skip to content

Trying to accelerate slic#3120

Open
emmanuelle wants to merge 15 commits intoscikit-image:mainfrom
emmanuelle:prange
Open

Trying to accelerate slic#3120
emmanuelle wants to merge 15 commits intoscikit-image:mainfrom
emmanuelle:prange

Conversation

@emmanuelle
Copy link
Member

@emmanuelle emmanuelle commented May 29, 2018

This PR is here to share some work I've been doing during the BIDS spring to accelerate cython for loops using prange. I do get a nice acceleration (factor of 4 using 8 threads, so it's quite good).

Some remaining issues:

  • as I understand, prange only works with gcc (not with clang). I haven't tried compiling this code on a Mac. compilation with openmp can be made optional
  • how to choose the number of threads? Should we leave it to the user? Can it be an argument of the function that's what I did.

@pep8speaks
Copy link

pep8speaks commented May 29, 2018

Hello @emmanuelle! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-08-25 17:24:05 UTC

@emmanuelle
Copy link
Member Author

@scoder comments or ideas very welcome :-)

@stefanv
Copy link
Member

stefanv commented May 29, 2018

@emmanuelle
Copy link
Member Author

On my machine tests fail randomly (half of the time). With test_gray_3d for example, there seems to be some boundary effects (values are the right ones, except at the border of expected superpixels). Do idea where this comes from... @jni

@codecov-io
Copy link

codecov-io commented May 30, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@2f436f1). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #3120   +/-   ##
=========================================
  Coverage          ?   86.81%           
=========================================
  Files             ?      339           
  Lines             ?    27385           
  Branches          ?        0           
=========================================
  Hits              ?    23773           
  Misses            ?     3612           
  Partials          ?        0
Impacted Files Coverage Δ
skimage/segmentation/slic_superpixels.py 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2f436f1...155e247. Read the comment docs.

@emmanuelle
Copy link
Member Author

Thank you for your comments @scoder. I actually had to revert to range in the part you commented on because of concurrent writing by different threads (which was the reason why tests are failing). However, there is another set of nested for loops on pixels where we can also decide which one is prange. My idea was that the set of array elements considered inside the prange loop would fit on cache, but maybe this is not needed?

@emmanuelle
Copy link
Member Author

@emmanuelle emmanuelle changed the title [WIP] Trying to accelerate slic DO NOT MERGE [WIP] Trying to accelerate slic Jun 1, 2018
@emmanuelle
Copy link
Member Author

I removed the "do not merge" in the PR title because maybe this is something we want to merge :-).

@emmanuelle
Copy link
Member Author

@scikit-image/core what do you think about this PR?

@jni
Copy link
Member

jni commented Jul 5, 2018

@emmanuelle I think it needs to be rebased on master and the asv benchmark function there adapted to test n_jobs=1, n_jobs=2, and n_jobs=4. =D (these should be different methods on the same class, I think.)

@emmanuelle
Copy link
Member Author

@jni ok I rebased :-)

@emmanuelle
Copy link
Member Author

@jni I rebased and added the asv benchmarks (is it the right way to do it?).

@emmanuelle
Copy link
Member Author

The code in the setup.py to add compiler flags is basically the same as in #3267, so we'll see which one gets merged first.

@jni
Copy link
Member

jni commented Sep 1, 2018

@emmanuelle I'm not sure whether I'm not compiling it right (using make clean; pip install -U --no-deps -e .), but currently the benchmark shows a decrease in performance on my machine:

 $ asv dev --bench time_slic_basic.*
· Discovering benchmarks
· Running 3 total benchmarks (1 commits * 1 environments * 3 benchmarks)
[  0.00%] ·· Building for existing-py_home_jni_miniconda3_envs_cf_bin_python
[  0.00%] ·· Benchmarking existing-py_home_jni_miniconda3_envs_cf_bin_python
[ 33.33%] ··· Running benchmark_segmentation.SegmentationSuite.time_slic_basic        2.36s
[ 66.67%] ··· Running benchmark_segmentation.SegmentationSuite.time_slic_basic_2jobs  2.52s
[100.00%] ··· Running benchmark_segmentation.SegmentationSuite.time_slic_basic_4jobs  3.24s

Have you tried it?

 $ gcc --version
gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

@emmanuelle
Copy link
Member Author

@jni hum, this is interesting because with asv I get the same result as you, but when using %timeit I see the improvement

In [4]: %timeit b = segmentation.slic(a, n_jobs=1)
3.47 s ± 4.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit b = segmentation.slic(a, n_jobs=2)
1.9 s ± 5.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit b = segmentation.slic(a, n_jobs=4)
1.48 s ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

About gcc:

$ gcc --version
gcc (Ubuntu 7.2.0-8ubuntu3.2) 7.2.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

@jni
Copy link
Member

jni commented Sep 1, 2018

Ah! This is very interesting and annoying! Looks like asv measures CPU time, not wall time:

The default timing function is the POSIX CLOCK_PROCESS_CPUTIME, which measures the CPU time used only by the current process.

https://github.com/airspeed-velocity/asv/blob/master/docs/source/writing_benchmarks.rst

I’m not sure what the right way to handle this is... perhaps to mentally divide by the number of cores? I’ll investigate.

@scoder
Copy link

scoder commented Aug 9, 2020

@emmanuelle I agree about the problem being here:

if distance[z, y, x] > dist_center:
     nearest_segments[z, y, x] = k + start_label
     distance[z, y, x] = dist_center

Could you use separate nearest_segments and distance arrays (or ones with an additional k dimension) to collect the "k local" changes in parallel, and then merge the k sub-arrays back together in a final sequential loop in the end? That's obvious trading k times the space for a bit of time, not sure if that is acceptable. Images can be quite large sometimes.

There's also an open issue for Cython to add min() and max() as OpenMP regressions. Would apply the the distance array but probably not to the nearest_segments.

@scoder
Copy link

scoder commented Aug 9, 2020

Could you use separate nearest_segments and distance arrays …

Or, rather than allowing me to make up some fishy "improvement" to an established algorithm, it's probably better to take a look through existing approaches to parallelise SLIC. Searching for parallel slic implementation gives me a few hits, at least.

I understand that this was an approach to get something out of it quickly, but it seems that the algorithm itself resists a little bit.

@emmanuelle
Copy link
Member Author

@scoder thanks for your comments! Indeed, yesterday I was a bit vexed that I could not get any reasonable speed-up so I started looking at the literature on parallel slic. For example, in this paper proposing a GPU implementation, the outer loop is on image pixels, and each pixel has a list of k indices corresponding to the possible centroids it can belong to (9 such indices in 2D). Distances to these centroids is computed for each pixel, a new label is attributed to the pixel, and after looping over all pixels, the new position of centroids is computed. This way, there is no problem of concurrent access. You are right that keeping the lists of indices for each pixel would have an important memory cost, unless there is a way to compute the indices (using modulo and division operations).

It could be worth giving it a try with this other way of looping. However, I'm not overly optimistic since I brutally tried to turn the loop over segments to a prange (ignoring the risk of concurrent access), and I could not see an improvement. Probably @grlee77 is right that

it looks like there are only basic multiply/add operations in the inner loop, so we may be limited by memory access times more than computation in the multithreaded case.

Thanks also for pointing at the open issue on min/max reduction!

Base automatically changed from master to main February 18, 2021 18:23
@jarrodmillman jarrodmillman modified the milestones: 0.17, 0.20 Jun 4, 2022
@alexdesiqueira
Copy link
Member

Hey @scikit-image/core,
how are we on this one? If we are happy, I'll fix the conflicts and merge it. Thanks!

Copy link
Member

@mkcor mkcor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexdesiqueira +1 let's not lose this contribution, thanks for taking care of the merge conflicts 🙏

Alexandre de Siqueira and others added 3 commits August 25, 2022 10:11
@alexdesiqueira
Copy link
Member

Alright, let's see what we got. 🙂

@alexdesiqueira
Copy link
Member

Azure warning:

022-08-25T19:27:47.3102697Z setup.py:9: DeprecationWarning: 
2022-08-25T19:27:47.3104273Z 
2022-08-25T19:27:47.3106742Z   `numpy.distutils` is deprecated since NumPy 1.23.0, as a result
2022-08-25T19:27:47.3108078Z   of the deprecation of `distutils` itself. It will be removed for
2022-08-25T19:27:47.3110438Z   Python >= 3.12. For older Python versions it will remain present.
2022-08-25T19:27:47.3111790Z   It is recommended to use `setuptools < 60.0` for those Python versions.
2022-08-25T19:27:47.3113069Z   For more details, see:
2022-08-25T19:27:47.3114456Z     https://numpy.org/devdocs/reference/distutils_status_migration.html 
2022-08-25T19:27:47.3115307Z 
2022-08-25T19:27:47.3116039Z 
2022-08-25T19:27:47.3117461Z   from numpy.distutils.command.build_ext import build_ext as npy_build_ext
2022-08-25T19:27:47.3119344Z C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\config\setupcfg.py:508: SetuptoolsDeprecationWarning: The license_file parameter is deprecated, use license_files instead.
2022-08-25T19:27:47.3120776Z   warnings.warn(msg, warning_class)
2022-08-25T19:27:47.3122055Z Partial import of skimage during the build process.

Azure error:

22-08-25T19:27:52.4272028Z --- Logging error ---
2022-08-25T19:27:52.4274135Z Traceback (most recent call last):
2022-08-25T19:27:52.4275547Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\logging\__init__.py", line 1088, in emit
2022-08-25T19:27:52.4276925Z     stream.write(msg + self.terminator)
2022-08-25T19:27:52.4278473Z ValueError: underlying buffer has been detached
2022-08-25T19:27:52.4279856Z Call stack:
2022-08-25T19:27:52.4281015Z   File "setup.py", line 263, in <module>
2022-08-25T19:27:52.4282048Z     setup(
2022-08-25T19:27:52.4283346Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\numpy\distutils\core.py", line 169, in setup
2022-08-25T19:27:52.4284637Z     return old_setup(**new_attr)
2022-08-25T19:27:52.4285939Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\__init__.py", line 87, in setup
2022-08-25T19:27:52.4287747Z     return distutils.core.setup(**attrs)
2022-08-25T19:27:52.4289555Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\_distutils\core.py", line 185, in setup
2022-08-25T19:27:52.4291001Z     return run_commands(dist)
2022-08-25T19:27:52.4292778Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\_distutils\core.py", line 201, in run_commands
2022-08-25T19:27:52.4294445Z     dist.run_commands()
2022-08-25T19:27:52.4296190Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\_distutils\dist.py", line 973, in run_commands
2022-08-25T19:27:52.4298062Z     self.run_command(cmd)
2022-08-25T19:27:52.4299787Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\dist.py", line 1217, in run_command
2022-08-25T19:27:52.4301544Z     super().run_command(command)
2022-08-25T19:27:52.4303365Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\_distutils\dist.py", line 992, in run_command
2022-08-25T19:27:52.4304766Z     cmd_obj.run()
2022-08-25T19:27:52.4306331Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\wheel\bdist_wheel.py", line 299, in run
2022-08-25T19:27:52.4307740Z     self.run_command('build')
2022-08-25T19:27:52.4309538Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\_distutils\cmd.py", line 319, in run_command
2022-08-25T19:27:52.4311094Z     self.distribution.run_command(command)
2022-08-25T19:27:52.4312829Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\dist.py", line 1217, in run_command
2022-08-25T19:27:52.4314693Z     super().run_command(command)
2022-08-25T19:27:52.4316544Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\_distutils\dist.py", line 992, in run_command
2022-08-25T19:27:52.4318510Z     cmd_obj.run()
2022-08-25T19:27:52.4319986Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\numpy\distutils\command\build.py", line 62, in run
2022-08-25T19:27:52.4321784Z     old_build.run(self)
2022-08-25T19:27:52.4323362Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\_distutils\command\build.py", line 132, in run
2022-08-25T19:27:52.4325143Z     self.run_command(cmd_name)
2022-08-25T19:27:52.4326796Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\_distutils\cmd.py", line 319, in run_command
2022-08-25T19:27:52.4328763Z     self.distribution.run_command(command)
2022-08-25T19:27:52.4330441Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\dist.py", line 1217, in run_command
2022-08-25T19:27:52.4331789Z     super().run_command(command)
2022-08-25T19:27:52.4333525Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\setuptools\_distutils\dist.py", line 992, in run_command
2022-08-25T19:27:52.4334749Z     cmd_obj.run()
2022-08-25T19:27:52.4336415Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\numpy\distutils\command\build_src.py", line 144, in run
2022-08-25T19:27:52.4337788Z     self.build_sources()
2022-08-25T19:27:52.4339470Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\numpy\distutils\command\build_src.py", line 161, in build_sources
2022-08-25T19:27:52.4340876Z     self.build_extension_sources(ext)
2022-08-25T19:27:52.4342754Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\numpy\distutils\command\build_src.py", line 306, in build_extension_sources
2022-08-25T19:27:52.4344311Z     sources = list(ext.sources)
2022-08-25T19:27:52.4345932Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\pythran\dist.py", line 170, in sources
2022-08-25T19:27:52.4347876Z     tc.compile_pythranfile(source, output_file,
2022-08-25T19:27:52.4351427Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\pythran\toolchain.py", line 477, in compile_pythranfile
2022-08-25T19:27:52.4354806Z     output_file = compile_pythrancode(module_name, fd.read(),
2022-08-25T19:27:52.4356764Z   File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\pythran\toolchain.py", line 411, in compile_pythrancode

@mkcor
Copy link
Member

mkcor commented Aug 26, 2022

And this error #6453 (comment) as well. [edit: re-running these jobs]

@lagru lagru removed this from the 0.20 milestone Oct 25, 2022
@stefanv
Copy link
Member

stefanv commented Feb 7, 2023

I suspect we'd be able to get this built fairly easily on Meson now. And the changes are not very invasive.

@stefanv stefanv added this to the 0.21 milestone Feb 7, 2023
@stefanv
Copy link
Member

stefanv commented Apr 4, 2023

Examining the PR more closely, it looks like it uses prange only on the height of the image. Ideally, I suppose, one would have to determine the best dimension to prange over. Perhaps something like:

                for z in range(z_min, z_max, num_threads=z_jobs):
                    for y in prange(y_min, y_max, num_threads=y_jobs):
                        for x in range(x_min, x_max, num_threads=x_jobs):

Although, I'm not even sure whether Cython would allow you to dynamically compute num_threads on prange.

Anyway, this PR isn't quite ready for merging, and requires some further investigation and benchmarking. So, I'm going to leave it here to note the idea, and remove the "OK to Merge" label.

@scoder
Copy link

scoder commented Apr 4, 2023

uses prange only on the height of the image. Ideally, I suppose, one would have to determine the best dimension to prange over

Not sure. What matters is a straight memory layout, so the height (i.e. chunks of rows of consecutive points in memory that benefit from RAM prefetching) seems a good thread partitioning scheme (for a 2D image). Large chunks of consecutive memory are usually as good as it gets.

In short, as long as your algorithm works in rows, your best partitioning scheme is chunks of rows.

@stefanv
Copy link
Member

stefanv commented Apr 4, 2023

That's a great point, thanks for the reminder. We operate over z, y, x on a pixel level, so batching on y was probably sensible.

@stefanv
Copy link
Member

stefanv commented Apr 4, 2023

Something else I was wondering about is whether x += 1 is still slow in Cython; will have to do an experiment to confirm.

@stefanv stefanv dismissed jni’s stale review April 5, 2023 20:55

Not ready yet.

@jarrodmillman jarrodmillman modified the milestones: 0.21, 0.22 May 19, 2023
@jarrodmillman jarrodmillman modified the milestones: 0.22, 0.23 Oct 24, 2023
@jarrodmillman jarrodmillman removed this from the 0.23 milestone Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.