Optimization of DBSCAN with OpenMP #3771

MarkFischinger · 2024-07-22T21:35:23Z

I've been working on optimizing our DBSCAN implementation, and I'm excited to share the results with you! This pull request introduces several performance enhancements that should make the clustering algorithm much faster, especially for larger datasets.

Changes made

Implemented parallel processing for centroids calculation
Improved parallelization of cluster assignments
Optimized batch clustering with better OpenMP usage
Introduced atomic operations to reduce critical sections
Enhanced memory locality for better cache utilization

I ran some benchmarks to compare the performance of the original implementation with the optimized version.

I've tested this on a variety of datasets, and it seems to be working well.
However, I'd really appreciate if you could take a look and let me know if there's anything I've missed or if you have any suggestions for further improvements.

shrit

Nice addition, I left some comments

shrit · 2024-07-23T07:51:40Z

src/mlpack/methods/dbscan/dbscan_impl.hpp

+#include <omp.h>
+#include <vector>
+#include <atomic>


These are already included, search for them inside core.hpp or prereqs.hpp, if they are inside then please include these files,

I also think that you can do the include inside the dbscan.hpp, no need to include anything in impl.hpp

shrit · 2024-07-23T07:52:50Z

src/mlpack/methods/dbscan/dbscan_impl.hpp


  // Get a count of all clusters.
-  const size_t numClusters = max(assignments) + 1;
+  const size_t numClusters = arma::max(assignments) + 1;


any reason to add the arma, we removed this intentionally, so if there is no error could you remove it please?

geekypathak21

Hey @MarkFischinger Nice work 👍

geekypathak21 · 2024-07-23T12:14:44Z

src/mlpack/methods/dbscan/dbscan_impl.hpp

-  arma::Row<size_t> counts;
-  counts.zeros(numClusters);
-  for (size_t i = 0; i < data.n_cols; ++i)
+  arma::Row<size_t> counts(numClusters, arma::fill::zeros);


We are getting rid of arma::fill::zeros can you try removing it no worries if you have any specific reason for adding this.

actually arma::fill::zeros is not needed, since the min version of armadillo will do zero initialization by default if we do not have anything specified

Oh cool we can remove this for sure then 👍

@MarkFischinger can you remove this? as described in the above comment ? thanks

geekypathak21 · 2024-07-23T12:16:12Z

src/mlpack/methods/dbscan/dbscan_impl.hpp

+  std::vector<MatType> localCentroids(numThreads, MatType(data.n_rows, numClusters, arma::fill::zeros));
+  std::vector<arma::Row<size_t>> localCounts(numThreads, arma::Row<size_t>(numClusters, arma::fill::zeros));


rcurtin

I haven't reviewed the whole PR yet, but the initial idea looks good. There are some comments about MLPACK_USE_OPENMP and other things that I left in #3762; do you think you can apply the relevant comments here too?

rcurtin · 2024-07-24T23:25:37Z

src/mlpack/methods/dbscan/dbscan_impl.hpp

+  #endif
  for (size_t i = 0; i < numClusters; ++i)
-    centroids.col(i) /= counts[i];
+    if (counts[i] > 0)


Why remove the comment pointing out that we are guaranteed the number of points in a cluster is greater than 0? (Then this check is not needed.)

@MarkFischinger could you apply the modification that Ryan requested? thanks?

rcurtin · 2024-07-24T23:25:44Z

src/mlpack/methods/dbscan/dbscan_impl.hpp

  return numClusters;
 }

+


Suggested change

rcurtin · 2024-07-24T23:29:01Z

src/mlpack/methods/dbscan/dbscan_impl.hpp

+  #pragma omp parallel for
+  #endif
  for (size_t i = 0; i < data.n_cols; ++i)
    assignments[i] = uf.Find(i);


I don't think UnionFind::Find() is thread-safe, so I'm not sure that this can work correctly. Or did I overlook something?

shrit · 2024-08-26T15:26:03Z

@MarkFischinger could you resolve the conflict with the master branch ? thank you

shrit · 2024-08-27T18:34:40Z

src/mlpack/methods/dbscan/dbscan_impl.hpp

-  arma::Row<size_t> counts;
-  counts.zeros(numClusters);
-  for (size_t i = 0; i < data.n_cols; ++i)
+  arma::Row<size_t> counts(numClusters, arma::fill::zeros);


@MarkFischinger can you remove this? as described in the above comment ? thanks

shrit · 2024-08-27T18:34:52Z

src/mlpack/methods/dbscan/dbscan_impl.hpp

  {
-    if (assignments[i] != SIZE_MAX)
+    MatType localCentroids(data.n_rows, numClusters, arma::fill::zeros);
+    arma::Row<size_t> localCounts(numClusters, arma::fill::zeros);


Same thing applies here

shrit · 2024-08-27T18:35:31Z

src/mlpack/methods/dbscan/dbscan_impl.hpp

+  #endif
  for (size_t i = 0; i < numClusters; ++i)
-    centroids.col(i) /= counts[i];
+    if (counts[i] > 0)


@MarkFischinger could you apply the modification that Ryan requested? thanks?

shrit · 2024-08-27T18:36:00Z

src/mlpack/methods/dbscan/dbscan_impl.hpp

-  arma::Col<size_t> counts(numClusters);
-  for (size_t i = 0; i < assignments.n_elem; ++i)
-    counts[assignments[i]]++;
+  arma::Col<size_t> counts(numClusters, arma::fill::zeros);


@MarkFischinger same comment goes in here

rcurtin

@MarkFischinger thanks for working on this! I have some comments---the strategy as it currently is unfortunately won't work, but I think we can still simplify, avoid the incorrect optimization, and merge. 👍

rcurtin · 2024-08-29T21:03:50Z

src/mlpack/methods/dbscan/dbscan_impl.hpp


-  // We should be guaranteed that the number of clusters is always greater than
-  // zero.
+  // Normalize centroids


Can you revert to or include the intent of the original comment? Someone wrote that there for a reason.

rcurtin · 2024-08-29T21:09:18Z

src/mlpack/methods/dbscan/dbscan_impl.hpp


  for (size_t i = 0; i < data.n_cols; ++i)
  {
-    if (i % 10000 == 0 && i > 0)


I believe that all of the optimizations below here are invalid and cause this algorithm to be something other than DBSCAN. I hate to say that because these give nice speedups (and are the core of the speedup of this PR) but inhere tly DBSCAN is a serial algorithm: select a point, find other points within range, mark them as part of the same cluster, repeat. That can't be parallelized (or I mean you can, but then it's not DBSCAN so we can't do it).

rcurtin · 2024-08-29T21:09:45Z

src/mlpack/methods/dbscan/dbscan_impl.hpp

-      Log::Info << "DBSCAN clustering on point " << i << "..." << std::endl;
-
-    // Get the next index.
-    const size_t index = pointSelector.Select(i, data);


The use of pointSelector to allow custom next point selection strategies is also important and we shouldn't remove it.

rcurtin · 2024-08-29T21:11:23Z

src/mlpack/methods/dbscan/dbscan_impl.hpp

+      visited[index] = true;
+
+      // Do the range search for only this point.
+      rangeSearch.Search(data.col(index), RangeType<ElemType>(ElemType(0.0), epsilon),


Realistically this is the "painful" part of DBSCAN. We would likely get better results by parallelizing the range search itself, but that is a bit of a can of worms, since you would have to parallelize the tree traversal (likely with OpenMP tasks) and then ensure that RangeSearchRules is thread-safe. Not impossible for sure, but probably a good amount of work and tuning.

rcurtin · 2024-08-29T21:21:06Z

src/mlpack/methods/dbscan/dbscan_impl.hpp

    const MatType& data,
    UnionFind& uf)
 {
-  // For each point, find the points in epsilon-neighborhood and their distances.


No need to remove the comment.

rcurtin · 2024-08-29T21:21:24Z

src/mlpack/methods/dbscan/dbscan_impl.hpp

 } // namespace mlpack

-#endif
+#endif


No need to remove the newline from the end of the file.

MarkFischinger · 2024-08-31T20:20:33Z

@rcurtin i implemented the suggestion in the search range and search range rules. Even with those changes, performance wise my benchmarks did not show a significant improvement that could not be explained by the standard deviation. The first algorithmic improvement was not aligned with the functionality of dbscan therefore i reverted it. Because of the nature of dbscan, finding a parallelization method or algorithmic improvement is not simple. I will come back to this as soon as i have a new idea.

rcurtin · 2024-09-01T01:30:36Z

Yeah, the way to do it (I did it in 2015 but for some reason never opened a PR because it needed tuning that I never had the time for) is to use OpenMP tasks in the dual-tree traversers, and tune it so a new task is only spawned every handful of levels. That was the only way I saw speedup for dual-tree algorithms. The idea of parallel dual-tree traversers (or even single-tree traversers) is a good idea, but probably needs a more in-depth dive that maybe we could do some other time. Realistically, DBSCAN speedup would then come just from speeding up the range search itself as that (I think) is the majority of the runtime.

If you think that we won't be able to get noticeable speedup here in this PR, then we can go ahead and close it, that's fine with me 👍 (or did I misunderstand your message and you still think there is something we could merge for improvement? either way works for me)

MarkFischinger added 4 commits July 22, 2024 22:05

optimization openmp

475b649

optimization

3ee3d2b

added comments back in

f605e23

add comments back in

53b0201

shrit reviewed Jul 23, 2024

View reviewed changes

geekypathak21 suggested changes Jul 23, 2024

View reviewed changes

fix with #ifdef MLPACK_USE_OPENMP

ca84021

rcurtin reviewed Jul 24, 2024

View reviewed changes

updates based on the comments

e494aa7

github-actions bot added the s: stale label Aug 26, 2024

shrit added s: keep open and removed s: stale labels Aug 26, 2024

Merge branch 'master' into opt/dbscan

197de85

shrit requested changes Aug 27, 2024

View reviewed changes

requested changes

73ba254

rcurtin reviewed Aug 29, 2024

View reviewed changes

MarkFischinger mentioned this pull request Aug 30, 2024

Optimize Naive K-means with OpenMP #3762

Merged

openmp in dbscan and range search

83d9116

MarkFischinger closed this Sep 1, 2024

		std::vector<MatType> localCentroids(numThreads, MatType(data.n_rows, numClusters, arma::fill::zeros));
		std::vector<arma::Row<size_t>> localCounts(numThreads, arma::Row<size_t>(numClusters, arma::fill::zeros));

Uh oh!

Optimization of DBSCAN with OpenMP #3771

Optimization of DBSCAN with OpenMP #3771

Uh oh!

Conversation

MarkFischinger commented Jul 22, 2024

Changes made

Uh oh!

shrit left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geekypathak21 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rcurtin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shrit commented Aug 26, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rcurtin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarkFischinger commented Aug 31, 2024

Uh oh!

rcurtin commented Sep 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants