-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Optimization of DBSCAN with OpenMP #3771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
shrit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice addition, I left some comments
| #include <omp.h> | ||
| #include <vector> | ||
| #include <atomic> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are already included, search for them inside core.hpp or prereqs.hpp, if they are inside then please include these files,
I also think that you can do the include inside the dbscan.hpp, no need to include anything in impl.hpp
|
|
||
| // Get a count of all clusters. | ||
| const size_t numClusters = max(assignments) + 1; | ||
| const size_t numClusters = arma::max(assignments) + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason to add the arma, we removed this intentionally, so if there is no error could you remove it please?
geekypathak21
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @MarkFischinger Nice work 👍
| arma::Row<size_t> counts; | ||
| counts.zeros(numClusters); | ||
| for (size_t i = 0; i < data.n_cols; ++i) | ||
| arma::Row<size_t> counts(numClusters, arma::fill::zeros); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are getting rid of arma::fill::zeros can you try removing it no worries if you have any specific reason for adding this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually arma::fill::zeros is not needed, since the min version of armadillo will do zero initialization by default if we do not have anything specified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh cool we can remove this for sure then 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarkFischinger can you remove this? as described in the above comment ? thanks
| std::vector<MatType> localCentroids(numThreads, MatType(data.n_rows, numClusters, arma::fill::zeros)); | ||
| std::vector<arma::Row<size_t>> localCounts(numThreads, arma::Row<size_t>(numClusters, arma::fill::zeros)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
rcurtin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't reviewed the whole PR yet, but the initial idea looks good. There are some comments about MLPACK_USE_OPENMP and other things that I left in #3762; do you think you can apply the relevant comments here too?
| #endif | ||
| for (size_t i = 0; i < numClusters; ++i) | ||
| centroids.col(i) /= counts[i]; | ||
| if (counts[i] > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove the comment pointing out that we are guaranteed the number of points in a cluster is greater than 0? (Then this check is not needed.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarkFischinger could you apply the modification that Ryan requested? thanks?
| return numClusters; | ||
| } | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #pragma omp parallel for | ||
| #endif | ||
| for (size_t i = 0; i < data.n_cols; ++i) | ||
| assignments[i] = uf.Find(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think UnionFind::Find() is thread-safe, so I'm not sure that this can work correctly. Or did I overlook something?
|
@MarkFischinger could you resolve the conflict with the master branch ? thank you |
| arma::Row<size_t> counts; | ||
| counts.zeros(numClusters); | ||
| for (size_t i = 0; i < data.n_cols; ++i) | ||
| arma::Row<size_t> counts(numClusters, arma::fill::zeros); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarkFischinger can you remove this? as described in the above comment ? thanks
| { | ||
| if (assignments[i] != SIZE_MAX) | ||
| MatType localCentroids(data.n_rows, numClusters, arma::fill::zeros); | ||
| arma::Row<size_t> localCounts(numClusters, arma::fill::zeros); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing applies here
| #endif | ||
| for (size_t i = 0; i < numClusters; ++i) | ||
| centroids.col(i) /= counts[i]; | ||
| if (counts[i] > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarkFischinger could you apply the modification that Ryan requested? thanks?
| arma::Col<size_t> counts(numClusters); | ||
| for (size_t i = 0; i < assignments.n_elem; ++i) | ||
| counts[assignments[i]]++; | ||
| arma::Col<size_t> counts(numClusters, arma::fill::zeros); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarkFischinger same comment goes in here
rcurtin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarkFischinger thanks for working on this! I have some comments---the strategy as it currently is unfortunately won't work, but I think we can still simplify, avoid the incorrect optimization, and merge. 👍
|
|
||
| // We should be guaranteed that the number of clusters is always greater than | ||
| // zero. | ||
| // Normalize centroids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you revert to or include the intent of the original comment? Someone wrote that there for a reason.
|
|
||
| for (size_t i = 0; i < data.n_cols; ++i) | ||
| { | ||
| if (i % 10000 == 0 && i > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that all of the optimizations below here are invalid and cause this algorithm to be something other than DBSCAN. I hate to say that because these give nice speedups (and are the core of the speedup of this PR) but inhere tly DBSCAN is a serial algorithm: select a point, find other points within range, mark them as part of the same cluster, repeat. That can't be parallelized (or I mean you can, but then it's not DBSCAN so we can't do it).
| Log::Info << "DBSCAN clustering on point " << i << "..." << std::endl; | ||
|
|
||
| // Get the next index. | ||
| const size_t index = pointSelector.Select(i, data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of pointSelector to allow custom next point selection strategies is also important and we shouldn't remove it.
| visited[index] = true; | ||
|
|
||
| // Do the range search for only this point. | ||
| rangeSearch.Search(data.col(index), RangeType<ElemType>(ElemType(0.0), epsilon), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realistically this is the "painful" part of DBSCAN. We would likely get better results by parallelizing the range search itself, but that is a bit of a can of worms, since you would have to parallelize the tree traversal (likely with OpenMP tasks) and then ensure that RangeSearchRules is thread-safe. Not impossible for sure, but probably a good amount of work and tuning.
| const MatType& data, | ||
| UnionFind& uf) | ||
| { | ||
| // For each point, find the points in epsilon-neighborhood and their distances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to remove the comment.
| } // namespace mlpack | ||
|
|
||
| #endif | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to remove the newline from the end of the file.
|
@rcurtin i implemented the suggestion in the search range and search range rules. Even with those changes, performance wise my benchmarks did not show a significant improvement that could not be explained by the standard deviation. The first algorithmic improvement was not aligned with the functionality of dbscan therefore i reverted it. Because of the nature of dbscan, finding a parallelization method or algorithmic improvement is not simple. I will come back to this as soon as i have a new idea. |
|
Yeah, the way to do it (I did it in 2015 but for some reason never opened a PR because it needed tuning that I never had the time for) is to use OpenMP tasks in the dual-tree traversers, and tune it so a new task is only spawned every handful of levels. That was the only way I saw speedup for dual-tree algorithms. The idea of parallel dual-tree traversers (or even single-tree traversers) is a good idea, but probably needs a more in-depth dive that maybe we could do some other time. Realistically, DBSCAN speedup would then come just from speeding up the range search itself as that (I think) is the majority of the runtime. If you think that we won't be able to get noticeable speedup here in this PR, then we can go ahead and close it, that's fine with me 👍 (or did I misunderstand your message and you still think there is something we could merge for improvement? either way works for me) |
I've been working on optimizing our DBSCAN implementation, and I'm excited to share the results with you! This pull request introduces several performance enhancements that should make the clustering algorithm much faster, especially for larger datasets.
Changes made
I ran some benchmarks to compare the performance of the original implementation with the optimized version.
I've tested this on a variety of datasets, and it seems to be working well.
However, I'd really appreciate if you could take a look and let me know if there's anything I've missed or if you have any suggestions for further improvements.