Skip to content

Conversation

@MarkFischinger
Copy link
Contributor

I've been working on optimizing our DBSCAN implementation, and I'm excited to share the results with you! This pull request introduces several performance enhancements that should make the clustering algorithm much faster, especially for larger datasets.

Changes made

  1. Implemented parallel processing for centroids calculation
  2. Improved parallelization of cluster assignments
  3. Optimized batch clustering with better OpenMP usage
  4. Introduced atomic operations to reduce critical sections
  5. Enhanced memory locality for better cache utilization

I ran some benchmarks to compare the performance of the original implementation with the optimized version.

output

I've tested this on a variety of datasets, and it seems to be working well.
However, I'd really appreciate if you could take a look and let me know if there's anything I've missed or if you have any suggestions for further improvements.

Copy link
Member

@shrit shrit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition, I left some comments

Comment on lines 16 to 18
#include <omp.h>
#include <vector>
#include <atomic>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are already included, search for them inside core.hpp or prereqs.hpp, if they are inside then please include these files,

I also think that you can do the include inside the dbscan.hpp, no need to include anything in impl.hpp


// Get a count of all clusters.
const size_t numClusters = max(assignments) + 1;
const size_t numClusters = arma::max(assignments) + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to add the arma, we removed this intentionally, so if there is no error could you remove it please?

Copy link
Member

@geekypathak21 geekypathak21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @MarkFischinger Nice work 👍

arma::Row<size_t> counts;
counts.zeros(numClusters);
for (size_t i = 0; i < data.n_cols; ++i)
arma::Row<size_t> counts(numClusters, arma::fill::zeros);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are getting rid of arma::fill::zeros can you try removing it no worries if you have any specific reason for adding this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually arma::fill::zeros is not needed, since the min version of armadillo will do zero initialization by default if we do not have anything specified

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh cool we can remove this for sure then 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarkFischinger can you remove this? as described in the above comment ? thanks

Comment on lines 77 to 78
std::vector<MatType> localCentroids(numThreads, MatType(data.n_rows, numClusters, arma::fill::zeros));
std::vector<arma::Row<size_t>> localCounts(numThreads, arma::Row<size_t>(numClusters, arma::fill::zeros));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed the whole PR yet, but the initial idea looks good. There are some comments about MLPACK_USE_OPENMP and other things that I left in #3762; do you think you can apply the relevant comments here too?

#endif
for (size_t i = 0; i < numClusters; ++i)
centroids.col(i) /= counts[i];
if (counts[i] > 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the comment pointing out that we are guaranteed the number of points in a cluster is greater than 0? (Then this check is not needed.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarkFischinger could you apply the modification that Ryan requested? thanks?

return numClusters;
}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

#pragma omp parallel for
#endif
for (size_t i = 0; i < data.n_cols; ++i)
assignments[i] = uf.Find(i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think UnionFind::Find() is thread-safe, so I'm not sure that this can work correctly. Or did I overlook something?

@shrit
Copy link
Member

shrit commented Aug 26, 2024

@MarkFischinger could you resolve the conflict with the master branch ? thank you

arma::Row<size_t> counts;
counts.zeros(numClusters);
for (size_t i = 0; i < data.n_cols; ++i)
arma::Row<size_t> counts(numClusters, arma::fill::zeros);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarkFischinger can you remove this? as described in the above comment ? thanks

{
if (assignments[i] != SIZE_MAX)
MatType localCentroids(data.n_rows, numClusters, arma::fill::zeros);
arma::Row<size_t> localCounts(numClusters, arma::fill::zeros);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing applies here

#endif
for (size_t i = 0; i < numClusters; ++i)
centroids.col(i) /= counts[i];
if (counts[i] > 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarkFischinger could you apply the modification that Ryan requested? thanks?

arma::Col<size_t> counts(numClusters);
for (size_t i = 0; i < assignments.n_elem; ++i)
counts[assignments[i]]++;
arma::Col<size_t> counts(numClusters, arma::fill::zeros);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarkFischinger same comment goes in here

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarkFischinger thanks for working on this! I have some comments---the strategy as it currently is unfortunately won't work, but I think we can still simplify, avoid the incorrect optimization, and merge. 👍


// We should be guaranteed that the number of clusters is always greater than
// zero.
// Normalize centroids
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert to or include the intent of the original comment? Someone wrote that there for a reason.


for (size_t i = 0; i < data.n_cols; ++i)
{
if (i % 10000 == 0 && i > 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that all of the optimizations below here are invalid and cause this algorithm to be something other than DBSCAN. I hate to say that because these give nice speedups (and are the core of the speedup of this PR) but inhere tly DBSCAN is a serial algorithm: select a point, find other points within range, mark them as part of the same cluster, repeat. That can't be parallelized (or I mean you can, but then it's not DBSCAN so we can't do it).

Log::Info << "DBSCAN clustering on point " << i << "..." << std::endl;

// Get the next index.
const size_t index = pointSelector.Select(i, data);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of pointSelector to allow custom next point selection strategies is also important and we shouldn't remove it.

visited[index] = true;

// Do the range search for only this point.
rangeSearch.Search(data.col(index), RangeType<ElemType>(ElemType(0.0), epsilon),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realistically this is the "painful" part of DBSCAN. We would likely get better results by parallelizing the range search itself, but that is a bit of a can of worms, since you would have to parallelize the tree traversal (likely with OpenMP tasks) and then ensure that RangeSearchRules is thread-safe. Not impossible for sure, but probably a good amount of work and tuning.

const MatType& data,
UnionFind& uf)
{
// For each point, find the points in epsilon-neighborhood and their distances.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to remove the comment.

} // namespace mlpack

#endif
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to remove the newline from the end of the file.

@MarkFischinger
Copy link
Contributor Author

@rcurtin i implemented the suggestion in the search range and search range rules. Even with those changes, performance wise my benchmarks did not show a significant improvement that could not be explained by the standard deviation. The first algorithmic improvement was not aligned with the functionality of dbscan therefore i reverted it. Because of the nature of dbscan, finding a parallelization method or algorithmic improvement is not simple. I will come back to this as soon as i have a new idea.

@rcurtin
Copy link
Member

rcurtin commented Sep 1, 2024

Yeah, the way to do it (I did it in 2015 but for some reason never opened a PR because it needed tuning that I never had the time for) is to use OpenMP tasks in the dual-tree traversers, and tune it so a new task is only spawned every handful of levels. That was the only way I saw speedup for dual-tree algorithms. The idea of parallel dual-tree traversers (or even single-tree traversers) is a good idea, but probably needs a more in-depth dive that maybe we could do some other time. Realistically, DBSCAN speedup would then come just from speeding up the range search itself as that (I think) is the majority of the runtime.

If you think that we won't be able to get noticeable speedup here in this PR, then we can go ahead and close it, that's fine with me 👍 (or did I misunderstand your message and you still think there is something we could merge for improvement? either way works for me)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants