-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Improve Hamerly K-Means with OpenMP Implementation #3761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
rcurtin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvement! Just a couple small comments.
| double centroidMovement = 0.0; | ||
|
|
||
| #pragma omp parallel for reduction(+:distanceCalculations, centroidMovement) \ | ||
| reduction(max:furthestMovement, secondFurthestMovement) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about furthestMovingCluster? That is a little trickier to handle...
| // First bound test. | ||
| if (upperBounds(i) <= m) | ||
| { | ||
| #pragma omp atomic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be atomic if we already have the reduction for hamerlyPruned?
|
I fixed the error and made sure to use mark@mark:~/gsoc/mlpack-benchmarks/src/kmeans$ ./hamerly_kmeans
Running benchmark...
Benchmark completed.
Total time: 4.95009 seconds |
rcurtin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not yet 100% sure everything here is correct (I am out of time for now, will come back later), but I think the general idea works and the speedup is nice. I have some structure and cleanliness comments; let me know if I can clarify them. 👍
| { | ||
| // Nothing to do. | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to remove the empty line.
| { | ||
| const double halfDist = dist / 2.0; | ||
| localMinClusterDistances(i) = std::min(localMinClusterDistances(i), halfDist); | ||
| localMinClusterDistances(j) = std::min(localMinClusterDistances(j), halfDist); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the reasoning behind these changes? I don't umderstand the purpose or how this improves the implementation. Unless I have missed something, this makes the algorithm significantly incorrect.
| } | ||
| else if (dist < lowerBounds(i)) | ||
|
|
||
| // The bounds failed. So test against all other clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original comment said significantly more than this for a reason. Please don't remove comments unless there is a very good reason.
|
|
||
| // Normalize centroids and calculate cluster movement (contains parts of | ||
| // Move-Centers() and Update-Bounds()). | ||
| // Normalize centroids and calculate cluster movement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no reason to change this comment.
| // Calculate movement. | ||
| const double movement = distance.Evaluate(centroids.col(c), | ||
| newCentroids.col(c)); | ||
| const double movement = std::sqrt(arma::sum(arma::square(centroids.col(c) - newCentroids.col(c)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect for any distance metric that is not the Euclidean distance. Please revert to the original implementation (even if it is slower due to the issue with arma::norm() that was discovered in the naive k-means PR.
|
|
||
| // Now update bounds (lines 3-8 of Update-Bounds()). | ||
| for (size_t i = 0; i < dataset.n_cols; ++i) | ||
| // Now update bounds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert the comment please. It is extremely important for maintenance to be able to connect the code to the parts of the original paper.
| #pragma omp critical | ||
| { | ||
| Log::Warn << "Invalid assignment for point " << i << std::endl; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But can this ever happen? I don't think that it can (at least it couldn't in the original imementation).
|
|
||
| return std::sqrt(centroidMovement); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to remove the empty line separating the end of the function from the closing of the namespace.
| } // namespace mlpack | ||
|
|
||
| #endif | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to remove the newline from the end of the file.
|
@rcurtin I simplified the Before the changes these were the execution time results: After: tested with this benchmark script. I also created this graph: |
rcurtin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for the cleanups @MarkFischinger. I think this one is basically ready to go but I had a thought about custom reductions that I want to see what you think about. If you can address or respond to the handful of comments I've left, I think we can get this merged shortly. 👍
| arma::mat threadNewCentroids(centroids.n_rows, centroids.n_cols, | ||
| arma::fill::zeros); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| arma::mat threadNewCentroids(centroids.n_rows, centroids.n_cols, | |
| arma::fill::zeros); | |
| arma::mat threadNewCentroids(centroids.n_rows, centroids.n_cols); |
(Armadillo defaults to fill::zeros, so we don't need to include it here.)
| if (upperBounds(i) <= m) | ||
| arma::mat threadNewCentroids(centroids.n_rows, centroids.n_cols, | ||
| arma::fill::zeros); | ||
| arma::Col<size_t> threadCounts(centroids.n_cols, arma::fill::zeros); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| arma::Col<size_t> threadCounts(centroids.n_cols, arma::fill::zeros); | |
| arma::Col<size_t> threadCounts(centroids.n_cols); |
| } | ||
|
|
||
| for (size_t i = 0; i < dataset.n_cols; ++i) | ||
| #pragma omp parallel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking to myself, "wouldn't it be nice if we could just use #pragma omp parallel for reduction(+: centroids) or something like this?" That would be much nicer code-wise than making the threadNewCentroids object and threadCounts objects and then updating them. Then I read about OpenMP custom reductions:
Do you want to give that a try and see if it works? Don't feel obligated, we could also do it in another PR or another time. But it could be nice for simplifying not just this code but other code that uses similar patterns for the other k-means strategies. I guess that we could define Armadillo-specific custom reductions in some core header file, and then just use it where relevant.
| arma::fill::zeros); | ||
| arma::Col<size_t> threadCounts(centroids.n_cols, arma::fill::zeros); | ||
| size_t threadHamerlyPruned = 0; | ||
| size_t threadDistanceCalculations = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for these two variables you could directly use an OpenMP reduction? Let me know if I overlooked something and there's a better reason to do it this way.
|
|
||
| const double dist = distance.Evaluate(dataset.col(i), centroids.col(c)); | ||
|
|
||
| // Is this a better cluster? At this point, upperBounds[i] = d(i, c(i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Is this a better cluster? At this point, upperBounds[i] = d(i, c(i)) | |
| // Is this a better cluster? At this point, upperBounds[i] = d(i, c(i)). |
The original period seems to have gotten lost 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i removed it on purpose because it is above the 80 char limit, due to the indent, it is the 81st char. i thought removing it would be better than breaking the line and causing a larger diff size. let me know what is best here :)
| #pragma omp parallel for reduction(+:distanceCalculations,centroidMovement) \ | ||
| schedule(static) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #pragma omp parallel for reduction(+:distanceCalculations,centroidMovement) \ | |
| schedule(static) | |
| #pragma omp parallel for reduction(+: distanceCalculations, centroidMovement) \ | |
| schedule(static) |
Just some small pedantic style fixes. 👍
|
@rcurtin Reading the stackoverflow comment I implemented the idea of custom reductions. The code is now cleaner and has similar/slightly better performance: |
|
Awesome that the custom reduction worked! That really simplifies the changes. Do you think you would be up for opening a PR to make that change to use a custom reduction in the naive strategy too? It should be a quick simplification (and I guess provide a slight amount of speedup too, given your results here). |
rcurtin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I think this is just about ready. Do you want to add a note in HISTORY.md about the OpenMP k-means improvements? You could reference this PR and the naive k-means PR, and then we could update it as we get the other PRs in. 👍
I'm excited about getting this merged, the speedups for this one are really nice. Once you handle the couple of comments from my side everything seems good to merge. 🚀
|
|
||
| // Custom reduction for arma::mat | ||
| #pragma omp declare reduction(matAdd : arma::mat : omp_out += omp_in) \ | ||
| initializer(omp_priv = arma::mat(omp_orig.n_rows, omp_orig.n_cols).zeros()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the .zeros() is unnecessary here since Armadillo will automatically initialize the matrix to zeros in recent versions. (That should provide a little additional speedup too.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, some other little thoughts. This is really nice, so it would be great if we could make it available to other techniques. Can you move it to a new header file that we can put in, say, mlpack/core/util/ and include in mlpack/base.hpp? I guess you could call it omp_reductions.hpp or something like that. I'm not totally tied to that so if you think there is a better place that works too. But basically the idea is just that these reductions can be available for any mlpack technique.
| ++counts(assignments[i]); | ||
| } | ||
|
|
||
| distanceCalculations += threadDistanceCalculations; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're reducing threadDistanceCalculations, then couldn't we just use distanceCalculations directly?
rcurtin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for handling the issues! It looks good to me now. There are a couple other tiny comments I have, all very minor. I can handle them before merge if you don't get to it, it's no big deal either way.
Co-authored-by: Ryan Curtin <[email protected]>
Co-authored-by: Ryan Curtin <[email protected]>
Co-authored-by: Ryan Curtin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second approval provided automatically after 24 hours. 👍

I've been working on optimizing the Hamerly K-means implementation, and I'm happy to share the results with you.
Benchmark Results
I ran a benchmark with the following settings:
Here's what I found:
That's a 42.6% decrease in execution time!
Looking forward to your feedback! Let me know if you need any clarification or have any questions.