Skip to content

Conversation

@MarkFischinger
Copy link
Contributor

I've been working on optimizing the Hamerly K-means implementation, and I'm happy to share the results with you.

Benchmark Results
I ran a benchmark with the following settings:

Number of data points: 100,000
Number of dimensions: 50
Number of clusters: 10
Maximum iterations: 100

Here's what I found:

Original version: 12.1609 seconds
OpenMP version: 6.98171 seconds

That's a 42.6% decrease in execution time!
Looking forward to your feedback! Let me know if you need any clarification or have any questions.

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement! Just a couple small comments.

double centroidMovement = 0.0;

#pragma omp parallel for reduction(+:distanceCalculations, centroidMovement) \
reduction(max:furthestMovement, secondFurthestMovement)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about furthestMovingCluster? That is a little trickier to handle...

// First bound test.
if (upperBounds(i) <= m)
{
#pragma omp atomic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be atomic if we already have the reduction for hamerlyPruned?

@MarkFischinger
Copy link
Contributor Author

I fixed the error and made sure to use #ifdef MLPACK_USE_OPENMP at the correct places. Here are the new benchmarks:

mark@mark:~/gsoc/mlpack-benchmarks/src/kmeans$ ./hamerly_kmeans
Running benchmark...
Benchmark completed.
Total time: 4.95009 seconds

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not yet 100% sure everything here is correct (I am out of time for now, will come back later), but I think the general idea works and the speedup is nice. I have some structure and cleanliness comments; let me know if I can clarify them. 👍

{
// Nothing to do.
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to remove the empty line.

{
const double halfDist = dist / 2.0;
localMinClusterDistances(i) = std::min(localMinClusterDistances(i), halfDist);
localMinClusterDistances(j) = std::min(localMinClusterDistances(j), halfDist);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the reasoning behind these changes? I don't umderstand the purpose or how this improves the implementation. Unless I have missed something, this makes the algorithm significantly incorrect.

}
else if (dist < lowerBounds(i))

// The bounds failed. So test against all other clusters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original comment said significantly more than this for a reason. Please don't remove comments unless there is a very good reason.


// Normalize centroids and calculate cluster movement (contains parts of
// Move-Centers() and Update-Bounds()).
// Normalize centroids and calculate cluster movement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reason to change this comment.

// Calculate movement.
const double movement = distance.Evaluate(centroids.col(c),
newCentroids.col(c));
const double movement = std::sqrt(arma::sum(arma::square(centroids.col(c) - newCentroids.col(c))));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect for any distance metric that is not the Euclidean distance. Please revert to the original implementation (even if it is slower due to the issue with arma::norm() that was discovered in the naive k-means PR.


// Now update bounds (lines 3-8 of Update-Bounds()).
for (size_t i = 0; i < dataset.n_cols; ++i)
// Now update bounds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert the comment please. It is extremely important for maintenance to be able to connect the code to the parts of the original paper.

#pragma omp critical
{
Log::Warn << "Invalid assignment for point " << i << std::endl;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But can this ever happen? I don't think that it can (at least it couldn't in the original imementation).


return std::sqrt(centroidMovement);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to remove the empty line separating the end of the function from the closing of the namespace.

} // namespace mlpack

#endif
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to remove the newline from the end of the file.

@MarkFischinger
Copy link
Contributor Author

@rcurtin I simplified the minClusterDistance update logic, made sure to use openmp more optimized by, for example, moving the centroid normalization and movement calculation into a parallel loop and fixed the calculation issue pointed out by you.

Before the changes these were the execution time results:

DatasetSize,Time
1000,0.00677065
10000,0.148302
100000,1.62585
1000000,17.6833

After:

DatasetSize,Time
1000,0.004081
10000,0.0428959
100000,1.04919
1000000,3.32927

tested with this benchmark script.

I also created this graph:

benchmark_comparison

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for the cleanups @MarkFischinger. I think this one is basically ready to go but I had a thought about custom reductions that I want to see what you think about. If you can address or respond to the handful of comments I've left, I think we can get this merged shortly. 👍

Comment on lines 70 to 71
arma::mat threadNewCentroids(centroids.n_rows, centroids.n_cols,
arma::fill::zeros);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
arma::mat threadNewCentroids(centroids.n_rows, centroids.n_cols,
arma::fill::zeros);
arma::mat threadNewCentroids(centroids.n_rows, centroids.n_cols);

(Armadillo defaults to fill::zeros, so we don't need to include it here.)

if (upperBounds(i) <= m)
arma::mat threadNewCentroids(centroids.n_rows, centroids.n_cols,
arma::fill::zeros);
arma::Col<size_t> threadCounts(centroids.n_cols, arma::fill::zeros);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
arma::Col<size_t> threadCounts(centroids.n_cols, arma::fill::zeros);
arma::Col<size_t> threadCounts(centroids.n_cols);

}

for (size_t i = 0; i < dataset.n_cols; ++i)
#pragma omp parallel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to myself, "wouldn't it be nice if we could just use #pragma omp parallel for reduction(+: centroids) or something like this?" That would be much nicer code-wise than making the threadNewCentroids object and threadCounts objects and then updating them. Then I read about OpenMP custom reductions:

https://stackoverflow.com/questions/59034114/custom-omp-reduction-on-stdmaps

Do you want to give that a try and see if it works? Don't feel obligated, we could also do it in another PR or another time. But it could be nice for simplifying not just this code but other code that uses similar patterns for the other k-means strategies. I guess that we could define Armadillo-specific custom reductions in some core header file, and then just use it where relevant.

arma::fill::zeros);
arma::Col<size_t> threadCounts(centroids.n_cols, arma::fill::zeros);
size_t threadHamerlyPruned = 0;
size_t threadDistanceCalculations = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for these two variables you could directly use an OpenMP reduction? Let me know if I overlooked something and there's a better reason to do it this way.


const double dist = distance.Evaluate(dataset.col(i), centroids.col(c));

// Is this a better cluster? At this point, upperBounds[i] = d(i, c(i))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Is this a better cluster? At this point, upperBounds[i] = d(i, c(i))
// Is this a better cluster? At this point, upperBounds[i] = d(i, c(i)).

The original period seems to have gotten lost 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i removed it on purpose because it is above the 80 char limit, due to the indent, it is the 81st char. i thought removing it would be better than breaking the line and causing a larger diff size. let me know what is best here :)

Comment on lines 152 to 153
#pragma omp parallel for reduction(+:distanceCalculations,centroidMovement) \
schedule(static)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#pragma omp parallel for reduction(+:distanceCalculations,centroidMovement) \
schedule(static)
#pragma omp parallel for reduction(+: distanceCalculations, centroidMovement) \
schedule(static)

Just some small pedantic style fixes. 👍

@MarkFischinger
Copy link
Contributor Author

@rcurtin Reading the stackoverflow comment I implemented the idea of custom reductions. The code is now cleaner and has similar/slightly better performance:

Running benchmark for 1000 points...
Dataset size: 1000, Time: 0.00392151 seconds
Running benchmark for 10000 points...
Dataset size: 10000, Time: 0.0348763 seconds
Running benchmark for 100000 points...
Dataset size: 100000, Time: 0.352331 seconds
Running benchmark for 1000000 points...
Dataset size: 1000000, Time: 3.49524 seconds
Benchmark completed. Results saved to hamerly_kmeans_benchmark.csv

@rcurtin
Copy link
Member

rcurtin commented Sep 1, 2024

Awesome that the custom reduction worked! That really simplifies the changes. Do you think you would be up for opening a PR to make that change to use a custom reduction in the naive strategy too? It should be a quick simplification (and I guess provide a slight amount of speedup too, given your results here).

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I think this is just about ready. Do you want to add a note in HISTORY.md about the OpenMP k-means improvements? You could reference this PR and the naive k-means PR, and then we could update it as we get the other PRs in. 👍

I'm excited about getting this merged, the speedups for this one are really nice. Once you handle the couple of comments from my side everything seems good to merge. 🚀


// Custom reduction for arma::mat
#pragma omp declare reduction(matAdd : arma::mat : omp_out += omp_in) \
initializer(omp_priv = arma::mat(omp_orig.n_rows, omp_orig.n_cols).zeros())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the .zeros() is unnecessary here since Armadillo will automatically initialize the matrix to zeros in recent versions. (That should provide a little additional speedup too.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, some other little thoughts. This is really nice, so it would be great if we could make it available to other techniques. Can you move it to a new header file that we can put in, say, mlpack/core/util/ and include in mlpack/base.hpp? I guess you could call it omp_reductions.hpp or something like that. I'm not totally tied to that so if you think there is a better place that works too. But basically the idea is just that these reductions can be available for any mlpack technique.

++counts(assignments[i]);
}

distanceCalculations += threadDistanceCalculations;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're reducing threadDistanceCalculations, then couldn't we just use distanceCalculations directly?

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for handling the issues! It looks good to me now. There are a couple other tiny comments I have, all very minor. I can handle them before merge if you don't get to it, it's no big deal either way.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second approval provided automatically after 24 hours. 👍

@shrit shrit merged commit 620a770 into mlpack:master Sep 3, 2024
This was referenced Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants