-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Optimize Naive K-means with OpenMP #3762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The speedup is definitely off. See https://doi.org/10.1109/ICSPCS.2017.8270510 for an alternative implementation (which is part of Armadillo: https://arma.sourceforge.net/docs.html#kmeans). arXiv version: https://arxiv.org/abs/1707.09094 |
|
Have you checked whether the distance computation itself is parallelized? If it's not, simply parallelizing that instead might give significantly more speedup. (Note that mlpack doesn't require OpenMP to be available or enabled, so going beyond |
|
@conradsnicta @rcurtin I experimented with various OpenMP approaches. The most effective method is dividing the calculations into blocks. ~/gsoc/mlpack-benchmarks/src/kmeans$ ./naive_kmeans
Running benchmark for naive KMeans...
Benchmark for naive KMeans completed.
Total time: 5.10537 secondsThe problem is that blockSize needs to be adjusted for the specific system (That's why the memory tests currently fail). I could add it as another parameter, but I'm not sure if that's a valid/best practise. Is there maybe a solution where I could set that size dynamically? |
|
@MarkFischinger I don't think hardcoding the block size is the way to go. It would be better to divide the available data into segments based on the number of available threads, which is determined at run-time. The nominal segment size can be simply As an example, say we have 123 vectors and 4 threads, resulting in a nominal segment size of 30. The first 3 segments would each have a size of 30, thereby taking up 90 vectors in total. The last segment size would then be 123 - 90 = 33. A bunch of corner cases would also need to be taken care of, such as ensuring a minimum number of vectors per thread, and dealing with the possibility of having more available threads than vectors. |
|
I fixed the error and implemented a new mark@mark:~/gsoc/mlpack-benchmarks/src/kmeans$ ./naive_kmeans
Running benchmark for naive KMeans...
Benchmark for naive KMeans completed.
Total time: 7.37996 seconds |
rcurtin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there @MarkFischinger, the numbers are looking better. I took a look through the code and I think that there are several places where we could get additional speedup while changing less of the original underlying code. Let me know what you think or if I can clarify any of my comments. 👍
|
|
||
| #ifdef MLPACK_USE_OPENMP | ||
| #include <omp.h> | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already done in core.hpp, so shouldn't be needed here (so long as naive_kmeans.hpp includes core.hpp.
| // Pre-compute squared norms of centroids | ||
| arma::vec centroidNorms(clusters); | ||
| #ifdef MLPACK_USE_OPENMP | ||
| #pragma omp parallel for schedule(static) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to wrap just a #pragma in the #ifdef---the compiler will ignore it if OpenMP is not enabled. Only where the actual code involves OpenMP do we need to use those (like if you are calling omp_get_num_threads() or something).
| closestCluster = j; | ||
| } | ||
| const arma::vec& centroid = centroids.col(j); | ||
| distances(j) = std::max(0.0, dataNorm + centroidNorms(j) - 2 * arma::dot(dataPoint, centroid)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true that a distance can be expressed this way, but in general it is not more efficient to compute distances like this. Note that computing the data norm takes one pass over the data column, but then when you take the dot product of the data point and the centroid, it's a second pass over the data column. Plus, given that this computes the norm of every data point at every iteration of k-means, I'd strongly expect that using distance.Evaluate() would be a lot faster. What was the reason you came to this solution?
| // Determine the number of threads and calculate segment size | ||
| size_t effectiveThreads = 1; | ||
| #ifdef MLPACK_USE_OPENMP | ||
| const size_t numThreads = static_cast<size_t>(std::max(1, omp_get_max_threads())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it even allowed that omp_get_max_threads() could return 0? I read the documentation and it doesn't explicitly disallow it, but it says it returns an 'upper bound'... which 0 could never be. Personally I think it would be safe to use omp_get_max_threads() directly.
| size_t effectiveThreads = 1; | ||
| #ifdef MLPACK_USE_OPENMP | ||
| const size_t numThreads = static_cast<size_t>(std::max(1, omp_get_max_threads())); | ||
| const size_t minVectorsPerThread = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I would just take numThreads = 1 when OpenMP isn't available and numThreads = omp_get_max_threads() otherwise. If the user has so few points that each OpenMP thread can't get enough work, then k-means is going to run so fast anyway that the difference won't really matter.
| threadId = omp_get_thread_num(); | ||
| #endif | ||
| const size_t segmentStart = threadId * nominalSegmentSize; | ||
| const size_t segmentEnd = (threadId == effectiveThreads - 1) ? points : (threadId + 1) * nominalSegmentSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't you just use an OpenMP for loop over the points with a static schedule, just making sure to update the correct element in threadCentroids and threadCounts? I think that could simplify this code a good bit (or at least make it much closer to the original code)
| arma::mat& localCentroids = threadCentroids[threadId]; | ||
| arma::Col<size_t>& localCounts = threadCounts[threadId]; | ||
|
|
||
| arma::vec distances(clusters); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a slowdown to cache all the distances like this, and then later iterate over them all with .index_min() to find the minimum cluster. It should be quicker to use the previous if (dist < minDistance) approach.
| { | ||
| cNorm += std::pow(distance.Evaluate(centroids.col(i), newCentroids.col(i)), | ||
| 2.0); | ||
| cNorm += arma::norm(centroids.col(j) - newCentroids.col(j), 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually makes this computation incorrect if distance is not EuclideanDistance.
| return std::sqrt(cNorm); | ||
| distanceCalculations += clusters * points; | ||
|
|
||
| return cNorm; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow why you have made these changes here?
| // Normalize the centroids | ||
| for (size_t j = 0; j < clusters; ++j) | ||
| { | ||
| if (counts(j) > eps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But counts(j) is a size_t and so you would want to compare it with 0. (The previous version of this loop was just fine, as far as I can tell.)
| // Find the closest centroid to this point. | ||
| double minDistance = std::numeric_limits<double>::infinity(); | ||
| size_t closestCluster = centroids.n_cols; // Invalid value. | ||
| const auto dataPoint = dataset.col(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarkFischinger Do not use auto with Armadillo expressions. As part of its meta-programming framework, Armadillo uses short-lived temporaries that are not properly handled by auto.
| { | ||
| const arma::vec& centroid = centroids.col(j); | ||
| distances(j) = std::max(0.0, dataNorm + centroidNorms(j) - 2 * arma::dot(dataPoint, centroid)); | ||
| // Optimized Euclidean distance calculation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarkFischinger I'm not convinced this is optimized; can you clarify what led you to the solution of expressing the Euclidean distance d(x, y) as <x, x> + <y, y> - 2 <x, y>? There are a number of disadvantages to this approach including extra memory usage, multiple passes over data points, and so forth.
rcurtin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @MarkFischinger. Can you be sure to address all the comments I leave please? If you can also provide new benchmarking numbers when you update the code it would be really helpful. 👍
| arma::mat& localCentroids = threadCentroids[threadId]; | ||
| arma::Col<size_t>& localCounts = threadCounts[threadId]; | ||
|
|
||
| for (size_t i = segmentStart; i < segmentEnd; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does a simple omp parallel for not work here? I don't think you need the segmentStart and segmentEnd variables---I think all you need to do is get the thread id and then add to the local centroids.
| { | ||
| const double dist = distance.Evaluate(dataset.col(i), | ||
| centroids.unsafe_col(j)); | ||
| const double dist = std::max(0.0, dataNorm + centroidNorms(j) - 2 * arma::dot(dataPoint, centroids.col(j))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still unclear on why you are doing distance computations like this. Can you clarify your thinking and any advantage to doing it like this? I can see several disadvantages and I don't think it will be faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rcurtin The idea was to optimize performance for high-dimensional data by precomputing squared norms for the centroids centroidNorms and data points dataNorm. This lets us reduce the number of dot products, which can be expensive, and potentially speed things up. The formula I used:
const double dist = std::max(0.0, dataNorm + centroidNorms(j) - 2 * arma::dot(dataPoint, centroids.col(j)));aims to minimize the calculations and help numerical stability.
However, I see your point. While this method might be faster in certain cases, it adds some complexity and overhead, especially for smaller datasets. Your suggestion to just compute the distance directly
const double dist = distance.Evaluate(dataPoint, centroids.col(j));is definitely simpler and easier to maintain. I’ll run some benchmarks to compare the two. If my approach doesn’t show a significant speedup, switching to the simpler method makes sense :)
|
@rcurtin I updated the code following the feedback and optimized the distance calculation and OpenMP parallelization. Below are the additional benchmarks comparing the performance before and after these changes: Before Optimization: After Optimization: I went back to the optimized Euclidean distance formula, precomputing squared norms for both centroids and data points. This cuts down on redundant calculations, particularly when dealing with high-dimensional data. I also improved the OpenMP parallelization by making the parallel loops more fine-grained. Using a static schedule helps minimize synchronization overhead, leading to better thread utilization and faster execution. Finally, I focused on memory and cache efficiency by reusing precomputed values and that the data access is more cache friendly. |
|
The reason I am hammering so hard (now for I believe the fifth time) about the distance calculation is because it is not always the right thing to simply observe that something is faster and then do it. The real question, when observing something like this, is understanding why the other thing is slower. (This is going to be a long comment...) Starting from the basics, we are talking about the following two approaches:
From a standpoint of what a computer should do, the first approach should be faster for several reasons:
My intuition is that memory latency will be the biggest issue here. So the question, from my perspective, is not "which is faster", but instead "why is the first one not faster", if that's what you are observing. One key observation is that the dot-based approach will use OpenBLAS (which is very heavily optimized, and also uses OpenMP for parallelization, but, I think that should not come into play here). I would imagine that Armadillo has the same optimizations going on, but perhaps the compiler is not emitting the code we want. I therefore wrote up the following code: #include <armadillo>
double compute_ed(const arma::mat& set1, const arma::mat& set2, const size_t points)
{
double sum = 0.0;
for (size_t i = 0; i < points; ++i)
{
for (size_t j = 0; j < points; ++j)
{
sum += arma::norm(set1.col(i) - set2.col(j), 2);
}
}
return sum;
}
double compute_ed2(const arma::mat& set1, const arma::mat& set2, const size_t points)
{
double sum = 0.0;
const size_t n_rows = set1.n_rows;
for (size_t i = 0; i < points; ++i)
{
for (size_t j = 0; j < points; ++j)
{
double point_sum = 0.0;
for (size_t k = 0; k < set1.n_rows; k += 4)
{
// take 4 elements at a time to try and "trick" the vectorizer into actually working
const double a1 = set1.mem[k + i * n_rows];
const double a2 = set1.mem[k + 1 + i * n_rows];
const double a3 = set1.mem[k + 2 + i * n_rows];
const double a4 = set1.mem[k + 3 + i * n_rows];
const double b1 = set2.mem[k + j * n_rows];
const double b2 = set2.mem[k + 1 + j * n_rows];
const double b3 = set2.mem[k + 2 + j * n_rows];
const double b4 = set2.mem[k + 3 + j * n_rows];
const double ps1 = (a1 - b1) * (a1 - b1);
const double ps2 = (a2 - b2) * (a2 - b2);
const double ps3 = (a3 - b3) * (a3 - b3);
const double ps4 = (a4 - b4) * (a4 - b4);
point_sum += ps1 + ps2 + ps3 + ps4;
}
sum += std::sqrt(point_sum);
}
}
return sum;
}
double compute_ip(const arma::mat& set1, const arma::mat& set2, const arma::vec& set1_norms, const arma::vec& set2_norms, const size_t points)
{
double sum = 0.0;
for (size_t i = 0; i < points; ++i)
{
for (size_t j = 0; j < points; ++j)
{
sum += std::sqrt(set1_norms[i] + set2_norms[j] - 2 * arma::dot(set1.col(i), set2.col(j)));
}
}
return sum;
}
int main()
{
const size_t dims = 100;
const size_t points = 1000;
arma::mat set1(dims, points, arma::fill::randu);
arma::mat set2(dims, points, arma::fill::randu);
arma::wall_clock c;
// Compute distances the EuclideanDistance way.
c.tic();
double sum = compute_ed(set1, set2, points);
const double euclidean_time = c.toc();
std::cout << "norm()-based calculation: " << euclidean_time << "s (result "
<< sum << ")." << std::endl;
c.tic();
sum = compute_ed2(set1, set2, points);
const double euclidean_2_time = c.toc();
std::cout << "norm()-based manual calculation: " << euclidean_2_time << "s (result "
<< sum << ")." << std::endl;
// Now compute the norm-based way.
c.tic();
arma::vec set1_norms(points, arma::fill::none);
arma::vec set2_norms(points, arma::fill::none);
for (size_t i = 0; i < points; ++i)
{
set1_norms[i] = arma::dot(set1.col(i), set1.col(i));
}
for (size_t i = 0; i < points; ++i)
{
set2_norms[i] = arma::dot(set2.col(i), set2.col(i));
}
const double selfnorm_time = c.toc();
std::cout << "dot()-based norm preparation time: " << selfnorm_time << "s."
<< std::endl;
c.tic();
sum = compute_ip(set1, set2, set1_norms, set2_norms, points);
const double norm_time = c.toc();
std::cout << "dot()-based calculation: " << norm_time << "s (result " << sum
<< ")." << std::endl;
}Here I implemented So you can see that I am getting Digging deeper would involve compiling with CC: @conradsnicta, would be interested to see what you think. I suppose we could be using |
|
@rcurtin @MarkFischinger Interim hazy perspective below. I'm currently recovering from surgery, so brain is working at reduced capacity. For simple expressions where the memory is directly accessible in a linear column-major order (so excluding subrows and any compound expressions), the functions For compound expressions and subrows, There is a bunch of caveats here:
Other observations:
|
|
@conradsnicta and @rcurtin, thanks for your feedback. I’ve been digging deeper into the performance of the sum of squared differences versus the dot-product based approach. Despite initial expectations that the sum of squared differences would be faster, my benchmarks show that both methods perform almost identically, prompting further investigation. Here are the initial benchmark results: After applying the Enabling the These results show how aggressive compiler optimizations impact both methods, bringing their performance closer together.
I also considered whether temporary Armadillo objects during operations like For further comparison, I benchmarked our implementation against I'm a bit unsure why the performance diffenerence is so huge. Armadillo is more optimized, but I'm looking into it. When reviewing the assembly code, I observed that both the Currently, I’m focusing on improving the |
|
It's not possible to directly compare mlpack's and Armadillo's k-means implementations without further digging. They will run for different numbers of iterations, to different tolerance measures, etc., etc... the way to compare them with each other would be to time how long a single iteration takes, or how long a specific number of iterations takes. (That's not to say that the iteration or termination strategies in mlpack couldn't be improved. But here we are considering only the distance computations, so ensuring they are equal across both implementations is needed.) |
|
@MarkFischinger, looking at this PR, it seems to me that it is getting further from the objective. For the sake of GSoC, please could you do the following:
Please do all of the above requests by the end of today, if you are interested in optimising the distance calculation let us do that in a different PR and merge this one with OpenMP modification only Thank you very much. |
|
I do think that we need to get to the bottom of why |
|
There are a lot of changes here that aren't actually functional changes---do you mind going through and reducing the changes to only the actual functionality changes? Thanks 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recognize that most of my comments will seem trivial, but the underlying idea is to only change code when there is a good reason to do so. If you go through the 'files' tab of the PR that shows the diff, you will see what I mean---there are lots of changes, but they aren't substantial; they are just naming changes or comment changes (or comment deletions, which is really generally not a good idea unless you have a good reason why). If we can reduce this to the minimum possible set of changes, then that will look good to me and we should merge it. 👍
(I still haven't had a chance to look deeper into the Armadillo norm() issues. If you want to open another issue for that feel free, otherwise I will do it when I have a chance.)
| arma::Col<size_t> localCounts(centroids.n_cols, arma::fill::zeros); | ||
| // Thread-local storage for partial sums | ||
| arma::mat threadCentroids(centroids.n_rows, centroids.n_cols, arma::fill::zeros); | ||
| arma::Col<size_t> threadCounts(centroids.n_cols, arma::fill::zeros); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes don't actually change anything. Can you revert any changes where there is no actual functionality change? The only change here is a change to the name of the variables, and I don't see any reason to do that (plus it makes the diffs across history unnecessarily larger).
| counts.zeros(centroids.n_cols); | ||
|
|
||
| // Find the closest centroid to each point and update the new centroids. | ||
| // Computed in parallel over the complete dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any reason to remove this comment.
| #pragma omp for | ||
| for (size_t i = 0; i < (size_t) dataset.n_cols; ++i) | ||
| #pragma omp for schedule(static) nowait | ||
| for (size_t i = 0; i < dataset.n_cols; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The (size_t) cast fixed a compilation warning for some compilers and we should therefore keep it.
| { | ||
| // Find the closest centroid to this point. | ||
| double minDistance = std::numeric_limits<double>::infinity(); | ||
| double minDistance = std::numeric_limits<double>::max(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make this change? infinity() was fine. I mean max() works too, but again, better to avoid making changes when possible.
| { | ||
| const double dist = distance.Evaluate(dataset.col(i), | ||
| centroids.unsafe_col(j)); | ||
| const double dist = distance.Evaluate(dataset.col(i), centroids.col(j)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe now this line is too long. Can you revert this to the previous version?
| localCounts(closestCluster)++; | ||
| // Update thread-local centroids and counts | ||
| threadCentroids.col(closestCluster) += dataset.col(i); | ||
| threadCounts(closestCluster)++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a reason to remove the assertion or change the comments. I think the switch from .unsafe_col() to .col() is a good change, though. (I believe once upon a time that did not compile correctly, many many years ago.)
| { | ||
| cNorm += std::pow(distance.Evaluate(centroids.col(i), newCentroids.col(i)), | ||
| 2.0); | ||
| cNorm += std::pow(distance.Evaluate(centroids.col(i), newCentroids.col(i)), 2.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that now this line is too long too.
| } // namespace mlpack | ||
|
|
||
| #endif | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update your editor setting to leave newlines at the end of files?
|
@rcurtin @shrit I added all the important comments back in, replaced Running the benchmark again using this benchmark script and this visualization script, I generated this graphic: |
rcurtin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for cleaning this up---I know that it's not the original level of optimization you were hoping for, but it is still a nice (albeit more modest) speedup, and we uncovered an issue with arma::norm() that is good to know about and indicates that there is a lot of speedup available for various mlpack algorithms if we can get arma::norm() to properly compile into SIMD instructions. We can investigate that later, but for now, let's get this merged. 🚀
Co-authored-by: Ryan Curtin <[email protected]>
|
Thanks for the hard work @MarkFischinger! I opened #3789 to follow up on the I think I need to fix the documentation build so that it issues warnings and not errors---in this case, looks like the CS department's website for Boston College has gone offline, causing that build to fail. Oh well. |

Following up on the Hamerly K-means optimization, I've also applied OpenMP to the naive K-means implementation. Here are the results.
Benchmark Results
I ran a benchmark with the following settings:
Here's what I found:
This represents a 6.3% decrease in execution time.
Looking forward to your thoughts :)