Skip to content

EuclideanDistance::Evaluate() is not properly vectorizing #3789

@rcurtin

Description

@rcurtin

In #3762, @MarkFischinger found that the implementation of EuclideanDistance::Evaluate() (which internally uses arma::norm()) is not as fast as using the equivalent identity d(x, y)^2 = <x, x> + <y, y> - 2<x, y>.

Here is an example code snippet that I ran (also in the linked PR):

#include <armadillo>

double compute_ed(const arma::mat& set1, const arma::mat& set2, const size_t points)
{
  double sum = 0.0;
  for (size_t i = 0; i < points; ++i)
  {
    for (size_t j = 0; j < points; ++j)
    {
      sum += arma::norm(set1.col(i) - set2.col(j), 2);
    }
  }
  return sum;
}

double compute_ed2(const arma::mat& set1, const arma::mat& set2, const size_t points)
{
  double sum = 0.0;
  const size_t n_rows = set1.n_rows;
  for (size_t i = 0; i < points; ++i)
  {
    for (size_t j = 0; j < points; ++j)
    {
      double point_sum = 0.0;
      for (size_t k = 0; k < set1.n_rows; k += 4)
      {
        // take 4 elements at a time to try and "trick" the vectorizer into actually working
        const double a1 = set1.mem[k + i * n_rows];
        const double a2 = set1.mem[k + 1 + i * n_rows];
        const double a3 = set1.mem[k + 2 + i * n_rows];
        const double a4 = set1.mem[k + 3 + i * n_rows];

        const double b1 = set2.mem[k + j * n_rows];
        const double b2 = set2.mem[k + 1 + j * n_rows];
        const double b3 = set2.mem[k + 2 + j * n_rows];
        const double b4 = set2.mem[k + 3 + j * n_rows];

        const double ps1 = (a1 - b1) * (a1 - b1);
        const double ps2 = (a2 - b2) * (a2 - b2);
        const double ps3 = (a3 - b3) * (a3 - b3);
        const double ps4 = (a4 - b4) * (a4 - b4);

        point_sum += ps1 + ps2 + ps3 + ps4;
      }
      sum += std::sqrt(point_sum);
    }
  }
  return sum;
}

double compute_ip(const arma::mat& set1, const arma::mat& set2, const arma::vec& set1_norms, const arma::vec& set2_norms, const size_t points)
{
  double sum = 0.0;
  for (size_t i = 0; i < points; ++i)
  {
    for (size_t j = 0; j < points; ++j)
    {
      sum += std::sqrt(set1_norms[i] + set2_norms[j] - 2 * arma::dot(set1.col(i), set2.col(j)));
    }
  }
  return sum;
}

int main()
{
  const size_t dims = 100;
  const size_t points = 1000;

  arma::mat set1(dims, points, arma::fill::randu);
  arma::mat set2(dims, points, arma::fill::randu);

  arma::wall_clock c;

  // Compute distances the EuclideanDistance way.
  c.tic();
  double sum = compute_ed(set1, set2, points);
  const double euclidean_time = c.toc();
  std::cout << "norm()-based calculation: " << euclidean_time << "s (result "
      << sum << ")." << std::endl;

  c.tic();
  sum = compute_ed2(set1, set2, points);
  const double euclidean_2_time = c.toc();
  std::cout << "norm()-based manual calculation: " << euclidean_2_time << "s (result "
      << sum << ")." << std::endl;

  // Now compute the norm-based way.
  c.tic();
  arma::vec set1_norms(points, arma::fill::none);
  arma::vec set2_norms(points, arma::fill::none);

  for (size_t i = 0; i < points; ++i)
  {
    set1_norms[i] = arma::dot(set1.col(i), set1.col(i));
  }
  for (size_t i = 0; i < points; ++i)
  {
    set2_norms[i] = arma::dot(set2.col(i), set2.col(i));
  }
  const double selfnorm_time = c.toc();
  std::cout << "dot()-based norm preparation time: " << selfnorm_time << "s."
      << std::endl;

  c.tic();
  sum = compute_ip(set1, set2, set1_norms, set2_norms, points);
  const double norm_time = c.toc();
  std::cout << "dot()-based calculation: " << norm_time << "s (result " << sum
      << ")." << std::endl;
}

which resulted in (on my system):

$ ./dot_test 
norm()-based calculation: 0.0977767s (result 4.07441e+06).
norm()-based manual calculation: 0.0344042s (result 4.07441e+06).
dot()-based norm preparation time: 0.000114631s.
dot()-based calculation: 0.0290754s (result 4.07441e+06).

The reason that the first norm()-based calculation is slower appears to be the lack of vectorization in the compiled output.

Due to the way distances are used internally in mlpack, it's not really feasible to adapt the entire codebase to using the dot-product based solution (you have to store the self-inner-product terms for every point, which costs extra memory, etc., and it would be a huge overhaul just to do that, for instance). So, solving this issue could result in a significant (guessing like 2-3x?) speedup for mlpack's distance-based algorithms.

In essence the issue is that we need to understand why EuclideanDistance::Evaluate() (and thus arma::norm()) is not generating vectorized code. That could be because we are using arma::norm() in such a way that it doesn't happen to generate vectorized code, or, maybe we need a specific compiler setting, or, perhaps something else. Once it's diagnosed, then we just need to figure out how to fix it in an uninvasive way. :)

CC: @conradsnicta (for visibility, I figure you might want to just stay apprised on this one)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions