`EuclideanDistance::Evaluate()` is not properly vectorizing

In #3762, @MarkFischinger found that the implementation of `EuclideanDistance::Evaluate()` (which internally uses `arma::norm()`) is not as fast as using the equivalent identity `d(x, y)^2 = <x, x> + <y, y> - 2<x, y>`.

Here is an example code snippet that I ran (also in the linked PR):

```c++
#include <armadillo>

double compute_ed(const arma::mat& set1, const arma::mat& set2, const size_t points)
{
  double sum = 0.0;
  for (size_t i = 0; i < points; ++i)
  {
    for (size_t j = 0; j < points; ++j)
    {
      sum += arma::norm(set1.col(i) - set2.col(j), 2);
    }
  }
  return sum;
}

double compute_ed2(const arma::mat& set1, const arma::mat& set2, const size_t points)
{
  double sum = 0.0;
  const size_t n_rows = set1.n_rows;
  for (size_t i = 0; i < points; ++i)
  {
    for (size_t j = 0; j < points; ++j)
    {
      double point_sum = 0.0;
      for (size_t k = 0; k < set1.n_rows; k += 4)
      {
        // take 4 elements at a time to try and "trick" the vectorizer into actually working
        const double a1 = set1.mem[k + i * n_rows];
        const double a2 = set1.mem[k + 1 + i * n_rows];
        const double a3 = set1.mem[k + 2 + i * n_rows];
        const double a4 = set1.mem[k + 3 + i * n_rows];

        const double b1 = set2.mem[k + j * n_rows];
        const double b2 = set2.mem[k + 1 + j * n_rows];
        const double b3 = set2.mem[k + 2 + j * n_rows];
        const double b4 = set2.mem[k + 3 + j * n_rows];

        const double ps1 = (a1 - b1) * (a1 - b1);
        const double ps2 = (a2 - b2) * (a2 - b2);
        const double ps3 = (a3 - b3) * (a3 - b3);
        const double ps4 = (a4 - b4) * (a4 - b4);

        point_sum += ps1 + ps2 + ps3 + ps4;
      }
      sum += std::sqrt(point_sum);
    }
  }
  return sum;
}

double compute_ip(const arma::mat& set1, const arma::mat& set2, const arma::vec& set1_norms, const arma::vec& set2_norms, const size_t points)
{
  double sum = 0.0;
  for (size_t i = 0; i < points; ++i)
  {
    for (size_t j = 0; j < points; ++j)
    {
      sum += std::sqrt(set1_norms[i] + set2_norms[j] - 2 * arma::dot(set1.col(i), set2.col(j)));
    }
  }
  return sum;
}

int main()
{
  const size_t dims = 100;
  const size_t points = 1000;

  arma::mat set1(dims, points, arma::fill::randu);
  arma::mat set2(dims, points, arma::fill::randu);

  arma::wall_clock c;

  // Compute distances the EuclideanDistance way.
  c.tic();
  double sum = compute_ed(set1, set2, points);
  const double euclidean_time = c.toc();
  std::cout << "norm()-based calculation: " << euclidean_time << "s (result "
      << sum << ")." << std::endl;

  c.tic();
  sum = compute_ed2(set1, set2, points);
  const double euclidean_2_time = c.toc();
  std::cout << "norm()-based manual calculation: " << euclidean_2_time << "s (result "
      << sum << ")." << std::endl;

  // Now compute the norm-based way.
  c.tic();
  arma::vec set1_norms(points, arma::fill::none);
  arma::vec set2_norms(points, arma::fill::none);

  for (size_t i = 0; i < points; ++i)
  {
    set1_norms[i] = arma::dot(set1.col(i), set1.col(i));
  }
  for (size_t i = 0; i < points; ++i)
  {
    set2_norms[i] = arma::dot(set2.col(i), set2.col(i));
  }
  const double selfnorm_time = c.toc();
  std::cout << "dot()-based norm preparation time: " << selfnorm_time << "s."
      << std::endl;

  c.tic();
  sum = compute_ip(set1, set2, set1_norms, set2_norms, points);
  const double norm_time = c.toc();
  std::cout << "dot()-based calculation: " << norm_time << "s (result " << sum
      << ")." << std::endl;
}
```

which resulted in (on my system):

```
$ ./dot_test 
norm()-based calculation: 0.0977767s (result 4.07441e+06).
norm()-based manual calculation: 0.0344042s (result 4.07441e+06).
dot()-based norm preparation time: 0.000114631s.
dot()-based calculation: 0.0290754s (result 4.07441e+06).
```

The reason that the first `norm()-based calculation` is slower appears to be the lack of vectorization in the compiled output.

Due to the way distances are used internally in mlpack, it's not really feasible to adapt the entire codebase to using the dot-product based solution (you have to store the self-inner-product terms for every point, which costs extra memory, etc., and it would be a huge overhaul just to do that, for instance).  So, solving this issue could result in a significant (guessing like 2-3x?) speedup for mlpack's distance-based algorithms.

In essence the issue is that we need to understand why `EuclideanDistance::Evaluate()` (and thus `arma::norm()`) is not generating vectorized code.  That could be because we are using `arma::norm()` in such a way that it doesn't happen to generate vectorized code, or, maybe we need a specific compiler setting, or, perhaps something else.  Once it's diagnosed, then we just need to figure out how to fix it in an uninvasive way. :)

CC: @conradsnicta (for visibility, I figure you might want to just stay apprised on this one)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`EuclideanDistance::Evaluate()` is not properly vectorizing #3789

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

EuclideanDistance::Evaluate() is not properly vectorizing #3789

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`EuclideanDistance::Evaluate()` is not properly vectorizing #3789