Fortran vectorize code

I would like to ask about the performance of a code pattern that shows up quite often in my projects:

Nested loops that call a function defined in a separate module. The first input arguments of the functions are elements of vectors, so that the result of the function is a 3DIM array F(a’,a,z), where a’,a and z are the vectors

I came up with a MWE. The bottleneck is the three nested loops in the main program.
I have tried to speed up the computation using OpenMP and achieved good improvement.
However, it was pointed out by some users (cc @kkifonidis, @PierU, @JohnCampbell) in another post that I could also try to vectorize the code. How would you vectorize this part of the code? The function is defined in a module and accepts scalar arguments and returns a scalar result.

This MWE is also available on github: GitHub - aledinola/FortranVec2: Loops vs vectorized code in Fortran

Any help is really appreciated!

module mymod
  implicit none

contains

function linspace(a, b, n) result(x)
    implicit none
    real(kind=8), intent(in) :: a, b
    integer, intent(in)      :: n
    real(kind=8) :: x(n)

    integer :: i
    real(kind=8) :: step

    if (n < 1) then
        ERROR STOP 'linspace: n must be at least 1'
    endif

    if (n == 1) then
        x(1) = a
    else
        step = (b - a) / real(n - 1, kind=8)
        do i = 1, n
            x(i) = a + step * real(i - 1, kind=8)
        enddo
    endif
end function linspace

function return_fn(aprime,a,z,r,w) result(res)
    implicit none
    ! Declare input arguments
    real(8), intent(in) :: aprime, a, z
    real(8), intent(in) :: r, w
    ! Declare function result
    real(8) :: res
    ! Declare local variables
    real(8) :: consumption

    consumption = (1.0d0+r)*a + w*z - aprime
    if (consumption > 0.0d0) then
        res = log(consumption)
    else
        res = -1.0d10
    endif
end function return_fn

end module mymod

program main
  use mymod,   only: return_fn, linspace
  use omp_lib, only: omp_get_wtime
  implicit none

  integer, parameter :: n_a = 1000, n_z = 100
  real(8), parameter :: r = 0.04d0, w = 1.0d0

  integer :: iap, ia, iz, alloc_status
  real(8) :: a_grid(n_a), z_grid(n_z)
  real(8), allocatable :: payoff(:,:,:), payoff2(:,:,:)
  real(8) :: t_start, t_end, err
  integer :: par_fortran

  ! Toggle OpenMP region on/off (set to 0 to force serial)
  par_fortran = 1

  ! Build vectors for a and z
  a_grid = linspace(0.001d0, 50.0d0, n_a)
  z_grid = linspace(0.5d0, 1.5d0, n_z)

  ! Allocate 3D arrays to hold results
  allocate(payoff(n_a, n_a, n_z), payoff2(n_a, n_a, n_z), stat=alloc_status)
  if (alloc_status /= 0) then
    print *, "Error allocating memory for payoff arrays"
    stop
  endif

  ! -------------------------
  ! Serial computation timing
  ! -------------------------
  t_start = omp_get_wtime()

  do iz = 1, n_z
    do ia = 1, n_a
      do iap = 1, n_a
        payoff(iap, ia, iz) = return_fn(a_grid(iap), a_grid(ia), z_grid(iz), r, w)
      enddo
    enddo
  enddo

  t_end = omp_get_wtime()
  print *, "Serial wall time (s): ", t_end - t_start

  ! -------------------------
  ! OpenMP computation timing
  ! -------------------------
  t_start = omp_get_wtime()

  !$omp parallel default(shared) private(iz, ia, iap) if (par_fortran == 1)
  !$omp do collapse(3) schedule(static)
  do iz = 1, n_z
    do ia = 1, n_a
      do iap = 1, n_a
        payoff2(iap, ia, iz) = return_fn(a_grid(iap), a_grid(ia), z_grid(iz), r, w)
      enddo
    enddo
  enddo
  !$omp end do
  !$omp end parallel

  t_end = omp_get_wtime()
  print *, "OpenMP wall time (s): ", t_end - t_start

  ! Check results are correct
  err = maxval(abs(payoff - payoff2))

  if (err > 1.0d-10) then
    print *, "Error: results do not match! max error = ", err
  else
    print *, "Results match! max error = ", err
  endif

  deallocate(payoff, payoff2)

end program main

Less work maybe? And might optimise better.

1 Like

Are nz=100 and na=1000 the largest grid you need? The result array is (1000^2 × 100 × 8) / 1000^2 = 800 MB in this case. I’m interested if you need bigger?

1 Like

The model I am working on now has n_a=1800 and n_z=245 (I doubt I will ever need anything larger). I have 64 GB of RAM, even more on server, but I realize that openMP is quite memory-intensive

Thanks, this is quite helpful! I will test this later. I just note (as a reminder to my later self) that your code introduces three different modifications:

(1) Eliminate the function return_fn by copying its content in the main program. This saves the overhead of calling the function from module mymod, among other things.

(2) Move loop invariants before the loop over iap: income, defined as income=r*a_grid(ia)+w*z_grid(iz)can be computed before the loop over iap starts.

(3) Use array syntax to eliminate the loop over iap: the do loop and the if condition are replaced by the construct where.

I will test how much is the overall speed up and the relevant contributions of 1 vs 2 vs 3. I guess 2 is what matters most, but who knows.

You’ve already been given some good advice here, so I am simply going to focus on what @JohnCampbell has already pointed out in the other thread that you posted in.

The triple loop, that is being benchmarked in the code that you’ve posted above, needs only of order one second or less of run-time on a single processor core. Even if you’d increase the size of your array dimensions to n_a=1800 and n_z=245, we are still talking about a very small problem (that by itself doesn’t seem to be worth the trouble of parallelization).

Hence, I am wondering what I am missing about your use case. Is your example some part of a larger application that calls this triple loop repeatedly as some sort of computational kernel? If yes, then you should parallelize, instead, the outermost loop of this application (that does this actual calling).

1 Like

When it comes to vectorization of the inner loop, the compiler is forced to use predicated SIMD instructions, which can be slightly more expensive.

Since the consumption is a linear function of aprime, it is possible to determine the cross-over point in advance and avoid the branching in the inner loop:

  real(8) :: apz, a_step
  integer :: iapz, iz, ia

  a_step = (50.0d0 - 0.001d0) / real(n_a - 1, kind=8)

  do iz = 1, n_z
    do ia = 1, n_a

      ! Zero point
      apz = (1.0d0 + r)*a_grid(ia) + w*z_grid(iz)

      ! Zero index
      iapz = min(max(int((apz - a_grid(1)) / a_step), 0),n_a)
      if (iapz < n_a) then
        ! Check next value, because we rounded down earlier
        if (apz > a_grid(iapz+1)) iapz = iapz + 1
      endif

      ! Assignment becomes no-op if slice is empty
      payoff2(1:iapz,ia,iz) = log(apz - a_grid(1:iapz))
      payoff2(iapz+1:n_a,ia,iz) = -1.0d10

    enddo
  enddo

Using gfortran -O3 -mcpu=native on an Apple M2, the branchless version is 20 - 30 % faster. Using flang -O3 -mcpu=native, it was between 20-40 % faster.

Edit: since the consumption condition is linear in other variables too it becomes a line cutting the (iap,ia)-plane, or a plane cutting the (iap,ia,iz)-space. I’m not sure if one can capitalize on this (triangular iteration space?).

Edit 2: I get another small boost by calling the logarithm function vvlog in the Apple Accelerate library.

1 Like

Thanks for the great suggestion! I slightly modified your code to allow for a generic a_grid (not necessarily equally spaced), as follows:

real(8), allocatable :: payoff(:,:,:), payoff2(:,:,:), payoff3(:,:,:), consumption(:)

allocate(consumption(n_a), payoff(n_a, n_a, n_z), payoff2(n_a, n_a, n_z), payoff3(n_a, n_a, n_z), stat=alloc_status)
if (alloc_status /= 0) then
  print *, "Error allocating memory for payoff arrays"
  stop
endif

do iz = 1, n_z
  do ia = 1, n_a
      cash = (1.0d0+r)*a_grid(ia) + w*z_grid(iz)
      consumption = cash - a_grid
      iapz = count(consumption > 0.0d0)
      payoff3(1:iapz, ia, iz) = log(consumption(1:iapz))
      if (iapz < n_a) then
          payoff3(iapz+1:n_a, ia, iz) = -1.0d10
      endif
  enddo
enddo

Unfortunately, the performance of this code is bad, i.e. slower than the brute-force method with the three nested loops. Your code is quite fast instead. Is it because I used the function count? The problem with your method is that it works only if a_grid is equally spaced (please correct me if I am wrong)

On the Apple M2 the variant using count is on average slightly slower than the brute-force method. I think count may be the culprit, because you essentially have two new loops in:

consumption = cash - a_grid
iapz = count(consumption > 0.0d0)

Since a_grid is ordered, you could try to use bisection search to locate the zero crossing in case of irregular a_grid intervals (the interval where a_grid(i) < cash < a_grid(i+1)). This loop would need approx. log2(n_a) iterations, so around 10 or 11 iterations. Counting using AVX512 units (8 doubles per iteration) would need 1000 / 8 = 125 inner loop iterations. The a_grid should be entirely in the L1 cache in both cases. But it increases the code complexity.

On a regular grid the interval lookup has a O(1) cost, with one division and a little branching.
Btw, I am not sure my logic yesterday was totally correct:

      ! Zero index
      iapz = min(max(int((apz - a_grid(1)) / a_step), 0),n_a)
      if (iapz < n_a) then
        ! Check next value, because we rounded down earlier
        if (apz > a_grid(iapz+1)) iapz = iapz + 1
      endif

It might sufficient to use:

iapz = min(max(floor((apz - a_grid(1)) / a_step) + 1, 0), n_a)

I was “drawing” these lines in my head and I may have missed something. Apologies if this causes any confusion.

1 Like