Petalisp generates high performance code for parallel computers by JIT-compiling array definitions. It augments the existing general purpose programming language Common Lisp (and soon also Python) with parallelism and lazy evaluation.
- Install Lisp and a suitable IDE. If unsure, pick Portacle.
- Download Petalisp via Quicklisp.
- Check out some of the examples.
Petalisp is still under development, so the following examples may still change slightly. Nevertheless they give a good glimpse on what programming with Petalisp will be like.
Example 1: transposing a matrix
(defun lazy-transpose (A)
(lazy-reshape A (transform m n to n m)))
Example 2: matrix-matrix multiplication
(defun matrix-multiplication (A B)
(lazy-reduce #'+
(lazy #'*
(lazy-reshape A (transform m n to n m 1))
(lazy-reshape B (transform n k to n 1 k)))))
Example 3: the numerical Jacobi scheme in two dimensions
(defun lazy-jacobi-2d (grid iterations)
(let ((interior (interior grid)))
(if (zerop iterations) grid
(lazy-jacobi-2d
(lazy-fuse
x
(lazy #'* 0.25
(lazy #'+
(lazy-reshape x (transform i0 i1 to (+ i0 1) i1) interior)
(lazy-reshape x (transform i0 i1 to (- i0 1) i1) interior)
(lazy-reshape x (transform i0 i1 to i0 (+ i1 1)) interior)
(lazy-reshape x (transform i0 i1 to i0 (- i1 1)) interior))))
(- iterations 1)))))
All subsequent benchmarks have been measured on a Intel i7-8750H CPU, with six cores running at 2.20 GHz (4.1 GHz turbocore). It has the following characteristics:
- L3 cache 9MB L3 cache, shared across all cores
- L2 cache 256kB per core
- L1 cache 32kB per core
- SIMD Support for AVX, AVX2 and FMA, but not AVX512
- Petalisp version commit 9f9cfd6328141ba3d52a5fee1343c825f6a55bcd
All benchmark results are given double-precision floating-point operations per second. The reported values are averages over four seconds of repeated execution.
This first benchmark compares the performance of Petalisp and the likwid
benchmark suite for a single kernel of the form
(in-package #:petalisp.benchmarks)
(with-temporary-backend (make-native-backend :threads 1)
(print-benchmark-table daxpy 20))
likwid-bench -t daxpy_avx -w S0:SIZE:1
For a single daxpy run, Petalisp reaches about 45-90% of the single-thread performance of likwid-bench. This picture changes if multiple runs of daxpy are scheduled in succession, because it allows Petalisp the chance to apply temporal blocking. In such cases, Petalisp can outperform high-quality single-pass kernels such as those of likwid.
These daxpy results use an experimental scheduler that is not yet upstream. The upstream scheduler doesn’t yet parallelize a single daxpy sweep because it operates directly on an input array rather than on partitioned data (Stay tuned for future updates!). Using the experimental scheduler and 6 threads, Petalisp reaches 32% to 70% of the performance of the daxpy version of likwid-bench. A possible explanation for the remaining performance difference are the multiple thread barriers for synchronization that Petalisp doesn’t (yet) optimize away.
Benchmark code:
(in-package #:petalisp.benchmarks)
(with-temporary-backend (make-native-backend :threads 6)
(print-benchmark-table daxpy 20))
likwid-bench -t daxpy_avx -w S0:SIZE:6
Matrix-matrix multiplication is an interesting case for Petalisp because it
doesn’t have a built-in reduction operator. Instead, Petalisp treats
reductions naively as repeated summations of all odd and even elements.
Although there are optimizations in place to optimize such repeated operations,
right now the multiplication of an
Benchmark code:
(in-package #:petalisp.benchmarks)
(with-temporary-backend (make-native-backend :threads 1)
(print-benchmark-table dgemm 20))
However, this picture changes for skinny matrices. For the multiplication of skinny matrices of the form Nx8 @ 8xK, the single-thread performance of Petalisp significantly faster than OpenBLAS:
Benchmark code:
(in-package #:petalisp.benchmarks)
(with-temporary-backend (make-native-backend :threads 1)
(print-benchmark-table dgemm-n=8 20))
OpenBLAS underperforms in this setting because it has no special handling of skinny matrices. Petalisp overperforms in this setting because it can generate specialized code and its buffer pruning technique can eliminate all intermediate storage because the reduction loop is short enough.
For Jacobi’s method in two dimensions, Petalisp achieves between 36% and 89% of the single-core performance of auto-vectorized C++ code. The parallel performance is not yet on par with OpenMP parallelized C++ code, mainly because Petalisp’s automatic parallel scheduler is very recent and contains a suboptimal synchronization mechanim.
Benchmark code:
(in-package #:petalisp.benchmarks)
(loop for threads from 1 to 6 do
(with-temporary-backend (make-native-backend :threads threads)
(print-benchmark-table stencil-jacobi-2d 20)))
The Red-Black Gauss-Seidel method differs from Jacobi’s method in that it touches elements in a chessboard-like pattern, with two sweeps over the domain per iteration. This results in a more complicated data-flow graph. Nevertheless, the measured performance is quite similar to that of Jacobi’s method, apart from the cost of having to traverse the domain twice.
Benchmark code:
(in-package #:petalisp.benchmarks)
(loop for threads from 1 to 6 do
(with-temporary-backend (make-native-backend :threads threads)
(print-benchmark-table rbgs 20)))
A Multigrid V-Cycle combines several numerical primitives to solve partial differential equations efficiently. It contains stencils for smoothing high-frequency components of a grid, interpolation and prolongation for transferring data between smaller and larger grids, and calculations of the residual on each grid level. Despite these complexities, Petalisp achieves decent floating-point performance and even a modest parallel speedup:
Benchmark code:
(in-package #:petalisp.benchmarks)
(loop for threads from 1 to 6 do
(with-temporary-backend (make-native-backend :threads threads)
(print-benchmark-table multigrid-v-cycle 20)))
NumPy is a widely used Python library for scientific computing on arrays. It provides powerful N-dimensional arrays and a variety of functions for working with these arrays.
Petalisp works on a more fundamental level. It provides even more powerful N-dimensional arrays, but just a few building blocks for working on them - element-wise function application, reshaping and array fusion.
So Petalisp is not a substitute for NumPy. However, it could be used to write a library that behaves like NumPy, but that is much faster and fully parallelized. In fact, writing such a library is one of my future goals.
Not necessarily. Not everyone has the time to learn Common Lisp. That is why I am also working on some convenient Python bindings for Petalisp.
But: If you ever have time to learn Lisp, do it! It is an enlightening experience.
Put the following code in your initialization file:
(put 'lazy 'common-lisp-indent-function '(1 &rest 1))
(put 'lazy-reduce 'common-lisp-indent-function '(1 &rest 1))
(put 'lazy-multiple-value 'common-lisp-indent-function '(1 1 &rest 1))
(put 'lazy-reshape 'common-lisp-indent-function '(1 &rest 1))
I am aware that this license prevents some people from using or contributing to this piece of software, which is a shame. But unfortunately the majority of software developers have not yet understood that
- In a digital world, free software is a necessary prerequisite for a free society.
- When developing software, open collaboration is way more efficient than competition.
So as long as distribution of non-free software is socially accepted, copyleft licenses like the AGPL seem to be the lesser evil.
That being said, I am willing to discuss relicensing on an individual basis.
I couldn’t wish for a better tool for the job. Common Lisp is extremely rich in features, standardized, fast, safe and mature. The Lisp community is amazing and there are excellent libraries for almost every imaginable task.
To illustrate why Lisp is particularly well suited for a project like Petalisp, consider the following implementation of a JIT-compiler for mapping a function over a vector of a certain element type:
(defun vector-mapper (element-type)
(compile nil `(lambda (fn vec)
(declare (function fn)
(type (simple-array ,element-type (*)) vec)
(optimize (speed 3) (safety 0)))
(loop for index below (length vec) do
(symbol-macrolet ((elt (aref vec index)))
(setf elt (funcall fn elt)))))))
Not only is this JIT-compiler just 8 lines of code, it is also 20 times faster than invoking GCC or Clang on a roughly equivalent piece of C code.