Addition-Only Matrix Multiplication
Addition-Only Matrix Multiplication
ABSTRACT but they are much more constrained than the original list.
Matrix multiplication consumes a large fraction of the time Specifically, rather than simply being numbers in the range
arXiv:2307.01415v1 [cs.DS] 4 Jul 2023
taken in many machine-learning algorithms. Thus, accel- up to k, their sum is at most k. Thus, when we iterate the
erator chips that perform matrix multiplication faster than process of sorting, eliminating duplicates, and taking differ-
conventional processors or even GPU’s are of increasing in- ences several times, the lists rapidly become much shorter
terest. In this paper, we demonstrate a method of per- than the original list. We can then multiply the original vec-
forming matrix multiplication without a scalar multiplier tor by any constant c by recursively multiplying the vector
circuit. In many cases of practical interest, only a single of differences by c and then obtaining the original vector by
addition and a single on-chip copy operation are needed to accumulating the differences after they have been multiplied
replace a multiplication. It thus becomes possible to design by c.
a matrix-multiplier chip that, because it does not need time-
, space- and energy-consuming multiplier circuits, can hold 1.1 Motivation and Background
many more processors, and thus provide a net speedup. The rising importance of deep-learning and other machine-
learning applications has made multiplication of large ma-
1. INTRODUCTION trices take on a new importance. For example, backpropa-
In this paper we show that when multiplying matrices, scalar gation [8] is essentially a sequence of matrix multiplications.
multiplication is not really necessary and can be replaced At the same time, special-purpose chips or boards, such as
by a surprisingly small number of additions. The advantage [7] [5], are proliferating. We therefore offer a new approach
of performing matrix multiplication using only addition for to matrix multiplication that:
arithmetic is that it then becomes feasible to build special-
purpose chips with no multiplier circuits. Such chips will
take up less space per on-chip processor, allowing more, but 1. Works for both sparse and dense matrices.
simpler, processors to be packed into a single chip. [7] is
an example of an architecture that implements this strat- 2. Is more efficient, the larger the matrices are.
egy. Further, since a multiplier circuit can take significantly 3. Works better when the matrix elements require fewer
more time than addition or other typical machine opera- bits, an important trend as people search for ways to
tions, it is possible that the addition-only approach can be make machine learning more efficient.
faster, even on conventional architectures. But there is lit-
tle advantage if the number of additions needed to replace 4. Supports a chip design that avoids multiplication cir-
a single multiplication is large. Thus, we must show that in cuits, thus saving chip real estate and allowing more
practice, very few additions are needed to replace a multi- arithmetic units to be placed on one chip.
plication – fewer than one addition (plus a little more than
one copy operation) in some cases of practical interest. 5. Uses a very small number of additions in place of one
scalar multiplication, thus offering an opportunity to
Before getting deep into the weeds, let us see the essential speed up the computation, since multiplication can
idea that the algorithm uses. Start with a vector of integers take significantly more time than addition.
in some large range, say 1-to-k. We sort, eliminate dupli-
cates, and take differences between consecutive numbers in
The search for algorithms that multiply n-by-n matrices
the sorted order. The result is a new vector of integers,
in less than the O(n3 ) time taken by the straightforward
algorithm has been ongoing for more than 50 years, from
Strassen [9] at O(n2.81 ) to the best known [1] at O(n2.37 ).
Unfortunately, all these algorithms to date, while they have
better asymptotic running times than the obvious, have con-
stant factors that make them unattractive, even for very
large matrices, and they also assume the matrices are dense.
If we were only to multiply two b-bit scalars, we could make 3. Recursion: Either by using Russian-Peasants multi-
some small improvements to the Russian-Peasants approach plication as a base case, or using this algorithm recur-
in some cases, but the latter method is about as good as we sively, produce the vector-scalar product [cd1 , . . . , cdm ].
can do. However, when matrices are large and the number
of bits in their elements is small – exactly the common case
4. Accumulate: Compute the product of c and the vector
for machine-learning applications – we can do significantly
[s1 , . . . , sm ] by cs1 = cd1 and csi = csi−1 + cdi for
better by multiplying a vector of n integers, each of b bits,
i = 2, 3, . . . , m. Note that we use m − 1 additions
by a b-bit constant. We preprocess the vector in a way we
here, and it is the only place, other than the recursive
shall describe, involving sorting, eliminating duplicates, and
step, where additions are used.
taking differences of successive numbers. Once preprocessed,
we can multiply the vector by n constants, and thus compute
one of the n outer products we need. 5. Follow Pointers: For each vi , the pointer pi is that
j such that vi = sj . Therefore, cpi = csj . We may
therefore copy the value csj just computed at the pre-
2.2 Vector-Scalar Multiplication vious step and make that be cpi .
The algorithm for multiplying a vector [v1 , . . . , vn ] by a scalar
c is suggested by Fig. 1. We assume that each vi is a posi-
tive b-bit integer; signed integers can be handled by ignoring
the sign until the products are created. The algorithm is re- Observe that while step (1) requires O(n log n) time to sort,
cursive, in that it modifies the initial vector to produce a and step (2) requires O(n) time for subtractions, these are
(typically) shorter vector [d1 , . . . , dm ]. Not only is this vec- done only n times each. Thus, the total time spent for these
tor shorter, but the sum of its elements is at most 2b . This two steps is O(n2 log n) at most. It is only steps (4) and
constraint is much stronger than the constraint on the orig- (5), each of which takes O(n) time but is repeated n2 times,
inal vector – that each element be less than 2b . that require O(n3 ) work. In particular, step (4) is where
the O(n3 ) additions occur, and step (5) requires O(n3 ) copy
Take operations.
Sort S differences D
V v1 ... vn s1 ... sm d1 ... dm
Pointers track Example 2.1. Suppose we start with the vector
Sorting order Once
Recursion, for each
P Multiply by c vector V = [3, 1, 4, 1, 5, 9]
p1 ... pn
3.4.3 Nonsquare Matrices Column (B) shows the lengths of the lists after taking dif-
Suppose we need to multiply a matrix A that is n rows by ferences and performing the same operations on the list of
k columns times a matrix B that is k rows by m columns. differences – align (if permitted), sort, and eliminate du-
Then we need to take k outer products, each of a column plicates. Then, columns (C) and (D) represent the lengths
of A times the corresponding row of B. Nothing about the of the lists that result after repeating this operation twice
algorithm described so far requires that n = m or that ei- more. The last column gives the average number of addi-
ther equals k. The only extension is that we need to decide tions that would be needed to multiply the initial vector of
whether to make the columns of A or the rows of B play the length n by a scalar. To be precise, it is 12 times column (D)
role of the vector V . Intuitively, the longer the vector, the (for the Russian-peasants multiplication of each element on
fewer additions per element we need. Thus, the choice nor- the list of third differences), plus columns (A), (B), and (C),
mally depends on picking columns of A if n > m and picking all divided by n.
rows of B otherwise. The only time that choice might not
be better is if there is so much parallelism available that we
can process more than min(n, m) scalars at once.
4.1 Intuition
If we look at the first row of Fig. 2, we see a typical situation
where the length of the vector, n, is much smaller than the
3.4.4 Sparse Matrices number of possible integers. In this case, the chances that
There are two cases to consider here. If the matrices are a random list of 1000 integers, each chosen from around
represented in the ordinary manner, as arrays, then the 16 million 24-bit integers, would have even one duplicate is
first step, where we sort and eliminate duplicates, essen- small. That remains true, even if we use alignment, which
tially eliminates all the 0’s. We cannot do better, because in effect divides integers into about 8 million groups, each
we have to look at the entire matrices, regardless of how corresponding to a 24-bit odd integer. Moreover, as we see
sparse they are. from column (B), even the differences between elements of
the sorted list are almost all distinct. It is not until we take
In the second case, the matrices A and B are represented as differences of the differences that we begin to see duplicates,
a set of triples (i, j, v) meaning that the element in row i and as suggested by column (C). By the time we take third dif-
column j has the nonzero value v. We can assemble columns ferences, there are very few distinct numbers indeed, as seen
of A or rows of B by finding all the triples that belong to in column (D).
that row or column. The rows and columns are missing
the 0 elements, but because we have location data for each In contrast, let us look at the last two rows, where there are
element, we can then take the outer product of column k of a million random integers initially, again chosen from the
A and row k of B by working with only the nonzero elements. roughly 16 million possible 24-bit integers. Now there are
Any product where one or both of the arguments is 0 would good chances of seeing some duplication, and indeed we do.
yield a 0 product and thus never influence the result anyway. But the most significant effect is seen in column (B), where
The only significant modification to the algorithm is that the number of distinct differences is tiny, even if we do not
along with the vector V we need to have a parallel vector align. The reason is that when we have a million numbers
out of 16 million possibilities, the average difference in the ond time, there cannot be many differences among the large
sorted list is only 16. Surely, there will be some larger gaps, numbers either. These two observations let us put a strong
but the chances of a really large gap, say over 100, is small, upper bound on the lengths of the lists as we repeat the
and there cannot be too many of those large gaps, because sort-eliminate duplicates-take differences process.
the sum of all the gaps is at most 224 . As a result, the total
work dealing with all the difference lists is negligible, and 5.1 Two Cost Functions
the vast majority of the work occurs computing the result
It is useful to define two mutually recursive cost functions,
of multiplying the list represented by column (A). In fact,
C and D.
we require less than one addition per scalar multiplication,
because the number of duplicates in the original list exceeds
the length of all the difference lists. • Define C(n, k) to be the number of additions needed
to multiply, by a constant, a vector of n elements, each
The effect of alignment is perhaps more profound than might of which is a positive, odd integer no larger than k, by
be expected, especially for the smaller values of n. Our first a constant.6
assumption might be that alignment doubles the number
of duplicates. That is, the 16 million possible integers are • Define D(n, k) to be the number of additions needed
grouped into 8 million groups, each corresponding to an odd to multiply, by a constant, a vector of length n, whose
number. That would appear to double the chances that a elements are distinct, odd positive numbers that sum
randomly chosen integer has appeared before on the initial to k.
list. But we are in fact beneficiaries of the “class-size para-
dox.” The groups are not all the same size. For example,
the group to which the integer 3 belongs has 23 members – Observe that the significant difference between C and D is
all the 24-bit integers that are all 0’s except for two consc- that in the former case, k is a bound on each individual
utive 1’s somewhere. So a random integer tends to belong element, while in the latter case, k bounds the sum of all
to a larger group, and further it will be a duplicate if any of the elements, and therefore represents a more limited class
the members of that group have appeared previously. For of vectors.
example, if we compare the last two lines, we see that if
we do not align the one million integers, we find an average
of 29,228 duplicates, while if we do align we average 82,460
5.2 Bounds on the Cost Functions
duplicates, almost three times as many. We can observe four rules that let us put bounds on C(n, k)
and D(n, k) mutually.
5. AN UPPER BOUND ON REQUIRED AD-
DITION STEPS Rule 1: C(n, k) ≤ D(n, k) + n. This rule reflects the idea
The results of Section 4 are based on the assumption that that we can align the given vector [v1 , v2 , . . . , vn ] of length
vectors are chosen randomly and uniformly. Especially for n, sort, eliminate duplicates, and take differences of the re-
long vectors, randomness forces difference lists to be small. sulting vector. We shall surely have a difference vector of
However, in many applications it is not reasonable to assume length no greater than n, and the sum of all its elements
uniformity in the distribution of values; there might well be cannot exceed k, the maximum possible element in the orig-
many small values but also enough large values that the inal vector. Recursively multiply this vector by a constant
differences in the sorted list are mostly distinct. c. We can then compute the value of each cvi by accumulat-
ing the differences. This step uses at most n shift-and-add
We shall show that, regardless of the initial list, computing additions. The shifting is necessary because some of the
a scalar-vector product using only additions and the opera- differences have had factors of 2 removed during the align-
tions needed to sort, eliminate duplicates, align bit strings, ment process. The only other additions needed are in the
and follow pointers, but not multiplication, requires only a recursive computation of the product of c and the vector of
small number of additions per element of the vector. The differences. That cost is no greater than D(n, k).
exact number of additions per element depends on the rela-
tionship between the length of the vector and the size of its Rule 2: D(n, k) ≤ C(n, x) + D(k/x, k) for any x. This rule
elements, but it is certainly much less than the baseline of doesn’t involve any operations, per se. Rather, it states
Section 1.3 (except perhaps in the uninteresting case of very that we can conceptually treat the vector of length n as
small matrices). two vectors. The first consists of the “front” part – those
elements less than or equal to x. The second is the “back”
Here is the intuition behind the proof. As we mentioned, part – elements greater than x. For the front, we have an
when we have a list of integers, each no more than k, and upper bound on each element; they can be no larger than x.
we sort, eliminate duplicates, and take differences, the sum We also have the condition from the definition of D, which
of those differences is at most k. The list of differences may is that the sum of these elements can be no greater than k,
have some small numbers, say numbers at most x – a value but we shall not use that. Presumably, the upper bound x
we shall select judiciously. But when we take differences a on each element will prove stronger. For the back part, we
second time, there cannot be many differences among the know that the sum of these elements is not greater than k.
small numbers, since their sum is at most x. There may We also know that there can be no more than k/x of these
also be some larger numbers, those bigger than x. However, elements, since each is greater than x.
there can be at most k/x of those, since the sum of all the
6
numbers is at most k. Thus, when taking differences a sec- Note that k is 2b if we are talking about b-bit integers.
√
Rule 3: D(n, k) ≤ D( k, k). Each of the elements on a list Finally, we use Rule 4 to bound each of the terms in the
is odd and the elements are distinct. Therefore, given that above sum. That gives
numbers sum to k, the longest the list can be is that r such i
that 1 + 3 + 5 + · · · + 2r − 1 ≤ k. Since the √
sum of the first r
X i+2−j
C(n, k) ≤ k1/(i+2) log2 k + (i + 1)n
odd numbers is exactly r2 , it follows that k is actually an j=0
i+2
upper bound on the length of the list described by √ D(n, k).
Of course this observation is only valuable if n > k, but The summation is 1/(i + 2) times 2 + 3 + · · · + (i + 2). The
the inequality of the rule is true regardless. latter is the sum of the first i + 2 integers, although it is
missing 1. That sum is therefore (i+2)(i+3)
2
− 1. We may
drop the “−1” and conclude
Rule 4: C(n, k) and D(n, k) are each no larger than n log2 k. i + 3 1/(i+2)
This rule holds because we could always use the Russian- C(n, k) ≤ k log2 k + (i + 1)n
2
peasants algorithm if nothing better is available.
As long as the first term is at most n, we have C(n, k) ≤
(i + 2)n. To simplify, substitute j for i + 2. We can then
We can use these rules to prove the following theorem. It assert that if n ≥ j+1
2
k1/j log2 k then C(n, k) ≤ jn.
says, roughly, that the number of additions per element you
need to multiply a vector of length n by a constant, with
elements no larger than k, is logn k. Example √5.2. Suppose k = 224 . Then 2n additions suf-
fice if n ≥ 32 k log2 k = 147,456. Also, 3n additions suffice
if n ≥ 2k1/3 log2 k = 12,288. One needs at most 4n addi-
Theorem 5.1. If tions if n ≥ 52 k1/4 log2 k = 3840.
j + 1
n≥ k1/j log2 k Example 5.3. The purpose of this example is to address
2
concerns one may have regarding extreme cases where the
then C(n, k) ≤ jn. process sort-eliminate-duplicates-take-differences makes lit-
tle progress. Such cases exist, but they require that the size
of the matrix be very small – comparable to the number of
Proof. Start with C(n, k), but let us replace k by x0 for bits used to represent elements.
reasons that will become obvious. By Rule 1 we have
For instance, suppose we start with powers of 2. If we do
C(n, x0 ) ≤ D(n, x0 ) + n
not align, then when we take differences we get the origi-
Apply Rule 2 to give us nal vector with the highest power of 2 removed. However,
in this case, n is equal to log2 k + 1, and the hypothesis
x0 of Theorem 5.1 cannot be satisfied except in trivial cases –
C(n, x0 ) ≤ C(n, x1 ) + D( , x0 ) + n
x1 specifically, for k = 1 (i,e a Boolean matrix), and the case
”
We may alternate Rules 1 and 2 as many times as we like, k = 2, j = 1.. Moreover, if we align, then all powers of 2
ending with Rule 1, and introducing a new unknown xj each immediately become 1, and we actually need no additions at
time we do, to get all, just shifts.
i−1
X xj Another seemingly bad case is the Fibonacci numbers, where
C(n, x0 ) ≤ D(n, xi ) + D( , xj ) + (i + 1)n after sorting and taking differences we only lose the two high-
xj+1
j=0 est values. For example, starting with the vector
Next, apply Rule 3 to the first term on the right, which gives [1, 2, 3, 5, 8, 13, 21]
us when we take differences we get [1, 2, 3, 5, 8]. Here, align-
i−1 ment doesn’t help. But we still have the constraint that n
√ X xj
C(n, x0 ) ≤ D( xi , xi ) + D( , xj ) + (i + 1)n must be logarithmic in k, although the base of the logarithm
j=0
xj+1 is now 1.61, approximately. It is still not possible to find
nontrivial values for n, k, and j to satisfy the hypothesis of
Now, we choose values for each of the xj ’s in order to make Theorem 5.1.
the terms equal. That is, pick xp = k(i+2−p)/(i+2) for p =
0, 1, . . . , i. In particular, xi = k2/(i+2) , so the first term
√
D( xi , xi ) becomes D(k1/(i+2) , k2/(i+2) ). The summation
6. ATTRIBUTION
Pi−1 xj Pi−1 1/(i+2)
Note: The algorithm described here was invented solely by
j=0 D( xj+1 , xj ) becomes j=0 D(k , k(i+2−j)/(i+2) ). Daniel Cussen. The proof of Section 5 is solely the work of
The first term D(k1/(i+2) , k2/(i+2) ) fits into this sum so we Jeffrey Ullman.
can write
i 7. REFERENCES
[1] J. Alman and V. V. Williams. A refined laser method
X
C(n, k) ≤ D(k1/(i+2) , k(i+2−j)/(i+2) ) + (i + 1)n
j=0
and faster matrix multiplication. In D. Marx, editor,
Proceedings of the 2021 ACM-SIAM Symposium on
Note that on the left, we replaced x0 by the original value Discrete Algorithms, SODA 2021, Virtual Conference,
k, for which it stood. January 10 - 13, 2021, pages 522–539. SIAM, 2021.
[2] K. E. Batcher. Sorting networks and their
applications. In AFIPS Spring Joint Computing
Conference, volume 32 of AFIPS Conference
Proceedings, pages 307–314. Thomson Book Company,
Washington D.C., 1968.
[3] A. Bhowmick, A. S. Rathore, M. Mishra, S. Mittal,
and R. Singhal. Realizing cnn inference in log-domain.
In Deployable AI Workshop (with AAAI), 2023.
[4] Fog, Agner. Instruction tables, 2021.
[5] N. Jouppi, C. Young, N. Patil, and D. Patterson.
Motivation for and evaluation of the first tensor
processing unit. IEEE Micro, 38(3):10–19, 2018.
[6] P. M. Kogge and H. S. Stone. A parallel algorithm for
the efficient solution of a general class of recurrence
equations. IEEE Trans. Computers, 22(8):786–793,
1973.
[7] R. Prabhakar, Y. Zhang, D. Koeplinger, M. Feldman,
T. Zhao, S. Hadjis, A. Pedram, C. Kozyrakis, and
K. Olukotun. Plasticine: A reconfigurable architecture
for parallel patterns. In Proceedings of the 44th
Annual International Symposium on Computer
Architecture, ISCA ’17, pages 389–402, New York, NY,
USA, 2017. Association for Computing Machinery.
[8] D. E. Rumelhart, G. E. Hinton, and R. J. Williams.
Learning Representations by Back-propagating Errors.
Nature, 323(6088):533–536, 1986.
[9] V. Strassen. Gaussian elimination is not optimal.
Numer. Math., 13(4):354–356, aug 1969.
[10] Wikipedia contributors. Binary multiplier —
Wikipedia, the free encyclopedia, 2021.
[11] Wikipedia contributors. Ancient egyptian
multiplication — Wikipedia, the free encyclopedia,
2022. [Online; accessed 29-March-2023].