0% found this document useful (0 votes)
34 views9 pages

Addition-Only Matrix Multiplication

The document proposes a method for performing matrix multiplication using only addition operations by representing numbers as differences of sorted lists. This avoids the need for multiplier circuits and allows more simple processing units to be packed on a chip. The key steps are to sort and take differences of vectors, which can then be multiplied using repeated addition instead of true multiplication. Relatively few additions are needed to replace a single multiplication operation.

Uploaded by

kvzakhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views9 pages

Addition-Only Matrix Multiplication

The document proposes a method for performing matrix multiplication using only addition operations by representing numbers as differences of sorted lists. This avoids the need for multiplier circuits and allows more simple processing units to be packed on a chip. The key steps are to sort and take differences of vectors, which can then be multiplied using repeated addition instead of true multiplication. Relatively few additions are needed to replace a single multiplication operation.

Uploaded by

kvzakhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Matrix Multiplication Using Only Addition

Daniel L. Cussen† , Jeffrey D. Ullman♯



Fgemm SPA, ♯ Stanford University

ABSTRACT but they are much more constrained than the original list.
Matrix multiplication consumes a large fraction of the time Specifically, rather than simply being numbers in the range
arXiv:2307.01415v1 [cs.DS] 4 Jul 2023

taken in many machine-learning algorithms. Thus, accel- up to k, their sum is at most k. Thus, when we iterate the
erator chips that perform matrix multiplication faster than process of sorting, eliminating duplicates, and taking differ-
conventional processors or even GPU’s are of increasing in- ences several times, the lists rapidly become much shorter
terest. In this paper, we demonstrate a method of per- than the original list. We can then multiply the original vec-
forming matrix multiplication without a scalar multiplier tor by any constant c by recursively multiplying the vector
circuit. In many cases of practical interest, only a single of differences by c and then obtaining the original vector by
addition and a single on-chip copy operation are needed to accumulating the differences after they have been multiplied
replace a multiplication. It thus becomes possible to design by c.
a matrix-multiplier chip that, because it does not need time-
, space- and energy-consuming multiplier circuits, can hold 1.1 Motivation and Background
many more processors, and thus provide a net speedup. The rising importance of deep-learning and other machine-
learning applications has made multiplication of large ma-
1. INTRODUCTION trices take on a new importance. For example, backpropa-
In this paper we show that when multiplying matrices, scalar gation [8] is essentially a sequence of matrix multiplications.
multiplication is not really necessary and can be replaced At the same time, special-purpose chips or boards, such as
by a surprisingly small number of additions. The advantage [7] [5], are proliferating. We therefore offer a new approach
of performing matrix multiplication using only addition for to matrix multiplication that:
arithmetic is that it then becomes feasible to build special-
purpose chips with no multiplier circuits. Such chips will
take up less space per on-chip processor, allowing more, but 1. Works for both sparse and dense matrices.
simpler, processors to be packed into a single chip. [7] is
an example of an architecture that implements this strat- 2. Is more efficient, the larger the matrices are.
egy. Further, since a multiplier circuit can take significantly 3. Works better when the matrix elements require fewer
more time than addition or other typical machine opera- bits, an important trend as people search for ways to
tions, it is possible that the addition-only approach can be make machine learning more efficient.
faster, even on conventional architectures. But there is lit-
tle advantage if the number of additions needed to replace 4. Supports a chip design that avoids multiplication cir-
a single multiplication is large. Thus, we must show that in cuits, thus saving chip real estate and allowing more
practice, very few additions are needed to replace a multi- arithmetic units to be placed on one chip.
plication – fewer than one addition (plus a little more than
one copy operation) in some cases of practical interest. 5. Uses a very small number of additions in place of one
scalar multiplication, thus offering an opportunity to
Before getting deep into the weeds, let us see the essential speed up the computation, since multiplication can
idea that the algorithm uses. Start with a vector of integers take significantly more time than addition.
in some large range, say 1-to-k. We sort, eliminate dupli-
cates, and take differences between consecutive numbers in
The search for algorithms that multiply n-by-n matrices
the sorted order. The result is a new vector of integers,
in less than the O(n3 ) time taken by the straightforward
algorithm has been ongoing for more than 50 years, from
Strassen [9] at O(n2.81 ) to the best known [1] at O(n2.37 ).
Unfortunately, all these algorithms to date, while they have
better asymptotic running times than the obvious, have con-
stant factors that make them unattractive, even for very
large matrices, and they also assume the matrices are dense.

Our central thesis is that it makes sense to replace multi-


plication by a small number of additions. This point can
only be justified if the time taken for multiplication signif- integer, x can be as large as 2b − 1 bits, and 0 ≤ i < b. That
icantly exceeds the time taken for addition. [4] is a recent is, the effect of this operation is
examination of the time taken by all instructions on many of
the most commonly used processors. The conclusion is that x := x + 2i y
typically, integer multiplication has 3–6 times the latency of In addition, we shall assume processors have available the
integer addition.1 Thus, replacing multiplication by one or operations needed to sort, eliminate duplicates, connect el-
two additions looks promising. ements of an initial list to their representation in the sorted
list (i.e., follow pointers), and take the first differences of a
The issue of how much space and energy a multiplier takes sorted list. However, since the total number of operations of
on a chip also argues that there are potential advantages these types needed in the proposed algorithm is O(n2 log n),
to avoiding using a multiplier altogether. The faster multi- and thus small compared with the total number of opera-
pliers involve complex circuits (compressors, where several tions, we shall not count the operations of these types ex-
of the partial products are combined not into a single sum, actly.
but into two or more integers that themselves must be com-
bined further [10]. Even more space- and energy-consuming 1.3 Russian-Peasants Multiplication
circuits allow pipelined multiplication, where the same cir-
Suppose we multiply n-by-n matrices P and Q with elements
cuit can be used to do several multiplications at once. Since
[pik ] and [qkj ], respectively, each of which isP
a b-bit integer.
it is then possible, in principle, to produce the result of one
the product has in position (i, j) the value n k=1 pik × qkj .
multiplication in the same time as it takes to perform one
We therefore have to take n3 products of b-bit integers. An-
integer addition, the utility of our addition-only method de-
other (n − 1)n2 additions are needed to sum the terms that
pends on being able to fit several adders in the space that
form each of the n2 result elements, but we need those addi-
one pipelined multiplier takes, and/or use less energy with
tions no matter how the multiplications are done. Thus, in
the adder-only approach.
discussions that follow, we shall not count these additions.
There is another approach to addition-only matrix multipli-
As a baseline for replacing multiplication by addition, we
cation. One can use a table lookup to convert numbers to
can always simulate a multiplication as a sequence of shift-
the log domain, add the logarithms, and convert back [3].
and-add operations. This technique, which is often referred
In this way, one multiplication is replaced by one addition.
to as “Russian-Peasants Multiplication,” uses only addition
This method works, although its efficiency depends on how
and was apparently known as far back as ancient Egypt [11].
large the log and antilog tables are, which in turn depends
The method is usually explained in decimal, but when num-
on how precise we wish the result to be. And at any rate,
bers are expressed in binary, it is especially simple. Sup-
the cost of the two lookups must be included in the total
pose we are multiplying p × q. Think of p as a bit string
cost.
pb−1 pb−2 · · · p0 . Starting with x = 0, for each pi that is 1 we
perform the shift-and-add operation add(x, q, i); the result
1.2 Simplifying Assumptions will be p × q.
To describe simply the ideas behind our addition-based ap-
proach to matrix multiplication, we shall make several as- We can thus replace n3 multiplications by at most bn3 addi-
sumptions. First, we assume that we are multiplying n- tions. If elements of the matrices are chosen at random, then
by-n matrices, and that the elements of these matrices are we would expect about half the bits to be 0, and therefore
b-bit integers. Typically, b would be 32 or 24, but even bn3 /2 is a better approximation to the number of required
smaller integers are becoming more interesting. For exam- additions to simulate n3 multiplications using the Russian-
ple, the TPU (tensor-processing unit) [5] is based on a 16-bit Peasants approach.. However, in fact we can, in practical
floating-point number, which has effectively a 12-bit man- cases, do much better than b or b/2 additions per multipli-
tissa, counting the implied leading 1. Note that if elements cation.
are single-precision floats, they have (effectively) a 24-bit
mantissa, and it is only the mantissas that need to be multi- 2. THE ADDITION-ONLY MATRIX-MUL-
plied. The exponent portions are added, and there may be a
shift of a single position needed in the product of mantissas.
TIPLICATION ALGORITHM
We shall begin by describing the overall approach to ma-
Section 3.4 will show how the ideas presented here can be
trix multiplication, followed by the details of the basic zero-
extended more broadly, including to matrices of arbitrary
multiplication algorithm. In Section 3.1, we shall give an
shapes and to sparse matrices.
additional idea that can reduce the number of additions
needed still further. Section 5 will prove that the algo-
We assume that our processors have a “shift-and-add” oper-
rithm offered does use significantly fewer additions than the
ation add(x, y, i). This operation shifts the integer y left by
Russian-Peasants approach in cases of practical interest.
i positions and adds the result to x. We assume y is a b-bit
1
When we talk of floating-point operations, the difference 2.1 Overview of Matrix Multiplication
is much less. The reason is that when multiplying, we have The goal is to multiply n-by-n matrices A and B. The el-
only to add exponents, while a floating-point addition re- ements of these matrices are b-bit integers. A common ap-
quires that we align the mantissas according to the differ- proach is to take n outer products, each is the outer product
ence in the exponents. However, we are not proposing to
replace floating-point multiplications by floating-point ad- of a column of A and the corresponding row of B. That is,
ditions. We are proposing to replace the multiplication of the kth outer product applies to column [a1k , . . . , ank ] of A
mantissas, which are integers, by integer addition. and the row [bk1 , . . . , bkn ] of B. The result of this outer
product is n2 scalar products aik bkj for all i and j; we add necessary to create an array of pointers [p1 , . . . , pn ],
this product to the element of the result that is in row i and where pi gives the unique value j such that vi = sj .
column j. These pointers are suggested by the red, dotted lines
in Fig. 1.2
The additions that accumulate these products are part of
any matrix-multiplication algorithm. We therefore do not 2. Differences: Construct a new vector [d1 , . . . , dm ] giv-
count them when tallying additions made by the proposed ing the differences between successive elements of the
method. That is, we are only counting the additions that sorted array, with an imaginary 0 preceding the first
we use to replace the n3 scalar products that are needed to number, That is, d1 = s1 and di = si − si−1 for
perform the n outer products just described. i = 2, 3, . . . , m.

If we were only to multiply two b-bit scalars, we could make 3. Recursion: Either by using Russian-Peasants multi-
some small improvements to the Russian-Peasants approach plication as a base case, or using this algorithm recur-
in some cases, but the latter method is about as good as we sively, produce the vector-scalar product [cd1 , . . . , cdm ].
can do. However, when matrices are large and the number
of bits in their elements is small – exactly the common case
4. Accumulate: Compute the product of c and the vector
for machine-learning applications – we can do significantly
[s1 , . . . , sm ] by cs1 = cd1 and csi = csi−1 + cdi for
better by multiplying a vector of n integers, each of b bits,
i = 2, 3, . . . , m. Note that we use m − 1 additions
by a b-bit constant. We preprocess the vector in a way we
here, and it is the only place, other than the recursive
shall describe, involving sorting, eliminating duplicates, and
step, where additions are used.
taking differences of successive numbers. Once preprocessed,
we can multiply the vector by n constants, and thus compute
one of the n outer products we need. 5. Follow Pointers: For each vi , the pointer pi is that
j such that vi = sj . Therefore, cpi = csj . We may
therefore copy the value csj just computed at the pre-
2.2 Vector-Scalar Multiplication vious step and make that be cpi .
The algorithm for multiplying a vector [v1 , . . . , vn ] by a scalar
c is suggested by Fig. 1. We assume that each vi is a posi-
tive b-bit integer; signed integers can be handled by ignoring
the sign until the products are created. The algorithm is re- Observe that while step (1) requires O(n log n) time to sort,
cursive, in that it modifies the initial vector to produce a and step (2) requires O(n) time for subtractions, these are
(typically) shorter vector [d1 , . . . , dm ]. Not only is this vec- done only n times each. Thus, the total time spent for these
tor shorter, but the sum of its elements is at most 2b . This two steps is O(n2 log n) at most. It is only steps (4) and
constraint is much stronger than the constraint on the orig- (5), each of which takes O(n) time but is repeated n2 times,
inal vector – that each element be less than 2b . that require O(n3 ) work. In particular, step (4) is where
the O(n3 ) additions occur, and step (5) requires O(n3 ) copy
Take operations.
Sort S differences D
V v1 ... vn s1 ... sm d1 ... dm
Pointers track Example 2.1. Suppose we start with the vector
Sorting order Once
Recursion, for each
P Multiply by c vector V = [3, 1, 4, 1, 5, 9]
p1 ... pn

When we sort and eliminate duplicates (Step 1), we get the


vector S = [1, 3, 4, 5, 9] and the vector P = [2, 1, 3, 1, 4, 5].
V’ cv1 ... cvn cs1 ... csm cd1 ... cdm That is, the first element of P is 2 because the first element
Follow
S’
Accumulate
D’
of V , which is 3, is in the second position of S. The second
pointers differences element of P is 1, because the second element of V , which
and copy Once for each c is 1, has moved to position 1 in S, and so on.

Next, take differences of S (Step 2) to get D = [1, 2, 1, 1, 4].


That is, the first element of D is the first element of S, i.e.,
Figure 1: Algorithm Outline 1. The second element of D is the difference between the
first and second elements of S, that is, 3 − 1 = 2, and so on.
Either using Russian-Peasants multiplication as a base case, For the recursion, Step 3, we treat the vector D as if it were
or by applying the same ideas recursively, we produce the V . We sort and eliminate duplicates, to get [1, 2, 4]. When
vector-scalar product [cd1 , . . . , cdm ], which is then used to we take differences, we have [1, 1, 2]. Sort and eliminate
produce the desired output [cv1 , . . . , cvn ]. The recursive part duplicates again, and we get [1, 2], whose difference vector
of the algorithm consists of the following steps. is [1, 1]. Now, when we sort and eliminate duplicates we
2
Depending on how a chip is organized, it may be more
1. Sort: Begin by sorting the given vector [v1 , . . . , vn ] in efficient to store the array of pointers inverted. That is, for
a way that eliminates duplicates. The resulting sorted each i there would be easy access to all the values of j for
vector [s1 , . . . , sm ] has s1 < s2 < · · · < sm . It is which vj = si .
have only the vector [1]. We can multiply this vector by any Of course, the count of additions as suggested by Exam-
constant c using Russian-Peasants multiplication.3 ple 2.2 is for one vector-scalar multiplication. For the com-
plete multiplication of n-by-n matrices, we would have to
Suppose that c = 5; that is, we want to compute 5V . Let us do n2 such operations, for a total of a little more than n3
assume that recursively, we have managed to compute D′ = additions (assuming again that n = 1000 and b = 12). But
5D, the vector [5, 10, 5, 5, 20]. We compute S ′ = 5S, the those additions replace the n3 multiplications that would
vector S ′ = [5, 15, 20, 25, 45] by accumulating the elements be needed, so we are still claiming a little more than one
of D′ (Step 4). That is, the first element of S ′ is the first addition replacing one multiplication.
element of D′ , i.e. 5. The second element of S ′ is the sum
of the first element of S ′ and the second element of D′ , or 2.4 Running Time of the Complete Algorithm
5 + 10 = 15, and so on. Remember that Figure 1 represents two different phases.
The first phase, above the line, is done once for each row
Lastly, we construct V ′ = 5V by applying the vector of point- of the second matrix, i.e., n times. Thus, even though we
ers P = [2, 1, 3, 1, 4, 5] to the vector S ′ = [5, 15, 20, 25, 45] are sorting a list of length n, which takes O(n log n) time,
(Step 5). The first element of V ′ is the second element of the total time spent above the line is O(n2 log n). That cost
S ′ , or 15. The second element of V ′ is the first element of can be neglected compared with the O(n3 ) running time of
S ′ , or 5, and so on. The result is V ′ = [15, 5, 20, 5, 25, 45]. the entire algorithm. However, if we think in terms of chip
design, we do have to include on chip the capability of doing
In practice, we would compute a complete outer product by the sorting and setting up the pointer array that is implied
multiplying V by n different constants, of which 5 was just by Fig. 1.
an example. We use the same D and P vectors for each of
the n constants, so there is no duplication of work. For the operations below the line in Fig. 1, they clearly take
O(n) time. The constant factor includes the number of ad-
2.3 How Many Additions? ditions needed to replace one multiplication. For each of
As mentioned, we shall later prove an upper bound on the the n rows of the second matrix, we perform the operations
number of additions needed by this algorithm; see Section 5. below the line n times, so the total time taken is O(n3 ) as
But here, let us look at the intuition why we would expect it should be.
relatively few additions to be needed. Intuitively, it is be-
cause when n is large and b is small, there can’t be too In this analysis, we have assumed a serial execution, which is
many different differences, and therefore, as we recurse, the not realistic or desirable if we are to design a special-purpose
lengths of the vectors involved drop quickly. chip. Presumably we would design for parallel execution, for
example implementing a parallel sort or processing several
vector-scalar multiplications at the same time. However,
Example 2.2. Suppose we have a vector of n = 1000 ran- O(n3 ) will still be the measure of the number of operations
domly chosen 12-bit integers. When we sort them, we expect the chip will execute.
to find some duplicates, but not many. So at the first level
of recursion, we would have m, the length of the sorted list, 3. IMPROVEMENTS TO THE BASIC AL-
close to 1000. But at the second level, we note that the vector GORITHM
of differences, [d1 , . . . , dm ], has almost 1000 elements that There are a number of modifications to the algorithm of Sec-
sum to at most 212 = 4096. Thus, there cannot be many dif- tion 2 that improve one or more aspects. Here, we shall men-
ferent values among these, and the value of m at the second tion alignment as a technique to reduce the length of vectors
level will be quite small compared with 1000. In particular, involved. We can take advantage of zeros, ones, and dupli-
the sum of the first 91 integers exceeds 4096, so 90 is an cates in the columns of the first matrix. We also mention
absolute upper bound on the number of second differences. ways to increase the available parallelism. And we address
extensions to nonsquare and sparse matrices.
For the third level of recursion, there may be so few that it
makes sense to use Russian Peasants on them, or we may
recurse a third time, and get a still smaller list. But the 3.1 Alignment
work at levels after the second will be negligible no matter If elements v and w of a vector differ by a factor that is a
how we do it. power of 2, then when we multiply the vector by any con-
stant c, the products cv and cw will also have a ratio that is
Thus, we require at most 1000 additions at Step 4 of the the same power of 2. Therefore, we can obtain cw from cv,
first level, and at most 91 additions at Step 4 of the second or vice-versa, by shifting their binary representations. We
level. Subsequent levels require even fewer additions, so the can use this observation to treat v and w as if they were
total number of additions required at all levels will be just the same, if we make some small modifications to the basic
a little more than 1000. In comparison, if we used Russian algorithm of Fig. 1.
Peasants only, we would expect around 6000 additions.
3
In fact, multiplication by 1 is trivial, and we really do not 1. Before sorting [v1 , . . . , vn ], we shift each element vi
need any operation at all in this special case. However, in right until it becomes an odd number (i.e., we drop 0’s
rare cases, the limit of the recursion will be a vector whose from the lower-order bits).
length is 1 but whose one element is bigger than 1. It may
also make sense to stop the recursion before the vector length 2. In addition to the vector of pointers [p1 , . . . , pn ], we
reaches 1. need another vector H = [h1 , . . . , hn ], where hi is the
number of positions to the right that we have shifted a single adder circuit. As the accumulation of differences
vi . (step 4 in Section 2.2) appears serial, it is hard to parallelize
this portion of the algorithm.4
3. When constructing the result vector [cv1 , . . . , cvn ], we
construct cvi by first following the pointer pi and then We can multiply the vector V by many different scalars c at
shifting the result hi positions to the left. Alterna- the same time, but we need registers to store intermediate
tively, if we can shift and add in one step, we can per- results for each c. That change may thus speed up the time
form this shifting when we add cvi /2hi to the element needed by a large factor, but it doesn’t change the ratio of
of the result matrix to which it belongs. space needed for registers compared with space for adders.
Likewise, we can process many different rows V in parallel;
that also speeds the process but doesn’t change the space
Example 3.1. Suppose the given vector allocation.
V = [3, 7, 2, 12, 8, 6]
There is one modification to the algorithm that will increase
When we divide out powers of two, we are left with the ratio of adder space to register space. After sorting and
[3, 7, 1, 3, 1, 3] eliminating duplicates, we can break the vector S into sev-
eral segments: one segment for the smallest values, another
When we sort and eliminate duplicates, we have vector S = for the next smallest values, and so on. We can then process
[1, 3, 7]. The vector of pointers is P = [2, 3, 1, 2, 1, 2]. For each segment independently, in parallel. That change allows
instance, the first element of V , which is 3, appears in posi- us to use many adders at once to accumulate differences for
tion 2 of S. The fourth element of V , which is 12, has also a single vector S.
become 3 after removing factors of 2, so the fourth element
of P is also 2. The vector H that records the number of There is a downside to this approach: When we take differ-
positions shifted is H = [0, 0, 1, 2, 3, 1]. For example, the 3 ences within the segments, the same difference may occur
in V is not shifted at all, while the 12 in V has been shifted in many different segments, so we may be doing the same
two positions to the right. work several times without realizing it. But because we
create segments after sorting, the sum of all the differences
among all the segments has not changed; it is still at most
There are two advantages to this alignment step. First, it 2b . Thus, we can still expect a significant reduction in the
evidently reduces the number of elements of V that are con- total length of vectors after we take second differences. An
sidered distinct. Thus, it reduces the length of the sorted example suggests what to expect.
list and the length of the vector of differences. But it has
another, more subtle, effect. The elements of the sorted list
are all odd. Therefore, all differences other than (perhaps) Example 3.2. Let us reconsider the situation of Exam-
the first are even. Thus, when called recursively the first ple 2.2, where we had 1000 12-bit integers and argued that
time, differences have at most b − 1 bits after shifting right there could be no more than 90 differences. After sorting, we
to eliminate trailing 0’s. might have fewer than 1000 elements, but let us assume the
sorted vector S still has 1000 elements. Suppose we divide S
3.2 Modified Scalar Multiplication into ten segments of 100 elements each and take differences
We have described the algorithm as taking a column of the within each segment. The sum of all those differences is still
first matrix, and processing each of its n values indepen- at most 4096.
dently. However, if there are duplicates among those values,
as will likely be the case, we should first eliminate dupli- Suppose that these differences divide as evenly as possible.5
cates. Further, we do nothing for those values that are 0, Then each segment has differences totaling at most 410.
and for values that are 1, no multiplication is needed; we Since the sum of the first 29 integers exceeds 410, there can
can take the original vector V as the product. be at most 28 differences in any of the ten segments, or a
total of 280 differences. That number is much larger than
We can also apply the trick of Section 3.1 to the columns of the 90 differences that can occur if we do not divide S into
the first matrix. For instance, if 3 and 12 are both elements ten segments, but it is much smaller than 1000; i.e., we are
of a column, and we have computed 3V , then we do not also still guaranteed to reduce significantly the total length of all
need to compute 12V ; we can simply shift the values of the the segments when we take differences.
vector 3V two positions left.
A second√ possible approach is to
√ divide a vector of length
3.3 Parallelism n into n segments of length n each. Accumulate the
Ideally, as much of the circuitry on a chip should be active sum within each segment. Then, accumulate the sums of
at any given time. The algorithm we have described is a the final sum of each segment, to get the value that must
serial algorithm, so it will not tend to keep things active. 4
However, there are many opportunities to increase the par- Strictly speaking, there is a parallel algorithm [6] for com-
allelism in the chip. First, note that other than the circuit to puting the accumulated sums of n elements in O(log n) time,
but this approach requires O(n log n) circuit components
sort, which itself can be parallelized in known ways, e.g. [2], and takes (n log n)/2 additions, so it would negate the ad-
most of the components needed are either registers to store vantage of using adders in place of multipliers.
the various vectors needed, or adder circuits. Moreover, as 5
It can be shown that an even distribution is the worst case;
described in Fig. 1, all the vectors shown are handled with i.e., it allows the largest number of differences.
be added to each member of each segment. That is, we that gives, for each element, the row (if V is a column of A)
must add to each element of the ith segment the sum of the or the column (if V is a row of B).
last elements in√each of the segments 1 through i − 1. This

approach gives n-fold parallelism, while requiring 2n + n 4. EXPERIMENTAL RESULTS
additions in place of the n additions that would be needed Figure 2 shows the lengths of the lists that result from the
to do a sequential accumulation of all n elements. following experiments. For four values of n ranging from one
thousand to one million, we generated 100 lists of random
3.4 Extension to Other Matrix Forms 24-bit numbers. These lists were sorted and duplicates were
Matrices of interest are rarely square, dense matrices of non- eliminated. In some cases, we right-shifted (aligned) the
negative integers. Here is a list of some of the extensions that numbers first to eliminate factors of 2. The lengths of the
can be made to the basic algorithm. resulting sorted lists are shown in the column labeled (A);
it and the following columns are averages rounded to the
nearest integer.
3.4.1 Positive and Negative Integers
We have assumed that all matrix elements are nonnegative
n align? (A) (B) (C) (D) +/∗
integers. But a sign can be carried along with each element
and not used until it is time to place a sign on each scalar 103 no 1000 985 228 39 2.68
product. The algorithm for choosing the correct sign should 103 yes 1000 871 73 13 2.12
be obvious. 104 no 9997 3963 72 17 1.42
104 yes 9991 1395 28 6 1.15
105 no 99706 1170 22 7 1.01
3.4.2 Floating-Point Numbers 105 yes 99119 470 9 3 1.00
Multiplication of floating-point numbers involves multiply-
106 no 970772 193 6 3 0.97
ing the mantissas and adding the exponents. Multiplication
106 yes 917540 85 3 1 0.92
of the mantissas is an integer multiplication. Adding of ex-
ponents is, of course, an addition, one that is necessary no
matter how we perform the multiplication.
Figure 2: Result of sorting and taking differences three times

3.4.3 Nonsquare Matrices Column (B) shows the lengths of the lists after taking dif-
Suppose we need to multiply a matrix A that is n rows by ferences and performing the same operations on the list of
k columns times a matrix B that is k rows by m columns. differences – align (if permitted), sort, and eliminate du-
Then we need to take k outer products, each of a column plicates. Then, columns (C) and (D) represent the lengths
of A times the corresponding row of B. Nothing about the of the lists that result after repeating this operation twice
algorithm described so far requires that n = m or that ei- more. The last column gives the average number of addi-
ther equals k. The only extension is that we need to decide tions that would be needed to multiply the initial vector of
whether to make the columns of A or the rows of B play the length n by a scalar. To be precise, it is 12 times column (D)
role of the vector V . Intuitively, the longer the vector, the (for the Russian-peasants multiplication of each element on
fewer additions per element we need. Thus, the choice nor- the list of third differences), plus columns (A), (B), and (C),
mally depends on picking columns of A if n > m and picking all divided by n.
rows of B otherwise. The only time that choice might not
be better is if there is so much parallelism available that we
can process more than min(n, m) scalars at once.
4.1 Intuition
If we look at the first row of Fig. 2, we see a typical situation
where the length of the vector, n, is much smaller than the
3.4.4 Sparse Matrices number of possible integers. In this case, the chances that
There are two cases to consider here. If the matrices are a random list of 1000 integers, each chosen from around
represented in the ordinary manner, as arrays, then the 16 million 24-bit integers, would have even one duplicate is
first step, where we sort and eliminate duplicates, essen- small. That remains true, even if we use alignment, which
tially eliminates all the 0’s. We cannot do better, because in effect divides integers into about 8 million groups, each
we have to look at the entire matrices, regardless of how corresponding to a 24-bit odd integer. Moreover, as we see
sparse they are. from column (B), even the differences between elements of
the sorted list are almost all distinct. It is not until we take
In the second case, the matrices A and B are represented as differences of the differences that we begin to see duplicates,
a set of triples (i, j, v) meaning that the element in row i and as suggested by column (C). By the time we take third dif-
column j has the nonzero value v. We can assemble columns ferences, there are very few distinct numbers indeed, as seen
of A or rows of B by finding all the triples that belong to in column (D).
that row or column. The rows and columns are missing
the 0 elements, but because we have location data for each In contrast, let us look at the last two rows, where there are
element, we can then take the outer product of column k of a million random integers initially, again chosen from the
A and row k of B by working with only the nonzero elements. roughly 16 million possible 24-bit integers. Now there are
Any product where one or both of the arguments is 0 would good chances of seeing some duplication, and indeed we do.
yield a 0 product and thus never influence the result anyway. But the most significant effect is seen in column (B), where
The only significant modification to the algorithm is that the number of distinct differences is tiny, even if we do not
along with the vector V we need to have a parallel vector align. The reason is that when we have a million numbers
out of 16 million possibilities, the average difference in the ond time, there cannot be many differences among the large
sorted list is only 16. Surely, there will be some larger gaps, numbers either. These two observations let us put a strong
but the chances of a really large gap, say over 100, is small, upper bound on the lengths of the lists as we repeat the
and there cannot be too many of those large gaps, because sort-eliminate duplicates-take differences process.
the sum of all the gaps is at most 224 . As a result, the total
work dealing with all the difference lists is negligible, and 5.1 Two Cost Functions
the vast majority of the work occurs computing the result
It is useful to define two mutually recursive cost functions,
of multiplying the list represented by column (A). In fact,
C and D.
we require less than one addition per scalar multiplication,
because the number of duplicates in the original list exceeds
the length of all the difference lists. • Define C(n, k) to be the number of additions needed
to multiply, by a constant, a vector of n elements, each
The effect of alignment is perhaps more profound than might of which is a positive, odd integer no larger than k, by
be expected, especially for the smaller values of n. Our first a constant.6
assumption might be that alignment doubles the number
of duplicates. That is, the 16 million possible integers are • Define D(n, k) to be the number of additions needed
grouped into 8 million groups, each corresponding to an odd to multiply, by a constant, a vector of length n, whose
number. That would appear to double the chances that a elements are distinct, odd positive numbers that sum
randomly chosen integer has appeared before on the initial to k.
list. But we are in fact beneficiaries of the “class-size para-
dox.” The groups are not all the same size. For example,
the group to which the integer 3 belongs has 23 members – Observe that the significant difference between C and D is
all the 24-bit integers that are all 0’s except for two consc- that in the former case, k is a bound on each individual
utive 1’s somewhere. So a random integer tends to belong element, while in the latter case, k bounds the sum of all
to a larger group, and further it will be a duplicate if any of the elements, and therefore represents a more limited class
the members of that group have appeared previously. For of vectors.
example, if we compare the last two lines, we see that if
we do not align the one million integers, we find an average
of 29,228 duplicates, while if we do align we average 82,460
5.2 Bounds on the Cost Functions
duplicates, almost three times as many. We can observe four rules that let us put bounds on C(n, k)
and D(n, k) mutually.
5. AN UPPER BOUND ON REQUIRED AD-
DITION STEPS Rule 1: C(n, k) ≤ D(n, k) + n. This rule reflects the idea
The results of Section 4 are based on the assumption that that we can align the given vector [v1 , v2 , . . . , vn ] of length
vectors are chosen randomly and uniformly. Especially for n, sort, eliminate duplicates, and take differences of the re-
long vectors, randomness forces difference lists to be small. sulting vector. We shall surely have a difference vector of
However, in many applications it is not reasonable to assume length no greater than n, and the sum of all its elements
uniformity in the distribution of values; there might well be cannot exceed k, the maximum possible element in the orig-
many small values but also enough large values that the inal vector. Recursively multiply this vector by a constant
differences in the sorted list are mostly distinct. c. We can then compute the value of each cvi by accumulat-
ing the differences. This step uses at most n shift-and-add
We shall show that, regardless of the initial list, computing additions. The shifting is necessary because some of the
a scalar-vector product using only additions and the opera- differences have had factors of 2 removed during the align-
tions needed to sort, eliminate duplicates, align bit strings, ment process. The only other additions needed are in the
and follow pointers, but not multiplication, requires only a recursive computation of the product of c and the vector of
small number of additions per element of the vector. The differences. That cost is no greater than D(n, k).
exact number of additions per element depends on the rela-
tionship between the length of the vector and the size of its Rule 2: D(n, k) ≤ C(n, x) + D(k/x, k) for any x. This rule
elements, but it is certainly much less than the baseline of doesn’t involve any operations, per se. Rather, it states
Section 1.3 (except perhaps in the uninteresting case of very that we can conceptually treat the vector of length n as
small matrices). two vectors. The first consists of the “front” part – those
elements less than or equal to x. The second is the “back”
Here is the intuition behind the proof. As we mentioned, part – elements greater than x. For the front, we have an
when we have a list of integers, each no more than k, and upper bound on each element; they can be no larger than x.
we sort, eliminate duplicates, and take differences, the sum We also have the condition from the definition of D, which
of those differences is at most k. The list of differences may is that the sum of these elements can be no greater than k,
have some small numbers, say numbers at most x – a value but we shall not use that. Presumably, the upper bound x
we shall select judiciously. But when we take differences a on each element will prove stronger. For the back part, we
second time, there cannot be many differences among the know that the sum of these elements is not greater than k.
small numbers, since their sum is at most x. There may We also know that there can be no more than k/x of these
also be some larger numbers, those bigger than x. However, elements, since each is greater than x.
there can be at most k/x of those, since the sum of all the
6
numbers is at most k. Thus, when taking differences a sec- Note that k is 2b if we are talking about b-bit integers.

Rule 3: D(n, k) ≤ D( k, k). Each of the elements on a list Finally, we use Rule 4 to bound each of the terms in the
is odd and the elements are distinct. Therefore, given that above sum. That gives
numbers sum to k, the longest the list can be is that r such i
that 1 + 3 + 5 + · · · + 2r − 1 ≤ k. Since the √
sum of the first r
X i+2−j
C(n, k) ≤ k1/(i+2) log2 k + (i + 1)n
odd numbers is exactly r2 , it follows that k is actually an j=0
i+2
upper bound on the length of the list described by √ D(n, k).
Of course this observation is only valuable if n > k, but The summation is 1/(i + 2) times 2 + 3 + · · · + (i + 2). The
the inequality of the rule is true regardless. latter is the sum of the first i + 2 integers, although it is
missing 1. That sum is therefore (i+2)(i+3)
2
− 1. We may
drop the “−1” and conclude
Rule 4: C(n, k) and D(n, k) are each no larger than n log2 k. i + 3 1/(i+2)
This rule holds because we could always use the Russian- C(n, k) ≤ k log2 k + (i + 1)n
2
peasants algorithm if nothing better is available.
As long as the first term is at most n, we have C(n, k) ≤
(i + 2)n. To simplify, substitute j for i + 2. We can then
We can use these rules to prove the following theorem. It assert that if n ≥ j+1
2
k1/j log2 k then C(n, k) ≤ jn.
says, roughly, that the number of additions per element you
need to multiply a vector of length n by a constant, with
elements no larger than k, is logn k. Example √5.2. Suppose k = 224 . Then 2n additions suf-
fice if n ≥ 32 k log2 k = 147,456. Also, 3n additions suffice
if n ≥ 2k1/3 log2 k = 12,288. One needs at most 4n addi-
Theorem 5.1. If tions if n ≥ 52 k1/4 log2 k = 3840.
j + 1
n≥ k1/j log2 k Example 5.3. The purpose of this example is to address
2
concerns one may have regarding extreme cases where the
then C(n, k) ≤ jn. process sort-eliminate-duplicates-take-differences makes lit-
tle progress. Such cases exist, but they require that the size
of the matrix be very small – comparable to the number of
Proof. Start with C(n, k), but let us replace k by x0 for bits used to represent elements.
reasons that will become obvious. By Rule 1 we have
For instance, suppose we start with powers of 2. If we do
C(n, x0 ) ≤ D(n, x0 ) + n
not align, then when we take differences we get the origi-
Apply Rule 2 to give us nal vector with the highest power of 2 removed. However,
in this case, n is equal to log2 k + 1, and the hypothesis
x0 of Theorem 5.1 cannot be satisfied except in trivial cases –
C(n, x0 ) ≤ C(n, x1 ) + D( , x0 ) + n
x1 specifically, for k = 1 (i,e a Boolean matrix), and the case

We may alternate Rules 1 and 2 as many times as we like, k = 2, j = 1.. Moreover, if we align, then all powers of 2
ending with Rule 1, and introducing a new unknown xj each immediately become 1, and we actually need no additions at
time we do, to get all, just shifts.

i−1
X xj Another seemingly bad case is the Fibonacci numbers, where
C(n, x0 ) ≤ D(n, xi ) + D( , xj ) + (i + 1)n after sorting and taking differences we only lose the two high-
xj+1
j=0 est values. For example, starting with the vector
Next, apply Rule 3 to the first term on the right, which gives [1, 2, 3, 5, 8, 13, 21]
us when we take differences we get [1, 2, 3, 5, 8]. Here, align-
i−1 ment doesn’t help. But we still have the constraint that n
√ X xj
C(n, x0 ) ≤ D( xi , xi ) + D( , xj ) + (i + 1)n must be logarithmic in k, although the base of the logarithm
j=0
xj+1 is now 1.61, approximately. It is still not possible to find
nontrivial values for n, k, and j to satisfy the hypothesis of
Now, we choose values for each of the xj ’s in order to make Theorem 5.1.
the terms equal. That is, pick xp = k(i+2−p)/(i+2) for p =
0, 1, . . . , i. In particular, xi = k2/(i+2) , so the first term

D( xi , xi ) becomes D(k1/(i+2) , k2/(i+2) ). The summation
6. ATTRIBUTION
Pi−1 xj Pi−1 1/(i+2)
Note: The algorithm described here was invented solely by
j=0 D( xj+1 , xj ) becomes j=0 D(k , k(i+2−j)/(i+2) ). Daniel Cussen. The proof of Section 5 is solely the work of
The first term D(k1/(i+2) , k2/(i+2) ) fits into this sum so we Jeffrey Ullman.
can write
i 7. REFERENCES
[1] J. Alman and V. V. Williams. A refined laser method
X
C(n, k) ≤ D(k1/(i+2) , k(i+2−j)/(i+2) ) + (i + 1)n
j=0
and faster matrix multiplication. In D. Marx, editor,
Proceedings of the 2021 ACM-SIAM Symposium on
Note that on the left, we replaced x0 by the original value Discrete Algorithms, SODA 2021, Virtual Conference,
k, for which it stood. January 10 - 13, 2021, pages 522–539. SIAM, 2021.
[2] K. E. Batcher. Sorting networks and their
applications. In AFIPS Spring Joint Computing
Conference, volume 32 of AFIPS Conference
Proceedings, pages 307–314. Thomson Book Company,
Washington D.C., 1968.
[3] A. Bhowmick, A. S. Rathore, M. Mishra, S. Mittal,
and R. Singhal. Realizing cnn inference in log-domain.
In Deployable AI Workshop (with AAAI), 2023.
[4] Fog, Agner. Instruction tables, 2021.
[5] N. Jouppi, C. Young, N. Patil, and D. Patterson.
Motivation for and evaluation of the first tensor
processing unit. IEEE Micro, 38(3):10–19, 2018.
[6] P. M. Kogge and H. S. Stone. A parallel algorithm for
the efficient solution of a general class of recurrence
equations. IEEE Trans. Computers, 22(8):786–793,
1973.
[7] R. Prabhakar, Y. Zhang, D. Koeplinger, M. Feldman,
T. Zhao, S. Hadjis, A. Pedram, C. Kozyrakis, and
K. Olukotun. Plasticine: A reconfigurable architecture
for parallel patterns. In Proceedings of the 44th
Annual International Symposium on Computer
Architecture, ISCA ’17, pages 389–402, New York, NY,
USA, 2017. Association for Computing Machinery.
[8] D. E. Rumelhart, G. E. Hinton, and R. J. Williams.
Learning Representations by Back-propagating Errors.
Nature, 323(6088):533–536, 1986.
[9] V. Strassen. Gaussian elimination is not optimal.
Numer. Math., 13(4):354–356, aug 1969.
[10] Wikipedia contributors. Binary multiplier —
Wikipedia, the free encyclopedia, 2021.
[11] Wikipedia contributors. Ancient egyptian
multiplication — Wikipedia, the free encyclopedia,
2022. [Online; accessed 29-March-2023].

You might also like