0% found this document useful (0 votes)
198 views15 pages

Polynomial Multiplication in Small Fields

This document discusses algorithms for Toom-Cook multiplication of polynomials over finite fields with small characteristic (Fp[x] where p is 3, 5, or 7). It describes how Toom's strategies can be adapted for these fields by rewriting polynomials in terms of another variable Y with a small degree. The document provides details on a Toom-3 algorithm for Fp[x], including the exact sequence of operations. It also discusses searching for optimal algorithms by considering the costs of basic operations like addition, subtraction, and multiplication over these fields.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
198 views15 pages

Polynomial Multiplication in Small Fields

This document discusses algorithms for Toom-Cook multiplication of polynomials over finite fields with small characteristic (Fp[x] where p is 3, 5, or 7). It describes how Toom's strategies can be adapted for these fields by rewriting polynomials in terms of another variable Y with a small degree. The document provides details on a Toom-3 algorithm for Fp[x], including the exact sequence of operations. It also discusses searching for optimal algorithms by considering the costs of basic operations like addition, subtraction, and multiplication over these fields.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Notes on Low Degree Toom-Cook Multiplication with

Small Characteristic

Marco Bodrato

Centro Interdipartimentle “Vito Volterra” – Università di Roma Tor Vergata


Via Columbia 2 – 00133 Rome, Italy

Abstract. The use of Toom-Cook sub-quadratic polynomial multiplication was recently


shown to be possible also when the coefficient field does not have elements enough, partic-
ularly for F2 [x]. This paper focus on how Toom’s strategies can be adapted to polynomials
on non-binary small fields. In particular we describe the Toom-3 algorithm for Fp [x] with
p ∈ {3, 5, 7}, and some other unbalanced algorithms.
Algorithms are described with full details, not only the asymptotic complexity is given, but
the exact sequence of operations for possibly optimal implementations.
Algorithms given here for Fp [x] perfectly works, as well, for Fpn [x], and may be useful for
computations inside Fpn for large enough n.
Moreover some connections with FFT-based multiplication are found, slightly improving
the worst cases of FFT.

Keywords: Polynomial multiplication, finite fields, Toom-Cook, Karatsuba, optimal, con-


volution, squaring.

1 Introduction
In a previous paper [Bod07] the author faced the problem of writing fast multiplication algorithms,
using Toom-Cook splitting strategies [Too63a,Coo66], for polynomials on a field with very few
elements: F2 [x]. A task that some authors believe impossible, because “there are not enough
evaluation points”.
The main idea comes from the fact that Toom-Cook algorithms are efficient only for quite big
polynomials and should be used recursively. That’s why we rewrite the operands as polynomials
in Fp [x] [Y ] with a small degree in Y . After this rewriting we can choose elements to evaluate at;
not only in the field Fp , but in the bigger ring Fp [x].
In this work we will focus on fields with characteristic 3, 5 and 7, because binary fields already
got enough attention and for bigger primes we can usually adapt algorithms optimised for Z [x].

2 Toom-Cook Algorithm for Polynomials


We start recalling Toom’s algorithm, as generalised in [Bod07]. Starting from two polynomials
u, v ∈ R[x], on some integral domain R, we want to compute the product R[x] 3 w = u · v. The
whole algorithm can be described in five steps.
Splitting : Choose
Pn−1 some base Y = xb ,P and represent u and v by means of two polynomials
n−1−i i m−1
u(y, z) = i=0 ui z y , v(y, z) = i=0 vi z m−1−i y i , both homogeneous, with respectively
n and m coefficients and degrees deg(u) = n − 1, deg(v) = m − 1. Such that u(xb , 1) =
u, v(xb , 1) = v. The coefficients ui , vi ∈ R[x] are themselves polynomials and can be chosen
to have degrees ∀ i, deg(ui ) < b, deg(vi ) < b.
2 Marco Bodrato (on-line version http: // bodrato. it/ papers/ #CIVV2007 )

Traditionally the Toom-n algorithm requires balanced operands so that m = n, but we


can easily generalise to unbalanced ones. We assume commutativity, hence we also assume
n ≥ m > 1.
Evaluation : We want to compute w = u · v whose degree is d = n + m − 2, so we need
d + 1 = n + m − 1 evaluation points Pd = {(α0 , β0 ), . . . , (αd , βd )} where αi , βi ∈ R[x] can be
polynomials. We define c = maxi {deg(αi ), deg(βi )}.
The evaluation of a single polynomial (for example u) on the points (αi , βi ), can be computed
with a matrix by vector multiplication. The matrix Ed,n is a (d + 1) × n Vandermonde-like
matrix. u(α, β) = Ed,n u =⇒
  n−1
α0 · β0n−2 · · · α0n−2 · β0 α1n−1
  
u(α0 , β0 ) β0 u0
u(α1 , β1 ) β1n−1 α1 · β1n−2 · · · α1n−2 · β1 α2n−1   u1 
 =  .. (1)
    
 .. .. .. ..   .. 
 .   . . . .   . 
n−1 n−2 n−2 n−1
u(αd , βd ) βd αd · βd · · · αd · βd αd un−1

Recursive multiplication : We compute ∀ i, w(αi , βi ) = u(αi , βi ) · v(αi , βi ), with d + 1 mul-


tiplications of polynomials whose degrees are comparable to that of Y = xb . We have
deg(u(αi , βi )) ≤ c(n − 1) + b, deg(v(αi , βi )) ≤ c(m − 1) + b, and the results deg(w(αi , βi )) ≤
c(n + m − 2) + 2b = cd + 2b. We note that c, d, m, n are fixed numbers for a chosen implemen-
tation, b instead will grow as the operands grow.
Interpolation : This step depends only on the expected degree of the result d, and on the d+1
chosen points (αi , βi ), no more on n and m separately. We now need the coefficients of the
Pd
polynomial w(y, z) = i=0 wi z d−i y i . We know the values of w evaluated at d+1 points, so
we face a classical interpolation problem. We need to apply the inverse of Ad , a (d+1) × (d+1)
Vandermonde-like matrix. w(α, β) = Ad w =⇒
−1 
β0 α0 · β0d−1 · · · α0d−1 · β0 α0d
   d 
w0 w(α0 , β0 )
w1  β1d α1 · β d−1 · · · αd−1 · β1 α1d  w(α1 , β1 )
1 1
 ..  =  . (2)
     
.. .. ..   ..
 .   ..

. . .   . 
wd d d−1 d−1 d w(αd , βd )
βd αd · βd · · · αk · βd αd

Recomposition : The desired result can be simply computed with one more evaluation: w =
w(xb , 1). This step requires at most d shifts and sums.

The two critical phases are evaluation and interpolation. As stated by formulas (1) and (2),
both require a matrix by vector multiplication. This two phases can require many sums and
subtractions, shifts, and even small multiplications or exact divisions (interpolation only) by
small elements in R[x]. In this paper we will give explicit Evaluation Sequences of operations
(called ES from now on) as well as Interpolation Sequences (IS) for described Toom algorithms
paying much attention in giving sequences with the lowest possible computational cost.

3 Searching for Optimality

The two previous papers [BZ07,Bod07] described techniques for computer aided search of optimal
Toom-n algorithms. Those techniques are valid for Z, Z [x] and F2 [x], with very small adaptations.
Here we will not explain them from start, but only highlight differences to take care of for different
polynomial rings.
Notes on low degree Toom-Cook multiplication with small characteristic 3

In both papers we basically assume we have the four operations: addition and subtraction,
together with multiplication and exact division both by small fixed1 constants. We also silently
assume that the costs of this operations are linear with respect to operand length; this assumption
is valid for any Fpn [x]. For readers not used with exact division, detailed explanation can be
found in §A. Moreover we assumed some relations between costs of the various operations or
some combinations, and this relations can be very different for different characteristics.

3.1 Linear Combinations


The basic operation we really need is linear combination li ← ±cj · lj /dj ± ck · lk /dk . We assume
it is possible, without temporary variables, for any cj , dj , ck , dk ∈ R[x] small constants, even if
i = j or i = k. The cost of various types of linear combinations are carefully classified in [BZ07],
but here we will simply compute them by adding up the cost of single operations, converted to
bit-shift whenever possible or skipped when the coefficient is trivial.

3.2 Cost of Basic Operations


We keep on assuming that addition and subtraction do cost the same, and we will call this cost
add.
We assumed exact division by a small constant to cost always the same regardless of the
constant, but this is not true any more, because if the constant is an invertible element δ ∈ Fpn of
the field, than the division reduce to a multiplication times the inverse δ −1 ∈ Fpn , which should
cost far less than the exact division by a generic element of Fpn [x]. Vice-versa in Z [x] we could
count on a cost-less multiplication (or division) by the invertible element −1: now we should no
more count on a simple sign-changing. Depending on representation, multiplication by −1 can
have very different cost. That’s why we admit operations like l1 ← l0 − l1 whose result equals
−(l1 − l0 ).

Doable shifts We also assumed some particular multiplications and divisions could be replaced
by fast shifting, by powers of 2 in Z [x] and by powers of x in F2 [x]. Powers of x should be simple
multipliers or divisors for any polynomial ring, the case being very different for powers of 2.
Multiplication by 2 can not cost more than a sum, because we can compute a ← 2 · b by
a ← b + b, and this can be even faster than a generic sum, because we have to read only one
operand. Other powers of 2, instead, can not be easily optimised, because of the needed reduction.
Division by 2 can be obtained with a multiplication by 2−1 ∈ Fp , but if elements in Fp are
represented in memory as integers in the range [0..p − 1] we can also divide even numbers with a
simple shift, and odd numbers with a shift followed by a correction: addition of 2−1 ∈ Fp . This
−1
procedure strategy gives the correct result because a2 ≡ a−1 1
2 + 2 ≡ (a − 1)  1 + 2 (mod p).
This trick also works with Montgomery’s representation [Mon85], with only one difference: the 1
in the above formula is not the neutral element for multiplication, but the residue represented by
all zeroes but the least significant bit.

4 The Matrices
Two kind of matrices are involved in any Toom-k algorithm, the square invertible matrix Ad and
the two, possibly equal, matrices Ed,n , Ed,m with the same number d + 1 = 2k − 1 of rows, but
fewer (respectively. n ≤ d and m ≤ d) columns.
1
Where small means that we can store those constants in a CPU word, but is not as an important
attribute as “fixed”, meaning asymptotically small; thus operations are asymptotically linear.
4 Marco Bodrato (on-line version http: // bodrato. it/ papers/ #CIVV2007 )

4.1 Matrices for the Interpolation Sequence

Since the matrices from equation 2 must be invertible, we are interested in the determinant.
Which can be computed from the points in Pd .
Theorem 1. For the Vandermonde-like matrix Ad generated from the d + 1 points in Pd =
{(α0 , β0 ), . . . , (αd , βd )}, the determinant can be computed with:
Y
det(Ad ) = (αi βj − αj βi )
0≤i<j≤d

Proof. It can be easily seen that a matrix with two points with βi = βj = 0 is not invertible, and
the above formula correctly gives 0.
If one point, suppose (α0 , β0 ), has β0 = 0, the matrix will start with the line (0, . . . , 0, α0d ).
Computing the determinant starting from this row, we will have αid det(A ed ) where A ed is the
complementary minor. Ad is a Vandermonde d × d matrix for the points αi /βi , where the i-th
e
line was multiplied by βid .
Using the classical formula for Vandermonde matrices, we obtain:
Y Y Y
det(Ad ) = α0d det(Aed ) = α0d βid (αi /βi − αj /βj ) = (αi βj − αj βi )
0<i≤d 0<i<j≤d 0≤i<j≤d

4.2 The Choice of Evaluation Points

The choice of the evaluation points Pd is one of the most important steps to reach an optimal
implementation, and completely determines the matrices Ad , Ed,2 , . . . , Ed,d .
We will consider two of them as being automatically chosen (0, 1), (1, 0), representing re-
spectively 0 and ∞, and immediately giving w0 = u0 · v0 , wd = un−1 · vm−1 , and the rows
(1, 0, . . .), (. . . , 0, 1). Another good choice is the point (1, 1), and (if characteristic 6= 2) (−1, 1).
We need an invertible matrix Ad , so if we use any point (αi , βi ), no other multiple point (λαi , λβi )
can be added, otherwise the factor (αi λβi − λαi βi ) will nullify the determinant.
Since the dimension of the extra space needed for the carries depends on the parameter
c = maxi {deg(αi ), deg(βi )}, we try to keep it as small as possible. In F2 [x] the use of degree-one
polynomials allows 9 possible couples, so that algorithms up to Toom-5 can be analysed. Even
for higher characteristic we will restrict to degree one, because we are not seeking higher Toom
instances anyway.
In Z and in F2 we could prove that any Toom-k with k > 2 requires at least one division,
because of the determinant value. Working in Fpn [x] we have many invertible elements, and the
determinant of the matrices Ad , even for higher Toom, can be 1 or the inverse of a power of two.
On the other hand, when we use any non-constant polynomial in Fpn [x]/Fpn , theorem 2 in
[Bod07, p.120] can be used to prove the need of a polynomial division.

4.3 Matrices for the Evaluation Sequence

The matrices Ed,n are non-square, so we can not compute the determinant. But we can compute
the rank.
Theorem 2. If the points Pd give an invertible Ad , then the rank of any Ed,n matrix is n.

Proof. Since the Ed,n are sub-matrices of the matrix Ad , modulo some multiplication of rows by
non-zero constants, all the n columns are linearly independent, so that the rank is n.
Notes on low degree Toom-Cook multiplication with small characteristic 5

4.4 Unusual Evaluations


Dealing with general finite fields one could consider some more general evaluation/interpolation
matrices.
In Z [x] one can consider basically two possible evaluations of a polynomial p(x) of degree
d: p(a), a ∈ Z or bd p( ab ). Thus if we have p(x) = p0 + p1 x + p2 x2 we usually evaluate p(a) =
p0 + p1 a + p2 a2 or b2 p(1/b) = b2 p0 + p1 b + p2 .
But in Fpn [x] one can also consider a general βp(α), with the quadratic example above we can
obtain also, for example, ap(a−1 ) = ap0 + p1 + p2 a−1 ; giving a line [a 1 a−1 ] in the matrix Ed,3
and a fairly easy evaluation sequence.
For squaring purposes we would better evaluate “both” operands with the same matrix, but
for general products we can even mix evaluations, for example:

v(α, β)u(β −1 , α−1 ) = (αβ)−n+1 w(α, β)

or more generally:

∀(αv , βv , αu , βu , γ, δ) ∈ R6 αv βu = αu βv ⇒ ∃λ : γv(αv , βv ) · δu(αu , βu ) = λw(αv , βv )

This new kind of mixed evaluation does not affect theorem 2, but will change the determinant.
Anyway the author did not find any optimal sequence with this kind of asymmetrical evaluations,
further investigation can be done in this direction.
From now on we will consider only plain evaluations, so that the algorithms will be easily
adapted from product to squaring.

5 Results and Algorithms in Characteristic 3


The algorithms described in the following sections were studied to work in F3 [x], but can be
applied in general for characteristic 3. A recent work [AHM07] suggest that arithmetic in F3m
works very fast with polynomial basis representation: this fact opens the way to experimentation
for multiplication of elements in those fields with our algorithms. We skip Toom-2 because it
coincides with the well known Karatsuba [KO62].
To verify algorithms proposed in this section with GP/PARI, some lines defining coefficients
in F3 should be pre-inserted:
U0=u0*Mod(1,3) ; U1=u1*Mod(1,3) ; U2=u2*Mod(1,3) ; U3=u3*Mod(1,3)
V0=v0*Mod(1,3) ; V1=v1*Mod(1,3) ; V2=v2*Mod(1,3)

5.1 Choosing Evaluation/Interpolation Points


Possible values for α, β up to degree 1 are:{0, ±1, ±x, ±x + 1, ±x − 1}. Contrary to what happen
in characteristic 2, the square of a binomial is a trinomial. Even simple evaluation using x ± 1
will require lots of computations; therefor, when possible, it is better to use ±x

5.2 Toom-2.5 in F3 [x]


The Toom-2.5 algorithm can be used to multiply two operands whose size is not the same. In
particular, one will be divided in 3 parts, the other in 2 parts.
Toom-2.5 requires four evaluations, so we need four “values”. With F2 [x] we could not just
use elements of the field F2 , so we added also the variable x. Here we use the three values we
have in F3 and ∞, named in our homogenized notation {(1, 0), (1, 1), (−1, 1), (0, 1)}. This choice
6 Marco Bodrato (on-line version http: // bodrato. it/ papers/ #CIVV2007 )

is possible for all greater characteristic, and the algorithm described in [BZ07] for Z [x] can be
used. The unique small trick possible in F3 is to avoid the division. Playing with signs we can
have just one division by −2 ≡ 1 (mod 3); the code is shown in figure 1.
Matrices are the same for any characteristic 6= 2. Swapping lines we can obtain det(A3 ) = ±2,
as needed.

U = U2*X^2 + U1*X*Y + U0*Y^2


V = V1*X + V0* Y 0 1 0 1
1 0 0 1 0
\\ Evaluation: 5 add; 4 mul (n) B1 1 1C B 1 1C
W3 = U0 + U2 ; W0 = V0 + V1 E3,3 =B
@ 1 −1 1 A ; E3,2 = @−1
C B C
1A
W2 = W3 + U1
W1 = W2 * W0 0 0 1 0 1
W3 = W3 - U1 ; W0 = V1 - V0
W2 = W3 * W0
W3 = U2 * V1 ; W0 = U0 * V0
\\ Interpolation: 4 add (2n) 0
1 0 0 0
1
W2 -= W1 B 1 1 1 1C
W1 -= W2 + W3 A3 = B
@−1
C
1 −1 1 A
W2 -= W0 0 0 0 1
\\ Recomposition
W = W3*X^3+ W2*X^2*Y+ W1*X*Y^2 + W0*Y^3
W == U*V
Fig. 1. Toom-2.5 in F3

5.3 Toom-3 in F3 [x]

Toom-3 is by far the best known and widely used variant of Toom-Cook algorithms. But usually
only in characteristic 0, for the multiplication in Z or Z [x]. While writing this paper, no imple-
mentations of balanced Toom-3 in F3 [x] by other authors was found on the net, and no detailed
description in the literature. Here we propose one.
Toom-3 has two variants, the balanced one, which is the most interesting, because it can be
used recursively; and the unbalanced, good when one operand is about twice as big as the other.
The two variants share the same IS but have different evaluation matrices. The balanced version
uses twice E4,3 , while the unbalanced one uses E4,2 for the smallest operand and E4,4 for the
bigger one.
The set of points used is P4 = {(0, 1), (1, −1), (1, 1), (1, x), (1, 0)}; code is shown in figure 2.
The IS needs one exact division, by the small constant element x3 − x. This division can
be obtained by a shift and a division by x2 − 1, eventually changing the two central lines of
interpolation with:

W3 = (W0 - W3)/x + W1 + W2*x + W4*x^3


W3 /= (1-x^2)

with a new sequence requiring the same number of operations.


Moreover the point (1, x) can be changed to some different (1, xw ), with a suitable w, may be
the number of elements of the field stored in each CPU word, so that shift should cost far less.
The drawback is that operands get bigger after evaluation, and recursive multiplication has to
manage longer values. In that case division will be by x3w − xw = xw (x2w − 1), and the algorithm
in §A.3 can be useful.
When working on some F3n , or other extension, an element a ∈ F3n can be used instead of x.
In this case, (a3 − a) needs to be invertible.
Notes on low degree Toom-Cook multiplication with small characteristic 7
0 1 0 1 0 1 0 1
1 0 0 1 0 0 0 0 1 0 0 0 1 0
B1
B −1 1C C
B1
B −1 1 −1 1C C
B1
B −1 1 −1CC
B1
B −1CC
E4,3 B1
=B 1 1C C ; A4 = B1 1 1 1 1C E4,4 B1
=B 1 1 1CC ; E4,2 = B1 1C
B B
C C
@1 x x2 A @1 x x2 x3 x4 A @1 x x2 x3 A @1 xA
0 0 1 0 0 0 0 1 0 0 0 1 0 1
U = U2*Y^2 + U1*Y + U0 U = U3*Y^3 + U2*Y^2 + U1*Y + U0
V = V2*Y^2 + V1*Y + V0 V = V1*Y + V0

\\ Evaluation:10 add,4 shift;5 mul (n) \\ Evaluation:10 add,4 shift;5 mul


W0 = U0 + U2 ; W2 = V0 + V2 W0 = U0 + U2
W3 = W0 - U1 ; W4 = W2 + V1 W1 = U1 + U3
W0 = W0 + U1 ; W2 = W2 - V1 W3 = W0 - W1 ; W4 = V0 + V1
W0 = W0 + W1 ; W2 = V0 - V1
W1 = W3 * W2 ; W2 = W0 * W4 W1 = W3 * W2 ; W2 = W0 * W4

W0 = U0 +(U1 + U2*x)*x W0 = U0 +(U1 +(U2 +U3*x)*x)*x


W4 = V0 +(V1 + V2*x)*x W4 = V0 + V1*x
W3 = W0 * W4 W3 = W0 * W4
W0 = U0 * V0 ; W4 = U2 * V2 W0 = U0 * V0 ; W4 = U3 * V1

\\ Interpolation: 9 add, 3 shift, 1 Sdiv


W1 -= W2
W2 -= W1 + W0 + W4
W3 -= W0 + W1*x + W2*x^2 + W4*x^4
W3 /=(x^3-x)
W1 -= W3
\\ Recomposition
W = W4*Y^4+ W3*Y^3+ W2*Y^2+ W1*Y + W0
W == U*V
Fig. 2. Toom-3 in F3 [x]

6 Results and Algorithms in Characteristic 5

The algorithms described in the following sections were studied to work in F5 [x], but can be
applied in general for characteristic 5. Again we skip Toom-2; we could skip Toom-2.5 too, because
it works as in Z [x], but a small consideration is given here to introduce Toom-3.5, which is
described in full details. The classical and most important Toom-3 is then given. The fact that
for all the elements x ∈ F5 we have x4 ≡ 1, gives many cancellation and the shortest sequences
both for the 3-way and for the unbalanced Toom-3.5 .
To verify algorithms proposed in this section with GP/PARI, some lines defining coefficients
in F5 should be pre-inserted:

U0=u0*Mod(1,5); U1=u1*Mod(1,5); U2=u2*Mod(1,5); U3=u3*Mod(1,5); U4=u4*Mod(1,5)


V0=v0*Mod(1,5); V1=v1*Mod(1,5); V2=v2*Mod(1,5)

6.1 Toom-2.5 in F5 [x], FFT-like

The Toom-2.5 algorithm has been completely eviscerated in §5.2, in F5 [x] the algorithm described
in [BZ07] can be used. The only possible trick, is to play with signs obtaining det(A3 ) = −2, so
that the division is replaced with a product by 2 = −2−1 .
8 Marco Bodrato (on-line version http: // bodrato. it/ papers/ #CIVV2007 )

With different sets of evaluation points, we can even obtain det(A e3 ) = 1 and sequences with
neither multiplications nor divisions of any kind, but they require a greater number of operation,
so we discarded them; after all, a product by two is nothing but one more addition.

6.2 Toom-3 in F5 [x]


With the same choice of points as in Z, the determinant can be 2 or −2 = 2−1 , so that only one
product (or division) by two is needed.
Again we will give the two variants for Toom-3: the balanced and the unbalanced one. In the
code shown for the balanced product, we give two different ES, one for each operands: they are
equivalent with respect to operation count. The preferred one can be chosen during implementa-
tion.
The set of points used is P4 = {(0, 1), (1, −1), (1, 1), (1, 2), (1, 0)}; code is shown in figure 3

0 1 0 1 0 1 0 1
1 0 0 1 0 0 0 0 1 0 0 0 1 0
B1 −1 1C B1 −1 1 −1 1C B1 −1 1 −1C B1 −1C
B C B C B C B C
E4,3 =B
B1 1 1C ; A4 = B1 1 1 1 1C
C B C E4,4 =B
B 1 1 1 1 C ; E4,2 = B1 1C
C B C
@1 2 −1A @1 2 −1 −2 1A @1 2 −1 −2A @1 2 A
0 0 1 0 0 0 0 1 0 0 0 1 0 1
U = U2*Y^2 + U1*Y + U0 U = U3*Y^3 + U2*Y^2 + U1*Y + U0
V = V2*Y^2 + V1*Y + V0 V = V1*Y + V0

\\ Evaluation:10 add,2 shift;5 mul (n) \\ Evaluation:10 add,1 shift;5 mul


W3 = U2 + U0 ; W1 = V2 + V0 W3 = U2 + U0 ; W4 = V0 + V1
W4 = W1 + V1 W1 = U3 + U1
W0 = W3 + U1 W0 = W3 + W1
W2 = W0 * W4 W2 = W0 * W4
W3 = W3 - U1 ; W1 = W1 - V1 W3 = W3 - W1 ; W1 = V0 - V1
W0 =(W0 + U2)*2 ; W4 = V0 + V1*2 W0 =(U1 - U3)*2 ; W4 = W4 + V1
W0 = W0 - U0 ; W4 = W4 - V2 W0 = W0 + U0 - U2

W1 = W3 * W1 ; W3 = W0 * W4 W1 = W3 * W1 ; W3 = W0 * W4
W4 = U2 * V2 ; W0 = U0 * V0 W4 = U3 * V1 ; W0 = U0 * V0

\\ Interpolation: 7 add, 2 shift (2n)


W3 -= W1
W1 -= W2 ; W1 *= 2
W2 -= W0 + W4
W3 += W2 * 2
W2 -= W1
W1 -= W3
\\ Recomposition
W = W4*Y^4+ W3*Y^3+ W2*Y^2+ W1*Y + W0
W == U*V
Fig. 3. Toom-3 in F5 [x]

6.3 Toom-3.5 in F5 [x]


The unbalanced Toom-3.5 in F5 [x] is interesting for two reasons. First of all because it can be done
using only products and divisions by 2, with a sequence very similar to FFT; this relationship
Notes on low degree Toom-Cook multiplication with small characteristic 9

will be further investigated in §8. Moreover it has two versions, the slightly unbalanced (4 parts
versus 3) and the strongly one (5 parts versus 2), with quite interesting evaluation sequences:
in particular the 5 parts evaluation is only a slight modification of the 4 parts one, because the
matrix repeats.
The set of points used is P4 = {(0, 1), (1, 2), (1, −2), (1, −1), (1, 1), (1, 0)}. The code in figure 4
uses those points and the resulting matrices.
0 1 0 1 0 1 0 1
1 0 0 0 1 0 0 1 0 0 0 0 1 0
B1 2 −1 −2C B1 2 −1C B1 2 −1 −2 1C B1 2C
B C B C B C B C
B1 −2 −1 2C
C ; E5,3 = B1 −2 −1C
B1 −2 −1 2 1C
C ; E5,2 = B1 −2C
B C B C
E5,4 =B
B1 −1 1 −1C B1 −1 1C E5,4 =B
B1 −1 1 −1 1C B1 −1C
B C B C B C B C
@1 1 1 1 A @1 1 1A @1 1 1 1 1 A @1 1A
0 0 0 1 0 0 1 0 0 0 0 1 0 1
U = U3*Y^3 + U2*Y^2 + U1*Y + U0 U = U4*Y^4 + U3*Y^3 + U2*Y^2 + U1*Y + U0
V = V2*Y^2 + V1*Y + V0 V = V1*Y + V0
\\ Evaluation: 14 add, 2 shift; 6 mul \\ Evaluation: 13 add, 1 shift; 6 mul (n)
W0 = U0 + U4
W1 = U0 + U2 ; W5 = V0 + V2 W1 = W0 + U2
W3 = U1 + U3 W3 = U1 + U3
W2 = W1 - W3 ; W4 = W5 - V1 W2 = W1 - W3 ; W4 = V0 - V1
W1 = W1 + W3 ; W5 = W5 + V1 W1 = W1 + W3 ; W5 = V0 + V1
W3 = W2 * W4 ; W4 = W1 * W5 W3 = W2 * W4 ; W4 = W1 * W5

W2 =(U1 -
U3)*2; W0 = V0 + V1*2 W2 =(U1 -
U3)*2
W1 = U0 U2 -; W0 = W0 - V2 W1 = W0 U2- ; W0 = W5 + V1
W5 = W1 W2 - W5 = W1 W2-
W2 = W2 W1 + W2 = W2 W1+
W1 = W2 W0 * W1 = W2 W0*
; W0 = W0 + V1 ; W0 = W0 + V1
W2 = W5 * W0 W2 = W5 * W0
W5 = U3 * V2 ; W0 = U0 * V0 W5 = U4 * V1 ; W0 = U0 * V0

\\ Interpolation: 10 add, 4 shift (2n)


W4 =(W4 + W3)/2
W3 = W4 - W3
W1 = W1 - W2 0 1
W2 = W2 - W1*2 1 0 0 0 0 0
W4 =(W4 + W2)/2 B
B 1 2 −1 −2 1 2CC
W2 = W4 - W2 B 1 −2 −1 2 1 −2C
A5 = B C
W3 =(W3 + W1)/2 B
B 1 −1 1 −1 1 −1C
C
W1 = W3 - W1 @ 1 1 1 1 1 1A
W1 = W1 - W5 0 0 0 0 0 1
W4 = W4 - W0
\\ Recomposition
W = W5*Y^5 +W4*Y^4+ W3*Y^3+ W2*Y^2+ W1*Y + W0
W == U*V
Fig. 4. Toom-3.5 in F5 [x]

7 Results and Algorithms in Characteristic 7


The algorithms described in the following sections were studied to work in F7 [x], but can be
applied in general for characteristic 7. Again we skip Toom-2; and Toom-2.5 too, because they
10 Marco Bodrato (on-line version http: // bodrato. it/ papers/ #CIVV2007 )

work as in Z [x]. Only some specification for Toom-3 are given. It would be interesting to analyse
even the 4-way and the unbalanced Toom-4.5, where cancellations due to the relation x6 ≡ 1
should show up, but this work is left for future implementations.
To verify algorithms proposed in this section with GP/PARI, some lines defining coefficients
in F7 should be pre-inserted:

U0=u0*Mod(1,7) ; U1=u1*Mod(1,7) ; U2=u2*Mod(1,7) ; U3=u3*Mod(1,7)


V0=v0*Mod(1,7) ; V1=v1*Mod(1,7) ; V2=v2*Mod(1,7)

7.1 Toom-3 in F7 [x]

With the same choice of points as in Z, the determinant can be ±2, requiring one shift. Other
choices are possible, even with determinant 1, but no shorter ES where found.
We propose here the sequences for balanced and unbalanced operands with the usual set of
points: P4 = {(0, 1), (1, 2), (1, 1), (1, −1), (1, 0)}. The code and the matrices are given in figure 5.

0 1 0 1 0 1 0 1
1 0 0 1 0 0 0 0 1 0 1 0 0 0
B1 2 4 C B1 2 4 1 2C B1 2 C B1 2 4 1C
B C B C B C B C
E4,3 =B
B 1 1 1 C ; A4 = B1 1 1 1 1C
C B C E4,2 =B
B 1 1 C ; E4,4 = B1 1 1 1C
C B C
@1 −1 1A @1 −1 1 −1 1A @1 −1A @1 −1 1 −1A
0 0 1 0 0 0 0 1 0 1 0 0 0 1
U = U2*Y^2 + U1*Y + U0 U = U3*Y^3 + U2*Y^2 + U1*Y + U0
V = V2*Y^2 + V1*Y + V0 V = V1*Y + V0

\\ Evaluation: 10 add, 2 shift; 5 mul (n) \\ Evaluation: 10 add, 1 shift; 5 mul (n)
W1 = U2 + U0
W1 = U2 + U0 ; W3 = V2 + V0 W3 = U3 + U1
W0 = W1 + U1 ; W4 = W3 + V1 W0 = W1 + W3 ; W4 = V0 + V1
W2 = W0 * W4 W2 = W0 * W4

W0 =(W0 + U2)*2 ; W4 =(W4 + V2)*2 W1 = W1 - W3 ; W3 = V0 - V1


W0 = W0 - U0 ; W4 = W4 - V0 W0 =(W0 + U2)*2 ; W4 = W4 + V1
W1 = W1 - U1 ; W3 = W3 - V1 W0 = W0 - U0 -U3

W3 = W1 * W3 ; W1 = W0 * W4 W3 = W1 * W3 ; W1 = W0 * W4
W4 = U2 * V2 ; W0 = U0 * V0 W4 = U3 * V1 ; W0 = U0 * V0

\\ Interpolation: 8 add, 2 shift (2n)


W1 -= W2
W3 =(W2 - W3)/2
W2 = W2 - W0 - W3
W1 -= W2
W2 -= W4
W1 -= W2*2
W3 -= W1
\\ Recomposition
W = W4*Y^4+ W3*Y^3+ W2*Y^2+ W1*Y + W0
W == U*V
Fig. 5. Toom-3 in F7 [x]
Notes on low degree Toom-Cook multiplication with small characteristic 11

8 Toom and FFT interactions


Toom’s ideas for multiplication and the use of FFT for convolution [SS71] are tightly related.
Both are based on the “evaluate, multiply, interpolate” paradigm. When we work on finite fields,
the difference became even smaller. In the correct ring, Toom-2.5 and FFT-4 coincide, or, more
correctly, NTT-4 is one of the possible Toom-2.5 algorithms. The main difference is that to
multiply two polynomials in Fpn [x], Toom uses evaluations on the ring itself, while the FFT
approach always maps into some ring extension.

8.1 Toom-3.5 for Any Ring with Imaginary Unit i.


The inversion sequence described for Toom-3.5 in F5 [x], can be generalised to any characteristic
p ≡ 1 (mod 4), or more generally to any integral domain containing ±i, the square roots of
−1 = (±i)2 .
With the six evaluation points (0, 1, −1, i, −i, ∞), we obtain evaluation and interpolation
matrices whose central lines requires exactly those operations needed to compute the product
modulo (y 4 − 1) via FFT.
   
1 0 0 0 0 0 1 00000
 1 1 1 1 1 1  0 1 0 0 0 1
   
 1 −1 1 −1 1 −1  F F T −4  0 0 1 0 0 0
   
 1 i −1 −i 1 i  (mod y4 −1)  0 0 0 1 0 0
   
 1 −i −1 i 1 −i   1 0 0 0 1 0
0 0 0 0 0 1 0 00001

The same can be done if we have ω 4 = −1, (ω 2 ) = ±i and we use negacyclic convolution.
   
1 0 0 0 0 0 1 0 0 0 0 0
1
 ω ω 2 ω 3 −1 −ω  
 0 1 0 0 0 −1 
 
 1 −ω ω 2 −ω 3 −1 ω  F F T −4  0 0 1 0 0 0 
   
 1 ω 3 −ω 2 ω −1 ω 3  (mod y 4 +1)  0 0 0 1 0 0
   
 1 −ω 3 −ω 2 −ω −1 −ω 3   −1 0 0 0 1 0 
0 0 0 0 0 1 0 0 0 0 0 1
Finally the two 1 (respectively −1) can be easily cleared with two subtractions (additions).
The argument can be reversed, using Toom strategies to handle FFT worst cases. Assume
we need to compute the product c = ab where a, b ∈ R[x]. If deg(ab) = K − 1 and K is a
good value for FFT, we can compute c = ab = ab (mod xK ± 1) very fast, using K point-wise
multiplication. On the other hand if deg(ab) = K + 1 we have two possible choices: use FFT
with a bigger K 0 , or compute C = ab( mod xK ± 1) as above, then c0 = a(0)b(0) = a0 · b0
and cK+1 = a(∞)b(∞) = an · bm ; finally correct the coefficients and obtain cK = c0 ∓ C0 ,
c1 = C1 ± cK+1 , ci = Ci . Thus we need K + 2 point-wise multiplication to obtain K + 2
coefficients, overhead is negligible, while plain FFT would have required 2K or more.
This approach is quite similar to the Granlund’s trick described in [GKZ07], but generalises
it using polynomials. This allow us to rebuild correct coefficient from both (low and high) ends,
and not only from lowest bits. It be used only for the first step, not inside the recursions.
In general the FFT strategy [SS71] can be used to compute a total number of coefficients
that is a power of two, thus we can compare Toom-4.5 with FFT-8, and Toom-(2k + 1)/2 with
FFT-(2k ). The argument above allows also FFT-(2k +1) and FFT-(2k +2), so that Toom-3.5 is an
exemplified FFT-(4+2), and the algorithm in §5.2 implements FFT-(2+2), which is not FFT-4.
The generalization in F2 [x] uses power of three [Sch77], but again che parallelism works com-
paring Toom-5 with FFT-9, Toom-6 with FFT-(9+2), and so on.
12 Marco Bodrato (on-line version http: // bodrato. it/ papers/ #CIVV2007 )

9 Conclusions
We gave several hints on how to search for optimal evaluation and interpolation sequences for
Toom-Cook algorithms, also in the event that we need more evaluation points than we have in
the base-field.
The 3-way Toom-Cook algorithm was given in full details for all significant characteristics; for
bigger ones (eleven and more) the sequences described for integers in [BZ07] can be used.
A final trick to enhance the worst cases of the FFT multiplication strategies was suggested,
this too can be applied to every characteristic with a very small ovrhead.

Acknowledgements
The author wants to thank Marica Tarantino for her patience which made this work possible;
Alberto Zanoni, Paul Zimmermann, Jörg Arndt and Richard Brent for helpful discussions.

References
AHM07. Omran Ahmadi, Darrel Hankerson, and Alfred Menezes, Software implementation of arithmetic
in F3m , in Carlet and Sunar [CS07].
Arn07. Jörg Arndt, Algorithms for programmers (working title), http://www.jjj.de/fxt/, June 2007,
Draft version.
Bod07. Marco Bodrato, Towards optimal Toom-Cook multiplication for univariate and multivariate
polynomials in characteristic 2 and 0, in Carlet and Sunar [CS07], http://bodrato.it/papers/
#WAIFI2007, pp. 116–133.
Bro07. Christopher W. Brown (ed.), ISSAC 2007 - International Symposium on Symbolic and Algebraic
Computation, Waterloo, Ontario, Canada, July 29-August 1, 2007, ACM press, July 2007.
BZ07. Marco Bodrato and Alberto Zanoni, Integer and polynomial multiplication: Towards optimal
Toom-Cook matrices, in Brown [Bro07], http://bodrato.it/papers/#ISSAC2007.
Coo66. Stephen A. Cook, On the minimum computation time of functions, Ph.D. thesis, Dept. of
Mathematics, Harvard University, 1966.
CS07. Claude Carlet and Berk Sunar (eds.), WAIFI 2007 - International Workshop on the Arithmetic
of Finite fields, Madrid, Spain, June 21-22, 2007, Lecture Notes in Computer Science, vol. 4547,
Springer, June 2007.
GKZ07. Pierrick Gaudry, Alexander Kruppa, and Paul Zimmermann, A GMP-based implementation of
Schnhage-Strassen’s large integer multiplication algorithm, in Brown [Bro07].
Jeb93. Tudor Jebelean, An algorithm for exact division, Journal of Symbolic Computation 15 (1993),
169–180.
KO62. Anatolii Alexeeviq Karacuba and ri Ofman, Umnoenie mnogoznaqnyh qisel na
avtomatah, Doklady Akademii Nauk SSSR 145 (1962), no. 2, 293–294, english translation
in [KO63].
KO63. Anatoly Alexeevich Karatsuba and Yuri Ofman, Multiplication of multidigit numbers on au-
tomata, Soviet Physics Doklady 7 (1963), 595–596.
Mon85. Peter L. Montgomery, Modular multiplication without trial division, Mathematics of Computa-
tion 44 (1985), no. 170, 519–521.
SS71. Arnold Schönhage, Volker Strassen, Schnelle Multiplikation großer Zahlen, Computing 7 (1971),
281–292.
Sch77. Arnold Schönhage, Schnelle Multiplikation von Polynomen über Körpern der Charakteristik 2,
Acta Informatica 7 (1977), 395–398.
Too63a. Andre ui Leonoviq Toom, O slonosti shemy iz funkcionalynyh zlementov, real-
iziruwe ui umnoenie celyh qisel, Doklady Akademii Nauk SSSR 150 (1963), no. 3,
496–498 (russian), english translation in [Too63b].
Too63b. Andrei Leonovich Toom, The complexity of a scheme of functional elements realizing the mul-
tiplication of integers, Soviet Mathematics Doklady 3 (1963), 714–716.
ZQ03. Paul Zimmermann and Michael Quercia, irred-ntl source code, 2003, http://www.loria.fr/
~zimmerma/irred/.
Notes on low degree Toom-Cook multiplication with small characteristic 13

A Exact Division
The cost for exact division by a fixed element of the ring field is linear on degree of the dividend.
The first reference the author found for a proof of this fact is the paper by Jebelean [Jeb93]. Code
by Quercia [ZQ03] for exact division by 1+x2 in F2 [x] gave the author the ideas for generalisation
to polynomial rings. Finally Arndt description of inversion for series [Arn07] gave hints for a clean
explanation of the procedure.

A.1 Exact Division by 1 ± xn


The divisor we most frequently have to divide for is D = 1 ± xn , so in this section we assume
we have an element A ∈ A [x] of any polynomial ring which is known to be a multiple of our
divisor D. Then we know the quotient Q ∈ A [x] exist so that A = Q · D. We want to compute
the element Q.
We can at first compute the degree d = deg(A) of the polynomial A, this immediately gives
us the degree of Q, deg(Q) = d − n.
To compute Q we start computing A0 , then we recourse:

A0 ← A(1 ∓ xn ) = Q(1 ± xn )(1 ∓ xn ) = Q(1 − x2n )

A1 ← A0 (1 + x2n ) = Q(1 − x2n )(1 + x2n ) = Q(1 − x4n )


...
k k+1
Ak ← Ak−1 (1 + xn·2 ) = Q(1 + xn·2 ) (3)
k+1 d−n
until n · 2 > d − n, then Ak ≡ Q (mod x ) and we can easily extract Q from the lowest
coefficient of Ak .
Since we need the result modulo xd−n , all the computations can be performed modulo that
polynomial. Each step requires then a shift and an addition of at most d elements in A. The
number of necessary steps is 1 + blog2 d−nn c = O(log2 d).
Another possible way to clear the factor 1 ± xn from A = B(1 ± xn ) is to compute A e0 =
n 2n n n 2n 3n k+1
A(1 ∓ x + x ) = Q(1 ± x )(1 ∓ x + x ) = Q(1 ± x ), and recursively A ek until n · 3 > d − n.
Which is usually slower, because one step requires two additions and gives only a factor 3 for the
exponent. One or two steps of this computation can have some use anyway, whenever a factor 2
is not enough while a 3 is, or 8 is not and 9 is. As an example:

(1 − x3 )−1 ≡ (1 + x3 )(1 + x6 )(1 + x12 )(1 + x24 ) ≡ (1 + x3 + x6 )(1 + x9 )(1 + x18 ) (mod x32 )

in both cases an additional factor, resp. (1 + x48 ) or (1 + x36 ), gives the inverse modulo x64 .

A.2 Exact Division with Linear Complexity


Assume we have a representation for the P finite field where each CPU word stores w coefficients,
n
and a0 , . . . , an are the words so that A = i=0 ai xiw and ∀ i , deg(ai ) < w. Moreover we assume
w
that the divisor D is invertible modulo x . Then we can implement exact division with linear
complexity; the algorithm follows.

for
i = 0 . . . n −1
ai ← ai · D (mod xw )
ai+1 ← ai+1 − ai ·D

xw
When D = 1 ± xn , multiplication by the inverse D−1 can be optimised as above, and
ai · D ai ai · xn ai
w
= w
± w
= ± w−n = ±ai  (w − n)
x x x x
14 Marco Bodrato (on-line version http: // bodrato. it/ papers/ #CIVV2007 )

reduces to one shift operation by (w − n) positions.


The very last step an+1 ← an+1 − an  (w − n) should be avoided. Anyway it must not have
any effect at all. If division is exact we must have an  (w − n) = 0.

A.3 Exact Division with Same Cost of an Addition


The above algorithm greatly simplifies if D = 1 ± xkw ⇒ D−1 ≡ 1 (mod xw ); the loop reduces to:

for i = 0 . . . (n − k)
| ai+k ← ai+k ∓ ai
This requires 1 word addition for each word. The last k steps of the previous loop should all
give zero, they can be used for check or simply skipped, because we started from a multiple A
of D. With (n + 1) words, only (n + 1 − 2k) steps are necessary; 2k less than those needed for
an addition, and here we just read/write one operand. The cost of this division can be even less
than the cost of an addition.

B Toom-k Complexity
Asymptotic complexity of Toom-k algorithm to multiply two degree-d polynomials has been
analysed many times. Here we recall and reorganise the results, with the aim of computing the
number of operations tightly, showing the influence of the optimisations in the linear part on the
usually implied constant.
We will call M (d) the number of operations needed to multiply two polynomials with d
coefficients. All Toom-k algorithms require some linear algebra, for evaluation and interpolation
phases; all operations used in those steps are linear in the number of coefficients d, so we will
condense the cost of all them in ck d, with a constant ck depending on the specific Toom-k method.
One step of a multiplication method splitting operands in k parts, and requiring m sub-
products and ck linear operation, will cost
 
d
Mk (d) = m · M + ck · d
k
operations. We will use the well known relations
n
logb β logb α
X αn+1 − 1
α =β , αi = ,
i=0
α−1

to compute the cost of r recursions of the same algorithm:


r−1 
(m/k)r − 1
   
r r d X m i r d
Mk (d) = m · M + d · ck = m · M + d · ck
kr i=0
k kr m/k − 1
d k m r d k d·mr k
r
   r

=m  · M kr  + d · ck m−k k − 1 = m · M kr + ck m−k kr − d · ck m−k
d d k k
= mr M + r ck − d · ck .
kr k m−k m−k
Then we analyse the asymptotic behaviour of Toom-k algorithm. Let m = 2k − 1, and assume
to use all possible recursions r = logk d, then the in the above formula mr = dlogk m , k r = d,
k/(m − k) = k/(k − 1):
 
k k
Mk (d) = dlogk m M (1) + ck − d · ck = Θ(dlogk (2k−1) ) .
k−1 k−1
THIS PAGE IS NOT PART OF THE ARTICLE2
BibTEX entry

@TechReport{Bodrato:CIVV2007,
author = {Marco Bodrato},
title = {Notes on Low Degree {Toom-Cook} Multiplication with Small Characteristic},
institution ={Centro "Vito Volterra", Universit\‘a di Roma "Tor Vergata"},
year = {2007},
number = {621},
pages = {14},
month = {December},
note = {\url{http://bodrato.it/papers/\#CIVV2007}},
}

2
Paper edited with Emacs-21 and LATEX-3.0 on a Debian GNU/Linux box.

You might also like