Comparing Fast Implementations of Bit Permutation Instructions

Abstract

Recently, a number of candidate instructions have been proposed to efficiently compute arbitrary bit permutations. Among these, GRP is the most attractive, having utility for other applications in addition to permutation such as sorting and having good inherent cryptographic properties. However, the current implementation of GRP is the slowest of the candidates; BFLY, on the other hand, is the fastest. In this paper, we examine the possibility of executing GRP on a butterfly or an inverse butterfly network.

Figures (13)

The paper is organized as follows: Section II examines the possibility of performing GRP ona a butterfly or inverse Fig. 1: GRP operation on 8 bits. bfh are L bits; acdeg are R bits. Instruc- tion is GRP Rs,Rc,Rd, where Rs is the source register, Re the control register and Rd the destination register. We show that GRP cannot be performed on a butterfly or inverse butterfly network but that two inverse butterfly net- works may be used to group the R bits and L bits in parallel. The outputs of the two networks are merged to complete the GRP operation. We show the circuit that dynamically decodes he n GRP control bits to the nlg(n)/2 IBFLY control bits. This circuit has significant latency and offsets the speed of the faster inverse butterfly network. However, the new design is a viable alternative to the original GRP circuit. In addition, a fixed GRP operation can bypass the decoder and directly use he fast IBFLY network; the “decoding” can be done statically by the compiler.

Fig. 2: Overview of GRP operation using parallel IBFLY circuits

Fig. 4: Example of conflicting greedy paths on 8-bit inverse butterfly network when attempting to route GRP operation with Rc = 01010110. Fig. 3: Example of conflicting greedy paths on 8-bit butterfly network when attempting to route GRP operation with Rc = 00010001. GRPR on inverse butterfly. Both GRPR and GRPL can be achieved using the inverse butterfly circuit. Fig. 5 shows an 8-bit example. We provide the basis of an inductive proof by first describing how GRPR is done at stage k+1, assuming both the right half circuit and the left half circuit through stage k have performed GRPR on their respective data bits. The result from the left half circuit of stage k is then rotated right by the number of zeroed L bits in the right half. At level k+1,

the bits in the left half that wrap are swapped into the most significant bits of the right half, via the inverse butterfly op- eration at this stage. This completes the GRPR operation for stage k+1. Fig. 5: Example GRPR operation on an 8-bit inverse butterfly network. The output from stage 2 is the GRPR operation within the left and right parts: O0ac, Oefh. The left part is rotated right by the number of zeros in the right part: 00ac—>cOOa. Bit c is then swapped (control bit = “0”) into the right half to produce the output Q0Qacefh.

Fig. 7: Performing rotation at level k+1 assuming rotation through level k. Fig. 7a shows the case that two bits are not swapped in the original permuta- tion. Fig. 7b shows that, if these bits are rotated m positions and wrap, com- pleting the rotation requires swapping the bits. Fig. 7c and 7d show the case where the two bits are swapped in the original permutation. To achieve the desired rotation of the data bits, the control bits for GRPR specified above must be modified. In general, to perform a rotation of a permutation a by Mm positions on an inverse butterfly network, the right part circuit and left part circuit through stage k rotate their respective parts of x by m positions. In order to complete the rotation at stage k+1, the control bits at that stage are also rotated by m positions in or- der to keep a control bit associated with its paired data bits; however, the control bits are complemented upon wrap, re- versing the routing of the data bits (Fig. 7). Thus a rotate and complement (ROTC) operation of the control bits is needed for a rotation of the data bits. Note that in order to rotate a by M positions at a given stage, the same rotation must have been performed at the previous stage. Thus the rotation is propa- gated up and performed at each stage of the inverse butterfly network.

Fig. 6: Left rotate and complement on wrap of “0000” by POPCNT(“1101”) = 3 produces the result “0111.” operation of the zero string “0...0” by the POPCNT will pro- duce a one in the least significant bit for each rotation by one position, thus expressing a unary encoding of the value.

Fig. 9: The basic network fragment of the parallel prefix popcount circuit, with the network being truncated for small i. PC;,; refers to the POPCNT of positions i toj. Fig. 8: 64-bit GRPR circuit with decoder detail.

Fig. 10a: Barrel rotator implementation of 4-bit LROTC circuit. Fig. 10b: Simplified circuit obtained from propagating zeros at input The output from the population counters controls the LROTC circuits. Each LROTC circuit can be realized as a barrel rotator modified to complement the bits that wrap around (Fig. 10a). However, while a standard 2bit rotator has k stages and control bits, this shifter has k+1 stages and control bits. The final stage selects between its input and the complement, as the bits wrap 2* positions, back to the same spot. Propagating the zeros at the input can greatly simplify the circuit (Fig. 10b). The outputs from the rotate circuits are routed directly to the appropriate inverse butterfly switches they control.

The carry-save adders in the POPCNT units consist of parallel full adders (FA). Each FA is composed of an asym- metric 3-input XOR gate and an asymmetric 3-input inverting majority gate. The XOR gate has logical effort gay« = 12, Qh = 6 and J,* = 6 for the three input bundles, where a bundle is composed of the complement and uncomplemented input sig- nal, and the majority gate has logical effort of Jam = 2, Som = 4 and Jom = 4 for the three complemented inputs. Both gates have a parasitic delay Pm = Py = 6 [13]. In order to limit the effort of any single input, the XOR input with the highest ef- fort, ax*, is tied to the majority input with the lowest effort, am’, yielding an effort gax = 14 for the bundle and ga = 8 for he complemented input, and gp = go = 10 and gy = gy = 7. Each FA is driven by a 2:1 fork stage that generates the com- plement and uncomplemented inputs. We assume each fork inverter is 4x drive strength.

3.2 FO4 due to branching to the GRPR and GRPL circuit and combining the results. The GRP on IBFLY latency of 38.4 FO4 is much greater than the 13.0 FO4 latency of the origina inverse butterfly network due to the high latency through the decoder. This latency can be attributed to the high delay of full adders. Each full adder level contributes 2.5—3.0 FO4 delay. Additionally, the branching required by the paralle prefix architecture together with the large equivalent capaci- ance for a wire spanning a full adder causes a large branching effort at each stage of the unit.

We synthesized both circuits using a TSMC 90nm stan- dard cell library [14]. Table III compares the latency from logical effort estimates, the latency from synthesis and the area from synthesis. The latency from synthesis verifies the logical effort result, with both circuits having comparable la- tency. However, the area results clearly favor the GRP on IB- FLY implementation. In this paper, we examine the possibility of performing the GRP operation on a butterfly or inverse butterfly network. We show that GRP cannot be routed on either network but that GRPR and GRPL can be routed on the inverse butterfly net- work. We design a decoder circuit that can produce the re- quired nlg(n)/2 butterfly control bits from the n GRP control bits. However, the latency through this decoder is quite large and thus the benefit of the fast routing inverse butterfly net- work is negated. The overall latency, however, is comparable to that of the original GRP circuit and this circuit has the bene- fit of using a more general purpose routing circuit. Further- more, if we wish to perform a static GRP operation, the com- piler can decode the bits in advance and produce control bits for the fast inverse butterfly network directly, bypassing the hardware decoder (this requires adding a bypass multiplexer in Fig. 8).