VLSI Digital Signal Processing Systems
Retiming
Lan-Da Van (), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall, 2010
[email protected] http://www.cs.nctu.tw/~ldvan/
VLSI Digital Signal Processing Systems
Outlines
Introduction
Definitions and Properties Solving Systems of Inequalities Retiming Techniques Conclusion Supplement
Lan-Da Van
VLSI-DSP-4-2
VLSI Digital Signal Processing Systems
Introduction (1/2)
Retiming is a transformation technique used to change the locations of delay elements in a circuit without affecting the input/output characteristics of the circuit. Critical path is defined to be the path with the longest computation time among all paths that contain zero delay. The lower bound on the clock period of the circuit can be achieved by retiming.
Lan-Da Van
VLSI-DSP-4-3
VLSI Digital Signal Processing Systems
Introduction (2/2)
Application of Retiming
Reduce the clock period. Reduce the number of registers. Reduce the power consumption and logic synthesis.
(1) (1)
x(n)
W(n) (1)
y(n)
D D
(2) a b (2)
x(n)
D
(1)
y(n)
D
W1(n) (2) a b
2D
2D
33% reduction
D
W2(n)
(2)
3 u.t.
w(n)=ay(n-1)+by(n-2) y(n)=w(n-1)+x(n) =ay(n-2)+by(n-3)+x(n)
Lan-Da Van
2 u.t.
w1(n)=ay(n-1) w2(n)=by(n-2) y(n)=w1(n-1)+w2(n-1)+x(n) =ay(n-2)+by(n-3)+x(n)
VLSI-DSP-4-4
VLSI Digital Signal Processing Systems
Outlines
Introduction Definitions and Properties Solving Systems of Inequalities
Retiming Techniques
Conclusion Supplement
Lan-Da Van
VLSI-DSP-4-5
VLSI Digital Signal Processing Systems
Definitions
(1)
V , U : node r (V) : retiming value
D D
(2)
G
(1)
w(e) : weight of the edge e in G
3 2D
wr(e) : weight of the edge e in Gr
(2)
wr(e)=w(e)+r(V)-r(U) ex: r(1)=0, r(2)=1, r(3)=0, r(4)=0
e e
(1)
Gr
(1)
1 D 2 D D
(2)
wr(3
3 2D
2)=w(3
2)+r(2)-r(3)
=0+1-0=1 wr(2
e
1)=w(2
1)+r(1)-r(2)
(2)
=1+0-1=0
Lan-Da Van VLSI-DSP-4-6
VLSI Digital Signal Processing Systems
Properties of Retiming
4.2.1 The weight of the retimed path p=V0
given by wr(p)=w(p)+r(Vk)-r(V0).
e0
V1 e1
ek-1
Vk is
4.2.2 4.2.3 4.2.4
Retiming does not change the number of delays in a cycle. Retiming does not alter the iteration bound in a DFG.
Adding the constant value j to the retiming value of each node does not change the mapping from G to Gr.
Lan-Da Van
VLSI-DSP-4-7
VLSI Digital Signal Processing Systems
Property 4.2.1
The weight of the retimed path p=V0 given by wr(p)=w(p)+r(Vk)-r(V0). Proof
e0
V1 e1
ek-1
Vk is
wr ( p ) wr (e)
i 0
k 1
( w(ei ) r (Vi 1 ) r (Vi ))
i 0
k 1
w(ei ) ( r (Vi 1 ) r (Vi ))
i 0 i 0 i 0
k 1
k 1
k 1
w( p) r (Vk ) r (V0 )
Lan-Da Van VLSI-DSP-4-8
VLSI Digital Signal Processing Systems
Outlines
Introduction Definitions and Properties Solving Systems of Inequalities
Retiming Techniques
Conclusion Supplement
Lan-Da Van
VLSI-DSP-4-9
VLSI Digital Signal Processing Systems
Solving Systems of Inequalities
Step 1: Draw a constraint graph
(a) Draw the node i for each of the N variables ri, i=1,2.,N. (b) Draw the node N+1. (c) For each inequality ri-rjk, draw the edge j i from node j to i with length k. (d) For each node i, i=1,2,N, draw the edge N+1i from the node N+1 to i with length 0.
Step 2: Solve using a shortest path algorithm
(a) The system of inequalities has a solution if and only if the constraint graph contains no negative cycles. (b) If a solution exists, one solution is where ri is the minimumlength path from the node N+1 to the node i.
Lan-Da Van
VLSI-DSP-4-10
VLSI Digital Signal Processing Systems
Example 4.3.1 with M=5, N=4 (1/2)
Given Inequalities: r1 - r2 0 r3 - r1 5 Step 2 : Bellman-Ford algorithm
V=1 V=2 V=3 V=4 V=5 0 0 0 0 0 0 0 0 -1 0 0 0 0 -1 0 0 0 0 -1 0
Ref=U5 r(k)(V) r(1) r(2) r(3) r(4)
r4 - r1 4
r4 - r3 -1 r3 - r2 2 Step 1 :
0 0 0
Floyd-Warshall algorithm
5 5 4 0 2 1 -1 0 0 0 -1
2
5
R(6) =
1
4
3
-1
Solution: r1=0, r2=0, r3=0, r4=-1
Lan-Da Van VLSI-DSP-4-11
VLSI Digital Signal Processing Systems
Example 4.3.1 with M=5, N=4 (2/2)
Alternative Computation for Floyd-Warshall algorithm
0 0 0
0 0 0
R(1) =
5 4 0 2 0 -1 0 0 0 0
5 4 0 2 1 0 -1 0 0 0 -1
0
0
R(2) =
0 0 0
0 0 0
0 0
0 0
5 4 2 1 0 -1 0 0 -1 0
5 2 0 0 4 1 -1 0 -1 0
R(3) =
0 0 0 0 0 0 0 0
5 2 0 0 5 2 0
4 1 -1 0 -1 4 1 -1 -1
R(4) =
R(5) =
R(6) =
ri,j(m+1)=min{ri,k(m)+rk,j(1)}
Lan-Da Van VLSI-DSP-4-12
VLSI Digital Signal Processing Systems
Outlines
Introduction
Definitions and Properties Solving Systems of Inequalities
Retiming Techniques
Cutset retiming and pipelining Systolic transformation
Retiming for clock period minimization
Conclusion Supplement
Lan-Da Van
VLSI-DSP-4-13
VLSI Digital Signal Processing Systems
Cutset Retiming (1/2)
Definition: A cutset is a set of edges that can be removed from the graph to create 2 disconnected subgraphs. Procedures:
Step 1: Construct 2 disjointed sub-graphs G1 and G2 Step 2: Add k delays to each edge from G1 to G2 and remove k delays from each edge from G2 to G1. wr(G1->G2)=w(G1->G2)+r(G2)-r(G1)=w(G1->G2)+k wr(G2->G1)=w(G2->G1)+r(G1)-r(G2)=w(G2->G1)-k => -min{w(G1->G2)} k min{w(G2->G1)}
Feasible solution of cutset retiming
Lan-Da Van
VLSI-DSP-4-14
VLSI Digital Signal Processing Systems
Cutset Retiming (2/2)
Upper case
Cutset = {2->1, 3->2, 1->4} and feasible solution: 0 k 1 Cutset = {2->1, 3->2, 4->2} and feasible solution: 0 k 1
Lower case
1
D D 3 4 2D 2 1
G1
D 3 4
k=1
2
3 4
3D
G2
1 D 2 D 3 4 2D 2 3 4 1 D
G1 k=1
2D
(1)
1
D
(1) 2
(2)
3 4
2D
G2
Lan-Da Van
(2) VLSI-DSP-4-15
VLSI Digital Signal Processing Systems
Pipelining
Pipelining is a special case of cutset retiming where there are no edges in the cutset from G2 to G1, i.e., pipelining applies to graphs without loops.
Lan-Da Van
VLSI-DSP-4-16
VLSI Digital Signal Processing Systems
Cutset Retiming and Slow-Down
Procedure:
Replace each delay with N delays to create an N-slow version of the DFG. Perform cutset retiming on the N-slow DFG.
N=2
In (a) , CP=101 2
Tcp=101*1+2*2=105
In (c) , CP = 2 2 Tcp=2*1+2*2=6
Lan-Da Van VLSI-DSP-4-17
VLSI Digital Signal Processing Systems
Systolic Transformation
z 1
z 1
z 1
ai , 2
S z 1
bi , 2
ai , 2
S z 1
bi , 2
ai ,1
S z 1
bi ,1
ai ,1
bi ,1
z 1
bi ,0
z 1 ai , 0
ai , 0
bi ,0
Source: L. D. Van, "A new 2-D systolic digital filter architecture without global broadcast," IEEE Trans. VLSI Systs., vol. 10, pp. 477-486, Aug. 2002.
Lan-Da Van VLSI-DSP-4-18
VLSI Digital Signal Processing Systems
Retiming Relationship
Cutset Retiming Retiming
Pipelining
Systolic Transformation
Lan-Da Van
VLSI-DSP-4-19
Retiming for Clock Period Minimization (1/2)
Definition:
VLSI Digital Signal Processing Systems
(G) = max{t(p) : w(p)=0} (min. feasible clk period) W(U,V) = min{w(p) : U e V} D(U,V) = max{t(p) : U e V and w(p) = W(U,V)}
Algorithm for computing W(U,V) and D(U,V):
1. Let M=tmaxn, where tmax is the maximum computation time of the nodes in G and n is the number of nodes in G. 2. Form a new graph G which is the same as G except the edge weights are replaced by w(e)=Mw(e)-t(U) for all edges. 3. Solve the all-pairs shortest path problem on G. Let SUV be the shortest path from U to V. 4. If UV, then W(U,V)=ceiling (SUV /M) and D(U,V)=MW(U,V)SUV+t(V). If U=V, then W(U,V)=0 and D(U,V)=t(U).
Lan-Da Van
VLSI-DSP-4-20
Retiming for Clock Period Minimization (2/2)
VLSI Digital Signal Processing Systems
W(U,V) and D(U,V) are used to determine if there is a retiming solution that can achieve a desired clock period. Given a desired clock period c, if :
1. feasibility constraint : r(U)-r(V)w(e) for every edge e :UV 2. critical path constraint : r(U)-r(V)W(U,V)-1 for all vertices U,V in G such that D(U,V)>c.
There is a feasible retiming solution r such that
(Gr)c.
Lan-Da Van
VLSI-DSP-4-21
VLSI Digital Signal Processing Systems
Example with c=3 (1/3)
G
Step 1: tmax=2, n=4, M=tmaxn=8
(1)
1 D
(1) 2
D
(2)
3 4
2D
(2)
G
(1)
(1)
1
7 2 7 -2 -2
(2)
3 4 (2)
15
Step 2: As G shows. Step 3: SUV 1 1 12 2 7 3 5 4 5 Step 4: W(U,V) 1 2 3 4 1 0 1 1 2 2 1 0 2 3 3 1 0 0 3 4 1 0 2 0
Lan-Da Van
2 5 12 -2 -2
3 7 14 12 12
4 15 22 20 20
D(U,V) 1 2 3 4
1 2 3 4
1 2 4 4
4 1 3 3
3 4 2 6
3 4 6 2
VLSI-DSP-4-22
VLSI Digital Signal Processing Systems
Example with c=3 (2/3)
Step 5: Constraints
Step 6: Constraint Graph 0 0 1 1 2 02 0 0 2 1 0 0 2 1 3 1 0 0 5
Feasibility constraint:
r(1)-r(3)1 r(1)-r(4)2 r(2)-r(1)1 r(3)-r(2)0 r(4)-r(2)0
Critical path constraint:
r(1)-r(2)0 r(2)-r(3)1 r(2)-r(4)2 r(3)-r(1)0 r(3)-r(4)2 r(4)-r(1)0 r(4)-r(3)1
Lan-Da Van
VLSI-DSP-4-23
VLSI Digital Signal Processing Systems
Example with c=3 (3/3)
Step 7: Using either Bellman-Ford algorithm or Floyd-Warshall algorithm, the retiming values for each node are calculated as
r(1)=r(2)=r(3)=r(4)=0
Step 8: Retimed DFG can be obtained as
(1) 1 D (1) 2
D
(2) 2D 3
(2)
VLSI-DSP-4-24
Lan-Da Van
VLSI Digital Signal Processing Systems
Example with c=2 (1/2)
Feasibility constraint:
r(1)-r(3)1 r(1)-r(4)2 r(2)-r(1)1 r(3)-r(2)0 r(4)-r(2)0
Constraint Graph
0
0 1 1 0 0 2 0 -1 1 3 1 2 0 0 0 5
Critical path constraint:
r(1)-r(4)1 r(2)-r(3)1 r(2)-r(4)2 r(3)-r(1)0 r(3)-r(2)-1 r(3)-r(4)2 r(4)-r(1)0 r(4)-r(2)-1 r(4)-r(3)1
1 -1 4
Lan-Da Van
VLSI-DSP-4-25
VLSI Digital Signal Processing Systems
Example with c=2 (2/2)
Using either Bellman-Ford algorithm or FloydWarshall algorithm, the retiming values for each node are calculated as
r(1)=-1 , r(3)=-1 , r(2)=0 r(4)=-1 j+1 r(1)=0 , r(3)=0 , r(2)=1 r(4)=0
Retimed DFG can be obtained as
(1) 1 D (1) 2 D D 3 4 (2)
2D
(2)
VLSI-DSP-4-26
Lan-Da Van
VLSI Digital Signal Processing Systems
Conclusion
Reduction of clock period, number of registers, and power consumption by retiming for synchronous systems Shortest path algorithms can be used to obtain a retiming solution if one exists. Retiming can also be used as a preprocessing step for folding and for computation of round-off noise as discussed in Chap 6 and 11, respectively.
Lan-Da Van
VLSI-DSP-4-27
VLSI Digital Signal Processing Systems
Supplement Part The Shortest Path Algorithm
VLSI Digital Signal Processing Systems
Shortest Path Algorithms
Length : sum of weights Negative cycle : when sum of lengths is negative ex : cycle 2412 SUV : the shortest path from the node U to the node V. The Bellman-Ford algorithm The Floyd-Warshall algorithm
1 -3 2 1 3 1 2 1 4
Lan-Da Van
VLSI-DSP-4-29
VLSI Digital Signal Processing Systems
The Bellman-Ford algorithm
Single-point shortest path algorithm
r(1)(U)=0 For k=1 to n If kU r(1)(k)=w(U e k) For k=1 to n-2 For V=1 to n r(k+1)(V)=r(k)(V) For W=1 to n If r(k+1)(V)> r(k)(W)+w(W e V) r(k+1)(V)= r(k)(W)+w(W e V) For V=1 to n For W=1 to n If r(n-1)(V)>r(n-1)(W)+w(W e V) return FALSE and exit return TRUE and exit
Lan-Da Van VLSI-DSP-4-30
VLSI Digital Signal Processing Systems
Example
U=2 ( node 2 )
r(k)(V) k=1 k=2 k=3 V=1 3 3 V=2 0 0 0 V=3 1 1 1 V=4 2 2 2
r(3)(V) is the shortest path from node 2 to node V for V=1,2,3,4
1
-3 2 1 2 1 4
3
2
Lan-Da Van
VLSI-DSP-4-31
VLSI Digital Signal Processing Systems
The Floyd-Warshall algorithm
All-points shortest path algorithm
For V=1 to n For U=1 to n r(1)(U,V)=w(U e V) For k=1 to n For V=1 to n For U=1 to n r(k+1)(U,V)=r(k)(U,V) If r(k+1)(U,V)> r(k)(U,k)+r(k)(k,V) r(k+1)(U,V)= r(k)(U,k)+r(k)(k,V) For k=1 to n For U=1 to n If r(k)(U,U)<0 return FALSE and exit return TRUE and exit
Lan-Da Van
VLSI-DSP-4-32
VLSI Digital Signal Processing Systems
Example
R(1) -3 1 2 2 1 R(3) -3 1 -2 -2 -1 1 2 2 -1 0
R(2) 1
R(4) 1
-3 -2
-3
1 2 2 -2 -1 1 2 2 -1 0
-3
2 1 3 2 2
1 4
R(5) 0 -3 -2 -1 3 0 1 2 3 0 1 2 1 -2 -1 0
R(5)(U,V) is the shortest path from
the node U to the node V.
Lan-Da Van VLSI-DSP-4-33