Co-Synthesis Algorithms: HW/SW Partitioning
Part of HW/SW Codesign of Embedded Systems Course (CE 40-226)
Winter-Spring 2001 Codesign of Embedded Systems 1
Topics
Introduction Preliminaries Hardware/Software Partitioning Distributed System Co-Synthesis
Winter-Spring 2001
Codesign of Embedded Systems
Topics
Introduction A Classification Examples
Vulcan Cosyma
Winter-Spring 2001
Codesign of Embedded Systems
Introduction to HW/SW Partitioning
The first variety of co-synthesis applications Definition
A HW/SW partitioning algorithm implements a specification on some sort of multiprocessor architecture Multiprocessor architecture = one CPU + some ASICs on CPU bus
Usually
Winter-Spring 2001
Codesign of Embedded Systems
Introduction to HW/SW Partitioning (contd)
A Terminology
Allocation
Synthesis methods which design the multiprocessor topology along with the PEs and SW architecture The process of assigning PE (CPU and/or ASICs) time to processes to get executed
Scheduling
Winter-Spring 2001
Codesign of Embedded Systems
Introduction to HW/SW Partitioning (contd)
In most partitioning algorithms
Type of CPU is fixed and given ASICs must be synthesized
What function to implement on each ASIC? What characteristics should the implementation have? CDFG is the starting model
Are single-rate synthesis problems
Winter-Spring 2001
Codesign of Embedded Systems
HW/SW Partitioning (contd)
Normal use of architectural components
CPU performs less computationally-intensive functions ASICs used to accelerate core functions High-performance applications
Where to use?
No CPU is fast enough for the operations ASIC accelerators allow use of much smaller, cheaper CPU
Codesign of Embedded Systems 7
Low-cost application
Winter-Spring 2001
A Classification
Criterion: Optimization Strategy
Trade-off between Performance and Cost Performance is the primary goal First, all functionality in ASICs. Progressively move more to CPU to reduce cost. Cost is the primary goal First, all functions in the CPU. Move operations to the ASIC to meet the performance goal.
Primal Approach
Dual Approach
Winter-Spring 2001
Codesign of Embedded Systems
A Classification (contd)
Classification due to optimization strategy (contd)
Example co-synthesis systems
Vulcan (Stanford): Primal strategy Cosyma (Braunschweig, Germany): Dual strategy
Winter-Spring 2001
Codesign of Embedded Systems
Co-Synthesis Algorithms: HW/SW Partitioning
HW/SW Partitioning Examples: Vulcan
Winter-Spring 2001
Codesign of Embedded Systems
10
Partitioning Examples: Vulcan
Gupta, De Micheli, Stanford University Primal approach
1. All-HW initial implementation. 2. Iteratively move functionality to CPU to reduce cost.
System specification language
HardwareC
Is compiled into a flow graph
Winter-Spring 2001
Codesign of Embedded Systems
11
Partitioning Examples: Vulcan (contd)
x=a; y=b;
nop
HardwareC
1
x=a
1
y=b
cond
if (c>d) x=e; else y=f;
c>d
x=e
c<=d
y=f
HardwareC
Winter-Spring 2001 Codesign of Embedded Systems 12
Partitioning Examples: Vulcan (contd)
Flow Graph Definition
A variation of a (single-rate) task graph Nodes
Represent operations Typically low-level operations: mult, add Represent data dependencies Each contains a Boolean condition under which the edge is traversed
Edges
Winter-Spring 2001
Codesign of Embedded Systems
13
Partitioning Examples: Vulcan (contd)
Flow Graph
is executed repeatedly at some rate can have initiation-time constraints for each node
t(vj)+lij t(vj) t(vj)+uij mi Ri Mi
can have rate constraints on each node
Winter-Spring 2001
Codesign of Embedded Systems
14
Partitioning Examples: Vulcan (contd)
Vulcan Co-synthesis Algorithm
Partitioning quantum is a thread
Algorithm divides the flow graph into threads and allocates them Thread boundary is determined by
1. (always) a non-deterministic delay element, such as wait for an external variable 2. (on choice) other points of flow graph
Target architecture
CPU + Co-processor (multiple ASICs)
Winter-Spring 2001
Codesign of Embedded Systems
15
Partitioning Examples: Vulcan (contd)
Vulcan Co-synthesis algorithm (contd)
Allocation
Primal approach
Scheduling
is done by a scheduler on the target CPU
is generated as part of synthesis process schedules all threads (both HW and SW threads)
cannot be static, due to some threads non-deterministic initiation-time
Winter-Spring 2001
Codesign of Embedded Systems
16
Partitioning Examples: Vulcan (contd)
Vulcan Co-synthesis algorithm (contd)
Cost estimation
SW implementation
Code size relatively straight forward Data size Biggest challenge. Vulcan puts some effort to find bounds for each thread
HW implementation ?
Codesign of Embedded Systems 17
Winter-Spring 2001
Partitioning Examples: Vulcan (contd)
Vulcan Co-synthesis algorithm (contd)
Performance estimation
Both SW- and HW-implementation
From flow-graph, and basic execution times for the operators
Winter-Spring 2001
Codesign of Embedded Systems
18
Partitioning Examples: Vulcan (contd)
Algorithm Details
Partitioning goal
Allocate each thread to one of two partitions
CPU Set: FS Co-processor set: FH
Required execution-rate must be met, and total cost minimized
Winter-Spring 2001
Codesign of Embedded Systems
19
Partitioning Examples: Vulcan (contd)
Algorithm Details (contd) Algorithm steps
1. Put all threads in FH set 2. Iteratively do
2.1. Move some operations to FS. 2.1.1. Select a group of operations to move to FS. 2.1.2. Check performance feasibility, by computing worst-case delay through flow-graph given the new thread times 2.1.3. Do the move, if feasible 2.2. Incrementally update the new cost-function to reflect the new partition
Winter-Spring 2001
Codesign of Embedded Systems
20
Partitioning Examples: Vulcan (contd)
Algorithm Details (contd)
Vulcan cost function
f(w) = c1Sh(FH) - c2Ss(FS) + c3B - c4P + c5|m|
c: weight constants S(): Size functions B: Bus utilization (<1) P: Processor utilization (<1) m: total number of variables to be transferred between the CPU and the co-processor
Winter-Spring 2001
Codesign of Embedded Systems
21
Partitioning Examples: Vulcan (contd)
Algorithm Details (contd) Complementary notes
A heuristic to minimize communication
Once a thread is moved to FS, its immediate successors
are placed in the list for evaluation in the next iteration.
No back-track
Once a thread is assigned to FS, it remains there
Experimental results
considerably faster implementations than all-SW, but much cheaper than all-HW designs are produced
Codesign of Embedded Systems 22
Winter-Spring 2001
Co-Synthesis Algorithms: HW/SW Partitioning
HW/SW Partitioning Examples: Cosyma
Winter-Spring 2001
Codesign of Embedded Systems
23
Partitioning Examples: Cosyma
Rolf Ernst, et al: Technical University of Braunschweig, Germany Dual approach
1. All-SW initial implementation. 2. Iteratively move basic blocks to the ASIC accelerator to meet performance objective.
System specification language
Cx
Is compiled into an ESG (Extended Syntax Graph) ESG is much like a CDFG
Codesign of Embedded Systems 24
Winter-Spring 2001
Partitioning Examples: Cosyma (contd)
Cosyma Co-synthesis Algorithm
Partitioning quantum is a Basic Block
A Basic Blocks is a branch-free block of program
Target Architecture
CPU + accelerator ASIC(s)
Scheduling Allocation Cost Estimation Performance Estimation Algorithm Details
Codesign of Embedded Systems 25
Winter-Spring 2001
Partitioning Examples: Cosyma (contd)
Cosyma Co-synthesis Algorithm (contd)
Performance Estimation
SW implementation
Done by examining the object code for the basic block generated by a compiler Assumes one operator per clock cycle. Creates a list schedule for the DFG of the basic block. Depth of the list gives the number of clock cycles required. Done by data-flow analysis of the adjacent basic blocks. In Shared-Memory Proportional to number of variables to be accessed
Codesign of Embedded Systems
HW implementation
Communication
Winter-Spring 2001
26
Partitioning Examples: Cosyma (contd)
Algorithm Steps
Change in execution-time caused by moving basic block b from CPU to ASIC:
Dc(b) = w( tHW(b)-tSW(b) + tcom(Z) - tcom(ZUb)) x It(b)
w: Constant weight t(b): Execution time of basic block b tcom(b): Estimated communication time between CPU and the accelerator ASIC, given a set Z of basic blocks
implemented on the ASIC It(b): Total number of times that b is executed
Codesign of Embedded Systems
Winter-Spring 2001
27
Partitioning Examples: Cosyma (contd)
Experimental Results
By moving only basic-blocks to HW
Typical speedup of only 2x Reason:
Limited intra-basic-block parallelism Implement several control-flow optimizations to increase parallelism in the basic block, and hence in ASIC Examples: loop pipelining, speculative branch execution with multiple branch prediction, operator pipelining Speedups: 2.7 to 9.7 CPU times: 35 to 304 seconds on a typical workstation
Codesign of Embedded Systems
Cure:
Result:
Winter-Spring 2001
28
What we learned today
HW/SW Partitioning: One broad category of co-synthesis algorithms Criteria by which a co-synthesis algorithm is categorized
Winter-Spring 2001
Codesign of Embedded Systems
29