Unit 2hardware-Software - Partitioning - in - Embedded - System
Unit 2hardware-Software - Partitioning - in - Embedded - System
net/publication/4060774
CITATIONS READS
95 353
5 authors, including:
Dávid Papp
North Carolina State University
49 PUBLICATIONS 359 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Péter Arató on 22 May 2014.
Péter Arató, Sándor Juhász, Zoltán Ádám Mann, András Orbán, Dávid Papp
Budapest University of Technology and Economics
Department of Control Engineering and Information Technology
H-1117 Budapest, Magyar tudósok körútja 2, Hungary
arato@[Link], jsanyi@[Link], {[Link], [Link], pappd}@[Link]
Abstract – One of the most crucial steps in the design of embedded new problems, e.g. related to the communication of hardware
systems is hardware-software partitioning, i.e. deciding which and software components, as well as system architecture
components of the system should be implemented in hardware issues. In order to address these problems, hardware-software
and which ones in software. In this paper, different versions of
the partitioning problem are defined, corresponding to real-time co-design (HSCD) methods have to be used [3].
systems, and cost-constrained systems, respectively. The authors One of the most crucial design steps in HSCD is
Today’s computer systems typically consist of both One of the most relevant works is presented in [5], [6]: a
hardware and software components. For instance in an very sophisticated integer linear programming model for the
embedded signal processing application it is common to use joint partitioning and scheduling problem for a wide range of
both application-specific hardware accelerator circuits and target architectures. This integer program is part of a 2-phase
general-purpose, programmable units with the appropriate heuristic optimization scheme which aims at gaining better and
software [1], [2]. better timing estimates using repeated scheduling phases, and
This is beneficial since application-specific hardware is using the estimates in the partitioning phases.
usually much faster than software, but it is also significantly [7] presents a method for allocation of hardware resources
more expensive. Software on the other hand is cheaper to for optimal partitioning. During the allocation algorithm,
create and to maintain, but slow. Hence, performance-critical an estimated hardware/software partition is also built. The
components of the system should be realized in hardware, and algorithm for this is basically a greedy algorithm: it takes
non-critical components in software. This way, a good trade- the components one by one, and allocates the most critical
off between cost and performance can be achieved. building block of the current component to hardware. If the
However, this kind of system design is not without chal- finite automaton realizing the control of the component also
lenges. Usual hardware and software design methodologies fits into hardware, then that component is moved to hardware.
are in many aspects inadequate for such design tasks. The An even more heuristic approach is described in [8].
composition of hardware and software elements also creates This paper deals with a partitioning problem, in which even
the cost function to be optimized is a very complicated,
This work was supported by OTKA T043329. This paper appeared in
Proceedings of the 2003 IEEE International Symposium on Intelligent Signal heuristically weighted sum, which tries to take into account
Processing. several optimization criteria. The paper also describes two
heuristic algorithms: one based on simulated annealing and Hardware implementation of a component is associated
one based on tabu search. with a hardware cost, which can be for instance area, heat
[9] shows an algorithm to solve the joint problem of dissipation, energy consumption etc.
partitioning and scheduling. It consists of basically two local Since hardware is significantly faster than software, the
search heuristics: one for partitioning and one for scheduling. running time of components implemented in hardware is
The two algorithms operate on the same graph, at the same taken to be zero.
time. If two communicating components are mapped to differ-
[10] considers partitioning in the design of ASIPs ent contexts, this is associated with a (time dimensional)
(application-specific integrated processors). It presents a communication overhead. If two components are mapped
formal framework and proposes a partitioning algorithm based to the same context, then the communication between
on branch and bound. them is neglected.
[11] presents an approach that is largely orthogonal to One of the most important advantages of this simplified
other partitioning methods: it deals with the problem of model is that scheduling does not have to be addressed
hierarchically matching tasks to resources. It also shows explicitly. Hardware components do not have to be scheduled,
a method for weighting partially defined user preferences, because their running time is assumed to be zero. Software
which can be very useful for multiple-objective optimization components do not have to be scheduled because there is only
problems. one processor, so that the overall running time will be the sum
of the running times of the software components, regardless
III. PROBLEM DEFINITION of their schedule. Therefore, we can completely decouple the
partitioning problem from the scheduling problem, and focus
We observed that previous works tried to capture too many solely on the partitioning problem.
details of the partitioning problem and the target architecture.
B. Formal problem definition
,
,
Therefore, the solutions proposed in the literature
!An#undirected
"%$ &(' andsimple
)
#"*$ &+'
do not scale well for large inputs but they fall victim to
graph
combinatorial explosion; and/or
are given. is the so-called
are heuristic in nature, and drift away too much from
task graph of the system, its nodes are the components of the
,.- - .-
optimal solutions.
system that have to be partitioned, and the edges represent
Moreover, the complexity of the problem is hard to determine.
/,- 0 -
communication between the components. (or ) and
Therefore, our main goal was to introduce a simplified
, - 1323
- 32 )
(or ) denote the software and hardware cost of node ,
model for partitioning, using which we can respectively, while denotes the communication cost
define different versions of the partitioning problem
4
between and if they are in different contexts (HW or SW).
formally;
4 56879:8;<
is called a hardware-software (HW-SW) partition if 87>=
analyze the complexity of the problems formally;
<;?@ 87#AB8;C@D
it is a bipartition of : , where
give algorithms that can be used for the design of large
+EFG
., - 32HI -J ; 32 J 7 -J 7 32 J ;
and . The crossing edges are:
systems as well.
4 K EL#MNOQPSRUTV -
or . The
Note that algorithms published previously in the literature
have been tested on systems with some dozens of components.
4 W X
E
hardware cost of is:
of is:
M NOPSRZY -[M]\ NOQ^ N_`P:ab ) - c323
; the software cost
, i.e. the
Our aim was to make our algorithms scalable for systems with software cost of the nodes and the communication cost; since
hundreds or even thousands of components, so that they can both are time-dimensional, it makes sense to add them, and
indeed be used in practical, large-scale projects. For these together they make up the running time of the system. The
reasons we had to keep the model as simple as possible, )
following optimization and decision problems can be defined
W Egf W 4 d
partitioning problem.
K K9d
P1: , are given. Is there a HW-SW partition
KE 9fd J $ &+' E
so that and ?
A. Informal model
K J K $&d ' W
P2: is given. Find a HW-SW partition so that
Wd
and 4
is minimal. (Cost-constrained systems)
W Ehf Wed K E
The characteristics of our model are the following: P3: is given. Find a HW-SW partition so that
We consider only one software context (i.e. one general- and is minimal. (Systems with hard real-
purpose processor) and one hardware context (e.g. one time constraints)
FPGA). The components of the system have to be mapped
to either one of these two contexts. C. Complexity
Software implementation of a component is associated
with a software cost, which is the running time of the Theorem 1. P1 is ihj
-complete even if only graphs with no
component if implemented in software. edges are considered.
Proof. P1
J ihj , since if partitioning is possible with the It can be proven in the same way that
as follows: , . Let , .
) On the other
-
hand, -
if communication is the only significant
M N O PSR - Wd K d
(Since is empty,
, let
there is
,
no need to
define
.
.) Introducing factor, i.e.
, then the trivial optimal solution is
to put every node to software. However, if there are some
We state that this instance of P1 is solvable iff the original
4 <79<;e
predefined constraints D ;
considering
the context of some nodes
K NAPSACK problem has a solution.
7 = ; B 7 A ; LD
(i.e. the nodes in 7 are prescribed to be in software 7 A ; BD
; 07
Assuming that P1 has a solution: , where and the ones in to be in hardware, where )
7 D
and . This means that the problem reduces to finding the minimal weighted - -cut
J
0
are the vectors representing the
(2) cost functions %
( $ &is the number of nodes, # is the number of
$ !('
#e; is a solution of the original
edges).
of , that is
is the transposed incidence matrix
(1) and (2) proves that
%$ + ,
K NAPSACK problem.
Let now assume that solves the K NAPSACK problem.
c
*) + &,- /0 1. $ if edge + starts in node,
if edge + ends in node
Therefore: ,
and 2
4
) +&- 5
$ +
if node + is realized in hardware
N
O P - K d N
O PSR - K d
if node is realized in software
It can be seen that the components of the vector 6 276 indicate
K E f Kd WEE f WE
Proof. P1 can be reduced to P2: P2 provides a solution where by introducing the variables8? 9;: to eliminate the 6A@36 :
and
is solvable iff
is minimal; let this value be . Clearly P1
. W Wed 2 (6a)
< [ ) ? f eW d
2
2 f ? (6b) Generating random valid individuals with approximately
f ? (6c) uniform distribution would be a promising alternative, but it is
by no means obvious how one could implement such a scheme.
2
2
J
30 $ (6d)
(6e)
Therefore we chose to fill the initial population partly with
randomly selected, not necessarily valid individuals, and partly
with valid individuals generated by a fast greedy algorithm.
The last two programs are equivalent. If 2 solves (5b)-(5c), This way, there are valid individuals, but also a wide variety
then 2 6 276 solves (6b)-(6e). On the other hand, if 2 ?
then 2 will solve (5b)-(5c) too, since ?
6 276
of other individuals in the initial population. Clearly, the ratio
)
solves (6b)-(6e),
between the two kinds of individuals in the initial population is
and . a crucial parameter of the GA. The tests have shown that about
Hence (6a)-(6e) is the ILP formulation of the P3 problem. one third of the individuals in the initial population should be
We solve this integer program using LP-relaxation and branch- chosen randomly.
and-bound [15]. Fitness function. Since we mainly focused on the P3
problem, the objective is to minimize hardware cost. However,
V. GENETIC ALGORITHM since invalid individuals should be punished, the fitness
a second component: the measure of invalidity,
+ e 4
has
,
Although the ILP-based solution is efficient for graphs with
defined as the amount by which software cost (including
up to some hundreds of nodes, and it produces the exact
communication cost) exceeds the software limit (and 0 for a
optimum, it cannot be used to partition even bigger graphs. For
valid partition).
this purpose, we also developed a genetic algorithm (GA). For
We tried several versions for the fitness function, which
GA in general, see [16], [17] and references therein.
fall basically into two categories: those ranking every valid
Individuals. The partitioning problem is fortunate from individuals in front of invalid ones, and the less rigorous ones,
the point of view of a genetic algorithm. The applicability that allow invalid individuals to beat valid ones. Later in our
of genetic algorithms requires that the solutions of the tests a rigorous version proved to be best:
optimization problem can be represented by means of a
H 4 K EE [ + e 4 [ if 44 is valid
K )
vector with meaningful components: this is the condition for
recombination to work on the actual features of a solution. if is invalid
where K
E is the hardware cost, ) is an appropriate constant
is a sufficiently large constant
Fortunately, there is an obvious vector representation in the
case of the partitioning problem: each partition is a bit vector
just like in the ILP program. determined by the tests, and
that makes each invalid individual have a higher fitness
Population. The population is a set of individuals. The
than any valid individual. (Note that the fitness has to be
question is whether non-valid individuals, i.e. those violating
minimized.)
for instance the software limit in case of P3, should also be
Genetic operations. Mutation, recombination and selection
allowed in the population. Since non-valid individuals violate
are used.
some important design constraint, it seems to be logical at
Mutation is done in the new population; each gene of each
first glance to work with valid partitions only. However,
individual is altered with the same probability.
this approach would have several drawbacks: first, invalid
In the case of recombination we tested both one-point-
individuals may contain valuable patterns that should be
crossover and two-point-crossover. Moreover, we tested two
propagated, and second, it is hard to guarantee that genetic
schemes for choosing the individuals for crossover: in the first
operations do not generate non-valid individuals even from
one, all individuals are chosen with the same probability, in the
valid ones. This holds for both mutation and recombination.
second one, the probability of choosing a particular individual
For these reasons we decided to allow non-valid individuals as
depends on its fitness, so that better individuals are chosen with
well in the population. Of course the GA must produce a valid
a higher probability. So we tested four different versions of the
partition at the end, so we must make sure to insert some valid
recombination operator. According to the test results, the best
individuals into the initial population, and choose the fitness
strategy is to choose individuals with probabilities determined
function in such a way that it punishes invalidity.
by their fitness, and to use two-point-crossover (although one-
Our tests showed that the population size should be around point-crossover is not much worse).
300. Selection is usually realized as filling some part of the new
Initial population. In order to guarantee diversity in the population with the best individuals of the old population.
population, the initial population usually consists of randomly However, since some versions of the fitness functions rank
chosen individuals. However, this method does not guarantee all invalid individuals behind the valid ones, this would mean
that there will be valid individuals in the initial population—in that invalid individuals are completely omitted from selection.
fact, if the problem space is big and constraints are tight, then Therefore, a given ratio of the selected individuals is taken
the chances are very low for this. from the invalid individuals.
Stopping criteria. The GA takes a given minimum number of x 10
4
the best$
steps (generations). After that, it stops after 2 steps if 5
ILP
is also a parameter.
4
3500
parameters, we could not test all configurations. Rather, we
first tuned the less sensitive parameters, and fixed them to the 3000
After having tuned the parameters of the GA, we fixed them 1500
to the best found configuration, and moved on to compare it
1000
with the ILP algorithm on industry benchmarks.
The benchmarks we used are cryptographic algorithms 500
ihj
IDEA communication dominated problem formally. In particular, we have proven that the
190 MARS processing dominated
MARS communication dominated problem is -hard in the general case, but we could also
RC6 processing dominated
180 RC6 communication dominated identify some efficiently solvable meaningful special cases.
Moreover, we presented two new partitioning algorithms: one
170
based on ILP, and a genetic algorithm. It turned out in our
160 empirical tests that the ILP-based solution works efficiently
Ratio in %
150
for graphs with several hundreds of nodes and yields optimal
solutions, whereas the genetic algorithm gives near-optimal
140
solutions on average, and works efficiently with graphs of even
130 thousands of nodes. Moreover, we observed an easy-hard-easy
pattern in the performance of GA, and wild oscillations in the
120
running time of the ILP algorithm.
110 Our future plans include randomization of our ILP-based
algorithm, developing better bounds for the branch-and-bound
100
1 2 3 4
Testcase
5 6 7 scheme used in the ILP algorithm, as well as the inclusion of
some simple scheduling methods into our partitioning model.
Fig. 3. The ratio of the results of GA and the optimum
REFERENCES
[1] P. D. Randall, “Hardware/software co-design and digital signal
processing,” MSc thesis, University of Oxford, May 1996.
far from the optimum. However, the gap increases with the [2] P. Arató, Z. Á. Mann, and A. Orbán, “Hardware-software co-design
problem size. for Kohonen’s self-organizing map,” in Proceedings of the IEEE 7th
Figure 3 shows the deviation of the result of GA from the International Conference on Intelligent Engineering Systems, 2003.
[3] R. K. Gupta, Co-Synthesis of Hardware and Software for Digital
optimum in detail. On the x-axis the different tests according to Embedded Systems, PhD thesis, Stanford University, December 1993.
the software limit, on the y-axis the ratio of the solution found [4] Z. Á. Mann and A. Orbán, “Optimization problems in system-level
by GA and the optimum can be seen. In test-case 1 and 7 the synthesis,” in Proceedings of the 3rd Hungarian-Japanese Symposium
on Discrete Mathematics and Its Applications, 2003.
software limit was chosen in such a way that the optimum is [5] R. Niemann and P. Marwedel, “An algorithm for hardware/software
to put every component in software or hardware, respectively. partitioning using mixed integer linear programming,” Design
It can be seen that GA finds these extremes in every case. Automation for Embedded Systems, special issue: Partitioning Methods
for Embedded Systems, vol. 2, pp. 165–193, March 1997.
Another consequence of the figure is that the difficulty of the [6] R. Niemann, Hardware/Software Co-Design for Data Flow Dominated
problem depends strongly on the software limit. Tests 5 and Embedded Systems, Kluwer Academic Publishers, 1998.
6 seemed to be the hardest problems, where in the worst case [7] J. Grode, P. V. Knudsen, and J. Madsen, “Hardware resource
allocation for hardware/software partitioning in the LYCOS system,” in
only an approximation ratio of 2 could be achieved. The best Proceedings of Design Automation and Test in Europe (DATE ’98), 1998.
results were found in the ’IDEA, communication-dominated’ [8] P. Eles, Z. Peng, K. Kuchcinski, and A. Doboli, “Hardware/software
scenario, in which all the seven test cases resulted in the partitioning of VHDL system specifications,” in Proceedings of EURO-
DAC ’96, 1996.
optimal solution. [9] K. S. Chatha and R. Vemuri, “MAGELLAN: Multiway hardware-
The easy-hard-easy pattern that can be recognized in software partitioning and scheduling for latency minimization of
hierarchical control-dataflow task graphs,” in Proceedings of CODES
Figure 3 is in accordance with previous findings in the 01, 2001.
literature [18]. We could expect a similar pattern in the running [10] N. N. Binh, M. Imai, A. Shiomi, and N. Hikichi, “A hardware/software
time of the ILP algorithm; however, this is not true. In partitioning algorithm for designing pipelined ASIPs with least gate
counts,” in Proceedings of the 33rd Design Automation Conference,
fact, the running time of the ILP algorithm oscillates wildly. 1996.
(In contrast, the running time of the GA is quite consistent.) [11] G. Quan, X. Hu, and G. Greenwood, “Preference-driven hierarchical
This might be caused by the previously observed heavy-tailed hardware/software partitioning,” in Proceedings of the IEEE/ACM
International Conference on Computer Design, 1999.
runtime distribution of search algorithms [19]. Actually, it [12] C. H. Papadimitriou, Computational complexity, Addison Wesley, 1994.
is possible that the ILP algorithm exhibits a similarly clear [13] D. S. Hochbaum, Ed., Approximation Algorithms for NP-Hard
easy-hard-easy pattern on average, but this phenomenon is Problems, PWS Publishing, Boston, MA, 1997.
[14] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin, Network Flows: Theory,
hidden because of the large deviation caused by the heavy- Algorithms, and Applications, Prentice-Hall, 1993.
tailed property. In order to decide this, we are planning to [15] A. Schrijver, Theory of linear and integer programming, Wiley, 1998.
[16] L. Davis, Handbook of genetic algorithms, Van Nostran Reinhold, 1991.
implement a randomized ILP algorithm, and test its average [17] W. Kinnebrock, Optimierung mit genetischen und selektiven Algorith-
behaviour. men, Oldenburg, 1994.
[18] D. Achlioptas, C. P. Gomes, H. Kautz, and B. Selman, “Generating
VII. CONCLUSION satisfiable instances,” in Proceedings of the Seventeenth National
Conference on Artificial Intelligence, 2000.
[19] C. P. Gomes, B. Selman, N. Crato, and H. Kautz, “Heavy-
In this paper, we have introduced a new, simplified model tailed phenomena in satisfiability and constraint satisfaction problems,”
for the hardware-software partitioning problem. This model Journal of Automated Reasoning, vol. 24, no. 1-2, pp. 67–100, 2000.